Introduction
This project aims to develop efficient, well tested graph algorithm implementations for Neo4j 3.1 and 3.2.
Releases are available here: https://github.com/neo4jcontrib/neo4jgraphalgorithms/releases
The goal is to provide parallel versions of common graph algorithms for Neo4j exposed as Cypher user defined procedures:
Centralities:

Page Rank

Betweenness Centrality

Closeness Centrality
Graph Partitioning:

Label Propagation

(Weakly) Connected Components

Strongly Connected Components
Path Finding:

Minimum Weight Spanning Tree

All Pairs and Single Source  Shortest Path
These procedures work on a subgraphm optionally filtered by label and relationshiptype. Future versions will also provide filtering and projection using Cypher queries.
We’d love your feedback, so please try out these algorithms and let us know how well they work for your usecase. Also please note things that you miss from installation instructions, readme, etc.
Please raise GitHub issues for anything you encounter or join the neo4jusers Slack group and ask in the #neo4jgraphalgorithm
channel.
Installation
Just copy the graphalgorithmsalgo*.jar
from the matching release into your $NEO4J_HOME/plugins
directory and restart Neo4j.
Then running call dbms.procedures();
should also list the algorithm procedures.
CALL dbms.procedures() YIELD name, description, signature
WHERE name STARTS WITH "algo."
RETURN name, description, signature
ORDER BY name
Warning

For safety reasons, in Neo4j 3.2.x you will need to add/enable this line in your dbms.security.procedures.unrestricted=algo.* 
Introduction
Graph theory is the study of graphs, which are mathematical structures used to model pairwise relations between nodes. A graph is made up of nodes (vertices) which are connected by relationships (edges). A graph may be undirected, meaning that there is no distinction between the two nodes associated with each relationship, or its relationships may be directed from one node to another. Relationships are what graph is all about: two nodes are joined by a relationship when they are related in a specified way.
We are tied to our friends. Cities are connected by roads and airline routes. Flora and fauna are bound together in a food web. Countries are involved in trading relationships. The World Wide Web is a virtual network of information.

Note that Neo4j can only save directed relationships, but we can treat them as though they are undirected when we are doing the analysis
Usage
These algorithms are exposed as Neo4j procedures. You can call them directly from Cypher in your Neo4j Browser, from cyphershell or your client code.
For most algorithms there are two procedures, one that writes results back to the graph as nodeproperties and another (named algo.<name>.stream
) that returns a stream of data, e.g. nodeids and computed values.
The general call syntax is:
CALL algo.<name>([label],[relType],{config})
For example for page rank on dbpedia:
CALL algo.pageRank('Page','Link',{iterations:5, dampingFactor:0.85, write: true, writeProperty:'pagerank'});
// YIELD nodes, iterations, loadMillis, computeMillis, writeMillis, dampingFactor, write, writeProperty
CALL algo.pageRank.stream('Page','Link',{iterations:5, dampingFactor:0.85})
YIELD node, score
RETURN node, score
ORDER BY score DESC LIMIT 10;
Cypher Loading
If label and relationshiptype are not selective enough to describe your subgraph to run the algorithm on, you can use Cypher statements to load or project subsets of your graph.
Then use a nodestatement instead of the label parameter and a relationshipstatement instead of the relationshiptype and use graph:'cypher'
in the config.
You can also return a property value or weight (according to your config) in addition to the id’s from these statements.
CALL algo.pageRank(
'MATCH (p:Page) RETURN id(p) as id',
'MATCH (p1:Page)[:Link]>(p2:Page) RETURN id(p1) as source, id(p2) as target',
{graph:'cypher', iterations:5, write: true});
Building
Currently aiming at Neo4j 3.1 and 3.2 (in the 3.2 branch)
git clone https://github.com/neo4jcontrib/neo4jgraphalgorithms cd neo4jgraphalgorithms mvn clean install cp algo/target/graphalgorithms*.jar $NEO4J_HOME/plugins/ $NEO4J_HOME/bin/neo4j restart
Overview of Algorithm Procedures
qualified name  description 





































































Algorithms
Graph algorithms are used to compute metrics for graphs, nodes or relationships.
They can provide insights on relevant entities (centralities, ranking) in the graph or inherent structures like communities (communitydetection, graphpartitioning, clustering).
Many of them are iterative aproaches that frequently traverse the graph for the computation using random walks, breadthfirst or depthfirst searches or pattern matching.
Due to the exponential growth of possible paths with increasing distance many of the approaches are also of a high algorithmic complexity.
Fortunately some optimized algorithms exist that utilize certain structures of the graph, memoize of already explored parts and if possible parallelize operations. Whenever possible we tried to apply these optimizations.
Page Rank
PageRank is Google’s popular search algorithm. PageRank works by counting the number and quality of links to a page to determine a rough estimate of how important the website is. The underlying assumption is that more important websites are likely to receive more links from other websites
History, Explanation
In order to measure the relative importance of web pages, Sergey Brin and Larry Page proposed PageRank, a method for computing a ranking for every web page based on the graph of the web. PageRank has applications in search, browsing, and traffic estimation.
PageRank is defined in the original Google paper as follows:
We assume page A has pages T1…Tn which point to it (i.e., are citations). The parameter d is a damping factor which can be set between 0 and 1. We usually set d to 0.85. Also C(A) is defined as the number of links going out of page A. The PageRank of a page A is given as follows:
PR(A) = (1d) + d (PR(T1)/C(T1) + … + PR(Tn)/C(Tn))
Note that the PageRanks form a probability distribution over web pages, so the sum of all web pages' PageRanks will be one.
The underlying mathematics of PageRank has to do with random walks on networks, akin to how random surfers propagate through a network. Precisely, PageRank is an example of a discrete ergodic Markov Chain. Random surfers follow links, but occasionally teleport to random vertices. The PageRank of a node is the probability it is visited by a random surfer with teleportation. PageRank is now widely recognized as a way of detecting central nodes in a network, and even has applications in systems biology.
When to use it / usecases
The mathematics of PageRank are entirely general and apply to any graph or network in any domain. Thus, PageRank is now regularly used in bibliometrics, social and information network analysis, and for link prediction and recommendation. It’s even used for systems analysis of road networks, as well as biology, chemistry, neuroscience, and physics.
In neuroscience, the PageRank of a neuron in a neural network has been found to correlate with its relative firing rate.
Personalized PageRank is used by Twitter to present users with other accounts they may wish to follow.
PageRank has been used to rank spaces or streets to predict how many people (pedestrians or vehicles) come to the individual spaces or streets. In lexical semantics it has been used to perform Word Sense Disambiguation,Semantic similarity, and also to automatically rank WordNet synsets according to how strongly they possess a given semantic property, such as positivity or negativity.
In any ecosystem, a modified version of PageRank may be used to determine species that are essential to the continuing health of the environment.
For the analysis of protein networks in biology PageRank is also a useful tool.
Pagerank has recently been used to quantify the scientific impact of researchers. The underlying citation and collaboration networks are used in conjunction with pagerank algorithm in order to come up with a ranking system for individual publications which propagates to individual authors. The new index known as pagerankindex (Pi) is demonstrated to be fairer compared to hindex in the context of many drawbacks exhibited by hindex.
Algorithm explanation on simple sample graph
CREATE (home:Page{name:'Home'})
,(about:Page{name:'About'})
,(product:Page{name:'Product'})
,(links:Page{name:'Links'})
,(a:Page{name:'Site A'})
,(b:Page{name:'Site B'})
,(c:Page{name:'Site C'})
,(d:Page{name:'Site D'})
CREATE (home)[:LINKS]>(about)
,(about)[:LINKS]>(home)
,(product)[:LINKS]>(home)
,(home)[:LINKS]>(product)
,(links)[:LINKS]>(home)
,(home)[:LINKS]>(links)
,(links)[:LINKS]>(a)[:LINKS]>(home)
,(links)[:LINKS]>(b)[:LINKS]>(home)
,(links)[:LINKS]>(c)[:LINKS]>(home)
,(links)[:LINKS]>(d)[:LINKS]>(home)
CALL algo.pageRank.stream('Page', 'LINKS', {iterations:20, dampingFactor:0.85})
YIELD node, score
RETURN node,score order by score desc limit 20
CALL algo.pageRank('Page', 'LINKS', {iterations:20, dampingFactor:0.85,
write: true,writeProperty:"pagerank"})
YIELD nodes, iterations, loadMillis, computeMillis, writeMillis, dampingFactor, write, writeProperty
name  pageRank 

Home 
3.232 
Product 
1.059 
Links 
1.059 
About 
1.059 
Site A 
0.328 
Site B 
0.328 
Site C 
0.328 
Site D 
0.328 
As we expected, we see that Home page has the highest pageRank, because it has incoming links from all other pages. We can also observe, that not only the number of incoming links is important, but also the importance of the page, that links to us.
Example Usage
CALL algo.pageRank('Label1', 'TYPE1') YIELD computeMillis
CALL algo.pageRank.stream('Label1', 'TYPE1') YIELD node, score order by score desc limit 20
Syntax
CALL algo.pageRank(label:String, relationship:String, {iterations:20, dampingFactor:0.85,
write: true,writeProperty:'pagerank'})
YIELD nodes, iterations, loadMillis, computeMillis, writeMillis, dampingFactor, write, writeProperty
 calculates page rank and potentially writes back
name  type  default  optional  description 

label 
string 
null 
yes 
label to load from the graph, if null load all nodes 
relationship 
string 
null 
yes 
relationshiptype to load from the graph, if null load all nodes 
iterations 
int 
20 
yes 
how many iterations of pagerank to run 
dampingFactor 
float 
0.85 
yes 
damping factor of the pagerank calculation 
write 
boolean 
true 
yes 
if result should be written back as node property 
writeProperty 
string 
'pagerank' 
yes 
property name written back to 
graph 
string 
'heavy' 
yes 
use 'heavy' when describing the subset of the graph with label and relationshiptype parameter, 'cypher' for describing the subset with cypher nodestatement and relationshipstatement 
name  type  description 

nodes 
int 
number of nodes considered 
iterations 
int 
number of iterations run 
dampingFactor 
float 
damping factor used 
writeProperty 
string 
property name written back to 
write 
boolean 
if result was written back as node property 
loadMillis 
int 
milliseconds for loading data 
computeMillis 
int 
milliseconds for running the algorithm 
writeMillis 
int 
milliseconds for writing result data back 
CALL algo.pageRank.stream(label:String, relationship:String, {iterations:20, dampingFactor:0.85})
YIELD node, score  calculates page rank and streams results
name  type  default  optional  description 

label 
string 
null 
yes 
label to load from the graph, if null load all nodes 
relationship 
string 
null 
yes 
relationshiptype to load from the graph, if null load all nodes 
iterations 
int 
20 
yes 
how many iterations of pagerank to run 
dampingFactor 
float 
0.85 
yes 
damping factor of the pagerank calculation 
name 
type 
description 
node 
long 
node id 
score 
float 
pagerank weight 
Cypher loading
If label and relationshiptype are not selective enough to describe your subgraph to run the algorithm on, you can use Cypher statements to load or project subsets of your graph.
Can be also used to run algorithms on a virtual graph.
Set graph:'cypher'
in the config.
CALL algo.pageRank(
'MATCH (p:Page) RETURN id(p) as id',
'MATCH (p1:Page)[:Link]>(p2:Page) RETURN id(p1) as source, id(p2) as target',
{graph:'cypher', iterations:5, write: true});
Versions
We support the following versions of the pageRank algorithm:

✓ directed, unweighted

❏ directed, weighted

❏ undirected, unweighted

❏ undirected, weighted
Betweenness Centrality
Betweenness Centrality is a measure of centrality in a graph. In graph theory and network analysis, indicators of centrality identify the most important vertices within a graph. It is based on observation, that every pair of vertices exchange information and information flows along the geodesic (shortest) path between two vertices. For every pair of vertices in a connected graph, there exists at least one shortest path between the vertices such that either the number of edges , that the path passes through ( unweighted network ) or the sum of the weights of the edges is minimized ( weighted network). The betweenness centrality for each node is the number of these shortest paths, that pass through the node.
History, Explanation
Anthonisse (1971) introduced the concept rush in a graph as the amount a node in a network has to intermediate between other nodes. Freeman (1977, 1979) defined betweenness centrality as one of the “three distinct intuitive conceptions of centrality” (Freeman, 1979: 215). Betweenness centrality is often connected with the notion of control over the flow of information. Betweenness centrality is calculated by a breathfirst search algorithm which calculates the shortest paths from every node to all other nodes (Brandes 2001). The nodes which most frequently lie on these shortest paths are favored in the betweenness centrality score.
Betweenness Centrality finds wide application in network theory. It represents the degree of which nodes stand between each other. For example, in a telecommunications network, a node with higher betweenness centrality would have more control over the network, because more information will pass through that node. Betweenness centrality was devised as a general measure of centrality. It applies to a wide range of problems in network theory, including problems related to social networks, biology, transport and scientific cooperation.
When to use it / usecases
Betweenness centrality is useful in finding vertices that serve as a bridge from one part of a graph to another. Consequently, betweenness is a rudimentary measure of the control, that a specific node exerts over the flow throughout the full graph. For example, Alice in above example is the main connection in the graph. Were Alice to be removed, all connections in the graph would be cut off. This makes Alice “important”, because it ensures that no nodes are isolated. When using betweenness centrality as an analysis measure, it indicates a potential gate keeping or controlling node.
It differs from the other centrality measures. A node can have quite low degree, be connected to others that have low degree, even be a long way from others on average, and still have high betweenness. Consider a node A that lies on a bridge between two groups of vertices within a network. Since any path between nodes in different groups must go through this bridge, node A acquires high betweenness even though it is not well connected (it lies at the periphery of both groups).
Algorithm explanation on simple sample graph
People with high betweenness tend to be the innovators and brokers in any network. They combine different perspectives, transfer ideas between groups (or decide to withold them), and get power from their ability to make introductions and pull strings.
CREATE (nAlice:User {id:'Alice'})
,(nBridget:User {id:'Bridget'})
,(nCharles:User {id:'Charles'})
,(nDoug:User {id:'Doug'})
,(nMark:User {id:'Mark'})
,(nMichael:User {id:'Michael'})
CREATE (nAlice)[:MANAGE]>(nBridget)
,(nAlice)[:MANAGE]>(nCharles)
,(nAlice)[:MANAGE]>(nDoug)
,(nMark)[:MANAGE]>(nAlice)
,(nCharles)[:MANAGE]>(nMichael)
CALL algo.betweenness.stream('User','MANAGE')
YIELD nodeId, centrality
RETURN nodeId,centrality order by centrality desc limit 20
CALL algo.betweenness('User','MANAGE', {write:true, writeProperty:'centrality'})
YIELD nodes, minCentrality, maxCentrality, sumCentrality, loadMillis, computeMillis, writeMillis
name  centrality weight 

Alice 
4 
Charles 
2 
Bridget 
0 
Michael 
0 
Doug 
0 
Mark 
0 
We can see that Alice is the main broker in this network and Charles is a minor broker.
Syntax
CALL algo.betweenness(label:String, relationship:String, {write:true, stats:true, writeProperty:'centrality'})
YIELD nodes, minCentrality, maxCentrality, sumCentrality, loadMillis, computeMillis, writeMillis
 calculates betweenness centrality and potentially writes back
name  type  default  optional  description 

label 
string 
null 
yes 
label to load from the graph, if null load all nodes 
relationship 
string 
null 
yes 
relationshiptype to load from the graph, if null load all nodes 
write 
boolean 
true 
yes 
if result should be written back as node property 
stats 
boolean 
true 
yes 
if stats about centrality should be returned 
writeProperty 
string 
'centrality' 
yes 
property name written back to 
graph 
string 
'heavy' 
yes 
use 'heavy' when describing the subset of the graph with label and relationshiptype parameter, 'cypher' for describing the subset with cypher nodestatement and relationshipstatement 
name  type  description 

nodes 
int 
number of nodes considered 
minCentrality 
int 
minimum centrality value 
maxCentrality 
int 
maximum centrality value 
sumCentrality 
int 
sum of all centrality values 
loadMillis 
int 
milliseconds for loading data 
evalMillis 
int 
milliseconds for running the algorithm 
writeMillis 
int 
milliseconds for writing result data back 
CALL algo.betweenness.stream(label:String, relationship:String) YIELD nodeId, centrality  yields centrality for each node
name  type  default  optional  description 

label 
string 
null 
yes 
label to load from the graph, if null load all nodes 
relationship 
string 
null 
yes 
relationshiptype to load from the graph, if null load all relationships 
name 
type 
description 
node 
long 
node id 
centrality 
float 
betweenness centrality weight 
Cypher loading
If label and relationshiptype are not selective enough to describe your subgraph to run the algorithm on, you can use Cypher statements to load or project subsets of your graph.
Can be also used to run algorithms on a virtual graph.
Set graph:'cypher'
in the config.
CALL algo.betweenness(
'MATCH (p:User) RETURN id(p) as id',
'MATCH (p1:User)[:MANAGE]>(p2:User) RETURN id(p1) as source, id(p2) as target',
{graph:'cypher', write: true});
Versions
We support the following versions of the betweenness centrality algorithm:

✓ directed, unweighted

❏ directed, weighted

❏ undirected, unweighted

❏ undirected, weighted
Implementations
algo.betweenness()

implementation of brandesbc algorithm and nodePartitioning extension

if concurrency parameter is set (and >1) ParallelBetweennessCentrality is used

ParallelBC spawns N(given by the concurrency param) concurrent threads for calculation where each one calculates the BC for one node at a time
algo.betweenness.exp1()

brandeslike algorithm which uses successor sets instead of predecessor sets

The algorithm is based on Brandes definition but with some changes regarding the dependencyaccumulation step.
Closeness Centrality
Closeness Centrality of a node is a measure of centrality in a network, that measures the nearness (as opposite from the distance) from a node to all other nodes. Nodes having a high closeness score have the shortest distances to all other nodes. This is important for the availability of knowledge and resources.
History, Explanation
Sabidussi (1966) described the sum of the shortest path distances from one node to every other node as the node’s farness. Freeman (1979) used this idea to define closeness centrality of a node as the inverse of Sabidussi’s farness. Nodes having a high closeness centrality are nearby all other nodes and have advantages in accessing resources in a network or having a good overview of the agents in a network.
When to use it / usecases
The most central nodes according to closeness centrality can quickly interact to all others because they are close to all others. This measure is preferable to degree centrality, because it does not take into account only direct connections among nodes, but also indirect connections.
Constraints / when not to use it
A key node centrality measure in networks is closeness centrality (Freeman, 1978; Opsahl et al., 2010; Wasserman and Faust, 1994). It is defined as the inverse of farness, which in turn, is the sum of distances to all other nodes. As the distance between nodes in disconnected components of a network is infinite, this measure cannot be applied to networks with disconnected components (Opsahl et al., 2010; Wasserman and Faust, 1994).
Algorithm explanation on simple sample graph

count farness in each msbfscallback

divide by N1
Graph:
( A )⇐⇒( B )⇐⇒( C )⇐⇒( D )⇐⇒( E )
Calculation:
N = 5
// number of nodes
k = N1 = 4
// used for normalization
A B C D E  A  0 1 2 3 4 // farness between each pair of nodes B  1 0 1 2 3 C  2 1 0 1 2 D  3 2 1 0 1 E  4 3 2 1 0  S  10 7 6 7 10 // sum each column =============================== k/S 0.4 0.57 0.67 0.57 0.4 // normalized centrality
CALL algo.closeness('Person', 'KNOWS', {write:true, writeProperty:'centrality'}) YIELD nodes,loadMillis, computeMillis, writeMillis  calculates closeness centrality and potentially writes back
CALL algo.closeness.stream('Person', 'KNOWS') YIELD nodeId, centrality
RETURN nodeId,centrality order by centrality desc limit 20  yields centrality for each node
Syntax
CALL algo.closeness(label:String, relationship:String, {write:true, writeProperty:'centrality'})
YIELD nodes,loadMillis, computeMillis, writeMillis
 calculates closeness centrality and potentially writes back
name  type  default  optional  description 

label 
string 
null 
yes 
label to load from the graph, if null load all nodes 
relationship 
string 
null 
yes 
relationshiptype to load from the graph, if null load all nodes 
write 
boolean 
true 
yes 
if result should be written back as node property 
writeProperty 
string 
'centrality' 
yes 
property name written back to 
graph 
string 
'heavy' 
yes 
use 'heavy' when describing the subset of the graph with label and relationshiptype parameter, 'cypher' for describing the subset with cypher nodestatement and relationshipstatement 
name  type  description 

nodes 
int 
number of nodes considered 
loadMillis 
int 
milliseconds for loading data 
evalMillis 
int 
milliseconds for running the algorithm 
writeMillis 
int 
milliseconds for writing result data back 
CALL algo.closeness.stream(label:String, relationship:String) YIELD nodeId, centrality  yields centrality for each node
name  type  default  optional  description 

label 
string 
null 
yes 
label to load from the graph, if null load all nodes 
relationship 
string 
null 
yes 
relationshiptype to load from the graph, if null load all relationships 
name 
type 
description 
node 
long 
node id 
centrality 
float 
closeness centrality weight 
Cypher loading
If label and relationshiptype are not selective enough to describe your subgraph to run the algorithm on, you can use Cypher statements to load or project subsets of your graph.
Can be also used to run algorithms on a virtual graph.
Set graph:'cypher'
in the config.
CALL algo.closeness(
'MATCH (p:Person) RETURN id(p) as id',
'MATCH (p1:Person)[:KNOWS]>(p2:Person) RETURN id(p1) as source, id(p2) as target',
{graph:'cypher', write: true});
Versions
We support the following versions of the betweenness centrality algorithm:

✓ directed, unweighted

❏ directed, weighted

❏ undirected, unweighted

❏ undirected, weighted
Minimum Weight Spanning Tree
A Minimum Weight Spanning Tree (MST) is a subset of the edges of a connected, edgeweighted undirected graph that connects all the vertices together, without any cycles and with the minimum possible total edge weight. That is, it is a spanning tree whose sum of edge weights is as small as possible It can be used to cluster the graph (KMeans).
History, Explanation
In graph theory, a tree is a way of connecting all the vertices together, so that there is exactly one path from any one vertex, to any other vertex of the tree. If the graph represents a number of cities connected by roads, one could select a number of roads, so that each city can be reached from every other, but that there is no more than one way to travel from one city to another.
Most of the time, graphs are weighted; each connection between two cities has a weight: It might cost something to travel on a given road, or one connection may be longer than the other, this means it takes more time to travel on that connection.
A minimum spanning tree is a tree. It is different from other trees in that it minimizes the total of the weights attached to the edges. Depending on what the graph looks like, there may be more than one minimum spanning tree. In a graph where all the edges have the same weight, every tree is a minimum spanning tree. If all the edges have different weights, there is exactly one minimal spanning tree.
Czech scientist Otakar Borůvka developed the first known algorithm for finding a minimum spanning tree in 1926. He wanted to solve the problem of finding an efficient coverage of Moravia with electricity. Today, this algorithm is known as Borůvka’s algorithm.
When to use it / usecases
There are quite a few use cases for minimum spanning trees. One example would be a telecommunications company which is trying to lay out cables in new neighborhood. If it is constrained to bury the cable only along certain paths (e.g. along roads), then there would be a graph representing which points are connected by those paths. Some of those paths might be more expensive, because they are longer, or require the cable to be buried deeper; these paths would be represented by edges with larger weights. Currency is an acceptable unit for edge weight – there is no requirement for edge lengths to obey normal rules of geometry such as the triangle inequality. A spanning tree for that graph would be a subset of those paths that has no cycles but still connects to every house; there might be several spanning trees possible. A minimum spanning tree would be one with the lowest total cost, thus would represent the least expensive path for laying the cable.
Algorithm explanation on simple sample graph
a a 1 / \ 2 / \ / \ / \ b 3 c b c   =>   4 5       d 6 e d e
MATCH(n:Node{start:true})
CALL algo.mst(n, 'cost', {write:true, writeProperty:"mst"})
YIELD loadMillis, computeMillis, writeMillis, weightSum, weightMin,weightMax, relationshipCount
Syntax
CALL algo.mst(node:Node, weightProperty:String, {nodeLabelOrQuery:String,
relationshipTypeOrQuery:String, write:boolean, writeProperty:String stats:boolean})
YIELD loadMillis, computeMillis, writeMillis, weightSum, weightMin, weightMax, relationshipCount
name  type  default  optional  description 

node 
Node 
null 
no 
node to load from the graph 
weightProperty 
string 
'cost' 
yes 
property name that contains weight. Must be numeric. 
nodeLabelOrQuery 
string 
null 
yes 
label to load from the graph, if null load all nodes 
relationshipTypeOrQuery 
string 
null 
yes 
relationshiptype to load from the graph, if null load all nodes 
write 
boolean 
true 
yes 
if result should be written back as node property 
writeProperty 
string 
'mst' 
yes 
relationshiptype written back as result 
name  type  description 

weightSum 
int 
sum of all weights 
weightMax 
int 
maximum value of weight 
weightMin 
int 
minimum value of weight 
relationshipCount 
int 
number of relationships created 
loadMillis 
int 
milliseconds for loading data 
computeMillis 
int 
milliseconds for running the algorithm 
writeMillis 
int 
milliseconds for writing result data back 
All Pairs and Single Source  Shortest Path
A Single Source Shortest Path algorithms calculates a path between a pair of nodes whose summed weights are minimal. A common algorithm used is Dijkstra.
All Pairs Shortest Path on the other hand calculates a shortest path forest containing all paths between the nodes in the graph. An algorithm to solve this is Floyd Warshall or Parallel Johnson’s algorithm.
History, Explanation
Path finding, in particular searching in a maze, belongs to the classical graph problems, and the classical references are Wiener [1873], Lucas [1882] (describing a method due to C.P. Tr´emaux), and Tarry [1895] – see Biggs, Lloyd, and Wilson [1976]. They form the basis for depthfirst search techniques. Path problems were also studied at the beginning of the 1950’s in the context of ‘alternate routing’, that is, finding a second shortest route if the shortest route is blocked. This applies to freeway usage (Trueblood [1952]), but also to telephone call routing. At that time making longdistance calls in the U.S.A. was automatized, and alternate routes for telephone calls over the U.S. telephone network nationwide should be found automatically.
When to use it / usecases
Shortest path algorithms are applied to automatically find directions between physical locations, such as driving directions on web mapping websites like MapQuest or Google Maps.
If one represents a nondeterministic abstract machine as a graph where nodes describe states and relationships describe possible transitions, shortest path algorithms can be used to find an optimal sequence of choices to reach a certain goal state, or to establish lower bounds on the time needed to reach a given state. For example, if nodes represent the states of a puzzle like a Rubik’s Cube and each directed relationship corresponds to a single move or turn, shortest path algorithms can be used to find a solution that uses the minimum possible number of moves.
In a networking or telecommunications mindset, this shortest path problem is sometimes called the mindelay path problem and usually tied with a widest path problem. For example, the algorithm may seek the shortest (mindelay) widest path, or widest shortest (mindelay) path.
Other applications, often studied in operations research, include plant and facility layout, robotics, transportation, and VLSI design.
Algorithm explanation on simple sample graph
CREATE (a:Loc{name:'A'}), (b:Loc{name:'B'}), (c:Loc{name:'C'}),
(d:Loc{name:'D'}), (e:Loc{name:'E'}), (f:Loc{name:'F'}),
(a)[:ROAD {cost:50}]>(b),
(a)[:ROAD {cost:50}]>(c),
(a)[:ROAD {cost:100}]>(d),
(a)[:RAIL {cost:50}]>(d),
(b)[:ROAD {cost:40}]>(d),
(c)[:ROAD {cost:40}]>(d),
(c)[:ROAD {cost:80}]>(e),
(d)[:ROAD {cost:30}]>(e),
(d)[:ROAD {cost:80}]>(f),
(e)[:ROAD {cost:40}]>(f),
(e)[:RAIL {cost:20}]>(f);
Dijkstra single source shortest path algorithm
MATCH (start:Loc{name:'A'}), (end:Loc{name:'F'})
CALL algo.shortestPath.stream(start, end, 'cost')
YIELD nodeId, cost
RETURN nodeId, cost LIMIT 20
MATCH (start:Loc{name:'A'}), (end:Loc{name:'F'})
CALL algo.shortestPath(start, end, 'cost',{write:true,writeProperty:'sssp'})
YIELD writeMillis,loadMillis,nodeCount, totalCost
RETURN writeMillis,loadMillis,nodeCount,totalCost
Delta stepping algorithm
MATCH(n:Loc {name:'A'})
CALL algo.deltaStepping.stream(n, 'cost', 3.0)
YIELD nodeId, distance
RETURN nodeId, distance LIMIT 20
MATCH(n:Node {name:'s'})
CALL algo.deltaStepping(n, 'cost', 3.0, {write:true, writeProperty:'sssp'})
YIELD nodeCount, loadDuration, evalDuration, writeDuration
RETURN nodeCount, loadDuration, evalDuration, writeDuration
Syntax
CALL algo.shortestPath(startNode:Node, endNode:Node, weightProperty:String
{nodeQuery:'labelName', relationshipQuery:'relationshipName', defaultValue:1.0,write:'true',writeProperty:'sssp'})
YIELD nodeCount, totalCost, loadMillis, evalMillis, writeMillis
name  type  default  optional  description 

startNode 
node 
null 
no 
start node 
endNode 
node 
null 
no 
end node 
weightProperty 
string 
null 
yes 
property name that contains weight, if null treats the graph as unweighted. Must be numeric. 
defaultValue 
float 
null 
yes 
default value of the weight in case it is missing or invalid 
write 
boolean 
true 
yes 
if result should be written back as node property 
writeProperty 
string 
'sssp' 
yes 
property name written back to the node sequence of the node in the path 
nodeQuery 
string 
null 
yes 
label to load from the graph, if null load all nodes 
relationshipQuery 
string 
null 
yes 
relationshiptype to load from the graph, if null load all nodes 
name  type  description 

nodeCount 
int 
number of nodes considered 
totalCost 
float 
sum of all weights along the path 
loadMillis 
int 
milliseconds for loading data 
evalMillis 
int 
milliseconds for running the algorithm 
writeMillis 
int 
milliseconds for writing result data back 
CALL algo.shortestPath.stream(startNode:Node, endNode:Node, weightProperty:String
{nodeQuery:'labelName', relationshipQuery:'relationshipName', defaultValue:1.0})
YIELD nodeId, cost
name  type  default  optional  description 

startNode 
node 
null 
no 
start node 
endNode 
node 
null 
no 
end node 
weightProperty 
string 
null 
yes 
property name that contains weight, if null treats the graph as unweighted. Must be numeric. 
nodeQuery 
string 
null 
yes 
label to load from the graph, if null load all nodes 
relationshipQuery 
string 
null 
yes 
relationshiptype to load from the graph, if null load all nodes 
defaultValue 
float 
null 
yes 
default value of the weight in case it is missing or invalid 
name 
type 
description 
nodeId 
int 
node id 
cost 
int 
cost it takes to get from start node to specific node 
Graph Partitioning: Label Propagation
Label Propagation algorithm (LPA) is an extremely fast graph partitioning method and is widely used in large scale networks. Graph partitioning ( community detection ) is an important methodology for understanding the organization of various realworld networks and has applications in problems as diverse as consensus formation in social communities or the identification of functional modules in biochemical networks.
History, Explanation
Community structure is considered one of the most interesting features in complex networks. Many realworld complex systems exhibit community structure, where individuals with similar properties form a partition (community). The identification of partitions in a network is important for understanding the structure of said network, in a specific perspective. Thus, graph partitioning in complex networks gained immense interest over the last decade. A lot of graph partitioning methods were proposed, and one of them is the label propagation algorithm (LPA). The simplicity and time efficiency of the LPA make it a popular graph partitioning method.
When to use it / usecases
Label propagation is an algorithm for finding communities ( partitions ). It is one of the stateoftheart methods, which estimates labels by propagating label information through a graph. Label propagation assumes that nodes connected in a graph should have similar labels. Consequently, the label estimation heavily depends on relationships in a graph which represent similarity of each node pair.
Algorithm explanation on simple sample graph
CALL algo.labelPropagation('Label', '','OUTGOING', {iterations:1,partitionProperty:'partition', write:true})
YIELD nodes, iterations, loadMillis, computeMillis, writeMillis, write, partitionProperty
Syntax
CALL algo.labelPropagation(label:String, relationship:String, direction:String, {iterations:1,
weightProperty:'weight', partitionProperty:'partition', write:true})
YIELD nodes, iterations, loadMillis, computeMillis, writeMillis, write, weightProperty,
partitionProperty  simple label propagation kernel
name  type  default  optional  description 

label 
string 
null 
yes 
label to load from the graph, if null load all nodes 
relationship 
string 
null 
yes 
relationshiptype to load from the graph, if null load all nodes 
direction 
string 
'OUTGOING' 
yes 
relationshipdirection to use in the algorithm 
iterations 
int 
1 
yes 
number of iterations 
weightProperty 
string 
'weight' 
yes 
property name that contains weight. Must be numeric. 
partitionProperty 
string 
'partition' 
yes 
property name written back the partition of the graph in which the node reside 
write 
boolean 
true 
yes 
if result should be written back as node property 
name  type  description 

nodes 
int 
number of nodes considered 
iterations 
int 
number of iterations 
loadMillis 
int 
milliseconds for loading data 
computeMillis 
int 
milliseconds for running the algorithm 
writeMillis 
int 
milliseconds for writing result data back 
weightProperty 
string 
property name that contains weight 
partitionProperty 
string 
property name written back to 
write 
boolean 
true 
Connected Components
Connected Components or UnionFind basically finds sets of connected nodes where each node is reachable from any other node in the same set. In graph theory, a connected component of an undirected graph is a subgraph in which any two vertices are connected to each other by paths, and which is connected to no additional vertices in the graph.
History, Explanation
The Connected Components of a graph represent, in general terms, the pieces of the graph. Two vertices are in the same component of graph if and only if there is some path between them.
Finding connected components is at the heart of many graph applications. For example, consider the problem of identifying partitions in a set of items. We can represent each item by a node and add an edge between each pair of items that are deemed similar. The connected components of this graph correspond to different classes of items.
When to use it / usecases
Testing whether a graph is connected is an essential preprocessing step for every graph algorithm. Such tests can be performed so quickly and easily that you should always verify that your input graph is connected, even when you know it has to be. Subtle, difficulttodetect bugs often result when your algorithm is run only on one component of a disconnected graph.
Connected components have other practical use cases, for example, if we are analysing a social network and we want to find all the disconnected groups of people that exist in our graph.
Algorithm explanation on simple sample graph
Recall that an undirected graph is connected if for every pair of vertices, there is a path in the graph between those vertices. A connected component of an undirected graph is a maximal connected subgraph of the graph. That means that the direction of the relationships in our graph have no influence as we treat our graph as undirected. We have two implementations of connected components algorithm. First implementations treats the graph as unweighted and the second version is weighted, where you can define the threshold of the weight above which relationships are created.
CREATE (nAlice:User {id:'Alice'})
,(nBridget:User {id:'Bridget'})
,(nCharles:User {id:'Charles'})
,(nDoug:User {id:'Doug'})
,(nMark:User {id:'Mark'})
,(nMichael:User {id:'Michael'})
CREATE (nAlice)[:FRIEND{weight:0.5}]>(nBridget)
,(nAlice)[:FRIEND{weight:4}]>(nCharles)
,(nMark)[:FRIEND{weight:1}]>(nDoug)
,(nMark)[:FRIEND{weight:2}]>(nMichael)
Unweighted version:
CALL algo.unionFind.stream('User', 'FRIEND', {})
YIELD nodeId,setId
CALL algo.unionFind('User', 'FRIEND', {write:true, partitionProperty:"partition"})
YIELD nodes, setCount, loadMillis, computeMillis, writeMillis
name  partition 

Alice 
0 
Charles 
0 
Bridget 
0 
Michael 
4 
Doug 
4 
Mark 
4 
Results show us, that we have two distinct group of users that have no link between them. First group has members Alice,Charles,Bridget and the second group are Michael,Doug,Mark.
MATCH (u:User)
RETURN distinct(u.partition) as partition,count(*) as size_of_partition ORDER by size_of_partition DESC
Weighted version:
If you define the property that holds the weight(weightProperty) and the threshold,it means the nodes are only connected, if the threshold on the weight of the relationship is high enough otherwise the relationship is thrown away.
CALL algo.unionFind.stream('User', 'FRIEND', {weightProperty:'weight', defaultValue:0.0, threshold:1.0})
YIELD nodeId,setId
CALL algo.unionFind('User', 'FRIEND', {write:true, partitionProperty:"partition",weightProperty:'weight', defaultValue:0.0, threshold:1.0})
YIELD nodes, setCount, loadMillis, computeMillis, writeMillis
name  partition 

Alice 
0 
Charles 
0 
Bridget 
1 
Michael 
4 
Doug 
4 
Mark 
4 
We can observe, that because the weight of the relationship betwen Bridget and Alice is only 0.5, the relationship was thrown away and regarded as not existing.
Syntax
CALL algo.unionFind(label:String, relationship:String, {threshold:0.42,
defaultValue:1.0, write: true, partitionProperty:'partition',weightProperty:'weight'})
YIELD nodes, setCount, loadMillis, computeMillis, writeMillis
 finds connected partitions and potentially writes back to the node as a property partition.
name  type  default  optional  description 

label 
string 
null 
yes 
label to load from the graph, if null load all nodes 
relationship 
string 
null 
yes 
relationshiptype to load from the graph, if null load all nodes 
weightProperty 
string 
null 
yes 
property name that contains weight, if null treats the graph as unweighted. Must be numeric. 
write 
boolean 
true 
yes 
if result should be written back as node property 
partitionProperty 
string 
'partition' 
yes 
property name written back the id of the partition particular node belongs to 
threshold 
float 
null 
yes 
value of the weight above which the relationship is not thrown away 
defaultValue 
float 
null 
yes 
default value of the weight in case it is missing or invalid 
name  type  description 

nodes 
int 
number of nodes considered 
setCount 
int 
number of partitions found 
loadMillis 
int 
milliseconds for loading data 
computeMillis 
int 
milliseconds for running the algorithm 
writeMillis 
int 
milliseconds for writing result data back 
CALL algo.unionFind.stream(label:String, relationship:String, {weightProperty:'weight', threshold:0.42, defaultValue:1.0)
YIELD nodeId, setId  yields a setId to each node id
name  type  default  optional  description 

label 
string 
null 
yes 
label to load from the graph, if null load all nodes 
relationship 
string 
null 
yes 
relationshiptype to load from the graph, if null load all relationships 
weightProperty 
string 
null 
yes 
property name that contains weight, if null treats the graph as unweighted. Must be numeric. 
threshold 
float 
null 
yes 
value of the weight above which the relationship is not thrown away 
defaultValue 
float 
null 
yes 
default value of the weight in case it is missing or invalid 
name 
type 
description 
nodeId 
int 
node id 
setId 
int 
partition id 
Strongly Connected Components
SCC is a class algorithms for finding groups of nodes where each node is directly reachable from every other node in the group. The strongly connected components of an arbitrary directed graph form a partition into subgraphs that are themselves strongly connected.There are several algorithms to compute the SCC.
History, Explanation
Decomposing a directed graph into its strongly connected components is a classic application of depthfirst search. The problem of finding connected components is at the heart of many graph application. Generally speaking, the connected components of the graph correspond to different classes of objects. The first lineartime algorithm for strongly connected components is due to Tarjan (1972).
When to use it / usecases
SCC algorithms can be used as a first step in many graph algorithms that work only on strongly connected graph. In social networks, a group of people are generally strongly connected (For example, students of a class or any other common place). Many people in these groups generally like some common pages or play common games. The SCC algorithms can be used to find such groups and suggest the commonly liked pages or games to the people in the group who have not yet liked commonly liked a page or played a game.
Algorithm explanation on simple sample graph
A directed graph is strongly connected if there is a path between all pairs of vertices. This algorithms treats the graph as directed, so the direction of the relationship is important and strongly connected compoment exists only if there are relationships between nodes in both direction.
CREATE (nAlice:User {id:'Alice'})
,(nBridget:User {id:'Bridget'})
,(nCharles:User {id:'Charles'})
,(nDoug:User {id:'Doug'})
,(nMark:User {id:'Mark'})
,(nMichael:User {id:'Michael'})
CREATE (nAlice)[:FOLLOW]>(nBridget)
,(nAlice)[:FOLLOW]>(nCharles)
,(nMark)[:FOLLOW]>(nDoug)
,(nMark)[:FOLLOW]>(nMichael)
,(nBridget)[:FOLLOW]>(nMichael)
,(nDoug)[:FOLLOW]>(nMark)
,(nMichael)[:FOLLOW]>(nAlice)
,(nAlice)[:FOLLOW]>(nMichael)
,(nBridget)[:FOLLOW]>(nAlice)
,(nMichael)[:FOLLOW]>(nBridget)
CALL algo.scc('User','FOLLOW', {write:true,partitionProperty:'partition'})
YIELD loadMillis, computeMillis, writeMillis, setCount, maxSetSize, minSetSize
name  partition 

Alice 
1 
Bridget 
1 
Michael 
1 
Charles 
0 
Doug 
2 
Mark 
2 
We can see that we have 2 strongly connected components in our sample graph. The first and biggest component has members Alice,Bridget,Michael and the second component has Doug and Mark.
MATCH (u:User)
RETURN distinct(u.partition) as partition,count(*) as size_of_partition ORDER by size_of_partition DESC LIMIT 1
Syntax
CALL algo.scc(label:String, relationship:String, {write:true,partitionProperty:'partition'})
YIELD loadMillis, computeMillis, writeMillis, setCount, maxSetSize, minSetSize
 finds strongly connected partitions and potentially writes back to the node as a property partition.
name  type  default  optional  description 

label 
string 
null 
yes 
label to load from the graph, if null load all nodes 
relationship 
string 
null 
yes 
relationshiptype to load from the graph, if null load all nodes 
write 
boolean 
true 
yes 
if result should be written back as node property 
partitionProperty 
string 
'partition' 
yes 
property name written back to 
name  type  description 

setCount 
int 
number of partitions found 
maxSetSize 
int 
number of members in biggest partition 
minSetSize 
int 
number of members in smallest partition 
loadMillis 
int 
milliseconds for loading data 
computeMillis 
int 
milliseconds for running the algorithm 
writeMillis 
int 
milliseconds for writing result data back 
Implementations
algo.scc.tarjan

original recursive tarjan implementation
algo.scc.tunedTarjan

also a recursive tarjan implementation
algo.scc.iterative

iterative adaption of tarjan algorithm
algo.scc.multistep

parallel scc algorithm
Implementers Section
Algorithm Procedures API Discussion
We would like to discuss design of the procedure API for algorithms. There are several concerns:
Naming

simple procedure names,

consistent spelling and casing

discriminate between writeback and resultstreaming procedure with
.stream
suffix 
default algorithm name should be the best implementation

other implementations can provided via suffixes
Input Graph Data

load via label/reltype selection and optional weight/values

load via cypher statements

the requirements for nodeid’s, degrees, weights, relationships etc. comes from the algorithm
Algorithm parameters

startnodes

iterations, dampening

concurrency, batchsizes

thresholds

writeback, writeback properties
IdMapping

original nodeids from the graph are mapped to a consecutive idrange to be used for the algorithm

this allows simple array offsets to be used within the computation

this also allows mapping of larger idranges to the computational range of the algorithm ()
Result parameters

output statistics: number of nodes and rels touched

time for loading, processing, and writing

property written to

errors

data summary, e.g.

min,max,avg of a centrality, rank or weight measure

output of all clusters, their sizes and perhaps the root node


alternatively streaming back nodeids and the computed value (score)
Constraints:

there is no overloading so we need differing names

there are default procedure parameters so we need to require only what’s really required to run

we should use sensible defaults

result values/columns have to be predeclared, there is no inheritance or generics
Suggestions:
Naming: 2 variants per algorithm

returning summary statistics after run:
algo.pageRank(…)

returning a stream of data:
algo.pageRank.stream(…)
default should be the one with statistics, as the scroll back from a medium size / large grap will kill the client
Parameters

nodeselector or null  label(s) or cypher statement: ":PersonUser" / "MATCH (u:User) RETURN id(u) as id, u.name as value"

relationshipselector or null  reltype(s) or cypher statement ":KNOWSFRIEND" / "MATCH (u1:User)[r:KNOWSFRIEND](u2:User) RETURN id(u1) as source, id(u2) as target, r.strength as weight"

configuration map with optinal parameters but sensible defaults: {iterations: 5, write:true, nodeValue:"value",nodeWeight:"weight",relationshipWeight:"weight", threshold: 0.85}
We could have the configuration map come from a function (per algorithm?) that handles validation and returns a correct map object for that algorithm? Or just pass the plain map?
Example:
call algo.pageRank("Person","KNOWS",{iterations:10, write:true, [property:"pagerank"]});
call algo.unionFind.stream('
MATCH (p:Person) RETURN id(p) as id
','
MATCH (p1:Person)[:KNOWS](p2:Person) RETURN id(p1) as source, id(p2) as target
', {}) YIELD nodeId, clusterId
API Documentation
Development Notes
Note

When implementing a procedure that yields any results (is not void ) and writes back to the database, make sure to consume the Result after running the algorithm call!

Algorithms are executed within a transaction, that is opened and closed by Cypher. The transaction is only marked successful if the results are consumed. Closing the Result directly will fail and rollback any open transactions and thus will revert all writeback operations.
The Model
The basic idea behind our model is to have a fast cache for the topology of the graph containing only relevant nodes, relations and in addition the weights. It implicitly maps (long) nodeid’s to an internal integerid (32bit) in ascending order which ensures that no id gets bigger then the maximum node count. This approach allows us to use primitive arrays as container for example.
Graph Interface
The Graph
interface specifies methods for iterating over all nodes of the graph as well as iterating over relationships of a given node in the form of forEach…(..)
methods.
The Graph knows the nodeCount
and degree
of each node and can map nodeId to internalId and vice versa.
Method  Description 


return a collection of iterables over every node, partitioned by the given batch size. 

Returns true iff the nodeId is mapped, otherwise false 

degree of the node in that direction 

Iterate over each nodeId 

Iterate over all relationships of a node 

Iterate over all weighted relationships of a node 

count of nodes 

iterate over all nodes 

Map neo4j nodeId to internal nodeId 

Map internal nodeId back to original nodeId 
Note

Currently we have 3 different implementations aiming for different goals like performance, memory consumption and accuracy. 
HeavyGraph
This implementations utilizes an intmatrix for storing connections between vertices. It has a higher memory consumption but performs basic calculations on the graph around 3 times faster then the memory efficient implementation. Furthermore the number of edges per node is limited only to the maximum array size (2bn) of the JVM. Relationships can be added in arbitrary order.
LightGraph
This implementation takes 3 times less heap due to a more intelligent memory layout. The drawback is slightly higher evaluation time and that the data cannot be loaded in parallel (yet).
GraphView
The View is just a simple wrapper around the Neo4j Kernel API. It has been implemented for tests and benchmarks baselines.
Import
A fair amount of the work is to fetch the relevant data from Neo4j and import it into our model. Fortunately we can use a multithreading approach to load the graph. This is part of the current implementation.
The abstract GraphFactory
specifies a constructor and the build
method.
It is responsible for creating the Graph using the Neo4j Kernel API.
Loading the Graph is done by using the GraphLoader which implements a simple fluent Builder Pattern.
final Graph graph = new GraphLoader( graphDatabaseAPI )
.setLabel( .. )
.setRelation( .. )
.setProperty( .. )
.setThreadPool( .. )
.load( FactoryImpl.class );
PageRank
Implementation Details
PageRank is Googles popular search algorithm.
Progress

✓ single threaded implementation

✓ tests

✓ simple benchmark

✓ implement procedure

✓ benchmark on bigger graphs

✓ parallelization

✓ evaluation
Requirements

NodeIterator

Incoming Relationships

Outgoing Degrees
Data structured involved
Our current approach needs one double array for storing ranks.
ToDo
parallelization
One approach to parallelize PageRank might be to partition the node into batches  one for each thread. Nonetheless we may need to sync them at the end of each iteration.
evaluation

Performance tests on different dataset sizes / level of concurrency
Future Improvements

we might scale up the ranks to ints for faster multiplication.
Details
Partition based parallel PageRank based on "An Efficient PartitionBased Parallel PageRank Algorithm" [1]

Each partition thread has its local array of only the nodes that it is responsible for, not for all nodes. Combined, all partitions hold all page rank scores for every node once. Instead of writing partition files and transferring them across the network (as done in the paper since they were concerned with parallelising across multiple nodes), we use integer arrays to write the results to. The actual score is upscaled from a double to an integer by multiplying it with {@code 100_000}.

To avoid contention by writing to a shared array, we partition the result array.

During execution, the scores arrays are shaped like this:
[ executing partition ] > [ calculated partition ] > [ local page rank scores ]

Each single partition writes in a partitioned array, calculation the scores for every receiving partition. A single partition only sees:
[ calculated partition ] > [ local page rank scores ]

The coordinating thread then builds the transpose of all written partitions from every partition:
[ calculated partition ] > [ executing partition ] > [ local page rank scores ]

This step does not happen in parallel, but does not involve extensive copying. The local page rank scores needn’t be copied, only the partitioning arrays. All in all, {@code concurrency^2} array element reads and assignments have to be performed.

For the next iteration, every partition first updates its scores, in parallel. A single partition now sees:
[ executing partition ] > [ local page rank scores ]

That is, a list of all calculated scores for it self, grouped by the partition that calculated these scores. This means, most of the synchronization happens in parallel, too.

Partitioning is not done by number of nodes but by the accumulated degree – as described in "Fast Parallel PageRank: A Linear System Approach" [2]. Every partition should have about the same number of relationships to operate on.

This is done to avoid having one partition with super nodes and instead have all partitions run in approximately equal time. Smaller partitions are merged down until we have at most {@code concurrency} partitions, in order to batch partitions and keep the number of threads in use predictable/configurable.
[1]: An Efficient PartitionBased Parallel PageRank Algorithm [2]: <a href="https://www.cs.purdue.edu/homes/dgleich/
Betweenness Centrality
Implementation Details
In graph theory, betweenness centrality is a measure of centrality in a graph based on shortest paths. For every pair of vertices in a connected graph, there exists at least one shortest path between the vertices such that either the number of edges that the path passes through (for unweighted graphs) or the sum of the weights of the edges (for weighted graphs) is minimized. The betweenness centrality for each vertex is the number of these shortest paths that pass through the vertex.
Progress

✓ adapt apocprocedure to algorithmapi

✓ implement procedure

✓ tests

✓ edge case tests

✓ simple benchmark

✓ benchmark on bigger graphs

✓ parallelization

✓ evaluation

✓ documentation
Details
algo.betweenness

implementation of brandesbc algorithm and nodePartitioning extension

if
concurrency
parameter is set (and >1) ParallelBetweennessCentrality is used 
ParallelBC spawns N(given by the concurrency param) concurrent threads for calculation where each one calculates the BC for one node at a time
algo.betweenness.exp1

brandeslike algorithm which uses successor sets instead of predecessor sets

The algorithm is based on Brandes definition but with some changes regarding the dependencyaccumulation step.

http://cassmt.pnnl.gov/docs/pubs/georgiatechlbnlpnnlfastbcmtaap2009.pdf
Closeness Centrality
Implementation Details
Closeness Centrality of a node is a measure of centrality in a network, calculated as the sum of the length of the shortest paths between the node and all other nodes in the graph. Thus the more central a node is, the closer it is to all other nodes.
Details

use
org.neo4j.graphalgo.impl.msbfs.MultiSourceBFS
for BFS 
MSBFS gives depth and number of sources.

in this scheme the farness can be calculated as follows
farness(v) = farness(v) + numberOfSources(v) * depth(v)
Minimum Weight Spanning Tree
Implementation Details
A Minimum Weight Spanning Tree is a acyclic undirected graph which consists of all connected nodes and whose relationship weights are minimal. It can be used to cluster the graph (KMeans). Our current approach uses Prim’s Algorithm to calculate the MST starting at a given node. This might not cover the whole graph. But if the nodes are connected the MST is always identical regardless at which node the execution starts.
Progress

✓ single threaded implementation

✓ tests

✓ simple benchmark

✓ implement procedure

✓ benchmark on bigger graphs

[?] parallelization

❏ evaluation
Requirements
BothRelationshipIterator
& Weights
Data structured involved

org.neo4j.graphalgo.core.utils.container.UndirectedTree
as container for efficient splitting and iterate 
An intbased Fibonacci Heap priority queue.

Container for visited state
ToDo
benchmark
Implement benchmark on big graph
evaluation

Performance tests on different dataset sizes / level of concurrency
Single Shortest Path
Implementation Details
A Single Source Shortest Path algorithms calculates a path between a pair of nodes whose summed weights are minimal. A common algorithm used is Dijkstra. All Pairs Shortest Path on the other hand calculates a shortest path forest containing all paths between the nodes in the graph. An algorithm to solve this is Floyd Warshall or Parallel Johnson’s algorithm.
Progress

✓ single threaded implementation

✓ tests

✓ simple benchmark

✓ implement procedure

❏ benchmark on bigger graphs

❏ parallelization

❏ evaluation
Requirements
(Outgoing)RelationshipIterator
& Weights
Data structured involved

An intbased Fibonacci Heap which implements an efficient priority queue.

Different Container for Costs / visited state / paths
ToDo
benchmark
Implement benchmark on big graph
parallelization
Parallizing All Pairs Shortest Path might be easy using Dijkstra on each thread for a different node. An easy approach for Single Source SP may use two threads. One starting at the startnode, one at the endnode. The first wins. [More](https://www.cs.rice.edu/~vs3/comp422/lecturenotes/comp422lec24s08v2.pdf)
evaluation

Performance tests on different dataset sizes / level of concurrency
Details
algo.shortestPath

Dijkstra single source shortest path algorithm

The algorithm computes a shortest path on weighted graphs between a given start and targetNodeId. It returns result tuples of [nodeId, distance] of each node in the path

there may be more then one shortest path, algo returns only one

if initialized with an notexisting weightproperty and a defaultWeight of 1.0 the resulting path is minimal in terms of count of nodes in the path.
algo.shortestPath.deltaStepping
TODO naming!?  parallel nonnegative single source shortest path algorithm for weighted graphs  It can be tweaked using the deltaparameter which controls the grade of concurrency.  returns minimum distance to all other nodes  if initialized with an nonexisting weightproperty and a defaultWeight of 1.0 its result can be interpreted as the number of nodes to reach the target
algo.allShortestPaths.stream

returns a stream of sourcetarget node to distance tuples for each pair of nodes

Since all nodeId’s have already been ordered by the idMapping we can use an integer instead of a queue which just count’s up for each startNodeId as long as it is < nodeCount.

Each thread tries to take one int from the counter at one time and starts its computation on it.

The {@link AllShortestPaths#concurrency} value determines the count of workers that should be spawned.

Due to the high memory footprint the result set would have we emit each result into a blocking queue. The result stream takes elements from the queue while the workers add elements to it.

The result stream is limited by N^2. If the stream gets closed prematurely the workers get closed too.

writeback not supported!
Label Propagation
Implementation Details
Label Propagation is a graph partitioning algorithm already implemented in current apocprocedures.
Progress

✓ adapt apocprocedure to algorithm api

✓ single threaded implementation

✓ tests

❏ edge case tests

✓ implement procedure

✓ simple benchmark

✓ benchmark on bigger graphs

✓ parallelization

✓ evaluation

✓ documentation
TODO

adapt existing procedure to algorithm api
Weakly Connected Components
Implementation Details
Connected Components or UnionFind basically finds sets of connected nodes where each node is reachable from any other node in the same set. One implementation also evaluates a Predicate on each relation which allows partitioning of the graph based on Relationships and Properties.
Progress

✓ single threaded implementation

✓ tests

✓ simple benchmark

❏ implement procedure

❏ benchmark on bigger graphs

❏ parallelization

❏ evaluation
Requirements
AllRelationshipIterator
& Weights
Data structured involved
We use a disjointsetstructure which is based on a parentarraytree. The DSS can be used to efficiently ask if two nodes are reachable by each other. [More](https://en.wikipedia.org/wiki/Disjointset_data_structure)
ToDo
parallelization
One approach to parallelize UnionFind might be relationship partitioning where each thread performs the execution into it’s own DSS instance on a subset of relationships. So each thread calculates a distinct set of unions. Later we can merge each DSS pairwise which can also be perfomed in parallel. Nonetheless the memory consumption might be high due to the preallocated array in DSS. We could also switch to a growing container if this is a problem.
evaluation

Performance tests on different dataset sizes / level of concurrency
Details

writes a clusterid to each node representing the a connected component where each node is reachable from any other node
algo.unionFind

if a threshold configuration parameter is supplied only relationships with a property value higher then the threshold are merged
algo.unionFind.exp1

parallel UnionFind using ExecutorService only.

Algorithm based on the idea that DisjointSetStruct can be built using just a partition of the nodes which are then merged pairwise.

The implementation is based on a queue which acts as a buffer for each computed DSS. As long as there are more elements on the queue the algorithm takes two, merges them and adds its result to the queue until only 1 element remains.
algo.unionFind.exp2

Like in exp1 the resulting DSS of each nodepartition is merged by the ForkJoin pool while the calculation of the DSS is done by the ExecutorService.
algo.unionFind.exp3

calculation and merge using forkJoinPool
Strongly Connected Components
Implementation Details
SCC is a class algorithms for finding groups of nodes where each node is directly reachable from every other node in the group. There are several algorithms to compute the SCC. Our current implementation (still in graphtest project) implements Tarjan’s SCC algorithm.
Progress

✓ implement procedure

✓ tests

✓ edge case tests

✓ simple benchmark

✓ benchmark on bigger graphs

✓ parallelization

✓ evaluation

✓ documentation
Details
algo.scc.tarjan

original recursive tarjan implementation

result is a clusterid at all nodes

each cluster is a scc

Builds sets of nodeIds which represent a strongly connected component within the graph. Also calculates minimum and maximum setSize as well as the count of distinct sets.
algo.scc.tunedTarjan

also a recursive tarjan implementation

result is a clusterid at all nodes

https://pdfs.semanticscholar.org/61db/6892a92d1d5bdc83e52cc18041613cf895fa.pdf
algo.scc.iterative

iterative adaption of tarjan algorithm

result is a clusterid at all nodes

http://code.activestate.com/recipes/578507stronglyconnectedcomponentsofadirectedgraph/
algo.scc.multistep

parallel scc algorithm

composition of several scc algorithms (FWBW, coloring, tarjan)

uses FWBW + coloring to find big scc’s

starts simple tarjan once the cutoff threshold is reached

http://www.sandia.gov/~srajama/publications/BFS_and_Coloring.pdf