Introduction
This library provides efficiently implemented, parallel versions of common graph algorithms for Neo4j 3.x exposed as Cypher procedures.
Releases are available here: https://github.com/neo4jcontrib/neo4jgraphalgorithms/releases
Algorithms
Centralities
These algorithms determine the importance of distinct nodes in a network.

Page Rank (
algo.pageRank
) 
Betweenness Centrality (
algo.betweenness
) 
Closeness Centrality (
algo.closeness
)
Community Detection
These algorithms evaluate how a group is clustered or partitioned, as well as its tendency to strengthen or break apart.

Louvain (
algo.louvain
) 
Label Propagation (
algo.labelPropagation
) 
(Weakly) Connected Components (
algo.unionFind
) 
Strongly Connected Components (
algo.scc
) 
Triangle Count / Clustering Coefficient (
algo.triangleCount
)
Path Finding
These algorithms help find the shortest path or evaluate the availability and quality of routes.

Minimum Weight Spanning Tree (
algo.mst
) 
All Pairs and Single Source  Shortest Path (
algo.shortestPath
,algo.allShortestPaths
)
These procedures work either on the whole graph or on a subgraph filtered by label and relationshiptype. You can also use filtering and projection using Cypher queries.
We’d love your feedback, so please try out these algorithms and let us know how well they work for your usecase. Also please note things that are missing from the installation instructions or documentation.
Please raise GitHub issues for anything you encounter or join the neo4jusers Slack group and ask in the #neo4jgraphalgorithm
channel.
Installation
Download graphalgorithmsalgo*.jar
from the matching release and copy it into your $NEO4J_HOME/plugins
directory.
Because the algorithms use the lower level Kernel API to read from and write to Neo4j you will also have to enable them in the configuration (for security reasons):
dbms.security.procedures.unrestricted=algo.*
Once you’ve done that restart Neo4j and execute the query CALL algo.list()
to see a list of all the algorithms.
Usage
These algorithms are exposed as Neo4j procedures. We can call them directly from Cypher in your Neo4j Browser, from cyphershell, or your client code.
For most algorithms we provide two procedures:

one named
algo.<name>
that writes results back to the graph as nodeproperties and reports statistics. 
another named
algo.<name>.stream
that returns a stream of data, e.g. nodeids and computed values.
For large graphs the streaming procedure might return millions or billions of results, that’s why it is often more convenient to store the results of the algorithm and then use them with later queries.
We can project the graph we want to run algorithms on with either label and relationshiptype projection or cypher projection.
Projected graph model is separate from Neo4j’s stored graph model to allow fast cache for the topology of the graph containing only relevant nodes, relations and in addition the weights. For now the projected graph model does not support multiple relationships between a single pair of nodes. During projection we allow only one relationship between a pair of nodes per direction (in, out) in the directed case but two relationships for BOTH the undirected case.
Label and relationshiptype projection
We can project the subgraph we want to run the algorithm on using label parameter to describe nodes and relationshiptype to describe relationships.
The general call syntax is:
CALL algo.<name>([label],[relationshipType],{config})
e.g. Page Rank on DBpedia (11M nodes, 116M relationships):
CALL algo.pageRank('Page','Link',{iterations:5, dampingFactor:0.85, write: true, writeProperty:'pagerank'});
// YIELD nodes, iterations, loadMillis, computeMillis, writeMillis, dampingFactor, write, writeProperty
CALL algo.pageRank.stream('Page','Link',{iterations:5, dampingFactor:0.85})
YIELD node, score
RETURN node.title, score
ORDER BY score DESC LIMIT 10;
Cypher projection
If label and relationshiptype projection is not selective enough to describe our subgraph to run the algorithm on, we can use Cypher statements to project subsets of our graph.
Use a nodestatement instead of the label parameter and a relationshipstatement instead of the relationshiptype and use graph:'cypher'
in the config.
Relationships described in the relationshipstatement will only be projected if both source and target nodes are described in the nodestatement. All other relationships that don’t have both source and target nodes described in nodestatement will be ignored in projection.
We can also return a property value or weight (according to our config) in addition to the id’s from these statements.
Cypher projection allows us to be more expressive in describing our subgraph we want to analyse, but might take longer to project the graph with more complex cypher queries.
CALL algo.pageRank(
'MATCH (p:Page) RETURN id(p) as id',
'MATCH (p1:Page)[:Link]>(p2:Page) RETURN id(p1) as source, id(p2) as target',
{graph:'cypher', iterations:5, write: true});
Cypher projection can also be used to project a virtual (nonstored) graph. Here is an example of how to project an undirected graph of people who visited the same web page and run Louvain community detection algorithm on it using the number of common visited web pages between pairs of people as relationship weight.
CALL algo.louvain(
'MATCH (p:Person) RETURN id(p) as id',
'MATCH (p1:Person)[:Visit]>(:Page)<[:Visit](p2:Person)
RETURN id(p1) as source, id(p2) as target, count(*) as weight',
{graph:'cypher', iterations:5, write: true});
Graph Loading
As it can take some time to load large graphs into the algorithm data structures, you can preload graphs and then later refer to them by name for several graph algorithms. After usage they can be removed from memory to free resources used.
// Load graph
CALL algo.graph.load('mygraph','Label','REL_TYPE',{graph:'heavy',..other config...})
YIELD name, graph, direction, undirected, sorted, nodes, loadMillis, alreadyLoaded,
nodeWeight, relationshipWeight, nodeProperty, loadNodes, loadRelationships;
// Info on loaded graph
CALL algo.graph.info('mygraph')
YIELD name, type, exists, removed, nodes;
// Use graph
CALL algo.pageRank(null,null,{graph:'mygraph',...})
// Remove graph
CALL algo.graph.remove('mygraph')
YIELD name, type, exists, removed, nodes;
Building Locally
Currently aiming at Neo4j 3.x (with a branch per version)
git clone https://github.com/neo4jcontrib/neo4jgraphalgorithms cd neo4jgraphalgorithms git checkout 3.3 mvn clean install cp algo/target/graphalgorithms*.jar $NEO4J_HOME/plugins/ $NEO4J_HOME/bin/neo4j restart
Yelp
Yelp Open Dataset
Yelp.com has been running Yelp Dataset challenge for about five years now. It is a competition that encourages people to explore and research Yelp’s open dataset, which contains almost 5 million reviews from over 1.1 million users on over 150,000 businesses from 12 metropolitan areas in the tenth round of the challenge. Over the years the dataset has become very popular with hundreds of academic paper written about it. It has wellstructured and highly relational data and is therefore a great showcase for Neo4j and graph algorithms on a realistic dataset. We will show how to use graph algorithms on a social network of friends and how to create and analyse an inferred graph (for example, projecting a review cooccurence graph or similarity between users based on their reviews).
Check out also past winners and their work.
Data
You can get the dataset by filling out their form. Download the data in JSON format. We will use APOC plugin to help us with importing and batching data in Neo4j. There are 6 JSON files available (detailed documentation). Photos and checkins are not the topic of our analysis, so they will not be imported. From the other files we will import a knowledge graph of Yelp’s world, that contains data about:

156639 businesses

1005693 tips from users about businesses

4736897 reviews of businesses by users

9489337 users total

35444850 friend relationships
Depending on your setup import might take some time as user.json contains data about a 10 million person social network of friends. While review.json is even bigger in size, it is mostly made up of text representing the actual review, so the import will be faster. We also do not need the actual text, but only metadata about them e.g. who wrote the review and how he rated a certain business, so the text will not be imported.
Graph Model
Graph consist of nodes labeled User, that can have a FRIEND relationship with other users. Users also write reviews and tips about businesses. We store all the metadata ( location of business etc.) as properties of nodes except for categories of the businesses, which are represented by separate nodes labeled Category.
Graph model always depends on the application we have in mind for it. Our application is to analyse ( inferred ) networks with graph algorithms. If we were to use our graph as a recommendation engine, we might construct a different graph model. Check this great guide or this educational video on how you can use Neo4j as a recommendation engine.
Import
CALL apoc.schema.assert(
{Category:['name']},
{Business:['id'],User:['id'],Review:['id']});
CALL apoc.periodic.iterate("
CALL apoc.load.json('file:///dataset/business.json') YIELD value RETURN value
","
MERGE (b:Business{id:value.business_id})
SET b += apoc.map.clean(value, ['attributes','hours','business_id','categories','address','postal_code'],[])
WITH b,value.categories as categories
UNWIND categories as category
MERGE (c:Category{id:category})
MERGE (b)[:IN_CATEGORY]>(c)
",{batchSize: 10000, iterateList: true});
CALL apoc.periodic.iterate("
CALL apoc.load.json('file:///dataset/tip.json') YIELD value RETURN value
","
MATCH (b:Business{id:value.business_id})
MERGE (u:User{id:value.user_id})
MERGE (u)[:TIP{date:value.date,likes:value.likes}]>(b)
",{batchSize: 20000, iterateList: true});
CALL apoc.periodic.iterate("
CALL apoc.load.json('file:///dataset/review.json')
YIELD value RETURN value
","
MERGE (b:Business{id:value.business_id})
MERGE (u:User{id:value.user_id})
MERGE (r:Review{id:value.review_id})
MERGE (u)[:WROTE]>(r)
MERGE (r)[:REVIEWS]>(b)
SET r += apoc.map.clean(value, ['business_id','user_id','review_id','text'],[0])
",{batchSize: 10000, iterateList: true});
CALL apoc.periodic.iterate("
CALL apoc.load.json('file:///dataset/user.json')
YIELD value RETURN value
","
MERGE (u:User{id:value.user_id})
SET u += apoc.map.clean(value, ['friends','user_id'],[0])
WITH u,value.friends as friends
UNWIND friends as friend
MERGE (u1:User{id:friend})
MERGE (u)[:FRIEND](u1)
",{batchSize: 100, iterateList: true});
Networks
Social network
Social network is a theoretical construct useful in the social sciences to study relationships between individuals, groups, organizations, or even entire societies. An axiom of the social network approach to understanding social interaction is that social phenomena should be primarily conceived and investigated through the properties of relationships between and within nodes, instead of the properties of these nodes themselves. Precisely because many different types of relations, singular or in combination, form these network configurations, network analytics are useful to a broad range of research enterprises.
Social network analysis is the process of investigating social structures through the use of networks and graph theory. It characterizes networked structures in terms of nodes (individual actors, people, or things within the network) and the ties, edges, or links (relationships or interactions) that connect them. Examples of social structures commonly visualized through social network analysis include social media networks, memes spread, friendship and acquaintance networks, collaboration graphs, kinship, disease transmission.
Social network analysis has emerged as a key technique in modern sociology. It has also gained a significant following in anthropology, biology, demography, communication studies, economics, geography, history, information science, organizational studies, political science, social psychology, development studies, sociolinguistics, and computer science.
Yelp’s friendship network is an undirected graph with unweighted friend relationships between users. There are 507948 users with no friends. They will be ignored in our analysis.
Global graph statistics:
Nodes : 8981389
Relationships : 35444850
Weakly connected components : 18512
Nodes in largest WCC : 8938630
Edges in largest WCC : 35420520
Triangle count :
Average clustering coefficient :
Graph diameter (longest shortest path):
Local graph statistics:
MATCH (u:User)
RETURN avg(apoc.node.degree(u,'FRIEND')) as average_friends,
stdev(apoc.node.degree(u,'FRIEND')) as stdev_friends,
max(apoc.node.degree(u,'FRIEND')) as max_friends,
min(apoc.node.degree(u,'FRIEND')) as min_friends
Average number of friends : 7.47
Standard deviation of friends : 46.96
Minimum count of friends : 1
Maximum count of friends : 14995
Prior work:
Projecting a review cooccurence graph
We can try to find which businesses are often reviewed by same users by inferring a cooccurence network between them. By way of definition, cooccurrence networks are the collective interconnection of nodes based on their paired presence within a specified domain. Our network is generated by connecting pairs of businesses using a set of criteria defining cooccurrence.
The cooccurence criteria for this network is that any pair of businesses must have at least 5 common reviewers. We save the count of common reviewers as a property of the relationship that will be used as a weight in community detection analysis. Inferred graph is undirected, as changing the direction of the relationships does not imply any semantic difference. We will limit our network to those businesses, that have more than 10 reviews and project a cooccurent relationship between businesses.
CALL apoc.periodic.iterate('
MATCH (b1:Business)
WHERE size((b1)<[:REVIEWS]>()) > 10 AND b1.city="Las Vegas"
RETURN b1
','
MATCH (b1)<[:REVIEWS](r1)
MATCH (r1)<[:WROTE](u)
MATCH (u)[:WROTE]>(r2)
MATCH (r2)[:REVIEWS]>(b2)
WHERE id(b1) < id(b2) AND b2.city="Las Vegas"
WITH b1, b2, COUNT(*) AS weight where weight > 5
MERGE (b1)[cr:CO_OCCURENT_REVIEWS](b2)
ON CREATE SET cr.weight = weight
',{batchSize: 1});
Prior work:
Projecting a review similarity graph
We can try to find similar groups of users by projecting a review similiarity network between them. The idea is to start with users that have more than 10 reviews and find all pairs of users who have reviewed more than 10 common businesses. We do this to filter out users with not enough data. We could do something similar to filter out users who have reviewed every business (probably bot or bored). Once we find pairs of users we calculate their similarity of reviews using cosine similarity and only create a relationship if cosine similarity is greater than 0, sometimes also called hard similarity. We do this so we do not end up with complete graph, where every pair of users is connected. Most community detection algorithms perform poorly in a complete graph. Cosine similarity between pairs of users is saved as a property of relationship and can be used as a weight in graph algorithms. Projected graph is modeled undirected as the direction of the relationship have no semantic value.
Projecting a review similarity graph is often used in recommendations. We calculate who are similar users based on review ratings, so we can recommend to a user what similar users liked.
CALL apoc.periodic.iterate(
"MATCH (p1:User) WHERE size((p1)[:WROTE]>()) > 5 RETURN p1",
"
MATCH (p1)[:WROTE]>(r1)>()<(r2)<[:WROTE](p2)
WHERE id(p1) < id(p2) AND size((p2)[:WROTE]>()) > 10
WITH p1,p2,count(*) as coop, collect(r1.stars) as s1, collect(r2.stars) as s2 where coop > 10
WITH p1,p2, apoc.algo.cosineSimilarity(s1,s2) as cosineSimilarity WHERE cosineSimilarity > 0
MERGE (p1)[s:SIMILAR_REVIEWS](p2) SET s.weight = cosineSimilarity"
, {batchSize:100, parallel:false,iterateList:true});
Prior work:
Overview of Algorithm Procedures
qualified name  description 



























































































































Algorithms
Graph algorithms are used to compute metrics for graphs, nodes or relationships.
They can provide insights on relevant entities (centralities, ranking) in the graph or inherent structures like communities (communitydetection, graphpartitioning, clustering).
Many of them are iterative approaches that frequently traverse the graph for the computation using random walks, breadthfirst or depthfirst searches or pattern matching.
Due to the exponential growth of possible paths with increasing distance, many of the approaches also have high algorithmic complexity.
Fortunately optimized algorithms exist that utilize certain structures of the graph, memoize already explored parts, and parallelize operations. Whenever possible we’ve applied these optimizations.
Page Rank
PageRank is Google’s popular search algorithm. PageRank counts the number and quality of links to a page to determine an estimate of how important the website is. The underlying assumption is that more important websites are likely to receive more links from other websites.
History, Explanation
In order to measure the relative importance of web pages, Sergey Brin and Larry Page proposed PageRank, a method for computing a ranking for every web page based on the graph of the web. PageRank has applications in search, browsing, and traffic estimation.
PageRank is defined in the original Google paper as follows:
We assume page A has pages T1…Tn which point to it (i.e., are citations). The parameter d is a damping factor which can be set between 0 and 1. We usually set d to 0.85. Also C(A) is defined as the number of links going out of page A. The PageRank of a page A is given as follows:
PR(A) = (1d) + d (PR(T1)/C(T1) + … + PR(Tn)/C(Tn))
Note that the PageRanks form a probability distribution over web pages, so the sum of all web pages' PageRanks will be one.
The underlying mathematics of PageRank has to do with random walks on networks, akin to how random surfers propagate through a network. Random surfers follow links, but occasionally teleport to random vertices. The PageRank of a node is the probability it is visited by a random surfer with teleportation. PageRank is now widely recognized as a way of detecting central nodes in a network.
When to use it / usecases
The mathematics of PageRank are entirely general and apply to any graph or network in any domain. PageRank is regularly used in bibliometrics, social and information network analysis, and for link prediction and recommendation. It’s even used for systems analysis of road networks, as well as biology, chemistry, neuroscience, and physics.
These are some other notable uses:

In neuroscience, the PageRank of a neuron in a neural network has been found to correlate with its relative firing rate.

Personalized PageRank is used by Twitter to present users with other accounts they may wish to follow [4].

PageRank has been used to rank spaces or streets to predict how many pedestrians or vehicles come to individual spaces or streets.

In lexical semantics it has been used to perform Word Sense Disambiguation, Semantic similarity, and to automatically rank WordNet synsets according to how strongly they possess a given semantic property, such as positivity or negativity [2].

In any ecosystem, a modified version of PageRank may be used to determine species that are essential to the continuing health of the environment [3].

For the analysis of protein networks in biology PageRank is also a useful tool [6].

PageRank has been used to quantify the scientific impact of researchers [5]. The underlying citation and collaboration networks are used in conjunction with PageRank algorithm to come up with a ranking system for individual publications that propagates to individual authors. The new index known as pagerankindex (Pi) is demonstrated to be fairer compared to hindex and addresses many of the drawbacks exhibited by hindex.
Constraints / when not to use it
There are some major limitations of the PageRank algorithm.

Spider traps  if there are no links from within the group to outside the groups then the group of pages is a spider trap.

Rank sink  this occurs when a network of pages form an infinite cycle.

Deadends and Dangling links  Deadends occurs when pages have no out links. If a page contains a link to another page which has no out links, the link would be known as a dangling link. [1]
You can read [7] for more.
Algorithm explanation on simple sample graph
CREATE (home:Page{name:'Home'})
,(about:Page{name:'About'})
,(product:Page{name:'Product'})
,(links:Page{name:'Links'})
,(a:Page{name:'Site A'})
,(b:Page{name:'Site B'})
,(c:Page{name:'Site C'})
,(d:Page{name:'Site D'})
CREATE (home)[:LINKS]>(about)
,(about)[:LINKS]>(home)
,(product)[:LINKS]>(home)
,(home)[:LINKS]>(product)
,(links)[:LINKS]>(home)
,(home)[:LINKS]>(links)
,(links)[:LINKS]>(a)[:LINKS]>(home)
,(links)[:LINKS]>(b)[:LINKS]>(home)
,(links)[:LINKS]>(c)[:LINKS]>(home)
,(links)[:LINKS]>(d)[:LINKS]>(home)
CALL algo.pageRank.stream('Page', 'LINKS', {iterations:20, dampingFactor:0.85})
YIELD node, score
RETURN node,score order by score desc limit 20
CALL algo.pageRank('Page', 'LINKS',
{iterations:20, dampingFactor:0.85, write: true,writeProperty:"pagerank"})
YIELD nodes, iterations, loadMillis, computeMillis, writeMillis, dampingFactor, write, writeProperty
name  pageRank 

Home 
3.232 
Product 
1.059 
Links 
1.059 
About 
1.059 
Site A 
0.328 
Site B 
0.328 
Site C 
0.328 
Site D 
0.328 
As we might expect, the Home page has the highest PageRank because it has incoming links from all other pages. We can also observe that its not only the number of incoming links is important, but also the importance of the pages that link to us.
Example Usage
We will run PageRank on Yelp’s social network to find potential influencers.
In the import section we stored the social network as a undirected graph.
Relationships in Neo4j always have a direction but in this domain the direction is irrelevant.
If Person A
is a FRIEND
with Person B
we can say that Person B
is also a FRIEND
with Person A
.
The default label and relationshiptype selection syntax won’t work for us here because it will project a directed social network. Instead, we can project our undirected social network using Cypher loading. We can also apply this approach to other algorithms that use Cypher loading.
CALL algo.pageRank.stream(
'MATCH (u:User) WHERE exists( (u)[:FRIENDS]() ) RETURN id(u) as id',
'MATCH (u1:User)[:FRIENDS](u2:User) RETURN id(u1) as source, id(u2) as target',
{graph:'cypher'}
) YIELD node,score with node,score order by score desc limit 10
RETURN node {.name, .review_count, .average_stars,.useful,.yelping_since,.funny}, score
Syntax
CALL algo.pageRank(label:String, relationship:String,
{iterations:20, dampingFactor:0.85, write: true, writeProperty:'pagerank', concurrency:4})
YIELD nodes, iterations, loadMillis, computeMillis, writeMillis, dampingFactor, write, writeProperty
name  type  default  optional  description 

label 
string 
null 
yes 
label to load from the graph. If null load all nodes 
relationship 
string 
null 
yes 
relationshiptype to load from the graph. If null load all relationships 
iterations 
int 
20 
yes 
how many iterations of pagerank to run 
concurrency 
int 
available CPUs 
yes 
number of concurrent threads 
dampingFactor 
float 
0.85 
yes 
damping factor of the pagerank calculation 
write 
boolean 
true 
yes 
if result should be written back as node property 
writeProperty 
string 
'pagerank' 
yes 
property name written back to 
graph 
string 
'heavy' 
yes 
use 'heavy' when describing the subset of the graph with label and relationshiptype parameter, 'cypher' for describing the subset with cypher nodestatement and relationshipstatement 
name  type  description 

nodes 
int 
number of nodes considered 
iterations 
int 
number of iterations run 
dampingFactor 
float 
damping factor used 
writeProperty 
string 
property name written back to 
write 
boolean 
if result was written back as node property 
loadMillis 
int 
milliseconds for loading data 
computeMillis 
int 
milliseconds for running the algorithm 
writeMillis 
int 
milliseconds for writing result data back 
CALL algo.pageRank.stream(label:String, relationship:String,
{iterations:20, dampingFactor:0.85, concurrency:4})
YIELD node, score  calculates page rank and streams results
name  type  default  optional  description 

label 
string 
null 
yes 
label to load from the graph, if null load all nodes 
relationship 
string 
null 
yes 
relationshiptype to load from the graph, if null load all nodes 
iterations 
int 
20 
yes 
how many iterations of pagerank to run 
concurrency 
int 
available CPUs 
yes 
number of concurrent threads 
dampingFactor 
float 
0.85 
yes 
damping factor of the pagerank calculation 
graph 
string 
'heavy' 
yes 
use 'heavy' when describing the subset of the graph with label and relationshiptype parameter, 'cypher' for describing the subset with cypher nodestatement and relationshipstatement 
name 
type 
description 
node 
long 
node id 
score 
float 
pagerank weight 
Cypher projection
If label and relationshiptype are not selective enough to describe a subgraph to run the algorithm on, you can use Cypher statements to load or project subsets of your graph.
Don’t forget to set graph:'cypher'
in the config.
CALL algo.pageRank(
'MATCH (p:Page) RETURN id(p) as id',
'MATCH (p1:Page)[:Link]>(p2:Page) RETURN id(p1) as source, id(p2) as target',
{graph:'cypher', iterations:5, write: true}
)
Versions
We support the following versions of the pageRank algorithm:

✓ directed, unweighted

❏ directed, weighted

✓ undirected, unweighted

Only with cypher projection


❏ undirected, weighted
References

https://anthonybonato.com/2016/04/13/themathematicsofgameofthrones/

[1] http://research.ijcaonline.org/volume110/number12/pxc3901035.pdf

[2] http://nmis.isti.cnr.it/sebastiani/Publications/ACL07.pdf

[4] https://web.stanford.edu/class/msande233/handouts/lecture8.pdf

[7] http://www.cs.princeton.edu/~chazelle/courses/BIB/pagerank.htm
Betweenness Centrality
Betweenness Centrality is a measure of centrality in a graph based on calculating geodesic (shortest) paths between nodes. There is at least one shortest path between every pair of nodes in a connected graph. The shortest path can be based on the number of relationships that the path passes through in an unweighted network or the sum of the weights of the relationships in a weighted network.
The betweenness centrality for each node is the number of these shortest paths that pass through that node. The nodes that most frequently lie on these shortest paths are will have a higher betweenness centrality score.
History, Explanation
The original idea behind betweenness centrality was introduced by J. M. Anthonisse in his 1971 paper The rush in a directed graph. Anthonisse defined the rush in a graph as the amount a node in a network has to intermediate between other nodes.
Linton Freeman gave the first formal definition of betweenness centrality as one of the “three distinct intuitive conceptions of centrality” in his paper A Set of Measures of Centrality Based on Betweenness. It is calculated by a breathfirst search algorithm that computes the shortest paths from every node to all other nodes. Ulrik Brandes' Faster Algorithm for Betweenness Centrality is often used to make the calculation.
When to use it / usecases
Betweenness centrality is useful for finding nodes that serve as a bridge from one part of a graph to another. As a result, it is a rudimentary measure of the control that a node exerts over the flow throughout the graph.
In the above example Alice is the main connection in the graph. If Alice is removed all connections in the graph would be cut off. This makes Alice “important” because she ensures that no nodes are isolated.
The betweenness score for a node can be non intuitive. The node may have low degree, be connected to others that have low degree, and even be a long way from others on average, yet still have high betweenness. Consider a node A that lies on a bridge between two groups of nodes within a network. Since any path between nodes in different groups must go through this bridge, node A acquires high betweenness even though it is not well connected within either group.
Betweenness Centrality has wide applicability in network theory. It represents the degree to which nodes stand between each other. It has founds uses in a variety of different domains [1] including:

analyzing social and protein networks.

identifying important bloggers

measuring network traffic in communication networks

studying the interaction patterns of players in massively multi player online games

analyzing the importance of people in mobile phone call networks

finding the controlling nodes in a telecommunications network
Constraints / when not to use it
Betweeness centrality makes the assumption that all communication between nodes happens along the shortest path and with the same frequency, which isn’t the case in real life. It therefore doesn’t give us a perfect view of the most influential nodes in a graph, but rather a good approximation. [4]
Algorithm explanation on simple sample graph
People with high betweenness tend to be the innovators and brokers in social networks. They combine different perspectives, transfer ideas between groups, and get power from their ability to make introductions and pull strings.
CREATE (nAlice:User {id:'Alice'})
,(nBridget:User {id:'Bridget'})
,(nCharles:User {id:'Charles'})
,(nDoug:User {id:'Doug'})
,(nMark:User {id:'Mark'})
,(nMichael:User {id:'Michael'})
CREATE (nAlice)[:MANAGE]>(nBridget)
,(nAlice)[:MANAGE]>(nCharles)
,(nAlice)[:MANAGE]>(nDoug)
,(nMark)[:MANAGE]>(nAlice)
,(nCharles)[:MANAGE]>(nMichael);
CALL algo.betweenness.stream('User','MANAGE',{direction:'out'})
YIELD nodeId, centrality
RETURN nodeId,centrality order by centrality desc limit 20;
CALL algo.betweenness('User','MANAGE', {direction:'out',write:true, writeProperty:'centrality'})
YIELD nodes, minCentrality, maxCentrality, sumCentrality, loadMillis, computeMillis, writeMillis;
name  centrality weight 

Alice 
4 
Charles 
2 
Bridget 
0 
Michael 
0 
Doug 
0 
Mark 
0 
We can see that Alice is the main broker in this network and Charles is a minor broker.
Approximation of betweenness centrality
The fastest known algorithm for exactly computing betweenness of all the nodes, designed by Brandes, requires at least O(nm)
time for unweighted graphs, where n
is the number of nodes and m
is the number of relationships.
For largescale graphs exact centrality computation isn’t practical [2], but we can work with a subset of the nodes and calculate an approximate score.
RABrandes Algorithm
RABrandes algorithm, proposed by Brandes and Pich[3], is different from original Brandes’ algorithm in only one main respect: Brandes’ algorithm considers all nodes within the graph, whereas RABrandes considers only a subset of nodes, also known as pivots, from the graph.
We implement two strategies of selecting the subset of nodes:

random: nodes are selected uniformly at random with defined probability of selection. Default probability is
log10(N) / e^2
. If probability is 1 then the algorithm works as original Brandes where all nodes are loaded. 
degree: first calculates the mean degree of the nodes and then only visits nodes whose degree is higher then the mean. i.e. only dense nodes are visited
First step in calculating betweenness centrality is to collect all shortest paths that run through specific node. With parameter maxDepth
you can limit the depth of all shortest paths traversal.
CALL algo.betweenness.sampled.stream('User','FRIEND',
{strategy:'random', probability:1.0, maxDepth:5})
YIELD nodeId, centrality
CALL algo.betweenness.sampled('User','FRIEND',
{strategy:'random', probability:1.0, writeProperty:'centrality', maxDepth:5})
YIELD nodes, minCentrality, maxCentrality
Syntax
CALL algo.betweenness(label:String, relationship:String,
{direction:'out',write:true, stats:true, writeProperty:'centrality',concurrency:1})
YIELD nodes, minCentrality, maxCentrality, sumCentrality, loadMillis, computeMillis, writeMillis
 calculates betweenness centrality and potentially writes back
name  type  default  optional  description 

label 
string 
null 
yes 
label to load from the graph. If null load all nodes 
relationship 
string 
null 
yes 
relationshiptype to load from the graph. If null load all relationships 
direction 
string 
outgoing 
yes 
relationship direction to load from the graph, if 'both' treats the relationships as undirected 
write 
boolean 
true 
yes 
if result should be written back as node property 
stats 
boolean 
true 
yes 
if stats about centrality should be returned 
writeProperty 
string 
'centrality' 
yes 
property name written back to 
graph 
string 
'heavy' 
yes 
use 'heavy' when describing the subset of the graph with label and relationshiptype parameter, 'cypher' for describing the subset with cypher nodestatement and relationshipstatement 
concurrency 
int 
available CPUs 
yes 
number of concurrent threads 
name  type  description 

nodes 
int 
number of nodes considered 
minCentrality 
int 
minimum centrality value 
maxCentrality 
int 
maximum centrality value 
sumCentrality 
int 
sum of all centrality values 
loadMillis 
int 
milliseconds for loading data 
evalMillis 
int 
milliseconds for running the algorithm 
writeMillis 
int 
milliseconds for writing result data back 
CALL algo.betweenness.stream(label:String, relationship:String,
{direction:'out',concurrency:1})
YIELD nodeId, centrality  yields centrality for each node
name  type  default  optional  description 

label 
string 
null 
yes 
label to load from the graph, if null load all nodes 
relationship 
string 
null 
yes 
relationshiptype to load from the graph, if null load all relationships 
concurrency 
int 
available CPUs 
yes 
number of concurrent threads 
direction 
string 
outgoing 
yes 
relationship direction to load from the graph, if 'both' treats the relationships as undirected 
name 
type 
description 
node 
long 
node id 
centrality 
float 
betweenness centrality weight 
CALL algo.betweenness.sampled(label:String, relationship:String,
{direction:'out', strategy:'random', probability: 1, maxDepth: 4, stats:true,
writeProperty:'centrality',concurrency:1})
YIELD nodes, minCentrality, maxCentrality, sumCentrality, loadMillis, computeMillis, writeMillis
 calculates betweenness centrality and potentially writes back
name  type  default  optional  description 

label 
string 
null 
yes 
label to load from the graph, if null load all nodes 
relationship 
string 
null 
yes 
relationshiptype to load from the graph, if null load all nodes 
direction 
string 
outgoing 
yes 
relationship direction to load from the graph, if 'both' treats the relationships as undirected 
write 
boolean 
true 
yes 
if result should be written back as node property 
strategy 
string 
'random' 
yes 
node selection strategy 
probability 
float 
log10(N) / e^2 
yes 
probability a node is selected. Values between 0 and 1. If 1 selects all nodes and works like original Brandes algorithm 
maxDepth 
int 
Integer.MAX 
yes 
depth of the shortest paths traversal 
stats 
boolean 
true 
yes 
if stats about centrality should be returned 
writeProperty 
string 
'centrality' 
yes 
property name written back to 
graph 
string 
'heavy' 
yes 
use 'heavy' when describing the subset of the graph with label and relationshiptype parameter, 'cypher' for describing the subset with cypher nodestatement and relationshipstatement 
concurrency 
int 
available CPUs 
yes 
number of concurrent threads 
name  type  description 

nodes 
int 
number of nodes considered 
minCentrality 
int 
minimum centrality value 
maxCentrality 
int 
maximum centrality value 
sumCentrality 
int 
sum of all centrality values 
loadMillis 
int 
milliseconds for loading data 
evalMillis 
int 
milliseconds for running the algorithm 
writeMillis 
int 
milliseconds for writing result data back 
CALL algo.betweenness.sampled.stream(label:String, relationship:String,
{direction:'out',concurrency:1, strategy:'random', probability: 1, maxDepth: 4})
YIELD nodeId, centrality  yields centrality for each node
name  type  default  optional  description 

label 
string 
null 
yes 
label to load from the graph, if null load all nodes 
relationship 
string 
null 
yes 
relationshiptype to load from the graph, if null load all relationships 
concurrency 
int 
available CPUs 
yes 
number of concurrent threads 
direction 
string 
outgoing 
yes 
relationship direction to load from the graph, if 'both' treats the relationships as undirected 
strategy 
string 
'random' 
yes 
node selection strategy 
probability 
float 
log10(N) / e^2 
yes 
probability a node is selected. Values between 0 and 1. If 1 selects all nodes and works like original Brandes algorithm 
maxDepth 
int 
Integer.MAX 
yes 
depth of the shortest paths traversal 
name 
type 
description 
node 
long 
node id 
centrality 
float 
betweenness centrality weight 
Cypher projection
If label and relationshiptype are not selective enough to describe your subgraph to run the algorithm on, you can use Cypher statements to load or project subsets of your graph.
Can be also used to run algorithms on a virtual graph.
Set graph:'cypher'
in the config.
CALL algo.betweenness(
'MATCH (p:User) RETURN id(p) as id',
'MATCH (p1:User)[:MANAGE]>(p2:User) RETURN id(p1) as source, id(p2) as target',
{graph:'cypher', write: true}
);
Versions
We support the following versions of the betweenness centrality algorithm:

✓ directed, unweighted

loading incoming relationships: 'INCOMING','IN','I' or '<'

loading outgoing relationships: 'OUTGOING','OUT','O' or '>'


❏ directed, weighted

✓ undirected, unweighted

direction:'both' or '<>'


❏ undirected, weighted
Implementations
algo.betweenness()

implementation of brandesbc algorithm and nodePartitioning extension

if concurrency parameter is set (and >1) ParallelBetweennessCentrality is used

ParallelBC spawns N(given by the concurrency param) concurrent threads for calculation where each one calculates the BC for one node at a time
algo.betweenness.exp1()

brandeslike algorithm which uses successor sets instead of predecessor sets

The algorithm is based on Brandes definition but with some changes regarding the dependencyaccumulation step.

Does not support undirected graph
algo.betweenness.sampled()

Calculates betweennessdependencies on a subset of pivot nodes (instead of all nodes). Up to now 2 randomization strategies are implemented which can be set using the optional argument strategy:

random selection(default): strategy:'random': (takes optional argument probability:double(01) or log10(N) / e^2 as default)

degree based randomization: strategy:'degree': (makes dense nodes more likely)

optional Arguments: maxDepth:int
References

http://cassmt.pnnl.gov/docs/pubs/georgiatechlbnlpnnlfastbcmtaap2009.pdf

https://www.sci.unich.it/~francesc/teaching/network/betweeness.html

https://econsultancy.com/blog/63682twitternetworkanalysisidentifyinginfluencersandinnovators/

[3] https://pdfs.semanticscholar.org/2e0d/94d072d79ba73c9a153a228b93520b3ce670.pdf

[4] Newman, Mark. Networks: An Introduction (Page 186). OUP Oxford.
Closeness Centrality
The Closeness Centrality of a node measures the distance from that node to all other nodes. Nodes with a high closeness score have the shortest distances to all other nodes. The premise of this algorithm is that nodes with short distance to other nodes can spread information very efficiently through the network. This is important for the availability of knowledge and resources.
History, Explanation
Sabidussi (1966) described the sum of the shortest path distances from one node to every other node as the node’s farness. Freeman (1979) used this idea to define closeness centrality of a node as the inverse of Sabidussi’s farness.
Closeness centrality is defined as the total number of relationships separating a node from all others along the shortest possible paths.
The algorithm operates as follows:

calculate the shortest path for each for each pair of nodes in the graph

for each node sum the distance from the node to all other nodes based on those shortest paths
The raw closeness centrality for a node is then calculated using the following formula:
raw closeness centrality(node) = 1 / sum(distance from node to all other nodes)
The greater the raw closeness centrality, the longer it takes for information originating at random points in the graph to arrive at the node. We could also interpret closeness as the potential ability of a node to reach all other nodes as quickly as possible.
It is important to note that raw closeness centrality is an inverse measure of centrality. i.e. nodes with smaller scores that are the most central. Our algorithm returns a normalized closeness centrality score where nodes with a higher score are more central.
The formula for normalized closeness centrality is as follows:
normalized closeness centrality(node) = (number of nodes  1) / sum(distance from node to all other nodes)
When to use it / usecases
This algorithm is useful for finding the nodes that can access the entire network most quickly.
We might use it if we’re trying to identify where in the city to place a new public service so that it’s easily accessible for residents. If we’re trying to spread a message on social media we could use the algorithm to find the key influencers that can help us achieve our goal.
This measure is preferable to degree centrality, because takes both indirected and direct connections into account. For closeness centrality in directed networks, we split the concept into incloseness and outcloseness. The incloseness variable measures the extent to which a node is easily reached by others i.e. paths from others to the node The outcloseness variable measures the extent to which a node can easily reach others i.e paths from the node to others
Constraints / when not to use it
Closeness centrality works best on connected graphs. On unconnected graphs we can end up with an infinite distance between two nodes in separate connected components. This means that we’ll end up with an infinite closeness centrality score when we sum up all the distances from that node. [1]
We may therefore choose to run the strongly connected component over our dataset before trying closeness centrality.
Algorithm explanation on simple sample graph
CREATE (a:Node{id:"A"}),
(b:Node{id:"B"}),
(c:Node{id:"C"}),
(d:Node{id:"D"}),
(e:Node{id:"E"})
CREATE (a)[:LINK]>(b),
(b)[:LINK]>(a),
(b)[:LINK]>(c),
(c)[:LINK]>(b),
(c)[:LINK]>(d),
(d)[:LINK]>(c),
(d)[:LINK]>(e),
(e)[:LINK]>(d);
CALL algo.closeness.stream('Node', 'LINKS') YIELD nodeId, centrality
RETURN nodeId,centrality order by centrality desc limit 20;
 yields centrality for each node
CALL algo.closeness('Node', 'LINK', {write:true, writeProperty:'centrality'})
YIELD nodes,loadMillis, computeMillis, writeMillis;
 calculates closeness centrality and potentially writes back
Calculation:

count farness in each msbfscallback

divide by N1
N = 5
// number of nodes
k = N1 = 4
// used for normalization
A B C D E  A  0 1 2 3 4 // farness between each pair of nodes B  1 0 1 2 3 C  2 1 0 1 2 D  3 2 1 0 1 E  4 3 2 1 0  S  10 7 6 7 10 // raw closeness centrality =============================== k/S 0.4 0.57 0.67 0.57 0.4 // normalized closeness centrality
Syntax
CALL algo.closeness(label:String, relationship:String,
{write:true, writeProperty:'centrality',graph:'heavy', concurrency:4})
YIELD nodes,loadMillis, computeMillis, writeMillis
 calculates closeness centrality and potentially writes back
name  type  default  optional  description 

label 
string 
null 
yes 
label to load from the graph. If null load all nodes 
relationship 
string 
null 
yes 
relationshiptype to load from the graph. If null load all relationships 
write 
boolean 
true 
yes 
if result should be written back as node property 
concurrency 
int 
available CPUs 
yes 
number of concurrent threads 
writeProperty 
string 
'centrality' 
yes 
property name written back to 
graph 
string 
'heavy' 
yes 
use 'heavy' when describing the subset of the graph with label and relationshiptype parameter, 'cypher' for describing the subset with cypher nodestatement and relationshipstatement 
name  type  description 

nodes 
int 
number of nodes considered 
loadMillis 
int 
milliseconds for loading data 
evalMillis 
int 
milliseconds for running the algorithm 
writeMillis 
int 
milliseconds for writing result data back 
CALL algo.closeness.stream(label:String, relationship:String,{concurrency:4})
YIELD nodeId, centrality  yields centrality for each node
name  type  default  optional  description 

label 
string 
null 
yes 
label to load from the graph, if null load all nodes 
relationship 
string 
null 
yes 
relationshiptype to load from the graph, if null load all relationships 
concurrency 
int 
available CPUs 
yes 
number of concurrent threads 
graph 
string 
'heavy' 
yes 
use 'heavy' when describing the subset of the graph with label and relationshiptype parameter, 'cypher' for describing the subset with cypher nodestatement and relationshipstatement 
name 
type 
description 
node 
long 
node id 
centrality 
float 
closeness centrality weight 
Cypher projection
If label and relationshiptype are not selective enough to describe your subgraph to run the algorithm on, you can use Cypher statements to load or project subsets of your graph.
Can be also used to run algorithms on a virtual graph.
Set graph:'cypher'
in the config.
CALL algo.closeness(
'MATCH (p:Node) RETURN id(p) as id',
'MATCH (p1:Node)[:LINK]>(p2:Node) RETURN id(p1) as source, id(p2) as target',
{graph:'cypher', write: true}
);
Versions
We support the following versions of the closeness centrality algorithm:

✓ directed, unweighted

❏ directed, weighted

✓ undirected, unweighted

Only with cypher projection


❏ undirected, weighted
References

[1] https://infoscience.epfl.ch/record/200525/files/[EN]ASNA09.pdf?

https://toreopsahl.com/2010/03/20/closenesscentralityinnetworkswithdisconnectedcomponents/

http://www.casos.cs.cmu.edu/publications/papers/CMUISR11113.pdf

http://qualquant.org/wpcontent/uploads/networks/2008%20173.pdf
Harmonic Centrality
Closeness centrality works best on connected graphs  results may be confused if there are multiple connected components.
Harmonic centrality (also known as 'valued centrality') is a variant of closeness centrality that addresses this problem. As with many of the centrality algorithms, it originates from the field of social network analysis.
History, Explanation
Harmonic centrality was proposed by Marchiori and Latora [1] while trying to come up with a sensible notion of "average shortest path". They suggested replacing the average distance calculation from the closeness centrality algorithm with the harmonic mean of all distances.
The algorithm operates as follows:

calculate the shortest path for each for each pair of nodes in the graph

for each node determine the distance from the node to all other nodes based on those shortest paths
The raw harmonic centrality for a node is then calculated using the following formula:
raw harmonic centrality(node) = sum(1 / distance from node to every other node excluding itself)
As with closeness centrality we can also calculate a normalized harmonic centrality with the following formula:
normalized harmonic centrality(node) = sum(1 / distance from node to every other node excluding itself) / (number of nodes  1)
The advantage of harmonic centrality is that ∞ are handled cleanly. Harmonic centrality and closeness centrality will often come up with similar results, but harmonic centrality can handle graphs that aren’t connected. [3]
Harmonic centrality was proposed independently by Dekker (2005)[4], using the name "valued centrality," and by Rochat (2009)[2].
When to use it / usecases
Harmonic centrality was proposed as an alternative to closeness centrality, and therefore has similar use cases.
For example, we might use it if we’re trying to identify where in the city to place a new public service so that it’s easily accessible for residents. If we’re trying to spread a message on social media we could use the algorithm to find the key influencers that can help us achieve our goal.
Algorithm explanation on simple sample graph
CREATE (a:Node{id:"A"}),
(b:Node{id:"B"}),
(c:Node{id:"C"}),
(d:Node{id:"D"}),
(e:Node{id:"E"})
CREATE (a)[:LINK]>(b),
(b)[:LINK]>(c),
(d)[:LINK]>(e);
CALL algo.closeness.harmonic.stream('Node', 'LINKS') YIELD nodeId, centrality
RETURN nodeId,centrality
ORDER BY centrality DESC
LIMIT 20;
 yields centrality for each node
CALL algo.closeness.harmonic('Node', 'LINK', {writeProperty:'centrality'})
YIELD nodes,loadMillis, computeMillis, writeMillis;
 calculates closeness centrality and potentially writes back
Calculation:
k = N1 = 4
A B C D E  A  0 1 2   // distance between each pair of nodes B  1 0 1   // or infinite if no path exists C  2 1 0   D     0 1 E     1 0  A  0 1 1/2 0 0 // inverse B  1 0 1 0 0 C 1/2 1 0 0 0 D  0 0 0 0 1 E  0 0 0 1 0  sum 1.5 2 1.5 1 1  *k 0.37 0.5 0.37 0.25 0.25
instead of calculating the farness we sum the inverse of each cell and multiply by 1/(n1)
Syntax
CALL algo.closeness.harmonic(label:String, relationship:String,
{write:true, writeProperty:'centrality',graph:'heavy', concurrency:4})
YIELD nodes,loadMillis, computeMillis, writeMillis
 calculates closeness centrality and potentially writes back
name  type  default  optional  description 

label 
string 
null 
yes 
label to load from the graph. If null load all nodes 
relationship 
string 
null 
yes 
relationshiptype to load from the graph. If null load all relationships 
write 
boolean 
true 
yes 
if result should be written back as node property 
concurrency 
int 
available CPUs 
yes 
number of concurrent threads 
writeProperty 
string 
'centrality' 
yes 
property name written back to 
graph 
string 
'heavy' 
yes 
use 'heavy' when describing the subset of the graph with label and relationshiptype parameter, 'cypher' for describing the subset with cypher nodestatement and relationshipstatement 
name  type  description 

nodes 
int 
number of nodes considered 
loadMillis 
int 
milliseconds for loading data 
evalMillis 
int 
milliseconds for running the algorithm 
writeMillis 
int 
milliseconds for writing result data back 
CALL algo.closeness.harmonic.stream(label:String, relationship:String,{concurrency:4})
YIELD nodeId, centrality  yields centrality for each node
name  type  default  optional  description 

label 
string 
null 
yes 
label to load from the graph, if null load all nodes 
relationship 
string 
null 
yes 
relationshiptype to load from the graph, if null load all relationships 
concurrency 
int 
available CPUs 
yes 
number of concurrent threads 
name 
type 
description 
node 
long 
node id 
centrality 
float 
closeness centrality weight 
Cypher projection
If label and relationshiptype are not selective enough to describe your subgraph to run the algorithm on, you can use Cypher statements to load or project subsets of your graph.
Can be also used to run algorithms on a virtual graph.
Set graph:'cypher'
in the config.
CALL algo.closeness.harmonic(
'MATCH (p:Node) RETURN id(p) as id',
'MATCH (p1:Node)[:LINK](p2:Node) RETURN id(p1) as source, id(p2) as target',
{graph:'cypher', writeProperty: 'centrality'}
);
Versions
We support the following versions of the harmonic centrality algorithm:

✓ undirected, unweighted

❏ undirected, weighted
References

[2] https://infoscience.epfl.ch/record/200525/files/ENASNA09.pdf?

[4] https://www.cmu.edu/joss/content/articles/volume6/dekker/index.html
Minimum Weight Spanning Tree
A Minimum Weight Spanning Tree (MST) is a subgraph that contains all the nodes from the original graph and the set of relationships that connect the nodes together with the minimum possible weight. Our current approach uses Prim’s Algorithm to calculate the MST starting at a given node. The KMeans variant can be used to detect clusters in the graph.
History, Explanation
In graph theory, a tree is a way of connecting all the nodes together, so that there is exactly one path from a node to any other node in the tree. If the graph represents cities connected by roads, one could select a number of roads, so that each city can be reached from every other, but that there is no more than one way to travel from one city to another.
Most of the time, graphs are weighted; each connection between two cities has a weight: It might cost something to travel on a given road, or one connection may be longer than the other, this means it takes more time to travel on that connection.
A graph might have more than one minimum spanning tree. In a graph where all the relationships have the same weight, every tree is a minimum spanning tree. If all the relationships have different weights there is exactly one minimal spanning tree.
Czech scientist Otakar Borůvka developed the first known algorithm for finding a minimum spanning tree in 1926, while trying to find an efficient electricity network for Moravia. This algorithm is known as Borůvka’s algorithm.
Prim’s algorithm (sometimes known as the JarníkPrim algorithm) is one of the simplest and bestknown minimum spanning tree algorithms. It is similar to Dijkstra’s algorithm, but rather than minimizing the total length of a path ending at each relationship we minimize the length of each relationship individually. Unlike Dijkstra’s, Prim’s can tolerate negativeweight relationships. It is optimal for dense graphs.
The algorithm operates as follows:

start with a trivial tree containing only one node (and no relationship). It does not matter which node is chosen.

select the minimalweight relationship coming from that node and add it to our tree.

repeatedly choose a minimalweight relationship that joins any node in the tree to one not in the tree, adding the new relationship and node to our tree. Such a relationship will always exist when the tree is incomplete because otherwise the partially built tree would be an isolated component but we know the graph consists of a single connected component.

when there are no more nodes to add, the tree we have built is a minimum spanning tree.
When to use it / usecases
Imagine a telecommunications company is trying to lay out cables in new neighborhood. If the company is only able to lay the cable along certain paths (e.g. along roads), we could derive a graph that represents the points connected by those paths. Some of those paths might be more expensive, because they are longer, or require the cable to be buried deeper; these paths would be represented by relationships with larger weights. A spanning tree for that graph would be a subset of those paths that has no cycles but still connects to every house; in fact several spanning trees might be possible. A minimum spanning tree would be the one with the lowest total cost, which would represent the least expensive path for laying the cable.
There are also some indirect applications such as LDPC codes for error correction, image registration with Renyi entropy, learning salient features for realtime face verification, model locality of particle interactions in turbulent fluid flows and autoconfig protocol for Ethernet bridging to avoid cycles in a network.
Constraints / when not to use it
The MST algorithm only gives meaningful results when run on a graph where the relationships have different weights. If the graph has no weights or all relationships have the same weight then any spanning tree is a minimum spanning tree.
Algorithm explanation on simple sample graph
CREATE (a:Place{id:"A"}),
(b:Place{id:"B"}),
(c:Place{id:"C"}),
(d:Place{id:"D"}),
(e:Place{id:"E"})
CREATE (d)[:LINK{cost:4}]>(b),
(d)[:LINK{cost:6}]>(e),
(b)[:LINK{cost:1}]>(a),
(b)[:LINK{cost:3}]>(c),
(a)[:LINK{cost:2}]>(c),
(c)[:LINK{cost:5}]>(e);
Minimum weighted spanning tree visits all nodes that are in the same connected component as the starting node and returns a spanning tree of all nodes in the component where the total weight of the relationships is minimized.
MATCH (n:Place{id:"D"})
CALL algo.spanningTree.minimum('Place','LINK', 'cost',id(n), {write:true, writeProperty:"MINST"})
YIELD loadMillis, computeMillis, writeMillis, effectiveNodeCount
RETURN loadMillis,computeMillis,writeMillis, effectiveNodeCount;
Maximum weighted tree spanning algorithm is similar to the mininum one, except that it returns a spanning tree of all nodes in the component where the total weight of the relationships is maximized.
MATCH (n:Place{id:"D"})
CALL algo.spanningTree.maximum('Place', 'LINK', 'cost',id(n), {write:true, writeProperty:"MAXST"})
YIELD loadMillis, computeMillis, writeMillis, effectiveNodeCount
RETURN loadMillis,computeMillis,writeMillis, effectiveNodeCount;
kSpanning tree
Sometimes we want to limit the size of our spanning tree result as we are only interested in finding a smaller tree within our graph that does not span across all nodes. KSpanning tree algorithm returns a tree with k nodes and k − 1 relationships.
In our sample graph we have 5 nodes.
When we ran MST above we got returned a 5minimum spanning tree, that covered all five nodes.
By setting the k=3
we define that we want to get returned a 3minimum spanning tree that covers 3 nodes and has 2 relationships.
MATCH (n:Place{id:"D"})
CALL algo.spanningTree.kmin('Place', 'LINK', 'cost',id(n), 3, {writeProperty:"kminst"})
YIELD loadMillis, computeMillis, writeMillis, effectiveNodeCount
RETURN loadMillis,computeMillis,writeMillis, effectiveNodeCount;
Place  partition 

A 
1 
B 
1 
C 
1 
D 
3 
E 
4 
Nodes A, B and C are the result 3minimum spanning tree of our graph.
MATCH (n:Place{id:"D"})
CALL algo.spanningTree.kmax('Place', 'LINK', 'cost',id(n), 3, {writeProperty:"kmaxst"})
YIELD loadMillis, computeMillis, writeMillis, effectiveNodeCount
RETURN loadMillis,computeMillis,writeMillis, effectiveNodeCount;
Place  partition 

A 
0 
B 
1 
C 
3 
D 
3 
E 
3 
Nodes C, D and E are the result 3maximum spanning tree of our graph.
When we run this algorithm on a bigger graph we can use the following query to find nodes that belong to our kspanning tree result.
MATCH (n:Place)
WITH n.partition as partition, count(*) as count
WHERE count = k
RETURN n
Syntax
CALL algo.spanningTree(label:String, relationshipType:String, weightProperty:String, startNodeId:int, {writeProperty:String})
YIELD loadMillis, computeMillis, writeMillis, effectiveNodeCount
name  type  default  optional  description 

label 
String 
null 
no 
label to load from the graph. If null load all nodes 
relationshipType 
String 
null 
no 
relationshiptype to load from the graph. If null load all nodes 
weightProperty 
string 
null 
no 
property name that contains weight. Must be numeric. 
startNodeId 
long 
null 
no 
start node id 
write 
boolean 
true 
yes 
if result should be written back as relationships 
writeProperty 
string 
'mst' 
yes 
relationshiptype written back as result 
name  type  description 

effectiveNodeCount 
int 
number of visited nodes 
loadMillis 
int 
milliseconds for loading data 
computeMillis 
int 
milliseconds for running the algorithm 
writeMillis 
int 
milliseconds for writing result data back 
CALL algo.spanningTree.k*(label:String, relationshipType:String, weightProperty:String, startNodeId:int, k:int, {writeProperty:String})
YIELD loadMillis, computeMillis, writeMillis, effectiveNodeCount
name  type  default  optional  description 

label 
String 
null 
no 
label to load from the graph. If null load all nodes 
relationshipType 
String 
null 
no 
relationship type 
weightProperty 
string 
null 
no 
property name that contains weight. Must be numeric. 
startNodeId 
int 
null 
no 
start node id 
k 
int 
null 
no 
result is a tree with k nodes and k − 1 relationships 
write 
boolean 
true 
yes 
if result should be written back as node property 
writeProperty 
string 
'mst' 
yes 
relationshiptype written back as result 
name  type  description 

effectiveNodeCount 
int 
number of visited nodes 
loadMillis 
int 
milliseconds for loading data 
computeMillis 
int 
milliseconds for running the algorithm 
writeMillis 
int 
milliseconds for writing result data back 
Versions
We support the following versions of the minimum weight spanning tree algorithm:

✓ undirected, weighted
Single Source Shortest Path
The Single Source Shortest Path algorithm calculates the shortest weight path between a pair of nodes. A common algorithm used is Dijkstra.
History, Explanation
Path finding has a long history and is considered to be one of the classical graph problems. It has been researched as far back as the 19th century. Classical references are Wiener [1873], Lucas [1882] (describing a method due to C.P. Tr´emaux), and Tarry [1895] – see Biggs, Lloyd, and Wilson [1976].
Dijkstra thought about the shortest path problem when working as a programmer at the Mathematical Center in Amsterdam in 1956 to demonstrate the capabilities of the new ARMAC computer. His wanted to choose both a problem and an answer (that would be produced by computer) that noncomputing people could understand. He designed what is now known as Dijkstra’s algorithm and later implemented it on the ARMAC for a slightly simplified transportation map of 64 cities in the Netherlands.
It gained prominence in the early 1950s in the context of ‘alternate routing’ i.e. finding a second shortest route if the shortest route is blocked. It found uses in urban planning [1] as well as telephone call routing.
When to use it / usecases
The most common usage of the single source shortest path algorithm is finding directions between physical locations. Web mapping websites such as MapQuest or Google Maps use it to provide driving directions.
It has also found uses in several other areas including:

Telecommunications and networking

Operations research  plant and facility layout, robotics, transportation, and VLSI design.

Social networks  find the degrees of separation between people

IP routing  find the Open Shortest Path First
Constraints / when not to use it
Dijkstra does not support negative weights. The algorithm assumes that adding a relationship to a path can never make a path shorter  an invariant that would be violated with negative weights.
Algorithm explanation on simple sample graph
CREATE (a:Loc{name:'A'}), (b:Loc{name:'B'}), (c:Loc{name:'C'}),
(d:Loc{name:'D'}), (e:Loc{name:'E'}), (f:Loc{name:'F'}),
(a)[:ROAD {cost:50}]>(b),
(a)[:ROAD {cost:50}]>(c),
(a)[:ROAD {cost:100}]>(d),
(a)[:RAIL {cost:50}]>(d),
(b)[:ROAD {cost:40}]>(d),
(c)[:ROAD {cost:40}]>(d),
(c)[:ROAD {cost:80}]>(e),
(d)[:ROAD {cost:30}]>(e),
(d)[:ROAD {cost:80}]>(f),
(e)[:ROAD {cost:40}]>(f),
(e)[:RAIL {cost:20}]>(f);
Dijkstra single source shortest path algorithm
MATCH (start:Loc{name:'A'}), (end:Loc{name:'F'})
CALL algo.shortestPath.stream(start, end, 'cost')
YIELD nodeId, cost
RETURN nodeId, cost LIMIT 20
MATCH (start:Loc{name:'A'}), (end:Loc{name:'F'})
CALL algo.shortestPath(start, end, 'cost',{write:true,writeProperty:'sssp'})
YIELD writeMillis,loadMillis,nodeCount, totalCost
RETURN writeMillis,loadMillis,nodeCount,totalCost
Delta stepping algorithm
MATCH (n:Loc {name:'A'})
CALL algo.shortestPath.deltaStepping.stream(n, 'cost', 3.0)
YIELD nodeId, distance
RETURN nodeId, distance LIMIT 20
MATCH (n:Loc {name:'A'})
CALL algo.shortestPath.deltaStepping(n, 'cost', 3.0, {defaultValue:1.0, write:true, writeProperty:'sssp'})
YIELD nodeCount, loadDuration, evalDuration, writeDuration
RETURN nodeCount, loadDuration, evalDuration, writeDuration
Syntax
CALL algo.shortestPath(startNode:Node, endNode:Node, weightProperty:String
{nodeQuery:'labelName', relationshipQuery:'relationshipName', defaultValue:1.0,write:'true',writeProperty:'sssp',direction:'OUTGOING'})
YIELD nodeCount, totalCost, loadMillis, evalMillis, writeMillis
name  type  default  optional  description 

startNode 
node 
null 
no 
start node 
endNode 
node 
null 
no 
end node 
weightProperty 
string 
null 
yes 
property name that contains weight, if null treats the graph as unweighted. Must be numeric. 
defaultValue 
float 
null 
yes 
default value of the weight in case it is missing or invalid 
write 
boolean 
true 
yes 
if result should be written back as node property 
writeProperty 
string 
'sssp' 
yes 
property name written back to the node sequence of the node in the path 
nodeQuery 
string 
null 
yes 
label to load from the graph, if null load all nodes 
relationshipQuery 
string 
null 
yes 
relationshiptype to load from the graph, if null load all nodes 
direction 
string 
outgoing 
yes 
relationship direction to load from the graph, if 'both' treats the relationships as undirected 
name  type  description 

nodeCount 
int 
number of nodes considered 
totalCost 
float 
sum of all weights along the path 
loadMillis 
int 
milliseconds for loading data 
evalMillis 
int 
milliseconds for running the algorithm 
writeMillis 
int 
milliseconds for writing result data back 
CALL algo.shortestPath.stream(startNode:Node, endNode:Node, weightProperty:String
{nodeQuery:'labelName', relationshipQuery:'relationshipName', defaultValue:1.0, direction:'OUTGOING'})
YIELD nodeId, cost
name  type  default  optional  description 

startNode 
node 
null 
no 
start node 
endNode 
node 
null 
no 
end node 
weightProperty 
string 
null 
yes 
property name that contains weight, if null treats the graph as unweighted. Must be numeric. 
nodeQuery 
string 
null 
yes 
label to load from the graph, if null load all nodes 
relationshipQuery 
string 
null 
yes 
relationshiptype to load from the graph, if null load all nodes 
defaultValue 
float 
null 
yes 
default value of the weight in case it is missing or invalid 
direction 
string 
outgoing 
yes 
relationship direction to load from the graph, if 'both' treats the relationships as undirected 
name 
type 
description 
nodeId 
int 
node id 
cost 
int 
cost it takes to get from start node to specific node 
Versions
We support the following versions of the shortest path algorithms:

✓ directed, unweighted:

direction: 'OUTGOING' or INCOMING, weightProperty: null


✓ directed, weighted

direction: 'OUTGOING' or INCOMING, weightProperty: 'cost'


✓ undirected, unweighted

direction: 'BOTH', weightProperty: null


✓ undirected, weighted

direction: 'BOTH', weightProperty: 'cost'

Implementations
algo.shortestPath

specify start and end node, find the shortest path between them

Dijkstra single source shortest path algorithm

there may be more then one shortest path, algo returns only one

if initialized with an nonexisting weightproperty it will treat the graph as unweighted
algo.shortestPath.deltaStepping

specify start node, find the shortest paths to all other nodes

parallel nonnegative single source shortest path algorithm for weighted graphs

It can be tweaked using the deltaparameter which controls the grade of concurrency.

if initialized with an nonexisting weightproperty it will treat the graph as unweighted
algo.shortestPaths

specify start node, find the shortest paths to all other nodes

Dijkstra single source shortest path algorithm

if initialized with an nonexisting weightproperty it will treat the graph as unweighted
All Pairs Shortest Path
All Pairs Shortest Path calculates the shortest weight path between all pairs of nodes. An algorithm to solve this is Floyd Warshall or Parallel Johnson’s algorithm.
History, Explanation
This algorithm finds shortest paths between all pairs of nodes in the graph.
Some pairs of nodes might not be reachable between each other, so no shortest path exist between these pairs of nodes.
In this scenario the algorithm will return Infinity
value as a result between these pairs of nodes.
Plain cypher does not support filtering Infinity
values, so algo.isFinite
function was added to help filter Infinity
values from results.
When to use it / usecases
The all pairs shortest path algorithm has found uses in the following areas:
Algorithm explanation on simple sample graph
CREATE (a:Loc{name:'A'}), (b:Loc{name:'B'}), (c:Loc{name:'C'}),
(d:Loc{name:'D'}), (e:Loc{name:'E'}), (f:Loc{name:'F'}),
(a)[:ROAD {cost:50}]>(b),
(a)[:ROAD {cost:50}]>(c),
(a)[:ROAD {cost:100}]>(d),
(a)[:RAIL {cost:50}]>(d),
(b)[:ROAD {cost:40}]>(d),
(c)[:ROAD {cost:40}]>(d),
(c)[:ROAD {cost:80}]>(e),
(d)[:ROAD {cost:30}]>(e),
(d)[:ROAD {cost:80}]>(f),
(e)[:ROAD {cost:40}]>(f),
(e)[:RAIL {cost:20}]>(f);
CALL algo.allShortestPaths.stream('cost',{nodeQuery:'Loc',defaultValue:1.0})
YIELD sourceNodeId, targetNodeId, distance
WITH sourceNodeId, targetNodeId, distance
WHERE algo.isFinite(distance) = true
RETURN sourceNodeId, targetNodeId, distance LIMIT 20
For now only singlesource shortest path support loading the relationship as undirected, but we can use cypher loading to help us solve this. Undirected graph can be represented as Bidirected graph, that is a directed graph in which the reverse of every relationship is also a relationship.
We do not have to save this reversed relationship, we can project it using cypher loading. Note that relationship query does not specify direction of the relationship. This is applicable to all other algorithms, that use cypher loading.
CALL algo.allShortestPaths.stream('cost', {
nodeQuery:'MATCH (n:Loc) RETURN id(n) as id',
relationshipQuery:'MATCH (n:Loc)[r](p:Loc) RETURN id(n) as source, id(p) as target, r.cost as weight',
graph:'cypher', defaultValue:1.0})
Implementations
algo.allShortestPaths.stream

find shortest paths between all pairs of nodes

returns a stream of sourcetarget node to distance tuples for each pair of nodes

writeback not supported

if initialized with an nonexisting weightproperty it will treat the graph as unweighted
Community detection: Triangle Counting / Clustering Coefficient
Triangle counting is a community detection graph algorithm that is used to determine the number of triangles passing through each node in the graph. A triangle is a set of three nodes where each node has a relationship to all other nodes.
History, Explanation
Triangle counting gained popularity in social network analysis where it is used to detect communities and measure the cohesiveness of those communities. It is often used as part of the computation of network indices such as the clustering coefficient.
There are two types of clustering coefficient:

The local clustering coefficient of a node is the likelihood that its neighbours are also connected. The computation of this score involves triangle counting.

The global clustering coefficient is the normalized sum of those local clustering coefficients.
The transitivity coefficient of a graph, which is just three times the number of triangles divided by the number of triples in the graph is sometimes used.[2]
When to use it / usecases
Triangle counting has gained popularity in many other domains including:
Algorithm explanation on simple sample graph
CREATE (alice:Person{id:"Alice"}),
(michael:Person{id:"Michael"}),
(karin:Person{id:"Karin"}),
(chris:Person{id:"Chris"}),
(will:Person{id:"Will"}),
(mark:Person{id:"Mark"})
CREATE (michael)[:KNOWS]>(karin),
(michael)[:KNOWS]>(chris),
(will)[:KNOWS]>(michael),
(mark)[:KNOWS]>(michael),
(mark)[:KNOWS]>(will),
(alice)[:KNOWS]>(michael),
(will)[:KNOWS]>(chris),
(chris)[:KNOWS]>(karin);
CALL algo.triangle.stream('Person','KNOWS')
yield nodeA,nodeB,nodeC;
CALL algo.triangleCount('Person', 'KNOWS',
{concurrency:4, write:true, writeProperty:'triangles',clusteringCoefficientProperty:'coefficient'})
YIELD loadMillis, computeMillis, writeMillis, nodeCount, triangleCount, averageClusteringCoefficient;
CALL algo.triangleCount.stream('Person', 'KNOWS', {concurrency:4})
YIELD nodeId, triangles;
Example Usage
In graph theory, a clustering coefficient is a measure of the degree to which nodes in a graph tend to cluster together. Evidence suggests that in most realworld networks, and in particular social networks, nodes tend to create tightly knit groups characterised by a relatively high density of ties; this likelihood tends to be greater than the average probability of a tie randomly established between two nodes.[4]
We check if this holds true for Yelp’s social network of friends.
CALL algo.triangleCount('User', 'FRIEND',
{concurrency:4, write:true, writeProperty:'triangles',clusteringCoefficientProperty:'coefficient'})
YIELD loadMillis, computeMillis, writeMillis, nodeCount, triangleCount, averageClusteringCoefficient;
Average clustering coefficient is 0.0523, which is really low for a social network. This indicates that groups of friends are not tightly knit together, but rather sparse. We can assume that users are not on Yelp for finding and creating friends, like for example Facebook, but rather something else, like finding good restaurant recommendations.
Local triangle count and clustering coefficient of nodes can be used as features in finding influencers in social networks.
Syntax
CALL algo.triangle.stream(label:String, relationship:String, {concurrency:4})
YIELD nodeA, nodeB, nodeC  yield nodeA, nodeB and nodeC which form a triangle
name  type  default  optional  description 

label 
string 
null 
yes 
label to load from the graph, if null load all nodes 
relationship 
string 
null 
yes 
relationshiptype to load from the graph, if null load all nodes 
concurrency 
int 
available CPUs 
yes 
number of concurrent threads 
name  type  description 

nodeA 
int 
id of node in the given triangle 
nodeB 
int 
id of node in the given triangle 
nodeC 
int 
id of node in the given triangle 
CALL algo.triangleCount.stream(label:String, relationship:String, {concurrency:4})
YIELD nodeId, triangles  yield nodeId, number of triangles
name  type  default  optional  description 

label 
string 
null 
yes 
label to load from the graph. If null load all nodes 
relationship 
string 
null 
yes 
relationshiptype to load from the graph. If null load all relationships 
concurrency 
int 
available CPUs 
yes 
number of concurrent threads 
name  type  description 

nodeId 
int 
id of node 
triangles 
int 
number of triangles a node is member of 
CALL algo.triangleCount(label:String, relationship:String,
{concurrency:4, write:true, writeProperty:'triangles', clusteringCoefficientProperty:'coefficient'})
YIELD loadMillis, computeMillis, writeMillis, nodeCount, triangleCount, averageClusteringCoefficient
name  type  default  optional  description 

label 
string 
null 
yes 
label to load from the graph. If null load all nodes 
relationship 
string 
null 
yes 
relationshiptype to load from the graph. If null load all relationships 
concurrency 
int 
available CPUs 
yes 
number of concurrent threads 
write 
boolean 
true 
yes 
if result should be written back as node property 
writeProperty 
string 
'triangles' 
yes 
property name the number of triangles a node is member of is written to 
clusteringCoefficientProperty 
string 
'coefficient' 
yes 
property name clustering coefficient of the node is written to 
name  type  description 

nodeCount 
int 
number of nodes considered 
loadMillis 
int 
milliseconds for loading data 
evalMillis 
int 
milliseconds for running the algorithm 
writeMillis 
int 
milliseconds for writing result data back 
triangleCount 
int 
number of triangles in the given graph. 
averageClusteringCoefficient 
float 
average clustering coefficient of the given graph 
Cypher projection
If label and relationshiptype are not selective enough to describe your subgraph to run the algorithm on, you can use Cypher statements to load or project subsets of your graph.
Can be also used to run algorithms on a virtual graph.
Set graph:'cypher'
in the config.
CALL algo.triangleCount(
'MATCH (p:Person) RETURN id(p) as id',
'MATCH (p1:Person)[:KNOWS]>(p2:Person) RETURN id(p1) as source,id(p2) as target',
{concurrency:4, write:true, writeProperty:'triangle',graph:'cypher', clusteringCoefficientProperty:'coefficient'})
yield loadMillis, computeMillis, writeMillis, nodeCount, triangleCount, averageClusteringCoefficient
Versions
We support the following versions of the triangle count algorithms:

✓ undirected, unweighted
Community detection: Label Propagation
Label Propagation (LPA) is an algorithm for finding communities in near linear time [1]. It detects these communities using network structure alone as its guide and requires neither a predefined objective function nor prior information about the communities.
Community detection is an important methodology for understanding the organization of various realworld networks and has applications in problems as diverse as consensus formation in social communities or the identification of functional modules in biochemical networks.
History, Explanation
Many community detection algorithms have been proposed and utilized with varying degrees of effectiveness in community detection research.
The idea of label flooding through a network originated from the Lshell method proposed by Bagrow and Bollt. (2005) [2] The main idea behind LPA is to propagate the labels of node throughout the network and form communities through the process of label propagation. The intuition of label flooding is that a single label can quickly become dominant in a densely connected group of nodes whereas it has trouble crossing a sparsely connected region. Labels will get trapped inside a densely connected group of nodes. Based on the idea of local trapping of labels, the author proposed a local community detection method where a node is initialized with a label which propagates step by step through the neighbours until the end of the community is reached.
LPA, which was first proposed by Raghavan et al. (2007) [3], uses unique identifiers of nodes as labels and propagates the labels based on an agreement with the majority of the neighbour nodes.
The algorithm works as follows:

Every node is initialized with a unique label (an identifier)

These labels propagate through the network

At every iteration of propagation, each node updates its label to the one that the maximum numbers of its neighbours belongs to. Ties are broken uniformly and randomly.

LPA reaches convergence when each node has the majority label of its neighbours.
As labels propagate, densely connected groups of nodes quickly reach a consensus on a unique label. At the end of the propagation only a few labels will remain  most will have disappeared. Nodes that have the same label at convergence are said to belong to the same community.
When to use it / usecases
Community structure is considered one of the most interesting features in complex networks. Many realworld complex systems exhibit community structure, where individuals with similar properties form a community (partition). The identification of partitions in a network is important for understanding its structure.
There are many algorithms that identify community structures in largescale realworld networks and a lot of them require prior information about the number and size of communities. In complex networks this size of community is very difficult to calculate. With social networks, which are often huge and heterogeneous in nature, it is impossible to predict the number of initial communities.
Algorithms that are able to detect communities from complex networks without having any knowledge about the number of communities or the size of communities are essential. LPA is such an algorithm and requires neither a predefined objective function nor prior information about the communities.
Constraints / when not to use it
In contrast with other algorithms label propagation can result in different community structures from the same initial condition. The range of solutions can be narrowed if some nodes are given preliminary labels while others are held unlabelled. Consequently, unlabelled nodes will be more likely to adapt to the labelled ones. For a more accurate finding of communities, Jaccard’s index is used to aggregate multiple community structures, containing all important information.[3]
Algorithm explanation on simple sample graph
CREATE (nAlice:User {id:'Alice',predefined_label:52})
,(nBridget:User {id:'Bridget',predefined_label:21})
,(nCharles:User {id:'Charles',predefined_label:43})
,(nDoug:User {id:'Doug',predefined_label:21})
,(nMark:User {id:'Mark',predefined_label: 19})
,(nMichael:User {id:'Michael', predefined_label: 52})
CREATE (nAlice)[:FOLLOW]>(nBridget)
,(nAlice)[:FOLLOW]>(nCharles)
,(nMark)[:FOLLOW]>(nDoug)
,(nBridget)[:FOLLOW]>(nMichael)
,(nDoug)[:FOLLOW]>(nMark)
,(nMichael)[:FOLLOW]>(nAlice)
,(nAlice)[:FOLLOW]>(nMichael)
,(nBridget)[:FOLLOW]>(nAlice)
,(nMichael)[:FOLLOW]>(nBridget)
,(nCharles)[:FOLLOW]>(nDoug);
CALL algo.labelPropagation('User', 'FOLLOW','OUTGOING',
{iterations:10,partitionProperty:'partition', write:true})
YIELD nodes, iterations, loadMillis, computeMillis, writeMillis, write, partitionProperty;
name  partition 

Alice 
5 
Charles 
4 
Bridget 
5 
Michael 
5 
Doug 
4 
Mark 
4 
Our algorithm found two communities with 3 members each.
Using predefined labels
At the beginning of the algorithm, every node is initialized with unique label (called as identifier) and the labels propagate through the network.
It is possible to define preliminary labels (identifiers) of nodes using partition
parameter.
We need to save preliminary set of labels we would like to run the LPA algorithm with as a property of nodes (must be a number).
In our example graph we saved them as property predefined_label.
Algorithm first checks if there is a predefined label assigned to the node and loads it, if there is one.
If there isn’t one it assigns the node new unique label(node id are used).
Using this preliminary set of labels (identifiers) it then sequentially updates each node’s label to a new one, which is the most frequent label among its neighbors at every label propagation step(iteration).
Can be used as semisupervised way of finding communities, where we handpick some initial communities.
CALL algo.labelPropagation('User', 'FOLLOW','OUTGOING',
{iterations:10,partitionProperty:'predefined_label', write:true})
YIELD nodes, iterations, loadMillis, computeMillis, writeMillis, write, partitionProperty;
Syntax
CALL algo.labelPropagation(label:String, relationship:String, direction:String, {iterations:1,
weightProperty:'weight', partitionProperty:'partition', write:true, concurrency:4})
YIELD nodes, iterations, didConverge, loadMillis, computeMillis, writeMillis, write, weightProperty,
partitionProperty  simple label propagation kernel
name  type  default  optional  description 

label 
string 
null 
yes 
label to load from the graph. If null load all nodes 
relationship 
string 
null 
yes 
relationshiptype to load from the graph. If null load all relationships 
direction 
string 
'OUTGOING' 
yes 
relationshipdirection to use in the algorithm 
concurrency 
int 
available CPUs 
yes 
number of concurrent threads 
iterations 
int 
1 
yes 
the maximum number of iterations to run 
weightProperty 
string 
'weight' 
yes 
property name of node and/or relationship that contain weight. Must be numeric. 
partitionProperty 
string 
'partition' 
yes 
property name written back the partition of the graph in which the node reside, can be used to define initial set of labels (must be a number) 
write 
boolean 
true 
yes 
if result should be written back as node property 
graph 
string 
'heavy' 
yes 
use 'heavy' when describing the subset of the graph with label and relationshiptype parameter, 'cypher' for describing the subset with cypher nodestatement and relationshipstatement 
name  type  description 

nodes 
int 
number of nodes considered 
iterations 
int 
number of iterations that were executed 
didConverge 
boolean 
true if the algorithm did converge to a stable labelling within the provided number of maximum iterations 
loadMillis 
int 
milliseconds for loading data 
computeMillis 
int 
milliseconds for running the algorithm 
writeMillis 
int 
milliseconds for writing result data back 
weightProperty 
string 
property name that contains weight 
partitionProperty 
string 
property name written back to 
write 
boolean 
true 
Cypher projection
If label and relationshiptype are not selective enough to describe your subgraph to run the algorithm on, you can use Cypher statements to load or project subsets of your graph.
Can be also used to run algorithms on a virtual graph.
Set graph:'cypher'
in the config.
CALL algo.labelPropagation(
'MATCH (p:User) RETURN id(p) as id, p.weight as weight, id(p) as value',
'MATCH (p1:User)[f:FRIEND]>(p2:User)
RETURN id(p1) as source, id(p2) as target, f.weight as weight',
"OUT",
{graph:'cypher',write:true});
Versions
We support the following versions of the label propagation algorithms:

✓ directed, unweighted:

direction: 'INCOMING' or 'OUTGOING', weightProperty: null


✓ directed, weighted

direction: 'INCOMING' or 'OUTGOING', weightProperty : 'weight'


✓ undirected, unweighted

direction: 'BOTH', weightProperty: null


✓ undirected, weighted

direction: 'BOTH', weightProperty: 'weight'

Community detection: Louvain
The Louvain method of community detection is an algorithm for detecting communities in networks that relies upon a heuristic for maximizing the modularity. Communities are groups of nodes within a network that are more densely connected to one another than to other nodes. A typical heuristic would be modularity, which quantifies the quality of an assignment of nodes to communities by evaluating how much more densely connected the nodes within a community are compared to how connected they would be in a random network.
The method consists of repeated application of two steps. The first step is a "greedy" assignment of nodes to communities, favoring local optimizations of modularity. The second step is the definition of a new coarsegrained network based on the communities found in the first step. These two steps are repeated until no further modularityincreasing reassignments of communities are possible.
The Louvain method achieves modularities comparable to preexisting algorithms, typically in less time, so it enables the study of much larger networks. It also reveals a hierarchy of communities at different scales, and this hierarchical perspective can be useful for understanding the global functioning of a network. While there are pitfalls to interpreting the community structure uncovered by the Louvain Method, these difficulties are shared by all modularity optimization algorithms.[2]
Link to the original paper[1]
History, Explanation
A fairly common feature of complex networks is that they consist of sets of nodes that interact more with one another than with nodes outside the set. Social networks, for instance, might consist of tightly knit communities of friends with rarer friendship ties between different communities. In protein interaction networks, certain groups of proteins interact with one another more frequently than they do with proteins in other groups. If you map out a network of projects going on in a large company, certain projects will likely have more conceptual overlap and mutual dependency than others.
In 1962, H.A. Simon proposed that this type of community structure might be a defining characteristic of complex systems, or at least those like the protein interaction network, in which many interacting constituent elements adaptively organize to achieve some higherorder function (e.g., the functioning of an organism). His reasoning was that individual actors have a much higher chance of collectively achieving a higherorder function if that function can be iteratively achieved by constructing intermediate stable forms (also called communities or modules) that achieve simpler functions. The firstorder intermediate forms would be communities in terms of the original nodes, but then interactions between these firstorder communities could generate secondorder communities that accomplish somewhat more complicated functions, and so on. In this way, a hierarchical structure can emerge in complex adaptive systems. However, even when there isn’t necessarily adaptive pressure towards achieving some higherorder function (as in the case of some social networks), community structure is a common observed feature of complex networks.
The algorithm is initialized with each node in its own community.
In the first stage we iterate through each of the nodes in the network. For each node, we consider the change in modularity if we remove the node from its current community and place it in the community of one of its neighbors. We compute the modularity change for each of the node’s neighbors. If none of these modularity changes are positive, we keep the node in its current community. If some of the modularity changes are positive, we move the node into the community where the modularity change is most positive. Ties are resolved arbitrarily. We repeat this process for each node until one pass through all nodes yields no community assignment changes.
The second stage in the Louvain Method uses the communities that were discovered in the community reassignment stage to define a new coarsegrained network. In this network the newly discovered communities are the nodes. The edge weight between the nodes representing two communities is the sum of the edge weights between the constituent, lowerlevel nodes of each community. The links within each community generate selfloops in the new, coarsegrained network.
The rest of the Louvain Method consists of repeated application of stages 1 and 2. By applying stage 1 (the community reassignment phase) to the coarsegrained graph, we find a second tier of communities of communities of nodes. Then, in the next application of stage 2, we define a new coarsegrained graph at this higherlevel of the hierarchy. We keep going like this until an application of stage 1 yields no reassignments. At that point repeated application of stages 1 and 2 will not yield any more modularityoptimizing changes, so the process is complete.[2]
When to use it / usecases
One of the applications reported in the original Louvain Method paper was a study of a large Belgian phone call network in which nodes represented customers and weighted links represented the number of phone calls between two customers over a sixmonth period. The Louvain Method revealed a hierarchy of six levels of communities. At the top level of this hierarchy, the communities representing more than 10,000 customers were strongly segregated by primary language. All except one of these communities had an 85% or greater majority of either French or Dutch speakers. The sole community with a more equitable distribution was positioned at the interface between French and Dutch clusters in the toplevel coarsegrained network.
Since 2008 the Louvain Method has found a wide range of applications in analyzing realworld networks. Several of these can be found on the website of the method:

analysis of online social networks such as Twitter, LinkedIn, Flickr, Youtube, and LiveJournal

analysis of collaboration communities in citation networks

analysis of a network of retail transactions

study of brain networks using the Louvain Method [4]
Constraints / when not to use it
Although the Louvain Method, and modularity optimization algorithms more generally, have found wide application across many domains, some problems with these algorithms have been identified.
The resolution limit:
For larger networks the Louvain Method doesn’t stop with the "intuitive" communities. Instead there’s a second pass through the community modification and coarsegraining stages, in which several of the intuitive communities are merged together. This is a general problem with modularity optimization algorithms  they have trouble detecting small communities in large networks. It’s a virtue of the Louvain Method that something close to the intuitive community structure is available as an intermediate step in the process.
The degeneracy problem:
There are typically an exponentially large (in network size) number of community assignments with modularities close to the maximum. This can be a severe problem because, in the presence of a large number of high modularity solutions, it’s hard to find the global maximum and difficult to determine if the global maximum is truly more scientifically important than local maxima that achieve similar modularity. Research undertaken at Universite Catholique de Louvain showed that the different locally optimal community assignments can have quite different structural properties.[3]
Algorithm explanation on simple sample graph
CREATE (nAlice:User {id:'Alice'})
,(nBridget:User {id:'Bridget'})
,(nCharles:User {id:'Charles'})
,(nDoug:User {id:'Doug'})
,(nMark:User {id:'Mark'})
,(nMichael:User {id:'Michael'})
CREATE (nAlice)[:FRIEND]>(nBridget)
,(nAlice)[:FRIEND]>(nCharles)
,(nMark)[:FRIEND]>(nDoug)
,(nBridget)[:FRIEND]>(nMichael)
,(nCharles)[:FRIEND]>(nMark)
,(nAlice)[:FRIEND]>(nMichael)
,(nCharles)[:FRIEND]>(nDoug);
CALL algo.louvain.stream('User', 'FRIEND', {})
YIELD nodeId, community
RETURN nodeId, community LIMIT 20;
CALL algo.louvain('User', 'FRIEND',
{write:true, writeProperty:'community'})
YIELD nodes, communityCount, iterations, loadMillis, computeMillis, writeMillis;
name  community 

Alice 
5 
Bridget 
5 
Michael 
5 
Charles 
4 
Doug 
4 
Mark 
4 
Syntax
CALL algo.louvain(label:String, relationship:String,
{weightProperty:'weight', defaultValue:1.0, write: true,
writeProperty:'community', concurrency:4})
YIELD nodes, communityCount, iterations, loadMillis, computeMillis, writeMillis
name  type  default  optional  description 

label 
string 
null 
yes 
label to load from the graph. If null load all nodes 
relationship 
string 
null 
yes 
relationshiptype to load from the graph. If null load all relationships 
weightProperty 
string 
null 
yes 
property name that contains weight. If null treats the graph as unweighted. Must be numeric. 
write 
boolean 
true 
yes 
if result should be written back as node property 
writeProperty 
string 
'community' 
yes 
property name written back the id of the community particular node belongs to 
defaultValue 
float 
null 
yes 
default value of the weight in case it is missing or invalid 
concurrency 
int 
available CPUs 
yes 
number of concurrent threads 
graph 
string 
'heavy' 
yes 
use 'heavy' when describing the subset of the graph with label and relationshiptype parameter, 'cypher' for describing the subset with cypher nodestatement and relationshipstatement 
name  type  description 

nodes 
int 
number of nodes considered 
communityCount 
int 
number of communities found 
iterations 
int 
number of iterations run 
loadMillis 
int 
milliseconds for loading data 
computeMillis 
int 
milliseconds for running the algorithm 
writeMillis 
int 
milliseconds for writing result data back 
CALL algo.louvain.stream(label:String, relationship:String,
{weightProperty:'propertyName', defaultValue:1.0, concurrency:4})
YIELD nodeId, community  yields a community to each node id
name  type  default  optional  description 

label 
string 
null 
yes 
label to load from the graph. If null load all nodes 
relationship 
string 
null 
yes 
relationshiptype to load from the graph. If null load all relationships 
weightProperty 
string 
null 
yes 
property name that contains weight. If null treats the graph as unweighted. Must be numeric. 
defaultValue 
float 
1.0 
yes 
default value of the weight if it is missing or invalid 
graph 
string 
'heavy' 
yes 
use 'heavy' when describing the subset of the graph with label and relationshiptype parameter, 'cypher' for describing the subset with cypher nodestatement and relationshipstatement 
name 
type 
description 
nodeId 
int 
node id 
community 
int 
community id 
Cypher projection
If label and relationshiptype are not selective enough to describe your subgraph to run the algorithm on, you can use Cypher statements to load or project subsets of your graph.
Can be also used to run algorithms on a virtual graph.
Set graph:'cypher'
in the config.
CALL algo.louvain(
'MATCH (p:User) RETURN id(p) as id',
'MATCH (p1:User)[f:FRIEND](p2:User)
RETURN id(p1) as source, id(p2) as target, f.weight as weight',
{graph:'cypher',write:true});
Versions

✓ undirected, unweighted

weightProperty: null


✓ undirected, weighted

weightProperty : 'weight'

Community detection: Connected Components
Connected Components or UnionFind basically finds sets of connected nodes where each node is reachable from any other node in the same set. In graph theory, a connected component of an undirected graph is a subgraph in which any two vertices are connected to each other by paths, and which is connected to no additional vertices in the graph.
History, Explanation
The Connected Components of a graph represent, in general terms, the pieces of the graph. Two vertices are in the same component of graph if and only if there is some path between them.
Finding connected components is at the heart of many graph applications. For example, consider the problem of identifying partitions in a set of items. We can represent each item by a node and add an edge between each pair of items that are deemed similar. The connected components of this graph correspond to different classes of items.
When to use it / usecases
Testing whether a graph is connected is an essential preprocessing step for every graph algorithm. Such tests can be performed so quickly and easily that you should always verify that your input graph is connected, even when you know it has to be. Subtle, difficulttodetect bugs often result when your algorithm is run only on one component of a disconnected graph.
Connected components have other practical use cases, for example, if we are analysing a social network and we want to find all the disconnected groups of people that exist in our graph.
Algorithm explanation on simple sample graph
Recall that an undirected graph is connected if for every pair of vertices, there is a path in the graph between those vertices. A connected component of an undirected graph is a maximal connected subgraph of the graph. That means that the direction of the relationships in our graph have no influence as we treat our graph as undirected. We have two implementations of connected components algorithm. First implementations treats the graph as unweighted and the second version is weighted, where you can define the threshold of the weight above which relationships are created.
CREATE (nAlice:User {id:'Alice'})
,(nBridget:User {id:'Bridget'})
,(nCharles:User {id:'Charles'})
,(nDoug:User {id:'Doug'})
,(nMark:User {id:'Mark'})
,(nMichael:User {id:'Michael'})
CREATE (nAlice)[:FRIEND{weight:0.5}]>(nBridget)
,(nAlice)[:FRIEND{weight:4}]>(nCharles)
,(nMark)[:FRIEND{weight:1}]>(nDoug)
,(nMark)[:FRIEND{weight:2}]>(nMichael);
Unweighted version:
CALL algo.unionFind.stream('User', 'FRIEND', {})
YIELD nodeId,setId
RETURN nodeId,setId LIMIT 20;
CALL algo.unionFind('User', 'FRIEND', {write:true, partitionProperty:"partition"})
YIELD nodes, setCount, loadMillis, computeMillis, writeMillis;
name  partition 

Alice 
0 
Charles 
0 
Bridget 
0 
Michael 
4 
Doug 
4 
Mark 
4 
Results show us, that we have two distinct group of users that have no link between them. First group has members Alice,Charles,Bridget and the second group are Michael,Doug,Mark.
MATCH (u:User)
RETURN u.partition as partition,count(*) as size_of_partition
ORDER by size_of_partition DESC LIMIT 20;
Weighted version:
If you define the property that holds the weight(weightProperty) and the threshold,it means the nodes are only connected, if the threshold on the weight of the relationship is high enough otherwise the relationship is thrown away.
CALL algo.unionFind.stream('User', 'FRIEND', {weightProperty:'weight', defaultValue:0.0, threshold:1.0})
YIELD nodeId,setId;
CALL algo.unionFind('User', 'FRIEND', {write:true, partitionProperty:"partition",weightProperty:'weight', defaultValue:0.0, threshold:1.0})
YIELD nodes, setCount, loadMillis, computeMillis, writeMillis;
name  partition 

Alice 
0 
Charles 
0 
Bridget 
1 
Michael 
4 
Doug 
4 
Mark 
4 
We can observe, that because the weight of the relationship betwen Bridget and Alice is only 0.5, the relationship was thrown away and regarded as not existing.
Example Usage
As said Connected Components are an essential step in preprocessing your data. One reason is that most centralities suffer from disconnected components or you just want to find disconnected groups of nodes. Yelp’s social network will be used to demonstrate how to proceed when dealing with real world data. A typical social network consist of one big component and a number of small disconnected components.
CALL algo.unionFind.stream('User', 'FRIEND', {})
YIELD nodeId,setId
RETURN count(distinct setId) as count_of_components;
We get back count of disconnected components being 18512 if we do not count users without friends. Let’s now check the size of top 20 components to get a better picture
CALL algo.unionFind.stream('User', 'FRIEND', {})
YIELD nodeId,setId
RETURN setId,count(*) as size_of_component
ORDER BY size_of_component LIMIT 20;
The biggest component has 8938630 out of total 8981389 (99,5%). It is quite high, but not shocking as we have a friendship social network, where we can expect small world effect and 6 degree of separation rule, where you can get to any person in a social network, just depends how long is the path.
We can now move on to next step of analysis and run centralities only on the biggest components, so that our results will be more accurate. We just simply write back the results to the node, and use centralities with cypher loading or set a new label for the biggest component.
Syntax
CALL algo.unionFind(label:String, relationship:String, {threshold:0.42,
defaultValue:1.0, write: true, partitionProperty:'partition',weightProperty:'weight',graph:'heavy',concurrency:4})
YIELD nodes, setCount, loadMillis, computeMillis, writeMillis
 finds connected partitions and potentially writes back to the node as a property partition.
name  type  default  optional  description 

label 
string 
null 
yes 
label to load from the graph. If null load all nodes 
relationship 
string 
null 
yes 
relationshiptype to load from the graph. If null load all relationships 
weightProperty 
string 
null 
yes 
property name that contains weight, if null treats the graph as unweighted. Must be numeric. 
write 
boolean 
true 
yes 
if result should be written back as node property 
partitionProperty 
string 
'partition' 
yes 
property name written back the id of the partition particular node belongs to 
threshold 
float 
null 
yes 
value of the weight above which the relationship is not thrown away 
defaultValue 
float 
null 
yes 
default value of the weight in case it is missing or invalid 
concurrency 
int 
available CPUs 
yes 
number of concurrent threads 
graph 
string 
'heavy' 
yes 
use 'heavy' when describing the subset of the graph with label and relationshiptype parameter, 'cypher' for describing the subset with cypher nodestatement and relationshipstatement 
name  type  description 

nodes 
int 
number of nodes considered 
setCount 
int 
number of partitions found 
loadMillis 
int 
milliseconds for loading data 
computeMillis 
int 
milliseconds for running the algorithm 
writeMillis 
int 
milliseconds for writing result data back 
CALL algo.unionFind.stream(label:String, relationship:String, {weightProperty:'weight', threshold:0.42, defaultValue:1.0, concurrency:4})
YIELD nodeId, setId  yields a setId to each node id
name  type  default  optional  description 

label 
string 
null 
yes 
label to load from the graph, if null load all nodes 
relationship 
string 
null 
yes 
relationshiptype to load from the graph, if null load all relationships 
concurrency 
int 
available CPUs 
yes 
number of concurrent threads 
weightProperty 
string 
null 
yes 
property name that contains weight, if null treats the graph as unweighted. Must be numeric. 
threshold 
float 
null 
yes 
value of the weight above which the relationship is not thrown away 
defaultValue 
float 
null 
yes 
default value of the weight in case it is missing or invalid 
graph 
string 
'heavy' 
yes 
use 'heavy' when describing the subset of the graph with label and relationshiptype parameter, 'cypher' for describing the subset with cypher nodestatement and relationshipstatement 
name 
type 
description 
nodeId 
int 
node id 
setId 
int 
partition id 
Cypher projection
If label and relationshiptype are not selective enough to describe your subgraph to run the algorithm on, you can use Cypher statements to load or project subsets of your graph.
Can be also used to run algorithms on a virtual graph.
Set graph:'cypher'
in the config.
CALL algo.unionFind(
'MATCH (p:User) RETURN id(p) as id',
'MATCH (p1:User)[f:FRIEND]>(p2:User)
RETURN id(p1) as source, id(p2) as target, f.weight as weight',
{graph:'cypher',write:true}
);
Implementations
algo.unionFind

if a threshold configuration parameter is supplied only relationships with a property value higher then the threshold are merged
algo.unionFind.queue

parallel UnionFind using ExecutorService only.

Algorithm based on the idea that DisjointSetStruct can be built using just a partition of the nodes which are then merged pairwise.

The implementation is based on a queue which acts as a buffer for each computed DSS. As long as there are more elements on the queue the algorithm takes two, merges them and adds its result to the queue until only 1 element remains.
algo.unionFind.forkJoinMerge

Like in exp1 the resulting DSS of each nodepartition is merged by the ForkJoin pool while the calculation of the DSS is done by the ExecutorService.
algo.unionFind.forkJoin

calculation and merge using forkJoinPool
algo.unionFind.mscoloring

coloring based parallel algorithm
Community detection: Strongly Connected Components
SCC is a class algorithms for finding groups of nodes where each node is directly reachable from every other node in the group. The strongly connected components of an arbitrary directed graph form a partition into subgraphs that are themselves strongly connected.There are several algorithms to compute the SCC.
History, Explanation
Decomposing a directed graph into its strongly connected components is a classic application of depthfirst search. The problem of finding connected components is at the heart of many graph application. Generally speaking, the connected components of the graph correspond to different classes of objects. The first lineartime algorithm for strongly connected components is due to Tarjan (1972).
When to use it / usecases
SCC algorithms can be used as a first step in many graph algorithms that work only on strongly connected graph. In social networks, a group of people are generally strongly connected (For example, students of a class or any other common place). Many people in these groups generally like some common pages or play common games. The SCC algorithms can be used to find such groups and suggest the commonly liked pages or games to the people in the group who have not yet liked commonly liked a page or played a game.
Algorithm explanation on simple sample graph
A directed graph is strongly connected if there is a path between all pairs of vertices. This algorithms treats the graph as directed, so the direction of the relationship is important and strongly connected compoment exists only if there are relationships between nodes in both direction.
CREATE (nAlice:User {id:'Alice'})
,(nBridget:User {id:'Bridget'})
,(nCharles:User {id:'Charles'})
,(nDoug:User {id:'Doug'})
,(nMark:User {id:'Mark'})
,(nMichael:User {id:'Michael'})
CREATE (nAlice)[:FOLLOW]>(nBridget)
,(nAlice)[:FOLLOW]>(nCharles)
,(nMark)[:FOLLOW]>(nDoug)
,(nMark)[:FOLLOW]>(nMichael)
,(nBridget)[:FOLLOW]>(nMichael)
,(nDoug)[:FOLLOW]>(nMark)
,(nMichael)[:FOLLOW]>(nAlice)
,(nAlice)[:FOLLOW]>(nMichael)
,(nBridget)[:FOLLOW]>(nAlice)
,(nMichael)[:FOLLOW]>(nBridget);
CALL algo.scc('User','FOLLOW', {write:true,partitionProperty:'partition'})
YIELD loadMillis, computeMillis, writeMillis, setCount, maxSetSize, minSetSize;
name  partition 

Alice 
1 
Bridget 
1 
Michael 
1 
Charles 
0 
Doug 
2 
Mark 
2 
We can see that we have 2 strongly connected components in our sample graph. The first and biggest component has members Alice,Bridget,Michael and the second component has Doug and Mark.
MATCH (u:User)
RETURN u.partition as partition,count(*) as size_of_partition ORDER by size_of_partition DESC LIMIT 1
Syntax
CALL algo.scc(label:String, relationship:String,
{write:true,partitionProperty:'partition',concurrency:4, graph:'heavy'})
YIELD loadMillis, computeMillis, writeMillis, setCount, maxSetSize, minSetSize
 finds strongly connected partitions and potentially writes back to the node as a property partition.
name  type  default  optional  description 

label 
string 
null 
yes 
label to load from the graph. If null load all nodes 
relationship 
string 
null 
yes 
relationshiptype to load from the graph. If null load all relationships 
write 
boolean 
true 
yes 
if result should be written back as node property 
partitionProperty 
string 
'partition' 
yes 
property name written back to 
concurrency 
int 
available CPUs 
yes 
number of concurrent threads 
graph 
string 
'heavy' 
yes 
use 'heavy' when describing the subset of the graph with label and relationshiptype parameter, 'cypher' for describing the subset with cypher nodestatement and relationshipstatement 
name  type  description 

setCount 
int 
number of partitions found 
maxSetSize 
int 
number of members in biggest partition 
minSetSize 
int 
number of members in smallest partition 
loadMillis 
int 
milliseconds for loading data 
computeMillis 
int 
milliseconds for running the algorithm 
writeMillis 
int 
milliseconds for writing result data back 
CALL algo.scc.stream(label:String, relationship:String, {concurrency:4})
YIELD nodeId, partition  yields a partition to each node id
name  type  default  optional  description 

label 
string 
null 
yes 
label to load from the graph, if null load all nodes 
relationship 
string 
null 
yes 
relationshiptype to load from the graph, if null load all relationships 
concurrency 
int 
available CPUs 
yes 
number of concurrent threads 
graph 
string 
'heavy' 
yes 
use 'heavy' when describing the subset of the graph with label and relationshiptype parameter, 'cypher' for describing the subset with cypher nodestatement and relationshipstatement 
name 
type 
description 
nodeId 
int 
node id 
partition 
int 
partition id 
Cypher projection
If label and relationshiptype are not selective enough to describe your subgraph to run the algorithm on, you can use Cypher statements to load or project subsets of your graph.
Can be also used to run algorithms on a virtual graph.
Set graph:'cypher'
in the config.
CALL algo.scc(
'MATCH (u:User) RETURN id(u) as id',
'MATCH (u1:User)[:FOLLOW]>(u2:User) RETURN id(u1) as source,id(u2) as target',
{write:true,graph:'cypher'})
YIELD loadMillis, computeMillis, writeMillis;
Implementations
algo.scc

iterative adaptation (same as
algo.scc.iterative
)
algo.scc.recursive.tarjan

original recursive tarjan implementation
algo.scc.recursive.tunedTarjan

also a recursive tarjan implementation
algo.scc.iterative

iterative adaption of tarjan algorithm
algo.scc.multistep

parallel scc algorithm
Implementers Section
Algorithm Procedures API Discussion
We would like to discuss design of the procedure API for algorithms. There are several concerns:
Naming

simple procedure names,

consistent spelling and casing

discriminate between writeback and resultstreaming procedure with
.stream
suffix 
default algorithm name should be the best implementation

other implementations can provided via suffixes
Input Graph Data

load via label/reltype selection and optional weight/values

load via cypher statements

the requirements for nodeid’s, degrees, weights, relationships etc. comes from the algorithm
Algorithm parameters

startnodes

iterations, dampening

concurrency, batchsizes

thresholds

writeback, writeback properties
IdMapping

original nodeids from the graph are mapped to a consecutive idrange to be used for the algorithm

this allows simple array offsets to be used within the computation

this also allows mapping of larger idranges to the computational range of the algorithm ()
Result parameters

output statistics: number of nodes and rels touched

time for loading, processing, and writing

property written to

errors

data summary, e.g.

min,max,avg of a centrality, rank or weight measure

output of all clusters, their sizes and perhaps the root node


alternatively streaming back nodeids and the computed value (score)
Constraints:

there is no overloading so we need differing names

there are default procedure parameters so we need to require only what’s really required to run

we should use sensible defaults

result values/columns have to be predeclared, there is no inheritance or generics
Suggestions:
Naming: 2 variants per algorithm

returning summary statistics after run:
algo.pageRank(…)

returning a stream of data:
algo.pageRank.stream(…)
default should be the one with statistics, as the scroll back from a medium size / large grap will kill the client
Parameters

nodeselector or null  label(s) or cypher statement: ":PersonUser" / "MATCH (u:User) RETURN id(u) as id, u.name as value"

relationshipselector or null  reltype(s) or cypher statement ":KNOWSFRIEND" / "MATCH (u1:User)[r:KNOWSFRIEND](u2:User) RETURN id(u1) as source, id(u2) as target, r.strength as weight"

configuration map with optinal parameters but sensible defaults: {iterations: 5, write:true, nodeValue:"value",nodeWeight:"weight",relationshipWeight:"weight", threshold: 0.85}
We could have the configuration map come from a function (per algorithm?) that handles validation and returns a correct map object for that algorithm? Or just pass the plain map?
Example:
call algo.pageRank("Person","KNOWS",{iterations:10, write:true, [property:"pagerank"]});
call algo.unionFind.stream('
MATCH (p:Person) RETURN id(p) as id
','
MATCH (p1:Person)[:KNOWS](p2:Person) RETURN id(p1) as source, id(p2) as target
', {}) YIELD nodeId, clusterId
API Documentation
Development Notes
Note

When implementing a procedure that yields any results (is not void ) and writes back to the database, make sure to consume the Result after running the algorithm call!

Algorithms are executed within a transaction, that is opened and closed by Cypher. The transaction is only marked successful if the results are consumed. Closing the Result directly will fail and rollback any open transactions and thus will revert all writeback operations.
The Model
The basic idea behind our model is to have a fast cache for the topology of the graph containing only relevant nodes, relations and in addition the weights. It implicitly maps (long) nodeid’s to an internal integerid (32bit) in ascending order which ensures that no id gets bigger then the maximum node count. This approach allows us to use primitive arrays as container for example.
Graph Interface
The Graph
interface specifies methods for iterating over all nodes of the graph as well as iterating over relationships of a given node in the form of forEach…(..)
methods.
The Graph knows the nodeCount
and degree
of each node and can map nodeId to internalId and vice versa.
Method  Description 


return a collection of iterables over every node, partitioned by the given batch size. 

Returns true iff the nodeId is mapped, otherwise false 

degree of the node in that direction 

Iterate over each nodeId 

Iterate over all relationships of a node 

Iterate over all weighted relationships of a node 

count of nodes 

iterate over all nodes 

Map neo4j nodeId to internal nodeId 

Map internal nodeId back to original nodeId 
Note

Currently we have 3 different implementations aiming for different goals like performance, memory consumption and accuracy. 
HeavyGraph
This implementations utilizes an intmatrix for storing connections between vertices. It has a higher memory consumption but performs basic calculations on the graph around 3 times faster then the memory efficient implementation. Furthermore the number of edges per node is limited only to the maximum array size (2bn) of the JVM. Relationships can be added in arbitrary order.
LightGraph
This implementation takes 3 times less heap due to a more intelligent memory layout. The drawback is slightly higher evaluation time and that the data cannot be loaded in parallel (yet).
GraphView
The View is just a simple wrapper around the Neo4j Kernel API. It has been implemented for tests and benchmarks baselines.
Import
A fair amount of the work is to fetch the relevant data from Neo4j and import it into our model. Fortunately we can use a multithreading approach to load the graph. This is part of the current implementation.
The abstract GraphFactory
specifies a constructor and the build
method.
It is responsible for creating the Graph using the Neo4j Kernel API.
Loading the Graph is done by using the GraphLoader which implements a simple fluent Builder Pattern.
final Graph graph = new GraphLoader( graphDatabaseAPI )
.setLabel( .. )
.setRelation( .. )
.setProperty( .. )
.setThreadPool( .. )
.load( FactoryImpl.class );
PageRank
Implementation Details
PageRank is Googles popular search algorithm.
Progress

✓ single threaded implementation

✓ tests

✓ simple benchmark

✓ implement procedure

✓ benchmark on bigger graphs

✓ parallelization

✓ evaluation
Requirements

NodeIterator

Incoming Relationships

Outgoing Degrees
Data structured involved
Our current approach needs one double array for storing ranks.
ToDo
parallelization
One approach to parallelize PageRank might be to partition the node into batches  one for each thread. Nonetheless we may need to sync them at the end of each iteration.
evaluation

Performance tests on different dataset sizes / level of concurrency
Future Improvements

we might scale up the ranks to ints for faster multiplication.
Details
Partition based parallel PageRank based on "An Efficient PartitionBased Parallel PageRank Algorithm" [1]

Each partition thread has its local array of only the nodes that it is responsible for, not for all nodes. Combined, all partitions hold all page rank scores for every node once. Instead of writing partition files and transferring them across the network (as done in the paper since they were concerned with parallelising across multiple nodes), we use integer arrays to write the results to. The actual score is upscaled from a double to an integer by multiplying it with {@code 100_000}.

To avoid contention by writing to a shared array, we partition the result array.

During execution, the scores arrays are shaped like this:
[ executing partition ] > [ calculated partition ] > [ local page rank scores ]

Each single partition writes in a partitioned array, calculation the scores for every receiving partition. A single partition only sees:
[ calculated partition ] > [ local page rank scores ]

The coordinating thread then builds the transpose of all written partitions from every partition:
[ calculated partition ] > [ executing partition ] > [ local page rank scores ]

This step does not happen in parallel, but does not involve extensive copying. The local page rank scores needn’t be copied, only the partitioning arrays. All in all, {@code concurrency^2} array element reads and assignments have to be performed.

For the next iteration, every partition first updates its scores, in parallel. A single partition now sees:
[ executing partition ] > [ local page rank scores ]

That is, a list of all calculated scores for it self, grouped by the partition that calculated these scores. This means, most of the synchronization happens in parallel, too.

Partitioning is not done by number of nodes but by the accumulated degree – as described in "Fast Parallel PageRank: A Linear System Approach" [2]. Every partition should have about the same number of relationships to operate on.

This is done to avoid having one partition with super nodes and instead have all partitions run in approximately equal time. Smaller partitions are merged down until we have at most {@code concurrency} partitions, in order to batch partitions and keep the number of threads in use predictable/configurable.
[1]: An Efficient PartitionBased Parallel PageRank Algorithm [2]: <a href="https://www.cs.purdue.edu/homes/dgleich/
Betweenness Centrality
Implementation Details
In graph theory, betweenness centrality is a measure of centrality in a graph based on shortest paths. For every pair of nodes in a connected graph, there exists at least one shortest path between the nodes such that either the number of relationships that the path passes through (for unweighted graphs) or the sum of the weights of the relationships (for weighted graphs) is minimized. The betweenness centrality for each vertex is the number of these shortest paths that pass through the vertex.
Progress

✓ adapt apocprocedure to algorithmapi

✓ implement procedure

✓ tests

✓ relationship case tests

✓ simple benchmark

✓ benchmark on bigger graphs

✓ parallelization

✓ evaluation

✓ documentation
Details
algo.betweenness

implementation of brandesbc algorithm and nodePartitioning extension

if
concurrency
parameter is set (and >1) ParallelBetweennessCentrality is used 
ParallelBC spawns N(given by the concurrency param) concurrent threads for calculation where each one calculates the BC for one node at a time
algo.betweenness.exp1

brandeslike algorithm which uses successor sets instead of predecessor sets

The algorithm is based on Brandes definition but with some changes regarding the dependencyaccumulation step.

http://cassmt.pnnl.gov/docs/pubs/georgiatechlbnlpnnlfastbcmtaap2009.pdf
algo.betweenness.sampled.*

Calculates betweennessdependencies on a subset of pivot nodes (instead of all nodes). Up to now 2 randomization strategies are implemented which can be set using the optional argument strategy:

random selection(default): strategy:'random': (takes optional argument probability:double(01) or log10(N) / e^2 as default)

degree based randomization: strategy:'degree': (makes dense nodes more likely)

optional Arguments: maxDepth:int
Closeness Centrality
Implementation Details
Closeness Centrality of a node is a measure of centrality in a network, calculated as the sum of the length of the shortest paths between the node and all other nodes in the graph. Thus the more central a node is, the closer it is to all other nodes.
Details

use
org.neo4j.graphalgo.impl.msbfs.MultiSourceBFS
for BFS 
MSBFS gives depth and number of sources.

in this scheme the farness can be calculated as follows
farness(v) = farness(v) + numberOfSources(v) * depth(v)
Harmonic Centrality
Details

use
org.neo4j.graphalgo.impl.msbfs.MultiSourceBFS
for BFS 
MSBFS gives depth and number of sources.

in this scheme the farness can be calculated as follows
inverseFarness(v) = inverseFarness(v) + numberOfSources(v) * (1/depth(v))
Minimum Weight Spanning Tree
Implementation Details
A Minimum Weight Spanning Tree is a acyclic undirected graph which consists of all connected nodes and whose relationship weights are minimal. It can be used to cluster the graph (KMeans). Our current approach uses Prim’s Algorithm to calculate the MST starting at a given node. This might not cover the whole graph. But if the nodes are connected the MST is always identical regardless at which node the execution starts.
Progress

✓ single threaded implementation

✓ tests

✓ simple benchmark

✓ implement procedure

✓ benchmark on bigger graphs

[?] parallelization

❏ evaluation
Requirements
BothRelationshipIterator
& Weights
Data structured involved

org.neo4j.graphalgo.core.utils.container.UndirectedTree
as container for efficient splitting and iterate 
An intbased Fibonacci Heap priority queue.

Container for visited state
ToDo
benchmark
Implement benchmark on big graph
evaluation

Performance tests on different dataset sizes / level of concurrency
Single Shortest Path
Implementation Details
A Single Source Shortest Path algorithms calculates a path between a pair of nodes whose summed weights are minimal. A common algorithm used is Dijkstra. All Pairs Shortest Path on the other hand calculates a shortest path forest containing all paths between the nodes in the graph. An algorithm to solve this is Floyd Warshall or Parallel Johnson’s algorithm.
Progress

✓ single threaded implementation

✓ tests

✓ simple benchmark

✓ implement procedure

❏ benchmark on bigger graphs

❏ parallelization

❏ evaluation
Requirements
(Outgoing)RelationshipIterator
& Weights
Data structured involved

An intbased Fibonacci Heap which implements an efficient priority queue.

Different Container for Costs / visited state / paths
ToDo
benchmark
Implement benchmark on big graph
parallelization
Parallizing All Pairs Shortest Path might be easy using Dijkstra on each thread for a different node. An easy approach for Single Source SP may use two threads. One starting at the startnode, one at the endnode. The first wins. [More](https://www.cs.rice.edu/~vs3/comp422/lecturenotes/comp422lec24s08v2.pdf)
evaluation

Performance tests on different dataset sizes / level of concurrency
Details
algo.shortestPath

Dijkstra single source shortest path algorithm

The algorithm computes a shortest path on weighted graphs between a given start and targetNodeId. It returns result tuples of [nodeId, distance] of each node in the path

there may be more then one shortest path, algo returns only one

if initialized with an notexisting weightproperty and a defaultWeight of 1.0 the resulting path is minimal in terms of count of nodes in the path.
algo.shortestPaths

Dijkstra single source shortest path algorithm

returns minimum distance to all other nodes

if initialized with an notexisting weightproperty and a defaultWeight of 1.0 the resulting path is minimal in terms of count of nodes in the path.
algo.shortestPath.deltaStepping
TODO naming!?  parallel nonnegative single source shortest path algorithm for weighted graphs  It can be tweaked using the deltaparameter which controls the grade of concurrency.  returns minimum distance to all other nodes  if initialized with an nonexisting weightproperty and a defaultWeight of 1.0 its result can be interpreted as the number of nodes to reach the target
All Pairs Shortest Path
References

[1] Urban Operations Research  Richard C. Larson and Amedeo R. Odoni

[2] REWIRE: An optimizationbased framework for unstructured data center network design  Andrew R. Curtis, Tommy Carpenter, Mustafa Elsheikh, Alejandro LópezOrtiz, S. Keshav
Implementation Details
Details
algo.allShortestPaths.stream

returns a stream of sourcetarget node to distance tuples for each pair of nodes

Since all nodeId’s have already been ordered by the idMapping we can use an integer instead of a queue which just count’s up for each startNodeId as long as it is < nodeCount.

Each thread tries to take one int from the counter at one time and starts its computation on it.

The {@link AllShortestPaths#concurrency} value determines the count of workers that should be spawned.

Due to the high memory footprint the result set would have we emit each result into a blocking queue. The result stream takes elements from the queue while the workers add elements to it.

The result stream is limited by N^2. If the stream gets closed prematurely the workers get closed too.

writeback not supported!
Triangle Count
Implementation Details

✓ single threaded implementation

✓ tests

❏ edge case tests

✓ implement procedure

✓ simple benchmark

✓ benchmark on bigger graphs

✓ parallelization

✓ evaluation

✓ documentation
Details
algo.triangle.stream(..) returns a Stream of Triples with nodeIds for each triangle.
algo.triangleCount(..) counts the number of triangles a node is member of and writes it back. It also counts the triangle in the whole graph and returns it in the Stats
algo.triangleCount.stream(..) counts number of triangles a node is member of and returns a stream with nodeId and triangleCount
Note
sum(triangleCount) == triangleCount * 3 because every triangle adds 1 to each of its 3 nodes.
Label Propagation
Implementation Details
Label Propagation is a graph partitioning algorithm already implemented in current apocprocedures.
Progress

✓ adapt apocprocedure to algorithm api

✓ single threaded implementation

✓ tests

❏ edge case tests

✓ implement procedure

✓ simple benchmark

✓ benchmark on bigger graphs

✓ parallelization

✓ evaluation

✓ documentation
TODO

adapt existing procedure to algorithm api
Louvain
Implementation Details
Louvain is an algorithm for detecting graph partitions in networks that relies upon a heuristic for maximizing the modularity.

✓ single threaded implementation

✓ tests

❏ edge case tests

✓ implement procedure

✓ simple benchmark

✓ benchmark on bigger graphs

✓ parallelization

❏ evaluation

✓ documentation
Weakly Connected Components
Implementation Details
Connected Components or UnionFind basically finds sets of connected nodes where each node is reachable from any other node in the same set. One implementation also evaluates a Predicate on each relation which allows partitioning of the graph based on Relationships and Properties.
Progress

✓ single threaded implementation

✓ tests

✓ simple benchmark

✓ implement procedure

✓ benchmark on bigger graphs

✓ parallelization

✓ evaluation
Requirements
AllRelationshipIterator
& Weights
Data structured involved
We use a disjointsetstructure which is based on a parentarraytree. The DSS can be used to efficiently ask if two nodes are reachable by each other. [More](https://en.wikipedia.org/wiki/Disjointset_data_structure)
ToDo
parallelization
One approach to parallelize UnionFind might be relationship partitioning where each thread performs the execution into it’s own DSS instance on a subset of relationships. So each thread calculates a distinct set of unions. Later we can merge each DSS pairwise which can also be perfomed in parallel. Nonetheless the memory consumption might be high due to the preallocated array in DSS. We could also switch to a growing container if this is a problem.
evaluation

Performance tests on different dataset sizes / level of concurrency
Details

writes a clusterid to each node representing the a connected component where each node is reachable from any other node
algo.unionFind

if a threshold configuration parameter is supplied only relationships with a property value higher then the threshold are merged
algo.unionFind.queue

parallel UnionFind using ExecutorService only.

Algorithm based on the idea that DisjointSetStruct can be built using just a partition of the nodes which are then merged pairwise.

The implementation is based on a queue which acts as a buffer for each computed DSS. As long as there are more elements on the queue the algorithm takes two, merges them and adds its result to the queue until only 1 element remains.
algo.unionFind.forkJoinMerge

Like in exp1 the resulting DSS of each nodepartition is merged by the ForkJoin pool while the calculation of the DSS is done by the ExecutorService.
algo.unionFind.forkJoin

calculation and merge using forkJoinPool
Strongly Connected Components
Implementation Details
SCC is a class algorithms for finding groups of nodes where each node is directly reachable from every other node in the group. There are several algorithms to compute the SCC. Our current implementation (still in graphtest project) implements Tarjan’s SCC algorithm.
Progress

✓ implement procedure

✓ tests

✓ edge case tests

✓ simple benchmark

✓ benchmark on bigger graphs

✓ parallelization

✓ evaluation

✓ documentation
Details
algo.scc.tarjan

original recursive tarjan implementation

result is a clusterid at all nodes

each cluster is a scc

Builds sets of nodeIds which represent a strongly connected component within the graph. Also calculates minimum and maximum setSize as well as the count of distinct sets.
algo.scc.tunedTarjan

also a recursive tarjan implementation

result is a clusterid at all nodes

https://pdfs.semanticscholar.org/61db/6892a92d1d5bdc83e52cc18041613cf895fa.pdf
algo.scc.iterative

iterative adaption of tarjan algorithm

result is a clusterid at all nodes

http://code.activestate.com/recipes/578507stronglyconnectedcomponentsofadirectedgraph/
algo.scc.multistep

parallel scc algorithm

composition of several scc algorithms (FWBW, coloring, tarjan)

uses FWBW + coloring to find big scc’s

starts simple tarjan once the cutoff threshold is reached

http://www.sandia.gov/~srajama/publications/BFS_and_Coloring.pdf