Introduction

This project aims to develop efficient, well tested graph algorithm implementations for Neo4j 3.1 and 3.2.

The goal is to provide parallel versions of common graph algorithms for Neo4j exposed as Cypher user defined procedures:

Centralities:

  • Page Rank

  • Betweenness Centrality

  • Closeness Centrality

Graph Partitioning:

  • Label Propagation

  • (Weakly) Connected Components

  • Strongly Connected Components

Path Finding:

  • Minimum Weight Spanning Tree

  • All Pairs- and Single Source - Shortest Path

These procedures work on a subgraphm optionally filtered by label and relationship-type. Future versions will also provide filtering and projection using Cypher queries.

We’d love your feedback, so please try out these algorithms and let us know how well they work for your use-case. Also please note things that you miss from installation instructions, readme, etc.

Please raise GitHub issues for anything you encounter or join the neo4j-users Slack group and ask in the #neo4j-graph-algorithm channel.

Installation

Just copy the graph-algorithms-algo-*.jar from the matching release into your $NEO4J_HOME/plugins directory and restart Neo4j.

Then running call dbms.procedures(); should also list the algorithm procedures.

CALL dbms.procedures() YIELD name, description, signature
WHERE name STARTS WITH "algo."
RETURN name, description, signature
ORDER BY name
Warning

For safety reasons, in Neo4j 3.2.x you will need to add/enable this line in your $NEO4J_HOME/conf/neo4j.conf:

dbms.security.procedures.unrestricted=algo.*

Introduction

Graph theory is the study of graphs, which are mathematical structures used to model pairwise relations between nodes. A graph is made up of nodes (vertices) which are connected by relationships (edges). A graph may be undirected, meaning that there is no distinction between the two nodes associated with each relationship, or its relationships may be directed from one node to another. Relationships are what graph is all about: two nodes are joined by a relationship when they are related in a specified way.

We are tied to our friends. Cities are connected by roads and airline routes. Flora and fauna are bound together in a food web. Countries are involved in trading relationships. The World Wide Web is a virtual network of information.

  • Note that Neo4j can only save directed relationships, but we can treat them as though they are undirected when we are doing the analysis

Usage

These algorithms are exposed as Neo4j procedures. You can call them directly from Cypher in your Neo4j Browser, from cypher-shell or your client code.

For most algorithms there are two procedures, one that writes results back to the graph as node-properties and another (named algo.<name>.stream) that returns a stream of data, e.g. node-ids and computed values.

The general call syntax is:

CALL algo.<name>([label],[relType],{config})

For example for page rank on dbpedia:

CALL algo.pageRank('Page','Link',{iterations:5, dampingFactor:0.85, write: true, writeProperty:'pagerank'});
// YIELD nodes, iterations, loadMillis, computeMillis, writeMillis, dampingFactor, write, writeProperty

CALL algo.pageRank.stream('Page','Link',{iterations:5, dampingFactor:0.85})
YIELD node, score
RETURN node, score
ORDER BY score DESC LIMIT 10;

Cypher Loading

If label and relationship-type are not selective enough to describe your subgraph to run the algorithm on, you can use Cypher statements to load or project subsets of your graph. Then use a node-statement instead of the label parameter and a relationship-statement instead of the relationship-type and use graph:'cypher' in the config.

You can also return a property value or weight (according to your config) in addition to the id’s from these statements.

CALL algo.pageRank(
'MATCH (p:Page) RETURN id(p) as id',
'MATCH (p1:Page)-[:Link]->(p2:Page) RETURN id(p1) as source, id(p2) as target',
{graph:'cypher', iterations:5, write: true});

Building

Currently aiming at Neo4j 3.1 and 3.2 (in the 3.2 branch)

git clone https://github.com/neo4j-contrib/neo4j-graph-algorithms
cd neo4j-graph-algorithms
mvn clean install
cp algo/target/graph-algorithms-*.jar $NEO4J_HOME/plugins/
$NEO4J_HOME/bin/neo4j restart

Overview of Algorithm Procedures

qualified name description

algo.allShortestPaths.stream

CALL algo.allShortestPaths.stream(weightProperty:String{nodeQuery:'labelName', relationshipQuery:'relationshipName', defaultValue:1.0}) YIELD sourceNodeId, targetNodeId, distance - yields a stream of {sourceNodeId, targetNodeId, distance}

algo.unionFind.exp2

CALL algo.unionFind(label:String, relationship:String, {property:'weight', threshold:0.42, defaultValue:1.0, write: true, partitionProperty:'partition'}) YIELD nodes, setCount, loadMillis, computeMillis, writeMillis

algo.unionFind.exp2.stream

CALL algo.unionFind.stream(label:String, relationship:String, {property:'propertyName', threshold:0.42, defaultValue:1.0) YIELD nodeId, setId - yields a setId to each node id

algo.pageRank

CALL algo.pageRank(label:String, relationship:String, {iterations:5, dampingFactor:0.85, write: true, writeProperty:'pagerank'}) YIELD nodes, iterations, loadMillis, computeMillis, writeMillis, dampingFactor, write, writeProperty - calculates page rank and potentially writes back

algo.pageRank.stream

CALL algo.pageRank.stream(label:String, relationship:String, {iterations:20, dampingFactor:0.85}) YIELD node, score - calculates page rank and streams results

algo.unionFind.exp1

CALL algo.unionFind(label:String, relationship:String, {property:'weight', threshold:0.42, defaultValue:1.0, write: true, partitionProperty:'partition'}) YIELD nodes, setCount, loadMillis, computeMillis, writeMillis

algo.unionFind.exp1.stream

CALL algo.unionFind.stream(label:String, relationship:String, {property:'propertyName', threshold:0.42, defaultValue:1.0) YIELD nodeId, setId - yields a setId to each node id

algo.betweenness.exp1.stream

CALL algo.betweenness.exp1.stream(label:String, relationship:String, {scaleFactor:1000000}) YIELD nodeId, centrality - yields centrality for each node

algo.betweenness.stream

CALL algo.betweenness.stream(label:String, relationship:String) YIELD nodeId, centrality - yields centrality for each node

algo.betweenness.exp1

CALL algo.betweenness.exp1(label:String, relationship:String, {write:true, writeProperty:'centrality', stats:true, scaleFactor:100000}) YIELD loadMillis, computeMillis, writeMillis, nodes, minCentrality, maxCentrality, sumCentrality] - yields status of evaluation

algo.betweenness

CALL algo.betweenness(label:String, relationship:String, {write:true, writeProperty:'centrality', stats:true}) YIELD loadMillis, computeMillis, writeMillis, nodes, minCentrality, maxCentrality, sumCentrality - yields status of evaluation

algo.closeness.stream

CALL algo.closeness.stream(label:String, relationship:String) YIELD nodeId, centrality - yields centrality for each node

algo.closeness

CALL algo.closeness(label:String, relationship:String, {write:true, writeProperty:'centrality'}) YIELD loadMillis, computeMillis, writeMillis, nodes] - yields evaluation details

algo.mst

CALL algo.mst(node:Node, weightProperty:String, {nodeLabelOrQuery:String, relationshipTypeOrQuery:String, write:boolean, writeProperty:String stats:boolean}) YIELD loadMillis, computeMillis, writeMillis, weightSum, weightMin, weightMax, relationshipCount

algo.labelPropagation

CALL algo.labelPropagation(label:String, relationship:String, direction:String, {iterations:1, weightProperty:'weight', partitionProperty:'partition', write:true}) YIELD nodes, iterations, loadMillis, computeMillis, writeMillis, write, weightProperty, partitionProperty - simple label propagation kernel

algo.unionFind.exp3

CALL algo.unionFind(label:String, relationship:String, {property:'weight', threshold:0.42, defaultValue:1.0, write: true, partitionProperty:'partition'}) YIELD nodes, setCount, loadMillis, computeMillis, writeMillis

algo.unionFind.exp3.stream

CALL algo.unionFind.stream(label:String, relationship:String, {property:'propertyName', threshold:0.42, defaultValue:1.0) YIELD nodeId, setId - yields a setId to each node id

algo.shortestPath.deltaStepping.stream

CALL algo.shortestPath.deltaStepping.stream(startNode:Node, weightProperty:String, delta:Double{label:'labelName', relationship:'relationshipName', defaultValue:1.0}) YIELD nodeId, distance - yields a stream of {nodeId, distance} from start to end (inclusive)

algo.shortestPath.deltaStepping

CALL algo.shortestPath.deltaStepping(startNode:Node, weightProperty:String, delta:Double{label:'labelName', relationship:'relationshipName', defaultValue:1.0, write:true, writeProperty:'sssp'}) YIELD loadDuration, evalDuration, writeDuration, nodeCount

algo.unionFind

CALL algo.unionFind(label:String, relationship:String, {weightProperty:'weight', threshold:0.42, defaultValue:1.0, write: true, partitionProperty:'partition'}) YIELD nodes, setCount, loadMillis, computeMillis, writeMillis

algo.unionFind.stream

CALL algo.unionFind.stream(label:String, relationship:String, {weightProperty:'propertyName', threshold:0.42, defaultValue:1.0) YIELD nodeId, setId - yields a setId to each node id

algo.shortestPath.stream

CALL algo.shortestPath.stream(startNode:Node, endNode:Node, weightProperty:String{nodeQuery:'labelName', relationshipQuery:'relationshipName', defaultValue:1.0}) YIELD nodeId, cost - yields a stream of {nodeId, cost} from start to end (inclusive)

algo.shortestPath

CALL algo.shortestPath(startNode:Node, endNode:Node, weightProperty:String{nodeQuery:'labelName', relationshipQuery:'relationshipName', defaultValue:1.0,write:'true',writeProperty:'sssp'}) YIELD nodeId, cost, loadMillis, evalMillis, writeMillis - yields nodeCount, totalCost, loadMillis, evalMillis, writeMillis

algo.shortestPaths.stream

CALL algo.shortestPaths.stream(startNode:Node, weightProperty:String{nodeQuery:'labelName', relationshipQuery:'relationshipName', defaultValue:1.0}) YIELD nodeId, distance - yields a stream of {nodeId, cost} from start to end (inclusive)

algo.shortestPaths

CALL algo.shortestPaths(startNode:Node, weightProperty:String{write:true, targetProperty:'path', nodeQuery:'labelName', relationshipQuery:'relationshipName', defaultValue:1.0}) YIELD loadDuration, evalDuration, writeDuration, nodeCount, targetProperty - yields nodeCount, totalCost, loadDuration, evalDuration

algo.scc

CALL algo.scc(label:String, relationship:String, config:Map<String, Object>) YIELD loadMillis, computeMillis, writeMillis, setCount, maxSetSize, minSetSize

algo.scc.tarjan

CALL algo.scc.tarjan(label:String, relationship:String, config:Map<String, Object>) YIELD loadMillis, computeMillis, writeMillis, setCount, maxSetSize, minSetSize

algo.scc.tunedTarjan

CALL algo.scc.tunedTarjan(label:String, relationship:String, config:Map<String, Object>) YIELD loadMillis, computeMillis, writeMillis, setCount, maxSetSize, minSetSize

algo.scc.tunedTarjan.stream

CALL algo.scc.tunedTarjan.stream(label:String, relationship:String, config:Map<String, Object>) YIELD nodeId, partition

algo.scc.iterative

CALL algo.scc.iterative(label:String, relationship:String, config:Map<String, Object>) YIELD loadMillis, computeMillis, writeMillis, setCount, maxSetSize, minSetSize

algo.scc.iterative.stream

CALL algo.scc.iterative.stream(label:String, relationship:String, config:Map<String, Object>) YIELD nodeId, partition

algo.scc.multistep

CALL algo.scc.multistep(label:String, relationship:String, {write:true, concurrency:4, cutoff:100000}) YIELD loadMillis, computeMillis, writeMillis, setCount, maxSetSize, minSetSize

algo.scc.multistep.stream

CALL algo.scc.multistep.stream(label:String, relationship:String, {write:true, concurrency:4, cutoff:100000}) YIELD nodeId, partition

algo.scc.forwardBackward.stream

CALL algo.scc.forwardBackward.stream(long startNodeId, label:String, relationship:String, {write:true, concurrency:4}) YIELD nodeId, partition

Algorithms

Graph algorithms are used to compute metrics for graphs, nodes or relationships.

They can provide insights on relevant entities (centralities, ranking) in the graph or inherent structures like communities (community-detection, graph-partitioning, clustering).

Many of them are iterative aproaches that frequently traverse the graph for the computation using random walks, breadth-first- or depth-first searches or pattern matching.

Due to the exponential growth of possible paths with increasing distance many of the approaches are also of a high algorithmic complexity.

Fortunately some optimized algorithms exist that utilize certain structures of the graph, memoize of already explored parts and if possible parallelize operations. Whenever possible we tried to apply these optimizations.

Page Rank

PageRank is Google’s popular search algorithm. PageRank works by counting the number and quality of links to a page to determine a rough estimate of how important the website is. The underlying assumption is that more important websites are likely to receive more links from other websites

History, Explanation

In order to measure the relative importance of web pages, Sergey Brin and Larry Page proposed PageRank, a method for computing a ranking for every web page based on the graph of the web. PageRank has applications in search, browsing, and traffic estimation.

PageRank is defined in the original Google paper as follows:

We assume page A has pages T1…​Tn which point to it (i.e., are citations). The parameter d is a damping factor which can be set between 0 and 1. We usually set d to 0.85. Also C(A) is defined as the number of links going out of page A. The PageRank of a page A is given as follows:

PR(A) = (1-d) + d (PR(T1)/C(T1) + …​ + PR(Tn)/C(Tn))

Note that the PageRanks form a probability distribution over web pages, so the sum of all web pages' PageRanks will be one.

The underlying mathematics of PageRank has to do with random walks on networks, akin to how random surfers propagate through a network. Precisely, PageRank is an example of a discrete ergodic Markov Chain. Random surfers follow links, but occasionally teleport to random vertices. The PageRank of a node is the probability it is visited by a random surfer with teleportation. PageRank is now widely recognized as a way of detecting central nodes in a network, and even has applications in systems biology.

When to use it / use-cases

The mathematics of PageRank are entirely general and apply to any graph or network in any domain. Thus, PageRank is now regularly used in bibliometrics, social and information network analysis, and for link prediction and recommendation. It’s even used for systems analysis of road networks, as well as biology, chemistry, neuroscience, and physics.

In neuroscience, the PageRank of a neuron in a neural network has been found to correlate with its relative firing rate.

Personalized PageRank is used by Twitter to present users with other accounts they may wish to follow.

PageRank has been used to rank spaces or streets to predict how many people (pedestrians or vehicles) come to the individual spaces or streets. In lexical semantics it has been used to perform Word Sense Disambiguation,Semantic similarity, and also to automatically rank WordNet synsets according to how strongly they possess a given semantic property, such as positivity or negativity.

In any ecosystem, a modified version of PageRank may be used to determine species that are essential to the continuing health of the environment.

For the analysis of protein networks in biology PageRank is also a useful tool.

Pagerank has recently been used to quantify the scientific impact of researchers. The underlying citation and collaboration networks are used in conjunction with pagerank algorithm in order to come up with a ranking system for individual publications which propagates to individual authors. The new index known as pagerank-index (Pi) is demonstrated to be fairer compared to h-index in the context of many drawbacks exhibited by h-index.

Algorithm explanation on simple sample graph

pagerank
Create sample graph
CREATE (home:Page{name:'Home'})
,(about:Page{name:'About'})
,(product:Page{name:'Product'})
,(links:Page{name:'Links'})
,(a:Page{name:'Site A'})
,(b:Page{name:'Site B'})
,(c:Page{name:'Site C'})
,(d:Page{name:'Site D'})
CREATE (home)-[:LINKS]->(about)
,(about)-[:LINKS]->(home)
,(product)-[:LINKS]->(home)
,(home)-[:LINKS]->(product)
,(links)-[:LINKS]->(home)
,(home)-[:LINKS]->(links)
,(links)-[:LINKS]->(a)-[:LINKS]->(home)
,(links)-[:LINKS]->(b)-[:LINKS]->(home)
,(links)-[:LINKS]->(c)-[:LINKS]->(home)
,(links)-[:LINKS]->(d)-[:LINKS]->(home)
Running algorithm and streaming results
CALL algo.pageRank.stream('Page', 'LINKS', {iterations:20, dampingFactor:0.85})
YIELD node, score
RETURN node,score order by score desc limit 20
Running algorithm and writing back results
CALL algo.pageRank('Page', 'LINKS', {iterations:20, dampingFactor:0.85,
write: true,writeProperty:"pagerank"})
YIELD nodes, iterations, loadMillis, computeMillis, writeMillis, dampingFactor, write, writeProperty
Table 1. Results
name pageRank

Home

3.232

Product

1.059

Links

1.059

About

1.059

Site A

0.328

Site B

0.328

Site C

0.328

Site D

0.328

As we expected, we see that Home page has the highest pageRank, because it has incoming links from all other pages. We can also observe, that not only the number of incoming links is important, but also the importance of the page, that links to us.

Example Usage

minimal
CALL algo.pageRank('Label1', 'TYPE1') YIELD computeMillis
CALL algo.pageRank.stream('Label1', 'TYPE1') YIELD node, score order by score desc limit 20

Syntax

running algorithm and writing back results
CALL algo.pageRank(label:String, relationship:String, {iterations:20, dampingFactor:0.85,
write: true,writeProperty:'pagerank'})
YIELD nodes, iterations, loadMillis, computeMillis, writeMillis, dampingFactor, write, writeProperty
- calculates page rank and potentially writes back
Table 2. parameters
name type default optional description

label

string

null

yes

label to load from the graph, if null load all nodes

relationship

string

null

yes

relationship-type to load from the graph, if null load all nodes

iterations

int

20

yes

how many iterations of page-rank to run

dampingFactor

float

0.85

yes

damping factor of the page-rank calculation

write

boolean

true

yes

if result should be written back as node property

writeProperty

string

'pagerank'

yes

property name written back to

graph

string

'heavy'

yes

use 'heavy' when describing the subset of the graph with label and relationship-type parameter, 'cypher' for describing the subset with cypher node-statement and relationship-statement

Table 3. results
name type description

nodes

int

number of nodes considered

iterations

int

number of iterations run

dampingFactor

float

damping factor used

writeProperty

string

property name written back to

write

boolean

if result was written back as node property

loadMillis

int

milliseconds for loading data

computeMillis

int

milliseconds for running the algorithm

writeMillis

int

milliseconds for writing result data back

running algorithm and streaming results
CALL algo.pageRank.stream(label:String, relationship:String, {iterations:20, dampingFactor:0.85})
YIELD node, score - calculates page rank and streams results
Table 4. parameters
name type default optional description

label

string

null

yes

label to load from the graph, if null load all nodes

relationship

string

null

yes

relationship-type to load from the graph, if null load all nodes

iterations

int

20

yes

how many iterations of page-rank to run

dampingFactor

float

0.85

yes

damping factor of the page-rank calculation

Table 5. results

name

type

description

node

long

node id

score

float

page-rank weight

Cypher loading

If label and relationship-type are not selective enough to describe your subgraph to run the algorithm on, you can use Cypher statements to load or project subsets of your graph. Can be also used to run algorithms on a virtual graph. Set graph:'cypher' in the config.

CALL algo.pageRank(
'MATCH (p:Page) RETURN id(p) as id',
'MATCH (p1:Page)-[:Link]->(p2:Page) RETURN id(p1) as source, id(p2) as target',
{graph:'cypher', iterations:5, write: true});

Versions

We support the following versions of the pageRank algorithm:

  • ✓ directed, unweighted

  • ❏ directed, weighted

  • ❏ undirected, unweighted

  • ❏ undirected, weighted

Betweenness Centrality

Betweenness Centrality is a measure of centrality in a graph. In graph theory and network analysis, indicators of centrality identify the most important vertices within a graph. It is based on observation, that every pair of vertices exchange information and information flows along the geodesic (shortest) path between two vertices. For every pair of vertices in a connected graph, there exists at least one shortest path between the vertices such that either the number of edges , that the path passes through ( unweighted network ) or the sum of the weights of the edges is minimized ( weighted network). The betweenness centrality for each node is the number of these shortest paths, that pass through the node.

History, Explanation

Anthonisse (1971) introduced the concept rush in a graph as the amount a node in a network has to intermediate between other nodes. Freeman (1977, 1979) defined betweenness centrality as one of the “three distinct intuitive conceptions of centrality” (Freeman, 1979: 215). Betweenness centrality is often connected with the notion of control over the flow of information. Betweenness centrality is calculated by a breath-first search algorithm which calculates the shortest paths from every node to all other nodes (Brandes 2001). The nodes which most frequently lie on these shortest paths are favored in the betweenness centrality score.

Betweenness Centrality finds wide application in network theory. It represents the degree of which nodes stand between each other. For example, in a telecommunications network, a node with higher betweenness centrality would have more control over the network, because more information will pass through that node. Betweenness centrality was devised as a general measure of centrality. It applies to a wide range of problems in network theory, including problems related to social networks, biology, transport and scientific cooperation.

When to use it / use-cases

betweenness centrality

Betweenness centrality is useful in finding vertices that serve as a bridge from one part of a graph to another. Consequently, betweenness is a rudimentary measure of the control, that a specific node exerts over the flow throughout the full graph. For example, Alice in above example is the main connection in the graph. Were Alice to be removed, all connections in the graph would be cut off. This makes Alice “important”, because it ensures that no nodes are isolated. When using betweenness centrality as an analysis measure, it indicates a potential gate keeping or controlling node.

It differs from the other centrality measures. A node can have quite low degree, be connected to others that have low degree, even be a long way from others on average, and still have high betweenness. Consider a node A that lies on a bridge between two groups of vertices within a network. Since any path between nodes in different groups must go through this bridge, node A acquires high betweenness even though it is not well connected (it lies at the periphery of both groups).

Algorithm explanation on simple sample graph

People with high betweenness tend to be the innovators and brokers in any network. They combine different perspectives, transfer ideas between groups (or decide to withold them), and get power from their ability to make introductions and pull strings.

Create sample graph
CREATE (nAlice:User {id:'Alice'})
,(nBridget:User {id:'Bridget'})
,(nCharles:User {id:'Charles'})
,(nDoug:User {id:'Doug'})
,(nMark:User {id:'Mark'})
,(nMichael:User {id:'Michael'})
CREATE (nAlice)-[:MANAGE]->(nBridget)
,(nAlice)-[:MANAGE]->(nCharles)
,(nAlice)-[:MANAGE]->(nDoug)
,(nMark)-[:MANAGE]->(nAlice)
,(nCharles)-[:MANAGE]->(nMichael)
Running algorithm and streaming results
CALL algo.betweenness.stream('User','MANAGE')
YIELD nodeId, centrality
RETURN nodeId,centrality order by centrality desc limit 20
Running algorithm and writing back results
CALL algo.betweenness('User','MANAGE', {write:true, writeProperty:'centrality'})
YIELD nodes, minCentrality, maxCentrality, sumCentrality, loadMillis, computeMillis, writeMillis
Table 6. Results
name centrality weight

Alice

4

Charles

2

Bridget

0

Michael

0

Doug

0

Mark

0

We can see that Alice is the main broker in this network and Charles is a minor broker.

Syntax

Running algorithm and writing back results
CALL algo.betweenness(label:String, relationship:String, {write:true, stats:true, writeProperty:'centrality'})
YIELD nodes, minCentrality, maxCentrality, sumCentrality, loadMillis, computeMillis, writeMillis
- calculates betweenness centrality and potentially writes back
Table 7. Parameters
name type default optional description

label

string

null

yes

label to load from the graph, if null load all nodes

relationship

string

null

yes

relationship-type to load from the graph, if null load all nodes

write

boolean

true

yes

if result should be written back as node property

stats

boolean

true

yes

if stats about centrality should be returned

writeProperty

string

'centrality'

yes

property name written back to

graph

string

'heavy'

yes

use 'heavy' when describing the subset of the graph with label and relationship-type parameter, 'cypher' for describing the subset with cypher node-statement and relationship-statement

Table 8. Results
name type description

nodes

int

number of nodes considered

minCentrality

int

minimum centrality value

maxCentrality

int

maximum centrality value

sumCentrality

int

sum of all centrality values

loadMillis

int

milliseconds for loading data

evalMillis

int

milliseconds for running the algorithm

writeMillis

int

milliseconds for writing result data back

Running algorithm and streaming results
CALL algo.betweenness.stream(label:String, relationship:String) YIELD nodeId, centrality - yields centrality for each node
Table 9. Parameters
name type default optional description

label

string

null

yes

label to load from the graph, if null load all nodes

relationship

string

null

yes

relationship-type to load from the graph, if null load all relationships

Table 10. Results

name

type

description

node

long

node id

centrality

float

betweenness centrality weight

Cypher loading

If label and relationship-type are not selective enough to describe your subgraph to run the algorithm on, you can use Cypher statements to load or project subsets of your graph. Can be also used to run algorithms on a virtual graph. Set graph:'cypher' in the config.

CALL algo.betweenness(
'MATCH (p:User) RETURN id(p) as id',
'MATCH (p1:User)-[:MANAGE]->(p2:User) RETURN id(p1) as source, id(p2) as target',
{graph:'cypher', write: true});

Versions

We support the following versions of the betweenness centrality algorithm:

  • ✓ directed, unweighted

  • ❏ directed, weighted

  • ❏ undirected, unweighted

  • ❏ undirected, weighted

Implementations

algo.betweenness()

  • implementation of brandes-bc algorithm and nodePartitioning extension

  • if concurrency parameter is set (and >1) ParallelBetweennessCentrality is used

  • ParallelBC spawns N(given by the concurrency param) concurrent threads for calculation where each one calculates the BC for one node at a time

algo.betweenness.exp1()

  • brandes-like algorithm which uses successor sets instead of predecessor sets

  • The algorithm is based on Brandes definition but with some changes regarding the dependency-accumulation step.

Closeness Centrality

Closeness Centrality of a node is a measure of centrality in a network, that measures the nearness (as opposite from the distance) from a node to all other nodes. Nodes having a high closeness score have the shortest distances to all other nodes. This is important for the availability of knowledge and resources.

History, Explanation

Sabidussi (1966) described the sum of the shortest path distances from one node to every other node as the node’s farness. Freeman (1979) used this idea to define closeness centrality of a node as the inverse of Sabidussi’s farness. Nodes having a high closeness centrality are nearby all other nodes and have advantages in accessing resources in a network or having a good overview of the agents in a network.

When to use it / use-cases

The most central nodes according to closeness centrality can quickly interact to all others because they are close to all others. This measure is preferable to degree centrality, because it does not take into account only direct connections among nodes, but also indirect connections.

Constraints / when not to use it

A key node centrality measure in networks is closeness centrality (Freeman, 1978; Opsahl et al., 2010; Wasserman and Faust, 1994). It is defined as the inverse of farness, which in turn, is the sum of distances to all other nodes. As the distance between nodes in disconnected components of a network is infinite, this measure cannot be applied to networks with disconnected components (Opsahl et al., 2010; Wasserman and Faust, 1994).

Algorithm explanation on simple sample graph

  • count farness in each msbfs-callback

  • divide by N-1

Graph:

( A )⇐⇒( B )⇐⇒( C )⇐⇒( D )⇐⇒( E )

Calculation:

N = 5 // number of nodes

k = N-1 = 4 // used for normalization

     A     B     C     D     E
 --|-----------------------------
 A | 0     1     2     3     4       // farness between each pair of nodes
 B | 1     0     1     2     3
 C | 2     1     0     1     2
 D | 3     2     1     0     1
 E | 4     3     2     1     0
 --|-----------------------------
 S | 10    7     6     7     10      // sum each column
 ==|=============================
k/S| 0.4  0.57  0.67  0.57   0.4     // normalized centrality
CALL algo.closeness('Person', 'KNOWS', {write:true, writeProperty:'centrality'})
YIELD nodes,loadMillis, computeMillis, writeMillis
- calculates closeness centrality and potentially writes back
CALL algo.closeness.stream('Person', 'KNOWS') YIELD nodeId, centrality
RETURN nodeId,centrality order by centrality desc limit 20 - yields centrality for each node

Syntax

Running algorithm and writing back results
CALL algo.closeness(label:String, relationship:String, {write:true, writeProperty:'centrality'})
YIELD nodes,loadMillis, computeMillis, writeMillis
- calculates closeness centrality and potentially writes back
Table 11. Parameters
name type default optional description

label

string

null

yes

label to load from the graph, if null load all nodes

relationship

string

null

yes

relationship-type to load from the graph, if null load all nodes

write

boolean

true

yes

if result should be written back as node property

writeProperty

string

'centrality'

yes

property name written back to

graph

string

'heavy'

yes

use 'heavy' when describing the subset of the graph with label and relationship-type parameter, 'cypher' for describing the subset with cypher node-statement and relationship-statement

Table 12. Results
name type description

nodes

int

number of nodes considered

loadMillis

int

milliseconds for loading data

evalMillis

int

milliseconds for running the algorithm

writeMillis

int

milliseconds for writing result data back

Running algorithm and streaming results
CALL algo.closeness.stream(label:String, relationship:String) YIELD nodeId, centrality - yields centrality for each node
Table 13. Parameters
name type default optional description

label

string

null

yes

label to load from the graph, if null load all nodes

relationship

string

null

yes

relationship-type to load from the graph, if null load all relationships

Table 14. Results

name

type

description

node

long

node id

centrality

float

closeness centrality weight

Cypher loading

If label and relationship-type are not selective enough to describe your subgraph to run the algorithm on, you can use Cypher statements to load or project subsets of your graph. Can be also used to run algorithms on a virtual graph. Set graph:'cypher' in the config.

CALL algo.closeness(
'MATCH (p:Person) RETURN id(p) as id',
'MATCH (p1:Person)-[:KNOWS]->(p2:Person) RETURN id(p1) as source, id(p2) as target',
{graph:'cypher', write: true});

Versions

We support the following versions of the betweenness centrality algorithm:

  • ✓ directed, unweighted

  • ❏ directed, weighted

  • ❏ undirected, unweighted

  • ❏ undirected, weighted

Minimum Weight Spanning Tree

A Minimum Weight Spanning Tree (MST) is a subset of the edges of a connected, edge-weighted undirected graph that connects all the vertices together, without any cycles and with the minimum possible total edge weight. That is, it is a spanning tree whose sum of edge weights is as small as possible It can be used to cluster the graph (KMeans).

History, Explanation

In graph theory, a tree is a way of connecting all the vertices together, so that there is exactly one path from any one vertex, to any other vertex of the tree. If the graph represents a number of cities connected by roads, one could select a number of roads, so that each city can be reached from every other, but that there is no more than one way to travel from one city to another.

Most of the time, graphs are weighted; each connection between two cities has a weight: It might cost something to travel on a given road, or one connection may be longer than the other, this means it takes more time to travel on that connection.

A minimum spanning tree is a tree. It is different from other trees in that it minimizes the total of the weights attached to the edges. Depending on what the graph looks like, there may be more than one minimum spanning tree. In a graph where all the edges have the same weight, every tree is a minimum spanning tree. If all the edges have different weights, there is exactly one minimal spanning tree.

Czech scientist Otakar Borůvka developed the first known algorithm for finding a minimum spanning tree in 1926. He wanted to solve the problem of finding an efficient coverage of Moravia with electricity. Today, this algorithm is known as Borůvka’s algorithm.

When to use it / use-cases

There are quite a few use cases for minimum spanning trees. One example would be a telecommunications company which is trying to lay out cables in new neighborhood. If it is constrained to bury the cable only along certain paths (e.g. along roads), then there would be a graph representing which points are connected by those paths. Some of those paths might be more expensive, because they are longer, or require the cable to be buried deeper; these paths would be represented by edges with larger weights. Currency is an acceptable unit for edge weight – there is no requirement for edge lengths to obey normal rules of geometry such as the triangle inequality. A spanning tree for that graph would be a subset of those paths that has no cycles but still connects to every house; there might be several spanning trees possible. A minimum spanning tree would be one with the lowest total cost, thus would represent the least expensive path for laying the cable.

Algorithm explanation on simple sample graph

    a                a
1 /   \ 2          /  \
 /     \          /    \
b --3-- c        b      c
|       |   =>   |      |
4       5        |      |
|       |        |      |
d --6-- e        d      e
MATCH(n:Node{start:true})
CALL algo.mst(n, 'cost', {write:true, writeProperty:"mst"})
YIELD loadMillis, computeMillis, writeMillis, weightSum, weightMin,weightMax, relationshipCount

Syntax

Running algorithm and writing back results
CALL algo.mst(node:Node, weightProperty:String, {nodeLabelOrQuery:String,
 relationshipTypeOrQuery:String, write:boolean, writeProperty:String stats:boolean})
YIELD loadMillis, computeMillis, writeMillis, weightSum, weightMin, weightMax, relationshipCount
Table 15. Parameters
name type default optional description

node

Node

null

no

node to load from the graph

weightProperty

string

'cost'

yes

property name that contains weight. Must be numeric.

nodeLabelOrQuery

string

null

yes

label to load from the graph, if null load all nodes

relationshipTypeOrQuery

string

null

yes

relationship-type to load from the graph, if null load all nodes

write

boolean

true

yes

if result should be written back as node property

writeProperty

string

'mst'

yes

relationship-type written back as result

Table 16. Results
name type description

weightSum

int

sum of all weights

weightMax

int

maximum value of weight

weightMin

int

minimum value of weight

relationshipCount

int

number of relationships created

loadMillis

int

milliseconds for loading data

computeMillis

int

milliseconds for running the algorithm

writeMillis

int

milliseconds for writing result data back

All Pairs- and Single Source - Shortest Path

A Single Source Shortest Path algorithms calculates a path between a pair of nodes whose summed weights are minimal. A common algorithm used is Dijkstra.

All Pairs Shortest Path on the other hand calculates a shortest path forest containing all paths between the nodes in the graph. An algorithm to solve this is Floyd Warshall or Parallel Johnson’s algorithm.

History, Explanation

Path finding, in particular searching in a maze, belongs to the classical graph problems, and the classical references are Wiener [1873], Lucas [1882] (describing a method due to C.P. Tr´emaux), and Tarry [1895] – see Biggs, Lloyd, and Wilson [1976]. They form the basis for depth-first search techniques. Path problems were also studied at the beginning of the 1950’s in the context of ‘alternate routing’, that is, finding a second shortest route if the shortest route is blocked. This applies to freeway usage (Trueblood [1952]), but also to telephone call routing. At that time making long-distance calls in the U.S.A. was automatized, and alternate routes for telephone calls over the U.S. telephone network nation-wide should be found automatically.

When to use it / use-cases

Shortest path algorithms are applied to automatically find directions between physical locations, such as driving directions on web mapping websites like MapQuest or Google Maps.

If one represents a nondeterministic abstract machine as a graph where nodes describe states and relationships describe possible transitions, shortest path algorithms can be used to find an optimal sequence of choices to reach a certain goal state, or to establish lower bounds on the time needed to reach a given state. For example, if nodes represent the states of a puzzle like a Rubik’s Cube and each directed relationship corresponds to a single move or turn, shortest path algorithms can be used to find a solution that uses the minimum possible number of moves.

In a networking or telecommunications mindset, this shortest path problem is sometimes called the min-delay path problem and usually tied with a widest path problem. For example, the algorithm may seek the shortest (min-delay) widest path, or widest shortest (min-delay) path.

Other applications, often studied in operations research, include plant and facility layout, robotics, transportation, and VLSI design.

Algorithm explanation on simple sample graph

sssp
Create sample graph
CREATE (a:Loc{name:'A'}), (b:Loc{name:'B'}), (c:Loc{name:'C'}),
       (d:Loc{name:'D'}), (e:Loc{name:'E'}), (f:Loc{name:'F'}),
       (a)-[:ROAD {cost:50}]->(b),
       (a)-[:ROAD {cost:50}]->(c),
       (a)-[:ROAD {cost:100}]->(d),
       (a)-[:RAIL {cost:50}]->(d),
       (b)-[:ROAD {cost:40}]->(d),
       (c)-[:ROAD {cost:40}]->(d),
       (c)-[:ROAD {cost:80}]->(e),
       (d)-[:ROAD {cost:30}]->(e),
       (d)-[:ROAD {cost:80}]->(f),
       (e)-[:ROAD {cost:40}]->(f),
       (e)-[:RAIL {cost:20}]->(f);
Dijkstra single source shortest path algorithm
Running algorithm and streaming results
MATCH (start:Loc{name:'A'}), (end:Loc{name:'F'})
CALL algo.shortestPath.stream(start, end, 'cost')
YIELD nodeId, cost
RETURN nodeId, cost LIMIT 20
Running algorithm and writing back results
MATCH (start:Loc{name:'A'}), (end:Loc{name:'F'})
CALL algo.shortestPath(start, end, 'cost',{write:true,writeProperty:'sssp'})
YIELD writeMillis,loadMillis,nodeCount, totalCost
RETURN writeMillis,loadMillis,nodeCount,totalCost
Delta stepping algorithm
Running algorithm and streaming results
MATCH(n:Loc {name:'A'})
CALL algo.deltaStepping.stream(n, 'cost', 3.0)
YIELD nodeId, distance
RETURN nodeId, distance LIMIT 20
Running algorithm and writing back results
MATCH(n:Node {name:'s'})
CALL algo.deltaStepping(n, 'cost', 3.0, {write:true, writeProperty:'sssp'})
YIELD nodeCount, loadDuration, evalDuration, writeDuration
RETURN nodeCount, loadDuration, evalDuration, writeDuration

Syntax

Running algorithm and writing back results
CALL algo.shortestPath(startNode:Node, endNode:Node, weightProperty:String
{nodeQuery:'labelName', relationshipQuery:'relationshipName', defaultValue:1.0,write:'true',writeProperty:'sssp'})
YIELD nodeCount, totalCost, loadMillis, evalMillis, writeMillis
Table 17. Parameters
name type default optional description

startNode

node

null

no

start node

endNode

node

null

no

end node

weightProperty

string

null

yes

property name that contains weight, if null treats the graph as unweighted. Must be numeric.

defaultValue

float

null

yes

default value of the weight in case it is missing or invalid

write

boolean

true

yes

if result should be written back as node property

writeProperty

string

'sssp'

yes

property name written back to the node sequence of the node in the path

nodeQuery

string

null

yes

label to load from the graph, if null load all nodes

relationshipQuery

string

null

yes

relationship-type to load from the graph, if null load all nodes

Table 18. Results
name type description

nodeCount

int

number of nodes considered

totalCost

float

sum of all weights along the path

loadMillis

int

milliseconds for loading data

evalMillis

int

milliseconds for running the algorithm

writeMillis

int

milliseconds for writing result data back

Running algorithm and streaming results
CALL algo.shortestPath.stream(startNode:Node, endNode:Node, weightProperty:String
{nodeQuery:'labelName', relationshipQuery:'relationshipName', defaultValue:1.0})
 YIELD nodeId, cost
Table 19. Parameters
name type default optional description

startNode

node

null

no

start node

endNode

node

null

no

end node

weightProperty

string

null

yes

property name that contains weight, if null treats the graph as unweighted. Must be numeric.

nodeQuery

string

null

yes

label to load from the graph, if null load all nodes

relationshipQuery

string

null

yes

relationship-type to load from the graph, if null load all nodes

defaultValue

float

null

yes

default value of the weight in case it is missing or invalid

Table 20. Results

name

type

description

nodeId

int

node id

cost

int

cost it takes to get from start node to specific node

Graph Partitioning: Label Propagation

Label Propagation algorithm (LPA) is an extremely fast graph partitioning method and is widely used in large scale networks. Graph partitioning ( community detection ) is an important methodology for understanding the organization of various real-world networks and has applications in problems as diverse as consensus formation in social communities or the identification of functional modules in biochemical networks.

History, Explanation

Community structure is considered one of the most interesting features in complex networks. Many real-world complex systems exhibit community structure, where individuals with similar properties form a partition (community). The identification of partitions in a network is important for understanding the structure of said network, in a specific perspective. Thus, graph partitioning in complex networks gained immense interest over the last decade. A lot of graph partitioning methods were proposed, and one of them is the label propagation algorithm (LPA). The simplicity and time efficiency of the LPA make it a popular graph partitioning method.

When to use it / use-cases

Label propagation is an algorithm for finding communities ( partitions ). It is one of the state-of-the-art methods, which estimates labels by propagating label information through a graph. Label propagation assumes that nodes connected in a graph should have similar labels. Consequently, the label estimation heavily depends on relationships in a graph which represent similarity of each node pair.

Algorithm explanation on simple sample graph

Running algorithm and writing back results
CALL algo.labelPropagation('Label', '','OUTGOING', {iterations:1,partitionProperty:'partition', write:true})
YIELD nodes, iterations, loadMillis, computeMillis, writeMillis, write, partitionProperty

Syntax

Running algorithm and writing back results
CALL algo.labelPropagation(label:String, relationship:String, direction:String, {iterations:1,
weightProperty:'weight', partitionProperty:'partition', write:true})
YIELD nodes, iterations, loadMillis, computeMillis, writeMillis, write, weightProperty,
partitionProperty - simple label propagation kernel
Table 21. Parameters
name type default optional description

label

string

null

yes

label to load from the graph, if null load all nodes

relationship

string

null

yes

relationship-type to load from the graph, if null load all nodes

direction

string

'OUTGOING'

yes

relationship-direction to use in the algorithm

iterations

int

1

yes

number of iterations

weightProperty

string

'weight'

yes

property name that contains weight. Must be numeric.

partitionProperty

string

'partition'

yes

property name written back the partition of the graph in which the node reside

write

boolean

true

yes

if result should be written back as node property

Table 22. Results
name type description

nodes

int

number of nodes considered

iterations

int

number of iterations

loadMillis

int

milliseconds for loading data

computeMillis

int

milliseconds for running the algorithm

writeMillis

int

milliseconds for writing result data back

weightProperty

string

property name that contains weight

partitionProperty

string

property name written back to

write

boolean

true

Connected Components

Connected Components or UnionFind basically finds sets of connected nodes where each node is reachable from any other node in the same set. In graph theory, a connected component of an undirected graph is a subgraph in which any two vertices are connected to each other by paths, and which is connected to no additional vertices in the graph.

History, Explanation

The Connected Components of a graph represent, in general terms, the pieces of the graph. Two vertices are in the same component of graph if and only if there is some path between them.

Finding connected components is at the heart of many graph applications. For example, consider the problem of identifying partitions in a set of items. We can represent each item by a node and add an edge between each pair of items that are deemed similar. The connected components of this graph correspond to different classes of items.

When to use it / use-cases

Testing whether a graph is connected is an essential preprocessing step for every graph algorithm. Such tests can be performed so quickly and easily that you should always verify that your input graph is connected, even when you know it has to be. Subtle, difficult-to-detect bugs often result when your algorithm is run only on one component of a disconnected graph.

Connected components have other practical use cases, for example, if we are analysing a social network and we want to find all the disconnected groups of people that exist in our graph.

Algorithm explanation on simple sample graph

Recall that an undirected graph is connected if for every pair of vertices, there is a path in the graph between those vertices. A connected component of an undirected graph is a maximal connected subgraph of the graph. That means that the direction of the relationships in our graph have no influence as we treat our graph as undirected. We have two implementations of connected components algorithm. First implementations treats the graph as unweighted and the second version is weighted, where you can define the threshold of the weight above which relationships are created.

connected components
Create sample graph
CREATE (nAlice:User {id:'Alice'})
,(nBridget:User {id:'Bridget'})
,(nCharles:User {id:'Charles'})
,(nDoug:User {id:'Doug'})
,(nMark:User {id:'Mark'})
,(nMichael:User {id:'Michael'})
CREATE (nAlice)-[:FRIEND{weight:0.5}]->(nBridget)
,(nAlice)-[:FRIEND{weight:4}]->(nCharles)
,(nMark)-[:FRIEND{weight:1}]->(nDoug)
,(nMark)-[:FRIEND{weight:2}]->(nMichael)
Unweighted version:
Running algorithm and streaming results
CALL algo.unionFind.stream('User', 'FRIEND', {})
YIELD nodeId,setId
Running algorithm and writing back results
CALL algo.unionFind('User', 'FRIEND', {write:true, partitionProperty:"partition"})
YIELD nodes, setCount, loadMillis, computeMillis, writeMillis
Table 23. Results
name partition

Alice

0

Charles

0

Bridget

0

Michael

4

Doug

4

Mark

4

Results show us, that we have two distinct group of users that have no link between them. First group has members Alice,Charles,Bridget and the second group are Michael,Doug,Mark.

We can also easily check the number and size of partitions using cypher.
MATCH (u:User)
RETURN distinct(u.partition) as partition,count(*) as size_of_partition ORDER by size_of_partition DESC
Weighted version:

If you define the property that holds the weight(weightProperty) and the threshold,it means the nodes are only connected, if the threshold on the weight of the relationship is high enough otherwise the relationship is thrown away.

Running algorithm and streaming results
CALL algo.unionFind.stream('User', 'FRIEND', {weightProperty:'weight', defaultValue:0.0, threshold:1.0})
YIELD nodeId,setId
Running algorithm and writing back results
CALL algo.unionFind('User', 'FRIEND', {write:true, partitionProperty:"partition",weightProperty:'weight', defaultValue:0.0, threshold:1.0})
YIELD nodes, setCount, loadMillis, computeMillis, writeMillis
Table 24. Results
name partition

Alice

0

Charles

0

Bridget

1

Michael

4

Doug

4

Mark

4

We can observe, that because the weight of the relationship betwen Bridget and Alice is only 0.5, the relationship was thrown away and regarded as not existing.

Syntax

Running algorithm and writing back results
CALL algo.unionFind(label:String, relationship:String, {threshold:0.42,
defaultValue:1.0, write: true, partitionProperty:'partition',weightProperty:'weight'})
YIELD nodes, setCount, loadMillis, computeMillis, writeMillis
- finds connected partitions and potentially writes back to the node as a property partition.
Table 25. Parameters
name type default optional description

label

string

null

yes

label to load from the graph, if null load all nodes

relationship

string

null

yes

relationship-type to load from the graph, if null load all nodes

weightProperty

string

null

yes

property name that contains weight, if null treats the graph as unweighted. Must be numeric.

write

boolean

true

yes

if result should be written back as node property

partitionProperty

string

'partition'

yes

property name written back the id of the partition particular node belongs to

threshold

float

null

yes

value of the weight above which the relationship is not thrown away

defaultValue

float

null

yes

default value of the weight in case it is missing or invalid

Table 26. Results
name type description

nodes

int

number of nodes considered

setCount

int

number of partitions found

loadMillis

int

milliseconds for loading data

computeMillis

int

milliseconds for running the algorithm

writeMillis

int

milliseconds for writing result data back

Running algorithm and streaming results
CALL algo.unionFind.stream(label:String, relationship:String, {weightProperty:'weight', threshold:0.42, defaultValue:1.0)
YIELD nodeId, setId - yields a setId to each node id
Table 27. Parameters
name type default optional description

label

string

null

yes

label to load from the graph, if null load all nodes

relationship

string

null

yes

relationship-type to load from the graph, if null load all relationships

weightProperty

string

null

yes

property name that contains weight, if null treats the graph as unweighted. Must be numeric.

threshold

float

null

yes

value of the weight above which the relationship is not thrown away

defaultValue

float

null

yes

default value of the weight in case it is missing or invalid

Table 28. Results

name

type

description

nodeId

int

node id

setId

int

partition id

Strongly Connected Components

SCC is a class algorithms for finding groups of nodes where each node is directly reachable from every other node in the group. The strongly connected components of an arbitrary directed graph form a partition into subgraphs that are themselves strongly connected.There are several algorithms to compute the SCC.

History, Explanation

Decomposing a directed graph into its strongly connected components is a classic application of depth-first search. The problem of finding connected components is at the heart of many graph application. Generally speaking, the connected components of the graph correspond to different classes of objects. The first linear-time algorithm for strongly connected components is due to Tarjan (1972).

When to use it / use-cases

SCC algorithms can be used as a first step in many graph algorithms that work only on strongly connected graph. In social networks, a group of people are generally strongly connected (For example, students of a class or any other common place). Many people in these groups generally like some common pages or play common games. The SCC algorithms can be used to find such groups and suggest the commonly liked pages or games to the people in the group who have not yet liked commonly liked a page or played a game.

Algorithm explanation on simple sample graph

strongly connected components

A directed graph is strongly connected if there is a path between all pairs of vertices. This algorithms treats the graph as directed, so the direction of the relationship is important and strongly connected compoment exists only if there are relationships between nodes in both direction.

Create sample graph
CREATE (nAlice:User {id:'Alice'})
,(nBridget:User {id:'Bridget'})
,(nCharles:User {id:'Charles'})
,(nDoug:User {id:'Doug'})
,(nMark:User {id:'Mark'})
,(nMichael:User {id:'Michael'})
CREATE (nAlice)-[:FOLLOW]->(nBridget)
,(nAlice)-[:FOLLOW]->(nCharles)
,(nMark)-[:FOLLOW]->(nDoug)
,(nMark)-[:FOLLOW]->(nMichael)
,(nBridget)-[:FOLLOW]->(nMichael)
,(nDoug)-[:FOLLOW]->(nMark)
,(nMichael)-[:FOLLOW]->(nAlice)
,(nAlice)-[:FOLLOW]->(nMichael)
,(nBridget)-[:FOLLOW]->(nAlice)
,(nMichael)-[:FOLLOW]->(nBridget)
Running algorithm and writing back results
CALL algo.scc('User','FOLLOW', {write:true,partitionProperty:'partition'})
YIELD loadMillis, computeMillis, writeMillis, setCount, maxSetSize, minSetSize
Table 29. Results
name partition

Alice

1

Bridget

1

Michael

1

Charles

0

Doug

2

Mark

2

We can see that we have 2 strongly connected components in our sample graph. The first and biggest component has members Alice,Bridget,Michael and the second component has Doug and Mark.

Find the largest partition
MATCH (u:User)
RETURN distinct(u.partition) as partition,count(*) as size_of_partition ORDER by size_of_partition DESC LIMIT 1

Syntax

Running algorithm and writing back results
CALL algo.scc(label:String, relationship:String, {write:true,partitionProperty:'partition'})
YIELD loadMillis, computeMillis, writeMillis, setCount, maxSetSize, minSetSize

- finds strongly connected partitions and potentially writes back to the node as a property partition.
Table 30. Parameters
name type default optional description

label

string

null

yes

label to load from the graph, if null load all nodes

relationship

string

null

yes

relationship-type to load from the graph, if null load all nodes

write

boolean

true

yes

if result should be written back as node property

partitionProperty

string

'partition'

yes

property name written back to

Table 31. Results
name type description

setCount

int

number of partitions found

maxSetSize

int

number of members in biggest partition

minSetSize

int

number of members in smallest partition

loadMillis

int

milliseconds for loading data

computeMillis

int

milliseconds for running the algorithm

writeMillis

int

milliseconds for writing result data back

Implementations

algo.scc.tarjan

  • original recursive tarjan implementation

algo.scc.tunedTarjan

  • also a recursive tarjan implementation

algo.scc.iterative

  • iterative adaption of tarjan algorithm

algo.scc.multistep

  • parallel scc algorithm

Implementers Section

Algorithm Procedures API Discussion

We would like to discuss design of the procedure API for algorithms. There are several concerns:

Naming

  • simple procedure names,

  • consistent spelling and casing

  • discriminate between write-back and result-streaming procedure with .stream suffix

  • default algorithm name should be the best implementation

  • other implementations can provided via suffixes

Input Graph Data

  • load via label/rel-type selection and optional weight/values

  • load via cypher statements

  • the requirements for node-id’s, degrees, weights, relationships etc. comes from the algorithm

Algorithm parameters

  • start-nodes

  • iterations, dampening

  • concurrency, batch-sizes

  • thresholds

  • write-back, write-back properties

Id-Mapping

  • original node-ids from the graph are mapped to a consecutive id-range to be used for the algorithm

  • this allows simple array offsets to be used within the computation

  • this also allows mapping of larger id-ranges to the computational range of the algorithm ()

Result parameters

  • output statistics: number of nodes and rels touched

  • time for loading, processing, and writing

  • property written to

  • errors

  • data summary, e.g.

    • min,max,avg of a centrality, rank or weight measure

    • output of all clusters, their sizes and perhaps the root node

  • alternatively streaming back node-ids and the computed value (score)

Constraints:

  1. there is no overloading so we need differing names

  2. there are default procedure parameters so we need to require only what’s really required to run

  3. we should use sensible defaults

  4. result values/columns have to be pre-declared, there is no inheritance or generics

Suggestions:

Naming: 2 variants per algorithm
  1. returning summary statistics after run: algo.pageRank(…​)

  2. returning a stream of data: algo.pageRank.stream(…​) default should be the one with statistics, as the scroll back from a medium size / large grap will kill the client

Parameters
  1. node-selector or null - label(s) or cypher statement: ":Person|User" / "MATCH (u:User) RETURN id(u) as id, u.name as value"

  2. relationship-selector or null - rel-type(s) or cypher statement ":KNOWS|FRIEND" / "MATCH (u1:User)-[r:KNOWS|FRIEND]-(u2:User) RETURN id(u1) as source, id(u2) as target, r.strength as weight"

  3. configuration map with optinal parameters but sensible defaults: {iterations: 5, write:true, nodeValue:"value",nodeWeight:"weight",relationshipWeight:"weight", threshold: 0.85}

We could have the configuration map come from a function (per algorithm?) that handles validation and returns a correct map object for that algorithm? Or just pass the plain map?

Example:
call algo.pageRank("Person","KNOWS",{iterations:10, write:true, [property:"pagerank"]});

call algo.unionFind.stream('
MATCH (p:Person) RETURN id(p) as id
','
MATCH (p1:Person)-[:KNOWS]-(p2:Person) RETURN id(p1) as source, id(p2) as target
', {}) YIELD nodeId, clusterId

API Documentation

Development Notes

Note
When implementing a procedure that yields any results (is not void) and writes back to the database, make sure to consume the Result after running the algorithm call!

Algorithms are executed within a transaction, that is opened and closed by Cypher. The transaction is only marked successful if the results are consumed. Closing the Result directly will fail and rollback any open transactions and thus will revert all write-back operations.

The Model

The basic idea behind our model is to have a fast cache for the topology of the graph containing only relevant nodes, relations and in addition the weights. It implicitly maps (long) node-id’s to an internal integer-id (32bit) in ascending order which ensures that no id gets bigger then the maximum node count. This approach allows us to use primitive arrays as container for example.

Graph Interface

The Graph interface specifies methods for iterating over all nodes of the graph as well as iterating over relationships of a given node in the form of forEach…​(..)-methods. The Graph knows the nodeCount and degree of each node and can map nodeId to internalId and vice versa.

Method Description

Collection<PrimitiveIntIterable> batchIterables(int batchSize);

return a collection of iterables over every node, partitioned by the given batch size.

boolean contains(long nodeId);

Returns true iff the nodeId is mapped, otherwise false

int degree(int nodeId, Direction direction);

degree of the node in that direction

void forEachNode(IntPredicate consumer);

Iterate over each nodeId

void forEachRelationship(int nodeId, Direction direction, RelationshipConsumer consumer);

Iterate over all relationships of a node

void forEachRelationship(int nodeId, Direction direction, WeightedRelationshipConsumer consumer);

Iterate over all weighted relationships of a node

int nodeCount();

count of nodes

PrimitiveIntIterator nodeIterator();

iterate over all nodes

int toMappedNodeId(long nodeId);

Map neo4j nodeId to internal nodeId

long toOriginalNodeId(int nodeId);

Map internal nodeId back to original nodeId

diag c7aa032e8c661d5a6bd35bc00da9209f
Note
Currently we have 3 different implementations aiming for different goals like performance, memory consumption and accuracy.

HeavyGraph

This implementations utilizes an int-matrix for storing connections between vertices. It has a higher memory consumption but performs basic calculations on the graph around 3 times faster then the memory efficient implementation. Furthermore the number of edges per node is limited only to the maximum array size (2bn) of the JVM. Relationships can be added in arbitrary order.

LightGraph

This implementation takes 3 times less heap due to a more intelligent memory layout. The drawback is slightly higher evaluation time and that the data cannot be loaded in parallel (yet).

GraphView

The View is just a simple wrapper around the Neo4j Kernel API. It has been implemented for tests and benchmarks baselines.

Import

A fair amount of the work is to fetch the relevant data from Neo4j and import it into our model. Fortunately we can use a multithreading approach to load the graph. This is part of the current implementation.

The abstract GraphFactory specifies a constructor and the build method. It is responsible for creating the Graph using the Neo4j Kernel API.

diag 9851458f6bf1935b4418f88d5cce6713

Loading the Graph is done by using the GraphLoader which implements a simple fluent Builder Pattern.

final Graph graph = new GraphLoader( graphDatabaseAPI )
        .setLabel( .. )
        .setRelation( .. )
        .setProperty( .. )
        .setThreadPool( .. )
        .load( FactoryImpl.class );

PageRank

Implementation Details

PageRank is Googles popular search algorithm.

Progress
  • ✓ single threaded implementation

  • ✓ tests

  • ✓ simple benchmark

  • ✓ implement procedure

  • ✓ benchmark on bigger graphs

  • ✓ parallelization

  • ✓ evaluation

Requirements
  • NodeIterator

  • Incoming Relationships

  • Outgoing Degrees

Data structured involved

Our current approach needs one double array for storing ranks.

ToDo
parallelization

One approach to parallelize PageRank might be to partition the node into batches - one for each thread. Nonetheless we may need to sync them at the end of each iteration.

evaluation
  • Performance tests on different dataset sizes / level of concurrency

Future Improvements
  • we might scale up the ranks to ints for faster multiplication.

Details

Partition based parallel PageRank based on "An Efficient Partition-Based Parallel PageRank Algorithm" [1]-

  • Each partition thread has its local array of only the nodes that it is responsible for, not for all nodes. Combined, all partitions hold all page rank scores for every node once. Instead of writing partition files and transferring them across the network (as done in the paper since they were concerned with parallelising across multiple nodes), we use integer arrays to write the results to. The actual score is upscaled from a double to an integer by multiplying it with {@code 100_000}.

  • To avoid contention by writing to a shared array, we partition the result array.

  • During execution, the scores arrays are shaped like this:

    [ executing partition ] -> [ calculated partition ] -> [ local page rank scores ]
  • Each single partition writes in a partitioned array, calculation the scores for every receiving partition. A single partition only sees:

    [ calculated partition ] -> [ local page rank scores ]
  • The coordinating thread then builds the transpose of all written partitions from every partition:

    [ calculated partition ] -> [ executing partition ] -> [ local page rank scores ]
  • This step does not happen in parallel, but does not involve extensive copying. The local page rank scores needn’t be copied, only the partitioning arrays. All in all, {@code concurrency^2} array element reads and assignments have to be performed.

  • For the next iteration, every partition first updates its scores, in parallel. A single partition now sees:

    [ executing partition ] -> [ local page rank scores ]
  • That is, a list of all calculated scores for it self, grouped by the partition that calculated these scores. This means, most of the synchronization happens in parallel, too.

  • Partitioning is not done by number of nodes but by the accumulated degree – as described in "Fast Parallel PageRank: A Linear System Approach" [2]. Every partition should have about the same number of relationships to operate on.

  • This is done to avoid having one partition with super nodes and instead have all partitions run in approximately equal time. Smaller partitions are merged down until we have at most {@code concurrency} partitions, in order to batch partitions and keep the number of threads in use predictable/configurable.

[1]: An Efficient Partition-Based Parallel PageRank Algorithm [2]: <a href="https://www.cs.purdue.edu/homes/dgleich/

Betweenness Centrality

Implementation Details

In graph theory, betweenness centrality is a measure of centrality in a graph based on shortest paths. For every pair of vertices in a connected graph, there exists at least one shortest path between the vertices such that either the number of edges that the path passes through (for unweighted graphs) or the sum of the weights of the edges (for weighted graphs) is minimized. The betweenness centrality for each vertex is the number of these shortest paths that pass through the vertex.

Progress
  • ✓ adapt apoc-procedure to algorithm-api

  • ✓ implement procedure

  • ✓ tests

  • ✓ edge case tests

  • ✓ simple benchmark

  • ✓ benchmark on bigger graphs

  • ✓ parallelization

  • ✓ evaluation

  • ✓ documentation

Details
algo.betweenness
  • implementation of brandes-bc algorithm and nodePartitioning extension

  • http://www.algo.uni-konstanz.de/publications/b-fabc-01.pdf

  • if concurrency parameter is set (and >1) ParallelBetweennessCentrality is used

  • ParallelBC spawns N(given by the concurrency param) concurrent threads for calculation where each one calculates the BC for one node at a time

algo.betweenness.exp1

Closeness Centrality

Implementation Details

Closeness Centrality of a node is a measure of centrality in a network, calculated as the sum of the length of the shortest paths between the node and all other nodes in the graph. Thus the more central a node is, the closer it is to all other nodes.

Details
  • use org.neo4j.graphalgo.impl.msbfs.MultiSourceBFS for BFS

  • MSBFS gives depth and number of sources.

  • in this scheme the farness can be calculated as follows

    farness(v) = farness(v) + numberOfSources(v) * depth(v)

Minimum Weight Spanning Tree

Implementation Details

A Minimum Weight Spanning Tree is a acyclic undirected graph which consists of all connected nodes and whose relationship weights are minimal. It can be used to cluster the graph (KMeans). Our current approach uses Prim’s Algorithm to calculate the MST starting at a given node. This might not cover the whole graph. But if the nodes are connected the MST is always identical regardless at which node the execution starts.

Progress
  • ✓ single threaded implementation

  • ✓ tests

  • ✓ simple benchmark

  • ✓ implement procedure

  • ✓ benchmark on bigger graphs

  • [?] parallelization

  • ❏ evaluation

Requirements

BothRelationshipIterator & Weights

Data structured involved
  • org.neo4j.graphalgo.core.utils.container.UndirectedTree as container for efficient splitting and iterate

  • An int-based Fibonacci Heap priority queue.

  • Container for visited state

ToDo
benchmark

Implement benchmark on big graph

evaluation
  • Performance tests on different dataset sizes / level of concurrency

Single Shortest Path

Implementation Details

A Single Source Shortest Path algorithms calculates a path between a pair of nodes whose summed weights are minimal. A common algorithm used is Dijkstra. All Pairs Shortest Path on the other hand calculates a shortest path forest containing all paths between the nodes in the graph. An algorithm to solve this is Floyd Warshall or Parallel Johnson’s algorithm.

Progress
  • ✓ single threaded implementation

  • ✓ tests

  • ✓ simple benchmark

  • ✓ implement procedure

  • ❏ benchmark on bigger graphs

  • ❏ parallelization

  • ❏ evaluation

Requirements

(Outgoing)RelationshipIterator & Weights

Data structured involved
  • An int-based Fibonacci Heap which implements an efficient priority queue.

  • Different Container for Costs / visited state / paths

ToDo
benchmark

Implement benchmark on big graph

parallelization

Parallizing All Pairs Shortest Path might be easy using Dijkstra on each thread for a different node. An easy approach for Single Source SP may use two threads. One starting at the start-node, one at the end-node. The first wins. [More](https://www.cs.rice.edu/~vs3/comp422/lecture-notes/comp422-lec24-s08-v2.pdf)

evaluation
  • Performance tests on different dataset sizes / level of concurrency

Details
algo.shortestPath
  • Dijkstra single source shortest path algorithm

  • The algorithm computes a shortest path on weighted graphs between a given start and target-NodeId. It returns result tuples of [nodeId, distance] of each node in the path

  • there may be more then one shortest path, algo returns only one

  • if initialized with an not-existing weight-property and a defaultWeight of 1.0 the resulting path is minimal in terms of count of nodes in the path.

algo.shortestPath.deltaStepping

TODO naming!? - parallel non-negative single source shortest path algorithm for weighted graphs - It can be tweaked using the delta-parameter which controls the grade of concurrency. - returns minimum distance to all other nodes - if initialized with an non-existing weight-property and a defaultWeight of 1.0 its result can be interpreted as the number of nodes to reach the target

algo.allShortestPaths.stream
  • returns a stream of source-target node to distance tuples for each pair of nodes

  • Since all nodeId’s have already been ordered by the idMapping we can use an integer instead of a queue which just count’s up for each startNodeId as long as it is < nodeCount.

  • Each thread tries to take one int from the counter at one time and starts its computation on it.

  • The {@link AllShortestPaths#concurrency} value determines the count of workers that should be spawned.

  • Due to the high memory footprint the result set would have we emit each result into a blocking queue. The result stream takes elements from the queue while the workers add elements to it.

  • The result stream is limited by N^2. If the stream gets closed prematurely the workers get closed too.

  • writeback not supported!

Label Propagation

Implementation Details

Label Propagation is a graph partitioning algorithm already implemented in current apoc-procedures.

Progress
  • ✓ adapt apoc-procedure to algorithm api

  • ✓ single threaded implementation

  • ✓ tests

  • ❏ edge case tests

  • ✓ implement procedure

  • ✓ simple benchmark

  • ✓ benchmark on bigger graphs

  • ✓ parallelization

  • ✓ evaluation

  • ✓ documentation

TODO
  • adapt existing procedure to algorithm api

Weakly Connected Components

Implementation Details

Connected Components or UnionFind basically finds sets of connected nodes where each node is reachable from any other node in the same set. One implementation also evaluates a Predicate on each relation which allows partitioning of the graph based on Relationships and Properties.

Progress
  • ✓ single threaded implementation

  • ✓ tests

  • ✓ simple benchmark

  • ❏ implement procedure

  • ❏ benchmark on bigger graphs

  • ❏ parallelization

  • ❏ evaluation

Requirements

AllRelationshipIterator & Weights

Data structured involved

We use a disjoint-set-structure which is based on a parent-array-tree. The DSS can be used to efficiently ask if two nodes are reachable by each other. [More](https://en.wikipedia.org/wiki/Disjoint-set_data_structure)

ToDo
benchmark

Implement benchmark on big graph &

  • stream nodeId-setId pairs

  • calculate setSize-setCount

parallelization

One approach to parallelize UnionFind might be relationship partitioning where each thread performs the execution into it’s own DSS instance on a subset of relationships. So each thread calculates a distinct set of unions. Later we can merge each DSS pairwise which can also be perfomed in parallel. Nonetheless the memory consumption might be high due to the preallocated array in DSS. We could also switch to a growing container if this is a problem.

evaluation
  • Performance tests on different dataset sizes / level of concurrency

Details
  • writes a cluster-id to each node representing the a connected component where each node is reachable from any other node

algo.unionFind
  • if a threshold configuration parameter is supplied only relationships with a property value higher then the threshold are merged

algo.unionFind.exp1
  • parallel UnionFind using ExecutorService only.

  • Algorithm based on the idea that DisjointSetStruct can be built using just a partition of the nodes which are then merged pairwise.

  • The implementation is based on a queue which acts as a buffer for each computed DSS. As long as there are more elements on the queue the algorithm takes two, merges them and adds its result to the queue until only 1 element remains.

algo.unionFind.exp2
  • Like in exp1 the resulting DSS of each node-partition is merged by the ForkJoin pool while the calculation of the DSS is done by the ExecutorService.

algo.unionFind.exp3
  • calculation and merge using forkJoinPool

Strongly Connected Components

Implementation Details

SCC is a class algorithms for finding groups of nodes where each node is directly reachable from every other node in the group. There are several algorithms to compute the SCC. Our current implementation (still in graphtest project) implements Tarjan’s SCC algorithm.

Progress
  • ✓ implement procedure

  • ✓ tests

  • ✓ edge case tests

  • ✓ simple benchmark

  • ✓ benchmark on bigger graphs

  • ✓ parallelization

  • ✓ evaluation

  • ✓ documentation

Details
algo.scc.tarjan
  • original recursive tarjan implementation

  • result is a cluster-id at all nodes

  • each cluster is a scc

  • Builds sets of node-Ids which represent a strongly connected component within the graph. Also calculates minimum and maximum setSize as well as the count of distinct sets.

algo.scc.tunedTarjan
algo.scc.iterative
algo.scc.multistep