Introduction

This library provides efficiently implemented, parallel versions of common graph algorithms for Neo4j 3.x exposed as Cypher procedures.

Algorithms

Centralities

These algorithms determine the importance of distinct nodes in a network.

  • Page Rank (algo.pageRank)

  • Betweenness Centrality (algo.betweenness)

  • Closeness Centrality (algo.closeness)

Community Detection

These algorithms evaluate how a group is clustered or partitioned, as well as its tendency to strengthen or break apart.

  • Louvain (algo.louvain)

  • Label Propagation (algo.labelPropagation)

  • (Weakly) Connected Components (algo.unionFind)

  • Strongly Connected Components (algo.scc)

  • Triangle Count / Clustering Coefficient (algo.triangleCount)

Path Finding

These algorithms help find the shortest path or evaluate the availability and quality of routes.

  • Minimum Weight Spanning Tree (algo.mst)

  • All Pairs- and Single Source - Shortest Path (algo.shortestPath, algo.allShortestPaths)

These procedures work either on the whole graph or on a subgraph filtered by label and relationship-type. You can also use filtering and projection using Cypher queries.

We’d love your feedback, so please try out these algorithms and let us know how well they work for your use-case. Also please note things that are missing from the installation instructions or documentation.

Please raise GitHub issues for anything you encounter or join the neo4j-users Slack group and ask in the #neo4j-graph-algorithm channel.

Installation

Download graph-algorithms-algo-*.jar from the matching release and copy it into your $NEO4J_HOME/plugins directory.

Because the algorithms use the lower level Kernel API to read from and write to Neo4j you will also have to enable them in the configuration (for security reasons):

Add to $NEO4J_HOME/conf/neo4j.conf
dbms.security.procedures.unrestricted=algo.*

Once you’ve done that restart Neo4j and execute the query CALL algo.list() to see a list of all the algorithms.

Usage

These algorithms are exposed as Neo4j procedures. We can call them directly from Cypher in your Neo4j Browser, from cypher-shell, or your client code.

For most algorithms we provide two procedures:

  • one named algo.<name> that writes results back to the graph as node-properties and reports statistics.

  • another named algo.<name>.stream that returns a stream of data, e.g. node-ids and computed values.

For large graphs the streaming procedure might return millions or billions of results, that’s why it is often more convenient to store the results of the algorithm and then use them with later queries.

We can project the graph we want to run algorithms on with either label and relationship-type projection or cypher projection.

diag a91a9570a9e43c309a1ec59b4ededc03

Projected graph model is separate from Neo4j’s stored graph model to allow fast cache for the topology of the graph containing only relevant nodes, relations and in addition the weights. For now the projected graph model does not support multiple relationships between a single pair of nodes. During projection we allow only one relationship between a pair of nodes per direction (in, out) in the directed case but two relationships for BOTH the undirected case.

Label and relationship-type projection

We can project the subgraph we want to run the algorithm on using label parameter to describe nodes and relationship-type to describe relationships.

The general call syntax is:

CALL algo.<name>([label],[relationshipType],{config})

e.g. Page Rank on DBpedia (11M nodes, 116M relationships):

CALL algo.pageRank('Page','Link',{iterations:5, dampingFactor:0.85, write: true, writeProperty:'pagerank'});
// YIELD nodes, iterations, loadMillis, computeMillis, writeMillis, dampingFactor, write, writeProperty

CALL algo.pageRank.stream('Page','Link',{iterations:5, dampingFactor:0.85})
YIELD node, score
RETURN node.title, score
ORDER BY score DESC LIMIT 10;

Cypher projection

If label and relationship-type projection is not selective enough to describe our subgraph to run the algorithm on, we can use Cypher statements to project subsets of our graph. Use a node-statement instead of the label parameter and a relationship-statement instead of the relationship-type and use graph:'cypher' in the config.

Relationships described in the relationship-statement will only be projected if both source and target nodes are described in the node-statement. All other relationships that don’t have both source and target nodes described in node-statement will be ignored in projection.

We can also return a property value or weight (according to our config) in addition to the id’s from these statements.

Cypher projection allows us to be more expressive in describing our subgraph we want to analyse, but might take longer to project the graph with more complex cypher queries.

CALL algo.pageRank(
'MATCH (p:Page) RETURN id(p) as id',
'MATCH (p1:Page)-[:Link]->(p2:Page) RETURN id(p1) as source, id(p2) as target',
{graph:'cypher', iterations:5, write: true});

Cypher projection can also be used to project a virtual (non-stored) graph. Here is an example of how to project an undirected graph of people who visited the same web page and run Louvain community detection algorithm on it using the number of common visited web pages between pairs of people as relationship weight.

CALL algo.louvain(
'MATCH (p:Person) RETURN id(p) as id',
'MATCH (p1:Person)-[:Visit]->(:Page)<-[:Visit]-(p2:Person)
RETURN id(p1) as source, id(p2) as target, count(*) as weight',
{graph:'cypher', iterations:5, write: true});

Graph Loading

As it can take some time to load large graphs into the algorithm data structures, you can pre-load graphs and then later refer to them by name for several graph algorithms. After usage they can be removed from memory to free resources used.

// Load graph
CALL algo.graph.load('my-graph','Label','REL_TYPE',{graph:'heavy',..other config...})
  YIELD name, graph, direction, undirected, sorted, nodes, loadMillis, alreadyLoaded,
        nodeWeight, relationshipWeight, nodeProperty, loadNodes, loadRelationships;

// Info on loaded graph
CALL algo.graph.info('my-graph')
  YIELD name, type, exists, removed, nodes;

// Use graph
CALL algo.pageRank(null,null,{graph:'my-graph',...})


// Remove graph
CALL algo.graph.remove('my-graph')
  YIELD name, type, exists, removed, nodes;

Building Locally

Currently aiming at Neo4j 3.x (with a branch per version)

git clone https://github.com/neo4j-contrib/neo4j-graph-algorithms
cd neo4j-graph-algorithms
git checkout 3.3
mvn clean install
cp algo/target/graph-algorithms-*.jar $NEO4J_HOME/plugins/
$NEO4J_HOME/bin/neo4j restart

Yelp

Yelp Open Dataset

Yelp.com has been running Yelp Dataset challenge for about five years now. It is a competition that encourages people to explore and research Yelp’s open dataset, which contains almost 5 million reviews from over 1.1 million users on over 150,000 businesses from 12 metropolitan areas in the tenth round of the challenge. Over the years the dataset has become very popular with hundreds of academic paper written about it. It has well-structured and highly relational data and is therefore a great showcase for Neo4j and graph algorithms on a realistic dataset. We will show how to use graph algorithms on a social network of friends and how to create and analyse an inferred graph (for example, projecting a review co-occurence graph or similarity between users based on their reviews).

Check out also past winners and their work.

Data

You can get the dataset by filling out their form. Download the data in JSON format. We will use APOC plugin to help us with importing and batching data in Neo4j. There are 6 JSON files available (detailed documentation). Photos and checkins are not the topic of our analysis, so they will not be imported. From the other files we will import a knowledge graph of Yelp’s world, that contains data about:

  • 156639 businesses

  • 1005693 tips from users about businesses

  • 4736897 reviews of businesses by users

  • 9489337 users total

  • 35444850 friend relationships

Depending on your setup import might take some time as user.json contains data about a 10 million person social network of friends. While review.json is even bigger in size, it is mostly made up of text representing the actual review, so the import will be faster. We also do not need the actual text, but only meta-data about them e.g. who wrote the review and how he rated a certain business, so the text will not be imported.

Graph Model

yelp graph model

Graph consist of nodes labeled User, that can have a FRIEND relationship with other users. Users also write reviews and tips about businesses. We store all the meta-data ( location of business etc.) as properties of nodes except for categories of the businesses, which are represented by separate nodes labeled Category.

Graph model always depends on the application we have in mind for it. Our application is to analyse ( inferred ) networks with graph algorithms. If we were to use our graph as a recommendation engine, we might construct a different graph model. Check this great guide or this educational video on how you can use Neo4j as a recommendation engine.

Import

Define graph schema (constraint/index)
CALL apoc.schema.assert(
{Category:['name']},
{Business:['id'],User:['id'],Review:['id']});
Load businesses
CALL apoc.periodic.iterate("
CALL apoc.load.json('file:///dataset/business.json') YIELD value RETURN value
","
MERGE (b:Business{id:value.business_id})
SET b += apoc.map.clean(value, ['attributes','hours','business_id','categories','address','postal_code'],[])
WITH b,value.categories as categories
UNWIND categories as category
MERGE (c:Category{id:category})
MERGE (b)-[:IN_CATEGORY]->(c)
",{batchSize: 10000, iterateList: true});
Load tips
CALL apoc.periodic.iterate("
CALL apoc.load.json('file:///dataset/tip.json') YIELD value RETURN value
","
MATCH (b:Business{id:value.business_id})
MERGE (u:User{id:value.user_id})
MERGE (u)-[:TIP{date:value.date,likes:value.likes}]->(b)
",{batchSize: 20000, iterateList: true});
Load reviews
CALL apoc.periodic.iterate("
CALL apoc.load.json('file:///dataset/review.json')
YIELD value RETURN value
","
MERGE (b:Business{id:value.business_id})
MERGE (u:User{id:value.user_id})
MERGE (r:Review{id:value.review_id})
MERGE (u)-[:WROTE]->(r)
MERGE (r)-[:REVIEWS]->(b)
SET r += apoc.map.clean(value, ['business_id','user_id','review_id','text'],[0])
",{batchSize: 10000, iterateList: true});
Load users
CALL apoc.periodic.iterate("
CALL apoc.load.json('file:///dataset/user.json')
YIELD value RETURN value
","
MERGE (u:User{id:value.user_id})
SET u += apoc.map.clean(value, ['friends','user_id'],[0])
WITH u,value.friends as friends
UNWIND friends as friend
MERGE (u1:User{id:friend})
MERGE (u)-[:FRIEND]-(u1)
",{batchSize: 100, iterateList: true});

Networks

Social network

Social network is a theoretical construct useful in the social sciences to study relationships between individuals, groups, organizations, or even entire societies. An axiom of the social network approach to understanding social interaction is that social phenomena should be primarily conceived and investigated through the properties of relationships between and within nodes, instead of the properties of these nodes themselves. Precisely because many different types of relations, singular or in combination, form these network configurations, network analytics are useful to a broad range of research enterprises.

Social network analysis is the process of investigating social structures through the use of networks and graph theory. It characterizes networked structures in terms of nodes (individual actors, people, or things within the network) and the ties, edges, or links (relationships or interactions) that connect them. Examples of social structures commonly visualized through social network analysis include social media networks, memes spread, friendship and acquaintance networks, collaboration graphs, kinship, disease transmission.

Social network analysis has emerged as a key technique in modern sociology. It has also gained a significant following in anthropology, biology, demography, communication studies, economics, geography, history, information science, organizational studies, political science, social psychology, development studies, sociolinguistics, and computer science.

Yelp’s friendship network is an undirected graph with unweighted friend relationships between users. There are 507948 users with no friends. They will be ignored in our analysis.

Global graph statistics:

Nodes : 8981389

Relationships : 35444850

Weakly connected components : 18512

Nodes in largest WCC : 8938630

Edges in largest WCC : 35420520

Triangle count :

Average clustering coefficient :

Graph diameter (longest shortest path):

Local graph statistics:
Use apoc to calculate local statistics
MATCH (u:User)
RETURN avg(apoc.node.degree(u,'FRIEND')) as average_friends,
       stdev(apoc.node.degree(u,'FRIEND')) as stdev_friends,
       max(apoc.node.degree(u,'FRIEND')) as max_friends,
       min(apoc.node.degree(u,'FRIEND')) as min_friends

Average number of friends : 7.47

Standard deviation of friends : 46.96

Minimum count of friends : 1

Maximum count of friends : 14995

Prior work:

Projecting a review co-occurence graph

We can try to find which businesses are often reviewed by same users by inferring a co-occurence network between them. By way of definition, co-occurrence networks are the collective interconnection of nodes based on their paired presence within a specified domain. Our network is generated by connecting pairs of businesses using a set of criteria defining co-occurrence.

The co-occurence criteria for this network is that any pair of businesses must have at least 5 common reviewers. We save the count of common reviewers as a property of the relationship that will be used as a weight in community detection analysis. Inferred graph is undirected, as changing the direction of the relationships does not imply any semantic difference. We will limit our network to those businesses, that have more than 10 reviews and project a co-occurent relationship between businesses.

Project a review co-occurence between businesses
CALL apoc.periodic.iterate('
MATCH (b1:Business)
WHERE size((b1)<-[:REVIEWS]->()) > 10 AND b1.city="Las Vegas"
RETURN b1
','
MATCH (b1)<-[:REVIEWS]-(r1)
MATCH (r1)<-[:WROTE]-(u)
MATCH (u)-[:WROTE]->(r2)
MATCH (r2)-[:REVIEWS]->(b2)
WHERE id(b1) < id(b2) AND b2.city="Las Vegas"
WITH b1, b2, COUNT(*) AS weight where weight > 5
MERGE (b1)-[cr:CO_OCCURENT_REVIEWS]-(b2)
ON CREATE SET cr.weight = weight
',{batchSize: 1});

Prior work:

Projecting a review similarity graph

We can try to find similar groups of users by projecting a review similiarity network between them. The idea is to start with users that have more than 10 reviews and find all pairs of users who have reviewed more than 10 common businesses. We do this to filter out users with not enough data. We could do something similar to filter out users who have reviewed every business (probably bot or bored). Once we find pairs of users we calculate their similarity of reviews using cosine similarity and only create a relationship if cosine similarity is greater than 0, sometimes also called hard similarity. We do this so we do not end up with complete graph, where every pair of users is connected. Most community detection algorithms perform poorly in a complete graph. Cosine similarity between pairs of users is saved as a property of relationship and can be used as a weight in graph algorithms. Projected graph is modeled undirected as the direction of the relationship have no semantic value.

Projecting a review similarity graph is often used in recommendations. We calculate who are similar users based on review ratings, so we can recommend to a user what similar users liked.

Create a review similarity graph
CALL apoc.periodic.iterate(
"MATCH (p1:User) WHERE size((p1)-[:WROTE]->()) > 5 RETURN p1",
"
MATCH (p1)-[:WROTE]->(r1)-->()<--(r2)<-[:WROTE]-(p2)
WHERE id(p1) < id(p2) AND size((p2)-[:WROTE]->()) > 10
WITH p1,p2,count(*) as coop, collect(r1.stars) as s1, collect(r2.stars) as s2 where coop > 10
WITH p1,p2, apoc.algo.cosineSimilarity(s1,s2) as cosineSimilarity WHERE cosineSimilarity > 0
MERGE (p1)-[s:SIMILAR_REVIEWS]-(p2) SET s.weight = cosineSimilarity"
, {batchSize:100, parallel:false,iterateList:true});

Prior work:

Overview of Algorithm Procedures

qualified name description

algo.unionFind.mscoloring

CALL algo.unionFind.mscoloring(label:String, relationship:String, {property:'weight', threshold:0.42, defaultValue:1.0, write: true, partitionProperty:'partition', concurrency:4}) YIELD nodes, setCount, loadMillis, computeMillis, writeMillis

algo.unionFind.mscoloring.stream

CALL algo.unionFind.mscoloring.stream(label:String, relationship:String, {property:'propertyName', threshold:0.42, defaultValue:1.0, concurrency:4) YIELD nodeId, setId - yields a setId to each node id

algo.closeness.harmonic.stream

CALL algo.closeness.harmonic.stream(label:String, relationship:String{concurrency:4}) YIELD nodeId, centrality - yields centrality for each node

algo.closeness.harmonic

CALL algo.closeness.harmonic(label:String, relationship:String, {write:true, writeProperty:'centrality, concurrency:4'}) YIELD loadMillis, computeMillis, writeMillis, nodes] - yields evaluation details

algo.list

CALL algo.list - lists all algorithm procedures, their description and signature

algo.shortestPath.stream

CALL algo.shortestPath.stream(startNode:Node, endNode:Node, weightProperty:String{nodeQuery:'labelName', relationshipQuery:'relationshipName', direction:'BOTH', defaultValue:1.0}) YIELD nodeId, cost - yields a stream of {nodeId, cost} from start to end (inclusive)

algo.shortestPath

CALL algo.shortestPath(startNode:Node, endNode:Node, weightProperty:String{nodeQuery:'labelName', relationshipQuery:'relationshipName', direction:'BOTH', defaultValue:1.0, write:'true', writeProperty:'sssp'}) YIELD nodeId, cost, loadMillis, evalMillis, writeMillis - yields nodeCount, totalCost, loadMillis, evalMillis, writeMillis

algo.louvain

CALL algo.louvain(label:String, relationship:String, {weightProperty:'weight', defaultValue:1.0, write: true, writeProperty:'community', concurrency:4}) YIELD nodes, communityCount, iterations, loadMillis, computeMillis, writeMillis

algo.louvain.stream

CALL algo.louvain.stream(label:String, relationship:String, {weightProperty:'propertyName', defaultValue:1.0, concurrency:4) YIELD nodeId, community - yields a setId to each node id

algo.unionFind.forkJoin

CALL algo.unionFind(label:String, relationship:String, {property:'weight', threshold:0.42, defaultValue:1.0, write: true, partitionProperty:'partition',concurrency:4}) YIELD nodes, setCount, loadMillis, computeMillis, writeMillis

algo.unionFind.forkJoin.stream

CALL algo.unionFind.stream(label:String, relationship:String, {property:'propertyName', threshold:0.42, defaultValue:1.0,concurrency:4}) YIELD nodeId, setId - yields a setId to each node id

algo.unionFind.queue

CALL algo.unionFind(label:String, relationship:String, {property:'weight', threshold:0.42, defaultValue:1.0, write: true, partitionProperty:'partition',concurrency:4}) YIELD nodes, setCount, loadMillis, computeMillis, writeMillis

algo.unionFind.queue.stream

CALL algo.unionFind.stream(label:String, relationship:String, {property:'propertyName', threshold:0.42, defaultValue:1.0, concurrency:4}) YIELD nodeId, setId - yields a setId to each node id

algo.labelPropagation

CALL algo.labelPropagation(label:String, relationship:String, direction:String, {iterations:1, weightProperty:'weight', partitionProperty:'partition', write:true, concurrency:4}) YIELD nodes, iterations, didConverge, loadMillis, computeMillis, writeMillis, write, weightProperty, partitionProperty - simple label propagation kernel

algo.closeness.dangalchev.stream

CALL algo.closeness.dangalchev.stream(label:String, relationship:String{concurrency:4}) YIELD nodeId, centrality - yields centrality for each node

algo.closeness.dangalchev

CALL algo.closeness.dangalchev(label:String, relationship:String, {write:true, writeProperty:'centrality, concurrency:4'}) YIELD loadMillis, computeMillis, writeMillis, nodes] - yields evaluation details

algo.closeness.stream

CALL algo.closeness.stream(label:String, relationship:String{concurrency:4}) YIELD nodeId, centrality - yields centrality for each node

algo.closeness

CALL algo.closeness(label:String, relationship:String, {write:true, writeProperty:'centrality, concurrency:4'}) YIELD loadMillis, computeMillis, writeMillis, nodes] - yields evaluation details

algo.allShortestPaths.stream

CALL algo.allShortestPaths.stream(weightProperty:String{nodeQuery:'labelName', relationshipQuery:'relationshipName', defaultValue:1.0, concurrency:4}) YIELD sourceNodeId, targetNodeId, distance - yields a stream of {sourceNodeId, targetNodeId, distance}

algo.shortestPaths.stream

CALL algo.shortestPaths.stream(startNode:Node, weightProperty:String{nodeQuery:'labelName', relationshipQuery:'relationshipName', defaultValue:1.0}) YIELD nodeId, distance - yields a stream of {nodeId, cost} from start to end (inclusive)

algo.shortestPaths

CALL algo.shortestPaths(startNode:Node, weightProperty:String{write:true, targetProperty:'path', nodeQuery:'labelName', relationshipQuery:'relationshipName', defaultValue:1.0}) YIELD loadDuration, evalDuration, writeDuration, nodeCount, targetProperty - yields nodeCount, totalCost, loadDuration, evalDuration

algo.spanningTree.kmax

CALL algo.spanningTree.kmax(label:String, relationshipType:String, weightProperty:String, startNodeId:long, k:int, {writeProperty:String}) YIELD loadMillis, computeMillis, writeMillis, effectiveNodeCount

algo.spanningTree.kmin

CALL algo.spanningTree.kmin(label:String, relationshipType:String, weightProperty:String, startNodeId:long, k:int, {writeProperty:String}) YIELD loadMillis, computeMillis, writeMillis, effectiveNodeCount

algo.betweenness.sampled.stream

CALL algo.betweenness.sampled.stream(label:String, relationship:String, {strategy:{'random', 'degree'}, probability:double, maxDepth:int, direction:String, concurrency:int}) YIELD nodeId, centrality - yields centrality for each node

algo.betweenness.stream

CALL algo.betweenness.stream(label:String, relationship:String, {direction:'out', concurrency :4})YIELD nodeId, centrality - yields centrality for each node

algo.betweenness

CALL algo.betweenness(label:String, relationship:String, {direction:'out',write:true, writeProperty:'centrality', stats:true, concurrency:4}) YIELD loadMillis, computeMillis, writeMillis, nodes, minCentrality, maxCentrality, sumCentrality - yields status of evaluation

algo.betweenness.sampled

CALL algo.betweenness.sampled(label:String, relationship:String, {strategy:'random', probability:double, maxDepth:5, direction:'out',write:true, writeProperty:'centrality', stats:true, concurrency:4}) YIELD loadMillis, computeMillis, writeMillis, nodes, minCentrality, maxCentrality, sumCentrality - yields status of evaluation

algo.mst

CALL algo.mst(label:String, relationshipType:String, weightProperty:String, startNodeId:long, {writeProperty:String}) YIELD loadMillis, computeMillis, writeMillis, effectiveNodeCount

algo.spanningTree

CALL algo.spanningTree(label:String, relationshipType:String, weightProperty:String, startNodeId:long, {writeProperty:String}) YIELD loadMillis, computeMillis, writeMillis, effectiveNodeCount

algo.spanningTree.minimum

CALL algo.spanningTree.minimum(label:String, relationshipType:String, weightProperty:String, startNodeId:long, {writeProperty:String}) YIELD loadMillis, computeMillis, writeMillis, effectiveNodeCount

algo.spanningTree.maximum

CALL algo.spanningTree.maximum(label:String, relationshipType:String, weightProperty:String, startNodeId:long, {writeProperty:String}) YIELD loadMillis, computeMillis, writeMillis, effectiveNodeCount

algo.scc

CALL algo.scc(label:String, relationship:String, config:Map<String, Object>) YIELD loadMillis, computeMillis, writeMillis, setCount, maxSetSize, minSetSize

algo.scc.stream

CALL algo.scc.stream(label:String, relationship:String, config:Map<String, Object>) YIELD loadMillis, computeMillis, writeMillis, setCount, maxSetSize, minSetSize

algo.scc.recursive.tarjan

CALL algo.scc.tarjan(label:String, relationship:String, config:Map<String, Object>) YIELD loadMillis, computeMillis, writeMillis, setCount, maxSetSize, minSetSize

algo.scc.recursive.tunedTarjan

CALL algo.scc.recursive.tunedTarjan(label:String, relationship:String, config:Map<String, Object>) YIELD loadMillis, computeMillis, writeMillis, setCount, maxSetSize, minSetSize

algo.scc.recursive.tunedTarjan.stream

CALL algo.scc.recursive.tunedTarjan.stream(label:String, relationship:String, config:Map<String, Object>) YIELD nodeId, partition

algo.scc.iterative

CALL algo.scc.iterative(label:String, relationship:String, config:Map<String, Object>) YIELD loadMillis, computeMillis, writeMillis, setCount, maxSetSize, minSetSize

algo.scc.iterative.stream

CALL algo.scc.iterative.stream(label:String, relationship:String, config:Map<String, Object>) YIELD nodeId, partition

algo.scc.multistep

CALL algo.scc.multistep(label:String, relationship:String, {write:true, concurrency:4, cutoff:100000}) YIELD loadMillis, computeMillis, writeMillis, setCount, maxSetSize, minSetSize

algo.scc.multistep.stream

CALL algo.scc.multistep.stream(label:String, relationship:String, {write:true, concurrency:4, cutoff:100000}) YIELD nodeId, partition

algo.scc.forwardBackward.stream

CALL algo.scc.forwardBackward.stream(long startNodeId, label:String, relationship:String, {write:true, concurrency:4}) YIELD nodeId, partition

algo.graph.load

CALL algo.graph.load(name:String, label:String, relationship:String{direction:'OUT/IN/BOTH', undirected:true/false, sorted:true/false, nodeProperty:'value', nodeWeight:'weight', relationshipWeight: 'weight', graph:'heavy/huge/cypher'}) YIELD nodes, relationships, loadMillis, computeMillis, writeMillis, write, nodeProperty, nodeWeight, relationshipWeight - load named graph

algo.graph.remove

CALL algo.graph.remove(name:String

algo.graph.info

CALL algo.graph.info(name:String

algo.triangle.stream

CALL algo.triangle.stream(label, relationship, {concurrency:4}) YIELD nodeA, nodeB, nodeC - yield nodeA, nodeB and nodeC which form a triangle

algo.triangleCount.stream

CALL algo.triangleCount.stream(label, relationship, {concurrency:8}) YIELD nodeId, triangles - yield nodeId, number of triangles

algo.triangleCount.forkJoin.stream

CALL algo.triangleCount.forkJoin.stream(label, relationship, {concurrency:8}) YIELD nodeId, triangles - yield nodeId, number of triangles

algo.triangleCount

CALL algo.triangleCount(label, relationship, {concurrency:4, write:true, writeProperty:'triangles', clusteringCoefficientProperty:'coefficient'}) YIELD loadMillis, computeMillis, writeMillis, nodeCount, triangleCount, averageClusteringCoefficient

algo.triangleCount.forkJoin

CALL algo.triangleCount.forkJoin(label, relationship, {concurrency:4, write:true, writeProperty:'triangles', clusteringCoefficientProperty:'coefficient'}) YIELD loadMillis, computeMillis, writeMillis, nodeCount, triangleCount, averageClusteringCoefficient

algo.pageRank

CALL algo.pageRank(label:String, relationship:String, {iterations:5, dampingFactor:0.85, write: true, writeProperty:'pagerank', concurrency:4}) YIELD nodes, iterations, loadMillis, computeMillis, writeMillis, dampingFactor, write, writeProperty - calculates page rank and potentially writes back

algo.pageRank.stream

CALL algo.pageRank.stream(label:String, relationship:String, {iterations:20, dampingFactor:0.85, concurrency:4}) YIELD node, score - calculates page rank and streams results

algo.shortestPath.deltaStepping.stream

CALL algo.shortestPath.deltaStepping.stream(startNode:Node, weightProperty:String, delta:Double{label:'labelName', relationship:'relationshipName', defaultValue:1.0, concurrency:4}) YIELD nodeId, distance - yields a stream of {nodeId, distance} from start to end (inclusive)

algo.shortestPath.deltaStepping

CALL algo.shortestPath.deltaStepping(startNode:Node, weightProperty:String, delta:Double{label:'labelName', relationship:'relationshipName', defaultValue:1.0, write:true, writeProperty:'sssp'}) YIELD loadDuration, evalDuration, writeDuration, nodeCount

algo.unionFind

CALL algo.unionFind(label:String, relationship:String, {weightProperty:'weight', threshold:0.42, defaultValue:1.0, write: true, partitionProperty:'partition'}) YIELD nodes, setCount, loadMillis, computeMillis, writeMillis

algo.unionFind.stream

CALL algo.unionFind.stream(label:String, relationship:String, {weightProperty:'propertyName', threshold:0.42, defaultValue:1.0) YIELD nodeId, setId - yields a setId to each node id

algo.unionFind.forkJoinMerge

CALL algo.unionFind(label:String, relationship:String, {property:'weight', threshold:0.42, defaultValue:1.0, write: true, partitionProperty:'partition', concurrency:4}) YIELD nodes, setCount, loadMillis, computeMillis, writeMillis

algo.unionFind.forkJoinMerge.stream

CALL algo.unionFind.stream(label:String, relationship:String, {property:'propertyName', threshold:0.42, defaultValue:1.0, concurrency:4}) YIELD nodeId, setId - yields a setId to each node id

algo.isFinite

CALL algo.isFinite(value) - return true iff the given argument is a finite value (not ±Infinity, NaN, or null), false otherwise.

algo.isInfinite

CALL algo.isInfinite(value) - return true iff the given argument is not a finite value (±Infinity, NaN, or null), false otherwise.

algo.Infinity

CALL algo.Infinity() - returns Double.POSITIVE_INFINITY as a value.

algo.NaN

CALL algo.NaN() - returns Double.NaN as a value.

Algorithms

Graph algorithms are used to compute metrics for graphs, nodes or relationships.

They can provide insights on relevant entities (centralities, ranking) in the graph or inherent structures like communities (community-detection, graph-partitioning, clustering).

Many of them are iterative approaches that frequently traverse the graph for the computation using random walks, breadth-first- or depth-first searches or pattern matching.

Due to the exponential growth of possible paths with increasing distance, many of the approaches also have high algorithmic complexity.

Fortunately optimized algorithms exist that utilize certain structures of the graph, memoize already explored parts, and parallelize operations. Whenever possible we’ve applied these optimizations.

Page Rank

PageRank is Google’s popular search algorithm. PageRank counts the number and quality of links to a page to determine an estimate of how important the website is. The underlying assumption is that more important websites are likely to receive more links from other websites.

History, Explanation

In order to measure the relative importance of web pages, Sergey Brin and Larry Page proposed PageRank, a method for computing a ranking for every web page based on the graph of the web. PageRank has applications in search, browsing, and traffic estimation.

PageRank is defined in the original Google paper as follows:

We assume page A has pages T1…​Tn which point to it (i.e., are citations). The parameter d is a damping factor which can be set between 0 and 1. We usually set d to 0.85. Also C(A) is defined as the number of links going out of page A. The PageRank of a page A is given as follows:

PR(A) = (1-d) + d (PR(T1)/C(T1) + …​ + PR(Tn)/C(Tn))

Note that the PageRanks form a probability distribution over web pages, so the sum of all web pages' PageRanks will be one.

The underlying mathematics of PageRank has to do with random walks on networks, akin to how random surfers propagate through a network. Random surfers follow links, but occasionally teleport to random vertices. The PageRank of a node is the probability it is visited by a random surfer with teleportation. PageRank is now widely recognized as a way of detecting central nodes in a network.

When to use it / use-cases

The mathematics of PageRank are entirely general and apply to any graph or network in any domain. PageRank is regularly used in bibliometrics, social and information network analysis, and for link prediction and recommendation. It’s even used for systems analysis of road networks, as well as biology, chemistry, neuroscience, and physics.

These are some other notable uses:

  • In neuroscience, the PageRank of a neuron in a neural network has been found to correlate with its relative firing rate.

  • Personalized PageRank is used by Twitter to present users with other accounts they may wish to follow [4].

  • PageRank has been used to rank spaces or streets to predict how many pedestrians or vehicles come to individual spaces or streets.

  • In lexical semantics it has been used to perform Word Sense Disambiguation, Semantic similarity, and to automatically rank WordNet synsets according to how strongly they possess a given semantic property, such as positivity or negativity [2].

  • In any ecosystem, a modified version of PageRank may be used to determine species that are essential to the continuing health of the environment [3].

  • For the analysis of protein networks in biology PageRank is also a useful tool [6].

  • PageRank has been used to quantify the scientific impact of researchers [5]. The underlying citation and collaboration networks are used in conjunction with PageRank algorithm to come up with a ranking system for individual publications that propagates to individual authors. The new index known as pagerank-index (Pi) is demonstrated to be fairer compared to h-index and addresses many of the drawbacks exhibited by h-index.

Constraints / when not to use it

There are some major limitations of the PageRank algorithm.

  • Spider traps - if there are no links from within the group to outside the groups then the group of pages is a spider trap.

  • Rank sink - this occurs when a network of pages form an infinite cycle.

  • Dead-ends and Dangling links - Dead-ends occurs when pages have no out links. If a page contains a link to another page which has no out links, the link would be known as a dangling link. [1]

You can read [7] for more.

Algorithm explanation on simple sample graph

pagerank
Create sample graph
CREATE (home:Page{name:'Home'})
,(about:Page{name:'About'})
,(product:Page{name:'Product'})
,(links:Page{name:'Links'})
,(a:Page{name:'Site A'})
,(b:Page{name:'Site B'})
,(c:Page{name:'Site C'})
,(d:Page{name:'Site D'})
CREATE (home)-[:LINKS]->(about)
,(about)-[:LINKS]->(home)
,(product)-[:LINKS]->(home)
,(home)-[:LINKS]->(product)
,(links)-[:LINKS]->(home)
,(home)-[:LINKS]->(links)
,(links)-[:LINKS]->(a)-[:LINKS]->(home)
,(links)-[:LINKS]->(b)-[:LINKS]->(home)
,(links)-[:LINKS]->(c)-[:LINKS]->(home)
,(links)-[:LINKS]->(d)-[:LINKS]->(home)
Running algorithm and streaming results
CALL algo.pageRank.stream('Page', 'LINKS', {iterations:20, dampingFactor:0.85})
YIELD node, score
RETURN node,score order by score desc limit 20
Running algorithm and writing back results
CALL algo.pageRank('Page', 'LINKS',
  {iterations:20, dampingFactor:0.85, write: true,writeProperty:"pagerank"})
YIELD nodes, iterations, loadMillis, computeMillis, writeMillis, dampingFactor, write, writeProperty
Table 1. Results
name pageRank

Home

3.232

Product

1.059

Links

1.059

About

1.059

Site A

0.328

Site B

0.328

Site C

0.328

Site D

0.328

As we might expect, the Home page has the highest PageRank because it has incoming links from all other pages. We can also observe that its not only the number of incoming links is important, but also the importance of the pages that link to us.

Example Usage

We will run PageRank on Yelp’s social network to find potential influencers.

In the import section we stored the social network as a undirected graph. Relationships in Neo4j always have a direction but in this domain the direction is irrelevant. If Person A is a FRIEND with Person B we can say that Person B is also a FRIEND with Person A.

The default label and relationship-type selection syntax won’t work for us here because it will project a directed social network. Instead, we can project our undirected social network using Cypher loading. We can also apply this approach to other algorithms that use Cypher loading.

Running algorithm on Yelp social network
CALL algo.pageRank.stream(
  'MATCH (u:User) WHERE exists( (u)-[:FRIENDS]-() ) RETURN id(u) as id',
  'MATCH (u1:User)-[:FRIENDS]-(u2:User) RETURN id(u1) as source, id(u2) as target',
  {graph:'cypher'}
) YIELD node,score with node,score order by score desc limit 10
RETURN node {.name, .review_count, .average_stars,.useful,.yelping_since,.funny}, score

Syntax

running algorithm and writing back results
CALL algo.pageRank(label:String, relationship:String,
    {iterations:20, dampingFactor:0.85, write: true, writeProperty:'pagerank', concurrency:4})
YIELD nodes, iterations, loadMillis, computeMillis, writeMillis, dampingFactor, write, writeProperty
Table 2. parameters
name type default optional description

label

string

null

yes

label to load from the graph. If null load all nodes

relationship

string

null

yes

relationship-type to load from the graph. If null load all relationships

iterations

int

20

yes

how many iterations of page-rank to run

concurrency

int

available CPUs

yes

number of concurrent threads

dampingFactor

float

0.85

yes

damping factor of the page-rank calculation

write

boolean

true

yes

if result should be written back as node property

writeProperty

string

'pagerank'

yes

property name written back to

graph

string

'heavy'

yes

use 'heavy' when describing the subset of the graph with label and relationship-type parameter, 'cypher' for describing the subset with cypher node-statement and relationship-statement

Table 3. results
name type description

nodes

int

number of nodes considered

iterations

int

number of iterations run

dampingFactor

float

damping factor used

writeProperty

string

property name written back to

write

boolean

if result was written back as node property

loadMillis

int

milliseconds for loading data

computeMillis

int

milliseconds for running the algorithm

writeMillis

int

milliseconds for writing result data back

running algorithm and streaming results
CALL algo.pageRank.stream(label:String, relationship:String,
  {iterations:20, dampingFactor:0.85, concurrency:4})
YIELD node, score - calculates page rank and streams results
Table 4. parameters
name type default optional description

label

string

null

yes

label to load from the graph, if null load all nodes

relationship

string

null

yes

relationship-type to load from the graph, if null load all nodes

iterations

int

20

yes

how many iterations of page-rank to run

concurrency

int

available CPUs

yes

number of concurrent threads

dampingFactor

float

0.85

yes

damping factor of the page-rank calculation

graph

string

'heavy'

yes

use 'heavy' when describing the subset of the graph with label and relationship-type parameter, 'cypher' for describing the subset with cypher node-statement and relationship-statement

Table 5. results

name

type

description

node

long

node id

score

float

page-rank weight

Cypher projection

If label and relationship-type are not selective enough to describe a subgraph to run the algorithm on, you can use Cypher statements to load or project subsets of your graph. Don’t forget to set graph:'cypher' in the config.

CALL algo.pageRank(
  'MATCH (p:Page) RETURN id(p) as id',
  'MATCH (p1:Page)-[:Link]->(p2:Page) RETURN id(p1) as source, id(p2) as target',
  {graph:'cypher', iterations:5, write: true}
)

Versions

We support the following versions of the pageRank algorithm:

  • ✓ directed, unweighted

  • ❏ directed, weighted

  • ✓ undirected, unweighted

    • Only with cypher projection

  • ❏ undirected, weighted

Betweenness Centrality

Betweenness Centrality is a measure of centrality in a graph based on calculating geodesic (shortest) paths between nodes. There is at least one shortest path between every pair of nodes in a connected graph. The shortest path can be based on the number of relationships that the path passes through in an unweighted network or the sum of the weights of the relationships in a weighted network.

The betweenness centrality for each node is the number of these shortest paths that pass through that node. The nodes that most frequently lie on these shortest paths are will have a higher betweenness centrality score.

History, Explanation

The original idea behind betweenness centrality was introduced by J. M. Anthonisse in his 1971 paper The rush in a directed graph. Anthonisse defined the rush in a graph as the amount a node in a network has to intermediate between other nodes.

Linton Freeman gave the first formal definition of betweenness centrality as one of the “three distinct intuitive conceptions of centrality” in his paper A Set of Measures of Centrality Based on Betweenness. It is calculated by a breath-first search algorithm that computes the shortest paths from every node to all other nodes. Ulrik Brandes' Faster Algorithm for Betweenness Centrality is often used to make the calculation.

When to use it / use-cases

betweenness centrality

Betweenness centrality is useful for finding nodes that serve as a bridge from one part of a graph to another. As a result, it is a rudimentary measure of the control that a node exerts over the flow throughout the graph.

In the above example Alice is the main connection in the graph. If Alice is removed all connections in the graph would be cut off. This makes Alice “important” because she ensures that no nodes are isolated.

The betweenness score for a node can be non intuitive. The node may have low degree, be connected to others that have low degree, and even be a long way from others on average, yet still have high betweenness. Consider a node A that lies on a bridge between two groups of nodes within a network. Since any path between nodes in different groups must go through this bridge, node A acquires high betweenness even though it is not well connected within either group.

Betweenness Centrality has wide applicability in network theory. It represents the degree to which nodes stand between each other. It has founds uses in a variety of different domains [1] including:

  • analyzing social and protein networks.

  • identifying important bloggers

  • measuring network traffic in communication networks

  • studying the interaction patterns of players in massively multi player online games

  • analyzing the importance of people in mobile phone call networks

  • finding the controlling nodes in a telecommunications network

Constraints / when not to use it

Betweeness centrality makes the assumption that all communication between nodes happens along the shortest path and with the same frequency, which isn’t the case in real life. It therefore doesn’t give us a perfect view of the most influential nodes in a graph, but rather a good approximation. [4]

Algorithm explanation on simple sample graph

People with high betweenness tend to be the innovators and brokers in social networks. They combine different perspectives, transfer ideas between groups, and get power from their ability to make introductions and pull strings.

Create sample graph
CREATE (nAlice:User {id:'Alice'})
,(nBridget:User {id:'Bridget'})
,(nCharles:User {id:'Charles'})
,(nDoug:User {id:'Doug'})
,(nMark:User {id:'Mark'})
,(nMichael:User {id:'Michael'})
CREATE (nAlice)-[:MANAGE]->(nBridget)
,(nAlice)-[:MANAGE]->(nCharles)
,(nAlice)-[:MANAGE]->(nDoug)
,(nMark)-[:MANAGE]->(nAlice)
,(nCharles)-[:MANAGE]->(nMichael);
Running algorithm and streaming results
CALL algo.betweenness.stream('User','MANAGE',{direction:'out'})
YIELD nodeId, centrality
RETURN nodeId,centrality order by centrality desc limit 20;
Running algorithm and writing back results
CALL algo.betweenness('User','MANAGE', {direction:'out',write:true, writeProperty:'centrality'})
YIELD nodes, minCentrality, maxCentrality, sumCentrality, loadMillis, computeMillis, writeMillis;
Table 6. Results
name centrality weight

Alice

4

Charles

2

Bridget

0

Michael

0

Doug

0

Mark

0

We can see that Alice is the main broker in this network and Charles is a minor broker.

Approximation of betweenness centrality

The fastest known algorithm for exactly computing betweenness of all the nodes, designed by Brandes, requires at least O(nm) time for unweighted graphs, where n is the number of nodes and m is the number of relationships. For large-scale graphs exact centrality computation isn’t practical [2], but we can work with a subset of the nodes and calculate an approximate score.

RA-Brandes Algorithm

RA-Brandes algorithm, proposed by Brandes and Pich[3], is different from original Brandes’ algorithm in only one main respect: Brandes’ algorithm considers all nodes within the graph, whereas RA-Brandes considers only a subset of nodes, also known as pivots, from the graph.

We implement two strategies of selecting the subset of nodes:

  • random: nodes are selected uniformly at random with defined probability of selection. Default probability is log10(N) / e^2. If probability is 1 then the algorithm works as original Brandes where all nodes are loaded.

  • degree: first calculates the mean degree of the nodes and then only visits nodes whose degree is higher then the mean. i.e. only dense nodes are visited

First step in calculating betweenness centrality is to collect all shortest paths that run through specific node. With parameter maxDepth you can limit the depth of all shortest paths traversal.

Running algorithm and streaming results
CALL algo.betweenness.sampled.stream('User','FRIEND',
  {strategy:'random', probability:1.0, maxDepth:5})
YIELD nodeId, centrality
Running algorithm and writing back results
CALL algo.betweenness.sampled('User','FRIEND',
  {strategy:'random', probability:1.0, writeProperty:'centrality', maxDepth:5})
YIELD nodes, minCentrality, maxCentrality

Syntax

Running Brandes algorithm and writing back results
CALL algo.betweenness(label:String, relationship:String,
  {direction:'out',write:true, stats:true, writeProperty:'centrality',concurrency:1})
YIELD nodes, minCentrality, maxCentrality, sumCentrality, loadMillis, computeMillis, writeMillis
- calculates betweenness centrality and potentially writes back
Table 7. Parameters
name type default optional description

label

string

null

yes

label to load from the graph. If null load all nodes

relationship

string

null

yes

relationship-type to load from the graph. If null load all relationships

direction

string

outgoing

yes

relationship direction to load from the graph, if 'both' treats the relationships as undirected

write

boolean

true

yes

if result should be written back as node property

stats

boolean

true

yes

if stats about centrality should be returned

writeProperty

string

'centrality'

yes

property name written back to

graph

string

'heavy'

yes

use 'heavy' when describing the subset of the graph with label and relationship-type parameter, 'cypher' for describing the subset with cypher node-statement and relationship-statement

concurrency

int

available CPUs

yes

number of concurrent threads

Table 8. Results
name type description

nodes

int

number of nodes considered

minCentrality

int

minimum centrality value

maxCentrality

int

maximum centrality value

sumCentrality

int

sum of all centrality values

loadMillis

int

milliseconds for loading data

evalMillis

int

milliseconds for running the algorithm

writeMillis

int

milliseconds for writing result data back

Running Brandes algorithm and streaming results
CALL algo.betweenness.stream(label:String, relationship:String,
{direction:'out',concurrency:1})
YIELD nodeId, centrality - yields centrality for each node
Table 9. Parameters
name type default optional description

label

string

null

yes

label to load from the graph, if null load all nodes

relationship

string

null

yes

relationship-type to load from the graph, if null load all relationships

concurrency

int

available CPUs

yes

number of concurrent threads

direction

string

outgoing

yes

relationship direction to load from the graph, if 'both' treats the relationships as undirected

Table 10. Results

name

type

description

node

long

node id

centrality

float

betweenness centrality weight

Running RA-Brandes algorithm and writing back results
CALL algo.betweenness.sampled(label:String, relationship:String,
  {direction:'out', strategy:'random', probability: 1, maxDepth: 4, stats:true,
 writeProperty:'centrality',concurrency:1})
YIELD nodes, minCentrality, maxCentrality, sumCentrality, loadMillis, computeMillis, writeMillis
- calculates betweenness centrality and potentially writes back
Table 11. Parameters
name type default optional description

label

string

null

yes

label to load from the graph, if null load all nodes

relationship

string

null

yes

relationship-type to load from the graph, if null load all nodes

direction

string

outgoing

yes

relationship direction to load from the graph, if 'both' treats the relationships as undirected

write

boolean

true

yes

if result should be written back as node property

strategy

string

'random'

yes

node selection strategy

probability

float

log10(N) / e^2

yes

probability a node is selected. Values between 0 and 1. If 1 selects all nodes and works like original Brandes algorithm

maxDepth

int

Integer.MAX

yes

depth of the shortest paths traversal

stats

boolean

true

yes

if stats about centrality should be returned

writeProperty

string

'centrality'

yes

property name written back to

graph

string

'heavy'

yes

use 'heavy' when describing the subset of the graph with label and relationship-type parameter, 'cypher' for describing the subset with cypher node-statement and relationship-statement

concurrency

int

available CPUs

yes

number of concurrent threads

Table 12. Results
name type description

nodes

int

number of nodes considered

minCentrality

int

minimum centrality value

maxCentrality

int

maximum centrality value

sumCentrality

int

sum of all centrality values

loadMillis

int

milliseconds for loading data

evalMillis

int

milliseconds for running the algorithm

writeMillis

int

milliseconds for writing result data back

Running RA-Brandes algorithm and streaming results
CALL algo.betweenness.sampled.stream(label:String, relationship:String,
  {direction:'out',concurrency:1, strategy:'random', probability: 1, maxDepth: 4})
YIELD nodeId, centrality - yields centrality for each node
Table 13. Parameters
name type default optional description

label

string

null

yes

label to load from the graph, if null load all nodes

relationship

string

null

yes

relationship-type to load from the graph, if null load all relationships

concurrency

int

available CPUs

yes

number of concurrent threads

direction

string

outgoing

yes

relationship direction to load from the graph, if 'both' treats the relationships as undirected

strategy

string

'random'

yes

node selection strategy

probability

float

log10(N) / e^2

yes

probability a node is selected. Values between 0 and 1. If 1 selects all nodes and works like original Brandes algorithm

maxDepth

int

Integer.MAX

yes

depth of the shortest paths traversal

Table 14. Results

name

type

description

node

long

node id

centrality

float

betweenness centrality weight

Cypher projection

If label and relationship-type are not selective enough to describe your subgraph to run the algorithm on, you can use Cypher statements to load or project subsets of your graph. Can be also used to run algorithms on a virtual graph. Set graph:'cypher' in the config.

CALL algo.betweenness(
  'MATCH (p:User) RETURN id(p) as id',
  'MATCH (p1:User)-[:MANAGE]->(p2:User) RETURN id(p1) as source, id(p2) as target',
  {graph:'cypher', write: true}
);

Versions

We support the following versions of the betweenness centrality algorithm:

  • ✓ directed, unweighted

    • loading incoming relationships: 'INCOMING','IN','I' or '<'

    • loading outgoing relationships: 'OUTGOING','OUT','O' or '>'

  • ❏ directed, weighted

  • ✓ undirected, unweighted

    • direction:'both' or '<>'

  • ❏ undirected, weighted

Implementations

algo.betweenness()

  • implementation of brandes-bc algorithm and nodePartitioning extension

  • if concurrency parameter is set (and >1) ParallelBetweennessCentrality is used

  • ParallelBC spawns N(given by the concurrency param) concurrent threads for calculation where each one calculates the BC for one node at a time

algo.betweenness.exp1()

  • brandes-like algorithm which uses successor sets instead of predecessor sets

  • The algorithm is based on Brandes definition but with some changes regarding the dependency-accumulation step.

  • Does not support undirected graph

algo.betweenness.sampled()

  • Calculates betweenness-dependencies on a subset of pivot nodes (instead of all nodes). Up to now 2 randomization strategies are implemented which can be set using the optional argument strategy:

  • random selection(default): strategy:'random': (takes optional argument probability:double(0-1) or log10(N) / e^2 as default)

  • degree based randomization: strategy:'degree': (makes dense nodes more likely)

  • optional Arguments: maxDepth:int

Closeness Centrality

The Closeness Centrality of a node measures the distance from that node to all other nodes. Nodes with a high closeness score have the shortest distances to all other nodes. The premise of this algorithm is that nodes with short distance to other nodes can spread information very efficiently through the network. This is important for the availability of knowledge and resources.

History, Explanation

Sabidussi (1966) described the sum of the shortest path distances from one node to every other node as the node’s farness. Freeman (1979) used this idea to define closeness centrality of a node as the inverse of Sabidussi’s farness.

Closeness centrality is defined as the total number of relationships separating a node from all others along the shortest possible paths.

The algorithm operates as follows:

  • calculate the shortest path for each for each pair of nodes in the graph

  • for each node sum the distance from the node to all other nodes based on those shortest paths

The raw closeness centrality for a node is then calculated using the following formula:

raw closeness centrality(node) = 1 / sum(distance from node to all other nodes)

The greater the raw closeness centrality, the longer it takes for information originating at random points in the graph to arrive at the node. We could also interpret closeness as the potential ability of a node to reach all other nodes as quickly as possible.

It is important to note that raw closeness centrality is an inverse measure of centrality. i.e. nodes with smaller scores that are the most central. Our algorithm returns a normalized closeness centrality score where nodes with a higher score are more central.

The formula for normalized closeness centrality is as follows:

normalized closeness centrality(node) = (number of nodes - 1) / sum(distance from node to all other nodes)

When to use it / use-cases

This algorithm is useful for finding the nodes that can access the entire network most quickly.

We might use it if we’re trying to identify where in the city to place a new public service so that it’s easily accessible for residents. If we’re trying to spread a message on social media we could use the algorithm to find the key influencers that can help us achieve our goal.

This measure is preferable to degree centrality, because takes both indirected and direct connections into account. For closeness centrality in directed networks, we split the concept into in-closeness and out-closeness. The in-closeness variable measures the extent to which a node is easily reached by others i.e. paths from others to the node The out-closeness variable measures the extent to which a node can easily reach others i.e paths from the node to others

Constraints / when not to use it

Closeness centrality works best on connected graphs. On unconnected graphs we can end up with an infinite distance between two nodes in separate connected components. This means that we’ll end up with an infinite closeness centrality score when we sum up all the distances from that node. [1]

We may therefore choose to run the strongly connected component over our dataset before trying closeness centrality.

Algorithm explanation on simple sample graph

closeness centrality
Create sample graph
CREATE (a:Node{id:"A"}),
       (b:Node{id:"B"}),
       (c:Node{id:"C"}),
       (d:Node{id:"D"}),
       (e:Node{id:"E"})
CREATE (a)-[:LINK]->(b),
       (b)-[:LINK]->(a),
       (b)-[:LINK]->(c),
       (c)-[:LINK]->(b),
       (c)-[:LINK]->(d),
       (d)-[:LINK]->(c),
       (d)-[:LINK]->(e),
       (e)-[:LINK]->(d);
Running algorithm and streaming results
CALL algo.closeness.stream('Node', 'LINKS') YIELD nodeId, centrality
RETURN nodeId,centrality order by centrality desc limit 20;

- yields centrality for each node
Running algorithm and writing back results
CALL algo.closeness('Node', 'LINK', {write:true, writeProperty:'centrality'})
YIELD nodes,loadMillis, computeMillis, writeMillis;

- calculates closeness centrality and potentially writes back

Calculation:

  • count farness in each msbfs-callback

  • divide by N-1

N = 5 // number of nodes

k = N-1 = 4 // used for normalization

     A     B     C     D     E
 --|-----------------------------
 A | 0     1     2     3     4       // farness between each pair of nodes
 B | 1     0     1     2     3
 C | 2     1     0     1     2
 D | 3     2     1     0     1
 E | 4     3     2     1     0
 --|-----------------------------
 S | 10    7     6     7     10      // raw closeness centrality
 ==|=============================
k/S| 0.4  0.57  0.67  0.57   0.4     // normalized closeness centrality

Syntax

Running algorithm and writing back results
CALL algo.closeness(label:String, relationship:String,
{write:true, writeProperty:'centrality',graph:'heavy', concurrency:4})
YIELD nodes,loadMillis, computeMillis, writeMillis
- calculates closeness centrality and potentially writes back
Table 15. Parameters
name type default optional description

label

string

null

yes

label to load from the graph. If null load all nodes

relationship

string

null

yes

relationship-type to load from the graph. If null load all relationships

write

boolean

true

yes

if result should be written back as node property

concurrency

int

available CPUs

yes

number of concurrent threads

writeProperty

string

'centrality'

yes

property name written back to

graph

string

'heavy'

yes

use 'heavy' when describing the subset of the graph with label and relationship-type parameter, 'cypher' for describing the subset with cypher node-statement and relationship-statement

Table 16. Results
name type description

nodes

int

number of nodes considered

loadMillis

int

milliseconds for loading data

evalMillis

int

milliseconds for running the algorithm

writeMillis

int

milliseconds for writing result data back

Running algorithm and streaming results
CALL algo.closeness.stream(label:String, relationship:String,{concurrency:4})
YIELD nodeId, centrality - yields centrality for each node
Table 17. Parameters
name type default optional description

label

string

null

yes

label to load from the graph, if null load all nodes

relationship

string

null

yes

relationship-type to load from the graph, if null load all relationships

concurrency

int

available CPUs

yes

number of concurrent threads

graph

string

'heavy'

yes

use 'heavy' when describing the subset of the graph with label and relationship-type parameter, 'cypher' for describing the subset with cypher node-statement and relationship-statement

Table 18. Results

name

type

description

node

long

node id

centrality

float

closeness centrality weight

Cypher projection

If label and relationship-type are not selective enough to describe your subgraph to run the algorithm on, you can use Cypher statements to load or project subsets of your graph. Can be also used to run algorithms on a virtual graph. Set graph:'cypher' in the config.

CALL algo.closeness(
  'MATCH (p:Node) RETURN id(p) as id',
  'MATCH (p1:Node)-[:LINK]->(p2:Node) RETURN id(p1) as source, id(p2) as target',
  {graph:'cypher', write: true}
);

Versions

We support the following versions of the closeness centrality algorithm:

  • ✓ directed, unweighted

  • ❏ directed, weighted

  • ✓ undirected, unweighted

    • Only with cypher projection

  • ❏ undirected, weighted

Harmonic Centrality

Closeness centrality works best on connected graphs - results may be confused if there are multiple connected components.

Harmonic centrality (also known as 'valued centrality') is a variant of closeness centrality that addresses this problem. As with many of the centrality algorithms, it originates from the field of social network analysis.

History, Explanation

Harmonic centrality was proposed by Marchiori and Latora [1] while trying to come up with a sensible notion of "average shortest path". They suggested replacing the average distance calculation from the closeness centrality algorithm with the harmonic mean of all distances.

The algorithm operates as follows:

  • calculate the shortest path for each for each pair of nodes in the graph

  • for each node determine the distance from the node to all other nodes based on those shortest paths

The raw harmonic centrality for a node is then calculated using the following formula:

raw harmonic centrality(node) = sum(1 / distance from node to every other node excluding itself)

As with closeness centrality we can also calculate a normalized harmonic centrality with the following formula:

normalized harmonic centrality(node) = sum(1 / distance from node to every other node excluding itself) / (number of nodes - 1)

The advantage of harmonic centrality is that ∞ are handled cleanly. Harmonic centrality and closeness centrality will often come up with similar results, but harmonic centrality can handle graphs that aren’t connected. [3]

Harmonic centrality was proposed independently by Dekker (2005)[4], using the name "valued centrality," and by Rochat (2009)[2].

When to use it / use-cases

Harmonic centrality was proposed as an alternative to closeness centrality, and therefore has similar use cases.

For example, we might use it if we’re trying to identify where in the city to place a new public service so that it’s easily accessible for residents. If we’re trying to spread a message on social media we could use the algorithm to find the key influencers that can help us achieve our goal.

Algorithm explanation on simple sample graph

harmonic centrality
Create sample graph
CREATE (a:Node{id:"A"}),
       (b:Node{id:"B"}),
       (c:Node{id:"C"}),
       (d:Node{id:"D"}),
       (e:Node{id:"E"})
CREATE (a)-[:LINK]->(b),
       (b)-[:LINK]->(c),
       (d)-[:LINK]->(e);
Running algorithm and streaming results
CALL algo.closeness.harmonic.stream('Node', 'LINKS') YIELD nodeId, centrality
RETURN nodeId,centrality
ORDER BY centrality DESC
LIMIT 20;

- yields centrality for each node
Running algorithm and writing back results
CALL algo.closeness.harmonic('Node', 'LINK', {writeProperty:'centrality'})
YIELD nodes,loadMillis, computeMillis, writeMillis;

- calculates closeness centrality and potentially writes back

Calculation:

k = N-1 = 4

     A     B     C     D     E
 ---|-----------------------------
 A  | 0     1     2     -     -    // distance between each pair of nodes
 B  | 1     0     1     -     -    // or infinite if no path exists
 C  | 2     1     0     -     -
 D  | -     -     -     0     1
 E  | -     -     -     1     0
 ---|------------------------------
 A  | 0     1    1/2    0     0    // inverse
 B  | 1     0     1     0     0
 C  |1/2    1     0     0     0
 D  | 0     0     0     0     1
 E  | 0     0     0     1     0
 ---|------------------------------
sum |1.5    2    1.5    1     1
 ---|------------------------------
 *k |0.37  0.5  0.37  0.25  0.25

instead of calculating the farness we sum the inverse of each cell and multiply by 1/(n-1)

Syntax

Running algorithm and writing back results
CALL algo.closeness.harmonic(label:String, relationship:String,
{write:true, writeProperty:'centrality',graph:'heavy', concurrency:4})
YIELD nodes,loadMillis, computeMillis, writeMillis
- calculates closeness centrality and potentially writes back
Table 19. Parameters
name type default optional description

label

string

null

yes

label to load from the graph. If null load all nodes

relationship

string

null

yes

relationship-type to load from the graph. If null load all relationships

write

boolean

true

yes

if result should be written back as node property

concurrency

int

available CPUs

yes

number of concurrent threads

writeProperty

string

'centrality'

yes

property name written back to

graph

string

'heavy'

yes

use 'heavy' when describing the subset of the graph with label and relationship-type parameter, 'cypher' for describing the subset with cypher node-statement and relationship-statement

Table 20. Results
name type description

nodes

int

number of nodes considered

loadMillis

int

milliseconds for loading data

evalMillis

int

milliseconds for running the algorithm

writeMillis

int

milliseconds for writing result data back

Running algorithm and streaming results
CALL algo.closeness.harmonic.stream(label:String, relationship:String,{concurrency:4})
YIELD nodeId, centrality - yields centrality for each node
Table 21. Parameters
name type default optional description

label

string

null

yes

label to load from the graph, if null load all nodes

relationship

string

null

yes

relationship-type to load from the graph, if null load all relationships

concurrency

int

available CPUs

yes

number of concurrent threads

Table 22. Results

name

type

description

node

long

node id

centrality

float

closeness centrality weight

Cypher projection

If label and relationship-type are not selective enough to describe your subgraph to run the algorithm on, you can use Cypher statements to load or project subsets of your graph. Can be also used to run algorithms on a virtual graph. Set graph:'cypher' in the config.

CALL algo.closeness.harmonic(
  'MATCH (p:Node) RETURN id(p) as id',
  'MATCH (p1:Node)-[:LINK]-(p2:Node) RETURN id(p1) as source, id(p2) as target',
  {graph:'cypher', writeProperty: 'centrality'}
);

Versions

We support the following versions of the harmonic centrality algorithm:

  • ✓ undirected, unweighted

  • ❏ undirected, weighted

Minimum Weight Spanning Tree

A Minimum Weight Spanning Tree (MST) is a subgraph that contains all the nodes from the original graph and the set of relationships that connect the nodes together with the minimum possible weight. Our current approach uses Prim’s Algorithm to calculate the MST starting at a given node. The K-Means variant can be used to detect clusters in the graph.

History, Explanation

In graph theory, a tree is a way of connecting all the nodes together, so that there is exactly one path from a node to any other node in the tree. If the graph represents cities connected by roads, one could select a number of roads, so that each city can be reached from every other, but that there is no more than one way to travel from one city to another.

Most of the time, graphs are weighted; each connection between two cities has a weight: It might cost something to travel on a given road, or one connection may be longer than the other, this means it takes more time to travel on that connection.

A graph might have more than one minimum spanning tree. In a graph where all the relationships have the same weight, every tree is a minimum spanning tree. If all the relationships have different weights there is exactly one minimal spanning tree.

Czech scientist Otakar Borůvka developed the first known algorithm for finding a minimum spanning tree in 1926, while trying to find an efficient electricity network for Moravia. This algorithm is known as Borůvka’s algorithm.

Prim’s algorithm (sometimes known as the Jarník-Prim algorithm) is one of the simplest and best-known minimum spanning tree algorithms. It is similar to Dijkstra’s algorithm, but rather than minimizing the total length of a path ending at each relationship we minimize the length of each relationship individually. Unlike Dijkstra’s, Prim’s can tolerate negative-weight relationships. It is optimal for dense graphs.

The algorithm operates as follows:

  • start with a trivial tree containing only one node (and no relationship). It does not matter which node is chosen.

  • select the minimal-weight relationship coming from that node and add it to our tree.

  • repeatedly choose a minimal-weight relationship that joins any node in the tree to one not in the tree, adding the new relationship and node to our tree. Such a relationship will always exist when the tree is incomplete because otherwise the partially built tree would be an isolated component but we know the graph consists of a single connected component.

  • when there are no more nodes to add, the tree we have built is a minimum spanning tree.

When to use it / use-cases

Imagine a telecommunications company is trying to lay out cables in new neighborhood. If the company is only able to lay the cable along certain paths (e.g. along roads), we could derive a graph that represents the points connected by those paths. Some of those paths might be more expensive, because they are longer, or require the cable to be buried deeper; these paths would be represented by relationships with larger weights. A spanning tree for that graph would be a subset of those paths that has no cycles but still connects to every house; in fact several spanning trees might be possible. A minimum spanning tree would be the one with the lowest total cost, which would represent the least expensive path for laying the cable.

There are also some indirect applications such as LDPC codes for error correction, image registration with Renyi entropy, learning salient features for real-time face verification, model locality of particle interactions in turbulent fluid flows and autoconfig protocol for Ethernet bridging to avoid cycles in a network.

Constraints / when not to use it

The MST algorithm only gives meaningful results when run on a graph where the relationships have different weights. If the graph has no weights or all relationships have the same weight then any spanning tree is a minimum spanning tree.

Algorithm explanation on simple sample graph

mst
Create sample graph
CREATE (a:Place{id:"A"}),
       (b:Place{id:"B"}),
       (c:Place{id:"C"}),
       (d:Place{id:"D"}),
       (e:Place{id:"E"})
CREATE (d)-[:LINK{cost:4}]->(b),
       (d)-[:LINK{cost:6}]->(e),
       (b)-[:LINK{cost:1}]->(a),
       (b)-[:LINK{cost:3}]->(c),
       (a)-[:LINK{cost:2}]->(c),
       (c)-[:LINK{cost:5}]->(e);

Minimum weighted spanning tree visits all nodes that are in the same connected component as the starting node and returns a spanning tree of all nodes in the component where the total weight of the relationships is minimized.

Running minimum spanning tree algorithm and writing back results
MATCH (n:Place{id:"D"})
CALL algo.spanningTree.minimum('Place','LINK', 'cost',id(n), {write:true, writeProperty:"MINST"})
YIELD loadMillis, computeMillis, writeMillis, effectiveNodeCount
RETURN loadMillis,computeMillis,writeMillis, effectiveNodeCount;
minst result
Figure 1. Results

Maximum weighted tree spanning algorithm is similar to the mininum one, except that it returns a spanning tree of all nodes in the component where the total weight of the relationships is maximized.

Running maximum spanning tree algorithm and writing back results
MATCH (n:Place{id:"D"})
CALL algo.spanningTree.maximum('Place', 'LINK', 'cost',id(n), {write:true, writeProperty:"MAXST"})
YIELD loadMillis, computeMillis, writeMillis, effectiveNodeCount
RETURN loadMillis,computeMillis,writeMillis, effectiveNodeCount;
maxst result
k-Spanning tree

Sometimes we want to limit the size of our spanning tree result as we are only interested in finding a smaller tree within our graph that does not span across all nodes. K-Spanning tree algorithm returns a tree with k nodes and k − 1 relationships.

In our sample graph we have 5 nodes. When we ran MST above we got returned a 5-minimum spanning tree, that covered all five nodes. By setting the k=3 we define that we want to get returned a 3-minimum spanning tree that covers 3 nodes and has 2 relationships.

Running k-minimum spanning tree algorithm and writing back results
MATCH (n:Place{id:"D"})
CALL algo.spanningTree.kmin('Place', 'LINK', 'cost',id(n), 3, {writeProperty:"kminst"})
YIELD loadMillis, computeMillis, writeMillis, effectiveNodeCount
RETURN loadMillis,computeMillis,writeMillis, effectiveNodeCount;
Table 23. Results
Place partition

A

1

B

1

C

1

D

3

E

4

Nodes A, B and C are the result 3-minimum spanning tree of our graph.

Running k-maximum spanning tree algorithm and writing back results
MATCH (n:Place{id:"D"})
CALL algo.spanningTree.kmax('Place', 'LINK', 'cost',id(n), 3, {writeProperty:"kmaxst"})
YIELD loadMillis, computeMillis, writeMillis, effectiveNodeCount
RETURN loadMillis,computeMillis,writeMillis, effectiveNodeCount;
Table 24. Results
Place partition

A

0

B

1

C

3

D

3

E

3

Nodes C, D and E are the result 3-maximum spanning tree of our graph.

When we run this algorithm on a bigger graph we can use the following query to find nodes that belong to our k-spanning tree result.

Find nodes that belong to our k-spanning tree result.
MATCH (n:Place)
WITH n.partition as partition, count(*) as count
WHERE count = k
RETURN n

Syntax

Running algorithm and writing back results
CALL algo.spanningTree(label:String, relationshipType:String, weightProperty:String, startNodeId:int, {writeProperty:String})
YIELD loadMillis, computeMillis, writeMillis, effectiveNodeCount
Table 25. Parameters
name type default optional description

label

String

null

no

label to load from the graph. If null load all nodes

relationshipType

String

null

no

relationship-type to load from the graph. If null load all nodes

weightProperty

string

null

no

property name that contains weight. Must be numeric.

startNodeId

long

null

no

start node id

write

boolean

true

yes

if result should be written back as relationships

writeProperty

string

'mst'

yes

relationship-type written back as result

Table 26. Results
name type description

effectiveNodeCount

int

number of visited nodes

loadMillis

int

milliseconds for loading data

computeMillis

int

milliseconds for running the algorithm

writeMillis

int

milliseconds for writing result data back

Running k-spanning tree algorithm and writing back results
CALL algo.spanningTree.k*(label:String, relationshipType:String, weightProperty:String, startNodeId:int, k:int, {writeProperty:String})
YIELD loadMillis, computeMillis, writeMillis, effectiveNodeCount
Table 27. Parameters
name type default optional description

label

String

null

no

label to load from the graph. If null load all nodes

relationshipType

String

null

no

relationship type

weightProperty

string

null

no

property name that contains weight. Must be numeric.

startNodeId

int

null

no

start node id

k

int

null

no

result is a tree with k nodes and k − 1 relationships

write

boolean

true

yes

if result should be written back as node property

writeProperty

string

'mst'

yes

relationship-type written back as result

Table 28. Results
name type description

effectiveNodeCount

int

number of visited nodes

loadMillis

int

milliseconds for loading data

computeMillis

int

milliseconds for running the algorithm

writeMillis

int

milliseconds for writing result data back

Versions

We support the following versions of the minimum weight spanning tree algorithm:

  • ✓ undirected, weighted

Single Source Shortest Path

The Single Source Shortest Path algorithm calculates the shortest weight path between a pair of nodes. A common algorithm used is Dijkstra.

History, Explanation

Path finding has a long history and is considered to be one of the classical graph problems. It has been researched as far back as the 19th century. Classical references are Wiener [1873], Lucas [1882] (describing a method due to C.P. Tr´emaux), and Tarry [1895] – see Biggs, Lloyd, and Wilson [1976].

Dijkstra thought about the shortest path problem when working as a programmer at the Mathematical Center in Amsterdam in 1956 to demonstrate the capabilities of the new ARMAC computer. His wanted to choose both a problem and an answer (that would be produced by computer) that non-computing people could understand. He designed what is now known as Dijkstra’s algorithm and later implemented it on the ARMAC for a slightly simplified transportation map of 64 cities in the Netherlands.

It gained prominence in the early 1950s in the context of ‘alternate routing’ i.e. finding a second shortest route if the shortest route is blocked. It found uses in urban planning [1] as well as telephone call routing.

When to use it / use-cases

The most common usage of the single source shortest path algorithm is finding directions between physical locations. Web mapping websites such as MapQuest or Google Maps use it to provide driving directions.

It has also found uses in several other areas including:

  • Telecommunications and networking

  • Operations research - plant and facility layout, robotics, transportation, and VLSI design.

  • Social networks - find the degrees of separation between people

  • IP routing - find the Open Shortest Path First

Constraints / when not to use it

Dijkstra does not support negative weights. The algorithm assumes that adding a relationship to a path can never make a path shorter - an invariant that would be violated with negative weights.

Algorithm explanation on simple sample graph

sssp
Create sample graph
CREATE (a:Loc{name:'A'}), (b:Loc{name:'B'}), (c:Loc{name:'C'}),
       (d:Loc{name:'D'}), (e:Loc{name:'E'}), (f:Loc{name:'F'}),
       (a)-[:ROAD {cost:50}]->(b),
       (a)-[:ROAD {cost:50}]->(c),
       (a)-[:ROAD {cost:100}]->(d),
       (a)-[:RAIL {cost:50}]->(d),
       (b)-[:ROAD {cost:40}]->(d),
       (c)-[:ROAD {cost:40}]->(d),
       (c)-[:ROAD {cost:80}]->(e),
       (d)-[:ROAD {cost:30}]->(e),
       (d)-[:ROAD {cost:80}]->(f),
       (e)-[:ROAD {cost:40}]->(f),
       (e)-[:RAIL {cost:20}]->(f);
Dijkstra single source shortest path algorithm
Running algorithm and streaming results
MATCH (start:Loc{name:'A'}), (end:Loc{name:'F'})
CALL algo.shortestPath.stream(start, end, 'cost')
YIELD nodeId, cost
RETURN nodeId, cost LIMIT 20
Running algorithm and writing back results
MATCH (start:Loc{name:'A'}), (end:Loc{name:'F'})
CALL algo.shortestPath(start, end, 'cost',{write:true,writeProperty:'sssp'})
YIELD writeMillis,loadMillis,nodeCount, totalCost
RETURN writeMillis,loadMillis,nodeCount,totalCost
Delta stepping algorithm
Running algorithm and streaming results
MATCH (n:Loc {name:'A'})
CALL algo.shortestPath.deltaStepping.stream(n, 'cost', 3.0)
YIELD nodeId, distance
RETURN nodeId, distance LIMIT 20
Running algorithm and writing back results
MATCH (n:Loc {name:'A'})
CALL algo.shortestPath.deltaStepping(n, 'cost', 3.0, {defaultValue:1.0, write:true, writeProperty:'sssp'})
YIELD nodeCount, loadDuration, evalDuration, writeDuration
RETURN nodeCount, loadDuration, evalDuration, writeDuration

Syntax

Running algorithm and writing back results
CALL algo.shortestPath(startNode:Node, endNode:Node, weightProperty:String
{nodeQuery:'labelName', relationshipQuery:'relationshipName', defaultValue:1.0,write:'true',writeProperty:'sssp',direction:'OUTGOING'})
YIELD nodeCount, totalCost, loadMillis, evalMillis, writeMillis
Table 29. Parameters
name type default optional description

startNode

node

null

no

start node

endNode

node

null

no

end node

weightProperty

string

null

yes

property name that contains weight, if null treats the graph as unweighted. Must be numeric.

defaultValue

float

null

yes

default value of the weight in case it is missing or invalid

write

boolean

true

yes

if result should be written back as node property

writeProperty

string

'sssp'

yes

property name written back to the node sequence of the node in the path

nodeQuery

string

null

yes

label to load from the graph, if null load all nodes

relationshipQuery

string

null

yes

relationship-type to load from the graph, if null load all nodes

direction

string

outgoing

yes

relationship direction to load from the graph, if 'both' treats the relationships as undirected

Table 30. Results
name type description

nodeCount

int

number of nodes considered

totalCost

float

sum of all weights along the path

loadMillis

int

milliseconds for loading data

evalMillis

int

milliseconds for running the algorithm

writeMillis

int

milliseconds for writing result data back

Running algorithm and streaming results
CALL algo.shortestPath.stream(startNode:Node, endNode:Node, weightProperty:String
{nodeQuery:'labelName', relationshipQuery:'relationshipName', defaultValue:1.0, direction:'OUTGOING'})
 YIELD nodeId, cost
Table 31. Parameters
name type default optional description

startNode

node

null

no

start node

endNode

node

null

no

end node

weightProperty

string

null

yes

property name that contains weight, if null treats the graph as unweighted. Must be numeric.

nodeQuery

string

null

yes

label to load from the graph, if null load all nodes

relationshipQuery

string

null

yes

relationship-type to load from the graph, if null load all nodes

defaultValue

float

null

yes

default value of the weight in case it is missing or invalid

direction

string

outgoing

yes

relationship direction to load from the graph, if 'both' treats the relationships as undirected

Table 32. Results

name

type

description

nodeId

int

node id

cost

int

cost it takes to get from start node to specific node

Versions

We support the following versions of the shortest path algorithms:

  • ✓ directed, unweighted:

    • direction: 'OUTGOING' or INCOMING, weightProperty: null

  • ✓ directed, weighted

    • direction: 'OUTGOING' or INCOMING, weightProperty: 'cost'

  • ✓ undirected, unweighted

    • direction: 'BOTH', weightProperty: null

  • ✓ undirected, weighted

    • direction: 'BOTH', weightProperty: 'cost'

Implementations

algo.shortestPath

  • specify start and end node, find the shortest path between them

  • Dijkstra single source shortest path algorithm

  • there may be more then one shortest path, algo returns only one

  • if initialized with an non-existing weight-property it will treat the graph as unweighted

algo.shortestPath.deltaStepping

  • specify start node, find the shortest paths to all other nodes

  • parallel non-negative single source shortest path algorithm for weighted graphs

  • It can be tweaked using the delta-parameter which controls the grade of concurrency.

  • if initialized with an non-existing weight-property it will treat the graph as unweighted

algo.shortestPaths

  • specify start node, find the shortest paths to all other nodes

  • Dijkstra single source shortest path algorithm

  • if initialized with an non-existing weight-property it will treat the graph as unweighted

All Pairs Shortest Path

All Pairs Shortest Path calculates the shortest weight path between all pairs of nodes. An algorithm to solve this is Floyd Warshall or Parallel Johnson’s algorithm.

History, Explanation

This algorithm finds shortest paths between all pairs of nodes in the graph. Some pairs of nodes might not be reachable between each other, so no shortest path exist between these pairs of nodes. In this scenario the algorithm will return Infinity value as a result between these pairs of nodes.

Plain cypher does not support filtering Infinity values, so algo.isFinite function was added to help filter Infinity values from results.

When to use it / use-cases

The all pairs shortest path algorithm has found uses in the following areas:

  • urban planning and simulation [1]

  • datacenter network design [2]

  • traffic routing

Algorithm explanation on simple sample graph

sssp
Create sample graph
CREATE (a:Loc{name:'A'}), (b:Loc{name:'B'}), (c:Loc{name:'C'}),
       (d:Loc{name:'D'}), (e:Loc{name:'E'}), (f:Loc{name:'F'}),
       (a)-[:ROAD {cost:50}]->(b),
       (a)-[:ROAD {cost:50}]->(c),
       (a)-[:ROAD {cost:100}]->(d),
       (a)-[:RAIL {cost:50}]->(d),
       (b)-[:ROAD {cost:40}]->(d),
       (c)-[:ROAD {cost:40}]->(d),
       (c)-[:ROAD {cost:80}]->(e),
       (d)-[:ROAD {cost:30}]->(e),
       (d)-[:ROAD {cost:80}]->(f),
       (e)-[:ROAD {cost:40}]->(f),
       (e)-[:RAIL {cost:20}]->(f);
Running algorithm and streaming results
CALL algo.allShortestPaths.stream('cost',{nodeQuery:'Loc',defaultValue:1.0})
YIELD sourceNodeId, targetNodeId, distance
WITH sourceNodeId, targetNodeId, distance
WHERE algo.isFinite(distance) = true
RETURN sourceNodeId, targetNodeId, distance LIMIT 20

For now only single-source shortest path support loading the relationship as undirected, but we can use cypher loading to help us solve this. Undirected graph can be represented as Bidirected graph, that is a directed graph in which the reverse of every relationship is also a relationship.

We do not have to save this reversed relationship, we can project it using cypher loading. Note that relationship query does not specify direction of the relationship. This is applicable to all other algorithms, that use cypher loading.

Running all pairs shortest path treating the graph as undirected
CALL algo.allShortestPaths.stream('cost', {
nodeQuery:'MATCH (n:Loc) RETURN id(n) as id',
relationshipQuery:'MATCH (n:Loc)-[r]-(p:Loc) RETURN id(n) as source, id(p) as target, r.cost as weight',
graph:'cypher', defaultValue:1.0})

Implementations

algo.allShortestPaths.stream

  • find shortest paths between all pairs of nodes

  • returns a stream of source-target node to distance tuples for each pair of nodes

  • writeback not supported

  • if initialized with an non-existing weight-property it will treat the graph as unweighted

Community detection: Triangle Counting / Clustering Coefficient

Triangle counting is a community detection graph algorithm that is used to determine the number of triangles passing through each node in the graph. A triangle is a set of three nodes where each node has a relationship to all other nodes.

History, Explanation

Triangle counting gained popularity in social network analysis where it is used to detect communities and measure the cohesiveness of those communities. It is often used as part of the computation of network indices such as the clustering coefficient.

There are two types of clustering coefficient:

  • The local clustering coefficient of a node is the likelihood that its neighbours are also connected. The computation of this score involves triangle counting.

  • The global clustering coefficient is the normalized sum of those local clustering coefficients.

The transitivity coefficient of a graph, which is just three times the number of triangles divided by the number of triples in the graph is sometimes used.[2]

When to use it / use-cases

Triangle counting has gained popularity in many other domains including:

  • computer network security - intrusion detection and spamming

  • web applications - community detection and blog analysis

  • social network analysis - link prediction, spam filters, content quality control [5]

  • biological networks - protein-protein interaction analysis [7]

Algorithm explanation on simple sample graph

triangle count
Create sample graph
CREATE (alice:Person{id:"Alice"}),
       (michael:Person{id:"Michael"}),
       (karin:Person{id:"Karin"}),
       (chris:Person{id:"Chris"}),
       (will:Person{id:"Will"}),
       (mark:Person{id:"Mark"})
CREATE (michael)-[:KNOWS]->(karin),
       (michael)-[:KNOWS]->(chris),
       (will)-[:KNOWS]->(michael),
       (mark)-[:KNOWS]->(michael),
       (mark)-[:KNOWS]->(will),
       (alice)-[:KNOWS]->(michael),
       (will)-[:KNOWS]->(chris),
       (chris)-[:KNOWS]->(karin);
returns a stream of triples with nodeIds for each triangle.
CALL algo.triangle.stream('Person','KNOWS')
yield nodeA,nodeB,nodeC;
counts the number of triangles a node is member of and writes it back. Returns total triangle count and average clustering coefficient of the given graph.
CALL algo.triangleCount('Person', 'KNOWS',
  {concurrency:4, write:true, writeProperty:'triangles',clusteringCoefficientProperty:'coefficient'})
YIELD loadMillis, computeMillis, writeMillis, nodeCount, triangleCount, averageClusteringCoefficient;
counts number of triangles a node is member of and returns a stream with nodeId and triangleCount
CALL algo.triangleCount.stream('Person', 'KNOWS', {concurrency:4})
YIELD nodeId, triangles;

Example Usage

In graph theory, a clustering coefficient is a measure of the degree to which nodes in a graph tend to cluster together. Evidence suggests that in most real-world networks, and in particular social networks, nodes tend to create tightly knit groups characterised by a relatively high density of ties; this likelihood tends to be greater than the average probability of a tie randomly established between two nodes.[4]

We check if this holds true for Yelp’s social network of friends.

CALL algo.triangleCount('User', 'FRIEND',
  {concurrency:4, write:true, writeProperty:'triangles',clusteringCoefficientProperty:'coefficient'})
YIELD loadMillis, computeMillis, writeMillis, nodeCount, triangleCount, averageClusteringCoefficient;

Average clustering coefficient is 0.0523, which is really low for a social network. This indicates that groups of friends are not tightly knit together, but rather sparse. We can assume that users are not on Yelp for finding and creating friends, like for example Facebook, but rather something else, like finding good restaurant recommendations.

Local triangle count and clustering coefficient of nodes can be used as features in finding influencers in social networks.

Syntax

returns a stream of triples with nodeIds for each triangle.
CALL algo.triangle.stream(label:String, relationship:String, {concurrency:4})
YIELD nodeA, nodeB, nodeC - yield nodeA, nodeB and nodeC which form a triangle
Table 33. Parameters
name type default optional description

label

string

null

yes

label to load from the graph, if null load all nodes

relationship

string

null

yes

relationship-type to load from the graph, if null load all nodes

concurrency

int

available CPUs

yes

number of concurrent threads

Table 34. Results
name type description

nodeA

int

id of node in the given triangle

nodeB

int

id of node in the given triangle

nodeC

int

id of node in the given triangle

counts number of triangles a node is member of and returns a stream with nodeId and triangleCount
CALL algo.triangleCount.stream(label:String, relationship:String, {concurrency:4})
YIELD nodeId, triangles - yield nodeId, number of triangles
Table 35. Parameters
name type default optional description

label

string

null

yes

label to load from the graph. If null load all nodes

relationship

string

null

yes

relationship-type to load from the graph. If null load all relationships

concurrency

int

available CPUs

yes

number of concurrent threads

Table 36. Results
name type description

nodeId

int

id of node

triangles

int

number of triangles a node is member of

counts the number of triangles a node is member of and writes it back. Returns total triangle count and average clustering coefficient of the given graph.
CALL algo.triangleCount(label:String, relationship:String,
{concurrency:4, write:true, writeProperty:'triangles', clusteringCoefficientProperty:'coefficient'})
YIELD loadMillis, computeMillis, writeMillis, nodeCount, triangleCount, averageClusteringCoefficient
Table 37. Parameters
name type default optional description

label

string

null

yes

label to load from the graph. If null load all nodes

relationship

string

null

yes

relationship-type to load from the graph. If null load all relationships

concurrency

int

available CPUs

yes

number of concurrent threads

write

boolean

true

yes

if result should be written back as node property

writeProperty

string

'triangles'

yes

property name the number of triangles a node is member of is written to

clusteringCoefficientProperty

string

'coefficient'

yes

property name clustering coefficient of the node is written to

Table 38. Results
name type description

nodeCount

int

number of nodes considered

loadMillis

int

milliseconds for loading data

evalMillis

int

milliseconds for running the algorithm

writeMillis

int

milliseconds for writing result data back

triangleCount

int

number of triangles in the given graph.

averageClusteringCoefficient

float

average clustering coefficient of the given graph

Cypher projection

If label and relationship-type are not selective enough to describe your subgraph to run the algorithm on, you can use Cypher statements to load or project subsets of your graph. Can be also used to run algorithms on a virtual graph. Set graph:'cypher' in the config.

CALL algo.triangleCount(
  'MATCH (p:Person) RETURN id(p) as id',
  'MATCH (p1:Person)-[:KNOWS]->(p2:Person) RETURN id(p1) as source,id(p2) as target',
  {concurrency:4, write:true, writeProperty:'triangle',graph:'cypher', clusteringCoefficientProperty:'coefficient'})
yield loadMillis, computeMillis, writeMillis, nodeCount, triangleCount, averageClusteringCoefficient

Versions

We support the following versions of the triangle count algorithms:

  • ✓ undirected, unweighted

Community detection: Label Propagation

Label Propagation (LPA) is an algorithm for finding communities in near linear time [1]. It detects these communities using network structure alone as its guide and requires neither a pre-defined objective function nor prior information about the communities.

Community detection is an important methodology for understanding the organization of various real-world networks and has applications in problems as diverse as consensus formation in social communities or the identification of functional modules in biochemical networks.

History, Explanation

Many community detection algorithms have been proposed and utilized with varying degrees of effectiveness in community detection research.

The idea of label flooding through a network originated from the L-shell method proposed by Bagrow and Bollt. (2005) [2] The main idea behind LPA is to propagate the labels of node throughout the network and form communities through the process of label propagation. The intuition of label flooding is that a single label can quickly become dominant in a densely connected group of nodes whereas it has trouble crossing a sparsely connected region. Labels will get trapped inside a densely connected group of nodes. Based on the idea of local trapping of labels, the author proposed a local community detection method where a node is initialized with a label which propagates step by step through the neighbours until the end of the community is reached.

LPA, which was first proposed by Raghavan et al. (2007) [3], uses unique identifiers of nodes as labels and propagates the labels based on an agreement with the majority of the neighbour nodes.

The algorithm works as follows:

  • Every node is initialized with a unique label (an identifier)

  • These labels propagate through the network

  • At every iteration of propagation, each node updates its label to the one that the maximum numbers of its neighbours belongs to. Ties are broken uniformly and randomly.

  • LPA reaches convergence when each node has the majority label of its neighbours.

As labels propagate, densely connected groups of nodes quickly reach a consensus on a unique label. At the end of the propagation only a few labels will remain - most will have disappeared. Nodes that have the same label at convergence are said to belong to the same community.

When to use it / use-cases

Community structure is considered one of the most interesting features in complex networks. Many real-world complex systems exhibit community structure, where individuals with similar properties form a community (partition). The identification of partitions in a network is important for understanding its structure.

There are many algorithms that identify community structures in large-scale real-world networks and a lot of them require prior information about the number and size of communities. In complex networks this size of community is very difficult to calculate. With social networks, which are often huge and heterogeneous in nature, it is impossible to predict the number of initial communities.

Algorithms that are able to detect communities from complex networks without having any knowledge about the number of communities or the size of communities are essential. LPA is such an algorithm and requires neither a pre-defined objective function nor prior information about the communities.

Constraints / when not to use it

In contrast with other algorithms label propagation can result in different community structures from the same initial condition. The range of solutions can be narrowed if some nodes are given preliminary labels while others are held unlabelled. Consequently, unlabelled nodes will be more likely to adapt to the labelled ones. For a more accurate finding of communities, Jaccard’s index is used to aggregate multiple community structures, containing all important information.[3]

Algorithm explanation on simple sample graph

label propagation
Create sample graph
CREATE (nAlice:User {id:'Alice',predefined_label:52})
,(nBridget:User {id:'Bridget',predefined_label:21})
,(nCharles:User {id:'Charles',predefined_label:43})
,(nDoug:User {id:'Doug',predefined_label:21})
,(nMark:User {id:'Mark',predefined_label: 19})
,(nMichael:User {id:'Michael', predefined_label: 52})
CREATE (nAlice)-[:FOLLOW]->(nBridget)
,(nAlice)-[:FOLLOW]->(nCharles)
,(nMark)-[:FOLLOW]->(nDoug)
,(nBridget)-[:FOLLOW]->(nMichael)
,(nDoug)-[:FOLLOW]->(nMark)
,(nMichael)-[:FOLLOW]->(nAlice)
,(nAlice)-[:FOLLOW]->(nMichael)
,(nBridget)-[:FOLLOW]->(nAlice)
,(nMichael)-[:FOLLOW]->(nBridget)
,(nCharles)-[:FOLLOW]->(nDoug);
Running algorithm and writing back results
CALL algo.labelPropagation('User', 'FOLLOW','OUTGOING',
  {iterations:10,partitionProperty:'partition', write:true})
YIELD nodes, iterations, loadMillis, computeMillis, writeMillis, write, partitionProperty;
Table 39. Results
name partition

Alice

5

Charles

4

Bridget

5

Michael

5

Doug

4

Mark

4

Our algorithm found two communities with 3 members each.

Using pre-defined labels

At the beginning of the algorithm, every node is initialized with unique label (called as identifier) and the labels propagate through the network.

It is possible to define preliminary labels (identifiers) of nodes using partition parameter. We need to save preliminary set of labels we would like to run the LPA algorithm with as a property of nodes (must be a number). In our example graph we saved them as property predefined_label. Algorithm first checks if there is a predefined label assigned to the node and loads it, if there is one. If there isn’t one it assigns the node new unique label(node id are used). Using this preliminary set of labels (identifiers) it then sequentially updates each node’s label to a new one, which is the most frequent label among its neighbors at every label propagation step(iteration).

Can be used as semi-supervised way of finding communities, where we hand-pick some initial communities.

Running algorithm with pre-defined labels
CALL algo.labelPropagation('User', 'FOLLOW','OUTGOING',
  {iterations:10,partitionProperty:'predefined_label', write:true})
YIELD nodes, iterations, loadMillis, computeMillis, writeMillis, write, partitionProperty;

Syntax

Running algorithm and writing back results
CALL algo.labelPropagation(label:String, relationship:String, direction:String, {iterations:1,
weightProperty:'weight', partitionProperty:'partition', write:true, concurrency:4})
YIELD nodes, iterations, didConverge, loadMillis, computeMillis, writeMillis, write, weightProperty,
partitionProperty - simple label propagation kernel
Table 40. Parameters
name type default optional description

label

string

null

yes

label to load from the graph. If null load all nodes

relationship

string

null

yes

relationship-type to load from the graph. If null load all relationships

direction

string

'OUTGOING'

yes

relationship-direction to use in the algorithm

concurrency

int

available CPUs

yes

number of concurrent threads

iterations

int

1

yes

the maximum number of iterations to run

weightProperty

string

'weight'

yes

property name of node and/or relationship that contain weight. Must be numeric.

partitionProperty

string

'partition'

yes

property name written back the partition of the graph in which the node reside, can be used to define initial set of labels (must be a number)

write

boolean

true

yes

if result should be written back as node property

graph

string

'heavy'

yes

use 'heavy' when describing the subset of the graph with label and relationship-type parameter, 'cypher' for describing the subset with cypher node-statement and relationship-statement

Table 41. Results
name type description

nodes

int

number of nodes considered

iterations

int

number of iterations that were executed

didConverge

boolean

true if the algorithm did converge to a stable labelling within the provided number of maximum iterations

loadMillis

int

milliseconds for loading data

computeMillis

int

milliseconds for running the algorithm

writeMillis

int

milliseconds for writing result data back

weightProperty

string

property name that contains weight

partitionProperty

string

property name written back to

write

boolean

true

Cypher projection

If label and relationship-type are not selective enough to describe your subgraph to run the algorithm on, you can use Cypher statements to load or project subsets of your graph. Can be also used to run algorithms on a virtual graph. Set graph:'cypher' in the config.

CALL algo.labelPropagation(
  'MATCH (p:User) RETURN id(p) as id, p.weight as weight, id(p) as value',
  'MATCH (p1:User)-[f:FRIEND]->(p2:User)
   RETURN id(p1) as source, id(p2) as target, f.weight as weight',
  "OUT",
  {graph:'cypher',write:true});

Versions

We support the following versions of the label propagation algorithms:

  • ✓ directed, unweighted:

    • direction: 'INCOMING' or 'OUTGOING', weightProperty: null

  • ✓ directed, weighted

    • direction: 'INCOMING' or 'OUTGOING', weightProperty : 'weight'

  • ✓ undirected, unweighted

    • direction: 'BOTH', weightProperty: null

  • ✓ undirected, weighted

    • direction: 'BOTH', weightProperty: 'weight'

Community detection: Louvain

The Louvain method of community detection is an algorithm for detecting communities in networks that relies upon a heuristic for maximizing the modularity. Communities are groups of nodes within a network that are more densely connected to one another than to other nodes. A typical heuristic would be modularity, which quantifies the quality of an assignment of nodes to communities by evaluating how much more densely connected the nodes within a community are compared to how connected they would be in a random network.

The method consists of repeated application of two steps. The first step is a "greedy" assignment of nodes to communities, favoring local optimizations of modularity. The second step is the definition of a new coarse-grained network based on the communities found in the first step. These two steps are repeated until no further modularity-increasing reassignments of communities are possible.

The Louvain method achieves modularities comparable to pre-existing algorithms, typically in less time, so it enables the study of much larger networks. It also reveals a hierarchy of communities at different scales, and this hierarchical perspective can be useful for understanding the global functioning of a network. While there are pitfalls to interpreting the community structure uncovered by the Louvain Method, these difficulties are shared by all modularity optimization algorithms.[2]

Link to the original paper[1]

History, Explanation

A fairly common feature of complex networks is that they consist of sets of nodes that interact more with one another than with nodes outside the set. Social networks, for instance, might consist of tightly knit communities of friends with rarer friendship ties between different communities. In protein interaction networks, certain groups of proteins interact with one another more frequently than they do with proteins in other groups. If you map out a network of projects going on in a large company, certain projects will likely have more conceptual overlap and mutual dependency than others.

In 1962, H.A. Simon proposed that this type of community structure might be a defining characteristic of complex systems, or at least those like the protein interaction network, in which many interacting constituent elements adaptively organize to achieve some higher-order function (e.g., the functioning of an organism). His reasoning was that individual actors have a much higher chance of collectively achieving a higher-order function if that function can be iteratively achieved by constructing intermediate stable forms (also called communities or modules) that achieve simpler functions. The first-order intermediate forms would be communities in terms of the original nodes, but then interactions between these first-order communities could generate second-order communities that accomplish somewhat more complicated functions, and so on. In this way, a hierarchical structure can emerge in complex adaptive systems. However, even when there isn’t necessarily adaptive pressure towards achieving some higher-order function (as in the case of some social networks), community structure is a common observed feature of complex networks.

The algorithm is initialized with each node in its own community.

In the first stage we iterate through each of the nodes in the network. For each node, we consider the change in modularity if we remove the node from its current community and place it in the community of one of its neighbors. We compute the modularity change for each of the node’s neighbors. If none of these modularity changes are positive, we keep the node in its current community. If some of the modularity changes are positive, we move the node into the community where the modularity change is most positive. Ties are resolved arbitrarily. We repeat this process for each node until one pass through all nodes yields no community assignment changes.

The second stage in the Louvain Method uses the communities that were discovered in the community reassignment stage to define a new coarse-grained network. In this network the newly discovered communities are the nodes. The edge weight between the nodes representing two communities is the sum of the edge weights between the constituent, lower-level nodes of each community. The links within each community generate self-loops in the new, coarse-grained network.

The rest of the Louvain Method consists of repeated application of stages 1 and 2. By applying stage 1 (the community reassignment phase) to the coarse-grained graph, we find a second tier of communities of communities of nodes. Then, in the next application of stage 2, we define a new coarse-grained graph at this higher-level of the hierarchy. We keep going like this until an application of stage 1 yields no reassignments. At that point repeated application of stages 1 and 2 will not yield any more modularity-optimizing changes, so the process is complete.[2]

When to use it / use-cases

One of the applications reported in the original Louvain Method paper was a study of a large Belgian phone call network in which nodes represented customers and weighted links represented the number of phone calls between two customers over a six-month period. The Louvain Method revealed a hierarchy of six levels of communities. At the top level of this hierarchy, the communities representing more than 10,000 customers were strongly segregated by primary language. All except one of these communities had an 85% or greater majority of either French or Dutch speakers. The sole community with a more equitable distribution was positioned at the interface between French and Dutch clusters in the top-level coarse-grained network.

Since 2008 the Louvain Method has found a wide range of applications in analyzing real-world networks. Several of these can be found on the website of the method:

  • analysis of online social networks such as Twitter, LinkedIn, Flickr, Youtube, and LiveJournal

  • analysis of collaboration communities in citation networks

  • analysis of a network of retail transactions

  • study of brain networks using the Louvain Method [4]

Constraints / when not to use it

Although the Louvain Method, and modularity optimization algorithms more generally, have found wide application across many domains, some problems with these algorithms have been identified.

The resolution limit:

For larger networks the Louvain Method doesn’t stop with the "intuitive" communities. Instead there’s a second pass through the community modification and coarse-graining stages, in which several of the intuitive communities are merged together. This is a general problem with modularity optimization algorithms - they have trouble detecting small communities in large networks. It’s a virtue of the Louvain Method that something close to the intuitive community structure is available as an intermediate step in the process.

The degeneracy problem:

There are typically an exponentially large (in network size) number of community assignments with modularities close to the maximum. This can be a severe problem because, in the presence of a large number of high modularity solutions, it’s hard to find the global maximum and difficult to determine if the global maximum is truly more scientifically important than local maxima that achieve similar modularity. Research undertaken at Universite Catholique de Louvain showed that the different locally optimal community assignments can have quite different structural properties.[3]

Algorithm explanation on simple sample graph

louvain
Create sample graph
CREATE (nAlice:User {id:'Alice'})
,(nBridget:User {id:'Bridget'})
,(nCharles:User {id:'Charles'})
,(nDoug:User {id:'Doug'})
,(nMark:User {id:'Mark'})
,(nMichael:User {id:'Michael'})
CREATE (nAlice)-[:FRIEND]->(nBridget)
,(nAlice)-[:FRIEND]->(nCharles)
,(nMark)-[:FRIEND]->(nDoug)
,(nBridget)-[:FRIEND]->(nMichael)
,(nCharles)-[:FRIEND]->(nMark)
,(nAlice)-[:FRIEND]->(nMichael)
,(nCharles)-[:FRIEND]->(nDoug);
Running algorithm and streaming results
CALL algo.louvain.stream('User', 'FRIEND', {})
YIELD nodeId, community
RETURN nodeId, community LIMIT 20;
Running algorithm and writing back results
CALL algo.louvain('User', 'FRIEND',
  {write:true, writeProperty:'community'})
YIELD nodes, communityCount, iterations, loadMillis, computeMillis, writeMillis;
Table 42. Results
name community

Alice

5

Bridget

5

Michael

5

Charles

4

Doug

4

Mark

4

Syntax

Running algorithm and writing back results
CALL algo.louvain(label:String, relationship:String,
  {weightProperty:'weight', defaultValue:1.0, write: true,
   writeProperty:'community', concurrency:4})
YIELD nodes, communityCount, iterations, loadMillis, computeMillis, writeMillis
Table 43. Parameters
name type default optional description

label

string

null

yes

label to load from the graph. If null load all nodes

relationship

string

null

yes

relationship-type to load from the graph. If null load all relationships

weightProperty

string

null

yes

property name that contains weight. If null treats the graph as unweighted. Must be numeric.

write

boolean

true

yes

if result should be written back as node property

writeProperty

string

'community'

yes

property name written back the id of the community particular node belongs to

defaultValue

float

null

yes

default value of the weight in case it is missing or invalid

concurrency

int

available CPUs

yes

number of concurrent threads

graph

string

'heavy'

yes

use 'heavy' when describing the subset of the graph with label and relationship-type parameter, 'cypher' for describing the subset with cypher node-statement and relationship-statement

Table 44. Results
name type description

nodes

int

number of nodes considered

communityCount

int

number of communities found

iterations

int

number of iterations run

loadMillis

int

milliseconds for loading data

computeMillis

int

milliseconds for running the algorithm

writeMillis

int

milliseconds for writing result data back

Running algorithm and streaming results
CALL algo.louvain.stream(label:String, relationship:String,
  {weightProperty:'propertyName', defaultValue:1.0, concurrency:4})
YIELD nodeId, community - yields a community to each node id
Table 45. Parameters
name type default optional description

label

string

null

yes

label to load from the graph. If null load all nodes

relationship

string

null

yes

relationship-type to load from the graph. If null load all relationships

weightProperty

string

null

yes

property name that contains weight. If null treats the graph as unweighted. Must be numeric.

defaultValue

float

1.0

yes

default value of the weight if it is missing or invalid

graph

string

'heavy'

yes

use 'heavy' when describing the subset of the graph with label and relationship-type parameter, 'cypher' for describing the subset with cypher node-statement and relationship-statement

Table 46. Results

name

type

description

nodeId

int

node id

community

int

community id

Cypher projection

If label and relationship-type are not selective enough to describe your subgraph to run the algorithm on, you can use Cypher statements to load or project subsets of your graph. Can be also used to run algorithms on a virtual graph. Set graph:'cypher' in the config.

CALL algo.louvain(
  'MATCH (p:User) RETURN id(p) as id',
  'MATCH (p1:User)-[f:FRIEND]-(p2:User)
   RETURN id(p1) as source, id(p2) as target, f.weight as weight',
  {graph:'cypher',write:true});

Versions

  • ✓ undirected, unweighted

    • weightProperty: null

  • ✓ undirected, weighted

    • weightProperty : 'weight'

Community detection: Connected Components

Connected Components or UnionFind basically finds sets of connected nodes where each node is reachable from any other node in the same set. In graph theory, a connected component of an undirected graph is a subgraph in which any two vertices are connected to each other by paths, and which is connected to no additional vertices in the graph.

History, Explanation

The Connected Components of a graph represent, in general terms, the pieces of the graph. Two vertices are in the same component of graph if and only if there is some path between them.

Finding connected components is at the heart of many graph applications. For example, consider the problem of identifying partitions in a set of items. We can represent each item by a node and add an edge between each pair of items that are deemed similar. The connected components of this graph correspond to different classes of items.

When to use it / use-cases

Testing whether a graph is connected is an essential preprocessing step for every graph algorithm. Such tests can be performed so quickly and easily that you should always verify that your input graph is connected, even when you know it has to be. Subtle, difficult-to-detect bugs often result when your algorithm is run only on one component of a disconnected graph.

Connected components have other practical use cases, for example, if we are analysing a social network and we want to find all the disconnected groups of people that exist in our graph.

Algorithm explanation on simple sample graph

Recall that an undirected graph is connected if for every pair of vertices, there is a path in the graph between those vertices. A connected component of an undirected graph is a maximal connected subgraph of the graph. That means that the direction of the relationships in our graph have no influence as we treat our graph as undirected. We have two implementations of connected components algorithm. First implementations treats the graph as unweighted and the second version is weighted, where you can define the threshold of the weight above which relationships are created.

connected components
Create sample graph
CREATE (nAlice:User {id:'Alice'})
,(nBridget:User {id:'Bridget'})
,(nCharles:User {id:'Charles'})
,(nDoug:User {id:'Doug'})
,(nMark:User {id:'Mark'})
,(nMichael:User {id:'Michael'})
CREATE (nAlice)-[:FRIEND{weight:0.5}]->(nBridget)
,(nAlice)-[:FRIEND{weight:4}]->(nCharles)
,(nMark)-[:FRIEND{weight:1}]->(nDoug)
,(nMark)-[:FRIEND{weight:2}]->(nMichael);
Unweighted version:
Running algorithm and streaming results
CALL algo.unionFind.stream('User', 'FRIEND', {})
YIELD nodeId,setId
RETURN nodeId,setId LIMIT 20;
Running algorithm and writing back results
CALL algo.unionFind('User', 'FRIEND', {write:true, partitionProperty:"partition"})
YIELD nodes, setCount, loadMillis, computeMillis, writeMillis;
Table 47. Results
name partition

Alice

0

Charles

0

Bridget

0

Michael

4

Doug

4

Mark

4

Results show us, that we have two distinct group of users that have no link between them. First group has members Alice,Charles,Bridget and the second group are Michael,Doug,Mark.

We can also easily check the number and size of partitions using cypher.
MATCH (u:User)
RETURN u.partition as partition,count(*) as size_of_partition
ORDER by size_of_partition DESC LIMIT 20;
Weighted version:

If you define the property that holds the weight(weightProperty) and the threshold,it means the nodes are only connected, if the threshold on the weight of the relationship is high enough otherwise the relationship is thrown away.

Running algorithm and streaming results
CALL algo.unionFind.stream('User', 'FRIEND', {weightProperty:'weight', defaultValue:0.0, threshold:1.0})
YIELD nodeId,setId;
Running algorithm and writing back results
CALL algo.unionFind('User', 'FRIEND', {write:true, partitionProperty:"partition",weightProperty:'weight', defaultValue:0.0, threshold:1.0})
YIELD nodes, setCount, loadMillis, computeMillis, writeMillis;
Table 48. Results
name partition

Alice

0

Charles

0

Bridget

1

Michael

4

Doug

4

Mark

4

We can observe, that because the weight of the relationship betwen Bridget and Alice is only 0.5, the relationship was thrown away and regarded as not existing.

Example Usage

As said Connected Components are an essential step in preprocessing your data. One reason is that most centralities suffer from disconnected components or you just want to find disconnected groups of nodes. Yelp’s social network will be used to demonstrate how to proceed when dealing with real world data. A typical social network consist of one big component and a number of small disconnected components.

Get the count of connected components
CALL algo.unionFind.stream('User', 'FRIEND', {})
YIELD nodeId,setId
RETURN count(distinct setId) as count_of_components;

We get back count of disconnected components being 18512 if we do not count users without friends. Let’s now check the size of top 20 components to get a better picture

Get the size of top 20 components
CALL algo.unionFind.stream('User', 'FRIEND', {})
YIELD nodeId,setId
RETURN setId,count(*) as size_of_component
ORDER BY size_of_component LIMIT 20;

The biggest component has 8938630 out of total 8981389 (99,5%). It is quite high, but not shocking as we have a friendship social network, where we can expect small world effect and 6 degree of separation rule, where you can get to any person in a social network, just depends how long is the path.

We can now move on to next step of analysis and run centralities only on the biggest components, so that our results will be more accurate. We just simply write back the results to the node, and use centralities with cypher loading or set a new label for the biggest component.

Syntax

Running algorithm and writing back results
CALL algo.unionFind(label:String, relationship:String, {threshold:0.42,
defaultValue:1.0, write: true, partitionProperty:'partition',weightProperty:'weight',graph:'heavy',concurrency:4})
YIELD nodes, setCount, loadMillis, computeMillis, writeMillis
- finds connected partitions and potentially writes back to the node as a property partition.
Table 49. Parameters
name type default optional description

label

string

null

yes

label to load from the graph. If null load all nodes

relationship

string

null

yes

relationship-type to load from the graph. If null load all relationships

weightProperty

string

null

yes

property name that contains weight, if null treats the graph as unweighted. Must be numeric.

write

boolean

true

yes

if result should be written back as node property

partitionProperty

string

'partition'

yes

property name written back the id of the partition particular node belongs to

threshold

float

null

yes

value of the weight above which the relationship is not thrown away

defaultValue

float

null

yes

default value of the weight in case it is missing or invalid

concurrency

int

available CPUs

yes

number of concurrent threads

graph

string

'heavy'

yes

use 'heavy' when describing the subset of the graph with label and relationship-type parameter, 'cypher' for describing the subset with cypher node-statement and relationship-statement

Table 50. Results
name type description

nodes

int

number of nodes considered

setCount

int

number of partitions found

loadMillis

int

milliseconds for loading data

computeMillis

int

milliseconds for running the algorithm

writeMillis

int

milliseconds for writing result data back

Running algorithm and streaming results
CALL algo.unionFind.stream(label:String, relationship:String, {weightProperty:'weight', threshold:0.42, defaultValue:1.0, concurrency:4})
YIELD nodeId, setId - yields a setId to each node id
Table 51. Parameters
name type default optional description

label

string

null

yes

label to load from the graph, if null load all nodes

relationship

string

null

yes

relationship-type to load from the graph, if null load all relationships

concurrency

int

available CPUs

yes

number of concurrent threads

weightProperty

string

null

yes

property name that contains weight, if null treats the graph as unweighted. Must be numeric.

threshold

float

null

yes

value of the weight above which the relationship is not thrown away

defaultValue

float

null

yes

default value of the weight in case it is missing or invalid

graph

string

'heavy'

yes

use 'heavy' when describing the subset of the graph with label and relationship-type parameter, 'cypher' for describing the subset with cypher node-statement and relationship-statement

Table 52. Results

name

type

description

nodeId

int

node id

setId

int

partition id

Cypher projection

If label and relationship-type are not selective enough to describe your subgraph to run the algorithm on, you can use Cypher statements to load or project subsets of your graph. Can be also used to run algorithms on a virtual graph. Set graph:'cypher' in the config.

CALL algo.unionFind(
  'MATCH (p:User) RETURN id(p) as id',
  'MATCH (p1:User)-[f:FRIEND]->(p2:User)
   RETURN id(p1) as source, id(p2) as target, f.weight as weight',
  {graph:'cypher',write:true}
);

Implementations

algo.unionFind

  • if a threshold configuration parameter is supplied only relationships with a property value higher then the threshold are merged

algo.unionFind.queue

  • parallel UnionFind using ExecutorService only.

  • Algorithm based on the idea that DisjointSetStruct can be built using just a partition of the nodes which are then merged pairwise.

  • The implementation is based on a queue which acts as a buffer for each computed DSS. As long as there are more elements on the queue the algorithm takes two, merges them and adds its result to the queue until only 1 element remains.

algo.unionFind.forkJoinMerge

  • Like in exp1 the resulting DSS of each node-partition is merged by the ForkJoin pool while the calculation of the DSS is done by the ExecutorService.

algo.unionFind.forkJoin

  • calculation and merge using forkJoinPool

algo.unionFind.mscoloring

  • coloring based parallel algorithm

Community detection: Strongly Connected Components

SCC is a class algorithms for finding groups of nodes where each node is directly reachable from every other node in the group. The strongly connected components of an arbitrary directed graph form a partition into subgraphs that are themselves strongly connected.There are several algorithms to compute the SCC.

History, Explanation

Decomposing a directed graph into its strongly connected components is a classic application of depth-first search. The problem of finding connected components is at the heart of many graph application. Generally speaking, the connected components of the graph correspond to different classes of objects. The first linear-time algorithm for strongly connected components is due to Tarjan (1972).

When to use it / use-cases

SCC algorithms can be used as a first step in many graph algorithms that work only on strongly connected graph. In social networks, a group of people are generally strongly connected (For example, students of a class or any other common place). Many people in these groups generally like some common pages or play common games. The SCC algorithms can be used to find such groups and suggest the commonly liked pages or games to the people in the group who have not yet liked commonly liked a page or played a game.

Algorithm explanation on simple sample graph

strongly connected components

A directed graph is strongly connected if there is a path between all pairs of vertices. This algorithms treats the graph as directed, so the direction of the relationship is important and strongly connected compoment exists only if there are relationships between nodes in both direction.

Create sample graph
CREATE (nAlice:User {id:'Alice'})
,(nBridget:User {id:'Bridget'})
,(nCharles:User {id:'Charles'})
,(nDoug:User {id:'Doug'})
,(nMark:User {id:'Mark'})
,(nMichael:User {id:'Michael'})
CREATE (nAlice)-[:FOLLOW]->(nBridget)
,(nAlice)-[:FOLLOW]->(nCharles)
,(nMark)-[:FOLLOW]->(nDoug)
,(nMark)-[:FOLLOW]->(nMichael)
,(nBridget)-[:FOLLOW]->(nMichael)
,(nDoug)-[:FOLLOW]->(nMark)
,(nMichael)-[:FOLLOW]->(nAlice)
,(nAlice)-[:FOLLOW]->(nMichael)
,(nBridget)-[:FOLLOW]->(nAlice)
,(nMichael)-[:FOLLOW]->(nBridget);
Running algorithm and writing back results
CALL algo.scc('User','FOLLOW', {write:true,partitionProperty:'partition'})
YIELD loadMillis, computeMillis, writeMillis, setCount, maxSetSize, minSetSize;
Table 53. Results
name partition

Alice

1

Bridget

1

Michael

1

Charles

0

Doug

2

Mark

2

We can see that we have 2 strongly connected components in our sample graph. The first and biggest component has members Alice,Bridget,Michael and the second component has Doug and Mark.

Find the largest partition
MATCH (u:User)
RETURN u.partition as partition,count(*) as size_of_partition ORDER by size_of_partition DESC LIMIT 1

Syntax

Running algorithm and writing back results
CALL algo.scc(label:String, relationship:String,
{write:true,partitionProperty:'partition',concurrency:4, graph:'heavy'})
YIELD loadMillis, computeMillis, writeMillis, setCount, maxSetSize, minSetSize

- finds strongly connected partitions and potentially writes back to the node as a property partition.
Table 54. Parameters
name type default optional description

label

string

null

yes

label to load from the graph. If null load all nodes

relationship

string

null

yes

relationship-type to load from the graph. If null load all relationships

write

boolean

true

yes

if result should be written back as node property

partitionProperty

string

'partition'

yes

property name written back to

concurrency

int

available CPUs

yes

number of concurrent threads

graph

string

'heavy'

yes

use 'heavy' when describing the subset of the graph with label and relationship-type parameter, 'cypher' for describing the subset with cypher node-statement and relationship-statement

Table 55. Results
name type description

setCount

int

number of partitions found

maxSetSize

int

number of members in biggest partition

minSetSize

int

number of members in smallest partition

loadMillis

int

milliseconds for loading data

computeMillis

int

milliseconds for running the algorithm

writeMillis

int

milliseconds for writing result data back

Running algorithm and streaming results
CALL algo.scc.stream(label:String, relationship:String, {concurrency:4})
YIELD nodeId, partition - yields a partition to each node id
Table 56. Parameters
name type default optional description

label

string

null

yes

label to load from the graph, if null load all nodes

relationship

string

null

yes

relationship-type to load from the graph, if null load all relationships

concurrency

int

available CPUs

yes

number of concurrent threads

graph

string

'heavy'

yes

use 'heavy' when describing the subset of the graph with label and relationship-type parameter, 'cypher' for describing the subset with cypher node-statement and relationship-statement

Table 57. Results

name

type

description

nodeId

int

node id

partition

int

partition id

Cypher projection

If label and relationship-type are not selective enough to describe your subgraph to run the algorithm on, you can use Cypher statements to load or project subsets of your graph. Can be also used to run algorithms on a virtual graph. Set graph:'cypher' in the config.

CALL algo.scc(
  'MATCH (u:User) RETURN id(u) as id',
  'MATCH (u1:User)-[:FOLLOW]->(u2:User) RETURN id(u1) as source,id(u2) as target',
  {write:true,graph:'cypher'})
YIELD loadMillis, computeMillis, writeMillis;

Implementations

algo.scc

  • iterative adaptation (same as algo.scc.iterative)

algo.scc.recursive.tarjan

  • original recursive tarjan implementation

algo.scc.recursive.tunedTarjan

  • also a recursive tarjan implementation

algo.scc.iterative

  • iterative adaption of tarjan algorithm

algo.scc.multistep

  • parallel scc algorithm

Implementers Section

Algorithm Procedures API Discussion

We would like to discuss design of the procedure API for algorithms. There are several concerns:

Naming

  • simple procedure names,

  • consistent spelling and casing

  • discriminate between write-back and result-streaming procedure with .stream suffix

  • default algorithm name should be the best implementation

  • other implementations can provided via suffixes

Input Graph Data

  • load via label/rel-type selection and optional weight/values

  • load via cypher statements

  • the requirements for node-id’s, degrees, weights, relationships etc. comes from the algorithm

Algorithm parameters

  • start-nodes

  • iterations, dampening

  • concurrency, batch-sizes

  • thresholds

  • write-back, write-back properties

Id-Mapping

  • original node-ids from the graph are mapped to a consecutive id-range to be used for the algorithm

  • this allows simple array offsets to be used within the computation

  • this also allows mapping of larger id-ranges to the computational range of the algorithm ()

Result parameters

  • output statistics: number of nodes and rels touched

  • time for loading, processing, and writing

  • property written to

  • errors

  • data summary, e.g.

    • min,max,avg of a centrality, rank or weight measure

    • output of all clusters, their sizes and perhaps the root node

  • alternatively streaming back node-ids and the computed value (score)

Constraints:

  1. there is no overloading so we need differing names

  2. there are default procedure parameters so we need to require only what’s really required to run

  3. we should use sensible defaults

  4. result values/columns have to be pre-declared, there is no inheritance or generics

Suggestions:

Naming: 2 variants per algorithm
  1. returning summary statistics after run: algo.pageRank(…​)

  2. returning a stream of data: algo.pageRank.stream(…​) default should be the one with statistics, as the scroll back from a medium size / large grap will kill the client

Parameters
  1. node-selector or null - label(s) or cypher statement: ":Person|User" / "MATCH (u:User) RETURN id(u) as id, u.name as value"

  2. relationship-selector or null - rel-type(s) or cypher statement ":KNOWS|FRIEND" / "MATCH (u1:User)-[r:KNOWS|FRIEND]-(u2:User) RETURN id(u1) as source, id(u2) as target, r.strength as weight"

  3. configuration map with optinal parameters but sensible defaults: {iterations: 5, write:true, nodeValue:"value",nodeWeight:"weight",relationshipWeight:"weight", threshold: 0.85}

We could have the configuration map come from a function (per algorithm?) that handles validation and returns a correct map object for that algorithm? Or just pass the plain map?

Example:
call algo.pageRank("Person","KNOWS",{iterations:10, write:true, [property:"pagerank"]});

call algo.unionFind.stream('
MATCH (p:Person) RETURN id(p) as id
','
MATCH (p1:Person)-[:KNOWS]-(p2:Person) RETURN id(p1) as source, id(p2) as target
', {}) YIELD nodeId, clusterId

API Documentation

Development Notes

Note
When implementing a procedure that yields any results (is not void) and writes back to the database, make sure to consume the Result after running the algorithm call!

Algorithms are executed within a transaction, that is opened and closed by Cypher. The transaction is only marked successful if the results are consumed. Closing the Result directly will fail and rollback any open transactions and thus will revert all write-back operations.

The Model

The basic idea behind our model is to have a fast cache for the topology of the graph containing only relevant nodes, relations and in addition the weights. It implicitly maps (long) node-id’s to an internal integer-id (32bit) in ascending order which ensures that no id gets bigger then the maximum node count. This approach allows us to use primitive arrays as container for example.

Graph Interface

The Graph interface specifies methods for iterating over all nodes of the graph as well as iterating over relationships of a given node in the form of forEach…​(..)-methods. The Graph knows the nodeCount and degree of each node and can map nodeId to internalId and vice versa.

Method Description

Collection<PrimitiveIntIterable> batchIterables(int batchSize);

return a collection of iterables over every node, partitioned by the given batch size.

boolean contains(long nodeId);

Returns true iff the nodeId is mapped, otherwise false

int degree(int nodeId, Direction direction);

degree of the node in that direction

void forEachNode(IntPredicate consumer);

Iterate over each nodeId

void forEachRelationship(int nodeId, Direction direction, RelationshipConsumer consumer);

Iterate over all relationships of a node

void forEachRelationship(int nodeId, Direction direction, WeightedRelationshipConsumer consumer);

Iterate over all weighted relationships of a node

int nodeCount();

count of nodes

PrimitiveIntIterator nodeIterator();

iterate over all nodes

int toMappedNodeId(long nodeId);

Map neo4j nodeId to internal nodeId

long toOriginalNodeId(int nodeId);

Map internal nodeId back to original nodeId

diag c7aa032e8c661d5a6bd35bc00da9209f
Note
Currently we have 3 different implementations aiming for different goals like performance, memory consumption and accuracy.

HeavyGraph

This implementations utilizes an int-matrix for storing connections between vertices. It has a higher memory consumption but performs basic calculations on the graph around 3 times faster then the memory efficient implementation. Furthermore the number of edges per node is limited only to the maximum array size (2bn) of the JVM. Relationships can be added in arbitrary order.

LightGraph

This implementation takes 3 times less heap due to a more intelligent memory layout. The drawback is slightly higher evaluation time and that the data cannot be loaded in parallel (yet).

GraphView

The View is just a simple wrapper around the Neo4j Kernel API. It has been implemented for tests and benchmarks baselines.

Import

A fair amount of the work is to fetch the relevant data from Neo4j and import it into our model. Fortunately we can use a multithreading approach to load the graph. This is part of the current implementation.

The abstract GraphFactory specifies a constructor and the build method. It is responsible for creating the Graph using the Neo4j Kernel API.

diag 9851458f6bf1935b4418f88d5cce6713

Loading the Graph is done by using the GraphLoader which implements a simple fluent Builder Pattern.

final Graph graph = new GraphLoader( graphDatabaseAPI )
        .setLabel( .. )
        .setRelation( .. )
        .setProperty( .. )
        .setThreadPool( .. )
        .load( FactoryImpl.class );

PageRank

Implementation Details

PageRank is Googles popular search algorithm.

Progress
  • ✓ single threaded implementation

  • ✓ tests

  • ✓ simple benchmark

  • ✓ implement procedure

  • ✓ benchmark on bigger graphs

  • ✓ parallelization

  • ✓ evaluation

Requirements
  • NodeIterator

  • Incoming Relationships

  • Outgoing Degrees

Data structured involved

Our current approach needs one double array for storing ranks.

ToDo
parallelization

One approach to parallelize PageRank might be to partition the node into batches - one for each thread. Nonetheless we may need to sync them at the end of each iteration.

evaluation
  • Performance tests on different dataset sizes / level of concurrency

Future Improvements
  • we might scale up the ranks to ints for faster multiplication.

Details

Partition based parallel PageRank based on "An Efficient Partition-Based Parallel PageRank Algorithm" [1]-

  • Each partition thread has its local array of only the nodes that it is responsible for, not for all nodes. Combined, all partitions hold all page rank scores for every node once. Instead of writing partition files and transferring them across the network (as done in the paper since they were concerned with parallelising across multiple nodes), we use integer arrays to write the results to. The actual score is upscaled from a double to an integer by multiplying it with {@code 100_000}.

  • To avoid contention by writing to a shared array, we partition the result array.

  • During execution, the scores arrays are shaped like this:

    [ executing partition ] -> [ calculated partition ] -> [ local page rank scores ]
  • Each single partition writes in a partitioned array, calculation the scores for every receiving partition. A single partition only sees:

    [ calculated partition ] -> [ local page rank scores ]
  • The coordinating thread then builds the transpose of all written partitions from every partition:

    [ calculated partition ] -> [ executing partition ] -> [ local page rank scores ]
  • This step does not happen in parallel, but does not involve extensive copying. The local page rank scores needn’t be copied, only the partitioning arrays. All in all, {@code concurrency^2} array element reads and assignments have to be performed.

  • For the next iteration, every partition first updates its scores, in parallel. A single partition now sees:

    [ executing partition ] -> [ local page rank scores ]
  • That is, a list of all calculated scores for it self, grouped by the partition that calculated these scores. This means, most of the synchronization happens in parallel, too.

  • Partitioning is not done by number of nodes but by the accumulated degree – as described in "Fast Parallel PageRank: A Linear System Approach" [2]. Every partition should have about the same number of relationships to operate on.

  • This is done to avoid having one partition with super nodes and instead have all partitions run in approximately equal time. Smaller partitions are merged down until we have at most {@code concurrency} partitions, in order to batch partitions and keep the number of threads in use predictable/configurable.

[1]: An Efficient Partition-Based Parallel PageRank Algorithm [2]: <a href="https://www.cs.purdue.edu/homes/dgleich/

Betweenness Centrality

Implementation Details

In graph theory, betweenness centrality is a measure of centrality in a graph based on shortest paths. For every pair of nodes in a connected graph, there exists at least one shortest path between the nodes such that either the number of relationships that the path passes through (for unweighted graphs) or the sum of the weights of the relationships (for weighted graphs) is minimized. The betweenness centrality for each vertex is the number of these shortest paths that pass through the vertex.

Progress
  • ✓ adapt apoc-procedure to algorithm-api

  • ✓ implement procedure

  • ✓ tests

  • ✓ relationship case tests

  • ✓ simple benchmark

  • ✓ benchmark on bigger graphs

  • ✓ parallelization

  • ✓ evaluation

  • ✓ documentation

Details
algo.betweenness
  • implementation of brandes-bc algorithm and nodePartitioning extension

  • http://www.algo.uni-konstanz.de/publications/b-fabc-01.pdf

  • if concurrency parameter is set (and >1) ParallelBetweennessCentrality is used

  • ParallelBC spawns N(given by the concurrency param) concurrent threads for calculation where each one calculates the BC for one node at a time

algo.betweenness.exp1
algo.betweenness.sampled.*
  • Calculates betweenness-dependencies on a subset of pivot nodes (instead of all nodes). Up to now 2 randomization strategies are implemented which can be set using the optional argument strategy:

  • random selection(default): strategy:'random': (takes optional argument probability:double(0-1) or log10(N) / e^2 as default)

  • degree based randomization: strategy:'degree': (makes dense nodes more likely)

  • optional Arguments: maxDepth:int

Closeness Centrality

Implementation Details

Closeness Centrality of a node is a measure of centrality in a network, calculated as the sum of the length of the shortest paths between the node and all other nodes in the graph. Thus the more central a node is, the closer it is to all other nodes.

Details
  • use org.neo4j.graphalgo.impl.msbfs.MultiSourceBFS for BFS

  • MSBFS gives depth and number of sources.

  • in this scheme the farness can be calculated as follows

    farness(v) = farness(v) + numberOfSources(v) * depth(v)

Harmonic Centrality

Details

  • use org.neo4j.graphalgo.impl.msbfs.MultiSourceBFS for BFS

  • MSBFS gives depth and number of sources.

  • in this scheme the farness can be calculated as follows

    inverseFarness(v) = inverseFarness(v) + numberOfSources(v) * (1/depth(v))

Minimum Weight Spanning Tree

Implementation Details

A Minimum Weight Spanning Tree is a acyclic undirected graph which consists of all connected nodes and whose relationship weights are minimal. It can be used to cluster the graph (KMeans). Our current approach uses Prim’s Algorithm to calculate the MST starting at a given node. This might not cover the whole graph. But if the nodes are connected the MST is always identical regardless at which node the execution starts.

Progress
  • ✓ single threaded implementation

  • ✓ tests

  • ✓ simple benchmark

  • ✓ implement procedure

  • ✓ benchmark on bigger graphs

  • [?] parallelization

  • ❏ evaluation

Requirements

BothRelationshipIterator & Weights

Data structured involved
  • org.neo4j.graphalgo.core.utils.container.UndirectedTree as container for efficient splitting and iterate

  • An int-based Fibonacci Heap priority queue.

  • Container for visited state

ToDo
benchmark

Implement benchmark on big graph

evaluation
  • Performance tests on different dataset sizes / level of concurrency

Single Shortest Path

Implementation Details

A Single Source Shortest Path algorithms calculates a path between a pair of nodes whose summed weights are minimal. A common algorithm used is Dijkstra. All Pairs Shortest Path on the other hand calculates a shortest path forest containing all paths between the nodes in the graph. An algorithm to solve this is Floyd Warshall or Parallel Johnson’s algorithm.

Progress
  • ✓ single threaded implementation

  • ✓ tests

  • ✓ simple benchmark

  • ✓ implement procedure

  • ❏ benchmark on bigger graphs

  • ❏ parallelization

  • ❏ evaluation

Requirements

(Outgoing)RelationshipIterator & Weights

Data structured involved
  • An int-based Fibonacci Heap which implements an efficient priority queue.

  • Different Container for Costs / visited state / paths

ToDo
benchmark

Implement benchmark on big graph

parallelization

Parallizing All Pairs Shortest Path might be easy using Dijkstra on each thread for a different node. An easy approach for Single Source SP may use two threads. One starting at the start-node, one at the end-node. The first wins. [More](https://www.cs.rice.edu/~vs3/comp422/lecture-notes/comp422-lec24-s08-v2.pdf)

evaluation
  • Performance tests on different dataset sizes / level of concurrency

Details
algo.shortestPath
  • Dijkstra single source shortest path algorithm

  • The algorithm computes a shortest path on weighted graphs between a given start and target-NodeId. It returns result tuples of [nodeId, distance] of each node in the path

  • there may be more then one shortest path, algo returns only one

  • if initialized with an not-existing weight-property and a defaultWeight of 1.0 the resulting path is minimal in terms of count of nodes in the path.

algo.shortestPaths
  • Dijkstra single source shortest path algorithm

  • returns minimum distance to all other nodes

  • if initialized with an not-existing weight-property and a defaultWeight of 1.0 the resulting path is minimal in terms of count of nodes in the path.

algo.shortestPath.deltaStepping

TODO naming!? - parallel non-negative single source shortest path algorithm for weighted graphs - It can be tweaked using the delta-parameter which controls the grade of concurrency. - returns minimum distance to all other nodes - if initialized with an non-existing weight-property and a defaultWeight of 1.0 its result can be interpreted as the number of nodes to reach the target

All Pairs Shortest Path

References

Implementation Details

Details
algo.allShortestPaths.stream
  • returns a stream of source-target node to distance tuples for each pair of nodes

  • Since all nodeId’s have already been ordered by the idMapping we can use an integer instead of a queue which just count’s up for each startNodeId as long as it is < nodeCount.

  • Each thread tries to take one int from the counter at one time and starts its computation on it.

  • The {@link AllShortestPaths#concurrency} value determines the count of workers that should be spawned.

  • Due to the high memory footprint the result set would have we emit each result into a blocking queue. The result stream takes elements from the queue while the workers add elements to it.

  • The result stream is limited by N^2. If the stream gets closed prematurely the workers get closed too.

  • writeback not supported!

Triangle Count

Implementation Details

  • ✓ single threaded implementation

  • ✓ tests

  • ❏ edge case tests

  • ✓ implement procedure

  • ✓ simple benchmark

  • ✓ benchmark on bigger graphs

  • ✓ parallelization

  • ✓ evaluation

  • ✓ documentation

Details

algo.triangle.stream(..) returns a Stream of Triples with nodeIds for each triangle.

algo.triangleCount(..) counts the number of triangles a node is member of and writes it back. It also counts the triangle in the whole graph and returns it in the Stats

algo.triangleCount.stream(..) counts number of triangles a node is member of and returns a stream with nodeId and triangleCount

Note

sum(triangleCount) == triangleCount * 3 because every triangle adds 1 to each of its 3 nodes.

Label Propagation

Implementation Details

Label Propagation is a graph partitioning algorithm already implemented in current apoc-procedures.

Progress
  • ✓ adapt apoc-procedure to algorithm api

  • ✓ single threaded implementation

  • ✓ tests

  • ❏ edge case tests

  • ✓ implement procedure

  • ✓ simple benchmark

  • ✓ benchmark on bigger graphs

  • ✓ parallelization

  • ✓ evaluation

  • ✓ documentation

TODO
  • adapt existing procedure to algorithm api

Louvain

Implementation Details

Louvain is an algorithm for detecting graph partitions in networks that relies upon a heuristic for maximizing the modularity.

  • ✓ single threaded implementation

  • ✓ tests

  • ❏ edge case tests

  • ✓ implement procedure

  • ✓ simple benchmark

  • ✓ benchmark on bigger graphs

  • ✓ parallelization

  • ❏ evaluation

  • ✓ documentation

Weakly Connected Components

Implementation Details

Connected Components or UnionFind basically finds sets of connected nodes where each node is reachable from any other node in the same set. One implementation also evaluates a Predicate on each relation which allows partitioning of the graph based on Relationships and Properties.

Progress
  • ✓ single threaded implementation

  • ✓ tests

  • ✓ simple benchmark

  • ✓ implement procedure

  • ✓ benchmark on bigger graphs

  • ✓ parallelization

  • ✓ evaluation

Requirements

AllRelationshipIterator & Weights

Data structured involved

We use a disjoint-set-structure which is based on a parent-array-tree. The DSS can be used to efficiently ask if two nodes are reachable by each other. [More](https://en.wikipedia.org/wiki/Disjoint-set_data_structure)

ToDo
benchmark

Implement benchmark on big graph &

  • stream nodeId-setId pairs

  • calculate setSize-setCount

parallelization

One approach to parallelize UnionFind might be relationship partitioning where each thread performs the execution into it’s own DSS instance on a subset of relationships. So each thread calculates a distinct set of unions. Later we can merge each DSS pairwise which can also be perfomed in parallel. Nonetheless the memory consumption might be high due to the preallocated array in DSS. We could also switch to a growing container if this is a problem.

evaluation
  • Performance tests on different dataset sizes / level of concurrency

Details
  • writes a cluster-id to each node representing the a connected component where each node is reachable from any other node

algo.unionFind
  • if a threshold configuration parameter is supplied only relationships with a property value higher then the threshold are merged

algo.unionFind.queue
  • parallel UnionFind using ExecutorService only.

  • Algorithm based on the idea that DisjointSetStruct can be built using just a partition of the nodes which are then merged pairwise.

  • The implementation is based on a queue which acts as a buffer for each computed DSS. As long as there are more elements on the queue the algorithm takes two, merges them and adds its result to the queue until only 1 element remains.

algo.unionFind.forkJoinMerge
  • Like in exp1 the resulting DSS of each node-partition is merged by the ForkJoin pool while the calculation of the DSS is done by the ExecutorService.

algo.unionFind.forkJoin
  • calculation and merge using forkJoinPool

Strongly Connected Components

Implementation Details

SCC is a class algorithms for finding groups of nodes where each node is directly reachable from every other node in the group. There are several algorithms to compute the SCC. Our current implementation (still in graphtest project) implements Tarjan’s SCC algorithm.

Progress
  • ✓ implement procedure

  • ✓ tests

  • ✓ edge case tests

  • ✓ simple benchmark

  • ✓ benchmark on bigger graphs

  • ✓ parallelization

  • ✓ evaluation

  • ✓ documentation

Details
algo.scc.tarjan
  • original recursive tarjan implementation

  • result is a cluster-id at all nodes

  • each cluster is a scc

  • Builds sets of node-Ids which represent a strongly connected component within the graph. Also calculates minimum and maximum setSize as well as the count of distinct sets.

algo.scc.tunedTarjan
algo.scc.iterative
algo.scc.multistep