4 Graph Algorithms on Steroids for data Scientists with cuGraph
We, as data scientists have gotten quite comfortable with Pandas or SQL or any other relational database.
We are used to seeing our users in rows with their attributes as columns. But does the real world behave like that?
In a connected world, users cannot be considered as independent entities. They have got certain relationships with each other, and we would sometimes like to include such relationships while building our machine learning models.
Now while in a relational database, we cannot use such relations between different rows(users), in a graph database, it is relatively trivial to do that.
Now, as we know, Python has a great package called Networkx to do this. But the problem with that is that it is not scalable.
A GPU can help solve our scalability problems with its many cores and parallelization. And that is where RAPIDS.ai CuGraph comes in.
The RAPIDS cuGraph library is a collection of graph analytics that process data found in GPU Dataframes — see cuDF . cuGraph aims to provide a NetworkX-like API that will be familiar to data scientists, so they can now build GPU-accelerated workflows more easily.
In this post, I am going to be talking about some of the most essential graph algorithms you should know and how to implement them using Python with cuGraph.
Installation
To install cuGraph you can just use the simple command that you can choose from rapids.ai based on your system and configuration.
The command I used is below and I used a nightly build(recommended):
conda install -c rapidsai-nightly -c nvidia -c numba -c conda-forge -c anaconda cudf=0.10 cuml=0.10 cugraph=0.10
1. Connected Components
We all know how clustering works?
You can think of Connected Components in very layman’s terms as a sort of a hard clustering algorithm which finds clusters/islands in related/connected data.
As a concrete example: Say you have data about roads joining any two cities in the world. And you need to find out all the continents in the world and which city they contain.
How will you achieve that? Come on, give some thought.
The connected components algorithm that we use to do this is based on a special case of BFS/DFS. I won’t talk much about how it works here, but we will see how to get the code up and running using Networkx as well as cuGraph.
Applications
From a Retail Perspective: Let us say, we have a lot of customers using a lot of accounts. One way in which we can use the Connected components algorithm is to find out distinct families in our dataset.
We can assume edges(roads) between CustomerIDs based on same credit card usage, or same address or same mobile number, etc. Once we have those connections, we can then run the connected component algorithm on the same to create individual clusters to which we can then assign a family ID.
We can then use these family IDs to provide personalized recommendations based on family needs. We can also use this family ID to fuel our classification algorithms by creating grouped features based on family.
From a Finance Perspective: Another use case would be to capture fraud using these family IDs. If an account has done fraud in the past, it is highly probable that the connected accounts are also susceptible to fraud.
The possibilities are only limited by your imagination.
Code
We will be using the Networkx module in Python for creating and analyzing our graphs.
Let us start with an example graph which we are using for our purpose. Contains cities and distance information between them.
We first start by creating a list of edges along with the distances which we will add as the weight of the edge:
edgelist = [['Mannheim', 'Frankfurt', 85], ['Mannheim', 'Karlsruhe', 80], ['Erfurt', 'Wurzburg', 186], ['Munchen', 'Numberg', 167], ['Munchen', 'Augsburg', 84], ['Munchen', 'Kassel', 502], ['Numberg', 'Stuttgart', 183], ['Numberg', 'Wurzburg', 103], ['Numberg', 'Munchen', 167], ['Stuttgart', 'Numberg', 183], ['Augsburg', 'Munchen', 84], ['Augsburg', 'Karlsruhe', 250], ['Kassel', 'Munchen', 502], ['Kassel', 'Frankfurt', 173], ['Frankfurt', 'Mannheim', 85], ['Frankfurt', 'Wurzburg', 217], ['Frankfurt', 'Kassel', 173], ['Wurzburg', 'Numberg', 103], ['Wurzburg', 'Erfurt', 186], ['Wurzburg', 'Frankfurt', 217], ['Karlsruhe', 'Mannheim', 80], ['Karlsruhe', 'Augsburg', 250],["Mumbai", "Delhi",400],["Delhi", "Kolkata",500],["Kolkata", "Bangalore",600],["TX", "NY",1200],["ALB", "NY",800]]
Now we want to find out distinct continents and their cities from this graph.
First, we will need to create a cudf dataframe with edges in it. Right now, I am creating a pandas dataframe and converting it to cudf dataframe, but in a real-life scenario, we will read from a csv file of edges.
import cugraph
import cudf
import pandas as pd
# create a pandas dataframe of edges
pandas_df = pd.DataFrame(edgelist)
pandas_df.columns = ['src','dst','distance']
# create a pandas dataframe of reversed edges as we have a undirected graph
rev_pandas_df = pandas_df.copy()
rev_pandas_df.columns = ['dst','src','distance']
rev_pandas_df = rev_pandas_df[['src','dst','distance']]
# concat all edges
pandas_df = pd.concat([pandas_df,rev_pandas_df])
Now our pandas df contains edges in both directions. And our node names in src and dst columns are in str format. Apparently, cuGraph doesn’t like that and only works with integer node IDs.
# CuGraph works with only integer node IDs
unique_destinations = set()
for [src,dst,dis] in edgelist:
unique_destinations.add(src)
unique_destinations.add(dst)
# create a map of city and a unique id
city_id_dict = {}
for i, city in enumerate(unique_destinations):
city_id_dict[city]=i
# create 2 columns that contain the integer IDs for src and dst
pandas_df['src_int'] = pandas_df['src'].apply(lambda x : city_id_dict[x])
pandas_df['dst_int'] = pandas_df['dst'].apply(lambda x : city_id_dict[x])
Now comes the main part that we should focus on:
cuda_g = cudf.DataFrame.from_pandas(pandas_df)
# cugraph needs node IDs to be int32 and weights to be float
cuda_g['src_int'] = cuda_g['src_int'].astype(np.int32)
cuda_g['dst_int'] = cuda_g['dst_int'].astype(np.int32)
cuda_g['distance'] = cuda_g['distance'].astype(np.float)
G = cugraph.Graph()
G.add_edge_list(cuda_g["src_int"],cuda_g["dst_int"] , cuda_g['distance'])
cugraph.weakly_connected_components(G)
The output of the last call is a cudf dataframe.
As we can see, the labels correspond to Connected Components ID.
2. Shortest Path
Continuing with the above example only, we are given a graph with the cities of Germany and the respective distance between them.
You want to find out how to go from Frankfurt (The starting node) to Munchen by covering the shortest distance.
The algorithm that we use for this problem is called Dijkstra. In Dijkstra’s own words:
What is the shortest way to travel from Rotterdam to Groningen , in general: from given city to given city. It is the algorithm for the shortest path , which I designed in about twenty minutes. One morning I was shopping in Amsterdam with my young fiancée, and tired, we sat down on the café terrace to drink a cup of coffee and I was just thinking about whether I could do this, and I then designed the algorithm for the shortest path. As I said, it was a twenty-minute invention. In fact, it was published in ’59, three years later. The publication is still readable, it is, in fact, quite nice. One of the reasons that it is so nice was that I designed it without pencil and paper. I learned later that one of the advantages of designing without pencil and paper is that you are almost forced to avoid all avoidable complexities. Eventually that algorithm became, to my great amazement, one of the cornerstones of my fame. — Edsger Dijkstra, in an interview with Philip L. Frana, Communications of the ACM, 2001 [3]
Applications
Variations of the Dijkstra algorithm is used extensively in Google Maps to find the shortest routes.
You are in a Walmart Store. You have different Aisles and distance between all the aisles. You want to provide the shortest pathway to the customer from Aisle A to Aisle D.
You have seen how LinkedIn shows up 1st-degree connections, 2nd-degree connections. What goes on behind the scenes?
Code
We already have our Graph as before. We can find the shortest distance from a source node to all nodes in the graph.
# get distances from source node 0
distances = cugraph.sssp(G, 0)
# filter infinite distances
distances = cugraph.traversal.filter_unreachable(distances)
distances
Now if we have to find the path between node 0 and 14 we can use the distances cudf.
# Getting the path is as simple as:
path = []
dest = 14
while dest != 0:
dest = distances[distances['vertex'] == dest]['predecessor'].values[0]
path.append(dest)
# reverse the list and print
print(path[::-1])
[0, 11, 9]
3. Pagerank
This is the page sorting algorithm that powered google for a long time. It assigns scores to pages based on the number and quality of incoming and outgoing links.
Applications
Pagerank can be used anywhere where we want to estimate node importance in any network.
It has been used for finding the most influential papers using citations.
Has been used by Google to rank pages
It can be used to rank tweets- User and Tweets as nodes. Create Link between user if user A follows user B and Link between user and Tweets if user tweets/retweets a tweet.
Recommendation engines
Code
For this exercise, we are going to be using Facebook social network data.
# Loading the file as cudf
fb_cudf = cudf.read_csv("facebook_combined.txt", sep=' ', names=['src', 'dst'],dtype =['int32','int32'])
# adding reverse edges also
rev_fb_cudf = fb_cudf[['dst','src']]
rev_fb_cudf.columns = ['src','dst']
fb_cudf = cudf.concat([fb_cudf,rev_fb_cudf])
Creating the graph
# creating the graph
fb_G = cugraph.Graph()
fb_G.add_edge_list(fb_cudf["src"],fb_cudf["dst"])
Now we want to find the users having high influence capability.
Intuitively, the Pagerank algorithm will give a higher score to a user who has a lot of friends who in turn have a lot of FB Friends.
# Call cugraph.pagerank to get the pagerank scores
fb_pagerank = cugraph.pagerank(fb_G)
fb_pagerank.sort_values(by='pagerank',ascending=False).head()
4. Link Prediction
Continuing along with our Facebook example. You might have seen recommended friends in your Facebook account. How can we create our small recommender?
Can we predict which edges will be connected in the future based on current edges?
A straightforward and fast approach to do this is by using the Jaccard Coefficient.
Applications
There could be many applications of link predictions. We could predict
Authors who are going to connect for co-authorships in a citation network
Who will become friends in a social network?
Idea
We calculate the Jaccard coefficient between two nodes i and j as :
Where the numerator is the number of common neighbors of i and j, and the denominator is the total number of distinct neighbors of i and j.
So in the figure, the half red and green nodes are the common neighbors of both A and B. And they have a total of 5 distinct neighbors. So the JaccardCoeff(A, B) is 2/5
Code
We first create a cudf_nodes cudf with all possible node combinations.
max_vertex_id = fb_pagerank['vertex'].max()
data = []
for x in range(0,max_vertex_id+1):
for y in range(0,max_vertex_id+1):
data.append([x,y])
cudf_nodes =cudf.from_pandas(pd.DataFrame(data))
cudf_nodes.columns = ['src','dst']
cudf_nodes['src'] = cudf_nodes['src'].astype(np.int32)
cudf_nodes['dst'] = cudf_nodes['dst'].astype(np.int32)
We can then calculate the Jaccard coefficient between nodes as:
jaccard_coeff_between_nodes = cugraph.link_prediction.jaccard(fb_G,cudf_nodes["src"],cudf_nodes["dst"])
jaccard_coeff_between_nodes.head()
But we are still not done. We need to remove the edges where the source==destination and the edges which are already present in the graph. We will do this using simple join and filter operations which work particularly similar to pandas.
jaccard_coeff_between_nodes=jaccard_coeff_between_nodes[jaccard_coeff_between_nodes['source']!=jaccard_coeff_between_nodes['destination']]
fb_cudf.columns = ['source', 'destination']
fb_cudf['edgeflag']=1
jaccard_coeff_joined_with_edges = jaccard_coeff_between_nodes.merge(fb_cudf,on= ['source', 'destination'],how='left')
# We just want to see the jaccard coeff of new edges
new_edges_jaccard_coeff = jaccard_coeff_joined_with_edges[jaccard_coeff_joined_with_edges['edgeflag']!=1]
This is our final sorted dataframe with the Jaccard coefficient between unconnected nodes. We know what friends to recommend to our platform users.
new_edges_jaccard_coeff.sort_values(by='jaccard_coeff',ascending=False)
Basic Network Statistics
There are a lot of basic measures which you want to know about your network.
Here is how you get them in your network
print("Number of Nodes",fb_G.number_of_nodes())
print("Number of Edges",fb_G.number_of_edges())
Number of Nodes 4039
Number of Edges 176468
You can also compute the indegree and outdegree for each node.
In a directed graph this corresponds to no of followers and no of follows.
fb_G.degrees().head()
Performance Benchmarks
I won’t do any justice to this post if I don’t add certain benchmarks for the different algorithms.
In my benchmark study, I use three datasets in increasing order of scale from the Stanford Large Network Dataset Collection.
ego-Facebook : Undirected graph with 4 K nodes and 88 K edges from Facebook
ego-Twitter : Directed graph with 81 K nodes and 1.7 M edges from Twitter
ego-Gplus : Directed graph with 107 K nodes and 13.6 M edges from Google+
Here are the results of the experiments I performed on NVIDIA <em><strong>Tesla V100 32 GB GPU</strong></em> . Thanks to Josh Patterson from NVIDIA and Richard Ulrich at Walmart Labs for arranging that for me. All the times are given in milliseconds:
I didn’t add Jaccard coefficients in the results as it didn’t run even for facebook using networkX. For cuGraph it had millisecond-level latencies.
Let us visualize these results:
Caveats
Rapids cuGraph is an excellent library for graph analysis, but I feel some things are still missing. Maybe we will get them in the next version.
A little bit of inconvenience that we have to use numbered nodes with data type int32 only. Renumbering helps with that. See my notebook for the benchmark for the exact code. Check the function cugraph.symmetrize_df too for creating undirected graphs.
Some algorithms are still not implemented. For instance, I could not find MST, Centrality measures, etc.
More example notebooks are needed to document best practices. I might be going to be work on some of those.
No visualization component in the library. I have to go to networkx to plot graphs.
But despite that, I would also like to add that the idea to provide graph analysis with GPU is so great that I can live with these small problems. And the way they have made the API so similar to pandas and networkx adds to its value.
I remember how using GPU needed a lot of code in the past. RAPIDS has aimed to make GPU ubiquitous, and that is a fabulous initiative.
Conclusion
In this post, I talked about some of the most powerful graph algorithms that have changed the way we live and how to scale them with GPUs.
I love the way Rapids AI has been working to make GPUs accessible to the typical developer/data scientist and to think that we hadn’t heard about it till a year back. They have come a long way.
Also, here are the newest version 0.9 documentation for cuDF and cuGraph .
You can get the running code in this Google Colab Notebook , and the code with benchmarks on my Github repository as Google Colab fell short on resources while benchmarking.
Continue Learning
If you want to read up more on Graph Algorithms here is a Graph Analytics for Big Data course on Coursera by UCSanDiego , which I highly recommend to learn the basics of graph theory.