GraphFrames PySpark Example : Learn Data Science

In this post, GraphFrames PySpark example is discussed with shortest path problem. GraphFrames is a Spark package that allows DataFrame-based graphs in Saprk. Spark version 1.6.2 is considered for all examples. Including the package with PySaprk shell :

pyspark –packages graphframes:graphframes:0.1.0-spark1.6

Code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
from pyspark import SparkContext
from pyspark.sql import SQLContext
sc = SparkContext ()
sqlContext = SQLContext(sc)

# create vertex DataFrame for users with id and name attributes
v = sqlContext.createDataFrame([
("a", "Alice"),
("b", "Bob"),
("c", "Charlie"),
], ["id", "name"])

# create edge DataFrame with "src" and "dst" attributes
e = sqlContext.createDataFrame([
("a", "b", "friends"),
("b", "c", "follow"),
("c", "b", "follow"),
], ["src", "dst", "relationship"])

# create a GraphFrame with v, e
from graphframes import *
g = GraphFrame(v, e)

# example : getting in-degrees of each vertex
g.inDegrees.show()

Output:

id inDegree
b 2
c 1

example : getting “follow” relationships in the graph

1
g.edges.filter("relationship = 'follow'").count()

Output:

2

getting shortest paths to “a” from each vertex

1
2
results = g.shortestPaths(landmarks=\["a"\])
results.select("id", "distances").show()

Feel free to ask your questions in the comments section!