mlwhiz

Turning data into insights

Learning pyspark – Installation – Part 1

This is part one of a learning series of pyspark, which is a python binding to the spark program written in Scala.

The installation is pretty simple. These steps were done on Mac OS Mavericks but should work for Linux too. Here are the steps for the installation:

1. Download the Binaries:

In []:
Spark : http://spark.apache.org/downloads.html
Scala : http://www.scala-lang.org/download/

Dont use Latest Version of Scala, Use Scala 2.10.x

2. Add these lines to your .bash_profile:

In []:
export SCALA_HOME=your_path_to_scala
export SPARK_HOME=your_path_to_spark

3. Build Spark(This will take time):

In []:
brew install sbt
cd $SPARK_HOME
sbt/sbt assembly

4. Start the Pyspark Shell:

In []:
$SPARK_HOME/bin/pyspark

And Voila. You are running pyspark on your Machine

To check that everything is properly installed, Lets run a simple program:

In []:
test = sc.parallelize([1,2,3])
test.count()

This should return 3.

So Now Just Run Hadoop On your Machine and then run pyspark Using:

In []:
cd /usr/local/hadoop/
bin/start-all.sh
jps
$SPARK_HOME/bin/pyspark

Online data science courses to jumpstart your future.

pyspark learning
Advertiser Disclosure: All Amazon links are affiliate links, which means I receive compensation for any purchases through them. You do not have to purchase via my links, but you support me if you do.

Comments