Big Data has become synonymous with Data engineering. But the line between Data Engineering and Data scientists is blurring day by day. At this point in time, I think that Big Data must be in the repertoire of all data scientists. Reason: Too much data is getting generated day by day And that brings us to Spark. Now most of the Spark documentation, while good, did not explain it from the perspective of a data scientist.
This is part one of a learning series of pyspark, which is a python binding to the spark program written in Scala. The installation is pretty simple. These steps were done on Mac OS Mavericks but should work for Linux too. Here are the steps for the installation: 1. Download the Binaries: Spark : http://spark.apache.org/downloads.html Scala : http://www.scala-lang.org/download/ Dont use Latest Version of Scala, Use Scala 2.10.x 2. Add these lines to your .