I generally have a use case for Hadoop in my daily job. It has made my life easier in a sense that I am able to get results which I was not able to see with SQL queries. But still I find it painfully slow. I have to write procedural programs while I work. As in merge these two datasets and then filter and then merge another dataset and then filter using some condition and yada-yada.
I have been using Hadoop a lot now a days and thought about writing some of the novel techniques that a user could use to get the most out of the Hadoop Ecosystem. Using Shell Scripts to run your Programs I am not a fan of large bash commands. The ones where you have to specify the whole path of the jar files and the such. You can effectively organize your workflow by using shell scripts.
This is part one of a learning series of pyspark, which is a python binding to the spark program written in Scala. The installation is pretty simple. These steps were done on Mac OS Mavericks but should work for Linux too. Here are the steps for the installation: 1. Download the Binaries: Spark : http://spark.apache.org/downloads.html Scala : http://www.scala-lang.org/download/ Dont use Latest Version of Scala, Use Scala 2.10.x 2. Add these lines to your .
It has been some time since I was stalling learning Hadoop. Finally got some free time and realized that Hadoop may not be so difficult after all. What I understood finally is that Hadoop is basically comprised of 3 elements: A File System Map – Reduce Its many individual Components. Let’s go through each of them one by one. 1. Hadoop as a File System: One of the main things that Hadoop provides is cheap data storage.