Big Data Machine Learning

Hadoop, Mapreduce and More – Part 1

Hadoop, Mapreduce and More – Part 1

It has been some time since I was stalling learning Hadoop. Finally got some free time and realized that Hadoop may not be so difficult after all. What I understood finally is that Hadoop is basically comprised of 3 elements:

  • A File System
  • Map – Reduce
  • Its many individual Components.

Let’s go through each of them one by one.

1. Hadoop as a File System:

One of the main things that Hadoop provides is cheap data storage. What happens intrinsically is that the Hadoop system takes a file, cuts it into chunks and keeps those chunks at different places in a cluster. Suppose you have a big big file in your local system and you want that file to be:

  • On the cloud for easy access
  • Processable in human time

The one thing you can look forward to is Hadoop.

Assuming that you have got hadoop installed on the amazon cluster you are working on.

Start the Hadoop Cluster:

You need to run the following commands to start the hadoop cluster(Based on location of hadoop installation directory):

cd /usr/local/hadoop/
bin/start-all.sh
jps

Adding File to HDFS: Every command in Hadoop starts with hadoop fs and the rest of it works like the UNIX syntax. To add a file “purchases.txt” to the hdfs system:

hadoop fs -put purchases.txt /usr/purchases.txt

2. Hadoop for Map-Reduce:

MapReduce is a programming model and an associated implementation for processing and generating large data sets with a parallel, distributed algorithm on a cluster.

While Hadoop is implemented in Java, you can use almost any language to do map-reduce in hadoop using hadoop streaming. Suppose you have a big file containing the Name of store and sales of store each hour. And you want to find out the sales per store using map-reduce. Lets Write a sample code for that:

InputFile

A,300,12:00
B,234,1:00
C,234,2:00
D,123,3:00
A,123,1:00
B,346,2:00

Mapper.py

import sys
def mapper():
    # The Mapper takes inputs from stdin and prints out store name and value
    for line in sys.stdin:
        data = line.strip().split(",")
        storeName,Value,time=data
        print "{0},{1}".format(storeName,Value)

Reducer.py

import sys
def reducer():
    # The reducer takes inputs from mapper and prints out aggregated store name and value
    salesTotal = 0
    oldKey = None
    for line in sys.stdin:
        data = line.strip().split(",")
        #Adding a little bit of Defensive programming
        if len(data) != 2:
            continue
        curKey,curVal = data
        if oldKey adn oldKey != curKey:
            print "{0},{1}".format(oldKey,salesTotal)
            salesTotal=0
        oldKey=curKey
        salesTotal += curVal
    if oldkey!=None:
        print "{0},{1}".format(oldKey,salesTotal)

Running the program on shell using pipes

textfile.txt | ./mapper.py | sort | ./reducer.py

Running the program on mapreduce using Hadoop Streaming

hadoop jar contrib/streaming/hadoop-*streaming*.jar /
-file mapper.py -mapper mapper.py /
-file reducer.py -reducer reducer.py /
-input /inputfile -output /outputfile

3. Hadoop Components:

Now if you have been following Hadoop you might have heard about Apache, Cloudera, HortonWorks etc. All of these are Hadoop vendors who provide Hadoop Along with its components. I will talk about the main component of Hadoop here – Hive. So what exactly is Hive: Hive is a SQL like interface to map-reduce queries. So if you don’t understand all the hocus-pocus of map-reduce but know SQL, you can do map-reduce via Hive. Seems Promising? It is. While the syntax is mainly SQL, it is still a little different and there are some quirks that we need to understand to work with Hive. First of all lets open hive command prompt: For that you just have to type “hive”, and voila you are in. Here are some general commands

show databases  #   -- See all Databases
use database     #     -- Use a particular Database
show tables       #     -- See all tables in a particular Database
describe table    

Creating an external table:

CREATE EXTERNAL TABLE IF NOT EXISTS BXDataSet
(ISBN STRING,BookTitle STRING, ImageURLL STRING)
ROW FORMAT DELIMITED  FIELDS TERMINATED BY ‘;’ STORED AS TEXTFILE;
LOAD DATA INPATH ‘/user/book.csv’ OVERWRITE INTO TABLE BXDataSet;

The query commands work the same way as in SQL. You can do all the group by and hive will automatically convert it in map-reduce:

select * from tablename;

Stay Tuned for Part 2 – Where we will talk about another components of Hadoop – PIG To learn more about hadoop in the meantime these are the books I recommend:

Start your future with a Data Analysis Certificate.
comments powered by Disqus