In this post i am giving a brief overview of Hadoop. Extensive overview like how it works and what makes it highly distributive and fault tolerant will be given in later posts. Target of this post is to make you comfortable in running your first Hadoop job.
In simple terms, Hadoop is a framework to process large datasets in distributed environment.
Hadoop includes four basic modules:
1. Hadoop common:
All the java libraries and utilities required by Hadoop modules.
2. Hadoop YARN:
Framework for job scheduling.
The Hadoop Distributed File System (HDFS) is the primary storage system used by Hadoop applications. It means it can store the large data on different nodes connected through network. It is highly fault-tolerant and scalable. HDFS stores the metadata in NameNode and actual content in DataNodes. There can be many DataNodes across the network, all of which store file content in blocks of standard size. Namenode monitors the replication of blocks and keep the metadata updated.
For accessing files/directory in HDFS , command
bin/hdfs dfs as a prefix is used.
List all the files of input folder :
Remove all the files of output folder:
HDFS Architecture (Image source:Hadoop Offical Documentation)
Hadoop MapReduce is a software framework for writing application which process vast amount of data in parallel on large cluster (thousand of nodes). It basically consists of two parts:
Map : It takes a set of data and converts it into intermediate set of data which must be of <key,value> pair form.
Reduce: It takes input tuples from map and combine those tuples into smaller set of tuples.
Example: Say you want to count the occurrence of words given in a file:
You put this data into two Map jobs, output of Map jobs would be like:
The first map emits:
The second map emits:
And when data is passed to Reduce job, it gets converted into:
(It’s a simple example to explain MapReduce jobs. You can also use Combine after Map jobs which first reduce the data at Map level then further reduction is at Reduce level.)
Setting up Hadoop using docker
In the previous tutorial Introduction of Docker and running it on Mac i have mentioned how to set up docker. Please go through this post and set up docker on your machine. Then run this command to download and run hadoop container:
(I am using version 2.7.1 . You can check the latest version on hadoop-docker)
To check hadoop is working try to run a Grep job using:
( Here we are running a example jar which has many map-reduce jobs. You can find the source code of these jobs from hadoop-examples. After the jar you have to specify which job you want to run and here we are running
grep job which needs three params which are input directory, output directory and grep pattern respectively.)
Once you run this program you can find the output in output directory which you can see through command:
_SUCCESS is a blank file denoting the success of map reduce job. part-r-00000 contains the output. You can view the file using command:
Running your MapReduce program in Hadoop:
Please put this file on location $HADOOP_PREFIX/mapreducejob/WordCount.java .
Compiling the program
Different ways to compile a map reduce program:
1. Using the minimal need libraries
2.Hadoop minimilisitc way
3. Using the hadoop actual classpaths
If you get this exception while running a job,
Exception in thread “main” org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory ,
Please clear the output directory first using:
You can view the HDFS file system using the web view but for that you have to expose the docker port 50070 . For that run hadoop image using:
Now you can access hdfs file system through web on 127.0.0.1:50070.
Web View HDFS
Important Links:comments powered by Disqus