jump to navigation

How to install and configure Big Data Hadoop in an hour or so (does not include your coffee break) May 25, 2015

Posted by Mich Talebzadeh in Big Data.
trackback

Pre-requisites

1) Have a Linux host running a respectable flavor of Linux. Mine is RHES 5.2 64-bit. I also have another host using RHES 5.2, 32-bit which also has Hadoop installed on it.
2) Basic familiarity with Linux commands
3) Oracle Java Development Kit (JDK) 1.5 or higher
4) Configuring SSH access
5) Setting ulimit (optional)

Creating group and user on Linux

Log in as root to your Linux host and create a group called hadoop2 and user hduser2. Note since I already have goup/user hadoop/hduser, I decided for the sake of this demo to create new ones on the same host

root@rhes564 ~]# groupadd hadoop2
[root@rhes564 ~]# useradd -G hadoop2 hduser2

You will see that a home directory for user hduser2 called /home/hduser2 is created.

ls -ltr /home
drwx------ 3 hduser2 hduser2 4096 May 24 09:30 hduser2
[root@rhes564 home]# su - hduser2
[hduser2@rhes564 ~]$ pwd
/home/hduser2

You can edit [root@rhes564 home]# vi /etc/passwd file to change the default shell for hduser2
I changed it to korn shell

hduser2:x:1013:1015::/home/hduser2:/bin/ksh

Now set the password for hduser2 as below

[root@rhes564 home]# passwd hduser2
Changing password for user hduser2.
New UNIX password:
Retype new UNIX password:
passwd: all authentication tokens updated successfully.

## log in as hduser2

su - hduser2
Password:
 id
uid=1013(hduser2) gid=1015(hduser2) groups=1015(hduser2)

So far so good. We have set up a new user hduser2 belonging to the group hadoop2. Make sure that the ownership is correct. As root do

</pre>
[root@rhes564 ~]# chown -R hduser2:hadoop2 /home/hduser2

Now you need to set up your shells. The Korn shell uses two startup files, the .profile and the .kshrc. The .profile is read once, by your login ksh, while the .kshrc is read by each new ksh.
I put all my environment in .kshrc file.

Remember most of this Big Data is developed in Java. so you will need an up-to-date version of JDK installed. The details for me are as follows:

$ which java
/usr/bin/java
$ java -version
java version "1.7.0_21"
Java(TM) SE Runtime Environment (build 1.7.0_21-b11)
Java HotSpot(TM) 64-Bit Server VM (build 23.21-b01, mixed mode)

My sample .profile is as follows:

$ cat .profile
stty erase ^?
PATH=/bin:/sbin:/usr/bin:/usr/sbin:/usr/local/bin:/usr/local/sbin:/usr/bin/X11:/usr/X11R6/bin:/root/bin
unset LANG
export ENV=~/.kshrc
. ./.kshrc

I will come to this later.

Setting up ssh

Hadoop relies on ssh to access different nodes and loop back to itself.  Although this is a single node setup, you will still need to set up ssh.  It is pretty straight forward and you happen to be a a DBA or developer, you have already done this many times for oracle or sybase accounts.

ssh-keygen -t rsa -P ""
Generating public/private rsa key pair.
Enter file in which to save the key (/home/hduser2/.ssh/id_rsa):
Created directory '/home/hduser2/.ssh'.
Your identification has been saved in /home/hduser2/.ssh/id_rsa.
Your public key has been saved in /home/hduser2/.ssh/id_rsa.pub.
The key fingerprint is:
97:af:a3:18:f8:30:74:3b:73:53:07:4d:16:e7:1b:8a hduser2@rhes564

Enable SSH access to your local machine with this newly created key.

cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

Now test ssh by connecting to localhost for user hduser2 and see if it works. It will also add to file known_hosts under directory $HOME/.ssh

ssh localhost
The authenticity of host 'localhost (127.0.0.1)' can't be established.
RSA key fingerprint is 21:75:c2:da:01:68:c4:ef:23:7b:d2:ac:e4:ef:06:02.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'localhost' (RSA) to the list of known hosts.
Address 127.0.0.1 maps to rhes564, but this does not map back to the address - POSSIBLE BREAK-IN ATTEMPT!

This should work. Otherwise Google for ssh and use ssh in debug mode by typing ssh -vv localhost to debug the error

Increasing ulimit parameter for user hduser2

You may find later that you will require file descriptor value higher than the default value od 1024. As root edit the file etc/security/limits.conf and add the following lines to the bottom

hduser2 soft nofile 4096
hduser2 hard nofile 63536

Reboot host for this to take place. You will need to set it in your shell startup file for hduser2 as well

ulimit -n 63536

Installing Hadoop

You can download Hadoop from Apache Hadoop site from here. At the time of writing these notes (May 2015), the most recent release was release 2.6.0. Download the binary zipped file. It is around 200MB as shown below:

-rw-r--r-- 1 hduser2 hadoop2 210343364 May 24 11:07 hadoop-2.7.0.tar.gz

Unzip and untar the file. It will create a directory called hadoop-2.7.0 as below

drwxr-xr-x 9 hduser2 hadoop2 4096 Apr 10 19:51 hadoop-2.7.0

Now you need to setup your environment variables in your shell routine. Mine is .kshrc

#HADOOP VARIABLES START
export JAVA_HOME=/usr/java/latest
export HADOOP_HOME=${HOME}/hadoop-2.7.0
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib"
#HADOOP VARIABLES END
export HADOOP_CLIENT_OPTS="-Xmx2g"
unset CLASSPATH
CLASSPATH=.:$HADOOP_HOME/share/hadoop/common/hadoop-common-2.7.0-tests.jar:$HADOOP_HOME/share/hadoop/common/hadoop-common-2.7.0.jar:hadoop-nfs-2.7.0.jar:$HIVE_HOME/conf
export CLASSPATH
ulimit -n 63536

The next step is to look at the directory tree for hadoop.

hduser2@rhes564::/home/hduser2> cd $HADOOP_HOME
hduser2@rhes564::/home/hduser2/hadoop-2.7.0> ls -ltr
total 52
drwxr-xr-x 4 hduser2 hadoop2 4096 Apr 10 19:51 share
drwxr-xr-x 2 hduser2 hadoop2 4096 Apr 10 19:51 libexec
drwxr-xr-x 3 hduser2 hadoop2 4096 Apr 10 19:51 lib
drwxr-xr-x 2 hduser2 hadoop2 4096 Apr 10 19:51 include
drwxr-xr-x 3 hduser2 hadoop2 4096 Apr 10 19:51 etc
drwxr-xr-x 2 hduser2 hadoop2 4096 Apr 10 19:51 bin
-rw-r--r-- 1 hduser2 hadoop2 1366 Apr 10 19:51 README.txt
-rw-r--r-- 1 hduser2 hadoop2 101 Apr 10 19:51 NOTICE.txt
-rw-r--r-- 1 hduser2 hadoop2 15429 Apr 10 19:51 LICENSE.txt
drwxr-xr-x 2 hduser2 hadoop2 4096 May 24 13:02 sbin

Note that unlike usual installation there is no conf directory here. Older versions of Hadoop had conf directory. This has now been replaced with etc directory. Below etc there is sub-directory called hadoop that has all the configuration files.

<pre>cd $HADOOP_HOME/etc/hadoop]
<pre> ls -ltr
total 152
-rw-r--r-- 1 hduser2 hadoop2 690 Apr 10 19:51 yarn-site.xml
-rw-r--r-- 1 hduser2 hadoop2 4567 Apr 10 19:51 yarn-env.sh
-rw-r--r-- 1 hduser2 hadoop2 2250 Apr 10 19:51 yarn-env.cmd
-rw-r--r-- 1 hduser2 hadoop2 2268 Apr 10 19:51 ssl-server.xml.example
-rw-r--r-- 1 hduser2 hadoop2 2316 Apr 10 19:51 ssl-client.xml.example
-rw-r--r-- 1 hduser2 hadoop2 10 Apr 10 19:51 slaves
-rw-r--r-- 1 hduser2 hadoop2 758 Apr 10 19:51 mapred-site.xml.template
-rw-r--r-- 1 hduser2 hadoop2 4113 Apr 10 19:51 mapred-queues.xml.template
-rw-r--r-- 1 hduser2 hadoop2 1383 Apr 10 19:51 mapred-env.sh
-rw-r--r-- 1 hduser2 hadoop2 951 Apr 10 19:51 mapred-env.cmd
-rw-r--r-- 1 hduser2 hadoop2 11237 Apr 10 19:51 log4j.properties
-rw-r--r-- 1 hduser2 hadoop2 5511 Apr 10 19:51 kms-site.xml
-rw-r--r-- 1 hduser2 hadoop2 1631 Apr 10 19:51 kms-log4j.properties
-rw-r--r-- 1 hduser2 hadoop2 1527 Apr 10 19:51 kms-env.sh
-rw-r--r-- 1 hduser2 hadoop2 3518 Apr 10 19:51 kms-acls.xml
-rw-r--r-- 1 hduser2 hadoop2 620 Apr 10 19:51 httpfs-site.xml
-rw-r--r-- 1 hduser2 hadoop2 21 Apr 10 19:51 httpfs-signature.secret
-rw-r--r-- 1 hduser2 hadoop2 1657 Apr 10 19:51 httpfs-log4j.properties
-rw-r--r-- 1 hduser2 hadoop2 1449 Apr 10 19:51 httpfs-env.sh
-rw-r--r-- 1 hduser2 hadoop2 775 Apr 10 19:51 hdfs-site.xml
-rw-r--r-- 1 hduser2 hadoop2 9683 Apr 10 19:51 hadoop-policy.xml
-rw-r--r-- 1 hduser2 hadoop2 2598 Apr 10 19:51 hadoop-metrics2.properties
-rw-r--r-- 1 hduser2 hadoop2 2490 Apr 10 19:51 hadoop-metrics.properties
-rw-r--r-- 1 hduser2 hadoop2 4224 Apr 10 19:51 hadoop-env.sh
-rw-r--r-- 1 hduser2 hadoop2 3670 Apr 10 19:51 hadoop-env.cmd
-rw-r--r-- 1 hduser2 hadoop2 774 Apr 10 19:51 core-site.xml
-rw-r--r-- 1 hduser2 hadoop2 318 Apr 10 19:51 container-executor.cfg
-rw-r--r-- 1 hduser2 hadoop2 1335 Apr 10 19:51 configuration.xsl
-rw-r--r-- 1 hduser2 hadoop2 4436 Apr 10 19:51 capacity-scheduler.xml

Configuring Hadoop

The important configuration files are listed below. They are either shell files; *.sh or xml configuration files *.xml. The important ones are explained below

hadoop-env.sh

Edit file hadoop-env.sh and set JAVA_HOME explicitely

The java implementation to use.
export JAVA_HOME=/usr/java/latest

This will work as long as ${JAVA_HOME} is set up in your start-up shell.

XML files below work on specifying property. To override a default value for a property, specify the new value within the tags, using the following format:

<property>
   <name> </name>
   <value> </value>
   <description> </description>
</property>

core-site.xml

The only parameter I have put here is fs.default.name. you need a URI whose classscheme and authority determines the FileSystem implementation. The uri’s authority is used to determine the host, port, etc. for a filesystem.

<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://rhes564:19000</value>
</property>
</configuration>

So in my case I have the host as my host rhes564, and the port as 9000. Note that 9000 will be the port hadoop is running on. By default it is localhost:9000.

mapred-site.xml

MapReduce Engine runs on Hadoop. MapReduce configuration options are stored in mapred-site.xml file. This file contains configuration information that overrides the default values for MapReduce parameters.

Copy mapred-site.xml.template to mapred-site.xml. Edit this file and add the following lines:

<configuration>
<property>
 <name>mapreduce.framework.name</name>
 <value>yarn</value>
 <description> Execution framework set to Hadoop YARN </description>
</property>

<property>
 <name>mapreduce.job.tracker</name>
 <value>rhes564:54312</value>
 <description> The URL to track mapreduce jobs </description>
</property>

<property>
 <name>mapreduce.job.tracker.reserved.physicalmemory.mb</name>
 <value>1024</value>
 <description> The physical memory allocated for each job </description>
</property>

<property>
 <name>mapreduce.map.memory.mb</name>
 <value>2048</value>
 <description> mapreduce.map.memory.mb is the upper memory limit that Hadoop allows to be allocated to a mapper, in MB. </description>
</property>

<property>
 <name>mapreduce.reduce.memory.mb</name>
 <value>2048</value>
 <description> Larger resource limit for reduces </description>
</property>

<property>
 <name>mapreduce.map.java.opts</name>
 <value>-Xmx3072m</value>
 <description> Larger heap-size for child jvms of maps </description>
</property>

<property>
 <name>mapreduce.reduce.java.opts</name>
 <value>-Xmx6144m</value>
 <description> Larger heap-size for child jvms of reduces </description>
</property>

<property>
 <name>yarn.app.mapreduce.am.resource.mb</name>
 <value>400</value>
 <description> specifies "The amount of memory the MR AppMaster needs </description>
</property>
</configuration>

hdfs-site.xml

One of the most important configuration files. It stores configuration settings with regard to Hadoop HDFS NameNode and DataNode among other things . I have already discussed these two somewhere else. You will need to find reasonable space where you want HDFS data to be stored.

<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
<description> This set the number of replicates. For single node it is 1 </description>
</property>

<property>
<name>dfs.namenode.name.dir</name>
<value>file:/data4/hadoop/hadoop_store/hdfs/namenode</value>
<description> This is where HDFS metadata is stored </description>
</property>

<property>
<name>dfs.datanode.data.dir</name>
<value>file:/data4/hadoop/hadoop_store/hdfs/datanode</value>
<description> This is where HDFS data is stored </description>
</property>

<property>
<name>dfs.block.size</name>
<value>134217728</value>
<description> This is the default block size. It is set to 128*1024*1024 bytes or 128 MB </description>
</property>
</configuration>

yarn-site.xml

This file provides configuration parameters for resource manager YARN (Yet Another Resource Manager).

<configuration>
 <property>
 <name>yarn.nodemanager.aux-services</name>
 <value>mapreduce_shuffle</value>
 </property>
<property>
 <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
 <value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
 <name>yarn.nodemanager.vmem-check-enabled</name>
 <value>false</value>
</property>
</configuration>

Formatting Hadoop Distributed File System (HDFS)

Before starting Hadoop we need to format the underlying file system. This is required when you set up Hadoop first time. Think of it as formatting your drive. You need to shutdown Hadoop whenever you format HDFS. Remember if you format it again you will lose all data! To perform the format you use the following command:

hdfs namenode -format

If you have set up your path in your start up (you have it), then hdfs will be on the path. Otherwise it is under $HADOOP_HOME/bin/hdfs. This is the ouput from the command

<pre>hdfs namenode -format
15/05/25 19:27:20 INFO namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG: host = rhes564/50.140.197.217
STARTUP_MSG: args = [-format]
STARTUP_MSG: version = 2.7.0

----

15/05/25 19:27:21 INFO namenode.FSNamesystem: dfs.namenode.safemode.threshold-pct = 0.9990000128746033
15/05/25 19:27:21 INFO namenode.FSNamesystem: dfs.namenode.safemode.min.datanodes = 0
15/05/25 19:27:21 INFO namenode.FSNamesystem: dfs.namenode.safemode.extension = 30000
15/05/25 19:27:21 INFO metrics.TopMetrics: NNTop conf: dfs.namenode.top.window.num.buckets = 10
15/05/25 19:27:21 INFO metrics.TopMetrics: NNTop conf: dfs.namenode.top.num.users = 10
15/05/25 19:27:21 INFO metrics.TopMetrics: NNTop conf: dfs.namenode.top.windows.minutes = 1,5,25
15/05/25 19:27:21 INFO namenode.FSNamesystem: Retry cache on namenode is enabled
15/05/25 19:27:21 INFO namenode.FSNamesystem: Retry cache will use 0.03 of total heap and retry cache entry expiry time is 600000 millis
15/05/25 19:27:21 INFO util.GSet: Computing capacity for map NameNodeRetryCache
15/05/25 19:27:21 INFO util.GSet: VM type = 64-bit
15/05/25 19:27:21 INFO util.GSet: 0.029999999329447746% max memory 888.9 MB = 273.1 KB
15/05/25 19:27:21 INFO util.GSet: capacity = 2^15 = 32768 entries
Re-format filesystem in Storage Directory /data4/hadoop/hadoop_store/hdfs/namenode ? (Y or N) y
15/05/25 19:28:40 INFO namenode.FSImage: Allocated new BlockPoolId: BP-959407252-50.140.197.217-1432578520863
15/05/25 19:28:40 INFO common.Storage: Storage directory /data4/hadoop/hadoop_store/hdfs/namenode has been successfully formatted.
15/05/25 19:28:41 INFO namenode.NNStorageRetentionManager: Going to retain 1 images with txid >= 0
15/05/25 19:28:41 INFO util.ExitUtil: Exiting with status 0
15/05/25 19:28:41 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at rhes564/50.140.197.217
************************************************************/

If you repeat the command again, you will get a message

Re-format filesystem in Storage Directory /data4/hadoop/hadoop_store/hdfs/namenode ? (Y or N)

If you confirm it, it will be reformatted.

Putting the show on the road, Starting Hadoop

Go ahead and start hadoop daeomons

start-dfs.sh

15/05/25 19:51:28 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Starting namenodes on [rhes564]
rhes564: starting namenode, logging to /home/hduser2/hadoop-2.7.0/logs/hadoop-hduser2-namenode-rhes564.out
localhost: Address 127.0.0.1 maps to rhes564, but this does not map back to the address - POSSIBLE BREAK-IN ATTEMPT!
localhost: starting datanode, logging to /home/hduser2/hadoop-2.7.0/logs/hadoop-hduser2-datanode-rhes564.out
Starting secondary namenodes [0.0.0.0]
0.0.0.0: Address 127.0.0.1 maps to rhes564, but this does not map back to the address - POSSIBLE BREAK-IN ATTEMPT!
0.0.0.0: starting secondarynamenode, logging to /home/hduser2/hadoop-2.7.0/logs/hadoop-hduser2-secondarynamenode-rhes564.out
15/05/25 19:51:43 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

If you get an error JAVA_HOME not set, make sure that you have modified and set JAVA_HOME in hadoop-env.sh. Ignore WARNs

To check that your Hadoop started on port 19000 (as we set in core-site.xml file), do the following

netstat -plten|grep java

(Not all processes could be identified, non-owned process info
 will not be shown, you would have to be root to see it all.)
tcp 0 0 50.140.197.217:19000 0.0.0.0:* LISTEN 1013 31595 12540/java

There it is running under process 12540. You can see it in full glory

ps -efww | grep 12540
hduser2 12540 1 0 19:51 ? 00:00:04 /usr/java/latest/bin/java -Dproc_namenode -Xmx1000m -Djava.net.preferIPv4Stack=true -Dhadoop.log.dir=/home/hduser2/hadoop-2.7.0/logs -Dhadoop.log.file=hadoop.log -Dhadoop.home.dir= -Dhadoop.id.str=hduser2 -Dhadoop.root.logger=INFO,console -Djava.library.path= -Dhadoop.policy.file=hadoop-policy.xml -Djava.net.preferIPv4Stack=true -Djava.net.preferIPv4Stack=true -Djava.net.preferIPv4Stack=true -Dhadoop.log.dir=/home/hduser2/hadoop-2.7.0/logs -Dhadoop.log.file=hadoop-hduser2-namenode-rhes564.log -Dhadoop.home.dir=/home/hduser2/hadoop-2.7.0 -Dhadoop.id.str=hduser2 -Dhadoop.root.logger=INFO,RFA -Djava.library.path=/home/hduser2/hadoop-2.7.0/lib/native -Dhadoop.policy.file=hadoop-policy.xml -Djava.net.preferIPv4Stack=true -Dhadoop.security.logger=INFO,RFAS -Dhdfs.audit.logger=INFO,NullAppender -Dhadoop.security.logger=INFO,RFAS -Dhdfs.audit.logger=INFO,NullAppender -Dhadoop.security.logger=INFO,RFAS -Dhdfs.audit.logger=INFO,NullAppender -Dhadoop.security.logger=INFO,RFAS org.apache.hadoop.hdfs.server.namenode.NameNode

You can start other processes as below

yarn-daemon.sh start resourcemanager

starting resourcemanager, logging to /home/hduser2/hadoop-2.7.0/logs/yarn-hduser2-resourcemanager-rhes564.out

yarn-daemon.sh start nodemanager

starting nodemanager, logging to /home/hduser2/hadoop-2.7.0/logs/yarn-hduser2-nodemanager-rhes564.out

mr-jobhistory-daemon.sh start historyserver

starting historyserver, logging to /home/hduser2/hadoop-2.7.0/logs/mapred-hduser2-historyserver-rhes564.out

If all goes well (and it should), you will hopefully get no error messages. Just to check all processes are running OK, do as follows at the command line using Jps. Jps (Java Processes Status) should come back with the list. It is part of JDK and is under $JAVA_HOME/bin. It is equivalent to ps command. It lists all java Processes of a user.

jps

<pre>13510 JobHistoryServer
12540 NameNode
12839 SecondaryNameNode
13127 ResourceManager
13374 NodeManager
13629 Jps
12637 DataNode

Shutting down Hadoop

stop-dfs.sh

15/05/25 20:28:16 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Stopping namenodes on [rhes564]
rhes564: stopping namenode
localhost: Address 127.0.0.1 maps to rhes564, but this does not map back to the address - POSSIBLE BREAK-IN ATTEMPT!
localhost: stopping datanode
Stopping secondary namenodes [0.0.0.0]
0.0.0.0: Address 127.0.0.1 maps to rhes564, but this does not map back to the address - POSSIBLE BREAK-IN ATTEMPT!
0.0.0.0: stopping secondarynamenode
15/05/25 20:28:34 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

yarn-daemon.sh stop resourcemanager

stopping resourcemanager

yarn-daemon.sh stop nodemanager

stopping nodemanager

mr-jobhistory-daemon.sh stop historyserver

stopping historyserver

Web Interfaces

Once the Hadoop is up and running, you can use the following web interfaces for various components:

Daemon		        Web Interface		        Notes
NameNode		http://host:port/		Default HTTP port is 50070.
MapReduce Jobs	        http://host:port/        	Default HTTP port is 8088.

Comments»

No comments yet — be the first.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: