jump to navigation

Working with Hadoop command lines May 30, 2015

Posted by Mich Talebzadeh in Uncategorized.
add a comment

In my earlier blog I covered installation and configuration of Hadoop. We will now focus with working with Hadoop.

When I say working with Hadoop, I infer working with HDFS and MapReduce Engine. If you are familiar with Unix/Linux commands then it will be easier to find your way through hadoop commands.

Let us start first and look at the root directory of HDFS. Hadoop commands have notation as follows:

hdfs dfs

To see the files under “/” we do

hdfs dfs -ls /
Found 5 items
drwxr-xr-x   - hduser supergroup          0 2015-04-26 18:35 /abc
drwxr-xr-x   - hduser supergroup          0 2015-05-03 09:08 /system
drwxrwx---   - hduser supergroup          0 2015-04-14 06:46 /tmp
drwxr-xr-x   - hduser supergroup          0 2015-04-14 09:51 /user
drwxr-xr-x   - hduser supergroup          0 2015-04-24 20:42 /xyz

If you want to go one level further, you can do

hdfs -ls /user
Found 2 items
drwxr-xr-x   - hduser supergroup          0 2015-05-16 08:41 /user/hduser
drwxr-xr-x   - hduser supergroup          0 2015-04-22 16:25 /user/hive

The complete list of hadoop file system shell commands are given here.

You can run the same command from a remote host as long as you have hadoop software installed on remote host. You qualify the target by specifying the hostname and port number on which hadoop is running

hdfs dfs -ls hdfs://rhes564:9000/user
Found 2 items
drwxr-xr-x - hduser supergroup 0 2015-05-16 08:41 hdfs://rhes564:9000/user/hduser
drwxr-xr-x - hduser supergroup 0 2015-04-22 16:25 hdfs://rhes564:9000/user/hive

You can put a remote file in hdfs as follows. As an example, first create a simple text file with remote hostname in it

echo hostname > hostname.txt

Now put that file in hadoop on rhes564

hdfs dfs -put hostname.txt hdfs://rhes564:9000/user/hduser

Check that the file is stored in HDFS OK

hdfs dfs -ls hdfs://rhes564:9000/user/hduser
-rw-r--r--   2 hduser supergroup          9 2015-05-30 19:49 hdfs://rhes564:9000/user/hduser/hostname.txt

Note the full pathname of the file in HDFS. We can go ahead and delete that txt file.

hdfs dfs -rm /user/hduser/hostname.txt
15/05/30 21:10:23 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 0 minutes, Emptier interval = 0 minutes.
Deleted /user/hduser/hostname.txt

To remove a full directory you can perform

hdfs dfs -rm -r /abc
Deleted /abc

To see the sizes of files and directories under / you can do

 hdfs dfs -du /
0            /system
4033091893   /tmp
22039558766  /user

How to install and configure Big Data Hadoop in an hour or so (does not include your coffee break) May 25, 2015

Posted by Mich Talebzadeh in Big Data.
add a comment

Pre-requisites

1) Have a Linux host running a respectable flavor of Linux. Mine is RHES 5.2 64-bit. I also have another host using RHES 5.2, 32-bit which also has Hadoop installed on it.
2) Basic familiarity with Linux commands
3) Oracle Java Development Kit (JDK) 1.5 or higher
4) Configuring SSH access
5) Setting ulimit (optional)

Creating group and user on Linux

Log in as root to your Linux host and create a group called hadoop2 and user hduser2. Note since I already have goup/user hadoop/hduser, I decided for the sake of this demo to create new ones on the same host

root@rhes564 ~]# groupadd hadoop2
[root@rhes564 ~]# useradd -G hadoop2 hduser2

You will see that a home directory for user hduser2 called /home/hduser2 is created.

ls -ltr /home
drwx------ 3 hduser2 hduser2 4096 May 24 09:30 hduser2
[root@rhes564 home]# su - hduser2
[hduser2@rhes564 ~]$ pwd
/home/hduser2

You can edit [root@rhes564 home]# vi /etc/passwd file to change the default shell for hduser2
I changed it to korn shell

hduser2:x:1013:1015::/home/hduser2:/bin/ksh

Now set the password for hduser2 as below

[root@rhes564 home]# passwd hduser2
Changing password for user hduser2.
New UNIX password:
Retype new UNIX password:
passwd: all authentication tokens updated successfully.

## log in as hduser2

su - hduser2
Password:
 id
uid=1013(hduser2) gid=1015(hduser2) groups=1015(hduser2)

So far so good. We have set up a new user hduser2 belonging to the group hadoop2. Make sure that the ownership is correct. As root do

</pre>
[root@rhes564 ~]# chown -R hduser2:hadoop2 /home/hduser2

Now you need to set up your shells. The Korn shell uses two startup files, the .profile and the .kshrc. The .profile is read once, by your login ksh, while the .kshrc is read by each new ksh.
I put all my environment in .kshrc file.

Remember most of this Big Data is developed in Java. so you will need an up-to-date version of JDK installed. The details for me are as follows:

$ which java
/usr/bin/java
$ java -version
java version "1.7.0_21"
Java(TM) SE Runtime Environment (build 1.7.0_21-b11)
Java HotSpot(TM) 64-Bit Server VM (build 23.21-b01, mixed mode)

My sample .profile is as follows:

$ cat .profile
stty erase ^?
PATH=/bin:/sbin:/usr/bin:/usr/sbin:/usr/local/bin:/usr/local/sbin:/usr/bin/X11:/usr/X11R6/bin:/root/bin
unset LANG
export ENV=~/.kshrc
. ./.kshrc

I will come to this later.

Setting up ssh

Hadoop relies on ssh to access different nodes and loop back to itself.  Although this is a single node setup, you will still need to set up ssh.  It is pretty straight forward and you happen to be a a DBA or developer, you have already done this many times for oracle or sybase accounts.

ssh-keygen -t rsa -P ""
Generating public/private rsa key pair.
Enter file in which to save the key (/home/hduser2/.ssh/id_rsa):
Created directory '/home/hduser2/.ssh'.
Your identification has been saved in /home/hduser2/.ssh/id_rsa.
Your public key has been saved in /home/hduser2/.ssh/id_rsa.pub.
The key fingerprint is:
97:af:a3:18:f8:30:74:3b:73:53:07:4d:16:e7:1b:8a hduser2@rhes564

Enable SSH access to your local machine with this newly created key.

cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

Now test ssh by connecting to localhost for user hduser2 and see if it works. It will also add to file known_hosts under directory $HOME/.ssh

ssh localhost
The authenticity of host 'localhost (127.0.0.1)' can't be established.
RSA key fingerprint is 21:75:c2:da:01:68:c4:ef:23:7b:d2:ac:e4:ef:06:02.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'localhost' (RSA) to the list of known hosts.
Address 127.0.0.1 maps to rhes564, but this does not map back to the address - POSSIBLE BREAK-IN ATTEMPT!

This should work. Otherwise Google for ssh and use ssh in debug mode by typing ssh -vv localhost to debug the error

Increasing ulimit parameter for user hduser2

You may find later that you will require file descriptor value higher than the default value od 1024. As root edit the file etc/security/limits.conf and add the following lines to the bottom

hduser2 soft nofile 4096
hduser2 hard nofile 63536

Reboot host for this to take place. You will need to set it in your shell startup file for hduser2 as well

ulimit -n 63536

Installing Hadoop

You can download Hadoop from Apache Hadoop site from here. At the time of writing these notes (May 2015), the most recent release was release 2.7.0. Download the binary zipped file. It is around 200MB as shown below:

-rw-r--r-- 1 hduser2 hadoop2 210343364 May 24 11:07 hadoop-2.7.0.tar.gz

Unzip and untar the file. It will create a directory called hadoop-2.7.0 as below

drwxr-xr-x 9 hduser2 hadoop2 4096 Apr 10 19:51 hadoop-2.7.0

Now you need to setup your environment variables in your shell routine. Mine is .kshrc

#HADOOP VARIABLES START
export JAVA_HOME=/usr/java/latest
export HADOOP_HOME=${HOME}/hadoop-2.7.0
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib"
#HADOOP VARIABLES END
export HADOOP_CLIENT_OPTS="-Xmx2g"
unset CLASSPATH
CLASSPATH=.:$HADOOP_HOME/share/hadoop/common/hadoop-common-2.7.0-tests.jar:$HADOOP_HOME/share/hadoop/common/hadoop-common-2.7.0.jar:hadoop-nfs-2.7.0.jar:$HIVE_HOME/conf
export CLASSPATH
ulimit -n 63536

The next step is to look at the directory tree for hadoop.

hduser2@rhes564::/home/hduser2> cd $HADOOP_HOME
hduser2@rhes564::/home/hduser2/hadoop-2.7.0> ls -ltr
total 52
drwxr-xr-x 4 hduser2 hadoop2 4096 Apr 10 19:51 share
drwxr-xr-x 2 hduser2 hadoop2 4096 Apr 10 19:51 libexec
drwxr-xr-x 3 hduser2 hadoop2 4096 Apr 10 19:51 lib
drwxr-xr-x 2 hduser2 hadoop2 4096 Apr 10 19:51 include
drwxr-xr-x 3 hduser2 hadoop2 4096 Apr 10 19:51 etc
drwxr-xr-x 2 hduser2 hadoop2 4096 Apr 10 19:51 bin
-rw-r--r-- 1 hduser2 hadoop2 1366 Apr 10 19:51 README.txt
-rw-r--r-- 1 hduser2 hadoop2 101 Apr 10 19:51 NOTICE.txt
-rw-r--r-- 1 hduser2 hadoop2 15429 Apr 10 19:51 LICENSE.txt
drwxr-xr-x 2 hduser2 hadoop2 4096 May 24 13:02 sbin

Note that unlike usual installation there is no conf directory here. Older versions of Hadoop had conf directory. This has now been replaced with etc directory. Below etc there is sub-directory called hadoop that has all the configuration files.

<pre>cd $HADOOP_HOME/etc/hadoop]
<pre> ls -ltr
total 152
-rw-r--r-- 1 hduser2 hadoop2 690 Apr 10 19:51 yarn-site.xml
-rw-r--r-- 1 hduser2 hadoop2 4567 Apr 10 19:51 yarn-env.sh
-rw-r--r-- 1 hduser2 hadoop2 2250 Apr 10 19:51 yarn-env.cmd
-rw-r--r-- 1 hduser2 hadoop2 2268 Apr 10 19:51 ssl-server.xml.example
-rw-r--r-- 1 hduser2 hadoop2 2316 Apr 10 19:51 ssl-client.xml.example
-rw-r--r-- 1 hduser2 hadoop2 10 Apr 10 19:51 slaves
-rw-r--r-- 1 hduser2 hadoop2 758 Apr 10 19:51 mapred-site.xml.template
-rw-r--r-- 1 hduser2 hadoop2 4113 Apr 10 19:51 mapred-queues.xml.template
-rw-r--r-- 1 hduser2 hadoop2 1383 Apr 10 19:51 mapred-env.sh
-rw-r--r-- 1 hduser2 hadoop2 951 Apr 10 19:51 mapred-env.cmd
-rw-r--r-- 1 hduser2 hadoop2 11237 Apr 10 19:51 log4j.properties
-rw-r--r-- 1 hduser2 hadoop2 5511 Apr 10 19:51 kms-site.xml
-rw-r--r-- 1 hduser2 hadoop2 1631 Apr 10 19:51 kms-log4j.properties
-rw-r--r-- 1 hduser2 hadoop2 1527 Apr 10 19:51 kms-env.sh
-rw-r--r-- 1 hduser2 hadoop2 3518 Apr 10 19:51 kms-acls.xml
-rw-r--r-- 1 hduser2 hadoop2 620 Apr 10 19:51 httpfs-site.xml
-rw-r--r-- 1 hduser2 hadoop2 21 Apr 10 19:51 httpfs-signature.secret
-rw-r--r-- 1 hduser2 hadoop2 1657 Apr 10 19:51 httpfs-log4j.properties
-rw-r--r-- 1 hduser2 hadoop2 1449 Apr 10 19:51 httpfs-env.sh
-rw-r--r-- 1 hduser2 hadoop2 775 Apr 10 19:51 hdfs-site.xml
-rw-r--r-- 1 hduser2 hadoop2 9683 Apr 10 19:51 hadoop-policy.xml
-rw-r--r-- 1 hduser2 hadoop2 2598 Apr 10 19:51 hadoop-metrics2.properties
-rw-r--r-- 1 hduser2 hadoop2 2490 Apr 10 19:51 hadoop-metrics.properties
-rw-r--r-- 1 hduser2 hadoop2 4224 Apr 10 19:51 hadoop-env.sh
-rw-r--r-- 1 hduser2 hadoop2 3670 Apr 10 19:51 hadoop-env.cmd
-rw-r--r-- 1 hduser2 hadoop2 774 Apr 10 19:51 core-site.xml
-rw-r--r-- 1 hduser2 hadoop2 318 Apr 10 19:51 container-executor.cfg
-rw-r--r-- 1 hduser2 hadoop2 1335 Apr 10 19:51 configuration.xsl
-rw-r--r-- 1 hduser2 hadoop2 4436 Apr 10 19:51 capacity-scheduler.xml

Configuring Hadoop

The important configuration files are listed below. They are either shell files; *.sh or xml configuration files *.xml. The important ones are explained below

hadoop-env.sh

Edit file hadoop-env.sh and set JAVA_HOME explicitely

The java implementation to use.
export JAVA_HOME=/usr/java/latest

This will work as long as ${JAVA_HOME} is set up in your start-up shell.

XML files below work on specifying property. To override a default value for a property, specify the new value within the tags, using the following format:

<property>
   <name> </name>
   <value> </value>
   <description> </description>
</property>

core-site.xml

The only parameter I have put here is fs.default.name. you need a URI whose classscheme and authority determines the FileSystem implementation. The uri’s authority is used to determine the host, port, etc. for a filesystem.

<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://rhes564:19000</value>
</property>
</configuration>

So in my case I have the host as my host rhes564, and the port as 19000. Note that 19000 will be the port hadoop is running on. By default it is localhost:9000.

mapred-site.xml

MapReduce Engine runs on Hadoop. MapReduce configuration options are stored in mapred-site.xml file. This file contains configuration information that overrides the default values for MapReduce parameters.

Copy mapred-site.xml.template to mapred-site.xml. Edit this file and add the following lines:

<configuration>
<property>
 <name>mapreduce.framework.name</name>
 <value>yarn</value>
 <description> Execution framework set to Hadoop YARN </description>
</property>

<property>
 <name>mapreduce.job.tracker</name>
 <value>rhes564:54312</value>
 <description> The URL to track mapreduce jobs </description>
</property>

<property>
 <name>mapreduce.job.tracker.reserved.physicalmemory.mb</name>
 <value>1024</value>
 <description> The physical memory allocated for each job </description>
</property>

<property>
 <name>mapreduce.map.memory.mb</name>
 <value>2048</value>
 <description> mapreduce.map.memory.mb is the upper memory limit that Hadoop allows to be allocated to a mapper, in MB. </description>
</property>

<property>
 <name>mapreduce.reduce.memory.mb</name>
 <value>2048</value>
 <description> Larger resource limit for reduces </description>
</property>

<property>
 <name>mapreduce.map.java.opts</name>
 <value>-Xmx3072m</value>
 <description> Larger heap-size for child jvms of maps </description>
</property>

<property>
 <name>mapreduce.reduce.java.opts</name>
 <value>-Xmx6144m</value>
 <description> Larger heap-size for child jvms of reduces </description>
</property>

<property>
 <name>yarn.app.mapreduce.am.resource.mb</name>
 <value>400</value>
 <description> specifies "The amount of memory the MR AppMaster needs </description>
</property>
</configuration>

hdfs-site.xml

One of the most important configuration files. It stores configuration settings with regard to Hadoop HDFS NameNode and DataNode among other things . I have already discussed these two somewhere else. You will need to find reasonable space where you want HDFS data to be stored.

<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
<description> This set the number of replicates. For single node it is 1 </description>
</property>

<property>
<name>dfs.namenode.name.dir</name>
<value>file:/data4/hadoop/hadoop_store/hdfs/namenode</value>
<description> This is where HDFS metadata is stored </description>
</property>

<property>
<name>dfs.datanode.data.dir</name>
<value>file:/data4/hadoop/hadoop_store/hdfs/datanode</value>
<description> This is where HDFS data is stored </description>
</property>

<property>
<name>dfs.block.size</name>
<value>134217728</value>
<description> This is the default block size. It is set to 128*1024*1024 bytes or 128 MB </description>
</property>
</configuration>

yarn-site.xml

This file provides configuration parameters for resource manager YARN (Yet Another Resource Manager).

<configuration>
 <property>
 <name>yarn.nodemanager.aux-services</name>
 <value>mapreduce_shuffle</value>
 </property>
<property>
 <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
 <value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
 <name>yarn.nodemanager.vmem-check-enabled</name>
 <value>false</value>
</property>
</configuration>

Formatting Hadoop Distributed File System (HDFS)

Before starting Hadoop we need to format the underlying file system. This is required when you set up Hadoop first time. Think of it as formatting your drive. You need to shutdown Hadoop whenever you format HDFS. Remember if you format it again you will lose all data! To perform the format you use the following command:

hdfs namenode -format

If you have set up your path in your start up (you have it), then hdfs will be on the path. Otherwise it is under $HADOOP_HOME/bin/hdfs. This is the ouput from the command

<pre>hdfs namenode -format
15/05/25 19:27:20 INFO namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG: host = rhes564/50.140.197.217
STARTUP_MSG: args = [-format]
STARTUP_MSG: version = 2.7.0

----

15/05/25 19:27:21 INFO namenode.FSNamesystem: dfs.namenode.safemode.threshold-pct = 0.9990000128746033
15/05/25 19:27:21 INFO namenode.FSNamesystem: dfs.namenode.safemode.min.datanodes = 0
15/05/25 19:27:21 INFO namenode.FSNamesystem: dfs.namenode.safemode.extension = 30000
15/05/25 19:27:21 INFO metrics.TopMetrics: NNTop conf: dfs.namenode.top.window.num.buckets = 10
15/05/25 19:27:21 INFO metrics.TopMetrics: NNTop conf: dfs.namenode.top.num.users = 10
15/05/25 19:27:21 INFO metrics.TopMetrics: NNTop conf: dfs.namenode.top.windows.minutes = 1,5,25
15/05/25 19:27:21 INFO namenode.FSNamesystem: Retry cache on namenode is enabled
15/05/25 19:27:21 INFO namenode.FSNamesystem: Retry cache will use 0.03 of total heap and retry cache entry expiry time is 600000 millis
15/05/25 19:27:21 INFO util.GSet: Computing capacity for map NameNodeRetryCache
15/05/25 19:27:21 INFO util.GSet: VM type = 64-bit
15/05/25 19:27:21 INFO util.GSet: 0.029999999329447746% max memory 888.9 MB = 273.1 KB
15/05/25 19:27:21 INFO util.GSet: capacity = 2^15 = 32768 entries
Re-format filesystem in Storage Directory /data4/hadoop/hadoop_store/hdfs/namenode ? (Y or N) y
15/05/25 19:28:40 INFO namenode.FSImage: Allocated new BlockPoolId: BP-959407252-50.140.197.217-1432578520863
15/05/25 19:28:40 INFO common.Storage: Storage directory /data4/hadoop/hadoop_store/hdfs/namenode has been successfully formatted.
15/05/25 19:28:41 INFO namenode.NNStorageRetentionManager: Going to retain 1 images with txid >= 0
15/05/25 19:28:41 INFO util.ExitUtil: Exiting with status 0
15/05/25 19:28:41 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at rhes564/50.140.197.217
************************************************************/

If you repeat the command again, you will get a message

Re-format filesystem in Storage Directory /data4/hadoop/hadoop_store/hdfs/namenode ? (Y or N)

If you confirm it, it will be reformatted.

Putting the show on the road, Starting Hadoop

Go ahead and start hadoop daeomons

start-dfs.sh

15/05/25 19:51:28 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Starting namenodes on [rhes564]
rhes564: starting namenode, logging to /home/hduser2/hadoop-2.7.0/logs/hadoop-hduser2-namenode-rhes564.out
localhost: Address 127.0.0.1 maps to rhes564, but this does not map back to the address - POSSIBLE BREAK-IN ATTEMPT!
localhost: starting datanode, logging to /home/hduser2/hadoop-2.7.0/logs/hadoop-hduser2-datanode-rhes564.out
Starting secondary namenodes [0.0.0.0]
0.0.0.0: Address 127.0.0.1 maps to rhes564, but this does not map back to the address - POSSIBLE BREAK-IN ATTEMPT!
0.0.0.0: starting secondarynamenode, logging to /home/hduser2/hadoop-2.7.0/logs/hadoop-hduser2-secondarynamenode-rhes564.out
15/05/25 19:51:43 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

If you get an error JAVA_HOME not set, make sure that you have modified and set JAVA_HOME in hadoop-env.sh. Ignore WARNs

To check that your Hadoop started on port 19000 (as we set in core-site.xml file), do the following

netstat -plten|grep java

(Not all processes could be identified, non-owned process info
 will not be shown, you would have to be root to see it all.)
tcp 0 0 50.140.197.217:19000 0.0.0.0:* LISTEN 1013 31595 12540/java

There it is running under process 12540. You can see it in full glory

ps -efww | grep 12540
hduser2 12540 1 0 19:51 ? 00:00:04 /usr/java/latest/bin/java -Dproc_namenode -Xmx1000m -Djava.net.preferIPv4Stack=true -Dhadoop.log.dir=/home/hduser2/hadoop-2.7.0/logs -Dhadoop.log.file=hadoop.log -Dhadoop.home.dir= -Dhadoop.id.str=hduser2 -Dhadoop.root.logger=INFO,console -Djava.library.path= -Dhadoop.policy.file=hadoop-policy.xml -Djava.net.preferIPv4Stack=true -Djava.net.preferIPv4Stack=true -Djava.net.preferIPv4Stack=true -Dhadoop.log.dir=/home/hduser2/hadoop-2.7.0/logs -Dhadoop.log.file=hadoop-hduser2-namenode-rhes564.log -Dhadoop.home.dir=/home/hduser2/hadoop-2.7.0 -Dhadoop.id.str=hduser2 -Dhadoop.root.logger=INFO,RFA -Djava.library.path=/home/hduser2/hadoop-2.7.0/lib/native -Dhadoop.policy.file=hadoop-policy.xml -Djava.net.preferIPv4Stack=true -Dhadoop.security.logger=INFO,RFAS -Dhdfs.audit.logger=INFO,NullAppender -Dhadoop.security.logger=INFO,RFAS -Dhdfs.audit.logger=INFO,NullAppender -Dhadoop.security.logger=INFO,RFAS -Dhdfs.audit.logger=INFO,NullAppender -Dhadoop.security.logger=INFO,RFAS org.apache.hadoop.hdfs.server.namenode.NameNode

You can start other processes as below

yarn-daemon.sh start resourcemanager

starting resourcemanager, logging to /home/hduser2/hadoop-2.7.0/logs/yarn-hduser2-resourcemanager-rhes564.out

yarn-daemon.sh start nodemanager

starting nodemanager, logging to /home/hduser2/hadoop-2.7.0/logs/yarn-hduser2-nodemanager-rhes564.out

mr-jobhistory-daemon.sh start historyserver

starting historyserver, logging to /home/hduser2/hadoop-2.7.0/logs/mapred-hduser2-historyserver-rhes564.out

If all goes well (and it should), you will hopefully get no error messages. Just to check all processes are running OK, do as follows at the command line using Jps. Jps (Java Processes Status) should come back with the list. It is part of JDK and is under $JAVA_HOME/bin. It is equivalent to ps command. It lists all java Processes of a user.

jps

<pre>13510 JobHistoryServer
12540 NameNode
12839 SecondaryNameNode
13127 ResourceManager
13374 NodeManager
13629 Jps
12637 DataNode

Shutting down Hadoop

stop-dfs.sh

15/05/25 20:28:16 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Stopping namenodes on [rhes564]
rhes564: stopping namenode
localhost: Address 127.0.0.1 maps to rhes564, but this does not map back to the address - POSSIBLE BREAK-IN ATTEMPT!
localhost: stopping datanode
Stopping secondary namenodes [0.0.0.0]
0.0.0.0: Address 127.0.0.1 maps to rhes564, but this does not map back to the address - POSSIBLE BREAK-IN ATTEMPT!
0.0.0.0: stopping secondarynamenode
15/05/25 20:28:34 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

yarn-daemon.sh stop resourcemanager

stopping resourcemanager

yarn-daemon.sh stop nodemanager

stopping nodemanager

mr-jobhistory-daemon.sh stop historyserver

stopping historyserver

Web Interfaces

Once the Hadoop is up and running, you can use the following web interfaces for various components:

Daemon		        Web Interface		        Notes
NameNode		http://host:port/		Default HTTP port is 50070.
MapReduce Jobs	        http://host:port/        	Default HTTP port is 8088.

Big Data in Action for those familiar with relational databases and SQL, part 2 May 16, 2015

Posted by Mich Talebzadeh in Big Data.
add a comment

What is schema on write

Most people who deal with relational databases are familiar with Schema on Write. In an RDBMS there are schemas/databases, tables, other objects and relationships. Every table has a well defined structure with column names, data type and column level constraint among others. A developer needs to know the schema beforehand before he/she can insert a row into that table. One cannot insert a varchar value into an integer column. Even adding a column to a table will require planning and code changes. If you are replicating the table then the replication definition will have to be modified as well to reflect that additional column, otherwise data will be out of sync at the replicate. So in short you need to know the schema at time of writing to the database. This is “schema on write” that is, the schema is applied when the data is being written to the database.

What is schema on read

In contrast to “schema on write”, the “schema on read” is applied as the data is being read off of the storage. The idea is that we decide what the schema is when we analyze the data as opposed to when we write the data. In other words with the massive heap that we have stored in Hadoop, we decide what to take out and analyze (and make sense out of it). So schema on read means write your data first, figure out what it is later. Schema on write means figure out what your data is first, then write it after.

Moving on

So there are a number of points to note. In schema on write we sort out and clean data first. In other words we do Extract, Transform and Load (ETL) before storing data to the source table. In schema on read, we add to data as is and then decide how to do ETL, slice and dice the data later. So when we hear that a very large amount of data is loaded daily to Hadoop, that basically means that the base table is augmented with new inserts, new updates and new deletes. OK. I am sure you will be happy with new inserts. That is the data inserted into the source table in Oracle or Sybase. How about updates or deletes? Unlike Oracle or Sybase we don’t go and update or delete rows in place. In Hive the correct mechanism (that you need to devise) is that every updated or deleted row will be a new row/entry added to Hive table with the correct flag and timestamp to identify the nature of DML.

There are three record types added to Hive table in any time interval. They follow the general CRUD (create, read, Update, Delete) rules. Within Entity Life History every relational table will have one insert type for a given Primary Key (PK), one delete for the same PK and many updates for the same PK. So we will end up within a day with new records, new updates and deleted records. Crucially a row in theory may be updated infinite times within a time period but what matters is the most “recent” update as far as a relational table is concerned. In contrast, the said table in Hive will be a running total of original direct loads from Oracle or Sybase table plus all new operations (CRUD) as well. For those familiar with replication, you sync the replicate once and let replication take care of new inserts, updates and deletes thereafter. The crucial point being that rather than row updates or deletes, you append the updated row or deleted row as a new row to Hive table. Well that is the art of Big Data so to speak. I will come back how to do the ETL in Hive as well.

On the face of it this looks a bit messy. However, we have to take into account the enormous opportunities that this technique will offer in terms of periodic reports of what happened, projections, opportunities etc. As a simple example, if I was audit, I could go and look at when a trade was created, how often it was updated, when it was settled and finally purged (deleted). The source Oracle or Sybase table may have deleted the record, however, the record will always be accessible in Hive. In other words as mentioned before Hive offers economies of cheap storage plus schema on read.

Cheap storage means that in theory you can use any storage without the need to use RAID or SAN. Well not exactly rushing to PC World to buy yourself some extra hard drives :). These can be JBOD. The name JBOD, just a bunch of disks refers to an architecture involving multiple hard drives much like commodity hardware using them as independent hard drives, or as a combined (spanned) single logical volume with no actual RAID functionality. This makes sense as HDFS takes the role of redundancy by having multiple copies of the data on different disks within HDFS cluster. Having said that I have seen Big Data application installed on SAN disks which sort of defeats the purpose of having redundant data within the cluster.

Setting up Hive

The easiest way to install Hive is to create it under Hadoop account in Linux. For example my Hadoop set up runs under user hduser:hadoop


id
uid=1009(hduser) gid=1010(hadoop) groups=1010(hadoop)

The recommendation is that you install Hive under /usr/lib/ directory. Briefly use your root account to create directory /usr/lib/hive and unzip and untar hive binaries under that directory. Change the ownership of directory and files to hduser:hadoop.

You need to modify your start up shell to include the following:

 
export HIVE_HOME=/usr/lib/hive
export PATH=/$HIVE_HOME/bin:$PATH

All Big Data components like Hadoop, Hive etc come with various configuration files. In Hive this file is called $HIVE_HOME/bin/hive-core.xml. That is akin to the configuration file SPFILE.ora in Oracle or the cfg file in Sybase ASE. You configure Hive using this xml file.

Let us understand the directory tree under Hive.


hduser@rhes564::/usr/lib/hive> ls -ltr
total 396
drwxr-xr-x 4 hduser hadoop 4096 Mar 7 22:38 examples
drwxr-xr-x 7 hduser hadoop 4096 Mar 7 22:38 hcatalog
drwxr-xr-x 3 hduser hadoop 4096 Mar 7 22:38 scripts
-rw-r--r-- 1 hduser hadoop 340611 Mar 7 22:38 RELEASE_NOTES.txt
-rw-r--r-- 1 hduser hadoop 4048 Mar 7 22:38 README.txt
-rw-r--r-- 1 hduser hadoop 277 Mar 7 22:38 NOTICE
-rw-r--r-- 1 hduser hadoop 23828 Mar 7 22:38 LICENSE
drwxr-xr-x 5 hduser hadoop 4096 Mar 9 08:24 lib
drwxr-xr-x 3 hduser hadoop 4096 Mar 9 09:03 bin
drwxr-xr-x 2 hduser hadoop 4096 Apr 24 23:58 conf

The import ones are scripts, bin and conf directories. under conf directory you will find hive-core.xml. Under bin directory you have all the executables


hduser@rhes564::/usr/lib/hive/bin> ls -ltr
total 32
-rwxr-xr-x 1 hduser hadoop 884 Mar 7 22:38 schematool
-rwxr-xr-x 1 hduser hadoop 832 Mar 7 22:38 metatool
-rwxr-xr-x 1 hduser hadoop 885 Mar 7 22:38 hiveserver2
-rwxr-xr-x 1 hduser hadoop 1900 Mar 7 22:38 hive-config.sh
-rwxr-xr-x 1 hduser hadoop 7311 Mar 7 22:38 hive
drwxr-xr-x 3 hduser hadoop 4096 Mar 7 22:38 ext
-rwxr-xr-x 1 hduser hadoop 881 Mar 7 22:38 beeline

Make sure that you have added ${HIVE_HOME}/bin to your path. Believe or not there is no way that I know that you can find out the version of hive. However, you can find out the version of hive by checking the following


cd /usr/lib/hive/lib

ls hive-hwi*
hive-hwi-0.14.0.jar

That shows that my hive version is 0.14.

Now Hive is a data warehouse. It requires to store its metadata somewhere. This is very similar to data stored in Oracle sys tablespace or Sybase master database.  It is called metastore. Remember in Hive you can create different databases as well

Metastore is by default created in a file in the directory that you start connecting to Hive. So you may end up with multiple copies of metastore lying around. However, you can modify your hive-core.xml to specify where Hive needs to look for its metastore.

I personally don’t like metadata to be stored in a flat file. The good thing about Hive is that it allows the metastore to be stored in any RDBMS that supports JDBC. Remember Hive like Hadoop is written in Java.

Hive has sql scripts to create its metadata on Oracle, mysql, postgres and other database. These are under /usr/lib/hive/scripts/metastore/upgrade. I creat5ed one for Sybase ASE as well.

Let us take an example of creating Hive metastore in Oracle. I like Oracle as it offers (IMO) the best concurrency and lock management among RDBMS. Concurrency becomes important when we load data into Hive.

Metastore is a tiny schema so it can be accommodated on any instance. Moreover, you can use Oracle queries to see what is going on and back it up or restore it.

The task is pretty simple. Create an Oracle schema called hiveuser or anything you care


create user hiveuser identified by hiveuser
default tablespace app1 temporary tablespace temp2;
grant create session, connect to hiveuser;
grant dba to hiveuser;
exit

Then using hiveuser run the following attached sql text (loaded as pdf files) in Oracle. First run hive-schema-0.14.0.oracle hive-schema-0.14.0.oracle to create the schema. You can optionally run hive-txn-schema-0.14.0.oracle hive-txn-schema-0.14.0.oracle to allow concurrency after running hive-schema-0.14.0.oracle. I tested both sql scripts and they work fine.

To be continued ..

Big Data in Action for those familiar with relational databases and SQL May 12, 2015

Posted by Mich Talebzadeh in Big Data.
add a comment

I thought it is best to start giving a practical introduction to Big Data Hadoop with tools that  SQL developers, RDBMS DBAs (Oracle, Sybase, Microsoft) are familiar with. As I have already explained here Hadoop has two main components; namely 1) MapReduce Engine and 2) Hadoop Distributed File System (HDFS). Now this does not mean that you can run a SQL against Hadoop without using some kind of API that allows SQL to be understood by Hadoop. OK. So if I know how to create a table in RDBMS. how to interrogate data in that table will I be able to do the same without using any language except my dear SQL? Well the answer is yes, by and large.

Introducing Apache Hive, your gateway to Hadoop via SQL

A while back a tool was developed called Hive which allowed SQL developers to access data in Hadoop. The Apache Hive data warehouse software facilitates querying and managing large datasets residing in distributed storage. Built on top of Apache Hadoop, it provides:

-Tools to enable easy data extract/transform/load (ETL)

- A mechanism to impose structure on a variety of data formats

- Access to files stored either directly in Apache HDFS or in other data storage systems
  such as Apache HBase

- Query execution via Apache Hadoop MapReduce, Apache Tez or Apache Spark frameworks.

Hive has gone a long way since its incarnation. It supports a SQL dialect close to ANSI SQL 92 called Hive Query Language  (hql). You can create a table in Hive much like any table in Sybase or Oracle. You can partition a table in Hive, similar to range partitioning in RDBMS. You can create hash partition using a feature called clustering into n number of buckets and so forth. As of now (Hive 0.14) indexes are not supported in Hive. So we don’t have indexes and there is an optimizer in Hive but don’t confuse it with the traditional Oracle or Sybase optimizer. So if we do not have indexes and everything is going to end up as a table scan what use is that? That is a valid question. However, remember we are talking about Big Data and data warehouse. Yes a database with tables with millions of rows. This is not your usual OLTP database with point queries. You don’t go around running a query to select a handful of rows! Hive is designed around the assumption that you will be doing scans of significant amounts of data, as are most data warehousing type solutions.  Simply put it does not have the right tools to handle efficient lookup of single rows or small ranges of rows.  This means that Hive is not a suitable candidate for transactional or front office applications. Think about it. A typical RDBMS has 8 K page or block size. On the assumption that each block in Hadoop is 128 MB (default) a 1 GB data will take around 8 blocks in Hadoop. Our say 10 GB table will fit just in 80 blocks in Hadoop. So which one is more efficient for transactional work? Taking few rows out of an Oracle table or the same number of rows from Hadoop with Hive? Doing OLTP work in Hive is like cracking a nut with a sledgehammer. Hive is great for massive transformations needed in ETL type processing and full data set analytics. Hive is improving in terms of concurrency and latency however it has a fair bit to go  to beat commercial Massive Parallel Processing (MPP) solutions in terms of performance and stability. The key advantages of Hive are storage economics and its flexibility (schema on read). To be continued ..

Technical Architect and the challenge of Big Data Architecture March 23, 2015

Posted by Mich Talebzadeh in Big Data.
2 comments

In the previous articles I gave an introduction to Big Data and Big Data Architecture. What makes Big Data Architecture challenging is the fact that there are multiple sources of data (Big Data Sources) and one storage pool, namely the Apache Hadoop Distributed File System (HDFS) as part of Data conversion and storage layer as shown in diagram 1 below.

BigDataIngestion

This makes the job of Technical Architect very challenging. I believe Technical Architect is an appropriate combined term for someone who deals with Data Architecture and Technology Architecture simultaneously. I combined these two domains together because I personally believe that in almost most cases there is a strong correlation between these two domains! TOGAF defines Data Architecture as “the structure of an organization’s logical and physical data assets and data management resources”. Further it defines Technology Architecture as “the software and hardware capabilities that are required to support the deployment of business, data and application services, including IT infrastructure, middleware, networks, communications, processing, and standards”. Putting semantics aside, I have never come across a Data Architect who is ambivalent to the underlying technology and neither to a Technologist who does not care about the type of data and the network bandwidth that is required to deliver that data. Sure plenty of people who seem to spend most of their time creating nice looking diagrams and quoting a phrase from George Orwell’s Animal Farm as soon as these diagrams were ready they were burnt in the furnace.

Anyway back to the subject, the sources of data have been around for a long time. What is the new kid in class is Hadoop. To be sure Hadoop has not been around for long. There are who of the opinion that Big Data is still more hype than reality, but agree that Hadoop might be a keeper.  It would only be the 2nd example in the IT industry, though, of a standard promoted by a giant industry player becoming universal, the 1st being Java of course as Sun Microsystems was a giant in its time, amazing though that now seems.

Big Data Architecture is all about developing reliable and scalable data pipelines with almost complete automation. That is no small feat as the Technology Architect requires deep knowledge of every layer in the stack, beginning with identifying the sources and quality of data, Data ingestion and conversion processes, the volume of data and the requirement for designing the Hadoop cluster to cater for that data. Think about it as coming up for each data input what needs to be done to factorize the raw data into relevant information, the information which in all probability has to confirm to Master Data Model, has to be tailored to individual consumption layer, with all access and security needed for it.

It would be appropriate here to define broadly what Hadoop is all about.

Hadoop

Think about a relational database management system (RDBMS) first as it is more familiar. Examples are Oracle or SAP Sybase ASE. RDBMS manages resources much like an operating system. It is charged with three basic tasks:

  • It must be able to put data in
  • keep that data
  • Take the data out and work with it.

Unlike an RDBMS, Hadoop is not a database. For example, it has basic APIs to put data in. It certainly keeps the data but again has very basic APIs to take that data out. So it is an infra-structure, it is also called a framework or Ecosystem. Its native API commands are similar to Linux commands. Case in point, in Linux you get the list of root directories by issuing an ls / command. In HDFS (see below) you get the same list by issuing hdfs dfs -ls / command.

Hadoop consists of two major components. These are:

  1. A massively scalable Hadoop Distributed File System (HDFS) that can support large amount of data on a cluster of commodity hardware (JBOD)
  2. A massively Parallel Processing System called MapReduce Engine, based on MapReduce programming model for computing results using a parallel, distributed algorithm on a cluster.

In short Hadoop allows applications based on MapReduce to run on large clusters of commodity hardware scaled horizontally to speed up computations and reduce latency. Compare this architecture to a typical RDBMS which is scaled vertically using a Symmetric Multi processing (SMP) architecture in a centralised environment.

Understanding Hadoop Distributed File System (HDFS)

HDFS is basically a resilient and clustered approach to managing files in a big data environment. Since data is written once and read many times after, compared to the transactional databases with constant read writes, HDFS lend itself to supporting big data analysis.

NameNode

As shown in figure 1 there is a NameNode and multiple DataNodes running on a commodity hardware cluster. The NameNode holds the metadata for the entire storage including details of where the blocks for a specific file are kept, the whereabouts of multiple copies of data blocks and access to files among other things.

It is akin to dynamic data dictionary views in Oracle and SYS/MDA tables in SAP ASE. It is essential that the metadata is available pretty fast.  NameNode behaves like an in-memory database IMDB and uses a disk file system called the FsImage to load the metadata as startup. This is the only place that I see value for Solid State Disk to make this initial load faster. For the remaining period until HDFS shutdown or otherwise NameNode will use the in memory cache to access metadata. Like an IMDB, if the amount of metadata in NameNode exceeds the allocated memory, then it will cease to function. Hence this is where large amount of RAM helps on the host that NameNode is running. With regard to sizing of NameNode to store metadata, one can use the following rules of thumb (heuristics):

  • NameNode consumes roughly 1 GB for every 1 million blokes (source Hadoop Operations, Eric Sammer, ISBN: 978-1-499-3205-7). So if you have 128 MB block size, you can store 128 * 1E6 / (3 *1024) = 41,666 GB of data for every 1 GB. Number 3 comes from the fact that the block is replicated three times (see below). In other words just under 42TB of data. So if you have 10 GB of NameNode cache, you can have up to 420 TB of data on your DataNodes

DataNodes

Data nodes provide the work horse for HDFS. Within the HDFS cluster data blocks are replicated across multiple data nodes (default three, two on the same cluster, one on another) and the access is managed by name nodes. Before going any further let us look at data node block sizes.

HDFS Block size

Block size in HDFS can be specified per file by the client application. A client writing to HDFS can pass the block size to the NameNode during its file creation request, and it (the client) may explicitly change the block size for the request, bypassing any cluster-set defaults.

The default block size of a client, which is used when a user does not explicitly specify a block size in their API call can be defaulted to the value specified by the local configuration parameter fs.block.size (in bytes). By default this value is set to 128MB (it can be 64MB, 256MB and I have seen 512MB). This block size is only applicable to DataNodes and has no bearing on NameNode.

It would be interesting to make a passing comparison to a typical block size in RDBMS. Standard RDBMS (Oracle, SAP ASE, MSSQL) have 8K per block or page. I am ignoring multi-block size or larger pool sizes for now. In simplest operation every single physical IO will fetch 8K worth of data. If we are talking about Linux, the OS block size is 4K. In other words in every single physical IO, we will fetch two OS blocks from the underlying disk into memory. In contrast, with HDFS a single physical IO will fetch 128*1024K/4K = 32,768 OS blocks. This shows the extreme coarse granularity of HDFS blocks that is suited for batch reads. How about the use of SSD here? Can creating HDFS directories on SSD are going to improve performance substantially. With regard to data nodes, we are talking about few writes (say bulk-insert) and many more reads. Let us compare SSD response compared to magnetic spinning hard disks (HDD) for bulk inserts. Although not on HDFS, I did tests using Oracle and SAP ASE RDBMS for bulk inserts and full table scans respectively as described below

 SSD versus HDD for Bulk Inserts

 SSD faired reasonably better compared to HDD but not that much faster. The initial seek time for SSD was faster but the rest of the writes were sequential extent after extent which would not much different from what HDD does.

In summary:

  • First seek time for positioning the disk on the first available space will be much faster for SSD than HDD
  • The rest of the writes will have marginal improvements with SSD over HDD
  • The performance gains I got was twice faster for SSD compared to HDD for bulk inserts

SSD versus HDD for full table scans

Solid state devices are typically twice here. When the query results in table scans, the gain is not that much. The reason being that HDD is reading sequentially from blocks and the seek time impact is considerably less compared to random reads.

In summary Solid State Disks are best suited for random access via index scan as seek time disappears and accessing any part of the disks is just as fast. When the query results in table scans, the gain is not that much. In HDFS practically all block reads are going to be sequential so having Solid State Disks for data nodes may not pay.

How does HDFS keep track of its data

Let us take an example of HDFS configured with 128MB of block size. So for every 1GB worth of data we consume eight blocks to store data. However, we will need redundancy. In RDBMS redundancy comes from mirroring using SAN disks and having DR resiliency in case of data centre failure. In HDFS the same comes from replicating every block that makes up the file, throughout the cluster in case of single point of failure by a node. So now that blocks are replicated how do we keep track of them? Recall that I mentioned NameNode and metadata. That is where all this info comes. In short, NameNode records:

  • The entity life history of the files, when they were created, modified, accessed, deleted etc
  • The map of all blocks of the files
  • The access and security of the files
  • The number of files, the overall, used and remaining capacity of HDFS
  • The cluster, the number of nodes, functioning and failed
  • The location of logs and the transaction log for the cluster

Unlike NameNode, DataNodes are very much like the workers. However, they do provide certain functions as below:

  • DataNode stores and retrieves the blocks in the local file system of the node
  • Stores the metadata of the block in the local file system, based on the metadata info from the NameNode
  • Does periodic validation of file checksums
  • Sends heartbeat and availability of blocks to NameNode
  • Directly can exchange metadata and data with the client application programs
  • Can forward data to the other DataNodes

Hadoop MapReduce

The second feature of Hadoop is its MapReduce engine. Why it is called engine, it is because it is an implementation of MapReduce algorithm. It is designed to processing large data sets in parallel across a Hadoop cluster. MapReduce as name sounds is a two step process, namely map and reduce. The job configuration supplies map and reduce analysis functions and the Hadoop engine provides the scheduling, distribution, and parallelization services. During the map phase, the input data is divided into input splits for analysis by map tasks running in parallel across the Hadoop cluster. That is the key point in Massive Parallel Processing (MPP). By default, the MapReduce engine gets input data from HDFS. The reduce phase uses results from map tasks as input to a set of parallel reduce tasks. The reduce tasks consolidate the data into final results. By default, the MapReduce engine stores results in HDFS.  Although the reduce phase depends on output from the map phase, map and reduce processing is not necessarily sequential. That is, reduce tasks can begin as soon as any map task completes. It is not necessary for all map tasks to complete before any reduce task can begin. Let us see this in action.

In below example I have a table tdash in HDFS that I imported it from Oracle database via Sqoop and Hive (will cover them in detail later). Sqoop (SQL to Hadoop) uses JDBC drivers to access relational database like Oracle and SAP ASE and stores them in HDFS as text delimited files. Hive software facilitates creating, querying and managing large datasets in relational table format residing in HDFS. So with Sqoop and Hive combined, you can get a table and its data from RDBMS and create and populate it in HDFS in relational format. Hive has a SQL like language similar to Transact SQL in SAP ASE called HiveQL. Now if you run a HQL query on Hive, you will see that it will use MapReduce to get the results out. You can actually create an index in Hive as well but it won’t be a B-tree index though (and at current state it will not be used by the Optimizer).  Diagram two below shows the steps needed to import an RDBMS table into HDFS. I will explain this diagram in later sections.

RDBMS_import_into_Hadoop

This table has 4 million rows and the stats for this table in Hive can be obtained believe it or not from update statistics command (BTW very similar to old Oracle one)

hive&amp;amp;amp;gt; analyze table tdash compute statistics;
Hadoop job information for Stage-0: number of mappers: 121; number of reducers: 0
2015-03-24 21:10:05,834 Stage-0 map = 0%,  reduce = 0%
2015-03-24 21:10:15,366 Stage-0 map = 2%,  reduce = 0%, Cumulative CPU 6.68 sec
2015-03-24 21:10:18,522 Stage-0 map = 6%,  reduce = 0%, Cumulative CPU 24.72 sec
2015-03-24 21:10:28,896 Stage-0 map = 7%,  reduce = 0%, Cumulative CPU 34.02 sec
…………..
2015-03-24 21:19:04,265 Stage-0 map = 100%,  reduce = 0%, Cumulative CPU 326.5 sec
MapReduce Total cumulative CPU time: 5 minutes 26 seconds 500 msec
Ended Job = job_1426887201015_0016
Table oraclehadoop.tdash stats: [numFiles=4, numRows=4000000, totalSize=32568651117, rawDataSize=32564651117]
MapReduce Jobs Launched:
Stage-Stage-0: Map: 121   Cumulative CPU: 326.5 sec   HDFS Read: 32570197317 HDFS Write: 10164 SUCCESS
Total MapReduce CPU Time Spent: 5 minutes 26 seconds 500 msec

So it tells us that this table has 4 million rows (numRows=4000000) and is split into 4 files (numFiles=4). The total size is 30GB has shown below. Total size is quoted in bytes


0: jdbc:hive2://rhes564:10010/default&amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;gt; select (32568651117/1024/1024/1024) AS GB;

+--------------------+--+

|         gb         |

+--------------------+--+

| 30.33192001003772  |

+--------------------+--+

FYI this table in Oracle is 61GB without compression. However, it is good that we have compression in HDFS. If I work this one out this table should table around 30*8 HDFS blocks on the assumption that each block is 128MB and 1 GB data takes around 8 blocks

Now let us run a simple query and see MapReduce in action

0: jdbc:hive2://rhes564:10010/default&amp;amp;amp;gt; select count(1) from tdash
. . . . . . . . . . . . . . . . . . .&amp;amp;amp;gt; where owner = 'PUBLIC'
. . . . . . . . . . . . . . . . . . .&amp;amp;amp;gt; and object_name = 'abc'
. . . . . . . . . . . . . . . . . . .&amp;amp;amp;gt; and object_type = 'SYNONYM'
. . . . . . . . . . . . . . . . . . .&amp;amp;amp;gt; and created = '2009-08-15 00:26:23.0';
INFO  : Hadoop job information for Stage-1: number of mappers: 121; number of reducers: 1
INFO  : 2015-03-24 21:38:40,015 Stage-1 map = 0%,  reduce = 0%
INFO  : 2015-03-24 21:39:01,523 Stage-1 map = 1%,  reduce = 0%, Cumulative CPU 18.53 sec
INFO  : 2015-03-24 21:39:03,575 Stage-1 map = 2%,  reduce = 0%, Cumulative CPU 18.74 sec
…………….
NFO  : 2015-03-24 21:40:31,761 Stage-1 map = 18%,  reduce = 0%, Cumulative CPU 79.15 sec
INFO  : 2015-03-24 21:40:37,948 Stage-1 map = 19%,  reduce = 6%, Cumulative CPU 83.21 sec
INFO  : 2015-03-24 21:40:44,129 Stage-1 map = 20%,  reduce = 6%, Cumulative CPU 88.91 sec
INFO  : 2015-03-24 21:40:47,197 Stage-1 map = 20%,  reduce = 7%, Cumulative CPU 90.13 sec
…………..
INFO  : 2015-03-24 21:47:23,923 Stage-1 map = 97%,  reduce = 32%, Cumulative CPU 401.5 sec
INFO  : 2015-03-24 21:47:31,207 Stage-1 map = 98%,  reduce = 32%, Cumulative CPU 405.54 sec
INFO  : 2015-03-24 21:47:32,237 Stage-1 map = 98%,  reduce = 33%, Cumulative CPU 405.54 sec
INFO  : 2015-03-24 21:47:37,356 Stage-1 map = 99%,  reduce = 33%, Cumulative CPU 407.87 sec
INFO  : 2015-03-24 21:47:41,452 Stage-1 map = 100%,  reduce = 67%, Cumulative CPU 409.14 sec
INFO  : 2015-03-24 21:47:42,474 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 409.88 se
INFO  : MapReduce Total cumulative CPU time: 6 minutes 49 seconds 880 msec
INFO  : Ended Job = job_1426887201015_0017

+------+--+
| _c0  |
+------+--+
| 0    |
+------+--+

The query started with 121 mappers and one reducer. I expected zero rows back. However, you will notice that after doing 19% mapping, the reduce process started at 6% and finally both map and reduce processes finished at 100% after 6 minutes and 49 seconds. This shows that reduce process does not need for mappers to complete before it starts.

Planning for Hadoop clustering

The philosophy behind big data is horizontal scaling so this embraces dynamic addition of nodes to a cluster depending on the famous three Vs; namely Volume, Velocity and Variety of data. Planning Hadoop cluster is not much different from planning for Oracle RAC or SAP Shared Disk Cluster. In contrast to these two products where the databases are in one place, HDFS distributes blocks of data among Data Nodes. The same resiliency and DR issues apply to Hadoop as much as Oracle and SAP ASE classic databases.

Hadoop runs on commodity hardware. That does not necessarily mean that you can deploy any hardware. Commodity hardware is a relative term. Sure you will not need to invest in SAN technology as you can get away with using JBODs, so there is relative savings to be made in here. For Hadoop to function properly, you will need to consider the hardware for NameNodes (AKA masters) and DataNodes (AKA workers). The master nodes require very good resiliency as a failure of a master node can disrupts the processing. In contrast, worker nodes that hold data can fail as there is in-built redundancy of data across workers.

Hardware for NameNodes

NameNodes hold metadata for HDFS data plus job scheduler, job tracker, worker nodes heartbeat monitor and secondary NameNode. So it is essential that redundancy is built into NameNode hardware. Bear in mind that the metadata is cached in memory. To this end the NameNode host must have enough RAM and pretty fast disks for start-up load. In above I recommended having SSD disks for NameNodes which I think would be beneficial to performance. NameNode will be memory hungry and low on disk storage. The OS disk needs to be mirrored internally (a default configuration) in all cases. The metadata will include filename, distribution of blocks and the location of replicas for each block, permissions, owner and group data. Bear in mind that longer filenames will occupy more bytes in memory. In general the more files the NameNode needs to track the more metadata it needs to maintain and more memory will be consumed. Depending on the size of the cluster if we are talking about 10-20 DataNodes, and assuming that 1GB of NameNode can store up to 42 TB of metadata, then we are looking at 420 to 840 TB of data storage so any host with 24 GB of RAM with quad-core, 1Gbit Ethernet connection and two JBOD drives will do, in addition to two disks that will be used for Operating System and mirror. Larger clusters will require NameNode with 48, 96 up to 128 GB of RAM or more.

Data Ingestion

Once we have pinpointed the source of data, Technical Architect needs to classify the characteristics of the data that must be processed according to each application. Data regardless of its source will have common characteristics:

  • Data origin – front office, middle office, back office, external providers, machine generated, email, intranet, wiki, spreadsheets, logs, audio, legacy, ftp
  • Data format – structured, unstructured
  • Data frequency – end of day, intra-day (on demand), continuous, time series
  • Data type – company data, customer reference data, historical, transactional
  • Data size – the volume of data expected from the silos and their daily growth. Your friendly DBAs should be able to provide this information plus various other administrators. You will need this information to assess the size of Big Data repository and its growth
  • Data delivery – How the data needs to be processed at the consumption layer

There is a terminology used for manipulating data. It is referred to as Data Ingestion. Data ingestion is the process of getting and processing data for later use or storage. The key point is processing these disparate data. Specifically Data Ingestion includes the following:

  • Data Identification – identifying different data formats for structured data and default formats for unstructured data.
  • Data filtration – only interested in data relevant to master data management (MDM) needs
  • Data validation – continuous validation of data against MDM
  • Data pruning – removing unwanted data
  • Data transformation – migrating from local silo schema to MDM
  • Data Integration – integrating the transformed data into Big Data

Real time data Ingestion

Real time reporting is becoming increasingly important. This has been around as a continuous real time log based transactional data capture. One of the challenges would be to implement it to deliver real time data to Hadoop. I am currently actively involved in getting this tested and production ready for a solution and I hope to cover it later.

Batch data Ingestion

The most straight forward case is getting data from structured data sources. Think about RDBMS data as source. In all probability you are familiar from getting data from RDBMS to another RDBMS. For example you can text delimited files to get data out of say SAP ASE and import it into Oracle (BCP data out from ASE, put data in using SQLLDR).

The only difference here is that the destination is Hadoop. Currently the most widely used tool is called Sqoop that uses JDBC to get data out of RDBMS and import it into HDFS in different file formats. I will discuss this later. Sqoop can also do incremental import the so called incremental append by checking the last unique identifier of rows and importing the new ones. It will effectively use SQL like SELECT * FROM <TABLE> WHERE IDENTIFIER > SOME_VALUE. However, this will not address the deletes and updates unless the full data is overwritten which will be time consuming if the underlying table is very large.

To be continued …

Big Data Architecture February 19, 2015

Posted by Mich Talebzadeh in Big Data.
1 comment so far

Data sources

I mentioned before that there are sources and consumers of big data and a big bridge in between. Let us take a heuristic approach and see what we mean by big data sources and the consumption layer. This is depicted in the diagram below:

BigDataSourcesAndConsumptionLayer

This layer provides the foundation and data sources necessary for a data driven enterprise. From the figure, let us have a look at the sources of data. Clearly we can see that we have varieties. We have structured data, mainframe data (ISAM, VSAM all that) in the form of legacy systems, we have text, images, spreadsheets and documents, intranet, XML and message bus data (TIBCO, MQ etc). While looking at more data is valuable and complex, looking at more data sources will likely add value to business.

The job of analysing the whole of this disparate data is complex. However, companies believe that it will open new challenges and revenues. In short they want to convert this enormous heap to information. Remember data on its own is practically worthless until it can be analysed and turned into information. So we must keep in mind that the ultimate purpose of data is to support business needs.

We are all familiar with the structured data that we keep in relational databases in one shape and form. Unstructured data such as logs, emails, blogs, conference recordings, webinars etc are everywhere often hidden below layers of heap, babble, misspelled names etc that make it difficult to store and turn it into information.

Can a relational database be made to deal with unstructured data? Well not very likely. Recall that we still have capacity and latency issues with image and text fields stored in relational databases. Those who are familiar with relational databases know full well that replication of text and image columns (clob and blob) create challenges for replication server in Sybase and materialized views in Oracle, notably the added latency and backlog to push the data. Simply put, relational databases were not built to process large amount of unstructured data in the form of texts, images etc.

How about Data warehouses based on columnar models of relational databases? Again they lack the build to provide meaningful and efficient analysis of un-structured data.

Data conversion and storage layer

Now that we have identified the source of big data, we will need to identify the characteristics of the data that must be processed according to each application. Data regardless of its source will have common characteristics:

Data origin – front office, middle office, back office, external providers, machine generated, email, intranet, wiki, spreadsheets, logs, audio, legacy, ftp
Data format – structured, unstructured
Data frequency – end of day, intra-day (on demand), continuous, time series
Data type – company data, customer reference data, historical, transactional
Data size – the volume of data expected from the silos and their daily growth. Your friendly DBAs should be able to provide this information plus various other administrators. You will need this information to assess the size of Big Data repository and its growth
Data delivery – How the data needs to be processed at the consumption layer

There is a terminology used for manipulating data. It is referred to as Data Ingestion. Data ingestion is the process of getting and processing data for later use or storage. The key point is processing these disparate data. Specifically Data Ingestion includes the following:

Data Identification – identifying different data formats for structured data and default formats for unstructured data.
Data filtration – only interested in data relevant to master data management (MDM) needs, see below
Data validation – continuous validation of data against MDM
Data pruning – removing unwanted data
Data transformation – migrating from local silo schema to MDM
Data Integration – integrating the transformed data into Big Data

These are shown in diagram 2 below. The golden delivery is data according to Master Data Management (MDM). MDM is defined as a comprehensive method of enabling an organisation to link all of its critical data to one file, called a master file that provides a common point of reference. It is supposed to streamline data sharing among architectures, platforms and applications. For me and you it is basically the enterprise wide data model and what data conversion is ultimately trying to achieve is to organize data of different structures into a common schema. Using this approach, the system can relate myriad of data including structured data such as trading and risk data and unstructured data such as product documents, wiki, graphs, power point presentations and conference recordings.

So in summary what are the objectives of an enterprise wide data model? Well to run any business, you manipulate data at hand to make it do something specific for you; to change the way you do business, you take data as-is and determine what new things it can do for you. These two approaches are more powerful combined than on their own.

The challenge is to bring the two schools of thoughts together. So, it does make sense to create a common architecture model to leverage all types of data in the organisation to open new opportunities.

BigDataIngestion

Within the storage layer, you will see we have Hadoop Distributed File System (HDFS) which can be used by a variety of NoSQL databases, plus Enterprise data warehouse.

For majority of technologists, I thought it would be useful to mention few words on HDFS and how it can be used. Hadoop is a framework for distributed data processing and storage. It is not a database, so it lacks functionality that is necessary in many analytics scenarios. Fortunately, there are many options available for extending Hadoop to support complex analytics, including real-time predictive models. So what are these options?

These are collectively called NoSQL databases. No does not mean that some cannot use SQL. It means Not Only SQL databases. Why is that? It is because SQL is by far the most widely used language to get data out of databases. Generations of developers know and use SQL. So it makes sense for tools that deal with data stored in Hadoop to use a dialect of SQL.

In general, a NoSQL database relaxes the rigid requirements of a traditional RDBMS (meaning ACID properties) by following CAP theorem (Consistency, Availability, Partition) by compromising Consistency in favour of Availability and Partition and can implement new usage models that require lower latency and higher levels of scalability. NoSQL databases come in four major types:

Key/Value Object Store – is the simplest type of database and can store any kind of digital object. Each object is assigned a key and is retrieved using the same key. Since you can only retrieve an object using its associated key, there is no way to search or analyze data until you have first retrieved it. Data storage and retrieval are extremely fast and scalable, but analytic capabilities are limited. Examples of key/value object stores include MemcacheDB, Redi and Riak.

Document Stores– These are a type of key/value store in which documents are stored in recognizable formats and are accompanied by metadata. Because of the consistent formats and the metadata, you can perform search and analysis without first retrieving the documents. Examples of document stores include MarkLogic, Couchbase and MongoDB.

Columnar Databases – These provide varying degrees of row and column structure, but without the full constraints of a traditional RDBMS. You can use a columnar database to perform more advanced queries on big data. Examples of columnar databases include Cassandra and HBase.

Graph Databases – These store networks of objects that are linked using relationship attributes. For example, the objects could be people in a social network who are related as friends, colleagues or strangers. You can use a graph database to map and quickly analyze very complex networks. Examples of graph databases include InfiniteGraph, Allegro and Neo4J.

Analysis layer

The analysis layer is responsible for transforming the data from Big Data to the Consumption layer. In other words this is the place where we need to make business sense of collected data. Analysis layer will have to adopt different approaches to take advantages of Big Data. The approach will have to include both business as usual analytics used in data warehouses plus innovative solutions for unstructured data. The fundamental difference between something like data warehouse and Hadoop is that a typical data warehouse is scaled vertically using a Symmetric Multi processing (SMP) architecture in a centralised environment, whereas Hadoop based tools use Hadoop distributed file system (HDFS) to physically distribute unstructured data based on Massive Parallel Processing (MPP) through MapReduce Engine.

Within the Analysis layer we need to identify what usage we are going to make of data collected in HDFS and enterprise data warehouse. The bulk of analysis tools should be able to answer the following:

What happened -This is the domain of traditional data warehouses. The information can be augmented by using query engines against HDFS for reduced query time

Why did it happen – What can we learn? Traditional BI engines and Decision Management tools can help.

What will happen in short, medium and long term – Pretty self explanatory. Event scoring and Recommendation Engines based on Predictive and Statistical Models can help.

Added value – How can we make it happen to benefit the business? Decision Management engines and Recommendation Engines can be deployed

In addition we need to process real time notifications and event driven requests which are typically stream based and may require Complex Event Processing (CEP) engines such as SAP Event Stream Processor or Oracle CEP.

Modelling can deploy pricing/quant’s models together with statistical models. These are shown in diagram 3 below.

BigDataAnalytics

Finally the whole of Big Data Architecture is shown in diagram 4 below.

BigDataArchitectureLayer

Introduction to Big Data February 19, 2015

Posted by Mich Talebzadeh in Big Data.
add a comment

Big Data is nowadays a vogue of the Month or the year. I recall around 10 years ago Linux was in the same category as well. Anyway we have got used to these trends. Bottom line Big Data is a trendy term these days and if you don’t know about it and you are in I.T. then you are missing the buzz.

Anyway let us move on and separate the signal from the pedestal so to speak

According to Wikipedia, Big Data is defined as and I quote:

“Big Data is a broad term for data sets so large or complex that they are difficult to process using traditional data processing applications. Challenges include analysis, capture, curation, search, sharing, storage, transfer, visualization, and information privacy.”

Right, what does so large and complex mean in this context? Broadly it refers to the following characteristics:

Extremely large Volume of data:

– Data coming from message buses (TIBCO, MQ-Series etc)
– Data coming from transactional databases (Oracle, Sybase, Microsoft and others)
– Data coming from legacy sources (ISAM, VSAM etc)
– Data coming from unstructured sources such as emails, word and excel documents etc
– Data coming from algorithmic trading through Complex Event Processing
– Data coming from Internet (HTML, WML etc)
– Data coming from a variety of applications

OK I used the term unstructured above. So what is that really? First for me structured data means anything kept in a relational database and accessed via Structured Query Language (SQL). So there we go the word structured popped up. SQL is very structured. Get it wrong or misspell or misplace anything and it won’t work!

Now what does the term unstructured data mean in this context? It basically means that its structure is not predictable or it does not follow a specified format. In other words it does not fit into relational context. I am sure some puritans will disagree with me on that point.

Extremely High Velocity:

Data is streaming in at unprecedented speed and must be dealt with in a timely manner. This could be batch or real time

Extremely wide variety of data:

OK, if you look at points under extremely large Volume of data, you will see the variety of structured and unstructured data that Big Data deals with.

Traditionally over the past 20-30 years the bulk of data used in many industries including Financial sector is stored in relational databases in structured format. Yes the one with tables and columns, primary keys, indexes, third normal form and stuff that originated by Dr Codd in 1980. Before that they were hierarchical and network databases but we won’t go into that. The current estimate is that around 20-30% of available data is stored in structured format and the rest are unstructured. Unstructured refers to piece of data that either does not have a pre-defined data model or is not organized in a pre-defined manner. You can pick them up from the above list.

So if you are coming from relational database management background like myself, you may be correctly asking yourself, why we need all this when we have reliable, time tested relational databases (RDBMS)? The answer is something that probably you already know. In short not every data fits into a structured mold and the unstructured data accounts for 70% of the data!

You may argue that we have data warehouses these days. However, recall that data warehouses like SAP Sybase IQ is a columnar implementation of relation model. In contrast, databases like Oracle and SAP Sybase ASE are row based implementation of relational model. Both Oracle and SAP Sybase ASE nowadays are offering an in-memory database as well. Oracle with version 12c about to be offering a columnar in-memory implementation of database but it will still use the same row based optimizer that has been in the heart of Oracle for many years.

Bottom line the data that enterprises hold are in different silos/databases and in different form and shapes, even the relational schemas are often local and reflect the legacy data model and they do not adhere to a common enterprise wide data model.

Let me give you an example. Imagine a Bank wants to know its total exposure to certain country today as part of its risk analysis. The position is not clear and is kept in different applications within the bank. Often these applications feed each other one on one. In other words, moving numbers from A to B and back again. A rather simplistic but surprisingly accurate summary!

That does not help to solve the problem. May be the bank has lent directly to this country. It may also have dealings with other banks and institutions that have large exposure to the said country. So where is the aggregate risk position as of now for the bank? It cannot go on shifting spreadsheets from one silo to another. Even if it did that it would take ages and will never be real time.

Banks are further under regulatory pressure to abide by many accounting and regulatory standards and will need IT transformation. One of these would be abiding by IFRS 9 Financial Instruments (regulation) under Risk and Legal Entity. These will all require (eventually) a golden source of data to get the required information.

Big Data is not a single technology but incorporates many new and exitsting technologies. It is not just just about having something like Hadoop Framework in place where you can use it as a big massive kitchen sink of all data. It is a myriad of technologies all working together. Ignoring the complexity of middle layers, we should think about the Big Data sources at one end and the so called Consumption layer at the other end.

In-Memory Data Grid (IMDG), why do we need it and how different it is from an In-Memory Database January 16, 2015

Posted by Mich Talebzadeh in Data Grid.
add a comment

Introduction

Relational databases have been the backbone of data storage for the majority of data stored in computers since 1980s. They are extensively used in many domains including financial sector.

My experience is heavily biased towards Banking. So I can only bore you from what I know. Throughout my long exposure from Front Office trading systems to back office and reconciliation, I have come to the conclusion that banking is all about moving numbers from A to B and back again. A rather simplistic but surprisingly accurate summary! Think of it as coping data from one silo to another and in between make sure that extract, transform and load (call it ETL if you like) will work fine.

For years this has been around. One database, two tier or three tier application in a client-server environment. If you had bandwidth and latency issues you would have used some form of replication. Replication is the process of sharing data at multiple databases and keeping them in sync. Products like Oracle Golden Gate, Sybase Replication server or simply Materialized Views are still commonly used to replicate data among multiple geographical locations or among heterogeneous databases. So these were the original data distributed set-ups. Replication is primarily used to address the bandwidth and latency issues. However, the access path with associated cost such as Physical IO, Logical IO, LRU-MRU chain etc would still be there.

The fundamental concern with the traditional data fetch from disk resident “classic” databases is that a significant portion of time is consumed by what is referred to as the data access overhead, where a service or application spends the majority of its effort retrieving the relevant data and often transforming the data to internal data structures. I call this bucketing. Think of it as the fruit of a developer who writes a report to extract end of day or intra-day reports, gets the data out of the database, stores it in C# or Java objects (i.e. mapping from a relational model to an object model) and finally pushes that data or rather cooked data (raw data with added bells and whistles) into another database, yet ready for another report. I am sure most of you are familiar with this process. This process may involve literally hundreds of reports/extracts running against the primary or reporting databases, multiple polling and generally a gradual impact on the throughput and performance of the systems.

Another caveat is the fact that often individual applications have their own specific schema or data model that may have to go through a fair bit of transformation to be of value to the target system. For example, you may find dates that are cryptic and need to change to standard format.

The Approach

Delivering the real-time responsiveness required of modern multi-tier applications requires a new set of data services. These data services must be delivered through a complementary architecture that leverages the data within relational databases, but crucially optimizes its delivery for real-time application requirements. Remember we still need a persistent data store for permanent storage (the so called golden source) and audit requirements so the database is not going to go away. It may move away from relational to NoSQL databases but it still has to store data one way or other.

We need to consider the following challenges:

  1.  Mapping from relational to object model
  2. Create a schema or model for wider needs
  3. Eliminate data access path overhead of typical of databases
  4. Move from the traditional client/server model to Service Oriented Architecture
  5. Use some form of in-memory caching. By caching data in memory across many servers instead of within a single database server, distributed caching eliminates performance bottlenecks and minimizes access times.
  6. Resiliency is provided by having in-memory data on multiple servers kept in sync
  7. In-memory Data grid automatically scales its capacity and handles network or server failures. It also provides APIs for coordinating data access and asynchronously notifying clients. To this data grid adds powerful tools, such as parallel query and map/reduce, to give applications the insights they need for fast, in-depth analysis.
  8. We still have to content with legacy/local databases. Remember that most applications are legacy and they will have to use legacy/local databases. The cost to rewrite these types of applications may be prohibitive unless they are decommissioned or replaced.

Index Join in Oracle and the equivalent Index Intersection in Sybase ASE September 1, 2012

Posted by Mich Talebzadeh in Oracle and Sybase.
1 comment so far

Both Oracle and Sybase ASE provide multipole access paths to the optimizer. Many people have this perception that for a given table only one index can be used by the optimizer. Although this is a widely expected outcome and it holds true for most cases, under certain conditions where the optimizer decides that it does not need to access the base table, more than one index on the same table can be used to satisfy the query.

Consider a simple case of a select list returning the values for a column or columns based on columns used in a where caluse.  In olther words the select list is entirely satisfied by the columns in the where clause. Let us look at a simple example in each case

Oracle (11gR2)

First create the table:

CREATE TABLE index_joins AS
 SELECT
 ROWNUM AS c1,
 (ROWNUM + trunc(dbms_random.value(0,1000))) AS c2,
 LPAD('A',255,'A') AS c3
 FROM
 DUAL
 CONNECT BY
 LEVEL <=10000;
CREATE UNIQUE INDEX ind_c1 ON index_joins(c1);
CREATE INDEX ind_c2 on index_joins(c2);
EXEC DBMS_STATS.GATHER_TABLE_STATS(OWNNAME=USER,TABNAME='INDEX_JOINS',CASCADE=>TRUE)

Let us query this table to get columns c1 and c2

alter system flush buffer_cache;
alter system flush shared_pool;
set timing on
set autotrace on
SELECT c1, c2
FROM index_joins
where   c1 between 100 and 1000
and     c2 between 50 and 100;

The select list contains column c1 and c2. The index ind_c1 covers c1 column and the index ind_c2 is on c2. These two indexes on index_joins table satisfy the query. Hence there is no need for the optimizer to visit the base table. Let us look at the plan:

---------------------------------------------------------------------------------------
| Id  | Operation          | Name             | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |                  |     4 |    36 |     6  (17)| 00:00:01 |
|*  1 |  VIEW              | index$_join$_001 |     4 |    36 |     6  (17)| 00:00:01 |
|*  2 |   HASH JOIN        |                  |       |       |            |          |
|*  3 |    INDEX RANGE SCAN| IND_C2           |     4 |    36 |     3  (34)| 00:00:01 |
|*  4 |    INDEX RANGE SCAN| IND_C1           |     4 |    36 |     4  (25)| 00:00:01 |
---------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
 ---------------------------------------------------
 1 - filter("C2"<=100 AND "C1"<=1000 AND "C1">=100 AND "C2">=50)
 2 - access(ROWID=ROWID)
 3 - access("C2">=50 AND "C2"<=100)
 4 - access("C1">=100 AND "C1"<=1000)

Let us look at the plan outpout. The optimizer does a index range scan of the two indexes extracting the rowid and c1 from index ind_c1 and the rowid and c2 from ind_c2. Then the optimizer performs a hash join between threse two results set on the rowid values. It is doing this because anytime the two rowsources have a rowid in common, that is the interestion that we want and both c1 and c2 values will be sent back as the result of the query.

Sybase ASE (ASE 15.7 ESD 2)

First we create the equivalent table in Sybase

create table index_joins
(
c1 int not null
, c2 int not null
, c3 varchar(255)
)
lock datarows
go
CREATE UNIQUE INDEX ind_c1 ON index_joins(c1)
CREATE INDEX ind_c2 on index_joins(c2)
go
--
declare @i int
declare @rows int
set @rows = 100
set @i = 1
set nocount on
while @i <=@rows
begin
 insert into #generator values (@i)
 set @i = @i + 1
end
set nocount off
insert into
 index_joins
select
 rs.id
 , rs.id + abs(rs.random % 1000)
 , REPLICATE('A',255)
from    ( select
 @rows * ( g1.id - 1 ) + g2.id AS id
 , CAST(NEWID() AS VARBINARY)  AS random
 from
 #generator g1,
 #generator g2
 ) rs
go
update index statistics index_joins
go

So that is the equivalent table in Sybase.

Unlike Oracle, the optimizer in Sybase will not consider using two indexes on the same table. The only way one could use two different indexes for the same table in a query is to have an  index intersection plan. Index Intersection is a technique to enable ASE optimizer to use more than one index on a table to satisfy a given query. This should work similar to Oracle “index join” whereby each index satisfies the predicate on the query. The hash join between two indexes results in common Row IDs (RID) to be selected on intersection and base table rows accessed by the intersection RowIDs. In ASE this index interesection has to be enabled explicitly by the following set command:

set index_intersection on

In addition, you need to supply multiscan abstract plan operator m_scan. The query will look like the following code:

set switch on 3604
go
declare @dbname varchar(30)
set @dbname=db_name()
dbcc cachedataremove(@dbname)
go
set statement_cache off
go
set showplan on
set statistics time,io,plancost on
go
set index_intersection on
go
SELECT c1, c2
FROM index_joins
where   c1 between 100 and 1000
and     c2 between 50 and 100
plan '(m_scan index_joins)'
go

The output of the Lava tree operator shows:

==================== Lava Operator Tree ====================
                     Emit
                     (VA = 6)
                     r:0 er:1
                     cpu: 0
                /
              RIDJoin
              (VA = 5)
              r:0 er:1
              l:4 el:7
              p:0 ep:0
     /                                              \
    HashJoin                                     Restrict
    Inner Join                                   (0)(0)(0)(13)(0)
    (VA = 2)                                     (VA = 4)
    r:0 er:1                                     r:0 er:1
    l:4 el:5
    p:0 ep:0
    bufct: 128
   /                      \                        /
IndexScan               IndexScan               TableScan
ind_c2                  ind_c1                  index_joins
(VA = 0)                (VA = 1)                (VA = 3)
r:0 er:1                r:0 er:900              r:0 er:1
l:2 el:2                l:0 el:3                l:0 el:934
p:2 ep:2                p:0 ep:3                p:0 ep:5
============================================================<br />

The execution plan steps are described as follows:

  1. index scan of ind_c2 to get the RIDs of the qualifying rows for “index_joins.c2 between 50 and 100″
  2. index scan of ind_c1 to get the RIDs of the qualifying rows for “t.index_joins.c1 between 100 and 1000″
  3. does HashJoin between those two sets of RIDs to return the common RID
  4. The stream of unique RIDs is passed to the RID JOIN operator. This operator creates a worktable (aka Dynamic Index) and fills it with a single-column row with each RID.
  5. does a RID Scan on the table index_joins using those RIDs from the worktable and gets back the rows.

Note that there is a subtle difference in behaviour between Oracle and ASE. Oracle does not require to visit the base table. However, ASE will visit the base table to feth the result set for RIDs from the worktable created. This method will also work for cases where the result set includes other columns from the table. for example a select * from the base table will also use this access path in ASE.

Oracle 10g, instance and database diagrams July 23, 2012

Posted by Mich Talebzadeh in Oracle.
add a comment

Use this link for Oracle10gR2 Architecture diagram Use this link for Oracle10gR2 Instance diagram Use this link for  Oracle10gR2 database diagram

Understanding AWR Report June 13, 2012

Posted by Mich Talebzadeh in Oracle.
10 comments

Updated December 2014

In the post Automatic Workload Repository  (AWR) Performance Monitoring Tool Basics , I described the basic set up and report generation for AWR. In this post we will try to understand the AWR report itself.

Before going further I must emphasise that this report was generated by running a PL/SQL block immediately after the instance was rebooted. The code was  used to simulate a typical OLTP workload with frequent insert/update/deletes and commits. The sample code:

  • Performs checkpoints immediately before and after PL/SQL block
  • Manually takes AWR snapshots before and after running PL/SQL block

The code is shown below

ALTER SYSTEM CHECKPOINT;
EXEC DBMS_WORKLOAD_REPOSITORY.CREATE_SNAPSHOT();
DECLARE
  type ObjIdArray is table of tdash.object_id%TYPE index by binary_integer;
  l_ids objIdArray;
  CURSOR c IS SELECT object_id FROM tdash;
BEGIN
  OPEN c;
  LOOP
    BEGIN
      FETCH c BULK COLLECT INTO l_ids LIMIT 100;

      FORALL rs in 1 .. l_ids.COUNT
        UPDATE testwrites
          SET PADDING1 =  RPAD('y',4000,'y')
        WHERE object_id = l_ids(rs);
      commit;
      FORALL rs in 1 .. l_ids.COUNT
        DELETE FROM testwrites
        WHERE object_id = l_ids(rs);
      commit;
      FORALL rs in 1 .. l_ids.COUNT
       INSERT INTO testwrites
       SELECT * FROM tdash t WHERE t.object_id = l_ids(rs);
      commit;
      EXIT WHEN C%NOTFOUND;
    EXCEPTION
      WHEN NO_DATA_FOUND THEN
        NULL;
      WHEN OTHERS THEN
        DBMS_OUTPUT.PUT_LINE('Transaction failed');
    END;
  END LOOP;
  CLOSE c;
END;
/
ALTER SYSTEM CHECKPOINT;
EXEC DBMS_WORKLOAD_REPOSITORY.CREATE_SNAPSHOT();

The output from the AWR report is shown below.

The snapshot details

This is normally shown at the beginning of the report.

        Snap Id      Snap Time   Sessions Curs/Sess
      --------- -------------- --------   -----
Begin Snap:4413 17-May-12 09:22:34   28   3.0
  End Snap:4414 17-May-12 09:43:35   29   3.3
   Elapsed:          21.02 (mins)
   DB Time:          21.78 (mins)

Note that unlike the hourly usual sample, this sampling took 21 minutes to complete. The snapshots in this case were taken immediately before and after running the PL/SQL block.

Elapsed (time) is just the interval of time between the start and end snapshots. Another important quantity is database DB Time which is the sum of working sessions’ time.  So DB Time =  sum of database CPU time + waits. In systems with multiple concurrent active sessions DB Time can be larger than the elapsed time. This is because DB Time is a sum over all active sessions that are using CPU(s) or waiting for an event. Note that Background processes are not included in that.

Load Profile

Next we will look at the Load Profile in AWR report

Load Profile      Per Second    Per Transaction Per Exec Per Call
~~~~~~~~~~~~   --------------- --------------- --------- -------
DB Time(s):            1.0                0.0       0.02    0.39
DB CPU(s):             0.1                0.0       0.00    0.04
Redo size:    27,284,637.6          916,465.9
Logical reads:    46,156.0            1,550.3
Block changes:    31,214.4            1,048.5
Physical reads:    2,446.5               82.2
Physical writes:   5,627.1              189.0
User calls:            2.7                0.1
Parses:               50.8                1.7
Hard parses:           0.8                0.0
W/A MB processed:      0.1                0.0
Logons:                0.1                0.0
Executes:             62.0                2.1
Rollbacks:             0.0                0.0
Transactions:         29.8

Per Second here means that this measure correlated to 1 second. It is worth looking at Per Second statistics to ensure that the amount of work the instance is doing is reasonable for the tasks it has to perform. For example is doing nearly 30 transactions per second good enough?

To understand what Per Transaction means, we need to look at another statistic further down the report from section under Instance Activity Stats

Statistic                     Total     per Second     per Trans
----------------------- ----------- -------------- -------------
user commits                 37,550           29.8           1.0

Per transaction is just per commits. So here we had 37,550 commits. To work out the total Redo size generated, we can take the Redo size throughput per transaction at 916,465.9 bytes and multiply that with the number of commits at 37,550. That will give us the total of 914465.9 * 37550/1024/1024 = 32747.4 MB of redo generated.

Other interesting statistics are Logical reads, Physical reads and Physical writes. These are quoted in blocks. so as before the total values would be 58,213,765 Logical block reads, 3,086,610 Physical block reads and 7,096,950 Physical block writes. Now a useful statistic would be the total amount of writes occurred in MB. Since the server is 8K block then the total amount of writes is (7096950 * 8192)/(1024 * 1024)  = 55,445 MB.

Instance Efficiency Percentages

This part is not that useful. Let us have a look at it

Instance Efficiency Percentages (Target 100%)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
            Buffer Nowait %:  100.00       Redo NoWait %:  100.00
            Buffer  Hit   %:   94.70    In-memory Sort %:  100.00
            Library Hit   %:   95.49        Soft Parse %:   98.41
          Execute to Parse %:   18.06         Latch Hit %:  100.00
Parse CPU to Parse Elapsd %:   59.75     % Non-Parse CPU:   98.91

OK we have 100% targets in few places. On its own this section does not tell us much. Buffet Hit is 94.70% but this is expected as we booted the instance before the test. Then we have Library Hit at 95.49% for the same reason.

Parsing

Think of Parsing as basically a two step process, that of a Syntax Check to check the validity of the statement and that of a Semantic Check; to ensure the statement can execute properly. The next step in the parse operation is to see if the statement being parsed has already being  processed by another session.  If that is the case then we can skip the next two steps in the process, that of optimization and row source generation.  If we can skip these next two steps in the process, we have done what is known as a Soft Parse, a shorter process to getting our query going.  If we cannot, if we must do all of the steps, we are performing as what is known as a Hard Parse; we must parse, optimize and generate the plan for the query.  Needless to say that Hard Parsing is more CPU intensive than Soft Parsing. Note that we have 59.75% Parse CPU to Parse Elapsed. We are losing a fair bit of time while parsing. However, that should not matter as after reboot shared pool did not contain any useful compiled query and since parsing is such a small fraction of the work being done.

Top 5 Timed Foreground Events

This is probably one of the most important section of AWR report. In a nutshell it is all about wait events. It records the number of waits encountered in the period and the total time spent waiting together with the average time waited for each event. The section is ordered by the percentage  of the total database time that each Event is responsible for.

Top 5 Timed Foreground Events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                                          Avg
                                         wait   % DB
Event                   Waits   Time(s)  (ms)   time Wait Class
---------------------------- ----------- ------ ------ ----------
db file sequential read 2,817,676    810   0   62.0 User I/O
DB CPU                               130        9.9
free buffer waits           5,505     70  13    5.3 Configurat
log file sync                 142     46 324    3.5 Commit
log buffer space              371    36   97    2.7 Configurat

Dependent on what is seen in this section, other report sections may need to be referenced in order to quantify or check the findings. For example, the wait count for a particular event needs to be assessed based upon the duration of the reporting period and also the number of active sessions at the time. In other words do stats look OK compared to the reported workload?

In above 62% of database time was spent waiting for db file sequential read, which in this case was about single block index reads. However, the average wait time was zero. Another 9.9% of the time was spent waiting for or using CPU time. A tenth of database time is not bad at all. In general High CPU usage is often a symptom of poorly tuned SQL (excessive logical I/O due to sub-optimal search path or excessive physical I/O). In this case CPU is not busy at all. However, in case CPU is busy say above 80%, you can investigate under  section SQL ordered by CPU Time and find the SQL Id that makes CPU DB busy. When you find that SQL Id that is taking high CPU resources, that sql query can be tuned for better performance.

Further down we have 5.3% of database time spent on free buffer waits. Free buffer wait occurs when a user session reads a block from the disk and cannot find a free block in the buffer cache to place it in. One frequently attributed reason is that the DBWR process is not clearing the buffer cache fast enough. However, there can be many reasons and one generic answer will not help. You will need to investiggate the cause or causes of this problem.

The log file sync wait event reports 3.5% of database time. It is triggered when a user session issues a commit (or a rollback). The user session will signal or post the LGWR to write the log buffer to the redo log file. If you look at the sql code above we are indeed issuing frequent commits after processing every 100 rows. Finally, we have Log Buffer Space wait event taking 2.7% of DB time. This event occurs when server processes write data into the log buffer faster than the LGWR process can write it out. Refer to Oracle documentation if you see high readings for this wait event.

The time model statistics

Time model statistics tell you the processing time spent on various matrics during the snapshot interval. You should not expect the % of DB Time to add up to 100% because there is overlap among statistics. For example sql execute elapsed time requires CPU time for sql execution. Note that some statistics such as background elapsed time and background cpu time are shown as well but these are not part of  % of DB Time.


Time Model Statistics                    DB/Inst: MYDB/mydb  Snaps: 4413-4414
-&amp;amp;amp;amp;amp;gt; Total time in database user-calls (DB Time): 1306.8s
-&amp;amp;amp;amp;amp;gt; Statistics including the word &amp;amp;amp;amp;amp;quot;background&amp;amp;amp;amp;amp;quot; measure background process
time, and so do not contribute to the DB time statistic
-&amp;amp;amp;amp;amp;gt; Ordered by % or DB time desc, Statistic name
Statistic Name                          Time (s) % of DB Time
------------------------------------------ ------------------ ------------
sql execute elapsed time               1,247.7         95.5
DB CPU                                   129.9          9.9
connection management call elapsed time    5.5           .4
parse time elapsed                         4.2           .3
hard parse elapsed time                    3.7           .3
PL/SQL execution elapsed time              1.2           .1
PL/SQL compilation elapsed time            0.4           .0
hard parse (sharing criteria) elapsed time 0.1           .0
repeated bind elapsed time                 0.0           .0
hard parse (bind mismatch) elapsed time    0.0           .0
sequence load elapsed time                 0.0           .0
failed parse elapsed time                  0.0           .0
DB time                                1,306.8
background elapsed time                3,395.9
background cpu time                       19.6
-------------------------------------------------------------

We can go through these figures now. 129.9 seconds CPU time was used for all user sessions. This was just under 10% of database resources. In total there was 1,306.8 seconds database time used. The total wait event time can be calculated as DB time – DB CPU, i.e. 1,306.8 – 129.9 = 1,176.9 seconds. The lion share of database time (95.5%) was spent on executing sql which is a good sign. The toal parse time was 5.5 seconds of which 4.2 seconds was hard parsing. The rest of statistics is tiny in this example.

Foreground Wait Class

We are familiar with Oracle wait events. In AWR report under Wait Class, wait events (to be precise user generated wait events AKA Foreground wait events) are grouped into classes. These include User I/O, DB CPU, Configuration, Commit, Other, System I/O, Application, Network and Concurrency. For example, exclusive TX locks are generally an application level issue and HW locks are generally a configuration issue. The following list includes common examples of the waits in some of the classes:

  • User I/O: Waits for user IO (db file sequential read, db file scattered read, direct path read, direct path write etc)
  • DB CPU: See above for the definition of this class
  • Configuration: Waits caused by inadequate configuration of database or instance resources (for example, undersized log file sizes, shared pool size etc)
  • Commit: waits for redo log write confirmation after a commit
  • Other: Waits which should not typically occur on a system (for example, ‘wait for EMON to spawn’)
  • System I/O: Waits for background process IO
  • Application: Waits resulting from user application code (for example, lock waits caused by row level locking or explicit lock commands)
  • Network: waits for data to be sent over the network
  • Concurrency: Waits for internal database resources (for example, latches)

For other classes  not covered above,  you can get information from Oracle manuals.

Let us look at this section in our AWR report:

                                                 Avg
                             %Time   Total Wait  wait
Wait Class      Waits        -outs   Time (s)    ms)  %DB time
----------- ---------------- ----- ---------------- -------- -----
User I/O          2,866,247     0     838        0      64.2
DB CPU                                130                9.9
Configuration         5,930     0     110       18       8.4
Commit                  142     0      46      324       3.5
Other                 1,971    96      20       10       1.6
System I/O              417     0       0        0       0.0
Application              88     0       0        0       0.0
Network               2,653     0       0        0       0.0
Concurrency               5     0       0        0       0.0
-------------------------------------------------------------

User I/O taking 64.2% of database time was explained before. The CPU usage was just under 9.9%. Configuration contributed 8.4% of wait time. These are attributed to database settings. We will see the events contributing to this class below.

Foreground Wait Events

-&amp;amp;amp;amp;amp;gt; s  - second, ms - millisecond -    1000th of a second
-&amp;amp;amp;amp;amp;gt; Only events with Total Wait Time (s) &amp;amp;amp;amp;amp;gt;= .001 are shown
-&amp;amp;amp;amp;amp;gt; ordered by wait time desc, waits desc (idle events last)
-&amp;amp;amp;amp;amp;gt; %Timeouts: value of 0 indicates value was &amp;amp;amp;amp;amp;lt; .5%.  Value of null is truly 0
                                                 Avg
                              %Time Total Wait  wait  Waits   % DB
Event                Waits    -outs   Time (s)  (ms)   /txn   time
-----------------  ------------ ---------- ------- -------- ------
db file sequential r 2,817,676    0   810       0     75.0   62.0
free buffer waits        5,505    0    70      13      0.1    5.3
log file sync              142    0    46     324      0.0    3.5
log buffer space           371    0    36      97      0.0    2.7
db file scattered r     48,458    0    28       1      1.3    2.1
rdbms ipc reply             10   90    19    1949      0.0    1.5
log file switch             48    0     4      81      0.0     .3
------------------------------------------------------------

For the sake of brevity I have only included those wait events that contribute to % DB time. db file sequential read and db file scattered read belong to user I/O wait class. log file sync is part of Commit wait class. free buffer waits, log buffer space, rdbms ipc reply and log file switch wait events belong to Configuration wait class.

Background Wait Events

This section of AWR report provides statistics on wait events for Oracle background processes.

Background Wait Events
-&amp;amp;amp;amp;amp;gt; ordered by wait time desc, waits desc (idle events last)
-&amp;amp;amp;amp;amp;gt; Only events with Total Wait Time (s) &amp;amp;amp;amp;amp;gt;= .001 are shown
-&amp;amp;amp;amp;amp;gt; %Timeouts: value of 0 indicates value was &amp;amp;amp;amp;amp;lt; .5%.  Value of null is truly 0
                                                 Avg
                              %Time Total Wait  wait  Waits   % bg
Event                Waits    -outs   Time (s)  (ms)   /txn   time
-----------------  ------------ ---------- ------- -------- ------
log file parallel write 10,319    0   845      82      0.3   24.9
db file parallel write  43,425    0   207       5      1.2    6.1
direct path write           55    0     2      40      0.0     .1
control file seq re     2,501     0     2       1      0.1     .1
-------------------------------------------------------------

The log file parallel write shows LGWR is waiting for blocks to be written to all online redo log members in one group. LGWR will wait until all blocks have been written to all members.   So here we had 24.9% of backhround total time spent on log file parallel write. The db file parallel write wait event belongs to DBWR process since it is the only process that writes dirty blocks from the SGA to datafiles.  DBWR process compiles a set of dirty blocks, hands the batch over to the OS, and waits on the db file parallel write event for the I/O to complete. The parameter of interest here is Avg wait (ms). In our case it is 5ms which is a perfectably respetable figure. Obviously larger average wait times point to slower I/O subsystem or poor I/O configurations.

Service Level Statistics

The service statistics section gives information about how particular services configured in the database are performing.

    • A service is a grouping of processes
    • Users may be grouped in SYS$USERS
    • A single user (application login) may be grouped with that user name
Service Statistics
-&amp;amp;amp;amp;amp;gt; ordered by DB Time
                                            Physical      Logical
 Service Name   DB Time (s)   DB CPU (s)    Reads (K)    Reads (K)
------------- ------------ ------------ ------------ ------------
SYS$USERS            1,281          129        3,079       58,172
mydb.mich.local         25            1            2           34
SYS$BACKGROUND           0            0            5           14
mydbXDB                  0            0            0            0
    -------------------------------------------------------------

Note that SYS$USERS service is a default service for all applications. You can configure different services for different production applications. This will enable DBA to identify the heavy loading services on the system. by grouping services according to the production need. For example services created for batch applications etc.


Service Wait Class Stats
-&amp;amp;amp;amp;amp;gt; Wait Class info for services in the Service Statistics section.
-&amp;amp;amp;amp;amp;gt; Total Waits and Time Waited displayed for the following wait
   classes:  User I/O, Concurrency, Administrative, Network
-&amp;amp;amp;amp;amp;gt; Time Waited (Wt Time) in seconds
Service Name
 ----------------------------------------------------------------
User I/O  User I/O  Concurcy Concurcy Admin  Admin Network Network
Tot Wts Wt Time   Tot Wts Wt Time Total Wts Wt Time Tot Wts Wt Time
--------- ------- -------- -------- --------- --------- ------- -------
SYS$USERS
2864754     838      5         0         0         0     796     0
mydb.mich.local
1493          0      0         0         0         0    1857     0
SYS$BACKGROUND
761          4      48         1         0         0       0     0
-------------------------------------------------------------

SQL Section

Any SQL statement appears in the top 5 statements in two or more areas below, then it is a prime candidate for tuning. The sections are:

  • Total Elapsed Time
  • Total CPU Time
  • Total Buffer Gets
  • Total Disk Reads
  • Total Executions
  • Total Parse Calls
  • Total Sharable Memory
  • Total Version Count

Let us try to see what these mean.

Total Elapsed Time

Total Elapsed Time = CPU Time + Wait Time. If a SQL statement appears in the total elapsed time area of the report this means its CPU time plus any other wait times made it pop to the top of the pile. Excessive Elapsed Time could be due to excessive CPU usage or excessive wait times.

This is the area that you need to examine and probably the one that will be reported by the users or application support. From a consumer perspective, the finer details don’t matter. The application is slow. Full stop.

In conjunction with excessive Elapsed time check to see if this piece of SQL is also a high consumer under Total CPU Time. It is normally the case. Otherwise check the wait times and Total Disk Reads. They can either indicate issues with wait times (slow disks, latch gets etc) or too much Physical IO associated with tables scans or sub-optimal indexes.  This section is a gate opener and often you will need to examine other sections.

Total CPU Time

When a statement appears in the Total CPU Time area this indicates it used excessive CPU cycles during its processing. Excessive CPU processing time can be caused by sorting, excessive function usage or long parse times. Indicators that you should be looking at this section for SQL tuning candidates include high CPU percentages in the service section for the service associated with this SQL (a hint, if the SQL is uppercase it probably comes from a user or application; if it is lowercase it usually comes from the internal or background processes). To reduce total CPU time, reduce sorting by using composite indexes that can cover sorting and use bind variables to reduce parse times.

Total Buffer Gets

Total buffer gets mean a SQL statement is reading a lot of data from the db block buffers. Generally speaking buffer gets (AKA logical IO or LIO) are OK, except when they become excessive. The old saying that you reduce the logical IO, because then the physical IO (disk read) will take care of itself holds true. LIO may  have incurred a PIO in order to get the block into the buffer in the first place. Reducing buffer gets is very important and should not be underestimated. To get a block from db block buffers, we have to latch it (i.e. in order to prevent someone from modifying the data structures we are currently reading from the buffer). Although latches are less persistent than locks, a latch is still a serialization device. Serialization devices inhibit scalability, the more you use them, the less concurrency you get. Therefore in most cases optimal buffer gets can result in improved performance. Also note that by lowering buffer gets you will require less CPU usage and less latching. |Thus to reduce excessive buffer gets, optimise SQL to use appropriate indexes and reduce full table scans. You can also look at improving the indexing strategy and consider deploying partitioning (licensed).

Total Disk Reads

High total disk reads mean a SQL statement is reading a lot of data from disks rather than being able to access that data from the db block buffers. High physical reads after a server reboot are expected as the cache is cold and data is fetched from the disk. However, disk reads (or physical reads) are undesirable in an OLTP system, especially when they become excessive. Excessive disk reads do cause performance issues. The usual norm is to increase the db buffer cache to allow more buffers and reduce ageing . Total disk reads are typified by high physical reads, a low buffer cache hit ratio, with high IO wait times. Higher wait times for Disk IO can be associated with a variety of reasons (busy or over saturated SAN, slower underlying storage, low capacity in HBC and other hardware causes). Statistics on IO section in AWR, plus the Oerating System diagnostic tools as simple as iostatcan help in identifying these issues. To reduce excessive disk reads, consider partitioning, use indexes and look at optimizing SQL to avoid excessive full table scans.

Total Executions

High total executions need to be reviewed to see if they are genuine executions or loops in SQL code. I have also seen situations where autosys jobs fire duplicate codes erroneously. In general statements with high numbers of executions usually are being properly reused. However, there is always a chance of unnecessary loop in PL/SQL, Java or C#. Statements with high number of executions, high number of logical and or physical reads are candidates for review to be sure they are not being executed multiple times when a single execution would serve. If the database has excessive physical and logical reads or excessive IO wait times, then look at the SQL statements that show excessive executions and show high physical and logical reads.

Parse Calls

Whenever a statement is issued by a user or process, regardless of whether it is in the SQL pool it undergoes a parse.  As explained under Parsing, the parse can be a hard parse or a soft parse. Excessive parse calls usually go with excessive executions. If the statement is using what are known as unsafe bind variables then the statement will be reparsed each time. If the header parse ratios are low look here and in the version count areas.

Sharable Memory

Sharable Memory refers to Shared Pool memory area in SGA , hence this perticular section in AWR Report states about the SQL STATEMENT CURSORS which consumed the maximum amount of the Shared Pool for their execution.

In general high values for Sharable Memory doesn’t necessary imply there is an issue It simply means that

  1. These SQL statements are big or complex and Oracle has to keep lots of information about these statements OR
  2. big number of child cursors exist for those parent cursors
  3. combination of 1 & 2

In case of point 2, it may be due to poor coding such as bind variables mismatch, security mismatch  or overly large SQL statements that join many tables. In a DSS or  DW environment large complex statements are normal. In an OLTP database large or complex statements are usually the result of over-normalization of the database design, attempts to use an OLTP system as a DW or simply poor coding techniques. Usually large statements will result in excessive parsing, recursion, and large CPU usage.

Version Count

High version counts are usually due to multiple identical-schema databases, unsafe bind variables, or Oracle bugs.

Instance Activity Statistics

This section compares the statistic values of instance activity between the two snapshot sets. For each statistic, the value of the statistic is shown along with the differentials measured by DB time, elapsed time, and per transaction. You have a comprehensive set of statistics here. Use these stats with the other stats

Tablespace IO Stats

This is a useful section in that it shows the tablespace IO activities. Remember the stats here are at the tablespace level. A good place to correlate tablespace IO stats is with section Statistic depicting Physical read and writes total IO requests as shown below:

Statistic Total per Second per Trans
-------------------------------- ------------------ -------------- -------------
physical read total IO requests 2,873,185 2,278.0 76.5
physical write total IO requests 5,636,945 4,469.3 150.1

Back to Tablespace IO Stats we see:


Tablespace IO Stats
-&amp;amp;amp;amp;amp;gt; ordered by IOs (Reads + Writes) desc

Tablespace ------------------------------
               Av       Av     Av               Av     Buf  Av Buf
    Reads     Reads/s  Rd(ms) Blks/Rd  Writes Writes/s Waits Wt(ms)
  ---------- ------- ------- -------  --------- ---- -------------
APP1 2,860,666  2,268   0.3     1.1  3,564,644   2,826   0    0.0
UNDOTBS1    90      0   9.8     1.0  1,990,759   1,578   0    0.0
SYSTEM   7,190      6   0.2     1.3        204       0   0    0.0
SYSAUX   1,938      2   0.3     2.1      1,047       1   0    0.0
APP1_HDD   105      0   0.1     1.0        105       0   0    0.0
-------------------------------------------------------------

Let us look at the physical read total IO requests. We have 2,873,185  in total. If we add up Reads from Tablespaces we see the lion share comes from App1 tablespace with a total of 2,860,666 Physical IO reads (Not block reads). The Average Physical IO Reads per second (Av Reads/s) was 2,268.  On Average every Physical IO reads took 0.3 ms and fetched 1.1 blocks.

Now switching back to writes, we see activity in both APP1 and UNDOTBS1 tablespaces. APP1 being the user tablespace and UNDOTBS1 gathering undo data!. Again we we add up the total Physical IO writes 3,564,644+1,990,759, we arrive top the figure of 5,555,403 not far off from 5,636,945 under physical write total IO requests. The Average physical write total IO for APP1 tablespace was 2,826. Obviously the throughput will tell if there has been any sub-optimal performance compared to previous records. If that is the case then you will need to examine the underlying disk sub-systems for the tablespace.

Next we look at the section under File IO Stats. This section describes the underlying files for tablespaces.  So with that in mind we will look at the readings for each file.

Tablespace Filename
------------------------ ----------------------------------------------------
 Av Av Av Av Buffer Av Buf
 Reads Reads/s Rd(ms) Blks/Rd Writes Writes/s Waits Wt(ms)
-------------- ------- ------- ------- ------------ -------- ---------- -------
APP1 /ssddata4/oracle/mydb/APP02.dbf
 695,636 552 0.4 1.1 1,602,099 1,270 0 0.0
APP1 /ssddata4/oracle/mydb/APP03.dbf
 702,494 557 0.2 1.0 849,159 673 0 0.0
APP1 /ssddata6/oracle/mydb/app01.dbf
 1,462,536 1,160 0.3 1.1 1,113,386 883 0 0.0

The main interest would be those file systems that belong to tablespace App1 above, namely APP02.dbf, APP03.dbf and app01.dbf. These are all built on SSD devices so should be pretty fast. Let us look at some of the readings here, for example there were 2,860,666 physical IO read for tablespace APP1. Breaking down we can see that there were 1,462,536 attributed to app01.dbf, 695,636 to APP02.dbf and the rest of 702,494 were from APP03.dbf. Now taking the Average Physical IO Reads per second (Av Reads/s) was 1,160 for app01.dbf, compared to 552 for APP02.dbf and 557 for APP03. So does this imply faster readings for app01.dbf compared to two other ones?  To investigate that we need to look at figures for Average Physical IO Reads/ms (Av Rd(ms)). We see figures of 0.3, 04 and 0.2. So the throughput is pretty consistent. The only noticeable point being that most of data for tablespace APP1 comes from the first file system app01.dbf and hence we see higher AV Reads/s for that file system in the report.

Follow

Get every new post delivered to your Inbox.