jump to navigation

Working with Hadoop command lines May 30, 2015

Posted by Mich Talebzadeh in Uncategorized.
add a comment

In my earlier blog I covered installation and configuration of Hadoop. We will now focus with working with Hadoop.

When I say working with Hadoop, I infer working with HDFS and MapReduce Engine. If you are familiar with Unix/Linux commands then it will be easier to find your way through hadoop commands.

Let us start first and look at the root directory of HDFS. Hadoop commands have notation as follows:

hdfs dfs

To see the files under “/” we do

hdfs dfs -ls /
Found 5 items
drwxr-xr-x   - hduser supergroup          0 2015-04-26 18:35 /abc
drwxr-xr-x   - hduser supergroup          0 2015-05-03 09:08 /system
drwxrwx---   - hduser supergroup          0 2015-04-14 06:46 /tmp
drwxr-xr-x   - hduser supergroup          0 2015-04-14 09:51 /user
drwxr-xr-x   - hduser supergroup          0 2015-04-24 20:42 /xyz

If you want to go one level further, you can do

hdfs -ls /user
Found 2 items
drwxr-xr-x   - hduser supergroup          0 2015-05-16 08:41 /user/hduser
drwxr-xr-x   - hduser supergroup          0 2015-04-22 16:25 /user/hive

The complete list of hadoop file system shell commands are given here.

You can run the same command from a remote host as long as you have hadoop software installed on remote host. You qualify the target by specifying the hostname and port number on which hadoop is running

hdfs dfs -ls hdfs://rhes564:9000/user
Found 2 items
drwxr-xr-x - hduser supergroup 0 2015-05-16 08:41 hdfs://rhes564:9000/user/hduser
drwxr-xr-x - hduser supergroup 0 2015-04-22 16:25 hdfs://rhes564:9000/user/hive

You can put a remote file in hdfs as follows. As an example, first create a simple text file with remote hostname in it

echo hostname > hostname.txt

Now put that file in hadoop on rhes564

hdfs dfs -put hostname.txt hdfs://rhes564:9000/user/hduser

Check that the file is stored in HDFS OK

hdfs dfs -ls hdfs://rhes564:9000/user/hduser
-rw-r--r--   2 hduser supergroup          9 2015-05-30 19:49 hdfs://rhes564:9000/user/hduser/hostname.txt

Note the full pathname of the file in HDFS. We can go ahead and delete that txt file.

hdfs dfs -rm /user/hduser/hostname.txt
15/05/30 21:10:23 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 0 minutes, Emptier interval = 0 minutes.
Deleted /user/hduser/hostname.txt

To remove a full directory you can perform

hdfs dfs -rm -r /abc
Deleted /abc

To see the sizes of files and directories under / you can do

 hdfs dfs -du /
0            /system
4033091893   /tmp
22039558766  /user

Is there a silver bullet IT Strategy June 4, 2012

Posted by Mich Talebzadeh in Uncategorized.
add a comment

This is about IT strategy and most of the guys that I know have had many years of IT experience. This industry is sharply divided between doers and watchers. The leaders and strategists are the ones that make head waves and usually they know what they want. However, the big obstacles seem to be organizations that we work in.

As a consultant I have been  to many sites that the IT managers and directors have little or no inclination to move IT in a pragmatic way. For most of them the status quo is just fine. In other words they toe the line and they speak the official language. However, as a result the enterprise suffers. I believe there is a culture that values loyalty more than capability. This in turn results in a rigid environment where changes are seldom accepted. Moreover, the IT strategists background is always favored towards a certain product, since, presumably, the manager feels more comfortable with a given product. That is hardly any strategy for progress.

It is also interesting to know that the average employee loyalty to a given firm is two to three years and more often these days people change jobs than before. Any long term vision or strategy is nonexistent chiefly because the IT director has a different agenda for himself/herself, ending up in a lame duck strategy.

I have seen too often that organizations bring in someone (say a new director) from outside (usually the director will tell you that he/she was headhunted for the job!). The pattern is always similar. Let us cut cost, streamline the processes (meaning get rid of few people), bringing in the outsourcing (usually they now few firms in South Asia who they have worked with before) and we can reduce the projects’ lifecycle drastically. There is usually an announcement of some town hall meeting or soap box meeting with video conferencing and large participation by the staff of all type. Mind you the staff were told to attend. Then we have the director giving his/her vision of things to come and IT strategy for the next two years. The figures shown on their favorite Power point projection are impressive. That is until such time that you start asking questions about these figures. The graphs do not add up or skewed too much. The IT director will have difficulty in explaining these figures. Otherwise he/she will tell you that a particular graph is not normalized. You shake your head with disbelief knowing that most of the stuff are downloadable/downloaded from the web. The director has difficulty explaining because he does not believe those figures either. Let us face it, how long it takes to formulate a project strategy. It does not happen overnight, a week, a month or a quarter. I am at loss to understand how these so called IT gurus are going to make it happen? BTW, they are usually pretty good at making cliché type statements. Gone few months you hear that the IT director has left the firm without delivering anything at great cost to the company without any value add. I am at loss to see why this industry is following this trend. Is that because the IT industry unlike some disciplines like civil engineering is still very much evolving? As strategists what are the great unknowns in IT? In my view there is no revolutionary solution. It is only evolutionary. A company’s IT strategy can only evolve through sound leadership and decision making. More importantly the IT strategy has got to be fit for the underlying business.

in-memory databases March 26, 2012

Posted by Mich Talebzadeh in In-Memory databases, Uncategorized.
add a comment

An in-memory database or IMDB could be a standalone database management system (DBMS) like Oracle’s TimesTen or a specific database, part of DBMS like Sybase Adaptive Server Enterprise (ASE, aka Sybase classic) IMDB (ASE-IMDB).

These databases rely on computer memory for data storage. This is in contrast to the traditional database management systems that rely on disk storage for storing data, even if the database is using Solid Stated Devices (SSD). IMDBs are faster than disk optimized databases since the internal optimization algorithms are simpler and execute fewer CPU instructions. Accessing data in memory provides much faster response. In applications where response time is critical, such as telecommunications network equipment, defense and certain trading systems, IMDBs are often used. Because of the nature of IMDBs, these databases tend to use more memory than their disk resident database counterparts.

In general one can categorize IMDBs as in-process and out-of-process main-memory databases. Out of process database management systems are usually, yet not necessarily, high functionality systems that implement full SQL, possibly with some dialects, security, database administration, etc. Specifically the administration is provided via a separate application (these days commonly running in a browser). These servers never provide access to data other than via SQL. Servers are not restricted in their memory and other system resources usage. Server’s footprint is usually rather high (in order of several megabytes). Servers such as ASE-IMDB and Oracle’s Times Ten can mimic the functionality of their “disk resident big brothers”. That makes it easier to use these products for caching SQL requests to a SQL back-end that maintains persistent databases. From a performance standpoint it takes time to route database requests either through some sort of an IPC mechanism (for local access), or through the network. This basically means:

  •  A lot of “out-of-the-box” functionality
  • No  need to write applications to access the database

In contrast, in-process database management systems are implemented as libraries that applications link with. This technique is sometimes referred to as “embedded database”. In-process databases can provide a native-language API, such as C/C++, Java, C#, etc., in addition (or even instead of the SQL API). In many cases, these native APIs can be more efficient than the SQL API, yet they require knowledge of the database layout and internals. IMDBs such as eXtremeDB from McObject provide API that facilitate database control functionality – system tables’ access, etc., as well as often integrate http server functions that allow applications to easily access the database through a browser. The scope of SQL is less than that found in out-of-process IMDBs, specifically security is never implemented. Most libraries carry much less overhead as they use application’s memory pools and other resources. Libraries footprint is usually much lower than those of out-of-process IMDBS, usually in order of hundreds kilobytes or less. This usually means:

  •  Less  “out-of-the box” SQL functionality, applications must be written to access      the database, compensated by providing application’s native language API
  • Higher  performance
  • Lower   footprint

Both TimesTen and ASE-IMDB and almost all current commercial IMDBs are based on what I call row-based storage implementation (RBSI) of relational model. These products are extremely effective for online transaction processing (OLTP) type applications and in majority cases that is what they deal with. This was the era of the transactional base databases in which the user — for example, a trader – was only interested in his/her own portfolio. Adding a few trades and reading them with some aggregates was fine and within the capability of these systems. Things have moved on since. Today’s systems and users deal much more with non-transactional (read) activity than with transactional (write/update) activity. An average trader today is not only interested in his/her portfolio but also interested in other portfolios and analytics, which requires sifting through millions of lines of records.

To this end, there has been another physical implementation of relational model for some time. In this physical model, data is stored in columns. This is referred to as Column Based Storage Implementation (CBSI) of Relational Model. This offering is used for analytics and data warehousing. Sybase IQ is an example of CBSI of relational model so is Lucid which is an open-source RDBMS for Analytics.

So how about another solution where you have columnar databases in memory? SAP who recently acquired Sybase Inc has already delivered columnar in-memory database technology to market via its Business Warehouse Accelerator hardware-based acceleration engine. So certainly IMDB is moving forward. Having said that, there appears to be a roadmap for Sybase IQ to be in-memory as well. Time will tell.

Complex Event Processing March 25, 2012

Posted by Mich Talebzadeh in Uncategorized.
add a comment