jump to navigation

Introduction to Big Data February 19, 2015

Posted by Mich Talebzadeh in Big Data.
trackback

Big Data is nowadays a vogue of the Month or the year. I recall around 10 years ago Linux was in the same category as well. Anyway we have got used to these trends. Bottom line Big Data is a trendy term these days and if you don’t know about it and you are in I.T. then you are missing the buzz.

Anyway let us move on and separate the signal from the pedestal so to speak

According to Wikipedia, Big Data is defined as and I quote:

“Big Data is a broad term for data sets so large or complex that they are difficult to process using traditional data processing applications. Challenges include analysis, capture, curation, search, sharing, storage, transfer, visualization, and information privacy.”

Right, what does so large and complex mean in this context? Broadly it refers to the following characteristics:

Extremely large Volume of data:

– Data coming from message buses (TIBCO, MQ-Series etc)
– Data coming from transactional databases (Oracle, Sybase, Microsoft and others)
– Data coming from legacy sources (ISAM, VSAM etc)
– Data coming from unstructured sources such as emails, word and excel documents etc
– Data coming from algorithmic trading through Complex Event Processing
– Data coming from Internet (HTML, WML etc)
– Data coming from a variety of applications

OK I used the term unstructured above. So what is that really? First for me structured data means anything kept in a relational database and accessed via Structured Query Language (SQL). So there we go the word structured popped up. SQL is very structured. Get it wrong or misspell or misplace anything and it won’t work!

Now what does the term unstructured data mean in this context? It basically means that its structure is not predictable or it does not follow a specified format. In other words it does not fit into relational context. I am sure some puritans will disagree with me on that point.

Extremely High Velocity:

Data is streaming in at unprecedented speed and must be dealt with in a timely manner. This could be batch or real time

Extremely wide variety of data:

OK, if you look at points under extremely large Volume of data, you will see the variety of structured and unstructured data that Big Data deals with.

Traditionally over the past 20-30 years the bulk of data used in many industries including Financial sector is stored in relational databases in structured format. Yes the one with tables and columns, primary keys, indexes, third normal form and stuff that originated by Dr Codd in 1980. Before that they were hierarchical and network databases but we won’t go into that. The current estimate is that around 20-30% of available data is stored in structured format and the rest are unstructured. Unstructured refers to piece of data that either does not have a pre-defined data model or is not organized in a pre-defined manner. You can pick them up from the above list.

So if you are coming from relational database management background like myself, you may be correctly asking yourself, why we need all this when we have reliable, time tested relational databases (RDBMS)? The answer is something that probably you already know. In short not every data fits into a structured mold and the unstructured data accounts for 70% of the data!

You may argue that we have data warehouses these days. However, recall that data warehouses like SAP Sybase IQ is a columnar implementation of relation model. In contrast, databases like Oracle and SAP Sybase ASE are row based implementation of relational model. Both Oracle and SAP Sybase ASE nowadays are offering an in-memory database as well. Oracle with version 12c about to be offering a columnar in-memory implementation of database but it will still use the same row based optimizer that has been in the heart of Oracle for many years.

Bottom line the data that enterprises hold are in different silos/databases and in different form and shapes, even the relational schemas are often local and reflect the legacy data model and they do not adhere to a common enterprise wide data model.

Let me give you an example. Imagine a Bank wants to know its total exposure to certain country today as part of its risk analysis. The position is not clear and is kept in different applications within the bank. Often these applications feed each other one on one. In other words, moving numbers from A to B and back again. A rather simplistic but surprisingly accurate summary!

That does not help to solve the problem. May be the bank has lent directly to this country. It may also have dealings with other banks and institutions that have large exposure to the said country. So where is the aggregate risk position as of now for the bank? It cannot go on shifting spreadsheets from one silo to another. Even if it did that it would take ages and will never be real time.

Banks are further under regulatory pressure to abide by many accounting and regulatory standards and will need IT transformation. One of these would be abiding by IFRS 9 Financial Instruments (regulation) under Risk and Legal Entity. These will all require (eventually) a golden source of data to get the required information.

Big Data is not a single technology but incorporates many new and exitsting technologies. It is not just just about having something like Hadoop Framework in place where you can use it as a big massive kitchen sink of all data. It is a myriad of technologies all working together. Ignoring the complexity of middle layers, we should think about the Big Data sources at one end and the so called Consumption layer at the other end.

Comments»

No comments yet — be the first.

Leave a comment