jump to navigation

Big Data Architecture February 19, 2015

Posted by Mich Talebzadeh in Big Data.
trackback

Data sources

I mentioned before that there are sources and consumers of big data and a big bridge in between. Let us take a heuristic approach and see what we mean by big data sources and the consumption layer. This is depicted in the diagram below:

BigDataSourcesAndConsumptionLayer

This layer provides the foundation and data sources necessary for a data driven enterprise. From the figure, let us have a look at the sources of data. Clearly we can see that we have varieties. We have structured data, mainframe data (ISAM, VSAM all that) in the form of legacy systems, we have text, images, spreadsheets and documents, intranet, XML and message bus data (TIBCO, MQ etc). While looking at more data is valuable and complex, looking at more data sources will likely add value to business.

The job of analysing the whole of this disparate data is complex. However, companies believe that it will open new challenges and revenues. In short they want to convert this enormous heap to information. Remember data on its own is practically worthless until it can be analysed and turned into information. So we must keep in mind that the ultimate purpose of data is to support business needs.

We are all familiar with the structured data that we keep in relational databases in one shape and form. Unstructured data such as logs, emails, blogs, conference recordings, webinars etc are everywhere often hidden below layers of heap, babble, misspelled names etc that make it difficult to store and turn it into information.

Can a relational database be made to deal with unstructured data? Well not very likely. Recall that we still have capacity and latency issues with image and text fields stored in relational databases. Those who are familiar with relational databases know full well that replication of text and image columns (clob and blob) create challenges for replication server in Sybase and materialized views in Oracle, notably the added latency and backlog to push the data. Simply put, relational databases were not built to process large amount of unstructured data in the form of texts, images etc.

How about Data warehouses based on columnar models of relational databases? Again they lack the build to provide meaningful and efficient analysis of un-structured data.

Data conversion and storage layer

Now that we have identified the source of big data, we will need to identify the characteristics of the data that must be processed according to each application. Data regardless of its source will have common characteristics:

Data origin – front office, middle office, back office, external providers, machine generated, email, intranet, wiki, spreadsheets, logs, audio, legacy, ftp
Data format – structured, unstructured
Data frequency – end of day, intra-day (on demand), continuous, time series
Data type – company data, customer reference data, historical, transactional
Data size – the volume of data expected from the silos and their daily growth. Your friendly DBAs should be able to provide this information plus various other administrators. You will need this information to assess the size of Big Data repository and its growth
Data delivery – How the data needs to be processed at the consumption layer

There is a terminology used for manipulating data. It is referred to as Data Ingestion. Data ingestion is the process of getting and processing data for later use or storage. The key point is processing these disparate data. Specifically Data Ingestion includes the following:

Data Identification – identifying different data formats for structured data and default formats for unstructured data.
Data filtration – only interested in data relevant to master data management (MDM) needs, see below
Data validation – continuous validation of data against MDM
Data pruning – removing unwanted data
Data transformation – migrating from local silo schema to MDM
Data Integration – integrating the transformed data into Big Data

These are shown in diagram 2 below. The golden delivery is data according to Master Data Management (MDM). MDM is defined as a comprehensive method of enabling an organisation to link all of its critical data to one file, called a master file that provides a common point of reference. It is supposed to streamline data sharing among architectures, platforms and applications. For me and you it is basically the enterprise wide data model and what data conversion is ultimately trying to achieve is to organize data of different structures into a common schema. Using this approach, the system can relate myriad of data including structured data such as trading and risk data and unstructured data such as product documents, wiki, graphs, power point presentations and conference recordings.

So in summary what are the objectives of an enterprise wide data model? Well to run any business, you manipulate data at hand to make it do something specific for you; to change the way you do business, you take data as-is and determine what new things it can do for you. These two approaches are more powerful combined than on their own.

The challenge is to bring the two schools of thoughts together. So, it does make sense to create a common architecture model to leverage all types of data in the organisation to open new opportunities.

BigDataIngestion

Within the storage layer, you will see we have Hadoop Distributed File System (HDFS) which can be used by a variety of NoSQL databases, plus Enterprise data warehouse.

For majority of technologists, I thought it would be useful to mention few words on HDFS and how it can be used. Hadoop is a framework for distributed data processing and storage. It is not a database, so it lacks functionality that is necessary in many analytics scenarios. Fortunately, there are many options available for extending Hadoop to support complex analytics, including real-time predictive models. So what are these options?

These are collectively called NoSQL databases. No does not mean that some cannot use SQL. It means Not Only SQL databases. Why is that? It is because SQL is by far the most widely used language to get data out of databases. Generations of developers know and use SQL. So it makes sense for tools that deal with data stored in Hadoop to use a dialect of SQL.

In general, a NoSQL database relaxes the rigid requirements of a traditional RDBMS (meaning ACID properties) by following CAP theorem (Consistency, Availability, Partition) by compromising Consistency in favour of Availability and Partition and can implement new usage models that require lower latency and higher levels of scalability. NoSQL databases come in four major types:

Key/Value Object Store – is the simplest type of database and can store any kind of digital object. Each object is assigned a key and is retrieved using the same key. Since you can only retrieve an object using its associated key, there is no way to search or analyze data until you have first retrieved it. Data storage and retrieval are extremely fast and scalable, but analytic capabilities are limited. Examples of key/value object stores include MemcacheDB, Redi and Riak.

Document Stores– These are a type of key/value store in which documents are stored in recognizable formats and are accompanied by metadata. Because of the consistent formats and the metadata, you can perform search and analysis without first retrieving the documents. Examples of document stores include MarkLogic, Couchbase and MongoDB.

Columnar Databases – These provide varying degrees of row and column structure, but without the full constraints of a traditional RDBMS. You can use a columnar database to perform more advanced queries on big data. Examples of columnar databases include Cassandra and HBase.

Graph Databases – These store networks of objects that are linked using relationship attributes. For example, the objects could be people in a social network who are related as friends, colleagues or strangers. You can use a graph database to map and quickly analyze very complex networks. Examples of graph databases include InfiniteGraph, Allegro and Neo4J.

Analysis layer

The analysis layer is responsible for transforming the data from Big Data to the Consumption layer. In other words this is the place where we need to make business sense of collected data. Analysis layer will have to adopt different approaches to take advantages of Big Data. The approach will have to include both business as usual analytics used in data warehouses plus innovative solutions for unstructured data. The fundamental difference between something like data warehouse and Hadoop is that a typical data warehouse is scaled vertically using a Symmetric Multi processing (SMP) architecture in a centralised environment, whereas Hadoop based tools use Hadoop distributed file system (HDFS) to physically distribute unstructured data based on Massive Parallel Processing (MPP) through MapReduce Engine.

Within the Analysis layer we need to identify what usage we are going to make of data collected in HDFS and enterprise data warehouse. The bulk of analysis tools should be able to answer the following:

What happened -This is the domain of traditional data warehouses. The information can be augmented by using query engines against HDFS for reduced query time

Why did it happen – What can we learn? Traditional BI engines and Decision Management tools can help.

What will happen in short, medium and long term – Pretty self explanatory. Event scoring and Recommendation Engines based on Predictive and Statistical Models can help.

Added value – How can we make it happen to benefit the business? Decision Management engines and Recommendation Engines can be deployed

In addition we need to process real time notifications and event driven requests which are typically stream based and may require Complex Event Processing (CEP) engines such as SAP Event Stream Processor or Oracle CEP.

Modelling can deploy pricing/quant’s models together with statistical models. These are shown in diagram 3 below.

BigDataAnalytics

Finally the whole of Big Data Architecture is shown in diagram 4 below.

BigDataArchitectureLayer

Comments»

1. Amit - February 21, 2015

Nice article ..thanks Mich


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: