jump to navigation

In-Memory Data Grid (IMDG), why do we need it and how different it is from an In-Memory Database January 16, 2015

Posted by Mich Talebzadeh in Data Grid.
trackback

Introduction

Relational databases have been the backbone of data storage for the majority of data stored in computers since 1980s. They are extensively used in many domains including financial sector.

My experience is heavily biased towards Banking. So I can only bore you from what I know. Throughout my long exposure from Front Office trading systems to back office and reconciliation, I have come to the conclusion that banking is all about moving numbers from A to B and back again. A rather simplistic but surprisingly accurate summary! Think of it as coping data from one silo to another and in between make sure that extract, transform and load (call it ETL if you like) will work fine.

For years this has been around. One database, two tier or three tier application in a client-server environment. If you had bandwidth and latency issues you would have used some form of replication. Replication is the process of sharing data at multiple databases and keeping them in sync. Products like Oracle Golden Gate, Sybase Replication server or simply Materialized Views are still commonly used to replicate data among multiple geographical locations or among heterogeneous databases. So these were the original data distributed set-ups. Replication is primarily used to address the bandwidth and latency issues. However, the access path with associated cost such as Physical IO, Logical IO, LRU-MRU chain etc would still be there.

The fundamental concern with the traditional data fetch from disk resident “classic” databases is that a significant portion of time is consumed by what is referred to as the data access overhead, where a service or application spends the majority of its effort retrieving the relevant data and often transforming the data to internal data structures. I call this bucketing. Think of it as the fruit of a developer who writes a report to extract end of day or intra-day reports, gets the data out of the database, stores it in C# or Java objects (i.e. mapping from a relational model to an object model) and finally pushes that data or rather cooked data (raw data with added bells and whistles) into another database, yet ready for another report. I am sure most of you are familiar with this process. This process may involve literally hundreds of reports/extracts running against the primary or reporting databases, multiple polling and generally a gradual impact on the throughput and performance of the systems.

Another caveat is the fact that often individual applications have their own specific schema or data model that may have to go through a fair bit of transformation to be of value to the target system. For example, you may find dates that are cryptic and need to change to standard format.

The Approach

Delivering the real-time responsiveness required of modern multi-tier applications requires a new set of data services. These data services must be delivered through a complementary architecture that leverages the data within relational databases, but crucially optimizes its delivery for real-time application requirements. Remember we still need a persistent data store for permanent storage (the so called golden source) and audit requirements so the database is not going to go away. It may move away from relational to NoSQL databases but it still has to store data one way or other.

We need to consider the following challenges:

  1.  Mapping from relational to object model
  2. Create a schema or model for wider needs
  3. Eliminate data access path overhead of typical of databases
  4. Move from the traditional client/server model to Service Oriented Architecture
  5. Use some form of in-memory caching. By caching data in memory across many servers instead of within a single database server, distributed caching eliminates performance bottlenecks and minimizes access times.
  6. Resiliency is provided by having in-memory data on multiple servers kept in sync
  7. In-memory Data grid automatically scales its capacity and handles network or server failures. It also provides APIs for coordinating data access and asynchronously notifying clients. To this data grid adds powerful tools, such as parallel query and map/reduce, to give applications the insights they need for fast, in-depth analysis.
  8. We still have to content with legacy/local databases. Remember that most applications are legacy and they will have to use legacy/local databases. The cost to rewrite these types of applications may be prohibitive unless they are decommissioned or replaced.

Comments»

No comments yet — be the first.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: