Synthesis: What is Big Data?

Someone I know asked me what is Big Data and if I could explain it the way they could understand it. Now this person can understand traditional data architectures but does not deal with technology on a day to day basis. Off late, they are more into strategy consulting and business development for large organizations.

The Big Problem

I explained that Big Data was the entire practice of handling large amounts of data that is growing every minute in a fashion never experienced before. I gave the example of Smart Meters that are being installed by electricity distribution companies in our homes. A typical large electricity distribution company that supplies power to around a million homes, has a million meters sending status updates (consumption, availability, etc.) every 15 minutes. That adds to (1 X 4 ) = 4 million new records an hour, (4 X 24)=96 million data points a day. Multiply that by a year and you start seeing (365 X 96M) = 35.04B data points in a year.

The above problem is still finite, since we can predict by how much the data will grow between a certain period of time. Look at the example of social media and we cannot even predict the rate at which the data will grow. A certain event can trigger a thousand tweets or blogs and no one can figure out what they mean as an overall trend or sentiment.

Of course, eventually the question becomes how do you make sense of this data? Most people are not even able to handle these datasets in traditional data architectures. Why this is the case, we need to understand why traditional database architectures are not able to scale. Then, I will describe how the new Big Data Architectures resolve these problems.

Limitations of Traditional Database Architectures with Big Data

Out of date Indices and Query Plans

Traditional databases were designed and optimized for a certain size and growth for each entity. The science was called Volumetrics. Based on the relative sizes of different entities, distribution in variability of data, as well as type of query to be performed, it was more efficient to perform one query using a strategy that was different from another (called Query plans). Database Indices were then designed that would return results really fast, based on relative sizings of tables, variability in data within each entity for queried or joined attributes, and ofcourse, nature of analysis. In Big Data, the data is churning so fast, it is impossible to keep re-analyzing indices, and coming up with different query plans for fast analysis.

Computational overheads in Minimizing Storage on Disk

Another problem is storage of data. In normalized models, data is made up of primary entities, look up tables and link tables. Typically, the data entry forms in these applications are designed such that upon data insert, the database receives coded values from the data input forms. When this is not the case, the application has to fire multiple database queries to convert user inputs into coded values for lookup tables. These were strategies to minimize the storage of data on disk.

From a computational point of view, a record insert, in traditional cases, would be made up of one insert, with N number of index based database queries on lookup tables. In cases where the user data forms are populated from pick lists that the user chooses from, these are full table scan queries on lookup tables. Ideally, an application can also cache these values upon startup. However, where dimensional models are involved, there is a concept of Slowly changing dimensions, where the lookup tables themselves are getting updated and caches may need to be updated eventually.

In Big Data scenarios, we are forced with two problems. Firstly, when dealing with unstructured data, the concept of lookup tables is just not possible. Secondly, for structured data, we still need to trade-off the computational overhead in performing lookups upon insertion, vs. our ability to validate the lookup values as well as come up with a finite list of lookups in the first place. If lookups is something that we want to apply to structured and unstructured data, we need to introduce some level of control on when to parse data, so that we can improve storage, reduce computational overhead during pre-storage and improve our chances of efficient retrieval eventually.

Re-emphasizing the same, the challenge is to keep storing the data efficiently, such that it uses minimal space on disk and ofcourse eventually, is available for analysis. Also, challenge is how do you do this efficiently, when you are not able to utilize traditional constructs like lookup tables, for ensuring referential integrity as well as index based searches.

Re-stating the problem

In a nutshell, Big Data is the entire practice around storage, retrieval, query and analysis of large volume datasets that are growing with time making traditional database architectures inefficient and obsolete.

The Big Data architecture

Storage and Retrieval using hashcode

The primary tactics is to look for approaches that allow a dataset to index itself, or atleast become more efficient in handling itself. Now programmers have long dealt with this problem. Most programming languages have native data structures for handling multiple data elements in memory. These include Arrays (a data structure in which we can store N elements for each dimension), Lists (single dimensional arrays that can grow as we add new elements), Maps (a list in which elements are accessed through a key rather than an element index) and Sets (lists containing unique values) that grow and sometimes sort themselves. Most of these constructs rely on a generation of an integer number called a hashcode.

A hashcode is an integer value, that is computed for each entry. More importantly, two values that are supposed to be equal should return the same hashcode. So, if we say that the text “Orange” is the same as “orange” and “ORANGE”, these should all return the same hashcode. The computation of hashcode helps in comparing, ordering, sorting and indexing values inside hashcode based data structures. More importantly hashcode computation is light-weight and lends itself to many algorithmic implementations.

Introducing Immutability

Another important benefit of hashcode based data structures is the ability to promote the concept of immutability. Immutability essentially implies that the system will never discard or overwrite (mutate) any value. So, if the system encounters a certain value, it will fill a position in the memory with that value, and never overwrite it. If you wrote a function in an application, that said let A=99.17, let B = 0.83, and then compute A=A+B, an Immutability based architecture will not discard the old A, which was 99.17. It will keep that in memory. It will actually create three values in memory, say X= 99.17, Y = 0.83 and Z = 100.0. At the beginning of your little function, it will assign A= X = 99.17 and at the end, it will re-assign A to Z, implying A=Z=100.0. The advantage of such an architecture, eventually is that if your application encounters hundreds of millions of rows of data (that contain one field value that ranges for example, from Excellent to Poor), the actual memory utilization will be much less than the actual number of rows and equal to the distribution of actual values (from Excellent to Poor). Compare this to lookup tables in traditional database, and you will understand the benefits.

The Big Data Architectures are primarily made up of data structures that can store simply values or Key-Value pairs based on hashcodes and immutability.

Computation and Analysis

Now, to perform computations and analysis on this new data paradigm, users of Big data needed a new construct. This was needed more so that the computation of large data could leverage large scale computation clusters where traditional index based models could not be rolled out. The invention was a Java based SDK that could receive a computation task and distribute it among a large scale deployment. Ofcourse, it had to make use of existing constructs of hashing based data structures. Apache Hadoop donated by Google, was perhaps the most important implementation that can take a problem, distribute it among a large processing node and collect the results in a way it makes sense. The framework is called MapReduce, where any problem is broken up into as many parallel computation tasks as the size of the computation cluster and then distributed over the cluster. Once the results are computed, the results are combined and reduced to generate the final result. It is important to note that MapReduce algorithms will only out perform index based architectures of yesteryear as long as the data is changing so fast that maintaining index based data warehouses is not feasible.

The challenge in implementing MapReduce, is that it is a programming API and one will need to write programs for performing any sort of calculation.

Simplified Hadoop Programming Models

Apache Hive, donated to Apache by Facebook, is a data warehousing software that is built on top of Hadoop. What this means is that users can write SQL like scripts for declaring data structures, and analyzing data that is residing on distributed file systems over large scale clusters. Under the hood, Hive uses Hadoop and Hadoop compatible distributed file systems.

And finally, Pigs, an ETL like platform uses a programming language called Pig Latin and has inbuilt transformers that can read from multiple formats and can perform Hadoop MapReduce computations using an ETL like construct.

There are many more Hadoop frameworks, and new ones are coming up each day. I have perhaps only described the two that are the most popular.

Summary

To summarize, my take of Big Data is Architectures that allow storage, retrieval, query and analysis of large volume rapidly changing data using large scale distributed clusters. In the real world, there are only a limited (though growing) class of problems that can be resolved using Big Data architectures, and one will still need the traditional relational architectures for a long time to come.

Synthesis

Saturday, April 13, 2013

What is Big Data?