Someone I know asked me what is Big
Data and if I could explain it the way they could understand it. Now
this person can understand traditional data architectures but does
not deal with technology on a day to day basis. Off late, they are
more into strategy consulting and business development for large
organizations.
The Big Problem
I explained that Big Data was the
entire practice of handling large amounts of data that is growing
every minute in a fashion never experienced before. I gave the
example of Smart Meters that are being installed by electricity
distribution companies in our homes. A typical large electricity
distribution company that supplies power to around a million homes,
has a million meters sending status updates (consumption,
availability, etc.) every 15 minutes. That adds to (1 X 4 ) = 4
million new records an hour, (4 X 24)=96 million data points a day.
Multiply that by a year and you start seeing (365 X 96M) = 35.04B
data points in a year.
The above problem is still finite,
since we can predict by how much the data will grow between a certain
period of time. Look at the example of social media and we cannot
even predict the rate at which the data will grow. A certain event
can trigger a thousand tweets or blogs and no one can figure out what
they mean as an overall trend or sentiment.
Of course, eventually the question
becomes how do you make sense of this data? Most people are not even
able to handle these datasets in traditional data architectures. Why
this is the case, we need to understand why traditional database
architectures are not able to scale. Then, I will describe how the
new Big Data Architectures resolve these problems.
Limitations of Traditional Database Architectures with Big Data
Out of date Indices and Query Plans
Traditional databases were designed and
optimized for a certain size and growth for each entity. The science
was called Volumetrics. Based on the relative sizes of different
entities, distribution in variability of data, as well as type of
query to be performed, it was more efficient to perform one query
using a strategy that was different from another (called Query
plans). Database Indices were then designed that would return results
really fast, based on relative sizings of tables, variability in data
within each entity for queried or joined attributes, and ofcourse,
nature of analysis. In Big Data, the data is churning so fast, it is
impossible to keep re-analyzing indices, and coming up with different
query plans for fast analysis.
Computational overheads in Minimizing Storage on Disk
Another problem is storage of data. In
normalized models, data is made up of primary entities, look up
tables and link tables. Typically, the data entry forms in these
applications are designed such that upon data insert, the database
receives coded values from the data input forms. When this is not the
case, the application has to fire multiple database queries to
convert user inputs into coded values for lookup tables. These were
strategies to minimize the storage of data on disk.
From a computational point of view, a
record insert, in traditional cases, would be made up of one insert,
with N number of index based database queries on lookup tables. In
cases where the user data forms are populated from pick lists that
the user chooses from, these are full table scan queries on lookup
tables. Ideally, an application can also cache these values upon
startup. However, where dimensional models are involved, there is a
concept of Slowly changing dimensions, where the lookup tables
themselves are getting updated and caches may need to be updated
eventually.
In Big Data scenarios, we are forced
with two problems. Firstly, when dealing with unstructured data, the
concept of lookup tables is just not possible. Secondly, for
structured data, we still need to trade-off the computational
overhead in performing lookups upon insertion, vs. our ability to
validate the lookup values as well as come up with a finite list of
lookups in the first place. If lookups is something that we want to
apply to structured and unstructured data, we need to introduce some
level of control on when to parse data, so that we can improve
storage, reduce computational overhead during pre-storage and improve
our chances of efficient retrieval eventually.
Re-emphasizing the same, the challenge
is to keep storing the data efficiently, such that it uses minimal
space on disk and ofcourse eventually, is available for analysis.
Also, challenge is how do you do this efficiently, when you are not
able to utilize traditional constructs like lookup tables, for
ensuring referential integrity as well as index based searches.
Re-stating the problem
In a nutshell, Big Data is the entire
practice around storage, retrieval, query and analysis of large
volume datasets that are growing with time making traditional
database architectures inefficient and obsolete.
The Big Data architecture
Storage and Retrieval using hashcode
The primary tactics is to look for
approaches that allow a dataset to index itself, or atleast become
more efficient in handling itself. Now programmers have long dealt
with this problem. Most programming languages have native data
structures for handling multiple data elements in memory. These
include Arrays (a data structure in which we can store N elements for
each dimension), Lists (single dimensional arrays that can grow as we
add new elements), Maps (a list in which elements are accessed
through a key rather than an element index) and Sets (lists
containing unique values) that grow and sometimes sort themselves.
Most of these constructs rely on a generation of an integer number
called a hashcode.
A hashcode is an integer value, that is
computed for each entry. More importantly, two values that are
supposed to be equal should return the same hashcode. So, if we say
that the text “Orange” is the same as “orange” and “ORANGE”,
these should all return the same hashcode. The computation of
hashcode helps in comparing, ordering, sorting and indexing values
inside hashcode based data structures. More importantly hashcode
computation is light-weight and lends itself to many algorithmic
implementations.
Introducing Immutability
Another important benefit of hashcode
based data structures is the ability to promote the concept of
immutability. Immutability essentially implies that the system will
never discard or overwrite (mutate) any value. So, if the system
encounters a certain value, it will fill a position in the memory
with that value, and never overwrite it. If you wrote a function in
an application, that said let A=99.17, let B = 0.83, and then compute
A=A+B, an Immutability based architecture will not discard the old A,
which was 99.17. It will keep that in memory. It will actually create
three values in memory, say X= 99.17, Y = 0.83 and Z = 100.0. At the
beginning of your little function, it will assign A= X = 99.17 and at
the end, it will re-assign A to Z, implying A=Z=100.0. The advantage
of such an architecture, eventually is that if your application
encounters hundreds of millions of rows of data (that contain one
field value that ranges for example, from Excellent to Poor), the
actual memory utilization will be much less than the actual number of
rows and equal to the distribution of actual values (from Excellent
to Poor). Compare this to lookup tables in traditional database, and
you will understand the benefits.
The Big Data Architectures are
primarily made up of data structures that can store simply values or
Key-Value pairs based on hashcodes and immutability.
Computation and Analysis
Now, to perform computations and
analysis on this new data paradigm, users of Big data needed a new
construct. This was needed more so that the computation of large data
could leverage large scale computation clusters where traditional
index based models could not be rolled out. The invention was a Java
based SDK that could receive a computation task and distribute it
among a large scale deployment. Ofcourse, it had to make use of
existing constructs of hashing based data structures. Apache Hadoop
donated by Google, was perhaps the most important implementation that
can take a problem, distribute it among a large processing node and
collect the results in a way it makes sense. The framework is called
MapReduce, where any problem is broken up into as many parallel
computation tasks as the size of the computation cluster and then
distributed over the cluster. Once the results are computed, the
results are combined and reduced to generate the final result. It is
important to note that MapReduce algorithms will only out perform
index based architectures of yesteryear as long as the data is
changing so fast that maintaining index based data warehouses is not
feasible.
The challenge in implementing
MapReduce, is that it is a programming API and one will need to write
programs for performing any sort of calculation.
Simplified Hadoop Programming Models
Apache Hive, donated to Apache by
Facebook, is a data warehousing software that is built on top of
Hadoop. What this means is that users can write SQL like scripts for
declaring data structures, and analyzing data that is residing on
distributed file systems over large scale clusters. Under the hood,
Hive uses Hadoop and Hadoop compatible distributed file systems.
And finally, Pigs, an ETL like platform
uses a programming language called Pig Latin and has inbuilt
transformers that can read from multiple formats and can perform
Hadoop MapReduce computations using an ETL like construct.
There are many more Hadoop frameworks,
and new ones are coming up each day. I have perhaps only described
the two that are the most popular.
Summary
To summarize, my take of Big Data is
Architectures that allow storage, retrieval, query and analysis of
large volume rapidly changing data using large scale distributed
clusters. In the real world, there are only a limited (though
growing) class of problems that can be resolved using Big Data
architectures, and one will still need the traditional relational
architectures for a long time to come.
No comments:
Post a Comment