Saturday, April 27, 2013

Measuring the Green Economy


Measuring how green an economy is, is perhaps one of the toughest aspects of understanding development trends that are going to affect us in the future. Luckily, there are several places this can be measured from. I recently read a book from OECD called Eco-Innovation in Industry: Enabling Green Growth that provided some excellent thoughts on measuring the growth of the Green Economy. It gave the following indicators for possible KPIs for growth of the Green Economy.


Operating Performance Indicator (OPI) Management Performance Indicator (MPI) Environment Condition Indicator (ECI)
Raw material used per unit of product (kg/unit) Environmental costs or budget ($/year) Contaminant concentrations in ambient air (ug/m3)
Energy used annually per unit of product (MJ/ 1000/ product) Percentage of Environmental Targets Achieved (%) Frequency of photo chemical smogs (per year)
Energy Conserved (MJ) Number of employees trained (% trained/ to be trained %) Contaminant concentration in ground or surface water (mg/L)
Number of emergency events or unplanned shutdowns in a year Number of audit findings Change in groundwater level (m)
Hours of preventive maintenance (hours/ year) Number of audit findings addressed Number of coliform bacteria per liter of potable water
Average fuel consumption of vehicle fleet (l/100km) Time spent to correct audit findings (person hours) Contaminant concentration in surface soil (mg/kg)
Hazardous waste generated per unit of product (kg/unit) Time spent to respond to environmental incidents (person hours per year) Area of contaminated land rehabilitation (hectares/ year)
Emissions of specific pollutants to air (tonnes CO2 / year) Number of complaints from public or employees (per year) Population of a specific species of animals within a defined area (per m2)
Wastewater discharged per unit of product (1000 litres/ unit) Number of suppliers contacted for environmental management (per year) Number of hospital admissions for asthma during smog season (per year)
Hazardous waste eliminated by pollution prevention (kg/year) Cost of pollution prevention projects ($/year) Number of fish deaths in a specific watercourse (per year)
Number of days air emission limits were exceeded (days/ year) Number of management level staff with specific environmental responsibilities Employee blood lead levels (μg/ 100 ml)

In addition, the Economics and Statistics Administration of the US Department of Commerce did some interesting research on measuring the green economy. The report uses the North American Industry Classification System (NAICS codes) to identify green industry service and products. The selected NAICS codes selected by the study are included in the Annexures within the report.

These can be used by government portals to also understand trends in the green economy. These can also be used for developing Performance Metrics dashboards either within an organization or for a local, state of federal level government agency






Thursday, April 18, 2013

Focusing on what matters: the portfolio approach

Recently I was thinking about which of my competing priorities should I be focused on to make sure it matters in the end. I was reminded of the BCG matrix that I had read about many years ago.

I wanted to think about whether I should focus on the standard stuff or put more priority on new ideas which perhaps had a better future. The BCG (Boston Consulting Group) matrix as most people already know, is such a tool meant for organizations to decide on which products to fund. The matrix used to be popular, till people thought the recommendations made no sense and suggested an incorrect strategy. As I found out, not quite so.

The matrix defined 4 project types based on the four cells in a 2X2 matrix. On the vertical axis was Growth Potential, and on the horizontal axis was Market share potential. The top left quadrant indicated the initiatives that provided high growth potential as well as high market share potential.

To explain, each of the quadrants starting from top left and proceeding anti-clockwise were:

1. Stars (top left) : These are projects/ products which have a high growth potential and a high market share. ( I would also add mind share within organizations)

2. Cash cows (bottom left): These were products in high market share segments but low future growth potential. Think of these as things that are successful for today, but new investment may not produce new growth.

3. Dogs (bottom right) : These are projects and products that are frankly not going anywhere. Also called cash traps.

4. Problem children or question marks (top right): Finally is the category of things that have a lot of growth potential, but are relatively nascent. They have low market and mind share currently.

The traditional interpretation of this matrix, which was also surprisingly echoed by the author, Bruce Henderson, in his original writing was to kill the dogs, milk the cows to fund the stars.

Essentially, it implies that people should take the benefit (cash flows) from the high market share and low growth initiatives, to fund high growth and and high market share potential initiatives. This of course people found objectionable after some time, as it meant to divert cash from low growth segments to high growth segments, almost like a parasitic existence of one on another. This was something that was not sustainable unless the cash outflows (read benefits) from the low growth segment was large enough to sustain the growth in the new segment.

However, this needs to be interpreted in context of other equally important concepts presented by the same author in his other writings. With these in mind, he himself may have not explained himself fully.

First, the author said that market share and not margins is the most important thing to focus on. The rationale was that in the author's view based on analysis of many companies in the late sixties and seventies, margins improve as the entity becomes more experienced in a certain market and product. However these margins are only sustained and improved as long as the company is able to maintain market share. In absence of market share, margins become impossible to defend. Putting it in perspective of the matrix, the left half of the quadrant is where one should focus.

In a separate piece of writing, the author also said that building market share requires new investments to fund growth. In high growth segments, the market share cannot be sustained unless continuous investment is made. So there is a propensity for Stars to become Question marks unless funding can be sustained that is proportional to the growth of the market.

On the other hand, projects that are deemed as pets (they don't have market share or growth potential ), should either be killed or should be invested in to such an extent that they become market leaders in terms of market share. Bottom line, according to the author, entities need to maintain market share at all costs, irrespective of growth potential of the market and let margins take care of themselves.

These, in my personal humble opinion, are very important insights that apply equally to companies building product portfolios as well as individuals when deciding on where to focus.

Saturday, April 13, 2013

What is Big Data?


Someone I know asked me what is Big Data and if I could explain it the way they could understand it. Now this person can understand traditional data architectures but does not deal with technology on a day to day basis. Off late, they are more into strategy consulting and business development for large organizations.

The Big Problem

I explained that Big Data was the entire practice of handling large amounts of data that is growing every minute in a fashion never experienced before. I gave the example of Smart Meters that are being installed by electricity distribution companies in our homes. A typical large electricity distribution company that supplies power to around a million homes, has a million meters sending status updates (consumption, availability, etc.) every 15 minutes. That adds to (1 X 4 ) = 4 million new records an hour, (4 X 24)=96 million data points a day. Multiply that by a year and you start seeing (365 X 96M) = 35.04B data points in a year.

The above problem is still finite, since we can predict by how much the data will grow between a certain period of time. Look at the example of social media and we cannot even predict the rate at which the data will grow. A certain event can trigger a thousand tweets or blogs and no one can figure out what they mean as an overall trend or sentiment.

Of course, eventually the question becomes how do you make sense of this data? Most people are not even able to handle these datasets in traditional data architectures. Why this is the case, we need to understand why traditional database architectures are not able to scale. Then, I will describe how the new Big Data Architectures resolve these problems.

Limitations of Traditional Database Architectures with Big Data

Out of date Indices and Query Plans

Traditional databases were designed and optimized for a certain size and growth for each entity. The science was called Volumetrics. Based on the relative sizes of different entities, distribution in variability of data, as well as type of query to be performed, it was more efficient to perform one query using a strategy that was different from another (called Query plans). Database Indices were then designed that would return results really fast, based on relative sizings of tables, variability in data within each entity for queried or joined attributes, and ofcourse, nature of analysis. In Big Data, the data is churning so fast, it is impossible to keep re-analyzing indices, and coming up with different query plans for fast analysis.

Computational overheads in Minimizing Storage on Disk

Another problem is storage of data. In normalized models, data is made up of primary entities, look up tables and link tables. Typically, the data entry forms in these applications are designed such that upon data insert, the database receives coded values from the data input forms. When this is not the case, the application has to fire multiple database queries to convert user inputs into coded values for lookup tables. These were strategies to minimize the storage of data on disk.

From a computational point of view, a record insert, in traditional cases, would be made up of one insert, with N number of index based database queries on lookup tables. In cases where the user data forms are populated from pick lists that the user chooses from, these are full table scan queries on lookup tables. Ideally, an application can also cache these values upon startup. However, where dimensional models are involved, there is a concept of Slowly changing dimensions, where the lookup tables themselves are getting updated and caches may need to be updated eventually.

In Big Data scenarios, we are forced with two problems. Firstly, when dealing with unstructured data, the concept of lookup tables is just not possible. Secondly, for structured data, we still need to trade-off the computational overhead in performing lookups upon insertion, vs. our ability to validate the lookup values as well as come up with a finite list of lookups in the first place. If lookups is something that we want to apply to structured and unstructured data, we need to introduce some level of control on when to parse data, so that we can improve storage, reduce computational overhead during pre-storage and improve our chances of efficient retrieval eventually.
Re-emphasizing the same, the challenge is to keep storing the data efficiently, such that it uses minimal space on disk and ofcourse eventually, is available for analysis. Also, challenge is how do you do this efficiently, when you are not able to utilize traditional constructs like lookup tables, for ensuring referential integrity as well as index based searches.

Re-stating the problem

In a nutshell, Big Data is the entire practice around storage, retrieval, query and analysis of large volume datasets that are growing with time making traditional database architectures inefficient and obsolete.

The Big Data architecture

Storage and Retrieval using hashcode

The primary tactics is to look for approaches that allow a dataset to index itself, or atleast become more efficient in handling itself. Now programmers have long dealt with this problem. Most programming languages have native data structures for handling multiple data elements in memory. These include Arrays (a data structure in which we can store N elements for each dimension), Lists (single dimensional arrays that can grow as we add new elements), Maps (a list in which elements are accessed through a key rather than an element index) and Sets (lists containing unique values) that grow and sometimes sort themselves. Most of these constructs rely on a generation of an integer number called a hashcode.

A hashcode is an integer value, that is computed for each entry. More importantly, two values that are supposed to be equal should return the same hashcode. So, if we say that the text “Orange” is the same as “orange” and “ORANGE”, these should all return the same hashcode. The computation of hashcode helps in comparing, ordering, sorting and indexing values inside hashcode based data structures. More importantly hashcode computation is light-weight and lends itself to many algorithmic implementations.

Introducing Immutability

Another important benefit of hashcode based data structures is the ability to promote the concept of immutability. Immutability essentially implies that the system will never discard or overwrite (mutate) any value. So, if the system encounters a certain value, it will fill a position in the memory with that value, and never overwrite it. If you wrote a function in an application, that said let A=99.17, let B = 0.83, and then compute A=A+B, an Immutability based architecture will not discard the old A, which was 99.17. It will keep that in memory. It will actually create three values in memory, say X= 99.17, Y = 0.83 and Z = 100.0. At the beginning of your little function, it will assign A= X = 99.17 and at the end, it will re-assign A to Z, implying A=Z=100.0. The advantage of such an architecture, eventually is that if your application encounters hundreds of millions of rows of data (that contain one field value that ranges for example, from Excellent to Poor), the actual memory utilization will be much less than the actual number of rows and equal to the distribution of actual values (from Excellent to Poor). Compare this to lookup tables in traditional database, and you will understand the benefits.

The Big Data Architectures are primarily made up of data structures that can store simply values or Key-Value pairs based on hashcodes and immutability.

Computation and Analysis

Now, to perform computations and analysis on this new data paradigm, users of Big data needed a new construct. This was needed more so that the computation of large data could leverage large scale computation clusters where traditional index based models could not be rolled out. The invention was a Java based SDK that could receive a computation task and distribute it among a large scale deployment. Ofcourse, it had to make use of existing constructs of hashing based data structures. Apache Hadoop donated by Google, was perhaps the most important implementation that can take a problem, distribute it among a large processing node and collect the results in a way it makes sense. The framework is called MapReduce, where any problem is broken up into as many parallel computation tasks as the size of the computation cluster and then distributed over the cluster. Once the results are computed, the results are combined and reduced to generate the final result. It is important to note that MapReduce algorithms will only out perform index based architectures of yesteryear as long as the data is changing so fast that maintaining index based data warehouses is not feasible.

The challenge in implementing MapReduce, is that it is a programming API and one will need to write programs for performing any sort of calculation.

Simplified Hadoop Programming Models

Apache Hive, donated to Apache by Facebook, is a data warehousing software that is built on top of Hadoop. What this means is that users can write SQL like scripts for declaring data structures, and analyzing data that is residing on distributed file systems over large scale clusters. Under the hood, Hive uses Hadoop and Hadoop compatible distributed file systems.

And finally, Pigs, an ETL like platform uses a programming language called Pig Latin and has inbuilt transformers that can read from multiple formats and can perform Hadoop MapReduce computations using an ETL like construct.

There are many more Hadoop frameworks, and new ones are coming up each day. I have perhaps only described the two that are the most popular.

Summary

To summarize, my take of Big Data is Architectures that allow storage, retrieval, query and analysis of large volume rapidly changing data using large scale distributed clusters. In the real world, there are only a limited (though growing) class of problems that can be resolved using Big Data architectures, and one will still need the traditional relational architectures for a long time to come.