Monday, August 23, 2010

Big Data - Now and the future

Data, data and more data !! The era of big data is upon us. Tera byte data sets are slowly becoming common place and exa and peta byte data sets are expected soon.

What are the underlying trends that caused the explosion of big data - or more aptly semi structured big data? On the web, the first one is the rise of Web search and the second one is the rise of social networking.

Search companies like Google needed a way to index the entire web on their machines. Google came up with the concept of MapReduce - a data processing framework on commodity machines to do this cost effectively. Open source implementations of MapReduce- named 'Hadoop' soon followed to solve these data processing issues. Social networking also required that the Facebooks and LinkedIns of the world , store huge amounts of user generated data coming in at a very high rate. They then had to index it, analyze it and generate insights from it to drive further user adoption and virality. A lot of this data was semi-structured( did not fit in a database neatly) and required a lot more computation to generate insights, than the traditional BI model.
This is leading to the rise of the so called Big Data Stack at consumer internet companies and it has five major components

Big Data Storage : NOSQL databases - Cassandra/Voldemort, HDFS, HBase
Big Data Indexing and index storage : Lucene, Katta or NOSQL stores like above; Zoie (real time indexing from Linkedin) ; Bobo for faceted search
Big Data Processing and Analytics: Hadoop, Hive, Pig
Big Data Workflows: Oozie( Yahoo), Azkaban(Linkedin), Cascading(Chris Wenzel)
Big Data and Big Log transportation : Chukwa, Flume, Scribe etc
Big Data Intelligence : Mahout (A Machine Learning framework -that can run on top of Hadoop)
Big Data Sharding: Gizzard ( A middleware sharding framework developed by Twitter)

(The exact use cases of the above stack and the variations at various internet companies merits its own discussion and is outside the scope of this article; I will address this in another post.)

Traditional Fortune 500 enterprises have long relied on an enterprise architecture stack consisting of RDBMS and BI software running on high-end servers; However, there was no good way to handle unstructured and semi structured data until recently. As more ideas like user generated data percolate from the consumer internet into the enterprise, enterprises are beginning to see the same big data issues that were first experienced in consumer internet space. There is also a growing realization that data can now be processed cost effectively to generate hidden insights and drive competitive advantage.

However today's CIO's lack the tools needed to manage this data. Even though this new stack and frameworks are getting mature, the skillsets currently needed by the IT staff to handle these new frameworks is very high. And every CIO is pressed on budget and under pressure to deliver value to their business using minimal staff. I think we will see a lot of tools and processes develop around big data ti ease the transition to the enterprise.

It should be an interesting space to watch!!

Friday, January 29, 2010

Current perspectives on Scalability - A buffet from various Internet scale companies

Dark Launch
Use(functional) concurrency supporting languages basd servers for applications which map to a parallel environment more.
Use straight forward HTTP web servers for req-response style requests.
Use C++ whenever efficiency/logging is required.

Develop/use NOSQL based approaches(Cassandra) for semi-structured/unstructured data that can tolerate relaxed consistency.

Develop your own Storage system (which does not require all the metadata and inode entries generally required by general POSIX systems) for photos to get rid of expensive CDN's.

Scribe - Their own distributed/reliable Logging System.

Do not use too many fine grained services - I have seen this problem in companies where too many fine grained services, then result in a drop order on deployment day (pretty painful).

No service private schemas ( Then how do they make changes to databases in an isolated way).
Swim Lanes

Thursday, January 21, 2010

Business Skills for Technology executives

I was reading one of the blog posts from AKF Partners the other day, where they talk about the business skills that need to be acquired by Technology Executives.
It is pretty coincidental, that I started on this exact quest some time ago.

The approach is pretty simple.
1) Got to recommended reading list
2) Read one book on each topic - whenever you have time
3) So far I understood Competitive Strategy, Positioning and a very basic way to read financial reports and a number of other skills too.
4) I felt this was the best approach in Silicon Valley - short of a full time MBA program.

Tuesday, January 19, 2010

An Architecture to learn from:FaceBook Chat

Many times a programming language is just a tool.. Sometimes it is a differentiator

At Facebook , they have used Erlang mostly for its lightweight concurrency and its actors model concurrency( ErLang calls them Channels; Scala calls them actors)

This has real implications in terms of how many machines Facebook has to buy to support chat; I am sure they cut their hardware requirements by at least half from what they would have needed if they went with a traditional request/response model; Shows how a good architecture means real money saved for hight traffic sites.

If you want to learn how Facebook uses Erlang as a differentiator, go through this presentation
There is a pdf somewhere also which explains the architecture in more detail.

My only question is this : Could a java based NIO approach have delivered similar/same results for Facebook; Is Java threading model so heavy and the semantics of shared memory so ill suited for Facebook chat?

Ramping up on Scala

I was getting up to speed on Scala on and off, but never made a concerted effort to get the entire hang of it.

Yesterday I finally got hold of Martin Odersky's Programming in Scala book and going at a good pace.

I would love to use more and more functional programming features in my future career.