Data, data and more data !! The era of big data is upon us. Tera byte data sets are slowly becoming common place and exa and peta byte data sets are expected soon.
What are the underlying trends that caused the explosion of big data - or more aptly semi structured big data? On the web, the first one is the rise of Web search and the second one is the rise of social networking.
Search companies like Google needed a way to index the entire web on their machines. Google came up with the concept of MapReduce - a data processing framework on commodity machines to do this cost effectively. Open source implementations of MapReduce- named 'Hadoop' soon followed to solve these data processing issues. Social networking also required that the Facebooks and LinkedIns of the world , store huge amounts of user generated data coming in at a very high rate. They then had to index it, analyze it and generate insights from it to drive further user adoption and virality. A lot of this data was semi-structured( did not fit in a database neatly) and required a lot more computation to generate insights, than the traditional BI model.
This is leading to the rise of the so called Big Data Stack at consumer internet companies and it has five major components
Big Data Storage : NOSQL databases - Cassandra/Voldemort, HDFS, HBase
Big Data Indexing and index storage : Lucene, Katta or NOSQL stores like above; Zoie (real time indexing from Linkedin) ; Bobo for faceted search
Big Data Processing and Analytics: Hadoop, Hive, Pig
Big Data Workflows: Oozie( Yahoo), Azkaban(Linkedin), Cascading(Chris Wenzel)
Big Data and Big Log transportation : Chukwa, Flume, Scribe etc
Big Data Intelligence : Mahout (A Machine Learning framework -that can run on top of Hadoop)
Big Data Sharding: Gizzard ( A middleware sharding framework developed by Twitter)
(The exact use cases of the above stack and the variations at various internet companies merits its own discussion and is outside the scope of this article; I will address this in another post.)
Traditional Fortune 500 enterprises have long relied on an enterprise architecture stack consisting of RDBMS and BI software running on high-end servers; However, there was no good way to handle unstructured and semi structured data until recently. As more ideas like user generated data percolate from the consumer internet into the enterprise, enterprises are beginning to see the same big data issues that were first experienced in consumer internet space. There is also a growing realization that data can now be processed cost effectively to generate hidden insights and drive competitive advantage.
However today's CIO's lack the tools needed to manage this data. Even though this new stack and frameworks are getting mature, the skillsets currently needed by the IT staff to handle these new frameworks is very high. And every CIO is pressed on budget and under pressure to deliver value to their business using minimal staff. I think we will see a lot of tools and processes develop around big data ti ease the transition to the enterprise.
It should be an interesting space to watch!!
Monday, August 23, 2010
Friday, January 29, 2010
Current perspectives on Scalability - A buffet from various Internet scale companies
FaceBook:
Dark Launch
Use(functional) concurrency supporting languages basd servers for applications which map to a parallel environment more.
Use straight forward HTTP web servers for req-response style requests.
Use C++ whenever efficiency/logging is required.
Develop/use NOSQL based approaches(Cassandra) for semi-structured/unstructured data that can tolerate relaxed consistency.
Develop your own Storage system (which does not require all the metadata and inode entries generally required by general POSIX systems) for photos to get rid of expensive CDN's.
Scribe - Their own distributed/reliable Logging System.
Do not use too many fine grained services - I have seen this problem in companies where too many fine grained services, then result in a drop order on deployment day (pretty painful).
No service private schemas ( Then how do they make changes to databases in an isolated way).
Swim Lanes
Dark Launch
Use(functional) concurrency supporting languages basd servers for applications which map to a parallel environment more.
Use straight forward HTTP web servers for req-response style requests.
Use C++ whenever efficiency/logging is required.
Develop/use NOSQL based approaches(Cassandra) for semi-structured/unstructured data that can tolerate relaxed consistency.
Develop your own Storage system (which does not require all the metadata and inode entries generally required by general POSIX systems) for photos to get rid of expensive CDN's.
Scribe - Their own distributed/reliable Logging System.
Do not use too many fine grained services - I have seen this problem in companies where too many fine grained services, then result in a drop order on deployment day (pretty painful).
No service private schemas ( Then how do they make changes to databases in an isolated way).
Swim Lanes
Thursday, January 21, 2010
Business Skills for Technology executives
I was reading one of the blog posts from AKF Partners the other day, where they talk about the business skills that need to be acquired by Technology Executives.
It is pretty coincidental, that I started on this exact quest some time ago.
The approach is pretty simple.
1) Got to personalmba.com recommended reading list http://personalmba.com/best-business-books/
2) Read one book on each topic - whenever you have time
3) So far I understood Competitive Strategy, Positioning and a very basic way to read financial reports and a number of other skills too.
4) I felt this was the best approach in Silicon Valley - short of a full time MBA program.
It is pretty coincidental, that I started on this exact quest some time ago.
The approach is pretty simple.
1) Got to personalmba.com recommended reading list http://personalmba.com/best-business-books/
2) Read one book on each topic - whenever you have time
3) So far I understood Competitive Strategy, Positioning and a very basic way to read financial reports and a number of other skills too.
4) I felt this was the best approach in Silicon Valley - short of a full time MBA program.
Tuesday, January 19, 2010
An Architecture to learn from:FaceBook Chat
Many times a programming language is just a tool.. Sometimes it is a differentiator
At Facebook , they have used Erlang mostly for its lightweight concurrency and its actors model concurrency( ErLang calls them Channels; Scala calls them actors)
This has real implications in terms of how many machines Facebook has to buy to support chat; I am sure they cut their hardware requirements by at least half from what they would have needed if they went with a traditional request/response model; Shows how a good architecture means real money saved for hight traffic sites.
If you want to learn how Facebook uses Erlang as a differentiator, go through this presentation
There is a pdf somewhere also which explains the architecture in more detail.
My only question is this : Could a java based NIO approach have delivered similar/same results for Facebook; Is Java threading model so heavy and the semantics of shared memory so ill suited for Facebook chat?
At Facebook , they have used Erlang mostly for its lightweight concurrency and its actors model concurrency( ErLang calls them Channels; Scala calls them actors)
This has real implications in terms of how many machines Facebook has to buy to support chat; I am sure they cut their hardware requirements by at least half from what they would have needed if they went with a traditional request/response model; Shows how a good architecture means real money saved for hight traffic sites.
If you want to learn how Facebook uses Erlang as a differentiator, go through this presentation
There is a pdf somewhere also which explains the architecture in more detail.
My only question is this : Could a java based NIO approach have delivered similar/same results for Facebook; Is Java threading model so heavy and the semantics of shared memory so ill suited for Facebook chat?
Ramping up on Scala
I was getting up to speed on Scala on and off, but never made a concerted effort to get the entire hang of it.
Yesterday I finally got hold of Martin Odersky's Programming in Scala book and going at a good pace.
I would love to use more and more functional programming features in my future career.
-Radha.
Yesterday I finally got hold of Martin Odersky's Programming in Scala book and going at a good pace.
I would love to use more and more functional programming features in my future career.
-Radha.
Friday, August 14, 2009
Competitive Strategy: Analyzing your career and any industry
I am into an in-depth study of Michael Porter's competitive Strategy book , as part of my effort to understand a number of business concepts.
I have already gone through a number of marketing books such as
Positioning (Ries/Trout), Seth Godin(All Marketers are Liars ) etc.
However Porter's book is in a different class of its own. It gives you a framework to analyze any industry using a five forces framework of suppliers, buyers, threat of new entrants, substitutes and industry rivalry.
Some of the stuff is common sense and it seems this stuff is more applicable to late-stage or mature companies - than startups.
I am doing an analysis of two entities based on whatever I learned by studying this framework.
The first is that of a Software Engineer's career in USA.
The second one is of my current employer's industry.
Will be following up soon with posts on these subjects.
I have already gone through a number of marketing books such as
Positioning (Ries/Trout), Seth Godin(All Marketers are Liars ) etc.
However Porter's book is in a different class of its own. It gives you a framework to analyze any industry using a five forces framework of suppliers, buyers, threat of new entrants, substitutes and industry rivalry.
Some of the stuff is common sense and it seems this stuff is more applicable to late-stage or mature companies - than startups.
I am doing an analysis of two entities based on whatever I learned by studying this framework.
The first is that of a Software Engineer's career in USA.
The second one is of my current employer's industry.
Will be following up soon with posts on these subjects.
Thursday, April 17, 2008
Improve performance of legacy code using Java 5 language features
Practical Advice for practical programmers
As a senior engineer, you sometimes are thrown into a situation, where you have to come up with some ways to improve the performance of your server side java multithreaded application.
The code was written over a span of 6-7 years and you have only a vague idea of what it does and does not do. More importantly it uses java language features that are old. What do you do?
Before kicking up a profiler and doing memory/CPU profiling - there is something very easy you can do which does not involve all that, provided you can move to Java 5.
There are a number of new classes and frameworks in Java 5 which should improve the performance and reduce the boilerplate code you need to write for a multithreaded application.
Here are some of them.
I will keep posting some more patterns to emulate as I come across more of them.
As a senior engineer, you sometimes are thrown into a situation, where you have to come up with some ways to improve the performance of your server side java multithreaded application.
The code was written over a span of 6-7 years and you have only a vague idea of what it does and does not do. More importantly it uses java language features that are old. What do you do?
Before kicking up a profiler and doing memory/CPU profiling - there is something very easy you can do which does not involve all that, provided you can move to Java 5.
There are a number of new classes and frameworks in Java 5 which should improve the performance and reduce the boilerplate code you need to write for a multithreaded application.
Here are some of them.
- Replace synchronized collections from your old code with new concurrent collections.i.e. of you have a synchronized HashMap - replace it with ConcurrentHashMap. The Concurrent classes in Java 5 perform fine grained locking and hence provide better scalability.
- Identify places where you use Java list class with Queue semantics and replace it with the Java Queue class , introduced in version 5 or 6. Java Queue class is much more efficient than the List class, whose interface supports random access.
- Use Blocking Queue (or Bounded Blocking queue), whenever possible.
- A common pattern in multithreaded Java applications, is the thread pool along with a work queue. See if you can use Java 5 Executor Task Execution framework.
I will keep posting some more patterns to emulate as I come across more of them.
Subscribe to:
Posts (Atom)