Over the last 12 months I’ve accumulated plenty of “conversations” where we’ve discussed big data analytics and BI strategies with our customers and potential users. These 5 points below represent some of the key take-away points about current state of analytics/BI field, why it is by in large a sore state of affairs and what some of the obvious tell telling signs of the decay.
Beware: some measure of hyperbole is used below to make the points more contrast...
"Batch"
This is probably getting obvious for the most of industry insiders but still worth while to mention. If you have "batch" process in your big data analytics - you are not processing live data and you are not processing it in real time context. Period.
That means that you are analyzing stale data and your competitors that are more agile and smart are running circles around you since they CAN analyze and process live (streaming) data in real time and make appropriate operational BI decisions based on real time analytics.
Having "batch" in your system design is like running your database off the tape drive. Would you do that when everyone around you using disk?
"Data Scientist"
A bit controversial. But... if you need one - your analytics/BI are probably not driving your business since you need a human body between your data and your business. Having humans (that sadly need to eat and sleep) paints any process with huge latency, and non real time characteristics.
In most cases it simply means:
- Data you are collecting and the system that is collecting it are so messed up that you need a Data Scientist (i.e. Statistician/Engineer below 30) to clean up this mess
- You process is hopelessly slow and clunky for real automation
- You analytics/BI is outdated by definition (i.e. analyzing stale data with no meaningful BI impact on daily operations)
Now, sometime you need a domain expert to understand the data and come up with some modeling - but I’ve yet to see a case complex enough that a 4 year engineer degree in CS could not solve. Most of the time it is overreaction/over hiring as the result of not understanding the problem in the first place.
"Overnight"
It’s a little brother of "Batch". It is essentially a built-in failure for any analytics or BI. In the world of hyper local advertising, geo locations, up to the seconds updates on Twitter or Facebook or LInkedIn - you are the proverbial grandma driving 1966 Buick with blinking turn light on the highway with everyone speeding past you...
There’s simply no excuse today to have any type of overnight processing (except for some rare legacy financial applications). Overnight processing is not only a technical laziness but it is often a built-in organizational tenet - and that’s what makes it even more appalling.
"ETL"
This is a little brother of "Overnight". ETL is what many people blame for overnight processing... "Look - we've got to move this Oracle into Hadoop and it takes 6 hours, and we can only do it at night when no one is online".
Well, I can really count two or three clients of ours where no one is online during the night. This is 2012 for god’s sake!!! Most businesses (even smallish startups) are 24/7 operations these days.
ETL is a clearest sign of significant technical debt accumulation. It is for the most parts manifestation of defensive and lazy approach to system’s design. It is especially troubling to see this approach in newer, younger companies that don’t have 25 years of legacy to deal with.
And it is equally invigorating to see it being steadily removed in the companies with 50 years of IT history.
"Petabyte"
This is a bit controversial again... But I’m getting a bit tired to hear that “We must design to process Petabytes of data” line from 20 people companies.
Let me break it:
- 99.99% of the companies will NEVER need Petabytes scale
- If your business "needs" to process Petabytes of data for its operations - you are likely doing something really wrong
- Most of the "working sets" that we've seen, i.e. the data you really need to process, measure in low teens of terabytes for absolute majority of use cases
- Given how frequently data is changing (in its structure, content, usefulness, fresh-ness, etc.) I don’t expect that “working set” size will grow nearly as fast (if at all) - overall data amount will grow but not the actual "window" that we need to process...
Yes - there are some companies and government organizations that will have a need for historical archival reasons to store petabytes and exabytes of data - but it’s for historical, archival and backup reasons in all of those rare cases - and likely never for frequent processing.