Big Data: Into the Trough of Disillusionment

Mark Schreiber

Snr Genomics Solution Architect - Health AI at Amazon Web Services (AWS)

Published May 2, 2014

Big Data like all technologies, looks like it's firmly following the trajectory of the "Hype Cycle". The availability of large amounts of data via the internet, social media and mobile data collection combined with technologies like Map-Reduce sparked phase 1 of the hype cycle, the Technology Trigger. Phase 2 is called the Peak of Inflated Expectations, we may not have reached the peak yet but there are signs that people are starting to pour cold water onto the initial excitement and marketing that has fanned the technology spark into a wild fire.

One of my favorite posts that has recently criticized the hype is "Data Science is Dead" by Miko Matsumura. I recommend reading it for the hilarious quotes alone. The thesis of the article is not so much that the premise of large scale analytics is flawed, it's that you can't actually do big data analytics because your data is in no fit state to analyze. Unless you're in one of the lucky few industries where your data is of very high quality, and this probably means no humans are involved in its production, you might want to get ready for that very painful tumble into the trough of disillusionment. Your expensive attempt to make your company data driven might become an expensive failure. Many data scientists and informaticians I speak to feel that they spend 90% of their time finding and cleaning data and only 10% analyzing it. It's pretty scary to think that your paying them a six figure salary to do analysis only 10% of the time. [Update: The team at Tamr has put together a nice post and graphic about what data scientists really have to do all day]

So why is this a problem? Well, if big data has three challenges (Volume, Velocity and Variety) it is the problem of Variety that will sink you. Why is variety such a big problem? Probably because it's a problem that is very hard to solve by simply throwing technology at it. Whereas people have made great progress in developing technologies to deal with data volume (like Hadoop) and velocity (like Kafka) there hasn't been much success on the variety front, especially if you need very high accuracy.

Current approaches to variety at scale include machine categorization and entity recognition (among others). These work really well but only to a point. Invariably you will have lots of false positives and false negatives and in many situations your users will be very happy to point out lots of places where you got it wrong. Manual curation is much better (assuming you have good people) but can never scale at the rate that data doubles. You can bring the two together but often there simply isn't enough context for the curator to figure out exactly what they're looking at.

So if your users are happy to tell you you're wrong maybe you can tap into this expert knowledge to improve your automated processes and assist your curators. Crowd sourcing isn't new but it has serious potential to help resolve the variety problem. I've been lucky enough to be involved in a crowd sourcing curation project that started as an MIT research project and has recently been commercialized as Tamr. The approach used gives you a critical piece of the workflow bridging the gap between the machine learning/ automated data improvement and the curator. When the curator isn't confident in the prediction or their own expertise, they can distribute tasks to your data producers and consumers to ask their opinions and draw on their expertise and institutional memory which is not stored in any of your data systems.

The trough of disillusionment is inevitable for any exciting technology and big data will get there too. The Variety and the Velocity problems are already seeing a lot of technologies to address these problems. How far we fall into the trough and how long it takes to get out will depend on technologies and approaches that address the Variety problem. Crowd sourcing is one approach, hopefully we will see the emergence of many more.

These views are my own and may not necessarily reflect those of my current or previous employers.

Jerome Homish (LION), ★ D.O., ★ MBA

Physician at JDH, LLC

Nice article, Mark. For sure, not only are the Velocity and Volumes of data points increasing exponentially, but also the Variety. As you noted, the Variety is an ominous presence which is frequently overlooked (or worse yet, decisions based on (relatively) inaccurate data points. My interest lies within the use of Big Data in Healthcare, in order to improve the improve access to care, increase the quality of care, reduce the cost of care and most importantly deliver value to the market. These are what will be the salvation of our Healthcare system, yet the data have to be actionable in nature, and applicable to the desired goal(s)

Matthias Fouquet-Lapar

Supercomputing Technologist at International Centre for Earth Simulation

My view is a bit different. Clearly finding/filtering data is taken a large amount of resources. Machine learning is progressing at an amazing rate, which will certainly mitigate this problem and allow to focus on analytics. I don't see any hype cycle, we are only at the very early stage of a true data deluge when IoT kicks in. At the same time the requirements for realtime analytics (mainly in BI) will go up. This will be the point when knowledge from the tradtional HPC space needs to merge with Big Data to open up new fields of applications. Requirements for scientific research vs BI are different in the brave new world of big data

Martin R. Gollery

Technology Marketing and Business Development

I would add 'Converting Data' to the 'finding and cleaning' process. Seems like you always have to convert from one data type to another- in life sciences, anyway...

Muthu Lalapet

Sr. Director Global Pre and Post Sales (Sales Engineering and Solutions Architects)

very true.. most of the time is spent in cleansing the data rather than doing pure analytics on it. even worse is in some scenarios analytical tools are used for etl works.

See more comments

Explore topics