Two weeks ago I had the opportunity to attend the Economist’s first Information Conference on big data and the disruption it is about to unleash on the world. The event was the brainchild of Kenn Cukier who almost 18 months ago wrote this Economist article on the ‘big data deluge’. The event brought together some of the best minds on this topic, with speakers ranging from academia, technology, NGOs, and consultants.
The one topic that received significant coverage during the event was the emerging role of the ‘data scientist’ – a term apparently coined by Jeff Hammerbacher, the co-founder and Chief Scientist of Cloudera while he was at Facebook. The McKinsey Global Institute recently published a study that forecast that the shortage of these skills could be as high as 190,000 people by 2018 in the US alone. The notion of such a discipline bothered me quite a bit and until now was not able to put my finger on it. In full disclosure, my educational background includes degrees in Mathematics, Computer Science, and Operations Research and I have spent most of my career helping companies deal with data and extract insights so they can make better decisions. But I am getting ahead of myself…
What is big data?
The definition of big data was, to my surprise, not a controversial topic. Most speakers agreed that big data is both about the quantity and quality of the underlying data, i.e., volume measured in petabytes (1015 bytes or 1M gigabytes or more), and data that does not only include structured but also unstructured (i.e., text, video, social media, etc.) data as well. You can read Wikipedia’s definition here.
Incredible innovation at the data management layer
The field of big data has seen an explosion of a new alphabet soup over the past few years (ACID, Cassandra, Hadoop, HBase, Hive, NoSQL, MapReduce, Pig, and many more). Many early-stage (Cloudera, Kognitio, Netezza, ParAccel) and established (EMC through its acquisition of Greenplum, Microsoft through its acquisition of DATAllegro, Oracle, SAP, and Teradata through its acquisition of Aster Data) technology companies are innovating at an unprecedented pace to help their customers deal with the big data deluge.
While this innovation at the data management layer is significant, most discussions around the data scientist in the industry today are focused at the predictive analytics/data visualization level of extracting insights from big data, and this is wherein my fundamental disagreement lies:
The field is not new – Extracting insights from data (i.e., predictive analytics) gave birth to Operations Research as an inter-disciplinary field during World War II. The field has its roots in the 1840s based on the work Charles Babbage did to optimize the UK’s mail system. During WW II, the UK and US scientists across many of the same disciplines people talk about today (mathematics, statistics, sociology, and psychology) were brought together to help the Allied Forces optimize their artillery rounds and air/sea networks and decipher the German cryptographic codes. The field then branched out in the 1960s and 1970s in the telecom and airline industries and has since expanded across most of the business world. The fundamental mathematical techniques however have changed very little in the past 70 years.
We are all data scientists – Most of the innovation that is taking place at the data visualization layer today is about putting the information at the hands of those able to make the best decisions, i.e., the elusive business user/information worker. While this may feel self-serving as it allows technology companies to expand their footprint, my many years of working as a ‘data scientist’ have led me to the very same conclusion:
- The real challenge is about driving adoption: Although this is more relevant in an enterprise context, the challenge is not about squeezing the last drop of potential benefit, but rather ensuring recommendations are adopted. If there is one thing my many years in the field have taught me is convincing the decision-makers to adopt your ideas. This Microsoft Windows 7 commercial sums it up best.
- Back-office data geeks do not always know the business challenge: Having been one myself, I can attest to the fact despite how smart we think we are, the knowledge that comes from knowing your business while being able to also act on those insights is priceless. The image and title of this post refer exactly to this point. PG&E is my local energy utility company and the data on the graph is my hourly energy consumption based on the smart meter (i.e., big) data they collect from my home. Who better to make decisions about energy consumption than the consumers themselves? Do the people appearing in this PG&E commercial look like data scientists to you?
What do you think? Is this short-sighted ‘old-world’ thinking, or the reality that will emerge over the next few years as we move past the hype?
Your thinking is in line with tech trends resulting in empowering the consumer. With your credentials and passion for this topic will be great to see you playing a more central and active role in addressing SAP’s internal big data opportunities.
Natascha Thomson says
interesting write up. I had no idea the UK mail system gave birth to Operational Research.
I certainly agree that there is no point in collecting data for the sake of collecting data. Not only do decision makers need to see it and make decision based on the data, but in this day and age, the people who execute need to have access to data to be able to fine tune their actions on a regular basis.
For social media, there is such a gap in knowledge, that it is more important than ever that the people in the trenches can see the data and figure out what it really means and what data is meaningful. Constant evolution in this space does not always make it easier.
Listening on the Internet is currently an imperfect science, to say the least, and reports have to be scrutinized and questioned, or the resulting decisions based on them can be bad. For social media, meaningful data automation is miles away…
Ted Sapountzis says
Thanks for your comments, analysis of social data is indeed a great use case since we understand so very little. Who better to decide what analyses to conduct and what the insights are than the people that actually care about their data? There is indeed so much noise in this data right now that no ‘smart’ data scientist (or even worse, algorithm) can come up with any meaningful recommendations….
david k waltz says
I am glad to hear you say a lot of this is not new. Seems the part of this that is new is that there will be a lot more work that needs to be done because of the amount of information available, and new tools that will need to be learned to handle it. But it is all still building off of fundamentals that have been around for centuries.
Ted Sapountzis says
David, thank you for your kind words, like everything new, we always tend to get over-excited initially, and I hope we soon reach a stage where we get past the hype so we can address the real issues at hand. I still fundamentally believe that we need to start to look at the past and learn from how people have solved similar challenges in the past. I am quite intrigued at the discussion I started in LinkedIn, and I still fundamentally believe that when the UK scientists were looking at how to crack the German cryptographic codes in WWII, they were asking the same questions we are today about the value of a ‘Like’ on Facebook.
Richard Rogers says
Great article and questions raised Ted. With the explosion of devices coming online (e.g. from smartphones to smart-meters) this topic will explode. E.g. in 10 years the data will be out there in the US of where every person is and every light that is turned on. How is that for a data boom ; )