The Data Science Pyramid by Atidiv

By Aviral Bhatnagar, Data Science lead at Atidiv

Every day, we create 2.5 quintillion bytes of data. 90% of the world’s data was created in the last two years alone. Vast amounts of data can yield insights that can change the world. From helping prevent deforestation, predicting the weather, growing more crops to preventing diseases, the use of data science is potentially vast. Uses range from the more commonly encountered recommendation algorithms at websites such as Netflix and Uber to more futuristic technologies including self driving cars, speech and image recognition, cropping patterns in agriculture and even adaptive gaming algorithms. Enterprises recognise this. 78% of enterprises believe data science can fundamentally change the way they do business in the next one to three years.

At Atidiv, we believe the Data Science as a field however suffers from a misconception. Decision makers, executives and potential talent often tends to believe Data Science is limited to merely analysing data. While analysis is a key part of data science, it is just one element of a system of interlinking parts that help create an effective data science operation.

The Data Science Pyramid by Atidiv

A framework by Atidiv to understand the components of Data Science

Three components make up an effective data science operation:

  1. Operations – Collecting Data: Standardised data collection embedded in regular business processes
  2. Engine – Data Platform: Architecting a data warehousing solution that creates a common platform for data across the organisation, enabling analysis
  3. Intel – Analysis: Effectively leveraging data to present actionable insights.

Several organisations tend to focus on the holy grail of analysis, hoping to uncover key insights that can save millions of dollars, while failing to create a sustainable data science operation. The result of this can mean your Data Scientists are spending time doing activities with a lower ROI.

According to a 2016 report by CrowdFlower, Data Scientists spent just 9% of their time on core analysis, represented as ‘Mining Data for Patterns’. A significant portion of their time is invested in Collecting Data and Cleaning and organisation data.

Source: CrowdFlower – Data Science Report 2016

Our experience at Atidiv has shown that a company that focuses on the Data Science Pyramid – ‘Operations, Engine and Intel’ can help their Data Scientists invest their time more productively. Organisations recognise these challenges. In a survey by The Economist Intelligence Unit, decisions makers identified ‘Solution Approach/Design’ (equivalent to Operations) and ‘Data Integration/preparation’ (equivalent to Engine) as key challenges they faced. By contrast, Analysis and Visualisation is a smaller problem.

[Source: The Economist Intelligence Unit]

Using the examples of a set of offline retailers and a set of online retailers, we are going to examine why understanding the pyramid structure to data science is essential for a company to get dividends from a data science division.

Operations – Collecting Data

From the outset, managers and data scientists must align on the key objective –  ‘What questions can data science help me answer?’. This will help in ensuring that key data is collected in the right manner.

For example, the online retailer may wish to collect data to power custom home pages for each user along with tailored recommendations. Three online retailers take three different approaches to this problem:

Ecommerce A: No planning & no data

A does not involve its data science team in the process. They install a rudimentary out of the box solution of Google Analytics on their website. They do not collect customer IDs nor do they collect product IDs. Their analytics implementation enables them to understand basic metrics such as the volume of pageviews and session duration but nothing beyond this.

Ecommerce B: Lack of focus on answering key questions

B does not consult its data science team either. Managers are bullish about the role of data in operations, and decide to collect all the data they can. They install the premium version of Google Analytics and top off with a smattering of other analytics softwares to capture even more information. From heatmap trackers to funnel trackers, the team installs a heavy analytics layer on their web and app platforms to ensure that no data point is missed out. The result – slower load times on the web and app as each analytics software kicks in before the page is rendered. The company had all the raw data it needed, but it lost customers by going overboard with analytics and sacrificing performance.

Ecommerce C: Involves their data science team ? answers key questions

They chart a list of potential questions they want the data to answer. Sample questions include: “Which acquisition channel is yielding the most valuable customers?”, “Which page is causing the most drop offs?”,

“What is a customer’s shopping basket looking like”. Keeping these in mind, the product team installs a customised version of Google Analytics Premium that captures all the data they need. They don’t go overboard with analytics, but manage to capture all the data they need.

Involve data science at the outset. Whether it is the creation of a new process or the implementation of a new feature, ask yourself – what questions will I need to answer and how can I involve my data science team to do it effectively?

‘Companies have progressed on the technology side of analytics, but processes should be embedded more deeply into the fabric of the business’ – The Economist Intelligence Unit.

Engine – Data Platform

Merely collecting data is not good enough. A strong data platform is essential to facilitate analysis. Let’s take another example, this time with two membership based offline retailers, both of whom want to understand the relationship between product discounts, product placements and customer profiles to ensure that products are discounted correctly, placed correctly and that the appropriate customers are targeted for specialised offers.

  • Retailer A: When products are placed on shelves, a manual system of entry ensures that the placement of each SKU (stock keeping unit) enters their database. Barcodes placed on each product state the selling price. A third, separate system, records the discount on each section and SKU. And a fourth system contains the customer profile data. On checkout, the system records the customer’s profile, the SKUs purchased and the discount availed. A separate system stores the SKU placement on that particular day. Both datasets are stored separately. While Sales data is stored in a non-relational MongoDB database, SKU location data is stored in PostgreSQL, an Object Relational Database. The databases have separate formats, and are not compatible. Though the data has been collected, due to the lack of data science planning, there is no coherent data engine to enable analysis on the dataset.
  • Retailer B: This retailer collects the same data as Retailer A, but opts to store it in a consolidated format in a PostgreSQL database, making it easier for their data science team to access, connect and analyse the divers datasets.

The Data Science operation in Retailer B will be more focused on identifying insights. At Retailer A, Data Scientists may end up spending between 60-80% of their time on cleaning and organising data.

‘One barrier is that IT systems in the industry tend to be quite old and fragmented. A lot of the core systems in the travel industry still run on mainframe computers that were first programmed 20 to 30 years ago, so in a lot of cases, you have data fragmented across multiple systems. This requires a huge effort just to pull the data out of these legacy systems and get it cleaned up and aligned to the point where you can even start to make sense of it’ - Glenn Hollister, a Principal at ZS Associates, a leading analytics firm.

Intelligence – Analysis

Our experience at Atidiv has shown that organisations often identify the interface between technology and business as a key challenge. Unavailable and unclean data means that business leads are overly dependent on ad-hoc analysis by data scientists. Since these data scientists need to invest a significant amount of effort to clean and organise data, the process of integrating data science into day to day decision making is not a scalable one. This impacts the ability of businesses to use data science effectively.  A data science operation built upon strong data collection operations and a strong data platform can enable automation and reporting of key metrics for business leaders, giving them access to real time data and visualisations, empowering them with powerful access to data and insights like never before.


(This article was written in collaboration with Atidiv, a leading data science company that helps organisations develop their data science operations from ground up)