ARE DATA SCIENTISTS FUNGIBLE? OR IS SPECIALIZATION

ARE DATA SCIENTISTS FUNGIBLE?
OR IS SPECIALIZATION WITHIN A
SECTOR VERY IMPORTANT?
CRISIL YOUNG THOUGHT LEADER 2014
RITIKA PRAKASH
DELHI SCHOOL OF ECONOMICS
Table of Contents:
Executive Summary............................................................................................................... 3
Data Analytics and its Impact across Sectors........................................................................ 4
Fundamentals of Data Science............................................................................................... 4
Familiarizing the Concept of Data Scientist.......................................................................... 5
Balancing the Structure of a Data Science Team................................................................... 5
Dilemma: Gain Domain Expertise or Outsource................................................................... 6
Composition of a Data Science Team.....................................................................................7
Different Data Science Operating Models............................................................................. 8
Conclusion.............................................................................................................................. 9
References.............................................................................................................................. 9
Executive Summary:
Analytics has served a variety of sectors to overcome analytical challenges with
heterogeneous contexts. There has been growing concern) by few business participants over
the de-emphasis on Data Expertise (a consequence of long term association with a particular
sector) in the generating value enhancing analytical solutions.(refer to line in topic)
This paper analyzes the aforementioned concern. It stresses on the impact of analytics on
business productivity, introduces the art of Data Science and familiarizes the concept of Data
Scientists.
Data Science is always practiced in teams and not as isolated units. This paper draws the
structure for any Data Science team, the dilemma faced by Data Scientists within the team
and suggests an efficient team composition conditional on a structure of the problem space. It
also illustrates the different operating models of Data Science depending on the business
type, level of complexity and analytical maturity.
Finally, the paper criticizes the compartmentalization of Data Scientists based on the
commonalities that exist in the use of Data Analytics across different sectors and
recommends Data Scientists to advance in the technical skills they choose to gain expertise in
and then align their technical skills with the understanding of business experts in order to
discover the concealed business opportunities.
Data Analytics and its Impact across Sectors
Analytics is a data science that supports the detection and communication of meaningful
patterns in data. It has proved to be one of the key recipes for the functioning of all
businesses. To perk up the shopping experience of its customers, the Retail and E-Commerce
sector employs data science applications in 3 priority areas- supply chain, merchandising and
operations. Within the Manufacturing sector, data science facilitates the decision making for
production outlay, manufactured goods lifecycle management, supplier leverage, customer
retention and usage data as well as demand forecasting. While the Banking, Financial
Services and Institutions (BSFI) sector bring analytics into play to optimize trading activities,
reduce fraud and abuse; the Entertainment Sector uses it to track media consumption and
engagement, advertising, customer retention, operations and infrastructure. The Information
Technology Sector uses it to motivate efficiency and for process conformity. For Life
Sciences and Healthcare Sector, it avoids pricey late-stage process malfunction and optimizes
pharmacy therapy by cautioning adverse reactions and predicting project success. The
Telecom sector uses it to analyze the subscriber churn and formulate customer-centric plans.
Even, the Energy and Utilities Industry is monitoring energy utilization patterns to escape
from rigid analog to an automated energy delivery system.
Table 1: Business Impacts of Data Science
Increment
17-49 %
11-42%
241%
1000%
Incremental Metric
Productivity
Return on Assets
Return on Investment
Return on Investment
Cause of Increment
Data usability increased by 10%.
Data access increased by 10%.
Big data used to improve competitiveness.
Analytics deployed in alignment with business
management’s goals and big data incorporated.
Source: Booz Allen Hamilton
Table 1 underlines the value awaiting business organizations that embrace data science. It is
best to corroborate Business Intelligence with the Business Domain Learning for maximal
value-enhancement of Data Analytics. Although restricting Data Scientists to a target domain
is productive for it results in addition of a quintessential skill i.e. Domain expertise to their
skill set, but it must be cautioned that even outsourcing this domain-centric information from
a third party which has been in long term association with the target domain should be no less
productive.
Fundamentals of Data Science:
According to A. McAfee and E. Brynjolfsson, “Data science is an art of turning data into
actions using a blend of domain expertise and machine learning”. It involves the following 4
processes:
1. Description (understanding the content of the data): It includes the techniques of
processing, aggregating and enriching data.
2. Discovery (understanding the key associations in data): It includes the techniques
of clustering and classification of data.
3. Prediction (predicting the probable outcomes): It includes the techniques of
regression and recommendation.
4. Advice (prescribing the action path to take): It includes the techniques of
simulation and optimization.
Data Science is fractal in construction. There exist several well-defined algorithms and data
science products which can be applied to a range of analytical challenges with heterogeneous
contexts to generate value boosting solutions. Any analytical challenge must go through the 4
steps of progression defined above. There is data science products available to support every
step.
Familiarizing the Concept of Data Scientists:
Wikipedia defines “Data Scientists as practitioners of Data Science”. A team of skilled Data
Scientists together form a Data Science Team which carries forward the aforementioned
processes. As reflected in “The Data Science Venn Diagram”, success of a Data Science
Team requires adeptness in 3 foundational technical skills1. Computer Science (represented by C) to provide a network in which data
manoeuvring, processing and hypotheses are tested and validated.
2. A loaded background in Statistics, Geometry, Linear Algebra and Calculus
(represented by S) to understand the foundation for countless algorithm and tools.
3. Domain Expertise (represented by D) to recognize the class of data that is available,
the problems that need to be addressed and to instrument and structure the problem
space.
Given these 3 foundational technical skills, infinite skill compositions can be crafted:
1.
2.
3.
4.
5.
6.
SxCyDz- expertise of x unit in S, y unit in C and z unit in D.
SxCy- expertise of x unit in S and y unit in C.
SxDz- expertise of x unit in S and z in D.
CyDz- expertise of y unit in C and z in D.
Sx- expertise of x unit in S only.
Cy- expertise of y unit in C only.
7. Dz- expertise of z unit in D only.
Note: The subscripts x, y, z are arbitrary numbers such that x∈ [0, ∞), y∈ [0, ∞), z∈ [0, ∞) and they represent
the units of expertise gained by Data Scientists in a particular skill.
The category Dz where z ≠ 0, represents domain experts with long term association to the target domain and
experience of real world business dealings.
Balancing the Structure of a Data Science Team:
Balancing the structure of a Data Science is analogous to balancing a chemical reaction
which adheres to the Law of Conservation of Mass. Here, reactants are Data Scientists each
with their idiosyncratic skill composition. It must be cautioned that they are not super heroes
who can expertise in all the skills. Hence, they face a trade off in gaining areas of expertise.
Suppose a Data Science Project which requires 3 parts C, 6 parts S, and 2 parts D. Given the
mix of skill set available, to perform value enhancing Data Analytics we require:
2S2 C+ C + 2S2D
C3S8D2
This Data Science Team consists of 5 Scientists in which two have 1 unit of expertise in C
and 2 units of expertise in S, one has 1 unit of expertise in C only and two has 2 unit of
expertise in S and 1 unit of expertise in D.
Throughout the project, the skill constraint of the team will change and the Data Science Skill
Composition Equation needs to be balanced again to ensure that an optimized analytical
solution is reached, conditional upon the staff blend.
Dilemma- Gain Domain Expertise or Outsource:
Due to comparable analytical character of the technical skills CS and S standing in odds with
DE, for simplification of the study let us put forward an assumption that Data Scientists are
of the type SxCSy (x ≠ 0 ∩ y ≠ 0) and the dilemma faced is: Domain Expertise is a
quintessential skill for the success of any Data Science Team, but is it important for
Data Scientists to gain domain expertise themselves or can outsourcing this knowledge
by collaborating with those who have already been in long term association with the
target domain yield (if not better) no less productive results.
Let us now reflect upon the following three abstract cases which deals with the same
statistical measure i.e. correlation:
Case 1: There exists a positive correlation between Gender Bias against the Girl Child and
Air India Flight Delays.
This is an extremely simplistic case where even a layman understands that correlation is a
mere coincidence. However, in complex cases sound domain expertise is essential to
recognize this claim of coincidence.
Case 2: There exists a positive correlation between Weight and Calorie Consumption.
An understanding of nutrition guarantees that Weight is dependent on Calorie Consumption,
i.e. the cause of obesity is high calorie intake and not the other way round. Without Domain
expertise it is difficult to suggest the dependent and the independent variables.
Case 3: The regression estimating Impact of ICDS suggests a positive correlation between
ICDS coverage and Stunting.
Without Domain Proficiency it is quite possible for Data Scientists to predict and advice that
with increase in ICDS (Integrated Child Development Services) coverage, there is an increase
in Stunting (measured as HAZ scores). However, those with Domain Expertise would
understand the possible reasons for this absurd result. ICDS is India’s social welfare
programme which targets malnourished children below 6 years of age. Hence, the positive
feedback relationship possibly is a consequence of the Problem of Simultaneity and could be
eliminated only by including relevant control variables in the regression measuring the
Impact of ICDS.
Generalizing the insights gained from the above cases, it is convenient to assume that
Domain Expertise ensures that no meaningless data is collected and enables emulation of real
world analytical challenges to be conjectured in the appropriate context by providing sound
domain-centric backups. Data learning and interactions become more meaningful and
business prescriptions become more value enhancing. This is because with Domain
Expertise, Scientists themselves can set up the analytical goal, derive business useful data
insights and be sure of using them to facilitate improved decision making.
However, with reference to the LinkedIn job posting dated October 31, 2013 for a Data
Scientist profile to design and implement automated models which facilitate detection and
prevention of fraud and abuse across the LinkedIn ecosystem, the following requirements
were posted for interested candidates:
1. Strong background in Machine Learning, Statistics, Information Retrieval and Graph
Analysis.
2. Strong hands-on analytic experience in predictive modelling or automated decisions.
3. Familiarity with Unix/Linux system and shell scripting.
4. Masters or PhD in a quantitative discipline: Computer Science, Statistics, Applied
Mathematics, Operations Research, engineering, etc.
5. Background in Internet Security preferred.
Out of the 5 requirements posted, only 1 dealt with domain expertise which could be
hypothesized as being statistically insignificant on account of its optional character. This is in
deep contrast with the importance of Domain Expertise suggested in the 3 cases reflected
upon previously.
Composition of a Data Science Team:
Claudia Perlich, three times winner of the KDD Nuggets Competition, won the contests
in domains as diverse as “Breast Cancer, Movie Prediction, and Sales Performance”,
without any domain-centric expertise in a single domain beforehand. She outsourced the
domain centric information from the business experts who helped her define and structure the
analytical problem. Hence, even without prior domain knowledge she was able to craft best
algorithms for generating value-enhancing analytical solutions, otherwise domain knowledge
is very important to construct and structure the problem space.
Figure 1: Composition of a Data Science Team
100%
90%
80%
70%
60%
50%
D
40%
M
30%
20%
10%
0%
Structured Problem Space
Unstructured Problem Space
Note: C- Computer Science, S- Statistics and Mathematics, D- Domain Expertise
*On account of comparability of C and S as technical skills is odds with D, for simplicity they are pooled
together as M, where M stand for Machine Learning.
Figure 1 illustrates the composition of a Data Science Team in two extreme situations. Given
that the problem space is structured well, machine learning and data science products could
generate analytical solutions. However, if this space is unstructured then it is necessary to
employ data expertise to one’s rescue.
Data Scientists are humans and they are governed by the economic theory of Monotonic
Preferences. Once a problem space is structured all analytical challenges require competence
in Machine Learning. Binding them to a particular domain is impracticable and even if
possible could lead to the Principal-Agent Problem. Economics offers different incentive and
asset ownership approaches to sublime this moral hazard crisis but the International Theory
of Trade based on Comparative Advantage suggests a simple solution i.e. “Specialization and
Trade” to reap benefits outside the production possibility frontiers. In context of Data
Science, businesses must hire Data Scientists proficient in Model Knowledge Experts and
supplement them with their Domain Knowledge Experts although those with proficiency in
both are prioritized. In general, Data Scientist are fungible across sectors but for value
enhancing Data Analytics to be performed businesses must corroborate them with Business
Experts.
Different Data Science Operating Models:
Depending upon the dimension, sophistication, and catalysts of business, an organization
must have one of the following three Data Science Working ModelsCentralized Data Science: Business units bring their problem to a centralized Data Science
Team who serves all needs of the organization. Domain Experts come to this organization for
brief rotational stints to solve challenges around the business.
Deployed Data Science: Small Data Science teams are forward deployed to business units to
work with domain experts to solve the analytical challenge.
Diffused Data Science: Data Scientists become domain experts and get fully embedded
within the business units. They work best when the nature of the domain or business unit is
similar to that focussed in analytics.
Conclusion:
With reference to the structural equation explaining the composition of a Data Science team,
it is suggested that D can enter the equation either independently or in sync with S, C or both.
Based on grounds of international trade, Data Science Analytics involve trade of skills and
interaction between Data Scientists (of the type SxCy) and Domain Experts (of the type D).
Domain Experts and Data Scientists are separate entities who experience the world of Big
Data in different traditions with different comparative advantages- the prior has a
comparative advantage in Domain Knowledge and the latter in Model Knowledge .
For the unbeaten completion of an analytics project, an efficient Data Science Team is
required. Domain Experts are members of BPM (Business Process Management) who
experience real commerce dealings and accumulate domain knowledge in the form of long
term association and experience allows them to deploy this knowledge in tacit conducts. Data
Science is practiced in a team and not as isolated independent units which allows Data
Scientists to advance in their chosen skill and then align their technical skills with the
understanding of business experts in order to discover the concealed business opportunities.
However, the decision to emphasize or de-emphasize Domain Expertise among Data
Scientists still depends on the structure, complexity and working of the business organization.
References:
1. “Data Scientists Aren’t Domain Experts” by Stijn Viaene, Vlerick Business School &
Ku Leuven.
2. “We don’t need more data scientists- Just make big data easier to use” by Scott Brave,
Baynote, December 22, 2012.
3. “Statistics vs Data Science vs BI.” Revolutions, by Smith, David dated 15 May 2013.
4. “A Field Guide to Data Science” by Booz Allen Hamilton.
5. "The Secrets to Managing Business Analytics Projects," MIT Sloan Management
Rev.
6. “Data Science: An Introduction” retrieved on October 1, 2014 at
http://en.wikibooks.org/wiki/Data_Science:_An_Introduction
7. “Data Science Venn Diagram” by Drew Conway, retrieved on October 4,2014 at
http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram