ARE DATA SCIENTISTS FUNGIBLE? OR IS SPECIALIZATION WITHIN A SECTOR VERY IMPORTANT? CRISIL YOUNG THOUGHT LEADER 2014 RITIKA PRAKASH DELHI SCHOOL OF ECONOMICS Table of Contents: Executive Summary............................................................................................................... 3 Data Analytics and its Impact across Sectors........................................................................ 4 Fundamentals of Data Science............................................................................................... 4 Familiarizing the Concept of Data Scientist.......................................................................... 5 Balancing the Structure of a Data Science Team................................................................... 5 Dilemma: Gain Domain Expertise or Outsource................................................................... 6 Composition of a Data Science Team.....................................................................................7 Different Data Science Operating Models............................................................................. 8 Conclusion.............................................................................................................................. 9 References.............................................................................................................................. 9 Executive Summary: Analytics has served a variety of sectors to overcome analytical challenges with heterogeneous contexts. There has been growing concern) by few business participants over the de-emphasis on Data Expertise (a consequence of long term association with a particular sector) in the generating value enhancing analytical solutions.(refer to line in topic) This paper analyzes the aforementioned concern. It stresses on the impact of analytics on business productivity, introduces the art of Data Science and familiarizes the concept of Data Scientists. Data Science is always practiced in teams and not as isolated units. This paper draws the structure for any Data Science team, the dilemma faced by Data Scientists within the team and suggests an efficient team composition conditional on a structure of the problem space. It also illustrates the different operating models of Data Science depending on the business type, level of complexity and analytical maturity. Finally, the paper criticizes the compartmentalization of Data Scientists based on the commonalities that exist in the use of Data Analytics across different sectors and recommends Data Scientists to advance in the technical skills they choose to gain expertise in and then align their technical skills with the understanding of business experts in order to discover the concealed business opportunities. Data Analytics and its Impact across Sectors Analytics is a data science that supports the detection and communication of meaningful patterns in data. It has proved to be one of the key recipes for the functioning of all businesses. To perk up the shopping experience of its customers, the Retail and E-Commerce sector employs data science applications in 3 priority areas- supply chain, merchandising and operations. Within the Manufacturing sector, data science facilitates the decision making for production outlay, manufactured goods lifecycle management, supplier leverage, customer retention and usage data as well as demand forecasting. While the Banking, Financial Services and Institutions (BSFI) sector bring analytics into play to optimize trading activities, reduce fraud and abuse; the Entertainment Sector uses it to track media consumption and engagement, advertising, customer retention, operations and infrastructure. The Information Technology Sector uses it to motivate efficiency and for process conformity. For Life Sciences and Healthcare Sector, it avoids pricey late-stage process malfunction and optimizes pharmacy therapy by cautioning adverse reactions and predicting project success. The Telecom sector uses it to analyze the subscriber churn and formulate customer-centric plans. Even, the Energy and Utilities Industry is monitoring energy utilization patterns to escape from rigid analog to an automated energy delivery system. Table 1: Business Impacts of Data Science Increment 17-49 % 11-42% 241% 1000% Incremental Metric Productivity Return on Assets Return on Investment Return on Investment Cause of Increment Data usability increased by 10%. Data access increased by 10%. Big data used to improve competitiveness. Analytics deployed in alignment with business management’s goals and big data incorporated. Source: Booz Allen Hamilton Table 1 underlines the value awaiting business organizations that embrace data science. It is best to corroborate Business Intelligence with the Business Domain Learning for maximal value-enhancement of Data Analytics. Although restricting Data Scientists to a target domain is productive for it results in addition of a quintessential skill i.e. Domain expertise to their skill set, but it must be cautioned that even outsourcing this domain-centric information from a third party which has been in long term association with the target domain should be no less productive. Fundamentals of Data Science: According to A. McAfee and E. Brynjolfsson, “Data science is an art of turning data into actions using a blend of domain expertise and machine learning”. It involves the following 4 processes: 1. Description (understanding the content of the data): It includes the techniques of processing, aggregating and enriching data. 2. Discovery (understanding the key associations in data): It includes the techniques of clustering and classification of data. 3. Prediction (predicting the probable outcomes): It includes the techniques of regression and recommendation. 4. Advice (prescribing the action path to take): It includes the techniques of simulation and optimization. Data Science is fractal in construction. There exist several well-defined algorithms and data science products which can be applied to a range of analytical challenges with heterogeneous contexts to generate value boosting solutions. Any analytical challenge must go through the 4 steps of progression defined above. There is data science products available to support every step. Familiarizing the Concept of Data Scientists: Wikipedia defines “Data Scientists as practitioners of Data Science”. A team of skilled Data Scientists together form a Data Science Team which carries forward the aforementioned processes. As reflected in “The Data Science Venn Diagram”, success of a Data Science Team requires adeptness in 3 foundational technical skills1. Computer Science (represented by C) to provide a network in which data manoeuvring, processing and hypotheses are tested and validated. 2. A loaded background in Statistics, Geometry, Linear Algebra and Calculus (represented by S) to understand the foundation for countless algorithm and tools. 3. Domain Expertise (represented by D) to recognize the class of data that is available, the problems that need to be addressed and to instrument and structure the problem space. Given these 3 foundational technical skills, infinite skill compositions can be crafted: 1. 2. 3. 4. 5. 6. SxCyDz- expertise of x unit in S, y unit in C and z unit in D. SxCy- expertise of x unit in S and y unit in C. SxDz- expertise of x unit in S and z in D. CyDz- expertise of y unit in C and z in D. Sx- expertise of x unit in S only. Cy- expertise of y unit in C only. 7. Dz- expertise of z unit in D only. Note: The subscripts x, y, z are arbitrary numbers such that x∈ [0, ∞), y∈ [0, ∞), z∈ [0, ∞) and they represent the units of expertise gained by Data Scientists in a particular skill. The category Dz where z ≠ 0, represents domain experts with long term association to the target domain and experience of real world business dealings. Balancing the Structure of a Data Science Team: Balancing the structure of a Data Science is analogous to balancing a chemical reaction which adheres to the Law of Conservation of Mass. Here, reactants are Data Scientists each with their idiosyncratic skill composition. It must be cautioned that they are not super heroes who can expertise in all the skills. Hence, they face a trade off in gaining areas of expertise. Suppose a Data Science Project which requires 3 parts C, 6 parts S, and 2 parts D. Given the mix of skill set available, to perform value enhancing Data Analytics we require: 2S2 C+ C + 2S2D C3S8D2 This Data Science Team consists of 5 Scientists in which two have 1 unit of expertise in C and 2 units of expertise in S, one has 1 unit of expertise in C only and two has 2 unit of expertise in S and 1 unit of expertise in D. Throughout the project, the skill constraint of the team will change and the Data Science Skill Composition Equation needs to be balanced again to ensure that an optimized analytical solution is reached, conditional upon the staff blend. Dilemma- Gain Domain Expertise or Outsource: Due to comparable analytical character of the technical skills CS and S standing in odds with DE, for simplification of the study let us put forward an assumption that Data Scientists are of the type SxCSy (x ≠ 0 ∩ y ≠ 0) and the dilemma faced is: Domain Expertise is a quintessential skill for the success of any Data Science Team, but is it important for Data Scientists to gain domain expertise themselves or can outsourcing this knowledge by collaborating with those who have already been in long term association with the target domain yield (if not better) no less productive results. Let us now reflect upon the following three abstract cases which deals with the same statistical measure i.e. correlation: Case 1: There exists a positive correlation between Gender Bias against the Girl Child and Air India Flight Delays. This is an extremely simplistic case where even a layman understands that correlation is a mere coincidence. However, in complex cases sound domain expertise is essential to recognize this claim of coincidence. Case 2: There exists a positive correlation between Weight and Calorie Consumption. An understanding of nutrition guarantees that Weight is dependent on Calorie Consumption, i.e. the cause of obesity is high calorie intake and not the other way round. Without Domain expertise it is difficult to suggest the dependent and the independent variables. Case 3: The regression estimating Impact of ICDS suggests a positive correlation between ICDS coverage and Stunting. Without Domain Proficiency it is quite possible for Data Scientists to predict and advice that with increase in ICDS (Integrated Child Development Services) coverage, there is an increase in Stunting (measured as HAZ scores). However, those with Domain Expertise would understand the possible reasons for this absurd result. ICDS is India’s social welfare programme which targets malnourished children below 6 years of age. Hence, the positive feedback relationship possibly is a consequence of the Problem of Simultaneity and could be eliminated only by including relevant control variables in the regression measuring the Impact of ICDS. Generalizing the insights gained from the above cases, it is convenient to assume that Domain Expertise ensures that no meaningless data is collected and enables emulation of real world analytical challenges to be conjectured in the appropriate context by providing sound domain-centric backups. Data learning and interactions become more meaningful and business prescriptions become more value enhancing. This is because with Domain Expertise, Scientists themselves can set up the analytical goal, derive business useful data insights and be sure of using them to facilitate improved decision making. However, with reference to the LinkedIn job posting dated October 31, 2013 for a Data Scientist profile to design and implement automated models which facilitate detection and prevention of fraud and abuse across the LinkedIn ecosystem, the following requirements were posted for interested candidates: 1. Strong background in Machine Learning, Statistics, Information Retrieval and Graph Analysis. 2. Strong hands-on analytic experience in predictive modelling or automated decisions. 3. Familiarity with Unix/Linux system and shell scripting. 4. Masters or PhD in a quantitative discipline: Computer Science, Statistics, Applied Mathematics, Operations Research, engineering, etc. 5. Background in Internet Security preferred. Out of the 5 requirements posted, only 1 dealt with domain expertise which could be hypothesized as being statistically insignificant on account of its optional character. This is in deep contrast with the importance of Domain Expertise suggested in the 3 cases reflected upon previously. Composition of a Data Science Team: Claudia Perlich, three times winner of the KDD Nuggets Competition, won the contests in domains as diverse as “Breast Cancer, Movie Prediction, and Sales Performance”, without any domain-centric expertise in a single domain beforehand. She outsourced the domain centric information from the business experts who helped her define and structure the analytical problem. Hence, even without prior domain knowledge she was able to craft best algorithms for generating value-enhancing analytical solutions, otherwise domain knowledge is very important to construct and structure the problem space. Figure 1: Composition of a Data Science Team 100% 90% 80% 70% 60% 50% D 40% M 30% 20% 10% 0% Structured Problem Space Unstructured Problem Space Note: C- Computer Science, S- Statistics and Mathematics, D- Domain Expertise *On account of comparability of C and S as technical skills is odds with D, for simplicity they are pooled together as M, where M stand for Machine Learning. Figure 1 illustrates the composition of a Data Science Team in two extreme situations. Given that the problem space is structured well, machine learning and data science products could generate analytical solutions. However, if this space is unstructured then it is necessary to employ data expertise to one’s rescue. Data Scientists are humans and they are governed by the economic theory of Monotonic Preferences. Once a problem space is structured all analytical challenges require competence in Machine Learning. Binding them to a particular domain is impracticable and even if possible could lead to the Principal-Agent Problem. Economics offers different incentive and asset ownership approaches to sublime this moral hazard crisis but the International Theory of Trade based on Comparative Advantage suggests a simple solution i.e. “Specialization and Trade” to reap benefits outside the production possibility frontiers. In context of Data Science, businesses must hire Data Scientists proficient in Model Knowledge Experts and supplement them with their Domain Knowledge Experts although those with proficiency in both are prioritized. In general, Data Scientist are fungible across sectors but for value enhancing Data Analytics to be performed businesses must corroborate them with Business Experts. Different Data Science Operating Models: Depending upon the dimension, sophistication, and catalysts of business, an organization must have one of the following three Data Science Working ModelsCentralized Data Science: Business units bring their problem to a centralized Data Science Team who serves all needs of the organization. Domain Experts come to this organization for brief rotational stints to solve challenges around the business. Deployed Data Science: Small Data Science teams are forward deployed to business units to work with domain experts to solve the analytical challenge. Diffused Data Science: Data Scientists become domain experts and get fully embedded within the business units. They work best when the nature of the domain or business unit is similar to that focussed in analytics. Conclusion: With reference to the structural equation explaining the composition of a Data Science team, it is suggested that D can enter the equation either independently or in sync with S, C or both. Based on grounds of international trade, Data Science Analytics involve trade of skills and interaction between Data Scientists (of the type SxCy) and Domain Experts (of the type D). Domain Experts and Data Scientists are separate entities who experience the world of Big Data in different traditions with different comparative advantages- the prior has a comparative advantage in Domain Knowledge and the latter in Model Knowledge . For the unbeaten completion of an analytics project, an efficient Data Science Team is required. Domain Experts are members of BPM (Business Process Management) who experience real commerce dealings and accumulate domain knowledge in the form of long term association and experience allows them to deploy this knowledge in tacit conducts. Data Science is practiced in a team and not as isolated independent units which allows Data Scientists to advance in their chosen skill and then align their technical skills with the understanding of business experts in order to discover the concealed business opportunities. However, the decision to emphasize or de-emphasize Domain Expertise among Data Scientists still depends on the structure, complexity and working of the business organization. References: 1. “Data Scientists Aren’t Domain Experts” by Stijn Viaene, Vlerick Business School & Ku Leuven. 2. “We don’t need more data scientists- Just make big data easier to use” by Scott Brave, Baynote, December 22, 2012. 3. “Statistics vs Data Science vs BI.” Revolutions, by Smith, David dated 15 May 2013. 4. “A Field Guide to Data Science” by Booz Allen Hamilton. 5. "The Secrets to Managing Business Analytics Projects," MIT Sloan Management Rev. 6. “Data Science: An Introduction” retrieved on October 1, 2014 at http://en.wikibooks.org/wiki/Data_Science:_An_Introduction 7. “Data Science Venn Diagram” by Drew Conway, retrieved on October 4,2014 at http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram
© Copyright 2026 Paperzz