2016 49th Hawaii International Conference on System Sciences Big Data – A Theory Model Carsten Felden TU Bergakademie Freiberg [email protected] Marco Pospiech TU Bergakademie Freiberg [email protected] unclear. Even the ability to communicate and interpret a value proposition is impeded by the difficulty to identify a clear goal. [6] The missing theoretical foundation of Big Data has been highlighted because relevant topics and related theories have not been clearly identified [6; 5]. This study aims to introduce a theoretical model of Big Data to describe the current state of its relationships and concepts. Few studies are addressing this research goal. In most cases, the characteristics of Big Data are explored through discursive approaches [6; 7]. From a methodical perspective, the evidence for the Big Data characteristics is vague [8]. In a previous study, we conducted [9] interviews with experts to deduce a descriptive model of Big Data that could be used for future empirical verification. It represents, to our knowledge, the only Big Data model based on a solid methodology. The model defines Big Data through causal conditions and context, whereby Big Data itself leads to particular strategies and those strategies to consequences. This follow-on research investigates whether the descriptive model can be confirmed by a structural equation model (SEM). This verification makes three principal contributions to scientific discussions about Big Data. First, it unveils the underlying characteristics of Big Data. Second, it proposes that an understanding of the theoretical background should help answer the question of whether Big Data can be addressed strategically. Third, the study seeks to confirm whether Big Data strategies produce benefits for organizations as effectively as strategies in other IT research areas [10]. Considering the fuzzy benefits of and poorly-understood obstacles to Big Data; the latter two aspects are of major practical interest. If Big Data can be clearly described, appropriate strategies can be chosen, and knowledgesharing between companies becomes possible. The research area will be more precisely defined, and business administration issues can be addressed. The paper is organized as follows: after a literature review, the initial set up of the descriptive Big Data model is presented and explained. The model is used to derive our hypotheses, which are confirmed by a SEM. To ensure the research is rigorous and transparent, we describe our research methodology and evaluate the model quality using standard measures. Finally, the results are discussed and implications are highlighted. Abstract Big Data is an emerging research topic. However, the term remains fuzzy and is seen as an umbrella term. Its origin, composition, possible strategies, and outcomes are unclear. This impedes the positioning of publications addressing business administration issues related to Big Data. The missing theoretical foundation of Big Data has been recognized, as has the need for underlying relationships and concepts to be elucidated. While a few publications have directly addressed this need, the analysis has remained methodically weak. Based on an existing qualitative model, we developed a Big Data theory model and tested it through structural equation modeling. All the hypotheses in our research model proved significant. The study makes three principal contributions to the scientific discussion about Big Data. First, it elucidates the current characteristics of Big Data. Second, the addressability of Big Data through strategies is demonstrated. Third, it produced evidence of positive outcomes through Big Data. 1. Introduction The term Big Data frequently occurs in scientific and practical discussions [1]. One standard definition comes from Gartner [2], where Big Data means “highvolume, high-velocity, and high-variety information assets that demand cost-effective innovative forms of information processing for enhanced insight and decision-making.” Bizer et al. [3] added semantic criteria, whereas others have addressed only the amounts of data being processed [4]. It appears likely that an increasing number of fields of study and technologies will be associated with Big Data, while established research areas such as High Performance Computing (HPC) are redefining themselves as Big Data to attract more interest [5]. The positioning of publications addressing business administration applications of Big Data is impeded by the fuzzy nature of the field, which is reflected in the heterogeneous definitions used. [5] From a practitioner’s perspective, the benefits of Big Data, or the nature of the obstacles that could be addressed by a Big Data approach, remain 1530-1605/16 $31.00 © 2016 IEEE DOI 10.1109/HICSS.2016.622 5011 5012 2. STATUS QUO 3. RESEARCH MODEL AND HYPOTHESES A literature review by Cooper [11] analyzed the scientific databases of the AIS eLibrary, IEEE Xplore, the ACM Digital Library, SpringerLINK, and ScienceDirect up to January 2015. A backward search was also conducted, to avoid missing relevant articles [12]. In the first round, 981 hits were obtained. Duplications were removed and abstracts were manually analyzed. Papers of interest were examined for relevance according to the four-eyes-principle. Because the research scope was limited to articles focusing on Big Data theory development, only four articles remained. Other papers, while not focusing exclusively on theory development, might influence theory construction, but the initial study concentrated on the use of existing theoretical developments and not on their derivation. Wu et al. [7] postulated that Big Data starts with large-volume, heterogeneous, autonomous sources with distributed and decentralized control, and seeks to explore complex and evolving relationships (the HACE theorem). Loshin [6] defined Big Data as “applying innovative and cost-effective techniques for solving existing and future business problems whose resource requirements exceed the capabilities of traditional computing environments.” The derivation of both theories is based only on argumentation. This methodically weak approach to theory development [8] was not supported by more rigorous verification in either case. A review of the extant literature explains the diversity of perspectives and the difficulty in deriving an academic description of Big Data. Our first model [13] explored Big Data through expert interviews. Experts were selected from international companies and surveyed using telephone interviews. The data were coded and conceptualized using Grounded Theory because the generation and discovery of concepts and inherent relationships is a strength of the method [14]. Consequently, all identified concepts were assigned to a common coding schema with five categories (Fig. 1). The phenomenon (Big Data) belongs to the middle category [14]. The model was applied and tested against existing Big Data publications, and its relevance to the research area was confirmed. Based on discussions in conferences and during presentations, our descriptive model was extended, as shown in Fig. 1 [9]. In addition, the model was used to classify academic publications into practical Big Data implementations. Nevertheless, the study had an exploratory character. A quantitative follow up is needed, to clarify whether all categories and relationships are significant [9]. A deeper understanding of a phenomenon can be achieved through sequential applying qualitative and quantitative methods, where findings from the qualitative study empirically inform the later quantitative study. [15] Our model, shown in Fig. 1, served as the basis for the proposed research model. We modified the position of the category strategy because strategies are applied after the phenomenon (Big Data) emerges. In addition, consequences are not deduced from the phenomenon but rather from the chosen actions (strategies). Grounded Theory defines a phenomenon as a central idea, event, happening, or incident about which a set of actions or interactions is grouped. Big Data can be treated as a phenomenon, because we can observe effects like growing volume and variety. In addition, the term Big Data is often linked, both in practice and in research, to phenomena [16; 17]. According to [14], the phenomenon is characterized by causal conditions and context. The term causal condition refers to events or incidents that lead to the occurrence or development of a phenomenon [14]. Initial reasons led to the phenomenon, because we recognized the existence of a growing volume of data that has to be processed. Others [1] have shown that academic and business requirements are drivers of this growing data. Thus, the stronger the causal conditions of Big Data, the stronger the phenomenon itself. Hypothesis 1A: Causal conditions are positively related to the phenomenon of Big Data. The category context describes the circumstances under which Big Data evolves [1]. Here, we have to observe the environment, to identify those incidents and locations that pertain to Big Data. One location can be found in research areas like Social Media, which are evolving in parallel with Big Data and pushing the phenomenon [5]. Social Media platforms generate growing volumes of user specific content as possible sources of analysis. It is our assumption that positive changes in context will lead to a growth in Big Data, as claimed in [1]. Hypothesis 1B: Context is positively related to the phenomenon Big Data. Alongside discussion of the origins of Big Data, is the question of whether Big Data have an effect on the category strategy. In IT, the implementation of technologies or concepts has often been used to address the increasing volume of data to be processed and/or stored [4]. Thus, strategy is not a component of the phenomenon Big Data itself (shown by a punctuated arrow in Fig. 1), but it is necessary to address its effects [9]. Hypothesis 2: Big Data is positively related to strategy. s 5013 5012 Figure 1. Big Data Descriptive Model Figure 2. Big Data Theory Model Consequences refers to the outcomes or results of a strategy. Within the Big Data domain, consequences are often discussed in the form of advantages and issues [18]. It is not the phenomenon but rather the strategies for managing Big Data that determine the positive or negative effects [9]. It is of interest, therefore, whether an increase within the category strategy leads to a positive growth of consequences. A possible example could be seen within business analytics. A growing understanding of customers should lead to higher revenues. Thus, more strategy would lead to more consequences. Hypothesis 3: Strategy is positively related to consequences. One advantage of the applied coding schema [14] is its generalizability to the description of a range of phenomena. We are aware that other concepts like Business Intelligence (BI) would fit into the coding schema (Fig. 1) as well. Nevertheless, in the current state of play an initial basis is needed. Further research can extend the theoretical model. Fig. 2 illustrates the proposed research model. Hypotheses and categories (afterwards described as constructs) are ordered according to the explanations given. 4. RESEARCH METHODOLOGY The study introduces an initial Big Data theoretical model to describe underlying relationships and concepts. The research was designed to establish whether relationships between the constructs exist and to evaluate their significance using an SEM. In addition, the relation of suggested Big Data concepts to model constructs is proven, too. We used a partial least squares (PLS) model as this technique is well-suited to the study of associations between latent variables when new theoretical grounds and measures are being explored [20]. The survey was conducted according to the guidelines of Babbie [21]. A pilot test was conducted with six participants to confirm the understandability of our measures. After minor adjustments, an anonymous survey was conducted using an online survey tool. The model was tested under the assumption that the phenomenon Big Data can be only described by human consensus because it is determined by its linguistic description [9]. This consensus can be achieved by an experienced crowd. We identified suitable participants from the business networking platform Xing [22], which hosts several Big Data interest groups whose members can be assumed to have the appropriate expertise. Only practitioners with at least three years of experience were sampled, because the underlying model (Fig. 1) is based on the input of IT company employees. Suitable participants were contacted and sent a link to the online survey. Of the participants contacted, 385 joined and 123 completed the survey. 1 An introduction was given and the categories explained to ensure consistent responses [21]. All items used a 6-point Likert-type scale in which 1 = “strongly disagree” and 6 = “strongly agree.” It has been shown that 6-point Likert-type scales have a higher reliability than 5-point scales [23]. The professional experience of our 123 participants was high: 22.76% had 3–9 years of experience, 40.65% had 10–19 years, and 36.59% had 20 or more years. The firms at which the respondents worked were located in Europe (82.11%), America (13.00%), and Asia (4.88%). The size of enterprise (measured in employees) of the participants was similar: 27.64% were from companies having less than 50 employees, 26.83% from companies with 50 to 499 employees, and 45.53% from companies with 499 or more employees. The data set is representative of the opinions of several interest groups. 1 1 A sample size should at least exceed the number of indicators ten times of the construct with the most indicators and the largest number of exogenous variables loading ten times on a single endogenous variable [24]. The needed sample size is 120. 5014 5013 technology. These data can be used to gain an extended environmental understanding. A further shift of data types arises from the growth of multimedia (CO4) sources (images, video, audio, and text), which must be processed and stored on a large scale. These types of data and sources implicate another measure: whereas traditional data were clean and precise, the new ones are often fuzzy. Thus, combining unknown sources raises unexplored issues of data quality (CO5). Fig. 1 gives the context of Internet/Web2.0/Social Media. We split these into three separate measures because of their different meanings. Internet (CO6) refers to the physical infrastructure which enables global interconnection. Web 2.0 (CO7) are interactive platforms on which people collaborate and share information. Companies were forced to develop costeffective and scalable technologies to handle and analyze the growth in content. [27]. Since Social Media (CO8) is one of the most important applications within Web 2.0, we analyzed the measure separately [19]. Here, user-specific content creates the possibility of gaining a more nuanced view of customers and the environment than was possible before. Decreasing IT costs (CO9) might also drive Big Data, as hardware costs for processing and storing data have decreased constantly. In addition, frequently-mentioned Big Data technologies like Hadoop are usually open source tools. The final context is the legal framework (CO10). Big Data applications raise issues of data privacy, and the combination of sources could be forbidden. The third construct is the phenomenon Big Data (BD) itself. In our previous work [9] the construct phenomenon was simply described as an “Increasing Volume of Data to be Processed or/and Stored.” The description corresponds with Strauss and Corbin‘s coding schema [14], but provides insufficient information to constitute suitable PLS indicators. However, a literature analysis at an earlier research stage has shown that Big Data belongs to a phenomenon occurring within the IT domain [5]. Big Data is observable in IT tasks. According to Information Processing Theory, IT belongs to any mechanism that facilitates the gathering of data, the transformation of data into information, or the communication and storage of information within an organization. [27] Following this analysis, we identified (BD2), and storing (BD1), transformation accessing/gathering (BD3) of data where Big Data could be observable. Additionally, we assumed that the communication of insights about the fields of discourse within Big Data is done by visualization (BD4) and analysis (BD5). In this way, we were able to examine the growing need for processing in an IT domain. We note that the five indicators are commensurate with the findings of Chen et al. [28]. The operationalization of this construct is reflexive, because the phenomenon influences the measures. 4.1 MEASURES The measures used in our research model were obtained from the descriptive model. We followed the model as closely as possible, to retain the theoretical underpinnings, so no additions were made. Only minor adjustments were made, following suggestions in conferences and presentations to avoid contextual overlaps. Constructs can be measured in a reflective or formative approach. The distinction is important, because an appropriate specification of the measurement model is necessary to identify meaningful relationships within the structural model. In the following paragraphs, 41 indicators are identified and discussed. Table 1 gives the questionnaire. We deduced five indicators within the construct causal conditions (CC) [25]. The construct is formative, because we were aiming at identifying the bases of the causal condition construct. Three of the five indicators belong to the main category requirements [25]. We divided the factor “extensive environment understanding” into market understanding (CC1) and research environment understanding (CC2), because enterprises need to gain insights into their markets and customers, whereas academics seek scientific understanding. Another requirement is seen in the need for a timely processing of information (CC3) because traditional computational approaches are usually too slow. There is a debate about whether Big Data is marketing-driven (CC4) or not [1]. If the factor loading is positive, any further theoretical explorations will be questionable, because Big Data is not novel. The final indicator is dynamic markets (CC5). Companies have to shorten production cycles, save costs, react more quickly, and maximize profits. The increasing pervasiveness of IT in the techniques enterprises use to address the market may lead to Big Data. The second construct belongs to the formative context (CO). Context changes will be initiated through the indicator (like multimedia data) and not through the construct itself. Compared to reflective models, a similar influence of all measures by the latent variable is doubtful. The first indicator belongs to the Knowledge Based Theory of the Firm (KBT) [25]. Here, knowledge is seen as both unique, and the most strategic resource (CO1). Focusing on knowledge integration and combining data sources can help the enterprise to achieve competitive advantage. KBT is used as an indicator rather than a theory and should not be interpreted as a theory mapping. Another context leading to the phenomenon of Big Data is the changing focus of analysis toward transactional data processing on the fly (CO2). Thus, transactions are no longer the fundamental basis of analyses, as understood in BI, but rather an explanatory variable. As a result of increasing IT pervasiveness, an increasing amount of machine generated content (CO3) flows out of log files, the internet of things, GPS, and many types of sensor 5015 5014 Table 1. Questionnaire Item BD1 BD2 BD3 BD4 BD5 CC1 CC2 CC3 CC4 CC5 CO1 CO2 CO3 CO4 CO5 CO6 CO7 CO8 CO9 CO10 ST1 ST2 ST3 ST4 ST5 ST6 ST7 ST8 ST9 ST10 ST11 ST12 CQ1 CQ2 CQ3 CQ4 CQ5 CQ6 CQ7 CQ8 CQ9 Statement An increasing data volume to store is observable An increasing data volume to transform is observable An increasing data volume to access is observable An increasing data volume to visualize is observable An increasing data volume to analyze is observable Big Data occurs through the need of processing data to allow an market understanding Big Data occurs through the need of processing data to allow an research environment understanding Big Data occurs through the need of a timely processing, because traditional approaches compute long Big Data is only driven by marketing departments Big Data occurs through the existence of dynamic markets Big Data emerges since knowledge is seen as unique and strategic resource Big Data emerges since transactional data (e.g. ERP) is analyzed on the fly Big Data emerges since the amount of machine generated content is growing Big Data emerges since the amount of multimedia data is growing Big Data emerges since this topic has to consider unknown data quality within the data itself Big Data emerges since the internet provides an appropriate infrastructure for data processing Big Data emerges since the Web 2.0 forced firms to develop efficient and scalable technologies Big Data emerges since social media platforms provide user specific content Big Data emerges since the IT costs are decreasing Big Data emerges since this topic has to consider legal frameworks Cloud computing is an appropriate method to handle Big Data Efficient programming models (like MapReduce) are an appropriate method to process Big Data Key-Value-oriented databases are an appropriate method to store Big Data Document-oriented databases are an appropriate method to store Big Data Column-oriented databases are an appropriate method to store Big Data Relational databases are an appropriate method to store Big Data Streaming is an appropriate method to handle Big Data The integration of (un-)structured data is appropriate prerequisite to analyze Big Data Task-oriented provision of information is an appropriate method to manage Big Data ILM as framework for policies, processes, practices, and tools used to align the business value of information with an effective IT disposition is an appropriate method to manage Big Data Simulations are an appropriate method to analyze Big Data Analytics (Data-, Text-, and Web Mining, Social-, Image-, Audio- , Video, and Predictive-Analytics, Visualization) are an appropriate method to analyze Big Data Big Data leads to a remarkable lack of skilled staff Big Data leads to a remarkable lack of integration possibilities Big Data leads to a remarkable lack of data quality Big Data leads to issues regarding the assigning of information to the task Big Data leads to an extensive customer/market knowledge Big Data leads to new business models Big Data leads to increasing cost savings Big Data leads to increasing investment returns Big Data leads to new research findings seen as part of distributed computing [29; 30]. However, cloud computing combines the concepts of parallel and distributed computing with the service delivery approach [31]. We therefore used cloud computing (ST1) as a single item, to avoid any overlaps. More efficient programming models (ST2), addressing time and computing complexity for faster data processing, were mentioned. Here, programming models like MapReduce may be applied [32]. Another discussion concerns the database area. NoSQL approaches were key-value- (ST3), document- (ST4) or column-oriented- (ST5) databases [33]. Any modification within the construct Big Data will affect all items, because the items share a common theme. The construct strategy (ST) has a formative operationalization [25]: it is not independent but rather is constituted by the measures themselves. Fig. 1 divides strategy into two sub-categories: technology and functional [25]. HPC belongs to the first. In, [9] we discussed concepts like cloud, grid, distributed, and parallel computing. A huge overlap between the concepts grid, distributed, and parallel computing was observed, and grid and parallel computing could be 5016 5015 Figure 3. Results Big Data Model was a major overlap with Information Lifecycle Management (ILM), for example IBM’s data deduplication runs within the ILM system [36], and ILM’s lifecycle approach always addresses data reduction. Even the identification of relevant information through management methods or machine learning techniques remains fuzzy. Overlaps with data reduction run from ILM and task-oriented provision, to analytics. In this context, we disregarded this indicator, though the items simulation (ST11) and analytics (ST12) were used to communicate similar insights. Analytics may embrace Data- and Text Mining, Image-, Audio-, Video-, PredictiveAnalytics, and Visualization. The indicators of consequences (CQ) were separated into competitive advantage, research findings, and issues. One of the most cited issues is that of the missing Big Data experts (CQ1). Here, more technical and domain-specific knowledge is needed, as the integration of various data sources and technologies like NoSQL is complex and critical (CQ2). In terms of the multimedia data type, the analysis of this item remains fuzzy and uncertain [26]. As a result, a lack of data quality arises (CQ3). Another issue was assigning relevant information to the task (CQ4). The scope of possible Big Data sources is broad, and not all content contributes to an improved understanding of the domain. One aspect of competitive advantage is the extensive view of customers and A debate was observed over the affiliation of relational databases (ST6). One minor adjustment was made following Fig. 1. After conference discussions, we shifted the indicator streaming applications (ST7) from the construct context to strategy, as streaming applications are a kind of technology and thus, more related to strategy. Streaming was developed for specific needs like analytics in operational BI, to allow the processing of huge amounts of data [34]. Thus, it is more an essential support technology than a driver of Big Data. The second sub-category belongs to functional strategies. One major concept is Information Management, which is divided into several sub-items. The mapping of structured and unstructured data (ST8) is one of the core concepts. Here, various data sources must be integrated in a useful manner, to gain a broader view of the environment. This is associated with the task-oriented provision of information (ST9), which requires the right information, at the right time, in the right amount and place, and of the right quality, to be assigned to the right task. Information lifecycle management (ILM) (ST10) is also seen as a possible approach to Big Data strategies. It comprises the policies, processes, practices, and tools needed to assign time-dependent value to information. This is necessary to allow information to be stored according to its value, including deletion at the appropriate point in time [35]. Fig. 1 included the indicator data reduction through algorithms. After discussion, we concluded that there 5017 5016 markets (CQ5), where strategies like analytics or simulation may bring insights. This is linked to new business ideas and markets (CQ6), while even financial advantage was mentioned in [28]. IT savings (CQ7) can be achievable through open source tools, reduced memory use, and computational capacity. Following KBT, returns (CQ8) will be possible if information is seen as a strategic resource [26]. The final consequence measure addressed the academic application of Big Data. Here, new insights and relationships (CQ9), in particular within natural science, are imaginable. The construct consequences underlies a reflexive measurement model. The measures will be the final entity in a causal chain and will be influenced through the construct rather than vice versa. Such measures are necessarily reflexive [26]. 4.1 DATA ANALYSIS Data analysis was conducted in two steps [37], using PLS with the tool SmartPLS 2.0, following many previous IS studies [38; 39]. Once the measurement model had been analyzed, the relationships within the structural model were evaluated [37]. The adequacy of a PLS measurement model with respect to reflexive indicators is ensured by examining the individual item reliabilities and by measuring the convergent and discriminant validity of the construct’s measures. In terms of item reliability, indicator loadings lower than 0.7 must be sequentially eliminated from the model [26], though if the average variance extracted (AVE) of the construct is greater than 0.5, loadings of 0.6 will be acceptable. All loadings must be significant at p < 0.05. In this case, all measures on issues (CQ1; CQ2; CQ3; CQ4) were removed, since their loadings were below 0.6 [20]. The minimum convergent validity of the different construct measures achieved an Internal Composite Reliability (ICR) of 0.7. Convergent validity was also investigated through the AVE, which should be greater than 0.5 for all reflexive constructs. Both constructs, Big Data (ICR = 0.834; AVE = 0.503) and consequences (ICR = 0.878; AVE = 0.592), met the requirements. The final assessment was the discriminant validity of our measures, which was met as the highest correlation (0.592² = 0.350) was below 0.503. The second discriminant validity check addressed cross-loadings. As shown in Table 2, all reflective items (bold font) were more strongly loaded for the respective construct. The formative indicators of a model are considered valid if the indicator weights or loadings are significant at p < 0.1. The first test represents the relative and the second test the absolute impact. If one weight or loading is significant, empirical support for the indicator’s relevance may be assumed [40]. Fig. 3 gives the significance of our measures. The indicators CC2, CC4, CO9, CO10, and ST10 were not significant on either measure. Table 2. Cross-loadings BD1 BD2 BD3 BD4 BD5 CQ5 CQ6 CQ7 CQ8 CQ9 BD 0.63 0.67 0.70 0.71 0.80 0.29 0.16 0.41 0.32 0.19 CQ 0.28 0.23 0.21 0.23 0.35 0.76 0.70 0.79 0.86 0.70 CC 0.31 0.28 0.25 0.40 0.26 0.38 0.17 0.38 0.37 0.07 CO 0.31 0.29 0.31 0.27 0.28 0.40 0.36 0.38 0.45 0.31 ST 0.32 0.27 0.28 0.31 0.34 0.41 0.38 0.50 0.48 0.30 The discriminant validity test is similar to that followed in Hair/Ringle/Sarstedt [20]. In contrast to reflexive measures, correlations between constructs are not squared. The correlation between the formative constructs and other model constructs should be lower than 0.9. The highest correlation, between strategy and consequences was 0.554; therefore, all values were below this threshold. In PLS, structural model quality is measured through path significance, a determination coefficient (R²), and predictive relevance (Q²) [20]. As shown in Fig. 3, all hypotheses were significant at p < 0.01 with positive loadings. All hypotheses were therefore supported. The explanatory power of the endogenous constructs of the structural model was measured by R². Chin [24] defined values of 0.67, 0.33, and 0.19 as substantial, moderate, and weak and values lower than 0.19 as irrelevant. All values met these criteria. Q² is only measured for endogenous reflexive constructs and must be greater than 0 (Fig. 3). The constructs Big Data (Q² = 0.128) and consequences (Q² = 0.175) exceeded these thresholds. In summary, all quality requirements were met, suggesting that the model generates meaningful and significant assertions. 5. Discussion All the hypotheses of the underlying descriptive model proved strongly significant (p < 0.01), and all path loadings exceeded 0.2, indicating that the results were meaningful [41]. Even the predictive Q² and R² were acceptable. We can therefore assume that Big Data is related to causal conditions (Hypothesis 1A) and context (Hypothesis 1B). An increase in any one of these constructs positively affects Big Data and all inherent measures. The two causal condition measures CC1 and CC3 were highly significant and had the greatest impact on the construct. No significance was found for research environment understanding (CC2). This is plausible, because the survey sampled only practitioners, who may have limited research experience. We found no significant evidence that Big Data is marketing driven (CC4), supporting the claim in 5018 5017 [9]. This is important, because it justifies further Big Data research. While the path loadings for CC5 were significant, the standard deviation (SD) value of 2.05 marks inconsistency in the expert opinions. Eight out of ten measures significantly affected the construct context, whereas indicators CO4 and CO5 had a less significant impact. Surprisingly, CO5 affected the construct negatively, suggesting that the lower the data quality the more strongly Big Data will emerge. The finding makes sense when analytics are considered. Analysts have to understand and structure data before insights can be obtained, and data quality concerns obstruct the use of analytics. The negative path loading of indicator CO6 is interesting, as it suggests that more Internet will reduce the effect of Big Data. Perhaps the effect can be explained through the high SD value of 1.38, which indicates a strong division of opinion between the respondents. CO9 and CO10 proved nonsignificant (see Table 3), but both indicators are part of the descriptive Big Data model, and should not be discarded, as to do so would be to ignore the theoretical underpinnings [20]. The descriptive model (Fig. 1) has shown that these measures were only supported by 30 percent of the interviewed experts, which could explain the results. CO10 could reflect the different geographical location of our participations, because the regulatory environment differs greatly between regions like Europe and America. The measures of the construct Big Data were highly significant, indicating meaningful path loadings. Considering that IT encompasses any mechanism that facilitates the gathering and transformation of data into information, and the communication and storage of information in an organization, Big Data should be relevant at any level of IT [28]. Visualization (BD4) and analysis of Big Data (BD5) were the major indicators, which reinforces the scope of the research domain. The explanatory variance of R²=0.253 is still weak, but the predictive relevance of Q²=0.128 is high. The low variance (Table 3) within the construct Big Data represents the consensus between the experts. Decision makers should be aware that the dimensions of Big Data will increase as quickly as expectations rise. Thus, gaining an extended understanding of the market (CC1) in a timely manner (CC3) will lead to more Big Data. This effect is still greater in contexts like Social Media (CO8) or machine generated content (CO3), where the volume of data is growing rapidly. Logically, diseconomies of scale are possible: as greater understanding is demanded, more data needs to be processed, increasing the cost of one unit of understanding over time. Finding the final piece of information within a huge data set may be very expensive. However, this needs more formal investigation. TABLE 3. DESCRIPTIVE STATISTICS Item BD1 BD2 BD3 BD4 BD5 CC1 CC2 CC3 CC4 CC5 CO1 CO2 CO3 CO4 CO7 CO8 CO7 CO8 CO9 CO10 Mean 5.34 4.79 5.05 4.97 5.42 5.08 4.78 4.51 2.73 4.02 4.89 4.36 5.35 5.12 4.58 4.87 4.58 4.87 3.93 3.93 SD 0.72 1.00 0.60 1.01 0.63 0.76 0.91 1.82 2.05 1.73 1.03 1.28 0.66 0.98 1.49 1.06 1.49 1.06 1.63 1.57 ST1 ST2 ST3 ST4 ST5 ST6 ST7 ST8 ST9 ST10 ST11 ST12 CQ1 CQ2 CQ3 CQ4 CQ5 CQ6 CQ7 CQ8 CQ9 4.25 5.35 4.65 4.09 4.57 3.47 4.55 4.90 4.22 4.50 4.39 5.35 4.63 3.66 4.00 3.98 4.95 5.32 4.17 4.36 5.18 1.31 0.58 0.95 1.58 1.02 1.92 1.15 0.95 1.21 0.98 1.36 0.55 1.82 1.78 1.95 1.25 1.00 0.57 1.31 1.46 0.73 Hypothesis 2 was also supported: more Big Data will lead to more strategy. Organizations struggling with at least one of the observed Big Data indicators can address them by applying one of these strategies. The explained variance of R² = 0.195 was weak, suggesting that not all the influencing factors have been identified. Organizational aspects like Big Data Competence Centers could be other significant factors. This study confirmed eleven of the significant strategies for Big Data. Only ILM (ST10) was not supported. Thus, our study contributes to practice by identifying strategies to tackle Big Data. Typical NoSQL approaches like ST3 or ST5 are impacting the construct most strongly. Surprisingly, ST4 was significant, but the path loading was low. The question of whether relational databases (ST6) are adequate to process Big Data was answered negatively by the respondents [33], as the path loading was negative. As expected, ST2 like MapReduce were supported as a way to deal with Big Data. Even the paradigm of cloud computing was supported, and ST7 also proved significant. Thus, all technologies were supported. At present, Big Data is less powerfully addressed by Information Management approaches [5]. Our study provides evidence that two of three Information Management aspects are appropriate in Big Data. Both the use of ST8, to gain a broader view of the environment, and ST9 were supported. Our study supports future research to address these areas. However, ST10 was not supported. Thus, Big Data does not focus on the deletion of data, as its value decreases over time. A possible explanation could be linked to BI, or to the growth of data warehouses [42]. The indicator analytics was highly significant, and even simulation was considered a possible strategy. 5019 5018 We used causal conditions and context to explain the rise of the Big Data phenomenon, allowing organizations to adjust their requirements of Big Data. By applying Information Processing Theory, the study found evidence of Big Data in all IT mechanisms. In addition, it verified the positive relation between Big Data and strategy. Thus, the more Big Data exists within an organization, the more strategy the organization will use. Moreover, significant strategy manifestations were demonstrated. The study suggests possible Big Data strategies, suitable for the observed increase of Big Data. Organizations dealing with Big Data can choose between several options to cope with its effects. We found evidence that Big Data strategies lead to positive consequences. Organizations considering investing in Big Data will find support in the potential value generation in our results. Thus, further research is supported. Nevertheless, the measured R² of all constructs was low. Future research should identify further constructs to help explain the rise of Big Data in more detail. The proposed model represents the current state of play and can be used as a foundation. It could be updated to take account of new developments within the area of Big Data. We are aware that our model might not cover all related concepts, because our focus was on the verification of Table 1. In addition, we are aware that all indicators are equally common for IT concepts like BI and are not unique to Big Data. Our model is a preliminary one, and more specific indicators are left to future research. However, the combination of contexts and causal conditions is particular to Big Data. Even the selected strategies are unique in combination and form their own construct. Thus, our model contributes a clearer understanding of Big Data to the scientific discussion, including the awareness that single indicators are common in IT. Our descriptive Big Data model and our data sample were based on discussions with practitioners. It is interesting to ask whether a different understanding of Big Data exists between researchers and practitioners, as this would also contribute to the scientific discussion of Big Data and increase the body of knowledge about this topic. We established a positive relationship between strategy and consequences (Hypothesis 3). This supports other findings in IT research [10]. The path coefficient was highly significant and impacted consequences with a strong loading of 0.554. A growing use of Big Data strategies will cause a rise in consequences. The R²=0.307 was small to moderate [24]. Following the requirements of reflexive measures, all issue-related indicators were removed, yielding evidence that only positive outcomes are generated through Big Data strategies. This supports the further development of strategies to increase the outcomes of Big Data. However, deleting all issue-related indicators contradicts the findings in the literature. Perhaps splitting the construct into positive and negative consequences would have led to other results. Nevertheless, all indicators were significant at p < 0.01. Possible financial outcomes were shown by CQ8 and CQ7. Even the identification of new business models was supported. However, we found evidence that Big Data strategies allow the development of CQ5 and CQ9. Besides that, we can give empirical evidence for Gartner’s [2] 3 V’s. The measure increasing data amount to store (BD1) supports volume, whereby velocity can be partly confirmed through the need of a timely processing (CC3). Varity belongs to different data types. Here, the measures CO1, CO2, CO3, CO4, and CO8 are adequate manifestations In summary, the study illustrates a significant causeeffect relationship. Thus, if there are increasing exogenous variables (causal condition and context), the observable indicators (store, transform, access, visualize, analyze) in Big Data will grow. This would lead to a growing usage of Big Data related strategies and at least to positive outcomes for organizations. 6. CONCLUSIONS The study proposed an initial theoretical foundation for Big Data. We developed a research model, based on a descriptive Big Data model, to elucidate the underlying relationships and concepts. All hypotheses of our Big Data theory model tested positively in the PLS model. This theoretical foundation therefore serves as a contribution to the scientific discussion. Further research is justified, because Big Data becomes a better defined research domain. The study contributes to a clearer understanding of Big Data, which also supports academic and practical discussions. Using our model, future applications and research can justify being located within Big Data. Organizations can more easily define Big Data policies and value propositions. Even the transfer of knowledge is simplified, because the model provides a common understanding of the Big Data domain. 7. REFERENCES [1] H. Buhl, M. Röglinger, F. Moser, and J. Heidemann, “Big Data A Fashionable Topic with (out) Sustainable Relevance for Research and Practice?”, BISE, 5, 2, 2013, pp. 65-69. [2] Gartner. (2014). IT Glossary Big Data [Online]. Available: http://www.gartner.com/it-glossary/big-data/ [3] C. Bizer P. Boncz, and M. Brodie, “The Meaningful Use of Big Data”, SIGMOD, 40, 2014, pp. 56-60. [4] Y. He, R. Lee, and Y. Huai.., “RCFile”, ICDE, Hannover, 2011, pp. 1199-1208. 5020 5019 [24] W. W. Chin, “The partial least squares approach for structural equation modeling”, in Modern methods for business research, Erlbaum, 1998, pp. 295-336. [25] T. Coltman T. Devinney, and D. Midgley., “Formative versus reflective measurement models”, JBR, 61, 2008, pp. 1250-1262. [26] J. Krumm N. Davies, and C. Narayanaswami, “Usergenerated content”, Pervasive Computing, 7, 2008, pp. 10-11. [27] J. Fairbank G. Labianca, and H. Steensma, “Information Processing Design Choices, Strategy, and Risk Management Performance”, MIS, 23, 2006, pp. 293-319. [28] H. Chen R. Chiang, and V. Storey, “Business intelligence and analytics: from big data to big impact”, MISQ, 36, 4, 2012, pp.1165–1188. [29] I. Foster Y. Zhao, and I. Raicu., “Cloud Computing and Grid Computing 360-Degree Compared”, GCE, Austin, 2008, pp. 1-10. [30] D. Peleg, Distributed Computing. Philadelphia: SIAM, 2000. [31] K. Hwang J. Dongarra, and C. Fox., Distributed and Cloud Computing. Waltham: Morgan Kaufmann, 2000. [32] J. Dean, and S. Ghemawat, “MapReduce: Simplified Data Processing on Large Clusters”, Comm. ACM, 51, 2008, pp. 137-149. [33] J. Han, E. Haihong, and G. Le, “Survey on NoSQL Database”, ICPCA, South Africa, 2011, pp. 363-366. [34] M. Castellanos C. Gupta, and S. Wang, “Leveraging web streams for contractual situational awareness in operational BI”, EDBT, Lausanne, 2010, pp. 1-8. [35] SNIA 2014. (2014). Information Lifecycle Management [Online]. Available: http://www.snia.org/ [36] M. Ebbers M. Archibald, and C. da Fonseca, (2014). IBM Smarter Data Centers [Online]. Available: http://www.redbooks.ibm.com/ [37] J. Hulland, “Use of partial least squares (PLS) in strategic management research”, Strategic Management Journal, 20, 1999, pp. 195-204. [38] C. Ringle S. Wende, and A. Will (2014). SmartPLS [Online]. Available: http://www.smartpls.de [39] S. Smith, R. Johnston, and S. Howard, “Putting Yourself in the Picture”, ISR, 22, 2011, pp. 640-659. [40] R. Cenfetelli, and G. Bassellier, “Interpretation of Formative Measurement in Information Systems Research”, MISQ, 33, 2009, pp. 689-708. [41] W. W. Chin, “Issues and Opinion on Structural Equation Modeling”, MISQ, 1998, pp. 7-16. [42] P. Zikopoulos, D. deRoos, and K. Parasuraman, Harness the Power of Big Data. New York: McGraw-Hill, 2012. [5] M. Pospiech, and C. Felden, “Big Data – A State-of-theArt”, AMCIS, Paper 22, 2012. [6] D. Loshin, Big Data Analytics. Waltham: Morgan Kaufmann, 2009. [7] X. Wu, X. Zhu, and G. Wu, “Data Mining with Big Data”, IEEE TKDE, 26, 2014, pp. 97-107. [8] R. D. Galliers, and F. F. Land, "Choosing an Appropriate Information Systems Research Methodology", Comm. ACM, 30, 1987, pp. 900-902. [9] M. Pospiech, and C. Felden, "Deployment of A Descriptive Big Data Model", in Business Intelligence for New-Generation Managers, J. Mayer, and R. Quick (eds.), Heidelberg, Springer, 2015, pp. 75-95. [10] G. Piccoli, and B. Ives, "IT-Dependent Strategic Initiatives and Sustained Competitive Advantage", MISQ, 29, 2005, pp. 747-776. [11] H. Cooper, Synthesizing Research, Thousand Oaks: Sage, 1998. [12] J. Webster, and R. Watson, “Analyzing the past to prepare for the future: writing a literature review”, MISQ, 26, 2002, pp. 13-23. [13] M. Pospiech, and C. Felden, “A Descriptive Big Data Model Using Grounded Theory”, BDSE, pp. 878-885. [14] A. Strauss, and J Corbin, Basics of Qualitative Research. Thousand Oaks: Sage, 1990. [15] W Kim H., Chan, and A. Kankanhalli, “What Motivates People to Purchase Digital Items on Virtual Community Websites?” ISR, 23, 2012, pp. 1232-1245. [16] Gartner. (2014). IT Glossary Big Data [Online]. Available: http://www.gartner.com/it-glossary/big-data/ [17] D. Boyd, and K. Crawford, “Critical Questions for Big Data. Information” C & S, 15, 5, 2012, pp. 662-679. [18] A. Cuzzocrea, Y. Song, and K. Davis, “Analytics over Large-Scale Multidimensional Data”, DOLAP, UK, 2011, pp. 101-103. [19] A. Kaplan, and M. Haenlein, "Users of the world, unite!", Business Horizons, 53, 2010, pp. 59-68. [20] J. Hair, C. Ringle, and M. Sarstedt., “PLS-SEM: Indeed a silver bullet”, JMTP, 19, 2011, pp. 139-152. [21] E. Babbie, Survey Research Methods, Wadsworth: Cengage, 1990. [22] Xing 2014. (2014). Xing [Online]. Available: https://www.xing.com/en [23] R. Chomeya, “Quality of Psychology Test Between Likert Scale 5 and 6 Points”, Journal of Social Sciences, 6, 2010, pp. 399-403. 5021 5020
© Copyright 2026 Paperzz