Big Data — A Theory Model

2016 49th Hawaii International Conference on System Sciences
Big Data – A Theory Model
Carsten Felden
TU Bergakademie Freiberg
[email protected]
Marco Pospiech
TU Bergakademie Freiberg
[email protected]
unclear. Even the ability to communicate and interpret a
value proposition is impeded by the difficulty to
identify a clear goal. [6] The missing theoretical
foundation of Big Data has been highlighted because
relevant topics and related theories have not been
clearly identified [6; 5]. This study aims to introduce a
theoretical model of Big Data to describe the current
state of its relationships and concepts.
Few studies are addressing this research goal. In
most cases, the characteristics of Big Data are explored
through discursive approaches [6; 7]. From a
methodical perspective, the evidence for the Big Data
characteristics is vague [8]. In a previous study, we
conducted [9] interviews with experts to deduce a
descriptive model of Big Data that could be used for
future empirical verification. It represents, to our
knowledge, the only Big Data model based on a solid
methodology. The model defines Big Data through
causal conditions and context, whereby Big Data itself
leads to particular strategies and those strategies to
consequences. This follow-on research investigates
whether the descriptive model can be confirmed by a
structural equation model (SEM). This verification
makes three principal contributions to scientific
discussions about Big Data. First, it unveils the
underlying characteristics of Big Data. Second, it
proposes that an understanding of the theoretical
background should help answer the question of whether
Big Data can be addressed strategically. Third, the
study seeks to confirm whether Big Data strategies
produce benefits for organizations as effectively as
strategies in other IT research areas [10]. Considering
the fuzzy benefits of and poorly-understood obstacles to
Big Data; the latter two aspects are of major practical
interest. If Big Data can be clearly described,
appropriate strategies can be chosen, and knowledgesharing between companies becomes possible. The
research area will be more precisely defined, and
business administration issues can be addressed.
The paper is organized as follows: after a literature
review, the initial set up of the descriptive Big Data
model is presented and explained. The model is used to
derive our hypotheses, which are confirmed by a SEM.
To ensure the research is rigorous and transparent, we
describe our research methodology and evaluate the
model quality using standard measures. Finally, the
results are discussed and implications are highlighted.
Abstract
Big Data is an emerging research topic. However,
the term remains fuzzy and is seen as an umbrella
term. Its origin, composition, possible strategies, and
outcomes are unclear. This impedes the positioning of
publications addressing business administration issues
related to Big Data. The missing theoretical foundation
of Big Data has been recognized, as has the need for
underlying relationships and concepts to be elucidated.
While a few publications have directly addressed this
need, the analysis has remained methodically weak.
Based on an existing qualitative model, we developed a
Big Data theory model and tested it through structural
equation modeling. All the hypotheses in our research
model proved significant. The study makes three
principal contributions to the scientific discussion
about Big Data. First, it elucidates the current
characteristics of Big Data. Second, the addressability
of Big Data through strategies is demonstrated. Third,
it produced evidence of positive outcomes through Big
Data.
1. Introduction
The term Big Data frequently occurs in scientific
and practical discussions [1]. One standard definition
comes from Gartner [2], where Big Data means “highvolume, high-velocity, and high-variety information
assets that demand cost-effective innovative forms of
information processing for enhanced insight and
decision-making.” Bizer et al. [3] added semantic
criteria, whereas others have addressed only the
amounts of data being processed [4]. It appears likely
that an increasing number of fields of study and
technologies will be associated with Big Data, while
established research areas such as High Performance
Computing (HPC) are redefining themselves as Big
Data to attract more interest [5]. The positioning of
publications addressing business administration
applications of Big Data is impeded by the fuzzy nature
of the field, which is reflected in the heterogeneous
definitions used. [5] From a practitioner’s perspective,
the benefits of Big Data, or the nature of the obstacles
that could be addressed by a Big Data approach, remain
1530-1605/16 $31.00 © 2016 IEEE
DOI 10.1109/HICSS.2016.622
5011
5012
2. STATUS QUO
3. RESEARCH MODEL AND HYPOTHESES
A literature review by Cooper [11] analyzed the
scientific databases of the AIS eLibrary, IEEE Xplore,
the ACM Digital Library, SpringerLINK, and
ScienceDirect up to January 2015. A backward search
was also conducted, to avoid missing relevant articles
[12]. In the first round, 981 hits were obtained.
Duplications were removed and abstracts were
manually analyzed. Papers of interest were examined
for relevance according to the four-eyes-principle.
Because the research scope was limited to articles
focusing on Big Data theory development, only four
articles remained. Other papers, while not focusing
exclusively on theory development, might influence
theory construction, but the initial study concentrated
on the use of existing theoretical developments and not
on their derivation.
Wu et al. [7] postulated that Big Data starts with
large-volume, heterogeneous, autonomous sources with
distributed and decentralized control, and seeks to
explore complex and evolving relationships (the HACE
theorem). Loshin [6] defined Big Data as “applying
innovative and cost-effective techniques for solving
existing and future business problems whose resource
requirements exceed the capabilities of traditional
computing environments.” The derivation of both
theories is based only on argumentation. This
methodically weak approach to theory development [8]
was not supported by more rigorous verification in
either case. A review of the extant literature explains
the diversity of perspectives and the difficulty in
deriving an academic description of Big Data. Our first
model [13] explored Big Data through expert
interviews. Experts were selected from international
companies and surveyed using telephone interviews.
The data were coded and conceptualized using
Grounded Theory because the generation and discovery
of concepts and inherent relationships is a strength of
the method [14]. Consequently, all identified concepts
were assigned to a common coding schema with five
categories (Fig. 1). The phenomenon (Big Data)
belongs to the middle category [14].
The model was applied and tested against existing
Big Data publications, and its relevance to the research
area was confirmed. Based on discussions in
conferences and during presentations, our descriptive
model was extended, as shown in Fig. 1 [9]. In addition,
the model was used to classify academic publications
into practical Big Data implementations. Nevertheless,
the study had an exploratory character. A quantitative
follow up is needed, to clarify whether all categories
and relationships are significant [9]. A deeper
understanding of a phenomenon can be achieved
through sequential applying qualitative and quantitative
methods, where findings from the qualitative study
empirically inform the later quantitative study. [15]
Our model, shown in Fig. 1, served as the basis for
the proposed research model. We modified the position
of the category strategy because strategies are applied
after the phenomenon (Big Data) emerges. In addition,
consequences are not deduced from the phenomenon
but rather from the chosen actions (strategies).
Grounded Theory defines a phenomenon as a
central idea, event, happening, or incident about which
a set of actions or interactions is grouped. Big Data can
be treated as a phenomenon, because we can observe
effects like growing volume and variety. In addition,
the term Big Data is often linked, both in practice and
in research, to phenomena [16; 17]. According to [14],
the phenomenon is characterized by causal conditions
and context.
The term causal condition refers to events or
incidents that lead to the occurrence or development of
a phenomenon [14]. Initial reasons led to the
phenomenon, because we recognized the existence of a
growing volume of data that has to be processed. Others
[1] have shown that academic and business
requirements are drivers of this growing data. Thus, the
stronger the causal conditions of Big Data, the stronger
the phenomenon itself.
Hypothesis 1A: Causal conditions are positively
related to the phenomenon of Big Data.
The category context describes the circumstances
under which Big Data evolves [1]. Here, we have to
observe the environment, to identify those incidents and
locations that pertain to Big Data. One location can be
found in research areas like Social Media, which are
evolving in parallel with Big Data and pushing the
phenomenon [5]. Social Media platforms generate
growing volumes of user specific content as possible
sources of analysis. It is our assumption that positive
changes in context will lead to a growth in Big Data, as
claimed in [1].
Hypothesis 1B: Context is positively related to the
phenomenon Big Data.
Alongside discussion of the origins of Big Data, is
the question of whether Big Data have an effect on the
category strategy. In IT, the implementation of
technologies or concepts has often been used to address
the increasing volume of data to be processed and/or
stored [4]. Thus, strategy is not a component of the
phenomenon Big Data itself (shown by a punctuated
arrow in Fig. 1), but it is necessary to address its effects
[9].
Hypothesis 2: Big Data is positively related to
strategy.
s
5013
5012
Figure 1. Big Data Descriptive Model
Figure 2. Big Data Theory Model
Consequences refers to the outcomes or results of a
strategy. Within the Big Data domain, consequences
are often discussed in the form of advantages and issues
[18]. It is not the phenomenon but rather the strategies
for managing Big Data that determine the positive or
negative effects [9]. It is of interest, therefore, whether
an increase within the category strategy leads to a
positive growth of consequences. A possible example
could be seen within business analytics. A growing
understanding of customers should lead to higher
revenues. Thus, more strategy would lead to more
consequences.
Hypothesis 3: Strategy is positively related to
consequences.
One advantage of the applied coding schema [14] is
its generalizability to the description of a range of
phenomena. We are aware that other concepts like
Business Intelligence (BI) would fit into the coding
schema (Fig. 1) as well. Nevertheless, in the current
state of play an initial basis is needed. Further research
can extend the theoretical model. Fig. 2 illustrates the
proposed research model. Hypotheses and categories
(afterwards described as constructs) are ordered
according to the explanations given.
4. RESEARCH METHODOLOGY
The study introduces an initial Big Data theoretical
model to describe underlying relationships and
concepts. The research was designed to establish
whether relationships between the constructs exist and
to evaluate their significance using an SEM. In
addition, the relation of suggested Big Data concepts to
model constructs is proven, too. We used a partial least
squares (PLS) model as this technique is well-suited to
the study of associations between latent variables when
new theoretical grounds and measures are being
explored [20]. The survey was conducted according to
the guidelines of Babbie [21]. A pilot test was
conducted with six participants to confirm the
understandability of our measures. After minor
adjustments, an anonymous survey was conducted
using an online survey tool. The model was tested
under the assumption that the phenomenon Big Data
can be only described by human consensus because it is
determined by its linguistic description [9]. This
consensus can be achieved by an experienced crowd.
We identified suitable participants from the business
networking platform Xing [22], which hosts several Big
Data interest groups whose members can be assumed to
have the appropriate expertise. Only practitioners with
at least three years of experience were sampled, because
the underlying model (Fig. 1) is based on the input of
IT company employees. Suitable participants were
contacted and sent a link to the online survey. Of the
participants contacted, 385 joined and 123 completed
the survey. 1 An introduction was given and the
categories explained to ensure consistent responses
[21]. All items used a 6-point Likert-type scale in which
1 = “strongly disagree” and 6 = “strongly agree.” It has
been shown that 6-point Likert-type scales have a
higher reliability than 5-point scales [23]. The
professional experience of our 123 participants was
high: 22.76% had 3–9 years of experience, 40.65% had
10–19 years, and 36.59% had 20 or more years. The
firms at which the respondents worked were located in
Europe (82.11%), America (13.00%), and Asia
(4.88%). The size of enterprise (measured in
employees) of the participants was similar: 27.64%
were from companies having less than 50 employees,
26.83% from companies with 50 to 499 employees, and
45.53% from companies with 499 or more employees.
The data set is representative of the opinions of several
interest groups.
1
1
A sample size should at least exceed the number of
indicators ten times of the construct with the most indicators and the
largest number of exogenous variables loading ten times on a single
endogenous variable [24]. The needed sample size is 120.
5014
5013
technology. These data can be used to gain an extended
environmental understanding. A further shift of data
types arises from the growth of multimedia (CO4)
sources (images, video, audio, and text), which must be
processed and stored on a large scale. These types of
data and sources implicate another measure: whereas
traditional data were clean and precise, the new ones
are often fuzzy. Thus, combining unknown sources
raises unexplored issues of data quality (CO5). Fig. 1
gives the context of Internet/Web2.0/Social Media. We
split these into three separate measures because of their
different meanings. Internet (CO6) refers to the
physical infrastructure which enables global
interconnection. Web 2.0 (CO7) are interactive
platforms on which people collaborate and share
information. Companies were forced to develop costeffective and scalable technologies to handle and
analyze the growth in content. [27]. Since Social Media
(CO8) is one of the most important applications within
Web 2.0, we analyzed the measure separately [19].
Here, user-specific content creates the possibility of
gaining a more nuanced view of customers and the
environment than was possible before. Decreasing IT
costs (CO9) might also drive Big Data, as hardware
costs for processing and storing data have decreased
constantly. In addition, frequently-mentioned Big Data
technologies like Hadoop are usually open source tools.
The final context is the legal framework (CO10). Big
Data applications raise issues of data privacy, and the
combination of sources could be forbidden.
The third construct is the phenomenon Big Data
(BD) itself. In our previous work [9] the construct
phenomenon was simply described as an “Increasing
Volume of Data to be Processed or/and Stored.” The
description corresponds with Strauss and Corbin‘s
coding schema [14], but provides insufficient
information to constitute suitable PLS indicators.
However, a literature analysis at an earlier research
stage has shown that Big Data belongs to a
phenomenon occurring within the IT domain [5]. Big
Data is observable in IT tasks. According to
Information Processing Theory, IT belongs to any
mechanism that facilitates the gathering of data, the
transformation of data into information, or the
communication and storage of information within an
organization. [27] Following this analysis, we identified
(BD2),
and
storing
(BD1),
transformation
accessing/gathering (BD3) of data where Big Data
could be observable. Additionally, we assumed that the
communication of insights about the fields of discourse
within Big Data is done by visualization (BD4) and
analysis (BD5).
In this way, we were able to examine the growing
need for processing in an IT domain. We note that the
five indicators are commensurate with the findings of
Chen et al. [28]. The operationalization of this construct
is reflexive, because the phenomenon influences the
measures.
4.1 MEASURES
The measures used in our research model were
obtained from the descriptive model. We followed the
model as closely as possible, to retain the theoretical
underpinnings, so no additions were made. Only minor
adjustments were made, following suggestions in
conferences and presentations to avoid contextual
overlaps. Constructs can be measured in a reflective or
formative approach. The distinction is important,
because an appropriate specification of the
measurement model is necessary to identify meaningful
relationships within the structural model. In the
following paragraphs, 41 indicators are identified and
discussed. Table 1 gives the questionnaire.
We deduced five indicators within the construct
causal conditions (CC) [25]. The construct is formative,
because we were aiming at identifying the bases of the
causal condition construct. Three of the five indicators
belong to the main category requirements [25]. We
divided
the
factor
“extensive
environment
understanding” into market understanding (CC1) and
research environment understanding (CC2), because
enterprises need to gain insights into their markets and
customers, whereas academics seek scientific
understanding. Another requirement is seen in the need
for a timely processing of information (CC3) because
traditional computational approaches are usually too
slow. There is a debate about whether Big Data is
marketing-driven (CC4) or not [1]. If the factor loading
is positive, any further theoretical explorations will be
questionable, because Big Data is not novel. The final
indicator is dynamic markets (CC5). Companies have to
shorten production cycles, save costs, react more
quickly, and maximize profits. The increasing
pervasiveness of IT in the techniques enterprises use to
address the market may lead to Big Data.
The second construct belongs to the formative
context (CO). Context changes will be initiated through
the indicator (like multimedia data) and not through the
construct itself. Compared to reflective models, a
similar influence of all measures by the latent variable
is doubtful. The first indicator belongs to the
Knowledge Based Theory of the Firm (KBT) [25].
Here, knowledge is seen as both unique, and the most
strategic resource (CO1). Focusing on knowledge
integration and combining data sources can help the
enterprise to achieve competitive advantage. KBT is
used as an indicator rather than a theory and should not
be interpreted as a theory mapping. Another context
leading to the phenomenon of Big Data is the changing
focus of analysis toward transactional data processing
on the fly (CO2). Thus, transactions are no longer the
fundamental basis of analyses, as understood in BI, but
rather an explanatory variable. As a result of increasing
IT pervasiveness, an increasing amount of machine
generated content (CO3) flows out of log files, the
internet of things, GPS, and many types of sensor
5015
5014
Table 1. Questionnaire
Item
BD1
BD2
BD3
BD4
BD5
CC1
CC2
CC3
CC4
CC5
CO1
CO2
CO3
CO4
CO5
CO6
CO7
CO8
CO9
CO10
ST1
ST2
ST3
ST4
ST5
ST6
ST7
ST8
ST9
ST10
ST11
ST12
CQ1
CQ2
CQ3
CQ4
CQ5
CQ6
CQ7
CQ8
CQ9
Statement
An increasing data volume to store is observable
An increasing data volume to transform is observable
An increasing data volume to access is observable
An increasing data volume to visualize is observable
An increasing data volume to analyze is observable
Big Data occurs through the need of processing data to allow an market understanding
Big Data occurs through the need of processing data to allow an research environment understanding
Big Data occurs through the need of a timely processing, because traditional approaches compute long
Big Data is only driven by marketing departments
Big Data occurs through the existence of dynamic markets
Big Data emerges since knowledge is seen as unique and strategic resource
Big Data emerges since transactional data (e.g. ERP) is analyzed on the fly
Big Data emerges since the amount of machine generated content is growing
Big Data emerges since the amount of multimedia data is growing
Big Data emerges since this topic has to consider unknown data quality within the data itself
Big Data emerges since the internet provides an appropriate infrastructure for data processing
Big Data emerges since the Web 2.0 forced firms to develop efficient and scalable technologies
Big Data emerges since social media platforms provide user specific content
Big Data emerges since the IT costs are decreasing
Big Data emerges since this topic has to consider legal frameworks
Cloud computing is an appropriate method to handle Big Data
Efficient programming models (like MapReduce) are an appropriate method to process Big Data
Key-Value-oriented databases are an appropriate method to store Big Data
Document-oriented databases are an appropriate method to store Big Data
Column-oriented databases are an appropriate method to store Big Data
Relational databases are an appropriate method to store Big Data
Streaming is an appropriate method to handle Big Data
The integration of (un-)structured data is appropriate prerequisite to analyze Big Data
Task-oriented provision of information is an appropriate method to manage Big Data
ILM as framework for policies, processes, practices, and tools used to align the business value of
information with an effective IT disposition is an appropriate method to manage Big Data
Simulations are an appropriate method to analyze Big Data
Analytics (Data-, Text-, and Web Mining, Social-, Image-, Audio- , Video, and Predictive-Analytics,
Visualization) are an appropriate method to analyze Big Data
Big Data leads to a remarkable lack of skilled staff
Big Data leads to a remarkable lack of integration possibilities
Big Data leads to a remarkable lack of data quality
Big Data leads to issues regarding the assigning of information to the task
Big Data leads to an extensive customer/market knowledge
Big Data leads to new business models
Big Data leads to increasing cost savings
Big Data leads to increasing investment returns
Big Data leads to new research findings
seen as part of distributed computing [29; 30].
However, cloud computing combines the concepts of
parallel and distributed computing with the service
delivery approach [31]. We therefore used cloud
computing (ST1) as a single item, to avoid any
overlaps. More efficient programming models (ST2),
addressing time and computing complexity for faster
data processing, were mentioned. Here, programming
models like MapReduce may be applied [32]. Another
discussion concerns the database area. NoSQL
approaches were key-value- (ST3), document- (ST4) or
column-oriented- (ST5) databases [33].
Any modification within the construct Big Data will
affect all items, because the items share a common
theme.
The construct strategy (ST) has a formative
operationalization [25]: it is not independent but rather
is constituted by the measures themselves. Fig. 1
divides strategy into two sub-categories: technology
and functional [25]. HPC belongs to the first. In, [9] we
discussed concepts like cloud, grid, distributed, and
parallel computing. A huge overlap between the
concepts grid, distributed, and parallel computing was
observed, and grid and parallel computing could be
5016
5015
Figure 3. Results Big Data Model
was a major overlap with Information Lifecycle
Management (ILM), for example IBM’s data
deduplication runs within the ILM system [36], and
ILM’s lifecycle approach always addresses data
reduction. Even the identification of relevant
information through management methods or machine
learning techniques remains fuzzy. Overlaps with data
reduction run from ILM and task-oriented provision, to
analytics. In this context, we disregarded this indicator,
though the items simulation (ST11) and analytics
(ST12) were used to communicate similar insights.
Analytics may embrace Data- and Text Mining, Image-,
Audio-,
Video-,
PredictiveAnalytics,
and
Visualization.
The indicators of consequences (CQ) were
separated into competitive advantage, research
findings, and issues. One of the most cited issues is
that of the missing Big Data experts (CQ1). Here, more
technical and domain-specific knowledge is needed, as
the integration of various data sources and technologies
like NoSQL is complex and critical (CQ2). In terms of
the multimedia data type, the analysis of this item
remains fuzzy and uncertain [26]. As a result, a lack of
data quality arises (CQ3).
Another issue was assigning relevant information to
the task (CQ4). The scope of possible Big Data sources
is broad, and not all content contributes to an improved
understanding of the domain. One aspect of competitive
advantage is the extensive view of customers and
A debate was observed over the affiliation of
relational databases (ST6). One minor adjustment was
made following Fig. 1. After conference discussions,
we shifted the indicator streaming applications (ST7)
from the construct context to strategy, as streaming
applications are a kind of technology and thus, more
related to strategy. Streaming was developed for
specific needs like analytics in operational BI, to allow
the processing of huge amounts of data [34]. Thus, it is
more an essential support technology than a driver of
Big Data. The second sub-category belongs to
functional strategies. One major concept is Information
Management, which is divided into several sub-items.
The mapping of structured and unstructured data (ST8)
is one of the core concepts. Here, various data sources
must be integrated in a useful manner, to gain a broader
view of the environment. This is associated with the
task-oriented provision of information (ST9), which
requires the right information, at the right time, in the
right amount and place, and of the right quality, to be
assigned to the right task. Information lifecycle
management (ILM) (ST10) is also seen as a possible
approach to Big Data strategies. It comprises the
policies, processes, practices, and tools needed to assign
time-dependent value to information. This is necessary
to allow information to be stored according to its value,
including deletion at the appropriate point in time [35].
Fig. 1 included the indicator data reduction through
algorithms. After discussion, we concluded that there
5017
5016
markets (CQ5), where strategies like analytics or
simulation may bring insights. This is linked to new
business ideas and markets (CQ6), while even financial
advantage was mentioned in [28]. IT savings (CQ7) can
be achievable through open source tools, reduced
memory use, and computational capacity. Following
KBT, returns (CQ8) will be possible if information is
seen as a strategic resource [26]. The final consequence
measure addressed the academic application of Big
Data. Here, new insights and relationships (CQ9), in
particular within natural science, are imaginable. The
construct consequences underlies a reflexive
measurement model. The measures will be the final
entity in a causal chain and will be influenced through
the construct rather than vice versa. Such measures are
necessarily reflexive [26].
4.1 DATA ANALYSIS
Data analysis was conducted in two steps [37],
using PLS with the tool SmartPLS 2.0, following many
previous IS studies [38; 39]. Once the measurement
model had been analyzed, the relationships within the
structural model were evaluated [37].
The adequacy of a PLS measurement model with
respect to reflexive indicators is ensured by examining
the individual item reliabilities and by measuring the
convergent and discriminant validity of the construct’s
measures. In terms of item reliability, indicator loadings
lower than 0.7 must be sequentially eliminated from the
model [26], though if the average variance extracted
(AVE) of the construct is greater than 0.5, loadings of
0.6 will be acceptable. All loadings must be significant
at p < 0.05. In this case, all measures on issues (CQ1;
CQ2; CQ3; CQ4) were removed, since their loadings
were below 0.6 [20]. The minimum convergent validity
of the different construct measures achieved an Internal
Composite Reliability (ICR) of 0.7. Convergent validity
was also investigated through the AVE, which should
be greater than 0.5 for all reflexive constructs. Both
constructs, Big Data (ICR = 0.834; AVE = 0.503) and
consequences (ICR = 0.878; AVE = 0.592), met the
requirements. The final assessment was the
discriminant validity of our measures, which was met as
the highest correlation (0.592² = 0.350) was below
0.503. The second discriminant validity check
addressed cross-loadings. As shown in Table 2, all
reflective items (bold font) were more strongly loaded
for the respective construct. The formative indicators of
a model are considered valid if the indicator weights or
loadings are significant at p < 0.1. The first test
represents the relative and the second test the absolute
impact. If one weight or loading is significant,
empirical support for the indicator’s relevance may be
assumed [40]. Fig. 3 gives the significance of our
measures. The indicators CC2, CC4, CO9, CO10, and
ST10 were not significant on either measure.
Table 2. Cross-loadings
BD1
BD2
BD3
BD4
BD5
CQ5
CQ6
CQ7
CQ8
CQ9
BD
0.63
0.67
0.70
0.71
0.80
0.29
0.16
0.41
0.32
0.19
CQ
0.28
0.23
0.21
0.23
0.35
0.76
0.70
0.79
0.86
0.70
CC
0.31
0.28
0.25
0.40
0.26
0.38
0.17
0.38
0.37
0.07
CO
0.31
0.29
0.31
0.27
0.28
0.40
0.36
0.38
0.45
0.31
ST
0.32
0.27
0.28
0.31
0.34
0.41
0.38
0.50
0.48
0.30
The discriminant validity test is similar to that
followed in Hair/Ringle/Sarstedt [20]. In contrast to
reflexive measures, correlations between constructs are
not squared. The correlation between the formative
constructs and other model constructs should be lower
than 0.9. The highest correlation, between strategy and
consequences was 0.554; therefore, all values were
below this threshold. In PLS, structural model quality is
measured through path significance, a determination
coefficient (R²), and predictive relevance (Q²) [20]. As
shown in Fig. 3, all hypotheses were significant at p <
0.01 with positive loadings. All hypotheses were
therefore supported.
The explanatory power of the endogenous
constructs of the structural model was measured by R².
Chin [24] defined values of 0.67, 0.33, and 0.19 as
substantial, moderate, and weak and values lower than
0.19 as irrelevant. All values met these criteria. Q² is
only measured for endogenous reflexive constructs and
must be greater than 0 (Fig. 3). The constructs Big Data
(Q² = 0.128) and consequences (Q² = 0.175) exceeded
these thresholds. In summary, all quality requirements
were met, suggesting that the model generates
meaningful and significant assertions.
5. Discussion
All the hypotheses of the underlying descriptive
model proved strongly significant (p < 0.01), and all
path loadings exceeded 0.2, indicating that the results
were meaningful [41]. Even the predictive Q² and R²
were acceptable. We can therefore assume that Big
Data is related to causal conditions (Hypothesis 1A)
and context (Hypothesis 1B). An increase in any one of
these constructs positively affects Big Data and all
inherent measures. The two causal condition measures
CC1 and CC3 were highly significant and had the
greatest impact on the construct. No significance was
found for research environment understanding (CC2).
This is plausible, because the survey sampled only
practitioners, who may have limited research
experience. We found no significant evidence that Big
Data is marketing driven (CC4), supporting the claim in
5018
5017
[9]. This is important, because it justifies further Big
Data research. While the path loadings for CC5 were
significant, the standard deviation (SD) value of 2.05
marks inconsistency in the expert opinions. Eight out of
ten measures significantly affected the construct
context, whereas indicators CO4 and CO5 had a less
significant impact. Surprisingly, CO5 affected the
construct negatively, suggesting that the lower the data
quality the more strongly Big Data will emerge. The
finding makes sense when analytics are considered.
Analysts have to understand and structure data before
insights can be obtained, and data quality concerns
obstruct the use of analytics. The negative path loading
of indicator CO6 is interesting, as it suggests that more
Internet will reduce the effect of Big Data. Perhaps the
effect can be explained through the high SD value of
1.38, which indicates a strong division of opinion
between the respondents. CO9 and CO10 proved nonsignificant (see Table 3), but both indicators are part of
the descriptive Big Data model, and should not be
discarded, as to do so would be to ignore the theoretical
underpinnings [20]. The descriptive model (Fig. 1) has
shown that these measures were only supported by 30
percent of the interviewed experts, which could explain
the results. CO10 could reflect the different
geographical location of our participations, because the
regulatory environment differs greatly between regions
like Europe and America.
The measures of the construct Big Data were highly
significant, indicating meaningful path loadings.
Considering that IT encompasses any mechanism that
facilitates the gathering and transformation of data into
information, and the communication and storage of
information in an organization, Big Data should be
relevant at any level of IT [28]. Visualization (BD4)
and analysis of Big Data (BD5) were the major
indicators, which reinforces the scope of the research
domain. The explanatory variance of R²=0.253 is still
weak, but the predictive relevance of Q²=0.128 is high.
The low variance (Table 3) within the construct Big
Data represents the consensus between the experts.
Decision makers should be aware that the dimensions
of Big Data will increase as quickly as expectations
rise. Thus, gaining an extended understanding of the
market (CC1) in a timely manner (CC3) will lead to
more Big Data. This effect is still greater in contexts
like Social Media (CO8) or machine generated content
(CO3), where the volume of data is growing rapidly.
Logically, diseconomies of scale are possible: as greater
understanding is demanded, more data needs to be
processed, increasing the cost of one unit of
understanding over time. Finding the final piece of
information within a huge data set may be very
expensive. However, this needs more formal
investigation.
TABLE 3. DESCRIPTIVE STATISTICS
Item
BD1
BD2
BD3
BD4
BD5
CC1
CC2
CC3
CC4
CC5
CO1
CO2
CO3
CO4
CO7
CO8
CO7
CO8
CO9
CO10
Mean
5.34
4.79
5.05
4.97
5.42
5.08
4.78
4.51
2.73
4.02
4.89
4.36
5.35
5.12
4.58
4.87
4.58
4.87
3.93
3.93
SD
0.72
1.00
0.60
1.01
0.63
0.76
0.91
1.82
2.05
1.73
1.03
1.28
0.66
0.98
1.49
1.06
1.49
1.06
1.63
1.57
ST1
ST2
ST3
ST4
ST5
ST6
ST7
ST8
ST9
ST10
ST11
ST12
CQ1
CQ2
CQ3
CQ4
CQ5
CQ6
CQ7
CQ8
CQ9
4.25
5.35
4.65
4.09
4.57
3.47
4.55
4.90
4.22
4.50
4.39
5.35
4.63
3.66
4.00
3.98
4.95
5.32
4.17
4.36
5.18
1.31
0.58
0.95
1.58
1.02
1.92
1.15
0.95
1.21
0.98
1.36
0.55
1.82
1.78
1.95
1.25
1.00
0.57
1.31
1.46
0.73
Hypothesis 2 was also supported: more Big Data
will lead to more strategy. Organizations struggling
with at least one of the observed Big Data indicators
can address them by applying one of these strategies.
The explained variance of R² = 0.195 was weak,
suggesting that not all the influencing factors have been
identified. Organizational aspects like Big Data
Competence Centers could be other significant factors.
This study confirmed eleven of the significant
strategies for Big Data. Only ILM (ST10) was not
supported. Thus, our study contributes to practice by
identifying strategies to tackle Big Data. Typical
NoSQL approaches like ST3 or ST5 are impacting the
construct most strongly. Surprisingly, ST4 was
significant, but the path loading was low. The question
of whether relational databases (ST6) are adequate to
process Big Data was answered negatively by the
respondents [33], as the path loading was negative. As
expected, ST2 like MapReduce were supported as a
way to deal with Big Data. Even the paradigm of cloud
computing was supported, and ST7 also proved
significant. Thus, all technologies were supported. At
present, Big Data is less powerfully addressed by
Information Management approaches [5].
Our study provides evidence that two of three
Information Management aspects are appropriate in Big
Data. Both the use of ST8, to gain a broader view of the
environment, and ST9 were supported. Our study
supports future research to address these areas.
However, ST10 was not supported. Thus, Big Data
does not focus on the deletion of data, as its value
decreases over time. A possible explanation could be
linked to BI, or to the growth of data warehouses [42].
The indicator analytics was highly significant, and even
simulation was considered a possible strategy.
5019
5018
We used causal conditions and context to explain
the rise of the Big Data phenomenon, allowing
organizations to adjust their requirements of Big Data.
By applying Information Processing Theory, the study
found evidence of Big Data in all IT mechanisms. In
addition, it verified the positive relation between Big
Data and strategy. Thus, the more Big Data exists
within an organization, the more strategy the
organization will use. Moreover, significant strategy
manifestations were demonstrated. The study suggests
possible Big Data strategies, suitable for the observed
increase of Big Data. Organizations dealing with Big
Data can choose between several options to cope with
its effects. We found evidence that Big Data strategies
lead to positive consequences. Organizations
considering investing in Big Data will find support in
the potential value generation in our results. Thus,
further research is supported.
Nevertheless, the measured R² of all constructs was
low. Future research should identify further constructs
to help explain the rise of Big Data in more detail. The
proposed model represents the current state of play and
can be used as a foundation. It could be updated to take
account of new developments within the area of Big
Data. We are aware that our model might not cover all
related concepts, because our focus was on the
verification of Table 1. In addition, we are aware that
all indicators are equally common for IT concepts like
BI and are not unique to Big Data. Our model is a
preliminary one, and more specific indicators are left to
future research. However, the combination of contexts
and causal conditions is particular to Big Data. Even
the selected strategies are unique in combination and
form their own construct. Thus, our model contributes a
clearer understanding of Big Data to the scientific
discussion, including the awareness that single
indicators are common in IT. Our descriptive Big Data
model and our data sample were based on discussions
with practitioners. It is interesting to ask whether a
different understanding of Big Data exists between
researchers and practitioners, as this would also
contribute to the scientific discussion of Big Data and
increase the body of knowledge about this topic.
We established a positive relationship between
strategy and consequences (Hypothesis 3). This
supports other findings in IT research [10]. The path
coefficient was highly significant and impacted
consequences with a strong loading of 0.554. A
growing use of Big Data strategies will cause a rise in
consequences. The R²=0.307 was small to moderate
[24]. Following the requirements of reflexive measures,
all issue-related indicators were removed, yielding
evidence that only positive outcomes are generated
through Big Data strategies. This supports the further
development of strategies to increase the outcomes of
Big Data. However, deleting all issue-related indicators
contradicts the findings in the literature. Perhaps
splitting the construct into positive and negative
consequences would have led to other results.
Nevertheless, all indicators were significant at p < 0.01.
Possible financial outcomes were shown by CQ8 and
CQ7. Even the identification of new business models
was supported. However, we found evidence that Big
Data strategies allow the development of CQ5 and
CQ9.
Besides that, we can give empirical evidence for
Gartner’s [2] 3 V’s. The measure increasing data
amount to store (BD1) supports volume, whereby
velocity can be partly confirmed through the need of a
timely processing (CC3). Varity belongs to different
data types. Here, the measures CO1, CO2, CO3, CO4,
and CO8 are adequate manifestations
In summary, the study illustrates a significant causeeffect relationship. Thus, if there are increasing
exogenous variables (causal condition and context), the
observable indicators (store, transform, access,
visualize, analyze) in Big Data will grow. This would
lead to a growing usage of Big Data related strategies
and at least to positive outcomes for organizations.
6. CONCLUSIONS
The study proposed an initial theoretical foundation
for Big Data. We developed a research model, based on
a descriptive Big Data model, to elucidate the
underlying relationships and concepts. All hypotheses
of our Big Data theory model tested positively in the
PLS model. This theoretical foundation therefore serves
as a contribution to the scientific discussion. Further
research is justified, because Big Data becomes a better
defined research domain. The study contributes to a
clearer understanding of Big Data, which also supports
academic and practical discussions. Using our model,
future applications and research can justify being
located within Big Data. Organizations can more easily
define Big Data policies and value propositions. Even
the transfer of knowledge is simplified, because the
model provides a common understanding of the Big
Data domain.
7. REFERENCES
[1] H. Buhl, M. Röglinger, F. Moser, and J. Heidemann,
“Big Data A Fashionable Topic with (out) Sustainable
Relevance for Research and Practice?”, BISE, 5, 2, 2013, pp.
65-69.
[2] Gartner. (2014). IT Glossary Big Data [Online].
Available: http://www.gartner.com/it-glossary/big-data/
[3] C. Bizer P. Boncz, and M. Brodie, “The Meaningful Use
of Big Data”, SIGMOD, 40, 2014, pp. 56-60.
[4] Y. He, R. Lee, and Y. Huai.., “RCFile”, ICDE,
Hannover, 2011, pp. 1199-1208.
5020
5019
[24] W. W. Chin, “The partial least squares approach for
structural equation modeling”, in Modern methods for
business research, Erlbaum, 1998, pp. 295-336.
[25] T. Coltman T. Devinney, and D. Midgley., “Formative
versus reflective measurement models”, JBR, 61, 2008, pp.
1250-1262.
[26] J. Krumm N. Davies, and C. Narayanaswami, “Usergenerated content”, Pervasive Computing, 7, 2008, pp. 10-11.
[27] J. Fairbank G. Labianca, and H. Steensma, “Information
Processing Design Choices, Strategy, and Risk Management
Performance”, MIS, 23, 2006, pp. 293-319.
[28] H. Chen R. Chiang, and V. Storey, “Business
intelligence and analytics: from big data to big impact”,
MISQ, 36, 4, 2012, pp.1165–1188.
[29] I. Foster Y. Zhao, and I. Raicu., “Cloud Computing and
Grid Computing 360-Degree Compared”, GCE, Austin,
2008, pp. 1-10.
[30] D. Peleg, Distributed Computing. Philadelphia: SIAM,
2000.
[31] K. Hwang J. Dongarra, and C. Fox., Distributed and
Cloud Computing. Waltham: Morgan Kaufmann, 2000.
[32] J. Dean, and S. Ghemawat, “MapReduce: Simplified
Data Processing on Large Clusters”, Comm. ACM, 51, 2008,
pp. 137-149.
[33] J. Han, E. Haihong, and G. Le, “Survey on NoSQL
Database”, ICPCA, South Africa, 2011, pp. 363-366.
[34] M. Castellanos C. Gupta, and S. Wang, “Leveraging
web streams for contractual situational awareness in
operational BI”, EDBT, Lausanne, 2010, pp. 1-8.
[35] SNIA 2014. (2014). Information Lifecycle Management
[Online]. Available: http://www.snia.org/
[36] M. Ebbers M. Archibald, and C. da Fonseca, (2014).
IBM Smarter Data Centers [Online]. Available:
http://www.redbooks.ibm.com/
[37] J. Hulland, “Use of partial least squares (PLS) in
strategic management research”, Strategic Management
Journal, 20, 1999, pp. 195-204.
[38] C. Ringle S. Wende, and A. Will (2014). SmartPLS
[Online]. Available: http://www.smartpls.de
[39] S. Smith, R. Johnston, and S. Howard, “Putting
Yourself in the Picture”, ISR, 22, 2011, pp. 640-659.
[40] R. Cenfetelli, and G. Bassellier, “Interpretation of
Formative Measurement in Information Systems Research”,
MISQ, 33, 2009, pp. 689-708.
[41] W. W. Chin, “Issues and Opinion on Structural
Equation Modeling”, MISQ, 1998, pp. 7-16.
[42] P. Zikopoulos, D. deRoos, and K. Parasuraman, Harness
the Power of Big Data. New York: McGraw-Hill, 2012.
[5] M. Pospiech, and C. Felden, “Big Data – A State-of-theArt”, AMCIS, Paper 22, 2012.
[6] D. Loshin, Big Data Analytics. Waltham: Morgan
Kaufmann, 2009.
[7] X. Wu, X. Zhu, and G. Wu, “Data Mining with Big
Data”, IEEE TKDE, 26, 2014, pp. 97-107.
[8] R. D. Galliers, and F. F. Land, "Choosing an
Appropriate Information Systems Research Methodology",
Comm. ACM, 30, 1987, pp. 900-902.
[9] M. Pospiech, and C. Felden, "Deployment of A
Descriptive Big Data Model", in Business Intelligence for
New-Generation Managers, J. Mayer, and R. Quick (eds.),
Heidelberg, Springer, 2015, pp. 75-95.
[10] G. Piccoli, and B. Ives, "IT-Dependent Strategic
Initiatives and Sustained Competitive Advantage", MISQ,
29, 2005, pp. 747-776.
[11] H. Cooper, Synthesizing Research, Thousand Oaks:
Sage, 1998.
[12] J. Webster, and R. Watson, “Analyzing the past to
prepare for the future: writing a literature review”, MISQ, 26,
2002, pp. 13-23.
[13] M. Pospiech, and C. Felden, “A Descriptive Big Data
Model Using Grounded Theory”, BDSE, pp. 878-885.
[14] A. Strauss, and J Corbin, Basics of Qualitative
Research. Thousand Oaks: Sage, 1990.
[15] W Kim H., Chan, and A. Kankanhalli, “What Motivates
People to Purchase Digital Items on Virtual Community
Websites?” ISR, 23, 2012, pp. 1232-1245.
[16] Gartner. (2014). IT Glossary Big Data [Online].
Available: http://www.gartner.com/it-glossary/big-data/
[17] D. Boyd, and K. Crawford, “Critical Questions for Big
Data. Information” C & S, 15, 5, 2012, pp. 662-679.
[18] A. Cuzzocrea, Y. Song, and K. Davis, “Analytics over
Large-Scale Multidimensional Data”, DOLAP, UK, 2011,
pp. 101-103.
[19] A. Kaplan, and M. Haenlein, "Users of the world,
unite!", Business Horizons, 53, 2010, pp. 59-68.
[20] J. Hair, C. Ringle, and M. Sarstedt., “PLS-SEM: Indeed
a silver bullet”, JMTP, 19, 2011, pp. 139-152.
[21] E. Babbie, Survey Research Methods, Wadsworth:
Cengage, 1990.
[22] Xing 2014. (2014). Xing [Online]. Available:
https://www.xing.com/en
[23] R. Chomeya, “Quality of Psychology Test Between
Likert Scale 5 and 6 Points”, Journal of Social Sciences, 6,
2010, pp. 399-403.
5021
5020