Exploiting Big Data - King`s College London

The International Centre for Security Analysis
The Policy Institute at King’s
King’s College London
Workshop Series on Open
Source Research Methodology
in Support of Non-Proliferation
Workshop 1: Exploiting Big Data in
Support of Non-Proliferation
Workshop Series Funded by the Carnegie
Corporation of New York
Contents
Executive Summary ...................................................................................................................... 2
Introduction ................................................................................................................................... 3
1: Key Concepts............................................................................................................................. 4
Defining and Understanding the Concept of Big Data ........................................................ 4
Big Data and Organisational Culture ...................................................................................... 5
2: Big Data Methodologies and Applications ............................................................................ 6
Quantitative Text Analysis ....................................................................................................... 6
Satellite Imagery ......................................................................................................................... 7
Identifying Anomalies using Big Data .................................................................................... 8
3: Legal and Ethical Considerations of Using Big Data ........................................................... 9
4: Breakout Group Discussions ................................................................................................ 11
Breakout Session One: Big Data and Proliferation Networks .......................................... 11
Breakout Session Two: Big Data and Non-Proliferation Sentiment ................................ 12
1
Executive Summary
The workshop discussed a wide-range of issues relating to the potential applications of
big data analytics in non-proliferation. A summary of the key discussion points and issues
that must be addressed in the future is given below:
1. Big data analytics may provide useful early indicators of anomalies and
deviations from normal or expected behaviour with regards to nonproliferation.
 Detecting early indicators of irregularities or illicit activities can then provide the
analyst with the opportunity to investigate these further.
 Assessing what is considered to be “normal” activity is a significant challenge.
2. Incorporating big data analytics for non-proliferation into existing analytical
frameworks often results in organisational resistance.
 Organisational roadblocks remain a significant barrier to the implementation of
big data methodologies.
 Demonstrating the value of big data analytics and open source research
methodologies is important in overcoming organisational resistance.
3. The successful application of big data methodologies and approaches to nonproliferation issues rests on integrating a varied range of data sources.
 Collating data sources into a coherent and useful dataset for detailed analysis is
vital for the effective use of big data and open source research methodologies.
 Understanding and effectively utilising existing datasets is as important as
incorporating new sources of information.
4. New and underused sources of data may be of significant value for the nonproliferation community when combined with big data analytics.
 Satellite imagery is one such source of data that could be integrated into existing
analytical frameworks.
 However, costs and processing capabilities remain substantial barriers to the
effective adoption of new data sources and research methods.
5. Legal and ethical concerns must be addressed.
 Legality and ethics must be considered if the use of open source information and
big data methodologies is to be fair and lawful in support of non-proliferation.
 The use of big data analytics and open source research methodologies in non-
proliferation has legal and ethical implications which we must recognise.
2
Introduction
The International Centre for Security Analysis at King’s College London is running a
four-part workshop series on open source research methodologies in support of nonproliferation. The series is funded by the Carnegie Corporation of New York. It is
designed to stimulate discussion on how open source information and research
methodologies can help tackle key issues in non-proliferation, including verification and
monitoring by the nuclear safeguards community. The workshop series is as follows:
1. Workshop One: the exploitation of big data research methodologies and
applications in support of non-proliferation;
2. Workshop Two: the exploitation of financial open source information in support
of non-proliferation;
3. Workshop Three: the challenges and solutions to the effective collection and
analysis of open source information on states’ nuclear-relevant scientific and
technical capabilities;
4. Workshop Four: the challenges and solutions to the effective collection and
analysis of open source information on both state and non-state nuclear nonproliferation strategies.
The first workshop on the exploitation of big data in support of non-proliferation was
held in Vienna on the 26th-27th January 2015. The workshop aimed to examine the role of
big data analytical tools, research techniques and methodologies for non-proliferation
efforts. The workshop approached this topic from a multi-disciplinary perspective
drawing upon expertise from: academia, law enforcement, the law and the private sector.
A multi-disciplinary approach was deliberately chosen to reflect the diverse approaches to
big data which may be of relevance to non-proliferation activities.
The workshop also aimed to facilitate the discussion of new open source research
methodologies and approaches that may be of use for the non-proliferation community.
To achieve this, the workshop was structured as a series of panel presentations and two
breakout sessions. The panel presentations were designed to give insights into innovative
and interesting applications of big data and open source research methodologies. The
breakout sessions sought to stimulate discussion on the practical applications of big data
methodologies in support of non-proliferation.
The International Centre for Security Analysis would like to thank all the workshop
participants for their contributions.
3
1: Key Concepts
Defining and Understanding the Concept of Big Data
Big data is a contested concept with no single overarching definition. Often definitions
will refer to the “3Vs”:
 Volume: intuitively, big data as a concept implies that the size of datasets is a key
defining characteristic;
 Velocity: the speed of data generation and processing;
 Variety: the context of data and the wide range of forms it can take.
Additional “Vs” have been added periodically in variations on the 3Vs definition,
including: veracity (data quality) and variability (data inconsistencies). The workshop
did not dwell on defining big data; instead, early discussions focussed on how to
conceptualise big data in a practical and meaningful way. This enabled the workshop to
focus on the more significant questions relating to how big data applications could
support and enhance the activities of the non-proliferation community.
One intuitive way of thinking about big data and big data methodologies that was
discussed centred on conceptualising it as: a process to access, collect and analyse data
that is out of reach with current tools and technologies. These tools and technologies
have dramatically improved over the last 20 years, enabling more sophisticated processes
to analyse ever larger and more complex datasets. However, developments in tools,
technologies and methodologies encourage us to attempt to collect, process and analyse
ever-larger, out of reach datasets. It is important to recognise that big data methodologies
must be dynamic and flexible as datasets increase in size, scope and complexity.
Furthermore, many of the discussions emphasised the importance of conceptualising big
data not just from the perspective of the developer of capabilities but also from the
perspective of end users. The aim should be to develop a clear and comprehensive
understanding in the non-proliferation community about what we understand big data to
mean and why it has become popular as a concept.
Developing this understanding is critical as we are now in a world of exabyte data, a vast
amount of information that is hard to comprehend. An intuitive way to grasp this volume
of data is to consider the thickness of a note (0.0043 inches). An ‘exa’ quantity of notes,
stacked on top of each other, would stretch to the moon and back over 250,000 times.
This fact clearly demonstrates that in big data disciplines we are increasingly moving far
beyond the capacities of human analysis and testing the limits of computer processing.
4
Big Data and Organisational Culture
The implementation of big data methodologies within organisations is often a significant
challenge and one that a number of the workshop panellists addressed. Discussions
throughout the workshop often returned to issues relating to organisational culture. Here
a number of the most significant discussions are described, including issues relating to:
 Organisational control of information;
 Understanding existing data and integrating new sources of information;
 Development of big data and open source research methodologies.
Existing datasets often exist in silos, controlled by individuals or analytical teams who
may be reluctant to share information with their colleagues. Integrating disparate datasets
into a single, coherent and intuitive platform is therefore both a technical and
organisational challenge. It requires collaborating closely with technical developers to
design and implement a platform that analysts can use effectively. It also requires
demonstrating to analysts the value of sharing information across departments and
analytical teams.
A related issue is the procurement and development of big data capabilities; a common
problem is an overemphasis on procurement rather than innovation and integration. This
then feeds into the problems of big data analytics and technologies that fail to satisfy the
requirements of the end user. It is a challenge that is further exacerbated by an
institutional resistance or poor understanding of the benefits that effectively developed
big data solutions could offer to existing processes.
Access to the most interesting or potentially useful datasets is often difficult. Such big
datasets are often controlled by organisations and not released due to a number of
concerns including: commercial sensitivity or privacy concerns. Furthermore, datasets are
not released because organisations find it extremely challenging to place a value on the
data they control.
Related to this is the concept that organisations should retain an effective control of their
data while developing the necessary information technology infrastructure and analytical
approaches to maximise the utility of big data methods. An understanding of what
datasets are under an organisation’s control is important in a rapidly evolving data
environment. Developing this understanding will enable an organisation to effectively
implement open source research and big data methodologies to unique research
problems, for example specific issues in non-proliferation.
5
2: Big Data Methodologies and Applications
Quantitative Text Analysis
Overview
Quantitative text analysis is an expressly quantitative variant of content analysis that uses
a statistical measure to monitor word frequency and relevance in a piece of text. The
development of quantitative text analysis as a methodology has been fundamentally
driven by the large, and growing, volumes of easily accessible and non-copyrighted text
available online. The applications of this quantitative text analysis are varied, including:
 Fast annotation of large volumes of text;
 Discover associations between text features to shape insight and prediction;
 Perform event detection or sentiment analysis;
 Topic tracking in a large corpus of text to identify changes in behaviour over time;
 Almost limitless industry-specific applications.
Applications for Non-Proliferation
Quantitative text analysis offers a number of other practical applications that may be
relevant to non-proliferation. These include: tracking public opinion on social media;
measuring latent political positions or issues frames within political texts. In the second
breakout session (see below), the groups discussed the use of sentiment analysis of
textual data to predict behaviour and its possible application to provide indications of
decision-maker intent and public opinion regarding nuclear security developments.
Quantitative text analysis also facilitates more complex analyses of document features
such as the frequency a word appears as the object or subject. This creates a document
frequency matrix that can be weighted according to key input features and used as a basis
for further analysis.
Non-proliferation research and analysis often involves collecting, processing and
analysing large volumes of unstructured text in a range of file formats. Applying
quantitative text analysis methodologies could be used to create structure from
unstructured text and enable analyses of keywords, latent political positions. It may also
provide a framework for further evaluation by acting as an aid for the analyst faced with
processing and analysing large volumes of text.
Challenges
The workshop addressed the main barriers to implementing quantitative text analysis as
one big data approach to analysing large volumes of unstructured text. One of these
challenges is the unique and highly technical lexicon of the non-proliferation community;
6
to overcome this, linguistics experts could work with non-proliferation analysts to
develop a unique dictionary of words relating to non-proliferation.
Another set of challenges relates to the practical means of processing and analysing large
volumes of text. Primarily, this is not a problem of data storage but rather developing
tools and techniques for extracting and structuring typically unstructured texts. A number
of techniques have been developed to address this challenge. One such development is
the use of natural language processing to extract and structure useful data from
unstructured texts.
Furthermore, assessing proliferation risks requires the analysis of large quantities of text
in multiple languages. This poses an additional challenge to the application of quantitative
text analysis methodologies which predominantly focus on languages written in the
Roman script. Techniques often use the white spaces between words to parse sentences
but this method will face challenges in other languages where sentences are constructed
very differently in terms of grammar and non-Roman scripts are used for written text.
Satellite Imagery
Overview
The non-proliferation community already uses satellite imagery in support of its analysis
and evaluation of proliferation issues. There are likely to be growing opportunities to
capitalise on the falling cost, increased resolution and intra-daily monitoring capabilities
of commercial satellite imagery. However, it is important to recognise that the integration
of satellite imagery into existing analyses may pose unique legal and ethical challenges (see
discussion on legal issues below).
Applications for Non-Proliferation
Although the non-proliferation community already uses satellite imagery to complement
other information sources, further integration may be beneficial. In particular, the ability
to monitor a particular area of interest over an extended period of time using satellite
imagery will produce a time-series library of data. This library of information can then be
cross-referenced against other sources of information to develop a more comprehensive
understanding of activities.
One particular area where satellite imagery may complement existing information sources
is through the integration of time-series imagery with trade data collected from a range of
sources. This could provide the analyst with not only a richer picture of activity in a given
location but also a more detailed understanding of trade networks, supply chains and the
movement of goods across the Earth’s surface. Discussions again touched on whether
satellite imagery, especially time-series imagery, may be able to provide indicators of
anomalies or deviations from typical behaviour and activity.
7
Identifying Anomalies using Big Data
Overview
One of the most discussed concepts in the workshop was the potential applications of
big data in identifying anomalies or early indicators of illicit non-proliferation activity. By
identifying and flagging anomalies to the analyst, big data methodologies may be most
useful as a guide for further analytical or investigative work.
Methods to do this effectively rely on a thorough understanding of existing datasets
within an organisation and integrating these across teams. Providing a single portal to
search across all structured and unstructured documents will enable analysts to use
existing data more effectively to identify anomalies, inconsistencies or suspicious
activities.
Applications for Non-Proliferation
Using big data analytics to identify anomalies in activity or reporting may be a useful
application for non-proliferation. In particular, an experienced analyst may benefit from a
system that uses big data techniques to search across datasets to identify possible
anomalies that the analyst can analyse further. This may become especially important as
analysts face ever greater volumes of information to process and analyse most of which is
likely to be “noise” rather than “signal”.
Practical applications of big data in identifying anomalies, red flags or proliferation risk
factors were discussed in the first breakout session which centred on analysing state and
non-state actor proliferation networks (see below). The use of big data for anomaly
detection is one avenue the workshop identified as a potentially useful area of further
research in the non-proliferation community.
Challenges
Discussions on the applications of big data in this regard also identified potential
limitations. Using big data analytics to identify anomalies in behaviour or activities may
only be useful in specific contexts. For example, discussions during the workshop
questioned how reliable or useful this approach would be in identifying anomalous
behaviour by individuals or small groups of people, for example small proliferation
networks.
It was suggested that it may be more productive to integrate anomaly-seeking and
behavioural approaches in the analysis of state activity and behaviour rather than nonstate actors. However, in both cases, a key challenge will be developing a process to
accurately understand and articulating what is meant by “normal” behaviour or activity so
that analysts may identify any deviations from this expected result.
8
3: Legal and Ethical Considerations of Using Big Data
Introduction
The use of big data analytics raises a host of legal and ethical concerns that, in recent
years, have centred on the issue of privacy. All of the panels touched on the issue of
privacy with a number of interesting points raised:
 Privacy is a critical issue but one which is difficult to assess. In particular, framing
big data analytics in terms of privacy automatically primes privacy concerns in
analysts which may affect the nature and output of research and analysis;
 The most interesting, valuable or useful datasets are often controlled and not
released, in part because of concerns over privacy;
 There is a global debate, particularly evident in the European Union, regarding an
individual’s right to control how their personally identifiable data is used;
 The process of collecting, transferring and analysing datasets from across the
globe is complex and involves the need to comply with different laws regarding
data usage across multiple jurisdictions;
 Using insights gleaned from personal data captured by big data analytics could be
problematic due to an individual’s right to remain anonymous under EU law.
Big Data and the Existing Non-Proliferation Legal Framework
The non-proliferation legal regime rests on three key areas:
1. Treaties and agreements;
2. Export control laws;
3. Interruption and interdiction.
These three are fundamental to the goal of preventing proliferation but efforts in these
three areas tend to be disaggregated and further complicated by different international
and domestic legal approaches. The problems posed by a disaggregated approach and
different legal regimes makes integrating big data analytics and methodologies particularly
challenging.
More practically it may be very difficult to use big data techniques and methodologies to
quantify or measure the intent of an individual to proliferate. Even if big data analytics
can be used to identify proliferation activities the path to prosecution is difficult. The
reliance on international agreements and the difficulty in securing extradition are
significant barriers to effective prosecution. This has led to alternative legal and political
approaches, especially economic sanctions, which have had some success in preventing
companies and individuals from accessing the US market.
9
Implications for Non-Proliferation
To effectively guide the use of big data by organisations requires the implementation of
“big ethics” or “big guidance” principles. These must relate to the collection, retention
and use of data as well as the protection of that data from external threats.
The collection, analysis and retention of any data that can be considered “personal” data
raise a number of legal and ethical challenges for the non-proliferation community to
address. The retention of any such data must be fair and lawful; this means it must be upto-date, accurate and held for no longer than is necessary. This imposes additional costs
on organisations which must institute effective and comprehensive safeguards to mitigate
the risk of a data breach. The need to comply with local, national and international laws
on data protection through the “anonymisation” or “de-identification” of personal data
may, in some cases, be prohibitively time-consuming and expensive. Such costs must be
weighed against the anticipated benefits offered by big data methodologies for nonproliferation.
New forms of data that may be of use in non-proliferation analysis also pose new legal
and ethical challenges. For instance, satellite imagery offers non-proliferation analysts the
opportunity to monitor sites and facilities of interest. However, as the capabilities of
commercial satellite companies develop, particularly in higher resolution imagery, legal
concerns regarding surveillance and privacy could arise.
Discussions during the workshop also focused on the dichotomy between technical and
non-technical challenges. Significant time and research effort has centred on
understanding and overcoming technological barriers to the application of big data
methodologies. However, less attention has been paid to the non-material, non-technical
challenges relating to the use of big data in support of non-proliferation, especially:
 Governance: current approaches tend to be focused on existing developments
rather than anticipating future innovations and developments;
 Regulation: related to governance, regulation of 21st century innovations and
technological developments is typically carried out by 20th century institutions,
laws and norms;
 Fragmentation: big data analytics are seen as a solution to challenges across a
multitude of disciplines; if true we require big data analytics to be simultaneously
highly flexible and very context specific.
Addressing these issues involves not only developing new technologies and approaches
but also instituting new legal and organisational structures.
10
4: Breakout Group Discussions
Breakout Session One: Big Data and Proliferation Networks
Outline
The first breakout session attempted to understand the applications of big data
methodologies and analytics to the challenge of identifying and analysing proliferation
networks. This was a broad starting point for a discussion on a range of topics, including:
 The differences between state and non-state proliferation networks and activities
and how these differences affect the implementation of big data analytics;
 Whether big data analytics may be able to identify anomalies or suspicious
proliferation activity that uses the avenues of legitimate commerce;
 The important distinction between using big data to assess the capability to carry
out proliferation activities and its use in assessing the intent to proliferate.
Discussion
A key focus of the three breakout groups was the potential applications of big data
analytics in pattern recognition and anomaly detection in large, disparate datasets. This
approach requires not only the implementation of effective analytical approaches but also
a comprehensive assessment of existing datasets and the integration of new sources of
data as they arise. Discussions emphasised the important differences between state and
non-state proliferation networks which affect the analyst’s ability to identify anomalous
behaviour or to even characterise what is considered to be “normal” behaviour.
The discussion also identified practical steps to enable a more effective integration of big
data into non-proliferation activities. Discussions focused on integrating a range of big
data analytics where appropriate to analyse various stages of proliferation pathways. It is
important to ensure that the integration of big data into current analytical methods is
both relevant and useful. For example, free text extraction techniques may be useful in
achieving understanding and clarity from large sets of proliferation-relevant scientific and
technical data.
There is already a wealth of information available on issues in relation to nonproliferation issues that could provide richer insights through the application of big data
analytics. In particular, big data analytics could be applied to trade data, especially the
trade mechanisms for the transfer of sensitive or dual-use technologies. However, there
are also areas where the non-proliferation community has little insight. Examples include
the dark web, which is a largely untapped resource, and visual data which is needs to be
geo-tagged or labelled for analysis.
11
Breakout Session Two: Big Data and Non-Proliferation Sentiment
Outline:
The second breakout session attempted to understand how big data may be used to
assess the attitudes of decision-makers and the wider public of a state towards
proliferation. This issue formed the basis of a wide-ranging discussion on a number of
topics, including:
 Developing “conversational maps” using big data analytics to assess change in
attitudes and sentiment over time and possibly identify divergences that may
indicated covert proliferation-related activities;
 Big data analytics also offer the ability to identify clusters of activity and sentiment
towards specific topic within particular geographic locations to provide indicators
of localised behaviour;
 Challenges to accurately assessing sentiment and attitudes include: language
barriers, especially where we are reliant on machine translation; the questionable
veracity of some open source information; and the problem of disinformation.
Discussion
Discussions within groups covered a range of opportunities and challenges to the
application of big data methodologies in assessing decision-maker and public attitudes
towards non-proliferation. A key focus of these discussions was the importance of
assessing disparate pieces of information in conjunction with other sources rather than in
isolation. In addition, the group discussions recognised that big data analytics are an aid
for the analyst but are not yet sophisticated enough presently to replace human analysis.
It is important to train analysts to make sense of, and draw meaningful conclusions from,
results derived from big data analytics in the context of non-proliferation.
The discussion was conducted with reference to another challenge: assessing the
reliability of open source information. A particular point of discussion was the
development of an analytical approach to recognise and filter active disinformation from
any analysis of sentiment and trends. This is not a unique challenge to big data sentiment
analysis and there already exists a developed literature on assessing the reliability, veracity
and usefulness of open source information.
It was recognised that big data analytics may not be useful in analysing decision-maker
sentiment if the volume of non-proliferation relevant data, such as speeches, official
statements, is small or subject to government filters. This raises the problem of inherent
biases within the relevant dataset. Similar challenges may arise when attempting to assess
public sentiment, especially in foreign languages, through the application of open source
research methodologies and big data analytics.
12