Paper

5th International Conference on Future-Oriented Technology Analysis (FTA) - Engage today to shape tomorrow
Brussels, 27-28 November 2014
SCIENCE, TECHNOLOGY & INNOVATION TEXTUAL DATAORIENTED TOPIC ANALYSIS AND FORECASTING:
METHODOLOGY AND A CASE STUDY
Yi Zhang1, 2, *, Guangquan Zhang2, Alan L. Porter3, Donghua Zhu1, Jie
Lu2
1
School of Management and Economics, Beijing Institute of Technology, Beijing, P. R. China
2
Decision Systems & e-Service Intelligence research Lab, Centre for Quantum Computation & Intelligent Systems, Faculty
of Engineering and Information Technology, University of Technology Sydney, Australia
3
Technology Policy and Assessment Centre, Georgia Institute of Technology, Atlanta, USA
*Corresponding Email Address: [email protected]
Abstract
Not only the external quantities, but also the potential topics of current Science, Technology & Innovation
(ST&I) are changing all the time, and their induced accumulative innovation or, even, disruptive revolution,
should be able to heavily influence the whole society in the near future. Addressing and predicting the
changes, this paper proposes an analytic method (1) to cluster associated terms and phrases to constitute
meaningful technological topics and (2) to identify changing topical emphases, the results of which we
carry forward to present mechanisms to forecast prospective developments via Technology Roadmapping
approaches. Furthermore, an empirical case study of Award data in the United States National Science
Foundation Division of Computer and Communication Foundations is performed to demonstrate the
proposed method and the resulting knowledge could hold interests for R&D management and science
policy in practice.
Keywords: Text Mining; Topic Analysis; Text Clustering; Technological Forecasting; Big Data;
Introduction
The coming of Big Data Age introduces big opportunities and challenges for human beings and
modern society, which makes possible to explore more potential information from massive and
various kinds of data sources. Meanwhile, the dynamic development of Science, Technology &
Innovation (ST&I) has been considered as one of the most important features in today’s open
innovation systems. In this context, both national R&D management and industrialization start to
highlight these trends for dominating global competition. However, not only the external
quantities, but also the potential topics are changing all the time, and their induced accumulative
innovation or, even, disruptive revolution, should be able to heavily influence the whole society
in the near future.
As a valuable instrument addressing these concerns, text mining affords effective approaches to
understand vast textual databases and engages with semantic tools to deal with real world
problems. It is ST&I resources, evolving with academic publications, patents, ST&I program
proposals, etc., that provides possibilities on describing previous scientific dynamics and efforts,
discovering innovation capabilities, and forecasting probable evolution trends in the near future
(Porter and Detampel 1995; Zhang et al. 2013). Currently, ST&I text analysis focuses on the
emerging topics by combing both quantitative and qualitative methodologies and emphasizing
the automatic knowledge-based system and bibliometric approaches.
THEME 3: CUTTING EDGE FTA APPROACHES
-1-
5th International Conference on Future-Oriented Technology Analysis (FTA) - Engage today to shape tomorrow
Brussels, 27-28 November 2014
On the one hand, clustering analysis is widely introduced for topic generation. As described by
Jain (2009), the purpose of clustering analysis is to explore the potential grouping for a set of
patterns, points or objects. Analogously, text clustering concentrates on the textual data with its
statistical properties and semantic connections of phrases and terms. Normally, clustering
algorithms seek to calculate the similarity between documents and to reduce the rank by
grouping large number of items to meaningful small number of factors (Chen et al. 2013; Zhang
et al. 2014). On the other hand, understanding on the topics usually connects with time series
and forecasting requests, and is described as Emerging Trend System (ETD) problems
(Kontostathis et al. 2004), evolving with (1) Technology Opportunity Analysis (Porter and
Detampel 1995; Porter and Cunningham 2004) and continuous studies on automated ST&I
intelligence extraction and visualisation, and future-oriented analysis (Zhu and Porter 2002;
Zhang et al. 2013; Huang et al. 2014), and (2) Topic Detection and Tracking (Allan et al. 1998)
and its related algorithm researches on Topic Identification (Small et al. 2014), Topic Detection
(Cataldi et al. 2010; Dai et al. 2010), and Concept Drift (Lu et al. 2014).
However, previous studies lack the macro scope on connecting the algorithm study and the realworld requirements, which only concentrate the design and refinement of textual clustering and
classification approaches, or only focus on the problem itself and ignore the possible quantitative
efforts for the improvement. Standing on these concerns, this paper develops a data-driven, but
adaptive, methodology for topic analysis and forecasting. At the first step, we introduce a KMeans-based clustering approach for semi-supervised learning on semi-labelled ST&I records,
which includes several selection models between 1) Phrases and Words, 2) normal Term
Frequency (TF) and Term Frequency Inverse Document Frequency (TFIDF), and 3) Feature
Combinations. Furthermore, we apply Technology Roadmapping approaches for foresight
studies, which combine quantitative evidence with expert knowledge, introduce the visual model
to present the innovation trend in a specified period, and address concerns for forecasting
discussions. Moreover, a special case study focuses on the United States (US) National Science
Foundation (NSF) Awards is presented in this paper.
This paper is organized as follows: in the “Methodological Approach” section, we present the
detailed research method on ST&I textual data oriented topic analysis and forecasting studies.
The section “Results, Discussion and Implications” follows, taking the US NSF Awards from
2009 to 2013 in the Division of Computer and Communication Foundation as a case study. This
part identifies topics by clustering approaches, illustrates the development trend visually, and
engages expert knowledge for foresight understanding. Finally, we conclude our current
research and put forward possible directions for future works.
Methodological Approach
The study proposes and develops a data pre-processing approach, a K-Means based clustering
analysis approach, and a trend analysis approach, and use NSF Award data as a case study. As
we mentioned, our methodology seeks to define a ST&I textual data-driven, but adaptive,
method to topic analysis and forecasting. The general research framework is given in Fig. 1.
THEME 3: CUTTING EDGE FTA APPROACHES
-2-
5th International Conference on Future-Oriented Technology Analysis (FTA) - Engage today to shape tomorrow
Brussels, 27-28 November 2014
Raw Data Retrieval
e.g., Title, Abstract, Special Fields
Feature Extraction
Term Clumping Processing
Core Terms (Words and Phrases)
& Term – Record Matrix
Topic Analysis
Cluster Validation Model
Labelled
Training Set
Feature Selection and
Weighting Model



Words or Phrases
TF or TFIDF
Feature Combination
K Local Optimum Model
Data-Driven K Means based Clustering Approach
Topics
Forecasting
External Factors
e.g., Policy, Tech.
Topic Understanding and Foresight

Technology Roadmapping
Expert
Knowledge
Figure 1. Research Framework for ST&I Textual Data Oriented Topic Analysis and Forecasting
Step 1 Raw Data Retrieval
Normally, ST&I textual data has common fields (e.g., Title, Abstract, etc.) and special ones (e.g.,
International Patent Classification in patent data, Program Element/Program Reference in NSF
Awards data, etc.), and in this step, our purpose is to remove the meaningless data, and to
retrieve the needed fields from the raw records.
Step 2 Feature Extraction
In our previous study, we developed a Term Clumping process for technical intelligence that
aims to retrieve core terms (words and phrases) from ST&I resources by performing term
cleaning, consolidation, and clustering approaches (Zhang et al. 2014). In this paper, we
introduce these effective steps for feature extraction (core terms retrieval); however, the purpose
of Term Clumping in our approach is to remove common terms but not to retrieve the exact core
terms—the usual goal for using Term Clumping. Therefore, based on the specific domain and
THEME 3: CUTTING EDGE FTA APPROACHES
-3-
5th International Conference on Future-Oriented Technology Analysis (FTA) - Engage today to shape tomorrow
Brussels, 27-28 November 2014
goals, the process of Term Clumping should be modified. Moreover, we generate a TermRecord-Matrix to calculate further similarities in this step.
Step 3 Topic Analysis
In this step, we set up a training set of labelled data for machine learning and propose a datadriven K-Means-based clustering approach. Several aiding models are added as described
below:
1) Cluster Validation Model
We compose the cluster validation model with Recall, Precision, and F Measure, referring to the
common performance measures in information retrieval, which are defined as follows:
𝑅𝑒𝑐𝑎𝑙𝑙 =
Number of Relevant Records Clustered to the Category
Total Number of Relevant Records of the Category
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =
Number of Relevant Records Clustered to the Category
Total Number of Records Clustered to the Category
𝐹 𝑀𝑒𝑎𝑠𝑢𝑟𝑒 =
𝑅𝑒𝑐𝑎𝑙𝑙 × 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 × 2
𝑅𝑒𝑐𝑎𝑙𝑙 + 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛
Generally, Recall denotes the fraction of the records that are relevant to the query that are
successfully retrieved, Precision indicates the fraction of retrieved documents that are relevant to
the find, and F Measure combines both as the harmonic mean. Considering the Recall value for
the whole dataset is meaningless, since all records have been clustered, we only calculate the
Total Precision to evaluate the total number of correctly grouped records. In addition, we also
calculate the Recall, Precision, and F Measure for each Category, and use the Average F
Measure as another main target value.
𝐴𝑣𝑒𝑟𝑎𝑔𝑒 𝐹 𝑀𝑒𝑎𝑠𝑢𝑟𝑒 =
𝑇𝑜𝑡𝑎𝑙 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =
∑ 𝐹 𝑀𝑒𝑎𝑠𝑢𝑟𝑒 of each Category
Total Number of Categories
Number of Records clustered to the correct Category
Total Number of Records in Training Set
Moreover, it is necessary to mention that, in the Cluster Validation Model, we label all records
grouped with the Centroid via the real category of this Centroid. Therefore, the selection of
Centroid is one of the key issues that will influence the cluster validation process.
2) K Local Optimum Model
The traditional K Means algorithm needs to set the K value manually, and this value affects the
clustering results heavily (Jain 2010). In our approach, aiming to reduce this influence and to
find the best K value in a specified interval, we situate the cluster validation model in the loop for
the specified interval, and decide the best K value in the interval based on its F Measure.
The main concept of K Means is described as follows:
A. Initialization: Select the top K records with the highest Euclid Length as the Centroid of K
Clusters;
Let 𝑡𝑓𝑖𝑛 as the frequency of term 𝑡𝑖 in Record 𝐷𝑛
Record − Term Vector: 𝑉(𝐷𝑛 ) = {𝑡𝑓1𝑛 , 𝑡𝑓2𝑛 , … , 𝑡𝑓(𝑖−1)𝑛 , 𝑡𝑓𝑖𝑛 }
Euclid Length of Record 𝐷𝑛 : 𝐸𝐿𝐸𝑁(𝐷𝑛 ) =
THEME 3: CUTTING EDGE FTA APPROACHES
-4-
1
√∑ 𝑡𝑓𝑖𝑛2
5th International Conference on Future-Oriented Technology Analysis (FTA) - Engage today to shape tomorrow
Brussels, 27-28 November 2014
B. Record Assignment: Classify each record to the Centroid with the highest Similarity value;
Let 𝑉(𝐷𝑛 ) and 𝑉(𝐷𝑚 ) as the Record − Term Vector of Record 𝐷𝑛 and Centroid 𝐷𝑚
Similarity Value: 𝑆(𝐷𝑛 , 𝐷𝑚 ) = 𝐶𝑜𝑠(𝑉(𝐷𝑖 ), 𝑉(𝐷𝑚 ))
C. Centroid Refine: Calculate the Similarity between record and its cluster, set the record with
the highest Similarity Value as the Centroid of this cluster;
Let Cluster 𝐶 = {𝐷1 , 𝐷2 , … , 𝐷𝑙−1 , 𝐷𝑙 }
Similarity Value: 𝑆(𝐷𝑛 , 𝐶) =
∑𝑙𝑘=1 S(𝐷𝑛 , 𝐷𝑘 )
𝑙
D. New & Old Centroid Comparison: If all new Centroids are the same as the old ones, the
loop ends. Or else, return to Step B.
3) Feature Selection and Weighting Model
In this step, considering the specified fields of the NSF Award data in our case study, we will use
“NSF Awards” as the sample to present our method. Title and Abstract (described as Narration
in NSF Awards) are the most common fields used by text analysis, and, for NSF Award data, we
also introduce the Program Element (PE) Code and Program Reference (PR) Code to our study.
In NSF Awards, one record will be classified to at the most 2 PE codes and at least 1 PR code,
both of which are also comprised of semantic terms. However, whereas these codes sometimes
make good sense to help our approach to explore relations between similar records, sometimes
they mislead the machine to other directions. For example, in the case of the PR code, one or
two codes will be used to describe the case of an empirical study that should be totally different
with the main concept. In this condition, we also use an automatic way, aiding with cluster
validation model, to assemble the best Title term, Narration term, PE code, and PR code. Six
assembled sets are compared in this model:

#1 Narration + Title Terms

#2 Narration + Title Terms + PE Code

#3 Narration + Title Terms + PE/PR Code

#4 Narration + Weighted Title Terms

#5 Narration + Weighted (Title Terms + PE Code)

#6 Narration + Weighted (Title Terms + PE/PR Code)
We treat these 4 kinds of terms separately and introduce a weighting model into #4, #5, and #6
in order to calculate similarities. Normally, in the first 3 assembled sets, we calculate the
similarity for Narration Terms, Title Terms, PE Code, and PR code respectively, and use the
mean as the final similarity value of the assembled set. In the last 3 assembled sets, with the
help of the weighting model, the inverse ratio of the term amount is engaged. Let #4 serve as a
sample and we come out with the weight terms below:
𝑉(𝐷𝑛 ) = 𝑉𝑁(𝐷𝑛 ) + 𝑉𝑇(𝐷𝑛 )
𝑉𝑁(𝐷𝑛 ) is the Term − Record Vector with only Narration Terms, while 𝑉𝑇(𝐷𝑛 ) with only Title Terms
Let 𝑇𝑁 = 𝑉𝑁(𝐷𝑛 ) ∩ 𝑉𝑁(𝐷𝑚 ) and 𝑇𝑇 = 𝑉𝑇(𝐷𝑛 ) ∩ 𝑉𝑇(𝐷𝑚 )
𝜔𝑁 =
𝑇𝑇
𝑇𝑁
, 𝜔𝑇 =
𝑇𝑁 + 𝑇𝑇
𝑇𝑁 + 𝑇𝑇
THEME 3: CUTTING EDGE FTA APPROACHES
-5-
5th International Conference on Future-Oriented Technology Analysis (FTA) - Engage today to shape tomorrow
Brussels, 27-28 November 2014
Weighted Similarity Value: 𝑆𝑤 (𝐷𝑛 , 𝐷𝑚 ) = 𝜔𝑁 × 𝑆(𝑉𝑁(𝐷𝑛 ), 𝑉𝑁(𝐷𝑚 ) ) + 𝜔 𝑇 × 𝑆(𝑉𝑇(𝐷𝑛 ), 𝑉𝑇(𝐷𝑚 ) )
At the same time, in this model, we also try to compare another two topics that always attract
researcher’s interests in text analysis: the clustering accuracy of words and phrases, normal
term frequency (TF), and TFIDF value.
According to common sense, phrase is more specific and would create a more accurate cluster,
since it has much stronger relation with each other than what individual words have. However,
phrases appear much less frequently leading to less overlap between records, etc., and thus,
might be detrimental to a similarity measure. It is interesting that a Topic Models approach,
which engages a hierarchical Bayesian analysis to explore latent semantic groups in a collection
of documents (Blei and Lafferty 2006; Blei 2012), handles text clustering tasks with a “Word –
Topic – Record” model and only pays attention to words. Yau et al. (2014) addressed Topic
Models on both words and phrases derived from ST&I publications, and the comparison
indicated better results on words. Moreover, Gretarsson et al. (2012) proposed a Topic Model
approach “TopicNets” for visual analysis on large amounts of textual data. They selected NSF
grant proposals awarded to University of California campuses as one data sample. Thus, this
comparison also represents our interests explored in this paper.
We apply normal term frequency and TFIDF value with the K Means approach and compare the
accuracy regarding the NSF Award data. Considering various kinds of TFIDF approaches, we
only address the classical formula (Boyack et al. 2011) as described below:
TFIDF = TF × IDF =
Frequency of Term 𝑡𝑖
Total Record Number in the Set
× 𝑙𝑜𝑔
Total Instances of Terms in Record 𝐷𝑗
Total Number of Records with Term 𝑡𝑖
Our data-driven clustering approach is comprised of the models above, and the clusters,
identified as topics, will be generated at the end of this step.
Step 4 Forecasting
In the past years, we have contributed a bit to the semi-automatic Technology Roadmapping
composing model (Zhang et al. 2013), but all these efforts are based on terms. In this paper, we
use topics to take the place of terms and locate them in the time series as topic trends. Also, we
engage expert knowledge and understanding with external factors, e.g., policy, technique
development status, etc. Especially, in this part, quantitative results are only considered as
objective evidence for expert engagement, and qualitative judgments will take a more important
role for the forecasting studies. General steps of this section are outlined below:
1) To sort out the generated topics by year and to remove distinct duplicate topics manually;
2) To send the topic list to domain experts and ask for assessments, 1 for “interesting topic at
that time,” 0 for “not interesting at that time,” and 0.5 for “not sure”;
3) To calculate the marks of each topic and obtain the ranking list;
4) With the help of experts, to remove low-ranked items and meaningless topics, consolidate
similar topics, and to classify topics into their appropriate “technology development levels”;
5) To locate topics on the visual maps and to address the understanding gained regarding
relations, development trends, and foresights.
Results, Discussion and Implications
Data
THEME 3: CUTTING EDGE FTA APPROACHES
-6-
5th International Conference on Future-Oriented Technology Analysis (FTA) - Engage today to shape tomorrow
Brussels, 27-28 November 2014
In the book Lee Kuan Yew: The Grand Master’s Insights on China, the United States, and the
World, the founding father of modern Singapore mentioned that “America’s creativity, resilience
and innovative spirit will allow it to confront its core problems, overcome them, and regain
competitiveness” (Allison et al. 2013). Researchers and institutions are trying to evaluate the
status of global innovation competition and until now there is no clear conclusion on the results.
However, undoubtedly, it is common sense that the United States (US) currently is, and still will
be, the leading country of the world for a while to come due to its powerful capability to produce
innovation.
As the most important government agency of the US for funding research and education in most
fields of science and engineering, the US National Science Foundation accounts for about onefourth of federal support to academic institutions for basic research. It receives approximately
40,000 proposals each year for research, education and training projects, approximately 11,000
of which are granted as awards (NSF Website, see Reference). In another word, understanding
on the NSF Award data, which contains the most intelligent and innovative basic research, more
advanced than other regions by several years, could be considered an express way to revealing
how the innovation evolution pathways of the US work. Such a research approach gets the core
of the world’s innovation and research forefront and the resulting knowledge could strongly
support R&D management plans and science policy both in the US and other countries.
Actually, the NSF Award database is made available with open access, and all data can be
handled on the NSF’s website (NSF Website, see Reference). Normally, these awards are
classified according to specific Award Type and divisions. As this represents the most
meaningful and the largest part of the data available, we concentrate on Standard Grants.
Moreover, most NSF Award data is labelled by its Program Type, while a less part is unlabelled.
However, the Program Type sometimes entails really extensive classification (e.g., Collaborative
Research, Early Concept Grants for Exploratory Research, etc.), and sometimes it is very
specific (e.g., Cyber Physical System, Information Integration and Informatics, etc.). Statistically,
less than half of the NSF Award data is labelled in detail or with any kind of “usable”
classification, while others have a common or meaningless labels or even no label at all. Thus,
we treat the NSF Award data as semi-labelled.
As we mentioned above, the NSF funds more than 10 thousand proposals per year and the
open access data online goes back to 1959. Considering our background, social networks, and,
especially, the purpose of this paper, we only choose awards related to Computer Science, from
which 12,915 records under the Division of Computer and Communication Foundations with an
Organization Code that falls between 5010000 and 5090000 are selected. Since one of the main
motivations for topic analysis is to address the innovation possibilities from NSF Award data, we
remove awards granting travel support, summer school support, and other proposals asking for
education support; we finally come out with 9,274 records. We then apply Term Clumping steps
(Zhang et al. 2014) for core term retrieval; the process for each step is given in Table 1.
However, we do not choose the clustering steps in Term Clumping steps, including Term Cluster
Analysis and Combine Terms Network, which will reduce the number of similar terms and
increase the difficulty of seeking similar pairs.
Table 1. Steps of Term Clumping Processing
Step
1
9274 Records, with 9274 Titles and 8975 Narrations
THEME 3: CUTTING EDGE FTA APPROACHES
-7-
# N.* Terms
# T.* Terms
-
-
5th International Conference on Future-Oriented Technology Analysis (FTA) - Engage today to shape tomorrow
Brussels, 27-28 November 2014
2
Natural Language Processing via VantagePoint (see
Reference)
254992
17859
3
Basic Cleaning with thesaurus
214172
16208
4
Fuzzy Matching
184767
15309
5
Pruning (Remove terms appearing only in one record)
42819
2470
6
Extra Fuzzy Matching
40179
2395
7
Computer Science based Common Term Cleaning
38487
2311
*N. = Narration, *T. = Title
Before further processing, we deal with the training set first. As NSF Award data is semi-labelled
data, we screen all 1,124 records in 2009, choose 10 labels associated with 588 records, and
set up the training set, which includes 135 Title Terms and 2,746 Narration Terms. Also, we
import 56 PE codes and 64 PR codes for these 588 records. A Term-Record Matrix is generated
for text clustering calculation.
Topic Analysis
Based on the training set, in the K Local Optimum Model, considering the balance of the best
number of clusters to treat at a time (fewer topics make the results easier to understand; but
more topics lead to a greater degree of accuracy), we set the interval of K value as [10, 20]. We
list the maximum and mean of Average F Measure and Total Precision of 6 Assembles with
Word & TF, Phrase & TF, and Phrase & TFIDF in Table 2, and also compare the accuracy of the
TF and TFIDF value with 6 assembled sets in Figure 2 (1 is for normal term frequency, and 2 is
for TFIDF value, e.g., #1-1 means Feature Combination #1 with normal term frequency).
Table 2. Max and Avg. Value of F Measure and Total Precision of 6 Assembles with Word & TF,
Phrase & TF, and Phrase & TFIDF
WORD
& TF
PHRASE
& TF
PHRASE
& TFIDF
Average F
Measure
Total
Precision
Average F
Measure
Total
Precision
Average F
Measure
Total
Precision
#1
#2
#3
#4
#5
#6
Max
0.908948
0.91547
0.91547
0.905554
0.935272
0.935961
Avg.
0.888425
0.888192
0.888192
0.857989
0.896751
0.922976
Max
0.865417
0.851789
0.851789
0.863714
0.870528
0.870528
Avg.
0.792318
0.798204
0.798204
0.705901
0.774353
0.838625
Max
0.960813
0.93294
0.928113
0.960813
0.971387
0.971387
Avg.
0.924594
0.902324
0.868127
0.921359
0.952589
0.939672
Max
0.957411
0.890971
0.890971
0.957411
0.969336
0.969336
Avg.
0.845594
0.844045
0.76661
0.828868
0.91064
0.899799
Max
0.948351
0.886208
0.909903
0.952885
0.979774
0.979774
Avg.
0.914352
0.855526
0.861222
0.914686
0.955184
0.955934
Max
0.943782
0.887564
0.906303
0.948893
0.977853
0.977853
Avg.
0.828713
0.840948
0.840019
0.783646
0.9221
0.931237
THEME 3: CUTTING EDGE FTA APPROACHES
-8-
5th International Conference on Future-Oriented Technology Analysis (FTA) - Engage today to shape tomorrow
Brussels, 27-28 November 2014
1
1
0.95
0.95
0.9
0.85
0.9
0.8
0.85
0.75
0.7
0.8
0.65
0.6
0.75
10
11
12
13
14
15
16
17
18
19
20
10
11
12
13
14
15
16
17
18
19
#1-1
#1-2
#2-1
#2-2
#1-1
#1-2
#2-1
#2-2
#3-1
#3-2
#4-1
#4-2
#3-1
#3-2
#4-1
#4-2
#5-1
#5-2
#6-1
#6-2
#5-1
#5-2
#6-1
#6-2
20
Figure 2. Average F Measure (L) and Total Precision (R) of 6 Assembles with TF and TFIDF
Generally, the results of the calculated TFIDF value are less accurate than those with TF in #1,
#2, #3, and #4, but inverse results are generated in #5 and #6, where TFIDF’s highest Average
F Measure and Total Precision reach the peak of the whole result set. In this context, one
possible understanding is that TFIDF analysis introduces document frequency into the feature
space rather than only term frequency. This would help to reduce the weighting of common
terms that appear in many records and to increase the weighting of special ones. That is to say
TFIDF strengthens relations between “real” similar records and increases the accuracy of the
clustering analysis.
Comparing with the efficiency of the 6 assembled sets of feature combinations, as shown in
Figure 2, #5 and #6 are the best assembled sets, #1 and #4 are at a passable level, while #2
and #3 obtain the worst results. We try to explore the reasons behind these differences and
outline some of our deductions below:
1) PE code and PR code could be treated as the keywords of publications, which are special
and meaningful, but much less terms originate from this than terms from the narration and title.
Thus, there is no obvious difference between #2 and #3 and between #5 and #6;
2) Undoubtedly, the total amount of narration terms is much greater than that of the title terms,
but, as we calculate the same terms between two records according to the weighting model,
𝑇𝑁 shows no significant difference from 𝑇𝑇 , which might explain why #1 and #4 have similar
results.
3) We have mentioned that the PE code acts as the main keyword for one proposal while the PR
Code contains much noise information, which should obfuscate the relations between proposals.
For example, it is common to add one or two terms describing the empirical study, e.g.,
“Earthquake Engineering,” “Gene And Drug Delivery,” etc., or to use some “general” terms to
emphasize the research purpose, e.g., “Science, Math, Eng & Tech Education,” “Science Of
THEME 3: CUTTING EDGE FTA APPROACHES
-9-
5th International Conference on Future-Oriented Technology Analysis (FTA) - Engage today to shape tomorrow
Brussels, 27-28 November 2014
Science Policy,” etc. Considering this and the reason described above in point 1), it should be
able to explain the reason why #2 and #5 is ahead of #3 and #6 a bit respectively.
4) Aiming to explore the potential reason why #5 (weighted PE code) is better than #4 (without
PE code), and #4 is better than #2 (non-weighted PE code), we run the “Feature Combination
#5.1”, which uses “direct ratio” to weight the PE code. The result of #5.1 is that the highest F
Measure is 0.91436 and the highest Total Precision is 0.873935, both of which are worse than
#2. Therefore, a reasonable explanation is that, comparing with PE/PR code, Narration Terms
are more negatively misleading in the clustering analysis, and Direct Ratio enlarges this negative
impact while Inverse Ratio weakens it.
5) Especially, although we have mentioned that the engagement of TFIDF analysis helps to
arrive at a better target value, based on the comparison of 6 assembled sets, TFIDF also seems
to weaken the distinction between PE/PR code and narration terms, since “#2 and #3” and “#5
and #6” are really close to each other when the TFIDF is added into consideration. At the same
time, as we run some test samples on the semi-labelled data, we also notice that the results
derived from the TFIDF value will lead to a “large” cluster, which means that 90% of the records
are grouped in one cluster. Our concern is that, although TFIDF introduces document frequency
to balance the influence from term frequency and to highlight the speciality of terms, TFIDF also
weakens the distance between special terms and common terms. Thus, records with more nonzero value in the Term-Record Matrix will have much more similarity with other ones. Thus the
function will select these records as the centroids more easily, and then “large” clusters will be
generated. Comparably, normal term frequency seems to avoid the “large” cluster troubles.
In Table 2, we also use normal TF for words to run the clustering approach again. Obviously,
compared with results derived from phrases, results of Word & TF are not as good as those
derived from Phrase with TF or TFIDF. Therefore, phrases definitely work better than words for
record clustering regarding the NSF Award data.
Based on the results and analysis of above experiments, we choose “Phrases,” “normal TF,” “#5
Feature Combination of Narration Terms, and weighted Title Terms and PE/PR Code,” and
“K=18” as the most suitable K Means Clustering approach for NSF proposal data.
Forecasting
We apply our method to NSF Award datasets from 2009 to 2013, the number of which is
approximately 1000 each year and 4,847 in total. After the K Means clustering approach, we
obtain 90 topics and retrieve 83 topics for further processing (just removing 7 duplicate topics
here). As we mentioned in the Methodology section, we seek to combine quantitative and
qualitative methods together for forecasting studies and treat our auto-generated results as
objective evidence for decision-making. Thus, in this case, we engage experts on computerrelated subjects for topic confirmation and modification. In the context, we invite 9 experts (4
researchers, e.g., Senior Lecturer, Lecturer, or Senior Researcher, who have focused on
computer-related studies for more than 10 years, and 5 PhD candidates) from the School of
Software, University of Technology Sydney, Australia. Referring to their research experience
and deep academic understanding regarding the development of computer science, they help us
to confirm if these topics, generated by Text Clustering analysis, are interesting or not, and they
also help us to consolidate similar topics.
Especially, considering the scope of knowledge, we use inverse ratio to weight the 4-Researcher
Group and 5-PhD-Candidate Group, and then, remove all topics below the Rank Value 0.5, as
0.5 is set as “not sure” and value below should incline to “not interesting.” Here, we obtain 54
topics in total. At the same time, referring to the hybrid composing model of Technology
THEME 3: CUTTING EDGE FTA APPROACHES
- 10 -
5th International Conference on Future-Oriented Technology Analysis (FTA) - Engage today to shape tomorrow
Brussels, 27-28 November 2014
Roadmapping (Zhang et al. 2013) and considering the special situation of computer science, 2
experts from the 5-PhD-Candidate Group also help us to classify these 54 topics into three
levels of Technology Development Phases: Basic Research, Assistant Instrument, and System
and Product. We listed parts of final topics (all 11 topics in 2009, and 8 topics in 2013) in Table 3.
It is necessary to mention that the assessment on the level of the technology development
phase is subjective, and also, aiming to reduce this kind of bias, we set up a 0.5 amendatory
value between nearby levels, e.g., 2.5 means the topic bases on “Assistant Instrument,” but is
also close to “System and Product.” This kind of amendatory will be inflected as the location of
topic on the Y axis. In addition, only in 2013, considering the significance of the topic “Big Data,”
we consolidate 2 related topics and package them as a big “Big Data” topic.
The visual Technology Roadmapping for trend analysis is shown in Figure 3.
Table 3. Interesting Topics of 2009 and 2013 Selected by Experts
Year
2009
Topic
Rank
Level
Adaptive Grasping
Adaptive Grasping, Automatic Speech Recognition,
Empirical Mechanism Design, Hierarchical Visual
Categorization
0.9111
2.0
Behavior Modeling
Behavior Modeling, Checkpoint, Citizen Science, Dynamic
Environments
0.8667
1.5
Online Social
Networks
HCC, Large Scale, Online Social Networks, Applications,
Measurement
0.8417
2.0
High Performance
Computer
High Performance Computer, MRI R2, MRI R2 Consortium,
Applications, Consortium
0.8167
3.0
Human Centered
Computing
BPC DP, Brain Computer Interface, HCC Small, Mainstream
0.7278
2.0
Bayesian Model
Computation
Bayesian Model Computation, CPS, Graphical Models, High
Dimensional Data Sets, III/EAGER
0.7222
1.0
Cyber Physical
System
CPS, Cyber Physical System, MRI R2, Service Attacks,
Wireless Network, Large Scale Data Centers
0.7083
2.5
Medical Device
Coordination
Medical Device Coordination, CPS, CSR, Data Centers,
0.6833
2.5
Hidden Web Databases, Sensitive Aggregates, Urgent
Challenge, Suppressing
0.6389
1.0
ad Hoc Wireless Networks, Cooperative Beam Forming,
Cross Layer Optimization, Data Centers, Mobile Devices
0.6139
2.0
Biological Networks
Biological Networks, CIF, Communication Networks, CPS,
Cryptography
0.5889
2.0
Big Data
AF, Big Data, , CIF, Mathematical Problems, Joint Source
Channel Codes, Large Scale Neural Networks
1.0000
2.5
RI, RUI, High End Computer Users, SI2 SSE, Software
Needs
1.0000
3.0
GV, Scientific Visualization Language, Supporting
Knowledge Discovery, correlated dynamics, cortical visual
processing
0.9111
2.0
Hidden Web
Databases
ad Hoc Wireless
Networks
2013
Topic Description
Robotic Intelligence
Supporting
Knowledge Discovery
THEME 3: CUTTING EDGE FTA APPROACHES
- 11 -
5th International Conference on Future-Oriented Technology Analysis (FTA) - Engage today to shape tomorrow
Brussels, 27-28 November 2014
Health Influences, Learning Fine, Social Media, Twitter
Health, NSF Smart Health
0.8611
3.0
Large Scale
Hydrodynamic
Brownian Simulations
CDS&E, Large Scale Hydrodynamic Brownian Simulations,
Parallel Structured Adaptive Mesh Refinement Calculations
0.7278
2.5
Automatic Graphical
Analysis
Automatic Graphical Analysis, Intuition Wall, Matrix Free
Algorithms, Program New Computer Architectures, SHF,
Electrical thermal Co Design
0.6389
2.0
Virtual Organization
Long Tail Sciences, Virtual Organization, Human Centered
Computation, Scientific Software Development, Scientists
0.5250
2.0
Asynchronous Learning Experiences, EXP, RUI,
Earthquake, Learning via Architectural Design
0.5000
2.5
NSF Smart Health
Asynchronous
Learning Experiences
Discussion and Implications
Considering the hottest topics at the current time, it is interesting and promising to explore and
discuss more details on the “Big Data” issues in this case. On March 29, 2012, the Obama
Administration announced the “Big Data Research and Development Initiative” (White House
2012, see Reference) to improve the ability to extract knowledge and insights from large and
complex collections of digital data and to help accelerate the pace of discovery in science and
engineering, strengthen national security, and transform teaching and learning. In this context,
six federal departments and agencies of the US planned to announce more than $200 million
US dollars to launch the initiative; the NSF is one of them. As shown in Figure 4, “Big Data”
appears as to have been a hot topic in 2013, having evolved with various kinds of “new”
techniques and concepts, but it is also easy to link these new ones to their original ideas (as
marked in the solid box in Figure 3). Obviously, “Large Scale” related concepts, algorithms, and
systems, have been generated since 2009, e.g., “L-Scale Data Centre (2009),” “Solving Large
System and Parallel Strategy (2011),” “Large Asynchronous Multi Channel Audio Corpora
(2012),” etc. In another words, “Big Data” is not a total invention, but a kind of evolution from
previous techniques and a solution for real-word problems. Almost all components can be traced
back. But looking forward must be more valuable for us. Concerning the general technology
development pathway and the leading situation of Big Data, we try to leave several comments
here for the purpose of forecasting studies.
1) Outwardly, “Trustworthy Cyber Space (2010 and 2012)” and “Privacy Preserving Architecture
and Digital Privacy (2011)” seem to have no direct relations with Big Data, especially the
techniques of Big Data, but, in May 2014, the White House announced another report “Big Data:
Seizing Opportunities Preserving Values” (White House 2014, see Reference), which involved
the relations between government, citizens, businesses, and consumers and focused on how
the public and private sectors can maximize the benefits of big data while minimizing its risks.
Definitely, Cyber Security is considered as such a risk in the Big Data Age. Therefore, it is
reasonable to imagine that, in the near future, the “Privacy in Big Data” should be a big concern
for both government and citizens, not only in the policy and legal domains, but also the privacy
protecting techniques.
2) Another set of topics that attracts our eyes is “Open Data Repository Intermediary (2011),”
“Supporting Knowledge Discovery (2013),” “Virtual Organization (2013),” and also those real
application related topics, e.g., “Medical Device Coordination (2009),” “Cyber Infrastructure
(2010),” “Wireless Camera Networks (2012),” “RFID System (2012),” etc. As the most powerful
competitor of the US, China has issued “Internet of Things” as its TOP 5 Emerging Industry,
THEME 3: CUTTING EDGE FTA APPROACHES
- 12 -
5th International Conference on Future-Oriented Technology Analysis (FTA) - Engage today to shape tomorrow
Brussels, 27-28 November 2014
announced firstly in the Speech “Let Science and Technology Lead China’s Sustainable
Development” by the then current Premier Wen (2009). Not uniquely, in the report of the White
House 2014 we mentioned above, “Internet of Things” is highlighted as the ability of devices to
communicate with each other using embedded sensors that are linked through wired and
wireless networks. This is also linked with Big Data. Thus, “Internet of Things” including its
related techniques and cyber security issues, must be another hot research topic in the coming
decades.
3) It has been a long time since people started to imagine intelligent robots, although these
topics are not new, and have appeared several times in Figure 3, e.g., “Next Generation
Robotics (2011),” “Robotic Intelligence (2012),” and “Robotics Engineering (2013),” we still
address positive foresight on the robotics techniques, which must be able to gain more
intelligence from Big Data and to upgrade into a smarter format.
4) As part of Obama Admission’s “Big Data” program, the NSF started its “NSF Smart Health
and Wellbeing” program in 2012, which “seeks improvements in safe, effective, efficient,
equitable, and patient-centered health and wellness services through innovations in computer
and information science and engineering” (NSF Website, see Reference). As a sign for this kind
of research, “Medical Device Coordination (2009)” could be considered as the fundamental
construction, and “NSF Smart Health” rose exponentially as the hot topic in 2013. With the push
of the NSF program and the pull of enormous wellbeing requests in modern society, the
application of computer techniques in health and wellness services must be an emerging
industry for a long time yet.
Conclusion and Further Study
Currently, in the Big Data Age, it is common sense to transfer the traditional method-driven
research to data-driven empirical study, and this paper could be considered as this kind of an
attempt. We focus on NSF Awards, propose a clustering approach for topic retrieval, and then
engage expert knowledge to identify developmental patterns. Combination of quantitative and
qualitative methods provides a promising approach to forecast potential coming advances.
We anticipate further study along three directions: 1) to continue to improve our clustering
algorithm by comparing with other text clustering approaches, and to make it more operable,
adaptable, and effective; 2) to extend empirical study to cover multiple data sources -- e.g.,
publications and patents, and 3) to extend our scope to the broader innovation processes. We
believe this should hold interest for government, industry, and researchers.
THEME 3: CUTTING EDGE FTA APPROACHES
- 13 -
5th International Conference on Future-Oriented Technology Analysis (FTA) - Engage today to shape tomorrow
Brussels, 27-28 November 2014
Figure 3. Technology Roadmapping for Computer Science (based on NSF Award Data)
THEME 3: CUTTING EDGE FTA APPROACHES
- 14 -
5th International Conference on Future-Oriented Technology Analysis (FTA) - Engage today to shape tomorrow
Brussels, 27-28 November 2014
References
Allan, J., Carbonell, J., Doddington, G.,Yamron, J., Yang, Y. Topic Detection and Tracking Pilot Study Final Report, in
Proceedings of the DARPA Broadcast News Transcription and Understanding Workshop, 1998.
Allison, G., Blackwill, R. D., Wyne, A. Lee Kuan Yew: The Grand Master’s Insights on China, the United States, and
the World. The MIT Press, USA, 2013.
Big Data is a Big Deal, Office of Science and Technology Policy, White House, March 29, 2012.
http://www.whitehouse.gov/blog/2012/03/29/big-data-big-deal (accessed October 24, 2014)
Big Data: Seizing Opportunities Preserving Values, White House, May 2014.
http://www.whitehouse.gov/sites/default/files/docs/big, data, privacy, report, may, 1, 2014.pdf (accessed October 24,
2014)
Blei, D. M. Probabilistic topic models. Communications of the ACM, 2012, 55(4): 77-84.
Blei, D. M., Lafferty, J. D. Dynamic topic models, Proceedings of the 23rd international conference on Machine
learning. ACM, 2006: 113-120.
Boyack, K. W., Newman, D., Duhon, R. J., Klavans, R., Patek M., et al. Clustering More than Two Million Biomedical
Publications: Comparing theAccuracies of Nine Text-Based Similarity Approaches. PLoS ONE 6(3): e18029.
doi:10.1371/journal.pone.0018029, 2011.
Cataldi, M., Di Caro, L., Schifanella, C. Emerging topic detection on twitter based on temporal and social terms
evaluation. Proceedings of the Tenth International Workshop on Multimedia Data Mining, ACM, 2010: 4.
Chen, H., Zhang, G., Lu, J. A Time-Series-Based Technology Intelligence Framework by Trend Prediction
Functionality, Systems, Man, and Cybernetics (SMC), 2013 IEEE International Conference on. IEEE, 2013: 34773482.
Dai, X., Chen, Q., Wang, X. Xu, J. Online topic detection and tracking of financial news based on hierarchical
clustering. Machine Learning and Cybernetics (ICMLC), 2010 International Conference of IEEE, 2010, 6: 3341-3346.
Gretarsson, B., O’donovan, J., Bostandjiev, S., Höllerer, T., Asuncion, A., Newman, D., Smyth, P. Topicnets: Visual
analysis of large text corpora with topic modeling. ACM Transactions on Intelligent Systems and Technology (TIST),
2012, 3(2), 23.
Huang, L., Zhang, Y., Guo, Y., Zhu, D., Porter, A. L. Four dimensional Science and Technology planning: A new
approach based on bibliometrics and technology roadmapping. Technological Forecasting and Social Change, 2014,
81: 39-48.
Jain, A. K. Data Clustering: 50 Years beyond K-Means. Pattern Recognition Letters, 2010, 31(8): 651-666.
Kontostathis, A., Galitsky, L. M., Pottenger, W. M.A survey of emerging trend detection in textual data mining. Survey
of Text Mining. Springer New York, 2004: 185-224.
Lu, N., Zhang, G., Lu, J. Concept drift detection via competence models. Artificial Intelligence, 2014, 209: 11-28.
Porter, A. L., Detampel, M. J. Technology opportunities analysis. Technological Forecasting and Social Change,
1995, 49(3): 237-255.
Porter, A. L., Cunningham, S. W. Tech mining: exploiting new technologies for competitive advantage. John Wiley &
Sons, New York, USA, 2004.
Small, H., Boyack, K. W., Klavans, R. Identifying emerging topics in science and technology.Research Policy, 2014.
43(8): 1450-1467.
United States National Science Foundation, http://www.nsf.gov/ , (accessed October 12, 2014)
VantagePoint, www.theVantagePoint.com, (accessed October 12, 2014)
Wen, J. Let Science and Technology Lead China's Sustainable Development, 2009.
http://www.chinanews.com/gn/news/2009/11-23/1979809.shtml (accessed October 24, 2014)
Yau, C., Porter, A., Newman, N. Clustering scientific documents with topic modelling. Scientometrics, 2014,100:767786.
THEME 3: CUTTING EDGE FTA APPROACHES
- 15 -
5th International Conference on Future-Oriented Technology Analysis (FTA) - Engage today to shape tomorrow
Brussels, 27-28 November 2014
Zhang, Y., Guo, Y., Wang, X., Zhu, D., Porter, A L. A hybrid visualisation model for technology roadmapping:
bibliometrics, qualitative methodology and empirical study. Technology Analysis & Strategic Management, 2013,
25(6): 707-724.
Zhang, Y., Porter, A. L., Hu, Z., Guo, Y. Newman, N. “Term clumping” for technical intelligence: A case study on dyesensitized solar cells. Technological Forecasting and Social Change, 2014, 85: 26-39.
Zhu, D., Porter, A. L. Automated extraction and visualization of information for technological intelligence and
forecasting. Technology Forecasting & Social Change, 2002, 69: 495–506.
THEME 3: CUTTING EDGE FTA APPROACHES
- 16 -