Fundamentals — from data to visualisation Big Data Business

Fundamentals — from data to visualisation
Big Data Business Academy
Finn Årup Nielsen
DTU Compute
Technical University of Denmark
September 21, 2016
Fundamentals — from data to visualisation
Getting my hands dirty with:
DBC library loan data.
Twitter retweet study.
Library information.
Art depictions data mining.
Danish Business Authority (Erhvervsstyrelsen).
Wikipedia citations mining.
Using tools such as: Python, Perl, R, sklearn, statsmodels, Matplotlib,
D3, command-line, Semantic Web, Wikidata, Wikipedia.
Finn Årup Nielsen
1
September 21, 2016
Fundamentals — from data to visualisation
Example: Library loans data
Finn Årup Nielsen
2
September 21, 2016
Fundamentals — from data to visualisation
Library loans data
47 million loan data collected from Danish library users by DBC (“Dansk
Bibliotekscenter”).
Anonymized structured data in the format of comma-separated values
dataset with name with the size of 5.8 GB: One loan, one line.
Extraction of title words wrt. to each of the 50 library system (“biblioteksvæsen”, e.g., municipality). Streaming processing over lines in 5
to 10 minutes to build:
Medium-sized data matrix of size words-by-library-system.
(with help from David Tolnem and Søren Vibjerg at HACK4DK)
Finn Årup Nielsen
3
September 21, 2016
Fundamentals — from data to visualisation
Finn Årup Nielsen
4
September 21, 2016
Fundamentals — from data to visualisation
Finn Årup Nielsen
5
September 21, 2016
Fundamentals — from data to visualisation
Interactive version
Finn Årup Nielsen
6
September 21, 2016
Fundamentals — from data to visualisation
Summary: Library loan data
Fairly small “big data”: No need for specialized big data tools.
Stream processing on the big data to get manageable medium-sized data.
Simple natural language processing: splitting, stopwords, counting
Little issue with feature processing. The analyzed data is count data of
words.
One-shot research analysis with clustering and correlation analysis using
standard Python tools: IPython Notebook, Pandas, sklearn, . . .
Visualization with Python’s Matplotlib and JavaScript’s NVD3 and D3.
Finn Årup Nielsen
7
September 21, 2016
Fundamentals — from data to visualisation
Example: Twitter retweet analysis
Finn Årup Nielsen
8
September 21, 2016
Fundamentals — from data to visualisation
Twitter retweet analysis question
Research question: What
determines whether a Twitter post will be retweeted?
“Good Friends, Bad News
— Affect and Virality in
Twitter” (Hansen et al.,
2011)
Collect a lot of tweets, extract features, build statistical model and determine
feature importance.
Finn Årup Nielsen
9
September 21, 2016
Fundamentals — from data to visualisation
Twitter retweet analysis data
Collection of Twitter data in two
ways:
1) Attach to streaming API
and store the returned (unstructured) JSON data in the MongoDB nosql database. A oneliner!
2) Query Twitter search API regularly searching on COP15.
Getting around half a million
tweets.
Finn Årup Nielsen
10
September 21, 2016
Fundamentals — from data to visualisation
Twitter sentiment through time
Finn Årup Nielsen
11
September 21, 2016
Fundamentals — from data to visualisation
Twitter retweet analysis feature extraction
Extracted features:
Occurence of hash tag
Occurence of @-mention
Occurence of link
“Newsiness” from trained
Naı̈ve Bayes classifier
Sentiment via AFINN word
list
Finn Årup Nielsen
12
September 21, 2016
Fundamentals — from data to visualisation
Twitter retweet analysis summary
Stream processing for extraction of features written to a medium-sized
comma-separated values file.
Twitter features analyzed with logistic regression over 100’000s tweets
in R.
Investigated the interaction between newsiness and sentiment, particularly
negative sentiment. An R one-liner.
Various statistical tests support that negative newsy tweets are retweeted
more (“bad news is good news”) as is positive non-news (“friends”)
tweets.
Finn Årup Nielsen
13
September 21, 2016
Fundamentals — from data to visualisation
Example: Library information
Finn Årup Nielsen
14
September 21, 2016
Fundamentals — from data to visualisation
Library information
DBC (“Dansk Bibliotekscenter”) competition in 2015/2016.
“How can data science be used to provide library users with new and
better experiences?”
Finn Årup Nielsen
15
September 21, 2016
Fundamentals — from data to visualisation
Library information
DBC (“Dansk Bibliotekscenter”) competition in 2015/2016.
“How can data science be used to provide library users with new and
better experiences?”
DBC made loan data available.
Recommendation system based on loan data?
Finn Årup Nielsen
16
September 21, 2016
Fundamentals — from data to visualisation
Library information
DBC (“Dansk Bibliotekscenter”) competition in 2015/2016.
“How can data science be used to provide library users with new and
better experiences?”
DBC made loan data available.
Recommendation system based on loan data?
1st and 3rd prize did that.
Finn Årup Nielsen
17
September 21, 2016
Fundamentals — from data to visualisation
Library information
DBC (“Dansk Bibliotekscenter”) competition in 2015/2016.
“How can data science be used to provide library users with new and
better experiences?”
DBC made loan data available.
Recommendation system based on loan data?
1st and 3rd prize did that.
New approach to search library information via geolocation.
Finn Årup Nielsen
18
September 21, 2016
Fundamentals — from data to visualisation
Littar
Geolocatable narrative locations from literary works in Wikidata plotted
on a map available at http://fnielsen.github.io/littar.
Finn Årup Nielsen
19
September 21, 2016
Fundamentals — from data to visualisation
So where is the data from? Wikidata!
Wikidata = Wikipedia’s
sister site with semi-structured
data.
Over 20 million items. For
instance, over 180’000 literary works.
Each may be described by
one or more of over 2700
properties.
Crowdsourced from over 15’000 “active users” and a total of over 370
million edits.
Finn Årup Nielsen
20
September 21, 2016
Fundamentals — from data to visualisation
Semantic Web: Example triples
Subject
Verb
Object
neuro:Finn
a
foaf:Person
neuro:Finn
foaf:homepage
http://www.imm.dtu.dk/˜fn/
dbpedia:Charlie Chaplin
foaf:surname
Chaplin
dbpedia:Charlie Chaplin
owl:sameAs
fbase:Charlie Chaplin
Table 1: Triple structure
where the the so-called “prefixes” are
PREFIX
PREFIX
PREFIX
PREFIX
PREFIX
foaf:
neuro:
dbpedia:
owl:
fbase:
Finn Årup Nielsen
<http://xmlns.com/foaf/0.1/>
<http://neuro.imm.dtu.dk/resource/>
<http://dbpedia.org/resource/>
<http://www.w3.org/2002/07/owl#>
<http://rdf.freebase.com/ns/type.object.>
21
September 21, 2016
Fundamentals — from data to visualisation
Semantic Web search engine
SPARQL search engines:
BlazeGraph (formerly called “Bigdata”), “supports up to 50 Billion edges
on a single machine”
Virtuoso Universal Server from Openlink Software
Apache Jena
RDF4J/Sesame
The Wikidata Query Service presently uses BlazeGraph. It is available
from https://query.wikidata.org and includes, e.g., graph and map visualizations.
Finn Årup Nielsen
22
September 21, 2016
Fundamentals — from data to visualisation
Example query: coauthor-journal network
Finn Årup Nielsen
23
September 21, 2016
Fundamentals — from data to visualisation
Example query: coauthor-journal network
Query on Wikidata Query Service with graph visualization for data with
scientific articles, their authors and journals over more than 100 million
statements.
# defaultView : Graph
SELECT DISTINCT ? journal ? journalLabel
( concat ( " 7 FFF00 " ) as ? rgb )
? coauthor ? coauthorLabel
WHERE {
? work wdt : P50 wd : Q20980928 .
? work wdt : P50 ? coauthor .
? work wdt : P1433 ? journal .
SERVICE wikibase : label {
bd : serviceParam wikibase : language " en " . }
}
Try it! or one for drug-disease interaction (of Dario Taraborelli).
Finn Årup Nielsen
24
September 21, 2016
Fundamentals — from data to visualisation
Example: Wikidata query on book data
Wikidata SPARQL query with OpenStreetMap and Leaflet map
Finn Årup Nielsen
25
September 21, 2016
Fundamentals — from data to visualisation
One step further: Data mining Wikidata data
Unsupervised learning (Non-negative matrix factorization) on a 896-by576-sized matrix of depictions in paintings as described on Wikidata.
Finn Årup Nielsen
26
September 21, 2016
Fundamentals — from data to visualisation
Example: Company information
Finn Årup Nielsen
27
September 21, 2016
Fundamentals — from data to visualisation
Company information for novelty detection
Extract features from 43 GB JSONL file from Erhvervsstyrelsen.
Feature: antal penheder, branche ansvarskode, nyeste antal ansatte,
nyeste virksomhedsform, reklamebeskyttet, sammensat status, sidste virksomhedsstatus, stiftelsesaar.
Features imputed and scaled.
Novelty here: Distance from company to each cluster center after Kmeans clustering.
Technical: Python, Pandas, unsupervised learning with MiniBatchKMeans
from Scikit-learn (sklearn) implemented in a Python module called cvrminer and an IPython Notebook
Finn Årup Nielsen
28
September 21, 2016
Fundamentals — from data to visualisation
Company information novelty
The most unusual company listing in
the present analysis (with K = 8 clusters).
“Sammensat status” is unusual: “Underreasummation”. There is only a
single instance of this category.
Other examples:
“Medarbejderinvesteringsselskab” (one of this kind),
SAS DANMARK A/S (large number
of employees compared to p-sites?)
Finn Årup Nielsen
29
September 21, 2016
Fundamentals — from data to visualisation
Company novelty distances
Histogram of distances from
company features to their
estimated cluster centers
Here for the companies assigned to the cluster with
the most novel/outlying
company.
Finn Årup Nielsen
30
September 21, 2016
Fundamentals — from data to visualisation
Company feature distances
Finn Årup Nielsen
31
September 21, 2016
Fundamentals — from data to visualisation
Company information for bankruptcy detection
Extract features from 43 GB JSONL file from Erhvervsstyrelsen.
Features extracted with indexing and regular expressions: antal penheder,
branche ansvarskode, nyeste antal ansatte, (nyeste virksomhedsform),
reklamebeskyttet, sammensat status, (nyeste statuskode), stiftelsesaar.
Focus on companies with ’Aktiv’ or ’OPLØSTEFTERKONKURS’ in
“sammensat status”.
Technical: Python, Pandas, supervised learning with generalized linear
model from statsmodels implemented in a Python module called cvrminer
and an IPython Notebook.
Finn Årup Nielsen
32
September 21, 2016
Fundamentals — from data to visualisation
Initial bankruptcy detection feature results
coef
std err
z
P>|z|
----------------------------------------------------------------------------Intercept
-0.1821
0.187
-0.976
0.329
C(nyeste_antal_ansatte)[T.1.0]
1.3965
0.019
71.879
0.000
C(nyeste_antal_ansatte)[T.2.0]
1.4391
0.019
76.948
0.000
C(nyeste_antal_ansatte)[T.5.0]
1.6605
0.025
67.751
0.000
C(nyeste_antal_ansatte)[T.10.0]
1.9545
0.032
62.028
0.000
C(nyeste_antal_ansatte)[T.20.0]
2.1077
0.043
49.589
0.000
C(nyeste_antal_ansatte)[T.50.0]
1.8773
0.093
20.237
0.000
C(nyeste_antal_ansatte)[T.100.0]
1.2759
0.157
8.126
0.000
C(nyeste_antal_ansatte)[T.200.0]
1.4266
0.274
5.206
0.000
C(nyeste_antal_ansatte)[T.500.0]
1.0133
0.752
1.347
0.178
C(nyeste_antal_ansatte)[T.1000.0]
0.7364
1.051
0.701
0.484
branche_ansvarskode[T.15]
-4.5699
1.034
-4.421
0.000
branche_ansvarskode[T.65]
0.4971
0.209
2.381
0.017
branche_ansvarskode[T.75]
-24.7808
1.42e+04
-0.002
0.999
branche_ansvarskode[T.96]
28.5924
2.16e+05
0.000
1.000
branche_ansvarskode[T.97]
0.5545
0.614
0.903
0.366
branche_ansvarskode[T.99]
0.2416
0.542
0.446
0.656
branche_ansvarskode[T.None]
-1.9593
0.180
-10.896
0.000
reklamebeskyttet[T.True]
-2.6928
0.051
-52.787
0.000
np.log(antal_penheder + 1)
-0.5775
0.072
-8.058
0.000
transform_year(stiftelsesaar)
0.0498
0.001
74.561
0.000
Finn Årup Nielsen
33
September 21, 2016
Fundamentals — from data to visualisation
Bankruptcy detection observation
“reklamebeskyttelse” is
surprisingly indicating an
“active” company.
The age of the company is important (in our
present analysis)
The size of the company
is important cf. “antal penheder” og “antal
ansatte”.
Finn Årup Nielsen
34
September 21, 2016
Fundamentals — from data to visualisation
Example: Wikipedia citations mining
Finn Årup Nielsen
35
September 21, 2016
Fundamentals — from data to visualisation
Wikipedia citations mining
13 GB compressed XML file with English Wikipedia dump:
bzcat enwiki -20160701 - pages - articles . xml . bz2 | less
Output from command-line streaming decompression:
< mediawiki xmlns = " http: // www . mediawiki . org / xml / export -0.10/ " ...
< siteinfo >
< sitename > Wikipedia </ sitename >
< dbname > enwiki </ dbname >
< base > https: // en . wikipedia . org / wiki / Main_Page </ base >
< generator > MediaWiki 1.28.0 - wmf .8 </ generator >
...
< page >
< title > A c c e s s i b l e C o m p u t i n g </ title >
< ns >0 </ ns >
< id > 10 </ id >
< redirect title = " Computer accessibility " / >
Finn Årup Nielsen
36
September 21, 2016
Fundamentals — from data to visualisation
Wikipedia citations mining
Iterate over pages and use a regular expression in Perl (does not match
all instances):
$ I N P U T _ R E C O R D _ S E P A R A T O R = " < page > " ;
@citejournals = m /({{\ s * cite journal .*?}})/ sig ;
@titles
= m | < title >(.*?) </ title >|;
We are after these parts in the wiki text:
< ref name = Dapson2007 > {{ Cite journal | last1 = Dapson | first1 = R .
| last2 = Frank | first2 = M . | last3 = Penney | first3 = D . | last4 = Kiernan
| first4 = J . | title = Revised procedures for the certification of carmine
( C . I . 75470 , Natural red 4) as a biological stain | doi =
1 0 . 1 0 8 0 / 1 0 5 2 0 2 9 0 7 0 1 2 0 7 3 6 4 | journal = Biotechnic & Histochemistry
| volume = 82 | pages = 13 | year = 2007 }} </ ref >
Finn Årup Nielsen
37
September 21, 2016
Fundamentals — from data to visualisation
Wikipedia citations mining
Finn Årup Nielsen
38
September 21, 2016
Fundamentals — from data to visualisation
Wikipedia citations mining
To help match different variation of journal names a manually-built XML
file was setup:
...
< Jou >
< wojou >7 </ wojou >
< name > The Journal of Neuroscience </ name >
< abbreviation > JNeurosci </ abbreviation >
< namePubmed >J Neurosci </ namePubmed >
< type > jou </ type >
< variation > Journal of Neuroscience </ variation >
< variation >j . neurosci . </ variation >
< variation >J Neurosci </ variation >
< wikipedia > Journal of Neuroscience </ wikipedia >
</ Jou >
...
Finn Årup Nielsen
39
September 21, 2016
Fundamentals — from data to visualisation
Wikipedia citations mining
Science
Nature
JBC
JAMA
AJ
...
Evolution
3
1
1
0
1
...
Bacteria
1
3
0
1
0
...
Sertraline
0
0
4
2
0
...
Autism
0
0
0
2
0
...
Uranus
...
1
...
0
...
0
...
0
...
3
...
...
...
Begin with (Wikipedia articles × journals)-matrix.
Topic mining with non-negative matrix factorization. This algorithm is,
e.g., implemented in sklearn.
Finn Årup Nielsen
40
September 21, 2016
Fundamentals — from data to visualisation
Wikipedia citations mining
Finn Årup Nielsen
41
September 21, 2016
Fundamentals — from data to visualisation
Wikipedia citations mining
Example of cluster with Wikipedia articles and scientific journals
Finn Årup Nielsen
42
September 21, 2016
Fundamentals — from data to visualisation
Summing up
Finn Årup Nielsen
43
September 21, 2016
Fundamentals — from data to visualisation
Structured and unstructed data
Structured data: Data that can be represented in a table and “easily” converted to numerical data and with a fixed number of columns.
Represented in CSV, SQL databases, spreadsheet. Most machine learning/statistical algorithms need a fixed size input.
Unstructed data: Data with no fixed number of columns/fields. Freeformat text, . . .
Semi-structured data: Data not in column format:
Semi-structured data I: Representation in XML, JSON, JSONL (lines of
JSON), NoSQL databases, . . .
Semi-structured data II: Semi-structured data easy to convert to structured data, e.g., Semantic Web. Represented in triple format, SPARQL
engine, . . .
Finn Årup Nielsen
44
September 21, 2016
Fundamentals — from data to visualisation
Machine learning
Supervised learning (regression, classification, . . . )
• Python now has a range of of-the-shelve data analysis packages: machine learning (sklearn), statistics (statsmodels) and deep learning
• Linear models also available in R.
Unsupervised learning (clustering, topic mining, density modeling . . . )
• Novelty detection, detection of anormalies
• Topic mining, e.g., of text corpora
Background knowledge from Semantic Web (Wikidata et al.)
Finn Årup Nielsen
45
September 21, 2016
Fundamentals — from data to visualisation
Streaming data processing
Operations that can be performed using streaming processes:
• Counting, mean, . . .
• Feature extraction for large datasets for conversion to “medium-sized”
data for in-memory data analysis.
Operations which is not so efficient with streaming because of data reload:
many machine learning algorithms. Streamining machine learning solutions,
• Batch processing, e.g., partial fit of sklearn in Python, deep learning.
• Spark’s MLlib/ML
Finn Årup Nielsen
46
September 21, 2016
References
References
Hansen, L. K., Arvidsson, A., Nielsen, F. Å., Colleoni, E., and Etter, M. (2011). Good friends, bad news
— affect and virality in Twitter. In Park, J. J., Yang, L. T., and Lee, C., editors, Future Information
Technology, volume 185 of Communications in Computer and Information Science, pages 34–43, Berlin.
Springer.
Finn Årup Nielsen
47
September 21, 2016