URLs retrieval

Essnet on Big Data
WP2 - Webscraping / Enterprise Characteristics
Giulio Barcaroli, Monica Scannapieco, Donato Summa
Issues regarding URLs
retrieval
ESSnet on Big Data – Wp2 “Webscraping / Enterprise Characteristics” – Kickoff meeting Rome 23-24 March
URLs retrieval
Task 1 –Data access
1. This task will produce an inventory of enterprises that will be the target for the
webscraping: (i) definition of the target population (requirements on enterprises in
terms of economic activities and size), (ii) identification of business registers
containing the information needed to identify the enterprises. All the countries involved
in the work package will participate in this task.
2. For each inventory of enterprises produced in task 1, identification of URLs: (i)
selection of archives containing the indication of URLs pertaining to the enterprises
included in the population of interest (activity 1), and estimation of coverage; (ii) in
case of insufficient coverage, development of applications for searching on the
web the URL of an enterprise given its identifiers (denomination, fiscal code,
economic activity, etc.), and evaluation of reliability of resulting URL.
ESSnet on Big Data – Wp2 “Webscraping / Enterprise Characteristics” – Kickoff meeting Rome 23-24 March
URLs retrieval
Definition of the target population
For instance, the population of interest in the “European survey on enterprise use of
Information and Communication Technology (ICT)” is given by all enterprises having an
economic activity code (NACE) included in a specified set and occupying 10 or more
employees.
More specifically, the enterprises are classified in the following economic activity (NACE
Rev. 2):
Manufacturing; Electricity, gas and steam, water supply, sewerage and waste management;
Construction; Wholesale and retail trade repair of motor vehicles and motorcycles;
Transportation and storage; Accommodation and food service activities; Information and
communication; Real estate activities; Professional, scientific and technical activities;
Administrative and support activities; Repair of computers.
The criteria for the selection of the population of interest should be decided together
with EUROSTAT.
ESSnet on Big Data – Wp2 “Webscraping / Enterprise Characteristics” – Kickoff meeting Rome 23-24 March
URLs retrieval
Identification of Business Register and additional information
On the basis of the chosen population of interest, each partner has to
individuate the register in which pertaining information is contained.
For instance, in Istat the Register containing information on the enterprises
involved in the ICT survey is the Archive of Enterprises “ASIA” (about
4,000,000 enterprises).
The number of enterprises contained in ASIA and fulfilling selection criteria is
about 200,000.
For each one of these enterprises, ASIA offers information on a number of
variables (denomination, economic activity, number of employees,
geographical location … ).
Beyond the information contained in the Register, a search for any other
source containing information of interest has to be carried out.
For instance, in Italy private firms (CONSODATA ) offer additional information,
sometimes including website and email addresses and telephone number.
ESSnet on Big Data – Wp2 “Webscraping / Enterprise Characteristics” – Kickoff meeting Rome 23-24 March
URLs retrieval
Retrieval of URLs
For each enterprise belonging to the population of interest, in case it owns a website,
its URL must be retrieved.
In case this information is not yet available from existing sources, or is available but
with insufficient coverage, a specific application has to be developed and applied.
ESSnet on Big Data – Wp2 “Webscraping / Enterprise Characteristics” – Kickoff meeting Rome 23-24 March
URLs retrieval
Proposed solution (I)
1.
2.
Input from existing sources (Register plus other archives): denomination,
address, telephone number, fiscal code, …)
For each enterprise in the target population:
a)
b)
c)
Introduce the denomination into a search engine
Obtain a list of the first k resulting web pages
For each one of these results, calculate the value of binary indicators. For
instance:
o
o
o
o
o
d)
the URL contains the denomination (Yes/No);
the scraped website contains geographical information coincident with already
available in the Register (Yes/No);
the scraped website contains the same fiscal code in the Register (Yes/No);
the scraped website contains the same telephone number in the Register (Yes/No);
…
Compute a score on the basis of the values of the above indicators.
ESSnet on Big Data – Wp2 “Webscraping / Enterprise Characteristics” – Kickoff meeting Rome 23-24 March
URLs retrieval
Proposed solution (II)
3.
4.
On the subset of enteprises for which the URL is known (training set), model the
relation between the binary indicators plus the score, and the success/failure of
the found URL
Apply the model to the subset of enterprises for which the URL is not known, in
order to decide if the found URL is acceptable or not.
ESSnet on Big Data – Wp2 “Webscraping / Enterprise Characteristics” – Kickoff meeting Rome 23-24 March
URLs retrieval
Proposed solution (III): an example for Istat case
ESSnet on Big Data – Wp2 “Webscraping / Enterprise Characteristics” – Kickoff meeting Rome 23-24 March
URLs retrieval
Proposed solution: an example for Istat case
ESSnet on Big Data – Wp2 “Webscraping / Enterprise Characteristics” – Kickoff meeting Rome 23-24 March
URLs retrieval
Proposed solution: an example for Istat case
ESSnet on Big Data – Wp2 “Webscraping / Enterprise Characteristics” – Kickoff meeting Rome 23-24 March
URLs retrieval
Proposed solution: an example for Istat case
ESSnet on Big Data – Wp2 “Webscraping / Enterprise Characteristics” – Kickoff meeting Rome 23-24 March
URLs retrieval
Proposed solution: an example for Istat case
ESSnet on Big Data – Wp2 “Webscraping / Enterprise Characteristics” – Kickoff meeting Rome 23-24 March
URLs retrieval
score_class
true
false
classification_error
group
(0.9,1]
5890
382
0.060905612
10
(0.8,0.9]
825
119
0.126059322
9
(0.7,0.8]
498
68
0.120141343
8
(0.6,0.7]
1297
235
0.153394256
7
(0.5,0.6]
1485
886
0.373681991
6
(0.4,0.5]
482
730
0.602310231
5
(0.3,0.4]
258
729
0.738601824
4
(0.2,0.3]
81
511
0.863175676
3
(0.1,0.2]
111
980
0.898258478
2
36
585
0.942028986
1
[0,0.1]
Total
16188
% of URLs found
with score > 0.6
57.5%
% of link errors
8.6%
Application to 146,000 enterprises with no indications
Of which: expected with URL still to be found:
47300
Expected found websites with score > 0.6
27215
Of which: erroneuous
ESSnet on Big Data – Wp2 “Webscraping / Enterprise Characteristics” – Kickoff meeting Rome 23-24 March
2349
Crowdsourcing
It is the process of obtaining needed services, ideas, or content by
soliciting contributions from a large group of people, especially from
an online community, in a nutshell :
Huge list of
Little problems
Community of
workers
A kind of human Hadoop !
ESSnet on Big Data – Wp2 “Webscraping / Enterprise Characteristics” – Kickoff meeting Rome 23-24 March
Huge list of
solutions
Crowdsourcing
CrowdSearcher
A web platform developed by Politecnico di Milano, http://crowdsearcher.searchcomputing.it/home
It is used for designing, deploying, and monitoring crowd-based applications on top of
social systems, including social networks and crowdsourcing platforms.
ESSnet on Big Data – Wp2 “Webscraping / Enterprise Characteristics” – Kickoff meeting Rome 23-24 March
Crowdsourcing
Task objectives
 Link the most probable official URL from a list of URLs to a given firm name.
 Do this operation for a very long list of firm names in a reasonable amount of time.
ESSnet on Big Data – Wp2 “Webscraping / Enterprise Characteristics” – Kickoff meeting Rome 23-24 March
Crowdsourcing
Task design features
•
A group of ( 100 – 200 ) identifiable volunteer workers
•
For each firm the user has to select the most probable official url from a list of
( 2 – 10 ) proposed urls
–
–
–
–
–
•
For each proposed url the system shows to the user the webpage
The url are sorted by a score computed in a previous step
The first choice proposed is the default “select an URL”
The last choice proposed is “none of the previous”
“offline” choice
Before the final submission the system
– Checks if there are still “select an URL” selections and invite the user to make a decision
– Acquires the data
ESSnet on Big Data – Wp2 “Webscraping / Enterprise Characteristics” – Kickoff meeting Rome 23-24 March
Crowdsourcing
Next steps for the task
• The Task has been implemented
• In the next few months
– Selection of the appropriate URLs set to submit
– Selection of the workers
– Launch of the Task
ESSnet on Big Data – Wp2 “Webscraping / Enterprise Characteristics” – Kickoff meeting Rome 23-24 March
Sharing of previous experiences on scraping - Istat’s experience
Thank you for your attention
ESSnet on Big Data – Wp2 “Webscraping / Enterprise Characteristics” – Kickoff meeting Rome 23-24 March