Essnet on Big Data WP2 - Webscraping / Enterprise Characteristics Giulio Barcaroli, Monica Scannapieco, Donato Summa Issues regarding URLs retrieval ESSnet on Big Data – Wp2 “Webscraping / Enterprise Characteristics” – Kickoff meeting Rome 23-24 March URLs retrieval Task 1 –Data access 1. This task will produce an inventory of enterprises that will be the target for the webscraping: (i) definition of the target population (requirements on enterprises in terms of economic activities and size), (ii) identification of business registers containing the information needed to identify the enterprises. All the countries involved in the work package will participate in this task. 2. For each inventory of enterprises produced in task 1, identification of URLs: (i) selection of archives containing the indication of URLs pertaining to the enterprises included in the population of interest (activity 1), and estimation of coverage; (ii) in case of insufficient coverage, development of applications for searching on the web the URL of an enterprise given its identifiers (denomination, fiscal code, economic activity, etc.), and evaluation of reliability of resulting URL. ESSnet on Big Data – Wp2 “Webscraping / Enterprise Characteristics” – Kickoff meeting Rome 23-24 March URLs retrieval Definition of the target population For instance, the population of interest in the “European survey on enterprise use of Information and Communication Technology (ICT)” is given by all enterprises having an economic activity code (NACE) included in a specified set and occupying 10 or more employees. More specifically, the enterprises are classified in the following economic activity (NACE Rev. 2): Manufacturing; Electricity, gas and steam, water supply, sewerage and waste management; Construction; Wholesale and retail trade repair of motor vehicles and motorcycles; Transportation and storage; Accommodation and food service activities; Information and communication; Real estate activities; Professional, scientific and technical activities; Administrative and support activities; Repair of computers. The criteria for the selection of the population of interest should be decided together with EUROSTAT. ESSnet on Big Data – Wp2 “Webscraping / Enterprise Characteristics” – Kickoff meeting Rome 23-24 March URLs retrieval Identification of Business Register and additional information On the basis of the chosen population of interest, each partner has to individuate the register in which pertaining information is contained. For instance, in Istat the Register containing information on the enterprises involved in the ICT survey is the Archive of Enterprises “ASIA” (about 4,000,000 enterprises). The number of enterprises contained in ASIA and fulfilling selection criteria is about 200,000. For each one of these enterprises, ASIA offers information on a number of variables (denomination, economic activity, number of employees, geographical location … ). Beyond the information contained in the Register, a search for any other source containing information of interest has to be carried out. For instance, in Italy private firms (CONSODATA ) offer additional information, sometimes including website and email addresses and telephone number. ESSnet on Big Data – Wp2 “Webscraping / Enterprise Characteristics” – Kickoff meeting Rome 23-24 March URLs retrieval Retrieval of URLs For each enterprise belonging to the population of interest, in case it owns a website, its URL must be retrieved. In case this information is not yet available from existing sources, or is available but with insufficient coverage, a specific application has to be developed and applied. ESSnet on Big Data – Wp2 “Webscraping / Enterprise Characteristics” – Kickoff meeting Rome 23-24 March URLs retrieval Proposed solution (I) 1. 2. Input from existing sources (Register plus other archives): denomination, address, telephone number, fiscal code, …) For each enterprise in the target population: a) b) c) Introduce the denomination into a search engine Obtain a list of the first k resulting web pages For each one of these results, calculate the value of binary indicators. For instance: o o o o o d) the URL contains the denomination (Yes/No); the scraped website contains geographical information coincident with already available in the Register (Yes/No); the scraped website contains the same fiscal code in the Register (Yes/No); the scraped website contains the same telephone number in the Register (Yes/No); … Compute a score on the basis of the values of the above indicators. ESSnet on Big Data – Wp2 “Webscraping / Enterprise Characteristics” – Kickoff meeting Rome 23-24 March URLs retrieval Proposed solution (II) 3. 4. On the subset of enteprises for which the URL is known (training set), model the relation between the binary indicators plus the score, and the success/failure of the found URL Apply the model to the subset of enterprises for which the URL is not known, in order to decide if the found URL is acceptable or not. ESSnet on Big Data – Wp2 “Webscraping / Enterprise Characteristics” – Kickoff meeting Rome 23-24 March URLs retrieval Proposed solution (III): an example for Istat case ESSnet on Big Data – Wp2 “Webscraping / Enterprise Characteristics” – Kickoff meeting Rome 23-24 March URLs retrieval Proposed solution: an example for Istat case ESSnet on Big Data – Wp2 “Webscraping / Enterprise Characteristics” – Kickoff meeting Rome 23-24 March URLs retrieval Proposed solution: an example for Istat case ESSnet on Big Data – Wp2 “Webscraping / Enterprise Characteristics” – Kickoff meeting Rome 23-24 March URLs retrieval Proposed solution: an example for Istat case ESSnet on Big Data – Wp2 “Webscraping / Enterprise Characteristics” – Kickoff meeting Rome 23-24 March URLs retrieval Proposed solution: an example for Istat case ESSnet on Big Data – Wp2 “Webscraping / Enterprise Characteristics” – Kickoff meeting Rome 23-24 March URLs retrieval score_class true false classification_error group (0.9,1] 5890 382 0.060905612 10 (0.8,0.9] 825 119 0.126059322 9 (0.7,0.8] 498 68 0.120141343 8 (0.6,0.7] 1297 235 0.153394256 7 (0.5,0.6] 1485 886 0.373681991 6 (0.4,0.5] 482 730 0.602310231 5 (0.3,0.4] 258 729 0.738601824 4 (0.2,0.3] 81 511 0.863175676 3 (0.1,0.2] 111 980 0.898258478 2 36 585 0.942028986 1 [0,0.1] Total 16188 % of URLs found with score > 0.6 57.5% % of link errors 8.6% Application to 146,000 enterprises with no indications Of which: expected with URL still to be found: 47300 Expected found websites with score > 0.6 27215 Of which: erroneuous ESSnet on Big Data – Wp2 “Webscraping / Enterprise Characteristics” – Kickoff meeting Rome 23-24 March 2349 Crowdsourcing It is the process of obtaining needed services, ideas, or content by soliciting contributions from a large group of people, especially from an online community, in a nutshell : Huge list of Little problems Community of workers A kind of human Hadoop ! ESSnet on Big Data – Wp2 “Webscraping / Enterprise Characteristics” – Kickoff meeting Rome 23-24 March Huge list of solutions Crowdsourcing CrowdSearcher A web platform developed by Politecnico di Milano, http://crowdsearcher.searchcomputing.it/home It is used for designing, deploying, and monitoring crowd-based applications on top of social systems, including social networks and crowdsourcing platforms. ESSnet on Big Data – Wp2 “Webscraping / Enterprise Characteristics” – Kickoff meeting Rome 23-24 March Crowdsourcing Task objectives Link the most probable official URL from a list of URLs to a given firm name. Do this operation for a very long list of firm names in a reasonable amount of time. ESSnet on Big Data – Wp2 “Webscraping / Enterprise Characteristics” – Kickoff meeting Rome 23-24 March Crowdsourcing Task design features • A group of ( 100 – 200 ) identifiable volunteer workers • For each firm the user has to select the most probable official url from a list of ( 2 – 10 ) proposed urls – – – – – • For each proposed url the system shows to the user the webpage The url are sorted by a score computed in a previous step The first choice proposed is the default “select an URL” The last choice proposed is “none of the previous” “offline” choice Before the final submission the system – Checks if there are still “select an URL” selections and invite the user to make a decision – Acquires the data ESSnet on Big Data – Wp2 “Webscraping / Enterprise Characteristics” – Kickoff meeting Rome 23-24 March Crowdsourcing Next steps for the task • The Task has been implemented • In the next few months – Selection of the appropriate URLs set to submit – Selection of the workers – Launch of the Task ESSnet on Big Data – Wp2 “Webscraping / Enterprise Characteristics” – Kickoff meeting Rome 23-24 March Sharing of previous experiences on scraping - Istat’s experience Thank you for your attention ESSnet on Big Data – Wp2 “Webscraping / Enterprise Characteristics” – Kickoff meeting Rome 23-24 March
© Copyright 2026 Paperzz