Links to Sustainable Cities #LinksSDGs - Natural Language Processing and data visualization challenge entry by Abdulqadir Rashik Abstract The solution was developed as an interactive data visualization. The various links extracted from the documents provided, were classified as either Causal Links Barriers Recommendations The visualization, initially presents the classification of links as "flows", allowing the user to see at a glance how differently classified links are proportionally divided between the various SDGs. The width of the flow into any SDG is proportional to the number of links between it and SDG 11, so one can easily infer for e.g. that the number of links to SDG 14 i.e. "Life below Water" is very small, whereas the most number of links were found to SDG 8 i.e. "Decent Work and Economic Growth". The actual number and breakdown of links into its different types is shown at the bottom center when hovering upon any SDG. Clicking on any individual SDG, then allows a drill-down view into a detailed filewise breakdown of link types and also displays the text of the links. Further filtering the links based on their type is also possible. Links have not been classified based on directionality, so links of the same type are grouped together regardless of link direction. Approaches Apache Solr was used for text extraction, processing and classification. While Solr is primarily intended as a search server, it provides various features out of the box, which make it a good choice for natural language processing. Three of Apache Solr's features in particular, made it particularly useful for the current visualization. 1. Its integration with Apache Tika, allows for direct processing of PDF files without having to resort to separate PDF text extraction tools. 2. Synonyms allow for easy management of search queries. By adding the different keywords for each SDG as its synonyms, we can run simple queries such as sdg1 AND sdg11 AND causal and all the different keyword substitutions for sdg1, sdg11 and causal would be handled by Solr. 3. Stemming is a feature which reduces words to its stems, thus matching all its different variations which have the same word stem For e.g. Searching for the keyword "Recommend" would match "Recommends", "Recommendation", "Recommending", etc; thus vastly reducing the number of keywords we need to specify The entire document text was tokenized into smaller fragments, which were then used for the actual search. Results can then be extracted as well as classified by running three queries per SDG i.e. once per each type of link. These three queries would be internally expanded by Solr into a much larger number of keyword matches due to synonyms and stemming. The resulting data was much larger number of keyword matches due to synonyms and stemming. The resulting data was then filtered and further cleaned up using a combination of manual as well as automated review to fine tune results. Solutions The solution allows a high level overview of links between SDG 11 i.e. Sustainable Cities and Communities, and all other SDGs, with further drill down capability to allow exploring the links of any specific SDG with SDG 11. The intention was to allow policy makers to get a clutter-free overview of the shared links of any SDG with SDG 11, while still allowing those focussing on individual SDGs to easily view relevant data. Where the link was found to be relevant to more than one SDG, the same link may be present in multiple SDGs. This results in the number of links to be much higher than the number of unique text snippets. The high level overview allows the user to see how links are proportionally divided amongst the SDGs as well as the proportion of each type of link within the SDG's overall links. The width of the incoming links is proportional to the maximum number of links for any SDG. So the SDG having the most number of links will have the flow cover the entire width of the SDG box, and all other SDGs link width will reduce proportionally. The incoming links are further differentiated based on the proportions of causal, barrier and recommendation links, with the flow for each link type differentiated by its color. This allows the user to see the proportion of each type of link for a particular SDG. Hovering over any SDG shows the number of links to that SDG in the bottom center of the visualization Clicking on any individual SDG, allows a drill-down view into the links with that specific SDG. The files containing links for that SDG as well as the number of each type of link within that file is displayed in a sidebar to the right. The link text is displayed at the bottom along with the file from which the text was extracted. At the top right corner of the links display, there are radio controls to help the user to further filter the displayed links by selecting the link type. Tools Utilized For the textual analysis and classification Apache Solr was used for search. The Python Requests library was used internally in custom developed scripts. The following third party components are used in the final visualization D3.js was used as the graphing library. Underscore.js was used for data processing.
© Copyright 2026 Paperzz