Links to Sustainable Cities

Links to Sustainable Cities
#LinksSDGs - Natural Language Processing and data visualization challenge entry
by Abdulqadir Rashik
Abstract
The solution was developed as an interactive data visualization.
The various links extracted from the documents provided, were classified as either
Causal Links
Barriers
Recommendations
The visualization, initially presents the classification of links as "flows", allowing the user to see at
a glance how differently classified links are proportionally divided between the various SDGs.
The width of the flow into any SDG is proportional to the number of links between it and SDG 11,
so one can easily infer for e.g. that the number of links to SDG 14 i.e. "Life below Water" is very
small, whereas the most number of links were found to SDG 8 i.e. "Decent Work and Economic
Growth".
The actual number and breakdown of links into its different types is shown at the bottom center
when hovering upon any SDG.
Clicking on any individual SDG, then allows a drill-down view into a detailed filewise breakdown of
link types and also displays the text of the links.
Further filtering the links based on their type is also possible.
Links have not been classified based on directionality, so links of the same type are grouped
together regardless of link direction.
Approaches
Apache Solr was used for text extraction, processing and classification.
While Solr is primarily intended as a search server, it provides various features out of the box,
which make it a good choice for natural language processing.
Three of Apache Solr's features in particular, made it particularly useful for the current
visualization.
1. Its integration with Apache Tika, allows for direct processing of PDF files without having to
resort to separate PDF text extraction tools.
2. Synonyms allow for easy management of search queries. By adding the different keywords
for each SDG as its synonyms, we can run simple queries such as
sdg1 AND sdg11 AND causal
and all the different keyword substitutions for sdg1, sdg11 and causal would be handled by
Solr.
3. Stemming is a feature which reduces words to its stems, thus matching all its different
variations which have the same word stem
For e.g. Searching for the keyword "Recommend" would match "Recommends",
"Recommendation", "Recommending", etc; thus vastly reducing the number of keywords
we need to specify
The entire document text was tokenized into smaller fragments, which were then used for the
actual search. Results can then be extracted as well as classified by running three queries per SDG
i.e. once per each type of link. These three queries would be internally expanded by Solr into a
much larger number of keyword matches due to synonyms and stemming. The resulting data was
much larger number of keyword matches due to synonyms and stemming. The resulting data was
then filtered and further cleaned up using a combination of manual as well as automated review to
fine tune results.
Solutions
The solution allows a high level overview of links between SDG 11 i.e. Sustainable Cities and
Communities, and all other SDGs, with further drill down capability to allow exploring the links of
any specific SDG with SDG 11.
The intention was to allow policy makers to get a clutter-free overview of the shared links of any
SDG with SDG 11, while still allowing those focussing on individual SDGs to easily view relevant
data.
Where the link was found to be relevant to more than one SDG, the same link may be present in
multiple SDGs. This results in the number of links to be much higher than the number of unique
text snippets.
The high level overview allows the user to see how links are proportionally divided amongst the
SDGs as well as the proportion of each type of link within the SDG's overall links. The width of the
incoming links is proportional to the maximum number of links for any SDG. So the SDG having
the most number of links will have the flow cover the entire width of the SDG box, and all other
SDGs link width will reduce proportionally.
The incoming links are further differentiated based on the proportions of causal, barrier and
recommendation links, with the flow for each link type differentiated by its color.
This allows the user to see the proportion of each type of link for a particular SDG.
Hovering over any SDG shows the number of links to that SDG in the bottom center of the
visualization
Clicking on any individual SDG, allows a drill-down view into the links with that specific SDG.
The files containing links for that SDG as well as the number of each type of link within that file is
displayed in a sidebar to the right. The link text is displayed at the bottom along with the file from
which the text was extracted.
At the top right corner of the links display, there are radio controls to help the user to further filter
the displayed links by selecting the link type.
Tools Utilized
For the textual analysis and classification
Apache Solr was used for search.
The Python Requests library was used internally in custom developed scripts.
The following third party components are used in the final visualization
D3.js was used as the graphing library.
Underscore.js was used for data processing.