Text mining techniques to support e-democracy systems

Text mining techniques to support e-democracy systems
J. Cardeñosa1, C Gallardo2, J.M. Moreno-Jiménez 3,
Abstract. E-democracy is a cognitive democracy oriented to the extraction and democratization of the knowledge related with
the scientific resolution of public decision making problems associated with the governance of society. This paper presents a
qualitative approach based in text mining tools to identify voting tendencies from the analysis of the messages and comments
elicited by citizens through a collaborative tool (forum). The proposed methodology has been applied to a case study developed
with students of the Faculty of Economics at the University of Zaragoza and related with the potential location at the region of
Aragón (Spain), of the greatest leisure project in Europe (Gran Scala).
Keywords: E-Government, E-democracy, E-voting, knowledge extraction, text mining.
1
Introduction
In this paper, a qualitative approach based on text mining tools [1,2] have been used to identify the arguments embedded in
the messages that support the different positions of the intervenient actors of the exploration of the possible vote trends of a set
of citizens. [3,4,5].
The methodology followed in the qualitative approach is based on the analysis of the different patterns that an expert
defines in order to classify a citizen’s message. These patterns are implemented in a text mining system that tries to emulate the
lines of reasoning of the expert. The implemented text mining system is backed in the linguistic knowledge codified in a tailormade grammar and a specific lexical resource. Its main purpose is to ascribe each comment to the different positions identified
in the resolution of the problem.
The final and main aim of e-democracy is the definition of a general opinion shared by the majority of society by means of
the expressions of their preferences. The individual and social knowledge added value related with the scientific resolution of
problems, the associated citizens’ education and the improvement of transparency, participation and control will be some
aspects that increase the individual and social welfare.
This work is structured as follows: Section 2 presents e-democracy and the stages of its participation process; Section 3
presents the qualitative approach proposed for identifying trends applied to a specific case study; Section 4 presents the results
of the application of this methodology.
2
E-democracy: definition and participation process
E-democracy [6,7] is a new democratic model that comprises the use of electronic communication technologies, such as the
Internet, in enhancing democratic processes within a representative democracy or any other democratic model, keeping the
government closer to the governed thereby increasing its political legitimacy. E-democracy includes within its scope not only
electronic voting but also more recent approaches such as the elicitation of votes from the free expression of preferences
concerning to an specific matter.
To this end, e-democracy identifies the subjacent tendencies existing in the free opinions expressed in a public forum
(messages). Obviously, some of these messages can incorporate strategic opinions that do not correspond with the final
preferences.
The stages we have followed in the e-democracy process are:
Stage 1: Problem Establishment. Using the web, this stage identifies the relevant aspects of the problem: context, actors and
factors (mission, criteria, subcriteria, attributes and alternatives), as well as their interdependencies and relationships.
Stage 2: Discussion. Using any media or collaborative tool (forum in our case), and the citizens themselves (through the
network) give their motives and justify the decisions.
Stage 3: Expert Interpretation: An expert identifies the patterns that support the different trends of the actors through the
revision of a representative sample of the citizens’ messages/opinions.
Stage 4: Definition of the Interpreter System: The expert patterns are implemented in text mining grammars that, when
applied to the citizens’ messages, allows for the interpretation of the intended sense of citizens’ messages.
Stage 5: Intelligent Trend identification: The system automatically classifies each message according to the subjacent vote
tendency or citizen preference.
Notice that stages 1 and 2 are common to any e-democracy process. Stages 3 to 5 are the solution we propose to elicit
citizens’ preferences and intended vote so that they can be incorporated into subsequent decision-making processes. The next
section presents the five stages illustrated with a case study, focusing on stages 3 to 5 mainly.
3
A Qualitative Approach for Searching for Tendencies. A case study
In this section, we will illustrate the whole process (stages 1 to 5 identified in the previous section) applied to a case of
study following step by step the methodology described in the previous section. The context of this experience is about the
installation of Gran Scala in a region of Spain.
Gran Scala will be the largest leisure complex ever built in Europe. The project statistics are staggering: 17,000 million
Euros in investment; 1,000 million in State and 677 million Euros in Aragon Regional Government revenues through taxation.
Gran Scala is expected to receive 25 million visitors per year, create 65,000 direct and indirect jobs at its 70 hotels, 32 casinos,
five theme parks, museums, golf courses and racetracks. In other words, the project would transform the area into a town with a
population of 100,000. There is general controversy among citizens about the installation of a leisure complex in the dessert
area of Los Monegros (near Zaragoza in Spain).
3.1
Stage 1: Problem Establishment
Since its inception, the project has caused much debate and controversy in Aragonese society. For these reasons, the project
was selected for an electronic voting experiment with students from the Multicriteria Decision Making course of the 4th year of
the Bachelor of Science in Business Administration at the University of Zaragoza. The experiment was intended to elicit the
opinion and preferences of the students regarding the implementation of the Gran Scala Project in the “Los Monegros” district
for the identification of arguments that support the decisions by means of text mining techniques.
There is a two-fold objective: to know citizens preferences in general and to take into account their preferences when taking
the final decision (approval or rejection of the project). Citizens can interact in a web forum freely: they can post messages,
open debate threads, etc. They are asked to discuss about benefits, costs, risks, and opportunities of the installation of Gran
Scala.
3.2
Stage 2: Discussion
The discussion stage was performed by means of a debate session was initiated after the end of the first round of voting.
The implementation of the process was performed through a web interface in the form of a discussion forum. The discussion
process was prolonged during ten days (24th February until 4th March 2008) in order to facilitate the sharing of viewpoints,
information and feedback necessary to achieve an appropriate level of maturity in the participants’ judgments.
Once the discussion phase is opened and finished, its output are analysed so that the subjacent trends and preferences of
citizens can be incorporated into the decision-making process of the pertinent public/private administrations. This is done
following a knowledge-based methodology. The design and implementation of the whole process is described in the following
subsections.
3.3
Stage 3: The Expert Interpretation
The expert has at his disposal a set of 332 messages extracted from a web forum about the construction of Gran Scala. He
then classifies each message according to the following categories:
• A1: In favour of the implementation of the project.
• A2: Support for the majority’s position.
• A3: Against the implementation of the project.
• A1_A2: Position is not clear but surely the subject won’t vote against (not A3)
• A2_A3: Vote is not clear but surely the subject won’t vote in favour (not A1)
Messages are decontextualized so that it is avoided any other source of information (even the discourse thread is missed). The
expert then reads the messages, classifies each according to the five postulated categories, and marks up the expressions that
support the proposed classification. Table 1 shows some examples of tagged messages; supporting arguments are underlined.
Table 1. Example of messages and supporting tendencies.
MESSAGE
A1
I think you are right, it is not logical that this year in expo water is treated as a very important problems and that
in the next X years we will like to build an aquatic park in Monegros, where water will be seriously wasted.
A2
A3
9
9
I think that the installation of Gran Scala will bring lot of opportunities that will favour people’s continuity in the
area, specially young people. Besides, it is another way to promote tourism.
3.4
9
Stage 4: Definition of the Interpreter System
Once the expert has classified the messages and identified the supporting expressions, common features and regularities
have to be looked for. In general, simple categories (A1, A2 y A3) show clearer patterns than intermediate ones (A1&A2 and
A2&A3). So it can be stated that:
− Category A1 is characterized by the presence of verb phrases, noun phrases and words with positive connotation (that is,
they refer to realities that are commonly considered as good or positive).
− Category A3 is characterized by the presence of noun phrases and words with negative connotation and verb phrases that
describe a negative reality (for example, doesn’t respect the bases of sustainable development).
− Category A2 is identified by the absence of significant expressions.
− In categories A1_A2 and A2_A3, negative and positive expressions are intermingled along with the so-called hedges.
As can be seen, the key aspect to identify patterns is the presence of expressions with either positive or negative
connotation. Thus, this work requires lexical units (in particular, adjectives, nouns and verbs) to be classified as positive,
negative, or neutral. Hedges are expressions that lessen the degree of assertiveness of the speaker’s intention. Modal verbs like
“should”, “could” or adverbs like “possibly” or “perhaps” will be used for the identification of intermediate categories.
Thus, the system of automatic classification mainly seeks noun and verb phrases and then assigns a global connotation to
the message according to the nature of the identified expressions. The identification of phrases will be done using shallow
parsing [8,9].
3.4.1 Creation of the lexical resource
The set of messages to be mined presents two crucial features: a) they are framed in a closed domain, namely, the settling
Gran Scala; and b) they show a high degree of informality and all sorts of errors. This implies that a robust parser is required,
able to extract maximum information even with insufficient lexical and grammatical information.
The unique information registered in the dictionary is the grammatical category of the word. The postulated categories for
this work are for open categories: verb, noun, adjective, and adverb; for functional categories: determinant, conjunction,
preposition, and pronoun; and finally two ad hoc categories for the domain: hedge, and quantifier.
This classification is a simplification of the categories present in natural languages, but it is sufficient for this work where
linguistic accuracy is not a priority. Words belonging to open categories are marked with either positive or negative
connotation. This is the only semantic feature allowed in the dictionary. Dictionary entries have the following appearance:
mafia: SUS,negative.
[mafia]
puestos de trabajo: SUS,positive.
riesgos: SUS,negative.
[risks]
[job]
In order to avoid the excessive dependence of the dictionary on the corpus of available messages, we started from an
existent resource that was adapted to the requirements of the system. In particular, we consulted the Spanish Wordnet (Spanish
lexical resource developed at the EuroWordNet project) [10] in order to obtain a general purpose dictionary as well as it was the
source to infer the connotation of open categories words from the information codified in the synset (set of synonyms that
define a concept) and in the definitions.
3.4.2 Pattern codification
Once obtained the first version of the lexical resource, it is required to define a grammar that can identify simple phrases
and that assigns a positive/negative connotation to the phrase as a whole. Thus, the grammar is restricted to the identification of
noun, verb, and adjective phrases. As an example, let’s see the treatment of noun phrases.
The following rule implements a pattern for a generic positive noun phrase:
DET? AP? N AP* PP*
Where:
− DET is the class of determiners, AP stands for Adjective Phrase, N stands for Noun, and PP stands for Prepositional Phrase.
− The symbol ? denotes an optional element and * indicates that the element can be repeated zero o more times.
− At least one of these elements must have positive connotation except for the PP that can present a negative connotation, like
in the following examples:
creación de [puestos_de_trabajo]pos
[creation of jobs]
utilizaciónpos [de tierras infertiles]neg [use of waste lands]
3.5
Stage 5: Intelligent trend identification: assignment rules
Category labels are ascribed to a message according to the following rules, which constitute the core of the intelligent
automatic classification:
• If the message does not contain any expression, or the number of positive expressions equals the number of negative one,
then assign A2.
• If only one positive expression has been found, or there are hedges and there are more positive than negative expressions,
then assign A1_A2.
• If only one negative expression has been found, or there are hedges and there are more negative than positive expressions,
then assign A2_A3.
• If there are more positive expressions than negative, then assign A1.
• If there are more negative expressions than positive, then assign A3.
These rules are the intelligent core of the system and their refinement and extension are the basis for a more accurate voting
prediction system.
4
Results and comparative analysis
The output of the text mining system was compared to the manual classification of the expert. Table 2 shows the results of
the manual (based on the expert) and the automatic (based on text mining techniques) approaches. The first column (#Expert)
shows the number of messages that the expert has classified for the different Categories (A1 to A3) whereas the second column
(#Automatic) indicates the number of messages classified by the text-meaning system for each category. Although the figures
may look similar, it has to be remarked that the expert has identified 177 messages out of 332 as A1; whereas the automatic
procedure has assigned category A1 to 172 messages. However, only 128 comments are the same in both sets (column
#Coincide). Thus the degree of coincidence of both approaches is 128/177 (taking the expert as the reference), that is, 72.31%
(column % Coincidence). For each category, it is shown the total number and rate of coincidence between the expert and
automatic classification.
Table 2. Comparative analysis of the expert’s and automatic classification.
Category
A1
A1_A2
A2
A2_A3
A3
Total
#
Expert
177
31
93
21
10
332
#
Automatic
172
48
73
22
17
332
#
Coincide
128
9
33
3
4
%
Coincidence
72.31 %
29.03 %
35.48 %
14.28 %
40.00 %
Coincidence rate is significant for category A1 but rather low for the rest. A possible reason for this is that the positive
tendency is much more marked and salient than the others within the group of participants. It is also possible that both the
expert and the automatic procedure detect the positive expressions more accurately. Summing up, these results suggest that
supporting expressions in favour of the implantation of Gran Scala are more precise than the arguments against it. Besides, they
also point at the necessity of revision of the rules of the automatic procedure as well as the intensity of the arguments against the
implantation of the project, which the expert finds weaker.
As expected, intermediate categories, A1_A2 and A2_A3, present the lowest level of coincidence. Such categories do not
represent an intermediate position between A1 and A2 or between A2 and A2 but instead suggest a tendency towards A1 too
weak or even contradictory, since A2 is an equidistant option between the positive A1 and the negative A3.
The interpretation of these intermediate categories is not straightforward and possible a fuzzy interpretation may better
describe the results. Another option is to ask the expert to classify messages more accurately, since the automatic procedure
reflects the patterns identified by the expert, in the spirit of knowledge-based systems (experts systems).
Other ways of interpretation imply an enrichment of lexical resources (in particular, the semantic characterization of lexical
units as positive or negative predicates) and grammars in order to weight the force of a positive/negative expression taking into
account its position in the discourse thread.
5
Conclusions and future directions
What it seems clear is that this approach for capturing citizens’ opinions can be more reliable than traditional methods
based on opinions polls, since it is much easier to declare a false opinion than to elaborate a short but coherent message
exposing a false opinion. The options that current search systems provide, either based on social networks or forums, can
drastically change the methods for surveying citizens’ trends as well as the accuracy in the prospective evaluation of their
opinions. In this way, the evaluation of vote trends as well as the arguments that support opinion shifts can be of outmost
importance and can be achieved in almost real time and at a not very high cost.
6
References
[1] Feldman, R., Sanger, J.: The Text Mining Handbook. Cambridge University Press (2007)
[2] Kao, A., Poteet, S.R. (eds.): Natural Language Processing and Text Mining. Heidelberg, Springer (2007)
[3] Moreno-Jiménez, J.M., Escobar, M.T., Toncovich, A., Turón, A.: Arguments that support decisions in e-cognocracy: A
quantitative approach based on priorities intensities. In Miltiadyis D. Lytras et al. The Open Knowledge Society.
Communications in Computer and Information Sciences 19, 649--658 (2008).
[4] Moreno-Jiménez, J.M., Piles, J., Ruiz, J., Salazar, J.L., Sanz, A.: Some Notes on e-voting and e-cognocracy. Proceedings
E-Government Interoperability Conference 2007. Paris, France (2007).
[5] Moreno-Jiménez, J.M., Piles, J., Ruiz, J., Salazar, J.L.: E-cognising: the e-voting tool for e-cognocracy. Rio’s Int. Jour. on
Sciences of Industrial and Systems Engineering and Management 2(2), 25--40 (2008).
[6] Moreno-Jiménez, J. M., Polasek, W.: E-democracy and Knowledge. A Multicriteria Framework for the New Democratic
Era. Journal Multicriteria Decision Analysis 12, 163--176 (2003).
[7] Ríos D., Kersten, G.E., Rios, J., Grima, C.: Towards decision support for participatory democracy. Inf. Syst. E-Business
Management 6(2): 161-191 (2008)
[8] Steven Abney, Parsing By Chunks. In: Robert Berwick and Steven Abney and Carol Tenny, "Principle-Based Parsing”,
Kluwer Academic Publishers (1991)
[9] Special Issue on Shallow Parsing. J. Machine Learning Research. vol 2(4) (2002)
http://www.illc.uva.nl/EuroWordNet/.