Dialogue Enhanced, Machine Assisted
Requirements Elicitation
Ke Li
Submitted for the degree of Doctor of Philosophy
Heriot-Watt University
Department of Computer Science
April 2014
The copyright in this thesis is owned by the author. Any quotation from the
thesis or use of any of the information contained in it must acknowledge
this thesis as the source of the quotation or information.
ABSTRACT
The Requirements Elicitation process often involves extracting valuable information
from the wealth of extant domain specific, natural language (NL) data to form the
requirements for building the future system. It also requires the collaboration of
stakeholders from different domains working together to identify additional key
information and to clarify any ambiguity from the existing data. However the highly
ambiguous and complex nature of natural language is often regarded as the main
obstacle preventing effective communication among stakeholders from different
domains and therefore success in Requirements Elicitation. Rather than focusing on
what can be gathered and/or extracted, this study introduces the concept that detecting
what is missing or ambiguous from the domain relevant data represented in natural
language can guide stakeholders to provide additional domain relevant information in
terms of clarifying any ambiguity.
The research was carried out first by a preliminary experiment, using small case studies,
involving undergraduate students. This provided the basis of understanding what
common mistakes might occur during the process of translating NL descriptions to OO
elements. Further investigations were conducted to develop Patterns in assistance of
ambiguity detection from NL domain descriptions and Question Templates that can
support user clarification. Overall the method of investigation can be summarized as
test-build-test, which has been proven to be effective and efficient for this study.
This study proves the claim that it is possible to bridge the knowledge gap between nontechnical and technical stakeholders by linguistic based Patterns and it also
demonstrates that non-technical stakeholders can provide valuable information from a
technical stakeholder’s point of view. However, this research can only be treated as a
preliminary study. For the concept to work effectively in the real world, a more
comprehensive repertoire of Patterns and Question Templates needs to be developed to
generate more quality outcomes, in terms of both correctness and completeness of
requirements.
ii
ACKNOWLEDGEMENTS
I would like to thank my supervisors, Rob J. Pooley and Rick G. Dewar, for their
helpful feedback, advice and support. They have been a pleasure to work with.
My thanks go to Heriot Watt University for the doctoral research studentship and to my
friends and colleagues there, who have supported me through my journey.
I also would like to thank all the individuals who took part in the research which will
not have been possible without their support.
Finally, my thanks go to my family and friends, and especially my Mum, Wang Fang
and Shangfei, without their understanding and support, this thesis would not have been
possible.
iii
Table of Contents
CHAPTER 1 Introduction ....................................................................................... 1
CHAPTER 2 Literature Review .............................................................................. 4
2.1
Requirements Engineering ................................................................................ 4
2.2
Requirements Elicitation (RE) .......................................................................... 5
2.2.1 Communication contents............................................................................... 6
2.2.2 Communication Mechanisms ........................................................................ 7
2.2.3 Communication Media .................................................................................. 8
2.3
Requirements Analysis (RA) ............................................................................ 9
2.4
Object-oriented modelling .............................................................................. 10
2.5
Linguistic Analysis ......................................................................................... 11
2.6
Natural language processing Tools ................................................................. 12
2.6.1 Mappings between OO and NL................................................................... 13
2.6.2 CASE Tools for automatic class identification ........................................... 15
CHAPTER 3 Hypothesis ....................................................................................... 21
3.1
Context of the problem.................................................................................... 21
3.2
The hypothesis ................................................................................................ 21
3.2.1 The Challenges ............................................................................................ 23
3.3
Methodology ................................................................................................... 24
3.3.1 Preliminary experiment ............................................................................... 25
3.3.2 Development of the Approach .................................................................... 25
3.3.3 1st Experimental study ................................................................................ 26
3.3.4 2nd Experimental study............................................................................... 26
CHAPTER 4 Preliminary Experiment .................................................................. 27
4.1
The choice of participants ............................................................................... 27
4.2
The choice of notation ..................................................................................... 27
4.3
The choice of scenarios ................................................................................... 28
4.3.1 Method ........................................................................................................ 28
4.3.2 Analysis ....................................................................................................... 29
4.3.3 Overall Results ............................................................................................ 31
4.3.4 Discussions.................................................................................................. 32
4.3.5 Conclusions ................................................................................................. 36
CHAPTER 5 Approach Used ................................................................................ 37
5.1
The proposed approach ................................................................................... 37
5.2
Mappings between NL elements and OO concepts ........................................ 38
5.2.1 Mappings between Noun and OO concepts ................................................ 38
5.2.2 Mappings between Verbs and OO concepts ............................................... 41
5.2.3 Mappings between NL and OO relationships ............................................. 42
5.3
Patterns ............................................................................................................ 45
5.3.1 Patterns for Identifying Class ...................................................................... 45
5.3.2 Patterns for Identifying Relationships ......................................................... 47
5.3.3 Patterns for Associating attribute and class ................................................ 49
5.4
Question Generation & User Participation ..................................................... 51
5.4.1 Question formats ......................................................................................... 51
5.4.2 Associating Question with Pattern .............................................................. 52
CHAPTER 6 Implementation ............................................................................... 55
6.1
The Requirements ........................................................................................... 55
6.1.1 Pattern based class identification ................................................................ 57
6.1.2 Question generation .................................................................................... 60
6.2
System Design ................................................................................................. 60
6.2.1 Class identification ...................................................................................... 62
iv
6.2.2 Dialog Interface ........................................................................................... 65
CHAPTER 7 Experimental Studies ...................................................................... 70
7.1
The design of experiments .............................................................................. 70
7.2
Patterns ............................................................................................................ 72
7.2.1 Stage one: Patterns Implementation ............................................................ 72
7.2.2 Stage two: Extendibility, Evaluation of Pattern Extension ......................... 84
7.3
Question Templates ......................................................................................... 86
7.3.1 Stage one: experimental study of question generation ................................ 86
7.4
The Chosen Scenario and Participants ............................................................ 90
CHAPTER 8 Discussions...................................................................................... 92
8.1
Stage one ......................................................................................................... 92
8.2
Stage two ......................................................................................................... 94
8.3
Stage three ....................................................................................................... 95
CHAPTER 9 Conclusion & Future Research ....................................................... 97
9.1
Conclusions ..................................................................................................... 97
9.2
Future research .............................................................................................. 101
Appendix A: list of POS tags ........................................................................................ 102
Appendix B: examples of question generation ............................................................. 103
Appendix C
....................................................................................................... 108
Appendix D
....................................................................................................... 109
Appendix E
....................................................................................................... 110
Bibliography
....................................................................................................... 111
v
List of Tables and Figures
TABLE 2-1. OO CONCEPTS AND NL ELEMENTS – A ................................................................................................ 14
TABLE 2-2. OO CONCEPTS AND NL ELEMENTS – B ................................................................................................ 15
TABLE 2-3. OO CONCEPTS AND NL ELEMENTS – C ................................................................................................ 15
TABLE 4-1: DOMAIN CONCEPTS IDENTIFIED IN BANK SYSTEM................................................................................... 30
TABLE 4-2: DELIVERED DOMAIN CONCEPTS OF ELEVATOR PROBLEM ......................................................................... 31
TABLE 4-3: DELIVERED DOMAIN CONCEPTS OF FILM MANAGEMENT ......................................................................... 31
TABLE 5-1 MAPPINGS BETWEEN NOUNS AND OO CONCEPTS DEFINED IN THIS STUDY ................................................... 41
TABLE 5-2. MAPPINGS BETWEEN VERBS AND OO CONCEPTS ................................................................................... 41
TABLE 5-3. CHOICES OF COMBINATIONS OF WORDS IN COMPOUND NOUNS ................................................................ 42
TABLE 5-4. MAPPINGS BETWEEN NL AND OO RELATIONSHIPS ................................................................................ 44
TABLE 5-5. LIST OF POS TAGS ........................................................................................................................... 46
TABLE 5-6 NOUN ASSOCIATED PATTERNS ............................................................................................................ 46
TABLE 5-7. VB RELATED PATTERNS, BASE FORM .................................................................................................... 47
TABLE 5-8. VBP RELATED PATTERNS, BASE FORM, = VB ......................................................................................... 47
TABLE 5-9. VBZ RELATED PATTERNS, 3RD PERSON SINGULAR PRESENT ...................................................................... 48
TABLE 5-10. VBD RELATED PATTERNS, VERB PAST TENSE ...................................................................................... 48
TABLE 5-11. VBN RELATED PATTERNS, VERB PAST PARTICIPLE ................................................................................ 48
TABLE 5-12. VBG RELATED PATTERNS, VERB GERUND & VERB PRESENT PARTICIPLE ................................................... 48
TABLE 5-13. QUESTION TEMPLATE AND CHOICES OF ANSWER .................................................................................. 52
TABLE 7-1. ASPECTS CONSIDERED FOR OVERALL EVALUATION .................................................................................. 71
TABLE 7-2. MOC RESULT OF EACH SCENARIO FROM EACH ASPECT ............................................................................ 78
TABLE 7-3. THE COMPARISON OF CLASS IDENTIFICATION AND MOC IDENTIFICATION .................................................... 79
TABLE 7-4. MIC RESULT OF EACH SCENARIO FROM EACH ASPECT.............................................................................. 80
TABLE 7-5. COMPARISON OF CLASS IDENTIFICATION AND MIC IDENTIFICATION ........................................................... 81
TABLE 7-6. APPEARANCE OF DIFFERENT TYPE OF COMPOUND NOUNS IN EACH SCENARIO.............................................. 81
TABLE 7-7. IDENTIFIED REASONS FOR INCORRECT AND MISSING MOC ....................................................................... 82
TABLE 7-8. IDENTIFIED REASONS FOR INCORRECT AND MISSING MIC......................................................................... 82
TABLE 7-9. TAXONOMY OF ERROR TYPES & CAUSES OF ERRORS .............................................................................. 83
TABLE 7-10. EXTENSION OF PATTERNS ................................................................................................................ 84
TABLE 7-11. CLASS IDENTIFICATION OF ELEVATOR PROBLEM ................................................................................... 89
TABLE 7-12. THREE SCENARIOS COMPLEXITY ....................................................................................................... 90
FIGURE 3-1 PLAN OF WORK .............................................................................................................................. 24
FIGURE 4-1 SCENARIO OF EXPERIMENTAL DESIGN .................................................................................................. 29
FIGURE 4-2 TAXONOMY OF ERROR TYPES ............................................................................................................ 32
FIGURE 5-1. APPLIED NOUN ASSOCIATED PATTERNS ............................................................................................... 47
FIGURE 5-2. APPLYING THE SVO PATTERN ........................................................................................................... 49
FIGURE 5-3. APPLIED PATTERNS FOR ASSOCIATING ATTRIBUTES AND CLASS ................................................................ 50
FIGURE 5-4. EXAMPLE OF GENERATING QUESTION FOR MISSING INTERACTIVECLASS .................................................... 53
FIGURE 6-1. AN EXAMPLE OF A PARSED SENTENCE................................................................................................. 58
FIGURE 6-2. EXAMPLE OF GRAMMAR TREE REPRESENTATION ................................................................................... 59
FIGURE 6-3 IDENTIFIED CLASSES IN XML ............................................................................................................. 59
FIGURE 6-4 SYSTEM ARCHITECTURE AND TECHNICAL SOLUTIONS .............................................................................. 62
FIGURE 6-5. DATA FLOW CHART FOR DOMAIN DESCRIPTION PROCESSING ................................................................... 64
FIGURE 6-6 DATA FLOW CHART FOR DIALOG INTERFACE .......................................................................................... 66
FIGURE 7-1. PERFORMANCE ON CLASS IDENTIFICATION .......................................................................................... 75
FIGURE 7-2. PERFORMANCE ON METHOD IDENTIFICATION ...................................................................................... 77
FIGURE 7-3. PERFORMANCE ON MOC IDENTIFICATION .......................................................................................... 78
FIGURE 7-4. PERFORMANCE ON MIC IDENTIFICATION ........................................................................................... 80
FIGURE 7-5. “EACH FLOOR, EXCEPT THE FIRST FLOOR AND TOP FLOOR HAS TWO BUTTONS, ONE TO REQUEST AND UP-ELEVATOR
AND ONE TO REQUEST A DOWN-ELEVATOR” APPDENDIX D ............................................................................. 83
FIGURE 7-6. IMPROVEMENTS BY THE PATTERN EXTENSION ..................................................................................... 85
vi
CHAPTER 1 -
Introduction
Requirements Engineering (REng) is about discovering and specifying requirements
that a software system needs to meet in terms of satisfying customers. The objectives of
REng are mainly concerned with understanding the customers’ real world, including
terminology, concepts, viewpoints and goals, and discovering the data that is essential
for building a system, as well as translating such real world problem domains into
formal specifications that form the basis of design, implementation, testing and
maintenance [Nuseibeh, B.; Easterbrook, S. (2000)].
The nature of REng is claimed to be multi-disciplinary and human-centred. The process
not only involves a large variety of data (e.g. features/functions performance, hardware
constraints, interface and glossary, etc.); it also involves heterogeneous groups of
stakeholders, who share different domain knowledge perspectives, view points and
assumptions [Nuseibeh, B.; Easterbrook, S. (2000)].
Supporting effective
communication among stakeholders becomes the key issue in REng. The failure to
produce quality requirements could be severe and costly hence the success of REng
grows to be crucial in software development.
This research aims to develop a mechanism that facilitates efficient and quality
communication during requirements elicitation and analysis to form a solid foundation
for system design.
The technique is designed to be used in the early stage of the REng process,
requirements elicitation, to help requirements engineers to understand the problem
domain and as an outcome to identify candidate classes that can be used to form the
object oriented conceptual model of the problem domain. The generation of a domain
model, is not provided in this study, but could be added as an extra feature in further
studies.
The domain description as the initial input to the system is ideally provided directly by
the domain experts, but the approach developed here is not limited to this; any English
language documents that describe some or all of the domain can also be used as part of
Page 1
the initial inputs.
The typical limits of the size and/or the complexity of the problem domain which can be
handled by the approach described, are difficult to judge and there was no particular
experiment to address them it. However, what can be used as an indication is the
number of Object Oriented Concepts involved in the chosen scenario. In the
experimental studies the maximum number of Object Oriented Concepts the technique
was shown to handle was, 49 candidate classes and 24 candidate methods with 33
candidate relationships. Although, this can be used as the sake limit for this technique to
work, there is no indication that a domain containing more would fail. However, the
actual limits of the technique in terms of complexity will require further investigation in
the future studies.
The main contribution of this study is that this technique does not rely on a perfect
domain description to work. It can can be used to find a solution to the ambiguity of
natural language (English). Hence, there are no other particular assumptions for the
domain descriptions, in terms of the grammatical quality and well-formed sentences,
except that it must to be written in English. As an example, this technique deals with
sub-sentences, where the subject noun, object noun or even both may be missing in
many cases. Such cases are instead further clarified by the answers of domain experts
via the question generation process. For more details please refer to the Chapter 5.
The results obtained through experimental studies prove that it is possible, using NL
processing to search for grammatical structures, to target any partially satisfied
candidates through built in Object Oriented Concept assessment. Considering these
candidates, through question generation, those partially satisfied candidates (which can
be seen as ambiguities in the input) can be clarified by domain experts. Regardless of
whether these are domain relevant or not an improved overall quality of object oriented
concepts identification resulted.
However, what would be the minimum requirements on the grammatical quality of
domain descriptions and what would be the impact this can have on the overall quality
of outputs would require further investigation.
Page 2
The rest of this thesis is structured as follows:
Chapter 2 represents a detailed survey of relevant literature.
Chapter 3 defines and explains the hypothesis of this study and a brief discussion
of the proposed approach.
Chapter 4 represents the preliminary experiment, aiming to understand the
obstacles faced in identifying OO concepts from text.
Chapter 5 represents the approach applied in this study
Chapter 6 describes how the approach is implemented
Chapter 7 represents the experimental studies and discusses the performance of
each component.
Chapter 8 represents the evaluation of the methodology applied
Chapter 9 discusses what has been achieved therefore how each hypothesis of
this study is evaluated and what further work is involved to resolve any
limitations of this study.
Page 3
CHAPTER 2 -
Literature Review
To fully understand problems existing in Requirements Engineering, the relevant
literature was studied in detail; this includes previous contributions in requirements
engineering, requirements elicitation, requirements analysis, object-oriented modelling,
linguistic analysis and natural language processing tools. Each one is represented in the
following sections.
2.1
Requirements Engineering
Since Requirements Engineering (REng) emerged as a discipline of its own within
software engineering in the early 1990s, some major progress has been made [Nuseibeh,
B.; Easterbrook, S. (2000)]. The REng process involves Requirements Elicitation (RE),
Requirements Analysis (RA), Requirements Specification (RS) and Requirements
Verification (RV) [Loucopoulos, P. Karakostas, V. (1995)]. As a result of a REng
process, a quality Software/System Requirements Specification (SRS) is constructed,
which is viewed as the blueprint of the software product under development. SRS
identifies not just what needs to be built from a developer’s perspective but also what
can be expected from a customer’s perspective [Kotonya, G. Sommerville, I. (1998)].
Thus, it is important that SRS is clear and accessible by different stakeholders.
In reality, requirements do not just exist in textual documentation involving customers’
working environment to carry out day-to-day jobs, but also in the surrounding physical
reality [Ryan, K. (1992)]. Nuseibeh and Easterbrook point out that REng is largely
about communication [Nuseibeh, B.; Easterbrook, S. (2000)], where system analysts act
as translators between the two specialist worlds of computing and the application
domain [Ryan, K. (1992)]. The success of requirements discovery ultimately depends
on the experience of the system analysts. Despite their experience, how efficiently
people can communicate and cooperate as well as how such relationships are
maintained has a major impact on the quality of the outcome. Communication has been
recognized as the main obstacle in discovering requirements, especially in customercentred REng [Beyer H.R. and Holtzblatt K. (1995)], [Saiedian, H. Dale, R. (2000)] that
encourages customers’ participation in the REng process at the earliest possible time.
Page 4
Hence effective communication should support customers in describing their whole
environment, then what they want to change and in what way, and most importantly
should avoid misunderstanding during the whole process.
One major improvement in REng is the use of modelling techniques for describing the
application domain. This has led to different domain modelling languages for particular
types of system, creating a view of its problem domain. Modelling techniques provide
easy access to views of the problem domain as well as improving the accuracy of
assistance for developers gathering domain relevant information from domain experts.
However, a modelling language cannot help customers to share their domain knowledge
on its own as it is not a natural way for people to communicate.
2.2
Requirements Elicitation (RE)
RE applies techniques such as meetings, interviews (questionnaire interview, open
ended interview) [Goguen, J.A.
Linde, C. (1993)], to elicit domain relevant
information for building the future system. These are discussed in [Li, K. (2004)]
Considering the critical role of communication in REng, the literature is reviewed with
from three aspects:
Communication content, referring to what types of information need to be
collected.
Communication mechanisms, referring to the manner by which the
communication is performed; and
Communication media, referring to the linguistics or the languages involved;
Page 5
2.2.1
Communication contents
A number of approaches to improving communication have been proposed in the
literature.
The Soft Systems approach provides a broader systematic view to the
requirements engineer and helps them observe and model the real world to develop an
understanding of real-world problems [Checkland P. (1981) ][Checkland, P.; Poulter, J.
(2006) ].
Systems Archaeology extracts and elicits requirements from existing
documents [Shlaer, S.; Mellor, S.J. (1988)]. Functional Partitioning, i.e. task analysis,
decomposition methods and Business Events, helps to derive a functional partitioning
solution from functional aspects of a system.
In the 1970's, Ivar Jacobson developed the Use Case technique that partitions functions
based on business events to capture functional requirements, and discovers physical
activity sequences of each use case by Scenario [Jacobson, I.; Christerson, M.; Jonsson,
P.; Overgaard, G. (1992) ]. It is now widely accepted as one of the best ways to link
Standard English based customer requirements to system design for Object-Oriented
system development. Reusing Requirements helps to define consistent requirements
to increase the opportunity of reusing defined knowledge. Models provide reusable
knowledge/information that cover general features shared by target problem domains,
such as, patterns [Coad, P.; North, D.; Mayfiel, M. (1995)], frames [Constantopoulos P.
Jarke M. Mylopoulos J. Vassiliou Y.(1991)], problem frames [Jackson (1995)], card
sorting [Maiden N.A.M., Mistry P., Sutcliffe A.G. (1995) ], and object system models
[Sutcliffe, A. Maiden, N. (1998)] [Maiden N.A.M., Hare M. (1998) ].
Firesmith suggested that a system should be described with a set of objects and
identified interfaces between such objects rather than between functions [Firesmith D.
(1991) ]. However, it is often impossible for customers to identify objects that persist
throughout the system development life-cycle. Thus the issue is that, before object
decomposition can take place, we need to decide how to extend the object-oriented
approach to help customers in describing their problems. Another issue is that objectoriented modelling specifies how the elements of a system interact with each other
internally. Use Case addresses such issues by drawing the boundary of the system and
showing interactions between the system and its environment. However, only a partial
environment is involved, which encompasses stakeholders, other systems and/or
hardware. The question is, “What are the relationships and interactions between the rest
Page 6
of concrete real objects/entities of the problem domain and the system?”
2.2.2
Communication Mechanisms
Interview and Survey are common techniques adopted for gathering requirements from
customers. A group approach has also been applied to requirements elicitation, such as
Family Therapy [Satir, V. Banmen, J. Gerber, J Gomori, M. (1991)] and Brainstorming
[Bolton R. (1979) ], to avoid misinterpretations as well as to generate good solutions.
Mind Mapping [Buzan T. Buzan B. (1995) ] and Neurolinguistic Programming
[O’Connor, J. Seymour, J. (1990)] techniques facilitate communication (during
observation or interviews) by using some extensive and meaningful words, pictures,
symbols as well as colour and supporting effective thinking and acting. In order to help
requirements engineers derive and reuse requirements and enhance sharing of
customers’ dialogue and behaviour, especially in a large organization, Video has been
adopted to make a record of communications and working processes [Robertson, S.
(2001)] [Holtzblatt K. and Beyer H.R. (1995)].
The main disadvantages of such
communication techniques are 1) it is hard to identify the most productive data from the
vast amount of information gathered; 2) it is inappropriate for obtaining and discovering
the detailed information essential for building a system. 3) “Relying exclusively on NL
to communicate so that the weakness of NL can hardly be avoided” [Saiedian, H. Dale,
R. (2000)] [Holtzblatt K. and Beyer H.R. (1995)]. It is argued by Beyer that traditional
communication techniques are inadequate for gathering customer requirements [Beyer
H.R. and Holtzblatt K. (1995)].
In the mid 1990s, REng moved a step further towards a customer-centred approach by
addressing the psychological and sociological issues to enhance communication
[Holtzblatt K. and Beyer H.R. (1995)]. Beyer and Holtzblatt proposed a model of an
apprenticing relationship, and claimed that a better learning and understanding can be
derived from practice.
The underlying theory of Apprenticing is that, 1) from a
customer’s point of view, doing things enhances the ability of describing rather than
talking; 2) from a requirements engineer’s viewpoint, seeing the work helps to discover
the problem details and structure.
However, this approach is criticised for the
uncertainty of time taken, the difficulty of relationship maintenance and the high
dependence on the ability of designers [Beyer H.R. and Holtzblatt K. (1995)], [Saiedian,
H. Dale, R. (2000)].
Page 7
2.2.3
Communication Media
As a medium of communication, natural language is familiar to all the stakeholders, but
is often problematic. The issues of cognitive limitations and vocabulary differences,
inherent ambiguity, misunderstanding and ineffective use are very difficult to address
[Alexander I.F. (1997)]. Hence, Linguistic analysis is very important in order to clearly
and unambiguously discern customer needs and system requirements as well as
supporting effective communication [Nuseibeh, B.; Easterbrook, S. (2000)].
For
example, the NTMS (Naturalness theoretical Morpho Syntax) model, is a generative
orientated grammar model [Fliedl, G.; Mayerthaler, W.; Winkler, C. (1999) ]; KCPM
(Klagenfurt Conceptual Predesign Model) [Fliedl, G.; Kop, C.; Mayerthaler, W.; Mayr,
H.C.; Winkler, C. (1997)] map the linguistic of a set of modelling concepts
(notions/types) to collect and catalogue data of a problem domain, supporting a
harmonized view of a given problem domain among stakeholders. However, this kind
of approach is based on a condition that customers have to provide a document (user
requirements) before any linguistic analysis can take place. Nevertheless, Robertson
[Robertson, S. (2001)] claimed that “The problem of requirements is that of why did
you not tell me what you want?”
An important effort in requirements elicitation is the adoption of natural language
Processing (NLP) to facilitate automation. Built on the foundation of existing mappings
between natural language (NL) elements and Object-Oriented (OO) concepts [Booch G.
(1994)],[Rumbaugh, J. Blaha, M. Premerlani, W. Eddy, F. Lorensen W. (1991)], the
NLP-based approach demonstrates the capability of key information detection from
domain-specific NL descriptions and OO model generation (e.g. Mich, L. (1996),
Delisle S. Barker K. Biskri I. (1999), Harmain H.M. and Gaizauskas R. (2000) and
Borstler J. (1996)). However, inherited from the informal nature of natural language,
the input descriptions often lack precision, completeness and consistency, resulting in
only a first-cut OO model, where further communication with stakeholders is essential.
Without the participation of stakeholders such NLP-based approaches can hardly go
beyond toy examples. Nevertheless, supporting communication still remains an open
issue. In our view, there is an urgent demand for mechanisms to enable stakeholders to
interact with the system in terms of clarifying ambiguity and complementing the
missing, yet essential, domain data that can be highlighted by the output of an NLP
system.
Page 8
2.3
Requirements Analysis (RA)
The activity of understanding the problem to be solved is called RA (or domain
analysis) in software development. RA primarily applies modelling techniques of either
the structured paradigm or the object-oriented paradigm. Since object-orientation was
popularized by Booch [Booch G. (1994)], Rumbaugh [Rumbaugh, J. Blaha, M.
Premerlani, W. Eddy, F. Lorensen W. (1991)] and others, object-oriented domain
analysis has become widely applied in practice. One of proposed strengths of object
orientation, in contrast with earlier structured approaches, is the belief that objects can
bridge the communication gap between domain concepts and software system elements
[Korthaus, A. (1998)] [Loos, P. and Fettke, P. (2001)] [Mentzas, G. (1999)] and,
therefore, between end-users and designers.
However the problem needs to be
understood before the objects can be identified. A good knowledge and thorough
understanding of the problem domain is a necessary foundation in order to perform
object-oriented domain analysis.
So, understanding the problem is vital for successful development in the real world. In
terms of object-oriented domain analysis, a number of approaches have been applied to
help technical experts understand the problem, e.g., use case analysis [Jacobson, I.;
Christerson, M.; Jonsson, P.; Overgaard, G. (1992) ], scenarios, and the “what vs. how”
problem approach [Jackson, M. (1994)]. Most of these techniques suggest ways to
assist requirements analysts in communicating with non-technical stakeholders more
effectively, thereby obtaining and organizing valuable information that forms a basis for
both the requirements for the system being developed (e.g. use case models) and the
structuring of its design (e.g. class diagrams).
No matter what methodology is applied and what techniques are used, there is no doubt
that natural language is unavoidable in domain analysis. Hence, analyzing such natural
language based data to develop domain relevant concepts is crucial in both practice and
teaching. Many attempts have been made to address this issue, including the
development of automated natural language processing systems, applying linguistic
techniques to discover domain relevant concepts, and then to observe mappings
between object-oriented concepts and natural language elements [Li, K. Dewar, R.G.
Pooley, R.J. (2005)]
Page 9
2.4
Object-oriented modelling
The popularity of object-oriented programming has led to the investigation of ObjectOriented Modelling (OOM), from early object-oriented system analysis [Shlaer, S.;
Mellor, S.J. (1988)], to object-oriented analysis [Coad, P.; Yourdon, E. (1990)] and
Object-oriented Modelling Technique (OMT) [Rumbaugh, J. Blaha, M. Premerlani, W.
Eddy, F. Lorensen W. (1991)], which aim to model the behaviours and interactions of
software systems.
More recently, four widely used object-oriented modelling
techniques were integrated into a standard modelling language for the object-oriented
paradigm, the Unified Modelling Language (UML) [Booch G. Rumbaugh J. Jacobson I.
(1999)]. The fact that OOM now uses the same building blocks (classes, objects,
methods, messages and inheritance, etc.) in design and implementation as in RA,
facilities further development in REng [Dawson L. Swatman P. (1999) ]. Use Case
[Jacobson, I.; Christerson, M.; Jonsson, P.; Overgaard, G. (1992) ] [Jacobson, I. Booch,
G. Rumbaugh J. (1999)] [Jacobson, I. Christerson, M. (1995) ] [Jacobson, I. (1995)]
focuses on functionalities from the customer’s perspective, and is supported by Use
Case Map (UCM). UCM can be integrated with languages and notations such as UML,
GRL [Mylopoulos, J. Chung, L. Yu, E. (1999)] and User Requirements Notation
(URN). Scenarios [Jacobson, I.; Christerson, M.; Jonsson, P.; Overgaard, G. (1992)
[Kotonya, G. Sommerville, I. (1998)] are also used to describe the behaviours of a
system.
Goal-oriented modelling has been proposed to complement object-oriented modelling;
particularly for NFRs analysis and modelling.
It is supported by Goal-oriented
Requirements Language (GRL) that emphasizes construction of non-functional
requirements.
Goal-oriented modelling is facilitated by a graphic user interface,
Organization Modelling Environment (OME), developed in the University of Toronto
[Horkoff, J.; Yu, E. (2011)].
In the object-oriented paradigm, a system is described with a set of objects that
communicate with each other through identified interfaces. The applied concepts are
closer to real-world situations [Juristo, N. Moreno, A.M. Lopez, M. (2000)]. Thus, the
primary task is to discover the essential objects at the earliest possible time in
development. However, the major difficulty is the identification of essential objects
that persist throughout the system development process.
Page 10
2.5
Linguistic Analysis
Linguistic Analysis can be divided into three levels, Lexical level analysis, Syntactic
level analysis and Semantic level analysis. Each level applies different tagging or
annotation techniques; these are grammatical tagging, syntactic tagging and semantic
tagging. Grammatical tagging - also called POS (Part-Of-Speech) tagging - is the
commonest form of corpus annotation. In the tagging process, each word or word
combination of the input text is assigned a POS tag. Existing tools for assisting
linguistic analysis include CLAWS developed by Lancaster University and Stanford
Parser by Stanford natural language Processing Group of Stanford University [Klein,
D.; Manning, C.D. (2003)].
NL is the primary format of information expression in the customer’s world, whether
informal or formal, allowing access to requirements specifications for customers for
validation purposes.
However, NL is very ambiguous.
The following example
demonstrates this nature of NL,
"Flying planes can be dangerous", the confusions from this sentence include,
Who is in danger? Are they pilots in the planes or, the people on the ground?
Should “can” be interpreted as a noun or as a verb?
What is the meaning of "plane"? Is it an airplane, or a geometric object, or a
woodworking tool?
This kind of approach requires and relies on an initial input document written by
customers, expressed in either informal or formal NL. Such expressions often make
customers feel stressful, especially for those lacking the necessary visualization,
abstraction, description and language skills [Karten, N. (1994)]. Overall, the output of
such approaches depends on the quality of input document and the further issue of
discovering non-described requirements can hardly be addressed.
Page 11
2.6
Natural language processing Tools
Motivated by the performance of linguistic techniques, many efforts have been made to
apply linguistic techniques to assist the extraction of key domain concepts. Such
attempts can be classified into two different types of approach. 1) Involving the use of
POS; exploring the mappings between the concepts of software systems and natural
language elements for extracting important domain concepts [Chen P. P.-S. (1983)],
[Halliday M.A.K. (2003)], [Juristo, N. Moreno, A.M. Lopez, M. (2000)], however,
there are no agreed simple rules to guide the identification of domain relevant concepts.
2) Involving the use of special words or the use of structures or fundamental semantic
relationships between words, e.g. automatic term recognition [Kageura, K., Umino, B.
(1996)], using techniques like stemming [Porter, M.F. (1977)] and lemmatization
[Snajder, J., Basic, B.D., Tadic, M. (2008)], the performance of which rely on the
identification of a word as a lexical entity; another effort is by [Goldin L., and Berry
D.M. (1997)], which achieved identification of different word morphologies by setting
the minimum length of string to be matched.
Based on the mappings selected by each researcher, natural language processing
techniques have been used in a number of systems that automatically, or semiautomatically, identify domain relevant concepts. Some even output object-oriented
diagrams [Mich, L. (1996)], [Delisle S. Barker K. Biskri I. (1999)]. This kind of
approach is typically based on the assumption that the importance of a concept is
strongly correlated with the frequency of its appearance in the text. In comparison, the
approach of [Goldin L., and Berry D.M. (1997)]is capable of capturing important
concepts that appear both frequently and infrequently in the text. However the setting
of a Word-Threshold parameter is a key to its overall performance.
Mappings between OO concepts and NL elements are the foundation of such approach.
In the following, existing mappings are reviewed in detail and the performance of
CASE tools that adopt those mappings to automatically identify class is investigated.
Page 12
2.6.1
Mappings between OO and NL
The most widely known technique for capturing object oriented model elements is
"Using Nouns”, which suggests that some Nouns can be interpreted as objects. Based
on this technique/approach, many NLP systems were developed to extract objects from
natural language expressions. CM-Builder [Harmain H.M. and Gaizauskas R. (2000)] is
perhaps the most recent project in this area. It contributes by supporting the standard
Unified Modelling Language (UML) that allows easy refinement with a UML CASE
tool. A similar tool, LOLITA [Mich, L. (1996)], [Mich, L. Garigliano, R. (1994)],
supports object-oriented analysis by automatically generating object-oriented analysis
models from NL specifications.
Past research in object-oriented modelling has investigated heuristic-based solutions for
mapping NL elements onto OO concepts. From a review of research on automatic
requirements analysis, such mappings executed by NLP systems (e.g. [Mich, L. (1996)]
[Delisle S. Barker K. Biskri I. (1999)] [Harmain H.M. and Gaizauskas R. (2000)]
[Borstler J. (1996)]) only provide an incomplete coverage of the necessary notations.
For static modelling of systems, some have claimed that nouns and noun clauses can be
used to derive classes and some others that natural language (NL) structures can be used
to derive relationships between these classes [Abbott R.J. (1983)].
However, as
Halliday states “Noun names a class of things that could be concrete objects or persons,
as well as abstracts, processes, relations, and states, attributes” [Halliday M.A.K.
(2003)]. This diverse nature of nouns makes identification of a class and its attributes
extremely difficult.
It is also apparent that most mappings between NL and OO
concepts are based on an open-class word; i.e. “is a type of” signifies inheritance,
“belong to” and “are part of” denote aggregation and “can be” indicates generalization
(e.g. a liquid can be water) [Juristo, N. Moreno, A.M. Lopez, M. (2000)]. Furthermore
dynamic modelling (i.e. modelling of behaviour) appears even more difficult to
automate. In attempting to provide computer-assisted requirements elicitation, we see
there is a demand for automatic requirements collection, interpretation, analysis and
stakeholder interaction that would help tease out a more comprehensive system model.
Richard C. Lee & William M. Tepfenhart point out that the mappings (in Table 2-1) can
be well understood by technical or non-technical project staff and therefore can be an
effective communication medium for stakeholders. However, it is also mentioned that
Page 13
this is an indirect approach to finding classes, objects and interfaces. Simply because
nouns are not always classes, objects and interfaces.
Subject noun could refer to
attribute, service, software component (subassembly) and even computer software
configuration (entire assembly). The role that a particular noun plays also plays a part
in class identification. Other issues discussed in this book are:
Different words refer to same thing (i.e. workplace & office), select one word
for one concept and eliminate others
Different words can be used to describe the same thing but from different
aspects (i.e. Lucy is a mother & Lucy is a dentist), each concept needs to be
specified
Different concepts can be described by the same noun (i.e. floor as part of a
room & floor in a building) need to be clarified and described by different words
when all are domain relevant
Table 2-1. OO concepts and NL elements – A [Lee R. C. and Tepfenhart W. M. (2003)]
OO Concepts
NL Element
Object
Singular proper nouns, e.g.
Jim, he, she, employee number 5, my
workstation
Object
Nouns of direct reference, e.g.
the sixth player, the one millionth purchase
Classes/ Interfaces
Plural nouns, e.g.
people, customers, vendors, users, employees
Classes/ Interfaces
Common nouns, e.g.
everyone, a player, a customer, an employee
Services (operation)
Verbs, e.g.
pay, collect, read, request
Services (operation)
Predicate Phrases, e.g.
are all paid, have simultaneously changed
Service
Subject Nouns
Attribute
Subject Nouns
computer software configuration Subject Nouns
software component
Subject Nouns
The alternative approach discussed in their book can help engineers deliver a good
outcome but is difficult to be executed by a CASE tool. In the following, such CASE
tools for class identification are compared and contrasted. The mappings between each
Page 14
type of OO concepts and NL in each CASE tool are represented.
2.6.2
CASE Tools for automatic class identification
Based on the existing mapping theory, some attempts at automatic class identification
have demonstrated that the OO concepts can be identified automatically by applying a
mapping based approach. Moreover, some have come out with a more comprehensive
mapping extension.
Table 2-2. OO concepts and NL elements – B [Harmain H.M. and Gaizauskas R. (2000)]
OO concepts
NL element
Class
Object nouns / noun phrases
Relationship
Transitive Verbs (event) with logical
subject and object
Attribute
Possessive relationship
Attribute
To have, denote identify
Attribute
Attributive adjective
Aggregation
Sentence patterns, e.g.
Something is made up of something;
Something is part of something;
Something contains something
Multiplicity of roles in associations – 1
Singular noun, Determiner, e.g. one
Multiplicity of roles in associations - Many Determiners, e.g.
each, all, every, many and some
Multiplicity of roles in associations – N
Presence of numbers, e.g. one, two, seven,
etc.
Table 2-3. OO concepts and NL elements – C [Harmain H.M. and Gaizauskas R. (2000)]
OO concepts
Multiplicity of roles in associations – 1
Class
Attribute
Relationship
Aggregation
Attribute
Attribute
Multiplicity of roles in associations Many
Multiplicity of roles in associations – N
NL element
Noun (singular) Determiner
Object Noun/ Noun Phrase
Noun (follows special verb)
Transitive event with logical subject and
object
Sentence pattern
Attributive adjective
Possessive relationship
Determiners
Presence of numbers
Page 15
As shown in the tables above, the principal OO concepts can be identified including
Class, Attribute, Operation, Object and Relationships (e.g. association, aggregation,
etc.); and more NL elements are involved in this process including Noun, Verb,
Adjective, compound words, preposition and some sentence patterns (structures) have
been found useful in discovering relationships.
Discussion: limitation / failure of existing approach – definition of reasons
In Object Oriented Modelling, the building blocks are objects (or, classes - the
definition of objects) that define their interaction with the outside world through the
methods that they expose. Methods form the object's interface allowing interaction with
the outside world. This may require parameter(s) or have pre-conditions and posconditions. Furthermore, objects which are relevant in a domain must interact with each
other in one of the following ways - association, generalization, aggregation,
composition and dependency. According to the earlier findings, the mappings between
OO concepts and NL elements are generally that i) Nouns can indicate classes or
attributes; ii) Verbs can indicate operations. However, attributes and operations do not
stand in isolation but are parts of a class. In other words, a class without its attributes
and operations identified cannot function or interact with other classes in the domain.
Regardless of the efficiency of the mappings, the key issue of associating the operations
and attributes with the corresponding class is not addressed. The interpretation is
affected by the position of a word in the text/sentence and its grammatical relationship
with other key words in the text/sentence.
Such approaches face the following difficulties,
separation of class from its methods and attributes.
synonyms
multiple Word Types
Page 16
Through some examples, the following demonstrates how each of above can cause
problems towards inaccurate interpretation. The text is chosen from the scenarios used
for the case studies used later for our experiments and the whole text of each can be
found in Appendix C, Appendix D and Appendix E.
Examples of difficulty I)
-: The illumination is cancelled when the elevator visits the corresponding floor and
moves in the desired direction;
i) As illustrated in the class model below move is chosen as the name of one of the
operations, which is encapsulated in the class Elevator. It is possible to identify the
verbs and nouns with their class via applying the Subject-Verb-Object syntax. However
it is not that simple, the question is: how to identify the synonyms, e.g. where visits or
moves can be interpreted as candidate operations of candidate class elevator, however
only one is needed as a candidate operation since both refer to same semantics in this
problem domain. For the approaches discussed earlier, it is crucial yet very difficult to
discover the synonyms in a text.
These difficulties derive from the nature of NL, namely its ambiguity. Consequently,
these characteristics affect the outcome in domain concept identification. This study
suggests that there are two main difficulties in identifying OO concepts automatically
via a mapping based approach. The nature of NL includes the following four aspects,
Flexible and complex expression and sentence structure
Required knowledge background
Multiple Word Classification
Compound Nuns / derivative words
Page 17
Flexible and complex expression and sentence structure
As shown in the example above, the identities and functions of book as well as the
relationships between book and others are related by the structure of the sentence in
which they are found. Understanding sentence structure can help the discovery of
relationships among domain concepts. However, for instance, a relationship between
book and other concrete thing can be described in many ways.
This variety of
expression increases the difficulty in finding a better solution.
Required knowledge background
Giving a simple word ‘book’ as an example, people who understand this word ‘book’
also share the knowledge/information that is encapsulated within the word ‘book’. By
this we mean,
the functions offered by the word ‘book’ are shared e.g.
book can be bought, can be borrowed, can be written, can be published etc. and,
the attributes owned by the word ‘book’ are shared, e.g.
book has price, has author(s), has publisher(s), has contents, has title(s) etc. and,
the interactions of a ‘book’ with other things indicate many-to-many association
relationship between ‘book’ and other things, e.g.
a book can be bought
from the
book shop,
a book can be
borrowed
from the
library,
a book is
written
by its
author,
a book is
published
by the publisher,
etc.
Where a communication involves the word ‘book’, people usually already share all the
knowledge about book in order to communicate with each other. In the circumstances
that one does not share the knowledge, question(s) will occur to gather further
information needed. This example demonstrates a simple misunderstanding that a class
cannot be described by a single noun, but all the relevant knowledge needs to be
identified in OO concepts. Simply because a machine does not share the knowledge
encapsulated in the vocabulary, all the relevant knowledge, or from an OO aspect, the
functions, attributes and relationships with other things. Therefore, simple mappings
Page 18
(e.g. noun-class) therefore are inadequate and can hardly be executed by a CASE tool
sufficiently and broadly. The question is how to discover the knowledge/information
that hide in a specific word and identify it in OO concepts.
Considering the above example, the i) functions of book are described in sentences that
link the book (noun) and its function (verb) bought, borrowed, written and published. In
ii) and iii) of the above example, sentence are used to describe the attributes of book and
the interactions among book and other concrete things.
Understanding sentence
structures is crucial in identifying such relationships between words, but not the only
solution in English. Joining two words can form a new word - a compound word,
which can help in defining relationships between words (e.g. compound nouns sign one
thing, that plays the role of attribute, to the other thing, that plays the role of object.
Others are like compound prepositions)
Multiple Word Classification
“Book” can be used as noun, verb and adjective. Different usages can have completely
different meanings. The word “book” could be the book to read, or the behaviour to
register in; or to arrange for tickets or lodgings (for example) in advance [Oxford
Dictionary of English (1989)].
Compound Nuns / derivative words
“Book” can be classified into different groups, i.e. school book, library book, travel
book, etc. which might inherit some or even all the functions and identities mentioned
above. A question regarding this can be, “Should ‘school book’ be interpreted as two
candidate classes or one candidate class.”
In terms of identifying OO concepts, this study suggests that mappings between NL
elements and OO concepts should meet the following requirements:
Such mappings must support distinguishing the true interpretation from other
potential interpretations
Page 19
Such mappings must support discovering the relationships that are embodied in
NL expressions
Such mappings must support user participation
In this chapter, the related literature was reviewed and discussed from different angles,
including, requirements engineering, requirements elicitation, requirements analysis,
object oriented modelling, and linguistic analysis. In particular, existing mappings used
in a variety of NLP tools that claim to identify OO concepts from domain description
are reviewed. Furthermore, it is discussed how the performance of such an approach is
affected by the nature of NL, namely, its flexible and complex expression and/or
sentence structure, its required knowledge background, its multiple word classification
and its derivative words. In the following Chapter, the hypothesis of this study are
represented and discussed in details.
Page 20
CHAPTER 3 -
Hypothesis
This chapter defines and explains the hypothesis of this study, starting from the context
of the problem and followed by the definition of hypothesis. It also looks at the
challenges in developing a better solution, and finally the methodology is represented
briefly.
3.1
Context of the problem
Given the background of Requirements Engineering and having discussed different
approaches to RE and previous efforts in automatic class identification, the following
facts are assumed in this study, that
Stakeholders’ participation is crucial to support a thorough and correct view of
the problem domain
A domain description alone is unlikely to meet the criteria (thorough, clear and
correct) for generating a quality OO model
At least some, if not all, NL based domain description can be translated to an
OO model automatically
The overall quality of the outcome of this automation depends on the quality of
the input domain description
3.2
The hypothesis
The premises listed above are the inspirations that led to our solution.
With the
understanding that domain experts do not have a complete picture of the requirements, a
domain description is unlikely to contain all the essential information that one needs to
build the future system; our alternative approach is to emphasize highlighting the
imperfections rather than following the existing track aiming to improve/maximize the
Page 21
amount of relevant information that can be extracted from a domain description.
The motivation of the study is to improve the support for working with NL descriptions
of problem domains, and thereby to encourage better system designs. The hypothesis of
this study is defined as below,
It is possible to improve object oriented domain modelling using translation from NL if
we incorporate within the process a step where domain experts have to clarify any
incompleteness or any type of ambiguity in the domain descriptions. Furthermore, if
the missing and/or ambiguous data can be identified in the domain descriptions, this
can be an important indication to developers to further extract domain relevant
information from different resources.
The problem is how to lead customers (with non-technical character) to describe their
environment and problems in a way that is more understandable and transferable whilst
unambiguous and as a result, to support developers in the use of such information in
further stages of development. Unlike the previous top-down approaches, we suggest a
bottom-up approach; eliciting requirements starts from the fundamental building blocks
of the problem domain – the concrete objects that compose the customers’ environment.
The focus is not only to translate NL descriptions into object oriented domain models
but most importantly to discover the problems within the domain description.
achieve this goal, we need to know:
1)
what information needs to be gathered
2)
how to gather such information.
Page 22
To
3.2.1
The Challenges
Although it might not be achievable to create equivalence in OO models for everything
that can be described by NL descriptions/elements, one must be able to model those that
are relevant to the problem domain in order to provide the functionality to meet the
requirements. However the challenge is how to deal with the problems that exist within
NL expressions and the definition of the problem, namely what is ‘missing data’ and/or
‘ambiguous data’. For example, does the ‘missing data’ refer to a missing word; or a
missing sentence; or a missing concept; or a missing part of speech; or simply all of
them?
In OO development, there is no expectation for a model to be perfect at the beginning.
It always needs to be revised and additional concepts can be added into existing model
along the development process. Based on this feature, the following sub-hypotheses are
defined.
I. An NL expression / element can have multiple interpretations (translations)
(which can be used for discovering ambiguous data)
II. An NL expression of problems can be transformed into object oriented
(concepts) domain model
III. An object oriented (concepts) domain model can be transformed into an NL
description
IV. An object oriented (concepts) domain model can be used to verify an NL
description, to discover problems within an NL description (i.e. can be used for
discovering missing data)
Page 23
The second challenge of this study is – How/where to gather such information? The
majority of additional information is in the form of written and/or spoken NL
expressions that are provided by domain experts. It is crucial that the future approach
supports customers to describe their needs without necessarily composing documents.
Hence, two further sub-hypotheses are defined in the following,
V. An incomplete object oriented domain model can be interpreted as an NL
expression, to allow domain experts provide additional information to clarify
issues
VI. It is possible to support an iterative process so that additional information can be
added into the initial model
3.3
Methodology
This research was designed and structured in the following six steps and each one is
linked with above research questions, as illustrated in figure below.
Step3. Implementation of the Approach
Step2. Development of the Approach
Step1. Preliminary Experiment
Hypothesis
Hypothesis
Hypothesis
Hypothesis
Hypothesis
Hypothesis
I
II
III
IV
V
VI
Step4.
1st Experimental study
Step5.
2nd Experimental study
Figure 3-1 Plan of Work
Page 24
3.3.1
Preliminary experiment
Among the six sub-hypotheses above, hypotheses IV and VI rely on I, II, III and V to be
valid, which are considered as the fundamental questions. Therefore, to ensure the
future approach is built on a solid foundation a preliminary experiment was designed
and conducted to test those hypotheses before the further development of the approach.
Details of the experiment are described in Chapter 4 where the results are discussed.
3.3.2
Development of the Approach
As introduced earlier, this research aims to support customers in describing their needs
accurately and efficiently and on the other hand to assist developers in discovering
hidden requirements that are essential and relevant to the system under development.
To achieve this, the potential approach should be able to offer automatic translation
between NL and UML (as the chosen object oriented modelling language), both
forwards and backwards. The former guides domain experts to provide domain relevant
information and validates the domain description; while the latter helps requirements
engineers to develop a better understanding of the information provided, as well as to
build a clearer view of the problem domain in terms of what information is missing or
which part of the problem domain requires further clarification, etc.
The approach is to investigate the associations between UML and NL, to establish a bidirectional mapping among UML elements and NL elements.
The NL elements
involved in this study need to be expanded from a focus on the Nouns and Verbs to
include a wider range; on the other hand, due to the limitation of their appearance in a
text, particular phrase structures introduced in existing approaches should be avoided in
the development of mappings. The UML elements considered are classes, objects,
operations, properties and relationships, which are introduced along with the Case Study
in chapter 5. For instance, the investigation of what role each type of word plays in
UML asks:
How are class and class relationships represented in NL?
How are object and object relationships represented in NL?
Page 25
3.3.3
How are the property and behaviour of objects represented in NL?
1st Experimental study
The first challenge, including the hypotheses I, II, III, is to establish a platform for
computer assisted interpretation between NL and UML, which is also considered as the
basis of the approach.
Therefore it is imperative to verify the correctness of
interpretation between NL and UML elements. Details of the experimental study and
the test results of first three hypotheses can be found in chapter 7.
3.3.4
2nd Experimental study
Once positive results are gained from the first experiment, the evaluation of hypotheses
IV, V, VI can be carried out, which is also given in detail in chapter 7. The purpose is
to observe the improvements to domain modelling by users’ clarification / participation.
In this chapter, six hypotheses are specified. There are five steps involved to find out the
answers of those research questions. The methodology provides details of each step. In
the next chapter, the first step preliminary experiment is represented and the
investigation of the data obtained is discussed in detail.
Page 26
CHAPTER 4 -
Preliminary Experiment
This chapter describes the preliminary experiment as detailed in the methodology. This
experiment is based on looking at how natural language may affect models produced by
applying object oriented modelling approaches. The motivation is to understand the
obstacles faced in identifying OO concepts from text.
4.1
The choice of participants
It was deemed important that each participant possess equivalent, if limited, experience
and knowledge of OO domain analysis. In order to find a group with the number of
suitably experienced students required for significant results, the decision was made to
use a penultimate (3rd) year class of undergraduates in the Computer Science
Department of Heriot-Watt University. All of the participants had studied UML and
been involved in exercises using it with textual, parts of speech based analysis.
4.2
The choice of notation
UML remains the most commonly used OO notation both in practice and education.
Among the various types of diagram in UML, use case diagrams and class diagrams are
generally considered to be the most relevant for domain modelling. Use case diagrams
are often used for communication between different stakeholders, while class diagrams
represent static relationships among the important concepts of a specific domain. Given
that the focus of this study is domain analysis, rather than domain information
gathering, UML class diagrams were chosen. These were also useful, since the students
had been exposed to UML in earlier teaching, minimizing the overhead in preparing
them for participation.
Page 27
4.3
The choice of scenarios
In choosing the scenarios for this experiment, the following criteria were defined. The
scenarios must:
consist of a domain description (in English) and corresponding UML class
diagram;
be relatively simple, so that they can be tackled in around 10 minutes;
be independent of prior, specialized domain knowledge, so that they can be
understood easily and quickly by all participants;
be published elsewhere previously; it is assumed that the authors of these
scenarios are experts, and their artefacts are of a high quality.
The chosen three scenarios are Bank System (BS) (Pooley & Wilcox (2004)), Elevator
Problem (EP) (Elevator (2006)) and Film Management (FM) (Film (2006)).
4.3.1
Method
The experiment compared how students understand information given in natural
language with how they respond to an OO model.
The inputs were Domain
Descriptions (DDs) and corresponding Domain Models (DMs). The outputs were: a
First Derivative Domain Model (1stDDM), a Second Derivative Domain Model
(2ndDDM) and a Derivative Domain Description (DDD) (see Figure 4-1). Sixty three
students participated in this study. As there were three scenarios, participants were
asked to form teams of three, with each person in a team working on a different
scenario. Each scenario had a DD (in English) and a DM (a class diagram).
Before the experiment began, the participants were given a five-minute refresher
course by a lecturer on UML Class diagram notation.
Half of the teams (set A) started from the DD of their scenario and produced a
DDM; the other half (set B) started from the DM and produced a DDD.
Page 28
Students then worked on a second scenario; those who had worked from a DM
worked on a DDD generated by another team member and vice versa.
For each part, students had about 10 minutes. It was vital that participants only saw one
part of a scenario during the experiment to prevent undue bias.
Initial briefing
DM
DD
UML
refresher
Set A
Task 1
Set B
Task 1
1st DDM
1stDDD
Set A
Task 2
Set B
Task 2
2ndDDD
2ndDDM
Figure 4-1 Diagram of experimental design
4.3.2
Analysis
For each scenario, the following OO concepts were compared with their counterparts in
Page 29
the original class diagram:
C, classes identified by the participants, including methods and attributes;
A, association relationships identified by the participants;
G, generalization relationships identified by the participants.
Results of Scenario One: Bank System
Class Account is equally well identified in both 1stDDM and 2ndDDM and class
Customer has a slightly higher percentage in 1stDDM than 2ndDDM. The remaining
concepts are identified better in the 2ndDDM.
In particular, classes Branch and
Transfer are not identified in the 1stDDM, but by 86% and 71% in the 2ndDDM.
Irrelevant class Employee is added by 40% in 1stDDM, and by 0% in 2ndDDM.
Table 4-1: Domain concepts identified in Bank System
Domain classes
Account (C)
Customer (C)
Savings Account (C)
Current Account (C)
Statement (C)
+Employee
Branch (C)
Transfer (C)
Account - Savings Account (G)
Account - Current (G)
Account - Customer (A)
Account - Statement (A)
Account - Branch (A)
Account - Transfer (A)
1stDDM 2ndDDM
100%
60%
60%
60%
40%
40%
0
0
60%
40%
20%
20%
0
0
100%
57%
86%
86%
71%
100%
86%
71%
86%
86%
57%
100%
86%
71%
Results of Scenario Two: Elevator Problem
The lowest percentage of 1stDDM is 17% for class ElevatorButton, which has a much
higher percentage in 2ndDDM. The data also shows that two concepts are identified by
100% of students in 2ndDDM but only 83% and 33% in 1stDDM. Overall, the results
for 1stDDM are worse than for 2ndDDM.
Page 30
Table 4-2: Delivered domain concepts of Elevator Problem
1stDDM 2ndDDM
Domain concepts
Elevator (C)
Button (C)
FloorButton (C)
ElevatorController (C)
Door (C)
ElevatorButton (C)
ElevatorController - Button (A)
ElevatorController - Door (A)
Button - ElevatorButton (G)
Button - FloorButton (G)
Elevator - ElevatorController (A)
83%
67%
33%
33%
33%
17%
33%
17%
17%
17%
0
100%
86%
71%
100%
43%
71%
86%
29%
57%
57%
86%
Results of Scenario Three: Film Management
Table 4-3: Delivered domain concepts of Film Management
Domain concepts
Scene (C)
Setup (C)
Take (C)
Internals (C)
Externals (C)
Location (C)
+Reel
Scene - Setup (A)
Setup – Take (A)
Scene - Internals (G)
Scene – Externals(G)
Location –External(A)
4.3.3
1stDDM
2ndDDM
100%
100%
75%
75%
75%
75%
17%
75%
50%
50%
50%
0
100%
100%
100%
83%
83%
83%
100%
83%
83%
83%
83%
50%
Overall Results
The case studies highlight an interesting result. In all three scenarios, the classes that
Page 31
students seem to have difficulty discovering in the 1stDDM were identified by most
students in their 2ndDDM. It can perhaps be reflected that the 1stDDM was based on a
natural language description, while the 2ndDDM was based on something closer to a
controlled natural language description (Kuhn et al. (2006)), which only includes
information on domain relevant concepts represented in simple English structures (the
DDDs shown in Figure 4-1). In a typical DDD written by a student, compared to the
original DD, it only includes key concepts; the irrelevant information is filtered out.
The relationships between these key concepts are made with simple sentence structures.
4.3.4
Discussions
The results obtained from this experiment give us some idea of how far students’
understanding of a problem in OO analysis is influenced by the simplicity of natural
language used in the domain description. This section describes what went wrong and
in cases where students got things “wrong” why they did so. The data used in this
discussion are the results of 1stDDM, as that is based on the original DD.
The errors made by the students are classified into four types represented by a taxonomy
given in Figure 4-2. Based on the analysis in the previous section, the main mistakes
can be categorized into two kinds; the added extra concept - refers to irrelevant concepts
that are identified, and the missing concept - refers to important concepts ignored by the
students.
Errors
Understanding
problem domain
description
Non-existent
concept
Added
unimportant
concepts
Missing
important
concepts
Existing concept
Not interpreted
Incorrect
interpretation
Figure 4-2 Taxonomy of Error Types
Page 32
Added Unimportant Concepts: Non-existent concepts
Collected class diagrams show that many students added extra attributes to classes. For
instance, 60% of the 1stDDMs of BS contain class Customer and 40% of these contain
extra attributes that not appear in the original DD (such as customer name, customer id,
etc.). This might be caused by: i) the use of personal knowledge relevant to the
problem; or ii) expectation that classes often have such attributes and methods, perhaps
mimicking previous text book examples. For the latter point, it is thought that although
the domain description is clear and detailed enough for people to understand the
problem, students may feel that the information given is incomplete and the added
attributes are required for later activities (i.e. system design).
Added Unimportant Concepts: Existing concepts: personal nouns
The results show that students may consider an actor to be an instance of a class.
Students can be divided into three groups, according to how they interpreted nouns
referring to an actor: i) all the actors are classes; ii) actors are not classes; iii) an actor
may or may not be a class. The 40% of students who considered Employee as a
candidate class, also all considered Customer as a candidate class. Conversely, the
majority who did not consider Employee as a candidate class, did not take Customer as
a candidate class either. Furthermore the collected outputs show just 20% of the class
diagrams contain class Customer alone. The sentence that ‘employee’ is involved in, is
“Bank employees may check any account that is held at their branch.
They are
responsible for invoking the addition of interest and for issuing statements at the correct
times.” Clearly, it says that employees communicate with the system. However, it does
not specify any static relationship between employee and any other concepts. This
should have been seen as a sign that employee should not be identified as a class, but a
potential actor.
The results may indicate that, relating to the first example, students need better guidance
on i) under which circumstances actors should be identified as a class; in particular, the
classification of personal nouns, ii) understanding natural language structures in terms
of transforming domain concepts to OO concepts.
Page 33
Missing concepts: Not interpreted
It was found that a Low Appearance Frequency (LAF) in the domain description does
not necessarily result in missing interpretation of domain relevant concepts.
For
instance, scenario BS has class Branch missing in all the 1st DDMs and, class Statement
is only identified by 40% of students. In the original DD, Branch appears three times
and Statement only appears once in the text. The results reveal that the importance of a
concept to a domain may be independent of its frequency of occurrence in the text,
questioning the findings of (Lecoeuche (2000)). Yet, students may have missed the
concept, or its importance, as a function of its prevalence in the text. The question
remains of how to distinguish the important concepts from all the others.
Missing concepts: Incorrect interpretation
With the same frequency as class Branch, class Transfer is also missed out in all the
1stDDMs.
In the scenario BS, Transfer appears as a method in 60% DDMs.
Specifically, 40% of the students interpreted Transfer Money as a method of class
Customer and, 20% of students considered Transfer as a method of class Account.
Unlike branch, transfer can be used in this context as a verb as well as a noun, which
may give rise to confusion. Based upon the results collected from this experiment, one
cause of incorrect interpretation might be related to activity nouns that can refer to
behaviours of concrete things, as well as to the concrete things themselves, and also
LAF in the text. This prompts the question, “Under what circumstances should such
activity nouns should be interpreted as a domain relevant class?”
Another example of incorrect interpretation of classes involved generalization
relationships. All students identified class Account, but most of them did not identify
both classes Savings Account and Current Account, recognizing only one of them.
Furthermore, 20% identified class Account, but had both specializations as attributes.
From a linguistic perspective, this indicates that two compound nouns that share a
common part may both have a generalization relationship with this shared part.
Page 34
Nevertheless, the appearance frequency of a particular noun remains an important
aspect in class identification. In particular, the results demonstrate that students do not
have a problem in identifying the key concepts with Highest Appearance Frequency
(HAF). In the scenario BS, the HAF concept is Account (appears seven times); in the
scenario FM the HAF concept is Scene (appears seven times); and in the scenario EP
the HAF is Elevator (appears eight times). However, at least in this experiment, it is
possible that domain relevant concepts appear infrequently. Therefore, the question is
how to capture important concepts with LAF. Tentatively, this study suggests the
following aspects to be considered as indicators of important concepts which have LAF
in the text.
There appears to be a relationship between LAF nouns that are important
domain concepts and the scenarios’ HAF nouns. Looking at the BS scenario,
Branch and Transfer both occur three times and were both denoted VD.
However, the latter occurs in the same sentence as the HAF, Account, twice
and the former only occurs once.
For nouns associated with roles, and again looking at BS, where there is
customer and employee, it can perhaps draw on the sentence structure
‘Subject-Verb-Object’ for insight.
Juxtaposing “Accounts are assigned to one or more customers” and “Bank employees
may check any account… ”. However, only Customer is a qualified class since the verb
assign implies a persistent relationship between the subject noun Accounts and object
noun Customers. On the contrary, the verb check indicates a behaviour of Employees
to Account, rather than associating Employees and Account. Hence, Employees seem
to be external users acting on the system. Students should be taught that special verbs
(e.g. assign, belong to) or compound prepositions (e.g. is part of) can help to avoid this
type of incorrect identification.
Page 35
4.3.5
Conclusions
The results of this study indicate that there are four typical errors students make which
form two categories; added extra concept and missing concept. Although these results
need further investigation to study the role of student bias, and prior learning, they offer
a view of the issue from a new angle and suggest possible improvements in teaching
object oriented analysis using UML. Due to the crucial role of natural language in early
stages of software development, students will benefit by analysing problem domains if
they are taught how to interpret natural language as OO concepts. For instance, how to
deal with important nouns which refer to concrete things but appear infrequently in the
text, and also how to find important concepts indicated by infrequent action nouns or
roles.
It appears that generic rules offered in academic textbooks are inadequate to prevent
mistakes by inexperienced analysts. While simplified syntax might appear to remove
some of these difficulties, stakeholders may well be unhappy with the restrictions this
imposes. It is unlikely that all real-life domain descriptions will ever be controlled
natural language descriptions. Instead, students need to be prepared for the variety of
natural language that customers inevitably produce; this is where new teaching methods
involving peer assessment and role playing can be brought to bear.
In summary, this chapter represents the conduct of a preliminary experiment and the
analysis of its results. It is suggested that merely teaching students how to use OO
techniques superficially in domain analysis is not enough. Underpinning language
usage is subtle, requiring deep understanding, to avoid crucial mistakes. Though the
experiment provides us the insight into the common mistakes people make so that it can
be avoided in the proposed approach, yet it remains challenge to develop a solution that
deals with such problems.
Page 36
CHAPTER 5 -
Approach Used
This chapter starts with a discussion of the proposed approach, including three main
parts, Mappings, Patterns and Question Generation. Each part is then explained in
detail in the following sections.
5.1
The proposed approach
In the earlier chapters, we have discussed the issues in automated requirements
engineering. The main obstacles can be briefly described as follows,
i) The knowledge gap between human beings and machines, and
ii) The ambiguous nature of natural language
The key to a solution is how a CASE tool can discover and identify the hidden
knowledge in order to share the vocabulary and to bridge the knowledge gap. This
study suggests a Mapping-Pattern-Question based solution to
i) associate NL elements and OO concepts by the defined mappings,
ii) bridge the knowledge gap by the patterns
iii) clarify the ambiguity via user interaction, facilitated by system generated
questions and answer options
The mapping is built based on the mapping theory introduced earlier, but is now
extended and further developed. In order to implement the mappings, we use part of
speech (POS) tags to define the mappings. The patterns are used to relate words to each
other in various relationships; hence the attributes and operations can be assigned to the
corresponding class as well as relationships between classes being identified. Both
mappings and patterns are applied to the grammar or parse tree. The grammar tree
structure and POS tags are used to define the mappings and patterns. We believe that
stakeholders’ (domain expert, system user, customer, etc.) participation is crucial to
requirements engineering especially at the early stage. The fact that stakeholders might
not share the same vocabulary with requirements engineers generates a communication
Page 37
problem. Others have sought to address this by using ontology and concept mapping
[Kaiya, H., Saeki, M. (2006)]. In this work, however, the key to address the issue and
to maximize their contribution is to provide clear guidance so that issues such as
misunderstanding and missing information can be avoided as much as possible.
In other words, we try to tell users exactly what information content is needed and what
kind of information is needed.
To receive sufficient user assistance in clarifying
ambiguities, the system generates very specific questions based on a particular
vagueness identified in the domain description along with a set of candidate answers
presented one at the time. If there is a missing part when matching a pattern, question
templates are used for automatic question generation. These questions are useful for
discovering the information that is ambiguous about a specific word and in gathering
missing information.
5.2
Mappings between NL elements and OO concepts
5.2.1
Mappings between Noun and OO concepts
Previous studies, introduced in earlier chapters, indicate that nouns can denote a number
of OO concepts. As typically defined in English grammar, a single noun can fall in one
of six categories, Common Nouns, Collective Nouns, Concrete Nouns, Proper Nouns,
Predicate Nouns and Gerund Nouns [Oxford Dictionary of English (1989)]. Nouns
from different categories can play different syntactic roles in a sentence, which could
affect the results in transforming nouns into OO concepts. It is, therefore, crucial to
study each Noun category in order to associate it with different OO concepts. As an
outcome, a more thorough mapping between different noun category and OO concepts
are developed for this approach, which is the key contribution of this study.
Common Nouns, by definition in the grammar, name people (i.e. man, girl, boy,
etc.), places (i.e. school, city, building, etc.), things (i.e. book, table, chair, etc.), or
ideas, feelings (i.e. love, hate, idea, etc.), referring to both concrete things and
abstract concepts. In terms of mapping, Common Nouns can be grouped into two
Page 38
kinds; the first three types of Common Nouns can indicate classes as well as
attributes of a class. There may also be association relationships and aggregation
relationships among common nouns, which are discussed later. Common Nouns
that describe ideas and feelings do not refer to concrete things that can be found in
reality when constructing a software system.
Collective Nouns, by definition in the grammar, are of singular form but refer to a
group of people (i.e. army, audience, band, committee, etc.) or things (i.e. bunch,
bundle, pair, set, etc.), where the former indicates a probable class and the latter
describe the quantity (number) of a class when it is involved in a relationship.
Concrete Nouns, by definition in the grammar, name things or people that we
experience through our senses, such as sight, hearing, smell, touch or taste. i.e. cats,
dogs, tables, chairs, buses, teachers. In comparison with Common Nouns and
Collective Nouns, Concrete Nouns normally refer to things or people or animals that
have identities. The majority of classes and attributes are indicated by Concrete
Nouns, which define instances or objects of a class.
Proper Nouns, by definition in the grammar, are also called proper names, referring
to the names of specific people, organizations or places, i.e., each part of a person's
name, names of companies, name of animals, etc. Proper Nouns might indicate
objects that are instances of a class.
In OO concepts, where a class can be
represented by more than one object, each object needs a unique name to be
differentiated from others. Proper Nouns can also indicate the value of a name
attribute of a class.
Page 39
Predicate Nouns, by definition in the grammar, follow a form of verb "to be". (i.e.
He is an idiot. Where idiot is a predicate noun because it follows is; a form of the
verb "be".) Predicate Nouns specify the value of an attribute of a class, but not the
name of the attribute. Because they indicate the value of an attribute of a class, the
subject often refers the name of an object of a class. i.e. Tom is a lawyer, where
lawyer is the predicate noun, the subject Tom is the name of a specific person whose
job is “lawyer”. This example shows how the predicate noun lawyer specifies the
job/occupation attribute of object Tom is “lawyer”. Sometimes, Predicate nouns
also indicate classes, i.e. lawyer may also be a class in its own right.
Gerund Nouns, by definition in the grammar, follow the form of verb –ing that
indicates something is happening, or behaviour is taking place. Gerund Nouns can
indicate the state of an object associated with a particular operation of that object’s
class. It can also indicate an operation of a class.
Overall, Nouns are mainly associated with classes. Nouns can indicate the class itself,
an attribute of a class, an operation of a class, the multiplicity of a role in an Association
Relationship or an object (instance of class). Common nouns, Collective nouns and
Concrete nouns specify the name of a class or the name of an attribute of a class. The
others may not give a direct indication for class identification but they may indicate that
class relevant information/data is hidden behind.
Though the infinitive form of verbs can be considered as defining nouns, it is more
behaviours related, hence only six noun categories are considered in establishing
mappings between Nouns and OO concepts. As a result, ten basic mappings are defined
in this study, as listed in Table 5-1, along with the corresponding translations in OO
model.
Page 40
Table 5-1 Mappings between Nouns and OO concepts defined in this study
Noun Categories
Common Nouns
Collective Noun
Concrete Nouns
Proper Noun
Predicate Noun (to be)
Gerund Nouns (-ing)
5.2.2
OO Concepts
Class
Attribute
Class
Quantity of Class
Class
Object
Class -> Name
Class -> Attribute
Class
Class -> Operation
Translation into OO model
Class Name
Attribute Name
Class Name
Multiplicity of a role in Association Relationship
Class Name
Object Name
Value of ‘Name’ Attribute of Class
Value of TBD Attribute of Class
Class Name
TBD Operation of Class
Mappings between Verbs and OO concepts
Verbs can be divided into modal verbs, copula (linking verb, i.e. be/am/is/are, has/have)
and action verbs/common verbs that describe behaviours performed by concrete things.
Action verbs can be interpreted as the names of operations that can be assigned to a
particular class. The past simple form and gerund form of verbs indicate that there is a
state related with this behaviour that is happening or has happened. Therefore, the past
simple form and gerund form of verbs can be useful in the Statechart diagrams of UML.
Behaviours can happen in a specific time that can be split into three periods, Past,
Present and Future and at a particular point in time (including all three periods of time),
which are simple tense, continuous tense, and perfect tense. Accordingly, as shown in
Table 5-2, verbs have four different forms, associated with time periods in English
grammar.
Each type has a mapping onto OO concepts and the corresponding
translation OO modelling.
Table 5-2. Mappings between Verbs and OO concepts
Verb Form / Time & Tense
Base Form / Present (Future) Simple,
Past Simple Form / Past Simple
Past Participle form / Perfect
Gerund / Continuous
OOC
Class-Behaviour
State
Class-Behaviour
State
Page 41
Translation
Behaviour Name
Behaviour Name
Behaviour Name
Behaviour Name
5.2.3
Mappings between NL and OO relationships
Using verbs and nouns to identify candidate operations and candidate classes might not
be difficult. Importantly, we may also identify the class to which a particular operation
belongs and also the class that holds particular attributes.
Once a class and its operations and attributes are reunited in a UML class definition, the
relationships among candidate classes need to be discovered and defined. In English,
two or more particular words can be combined into a Compound Noun.
Most
compound nouns in English are formed by nouns modified by other nouns or adjectives
(see the first two examples in the following table). Though the two parts may be written
in a number of ways, typically two words joined together to form a new word or two
words joined using a hyphen. We focus on those which appear as two separate words.
Table 5-3. Choices of combinations of words in compound nouns
Noun
Adjective
Verb
Preposition
Noun
Noun
Adjective
Preposition
+
+
+
+
+
+
+
+
Noun
Noun
Noun
Noun
Verb
Preposition
Verb
Verb
i.e. toothpaste
i.e. monthly ticket
i.e. swimming pool
i.e. underground
i.e. haircut
i.e. anger on
i.e. dry-cleaning
i.e. output
Compound Nouns are useful in associating an attribute with the corresponding class.
The commonest way to link individual words is through sentence structure, which
includes Simple Sentence, Compound Sentence and Complex Sentence in English
grammar.
Simple Sentence has only one independent (main) clause, containing a single subject,
verb and predicate. It describes only one thing, idea or question. Simple Sentence
Structure S-V-O [Delisle S. Barker K. Biskri I. (1999)] describes a relationship between
two classes Subject (S) and Object (O) through Verb (V). It also indicates the verb is an
operation of either class S or class O.
Page 42
Compound Sentences are made up of two or more simple sentences combined using a
conjunction such as “and”, “or” or “but”.
They are made up of more than one
independent clause joined together with a coordinating conjunction. Each clause can
stand alone as a sentence, with a subject and a verb. Compound Sentences describe a
relationship between two behaviours. Structure is (S-V-O)-{and/or/but}-(S-V-O). e.g. I
speak Chinese and/or/but my friend speaks English.
Complex sentences are sentences that contain a nominal clause and describe a
relationship between two behaviours. Structure is {that/if/whether}-SS. A complex
sentence that contains an adverbial clause, describes a relationship between two
behaviours. Structure is (S-V-O)-{after, although, as, because, before, if, since, that,
though, till, unless, until, when, where, while}-(S-V-O). e.g. they went to the movies
after they finished studying; He forgot the last page when he submitted his writing.
Complex Sentences describe more than one thing or idea and have more than one verb
in them. They are made up of more than one clause, an independent clause (that can
stand by itself) and a dependent (subordinate) clause (which cannot stand by itself) of
the following types.
i)
A nominal clause, is one that contains a noun with “that” or “if” or “whether”.
i.e. I wondered whether the homework was necessary
ii)
An adjectival clause is one that contains an adjective preceded by “who” or
“which” or “that”. i.e. I went to the show that was very popular
iii)
An adverbial clause, is separated from the other clauses by any subordinating
conjunction (after, although, as, because, before, if, since, that, though, till,
unless, until, when, where, while). Adverbial clauses can be placed before or
after the main clause. i.e., ‘They will visit you before they go to the airport’ or
‘Before they go to the airport they will visit you’. An adverbial clause is a word
or expression in the sentence that functions as an adverb; that is, it tells you
something about how the action in the verb was done.
Page 43
iv)
Adjectival clause, describes an identity of the class denoted by the noun it
follows. Structure is {who/which/that}-(V-O). e.g. ‘I went to the show that was
very popular’. The adjectival clause ‘that was very popular’ tells the popularity
of the show (can be translated to a property/attribute popularity of the class
Show) is considerable.
Both Compound Sentence and Complex Sentence are made of simple sentence
structure. The mappings are developed based on the simple sentence structure. Table
5-4 represents the mappings between NL relationship (Simple Sentence Structure) and
OO relationship (Class-Operation). When the verb is transitive and in the base form,
the SVO indicates the verb is an operation of the object class and when the verb is in the
passive voice form, the SVO indicates the verb is an operation of the subject class.
Intransitive verb requires no Object, which indicate this could be some behaviour
happen on the Subject itself to change a state of itself describe the behaviour of the
subject.
Table 5-4. Mappings between NL and OO Relationships
Sentence Structure
SVO (Subject-Verb-Object)
SVedO (Subject-Verbed-Object)
SV (Subject-Verb)
Class
Object
Subject
Subject
Attribute
Operation
Verb
Verb
Verb
Prepositions are a class of words that indicate relationships between nouns, pronouns or
other words in a sentence as well as joining two sentences. Most often they come
before a noun. In grammar, Simple prepositions are single word prepositions (i.e.
across, after, at, before, between, by, during, from, in, into, of, on, to, under, with and
without) and compound prepositions are more than one word (i.e. in between, because
of, in front of) for instance The book is in between War and Peace, The book is in front
of the clock. The book is on the table.
Page 44
5.3
Patterns
NL structured patterns provide general criteria for identifying class and relationships,
which can be divided into two categories, one for identifying single OO concepts and
one for identifying OO relationships. The first includes the patterns derived from the
noun based mappings and verb based mappings introduced in section 3.1 and the second
is more complicated, derived from the diversity of possible sentence structures.
5.3.1
Patterns for Identifying Class
Each pattern associates an OO concept (OOC) and a NL element (NLE) in Table 5-6,
the symbol ‘>’ links a child node (on the left) and its parent node (on the right). The
POS tag NP (Noun Phrase) is the root of each pattern since NP is the key node for
identifying classes in a grammatical tree-structure. NP can be followed by different
noun related POS tags. In the Stanford Parser [Klein, D.; Manning, C.D. (2003)],
Compound Nouns and Proper Nouns share the same POS tags, NNP and NNPS. A
Gerund Noun is the –ing form of a verb, and the corresponding POS tag is verb related.
This is discussed in the section on Operations of a Class. Though verbs do not appear
in many different types, they have different forms according to the tense or voice.
Table 5-5 lists the POS tags of the Stanford Parser that are involved in our patterns. A
pattern might link other patterns, in which circumstances the associated patterns are
searched recursively. For instance, the first pattern searches a singular noun (NN) that
has a parent node noun phrase (NP).
Page 45
As a small example, Figure 5-1 demonstrates that four candidate classes ‘elevator’,
‘floor’, ‘requests’, ‘doors’ are captured by applying patterns 1 and 2. Each green oval
represents the matching of a noun associated pattern. The leaf on a tree structure (or
sub-tree) is called ‘the head’ in the grammar tree, which is interpreted as a class.
Table 5-5. List of POS tags
POS tag
NP
NN
NNS
NNP
NNPS
PRP
PRP$
VP
VB
VBD
VBG
VBN
VBP
VBZ
Part Of Speech (POS)
Noun Phrase
Noun, singular or mass
Noun, plural
Proper noun, singular & Compound nouns, singular
Proper noun, plural & Compound nouns, plural
Personal pronoun
Possessive pronoun
Verb Phrase
Verb, base form
Verb, past tense
Verb, gerund or present participle
Verb, past participle
Verb, none-3rd person singular present
Verb, 3rd person singular present
Table 5-6 Noun Associated Patterns
ID
1
2
3
4
Pattern
@NN > NP
@NNS > NP
@NNP > NP
@NNPS > NP
OOC
Class
Class
Class
Class
NLE
Noun, singular
Noun, plural
Proper noun, singular
Proper noun, plural
Page 46
Associated Pattern
ID
ID
ID
ID
Figure 5-1. Applied noun associated patterns1
5.3.2
Patterns for Identifying Relationships
Each pattern is used to identify two words and the ownership between the two words.
The form of a verb has an impact on associating the verb with its class indicated by
either subject or object. The patterns are defined for each form of verb as shown in
Table 5-7. Each pattern links the OO concepts and NL elements. POS tags of different
form of verb used in Stanford Parser can be found in [Klein, D.; Manning, C.D. (2003)].
Table 5-7. VB related patterns, base form
(only include action verbs, not Auxiliary Verbs e.g 'is, to (be)')
Pattern
OOC
NLE
(VB $+!VP) > VP
@/NN.?/ > ( NP $+ (VP << VB = ))
@/NN.?/ > ( NP <- (S << VB = ))
@/NN.?/ >>( S $- (VB = ) > VP)
@/NN.?/ >>( NP $- (VB = ) > VP)
@/NN.?/ >>( NP $- (VB = ) > VP )
Operation
InteractiveClass
InteractiveClass
OwnerClass
OwnerClass
OwnerClass
Verb base form
Verb base form
singula Object Noun
singula Object Noun
singula Object Noun
Table 5-8. VBP related patterns, base form, = VB
(only include action verbs, not Auxiliary Verbs e.g 'is, to (be)' (VP ![< VP | < S]) )
Pattern
OOC
NLE
( VBP !$+ VP ) > VP
@NNS > ( NP $+ (VP < VBP = ))
@/NN.?/ > ( NP > (VP < VBP = ))
Method
InteractiveClass
OwnerClass
Verb base form
Subject Noun
singula Object Noun
1
‘When an elevator has no requests, it remains at its current floor with its doors closed’
Page 47
Table 5-9. VBZ related patterns, 3rd person singular present
Pattern
OOC
NLE
(VBZ![< is | < has ]) > VP
@/NN.?/ > ( NP $+ (VP < VBZ = ))
@/NN.?/ > ( NP $+ (VP << VBZ = ))
@/NN.?/ >> ( NP > (VP << VBZ = ))
Operation
InteractiveClass
InteractiveClass
OwnerClass
3rd person singular present
Subject Noun
Subject Noun
singular Object Noun
Table 5-10. VBD related patterns, Verb Past Tense
Pattern
OOC
NLE
@VBD > VP
@NN > ( NP $+ (VP < VBD = ))
@NN > ( NP > (VP < VBD = ))
Operation
InteractiveClass
OwnerClass
Verb past tense
Subject Noun
Object Noun
Table 5-11. VBN related patterns, Verb Past Participle
Pattern
OOC
NLE
@VBN > VP
@/NN.?/ > ( NP > (VP << VBN = ))
@/NN.?/ > ( NP $+ (VP < VBN = ))
@/NN.?/ > ( NP $+ (VP << VBN = ))
Operation
InteractiveClass
OwnerClass
OwnerClass
Verb past participle
singular Object Noun
Subject Noun
Subject Noun
Table 5-12. VBG related patterns, Verb Gerund & Verb Present Participle
Pattern
OOC
NLE
@VBG > VP
@NN > ( NP $+ (VP < VBG = ))
@NN > ( NP > (VP < VBG = ))
Operation
InteractiveClass
OwnerClass
Verb gerund / present participle
Subject Noun
singular Object Noun
The tree structures for the patterns above are illustrated in Figure 5-2 below. The italic
part in each figure demonstrates an example but only two of them are based on the
example sentence given in previous section. For the first six complete patterns, the
name of operation is denoted by the head of VP, a verb.
Page 48
The red oval in the centre of the grammar tree is considered as an incomplete matching
as the head of the matching sub-tree is not a noun; this identifies a place where a
question needs to be generated for clarification.
Figure 5-2. Applying the SVO pattern2
5.3.3
Patterns for Associating attribute and class
Prepositions are a class of words that indicate relationships between nouns, pronouns
and other words in a sentence. Most often they come before a noun, examples can be
found in Figure 5-3.
2
‘When an elevator has no requests, it remains at its current floor with its doors closed’
Page 49
Static relation
Dynamic relation
Over write
Inherit
?
install
Product
control
Elevators
install( )
in
Building
with
MFloors
Figure 5-3. Applied patterns for associating attributes and class3
3
“A product is to be installed to control elevators in a building with m floors”
Page 50
5.4
5.4.1
Question Generation & User Participation
Question formats
For incomplete patterns, questions are generated according to the position of NP in the
sentence tree-structure. If the NP has a sister VP, the question should be Subject
Question, starting with question words When/What/Which/Who/Whose; if the NP is a
child of VP, the question should be Object Question, starting with question words
What/Which/Who. Questions should be generated according to the role that the missing
word plays in the corresponding sentence.
Each candidate class needs to meet three criteria to be a valid class of problem domain;
these are to have attributes, to have methods and to interact with at least one other class
(or have a relationship with another class). Very often such information is unavailable
in one sentence and it is crucial to consider the whole text rather than a single noun in
one sentence during class identification. In the reality, domain relevant concepts might
be missing from the domain descriptions. The principle is to generate questions to dig
out the relationships among relevant concepts in the domain descriptions and also to
assist in discovering any missing concepts. The most challenging job is to determine
what information is domain relevant among those stated in the domain description as
well as among those missing information. One potential key to this issue is via user
participation in this computer assisted domain concepts identification.
In order to achieve both targets, each pattern has a number of associated questions to
deal with both missing parts on the grammar tree that matches the pattern and the
selection of a correct interpretation from alternatives.
Questions are also used to
determine the validity of classes. For the development of NL structure patterns, we use
close-class words (e.g. prepositions) in order to reduce the alternatives in interpretation
and improve the coverage; rather than using open-class words (noun, verb et.) like those
mentioned early in the paper.
Page 51
Question templates are established based on the principle/assumption that a method
cannot function without the call from a class (either the class it belongs to or from other
class), namely the InteractiveClass and a method cannot be created without the class it
belongs to, namely the OwnerClass.
Defined question templates with associated
answers can be found in Table 5-13 below. The implication of question templates and
integration with patterns is discussed through some examples in the following section.
Table 5-13. Question template and choices of answer
Question
“Which thing from the list above ‘”+ Method + “’ ?”
[asking for the Sbject]
"Is "+ interactiveClass + “changed by the selected verb
‘” + method + "’ ? "
[asking if this is a change of state]
"What/Who " + method + ownerClass +" ? "
What is the data type of ‘Attribute’
5.4.2
Options
class
Yes
No
Answer
Noun
Action / Change
Interactive
Owner Class
Method
Class
Noun
-
Yes
Attribute /
State
Method
interactive
Class
Method
ownerClas
s
Method
No
N/A
Class.Method
Noun
N/A
Noun
String/Integer/
Boolean/
Associating Question with Pattern
As shown in Figure 5-4 below, one InteractiveClass is missing, which is the subject of
‘install’ in the text; a question mark is used to name the missing InteractiveClass, which
also indicates that a question will be generated for further input from users.
Since the OwnerClass is available in this case, the third question template in Table 5-13
is applied to generate a simple question to users, to guide users to clarify such
ambiguity. We discussed earlier that missing information might not be domain relevant
and hence users are also given the option to choose not to provide any further
information. Other examples are shown in Appendix B.
Page 52
Static relation
Dynamic relation
Over write
Inherit
?
install
Product
control
Elevators
install( )
in
Building
with
MFloors
Figure 5-4. Example of generating question for missing InteractiveClass 4
In summary, the significance of this approach is not how to improve the performance of
applied linguistic techniques in identifying OO concepts from domain description, but
to complement it and to improve its performance by introducing a mechanism that
supports the participation of domain experts in terms of clarifying ambiguities during
4
“A product is to be installed to control elevators in a building with m floors”
Page 53
the process. Furthermore, the use of OO techniques to associate different concepts
denoted by individual words make the system more practical by reducing the amount of
communication required. However, there are many open issues in this study, e.g. how
different terms for the same concept can be detected; how plural and singular form of
the same word can be detected, etc. Solutions for such issues can be applying other
existing linguistic techniques in the future studies.
Page 54
CHAPTER 6 -
Implementation
In this chapter, it starts with a discussion of the options for implementing the approach.
The first section represents the implementation of patterns and question generation. The
second section represents how the system is designed, including the class identification
and dialog interface.
The implementation consists of two stages. Stage one generates an XML file where the
initial candidate classes and operations are stored. A POS parser is adopted to create the
grammar tree, onto which the patterns are applied to capture nouns and verbs that
indicate candidate classes and operations. The initial XML file is input to stage two of
user interaction, where the user can interact with the system through system-generated
questions in order to clarify the ambiguity of domain description and supply the missing
information.
The question templates introduced earlier provide the foundation for
question generation.
For each generated question, the system also lists candidate
answers for the user to choose from and this guidance avoids unnecessary mistakes by
limiting the choices of answers. Such answers provide data to the initial XML file that
can reduce the ambiguity of domain description and improve the completeness of
identified domain concepts. As a result, this user interaction process improves the
overall quality of requirements elicitation.
6.1
The Requirements
The core of the approach can be summarized as the following
1.
NL concepts/structure interpretation – applying patterns to transform NL
concepts/structure of domain description to OO concepts/structure
2.
OO concepts/structure complementation – generating questions to users based on
the pattern’s associated question template, where further information can be
obtained to complement the initial view of the domain
Page 55
The potential candidate OO concepts introduced in the domain description need to be
identified through the interpretation process based on the patterns. Having the patterns
designed in a way that links NL and OO concepts, the idea is to match the content of
domain description with patterns. Therefore the matching parts can be interpreted into
OO concepts. However, domain description is a sequence of characters and the patterns
are built in tree structure with POS tags as its nodes. A combination of structure
transformation and POS annotation is required.
Though below is very high levelled requirements, these are considered as essential
components to the success of the implementation.
1 Pattern based class identification
1.1
1.2
1.3
NL parsing,
1.1.1
transforming the structure of domain description
1.1.2
annotating POS tags to domain description
Pattern identification,
1.2.1
Comparing restructured domain description with patterns
1.2.2
Identifying the matching part of a particular pattern
NL interpretation
1.3.1
Identifying candidate OO concepts from the restructured domain
description based on the patterns
1.3.2
Examining the candidate OO concepts, highlighting the invalid
ones by applying OO modelling technique
1.3.3
Supporting easy traceability
Page 56
2 Question generation
2.1
Generating questions in NL expression based on the question templates
2.2
Gathering further information from domain experts
2.3
Modifying the initial OO concepts identification, developing more
comprehensive and detailed view of problem domain
6.1.1
Pattern based class identification
The patterns are designed to bridge the gap between natural language structures and
object oriented structure, thereby to assist a translation between NL and OO. XML
parser as the implementation of patterns is one of the key components and is involved in
both initial class identification and question generation to provide a bilingual
interpretation between NL and OO based on the patterns. How XML parser works with
other components is explained in the system design later. Below is a scenario of the
translation from NL to OO.
Page 57
(ROOT [97.364]
(S [97.259]
(NP [13.168] (DT [4.563] A) (NN [6.445] product))
(VP [82.949]
(VBZ [0.148] is)
(S [76.657]
(VP [76.392]
(TO [0.010] to)
(VP [76.363]
(VB [0.002] be)
(VP [73.661]
(VBN [7.186] installed)
(S [63.917]
(VP [63.653]
(TO [0.010] to)
(VP [63.623] (VB [6.604] control)
(NP [11.558]
(NNS [9.003] elevators))
(PP [42.471]
(IN [1.563] in)
(NP [39.246]
(NP [9.993]
(DT [1.413] a)
(NN [6.811] building))
(PP [28.711]
(IN [3.427] with)
(NP[24.883]
(NN [12.269] m )
(NNS [8.880] floors))
)))))))))))
(. [0.002] .)))
Figure 6-1. An example of a parsed sentence
In the following, a scenario is given to represent how this works. Taking the first
sentence of domain description of Elevator Problem “A product is to be installed to
control elevators in a building with m floors. ”, Figure 6-2 represents the parsing results.
With the added POS tags, the role of each word is specified by the POS tag it follows.
The POS tag S refers to Sentence, and NP refers to NounPhrase, etc. A full explanation
of POS tags can be found in the Appendix. As the structured grammar is grouped by
the brackets, it can be further transformed into a tree structure as shown in Figure 6-2;
this is scanned for the mapping patterns. Each pattern is searched for against the whole
tree, in order to maximize the matching of NL concepts and OO concepts of the
problem domain.
Page 58
Figure 6-2. Example of grammar tree representation
- <class>
<cName cPattern="Root (@NN > NP )">product</cName>
<cName cPattern="Root (@NN > NP )">building</cName>
<cName cPattern="Root (@NN > NP )">m</cName>
<cName cPattern="Root (@NNS > NP )">elevators</cName>
<cName cPattern="Root (@NNS > NP )">floors</cName>
</class>
- <attribute>
</attribute>
- <method>
<mName mPattern="Root (VB $+ !VP > VP )" InteractiveClass="product"
OwnerClass="elevators">control</mName>
<mName mPattern="Root (@VBN > VP )" InteractiveClass=""
OwnerClass="product">installed</mName>
</method>
Figure 6-3 Identified classes in XML
Page 59
6.1.2
Question generation
Having discussed previously that the nature of NL, such as its ambiguity and lack of
precision, is considerably problematic to problem domain analysis, the attempts to adopt
NLP to achieve automated requirements analysis have proven that conventional NLP is
not capable enough of addressing NL related issues; hence conducting quality class
identification automatically is ambitious. However it has been proven to be useful in
translating simple NL descriptions into OO concepts, which can be capable of
supporting user participation to establish an environment of computer assisted domain
analysis and requirements elicitation.
The objective of question generation is to examine the initial class identification and
generate questions for undefined parts.
Therefore these can be further defined
according to the answers given by the users. Giving the following example that the
candidate method ‘installed’ has no interactive class,
A product is to be installed
<mName mPattern="Root (@VBN > VP )" InteractiveClass="" OwnerClass="product"> installed
</mName>
Below is the question generated based on the question format who/what-verb-noun.
Who/What ‘installed’ ‘product’?
By allowing the user to specify who installed the product, an actor which interacts with
the domain is discovered, having been missing from the initial domain description.
6.2
System Design
This study represents an iterative and incremental approach in class identification
supported by an interactive environment where users participate in this activity to
complement the incomplete information and to clarify ambiguous information. The
pattern based class identification and question generation introduced above are the key
requirements for the system.
The system architecture illustrated in Figure 6-4,
represents how these two aspects are implemented and how the corresponding
components collaborate and communicate within the system to satisfy all the
Page 60
requirements.
One of the main components is the dialog interface represented on the right hand side,
which obtains user inputs to complement the initial class identification derived from
domain description. As an initial input, the domain description, provided by customers,
provides domain relevant information that is essential and useful to developing a
solution of the problem. However, domain description is conducted by people who
understand the problem, yet are not able to see it or express it from the angle of software
engineering.
Therefore, domain description may not contain all the relevant
information that can be discovered during user participation.
While on the other hand, it may also contain large amounts of irrelevant information
that need to be filtered out. As shown on the bottom left of Figure 6-4, the Domain
Description Processing pulls out relevant domain concepts from the domain description
based on the patterns and translates such into class, method and attribute. The output
candidate class forms the basis for further improvements and completions.
The objective is to achieve effective communication among stakeholders from different
backgrounds. The input is from users and output is to the software engineers. In this
interactive environment, users can participate during the domain concepts identification
and the incomplete and/or ambiguous data can be clarified by their inputs.
The approach encompasses two parts, Domain Description Processing and Dialogue
Interface that are represented in following two sections, which are built on the
foundation of Patterns and Question Templates, and encompasses three components,
Class Identification, Question Generator and User Interaction. In the following two
sections, it is represented how each component is implemented, including the selection
of techniques and tools, and how these components are integrated into two parts.
Page 61
Domain Description Processing
Dialog Interface
Incomplete
Ambiguous
Domain
Description
Question
Generator
Class
Identification
Candidate
Class
Class
Identification
Clarification
User
Interaction
Patterns
Mappings Patterns Questions
Data Flow
Processing Flow
Requirements
NL parsing
Pattern identification
NL Interpretation
Question generation
Solution
Stanford parser
Stanford pattern matcher
xml parser
question generation & option generation
Figure 6-4 System architecture and technical solutions
6.2.1
Class identification
POS tagging (NL processing) – The input domain description is a sequence of
characters. In order to identify OO concepts, it is crucial that such a sequence can be restructured to match the structure of the patterns, in other words to generate a grammar
tree for each sentence. Hence the initial input domain-specific problem description is
processed by the POS tagging, adapted from the Stanford parser. Each sentence of
domain description is parsed to a grammar tree with each word assigned a POS tag to
allow the search of matching patterns. POS tagging technique was introduced early in
the previous chapter and the adopted POS tagging system is discussed later.
Pattern Matching – Introduced in the previous chapter, defined patterns are treestructured word sequences and each word has POS annotations as the parent node
identifying its role. The aim is to find the matching patterns on the grammar tree
generated by a NL parser. NL concepts and OO concepts are associated by patterns,
which bridge the gap between them and facilitate the translation from one to the other.
Hence the vocabulary of matching sequences to the grammar tree can be translated into
Page 62
OO concepts. During Pattern Matching, each output grammar tree of POS tagging is
scanned to discover sub-trees that match any defined patterns. Each pattern is applied,
at least once, to each grammar tree to pull out all the candidate domain-relevant OO
concepts. As a result the vocabulary on a matching sub-tree is stored in an XML file,
under corresponding element according to the category of the matching pattern.
In terms of implementation, the key is to assign a POS tag to specify the morphology of
each word and, parse the sentence to construct the grammatical structure of it. In
linguistic study, the former is called POS tagging and the latter is called parsing.
Though, POS tagging and grammatical parsing play an important role in this study, both
fall into different research domains and, as was described earlier, these have been well
developed in terms of theory and practice during the last decade. It is therefore sensible
to adopt an existing tool for sentence parsing and POS tagging, but only if one can meet
all the following requirements,
i)
it must be open source and can be downloaded from internet;
ii)
it must provide a POS tagging, grammar tree generation, hence the defined
patterns can be applied on the grammar tree to extract the key information from
the domain description;
iii)
ideally, it can support pattern matching, which searches all the matching sub-tree
of a particular input pattern on the grammar tree.
After a careful investigation, the Stanford parser was the only one available at the time
which satisfies all the requirements.
It supports POS tagging as well as parsing
grammatical structure to facilitate the class interpretation from the domain description.
More importantly, it also meets the additional requirement of supporting pattern
matching.
Figure 6-5 is the data flow chart, demonstrating a scenario of domain
description processing. It represents how the initial domain description in .txt format
can be processed towards initial class identification. The inputs to this stage also
include, XML file in which the interpreted OO Concepts are stored and, Stanford parser
that builds the grammatical structure of a NL expression.
When all the inputs are loaded, the system reads the first sentence of domain description
Page 63
and passes it to the Tokenizer of Stanford parser. The tokenized sentence is then passed
to the Parser that builds the grammatical structure of the sentence. An example of a
tokenized sentence and the output of the parser are also shown in Figure 6-5. Once the
sentence’s grammatical structure is built, the pattern matcher of Stanford parser is used
to find matching patterns.
These are the adopted functionalities in the XML parser
from Stanford parser. Each pattern associates NL concepts and OO concepts, so the NL
concepts of a matching on the grammar tree can be interpreted into corresponding OO
concepts. Such data is filed as initial candidate classes. This process repeats until
reaching the end of domain description.
The structure of the patterns is a two-dimensional array and is built in the XML parser.
Since Stanford parser requires large size of memory, such design can achieve a low
maintenance, fast and easy access to the patterns.
Stanford Parser
OO Concepts .XML
Domain Description
Load inputs
No
Read Sentence
Tokenize Sentence
…
Parse Sentence
…
Built-in patterns
Parse2OOconcepts
Update XML
Figure 6-5. Data flow chart for domain description processing
Page 64
XML parser is facilitated by the following main packages inherited from Stanford
parser,
For POS tagging;
edu.stanford.nlp.ling for linguistic analysis;
edu.stanford.nlp.process.Tokenizer,preparing for parsing
edu.stanford.nlp.process.WordSegmentingTokenizer;
For Parsing;
edu.stanford.nlp.parser,
edu.stanford.nlp.parser.lexparser.LexicalizedParser, the English parser
edu.stanford.nlp.trees, which is grammar tree generator
Within class identification we work with tagged, structured text, aiming to find model
generating rules for all its structures. The use of an XML format for tagged text allows
us to regenerate the initial NL texts into a more structured, less ambiguous form after
each cycle in our iterative process. This provides a valuable source of documentation in
its own right. As already noted in existing related work, it is clear that there is no
simple one-to-one relationship (mapping) between NL and OO concepts. To address
this problem, each NL pattern is associated with a number of questions to be answered
by the user, allowing the correct OO concept to be determined from their answers.
6.2.2
Dialog Interface
The main contribution of this study is the development of the mechanism that allows
users to participate in the process of automatic class identification supported by
techniques used in NLP. The dialog interface is designed for user interaction with the
system, where the questions are displayed and, the answers provided by users are
Page 65
obtained to complement the initial candidate class identification. Hence the question
generation is critical in supporting user participation that improves the quality of
automatic class identification.
Candidate classes .XML
Load inputs
No
Read inputs
Class Representation
No
Built-in Question
Templates
Apply Question Template to
candidate classes
No
Generate Question
Generate Answer Options
User’s
answer
Update inputs .XML
Figure 6-6 Data flow chart for dialog interface
Figure 6-6 is a data flow chart, demonstrating a scenario of the proposed approach to
represent how the initial candidate class identification works.
The initial input is
domain documents that are processed by the NLP component employing word tagging
techniques. Each word of the input file is assigned a part-of-speech (POS) tag and
documented in an Extensible Markup Language (XML) format. Then, the XML file is
scanned to extract structured text that matches the pre-defined NL structure patterns.
The structured data which completely matches a pattern is stored and is interpreted as
OO concepts and represented in the XML Metadata Interchange (XMI) format. For
Page 66
those parts of the text that are not thoroughly matched with any pattern, the questions
corresponding to the closest matching NL structure pattern are used to obtain further
information from users to clarify incomplete or ambiguous data.
We believe that the additional information obtained from the answers provided by users
can be valuable to NL analysis. A number of questions is associated with each pattern
to deal both with any missing parts in the pattern and with the selection of a correct
interpretation from alternatives. As introduced early, class identification considers the
whole text rather than a single noun that is each candidate class needs to meet three
criteria to be a valid class; these are to have attributes, methods and relationships with at
least one class. Questions are also used to determine the validity of classes. For the
development of NL structure patterns, we use close-class words (e.g. prepositions) in
order to reduce the alternatives in interpretation and improve the coverage; rather than
using open-class words (noun, verb et.) like those mentioned early in the paper.
Improved Interface Design
In terms of user friendly interface, candidate classes are represented to provide an easy
access to the candidate classes for users.
The initial XML file is scanned and
restructured into a java tree, where each candidate class is shown as a node with
associated attributes and operations as child nodes under it. To avoid mistakes as well
as improving the efficiency, the system also generates candidate options for each
question that user can choose from when answering a question. The improvements are
discussed in next chapter.
There are two inputs: the domain description and candidate classes, which are displayed
in the interface. Comparing with existing studies, this approach supports the
participation of stakeholders (domain expert, customer, etc.) in the class identification
process. Hence such essential data missing from the initial domain description can be
supplied at the earliest possible time.
Page 67
The selection of data structure considers following main criteria for output format
Traceability
Accessibility to modeling CASE tools
Transferable to tree structure
XML Metadata Interchange (XMI) as a standard format for storing class models is
accepted by many UML CASE tools (e.g. ArgoUML), provides easy accessibility to
UML CASE tools for further object-oriented analysis/design, and potentially can be
forwarded to preliminary code. Choosing XML as our output structure can simplify
XMI file construction and therefore support easy accessibility to modelling CASE tools
for engineers to carry out further requirements and design analysis.
Java, as an open source programming language, with its support for XML parsers, is
certainly a good choice of system implementation.
There are number of parsers
available for processing XML in java. We choose Dom4J and sax.
Briefly, the system contains two main parts, the Domain Description Processing which
adopts an existing tool to process the domain description in order to find matches of the
patterns to extract candidate OO concepts; and the questions are generated by the
Dialog Interface, the output of both parts are stored into XML format for easy
traceability.
Page 68
Page 69
CHAPTER 7 -
Experimental Studies
In the following sections, it starts from the design of experimental studies and the
considerations for evaluating it.
Section two and Section three represent how the
experiments are carried out for Patterns and Question Templates and the performance of
each one. The selection of scenario and participants of all experimental studies is
discussed in the final section.
7.1
The design of experiments
As stated in the previous chapter, this study attempts to assist requirements engineers, as
well as to enable domain experts to take part, since they play a very important role
during the construction of requirements, particularly in the early stage. Overall, this
approach supports both sides - the technical and the non-technical - to maximize the
control and assist in communication among them. The primary contribution of this
study is to investigate how to support the participation of domain experts in the
established environment to achieve a fluent communication among a diversity of
stakeholders.
The results of the preliminary experiment presented in Chapter 4 demonstrate the
feasibility of allowing stakeholders’ participation in Requirements Elicitation by
automated question generation. Encouraged by the outcome of the preliminary test, the
objective of the experiments described in this chapter is to collect data that can provide
evidence for the evaluation of the patterns and question formats, and so to assess the
following aspects:
The performance of this approach: to prove whether the approach can work in
reality;
The efficiency and quality of this approach;
The potential of this approach: to prove that the approach has potential for future
development rather than a one-off approach.
Page 70
Table 7-1. Aspects considered for overall evaluation
Internal validity
If
(Proposition / cause-effect inferences)
Users participate
Users participate
The ambiguity can be clarified & The
missing information can be completed
More patterns and question templates
More patterns and question templates
Outcome
Then
Aspects In Measurements
The ambiguity can be clarified
The missing information can be
completed
A better domain model can be
generated
More ambiguity can be
clarified
More missing information can
be completed
Performance. clarity
Performance. completeness
Efficiency
Potential. clarity
Potential. complete
We consider that there are two components involved in this approach, identifying object
oriented concepts by applying mappings and patterns in domain description processing
and, clarifying domain description by implementing question formats in question
generation to enable users’ participation. The evaluation is divided into two parts, each
of which will be tested in terms of performance, efficiency and potential.
Patterns
o performance – testing the correctness and accuracy
o efficiency – evaluating quality of class identification
o potential – proving the extendibility
Question formats
o performance – testing the correctness and accuracy
o efficiency – in clarifying the ambiguous information and the
communication between users and the application
o potential – proving the extendibility
Page 71
With such support and guidance, domain relevant information provided by domain
experts can be improved in terms of completeness and clarity. The following factors are
considered as important for assessing efficiency,
the percentage of relevant classes identified by patterns;
the number of irrelevant classes identified by patterns;
how understandable are the questions generated according to the question
formats;
The patterns and question formats are the key aspects which need to be tested and
evaluated. However, other variables in the hypothesis that can have an impact on the
results are participants (e.g. background of participants, domain knowledge) and
scenarios (e.g. the size and complexity of chosen scenarios). In the following sections,
a detailed explanation of chosen scenarios and participants is given.
7.2
Patterns
7.2.1
Stage one: Patterns Implementation
Stage one is to test the performance of our approach including accuracy and efficiency
via three scenarios, to prove that:
built on the existing mappings between NL & OO, the pattern based approach is
capable of identifying classes from a domain description;
questions can be used to support customers’ participation that can help in
clarifying any ambiguity of domain description inherited from the nature of
natural language (NL), as well as discovering important missing information.
Page 72
The evaluation of the accuracy and efficiency of the defined patterns is based on the
number of Object Oriented Concepts (OOC), Classes (C), Methods (M) and Attributes
(A) which can be identified by applying patterns in domain description processing.
The measure of accuracy is based on the number of OOC which are captured against the
number of OOC which could in theory be captured. The followings provide detailed
measurements for each type of OOC.
C1. Number of C which should be captured
C2. Number of C correctly captured
C3. Number of C missing
C4. Number of C incorrectly captured
M1.
Number of M which should be captured
M2.
Number of M correctly captured
M3.
Number of M missing
M4.
Number of M incorrectly captured
MOC1. Number of M-Owner Class which should be captured
MOC2. Number of M-Owner Class (OwnerC) correctly captured
MOC3. Number of M-OwnerClass missing
MOC4. Number of M-OwnerClass incorrectly captured
MIC1. Number of M-Interactive Class which should be captured
MIC2. Number of M-Interactive Class (InteractiveC) correctly captured
MIC3. Number of M-InteractiveClass missing
MIC4. Number of M-InteractiveClass incorrectly captured
Page 73
A1.Number of A which should be captured
A2.Number of A correctly captured
A3.Number of A missing
A4.Number of A incorrectly captured
The following provides the detail of how the efficiency is measured.
– The frequency of appearance of patterns
– The number of patterns applied in a sentence
– The missing OOC which could be determined VS the missing OOC which are
determined
Number of missing C determined
Number of missing M determined
Number of missing A determined
Number of missing M-Owner Class determined
Number of missing M-Interactive Class determined
– The domain relevant concept is captured, Precision [Oakes, M. (1998)]
Precision =
number of relevant items obtained
total number of retrieved items
– The number of domain relevant concept could be captured, Recall [Oakes, M.
(1998)]
Recall =
number of relevant items actually obtained
total number would have been obtained
Page 74
Each chosen scenario is tested and the data collected is analyzed based on the above
criteria. The following three diagrams in Figure 7-1, the left one represents the class
identification of EP (Elevator Problem), the middle one represents the class
identification of BS (Bank System) and the right one represents the class identification
of FM (Film Management). The results are encouraging in that the correctness of class
identification is 78% on average for three scenarios and, the missing candidate class is
just 5% for BS only. However the incorrectly identified candidate class is much higher
than other two figures, 21% for EP and 38% for FM, yet only 5% for BS.
EP - Class
C4
21%
Precision
C3
0%
Recall
C2
79%
EP - Class
100%
80%
60%
40%
20%
0%
100%
80%
60%
40%
20%
0%
1 2 3 4 5 6 7 8
EP - Class
1 2 3 4 5 6 7 8
Number of inputs
Number of inputs
BS - Class
80%
80%
Precision
C3
5%
C4
5%
100%
60%
Recall
C2
90%
BS - Class
100%
60%
40%
40%
20%
20%
0%
0%
1
BS-Class
3
5
7
1
9
Number of inputs
3
FM - Class
100%
80%
80%
Precision
C4
14%
60%
40%
20%
0%
FM-Class
7
9
FM - Class
100%
Rcall
C3
0%
C2
62%
5
Number of inputs
60%
40%
20%
0%
1
3
5
7
9
Number of inputs
Figure 7-1. Performance on class identification
Page 75
1
3
5
7
9
Number of inputs
The data obtained clearly demonstrates that the major problem in class identification is
a relatively high rate of incorrect class identification. Such dramatic change of incorrect
class identification in three scenarios inspires us to look inside the scenarios to find the
reason.
The major difference among the three domain descriptions is the use of
compound nouns (NP->NN(S) NN(S)).
The appearance of compound nouns is 7
among 33 in EP, and 7 among 49 in FM and only 2 among 40 in BS. This has
indicated that compound nouns and single nouns are equally important in class
identification yet are not covered in the patterns for class identification.
The diagrams in Figure 7-2 represent captured methods for EP, BS and FM, from left to
right. The results are perfect in that the correctly captured methods equal the number of
methods which can theoretically be captured, contributing 100% correctness for all
three scenarios.
EP - Method
M4 0%
Precision
M3 0%
1 2 3 4 5 6 7 8
EP – Method
1 2 3 4 5 6 7 8
Number of inputs
Number of inputs
BS - Method
100%
80%
80%
M4 0%
Recall
60%
M3 0%
BS - Method
100%
Precision
M2 100%
BS-Method
EP - Method
100%
80%
60%
40%
20%
0%
100%
80%
60%
40%
20%
0%
Recall
M2 100%
60%
40%
40%
20%
20%
0%
0%
1
3
5
7
Number of inputs
Page 76
9
1
3
5
7
9
Number of inputs
M2 100%
FM - Method
80%
80%
Precision
100%
60%
Rcall
M3 0%
FM - Method
100%
40%
M4 0%
20%
0%
FM-Method
60%
40%
20%
0%
1
3
5
7
9
1
Number of inputs
3
5
7
9
Number of inputs
Figure 7-2. Performance on method identification
Unlike classes, a method belongs to a particular class and, furthermore a method can
only function when it is triggered (e.g. by a class that interacts with it). It is crucial that
the owner class and interactive class of the method are identified as well. In the
following, the results of method owner class and method interactive class identification
are discussed. It is the most difficult task to associate a verb with the corresponding
nouns, either subject or object.
EP - MOC
MOC4
20%
Precision
MOC3
0%
Recall
MOC2
80%
EP - MOC
100%
80%
60%
40%
20%
0%
100%
80%
60%
40%
20%
0%
1 2 3 4 5 6 7 8
EP - Method Ow ner Class
1 2 3 4 5 6 7 8
Number of inputs
Number of inputs
BS - MOC
100%
80%
80%
Precision
MOC3
11%
60%
60%
40%
COC4
16%
40%
20%
20%
0%
0%
BS-Method Owner Class
BS - MOC
100%
Recall
MOC2
72%
1
3
5
7
Number of inputs
Page 77
9
1
3
5
7
9
Number of inputs
FM - MOC
100%
80%
80%
60%
60%
Rcall
MOC2
5%
MOC3
10%
FM - MOC
100%
Precision
MOC1
90%
40%
40%
20%
20%
0%
0%
1
FM-Method Ow ner Class
3
5
7
9
Number of inputs
1
3
5
7
9
Number of inputs
Figure 7-3. Performance on MOC identification
Based on Figure 7-3 above, Table 7-2 below highlights the percentage of each aspect
for the three scenario. The results for MOC identification in FM is the best among the
three scenarios; it has the highest correctly captured MOC at 90%, the lowest
incorrectly captured MOC at 10% and medium results for missing MOC at 5%. BS is
the highest in missing MOC at 11% and EP is the highest in incorrectly captured MOC
at 20%. The average of three scenarios on each measure is 80%, 5% and 15% for
correctness, missing and incorrectness.
Based on the analysis, it is clear that the
incorrectness is the most inaccurate in identifying MOC.
Table 7-2. MOC result of each scenario from each aspect
Type
Correct
Missing
Incorrect
Highest
90%
FM
11%
BS
20%
EP
Medium
80%
EP
5%
FM
16%
BS
Lowest
72%
BS
0%
EP
10%
FM
Average
80%
5%
15%
Having mentioned above that the verbs indicate methods of particular classes, the verb
based patterns face the challenge of capturing its class as well as the class that interacts
with it. The results of MOC identification have shown connections with the class
identification. The accuracy of class identification and MOC identification is compared
in Table 7-3. The closest result is EP that has 1% deference in incorrect identification
between class and MOC and no difference between missing class and MOC. BS has the
largest differences in both incorrect identification and missing concepts between class
and MOC; the former is 11% and the latter is 6%. The last row refers to FM that has
Page 78
4% difference in incorrect and 5% in missing identification. Here the emphasis is on
MOC. Therefore, a negative value is used when the percentage of incorrect Class
identification is higher than the incorrect MOC identification; in other words, the
accuracy of MOC is higher than class identification and, a positive value is used when it
is opposite. Table 7.4 below indicates that the BS has the largest difference in results
for both incorrect and missing identification between class and MOC.
Table 7-3. the comparison of class identification and MOC identification
Scenarios
EP
BS
FM
Class
21%
5%
14%
Incorrect
MOC
Difference
20%
( - ) 1%
16%
(+) 11%
10%
( - ) 4%
Class
0%
5%
0%
Missing
MOC
Difference
0%
0%
11%
(+) 6%
5%
(+) 5%
The results of BS on MIC identification is the best among the three scenarios, with the
highest incorrectly captured MIC 58% and the lowest for both missing MIC 26% and
incorrectly captured MIC 16%. EP is the highest in missing MIC at 40% and FM is the
highest in incorrectly captured MIC at 28.57%. In other words, the incorrectly captured
MIC is not relatively high, 18% on average but the missing MIC is 25% on average.
EP - MIC
MIC4
17%
Precision
MIC3
50%
Recall
MIC2
33%
EP - MIC
100%
80%
60%
40%
20%
0%
100%
80%
60%
40%
20%
0%
1 2 3 4 5 6 7 8
EP - Method Interactive Class
1 2 3 4 5 6 7 8
Number of inputs
Number of inputs
BS - MIC
MIC2
26%
100%
80%
80%
Recall
60%
60%
40%
MIC3
26%
BS-Method Interactive Class
BS - MIC
100%
Precision
MIC2
47%
40%
20%
20%
0%
0%
1
3
5
7
Number of inputs
Page 79
9
1
3
5
7
9
Number of inputs
FM - MIC
MIC4
12%
100%
80%
80%
Precision
MIC2
88%
FM - MIC
100%
Rcall
MIC3
0%
60%
60%
40%
40%
20%
20%
0%
FM-Method Interactive Class
0%
1
3
5
7
9
Number of inputs
1
3
5
7
9
Number of inputs
Figure 7-4. Performance on MIC Identification
Table 7-4. MIC result of each scenario from each aspect
Type
Correct
Missing
Incorrect
Highest
88%
FM
50%
EP
26%
BS
Medium
47%
BS
26%
BS
17%
EP
Lowest
33%
EP
0%
FM
12%
FM
Average
56%
25%
18%
The results of MIC identification have shown connections with class identification. The
accuracy of class identification and MIC identification is compared in Table 7-5. EP
has a better result for incorrect MIC identification with 17%, which is 4% lower than
incorrect class identification and 3% lower than incorrect MOC identification.
Although both missing identified class and missing MOC are 0%, the missing MIC is as
high as 50%. BS has worse results at 26% in both incorrect and missing identification.
The former is 21% higher than incorrect class identification and 10% higher than
incorrect MOC identification.
The latter is also 21% higher than incorrect class
identification and 15% higher than incorrect MOC identification. FM has the best
results at 12% and 0% among three scenarios. The former is 2% lower than incorrect
class identification and 2% higher than incorrect MOC identification. The latter is equal
to missing class identification and 5% lower than missing MOC identification.
Page 80
Table 7-5. Comparison of class identification and MIC identification
Scenarios
EP
BS
FM
Class
21%
5%
14%
Incorrect
MIC
Difference
17%
( - ) 4%
26%
(+) 21%
12%
( - ) 2%
Class
0%
5%
0%
Missing
MIC
Difference
50%
(+) 50%
26%
(+) 21%
0%
0%
Identifying the cause of the errors in MOC
The fact that both BS and FM have higher percentages of missing and incorrectly
identified MOC is difficult to understand. To investigate the cause of such differences,
each grammar tree is further inspected.
Table 7-6. Appearance of different type of Compound Nouns in each scenario
Compound Nouns
Noun Phrase + Noun Phrase
Noun + Noun
Adjective + Noun
DT + Noun
Verb + Noun
Compound Nouns in POS
NNP – NNP
NN(S) – NN(S)
JJ – NN(S)
DT – NN(S)
VBN – NN(S)
Total
EP
0
0
0
BS
0
4
6
0
0
10
FM
1
3
11
3
1
19
Table 7-6 presents different types of Compound Nouns, with the corresponding tags,
and highlights the number of compound nouns appearing in each scenario, categorized
into the types above with a number representing the frequency of appearance in the
scenario. For EP there are fewer compound nouns that are involved in MOC, which
leads to a fewer inaccuracies in the result. On the other hand, FM and BS result in a
large difference in missing and incorrectly identified MOC. Table 7-7 summarizes the
reasons for incorrectly identified MOC and missing MOC. Overall, nouns play very
important roles in identifying domain relevant concepts.
Page 81
Table 7-7. Identified reasons for incorrect and missing MOC
Scenarios
EP
BS
Cause for incorrect MOC
Compound Nouns
NP->NP>NN/NNS
FM
Compound Nouns
Cause for missing MOC
PP->IN/TO
NP & PP -> NP -> NP
NNP & NNP
Identifying the cause of the errors in MIC
Looking into the domain description of EP, the special verb has/have is the main reason
of such a significant difference in the results.
Unlike other verbs, which express
behaviours, has and have can describe an association relationship between two concrete
things, denoted by the two nouns next to it. For instance the sentence in EP, “each floor
has two buttons” explains a potential association relationship between the two candidate
classes “floor” and “buttons”. Reasons for incorrectly identified MIC and missing MIC
are listed in Table 7-8.
Table 7-8. Identified reasons for incorrect and missing MIC
Scenarios
EP
FM
Cause for incorrect MIC
Multiple NP before the appearance of Verb
Compound Nouns
Cause for missing MIC
the use of PP-> IN (in/to/for)
-
In terms of achieving a more accurate outcome, developing a thorough understanding of
errors when identifying each type of OO concept is necessary.
The further
investigations look into the generated grammar trees that have incorrect and/or missing
identification. The focus is to discover exceptional cases from those correctly identified
OO concepts.
Page 82
Table 7-9. Taxonomy of Error Types & Causes of Errors
Incompletely
Identified
Class
Insufficient compile of compound nouns
MIC,
Miss-identified MOC
Errors
Incorrectly
Identified
Uncovered Sentence Structured in the patterns
i.e. when a verb is followed immediately by PP
(Preposition Phrase) rather than a NP
i.e. the use of NPP
Attribute
Insufficient use of compound nouns and Special
verbs, i.e. be (am/is/are)
Relationships,
special words and sentence structure
Attributes
i.e. Has/have; Be (am/is/are);
compound nouns in the format of JJ NN
Methods
Exceptional
(these are extra methods Special verbs, i.e. be (am/is/are)
identified)
The errors are classified into three types, i) incorrectly identified concept ii)
incompletely identified concept iii) unidentified concept, in which the first two refer to
C4/M4/MOC4/MIC4 and the last one refers to C3/M3/MOC3/MIC3. Shown in Table
7-9, the main mistakes can be categorized into two kinds; insufficient use of compound
nouns and ignorance of special words. The former comprises two forms of compound
nouns that have been introduced early; a) referring to the form of NN(S) NN(S)) and
leading to an incompletely identified class; b) referring to the form of JJ NN(S)) leading
to an unidentified attribute. The latter can lead to an unidentified attribute or even an
unidentified relationship. An example is the special verb has/have as shown in Figure
7-5.
Floor
First floor
Top floor
1
2
?
?
Buttons
Buttons
Buttons
Figure 7-5. “Each floor, except the first floor and top floor has two buttons,
one to request and up-elevator and one to request a down-elevator”
Appdendix D
To avoid above errors, extra patterns shown in Table 7-10, are developed and added into
the existing list, hence more candidate domain concepts can be captured. For instance,
NP can be identified as a candidate class to reduce the incorrectly identified classes.
Page 83
Candidate attributes can be captured by targeting the compound nouns that match the
sequence of JJ-NN, but need further clarification by users.
Table 7-10. Extension of patterns
Pattern
@NNP > NP
OOC
Class
@NNPS > NP
Class
@JJ $+ NNS > NP Class
@JJ $+ NN > NP Class
7.2.2
NLE
Proper noun, singular
Proper noun, plural
Adjective
Adjective
Stage two: Extendibility, Evaluation of Pattern Extension
In this stage the aim is to further investigate whether better results in class identification
can be generated by adding more efficient patterns into the system. During this stage,
an extension of patterns is constructed and deployed in the system. The following
aspects will again be tested and the results can be compared with the results of the
experiment in stage one.
The number of captured C
The number of captured M
The number of captured MOC
The number of captured MIC
Page 84
Result 2
Result 1
Improvement
C
M
49
33
25
24
40 40
17
30
26
22
21
20
39%
21%
17
0
EP
BS
19%
EP
FM
BS
FM
MOC
20
MIC
12
56%
12
11
40%
16
21
19
12
12%
%
40%
40%
6
50%
8
7
75
2
EP
BS
FM
EP
BS
%
FM
Figure 7-6. Improvements by the Pattern Extension
This experiment is run in the same manner of the experiment in stage one. The only
difference was the number of patterns involved. To clarify understanding, the result
generated in the experiment in stage one is called Result 1 and the result generated in
the experiment in stage two is called Result 2. Figure 7-6 illustrates Result 1 and Result
2 and the difference between them as the improvements in percentage (1- result 1/ result
2). Chart C refers to the number of captured Classes in all three scenarios, chart M
refers to the number of captured Methods, chart MOC refers to the number of captured
Method Owner Classes and the chart MIC refers to the number of captured Method
Interactive Classes. The column in yellow demonstrates how much the outcome has
improved in percentage when adding more patterns and hence suggests the potential of
the approach in terms of extendibility.
Page 85
7.3
Question Templates
In this section, the emphasis is to test the performance of question templates, including
two stages. Stage one covers the experimental study of question generation, during
which the results are evaluated in terms of improvements and efficiency on the
identified OO concepts. Stage two follows, and more question formats are developed
and deployed into the system to evaluate the extendibility of the user-centred approach.
7.3.1
Stage one: experimental study of question generation
In order to test the accuracy of Question Templates, we compare the number of (No)
OOC that can be captured with the No of OOC that are captured; in terms of efficiency,
we compare the number of missing OOC that need to be clarified against the number of
missing OOC which is clarified.
The experiment requires the involvement of real people interacting with the system in
order to test the understandability of the pattern-associated-question templates. On
principle, question templates that are designed in the form of natural language should
support an easy communication between the participants and the system. The big
challenge of this approach is the efficiency of the questioning process; for example, how
many questions are generated to deliver the expected results, how long will this process
take, and how well can human being understand the machine generated questions.
Facing these challenges, the following aspects are considered when designing the
experiment, which will allow us to collect some valuable data to examine the efficiency
of the approach.
A
The total number of questions generated
B
The maximum, minimum and average of time taken for completing all the
questions
C
The percentage of volunteers that have misunderstood a particular question
D
The maximum number and the percentage of questions misunderstood by
volunteers
Page 86
E
The minimum number and the percentage of questions misunderstood by
volunteers
F
The quality of delivered results in terms of class identification
To compare and contrast the results of this experiment with the results of preliminary
experiment, the following aspects are considered.
i.
The time spent for completing all questions
ii.
The percentage of mistakes avoided
For the choice of participants, ideally are domain experts who have insight and
knowledge of the domain problem.
However, it is very difficult to achieve this
especially in a short period of time. Alternatively, if we use a group of students, we
would wish that in the scenario a particular problem is familiar to most of them, that
anyone from the group has the potential to play the role of domain expert. As we
gained positive feedback from the preliminary experiment by student volunteers, which
gave us confidence to use the same scenario for this experiment, and the two sets of
result can be investigated.
In this experiment, first year computer science students volunteered to communicate
with the application, answering questions regarding the domain of elevator systems.
The experiment was carried out during their Linux laboratory class, where more
students could be easily gathered and organized. To minimize the work for participants,
1. Extra code was added to create a unique text file and XML file for each
participant according to the user logon id and,
Page 87
2. A shell script file was created, which set up the java class path and directory,
supporting easy access to the application.
To start with, participants only
needed to change to the specified directory and run the script file.
The
simplified procedure and familiar problem domain allowed participants to
concentrate on answering the questions.
The quality of the class identification was also examined against the results of the class
diagrams delivered by experienced students introduced in the preliminary experiment.
1stDD is the hand drawing class diagram based on the original domain description.
2ndDD is the hand drawing class diagram based on the delivered domain description
from the original class diagram. The size of the class was 67 and the results are,
A
The total number of questions generated was 12
B
The maximum, minimum and average time taken for completing the 12
questions were 10 minutes , 3 minutes, 6 minutes
C
The percentage of volunteers that had misunderstanding of a particular
question is 9%
D
The maximum number of questions that was misunderstood by volunteers
was 6 and the percentage was 8%
E
The minimum number of questions that was misunderstood by volunteers
was 0 and the percentage was 91%
Page 88
Table 7-11. Class identification of Elevator Problem
Domain concepts
Elevator (C)
Button (C)
FloorButton (C)
ElevatorController (C)
Door (C)
ElevatorButton (C)
ElevatorController - Button (A)
ElevatorController - Door (A)
Button - ElevatorButton (G)
Button - FloorButton (G)
Elevator - ElevatorController (A)
1st DDM
83%
67%
33%
33%
33%
17%
33%
17%
17%
17%
0
100%
100%
0
100%
76%
0
0
0
0
0
100%
2ndDDM
100%
86%
71%
100%
43%
71%
33%
17%
17%
17%
0
The results of this experiment are very encouraging when compared with the earlier
experiment. The difference on average of time taken is 6 minutes against 35 minutes
(25 minutes for 1stDDM and 45 minutes for 2ndDDM). In terms of class identification,
Table 7-11 above presents the percentage of students who delivered the correct domain
concepts in the preliminary experiment, 1stDDM, 2ndDDM, and the percentage of
delivered correct domain concepts constructed through the interaction of students and
question generation. On average,
i.
The time spent for completing the whole questions went down from average of
25 minutes to average of 5 minutes
ii.
Mistakes are avoided completely
Page 89
7.4
The Chosen Scenario and Participants
One of the issues in RE that has been discussed is the ambiguity of NL. As the main
communication language among stakeholders, the clarity and complexity of domain
description has a significant impact on the outcome. The three scenarios are Elevator
Problem, Bank System and Film Management. The results for each scenario are given
in Table 7-12.
Table 7-12. Three Scenarios Complexity
L
EP
BS
FM
128
162
206
S
8
11
10
Min
5
10
10
SL / Branch
Max
Ave
28
16
25
15
40
20
Min
2
1
1
Clause
Max
4
3
5
Ave
22/8
2
3
Min
7
7
9
Depth
Max
15
16
20
Ave
10
11
12
Complexity of domain is measured according to the size of the text
L–
The Length of Sentence, the number of words in the text
S–
The number of Sentences in the domain description
Complexity of a sentence is measured according to
SL –
Sentence Length The number of words, including the Maximum words in a
sentence (Max), the Minimum words in a sentence (Min) and the Average
words in a sentence (Ave)
Clause – The number of clauses / sub-sentences in a sentence, including the Minimum,
Maximum and the Average
Page 90
Depth –
The depth of sentence, including Minimum depth. Maximum of depth and
Average depth of the grammar tree
The minimum requirements for a useful candidate are
has relevant domain knowledge;
can communicate in English well and
does not necessarily has any background in RE and/or OOA, OOD.
In summary, the experimental studies for both Patterns and Question Templates are
represented and the performance is evaluated. More importantly, extensions are
developed for both components to avoid the mistakes appeared during the first
experiment for both components. Though, ideally the approach should be tested by
domain experts, however encouraging students to participate was the only option
available at the time. Overall, the Patterns and Question Templates are more complete
and are implemented more accurately.
Page 91
CHAPTER 8 -
Discussions
In this chapter, the methodology applied in developing the approach is discussed and
evaluated. The evaluation is divided into three steps which are represented by the
following three sections. As shown below, each one emphasizes one corresponding
stage of the methodology.
Evaluation of preliminary experiment,
stage1
Evaluation of 1st experiment,
stage2
Evaluation of 2nd experiment,
stage3
8.1
Stage one
Based on the literature review we see there are problems in requirements identification
and that existing methods have their limitations in dealing with the problems. The
motivation of this study is to develop a solution that can complement other methods and
together improve the performance during requirements identification.
The focus of this approach is how to deal with the difficulties that one might face in
transferring natural language descriptions into object oriented models rather than how to
capture domain relevant information from natural language descriptions.
The
preliminary experiment was an investigation of difficulties which can be caused by the
nature of natural language.
To perform the experiment it is necessary to have number of volunteers to participant
and some domain descriptions written in English (as a type of natural language) to be
transferred into object oriented models.
Page 92
The criteria defined for choosing scenarios proved to be appropriate (please see Chapter
4 for details), each scenario is simple, small and describes a common problem. As a
result all the participants found no problem to understand and managed to complete the
tasks within the given time.
The key considerations for the right participants are i) their capability and/or experience
of generating object oriented model from domain descriptions and ii) their availability.
The right participants need to be available at short notice as well as to be able to
understand the domain descriptions and to be able to create object oriented models from
them.
However, to maximize the opportunity of discovering the difficulties, the
participants don’t need to be extremely experienced professionals. The decision was to
use penultimate year undergraduates who have studied and practised object oriented
modelling.
Before the start of the experiment, instructions were handed out, an initial briefing was
given to explain the purpose of this experiment and the tasks they need to deliver, etc.
The briefing was then followed by 5 minutes freshening up on class diagrams, which
allowed students to start working on the tasks straight away.
The participants we had for this experiment were the right choice at the time. However,
there knowledge of object-oriented modelling and experience may vary. To address this
issue, a pre-session might be necessary in order to test participants’ knowledge in
object-oriented modelling, so that all those at the same level of experience in generating
class diagrams can be grouped together.
Overall this experiment was very successful; we managed to obtain a good quantity of
data and the investigation of such data enabled us to summarize common mistakes that
people make in transferring domain descriptions to object oriented model (please refer
to Figure 4-2), which is a reflection of the ambiguities that can affect the translation
between domain description and object oriented model.
Page 93
8.2
Stage two
The proposed approach includes two major parts. One of them is to identify the OO
concepts from domain descriptions automatically. Previous studies have demonstrated
that NL elements can be interpreted into OO concepts by applying the mapping theory;
and NL descriptions can be transformed into POS based sequences. The solution is to
define mappings that connect the NL elements with the OO concepts, based on which
patterns can be developed for identifying the OO concepts.
Some tools are required in order to implement the patterns and to test their performance,
such as a parser to transform NL descriptions and a matcher to match the output with
the predefined patterns. Based on the literature review we understand that there are
some existing tools available to achieve the tasks. Considering the time scale and that
the objective of this study which is not to build a tool set, the decision was made to
adopt the existing tools whenever possible. In the end, it was proven to be a correct
decision because it allowed us to concentrate on the patterns developing.
Once the first set of patterns was specified and implemented, a test took place to verify
the correctness of the defined patterns.
As the result of the analysis of the data
collected, a Taxonomy of Error Types and Causes of Errors were presented in Table
7-9, which gave us confidence to further extend the patterns. The comparison of two
versions of the defined patterns demonstrated that more patterns can be defined; the
better outcomes can be obtained in terms of OO concepts identification.
However, we are not only aiming to achieve an automated OO concepts identification;
we are also looking to find a solution to reduce the ambiguity and missing information
of the domain description, which is the real challenge of this study. Though it does
highlight the problems in the domain description but the patterns alone could not
provide an answer on how to reduce the ambiguity in the domain description.
Page 94
8.3
Stage three
The data obtained from the preliminary experiment demonstrated that the ambiguity of
problem domain descriptions can lead to a misunderstanding of the problem and hence
affect the outcome of the problem domain analysis. Most importantly, the finding
indicates the ambiguity can be described as the following two scenarios.
irrelevant information included in the problem domain description
relevant information missed out in the problem domain description
Clearly, our approach must demonstrate the ability to reduce the ambiguity outlined in
the above two scenario. We believe that the only key to address the issue is through the
participation of stakeholders who have the relevant domain knowledge and therefore are
considered to be the best candidates for clarifying such ambiguities. To meet the goal
the following was achieved,
establishing question templates with multiple choices based on the question
format of natural language;
implementing the defined question templates to generate questions and multiple
choices;
developing an interface for interaction that allows stakeholder to answer the
generated questions to clarify the ambiguities identified by patterns;
testing the performance of question template through experiment.
Page 95
The efficiency and understandability are major concerns during the development of
question templates. The multiple choices defined for each question template were
proven to be effective to not only help participants to understand the question better but
also to enable them to answer questions quicker. Although we tried not to limit
participants by providing multiple choices, it might still have some negative effect.
The scenario chosen for this experiment were the same with those for the preliminary
experiment hence the two sets of result can be compared and contrasted. Therefore, the
correctness and completeness of OO concepts identification for those scenarios can be
discussed.
Overall, the methodology chosen for this study can be described as test-build-test
approach. At the beginning of each stage, an experiment was conducted and based on
the results generated from it decisions can be made for the next stage. The advantage
of doing so is that this can help us make right decisions therefore avoid any delay
caused by repeating during the development. As a result we were able to develop and
test all the essential futures within a limited time. On the other hand, futures developed
and tested during the second stage were tested again during the third stage, yet it
allowed a through comparison of the two results.
Page 96
CHAPTER 9 -
Conclusion & Future Research
In this chapter, it is represented, as in the following sections, how the work been
conducted reflect the hypothesis stated in Chapter 3, and therefore looking at the future
work of this study.
9.1
Conclusions
This study is all about finding a better solution in translating NL descriptions to OO
domain modelling. The core issue is the difficulties during communication mainly
between domain expert and requirements engineer/business analyst, caused by
the complexity and ambiguous natural of NL
domain expert’s lack of technical knowledge
requirements engineer’s lack of understanding of problem domain
To tackle the problem, we need a mechanism that helps stakeholders to communicate in
the way that is less ambiguity and complication. If bi-direction translation between NL
descriptions and object oriented concepts is possible, we see the potential of mapping
theory between NL and object oriented concepts, which can lead to a solution that the
mappings are embedded within an iterative and incremental clarification process where
domain experts can correct ambiguous data in domain descriptions and/or provide
additional information to complement incomplete data.
Page 97
In the following sections we will go through each hypothesis specified in Chapter 3 to
discuss whether and how well we established the truth of each of them and to conclude
whether we have achieved what we set up to do and how successful we are.
Sub-Hypothesis I:
An NL element can have multiple translations to object oriented
concepts; (which can be used for discovering/clarifying ambiguous data in NL
description)
The preliminary experiment was an investigation of common mistakes people make
when creating OO models from NL descriptions of problem domain. The analysis of
collected data demonstrated that an NL element can be translated to different object
oriented concepts by different people (examples and discussions can be found in
Chapter 4); yet whether this can be used for discovering and/or clarifying ambiguous
data in NL description is left open at this stage.
Sub-Hypothesis II: An NL expression of problem can be translated to object oriented
concepts to form object oriented domain model
Sub-Hypothesis III: An object oriented concepts in object oriented domain model can
be translated into an NL expression
Based on the existing mapping theory, we established mappings between NL elements
and OO concepts so that one can be translated to another. The patterns were derived
from mappings and each pattern was associated with a particular OO concept and a
particular NL element to support automatic translation between NL elements and OO
concepts, for instance, the part of NL expression that matches any of defined patterns
can be translated to the corresponding OO concept. However, the overall result is
affected by the number of patterns defined, which is of course limited by the mappings
established between NL elements and OO concepts.
Page 98
Sub-Hypothesis IV: An object oriented modelling language (UML) is more accurate
than NL when describing domain model, which can be used as guidance to discover
problems within NL domain descriptions, i.e. missing data.
It is not realistic to have domain descriptions that contain all the essential information of
problem domain. One of the challenges of this study is to discover what information is
missing before it can be obtained. When performing the preliminary experiment, we
requested each candidate to describe a class diagram using English, as a natural
language. The results demonstrated that all the candidates can describe it in a simple
and clear way; and surprisingly the descriptions are very much the same, in other words,
UML is less ambiguous, more standard and accurate comparing to NL.
Inspirited by the outcome of preliminary experiment, the decision was to verify each
candidate OO concept translated from NL elements of domain description by built-in
rules derived from UML. For instance, an attribute or an operation must belong to a
particular class. If an attribute captured by the patterns does not belong to a class, the
problems it might indicate are i) this is not an attribute ii) the data identifies the class it
belongs to is missing from the corresponding NL expression. In order to highlight such
incompleteness, first of all the captured candidate OO concepts are stored in XML
structure, in which related candidate OO concepts are linked. For example a class is
represented as an element that followed by sub-elements that represent its attributes
and/or operations. Secondly, each candidate OO concept is assessed against the built-in
rules. Together we were able highlight potential issues within NL domain descriptions
by implying rules derived from criteria for identifying classes in UML.
However we only proved that this can be done to class diagram, it is not feasible to
tackle all the UML diagrams in this study. Even within class identification, different
relationships between two classes are not fully covered.
Page 99
Sub-Hypothesis V: An incomplete object oriented domain model can be translated to
NL expressions, to allow domain expert providing additional information to complete
the model.
We applied simple sentence structures to form questions to obtain further information
for clarifying ambiguity from domain experts. The second experiment demonstrated that
the further information can complement the initial class identification, resulting a better
class identification in terms of completeness.
However UML contains different
diagrams, within the time scale of this study we were unable to approve that we can do
this to all diagrams and we did not even generate the class diagram graphically.
Sub-Hypothesis VI: It is possible to support an iterative process so that additional
information can be added into the initial model
Questions for obtaining further information are generated while scanning the XML file
that stores candidate classes.
This process is repeated until all the candidate OO
concepts satisfy the rules. As a result, more questions generated more information can
be gathered from domain experts; hence more information can be added into the XML
file. Overall, we proved that there is a way to allow additional information to be added
during an iterative process, however into class identification rather than a class diagram.
Overall Hypothesis:
It is possible to improve object oriented domain
modelling using translation from NL if we incorporate within the process a step where
domain experts have to clarify any incompleteness or any type of ambiguity in the
domain descriptions.
Furthermore, if the missing and/or ambiguous data can be
identified in the domain descriptions, this can be an important indication to developers
to further extract domain relevant information from different resources.
Page 100
In this study we did not prove the above statement but we did manage to prove that at
lease class identification can be improved by allowing participation of domain experts
to clarify incompleteness and/or ambiguity in the domain descriptions.
For the second part, yes missing and/or ambiguous data can be highlighted in the
domain descriptions, whether it can be important indication to developers to further
extract domain relevant information from different resources is left open in this study.
What is definite is that by generating questions we can help developers to extract
domain relevant information from stakeholders.
9.2
Future research
In this study, a mechanism is developed, which allows domain experts to participant
through the interface dialog built on the patterns and question templates, providing
additional information to clarify ambiguous data in domain descriptions, hence to assist
the engineers developing a better understanding of problem domain.
In the future, if more patterns can be developed more ambiguous data can be
highlighted and with more question templates defined, domain experts will be able to
provide more additional information in order to clarifying ambiguous data. In addition,
if more existing linguistic techniques can be applied, such as stemming or automatic
term recognition, might help to identify the multiple terms used for same concept or
singular and plural forms of same concept, the number of question need to be answered
by the domain experts can be reduced. Hence the overall efficiency of this technique
can be improved.
Furthermore, if UML diagrams, e.g. class diagram, can be generated graphically that
can improve the usability of the system therefore might help domain experts to
understand the questions even better. Of course, to test this will require the participation
of real domain experts to use the system.
Page 101
Appendix A: list of POS tags
CC
DT
EX
FW
IN
JJ
JJR
JJS
LS
MD
NNP
NNPS
NNS
PDT
POS
PRP
PRP$
RB
RBR
RBS
RP
SYM
UH
VB
VBD
VBG
VBN
VBP
VBZ
WDT
WP
WP$
WRB
oordinating conjunction; e.g. and,but,or... CD Cardinal Number
Determiner
Existential there
Foreign Word
Preposision or subordinating conjunction
Adjective
Adjective, comparative
Adjective, superlative
List Item Marker
Modal; e.g. can, could, might, may... NN Noun, singular or mass
Proper Noun, singular
Proper Noun, plural
Noun, plural
Predeterminer; e.g. all, both ... when they precede an article
Possessive Ending; e.g. Nouns ending in 's
Personal Pronoun; e.g. I, me, you, he...
Possessive Pronoun; e.g. my, your, mine, yours...
Adverb; Most words that end in -ly as well as degree words like quite, too and
very
Adverb, comparative Adverbs with the comparative ending -er, with a strictly
comparative meaning.
Adverb, superlative
Particle
Symbol; Should be used for mathematical, scientific or technical symbols TO to
Interjection; e.g. uh, well, yes, my...
Verb, base form; subsumes imperatives, infinitives and subjunctives
Verb, past tense includes the conditional form of the verb to be
Verb, gerund or persent participle
Verb, past participle
Verb, non-3rd person singular present
Verb, 3rd person singular present
Wh-determiner; e.g. which, and that when it is used as a relative pronoun
Wh-pronoun; e.g. what, who, whom...
Possessive wh-pronoun
Wh-adverb; e.g. how, where why
Page 102
Appendix B: examples of question generation
“The problem concerns the logic required to move elevators between floors according to the following
constraints: Each elevator has a set of m buttons, one for each floor”
concerns
logic
problem
require
S2?
constraints
require( )
concern( )
move
move according
Elevators
control( )
move( )
mButtons
has (?)
move between
illuminate( )
pressed( )
?
Floors
?
?
Page 103
‘These illuminate when pressed and cause the elevator to visit the corresponding floor’
Buttons
Elevators
what illuminate?
illuminate( )
pressed( )
who press what?
control( )
move( )
visit
Floors
visit( )
Page 104
‘The illumination is cancelled when the elevator visits the corresponding floor’
Illumination
what cancel illumination?
cancel( )
Elevators
control( )
move( )
visit
Floors
visit( )
Page 105
‘Each floor, except the first floor and top floor has two buttons, one to request an up-elevator and one to
request a down-elevator’
Incomplete Pattern
Existing class
Complete Pattern
Candidate Class
?
?
Missing Candidate Class
Floors
has
Buttons
downElevator
visit( )
what request up
Elevator?
firstFloor
secondFloor
what request
upElevator down Elevator?
Page 106
‘These buttons illuminate when pressed’
Buttons
illuminate( )
press( )
What is pressed?
Patte
rn
Existing
class
‘When an elevator has no requests, it remains at its current floor with its doors closed’
?
request
What remain at current floor?
has no
Elevators
remain
Floors
control( )
move( )
doors
visit( )
doors of What is closed?
close( )
Page 107
Appendix C
Domain description of bank management
You are asked to design a system to handle current and savings accounts for a bank.
Accounts are assigned to one or more customers, who may make deposits or withdraw
money. Each type of account earns interest on the current balance held in it. Current
accounts may have negative balances (overdrafts) and then interest is deducted. Rates of
interest are different for each type of account. On a savings account, there is a
maximum amount that can be withdrawn in one transaction. Bank employees may
check any account that is held at their branch. They are responsible for invoking the
addition of interest and for issuing statements at the correct times. A money transfer is a
short lived record of an amount which has been debited from one account and has to be
credited to another. A customer may create such a transfer from their account to any
other. Transfers within a branch happen immediately, while those between braches take
three days.
Page 108
Appendix D
Domain description of elevator problem
A product is to be installed to control elevators in a building with m floors. The problem
concerns the logic required to move elevators between floors according to the following
constraints:
Each elevator has a set of m buttons, one for each floor. These illuminate when pressed
and cause the elevator to visit the corresponding floor. The illumination is canceled
when the elevator visits the corresponding floor.
Each floor, except the first floor and top floor has two buttons, one to request and upelevator and one to request a down-elevator. These buttons illuminate when pressed.
The illumination is canceled when an elevator visits the floor and then moves in the
desired direction.
When an elevator has no requests, it remains at its current floor with its doors closed.
Page 109
Appendix E
Domain description of film management
We are interested in building a software application to manage filmed scenes for
realizing a movie, by following the so-called “Hollywood Approach”. Every scene is
identified by a code (a string) and it is described by a text in natural language. Every
scene is filmed from different positions (at least one), each of this is called a setup.
Every setup is characterized by a code (a string) and a text in natural language where the
photographic parameters are noted (e.g., aperture, exposure, focal length, filters, etc.).
Note that a setup is related to a single scene. For every setup, several takes may be
filmed (at least one). Every take is characterized by a (positive) natural number, a real
number representing the number of meters of film that have been used for shooting the
take, and the code (a string) of the reel where the film is stored. Note that a take is
associated to a single setup. Scenes are divided into internals that are filmed in a theatre,
and externals that are filmed in a location and can either be “day scene” or “night
scene”. Locations are characterized by a code (a string) and the address of the location,
and a text describing them in natural language.
Page 110
Bibliography
Abbott R.J. (1983) Program Design by Informal English Descriptions, Communications
of the ACM, Nov. 26(11), 1983, pp. 882-894
Alexander I.F. (1997) A Historical Perspective on Requirements, Issue 12, (October
1997) pp 13-20. Requirenautics Quarterly, The Newsletter of the Requirements
Engineering Specialist Group of the British Computer Society.
Alvarez R. Urla J. (2002) Special Issue on Critical Analysis of ERP Systems: the macro
level: Tell me a good story: using narrative analysis to examine information
requirements interviews during an ERP implementation. February 2002. ACM SIGMIS
Database, Volume 33 Issue 1.
Beyer H.R. and Holtzblatt K. (1995) Apprenticing With the Customer. 1995.
Communications of ACM May 1995/Vol. 38, No. 5.
Bolton R. (1979) People Skills: How to Assert Yourself, Listen to Others and Resolve
Conflicts. New Your: Simon and Schuster.
Booch G. (1994) Object-Oriented Analysis And Design With Applications, Second
Edition. The Benjamin/Cummings Publishing Company, Inc.
Booch G. Rumbaugh J. Jacobson I. (1999) The Unified Modelling Language User
Guide. Reading, MA: Addison—Wesley.
Page 111
Borstler J. (1996) User-Centred Requirements Engineering in RECORD – An Overview,
the Nordic Workshop on Programming Environment Research, Proceddings
NWPER’06, Aalborg, Denmark, May 1996, pp. 149156
Buzan T. Buzan B. (1995) The Mind Map Book. London: BBC Books
Byrd T. A. Cossick K.L. Zmud R.W. (1992) A Synthesis of Research on Requirements
Analysis and Knowledge Acquisition Techniques. 1992. MIS Quarterly, Vol. 16, No. 1.
(Mar., 1992), pp. 117-138..
Checkland P. (1981) Sysems Thinking, Systems Practice. Chichester: Wiley.
Checkland, P.; Poulter, J. (2006) Learning for action: a short definitive account of soft
systems methodology and its use for practitioner, teachers, and students, Wiley, 2006
Chen P. P.-S. (1983) “English sentence structure and entity relationship diagrams”.
Information Sciences, 29: 127-149, 1983.
Coad, P.; North, D.; Mayfiel, M. (1995) object models: Strategies, Patterns and
Applications, Prentice Hall
Coad, P.; Yourdon, E. (1990) Object-Oriented Analysis. Yourdon Press, Englewood
Cliffs. NJ.
Coad P. Yourdon E.(1991) Object-Oriented Analysis. Yourdon Press, Englewood
Cliffs. NJ
Page 112
Constantopoulos P. Jarke M. Mylopoulos J. Vassiliou Y.(1991) Software information
base: a server for reuse. Technical Report, FORTH Research Institute, Univ of
Heraklion, Crete.
Dawson L. Swatman P. (1999) The use of object-oriented models in requirements
engineering: a field study. Proceeding of the 20th international conference on
Information Systems, January
Delisle S. Barker K. Biskri I. (1999) Object-Oriented Analysis: Getting Help from
Robust Computational Linguistic Tools. 4th International Conference on Applications of
natural language to Information Systems (OCG Schriftenreihe 129), Klagenfurt,
Austria, 1999, pp 167-171.
Oxford Dictionary of English (1989) Oxford University Press, Second Edition, 1989
Firesmith D. (1991) Structured Analysis and Object-Oriented Development are not
Compatible, ACM SIGAda Ada Letters, olume XI, Number 9, Nov/Dec 1991, ACM
SIGAda, pp 56-66)
Fliedl, G.; Kop, C.; Mayerthaler, W.; Mayr, H.C.; Winkler, C. (1997) NTS-based
derivation of KCPM perspective determiners, Proceedings of the Third International
Workshop on Applications of Natural Langage to Information Systems, 26-27 June
1997, Page: 215-226.
Fliedl, G.; Mayerthaler, W.; Winkler, C. (1999) The NT(M)S Parser: an efficient
computational linguistic tool, Proceedings of the First International Workshop on
Computer Science and Information Technologies, 18-22 January 1999, Page: 125-128.
Page 113
Goguen, J.A.
Linde, C. (1993) Techniques for requirements elicitation, Proceedings,
IEEE 1st International Symposium on Requirements Engineering (RE'93) Jan. Pp152164.
Goldin L., and Berry D.M. (1997)
AbstFinder, A Prototype natural language Text
Abstraction Finder for Use in Requirements Elicitation, Automated Software
Engineering, 4(4), October, 1997, pp. 375-412.
Halliday M.A.K. (2003) On Language and Linguistics, edited by Jonathan Webster,
Continuum, London/New York, 2003.
Harmain H.M. and Gaizauskas R. (2000) CM_Builder: An Automated NL-based CASE
Tool, Proceedings of ASE-2000: The 15th IEEE Conference on Automated Software
Engineering. September 2000
Holtzblatt K. and Beyer H.R. (1995) Requirements gathering: the human factor,
COMMUNICATIONS OF THE ACM. May. 38(5) Pp31-32.
Jacobson, I. (1995) The Use-Case Construct in Object-Oriented Software Engineering,
Scenario-Based Design: Envisioning Work and Technology in System Development. pp
309-336, John Wiley and Sons Inc. New York, NY, USA
Jacobson, I. Booch, G. Rumbaugh J. (1999) The Unified Software Development
Process. Reading, MA: Addison Wesley Longman
N. Juristo and A.M. Moreno, How to Use Linguistic Instruments for Object-Oriented
Analysis, IEEE Software, May/June 2000, pp. 80-89.
Jacobson, I. Christerson, M. (1995) A Growing Consensus on Use Cases, Journal of
Object-Oriented Programming (8) pp:15-19.
Page 114
Jacobson, I.; Christerson, M.; Jonsson, P.; Overgaard, G. (1992) Object-Oriented
Software Engineering: A Use Case Driven Approach. Reading, MA: AddisonWesley/ACM, 1992.
Jackson, M. (1994)
The role of architecture in requirements engineering, Proc. IEEE
International Conference on Requirements Engineering, p. 241, IEEE Computer
Society, 1994
Jackson, M. (1995)
Software Requirements and Specifications: a lexicon of practice,
principles and prejudices. New York/Reading, MA: ACM Press/Addison-Wesley
Juristo, N. Moreno, A.M. Lopez, M. (2000) How to use linguistic instruments for
object-oriented analysis, IEEE Software. May-June 17(3) pp:80 – 89.
Kageura, K., Umino, B. (1996)
Methods of automatic term recognition: a review.
Terminology 3(2), 1996, page 259-289
Karten, N. (1994)
Managing Expectations, Dorset House Publishing Co. Inc. New
York, NY USA
Kaiya, H., Saeki, M. (2006) Using Domain Ontology as Domain Knowledge for
Requirements Elicitation, Requirements Engineering, 14th IEEE International
Conference, 11-15 Sept. 2006, page: 189 - 198
Korthaus, A. (1998) Using UML for business object based systems modeling. In M.
Schader, and A. Korthaus, editors, The Unified Modeling Language – Technical
Aspects and Applications, pp. 795-825, Heidelberg, Germany.
Page 115
Kotonya, G. Sommerville, I. (1998) Requirements
Engineering
Processes
and
Techniques, Chichester UK: John Wiley and Sons.
Lee R. C. and Tepfenhart W. M. (2003)
Practical
Object-Oriented
Development
with UML and Java, 2003, Page 90, 98
Lecoeuche R. (2000) Finding comparatively important concepts between texts, Proc.
15th IEEE International Conference on Automated Software Engineering, 11-15
September, 2000, pp:55-60.
Li, K. (2004) PhD Proposal. Heriot Watt University
Li, K. Dewar, R.G. Pooley, R.J. (2005) Towards Semi-automation in Requirements
Elicitation: mapping natural language and object-oriented concepts, Proc. of the
Doctoral Consortium at the 13th IEEE International Requirements Engineering
Conference, Edited by Nancy A. Day, Paris, France, 29th August– 2nd September, 2005
Loos, P. and Fettke, P. (2001)
Towards an integration of business process
modeling and object-oriented software development, Technical Report, Chemnitz
University of Technology, Chemnitz, Germany.
Loucopoulos, P. Karakostas, V. (1995)
Systems
Requirements
Engineering,
McGraw-Hill, Inc. New York, NY, USA
Maiden N.A.M., Hare M. (1998) Problem domain categories in requirements
engineering, 1998. Academic Press, Int. J. Human-Computer Studies (1998) Volume
49, issue 3, pp: 281-304
Page 116
Maiden N.A.M., Mistry P., Sutcliffe A.G. (1995) How people categories requirements
for reuse: a natural approach, Proceedings of the 2nd IEEE Symposium on
Requirements Engineering, pp. 148-155. Silver Spring, MD: IEEE Computer Society.
Mentzas, G. (1999)
Coupling object-oriented and workflow modeling in business and
information reengineering, Information Knowledge and Systems Management,1(1):6387.
Mich, L. (1996) NL-OOPs: From natural language to Object Oriented Using the
natural language Processing System LOLITA, natural language Engineering, 2(2),
1996, pp. 161-187.
Mich, L. Garigliano, R. (1994) A linguistic approach to the development of objectoriented systems using NL system LOLITA, In Object-Oriented Methodologies and
Systems International Symposium Sep. 1994 LNCS (858) pp371-386
Mylopoulos, J. Chung, L. Yu, E. (1999) From object-oriented to goal-oriented
requirements analysis. Communications of the ACM. January. 42(1) Pp 31-37.
Nuseibeh, B.; Easterbrook, S. (2000) Requirements Engineering: A Roadmap, 2000,
Proceedings of International Conference on Software Engineering (ICSE-2000), 4-11
June 2000, Limerick, Ireland, pp: 35-46
Oakes, M. (1998)
Statistics for Corpus Linguistics. Edinburgh, U.K.: Edinburgh
Univ. Press, 1998
O’Connor, J. Seymour, J. (1990) Introducing Neuro-Linguistic Programming, London:
Thorsons.
Page 117
Horkoff, J.; Yu, E. (2011)
Analyzing Goal Models – Different Approaches and How
to Choose Among Them. 26th ACM Symposium on Applied Computing (SAC’11). Tai
Chung, Taiwan, March 21-25, 2011. 8pp
Pooley, R.J. and Wilcox, P. (2004)
Applying UML Advanced Application, Elsevier
Ltd.
Robertson, S. (2001) Requirements trawling: techniques for discovering requirements.
© 2001 Academic Press International Journal of Human-Computer Studies, Volume 55,
Issue4, October 2001, Page: 405-421
Porter, M.F. (1977)
An algorithm for suffix stripping, 1977, Page:313-316
Rumbaugh, J. Blaha, M. Premerlani, W. Eddy, F. Lorensen W. (1991) Object-Oriented
Modelling and Design. Prentice Hall 1991
Ryan, K. (1992) The role of natural language in requirements engineering,
Requirements Engineering, Proceedings of IEEE International Symposium. Jan. 1992
4(6) pp:240 – 242.
Saiedian, H. Dale, R. (2000) Requirements
engineering:
making
the
connection
between the software developer and customer, Information and Software Technology.
42(6) 15 April. pp:419-428. Elsevier
Satir, V. Banmen, J. Gerber, J Gomori, M. (1991)
The Satir Model: Family Therapy
and Beyond , Science and Behaviour books, Palo Alto Califormia 1991
Shlaer, S.; Mellor, S.J. (1988) Object-Oriented System Analysis: Modeling the World in
Data. New York, NY: Yourdon, 1988.
Page 118
Snajder, J., Basic, B.D., Tadic, M. (2008)
Automatic acquisition of inflectional lexica
for morphological normalization. Inf. Process. Manage. 44(5), 2008, page 1720-1731
Klein, D.; Manning, C.D. (2003)
Accurate Unlexicalized Parsing. Proceedings of
the 41st Meeting of the Association for Computational Linguistics, 2003 pp. 423-430
Sutcliffe, A. Maiden, N. (1998) The domain theory for requirements engineering, IEEE
Transactions on Software Engineering, 24(3) March 1998 pp: 174 - 196.
Svetinovic, D. Berry, D.M. Godfrey M. (2005) Concept Identification in ObjectOriented Domain Analysis: Why Some Students Just Don’t Get It, Proc. 13th IEEE
International Conference on Requirements Engineering, Paris, France, 29th August– 2nd
September, 2005, pp189-198
Page 119
© Copyright 2026 Paperzz