Matching Conceptual Graphs as an Aid to Requirements Re-use

Matching Conceptual Graphs as an Aid to Requirements Re-use
Kevin Ryan and Brian Mathews
Dept. of Computer Science & Information Systems
University of Limerick
Ireland
(To Be Presented at Requirements Engineering ’93)
Abstract
The types of knowledge used during requirements
acquisition are identified and a tool to aid in this process,
ReqColl (Requirements Collector) is introduced. The tool
uses conceptual graphs to represent domain concepts and
attempts to recognise new concepts through the use of a
matching facility. The overall approach to requirements
capture is first described and the approach to matching
illustrated informally.
The detailed procedure for
matching conceptual graphs is then given. Finally
ReqColl is compared to similar work elsewhere and some
future research directions indicated.
1. Introduction
In making the transition from the informalities of the
real world into the
unambiguous, clearly-defined
representations which are needed for computer
manipulation, various types of knowledge must be used.
Problems arise from the insufficient knowledge which a
software designer commonly has about the application
domain of the required software. Conversely the end
user, or client, often has limited knowledge about
computer systems. The main aim of the Requirements
Collector system - ReqColl [18] - is to assist this
informal to formal transition,
and
to provide a
common basis for client/analyst communication. In
particular it aims to facilitate the re-use of previously
developed requirements specifications from the same or
similar domains.
2. ReqColl Real-World Knowledge
Two types of knowledge need to merge during
requirements engineering: the client’s knowledge - which
relates to the real-world things he deals with every day,
and the actions which act upon those things; and the
analyst’s knowledge - which relates to how those things
can be represented as data in a computer, and what
functions will have to exist to model the real-world
actions. These two types of knowledge, are discussed in
this section and the next.
The ReqColl system is based on the principle that, for the
purposes of software engineering, the real-world is
divisible into (not necessarily distinct) problem domains,
which contains a set of related problems to be solved
through the software engineering process. These problems
are related in that they all involve the manipulation of
entities and actions which are particular to the problem
domain. (Applications programming then consists of the
construction of a set of software solutions for the
problems of particular problem domains).
A second premise is that, within any solvable problem
domain (one for which software solutions can be written),
the sets of entities and actions are finite (though
potentially large). Furthermore, the entities and actions of
the domain are the important items to be modelled for the
purposes of Requirements Engineering. In this, ReqColl
shares the ideas which are central to Object-Oriented
Design [7, 19], summarised by Meyer as "Ask not first
what the system does - Ask what it does it to !" [19]. This
motivates the third premise of ReqColl - that a
knowledge-based CASE tool for Requirements
Engineering should be based upon the reuse of knowledge
gained in previous requirements processing.
Object-Oriented Design developers have shown that the
best way to achieve software reuse is through the
modelling of real-world entities as objects, to which
actions are applied in the form of messages. Objects exist
in inheritance hierarchies, where child objects contain the
features of the parent objects, plus additional features
peculiar to the child. ReqColl applies the same principles
to the activity of Requirements Collection, by using an
inheritance framework to define the terms of a problem
domain language as advocated, for example, by Davis
[8], and used in Draco [20]. It is not enough simply to
define a lexicon of words used in the domain - it must also
be possible to express the semantics of the terms, in a
representation suitable for computer manipulation. The
language can then be used as a basis for client-analyst
interaction, since both will have an understanding of the
essentials of the domain.
In ReqColl, the problem domains are organised
hierarchically. For example, the problem domains of
library systems and warehouse inventory systems could be
sub-domains of the more general inventory domain. Since
all real-world problems must exist in the context of the
real-world, all Problem domains exist as sub-domains of
one encompassing Problem domain, representing the
modelled aspects of the real-world.
The entities and actions within sub-domains are then
derived by the specialisation of appropriate items existing
in the super-domains (in the same manner as ObjectOriented Design). The use of abstraction mechanisms
such as specialization has been shown to be appropriate
for suppressing the display of unnecessary detail in realworld modelling [4]. Real-world knowledge is captured in
ReqColl through the use of conceptual graphs (see next
section). In ReqColl, the semantics of each term of the
domain language is defined by a network of nodes, which
express properties of the term, and the context in which it
is normally to be found. The interconnections of nodes are
constrained by conceptual relations, which may be
inherited, or may be specific to the domain. The relations
ensure that the semantic integrity of the definition is
maintained. By using a network matching algorithm (see
section 6) to compare the nodes and relations of two
definitions, the use of conceptual graphs allows us to
identify concepts appearing in a problem statement, not
only by simple word match, but also by semantic match.
This permits the detection of synonyms and close
matches.
One of the principle reasons for choosing conceptual
graphs for Requirements Engineering is that they provide
a language which is intermediate between natural
language (as used by the client), and the formality of firstorder logic. Sowa [24] demonstrates how conceptual
graphs can be derived from natural language statements,
and using the formula operator, ’f’ gives a direct mapping
from conceptual graphs into first-order logic. The
feasibility of constructing conceptual graphs from natural
language has been shown by Velardi et al [26], for the
Italian language. Conceptual graphs also employ the
abstraction principles used successfully by the Taxis
project for real-world modelling [11, 4] - namely,
Classification (conceptual graph individuals), Aggregation
(conceptual
graph
Joins),
and
Generalisation/Specialisation (conceptual graph type
hierarchies and graph restrictions). They are also
consistent with the features proposed for conceptual
modelling languages by Borgida [5].
3. Software Knowledge
A software analyst working in a particular problem
domain gradually acquires a set of generic design plans
which are used to solve recurring problems. Analysis of a
particular problem is guided by the analyst attempting to
fill in missing details for what is felt to be the closest
generic design plan to the current problem (i.e. the
Analyst uses his experience). Such behaviour has been
demonstrated empirically in [1, 23], and is the basis for
the plans used in the Programmer’s Apprentice [22].
The analyst’s knowledge (the solution model) provides a
framework for the structure of the problem. The missing
sections and details of the framework for specific
problems must be obtained from the client, who (if the
analyst has been trying to solve the correct problem) will
provide specific requirements to instantiate the generic
pattern. If the client requests features which do not exist
in the client’s current model, then the analyst must choose
between
i) adapting the current model to suit;
ii) discarding the current model and adopting another
which matches the problem better; or
(if no suitable model is available from experience)
iii) creating a new model, thereby gaining experience.
A succession of such actions allows the analyst to create
an instantiated model specific to the problem being
solved, but based upon models derived from prior
experience. A CASE tool which intends to guide an
analyst must be able to capture, express, store and retrieve
such design frameworks for stating the problems of a
problem domain. It is exactly these pattern-filling
activities that ReqColl attempts to mimic.
4. Conceptual Graphs
Conceptual Graphs (CGs) are used to express Sowa’s
conceptual structures [24]. CGs have a close affiliation
to semantic (associative) networks [10], but the
semantic expression is deeper than the simpler
network representations. A CG consists of a network of
nodes, some of which are concept nodes (describing
concepts of the application domain), linked via relation
nodes (which define the allowable roles concepts can
have in relation to each other). Both concepts and
relations have types associated with them, themselves
defined in hierarchical (IS-A) structures. It is therefore
possible to define sub-concepts, and sub-relations,
related to their super-types by having the same basic
properties, but are more restricted.
4.1 Notation
ReqColl currently uses the textual form of conceptual
graphs, as follows :
Concept
Relation
Arc
[ <Label> ] ( <Label> )
--->
Any particular section of a CG can be roughly interpreted
in the following manner. A graph of the form :
[ C1 ]---->( R1 )---->[ C2 ]
may generally be read as "The R1 of C1 is C2".
(Grammatical considerations may make the interpretation
improper, but it serves as a reasonable guide to the
semantics of the graph.) For example we may interpret :
[ Stock__Item ]---->( QTY )---->[ Number : 10 ]
as "The Quantity of the Stock Item is 10".
In the example, Stock_Item, QTY, and Number are
labels which reference type definitions, themselves
expressed as CGs. The "10" is an example of a referent
whose purpose is to define more accurately the concept
which the node is trying to express.
Concept nodes may only be connected to relation nodes,
and vice versa. The domain and range of the connecting
relation node imposes selectional constraints on allowable
combinations of concepts. Thus, in the example above,
the type definition of QTY could be :
Type QTY (x,y) is
[ Thing ]--->( QTY )--->[ Number ].
In this way the range of QTY is limited to concepts (and
sub-Concepts) of Number and constructs such as :
[ Thing ]---->( QTY )--->[ Green ].
can therefore not occur, since Green is not a sub-concept
of Number.
4.2 Type Expansion
The labels which appear in the nodes of CGs
reference type definitions, themselves expressed as CGs.
Any label may be replaced by the complete type definition
which it references, through type expansion. For example,
given the type definitions :
Type Stock__Item is
Type Stock-Code is
[Entity][Identification](Attr)--->[Stock-Code]
(Part)--->[Number]
(Attr)--->[Colour].
(Part)--->[Letter].
(A Stock_Item is an entity with two attributes - a StockCode, and a Colour; A Stock-Code is a two-part
Identification consisting of a Number and a Letter). An
expansion of Stock-Code in Stock-Item gives :
Type Stock__Item is
[Entity](Attr)--->[Identification](Part)--->[Number]
(Part)--->[Letter],
(Attr)--->[Colour].
(A Stock_Item has two attributes - a Colour, and a
two-part Identification). In the reverse operation, type
contraction, a sub-graph of a CG which corresponds to
some type definition is replaced by its type label.
5. ReqColl Method
ReqColl attempts to mimic the pattern-filling approach of
Analysts (as discussed in section 3) by the use of predefined patterns for application domains. Client’s
problem statements are compared against the generic
pattern for the appropriate application domain, and the
clients are queried about any differences. The client
must then explicitly state that some missing part is not
meant to be included - or else add it to his problem
definition. In this case further missing parts may be
detected. Eventually all parts of the pattern are either
instantiated or specifically denied by the client.
Suppose, for example, that a client requests an inventory
system. One concept which an analyst will expect from
his generic pattern is the concept of Stock_Item_Record (a
description of a stock item, divided into data fields). If
the analyst finds no description of the fields to be included
in the Stock_Item_Record in the client documentation,
then the client must be queried further about the nature of
his stock, so that appropriate fields can be defined. (A
library inventory can be expected to contain different
record fields than, say, a hardware inventory). The
information supplied by the client provides an
instantiation for the generic pattern.
ReqColl ensures that the definition of the required
problem is as complete as the accumulated examples
of the system allow. This ’experience’ grows with each
new development as definitions of new problems are
added to the knowledge base. If a client has defined the
need for some extra part, which does not exist in the
generic pattern, then the knowledge base will store the
fact that such a part is sometimes needed in problems of
the domain. Such variations on a concept are stored in
(concept graph) schemas of that concept. If it is found
that the need for this part constantly recurs in later
developments, then it may eventually be included in the
generic pattern - i.e. the system "learns" that this is a
common feature of the application domain.
Comparison with previously defined concepts is carried
out on an entity-by-entity basis. For each entity, one of
three situations will hold :
i) The entity was an exact match with a term of the
Domain Language. In this case, a previously defined
concept may be retrieved and re-used.
ii) The entity found a close match with a term of the
Domain Language. Closeness, in this case, is
measured by the value returned by the CGMatcher
algorithm.
iii) No reasonable match could be found for the entity. No
schema can be found with a difference value which is
sufficiently close to that of the concept being defined.
This is an indication that the client has requested
several features which are unknown in the experience
of the system. In this case, the analyst must accept
that ReqColl can not re-use any previous
requirements and will not be able to provide analysis
guidance for the current problem. However, future
analysis activities will not suffer from the same lack
of experience, since the model which the analyst will
produce (without guidance) will be added to the stock
of schemata available for that problem.
For example, consider the case where a client has
mentioned "shipment" in his initial statement. A word
search will find the following ReqColl definition :
Type Shipment (x) is
[Entity](Object)<---[Ship__Act]
(Content)--->[Stock__Item : {*}]
(PTim)--->[Time]
(Accm)--->[Document : {*}].
which can be translated as "A Shipment is an Entity
whose Contents are a set of Stock_Items, is
accompanied by a set of Documents, is the Object of an
act of Shipping, and occurs at a certain Point-InTime". The client can now confirm or deny that this is
the correct definition.
When some concepts do not match, or the definition is
not the one intended, then the client is asked to define
his concept, by the aggregation of concepts which already
exist in the KB. (The process is recursive if the
customer uses other unknown concepts while defining his
concept). The new definition (in the form of a conceptual
graph) is compared against stored definitions, looking for
close matches and synonyms.
For example, the client may have the concept
"delivery" in his initial statement. No word match is
found for this, so the client is asked for a definition. The
definition given could be something like : "A delivery
is a set of machine parts sent from a supplier, received
at the warehouse". This would result in the construction
of the following graph, using known concepts of
Machine_Part, Send, Supplier and Warehouse :
Type Delivery is
[Machine__Part : {*}](Object)<---[Send](Source)--->[Supplier]
(Dest)--->[Warehouse].
Given that ReqColl already has a type definition for the
act Ship, as follows :
Type Ship_Act is
[Send](Source)--->[Supplier]
(Dest)--->[Customer]
(Object)--->[Shipment].
Using the matching approach described below, this will
give a partial match with the newly defined type Delivery.
(Both are "Sends" by a Supplier, and a warehouse can
have the role of being a Customer). The principle
difference will lie in the match of Machine_Parts against
Shipment, so this area needs to be further investigated. If
Shipment in Ship_Act is type expanded, then
Machine_Parts can merge into the Content relation of
Shipment, giving a closer match, and, quite probably,
providing a definition with which the client is content.
Note that this process has discovered the need for a set of
Documents to accompany a Delivery and also the fact that
a Time of delivery may be relevant. The analyst may now
need to investigate what documents are required, how
time is to be recorded etc.
In the best case, where the client states that Shipment is a
close match to his concept of Delivery, then Delivery
may be installed as a sibling of Shipment so future
developments will have the choice of the Delivery
concept, or the Shipment concept.
6. The CGMatcher
6.1 Overview
The CGMatcher is an integral part of many of the
ReqColl CG manipulations. Expansions, joins, and merges
of conceptual graphs are all based on the data supplied by
matching two CGs, using the CGMatcher. In addition to
its use in these actions, the CGMatcher can also be
invoked directly by the user to view the differences
between any two CGs of interest. In this latter case, the
CGMatcher desktop is used.
In ReqColl, conceptual graph arcs are the principal data
unit. Each arc is stored either as a main arc, or as a
subarc. Main arcs are joined directly to the genus
concept i.e. the concept containing the type label of the
most relevant supertype. Subarcs are attached to the
concepts of main arcs, or the concepts of other subarcs.
Each arc contains a conceptual relation (ConRel), a
concept, and a direction (Arcdir) stating whether the
concept is a range or domain of the ConRel. (The other
concept of the relation is either the genus concept (main
arcs), or the concept of the arc to which the subarc is
attached. For example: In the expanded version of
Stock_Item given earlier, "Entity" is the genus concept,
the two "Attr" arcs are main arcs, and the two "Part" arcs
are subarcs of "Identification". (Note that in definition for
Stock_Code, they appear as main arcs).
To match two CGs, CG1 and CG2, their main arcs are
matched against each other to obtain a numerical match
score. (A low score between two arcs indicates a good
match - a zero indicates a perfect match.) Two arcs with
different directions are never matched. The differences
between the two concepts, and between the two relations
contribute to the match score. In addition, if any subarcs
are present on either or both arcs, there will be an
additional (weighted) score.
From the set of scores which results from these individual
matches, an arc pairing is calculated, consisting of the set
of arc pairs which give the lowest overall pairing score,
calculated by totalling the individual match scores of the
arc pairs. Usually, the numbers of arcs in the two sets are
not equal, so some arcs must be left with no pairing. The
effect of excluding particular Arcs from the set of pairs is
(subarc depth, in ReqColl) of the ith difference between
two structures, expressed as graphs. Experimentation with
sample CGs indicated that this attenuated the influence of
lower levels too rapidly. A weighting of 0.5 is used
instead in ReqColl. Thus, the influence of a subarc
ishalved on each level of matching.
.
allowed for, and may be a determining factor in deciding
which Arcs are paired, and which are not.
Many of the graph matching ideas employed in ReqColl,
including the weighting of graph branches to reduce their
influence, are based on those of Winston [27]. Winston
used a weighting scheme of e-L(i), where L(i) is the level
CGMatcher - Lisp Listen
||| Arc Comparison Information
Total Number of Matches : 835
Best Match Score :
306
Best Match Path : ((1,1)(2,4)(3,2)(4,3))
||| Graph of Component CG
Scores ||| Graph of Stock_Item.Sch2 CG
Type Component (x) is :
306
Schema Stock_Item.Sch2(x) is :
[Entity *x]...
-->(Attr) -> [Identifier]
-->(Attr) -> [Stock_Level]
-->(Attr) -> [Critical_Level]
-->(Attr) -> [Cost]
2
81
0
0
223
[Stock.Item *x]...
-->(Attr) -> [Stock_Cost]
-->(Attr) -> [Stock_Level]
-->(Attr) -> [Critical_Level]
-->(Attr) -> [Description]
||| Arc Comparison Matrix
(81 122 94 141)
(215 212 221 0)
(321 0 83 25)
(125 84 225 17)
Figure 1 : CGMatcher desktop
The matching algorithm used by the CGMatcher can be
divided into three phases:
i) the scoring of arc pair matches (Section 5.3);
ii) the finding of the best arc pairings using the
individual pair scores (Section 5.4) and
iii) the scoring of unpaired arcs (Section 5.5).
6.2 The CGMatcher desktop
The CGMatcher desktop (see Figure 1) contains three
main windows - two which display the CGs in their best
match with each other, and a third (middle) window which
gives the scores for each arc pair match. The CGs have the
order of their arcs rearranged in the display, so that if two
arcs have been paired with each other, then they will
appear on the same line of the screen. The score which
they achieve against other also appears on this line, in the
scores window. If some arc is not part of a pair (i.e. there
were a different number of arcs in the two CGs), then a
blank line appears opposite it in the other display window.
Thus, the user is able to see at a glance which parts of the
two CGs pair well, and which parts differ.Other windows
in the desktop can display the progress of the matching
and pairing functions, if the user wishes.
6.3 Arc pair matches
The score from the match of any two arcs (one from each
CG) is derived by a scoring routine which compares the
two arcs. A weighted score from a (recursive) routine
which returns the best match score of their subarcs is
added to the arc score. The score from comparing the two
arcs is the sum of four different comparison scores concept:concept, ConRel:ConRel, referent:referent, and
marker:marker. The two arcs to be compared must also
have the same arc direction - this is enforced by the
CGMatcher.
Concept:Concept : The two concepts are checked to see
if one is a direct descendant of the other. If so, then the
value returned is the difference between their respective
depths in the concept hierarchy.This score (which will be
a relatively low number, compared to the score of nondescendant concepts) reflects the fact that one of the two
concepts can be derived from the other. That is, they share
some features, and hence have a degree of similarity. The
score is not intended to reflect the exact degree of
difference between the two.
+1
+1
C1
C1
+20
+20
C2
C4
C3
C2
+1
+1
+1
C1
C2
+20
C4
C3
Score : 22
Score : 41
Score Returned : 22
C3
Figure 3 : Non-Descendant matching
Figure 2 : Descendant matching
Rather, it measures how many restrictions of the ancestor
concept exist between it and the descendant concept. In
Figure 2, the score from a match between C1 and C3
would be 2. Note that the restriction of C1 which created
C2 could be minor (say, one arc concept in C1 was
restricted), whereas the restriction which created C3 could
be more complicated (say, several new arcs added). Yet,
both would have the same difference score of 1, with their
respective genus. This is justified in ReqColl by the fact
that ReqColl matches attempt to emphasise the similarity
of two CGs, rather than their differences. Thus,
ancestor/descendant pairs are given high priority in
matches.
If the user requires a more accurate match between the
two concepts, then the user can expand the relevant
concepts in the two CGs being matched. This will result in
the full definitions of those concepts being matched, as a
part of the overall CG match (the definitions of the
concepts will be installed as subarcs at the appropriate
point in the CGs being matched). However, matching the
concepts in their unexpanded form gives the user an initial
indication of the degree of similarity between them, which
assists in deciding if more detailed matching is required.
If neither concept is a direct specialisation of the other,
then it is necessary to determine the closest path between
the concepts in the concept hierarchy. The super-concepts
of one of the concepts are checked, until one is found
which is also an ancestor of the other concept (in the worst
case, the ancestor super-concept will be Thing - the
common ancestor of all ReqColl Concepts). For each
ancestor level traversed, a score of twenty is added to the
comparison score. (Twenty is a relatively high score in
CG matching, to reflect the fact that, if two concepts are
not directly derivable from each other, then they should
have a high difference score. However, it is not so high
that it will not find a sibling, if that is the best match
available). The same process is carried out for the other
concept, and the value returned is the minimum of the two
scores (see Figure 3).
It should be emphasised that no claim is made that the
returned score measures the essential difference between
any two real-world entities - it is not possible to measure
such differences numerically. Instead, the score returned
represents the idea that two concepts are probably more
similar to each other if one is directly derivable from the
other, than if they belong to different branches of a
concept hierarchy.
ConRel:ConRel : The ConRel:ConRel matching is
performed in an almost identical manner to the
concept:concept matching. The only modification is that
the ConRel level is calculated as the sum of the levels of
the domain and range concepts of the ConRel. Therefore,
the score returned depends upon the domain and range
concepts of the ConRel, and also upon whether the two
ConRels are descendants or not.
Referent:referent : The current version of ReqColl
allows any string to be entered as a referent for a concept,
so it is not possible to make accurate comparisons
between referents. Therefore, a score of zero is returned if
the two referents are identical (often both null), and a
(somewhat arbitrary) score of ten is returned otherwise.
Marker:marker : It is not possible to quantify the
semantic effect of the presence of a marker in one arc, as
compared to the absence of a marker in another, for all
cases. Therefore, a score of zero is returned if both
markers are identical (usually both 0), and a similarly
arbitrary score of five, to reflect the lower importance of
markers, is returned otherwise.
The presence of these arbitrary scores might seem to
introduce a degree of inaccuracy into the matching, and
hence the pairing, processes. In fact, this is not so. It
should be remembered that the ReqColl matcher tries to
emphasise the similarity of CGs, rather than their
differences. Thus, a score of zero is returned if referents
(or markers) are identical, giving that arc pair a large
advantage. It can also be noted that all matches attempted
with a referent (or marker) will score the same value (10
or 5), if they are not identical. Thus, if there were no
identical matches, the scores will cancel each other when
the best overall match to the arc is being decided. The
pairing of that arc will then actually be decided on the
basis of the similarity of the concepts and ConRels of each
arc pair. If one identical match does exist (for example,
for the referent), then the pairing with the arc which
6
contains the identical referent will very likely be chosen which is a sensible outcome.
The value returned from each arc match is the sum total
of the values returned from each of the sub-matches given
above, plus a weighted score from the comparison of the
subarcs of the arcs matched. The total values are then used
to decide which set of arc pairs gives the closest match for
the two CGs.
6.4 Selecting the best arc pairing
Assume that there are two sets of CG arcs to be matched
- set A, and set X. Assume also that they are of equal size.
(The case of unequal numbers of arcs will be dealt with in
Section 6.5) Pairing the two sets of arcs involves matching
all arcs in A against all arcs in X, and finding the optimal
arc pairs - i.e. the set of arc pairs which will produce the
lowest match score. (The score for a perfect pairing is
zero - the best score possible. This would imply that two
identical sets of arcs were matched).
The scores of each individual arc pair match, as
described in Section 5.3, are stored in a match matrix,
such that the (i,j)th element holds the score from the
comparison of arc i from A, against arc j from X. For
example, let A be the set of arcs {a, b, c}, and X be the set
of arcs {x, y, z}. An example match matrix could then be :
x
y
z
a
(
2
3
1
)
b
(
0
5
4
)
c
(
0
1
2
)
Thus, the score of b against z is 4. Note that each possible
arc pairing must contain all the arcs in both sets, that no
arc can simultaneously be paired with more than one other
arc, and that a low score indicates a good match. We now
define a match path, Pi (i=1..n!), for an n x n match
matrix, M, to be the set of elements, {pk} (k= 1..n), of M,
such that there is exactly one element taken from each row
and each column (since any arc can only be paired with
exactly one other arc). Possible match paths for the
example above are {(1,1), (2,2), (3,3)}; {(1,2), (2,1),
(3,3)}; etc. (Note that the path ordering is irrelevant {(1,3), (2,1), (3,2)} is exactly equivalent to {(2,1), (3,2),
(1,3)}).
Let the function val(e) return the value of the element e
of a match matrix, M. We then define the path total, Ti, to
be the sum of the values, {ti}, held in the elements {pi },
of Pi, where ti = val(pi). Thus Ti = S ti.
We further define the best match path, B, for a match
matrix, M, to be a match path, Pi, for M, for which Ti ² Tj,
™ Pj of M. We define the best match total, T, to be the
value of Ti for B.
The problem, therefore, is to find B and T. B records the
best possible arc pairing, and T is a numerical measure of
how close the two sets of arcs pairs are - i.e. how good the
overall match is. In the example case given above, B is {
(1,3), (2,1), (3,2) }, and T = 1 + 0 + 1 = 2. This indicates
that the best arc pairing is (a and z), (b and x), (c and y).
The number of possible Pis in an n x n matrix is factorial
n (n!). Potentially, every Pi may have to be checked to
ensure that the overall minimum path total has been
found. The time required would therefore be O(en)
(exponential time), which is unacceptable for high values
of n. The CGMatcher algorithm, however, incorporates a
pruning rule which greatly reduces the number of Pis
actually generated.
The first path generated, P1, is always the main diagonal
of the match matrix. This yields T1, which is used to
initialise a global variable, best.path.value. P1 is used to
initialise another global variable, best.path, which stores
the best match path found up to the current stage of the
matching process. If, while a particular path is being
investigated, the addition of the next Ti would make the
current path total greater than the current value of
best.path.value, then no further generation of that path
occurs, and alternative values for Ti are tried. If none of
these are suitable either, then the path generation
backtracks to pi-1, and alternatives are sought at that
depth.
Thus, when a path of length n is generated (Pin = Pi) (all
arcs have been paired), Ti (= Tin) must be less than (or
equal to) the current value of best.path.value. If it is less,
then best.path is reset to Pi, and best.path.value is reset to
Ti. Thus, a lower threshold is established for future
pruning.
The matcher algorithm is based on an adaptation of the
permutation generation algorithms described by
Topor[25]. The algorithm can cope with pairing
reasonable sizes of arc sets (it has been tested for match
matrices up to 20 x 20). This is reasonable for the
matching of CGs - the sets of arcs which must be paired
are normally small in number. Few CGs in the CG
canonical basis of ReqColl contain groups of arcs (main
arcs, or the level-(n+1) subarcs of a level-n arc) which
number more than ten arcs. (Thus, even pairings of 10 x
10 are rare).
The algorithm will work for larger match matrices,
provided that there is a reasonable range of element values
(as is usual with CG arc matches). However, it will take
exponential time to complete if the values in M are not
spread over a reasonable interval, thus making most of the
path totals in or around the same value.
6.5 Scoring of unpaired arcs
If the two sets of arcs being matched are unequal in
number (say n arcs in one, and m in the other, n<m), then
there must be (m-n) unpaired arcs at the end of the pairing
process. The contribution of these unpaired arcs cannot be
7
simply ignored, if an accurate match is to be performed. It
is possible that the "cost" of not pairing some particular
arc may be so high that an overall better solution will be
found by including it in one of the n arc pairs, thus leaving
some other arc (with a lower "cost"), unpaired. In the
CGMatcher, if the numbers of arcs to be paired are not
equal, then each of the m arcs of the larger set (call it X)
are matched against a virtual arc to obtain the costs of
omitting each of them from the final pairing. The virtual
arc, Figure 6, is an imaginary ReqColl arc, which is the
most generic arc possible :
---> (Thing->Thing) ---> [Thing]
Figure 6 : Virtual Arc
Thus, the ConRel of each of the m arcs is matched against
Thing->Thing, the concept against Thing, the referent
against "", and the marker against 0 (the default value of
any marker facet). The effect of the virtual arc match is to
yield higher scores for those arcs which are more
restricted. The subarcs of any arc are recursively matched
against virtual arcs, and a weighted score added to the arc
score. This process returns an m-element list of values,
where the ith value represents the cost of omitting the ith
arc of X from any possible pairing. The matching
algorithm given in Section 5.4 is adjusted to account for
the presence of extra arcs. Let M be an n x m match
matrix, representing the matching of n arcs of set A with
M arcs of set X, n < m. Let Pin be a completed match path
of M, let Tin be the path total for Pin, let Bc be the current
best path, and Tc be the current best path total. Let E =
{ej} be the set of cost values for X, and let Ek = {ekj} be
the subset of E containing the cost values of the k (= m-n)
arcs of X omitted in the pairing represented by Pin.
Assume that Tin < Bc. Then, Tc will be reset to Tin (and
Bc to Pin) iff :
(Be =) Tin + Y1,k ekj < Bc
i.e. the total of the current path total plus the sum of the
costs of the omitted arcs must be less than the current best
value. There is one exception to this rule : if Tin = 0, then
Tc is automatically reset to Tin (and Bc to Pin). This
represents the case where a perfect match has been found
between the two sets of n arcs. This implies that X is
actually an extension of A, and that the difference
between the two sets should be shown to the user to be the
k unpaired arcs. Note that Tc is reset to Tin, rather than to
Be. This is because the aim of the CGMatcher is to find
the best match paths. Unpaired arcs are not, by definition,
elements of the match path - therefore, they should not
contribute to the path total. However, they do contribute in
calculating whether a better overall solution has been
found. (The value returned by the CGMatcher will
actually be Be, since it represents the best overall match
which could be found. But path totals are only compared
against other path totals in the course of matching).
7. Related Work
Conceptual graphs have been used in a manner similar to
that proposed for ReqColl by Eklund and Kuczora [9, 14].
Their task was knowledge acquisition in engineering
domains. A Conceptual Graph Tool (CGT) which can
construct and manipulate conceptual graphs has been
built, which works in conjunction with a semantic network
editor, VEGAN [13], and the Rule Based Frame System
(RBFS) [3]. The conceptual graph tool allows the domain
expert to describe conceptual graphs through the use of a
WIMP interface, and allows them to be mapped into an
object-oriented representation. The graphs represent
heuristic and procedural information about objects in the
problem domain.
The Analyst Assist project [16] also uses CGs as the
underlying representation for a set of support tools that are
to support many of the tasks carried out during
requirements capture. It includes a "concept elicitation
and analysis tool", outlined in [15] and detailed in [6], that
embodies some of the features of ReqColl, such as CG
editing and manipulation. However, it does not support
concept or requirements reuse, although it might be
possible to add this feature to the toolset.
Reuse of specifications through analogy is directly
addressed by the prototype Ira (Intelligent Reuse Adviser)
system [17]. The analogy engine performs structure
matching, based on object structures and constraints, as
well as providing tutorial advice and explanation to the
analyst. As with ReqColl, heuristics are used to calculate
differences when there are two or more candidate
matches, and these heuristics have been adjusted to take
account of the level of similarity of "critical and noncritical" features. It might be worthwhile to add this
distinction to ReqColl although it may be that the use of
CGs in ReqColl provide a more generic and semantically
sound basis for matching.
The clichés of the Requirements Apprentice [21] have a
similar role to CGs in ReqColl and are also organised into
an inheritance hierarchy. Although the goals of ReqColl
are narrower, it is felt that the CG formalism and the
matching procedure used in ReqColl are clearer and
would be understood more easily by the user or customer.
This would need to be tested empirically. It would also be
interesting to compare the performance of the ReqColl
matching procedure to the "percentage threshold" used by
the RA's classification system.
8. Conclusion and Future work
8
The CGMatcher, a ReqColl Tool for comparing two CGs
has been described. It matches arcs of the CGs against
each other, and deduces the best arc pairing for each
subarc and main arc level, using the scores obtained from
the individual arc matches. The pairing is deduced by
finding the optimal match path in a match matrix, which
records the scores for pair of arcs matched. The overall
extra arcs are catered for by calculating the cost they
would add to the overall match if they were not included.
The matcher is used to identify likely concepts whose
specification might be candidates for re-use. In fact the
ReqColl system also store software structures (described
using JSD [12] notation) and can therefore support the
early stages of design also. In this way, re-use occurs at a
very high level - the problem definition phase. In
theory, if the definition of two problems is close, and one
has already been solved, then it should be possible to
solve the second in a similar manner i.e. in the software
case, the second development should be able to follow the
methods used in the first. However, ReqColl currently
confines itself to recognising that a similar problem has
been solved before. At present it is beyond the scope of
ReqColl to carry the development further.
Future improvements to ReqColl will involve providing a
graphical user interface and the addition of further
heuristics to the matching algorithm. The extension of the
component library to embrace object classes so that the
development process can be carried through to
implementation would be a longer term goal. Even more
crucial in testing the worth of the ideas presented here will
be the conduct of realistic case studies, initially within a
narrow domain, to assess empirically the level of re-use
provided by ReqColl.
Acknowledgement : The ReqColl system was developed
while the authors were at the Department of Computer
Science, Trinity College Dublin, and made use of a Unisys
Explorer and KEE™ software donated by Unisys.
References
[1] Adelson B., Soloway E., "The Role of domain Experience in
Software Design", IEEE Trans. on Software Engineering,
Vol SE-11, No 11, Nov. 1985, pp. 1351-1360.
[2] Balzer R., Goldman N., Wile D., "informality in Program
Specifications", IEEE Trans. on Software Engineering, Vol
4, No 2, Mar. 1978, pp. 94-102.
[3] Barber T.J., MarshallG., Boardman J.T., "A Philosophy and
architecture for a Rule Based Frame System (RBFS)",
International Journal of AI in Engineering, July 1987.
[4] Borgida A., Mylopoulos J., Wong H.K.T., "Generalization/
Specialization as a Basis for Software Specification", in
[Brod84], pp. 87-117.
[5] Borgida A., "Features of Languages for the Development of
information Systems at the conceptual Level", IEEE
Software, Vol 2, No 1, Jan. 1985, pp. 63-72.
[6] Champion R E M, "Modelling for Requirements
Engineering using Conceptual Structures", PhD Thesis,
Dept. of Computation, UMIST UK, May 1991.
[7] Cox, B., "Object-Oriented Programming - An Evolutionary
Approach", Addison-Wesley, 1987.
[8] Davis A.M., "The Design of a Family of ApplicationOriented Requirements Languages", IEEE Computer, Vol
15, No 5, May 1982, pp. 21-28.
[9] Eklund P., "On the Use of Conceptual Structures in
Knowledge Based Systems Development", Brighton
Polytechnic IT Research Institute Internal Report No 14,
Sept. 1987.
[10] Findler N.V., "Associative Networks - Representation and
Use of Knowledge by Computers", Academic Press, 1979.
[11] Greenspan S.J., Mylopoulos J., Borgida A., "Capturing
More World Knowledge in the Requirements Specification",
in Proc. 6th Int’l Conf. on Software Engineering, Los
Alamitos, California, IEEE CS Press, 1982, pp. 225-234.
[12] Jackson M., "System Development", Prentice-Hall
International Series in Computer Science, 1983.
[13] Kellet J., Esfahani L., "VEGAN and KET : An Integrated
graphical Approach to Knowledge Representation and
Acquisition", in Procs. of the Joint ICL and Ergonomics
Society Conf. on Human and Organisational Issues in
Expert Systems, Stratford-upon-Avon, May 1988.
[14] Kuczora P., Eklund P., "A conceptual graph
Implementation for Knowledge Engineering", in "Human
and Organisational Issues in Expert Systems", Hart A.,
Berry D. (eds.), Hogan-Pale, April 1989.
[15] Loucopoulos P., Champion R E M, "concept Definition and
Analysis for Requirements Sprcification", IEE Software
Engineering Journal, March 1990.
[16] Loucopoulos P., Harthoorn C., "A Knowledge-Based
Requirements Engineering Support Environment", in
[CASE88], Vol 1, pp.13.10-13.14.
[17] Maiden N A., Sutcliffe A G., "Exploiting Reusable
Specifications Through Analogy", Communications of the
ACM, Vol 35, No 4, April 1992
[18] Mathews B.,
"Requirements Specification Using
conceptual graphs", MSc Thesis, Trinity College Dublin,
[19] Meyer B., "Object-Oriented Software Construction",
Prentice-Hall, 1988.
[20] Neighbors J., "The Draco Approach to Constructing
Software from Reusable Components", IEEE Trans. on
Software Engineering, Vol. SE-10, No. 5, Sept. 1984, pp.
564-574.
[21] Reubenstein H B and Waters R C, "The Requirements
Apprentice: Automated Assistance for Requirements
Acquisition", IEEE Trans. on Software Engineering, Vol
17, No 3, March 1991
[22] Rich C, "A Formal Representation for Plans in the
Programmers Apprentice", in Proc. 7th IJCAI, Aug. 81.
[23] Soloway E., Ehrlich K., "Empirical Studies of
Programming Knowledge", IEEE Trans. on Software
Engineering, Vol 10, No 5, Sept. 1984.
[24] Sowa J.F., "Conceptual Structures - information Processing
in Mind and Machine", Addison-Wesley, 1984.
[25] Topor R.W., "Functional Programs for Generating
Permutations", The Computer Journal, Vol 25, No 2, 1982,
pp. 257-263.
[26] Velardi P., Pazienza M.T., De Giovanetti M., "Conceptual
Graphs for the Analysis and Generation of Sentences", IBM
9
Journal of Research and Development, Vol 32, No 2, Mar.
1988.
[27] Winston P H, "Learning Structural Descriptions from
Examples", Ph D Thesis, Massachusetts Institute of
Technology, USA, Jan. 1970.
10