the complete PDF version here.

© 2012 ARMA International • www.arma.org
Drawing a Blueprint for a
Scalable Taxonomy
Drawing on the basic concepts of biological classification most studied in high school,
this article describes how to develop a scalable taxonomy that can migrate to any
content repository – from share drives to enterprise content management systems.
Eugene Stakhov, CRM, CDIA+
ot very long ago, the word
“taxonomy” didn’t really
have a place in the field of information technology. But,
as the ability to govern information has grown more sophisticated, so has the language used
to describe the newfound complexity
of the various interrelationships and
countless moving parts that comprise
the typical enterprise data landscape.
To the records and information
management professional, words like
“system” and “program” don’t seem
quite adequate now to describe the richness or the organic and evolving nature of
this discipline. Today, the
words “platforms” and “ecosystems” are used.
This is an important concept because it underscores the challenge of
effective information governance in
this day and age, and it provides a
glimpse of the growing monster so
many organizations are grappling
with.
N
MAY/JUNE 2012 INFORMATIONMANAGEMENT
33
© 2012 ARMA International • www.arma.org
Tiger Taxonomies
Kingdom:
Animalia
Phylum:
Chordata
Subphylum:
Verbrata
Class:
Mammilia
Order:
Carnivora
Family:
Felidae
Genus:
Panthera
Species:
Tigris
Reviewing Taxonomy Fundamentals
Many will remember first hearing
the word “taxonomy” in high school
science class, where they learned that
the hierarchical categories “Kingdom,”
“Phylum,” “Class,” “Order,” “Family,”
“Genus,” and “Species” are conceptual
buckets within which plants and animals can be classified, such as the
above classification for tiger.
This hierarchical classification
teaches that tigers are carnivores;
that every carnivore is a vertebrate;
and, therefore, that all tigers must be
vertebrates – but not all vertebrates
must be tigers.
Inheritance and Specialization
This relationship of a parent class
(superclass) to its child (subclass) illustrates the important concepts of
inheritance and specialization. In the
case of the tiger example, this subclass would be carnivora (Order). Carnivora inherit all the characteristics
of the animalia (Kingdom), the vertebrata (Phylum), and the mammalia
(Class).
Then, they specialize by defining
their own characteristics that are
34
MAY/JUNE 2012 INFORMATIONMANAGEMENT
unique to all carnivores. This pattern
of inheritance and specialization repeats all the way down to the tigris
(Species) – the lowest category of the
biological taxonomy tree.
The only difference between a biological taxonomy and its content counterpart is that rather than inheriting
limbs and backbones, the latter inherits document characteristics, including metadata and security, and,
in some cases, retention requirements.
In fact, records management professionals have been practicing taxonomy development for as long as the
discipline has been around. There
may be nuances in terminology (e.g.,
“file plan” and “retention schedule”),
but the core concept is the same: the
higher up the bucket, the broader the
classification; the lower the bucket,
the more specialized the classification.
The common denominator among
all these classification practices is the
specialization and inheritance of
characteristics.
Explaining Technical Concepts
Objects and Classes
It is helpful to think of the relationship between classes and objects
as analogous to cookie cutters and
cookies. Classes are templates that
are used to build the objects (documents and folders) that are managed
by an enterprise content management (ECM) system. Take the following pattern:
• Documents are patterned by
Classes
• Classes are described by Properties
• Classes can pass on their property
definitions to one or more children,
known as Subclasses
This type of design paradigm borrows from a style of computer programming known as object-oriented
programming.
Inheritance and Polymorphism
Inheritance and polymorphism
are among the core capability requirements of object-oriented design.
The latter refers to the ability of a
property to have more than one intrinsic meaning. To illustrate this,
consider two document classes, one
called “Invoice” and the other “Contract.”
The “Invoice” document class may
have these properties defined:
• Invoice Date
• Invoice Number
The “Contract” document class
may have these properties defined:
• Contract Date
• Contract Number
Rather than define four separate
properties to describe what are essentially only two distinct pieces of information (the date and number), the
taxonomy designer may instead
choose to paint both document classes
with the following properties:
• Document Date
• Document Number
In this scenario, the document’s
class determines whether “Document
Date” refers to an invoice date or a
contract date. These two properties
are generic enough, they can have
more than one intrinsic meaning;
they are polymorphic.
Folders
Foldering in an ECM system
works conceptually differently than in
a file system. In the former, folders
Convention: Container attributes followed
by and asterik (*) are linked to document
contents.
Figure 1: Foldering
© 2012 ARMA International • www.arma.org
are not used so much for storage of
documents, but for organization.
Take the example illustrated in Figure 1.
Here, a sample insurance claim
folder is linked to two constituent documents by virtue of a claim number
property. This type of setup allows a
user to browse many disparate document classes and/or document types.
There are also workflow implications in use with foldering. Routing
folders (as opposed to individual documents) through workflow queues
greatly reduces confusion in case-driven workflow scenarios.
Describing Design Style
Concepts, Choices
Document Class
Every ECM system will expose
some basic properties for the document and folder classes they model.
These out-of-box properties are found
at the top-most, basic level of the document taxonomy (the Kingdom level
in biology). Typical properties that
ECM systems may place on this level
include:
• Date Created – The date/time stamp
at which the content artifact was
created in the ECM system (This is
distinguished from the businesscentric “Document Date” property.)
• Document Identifier – The unique
system identifier required for every
document. No two documents
within the ECM system will have
the same identifier, and no identifier will ever be reused.
• Document Title – The readable title
of a document in an ECM system.
This property is typically not required.
In this case, the first level is referred to as the document class.
(Folder objects would ostensibly derive from a folder class. For now, the
focus is on documents). The document
class is the root pattern, the forebear,
of all documents, and the properties
at this level will be inherited down to
every other document subclass along
the hierarchy. Every document
within the spectrum of the ECM system will contain at least the characteristics of the document class.
Enterprise Class
The next level down is the first
level of specialization. The properties
from the document class have been
inherited and will look like any other
native properties at this level. But, in
addition to the inherited properties,
the taxonomy designer has the choice
of defining new ones.
Typically, the document characteristics found on this level apply to
every document in the organization.
Using the insurance company paradigm as an example, some properties
that might make sense at this level
include:
• Active – A Boolean indicator used
to aid in records management by
marking the content as either an
active or inactive record
• Document Type – A choice listing of
document types that add granularity to the class by further specializing into document sub-categories
(This will be illustrated further
down.)
• Document Date – The polymorphic,
business-centric date of a document (e.g., a contract effective date
or an invoice date). This is distinguished from the technical date
created property.
Core Class
The third level down is where
taxonomy design begins to get creative … and fun! How does one determine that next species of
document? The choice made here determines the basis of taxonomy design and is the real meat of this
discussion.
There are three general design
style patterns that are seen in most
organizations:
• Content-Centric
• Organizational
• Functional
MAY/JUNE 2012 INFORMATIONMANAGEMENT
35
© 2012 ARMA International • www.arma.org
Figure 2: Sample Functional Taxonomy
Content-Centric – In this design
style, the classes are modeled around
the meaning behind the underlying
content. In records management parlance, the file plan counterpart style to
this might be the subject-based file
plan. This design marginalizes the
relevance of an organizational unit or
function in the definition of the document, so if inheritance of security policy is a concern, this may not be the
best option.
The focus here is shifted from the
function or business unit to each content element’s purpose and unique
characteristics (e.g., “what does it
mean to be a correspondence document?”).
Organizational – In organizational design, the classes are modeled
around the organization of the enterprise. In this design style, named
lines of business (LOB) classes are
used as parent containers of the document classes that they work with
and govern. The subsequent layers of
the hierarchy then follow the organi-
36
MAY/JUNE 2012 INFORMATIONMANAGEMENT
zation down into smaller and smaller
groupings. Here, content is seen as a
direct function of its parent LOB.
Organizational design is a simple
and security-driven model. It is easy
to map security between LOB users
and the documents they have access
to. One drawback of this model is its
rigidity. Since the classes are tightly
coupled to business units, it may not
be a very good fit for organizations
that experience a lot of restructuring
or mergers and acquisitions.
Functional – Functional design
is modeled around the higher-level
abstractions of the functions that an
organization carries out. This may be
different than an organizational design paradigm in that this approach
captures many of the functional aspects of the corporation.
These functions may mirror the
organizational structure, but in a
more abstract perspective, by focusing on the function or processes for
which the content is used. In records
management, this maps to the reten-
tion schedule design of the same
name.
Reviewing a Detailed Taxonomy
Figure 2 represents the standard
way of notating subclass derivation
and property arrangement in objectoriented design patterns. At first
glance, there is a lot going on, but
there is an important nuance useful
to understand here: document classes
contrasted with document types.
Each rectangle represents a distinct document class modeled for the
sample insurance company. Subclasses inherit all the properties and
security policies of their respective
parents. Properties are listed right
under the class name. Required properties are notated in bold.
The items in blue represent that
special document type property that
was defined earlier on the document
level. It is a property just like document title, only it holds values that
correspond to types of documents
within that same class that share the
© 2012 ARMA International • www.arma.org
same characteristics. This concept is
an important one.
When structured this way, document type becomes a polymorphic
property that can be used to further
specialize classes. Taxonomy designers often have the mistaken belief
that since document classes are used
as classification buckets, a document
type property is not necessary. However, this is rarely the case.
A taxonomy can have both, and it
should. When the specialization
process gets to the point where a document class has nothing to subclass
on, but must still have unique document definitions, document type will
serve as the differentiator.
Therefore, Figure 2 illustrates
two ways in which specialization is
achieved down the hierarchical path:
using document type property in
cases where the base characteristics
are uniform for all like document
types or subclassing in cases where
the intrinsic meaning of a document
class must be retained, but additional
or different characteristics are
needed.
Compare this to the biology example: if a category of animal is so
different from the class it was derived
from, it’s a subclass (a tiger is derived
from a carnivore); but if not, it’s a document type (a Bengal tiger or a Siberian tiger).
This is the blueprint for a scalable taxonomy. As this tree gets wider
and longer, identifying where something belongs becomes easier.
Linking to Records Management
Taking a page out of the utilities
playbook, Figure 3 illustrates a reallife example of enterprise taxonomy
at work. Here, folders and documents
are mapped to retention schedules.
The Public Utilities Regulatory
Policies Act (PURPA) folder is linked
to its retention by virtue of the folder
class. All PURPA studies expire 10
years after they are created. On the
other side, the project folder is a bit
more complicated, because major
projects are retained for 15 years, but
minor ones for only 5 years. In this
case, we must use the project type
property for retention schedule mapping guidance.
The point is that records declaration can be achieved automatically
based on a pre-determined mapping
between the content and records taxonomies. By assigning ordinary metadata to their content, users may not
even realize they’re actually declaring
and classifying records.
Designing for the Enterprise
Defining an abstract taxonomy
and actually implementing it in a specific system are two different things,
oftentimes very different. Each ECM
system has slightly different nuances
in how things are named, how they’re
structured, and how they’re connected.
The core concepts presented here
in a general sense can be taken and
implemented in any content repository. A taxonomy shouldn’t be married to its host system; the core
concepts of an abstract taxonomy
should migrate to an ECM system of
choice.
One standardization tactic is to
refer to the Content Management Interoperability Services (CMIS) specification for guidance. CMIS allows
different ECM systems to communicate and exchange information with
each other or with some other program. It acts as a layer of abstraction
so all ECM platforms speak the same
language.
For example, where the one platform calls its main data store an “Object Store,” CMIS standardizes it by
calling it a “Repository.” CMIS models the document, folder, and repository objects for the taxonomy
designer to derive from.
Getting Started with Workshops
To begin a real-world implementation, the best strategy is to
MAY/JUNE 2012 INFORMATIONMANAGEMENT
37
© 2012 ARMA International • www.arma.org
Figure 3: Content and Records Taxonomy
begin with the front-line warriors,
the end-users. Start with one LOB
and involve it in the taxonomybuilding process right away through
interactive workshops and questionnaires.
This usually starts out as very
qualitative work. The idea is to get
the narrative of what drives the organization’s data. Discuss the
process, not the document. Every
process has an input and an output.
Whether these are documents, work
list items, or records (or all three),
they can all have a place in an intelligently designed taxonomy.
The following six pieces of information are crucial to understanding
the full picture in terms of breadth
and scope of content elements in
play at any organization:
1. Document characteristics (volume,
format, input)
2. Organizational structure
3. Process
4. Security
5. Retention
6. Reporting
If building a taxonomy from
scratch, there will invariably be confusion among concepts like document classes, document types,
inheritance, and polymorphism. It is
38
MAY/JUNE 2012 INFORMATIONMANAGEMENT
a good idea to abstract these concepts as much as possible. As these
are identified, cross-reference them
in a matrix that can then be used to
build the taxonomy hierarchy.
The matrix should be a listing of
document types and their properties, so a quick glance can yield the
lay of the land, and document types
can be collapsed into broader
classes. Other useful tools include
data dictionaries for property semantics and nomenclature, a system
requirements specification, and a
system design specification.
Avoiding Common Pitfalls
Don’t Design Around a Platform
It is important to design the
platform around the taxonomy, not
vice-versa. As stated earlier, the hierarchy from Figure 2 should be deployable anywhere.
Keep in mind the importance of
polymorphic document types. Strive
to keep the taxonomy lean and flexible, and look for warning signs of
rigidity in data architecture.
Reject Unneeded Properties
Occasionally, taxonomy project
stakeholders and others may demand the inclusion of properties
that are out of place or not in concert
with taxonomy design. It is important to stand one’s ground when this
happens. Not all metadata is equal,
and the following set of questions
may help weed out what may be a
“nice to have” property from what is
truly necessary:
1. Is the property used to search on
and/or to display in hit list results?
2. Does it have any use for business
process/workflow?
3. Does it have any use for records
management-related functions?
4. Is it used for reporting?
These four dimensions really
ought to have it covered. The problem with “nice-to-have” requests is
they can cause the taxonomy to grow
to become bulky and unwieldy.
Growing the Taxonomy
Once a basic blueprint for the taxonomy is established, it should get
conceptually easier to grow it as the
need evolves. If it is not, that is often
a sign that something is wrong at a
higher level. END
Eugene Stakhov, CRM, CDIA+, can be
contacted at gstakhov@lighthousecs.
com. See his bio on page 47.