Search and Discovery Technology Stack v4

Behind the Curtain: Understanding the Search and
Discovery Technology Stack
Patrick Lambe
The Lion thought it might be as well to frighten the Wizard, so he gave a
large, loud roar, which was so fierce and dreadful that Toto jumped away
fromhiminalarmandtippedoverthescreenthatstoodinacorner.Asitfell
with a crash they looked that way, and the next moment all of them were
filled with wonder. For they saw, standing in just the spot the screen had
hidden,alittleoldman,withabaldheadandawrinkledface,whoseemedto
beasmuchsurprisedastheywere.TheTinWoodman,raisinghisaxe,rushed
towardthelittlemanandcriedout,"Whoareyou?"
"I am Oz, the Great and Terrible," said the little man, in a trembling voice.
"Butdon'tstrikeme--pleasedon't--andI'lldoanythingyouwantmeto.”
L.FrankBaum.TheWonderfulWizardofOz.ProjectGutenberg(2008)
TheWizardofOzcapturesthelifetrajectoryofmanytechnologies,andsearchisno
exception:thetechnologystartsinthepublicdomain,itisappropriatedandmade
mysterious(hiddenbehindthecurtain)bycommercialinterests,thentoutedand
overblownasmagical,eventuallythecurtainfalls,andwecanseewhathasbeen
thereallalong,andworkwithitmoreintelligentlyasitis,knowingwhatit’sgoodfor
andknowingwhatit’snotgoodfor.Thisistrueofsearch,asitistrueofautoclassification.ModernsearchtechnologyhasitsrootsinVannevarBusch’sseminal
1945paper“Aswemaythink”anditsfoundationsweredevelopedbyresearch
groupsatMIT,CornellandHarvardintothe1970s.
Oneoftheconsequencesofhidingatechnologybehindthecommercialcurtainis
thatdifferenttechnologiescompetewitheachotherforthesamemarket,and
thereforedownplaytheirperceivedcompetitors,orattempttoduplicatethe
functionalitiesoftheirperceivedcompetitors.Searchtechnologyhascompetedwith
taxonomyworkforyears–“youdon’tneedtaxonomies,”wentthemantraofone
infamoussearchvendor,“oursoftwarewillunderstandyourcontentforyou.”The
curtainfellwiththeunveilingofGoogleKnowledgeGraph,whenitbecameobvious
justhowmuchGooglewasinvestinginknowledgeorganisationstructures
(controlledvocabularies,taxonomies,ontologies,knowledgegraphs)tomakeits
searchsmarter.Thesuperiorperformanceofopensourcesearchengines(andtheir
strikingadoptionrates)overcommercialsearchenginesinsolvingparticulartypesof
problemis,likeToto’sleap,anothermanifestationofthedroppedcurtain.
Machineclassificationofcontenthasbeenanotherexample.Itisfrequentlytouted
byitscommercialvendorsasasingletechnologyperformingasingle“we’lltagyour
1
contentforyou”functionwhereitisactuallyabundleofquitedistincttechnologies,
androotedininformationretrievalresearchandsearchtechnologiesdevelopedin
the1970sand1980s.Machineclassificationvendorshavefoundthemselves
competingwithsearch(“wecanmakeyoursearchsmarter”)andtaxonomies(“we
canderivetaxonomiesautomaticallyforyou”).Thecurtainisnowstartingtofall
withlargeorganisationslearninghowtousetheopensourcetoolkitssuchasthe
UniversityofSheffield’sGATEtoolset,tosolvetheirsearchanddiscoveryproblems.
The“goitalone”propagandaofthetechnologyvendorsisdamagingbecauseitlocks
buyersintoasingletoolsetdesignedforcommondenominators,withprohibitive
customisationandmaintenancecosts.The“magicblackbox”propagandablinds
organisationstotheneedtobuildcapabilitiesinarangeoftools.Andthe“wedo
everything”propagandahidesthesynergiesandcapabilitiesthatcanbeattainedby
intelligentlyusingsearch,machineclassificationandknowledgeorganisation
technologiesasatoolkit,withdifferenttoolsetstobeusedindifferentcombinations
tosolvedifferentkindsofsearchanddiscoveryproblem.
Thepurposeofthisarticleistoexplaintheusesandinteractionsofthree
technologiesthattogethersupportsearchanddiscovery,sothatorganisationscan
learnwhichtoolstoadoptinwhichcombinationtosolvedifferentproblems,and
theycanseewhatkindofinternalcapabilitiestheyneedtobuildandmaintainto
continuesolvingthoseproblems:
• Enterprisesearch
• Taxonomyandontologymanagement
• Textanalyticstoperformmachineclassification.
Enterprisesearchisatechnologyfordeliveringusefulandrelevantresultstousers
exploringalargecontentbase.Asatechnologyitisrelativelysimpleandnot
especiallysmart,andtodeliversuperiorresultsincomplexenvironmentsitneedsto
besupportedbyothertechnologies.“Complexity”canrefertothediversityofuser
communities,andtothediversityofthecontentitself.Searchenginesbythemselves
don’tunderstandanything.Theycrawlandindex,processqueriesandserveupuser
analytics.Tobecomesmart,theyneedhumandesignedcurationtools(taxonomy
andontologytools)andtoolsforscalingthehumantaggingofcontent(textanalytics
tools).
Taxonomyandontologymanagementtechnologiessupportdesigning,delivering
andmaintainingknowledgestructuresandcontrolledvocabulariestoenhancethe
searchanddiscoveryexperience.Thesestructuresarebasedonananalysisof
businessanduserneedstogetherwithananalysisofthetargetcontentforsearch
anddiscovery,andtheyhelptodisambiguateconcepts,capturevariationsin
languageandmapthemassynonyms,andidentifyrelationshipsbetweenconcepts
sothatsearchescanbeexpandedornarrowed.Theconstraintsoftaxonomyand
ontologymanagementmostoftenrelate(a)tostayinguptodatewithnewand
emergingneedsofdiverseusersets(whichiswheresearchanalyticscanhelp),and
(b)toscalingthewaythatconceptscanbeidentifiedincontentandmetadata
enriched(whichiswheretextanalyticsandmachineclassificationcanhelp).
2
Textanalyticsreferstoaclusteroftechnologiesthatextractmeaningfromtextand
turnitintometadata.Someoftheroottechnologies(crawlingandindexing)are
sharedwithsearch.Thesetechnologiesarearesponsetotheproblemsofhuman
inconsistency(peoplearenotconsistentinhowtheyapplytagstocontent)andof
thescaleofcontenttobetagged(whenitwouldnotbefeasibletoapplyhumancuratedtaggingtolargeamountsofcontentinashortperiodoftime).Different
technologiesareusedtosupplydifferenttypesoftaggingoperation,rangingfrom
identificationofknownentitiessuchaspeople,organisationsandplaces,quantities
suchasmoneyamounts,totheidentificationofabstractconceptsthatcharacterize
whatadocumentis“about”inwaysthatmakesensetotargetusers.Theconstraints
oftextanalyticsare(a)thatithasahardtimefiguringoutwithoutintensivehuman
curationorverywellstructuredtrainingsetswhichconceptsaremostsalienttothe
mostpeople(whichiswheretaxonomyandontologymanagementhelps)and(b)
likesearch,itrequiresconstanttuningbasedonemerginguserneeds(whichis
wheresearchanalyticsonuserbehaviourscanhelp).
Thesetechnologiescanbeextremelypowerfulwhenusedincombinationinsupport
ofthesearchanddiscoveryexperience.However,itisimportanttounderstandwhat
theyarecapableof,howtheywork,andthelimitationsofwhattheycando.
FigureA1showsadetailedviewofthethreepillarsofthesearchanddiscovery
technologystack,howtheywork,andtheirkeyinteractions.Webeginatthetopof
thediagram.Sincethesetechnologiesaresupposedtobedeployedintheserviceof
userneeds,decisionsabouttheirconfigurationanddesignneedtobemadeinthe
contextofknownuserneeds,aswellasathoroughunderstandingofthecontentto
becovered.Notethatthisisasimplifiedconceptualframework,anditdoesnotfully
representthecomplexitiesoftherelationshipsbetweenthepillars,norallthe
overlapsbetweenthem.
3
4
FigureA1Detailviewofthesearchanddiscoverytechnologystack
“Content”canbestructured,semi-structuredorunstructured.Itcanbeintheform
ofdata,documents,webcontent,audiovisualfiles,discussions,orquestions.
“People”canalsobeconsideredaformofcontentsinceknowledgeablepeoplemay
alsobeausefulresultinssearchonaparticulartopic.Inacontentmanagement
systemapersonprofilecanbeusedasaproxyforthatperson,andcanbetagged
withrelevantsubjecttagssoastosurfaceinsearchalongsideotherformsof
content.
Allthreetechnologiesinthestackdependuponathoroughsurveyofthetarget
contenttobecovered,andpreparationandstructuringofthecontentstoressothat
thesearch,taxonomymanagementandtextanalyticstechnologiescanaccessthem
andoperateuponthem.
Enterprise Search
Let’sbeginwiththecentralpillar,theenterprisesearchstack.Therearesixmain
componentsinthesearchstack.
Crawling
Thefirstcomponentisthecrawlingmodule.Thisisdirectedtocontentsources,and
itcrawlsthecontentinpreparationforindexing.Youmaywanttobeabletodirect
thecrawlertospecificpartsofyourcontent.
ContentProcessing
Contentisthenprocessed,usingtoolsincommonwithinitialprocessingofcontent
bytextanalyticstools.Atextfileiscreatedforparsingandindexing,andthecontent
istokenized,meaningeachelementisgivenauniqueidentifiersothattheycanbe
manipulatedandlocatedlateron.Thisisalsothepointatwhichlemmatization
(resolvingvariantgrammaticalformssuchastensesandpluralstothesame
grammaticalroot)andstemming(resolvingdifferentwordendingstothesame
word-stem)occurs.Youmayalsowanttoremove“stopwords”thatwillcreate
meaninglessnoise(commonoperatorssuchasif,but,and,the,a)fromtheindexing.
Yourstopwordsmayneedtobetuned.Sometimesstopwordsthatarenoisyin
searchindexesprovidesemanticcluesinsomerules-drivenapplicationsoftext
analytics.
Indexing
Theindexingmodulecreatestermindexesfromtheparsedtext.Eachtermislinked
toitssourcecontent,andtokenizationprovidesthecontextinwhichthesearchterm
appears.Thepre-processedindexesprovidespeedinsearch.Crawlingandpreprocessingoflarge-scalecontentcollectionscanbeveryslowandoccupies
significantprocessingcapacity.Hencemostindexingconsistsofincrementalupdates
totheindexesbynoticing,crawlingandindexingnewcontentwheneveritisadded.
Itisveryimportanttoplantheconfigurationandsetupofyoursearchengine
carefullyinadvance.Anymajorchangestothesearchengineconfiguration,ortothe
5
wayitexploitstaxonomyortextanalyticsmayrequireanexpensiveandslow
completere-crawlandre-index.
Auto-categorization
Somesearchenginesalsooffersimpleauto-categorizationfunctions.Intheabsence
ofsophisticatedtextanalyticstools,theseareusuallyentityrecognitionbasedon
lookupsofreferencetermlists.Theselistscanbemaintainedwithinthesearch
engineitself,orcanbesuppliedfromataxonomymanagementsystem.Thisis
referredtoas“knownentityextractionvialookup”.Theentitiesyouareinterested
incallingout(forexamplecompanynamesorcountrynames)arecompiledintolists,
andareregisteredassignificantentitiesifthemainsearchindexescomeacross
thesetermsortheirsynonyms.Thiscanprovideaverysimplefilteringcapability
basedontheentitytypesyouareinterestedin.However,thisisafairlybasic
technology.Justbecauseadocumentmentions“Google”doesnotmeanthatthe
documentissubstantially“about”Google.
QueryProcessing
Muchoftheutilityofasearchengineliesinthesophisticationwithwhichithandles
queriesandqueryinterfaces.
Queryinterfacescanincludetheubiquitoussearchquerybox,browsingtaxonomy
structureswhereconceptsarepre-linkedtocontentthroughstoredsearches
(requiresintegrationwithataxonomymanagementmodule),filteringofsearch
resultsusingmetadataelementsortaxonomyfacets(againrequiresintegrationwith
taxonomy).
Manysearchenginesofferauto-completeorauto-suggestofsearchqueries
suggestingcommonsearchqueriesastheusertypestheirqueryintothesearchbox.
Auto-completecanexploitcommonsearchqueries,existingtaxonomyorthesaurus
terms(linkedtoresultsthroughstoredsearches),and/orkeywordsincontextfrom
metadatasuchastitlesofhighvaluecontent.
Therelevanceofsearchresultsforcommonqueriescanbetunedattheback-endby
asearchmanagerbyalteringtherulesthatcalculatethelikelyrelevanceofagiven
pieceofcontentforagivenquery.Wherethereisaknown“best”pieceofcontent
foragivenquery,thiscanbepromotedtothetopoftheresultspage.Relevance
tuningcanalsobeautomated,bylookingatclickbehaviorsbetweenqueryand
clickingonresultsandpromotingcontentthatisfrequentlyclickedon.
Relevance-tuningcanexploitwhatisknownabouttheusersandtheircontexts.
Knowingthatauserissearchingfromwithinagivenorganizationalfunctionenables
rulestobewrittenforhowqueriesfromthoseusersshouldbehandled,fromzoning
thesearchtoknowntargetcontentforthatfunctionalgroup,tohandlingrelevance
calculationbasedonmetadataassociatedwiththosefunctionsandusers,and
matchingitwiththeindexedcontent.Contentrankingmechanismscanalsobe
pulledintothemixed.
6
SearchAnalytics
Queryhandlingandsearchtuningrequiresophisticationinthesearchengineitselfas
wellasarelativelysophisticatedsearchmanagementcapability–i.e.thewaythat
searchisstaffed,administered,configuredandconstantlytunedinrelationtoknown
userneeds.
Searchanalyticsisthemodulethatpowersthissophisticationandtuning.Search
engineshaveverypowerfulreportingcapabilities,trackinghowqueriesare
conductedanduserinteractionswithresultspages,butasinallcomplexreporting
capabilities,thesmartslieinhowthereportsaredefinedandconfigured.Whatwill
youtrack,how,andhowfrequently?Howwillyoubackuphunchesgatheredfrom
thereports,andvalidatethemwithusers?Theinitialsearchdesignstrategybased
onusecasesandcontentanalysiswillgiveyouastartingpoint.Thiswillberefined
bytheinsightsyougainbyobservinguserbehaviorsthroughthesearchroll-outand
subsequentmaintenance.
Taxonomy and Ontology Management
Likesearch,taxonomyandontologymanagementrequiresveryrobustanalysisof
howusersneedtointeractwithcontent,andanalysisofthecontentitselfto
understandhowitisdescribed,organizedandused.Taxonomydevelopmentand
managementisstillbestconductedasaworkofskilledhumandesign,butitcanbe
substantiallyenhancedwiththeappropriatetechnologies.Therearetwobroad
functionswithinenterprisetaxonomyandontologymanagementsystems.
TaxonomyBuildingandMaintaining
Enterprisetaxonomymanagementsystemssupporttheworkofbuildingand
maintainingtaxonomiesbyprovidingworkflowsandmetadataaroundtaxonomy
concepts.Forexample,theycanhousesourcevocabulariesofvaryingauthorityand
structurethatcanbeusedtogeneratecandidatetermsforthetaxonomy.Thereare
approvalworkflowstotracktheapprovalprocess.Theremayberoles-based
permissionsforcomplexenvironmentswheremultipletaxonomyprofessionalsare
involvedinmaintainingthevocabularies.Sourcesandusagesoftaxonomytermscan
betrackedasattributesofthetaxonomyconcepts.Scopenotescanbeaddedto
explicatetaxonomyterms,andtherevisionhistoryoftermsandtaxonomyfacets
canbetracked.
ReferenceStore
Thesecondmajorfunctionofanenterprisetaxonomymanagementsystemistoact
asareferencestoreforothersystems,supplyingtaxonomyterms,relationships
betweenterms,synonymsandcontrolledvocabulariestoothersystems.Itcan
supply:
• controlledvocabularytermstocontentmanagementsystemstosupport
manualtaggingofcontent
• termsandrelationshipstosupportauto-categorizationviaknownentity
lookup,whetherappliedwithinasearchengineoratextanalyticsengine;
7
•
synonymsandrelationshipstosearchsothatitcanimproverelevanceand
precisionofresults,andsupportexpansionofsearch
taxonomyfacetstosearchtosupportfilteringofsearchresults.
•
Becausetheyarebuilttomanipulatedefinedrelationshipsbetweenconceptsand
theirattributes,taxonomyandontologymanagementsystemscanalsoactas
brokerstootherLinkedDataorLinkedOpenDatasources,tocallauthorityterms
andrelationshipsfromelsewhere,tosupportsearchand/ortosupportenrichment
ofcontentviatextanalytics.
Finally,enterprisetaxonomyandontologymanagementsystemsarebuilttoactas
taxonomy/ontologyhubscateringtomultiplesystemsandaudiences.Theyare
capableofinteractingdifferentlywithdifferentplatformsandaudiences.For
example,differentversionsofthesamecoretaxonomycanbepresenteddifferently
todifferentaudiences,dependingontheirneeds.Differentvocabulariesand
taxonomyservicescanbesuppliedtodifferentsystemsbasedontheircontentand
uses.
Text Analytics
Textanalyticsreferstoaquitediversefamilyofcomputer-assistedtechniquesfor
assigningmeaningtocontent.Differentmodulescanbeusedtoservedifferent
purposes.
ContentProcessing
Therearesomebasiccontentprocessingfunctionsthattextanalyticshasincommon
withenterprisesearch:preparationoftextforparsing,tokenization,lemmatization,
stemming,stopwords.
SyntacticandSemanticPre-Processing
Severalfunctionalapplicationsoftextanalyticsdependontwofoundational
processingtechniques.
Thefirst,syntacticanalysis,analyzesthelanguagestructureoftextgrammatically,
usingverylargeopensourcetrainingsets.Sentencesplittersbreakupsentencesinto
partsofspeech,distinguishingnounsandnounphrasesfromverbsandoperators,
andsoon,andsyntacticanalyzerscreatealogicalrepresentationofthesentences.
Semanticanalysisattemptstoinfermeaningfromanalyzedtexts.Likesyntactic
analysisitdependsonreferencetrainingsets.Itenablescomputerstoinfer
meaningfulassertionsandfactsfrombodiesoftext,andunderpinsthebase
technologiesbehindmachinetranslation.Syntacticanalysisisbasedonknownrules
ofgrammar.Semanticanalysishasamuchmorechallengingtask,whichistoinfer
human-understandablemeanings.Notallgrammaticallycorrectsentencesare
meaningfultohumans–"Colorlessgreenideassleepfuriously"isafamousexample
8
presentedbyNoamChomskyofasyntacticallycorrectsentencethatissemantically
incoherent.
Syntacticandsemantictechniquesunderpintheidentificationof“entities”within
content.Inthecontextoftextanalytics,an“entity”canbeaperson,athing,aplace,
atime,atopic.Itisanydistinctconceptthatcanbeidentifiedfromtext.
Identificationandextractionofentitiescanbedonemostsimplybysimplylooking
upreferencelists,asinauto-categorizationofknownentitiesprovidedbysome
searchengines.
However,withsyntacticandsemanticanalysistextanalyticscanalsoidentify
unknown(unpredicted)entitiesforwhichyoudon’tyethaveexamplesinyourlists.
ForexampleinEnglishpropernamesarecapitalized,andpersonalnameshave
knownforms(thatalsovarybyculture).Hencecandidatesforpersonsmentionedin
textcouldbepickedupbyarulethatsays“lookforoneormorecapitalizedsubjects
orobjectsofsentences,andthenlookforlemmatizedverbsfromthefollowinglist
‘say’…‘reply’…etc”.
Suchrulesarebasedoncommonusageasrepresentedintheirtrainingsetsandare
typicallyprovidedasstandard.Theseruleswilltypicallyneedtobetestedand
refinedbasedonlocalusage.Forexample,wefoundthatonestandardtooldidnot
pickupcountryoforiginincontentfromanimmigrationandcustomsauthority,
becauseitwasusingsimpleterm-phraselookup.However,customsofficersat
checkpointshabituallyusedtheconvention“[Country]-registeredvehicle”.The
hyphenationandextensionofthetermdisruptedthesimplelookupfunction.Notice
thatthesolutiontothisproblemisacombinationoftechniques,likely
supplementingknownentitylookupwithsyntacticallydrivenentityextractionrules.
Thepracticeofusingdifferenttechniquesincombinationischaracteristicoftext
analyticsapproaches.
Withthesefoundationalcapabilities,therearemanyspecializedapplicationsoftext
analytics,notallofwhichwillbesuitabletoeverypurpose.
Auto-categorization
Theabilitytoidentifybothknownandunknownentities/conceptsisasignificant
improvementinauto-categorizationcapabilities.Lookuplistsforknown
entities/conceptscanbesupportedbyataxonomymanagementsystem,and
strengthenedthroughthesupplyofsynonyms.
Conversely,throughthejudicioususeofrules,textanalyticscandiscover
entities/conceptsmentionedintextthatarenotyetinthetaxonomyorcontrolled
vocabularies,andcanbeconsideredascandidateterms.
TextMining
Textminingreferstoanumberoftechniquesforanalyzingtextbasedonstatistical
algorithms.Inverybroadterms,largebodiesoftextaresubmittedtoiterative
9
operationsusingalgorithmsto“sort”thetextintoclustersor“buckets”thatare
internallystatisticallysimilar,andstatisticallydissimilartotheotherclustersor
buckets.
Statisticaltechniquesarealsousedtodeterminethewordsorword-phrasesthatare
mostlikelytouniquelyrepresenttheirbucketcomparedtootherbuckets.These
wordstringsareoftencalled“topicmodels”.Thelikelihoodofspecificterms
occurringinproximitytoeachothercanalsobecalculated.
Textminingissometimestoutedasafully-automatedalternativetodescribingand
taggingcontent.Inpracticehowever,theclusteringtechniquesusefuzzyalgorithms,
andtheend“described”stateishighlyinfluencedbywhichclusteringpathwaysare
takenatthebeginningoftheoperation.Textminingtechniquessufferfromnonreproducibility(thesameseriesofoperationsonthesamecontentcanproduce
differentresultsondifferentoccasions)andfromnon-comprehensibility–thetopic
modelsgeneratedoftendonotmatchhowusersthemselvesdescribethecontent.
Soinpractice,applicationsoftextminingareheavilytunedbytheirhuman
operators,anditisnotalwaystransparenthowagivensetofresultsareachieved.
Textminingalsosuffersfromendogeneity–meaningthatthetagsthatareextracted
todescribesimilarbucketsofcontentarewhollyderivedfromthatcontentset(or
whentrainingsetsareusedforcomparison,fromthetargetcontent+thetraining
set).Endogeneityisaproblemwhenyouwanttosurveydiversecollectionsof
contentandmapmeaningsconsistentlyacrossthem.Inthatcase,ataxonomyor
ontologythatcrossesthosecontentsetsandaudiencesprovidesanexogenous
referencepointthatcanbrokermeaningsacrossdifferentcontentsetsanduser
audiences.
However,textminingcanbeapowerfultoolusedincombinationwithsearchand
taxonomymanagement.Itsstatisticaltechniquestoclusterbasedonsimilaritycan
proposenewcategorizationsandpotentiallyrelevantrelationshipsbetweentopics
tothetaxonomymanager.Itsabilitytoextracttopicmodelscanbeusefulin
automatedsummarizationtechniques.
SemanticAnalysis
Wehavealreadydiscussedsomeofthechallengesofcomplexityfacingsemantic
analysis,notleasttheheavyinfluenceofcontextandculture(factorsexternaltothe
content)onmeaning.Inpracticaltextanalyticsapplications,semanticanalysisrelies
onanumberoftechniquesandnotjustcomputationalsemanticanalysis.For
example,oneofthemainapplicationsofsemanticanalysisistoextractassertionsor
factsfromcomplexdocumentation.Ifsemanticanalysishasaccesstoontologiesvia
ataxonomymanagementsystemitcanthenenrichitsinferencesbasedonknown
andvalidatedrelationshipsbetweenentities.
10
SentimentAnalysis
Sentimentanalysisisaveryspecializedapplicationoftextanalytics,oftenusedin
marketing,socialmediaanalysis,andqualitativefeedback.Itlooksatthesemantics
ofpositive,negativeandneutralexpressions,inassociationwithknown(extracted)
entities.Againitishighlycontextual,andsubjecttoover-simplification–for
example,sarcasticlanguagecanbesuperficiallypositive,butthelinguisticmarkers
ofsarcasmareoftennotveryobvious.Aswithmosttextanalyticstechniques,the
rulesandalgorithmsneedtobetested,validatedandconstantlysupervisedfor
accuracy.
Summarization
Automaticsummarizationoftenexploitsknowndocumentordatastructures(it
knowswheretolookforkeycontent),anditmayexploitsyntacticanalysis(e.g.
removingextraneous“padding”languagesuchasadjectivesandadverbs).Itmayuse
semanticanalysistoextractkeyfactsfromdocuments.
How the Pieces of the Stack Work Together
Let’sbringthisbacktogethertogetabetterunderstandingofhowthepillarsofthe
stackworktogether.FigureA2illustratesthehighlevelviewofhowthedifferent
pillarsofthesearchanddiscoverytechnologystackinteractandworktogetherto
producesuperiorresultsfororganisations.Thefocusforhowthestackisdeployed,
andwhichparticulartechnologycomponentsareused,alwaysspringsfromuser
analysis,andanunderstandingofthequeriesthatusersmake,againstan
understandingofthetargetcontent,howitisstructured,howandwhyitis
produced,howitiscurrentlybeingused,andhowitcouldpotentiallybeused,ifthe
searchanddiscoveryfunctionalitycouldbeimproved.Indigitalenvironments,
“targetcontent”canbeanycombinationofdocuments,webcontent,data,
dashboards,andpeopleprofiles.
TheconfigurationoftheelementsoftheSearchandDiscoverystackspringsfroman
understandingofuserneedsagainstanunderstandingoftargetcontent.
Determininghowtousethepillarsofthestacksitswithinalargerdiagnosticand
decision-makingframework.
1. Whatisthebusinessproblem(orsetofbusinessproblems)youaretryingto
solve?Theanswertothisquestioncanoftencomethroughaknowledge
managementdiagnosticsactivitysuchasaknowledgeaudit,combinedwith
anunderstandingoftheorganisation’sbusinessdirectionandstrategy.
2. Whoarethekeyusercommunitiesinrelationtothebusinessproblem,and
howwillyouunderstandtheirworkingneedsandopportunitiesfor
improvement?(Thiscanalsocomefromaknowledgeaudit).
3. Whatisthetargetcontent(theknowledgeresourcesthatyourtargetuser
communitiesneedstoworkwithmoreeffectively),whereisit,howisit
structuredandusedcurrently?(Contentanalysisandmodelingcanhelp).
11
4. Whattoolsfromwithinthesearchanddiscoverystackcanhelptoaddress
theproposedsolution?(Thefocusofthispaper)
5. Whatneedstochangeonthebehavioural,processandgovernancelevelsfor
thenewsolutiontowork?(Forsustainableimplementationplanning,check
outTheKnowledgeManager’sHandbook(MiltonandLambe,2016)froma
knowledgemanagementperspectiveorMaishNichani’sseriesofarticlesat
http://olasearch.com/articlesfromasearchdesignperspective).
FigureA2Interactionsbetweentaxonomymanagement,searchandtextanalytics
SearchandTaxonomy
Searchistheuser-facingcomponentofthestack.Itmediatescontenttousers.
Taxonomymanagementsuppliesknownsalientconceptsandrelationshipsto
search,andittellssearchwhichquerytermsaresynonymsofthesametaxonomy
concepts.By“salient”wemeanthattheconceptsarefrontofmind,action-related
conceptsintheheadsofusersastheyperforminformationandknowledgeseeking
activitiestocompleteworktasks.Taxonomiesmakesearchsmarterbytellingit
whichconceptsareimportanttousers,howtheyarerelatedtootherconcepts,
whetherthroughhierarchicalorotherrelationships.Hencesearchcanexploit
taxonomiestohelpuserstorefineorexpandtheirsearch,tofollowmeaningful
pathwaysandexplorecontent.Taxonomiesandontologiesenhancetherelevance
andprecisionofsearchforusers.
Becausesearchisuserfacing,itgathersalotofdataaboutuserneedsand
behaviors.Frequencyanalysisofsearchqueriestellsthetaxonomistalotabouthow
usersthinkabouttheircontentandtheirsearchrequirements.Clickthroughactivity
12
onsearchresultspagesalsoprovidesevidenceaboutuserperceptionsofwhatmay
ormaynotbeuseful.Hencesearchanalyticsprovidesafeedbacklooptohelpthe
taxonomymanagerconstantlyrefineandimprovethetaxonomytouserneeds.This
isapositivefeedbackloop,wheretaxonomyprovidesrelevanceandprecisionto
search,andsearchprovidesaconstantflowofrealtimelarge-scaleevidenceon
whatisactuallyrelevantandusefultousersatthepresenttime.Searchanalyticscan
alsopickupemergingtermsandconceptsthatareimportanttousers,andpropose
themascandidatetermsfortaxonomiesandtheirsupportingthesauri.
Withouttaxonomymanagement,searchisrelativelydumb,whetherintermsof
whatitcanindex,howitcanhelpusersmaketheirqueries,andhowitcanhelp
usersactupontheirresults.Withoutsearch,taxonomymanagementlacksa
convincingmediumtodemonstrateitsvalue,anditlacksacontinuousflowof
evidencetomaintainitsrelevanceandusefulnesstousers.
SearchandTextAnalytics
Textminingcanproposenewconceptsandnewassociationsbetweencontentitems
forthesearchmanagertobuildintoqueryprocessingandrelevancytuning.
Semanticanalysisandsummarizationtechniquescanbeusedtoprovide“smart”
summariesofcontentitemstobepreviewedinsearchsnippetsandinsearchresults
pages,makingiteasierforuserstoassesswhetherornotthecontentisrelevantto
theirneed,beforetheyopenthecontentitem.Textanalyticscanextendthewayin
whichcontentcanbepulledtogetherinspecializedsearchbasedapplications.
Allthreepillarsofthestack(search,taxonomyandtextanalytics)dependon
constanttuningoftheinfrastructuretodelivergoodsearchanddiscoveryresults.
Textanalyticsinparticulardependsonrulestoguidehowthecontentisprocessed
andtagged.Theserulesneedmaintenanceandtuning.Thistuningisnecessary
becauseoftheconstantongoingchangesincontent,context,priorities,usersand
needs.Searchanalyticsprovidealargescale,real-timewindowintouserneedsand
behaviors,andso,aswithtaxonomymanagement,itprovidestheevidencebase
neededtokeepthetextanalyticsstacktunedtodeliveraccurateandusefulresults
forusers.
Whentherearelargeandcomplexknowledgebases,withouttextanalytics,search
findsitdifficulttopickoutsalientconceptsrelatedtocontent.Aswithtaxonomy
management,textanalyticsdependsondeepuserandcontentknowledgetoknow
howtoconfigureandtuneitsoperations.Withoutsearch,textanalyticsfindsithard
todothisunlesstheteamispreparedtoinvestinconstant(andexpensive)user
researchandtesting.
TextAnalyticsandTaxonomyManagement
Taxonomiesaredesigned,evidence-basedartifacts.Assuch,theyconsistentlylag
thepaceofchange.Theyareoptimizedtoexploitknownanddocumentedconcepts.
Textanalyticstechniquessuchasauto-categorizationandtextminingcansupply
candidatetermsandrelationshipstothetaxonomymanagerwithouttheneedfor
13
comprehensivere-surveyingofcontentanduserneeds.Theycanpickupemerging
conceptsinthecontent,justassearchanalyticspicksupandproposesemerging
conceptsevidencedinsearchqueries.Sotextanalyticscanactasapowerful
taxonomyenrichmenttool.
Taxonomiesandontologiescanguideandstrengthentheapplicationoftext
analyticstechniquessuchasauto-categorization(knownentityextractionthrough
lookup),factextractionthroughsemanticanalysis,andsentimentanalysis(by
helpingthesentimentengineresolvedifferentsynonymstothesameentitybeing
referencedinthecontent).
Whentherearelargeandcomplexknowledgebases,withouttextanalytics,
taxonomytagscannoteasilyandconsistentlybeappliedtolarge,diversebodiesof
content.Withouttaxonomymanagement,textanalyticsstrugglestodescribe
contentintermsthatmakesensetokeyusersandinthecontextofuseractivities.
Machinetaggingtechniquesalonedonotdescribecontentintermsthathumans
intuitivelyunderstand,withoutsomeformofhuman-curatedfilter.Taxonomy
managementprovidesthis.
Implications
Fortoolongorganisationshavebeenlisteningtotheboomingvoicegivingmagical
reassurancesfrombehindthecurtain,andhaveneglectedtheirowncapabilities,and
thetoolsetsthatexistintheopensourcearena(thatcommercialproductsalso
exploit).Astheworldbecomesmorecomplex,solutionstosearchanddiscovery
problemswillbecomemorediverse.Weneedtobecomebetterjourneymen,more
knowledgeableaboutthecapabilitiesdifferenttoolsetsthatareavailable,sothatwe
canchoosetherighttoolsforanygivenproblem.Someofthemwillbecommercial
tools,acquiredfortheirexcellenceinaspecificsetoffunctions.Someofthemwill
beopensourcetools,acquiredandtunedtosolvespecifickindsofproblem.Theage
ofsinglesource“doitall”commercialplatformsisover.
Thispaperhasproposedthatknowledgeofthesearchanddiscoverytechnology
stackisanimportantcapabilitytoacquire.Itneedstobeatleastadequatetothejob
ofevaluatingneedsandpossibilities,andofaskingvendorssearchingquestions
abouthowtheirtechnologyworks,whatitisgoodat,andwhatitslimitationsare.In
somecases,organisationswillwanttointernalizeadeepertechnicalcapabilityin
workingwithanumberofsetsoftools.
Thereareotherimportantsearchanddiscoveryrelatedcapabilitiesthatenterprises
needtoacquirebeyondaknowledgeofthesearchanddiscoverytechnologystack–
forexample,howuserneedsandcontentcanbeanalysedtodeterminepriorities
andopportunities,howcontentcanbemanagedandenrichedtoprovideoptimal
userexperiences,andhowbusinessproblemscanbetransformedthroughdesign
processesintoeffectivesolutionsforbusinessneeds,problemsandopportunities.
14
I would like to acknowledge the following colleagues who have provided valuable
insights and advice informing this paper: Dave Clarke, Ahren Lehnert, Agnes Molnar,
Maish Nichani and Tom Reamy.
Patrick Lambe October 2016-January 2017
15