Behind the Curtain: Understanding the Search and Discovery Technology Stack Patrick Lambe The Lion thought it might be as well to frighten the Wizard, so he gave a large, loud roar, which was so fierce and dreadful that Toto jumped away fromhiminalarmandtippedoverthescreenthatstoodinacorner.Asitfell with a crash they looked that way, and the next moment all of them were filled with wonder. For they saw, standing in just the spot the screen had hidden,alittleoldman,withabaldheadandawrinkledface,whoseemedto beasmuchsurprisedastheywere.TheTinWoodman,raisinghisaxe,rushed towardthelittlemanandcriedout,"Whoareyou?" "I am Oz, the Great and Terrible," said the little man, in a trembling voice. "Butdon'tstrikeme--pleasedon't--andI'lldoanythingyouwantmeto.” L.FrankBaum.TheWonderfulWizardofOz.ProjectGutenberg(2008) TheWizardofOzcapturesthelifetrajectoryofmanytechnologies,andsearchisno exception:thetechnologystartsinthepublicdomain,itisappropriatedandmade mysterious(hiddenbehindthecurtain)bycommercialinterests,thentoutedand overblownasmagical,eventuallythecurtainfalls,andwecanseewhathasbeen thereallalong,andworkwithitmoreintelligentlyasitis,knowingwhatit’sgoodfor andknowingwhatit’snotgoodfor.Thisistrueofsearch,asitistrueofautoclassification.ModernsearchtechnologyhasitsrootsinVannevarBusch’sseminal 1945paper“Aswemaythink”anditsfoundationsweredevelopedbyresearch groupsatMIT,CornellandHarvardintothe1970s. Oneoftheconsequencesofhidingatechnologybehindthecommercialcurtainis thatdifferenttechnologiescompetewitheachotherforthesamemarket,and thereforedownplaytheirperceivedcompetitors,orattempttoduplicatethe functionalitiesoftheirperceivedcompetitors.Searchtechnologyhascompetedwith taxonomyworkforyears–“youdon’tneedtaxonomies,”wentthemantraofone infamoussearchvendor,“oursoftwarewillunderstandyourcontentforyou.”The curtainfellwiththeunveilingofGoogleKnowledgeGraph,whenitbecameobvious justhowmuchGooglewasinvestinginknowledgeorganisationstructures (controlledvocabularies,taxonomies,ontologies,knowledgegraphs)tomakeits searchsmarter.Thesuperiorperformanceofopensourcesearchengines(andtheir strikingadoptionrates)overcommercialsearchenginesinsolvingparticulartypesof problemis,likeToto’sleap,anothermanifestationofthedroppedcurtain. Machineclassificationofcontenthasbeenanotherexample.Itisfrequentlytouted byitscommercialvendorsasasingletechnologyperformingasingle“we’lltagyour 1 contentforyou”functionwhereitisactuallyabundleofquitedistincttechnologies, androotedininformationretrievalresearchandsearchtechnologiesdevelopedin the1970sand1980s.Machineclassificationvendorshavefoundthemselves competingwithsearch(“wecanmakeyoursearchsmarter”)andtaxonomies(“we canderivetaxonomiesautomaticallyforyou”).Thecurtainisnowstartingtofall withlargeorganisationslearninghowtousetheopensourcetoolkitssuchasthe UniversityofSheffield’sGATEtoolset,tosolvetheirsearchanddiscoveryproblems. The“goitalone”propagandaofthetechnologyvendorsisdamagingbecauseitlocks buyersintoasingletoolsetdesignedforcommondenominators,withprohibitive customisationandmaintenancecosts.The“magicblackbox”propagandablinds organisationstotheneedtobuildcapabilitiesinarangeoftools.Andthe“wedo everything”propagandahidesthesynergiesandcapabilitiesthatcanbeattainedby intelligentlyusingsearch,machineclassificationandknowledgeorganisation technologiesasatoolkit,withdifferenttoolsetstobeusedindifferentcombinations tosolvedifferentkindsofsearchanddiscoveryproblem. Thepurposeofthisarticleistoexplaintheusesandinteractionsofthree technologiesthattogethersupportsearchanddiscovery,sothatorganisationscan learnwhichtoolstoadoptinwhichcombinationtosolvedifferentproblems,and theycanseewhatkindofinternalcapabilitiestheyneedtobuildandmaintainto continuesolvingthoseproblems: • Enterprisesearch • Taxonomyandontologymanagement • Textanalyticstoperformmachineclassification. Enterprisesearchisatechnologyfordeliveringusefulandrelevantresultstousers exploringalargecontentbase.Asatechnologyitisrelativelysimpleandnot especiallysmart,andtodeliversuperiorresultsincomplexenvironmentsitneedsto besupportedbyothertechnologies.“Complexity”canrefertothediversityofuser communities,andtothediversityofthecontentitself.Searchenginesbythemselves don’tunderstandanything.Theycrawlandindex,processqueriesandserveupuser analytics.Tobecomesmart,theyneedhumandesignedcurationtools(taxonomy andontologytools)andtoolsforscalingthehumantaggingofcontent(textanalytics tools). Taxonomyandontologymanagementtechnologiessupportdesigning,delivering andmaintainingknowledgestructuresandcontrolledvocabulariestoenhancethe searchanddiscoveryexperience.Thesestructuresarebasedonananalysisof businessanduserneedstogetherwithananalysisofthetargetcontentforsearch anddiscovery,andtheyhelptodisambiguateconcepts,capturevariationsin languageandmapthemassynonyms,andidentifyrelationshipsbetweenconcepts sothatsearchescanbeexpandedornarrowed.Theconstraintsoftaxonomyand ontologymanagementmostoftenrelate(a)tostayinguptodatewithnewand emergingneedsofdiverseusersets(whichiswheresearchanalyticscanhelp),and (b)toscalingthewaythatconceptscanbeidentifiedincontentandmetadata enriched(whichiswheretextanalyticsandmachineclassificationcanhelp). 2 Textanalyticsreferstoaclusteroftechnologiesthatextractmeaningfromtextand turnitintometadata.Someoftheroottechnologies(crawlingandindexing)are sharedwithsearch.Thesetechnologiesarearesponsetotheproblemsofhuman inconsistency(peoplearenotconsistentinhowtheyapplytagstocontent)andof thescaleofcontenttobetagged(whenitwouldnotbefeasibletoapplyhumancuratedtaggingtolargeamountsofcontentinashortperiodoftime).Different technologiesareusedtosupplydifferenttypesoftaggingoperation,rangingfrom identificationofknownentitiessuchaspeople,organisationsandplaces,quantities suchasmoneyamounts,totheidentificationofabstractconceptsthatcharacterize whatadocumentis“about”inwaysthatmakesensetotargetusers.Theconstraints oftextanalyticsare(a)thatithasahardtimefiguringoutwithoutintensivehuman curationorverywellstructuredtrainingsetswhichconceptsaremostsalienttothe mostpeople(whichiswheretaxonomyandontologymanagementhelps)and(b) likesearch,itrequiresconstanttuningbasedonemerginguserneeds(whichis wheresearchanalyticsonuserbehaviourscanhelp). Thesetechnologiescanbeextremelypowerfulwhenusedincombinationinsupport ofthesearchanddiscoveryexperience.However,itisimportanttounderstandwhat theyarecapableof,howtheywork,andthelimitationsofwhattheycando. FigureA1showsadetailedviewofthethreepillarsofthesearchanddiscovery technologystack,howtheywork,andtheirkeyinteractions.Webeginatthetopof thediagram.Sincethesetechnologiesaresupposedtobedeployedintheserviceof userneeds,decisionsabouttheirconfigurationanddesignneedtobemadeinthe contextofknownuserneeds,aswellasathoroughunderstandingofthecontentto becovered.Notethatthisisasimplifiedconceptualframework,anditdoesnotfully representthecomplexitiesoftherelationshipsbetweenthepillars,norallthe overlapsbetweenthem. 3 4 FigureA1Detailviewofthesearchanddiscoverytechnologystack “Content”canbestructured,semi-structuredorunstructured.Itcanbeintheform ofdata,documents,webcontent,audiovisualfiles,discussions,orquestions. “People”canalsobeconsideredaformofcontentsinceknowledgeablepeoplemay alsobeausefulresultinssearchonaparticulartopic.Inacontentmanagement systemapersonprofilecanbeusedasaproxyforthatperson,andcanbetagged withrelevantsubjecttagssoastosurfaceinsearchalongsideotherformsof content. Allthreetechnologiesinthestackdependuponathoroughsurveyofthetarget contenttobecovered,andpreparationandstructuringofthecontentstoressothat thesearch,taxonomymanagementandtextanalyticstechnologiescanaccessthem andoperateuponthem. Enterprise Search Let’sbeginwiththecentralpillar,theenterprisesearchstack.Therearesixmain componentsinthesearchstack. Crawling Thefirstcomponentisthecrawlingmodule.Thisisdirectedtocontentsources,and itcrawlsthecontentinpreparationforindexing.Youmaywanttobeabletodirect thecrawlertospecificpartsofyourcontent. ContentProcessing Contentisthenprocessed,usingtoolsincommonwithinitialprocessingofcontent bytextanalyticstools.Atextfileiscreatedforparsingandindexing,andthecontent istokenized,meaningeachelementisgivenauniqueidentifiersothattheycanbe manipulatedandlocatedlateron.Thisisalsothepointatwhichlemmatization (resolvingvariantgrammaticalformssuchastensesandpluralstothesame grammaticalroot)andstemming(resolvingdifferentwordendingstothesame word-stem)occurs.Youmayalsowanttoremove“stopwords”thatwillcreate meaninglessnoise(commonoperatorssuchasif,but,and,the,a)fromtheindexing. Yourstopwordsmayneedtobetuned.Sometimesstopwordsthatarenoisyin searchindexesprovidesemanticcluesinsomerules-drivenapplicationsoftext analytics. Indexing Theindexingmodulecreatestermindexesfromtheparsedtext.Eachtermislinked toitssourcecontent,andtokenizationprovidesthecontextinwhichthesearchterm appears.Thepre-processedindexesprovidespeedinsearch.Crawlingandpreprocessingoflarge-scalecontentcollectionscanbeveryslowandoccupies significantprocessingcapacity.Hencemostindexingconsistsofincrementalupdates totheindexesbynoticing,crawlingandindexingnewcontentwheneveritisadded. Itisveryimportanttoplantheconfigurationandsetupofyoursearchengine carefullyinadvance.Anymajorchangestothesearchengineconfiguration,ortothe 5 wayitexploitstaxonomyortextanalyticsmayrequireanexpensiveandslow completere-crawlandre-index. Auto-categorization Somesearchenginesalsooffersimpleauto-categorizationfunctions.Intheabsence ofsophisticatedtextanalyticstools,theseareusuallyentityrecognitionbasedon lookupsofreferencetermlists.Theselistscanbemaintainedwithinthesearch engineitself,orcanbesuppliedfromataxonomymanagementsystem.Thisis referredtoas“knownentityextractionvialookup”.Theentitiesyouareinterested incallingout(forexamplecompanynamesorcountrynames)arecompiledintolists, andareregisteredassignificantentitiesifthemainsearchindexescomeacross thesetermsortheirsynonyms.Thiscanprovideaverysimplefilteringcapability basedontheentitytypesyouareinterestedin.However,thisisafairlybasic technology.Justbecauseadocumentmentions“Google”doesnotmeanthatthe documentissubstantially“about”Google. QueryProcessing Muchoftheutilityofasearchengineliesinthesophisticationwithwhichithandles queriesandqueryinterfaces. Queryinterfacescanincludetheubiquitoussearchquerybox,browsingtaxonomy structureswhereconceptsarepre-linkedtocontentthroughstoredsearches (requiresintegrationwithataxonomymanagementmodule),filteringofsearch resultsusingmetadataelementsortaxonomyfacets(againrequiresintegrationwith taxonomy). Manysearchenginesofferauto-completeorauto-suggestofsearchqueries suggestingcommonsearchqueriesastheusertypestheirqueryintothesearchbox. Auto-completecanexploitcommonsearchqueries,existingtaxonomyorthesaurus terms(linkedtoresultsthroughstoredsearches),and/orkeywordsincontextfrom metadatasuchastitlesofhighvaluecontent. Therelevanceofsearchresultsforcommonqueriescanbetunedattheback-endby asearchmanagerbyalteringtherulesthatcalculatethelikelyrelevanceofagiven pieceofcontentforagivenquery.Wherethereisaknown“best”pieceofcontent foragivenquery,thiscanbepromotedtothetopoftheresultspage.Relevance tuningcanalsobeautomated,bylookingatclickbehaviorsbetweenqueryand clickingonresultsandpromotingcontentthatisfrequentlyclickedon. Relevance-tuningcanexploitwhatisknownabouttheusersandtheircontexts. Knowingthatauserissearchingfromwithinagivenorganizationalfunctionenables rulestobewrittenforhowqueriesfromthoseusersshouldbehandled,fromzoning thesearchtoknowntargetcontentforthatfunctionalgroup,tohandlingrelevance calculationbasedonmetadataassociatedwiththosefunctionsandusers,and matchingitwiththeindexedcontent.Contentrankingmechanismscanalsobe pulledintothemixed. 6 SearchAnalytics Queryhandlingandsearchtuningrequiresophisticationinthesearchengineitselfas wellasarelativelysophisticatedsearchmanagementcapability–i.e.thewaythat searchisstaffed,administered,configuredandconstantlytunedinrelationtoknown userneeds. Searchanalyticsisthemodulethatpowersthissophisticationandtuning.Search engineshaveverypowerfulreportingcapabilities,trackinghowqueriesare conductedanduserinteractionswithresultspages,butasinallcomplexreporting capabilities,thesmartslieinhowthereportsaredefinedandconfigured.Whatwill youtrack,how,andhowfrequently?Howwillyoubackuphunchesgatheredfrom thereports,andvalidatethemwithusers?Theinitialsearchdesignstrategybased onusecasesandcontentanalysiswillgiveyouastartingpoint.Thiswillberefined bytheinsightsyougainbyobservinguserbehaviorsthroughthesearchroll-outand subsequentmaintenance. Taxonomy and Ontology Management Likesearch,taxonomyandontologymanagementrequiresveryrobustanalysisof howusersneedtointeractwithcontent,andanalysisofthecontentitselfto understandhowitisdescribed,organizedandused.Taxonomydevelopmentand managementisstillbestconductedasaworkofskilledhumandesign,butitcanbe substantiallyenhancedwiththeappropriatetechnologies.Therearetwobroad functionswithinenterprisetaxonomyandontologymanagementsystems. TaxonomyBuildingandMaintaining Enterprisetaxonomymanagementsystemssupporttheworkofbuildingand maintainingtaxonomiesbyprovidingworkflowsandmetadataaroundtaxonomy concepts.Forexample,theycanhousesourcevocabulariesofvaryingauthorityand structurethatcanbeusedtogeneratecandidatetermsforthetaxonomy.Thereare approvalworkflowstotracktheapprovalprocess.Theremayberoles-based permissionsforcomplexenvironmentswheremultipletaxonomyprofessionalsare involvedinmaintainingthevocabularies.Sourcesandusagesoftaxonomytermscan betrackedasattributesofthetaxonomyconcepts.Scopenotescanbeaddedto explicatetaxonomyterms,andtherevisionhistoryoftermsandtaxonomyfacets canbetracked. ReferenceStore Thesecondmajorfunctionofanenterprisetaxonomymanagementsystemistoact asareferencestoreforothersystems,supplyingtaxonomyterms,relationships betweenterms,synonymsandcontrolledvocabulariestoothersystems.Itcan supply: • controlledvocabularytermstocontentmanagementsystemstosupport manualtaggingofcontent • termsandrelationshipstosupportauto-categorizationviaknownentity lookup,whetherappliedwithinasearchengineoratextanalyticsengine; 7 • synonymsandrelationshipstosearchsothatitcanimproverelevanceand precisionofresults,andsupportexpansionofsearch taxonomyfacetstosearchtosupportfilteringofsearchresults. • Becausetheyarebuilttomanipulatedefinedrelationshipsbetweenconceptsand theirattributes,taxonomyandontologymanagementsystemscanalsoactas brokerstootherLinkedDataorLinkedOpenDatasources,tocallauthorityterms andrelationshipsfromelsewhere,tosupportsearchand/ortosupportenrichment ofcontentviatextanalytics. Finally,enterprisetaxonomyandontologymanagementsystemsarebuilttoactas taxonomy/ontologyhubscateringtomultiplesystemsandaudiences.Theyare capableofinteractingdifferentlywithdifferentplatformsandaudiences.For example,differentversionsofthesamecoretaxonomycanbepresenteddifferently todifferentaudiences,dependingontheirneeds.Differentvocabulariesand taxonomyservicescanbesuppliedtodifferentsystemsbasedontheircontentand uses. Text Analytics Textanalyticsreferstoaquitediversefamilyofcomputer-assistedtechniquesfor assigningmeaningtocontent.Differentmodulescanbeusedtoservedifferent purposes. ContentProcessing Therearesomebasiccontentprocessingfunctionsthattextanalyticshasincommon withenterprisesearch:preparationoftextforparsing,tokenization,lemmatization, stemming,stopwords. SyntacticandSemanticPre-Processing Severalfunctionalapplicationsoftextanalyticsdependontwofoundational processingtechniques. Thefirst,syntacticanalysis,analyzesthelanguagestructureoftextgrammatically, usingverylargeopensourcetrainingsets.Sentencesplittersbreakupsentencesinto partsofspeech,distinguishingnounsandnounphrasesfromverbsandoperators, andsoon,andsyntacticanalyzerscreatealogicalrepresentationofthesentences. Semanticanalysisattemptstoinfermeaningfromanalyzedtexts.Likesyntactic analysisitdependsonreferencetrainingsets.Itenablescomputerstoinfer meaningfulassertionsandfactsfrombodiesoftext,andunderpinsthebase technologiesbehindmachinetranslation.Syntacticanalysisisbasedonknownrules ofgrammar.Semanticanalysishasamuchmorechallengingtask,whichistoinfer human-understandablemeanings.Notallgrammaticallycorrectsentencesare meaningfultohumans–"Colorlessgreenideassleepfuriously"isafamousexample 8 presentedbyNoamChomskyofasyntacticallycorrectsentencethatissemantically incoherent. Syntacticandsemantictechniquesunderpintheidentificationof“entities”within content.Inthecontextoftextanalytics,an“entity”canbeaperson,athing,aplace, atime,atopic.Itisanydistinctconceptthatcanbeidentifiedfromtext. Identificationandextractionofentitiescanbedonemostsimplybysimplylooking upreferencelists,asinauto-categorizationofknownentitiesprovidedbysome searchengines. However,withsyntacticandsemanticanalysistextanalyticscanalsoidentify unknown(unpredicted)entitiesforwhichyoudon’tyethaveexamplesinyourlists. ForexampleinEnglishpropernamesarecapitalized,andpersonalnameshave knownforms(thatalsovarybyculture).Hencecandidatesforpersonsmentionedin textcouldbepickedupbyarulethatsays“lookforoneormorecapitalizedsubjects orobjectsofsentences,andthenlookforlemmatizedverbsfromthefollowinglist ‘say’…‘reply’…etc”. Suchrulesarebasedoncommonusageasrepresentedintheirtrainingsetsandare typicallyprovidedasstandard.Theseruleswilltypicallyneedtobetestedand refinedbasedonlocalusage.Forexample,wefoundthatonestandardtooldidnot pickupcountryoforiginincontentfromanimmigrationandcustomsauthority, becauseitwasusingsimpleterm-phraselookup.However,customsofficersat checkpointshabituallyusedtheconvention“[Country]-registeredvehicle”.The hyphenationandextensionofthetermdisruptedthesimplelookupfunction.Notice thatthesolutiontothisproblemisacombinationoftechniques,likely supplementingknownentitylookupwithsyntacticallydrivenentityextractionrules. Thepracticeofusingdifferenttechniquesincombinationischaracteristicoftext analyticsapproaches. Withthesefoundationalcapabilities,therearemanyspecializedapplicationsoftext analytics,notallofwhichwillbesuitabletoeverypurpose. Auto-categorization Theabilitytoidentifybothknownandunknownentities/conceptsisasignificant improvementinauto-categorizationcapabilities.Lookuplistsforknown entities/conceptscanbesupportedbyataxonomymanagementsystem,and strengthenedthroughthesupplyofsynonyms. Conversely,throughthejudicioususeofrules,textanalyticscandiscover entities/conceptsmentionedintextthatarenotyetinthetaxonomyorcontrolled vocabularies,andcanbeconsideredascandidateterms. TextMining Textminingreferstoanumberoftechniquesforanalyzingtextbasedonstatistical algorithms.Inverybroadterms,largebodiesoftextaresubmittedtoiterative 9 operationsusingalgorithmsto“sort”thetextintoclustersor“buckets”thatare internallystatisticallysimilar,andstatisticallydissimilartotheotherclustersor buckets. Statisticaltechniquesarealsousedtodeterminethewordsorword-phrasesthatare mostlikelytouniquelyrepresenttheirbucketcomparedtootherbuckets.These wordstringsareoftencalled“topicmodels”.Thelikelihoodofspecificterms occurringinproximitytoeachothercanalsobecalculated. Textminingissometimestoutedasafully-automatedalternativetodescribingand taggingcontent.Inpracticehowever,theclusteringtechniquesusefuzzyalgorithms, andtheend“described”stateishighlyinfluencedbywhichclusteringpathwaysare takenatthebeginningoftheoperation.Textminingtechniquessufferfromnonreproducibility(thesameseriesofoperationsonthesamecontentcanproduce differentresultsondifferentoccasions)andfromnon-comprehensibility–thetopic modelsgeneratedoftendonotmatchhowusersthemselvesdescribethecontent. Soinpractice,applicationsoftextminingareheavilytunedbytheirhuman operators,anditisnotalwaystransparenthowagivensetofresultsareachieved. Textminingalsosuffersfromendogeneity–meaningthatthetagsthatareextracted todescribesimilarbucketsofcontentarewhollyderivedfromthatcontentset(or whentrainingsetsareusedforcomparison,fromthetargetcontent+thetraining set).Endogeneityisaproblemwhenyouwanttosurveydiversecollectionsof contentandmapmeaningsconsistentlyacrossthem.Inthatcase,ataxonomyor ontologythatcrossesthosecontentsetsandaudiencesprovidesanexogenous referencepointthatcanbrokermeaningsacrossdifferentcontentsetsanduser audiences. However,textminingcanbeapowerfultoolusedincombinationwithsearchand taxonomymanagement.Itsstatisticaltechniquestoclusterbasedonsimilaritycan proposenewcategorizationsandpotentiallyrelevantrelationshipsbetweentopics tothetaxonomymanager.Itsabilitytoextracttopicmodelscanbeusefulin automatedsummarizationtechniques. SemanticAnalysis Wehavealreadydiscussedsomeofthechallengesofcomplexityfacingsemantic analysis,notleasttheheavyinfluenceofcontextandculture(factorsexternaltothe content)onmeaning.Inpracticaltextanalyticsapplications,semanticanalysisrelies onanumberoftechniquesandnotjustcomputationalsemanticanalysis.For example,oneofthemainapplicationsofsemanticanalysisistoextractassertionsor factsfromcomplexdocumentation.Ifsemanticanalysishasaccesstoontologiesvia ataxonomymanagementsystemitcanthenenrichitsinferencesbasedonknown andvalidatedrelationshipsbetweenentities. 10 SentimentAnalysis Sentimentanalysisisaveryspecializedapplicationoftextanalytics,oftenusedin marketing,socialmediaanalysis,andqualitativefeedback.Itlooksatthesemantics ofpositive,negativeandneutralexpressions,inassociationwithknown(extracted) entities.Againitishighlycontextual,andsubjecttoover-simplification–for example,sarcasticlanguagecanbesuperficiallypositive,butthelinguisticmarkers ofsarcasmareoftennotveryobvious.Aswithmosttextanalyticstechniques,the rulesandalgorithmsneedtobetested,validatedandconstantlysupervisedfor accuracy. Summarization Automaticsummarizationoftenexploitsknowndocumentordatastructures(it knowswheretolookforkeycontent),anditmayexploitsyntacticanalysis(e.g. removingextraneous“padding”languagesuchasadjectivesandadverbs).Itmayuse semanticanalysistoextractkeyfactsfromdocuments. How the Pieces of the Stack Work Together Let’sbringthisbacktogethertogetabetterunderstandingofhowthepillarsofthe stackworktogether.FigureA2illustratesthehighlevelviewofhowthedifferent pillarsofthesearchanddiscoverytechnologystackinteractandworktogetherto producesuperiorresultsfororganisations.Thefocusforhowthestackisdeployed, andwhichparticulartechnologycomponentsareused,alwaysspringsfromuser analysis,andanunderstandingofthequeriesthatusersmake,againstan understandingofthetargetcontent,howitisstructured,howandwhyitis produced,howitiscurrentlybeingused,andhowitcouldpotentiallybeused,ifthe searchanddiscoveryfunctionalitycouldbeimproved.Indigitalenvironments, “targetcontent”canbeanycombinationofdocuments,webcontent,data, dashboards,andpeopleprofiles. TheconfigurationoftheelementsoftheSearchandDiscoverystackspringsfroman understandingofuserneedsagainstanunderstandingoftargetcontent. Determininghowtousethepillarsofthestacksitswithinalargerdiagnosticand decision-makingframework. 1. Whatisthebusinessproblem(orsetofbusinessproblems)youaretryingto solve?Theanswertothisquestioncanoftencomethroughaknowledge managementdiagnosticsactivitysuchasaknowledgeaudit,combinedwith anunderstandingoftheorganisation’sbusinessdirectionandstrategy. 2. Whoarethekeyusercommunitiesinrelationtothebusinessproblem,and howwillyouunderstandtheirworkingneedsandopportunitiesfor improvement?(Thiscanalsocomefromaknowledgeaudit). 3. Whatisthetargetcontent(theknowledgeresourcesthatyourtargetuser communitiesneedstoworkwithmoreeffectively),whereisit,howisit structuredandusedcurrently?(Contentanalysisandmodelingcanhelp). 11 4. Whattoolsfromwithinthesearchanddiscoverystackcanhelptoaddress theproposedsolution?(Thefocusofthispaper) 5. Whatneedstochangeonthebehavioural,processandgovernancelevelsfor thenewsolutiontowork?(Forsustainableimplementationplanning,check outTheKnowledgeManager’sHandbook(MiltonandLambe,2016)froma knowledgemanagementperspectiveorMaishNichani’sseriesofarticlesat http://olasearch.com/articlesfromasearchdesignperspective). FigureA2Interactionsbetweentaxonomymanagement,searchandtextanalytics SearchandTaxonomy Searchistheuser-facingcomponentofthestack.Itmediatescontenttousers. Taxonomymanagementsuppliesknownsalientconceptsandrelationshipsto search,andittellssearchwhichquerytermsaresynonymsofthesametaxonomy concepts.By“salient”wemeanthattheconceptsarefrontofmind,action-related conceptsintheheadsofusersastheyperforminformationandknowledgeseeking activitiestocompleteworktasks.Taxonomiesmakesearchsmarterbytellingit whichconceptsareimportanttousers,howtheyarerelatedtootherconcepts, whetherthroughhierarchicalorotherrelationships.Hencesearchcanexploit taxonomiestohelpuserstorefineorexpandtheirsearch,tofollowmeaningful pathwaysandexplorecontent.Taxonomiesandontologiesenhancetherelevance andprecisionofsearchforusers. Becausesearchisuserfacing,itgathersalotofdataaboutuserneedsand behaviors.Frequencyanalysisofsearchqueriestellsthetaxonomistalotabouthow usersthinkabouttheircontentandtheirsearchrequirements.Clickthroughactivity 12 onsearchresultspagesalsoprovidesevidenceaboutuserperceptionsofwhatmay ormaynotbeuseful.Hencesearchanalyticsprovidesafeedbacklooptohelpthe taxonomymanagerconstantlyrefineandimprovethetaxonomytouserneeds.This isapositivefeedbackloop,wheretaxonomyprovidesrelevanceandprecisionto search,andsearchprovidesaconstantflowofrealtimelarge-scaleevidenceon whatisactuallyrelevantandusefultousersatthepresenttime.Searchanalyticscan alsopickupemergingtermsandconceptsthatareimportanttousers,andpropose themascandidatetermsfortaxonomiesandtheirsupportingthesauri. Withouttaxonomymanagement,searchisrelativelydumb,whetherintermsof whatitcanindex,howitcanhelpusersmaketheirqueries,andhowitcanhelp usersactupontheirresults.Withoutsearch,taxonomymanagementlacksa convincingmediumtodemonstrateitsvalue,anditlacksacontinuousflowof evidencetomaintainitsrelevanceandusefulnesstousers. SearchandTextAnalytics Textminingcanproposenewconceptsandnewassociationsbetweencontentitems forthesearchmanagertobuildintoqueryprocessingandrelevancytuning. Semanticanalysisandsummarizationtechniquescanbeusedtoprovide“smart” summariesofcontentitemstobepreviewedinsearchsnippetsandinsearchresults pages,makingiteasierforuserstoassesswhetherornotthecontentisrelevantto theirneed,beforetheyopenthecontentitem.Textanalyticscanextendthewayin whichcontentcanbepulledtogetherinspecializedsearchbasedapplications. Allthreepillarsofthestack(search,taxonomyandtextanalytics)dependon constanttuningoftheinfrastructuretodelivergoodsearchanddiscoveryresults. Textanalyticsinparticulardependsonrulestoguidehowthecontentisprocessed andtagged.Theserulesneedmaintenanceandtuning.Thistuningisnecessary becauseoftheconstantongoingchangesincontent,context,priorities,usersand needs.Searchanalyticsprovidealargescale,real-timewindowintouserneedsand behaviors,andso,aswithtaxonomymanagement,itprovidestheevidencebase neededtokeepthetextanalyticsstacktunedtodeliveraccurateandusefulresults forusers. Whentherearelargeandcomplexknowledgebases,withouttextanalytics,search findsitdifficulttopickoutsalientconceptsrelatedtocontent.Aswithtaxonomy management,textanalyticsdependsondeepuserandcontentknowledgetoknow howtoconfigureandtuneitsoperations.Withoutsearch,textanalyticsfindsithard todothisunlesstheteamispreparedtoinvestinconstant(andexpensive)user researchandtesting. TextAnalyticsandTaxonomyManagement Taxonomiesaredesigned,evidence-basedartifacts.Assuch,theyconsistentlylag thepaceofchange.Theyareoptimizedtoexploitknownanddocumentedconcepts. Textanalyticstechniquessuchasauto-categorizationandtextminingcansupply candidatetermsandrelationshipstothetaxonomymanagerwithouttheneedfor 13 comprehensivere-surveyingofcontentanduserneeds.Theycanpickupemerging conceptsinthecontent,justassearchanalyticspicksupandproposesemerging conceptsevidencedinsearchqueries.Sotextanalyticscanactasapowerful taxonomyenrichmenttool. Taxonomiesandontologiescanguideandstrengthentheapplicationoftext analyticstechniquessuchasauto-categorization(knownentityextractionthrough lookup),factextractionthroughsemanticanalysis,andsentimentanalysis(by helpingthesentimentengineresolvedifferentsynonymstothesameentitybeing referencedinthecontent). Whentherearelargeandcomplexknowledgebases,withouttextanalytics, taxonomytagscannoteasilyandconsistentlybeappliedtolarge,diversebodiesof content.Withouttaxonomymanagement,textanalyticsstrugglestodescribe contentintermsthatmakesensetokeyusersandinthecontextofuseractivities. Machinetaggingtechniquesalonedonotdescribecontentintermsthathumans intuitivelyunderstand,withoutsomeformofhuman-curatedfilter.Taxonomy managementprovidesthis. Implications Fortoolongorganisationshavebeenlisteningtotheboomingvoicegivingmagical reassurancesfrombehindthecurtain,andhaveneglectedtheirowncapabilities,and thetoolsetsthatexistintheopensourcearena(thatcommercialproductsalso exploit).Astheworldbecomesmorecomplex,solutionstosearchanddiscovery problemswillbecomemorediverse.Weneedtobecomebetterjourneymen,more knowledgeableaboutthecapabilitiesdifferenttoolsetsthatareavailable,sothatwe canchoosetherighttoolsforanygivenproblem.Someofthemwillbecommercial tools,acquiredfortheirexcellenceinaspecificsetoffunctions.Someofthemwill beopensourcetools,acquiredandtunedtosolvespecifickindsofproblem.Theage ofsinglesource“doitall”commercialplatformsisover. Thispaperhasproposedthatknowledgeofthesearchanddiscoverytechnology stackisanimportantcapabilitytoacquire.Itneedstobeatleastadequatetothejob ofevaluatingneedsandpossibilities,andofaskingvendorssearchingquestions abouthowtheirtechnologyworks,whatitisgoodat,andwhatitslimitationsare.In somecases,organisationswillwanttointernalizeadeepertechnicalcapabilityin workingwithanumberofsetsoftools. Thereareotherimportantsearchanddiscoveryrelatedcapabilitiesthatenterprises needtoacquirebeyondaknowledgeofthesearchanddiscoverytechnologystack– forexample,howuserneedsandcontentcanbeanalysedtodeterminepriorities andopportunities,howcontentcanbemanagedandenrichedtoprovideoptimal userexperiences,andhowbusinessproblemscanbetransformedthroughdesign processesintoeffectivesolutionsforbusinessneeds,problemsandopportunities. 14 I would like to acknowledge the following colleagues who have provided valuable insights and advice informing this paper: Dave Clarke, Ahren Lehnert, Agnes Molnar, Maish Nichani and Tom Reamy. Patrick Lambe October 2016-January 2017 15
© Copyright 2026 Paperzz