USOO8880389B2 (12) United States Patent (10) Patent N0.: Io?nov (54) (45) Date of Patent: COMPUTER IMPLEMENTED SEMANTIC SEARCH METHODOLOGY, SYSTEM AND 7,769,742 B1* 8,005,841 B1 * 8 020 206 B2 * COMPUTER PROGRAM PRODUCT FOR (76) US 8,880,389 B2 ’ ’ 8/2010 Brawer et a1. ............... .. 707/709 8/2011 Walsh et a1. ..... .. 707/739 9/2011 Hubbard et a1. 726/22 8,245,135 B2* 8/2012 DETERMINING INFORMATION DENSITY IN 2001/0049700 A1* 12/2001 TEXT 2002/0052894 2002/0069100 2003/0182124 2003/0182631 A1* A1* AI* A1* 5/2002 6/2002 9/2003 9/2003 2004/0267730 Inventor: Igor Io?nov, Brooklyn, NY (US) - - ( * ) Notice: - - - Nov. 4, 2014 - Cal et a1. ...... .. 715/254 Ichikura ........... .. 707/512 Bourdoncle et a1. 707/513 Arberman 705/10 Khan ~~~~~ ~~_~~~_~ ~~~~~~~~~~~~ ~~ 704/2701 Tsochantar1d1s et a1. 715/531 A1* 12/2004 Subject to any disclaimer, the term ofthis Zoos/0015357 A1,, 100% Patent 15 extended or adJusted under 35 U.S.C. 154(b) by 564 days. 2005/0027708 A1* 2/2005 Mueller et a1. . 707/10 2005/0278309 A1* 12/2005 Evans et a1. ..................... .. 707/3 C (21) Appl. N0.: 13/315,298 D ' t l. ....... .. 707/3 707/1 t' d ( on “me ) OTHER PUBLICATIONS (22) Filed: Dec. 9, 2011 (65) Alonso, O., Germ., M. and Baeza-Yates, R., Nov. 2009. “Clustering and Exploring Search Results using Timeline Constructions”. Prior Publication Data Us 2013/0151236 A1 Jun. 13, 2013 CIKM’09’ 974°“ (51) Int Cl (Continued) G06F 17/20 (2006-01) Primary Examiner * Edgar Guerra-Erazo i 5‘; (52) (74) Attorney, Agent, or Firm * Steven Rinehart G06F 17/21 (2006.01) G10L 21/00 (2013.01) (57) _ us CL A method, computer program product and system are dis USPC ______________ __ 704/9. 704/1. 7040. 7046. 704/4. closed for determining the semantic density of textualized 704/5. 704/6. 702‘”. 70mg. 702V“). 7b4/270f , (58) ABSTRACT I 704,070 1 7704/27 digital media (a measure of how much information is con 70 4 078’ veyed in a sentence or clause relative to its length). The more Field of Classi?cation Search ’ ’ USPC 704/1i9 270 270 1 275 278 semantically dense text is, the more information it conveys in a given space. Users input atopic, atimeline, and one or more See ap'tlgr C 0mp1 a; seagch hist’ory ’ target web media sources for analysis. Text in the target media ' (56) sources is deconstructed to determine density, and a density References Cited rating assigned to the web media source. Over time, users can track trends in the density of text media relative to a given topic, and determine how much information is being con veyed in connection with the topic, such as a political cam U.S. PATENT DOCUMENTS 6,081,774 A * 6,154,213 A * 6,675,159 paign. Line graphs, pie charts, and other time-elapsed output 6/2000 de Hita et al. ................... .. 704/9 Renni son et a1. .. 11/2000 B1* 1/2004 7,117,207 B1* 7,260,573 B1* 7,707,161 B2* 10/2006 8/2007 4/2010 Lin et a1. . . . . . . . . . . . . . . . . . . . . . . .. graphic representations of the semantic density are generated 1/1 and rendered for the user. Kerschberg et al. . ........ .. 1/1 Jeh et a1. ..................... .. 707/708 Hall et a1. ............ .. 707/999.103 20 Claims, 8 Drawing Sheets 702 704 700 c dependencles exist between 709 clauses? 72° Has Bantam changed from prevlous scan? Crsale one or "WIS ASG graphs linking clauses "0 graphically Yes Parse sentences In textuallzed data on each web media SOUI'CE ldemlfy clauses wllhln the sentences Store semences and clauses In 730 Archive semantic density Ci lculabe semantic densin score (or senlsncs lnlunnellen in memory Calculate semamlc density score ier web medla source Render llmeelepeed graphls) US 8,880,389 B2 Page 2 (56) References Cited OTHER PUBLICATIONS U.S. PATENT DOCUMENTS 2006/0026013 2007/0038610 2007/0073745 2008/0034058 2008/0162498 2008/0263038 2008/0288442 2009/0138565 2010/0268700 2010/0318526 2013/0091090 A1 * 2/2006 A1* 2/2007 Omoigui .. A1* A1* 3/2007 2/2008 A1* A1* Cui, W., Liu, S., Tan, L., Shi, C., Song,Y., Gao, Z., Qu, H., andTong, X., Dec. 2011. “Text?ow: Towards better understanding of evolving topics in text”. I EEE Trans. Vis. Comput. Graph. 17, 12 (2011 ), Kraft ..................... .. Scott et al. ............ .. .... .. Korman et a1. ....... .. .... .. 707/100 2412-2421~* 709/217 Liu, S., Zhou, M., Pan, S., Qian, W., Cai, W., Lian, X., Nov. 2009. “Interactive Topic-based Visual Text Summarization and Analysis”, 7/2008 Omoigui 10/2008 Judge et al. ...... .. A1* 10/2010 Wissner et al 707/2 ‘ 709/206 , 707/706 A1* . 707/749 A1 * 11/2008 Feigenbaum et al. A1 * A1 * 5/2009 Shiff et a1. 12/2010 Nakazawa et al. . 4/ 2013 Spivack et al. ........ .. .... .. 707/608 Wei, F., et al., Jul. 2010. “Tiara: AVisual Exploratory Text Analytic System”. KDD’IO, Jul. 25-28, 2010, pp. 1-10.* * cited by examiner US. Patent Nov. 4, 2014 Sheet 1 0f 8 US 8,880,389 B2 100 K Server 116a Media content\\ Indexing —1183 server @ Internet User 112a m Web server ? DBMS M User 112b Persistent Storage 106 FIG. 1 Keywor S 108 Media content \ 118b Server 116b e 1c;o()ntent \ US. Patent Nov. 4, 2014 Sheet 2 0f8 US 8,880,389 B2 Storage User Keyword(s) 10ia 112g Timeline 202a URL 2 4a URL 4b E Member Structure M User 112g Registration Q Keyword(s) 108b Timeline 2b URL 2 40 Web content m Web media Web media Sentences 21 Oa-z m m Clauses 212a-z Stats 214a-f Density ratings m User M Density m Graph m Stats m Keywords M AST graphs M 200 FIG. 2 US. Patent Nov. 4, 2014 Sheet 3 0f8 US 8,880,389 B2 300\ A -> COMPILING A LIST OF WEB MEDIA SOURCES @ 1 —> QUERYING THE WEB MEDIA SOURCES @ -> —> STORING THE TEXT IN COMPUTER READABLE MEMORY w —> —> IDENTIFYING SENTENCES WITHIN THE WEB MEDIA @ —> —> STORING THE SENTENCES IN AN ARRAY m —> —> IDENTIFYING CLAUSES WITHIN THE SENTENCES m —> —> STORING THE CLAUSES WITHIN AN ARRAY m —> —> QUANTIFYING SEMANTIC ELEMENTS IN THE CLAUSES m —> —> CALCULATING SEMANTIC DENSITY SCORE m —> —> QUERYING SENTENCES FOR KEYWORD(S) i —> -> QUERYING WEB MEDIA SOURCES FOR INSTANCES OF THE KEYWORD(S) ? _’ _> RENDERING A GRAPHIC REPRESENTATION OF SEMANTIC DENSITY g _’ 4 V FIG. 3 US. Patent Nov. 4, 2014 Sheet 4 0f8 US 8,880,389 B2 I System 1 I 400 I | : Web server 102 Indexing server 120 RDBMS 104 | : Sentence Synonym Text Acceptor I I I scanner module 414 determiner module 416 querier module 418 module 420 l l | | | Render Density ASG Depdncy module 422 storer module 424 creator module 426 analyzer module 428 | l l | | | Block storer module |l 420 Reception module 432 l : l l | I l l l | | | | l l | | URL prompter module Calculator module 436 l 44 I ll | l FIG. 4 Timeline 220 US. Patent Nov. 4, 2014 500\ /// / / // sentence US 8,880,389 B2 \\\ Array 502 strlng 210 ?x \ \ \ / / / / / / /’ Sheet 5 0f8 ‘ / / A \ \ \ ‘ K, ’ \ \ \\ \A \IndepenQerTf \ Ereposition Parenthet)‘ Clause Abpqsiti Modi?e?~ .End Code \ ciau‘ée \58621 4 ’ 504 \ \ \ \\‘ \ \ Verb ‘Noun \ 514 91 6 \ \ 510a 508a/ I / / / \ \ \ \ -ca| / \ \ I \ 518a / \ I \ \ ‘512 \ z \\ \ \ \ \ \\ / Adverb / on \ \ \51 00 510b Verb\ \ 520a \\ Subject \ \ 522a \ \ / \ \ FIG. 5 US. Patent Nov. 4, 2014 Sheet 6 0f8 US 8,880,389 B2 600 VVebpage VVebpage 118a 118b VVebpage VVebpage 1189 1180 Search Engme @ VVebpage VVebpage 118f 118d VVebpage 118g VVebpage FIG. 6 118h US. Patent Nov. 4, 2014 Sheet 7 0f8 704 702 Promptand userend for p0|nts timeline start 700 : - . 706 US 8,880,389 B2 Start / + Prompt user for topic 708 \A 710 o dependencies exist between ¢ clauses? Scan web media content Has content 720 Changed from Create one or more ASG \A graphs linking clauses previous scan? graphlcauy Yes 722 712 ¢ \/\ Increase one or more \A Parse sentences in textualized dependency identi?ers data on each web media source 71 4 \A 716 l, 724 Identify clauses within the or topic synonyms sentences found in sentence? l \A Store sentences and clauses in _l\ arrays 726 \/\ 730 \A Archive semantic density information in memory v ‘ Calculate semantic density score for sentence 728 \A + Calculate semantic density score for web 732 r \/\ Render time-elapsed graph(s) media source 734 A End t FIG. 7 US. Patent Nov. 4, 2014 Sheet 8 0f8 US 8,880,389 B2 Politics [density of discussion 41.1%] / 800 High Medium Low 12/12/2009 - 12/12/2010 - 12/12/2011 850 Candidate XYZ [density of discussion 32.4%] / High Medium Low 12/12/2009 - 12/12/2010 FIG. 8B - 12/12/2011 US 8,880,389 B2 1 2 COMPUTER IMPLEMENTED SEMANTIC operations of the computer program product comprising: SEARCH METHODOLOGY, SYSTEM AND compiling a list of web media sources for analysis in com puter readable memory; querying the web media sources for a block of text; storing the block of text in volatile computer readable memory; identifying sentences within the block of text; storing the sentences as strings within an array; scanning COMPUTER PROGRAM PRODUCT FOR DETERMINING INFORMATION DENSITY IN TEXT each sentence to quantify one or more of the following: a BACKGROUND OF THE INVENTION number of words in the sentence; a number of prepositions, postpositions, adjectives, adverbs, verbs, nouns, and gram 1. Field of the Invention This invention relates to information retrieval systems, and more particularly relates to a computer implemented method matical conjunctions, by referencing words within the sen tence with a dictionary in computer readable memory; a num ber of dependent clauses in the sentence; a number of independent clauses in the sentence; a number of ellipsis, a number of dashes (both en dashes and em dashes), and a number of commas, semicolons, and colons; a number of and system for determining the information density of a target body of digital text. 2. Description of the Related Art With the emergence of the Internet has come access to almost limitless amounts of information. Users typically seek subjects and predicates in the sentence; a number of apposi access to this information with computer in logical commu tions in the sentence; a number of syllables in each word of the sentence by cross-referencing each word with the dictio nary in persistent storage; and a number of alphanumeric nication with an information-retrieval system like Google® or other search engines. These search engines typically com prise a graphical user interface for accepting keywords used 20 characters in the sentence; storing each quanti?ed number in a persistent computer readable database with a time-stamp to identify relevant information, typically in the form of online electronic documents. identifying the date the number(s) were quanti?ed; calculat Typical search engines rank the importance of a document by the number of instances that the keyword appears in the body of text forming the document, the number of backlinks ing a semantic density score for each web media source in the list of web media sources, wherein the semantic density score is a function of the quanti?ed numbers for each sentence in 25 the web media source; and storing the semantic density score in a persistent computer readable database, the score exclu sively associated with the web media content from which the to the document, and the importance of the documents com prising the backlinks. This is problematic because many documents included in a given set of search results contain only super?cial references to the keywords, and users are left to manually ?lter through irrelevant results in search for the documents conveying more denser amounts of information. There exists no ef?cient means in the art of valuing the context in which keywords are appearing, or of determining the information density of sentences and clauses comprising the keywords and it synonyms. Methods, computer program products and systems are lacking in the art which determining the semantic density of textualized digital media. Semantic score was derived. 30 a topic for semantic density determination; prompting the user with a webpage to input a timeline for semantic density determination; prompting the user with a webpage to input 35 URLs for one or more web media sources; determining syn onym(s) for the topic by referencing a thesaurus in computer readable memory; querying textualized media on the web media source for instances of the topic; querying textualized density is a measure of how much information is conveyed in a sentence or clause relative to its length in a given block of text. The more semantically dense text is, the more informa tion it conveys in a given space. In some embodiments, the computer program product of further comprises: prompting a user with a webpage to input media on the web media source for instances of synonyms of 40 The semantic density of text with regard to a particular the topic identi?ed by referencing the dictionary; storing semantic density ratings for sentences comprising the topic; storing semantic density ratings for sentences comprising the topic is a useful metric to users desiring to measure trends in synonyms of the topic; and rendering a graph on a computer online chatter, news, such as a political campaign. The present invention bene?cially teaches a unique com display showing elapsed time across one axis, the graph showing a plurality of semantic density ratings for the web 45 media identi?ed by the user. The web media sources may be queried at predetermined intervals of time. The list of web media sources may consist of web media puter implemented methodology for determining the seman tic word density of bodies of text in digitized documents which overcomes prior shortcomings in art. SUMMARY OF THE INVENTION 50 sources manually identi?ed by one or more users to the web site. The sentences may be identi?ed within the block of text as being a string of text preceding a period immediately From the foregoing discussion, it should be apparent that a need exists for a method, system, and computer program 55 followed by a space. Therein sentences may be identi?ed within the block of text as being a string of text following a period (.) immedi 60 ately followed by a space ( ) which string of text also precedes a period (.) immediately followed by a space ( ). The computer program product may further comprise stor ing a dictionary in computer readable memory, the dictionary identifying words of a language, the dictionary characterizing product for determining the semantic word density of a body of text. The present invention has been developed in response to the present state of the art; and, in particular, in response to the problems and needs in the art that have not yet been fully solved by currently available methods, systems, computer program products and apparatii, which invention overcomes many or all of the above-discussed shortcomings in the art. Accordingly, the present invention has been developed to provide a method, computer program product, and system for each word as one or more of a noun, verb, adverb, adjective, determining semantic word density. lables in each word. In some embodiments, a dependent clause in the sentence A computer program product is disclosed, comprising a computer readable medium, having computer usable pro gram code executable to perform operations for determining a semantic density of textualized digital web media, the and preposition, the dictionary identifying a number of syl 65 may be identi?ed as a string of text not comprising a verb within two commas. An independent clause in the sentence may be identi?ed as a string of text comprising a verb within US 8,880,389 B2 3 4 two commas. The semantic density score may further be a ing a semantic density of textualized digital web media, the function of the number of sentences appearing in the block of text. A database query may be generated using the topic and prompting a user with a graphic user interface to input one or synonyms, wherein the sentences are stored in a database in more keyword(s) for semantic density determination; computer readable memory and queried with the database prompting the user with the graphic user interface to input a operations of the computer program product comprising: query by a DBMS. timeline for semantic density determination; prompting the The computer program product may further comprise ren dering a graph on a computer display showing elapsed time across one axis, the graph showing a plurality of semantic word density ratings for the web media identi?ed by the user. user with the graphic user interface to input domain names for one or more web media sources, the domain names compiled into a list of web media sources for analysis in computer readable memory; storing the blocks of text in the web media sources in volatile computer readable memory; and identify ing sentences within the block of text, wherein sentences are identi?ed within the block of text as being a string of text A system to perform operations for relaying commercial offerings from select merchants to select consumers is also disclosed, the system comprising: a webserver; a semantic following a period (.) immediately followed by a space ( ) which string of text also precedes a period (.) immediately followed by a space ( ). The computer program product further comprises storing indexing server; an RDBMS; a topic prompter module con ?gured to prompt a user with a graphic user interface to input a topic for semantic density determination; a timeline prompter module con?gured to prompt the user with a graphic user interface to input a timeline for semantic density the sentences as strings within an array; identifying clauses determination; a URL prompter module con?gured to prompt within the sentences satisfying one or more of the criteria the user with a graphic user interface to input URLs for one or 20 from the group consisting of: a string of text enclosed ellipsis; more web media sources, the URLs compiled into a list of a string of text enclosed by en dashes a string of text enclosed by em dashes; a string of text within the sentence enclosed by web media sources for analysis in computer readable memory; a synonym determiner module con?gured to deter mine synonym(s) for the topic by referencing a thesaurus in computer readable memory; a block storer module con?g ured to store the blocks of text forming the web media sources in volatile computer readable memory; and a sentence iden 25 ti?er module con?gured to identify sentences within the block of text. The system further comprises a sentence parser module con?gured to parse each sentence to quantify one or more of one or more of commas, semicolons, and colons; storing the identi?ed clauses as strings within an array; calculating a semantic density score for each web media source in the list of web media sources, wherein the semantic density score is a function of the number of clauses in each sentence of the web media source; storing the semantic density score in a persis tent computer readable database, the score exclusively asso 30 ciated with the web media content from which the score was derived; determining synonym(s) for the keyword by refer the following: a number of words in the sentence; a number of encing a thesaurus in computer readable memory; querying prepositions, postpositions, adjectives, adverbs, verbs, the sentences for instances of the keyword and synonyms; nouns, and grammatical conjunctions, by referencing words within the sentence with a dictionary in computer readable memory; a number of dependent clauses in the sentence; a number of independent clauses in the sentence; a number of ellipsis, a number of dashes (both en dashes and em dashes), and a number of commas, semicolons, and colons; a number of subjects and predicates in the sentence; a number of appo sitions in the sentence; a number of syllables in each word of storing semantic density ratings for sentences comprising the 35 display showing elapsed time across one axis, the graph showing a plurality of semantic density ratings for the web media identi?ed by the user. Reference throughout this speci?cation to features, advan 40 the sentence by cross-referencing each word with the dictio nary in persistent storage; and a number of alphanumeric characters in the sentence. The system further comprises a value storer module con ?gured to store each quanti?ed number in a persistent com invention should be or are in any single embodiment of the 45 advantages is understood to mean that a speci?c feature, advantage, or characteristic described in connection with an embodiment is included in at least one embodiment of the present invention. Thus, discussion of the features and advan tages, and similar language, throughout this speci?cation may, but do not necessarily, refer to the same embodiment. 50 Furthermore, the described features, advantages, and char acteristics of the invention may be combined in any suitable manner in one or more embodiments. One skilled in the relevant art will recognize that the invention may be practiced persistent computer readable database, the score exclusively without one or more of the speci?c features or advantages of associated with the web media content from which the score was derived; a text querier module con?gured to query tex tualized media on the web media source for instances of the topic; a synonym query module con?gured to query textual ized media on the web media source for instances of syn 55 onyms of the topic identi?ed by referencing the dictionary; 60 a particular embodiment. In other instances, additional fea tures and advantages may be recognized in certain embodi ments that may not be present in all embodiments of the invention. and a render module con?gured to render a graph on a com puter display showing elapsed time across one axis, the graph showing a plurality of semantic density ratings for the web These features and advantages of the present invention will become more fully apparent from the following description and appended claims, or may be learned by the practice of the invention as set forth hereinafter. media identi?ed by the user. A second computer program product is disclosed compris ing a computer readable medium having computer usable program code executable to perform operations for determin tages, or similar language does not imply that all of the features and advantages that may be realized with the present invention. Rather, language referring to the features and puter readable database with a time-stamp identifying the date the number(s) were quanti?ed; a calculator module con ?gured to calculate a semantic density score for each web media source in the list of web media sources, wherein the semantic density score is a function of the quanti?ed numbers for each sentence in the web media source; a density storer module con?gured to store the semantic density score in a keyword and synonyms; rendering a graph on a computer BRIEF DESCRIPTION OF THE DRAWINGS 65 In order that the advantages of the invention will be readily understood, a more particular description of the invention will US 8,880,389 B2 6 5 be rendered by reference to speci?c embodiments that are The web server 102 may also comprise a computer pro illustrated in the appended drawings. Understanding that these drawings depict only typical embodiments of the inven gram running on one or more data processing devices tion and are not therefore to be considered to be limiting of its (DPDs), such as a server, computer workstation, router, main frame computer, or the like. In various embodiments, the scope, the invention will be described and explained with additional speci?city and detail through the use of the accom computing device well-known to those in the art and may DPD comprises one or more processors. The processor is a include an application-speci?c integrated circuit (“ASIC”). A legal entity controls the web sever 102. This legal entity may comprise any individual, organization, entity, or combi panying drawings, in which: FIG. 1 is a UML deployment diagram illustrating one embodiment of a system for determining semantic word den nation thereof, tasked with administering the server 102. The legal entity may be a legal entity, such as a partnership, sity in accordance with the present invention; FIG. 2 is a data-entity relationship diagram illustrating the interrelationships and data?ow of the operations and modules of the present invention; limited liability company (LLC), corporation, joint venture, municipality, d/b/a, or the like. Information may be routed from the sever 102 and/ or from the indexing server 120 to any FIG. 3 is a ?owchart illustrating method steps of the present DPD, device, person, merchant, consumer, company, organi invention, including steps which may be performed determin zation, or the like, in logical communication with the server 102. ing a semantic density of textualized digital web media in accordance with the present invention; The server 102 communicates via the Internet 114 with the DPDs of users 112a-b through PCs, smart phones, or other FIG. 4 is a block diagram illustrating one embodiment of a system which determines a semantic density of textualized 20 personal DPDs such as laptops, Palm Pilots®, digital cam 25 eras, and the like. The server 102 may be in logical commu nication with users 112a-b through a networked environment such as the Internet 114. The Internet 114 may comprise a local area network (LAN) or wide area network (WAN). The server 102 may email search results to users 112a-b using digital web media in accordance with the present invention; FIG. 5 is sentence content unit subdivided into clauses in accordance with the present invention; FIG. 6 is a Booch-relationship diagram illustrating the relationships between the search engine, users, and web media pages in accordance with the present invention; variations of the Simple Mail Transfer Protocol (SMTP), Internet Message Access Protocol (IMAP), Post Of?ce Pro FIG. 7 is a ?owchart illustrating method steps of the present invention, including steps which may be performed determin tocol (POP), or other protocols well-known to those of skill in ing a semantic density of textualized digital web media in accordance with the present invention; FIG. 8A is a line graph illustrated a time elapsed display of the art. 30 database management system (RDBMS), such as Oracle, MySQL, SQL, FireBird, IBM DB2®, or the like. semantic density ratings for a general topic in accordance with the present invention; and In the shown embodiment, one or more users 112 login to FIG. 8B is a line graph illustrated a time elapsed display of semantic density ratings for a speci?c topic in accordance The web server 102, in the shown embodiment, comprises a database management system (DBMS) 104 or relational 35 with the present invention. DETAILED DESCRIPTION OF THE INVENTION an account they have created with the server 102, wherein the account information is stored in computer readable database memory 106 accessible by the server 102. In other embodi ments, it is not necessary for the users 112a-b to have accounts with the server 102. Reference throughout this speci?cation to “one embodi 40 The users 112a-b may comprise any individual, organiza ment,” “an embodiment,” or similar language means that a tion, or group of individuals interested in the trends of seman particular feature, structure, or characteristic described in tic word density of a general or speci?c topic over a period of connection with the embodiment is included in at least one time. The users 112a-b access the web server 102 via the embodiment of the present invention. Appearances of the phrases “in one embodiment,” “in an embodiment,” and simi lar language throughout this speci?cation may, but do not Internet 114, interfacing with the server 102 using a graphic 45 necessarily, all refer to the same embodiment. The described features, structures, or characteristics of the The servers 116a-b, in the shown embodiment, comprises DPDs hosting third-party website with digital media content invention may be combined in any suitable manner in one or more embodiments. In the following description, numerous speci?c details are provided. One skilled in the relevant art 50 will recognize, however, that the invention may be practiced without one or more of the speci?c details, or with other methods, components, materials, and so forth. In other instances, well-known structures, materials, or operations are not shown or described in detail to avoid obscuring aspects of the invention. The apparatus modules recited in the claims may be con?gured to impart the recited functionality to the user interface on a DPD local to the users 112a-b (this graphic user interface usually comprising a website shown in an Inter net browser on a computer display). 55 118a-b, which content 118a-b is accessible via the Internet or another network by the server 102 and/or the server 120. The users 112a-b submit topics for processing to the sever 102. These The indexing server 120 scans textualized web media con tent for text blocks, and parses text found on the web media sources into sentences, which sentences are arrayed in per sistent storage 106. Web media content 118a-b gathered from web media sources on the servers 116a-b is transferred to the server 102 and then to the indexing server 120, or altema apparatus. density in accordance with the present invention. The system tively straight to the indexing server 120. In various embodiments of the present invention, the indexing server 120 is con?gured to identify and store in 100 comprises a web server 102, a DBMS 104, persistent storage 106, keywords 108, web content 110, a user 11211, a volatile or nonvolatile memory 106 copies of text on web media sources identi?ed by the users 112a-b; or pointers or FIG. 1 is a UML deployment diagram illustrating one embodiment of a system 100 for determining semantic word 60 user 112b, a network 114, a server 11611, a server 116b, media 65 references to those web media sources on servers 116a-b. In content 118a, media content 118b, and an indexing server 120. various embodiments of the present invention, the sever 102 and/or the indexing server 120 may be con?gured to scan a US 8,880,389 B2 7 8 network 114 and identify and store in volatile or nonvolatile web media sources. These web media sources are queried 304 memory 106 copies of text found on a network, even at sources not identi?ed ahead of time by the users 112a-b. In some embodiments, this scan is done before a topic for for textualized web media content. This textualized web media content is stored 306 in computer readable memory 106. analysis is submitted by the users 112a-b to the server 102. In Sentences 210 within the web media content 118 are iden ti?ed and parsed 308 from the rest of the web media content other embodiments, the scan is done subsequently and only web media sources found to contain references to the topic (or 118. The sentences 210 are stored 310 in an array datastruc its synonyms) are parsed and stored by the indexing server ture. 120 in memory 106. In some embodiments, the display may be a cathode ray Clauses 212 are identi?ed 312 and parsed from the sen tences 210, then stored 314 in an array. tube (“CRT”), a liquid crystal display (“LCD”), a plasma display panel (“PDP”), or the like. FIG. 2 is a data-entity relationship 200 diagram illustrating Dependencies between the clauses are determined using methodology further described below. The types of clauses and dependencies identi?ed are used to increase the values of variables associated with those dependencies, a process the interrelationships and data?ow of the operations and mod ules of the present invention. The relationship 200 comprises through which semantic density is quanti?ed 316. A semantic users 112a-c, web media sources 118a-b, and storage 106. density score for the each sentence, web media source, and The storage 106 comprises computer readable ?les, including member structure 216, web content 110, density rating 218, keywords 108, and AST graphs 220. Various species of data the topic as a whole are calculated 318. are transmitted back and forth between the users 112a-c, the web media 118a-b, and storage 106, including keywords 108a-b, timelines 202a-b, URL (Uniform Resource Loca tors) 204a-c, sentences 21011-2, clauses 21211-2, statistics 214a-g, density data 216, and time elapsed semantic word graphs 218. In some embodiments, the sentences 210 are then queried 322 for instances of the topic, or keyword(s) 108, input by a 20 instances of synonyms of the keyword(s) 108 ascertained by cross-referencing a digital thesaurus. Alternatively, instances of the keyword(s) 108 in the sentences 210 may be replaced with the synonyms, and the sentences 210 re-quanti?ed 316 25 In the shown embodiment, user 11211 is prompted to upload for semantic word density. Finally, a graph of semantic word density for the topic or keyword(s) 108 is displayed 324 for the user 112. via a graphic user interface a number of pieces on information to the server 102, including one or more keyword(s) 10811, a timeline 20211 with a start date (i.e. start point) and an end date, and plurality of URLs 204a-b. user 112. The sentences 210 may subsequently be queried for FIG. 4 is a block diagram illustrating one embodiment of a 30 system which determines a semantic density of textualized digital web media in accordance with the present invention. The system 400 comprises modules largely con?gured to perform functions corresponding to the steps of the method The keyword(s) 108 connote topics of interest to the users 112a for which the user 112a would like semantic word density information in the form of graphs 218 (shown further 300 and computer program product embodiments. These cor in FIGS. 8A-8B below) or statistics 214. The keyword(s) 108 are stored in a database readable ?le 108 in storage 106. A plurality of URLs to web media content 118a-b are identi?ed by the user 11211 which the user 11211 would like searched for textualized media. The URLs 204a-b are stored in storage 106, and the indexing server 120 scans the web media sources 118a-b identi?ed by the URLs 204a-b for textualized web media content. This textualized web media relations are further described in the claims. content is parsed into sentences 21011-2 and clauses 214a-f, which sentences 21011-2 and clauses 214a-f are timestamped and stored in storage 106. Dependencies between the clauses 35 40 45 The system 400 may be said to form a search engine for information analysis and summary. are used to create statistics describing the magnitude of word 50 stored in a database ?le 218 in storage 106. Subsequently, these statistics 214a-f can be transferred to another user 1120 (or the same user 112a) in the form of 55 can perform semantic word searches by uploading personal Certain unique identi?ers are associated with each charac teristic of each sentence 210. For each sentence 210 a unique identi?er is exclusively associated with a number of words in 60 the sentence; a number of prepositions, postpositions, adjec tives, adverbs, verbs, nouns, and grammatical conjunctions, storage 106. by referencing words within the sentence with a dictionary in computer readable memory; a number of dependent clauses FIG. 3 is a ?owchart illustrating method 300 steps of the present invention, including steps which may be performed The method 300 begins 302 with a list of domain names and/ or ULRs from a user 112 being compiled 302 into a list of ing a string of text not preceded and/or not followed by any text. information to the server 102, such as name, email address, determining a semantic density of textualized digital web media in accordance with the present invention. FIG. 5 is sentence content unit 210 subdivided into clauses in accordance with the present invention. Blocks of text in the present invention are parsed into sentences 210. Sentences 210, in some embodiments, are identi?ed as strings of text separated by a period (.) following immediately by a space ( ). In other embodiments, the sentences 210 compris time-elapsed graphs 218. age, and the like. This personal information is used to create a member structure 216, which is stored in the database in Density ratings 218 are stored in computer readable memory 106 by the indexing server 120 and output in graphi cal form to the DPD ofa user 112. 212a-z are determined as further described below in relation In some embodiments of the present invention, users 112a-c are required to register with the server 102 before they identi?ed by users 112. A timeline 220, with start and end points, is uploaded by a user 112 to the acceptor module 420. Keywords 108 and URLs 204a-c are also uploaded to the system 400 by the user 112. The user 112 may comprise a human being, group of human beings, or another DPD con ?gured to use the system 400 for information retrieval pur poses. to FIG. 4 and other subsequent Figures. These dependencies density contained by each sentence in the respective web media 118 (i.e. density ratings). These density ratings are Textualized web media content 118a-c is gathered by the block storer module 430 from web content sources at URLs 65 in the sentence; a number of independent clauses in the sen tence; a number of ellipsis, a number of dashes (both en dashes and em dashes), and a number of commas, semico lons, and colons; a number of subjects and predicates in the US 8,880,389 B2 10 readable memory and only those blocks of text satisfying the sentence; a number of appositions in the sentence; a number of syllables in each word of the sentence by cross-referencing each word with the dictionary in persistent storage; and a query execution plan relayed to the users 112 or modules con?gured to parse and quantify identi?ers in the sentences 210 and clauses. FIG. 6 is a Booch-relationship diagram illustrating the interrelationships 600 between the search engine, users, and web media pages in accordance with the present invention. In the shown embodiment, a search engine 602, comprising the system 400, is given information by users 112a-d who have been made aware of speci?c webpages 118a-h which they number of alphanumeric characters in the sentence. The magnitude, or value assigned to these unique ideni?ers is representative of the number of times that each character istic in the sentence in identi?ed. These unique identi?ers, along with unique identi?ers representing the dependencies between sentence 210 clauses 504-510, are used in determin ing the informative density, or semantic density of sentences 210; and; consequently, entire blocks of text and web source documents. The sentence 210 comprises clauses identi?ed by the sys want processed by the search engine 602. FIG. 7 is a ?owchart illustrating method steps of the present invention, including steps which may be performed determin tem 400. The sentences 210 are stored in computer readable memory as encoded content units 210, which units are sub ing a semantic density of textualized digital web media in accordance with the present invention. jected to analysis by the indexing server 120. The keywords 108, in the shown embodiment, comprise The method 700 begins 702 and progresses as shown. A user 112 is prompted 704 to input a timeline with start and end digitized search terms entered by a user of the search engine. Typically, the keywords 108 are received by the server 102 and relayed to the indexing server 120. The keywords 108 may comprise phrases (one or more words) that relate to content which the user wishes to search, which content typi cally will relate to the subject matter of interest of the user. In response to reception of the keywords 1-8, the sever 120 may locate information archived in a content library under the points (i.e. start and end dates). The users 112 is also prompted 706 for keyword 108 (e.g. a topic). Web media content identi?ed by the user 112 is scanned 708 by the 20 indexing server 120. In the shown embodiment, if content has changed 710 from an earlier scan, information on web media page is scanned 25 control of the sever 120. The content library may be internal or external to the index ing server 120. In some embodiments, the content library is in logical communication with the indexing server 120 via a distributed network, or through other means well-known to those of skill in the art. In the shown embodiment, the content units 210 are stored in a computer readable database con trolled by the database server 101, such as IBM DB2, Oracle, or SQL. The keywords 108 are formulated into a database query and then associated with a query execution plan which iden ti?es content units within the content library that satisfy the 30 lations between words and/or clause(s). In some embodiments of the present invention, blocks of text identi?ed by a user 112 are searched for instances of the 35 of web media pages are searched for instances of the key 40 XPATH, and the like. The keywords 106a-c may also com prise other identi?ers relevant to creating, or identifying, the FIG. 8A is a line graph 800 illustrating a time elapsed display of semantic density ratings for a general topic in accordance with the present invention. The general topic for which time elapsed semantic word density is shown is poli The keywords 106a-c may be formulated into database tics. The graph 800 shows word density, or the informative as needed by a user with a form that is unknown until the 50 is likely to change between requests) or static (meaning the database query is prede?ned does not change form between requests, although the parametric data values of the query may change). The dynamic database queries comprising the keywords still further embodiments, the blocks of text may be searched for instances of synonyms of the keyword(s) 108, while in still further embodiments, instances of the keyword(s) 108 appearing the text may be substituted out and replaced with synonyms of the keyword(s) before performing further semantic word density analysis and calculation. 45 queries which are dynamic (meaning the query is generated query is received by the database server 101 and which form keyword(s) 108 after the semantic word density fo the blocks of text is determined. In other embodiments, large collections word(s) 108 before semantic word density is calculated. In may be generated using various languages including SQL, proper query execution plan such as the title and the names of speakers featured in the content units 108 or other forms of additional information described above. used in calculating the informative density, or semantic den sity of the sentence 210 containing the clause(s). In some embodiments, b-tree and/or ASG graphs are used to deter mine these dependencies. In other embodiments, dictionaries and thesauri are referenced to determine relations and corre database query after searching textualized digital transcripts of the audio portions of the content units 108 stored in asso ciation with the content units 108. These database queries again and sentences 210 are parsed 712. Clauses are identi?ed 714, and sentences are stored in arrays 716. The dependencies between the clauses are identi?ed and value of the discussion of politics, rising in 2009, slumping in 2010, and rising again in 2011. The sentences discussing politics average 41.1% of the maximum achievable density over the given 2009-2011 timeline 220. FIG. 8B is a line graph 850 illustrating a time elapsed 55 display of semantic density ratings for a speci?c topic in 108-c may comprise more than just textualized words or accordance with the present invention. phrases, but also logical bitwise operations that the indexing The speci?c topic for which time elapsed semantic word density is shown is the name of political candidate. Examples might include Barack Obama, Mitt Romney, and the like. The graph 850 shows word density, or the informative value of the server 120 infers were intended by the user 112 such as NOR, OR, XOR, AND, NAND, and the like, as well as other infor mation. The users 112 may be afforded a great deal of ?ex 60 ibility in searching for not just words and phrases, but com binations of words, phrases, and the like within the web media discussion of this candidate, falling in 2009, rising in 2010, and falling again in 2011. The sentences discussing politics content. average 32.4% of the maximum achievable density over the Web media which satis?es a database query comprising the keywords 108 may be analyzed for informative or semantic density only after identifying instances of the keyword(s) 108 given 2009-2011 timeline 220. The present invention may be embodied in other speci?c forms without departing from its spirit or essential character in the text. The web media content may be cached in computer istics. The described embodiments are to be considered in all 65 US 8,880,389 B2 11 12 respects only as illustrative and not restrictive. The scope of rendering a graph on a computer display showing elapsed the invention is, therefore, indicated by the appended claims rather than by the foregoing description. All changes which time across one axis, the graph showing a plurality of come within the meaning and range of equivalency of the the user. claims are to be embraced within their scope. semantic density ratings for the web media identi?ed by 5 What is claimed is: 1. A computer program product comprising a non-transi tory computer readable medium having computer usable pro gram code executable to perform operations for determining 4. The computer program product of claim 1, wherein the list of web media sources consists of web media sources manually identi?ed by one or more users to the website. an informative score of textualized digital web media, the operations of the computer program product comprising: 5. The computer program product of claim 1, further com prising: compiling a list of web media sources for analysis in com puter readable memory; for each sentence, creating one of an abstract semantic graph (ASG) and an abstract syntax tree (AST); and querying the web media sources for a block of text; storing the block of text in volatile computer readable memory; identifying sentences within the block of text; storing the sentences as strings within an array; parsing each sentence to quantify one or more of the fol lowing: 20 a number of words in the sentence; a number of prepositions, postpositions, adjectives, by referencing words within the sentence with a dic characterizing each word as one or more of a noun, verb, 25 text not comprising a verb within two commas. and em dashes), and a number of commas, semico 30 a number of subjects and predicates in the sentence; a number of appositions in the sentence; a number of syllables in each word of the sentence by cross-referencing each word with the dictionary in persistent storage; and adverb, adjective, and preposition, the dictionary identifying a number of syllables in each word. 7. The computer program product of claim 1, wherein a dependent clause in the sentence is identi?ed as a string of a number of dependent clauses in the sentence; a number of independent clauses in the sentence; a number of ellipsis, a number of dashes (both en dashes lons, and colons; calculating a semantic density score for each web media source in the list of web media sources, wherein the semantic density score is a function of the number of clauses in each sentence of the web media source, the dependency identi?er, and one or more of the ASG and the AST. 6. The computer program product of claim 1, further com prising storing a dictionary in computer readable memory, the dictionary identifying words of a language, the dictionary adverbs, verbs, nouns, and grammatical conjunctions, tionary in computer readable memory; 3. The computer program product of claim 1, wherein the web media sources are queried at predetermined intervals of time. 8. The computer program product of claim 1, wherein an independent clause in the sentence is identi?ed as a string of text comprising a verb within two commas. 9. The computer program product of claim 1, where in the 35 semantic density score is further a function of the number of sentences appearing in the block of text. 10. The computer program product of claim 1, wherein a a number of alphanumeric characters in the sentence; storing each quanti?ed number in a persistent computer database query is generated using the topic and synonyms, readable database with a time- stamp identifying the date wherein the sentences are stored in a database in computer readable memory and queried with the database query by a the number(s) were quanti?ed; calculating a semantic density score for each web media source in the list of web media sources, wherein the semantic density score is a function of the quanti?ed numbers for each sentence in the web media source; and storing the semantic density score in a persistent computer readable database, the score exclusively associated with 40 11. The computer program product of claim 1, further comprising rendered a graph on a computer display showing elapsed time across one axis, the graph showing a plurality of semantic word density ratings for the web media identi?ed by 45 the user. 12. A system to perform operations for determining a level of discussion of a topic, the system comprising: the web media content from which the score was derived. 2. The computer program product of claim 1, further com a webserver; a semantic indexing server; prising: prompting a user with a webpage to input a topic for 50 semantic density determination; density determination; semantic density determination; prompting the user with a webpage to input (Uniform Resource Locators) URLs for one or more web media 55 sources; determining synonym(s) for the topic by referencing a thesaurus in computer readable memory; semantic density determination; to input URLs for one or more web media sources, the 60 URLs compiled into a list of web media sources for analysis in computer readable memory; querying textualized media on the web media source for instances of synonyms of the topic identi?ed by refer the synonyms of the topic; and a timeline prompter module con?gured to prompt the user with a graphic user interface to input a timeline for a (Uniform Resource Locator) URL prompter module con ?gured to prompt the user with a graphic user interface querying textualized media on the web media source for encing the dictionary; storing semantic density ratings for sentences comprising the topic; storing semantic density ratings for sentences comprising an (Relational Database Management System) RDBMS; a topic prompter module con?gured to prompt a user with a graphic user interface to input a topic for semantic prompting the user with a webpage to input a timeline for instances of the topic; (Database Management System) DBMS. a synonym determiner module con?gured to determine synonym(s) for the topic by referencing a thesaurus in computer readable memory; 65 a block storer module con?gured to store the blocks of text forming the web media sources in volatile computer readable memory; US 8,880,389 B2 14 13 a sentence identi?er module con?gured to identify sen tences within the block of text; a sentence parser module con?gured to parse each sentence to quantify one or more of the following: a number of words in the sentence; wherein a dependency comprises a clause with one or more a number of prepositions, postpositions, adjectives, noun, adjective, adverb, or preposition in a second clause; and characteristics from the group consisting of: a modi?er in the ?rst clause modifying a verb in a second clause; a modi?er in a ?rst clause modifying one of a noun, verb, and adverb in second clause; a complement in the ?rst clause of one of a adverbs, verbs, nouns, and grammatical conjunctions, an interjection referencing a noun in the second clause. by referencing words within the sentence with a dic 18. The system of claim 12, further comprising rendering a tionary in computer readable memory; graph on a computer display showing elapsed time across one a number of dependent clauses in the sentence; a number of independent clauses in the sentence; a number of ellipsis, a number of dashes (both en dashes axis, the graph showing a plurality of semantic word density ratings for the web media identi?ed by the user. 19. A computer program product comprising a non-transi and em dashes), and a number of commas, semico tory computer readable medium having computer usable pro gram code executable to perform operations for determining a semantic density of textualized digital web media, the lons, and colons; a number of subjects and predicates in the sentence; a number of appositions in the sentence; a number of syllables in each word of the sentence by cross-referencing each word with the dictionary in operations of the computer program product comprising: prompting a user with a graphic user interface to input one persistent storage; and a number of alphanumeric characters in the sentence; a value storer module con?gured to store each quanti?ed number in a persistent computer readable database with a time-stamp identifying the date the number(s) were 20 domain names for one or more web media sources, the quanti?ed; a calculator module con?gured to calculate a semantic density score for each web media source in the list of web media sources, wherein the semantic density score is a function of the quanti?ed numbers for each sentence in the web media source; a density storer module con?gured to store the semantic domain names compiled into a list of web media sources 25 sentences are identi?ed within the block of text as being 30 a text querier module con?gured to query textualized 35 topic; a synonym query module con?gured to query textualized media on the web media source for instances of syn onyms of the topic identi?ed by referencing the dictio 40 a render module con?gured to render a graph on a com puter display showing elapsed time across one axis, the graph showing a plurality of semantic density ratings for the web media identi?ed by the user. 13. The system of claim 12, wherein sentences are identi ?ed within the block of text as being a string of text following 45 50 increased for each dependency identi?ed between a ?rst clause and a second clause in the sentence, wherein a dependency comprises a clause with one or more char 55 acteristics from the group consisting of: a modi?er in the ?rst clause modifying a verb in a second clause; a modi ?er in a ?rst clause modifying one of a noun, verb, and adverb in second clause; a complement in the ?rst clause the sentence is identi?ed as a string of text not comprising a of one of a noun, adjective, adverb, or preposition in a second clause; and an interjection referencing a noun in verb within two commas. 16. The system of claim 12, wherein an independent clause the second clause; 60 for each sentence, creating one of an abstract semantic graph (ASG) and an abstract syntax tree (AST); verb within two commas. 17. The system of claim 12, further comprising a depen dency analyzer module con?gured to analyze dependencies ?edbetween a ?rst clause and a second clause in the sentence, following an agent-object-verb (AOV) order; analyzing dependencies between the clauses and increas ing the magnitude of a dependency identi?er exclusively associated with the sentence, the dependency identi?er and preposition, the dictionary identifying a number of syl between the clauses and increasing the magnitude of a depen dency identi?er exclusively associated with the sentence, the dependency identi?er increased for each dependency identi string of text enclosed ellipsis; string of text enclosed by en dashes string of text enclosed by em dashes; string of text within the sentence enclosed by one or more of commas, semicolons, and colons; a string of text within the sentence comprising words storing the identi?ed clauses as strings within an array; each word as one or more of a noun, verb, adverb, adjective, in the sentence is identi?ed as a string of text comprising a a a a a following a subject-object-verb (SOV) word order; space ( ). lables in each word. 15. The system of claim 12, wherein a dependent clause in storing the sentences as strings within an array; identifying clauses within the sentences satisfying one or more of the criteria from the group consisting of: and string of text within the sentence comprising words a period (.) immediately followed by a space ( ) which string of text also precedes a period (.) immediately followed by a 14. The system of claim 12, further comprising storing a dictionary in computer readable memory, the dictionary iden tifying words of a language, the dictionary characterizing a string of text following a period (.) immediately fol lowed by a space () which string of text also precedes a period (.) immediately followed by a space ( ); content from which the score was derived; nary; and for analysis in computer readable memory; storing the blocks of text in the web media sources in volatile computer readable memory; identifying sentences within the block of text, wherein density score in a persistent computer readable database, the score exclusively associated with the web media media on the web media source for instances of the or more keyword(s) for semantic density determination; prompting the user with the graphic user interface to input a timeline for semantic density determination; prompting the user with the graphic user interface to input 65 calculating a semantic density score for each web media source in the list of web media sources, wherein the semantic density score is a function of the number of clauses in each sentence of the web media source, the dependency identi?er, and one or more of the ASG and the AST; US 8,880,389 B2 15 storing the semantic density score in a persistent computer readable database, the score exclusively associated with the web media content from which the score was derived; determining synonym(s) for the keyword by referencing a 5 thesaurus in computer readable memory; querying the sentences for instances of the keyword and synonyms; storing semantic density ratings for sentences comprising the keyword and synonyms; and 10 rendering a graph on a computer display showing elapsed time across one axis, the graph showing a plurality of semantic density ratings for the web media identi?ed by the user. 20. The computer program product of claim 19, wherein a 15 database query is generated using the keyword(s) and syn onyms of the keyword(s), wherein the sentences are stored in a database in computer readable memory and queried with the database query by a (Database Management System) DBMS. * * * * * 20
© Copyright 2026 Paperzz