Bloomberg IMPORTANT PROBLEMS WITH LESS FOCUS James Hodson, AI Research (BRAIN) Lab Bloomberg Overview Knowledge James Hodson, AI Research (BRAIN) Lab Overview Bloomberg This talk News as a clean source of text; Defining and detecting novelty/importance; Modeling bias, authority; Normalized Languages; Named Entity Disambiguation under adversity; Extracting a propositional structure; Beyond N-gram features; James Hodson, AI Research (BRAIN) Lab Overview Bloomberg This talk How does a human understand natural language? A human must place the words being spoken into the space of shared experiences and common context provided by the speaker. The speaker’s job is to evoke the right set of connections so that the central relationships and inferences may fit into the mental model of her interlocutor. i.e. we cannot understand each other unless we know we are talking about the same concepts, making the same assumptions, and using the same contextual clues! James Hodson, AI Research (BRAIN) Lab Bloomberg News Well formed, grammatical text? Textual News comes in all forms imaginable. • • • • • • • • Well-formed prose; Bullet-pointed lists; Ascii-art, HTML, etc., tables; All-caps headlines; Poorly formed prose (Blogs and Social Media); Machine-readable or form-based; Badly encoded PDF; Images; Every contributor has their own ideas about the right way to do things. We have 100k+ global sources, up to 1.5m stories each day accessible to users. That’s a lot of noise! James Hodson, AI Research (BRAIN) Lab Bloomberg Formal News Text Bloomberg News, NYT, Press Releases… Editorially Curated Content. Bloomberg News: "Ninety-five, or 57 percent, of the 167 regional equity long-short hedge funds which began trading with less than $50 million still manage less than that amount after an average of 5.3 years in existence, it added, citing data from Singapore-based Eurekahedge Pte.“ “To Chevron, the single most valuable component of the settlement is probably Patton Boggs’s agreement to allow Gibson Dunn to depose Patton Boggs lawyers who could reveal fresh evidence the oil company could use to persuade courts in other countries not to enforce the Ecuadorean judgment.” This is not an edge case! State of the art NLP is poor for these. James Hodson, AI Research (BRAIN) Lab Bloomberg Informal News Text Social Media, Blogs… Everybody has a voice. If only they spoke the same language… Twitter: "$AAPL - And so it begins." The Politics Blog: “The great failing of the Democratic party over the past three-and-a-half decades has been the party's failure to take political advantage of the obvious prion disease that has afflicted the Republican party since it first ate all the monkey-brains in the mid-1970's.” Missing context, misspellings, colloquialisms, hidden assumptions, stylistic playfulness. And more. James Hodson, AI Research (BRAIN) Lab Bloomberg Headlines Fast, Novel, Impactful… 64 characters or fewer. Meant to convey what is important. Now. "*BRAZIL OFFERS 20,000 FX SWAPS OFFERED IN ROLLOVER AUCTION" “*PROSEGUR TO PAY EU0.027/SHR GROSS DIVIDEND JULY 17” “*SDM GROUP <8363> LISTING ON THE GROWTH ENTERPRISE MARKET OF” “*GAZPROM CUTS SEPT. TRANSIT VIA UKRAINE TO 34% OF EUROPE EXPORTS” “*PSP SUSPENSION FROM OFFICIAL QUOTATION COB” “*PBOC SAYS TO SUPPORT FINANCING FOR PROPERTY DEVELOPERS” Often inconsistent decisions for abbreviations, named-entities lack a defining feature (capitalization), and an enormous number of assumptions need to be made to fully understand. Written for a specific set of experts. Highly domain-specific. James Hodson, AI Research (BRAIN) Lab Bloomberg Non-English Cross-lingual conventions… Each language has its own conventions. • German editors refuse to translate; • Italian sources often lack accented letters; • Non-English stories about US events tend to provide more background; • French news stories are longer than English equivalents; • Japan releases most of its impactful news around market open; • News coverage much less dense in non-English locations; Building NLP pipelines for each cultural frame requires a high level of familiarity with the context and habits. Less parallel training data for MT! James Hodson, AI Research (BRAIN) Lab Bloomberg Novelty What counts as new information? A complication. Investors react primarily to novel news content. However, it has been shown (Tetlock, 2011) that investors overreact to duplicated information, especially (Fedyk, 2014) when the duplicated information represents an aggregation of multiple previously distinct events. • Average 4 basis points excess return (duplicate) • Average 8 basis points excess return (aggregate); • Consistent behavioral effect since at least 2000; Clearly, there is inefficiency in absorbing new information into the market’s shared consciousness. James Hodson, AI Research (BRAIN) Lab Bloomberg Novelty What counts as new information? Newness is not an inherent property of the text. Example: “Iron Mountain Inc., the Boston-based data storage and information management company, is considering an offer to buy Recall Holdings Ltd. for more than $2 billion, people with knowledge of the matter said.” • Let’s say no previous news story mentioned that Iron Mountain is based in Boston. Is it novel? • What level of granularity: Story, Sentence, Predicate? • What time span? What context? “IBM’s shares jumped 3% during market open today.” • This could happen every day. But, it sounded so intuitive… James Hodson, AI Research (BRAIN) Lab Bloomberg New and Important Definition Text-based features contribute to an answer. • • • • Entities, facts, events, relationships; Temporal clues and dependencies; Topic modeling; Entity-narrative consistency; We care about identifying anomalous events in an otherwise consistent entity narrative. Documents and streams of documents can help us define the context. But we need a more concrete world model to ground our narrative. Rich semantic, context-driven, world model for novelty. James Hodson, AI Research (BRAIN) Lab Bloomberg New and Important Measurement Building corpus resources for this task. • A simple sequence labeling task; • Annotator sequentially reads and tags; Great, except, this breaks down if you want to do more than 10 documents. We need thousands at least. Use entity-level annotations and associated latent topics to constrain the task. Then select pairs of documents at random from a temporally constrained distribution. Mturk? Choose enough pairs until your likelihood of missing annotations is below some threshold. James Hodson, AI Research (BRAIN) Lab What to Believe Bloomberg Incentives and Authority What does the content from this source usually look like? • Normalize the input text to reduce dimensionality: @country:IRAN@,@nuclear_disarmament@, @country:USA@, [backs, initial, sanctions, deal], @support@; • Build a semi-lexicalized, feature-rich channel-based language model (LM); • Estimate the likelihood that new content published is close to our expectation from this channel, by using a modified Query Likelihood Model: • 𝑃 𝑛𝑒𝑤𝑠|𝑐ℎ𝑎𝑛𝑛𝑒𝑙 = 𝑛𝑘=0 log(𝑃 𝐾 𝑄 ) Where k represents the semi-lexicalized feature, K the set of n-gram features rooted at k, and Q the semi-lexicalized channel LM. Cluster all sources to categorize them. James Hodson, AI Research (BRAIN) Lab Bloomberg What to Believe Incentives and Authority James Hodson, AI Research (BRAIN) Lab Bloomberg Working with Text Constrained Paraphrasing What is the simplest way to say it? • Syntactically more likely; • Smallest number of unary or binary relations; • Split complex propositions; “Deutsche Lufthansa AG Chief Executive Officer Carsten Spohr asked customers to be patient as he grapples with the longest strike in the airline’s history, saying the future of all employees is at stake as he seeks to find a compromise with pilots seeking to preserve benefits.” • • • • “Carsten Spohr is the CEO of Deutsche Lufthansa AG.” “Carsten Spohr asked customers to be patient.” “Carsten Spohr is dealing with Deutsche Lufthansa AG’s longest strike.” … Much easier to translate accurately, or use for a variety of IR tasks. James Hodson, AI Research (BRAIN) Lab Bloomberg Working with Text Along the Garden Path Linguists have been known to talk about this. But, it actually happens in practice too! • • • • “The horse raced past the barn fell.” “The cotton clothing is usually made of grows in Mississippi” “Until the police arrest the drug dealers control the street.” “Fat people eat accumulates.” Usually occurs because of poorly constructed attributive statements, that could be reformulated easily with “that” constructions. We need better normalization! James Hodson, AI Research (BRAIN) Lab Bloomberg Working with Text Concentrate on the Content Apply a great algorithm to the wrong problem. Boilerplate language is everywhere. It’s how we structure our documents, it’s how we convey certain metadata. • Authorship; • Disclaimers; • Wrappers; “Lavante is the leading provider of Cloud-based supplier management solutions for Fortune 1000 Companies. Our mission is to connect businesses with their suppliers and we provide value to both by automating and improving the quality, accuracy and cost effectiveness of business interactions.” Learn to identify based on position, style, and structure markers. James Hodson, AI Research (BRAIN) Lab Bloomberg Working with Text Identify what’s important It’s not duplicate information. It’s a template. • Auto-generated content; • Regulatory releases; • Pro-forma information; FUND: LYXOR ETF MSCI India Part B GBP ISIN CODE: FR0010375766 TRADING DATE: 29-Sep-14 NAV PER SHARE: GBP 10.1966 NUMBER OF UNITS: 100000 CODE: INRGBP Warning contact: 0800 707 6956 Build a library of templates, extract only the relevant information. James Hodson, AI Research (BRAIN) Lab Bloomberg What is it about? Disambiguation under adversity Named Entity Disambiguation is hard. • Context can be subtle • “Michael Jordan stats” • “Spain advanced to the next round.” • Need to exploit an understanding of the expected context for each candidate. What if those creating the content, don’t care how difficult it is? • Traders come up with new names to describe securities on an intraday basis, depending on the behavior of the security, their mood, lunar cycles, etc. • “Good luck trading the cats on a Wednesday.” More than meme propagation. James Hodson, AI Research (BRAIN) Lab Bloomberg Propositional Structure What is to be understood? What if we could understand? • Facts; • Relationships; • Events; “Futura Venture Partners to buy Quip Ltd for $15m” Event: Buy Buyer: Futura [1434] Object: Quip [3778] Bid: $15,000,000 Cyc can generate English statements from statements in CycL. So we can train a statistical MT model from this parallel corpus [XLIKE, Tadic, 2014] Constrained vocabulary generates state of the art results! James Hodson, AI Research (BRAIN) Lab Bloomberg The Available Information Non-local Features Look beyond n-gram context features. Exercise: randomly select 100 sentences from news wire stories in the business and financial domain. Ask a human evaluator to mark ambiguous entities, missing context, unsubstantiated references. • 73% of sentences contain some level of ambiguity; • In roughly half of cases, ambiguities are resolved at the document level; • In many other cases, previous news or general market knowledge can help to guide disambiguation; Humans use more information when processing documents to make decisions, and interpret. Why would we not? James Hodson, AI Research (BRAIN) Lab Bloomberg The Available Information More Evidence Look at the entire document. Topics, entities, word senses, tend to be coherent across an entire document, so we can use document-level information to better guide our extraction, disambiguation, and understanding. Look at the entire stream. Use collocation and correlations among prior topics, stories, events, and entities to update the expectations around the current document or statement being analyzed. Look at the past history of a concept. Concepts and entities tend to be subject to similar events, to repetitions, and to homogeneity in topical context. We can build richer models based on local and global interactions. James Hodson, AI Research (BRAIN) Lab Bloomberg In Conclusion An Incomplete Assortment of Challenges Problems are interesting in the abstract. But the real world has so many more surprises! James Hodson, AI Research (BRAIN) Lab Questions Bloomberg Answers? ? [email protected] James Hodson, AI Research (BRAIN) Lab
© Copyright 2026 Paperzz