Bloomberg IMPORTANT PROBLEMS WITH LESS FOCUS

Bloomberg
IMPORTANT PROBLEMS WITH LESS FOCUS
James Hodson, AI Research (BRAIN) Lab
Bloomberg
Overview
Knowledge
James Hodson, AI Research (BRAIN) Lab
Overview
Bloomberg
This talk
 News as a clean source of text;
 Defining and detecting novelty/importance;
 Modeling bias, authority;
 Normalized Languages;
 Named Entity Disambiguation under adversity;
 Extracting a propositional structure;
 Beyond N-gram features;
James Hodson, AI Research (BRAIN) Lab
Overview
Bloomberg
This talk
How does a human understand natural language?
A human must place the words being spoken into the space of
shared experiences and common context provided by the
speaker. The speaker’s job is to evoke the right set of
connections so that the central relationships and inferences
may fit into the mental model of her interlocutor.
i.e. we cannot understand each other unless we know we are
talking about the same concepts, making the same
assumptions, and using the same contextual clues!
James Hodson, AI Research (BRAIN) Lab
Bloomberg
News
Well formed, grammatical text?
Textual News comes in all forms imaginable.
•
•
•
•
•
•
•
•
Well-formed prose;
Bullet-pointed lists;
Ascii-art, HTML, etc., tables;
All-caps headlines;
Poorly formed prose (Blogs and Social Media);
Machine-readable or form-based;
Badly encoded PDF;
Images;
Every contributor has their own ideas about the right way to do things. We
have 100k+ global sources, up to 1.5m stories each day accessible to users.
That’s a lot of noise!
James Hodson, AI Research (BRAIN) Lab
Bloomberg
Formal News Text
Bloomberg News, NYT, Press Releases…
Editorially Curated Content.
Bloomberg News:
"Ninety-five, or 57 percent, of the 167 regional equity long-short hedge
funds which began trading with less than $50 million still manage less than
that amount after an average of 5.3 years in existence, it added, citing data
from Singapore-based Eurekahedge Pte.“
“To Chevron, the single most valuable component of the settlement is
probably Patton Boggs’s agreement to allow Gibson Dunn to depose Patton
Boggs lawyers who could reveal fresh evidence the oil company could use
to persuade courts in other countries not to enforce the Ecuadorean
judgment.”
This is not an edge case! State of the art NLP is poor for these.
James Hodson, AI Research (BRAIN) Lab
Bloomberg
Informal News Text
Social Media, Blogs…
Everybody has a voice. If only they spoke the same language…
Twitter:
"$AAPL - And so it begins."
The Politics Blog:
“The great failing of the Democratic party over the past three-and-a-half
decades has been the party's failure to take political advantage of the
obvious prion disease that has afflicted the Republican party since it first
ate all the monkey-brains in the mid-1970's.”
Missing context, misspellings, colloquialisms, hidden assumptions, stylistic
playfulness. And more.
James Hodson, AI Research (BRAIN) Lab
Bloomberg
Headlines
Fast, Novel, Impactful…
64 characters or fewer. Meant to convey what is important. Now.
"*BRAZIL OFFERS 20,000 FX SWAPS OFFERED IN ROLLOVER AUCTION"
“*PROSEGUR TO PAY EU0.027/SHR GROSS DIVIDEND JULY 17”
“*SDM GROUP <8363> LISTING ON THE GROWTH ENTERPRISE MARKET OF”
“*GAZPROM CUTS SEPT. TRANSIT VIA UKRAINE TO 34% OF EUROPE EXPORTS”
“*PSP SUSPENSION FROM OFFICIAL QUOTATION COB”
“*PBOC SAYS TO SUPPORT FINANCING FOR PROPERTY DEVELOPERS”
Often inconsistent decisions for abbreviations, named-entities lack a defining
feature (capitalization), and an enormous number of assumptions need to be
made to fully understand.
Written for a specific set of experts. Highly domain-specific.
James Hodson, AI Research (BRAIN) Lab
Bloomberg
Non-English
Cross-lingual conventions…
Each language has its own conventions.
• German editors refuse to translate;
• Italian sources often lack accented letters;
• Non-English stories about US events tend to provide more
background;
• French news stories are longer than English equivalents;
• Japan releases most of its impactful news around market open;
• News coverage much less dense in non-English locations;
Building NLP pipelines for each cultural frame requires a high level of
familiarity with the context and habits.
Less parallel training data for MT!
James Hodson, AI Research (BRAIN) Lab
Bloomberg
Novelty
What counts as new information?
A complication.
Investors react primarily to novel news content.
However, it has been shown (Tetlock, 2011) that investors overreact to
duplicated information, especially (Fedyk, 2014) when the duplicated
information represents an aggregation of multiple previously distinct
events.
• Average 4 basis points excess return (duplicate)
• Average 8 basis points excess return (aggregate);
• Consistent behavioral effect since at least 2000;
Clearly, there is inefficiency in absorbing new information into the
market’s shared consciousness.
James Hodson, AI Research (BRAIN) Lab
Bloomberg
Novelty
What counts as new information?
Newness is not an inherent property of the text.
Example:
“Iron Mountain Inc., the Boston-based data storage and information
management company, is considering an offer to buy Recall Holdings Ltd.
for more than $2 billion, people with knowledge of the matter said.”
• Let’s say no previous news story mentioned that Iron Mountain is
based in Boston. Is it novel?
• What level of granularity: Story, Sentence, Predicate?
• What time span? What context?
“IBM’s shares jumped 3% during market open today.”
• This could happen every day.
But, it sounded so intuitive…
James Hodson, AI Research (BRAIN) Lab
Bloomberg
New and Important
Definition
Text-based features contribute to an answer.
•
•
•
•
Entities, facts, events, relationships;
Temporal clues and dependencies;
Topic modeling;
Entity-narrative consistency;
We care about identifying anomalous events in an otherwise consistent
entity narrative. Documents and streams of documents can help us define
the context. But we need a more concrete world model to ground our
narrative.
Rich semantic, context-driven, world model for novelty.
James Hodson, AI Research (BRAIN) Lab
Bloomberg
New and Important
Measurement
Building corpus resources for this task.
• A simple sequence labeling task;
• Annotator sequentially reads and tags;
Great, except, this breaks down if you want to do more than 10 documents.
We need thousands at least.
Use entity-level annotations and associated latent topics to constrain the
task. Then select pairs of documents at random from a temporally
constrained distribution. Mturk?
Choose enough pairs until your likelihood of missing annotations is below
some threshold.
James Hodson, AI Research (BRAIN) Lab
What to Believe
Bloomberg
Incentives and Authority
What does the content from this source usually look like?
•
Normalize the input text to reduce dimensionality:
 @country:IRAN@,@nuclear_disarmament@, @country:USA@, [backs,
initial, sanctions, deal], @support@;
•
Build a semi-lexicalized, feature-rich channel-based language model (LM);
•
Estimate the likelihood that new content published is close to our
expectation from this channel, by using a modified Query Likelihood Model:
• 𝑃 𝑛𝑒𝑤𝑠|𝑐ℎ𝑎𝑛𝑛𝑒𝑙 = 𝑛𝑘=0 log(𝑃 𝐾 𝑄 )
Where k represents the semi-lexicalized feature, K the set of n-gram features
rooted at k, and Q the semi-lexicalized channel LM.
Cluster all sources to categorize them.
James Hodson, AI Research (BRAIN) Lab
Bloomberg
What to Believe
Incentives and Authority
James Hodson, AI Research (BRAIN) Lab
Bloomberg
Working with Text
Constrained Paraphrasing
What is the simplest way to say it?
• Syntactically more likely;
• Smallest number of unary or binary relations;
• Split complex propositions;
“Deutsche Lufthansa AG Chief Executive Officer Carsten Spohr asked customers
to be patient as he grapples with the longest strike in the airline’s history,
saying the future of all employees is at stake as he seeks to find a compromise
with pilots seeking to preserve benefits.”
•
•
•
•
“Carsten Spohr is the CEO of Deutsche Lufthansa AG.”
“Carsten Spohr asked customers to be patient.”
“Carsten Spohr is dealing with Deutsche Lufthansa AG’s longest strike.”
…
Much easier to translate accurately, or use for a variety of IR tasks.
James Hodson, AI Research (BRAIN) Lab
Bloomberg
Working with Text
Along the Garden Path
Linguists have been known to talk about this.
But, it actually happens in practice too!
•
•
•
•
“The horse raced past the barn fell.”
“The cotton clothing is usually made of grows in Mississippi”
“Until the police arrest the drug dealers control the street.”
“Fat people eat accumulates.”
Usually occurs because of poorly constructed attributive statements, that
could be reformulated easily with “that” constructions.
We need better normalization!
James Hodson, AI Research (BRAIN) Lab
Bloomberg
Working with Text
Concentrate on the Content
Apply a great algorithm to the wrong problem.
Boilerplate language is everywhere. It’s how we structure our documents,
it’s how we convey certain metadata.
• Authorship;
• Disclaimers;
• Wrappers;
“Lavante is the leading provider of Cloud-based supplier management
solutions for Fortune 1000 Companies. Our mission is to connect businesses
with their suppliers and we provide value to both by automating and
improving the quality, accuracy and cost effectiveness of business
interactions.”
Learn to identify based on position, style, and structure markers.
James Hodson, AI Research (BRAIN) Lab
Bloomberg
Working with Text
Identify what’s important
It’s not duplicate information. It’s a template.
• Auto-generated content;
• Regulatory releases;
• Pro-forma information;
FUND:
LYXOR ETF MSCI India Part B GBP
ISIN CODE:
FR0010375766
TRADING DATE:
29-Sep-14
NAV PER SHARE:
GBP 10.1966
NUMBER OF UNITS: 100000
CODE:
INRGBP
Warning contact: 0800 707 6956
Build a library of templates, extract only the relevant information.
James Hodson, AI Research (BRAIN) Lab
Bloomberg
What is it about?
Disambiguation under adversity
Named Entity Disambiguation is hard.
• Context can be subtle
• “Michael Jordan stats”
• “Spain advanced to the next round.”
• Need to exploit an understanding of the expected context for each
candidate.
What if those creating the content, don’t care how difficult it is?
• Traders come up with new names to describe securities on an intraday basis, depending on the behavior of the security, their mood,
lunar cycles, etc.
• “Good luck trading the cats on a Wednesday.”
More than meme propagation.
James Hodson, AI Research (BRAIN) Lab
Bloomberg
Propositional Structure
What is to be understood?
What if we could understand?
• Facts;
• Relationships;
• Events;
“Futura Venture Partners to buy Quip Ltd for $15m”
Event: Buy
Buyer: Futura [1434]
Object: Quip [3778]
Bid: $15,000,000
Cyc can generate English statements from statements in CycL. So we can
train a statistical MT model from this parallel corpus [XLIKE, Tadic, 2014]
Constrained vocabulary generates state of the art results!
James Hodson, AI Research (BRAIN) Lab
Bloomberg
The Available Information
Non-local Features
Look beyond n-gram context features.
Exercise: randomly select 100 sentences from news wire stories in the
business and financial domain. Ask a human evaluator to mark ambiguous
entities, missing context, unsubstantiated references.
• 73% of sentences contain some level of ambiguity;
• In roughly half of cases, ambiguities are resolved at the document
level;
• In many other cases, previous news or general market knowledge
can help to guide disambiguation;
Humans use more information when processing documents to make
decisions, and interpret.
Why would we not?
James Hodson, AI Research (BRAIN) Lab
Bloomberg
The Available Information
More Evidence
Look at the entire document.
Topics, entities, word senses, tend to be coherent across an entire
document, so we can use document-level information to better guide our
extraction, disambiguation, and understanding.
Look at the entire stream.
Use collocation and correlations among prior topics, stories, events, and
entities to update the expectations around the current document or
statement being analyzed.
Look at the past history of a concept.
Concepts and entities tend to be subject to similar events, to repetitions,
and to homogeneity in topical context. We can build richer models based
on local and global interactions.
James Hodson, AI Research (BRAIN) Lab
Bloomberg
In Conclusion
An Incomplete Assortment of Challenges
Problems are interesting in the abstract.
But the real world has so many more surprises!
James Hodson, AI Research (BRAIN) Lab
Questions
Bloomberg
Answers?
?
[email protected]
James Hodson, AI Research (BRAIN) Lab