Sentence Level Classification as Domain Adaptation Task within the

Sentence Level Classification
as Domain Adaptation Task
within the Financial Press
February 28, 2013
Background
About Nick
✦
Physics and German Studies double major.
✦
MBA from University of Chicago in finance and
statistics.
✦
Studies in FORTRAN 77, Japanese, Spanish and
Latin at the UW.
✦
16-year Wall Street career in investment banking,
quantitative research, sales, derivatives trading
and management.
✦
Today, a hedge fund manager using comp ling
sentiment to time the markets.
2
Background
Madison Park Sentiment Trading
✦
Partner Johnny Hom (MIT) developed an eightstate sentiment model for assets in during the
LTCM Crisis of 1998 based on what traders say.
✦
Newspapers started going online in 1995+.
✦
Posited that market sentiment could be extracted
from online resources in a kind of indirect
sentiment survey.
✦
Since November 2001 Johnny has hand-scored
322,000 sentiment statements (“MPCP Corpus”).
✦
Daily sentiment voices are aggregated into both a
sentiment heading and signal strength.
3
Background
Market Sentiment Perceptual Map
“I feel very bullish about the market”, XYZ Analyst from ABC Bank.
Confident Market
Higher Prices Topping Market
Joy
Contentment
Price Expectation
Greed
Positive Surprise
Hope
Emotional Aspect
Negative Surprise
(Selling Impetus)
(Buying Impetus)
“We are cauIously Caution
opImisIc about the market, but are not recommending more than Apathy
a 30% weighIng” , XYZ Analyst from ABC Bank.
Bottoming Market Lower Prices
“We believe that the earnings outlook for the S&P 500 should hold at the $105 level”, XYZ Analyst from ABC Bank.
Fear
Despair
Nervous Market
“We are concerned about the outlook for the market and are not ready to invest” , XYZ Analyst from 4
ABC Bank.
Background
Recent Evolution of US Equity Market Sentiment
Con$nued bullish sen$ment typical of January bull market phases.
Federal Reserve minutes indicate that quan$ta$ve easing may come to an end sooner rather than later. (Market sells off sharply).
5
Background
Sentiment Analytics
✦
Started trying to automate sentiment scoring
using Latent Semantic Analysis in early 2009.
✦
Struggled with “bottom-up” vs “top-down”
approaches.
✦
Started using Perl CPAN and R libraries for NLP
tasks.
✦
Automated collection of financial press articles in
early 2011 with a database of 500,000+ articles
(roughly 1,000 daily from market related sections
of 24 websites.)
6
Background
Daily Article Collection Time Series
✦
We use Vector Space Models to determine
relevance.
7
Background
Why CLMS?
✦
Came across the Agrawal and Yu paper in late
2011.
✦
Paper discussed classifying sentences into four
types using supervised learning, which could be
used to classify our eight sentiment states.
✦
Knew statistics, but did not have a background in
machine learning.
✦
Big, long-term goal: Use supervised machine
learning to build a scoring algorithm for new
sentiment statements automatically.
8
Agrawal and Yu Paper
“Automatically classifying sentences in full-text biomedical articles into
Introduction, Methods, Results and Discussion”
✦
✦
✦
✦
✦
✦
✦
Relatively recent, published in 2009.
Prior papers had only looked at classifying medical
paper abstracts.
Scientific articles use categories within papers, but
not all of the sentences may belong to that category.
Apply a IMRAD (“Introduction, Methods, Results and
Discussion”) framework for classifying sentences.
Sentences are the typical “independent unit” for text
mining for information extraction and summarization.
Classifyication into “rhetorical zones” can aid
searches for specific types of results or methods, for
example.
Achieved a 91.55% F-score.
9
Agrawal and Yu Paper
IMRAD Framework
Introduction
PECAM-1 plays an important role in endothelial cellcell and cellmatrix interactions, which are essential during vasculogenesis and/or angiogenesis (17,22).
Methods
Here, we examined expression of PECAM-1 mRNA in vascular beds of various human tissues
and compared it with expression of PECAM-1 in human endothelial and hematopoietic cells.
Results
A short exposure of the blot probed with GAPDH is shown, because poly(A)+
RNA from the cell lines gives a strong signal within several hours compared
with the total RNA from human tissue.
Discusion
Therefore, total RNA from various tissues required a much longer exposure to reveal GAPDH mRNA.
Human tissue and cell lines expressed multiple RNA bands for PECAM-1, which may represent
alternatively spliced PECAM-1 isoforms, the identity of which required further analysis.
10
Agrawal and Yu Paper
Methods
• Data: MEDLINE random sample of 2,960 sentences.
• Annotation: Authors and Biologists (high, medium
low rankings).
• Classification:
• Baseline: Assume section label.
• Rule Based: 603 hand-created rules.
• Supervised Learning: Used Naïve Bayes,
multinomial NB and SVM approaches.
• Features: Uni-, bi- and tri-grams and features.
Used chi-square and mutual information to select
top 2,600 features.
11
Agrawal and Yu Paper
Methods
Table 1: Performance Across 10-fold Classification.
Standard
Accuracy Deviation
Classifier
Comment
Baseline
Rule
Sectional Annotation
Human Generated
77.81
58.18
Non1
Non2
Non3
Non4
Full-sentences from test
Iterative trainer
Trained on abstracts
Full-sentences random
74.45
72.77
66.94
73.92
Man-Terms
Man-IMRAD
Man-Tense
Man-All
Term features
Term features plus IMRAD category
Term features plus verb tense
Term features, tenses and IMRAD category
88.06
91.34
88.77
91.95
±4.03
±4.87
±5.22
±5.24
±4.49
±2.39
±3.20
±3.09
±3.91
±2.81
12
Agrawal and Yu Paper
Results
13
Agrawal and Yu Paper
Results
14
Agrawal and Yu Paper
Results
15
Agrawal and Yu Paper
Conclusions
• Very impressive results without any major
obvious shortcomings.
• It’s surprisingly that they did not suggest apply
the methodology to other domains.
• Provides a great basis for sentence level
classification in the financial domain.
• Maximum Entropy may have generated higher
accuracies owing to the cross-classification
problem highlighted in Table 8.
16
Agrawal and Yu Paper
Takeaways
•
•
•
•
Could have more categories.
Abstracts are different from normal text.
Top 2500 features worked best.
Multinomial Bayes was the best model.
Table 1: Comparison of model accuracy.
Metric
Multinomial
Bayes
Naive
Bayes
Support
Vector
Machine
Mutual Information
Chi-Squared
91.95±2.81
86.03±2.69
85.95±3.44
86.65±3.07
89.13±2.30
87.09±2.48
17
MPCP Corpus Project
Thesis Topic
• Fully automated collection, classification,
sentiment detection and scoring for a given asset
class.
• Will use MPCP’s hand annotated statements to
train our classifier for sentiment scoring.
• Numerous other NLP tasks exist within the corpus
such as NER tracking and topic modeling.
• Could apply classifier to other markets using a
number of the MT papers discussed in Ling 575.
• Used Decision Trees for a number of machine
learning tasks initially.
18
MPCP Corpus Project
The “Golden Essays”
US Equity Sentiment Statement Database - Nov 2001-Feb 2013
Sentiment State
Market Phase
Count
Frequency
Joy
Joy/Contentment
Contentment
Contentment/Hope
Topping
Topping
Topping
Topping
36,116
11,450
39,972
27,075
11.20
3.55
12.39
8.39
Hope
Hope/Fear
Fear
Fear/Despair
Nervous
Nervous
Nervous
Nervous
7,735
9,808
12,663
18,541
2.40
3.04
3.93
5.75
Despair
Despair/Apathy
Apathy
Apathy/Caution
Bottoming
Bottoming
Bottoming
Bottoming
24,661
11,535
46,282
27,010
7.65
3.58
14.35
8.37
Caution
Caution/Greed
Greed
Greed/Joy
Greedy
Greedy
Greedy
Greedy
13,176
6,357
15,043
14,992
4.08
1.97
4.66
4.65
Totals
322,551
19
MPCP Corpus Project
Project Topic
• General statistical survey of financial domain for
linguistic differences, similarities and possible
sub-domains.
• Improve our article topic classifier using
Maximum Entropy and Support Vector Machine
(SVM) vs. our Decision Tree baseline.
• Build a sentence/paragraph level classifier similar
to the Agrawal and Yu IMRAD approach.
• Domain adaptation focus will be in examing how
training on the WSJ source adapts to the 23 other
target financial press sources.
20
MPCP Corpus Project
News Article Corpus
US Financial Press Article Database - Mar 2010-Jan 2013
Source
Sub-Domain
Articles
Words
Average Length
Bloomberg News
CNBC
Finanicial Times
Market Watch
Reuters
Wall Street Journal
Core
Core
Core
Core
Core
Core
58,650
29,736
48,379
56,527
52,199
34,309
41,957,672
13,284,684
26,486,352
15,637,149
25,326,299
22,936,452
715
447
547
277
485
669
AP, AFP etc
NY Times
Core/Mainstream
Core/Mainstream
3,750
32,735
2,008,477
18,162,409
536
555
CNN
Fox News
LA Times
MSN
Smart Money
USA Today
Yahoo
Mainstream
Mainstream
Mainstream
Mainstream
Mainstream
Mainstream
Mainstream
4,930
25,404
6,952
1,293
3,712
3,125
6,778
2,627,105
9,837,877
2,387,060
480,918
2,968,556
1,603,193
3,107,183
533
387
343
372
800
513
458
Barron’s
Business Week
Fortune
The Street.com
Topical
Topical
Topical
Topical
7,708
10,737
4,153
46,992
3,871,801
4,737,789
1,686,748
27,943,901
502
441
406
595
Business Insider
Daily Finance
Gossipy
Gossipy
57,074
7,770
17,507,712
2,975,278
307
383
Minyanville
Seeking Alpha
WSJ Blogs
Blog
Blog
Blog
4,719
28,752
7,406
800,948
18,288,708
1,997,891
170
636
270
21
MPCP Corpus Project
Three Corpus Factoids
• The neuro-linguistic space of the financial press
is a relatively well-behaved one:
• Participants tend to use the same language over
time, e.g. “P/E ratio, earnings, Federal Reserve,
interest rates, stock prices”, etc (stable
distribution).
• New terms are infrequent, but generally represent
a new important meme in the market, e.g. “the
New Normal”, “Fiscal Cliff” and now
“Sequestration”.
• These collocation n-grams have deep and specific
meaning as such as “1929 Crash”, “Black
Monday”, “the 1936 Roosevelt Shock”, etc.
22
MPCP Corpus Project
Financial Sentiment Measures
• Somewhat surprisingly, financial sentiment
measures are relatively outdated:
• Survey based measures, e.g. AAII and NAIM.
• Econometric proxy measures, e.g. Citigroup’s
Panic-Euphoria model, VIX Index, credit
spreads, etc.
• Computational linguistic based measures, e.g.
RavenPacks sentiment indices (bottom-up) and
Dow Jones Leading Indicators Index.
• Clearly, there are more applications of comp ling
in the trading world, but they are likely protected
as proprietary engines of profit, e.g. Renaissance
Technologies, MPCP et al.
23
MPCP Corpus Project
Article Classification
• Our Decision Tree approach falls short in two
ways:
• An article can contain multiple themes, i.e.
combinations of stocks, bonds, commodities,
foreign exchange, the economy and stock
specific topics.
• R’s implementation of DT’s does not allow for
the flexibility of Mallet’s feature vectors.
• Although I did figure out how to use structural
features such at the number of $ or % signs in an
article.
24
MPCP Corpus Project
Article Zone Classification
• At a minimum, the WSJ follows a relatively
standard alternating paragraph structure much
like IMRAD.
• Articles contain an Introductory paragraph.
• The balance of the article alternates between:
• Market Recap paragraphs.
• General Market commentary.
• Specific market/security Pricing
information.
• Market Sentiment statements.
• We term this as a IRMPS framework.
25
MPCP Corpus Project
Treasury Investors Just Can't Let Go of That Haven Feeling
By MIN ZENG
Defying a rally in the U.S. stock market, Treasury bonds rose Friday and chalked up price gains for the
week.
Treasury bonds normally lose ground as a haven when stocks strengthen. The fact that both markets advanced on
Friday suggests that some investors remain skeptical about the pace of global economic growth, traders said.
In addition, concerns have arisen whether the month’s long bull run in the stock market is due for a selloff. The Dow
Jones Industrial Average is up 6.8% this year.
"People are not convinced that the stock rally can keep on. That's why some of them buy bonds as
insurance," said Gang Hu, co-head of U.S. rates and swaps trading at Credit Suisse Securities USA
LLC in New York.
Late Friday in New York, the 10-year Treasury note rose 4/32 point, to 100 10/32, to yield 1.965%, compared with
1.977% late Thursday, as bond yields move inversely to prices. The 10-year yield fell 0.043 percentage point this week.
Treasury bond bears have argued that the pace of global economic growth will accelerate this year, which justifies a flight
out of bonds and into stocks. This year, they have won out. The 10-year yield has risen from 1.759% at the end of last year
and a record low of 1.404% set in July.
Even so, bond bulls remain skeptical, especially as the U.S. economy confronts fiscal austerity as the prospect of
automatic federal spending cuts is set to kick in next month without a fiscal deal. Bond bulls' buying has helped hold the
10-year note's yield at about 2% over the past few weeks. This month, the yield has been kept in a range between 1.92%
and 2.06%.
"There is some real buying on the expectation that, with only one more week to cut a deal, the
spending cuts could be an actual event, which in turn is likely to send yields lower," said Kevin Giddis,
head of fixed income at Raymond James in Memphis, Tenn.
Both the bull and bear camps found supporting evidence Friday. A gauge of business confidence in Germany posted the
biggest monthly gain since July 2010, which initially pushed down Treasury prices. Supporting bond prices, European
Union economists turned gloomier over the euro zone, expecting a second year of contraction in 2013 for the region.
26
MPCP Corpus Project
Ling 575 Project
• Examine sub-domain adaptation across the
financial domain.
• Train article and sentence zone classifiers on
WSJ hand annotated data.
• Adapt the two models to the other 23 website
sources and look for similarities and differences
using KL divergence and other measures.
• Attempt to improve the original model using
bootstrapping and other domain adaptation
methods covered in Ling 575.
• Develop the general basis and approach for my
thesis on automatic scoring of sentiment for my
thesis.
27
MPCP Corpus Project
Ling 575 Project (Continued)
• Focus solely on the US Treasury bond sub-topic.
• US Treasury bond commentary is a very dry and
serious topic.
• Extremely limited vocabulary, i.e. bond prices,
inflation rates, credit spreads and interest rates.
• WSJ and Bloomberg exhibit highly structured
article formats.
• Every source contains a decent sample of UST
articles.
• Given such a “normalized” format, deviation in US
Treasury articles should be readily apparent in
the target domains.
28
MPCP Corpus Project
Corpus Article Counts - Baseline Decision Tree
29
MPCP Corpus Project
Conclusion
• Thanks for your attention.
• Questions?
• Who else still writes in Perl?
30