Sentence Level Classification as Domain Adaptation Task within the Financial Press February 28, 2013 Background About Nick ✦ Physics and German Studies double major. ✦ MBA from University of Chicago in finance and statistics. ✦ Studies in FORTRAN 77, Japanese, Spanish and Latin at the UW. ✦ 16-year Wall Street career in investment banking, quantitative research, sales, derivatives trading and management. ✦ Today, a hedge fund manager using comp ling sentiment to time the markets. 2 Background Madison Park Sentiment Trading ✦ Partner Johnny Hom (MIT) developed an eightstate sentiment model for assets in during the LTCM Crisis of 1998 based on what traders say. ✦ Newspapers started going online in 1995+. ✦ Posited that market sentiment could be extracted from online resources in a kind of indirect sentiment survey. ✦ Since November 2001 Johnny has hand-scored 322,000 sentiment statements (“MPCP Corpus”). ✦ Daily sentiment voices are aggregated into both a sentiment heading and signal strength. 3 Background Market Sentiment Perceptual Map “I feel very bullish about the market”, XYZ Analyst from ABC Bank. Confident Market Higher Prices Topping Market Joy Contentment Price Expectation Greed Positive Surprise Hope Emotional Aspect Negative Surprise (Selling Impetus) (Buying Impetus) “We are cauIously Caution opImisIc about the market, but are not recommending more than Apathy a 30% weighIng” , XYZ Analyst from ABC Bank. Bottoming Market Lower Prices “We believe that the earnings outlook for the S&P 500 should hold at the $105 level”, XYZ Analyst from ABC Bank. Fear Despair Nervous Market “We are concerned about the outlook for the market and are not ready to invest” , XYZ Analyst from 4 ABC Bank. Background Recent Evolution of US Equity Market Sentiment Con$nued bullish sen$ment typical of January bull market phases. Federal Reserve minutes indicate that quan$ta$ve easing may come to an end sooner rather than later. (Market sells off sharply). 5 Background Sentiment Analytics ✦ Started trying to automate sentiment scoring using Latent Semantic Analysis in early 2009. ✦ Struggled with “bottom-up” vs “top-down” approaches. ✦ Started using Perl CPAN and R libraries for NLP tasks. ✦ Automated collection of financial press articles in early 2011 with a database of 500,000+ articles (roughly 1,000 daily from market related sections of 24 websites.) 6 Background Daily Article Collection Time Series ✦ We use Vector Space Models to determine relevance. 7 Background Why CLMS? ✦ Came across the Agrawal and Yu paper in late 2011. ✦ Paper discussed classifying sentences into four types using supervised learning, which could be used to classify our eight sentiment states. ✦ Knew statistics, but did not have a background in machine learning. ✦ Big, long-term goal: Use supervised machine learning to build a scoring algorithm for new sentiment statements automatically. 8 Agrawal and Yu Paper “Automatically classifying sentences in full-text biomedical articles into Introduction, Methods, Results and Discussion” ✦ ✦ ✦ ✦ ✦ ✦ ✦ Relatively recent, published in 2009. Prior papers had only looked at classifying medical paper abstracts. Scientific articles use categories within papers, but not all of the sentences may belong to that category. Apply a IMRAD (“Introduction, Methods, Results and Discussion”) framework for classifying sentences. Sentences are the typical “independent unit” for text mining for information extraction and summarization. Classifyication into “rhetorical zones” can aid searches for specific types of results or methods, for example. Achieved a 91.55% F-score. 9 Agrawal and Yu Paper IMRAD Framework Introduction PECAM-1 plays an important role in endothelial cellcell and cellmatrix interactions, which are essential during vasculogenesis and/or angiogenesis (17,22). Methods Here, we examined expression of PECAM-1 mRNA in vascular beds of various human tissues and compared it with expression of PECAM-1 in human endothelial and hematopoietic cells. Results A short exposure of the blot probed with GAPDH is shown, because poly(A)+ RNA from the cell lines gives a strong signal within several hours compared with the total RNA from human tissue. Discusion Therefore, total RNA from various tissues required a much longer exposure to reveal GAPDH mRNA. Human tissue and cell lines expressed multiple RNA bands for PECAM-1, which may represent alternatively spliced PECAM-1 isoforms, the identity of which required further analysis. 10 Agrawal and Yu Paper Methods • Data: MEDLINE random sample of 2,960 sentences. • Annotation: Authors and Biologists (high, medium low rankings). • Classification: • Baseline: Assume section label. • Rule Based: 603 hand-created rules. • Supervised Learning: Used Naïve Bayes, multinomial NB and SVM approaches. • Features: Uni-, bi- and tri-grams and features. Used chi-square and mutual information to select top 2,600 features. 11 Agrawal and Yu Paper Methods Table 1: Performance Across 10-fold Classification. Standard Accuracy Deviation Classifier Comment Baseline Rule Sectional Annotation Human Generated 77.81 58.18 Non1 Non2 Non3 Non4 Full-sentences from test Iterative trainer Trained on abstracts Full-sentences random 74.45 72.77 66.94 73.92 Man-Terms Man-IMRAD Man-Tense Man-All Term features Term features plus IMRAD category Term features plus verb tense Term features, tenses and IMRAD category 88.06 91.34 88.77 91.95 ±4.03 ±4.87 ±5.22 ±5.24 ±4.49 ±2.39 ±3.20 ±3.09 ±3.91 ±2.81 12 Agrawal and Yu Paper Results 13 Agrawal and Yu Paper Results 14 Agrawal and Yu Paper Results 15 Agrawal and Yu Paper Conclusions • Very impressive results without any major obvious shortcomings. • It’s surprisingly that they did not suggest apply the methodology to other domains. • Provides a great basis for sentence level classification in the financial domain. • Maximum Entropy may have generated higher accuracies owing to the cross-classification problem highlighted in Table 8. 16 Agrawal and Yu Paper Takeaways • • • • Could have more categories. Abstracts are different from normal text. Top 2500 features worked best. Multinomial Bayes was the best model. Table 1: Comparison of model accuracy. Metric Multinomial Bayes Naive Bayes Support Vector Machine Mutual Information Chi-Squared 91.95±2.81 86.03±2.69 85.95±3.44 86.65±3.07 89.13±2.30 87.09±2.48 17 MPCP Corpus Project Thesis Topic • Fully automated collection, classification, sentiment detection and scoring for a given asset class. • Will use MPCP’s hand annotated statements to train our classifier for sentiment scoring. • Numerous other NLP tasks exist within the corpus such as NER tracking and topic modeling. • Could apply classifier to other markets using a number of the MT papers discussed in Ling 575. • Used Decision Trees for a number of machine learning tasks initially. 18 MPCP Corpus Project The “Golden Essays” US Equity Sentiment Statement Database - Nov 2001-Feb 2013 Sentiment State Market Phase Count Frequency Joy Joy/Contentment Contentment Contentment/Hope Topping Topping Topping Topping 36,116 11,450 39,972 27,075 11.20 3.55 12.39 8.39 Hope Hope/Fear Fear Fear/Despair Nervous Nervous Nervous Nervous 7,735 9,808 12,663 18,541 2.40 3.04 3.93 5.75 Despair Despair/Apathy Apathy Apathy/Caution Bottoming Bottoming Bottoming Bottoming 24,661 11,535 46,282 27,010 7.65 3.58 14.35 8.37 Caution Caution/Greed Greed Greed/Joy Greedy Greedy Greedy Greedy 13,176 6,357 15,043 14,992 4.08 1.97 4.66 4.65 Totals 322,551 19 MPCP Corpus Project Project Topic • General statistical survey of financial domain for linguistic differences, similarities and possible sub-domains. • Improve our article topic classifier using Maximum Entropy and Support Vector Machine (SVM) vs. our Decision Tree baseline. • Build a sentence/paragraph level classifier similar to the Agrawal and Yu IMRAD approach. • Domain adaptation focus will be in examing how training on the WSJ source adapts to the 23 other target financial press sources. 20 MPCP Corpus Project News Article Corpus US Financial Press Article Database - Mar 2010-Jan 2013 Source Sub-Domain Articles Words Average Length Bloomberg News CNBC Finanicial Times Market Watch Reuters Wall Street Journal Core Core Core Core Core Core 58,650 29,736 48,379 56,527 52,199 34,309 41,957,672 13,284,684 26,486,352 15,637,149 25,326,299 22,936,452 715 447 547 277 485 669 AP, AFP etc NY Times Core/Mainstream Core/Mainstream 3,750 32,735 2,008,477 18,162,409 536 555 CNN Fox News LA Times MSN Smart Money USA Today Yahoo Mainstream Mainstream Mainstream Mainstream Mainstream Mainstream Mainstream 4,930 25,404 6,952 1,293 3,712 3,125 6,778 2,627,105 9,837,877 2,387,060 480,918 2,968,556 1,603,193 3,107,183 533 387 343 372 800 513 458 Barron’s Business Week Fortune The Street.com Topical Topical Topical Topical 7,708 10,737 4,153 46,992 3,871,801 4,737,789 1,686,748 27,943,901 502 441 406 595 Business Insider Daily Finance Gossipy Gossipy 57,074 7,770 17,507,712 2,975,278 307 383 Minyanville Seeking Alpha WSJ Blogs Blog Blog Blog 4,719 28,752 7,406 800,948 18,288,708 1,997,891 170 636 270 21 MPCP Corpus Project Three Corpus Factoids • The neuro-linguistic space of the financial press is a relatively well-behaved one: • Participants tend to use the same language over time, e.g. “P/E ratio, earnings, Federal Reserve, interest rates, stock prices”, etc (stable distribution). • New terms are infrequent, but generally represent a new important meme in the market, e.g. “the New Normal”, “Fiscal Cliff” and now “Sequestration”. • These collocation n-grams have deep and specific meaning as such as “1929 Crash”, “Black Monday”, “the 1936 Roosevelt Shock”, etc. 22 MPCP Corpus Project Financial Sentiment Measures • Somewhat surprisingly, financial sentiment measures are relatively outdated: • Survey based measures, e.g. AAII and NAIM. • Econometric proxy measures, e.g. Citigroup’s Panic-Euphoria model, VIX Index, credit spreads, etc. • Computational linguistic based measures, e.g. RavenPacks sentiment indices (bottom-up) and Dow Jones Leading Indicators Index. • Clearly, there are more applications of comp ling in the trading world, but they are likely protected as proprietary engines of profit, e.g. Renaissance Technologies, MPCP et al. 23 MPCP Corpus Project Article Classification • Our Decision Tree approach falls short in two ways: • An article can contain multiple themes, i.e. combinations of stocks, bonds, commodities, foreign exchange, the economy and stock specific topics. • R’s implementation of DT’s does not allow for the flexibility of Mallet’s feature vectors. • Although I did figure out how to use structural features such at the number of $ or % signs in an article. 24 MPCP Corpus Project Article Zone Classification • At a minimum, the WSJ follows a relatively standard alternating paragraph structure much like IMRAD. • Articles contain an Introductory paragraph. • The balance of the article alternates between: • Market Recap paragraphs. • General Market commentary. • Specific market/security Pricing information. • Market Sentiment statements. • We term this as a IRMPS framework. 25 MPCP Corpus Project Treasury Investors Just Can't Let Go of That Haven Feeling By MIN ZENG Defying a rally in the U.S. stock market, Treasury bonds rose Friday and chalked up price gains for the week. Treasury bonds normally lose ground as a haven when stocks strengthen. The fact that both markets advanced on Friday suggests that some investors remain skeptical about the pace of global economic growth, traders said. In addition, concerns have arisen whether the month’s long bull run in the stock market is due for a selloff. The Dow Jones Industrial Average is up 6.8% this year. "People are not convinced that the stock rally can keep on. That's why some of them buy bonds as insurance," said Gang Hu, co-head of U.S. rates and swaps trading at Credit Suisse Securities USA LLC in New York. Late Friday in New York, the 10-year Treasury note rose 4/32 point, to 100 10/32, to yield 1.965%, compared with 1.977% late Thursday, as bond yields move inversely to prices. The 10-year yield fell 0.043 percentage point this week. Treasury bond bears have argued that the pace of global economic growth will accelerate this year, which justifies a flight out of bonds and into stocks. This year, they have won out. The 10-year yield has risen from 1.759% at the end of last year and a record low of 1.404% set in July. Even so, bond bulls remain skeptical, especially as the U.S. economy confronts fiscal austerity as the prospect of automatic federal spending cuts is set to kick in next month without a fiscal deal. Bond bulls' buying has helped hold the 10-year note's yield at about 2% over the past few weeks. This month, the yield has been kept in a range between 1.92% and 2.06%. "There is some real buying on the expectation that, with only one more week to cut a deal, the spending cuts could be an actual event, which in turn is likely to send yields lower," said Kevin Giddis, head of fixed income at Raymond James in Memphis, Tenn. Both the bull and bear camps found supporting evidence Friday. A gauge of business confidence in Germany posted the biggest monthly gain since July 2010, which initially pushed down Treasury prices. Supporting bond prices, European Union economists turned gloomier over the euro zone, expecting a second year of contraction in 2013 for the region. 26 MPCP Corpus Project Ling 575 Project • Examine sub-domain adaptation across the financial domain. • Train article and sentence zone classifiers on WSJ hand annotated data. • Adapt the two models to the other 23 website sources and look for similarities and differences using KL divergence and other measures. • Attempt to improve the original model using bootstrapping and other domain adaptation methods covered in Ling 575. • Develop the general basis and approach for my thesis on automatic scoring of sentiment for my thesis. 27 MPCP Corpus Project Ling 575 Project (Continued) • Focus solely on the US Treasury bond sub-topic. • US Treasury bond commentary is a very dry and serious topic. • Extremely limited vocabulary, i.e. bond prices, inflation rates, credit spreads and interest rates. • WSJ and Bloomberg exhibit highly structured article formats. • Every source contains a decent sample of UST articles. • Given such a “normalized” format, deviation in US Treasury articles should be readily apparent in the target domains. 28 MPCP Corpus Project Corpus Article Counts - Baseline Decision Tree 29 MPCP Corpus Project Conclusion • Thanks for your attention. • Questions? • Who else still writes in Perl? 30
© Copyright 2026 Paperzz