Timothy Loughran: Textual Analysis in Finance

Peschel announcement finance survey paper letter review three things three things one of them sentiment analysis number two this walk journal excited about that and number three is financial document readability I have problems at home there's everybody else below four teenagers are called my children and I only want language what kind of language does your teenagers use what I want is that afternoon you can see if you disagree with me after twenty minutes a part of my his nothing is the language they use I have become a researcher in this can figure this out what kind of words they use they use mobile words they use weasel words these are the kind of work them I can't use all the time thank you laughing oh the possibility of here wait what we know what it just not saying anything I am I do think you have homework a grandparent world week multiple words that they use when they talk to me we use a hundred words and 15 though weak mortals right now what happens if the CEO that's talking like a teenager contacts not an appropriate thank you but in the context of using weak motor works a high frequency with normal work well this might happen or this could happen at the end of the day he was like I've learned nothing from this conference call though ideal prospectus or annually for right and not surprisingly epidemics have it shows that runs of a high fraction a week mobile version or handle reports right field perspective have higher selves of good stock market stock turbo tell what or having difficulty pricing these things we're not getting information from management Paul talked a little bit about that Awards we didn't really combined it and so what this is this will have counting the frequency of each work that appears within the document but the key is that the work sequence is completely ignored I don't look at the context I am just counting the number of week more words 19 right and I thought about the whole more up of the context of the whole Merkel they're there with their grandparents or whatever in the world where simplistic business operation faucet just accomplish your work that's not the easiest thing in the world right but it's actually very very powerful and soap all over to us a little bit to measure tone in documents typically it's the proportion of negative words more negative words in the document just focus on the negative okay this negative information there's so many positive words that you're if you're counting it how would you imagine a whole bunch of journalists Peter how would you measure the sentiment of a financial document or a newspaper column welcome any four ways the first one is he deserve a call polities saying what words bigger negative go on that for the last presentation catastrophic who that's a great work hey Beth award we need a couple newspaper articles maybe reports and kind of come up with number three would roll off the shelf we'll grab a hard word genocide general cry or GI dictionary now this formless was not created with financial documents in mind this is psychology and sociology I don't so what villain I found is about 75 percent you know that's three-quarters of all the hard work maybe the words actually don't have pessimistic meaning whether use in my context of Maori for wow we're not doing so well yeah and so here are high-frequency negative words that are perinatal reports before the carbon tax I'm actually cost capital or am I making this up liability of course of course vice for depreciation none of these words are negative when they appear in annual report they're talking about the Board of Directors we're talking the vice president's running everybody so there's a lot of this classification about use with an off-the-shelf dictionary that was made for sociology you know most fascinating is several of Harvard negative words are likely the proxy for specific industries what industry you think would call use that word a lot I think it's the oil industry crude oil yeah I have cancer that's bad isn't it cancer the bad word now in the context of a business document you think it's going to be bad know they're curing cancer right and the one that think it's the funniest is mine I don't even know if that's a negative word I don't think about that how often do you think I'm mining company will mention this silver mine this gold mine this platinum mine a boatload right I mean a boatload so you have these high negative counts I just you know just proxies for industry and you look at those words are those negative words on the context of business there there's no way there's no way so what Bill and I did is we actually did the fourth method so we didn't survey people we didn't look at a few we didn't use off-the-shelf we actually look at the words that managers use in an annual report and we looked at you had to be with him you know at least by for some of the ten caves and what we did is no Jason I could say Jason here here are the ten key words to avoid and he would write a little document on the column right now be one of this fifty two columns and he'd be all set right the problem with that is yeah it's easy to avoid those words right don't say the word catastrophic okay you know what I mean that that's really easy so what bill and I said um now we're gonna get every last word so you can't write around that and also we looked at the most likely interpretation of a word right like once in a while cancer could be a bad one what but generally it's not so it would be on our word list and what we did is we have a extensive wordless 552 3 to be 354 positive words in 2000 393 2329 negative words is that kind of fascinating in the English language there's a lot more ways of saying bad things then good isn't it all right ok so I'm going to show you the top 10 words and this is the power of the bag of words so I could quickly show good ol Jason here are the 10th most frequently occurring words in annual reports ok loss both weak losses claims pair men against adverse triste didn't like to see that in a way for adversely restructuring the litigation yeah we did it right so I can show you look these are the words that drive my analysis and you can sit there saying well maybe I whether I agree or not so what's amazing is these are 10 words I just said there's 2329 negative words that's 0.4% of all the words guess what these 10 words account for 33% of all the negative words occurring in an annual report is that amazing when you think about it I could say all these negative words but really 10 accounts per often earth why do you think that is this is our second point this law and it's the driving force between these tripwires or work classifications is that word counts tend to follow a power-law distribution in other words Jay said there's only a few words that will dominate the frequencies of these were dictionaries and this is important if we mess up right and we have like a bad word in there we're looking at the oil industry we use the Harvard word list and we and we let food coming in go in it will drive our analysis right that's not even the negative work so one of these words is platanus classified it will potentially drive the analysis you can't really see this too well but what this is is the frequencies that 25 most frequently occurring words and all 10k and 10-q over this time period and you basically see it's the same kind of relation where just what I was saying just a handful of words drives the analysis so the Connells common words appearing at 10 K you're ready for this ha and in – oh you know I mean that's that's it and they have a very very high frequency it's here with the negative words so now we're going to go on our third point which is we're trying to measure financial aid ability unfortunately we have some people sitting next to me and I kept talking to them like how do you measure financial readability what readability enough in it and they would never you know we never got it down because I think this is really hard isn't it how would you marry with Jake journalist here how would you measure readability why are some text easier to understand than others now regulators and financial researchers have really struggled with this notion how do you measure readability and mandated financial disclosures so bill high in another paper we proposed defining readability as the effective communication of valuation relevant information that's why they're creating an annual report Gary they're trying to disseminate information does everybody agree with that that's really the purpose of what's going on I'm trying to tell you about my firm and the better job I do are telling you about my firm the lower my standing of stock returns should be my volatility what the lower the analysts forecast there should be and the lower the analyst dispersion should be I think most of you would something say that makes some sense right right the better communication of what's going on the less the volatility in the ears now there's something called a fog index in accounting and mostly at Crowley but some finance literature's really picked it up pick up on it as a measure of readability the fog index and you know what's beautiful about it it it is so simple that's what sells isn't it how many components you think are in the fog index what do you think carat that's it you get it right there are only two components this is easy the first one Harry is the average number of average sentence length in work right and the other one is complex work now Paul would sit there and say ooh complex words these are high level words no no no no no and the fog index it's more than two syllables I can't even make that up right more than two syllables that's complex yes it is yes it is and so in the beauty of this is it tells you the grade level here look at this high level formula we take the average words per sentence and financial ten case it's about annual reports in about twenty five the percentage of complex were just about twenty five times it by point four you need eighteen years of education to read a typical US annual report very doing everybody father right that's pretty simple now let's focus on this complex work component do you think that's a good measure of readability no it's actually a horrible measure and it's 50% of the weight it's got two components and one of them is horrible right and so legibly know the volume that's in the as an increase in the number of complex words more than two syllables decreases the readability and accounts for half of it however business text everybody found us we're talking about business text we're not talking about dr.

Seuss okay we're not talking about business text it is full of complex words these are the most frequently occurring complex words in US annual reports financial company I think I got that one down interest agreement including including that's pretty low operations period and related and effect going and showing you more and more or more of these you're like none of these are hard yeah they're really obvious right management the next one I think most readers of a 10k when they see the word management they're not gonna be forced to consult their dictionaries right so in academics so often what happens is Paul if I say ooh this is bad people say okay okay the Father units are back give me an alternative right that's the way all right Charlie Carl's again that's what academics work there's like is easier to criticize right but it's hard to kind of be positive so what we came up with is using the document file size just a number of megabytes required to store the documents reporter on the SEC website as a simple and admittedly imperfect proxy for readability file size correlates very strong with the number of words once again a problem number words is I have actually count them all a little bit problematic right I need a good dictionary to make sure that these are actually words they're not typos or they're catching some pote oh and we show that the file size how big is the document relates to post-filing return volatility another the information environment and the manner consistent with the notional readability once again I'm thinking about the cementation of information ah how do you think annual report do you think they're becoming more or less readable what do you think about that what a good job of time we're almost done I think they're becoming less readable myself okay and what I'm going to show you is a graph of the time series of the number of words in annual reports u.s.

Annual report they did this in honor of this conference just pull this out what do you think they're not what do you the trying to do is and in terms of the number of words on these documents annual reports it's a oh yeah Charlotte it's up over a hundred percent so now the typical the average publicly traded company that's about fifty thousand words they are doing david dumps on us right they're getting longer and longer and then when they get longer longer it's less likely the people who actually input it up right and read it okay so just a quick review we're doing great for time wordlist designed specifically for business communication should be used to measure send them in and business that's what i don't want to use sociology words in my business text this works a crude or cancer or a minor bore or liability or depreciation these are not negative words and business contacts i think you all agree with that i'll talk about this law which documents the fact that a very small number words will dominate the frequency counts so in other words that you're going to create a dictionary you want to show people hey here's all the words right here the words that drive it fulton talking about full disclosure here right by its very nature i think most of us would agree me business tax has an extremely high percentage of complex words once again complex words it's just more than two syllables right according the fog index one of the two components and when you look at the words that investor Jews they're very easily understood they're not using an essay SAT words and so when you when you use the word readability in the context of financial documents you want to think about what that means right it's really hard to pin that down but I really believe it and Bill agrees it deals with the dissemination of information that's why they're creating these mandated documents right I want to tell you how things are going and I think that the size of a annual report is an easily calculated proxy for document readability I kind of have a couple technical questions one is that the risk factors section they have a lot of negative words that's always wondering is that just baseline also when there's a lawsuit they'll repeat it like 14 times in a 10k the other thing is that with the file size are you excluding exhibits from that because those can be pretty lengthy oh we include we include exhibits um we look at yeah don't the whole thing risk factor section yeah well the respect well that's going to get you to UM to capture to getting your right they tend to repeat these things they tend to repeat these things and I think the risk factor does tend to be a little bit negative so on a technical note sometimes it's difficult to parse in correctly because we have like 60,000 of these documents there's like 12.5 billion words that we look at and so it's hard to actually go in and grab exactly the risk factor section and all these other little attributes but you're right some sections could be better or worse but you know so it's just kind of a quick thing grab but we grab the whole thing yeah did you correlate with volatility I mean did you show that the dispersion of the event yeah I didn't show the regression results so these measures that do correlate with I know my audience I know my audience I was joking with Dennis and David about they're gonna show regressions I'm like I don't know if you want to do that hi God can you use your methodology to analyze to the statement from the fact from the what problem the Federal Reserve some people are doing that andrew is looking at that looking at the so yeah people but then it's one of those things where they know you're looking at it so there you know it gets a little bit complicated here I wonder what's driving this is that legalese is that companies intentionally trying to to fog not do English what Paul was saying about the lawsuits I mean you know like if we look at Jesus I feel prospectus when they went public it was chock full of legal words because they were sued by every single entity in the world every state was suing them every vendor was selling them and it went on and on I think it's a really interesting presentation I think that just like to go into back a little bit for the fog index because I was just thinking about this in the context of my dealings with the IRS because I understand all the words individually in IRS form but together they make no sense to me how these words are bound together like you know you're saying a lot in practice a lot of these corporate reports are becoming more unreadable and yet the words presumably are remaining pretty much yeah yeah but it's just a lane it's just the length is it also like sentence complexity is there an added element to that do you think I don't know if the word you know once again going back to the fog index you know average works per sentence is probably a pretty good proxy right but but the word complex were the two syllables and the you know that that's actually know what's fascinating about this is you think that jargon would be positive or negative Oh every single readability study says jargon is mad Oh villain I find a jargon is good and I feel that deals with I could quickly communicate with good old Hugh here about when I say the word appreciation he knows exactly what I mean or I could spend like all this time talking about it appreciation is a bit of a jargon word and so what we find is the more jargon to better information content which is kind of which is like counter to it but once again it's in the context of business so the animals and investors are like yeah I know that word yeah when they worker you have the whole you have customize these specifically reports right you said it perfectly many people actually use our worm list in other contexts even though it's created just for annual reports looking at the language that they use tweets are hard because they use slang like wicked I would think wicked is a bad thing no wicked stock is good right and so you have to kind of and it kind of moves across time sometimes that's a good word sometimes that's a bad I think I think a lot of it is the SEC Lex has a fact on wards closer the better and so there are you know like come on talk to managers anytime they have something in their wrists back perception they never be removed which which which is not actually even think about a little runaway no last year these documents are getting pretty big so we have seen we've got the anger of certs walks how many people hip enemies at all and it is shocking how low the numbers are and then people are not getting these things I team in the world this is a high-end group here writing I wore on all the time they look at the numbers I was a filmer for the right deal with a media person we have a question I can take that off I want to you know understand you take it over right right first source no one else is doing it so that the proper one they're getting so big

