ArticlePDF Available

On the predictive ability of narrative disclosures in annual reports

Authors:
  • School of Information Science and Technology, Guangdong University of Foreign Studies, Guangzhou, China

Abstract and Figures

We investigate whether narrative disclosures in 10-K and 10K-405 filings contain value-relevant information for predicting market performance. We apply text classification techniques from computer science to machine code text disclosures in a sample of 4280 filings by 1236 firms over five years. Our methodology develops a model using documents and actual performance for a training sample. This model, when applied to documents from a test set, leads to performance prediction. We find that a portfolio based on model predictions earns significantly positive size-adjusted returns, indicating that narrative disclosures contain value-relevant information. Supplementary analyses show that the text classification model captures information not contained in document-level features of clarity, tone and risk sentiment considered in prior research. However, we find that the narrative score is not providing information incremental to traditional predictors such as size, market-to-book and momentum, but rather affects investors' use of price momentum as a factor that predicts excess returns.
Content may be subject to copyright.
This article appeared in a journal published by Elsevier. The attached
copy is furnished to the author for internal non-commercial research
and education use, including for instruction at the authors institution
and sharing with colleagues.
Other uses, including reproduction and distribution, or selling or
licensing copies, or posting to personal, institutional or third party
websites are prohibited.
In most cases authors are permitted to post their version of the
article (e.g. in Word or Tex form) to their personal website or
institutional repository. Authors requiring further information
regarding Elsevier’s archiving and manuscript policies are
encouraged to visit:
http://www.elsevier.com/copyright
Author's personal copy
Stochastics and Statistics
On the predictive ability of narrative disclosures in annual reports
Ramji Balakrishnan
a,*
, Xin Ying Qiu
b
, Padmini Srinivasan
c
a
The University of Iowa, Tippie College of Business, Iowa City, IA 52246, USA
b
Christopher Newport University, Luter School of Business, Newport News, VA 23606, USA
c
The University of Iowa, Computer Science Department and Tippie College of Business, Iowa City, IA 52246, USA
article info
Article history:
Received 6 February 2009
Accepted 18 June 2009
Available online 30 June 2009
Keywords:
Economics
Finance
Text mining
Capital markets
abstract
We investigate whether narrative disclosures in 10-K and 10K-405 filings contain value-relevant infor-
mation for predicting market performance. We apply text classification techniques from computer sci-
ence to machine code text disclosures in a sample of 4280 filings by 1236 firms over five years. Our
methodology develops a model using documents and actual performance for a training sample. This
model, when applied to documents from a test set, leads to performance prediction. We find that a port-
folio based on model predictions earns significantly positive size-adjusted returns, indicating that narra-
tive disclosures contain value-relevant information. Supplementary analyses show that the text
classification model captures information not contained in document-level features of clarity, tone and
risk sentiment considered in prior research. However, we find that the narrative score is not providing
information incremental to traditional predictors such as size, market-to-book and momentum, but
rather affects investors’ use of price momentum as a factor that predicts excess returns.
Ó2009 Elsevier B.V. All rights reserved.
1. Introduction
A primary use of accounting reports is to help investors evaluate an organization’s financial prospects. Narratives are an important infor-
mation source for analysts and a critical component in annual reports (Rogers and Grant, 1997). A majority of the financial analysts sur-
veyed by the AIMR (2000) indicate that management discussion is a very or extremely important item when evaluating firm value.
However, perhaps because of the relative costs of gathering and analyzing numerical versus textual data, most academic research has fo-
cused on the quantitative disclosures in annual reports. Moreover, because of flexibility in framing these disclosures with respect to choice
of words and tone in addition to content, it is likely that the information in narratives is not fully impounded into contemporaneous prices
(see Li, 2006 for additional observations in this regard). In this study, we modify and apply techniques from the text classification branch of
computer science to the narrative disclosures in 10-K and 10-K405 filings in order to predict market returns.
In a training sample, we pair the narrative disclosure in the 10K documents with the subsequent performance and use standard text
classification techniques to build a predictive model. In particular, we define out- and under-performing firms as the top (bottom) 25%
of all firms, and group firms into three classes (out-performing, average and under-performing) based on their actual performance from
period tto tþ1. We then use text disclosed in period t(that relate to performance for the period t1tot) and the performance class
to build a model that associates the text in a 10K report for a period with next period’s performance. This automated text-classification
exercise, which employs many features such as the number, frequency, and count of words that are similar (dissimilar) across documents,
yields a model that can classify the text for an arbitrary firm as to its predicted performance. We test the model’s predictive ability by
applying it to the documents for period tþ1 (this testing sample of documents relates to performance for tto tþ1) to predict performance
for the period tþ1totþ2. We use these predictions to form and maintain a portfolio. Specifically, for each year, our equally weighted
portfolio buys stock in firms we predict to out-perform the market and sells predicted under-performers. The magnitude of the size-ad-
justed returns for the portfolio is then a joint test of the presence of value relevant information in narrative disclosures and our ability
to systematically extract it. (Of course, like the anomalies literature, our analysis also assumes that the information is not impounded
immediately into prices.)
On average, our portfolio yields an average size-adjusted return of 12.16% per year. Our classifier is word-based, i.e., it extracts key
words from the texts and uses these as features to build predictive models. We conduct additional tests to examine the extent to which
0377-2217/$ - see front matter Ó2009 Elsevier B.V. All rights reserved.
doi:10.1016/j.ejor.2009.06.023
*Corresponding author. Tel.: +1 (319) 335 0958; fax: +1 (319) 335 1956.
E-mail addresses: Ramji-Balakrishnan@uiowa.edu,ramji.balakrishnan@gmail.com (R. Balakrishnan).
European Journal of Operational Research 202 (2010) 789–801
Contents lists available at ScienceDirect
European Journal of Operational Research
journal homepage: www.elsevier.com/locate/ejor
Author's personal copy
our (word-level) model adds to models built using (document-level) meta-features such as clarity and tone. Motivated by prior research we
add the following three meta-features to our model: Fog index (Li, 2006), risk sentiment (Henry, 2006a) and optimistic tone (Davis et al.,
2006). (As a check, we replicated the association between changes in clarity (the fog index) and market returns. A portfolio based on
changes in clarity leads to a size adjusted return of 10%, a magnitude similar to that reported in Li, 2006.) Adding these three docu-
ment-level meta-features to the text classification model, however, leads to statistically similar returns (11.74% for the augmented model
versus 12.16% for the raw model). In contrast, a model that contains only the three textual meta-features does not generate any excess
return whatsoever. We also find that while the meta-features distinguish between average and extreme performance (as also predicted
by our model) they are less able to distinguish the direction of the difference (i.e., between out- and under-perform), particularly for
sub-samples of firms. That is, while the scores on the meta-features of out- and under-performing firms differ from the average firm
(i.e., they are denser and reflect greater risk sentiment), the two groups do not differ between themselves. We conclude that the word-level
text classifier model we employ captures more information than is represented in the meta-features of clarity, risk sentiment, and tone,
suggesting a fruitful role for word-based text classification methods in accounting and finance research.
Next, to gain insight into the source of the information content, we examine quantitative properties of firms in the predicted classes.
These univariate comparisons indicate that the text-based disclosure score we develop is correlated with attributes such as size and mar-
ket-to-book. Indeed, sub-sample analyses indicate a greater portfolio return for glamour versus value firms, and for small versus large
firms. Thus, one explanation for our results is that our text-classification captures firm attributes that may be readily computed using
financial data. While the correlation between text disclosure and financial characteristics is an interesting finding in itself, it also is possible
that the text disclosures affect the association between numeric dimensions and market performance. We examine these conjectures by
regressing the excess returns on known factors such as earnings surprise, price momentum, firm size, and market-to-book ratio. We find no
main effect for the score on narrative disclosures, suggesting that the disclosure score provides no new information over that provided by
known financial factors. However, we find that the coefficient for interaction term with price momentum reliably differs from zero. (The
interaction with market-to-book ratio is weakly significant.) We infer that differences in the disclosures across firms affect confidence in the
numerical estimates, a finding consistent with firms with differing profiles following differing text disclosure strategies.
Our study makes both methodological and economic contributions. Recent literature (e.g., Li, 2006; Davis et al., 2006; Henry, 2006a,b;
Tetlock et al., 2006) that conducts a large sample study of the characteristics of narrative disclosures considers specific dimensions of
narrative disclosure.
1
Li, 2006) examines how changes in a firm’s fog index (a measure of readability) correlates with earnings prospects
and persistence, thereby shedding light on managerial incentives to alter readability. In a levels study, Davis et al. (2006) consider the asso-
ciation between the tone (optimistic/pessimistic) of current reports with future ROA. Henry (2006a) conducts an event study that links the
tone of press releases with market reactions.
2
In contrast, our text-classification algorithm offers three key methodological advantages. First,
it simultaneously considers all aspects of the disclosure such as length and word choice, thereby avoiding the need to impose an external
model to generate meta-features such as optimism or readability. Allowing an unconstrained relation lets the predictive model capture com-
plex interactions among features. This attribute is particularly important for analysis of text because the relations can occur at the word,
sentence and/or document level. And yet (the second advantage), our approach is open to including meta-features such as the fog index,
thereby helping us understand the information captured by meta-features. (We perform such an extension.) Third, our approach can be
readily extended to include other text sources such as analyst forecasts, economic reports or industry analyses that also might be relevant
for firm valuation and for predicting performance. Indeed, it is possible to differentially assign weights to these sources in terms of their
credibility, freshness, and so on, which extensions are not possible with the current approaches which rely on features developed from
external models.
3
Economically, we show that current period disclosure quality is associated with future returns and that the disclosures affect the con-
fidence in estimates.
4
Our results indicate considerable benefits from research that refines such predictive models by increasing the dataset
(e.g., adding economic forecasts), and by conditioning the model on parameters such as industry and product-life cycle. Overall, the techniques
we explore in this paper point toward a rich set of questions that parallel the use of numeric disclosures and examine the use of narrative
disclosures by market participants as well as management incentives connected with such disclosures.
The rest of this paper is as follows: Section 2 describes our research question and Section 3 provides an overview of the methodology.
We discuss sample selection process in Section 4 and provide sample descriptions. We report results in Section 5 and offer concluding re-
marks in Section 6.
2. Background
Beginning with the seminal work by Ball and Brown (1968), a vast literature examines whether and how market participants employ
financial reports to evaluate a firm’s future performance, and thus, its value. Fields et al. (2001); Kothari (2001) and Healy and Palepu
(2001) provide recent surveys. In contrast to the attention paid to the properties of and the information contained in financial data dis-
closed by firms, there is a paucity of research examining the narrative disclosures. However, such narratives are an important information
source to the analysts and a critical component in annual reports. Rogers and Grant (1997) found that the management discussion and
analysis (MD&A) section in annual reports constituted the largest proportion of information cited by the analysts. They state (p. 17),
‘‘[I]n total, the narrative portions of the annual report provide almost twice as much information as the basic financial statements”.
1
There is also a research stream (Barron et al., 1999; Clarkson et al., 1999; Subramanian et al., 1993; Smith and Taffler, 2000) that primarily relies on hand-coded classification
of a small sample of firms when investigating their research question.
2
Henry (2006b) considers a partitioning algorithm (CART) and shows that including data about key words and document style improves classification accuracy. She performs a
10-fold analysis on contemporaneous data. That is, the model is trained with 90% of the observations and tested on the remaining 10%. Thus, the model is not implementable
because it uses data from the same period to predict returns. That is, she uses actual data from 1998 for 90% of firms to predict returns in 1998 for the remaining 10% of firms. In
contrast, our approach and tests lead to implementable approach in that we use actual data from 1998 to predict returns for 1999.
3
The disadvantage is that the underlying model is not transparent because it might be non-linear. Although not our focus, with additional structure and analyses, it is possible
to determine the relative ‘‘weights” of the attributes.
4
Botosan (1997) and Botosan and Plumlee (2000), who follow the convention of using the AIMR ranking of corporate disclosure as a measure of disclosure quality, are notable
exceptions.
790 R. Balakrishnan et al. /European Journal of Operational Research 202 (2010) 789–801
Author's personal copy
Similarly, a survey of financial analysts by the Association for Investment Management and Research (AIMR, 2000) found that ‘‘86% of sur-
veyed financial analysts indicated that management’s discussion of corporate performance was an ‘extremely’ or ‘very’ important factor in
assessing firm value” (AIMR, 2000).
Research corroborates practitioners’ claims that the narrative in an annual report contains value relevant information. For instance,
using quality scores provided by analysts, Clarkson et al. (1999) finds that the quality of forward-looking information in MD&A directly
relates to the firm’s upcoming performance. Botosan (1997) studies the association between disclosure level and the cost of equity capital,
and finds that voluntary disclosures substitute for analyst following in lowering cost of capital. Bryan (1997) shows that discussions of fu-
ture operations and planned capital expenditures are associated with one-period-ahead changes in sales, earnings per share, and capital
expenditures. Barron et al. (1999) finds that high MD&A quality (in terms of compliance with the disclosure requirements) reliably reduces
errors in analysts’ earnings forecasts. We also interpret the SEC’s plain English disclosure rules as acknowledging the importance of nar-
rative disclosures when evaluating earnings and cash flow (Firtel, 1999).
The source of the information content in narrative disclosures is subtle and hard to measure. Subramanian et al. (1993) find that
well performing firms used ‘‘strong” writing in their reports while poor performers’ reports contained significantly more jargon or
modifiers and were hard to read. Smith and Taffler (2000) identify thematic keywords from Chairman’s statements and generate
discriminant functions to predict company failure. Kohut and Segars (1992) study president’s letters in annual reports and suggest
that, as a communications stratagem, poor performing firms tend to emphasize future opportunities over poor past financial perfor-
mance. Lang and Lundholm (2000) find that ‘‘optimistic” pre-announcement disclosures of equity offerings lower the cost of equity
capital.
Because of the difficulty in data collection and measurement, early studies that examine the qualitative aspects of the disclosure usually
employ hand-collected data and examine small samples. They also typically rely on experts to code the quality of disclosure (e.g., AIMR
scores). Recognizing these limitations, Core (2001) suggests that computing the measure of disclosure quality could greatly benefit from
the techniques of other research areas such as computer science, computational linguistics, and artificial intelligence. There also is interest
in developing analyses that test the information content and the predictive ability of narrative disclosures in a large-sample study with
automatic coding of data.
Recent research (Li, 2006; Henry, 2006a,b; Davis et al., 2006) has responded to this call. Typical examples include Li (2006) who shows
that changes in the readability of the MD&A section are predictive of future return and Davis et al. (2006) who show that tone (a count of
pessimistic versus optimistic words) is associated with future ROA. Note that, like our study, Li (2006) assumes that market price does not
instantaneously impound the information contained in narrative disclosures. We view these papers as positing a relation between some
dimension(s) of textual data and future performance. Thus, these papers construct a measure (e.g., fog index, count of positive words)
of the typically one dimension (readability, optimism) studied, and use traditional statistical methodology such as OLS regressions to test
the association between the measure and performance. The values and relations among the parameter estimates form the basis for infer-
ences about patterns in the data.
Our innovation is the use of an algorithmic approach (see also Henry, 2006b) to develop a predictive model.
5
Our approach, which draws
from foundations in computer science, focuses on predictive accuracy and treats the data structure or pattern as an unknown. The goal is to let
the algorithm ‘‘learn” the underlying model using the most relevant information from the entire set available. Thus, the focus is not on gen-
erating model parameters but on fitting the best possible model. Such an approach confers at least three advantages:
We can simultaneously consider many different aspects of the disclosure such as length, readability and word choice, thereby avoid-
ing the need to specify ex ante the meta-features of interest such as optimism or readability. Such an unconstrained relation lets the
predictive model discover and capture complex interactions among features. This attribute is particularly important for analysis of
text because the relations can occur at the word, sentence and/or document level. Indeed, we can (and do so in our extensions)
include document-level meta-features such as the fog index, thereby helping us understand the information captured by meta-
features.
The approach can be easily extended to include other information sources such as economic reports. Including such data is particularly
useful because market participants parse the annual report in the broader economic context and the other information available to
them.
6
Indeed, current development in computer science allows for models that differentially weight information sources in terms of their
credibility, freshness, and so on.
We can use the model to identify sub-sets of the population that systematically differ in terms of the information content of their
disclosure.
7
Because of these advantages, the use of algorithmic text classification models is widespread in diverse areas such as marketing, biomed-
icine, music, law and web crawlers (Dave et al., 2003; Popa et al., 2007;Pérez-Sancho et al., 2005;Thompson, 2001; Pant and Srinivasan,
2005), although their use in finance and accounting is nascent. The primary disadvantage is that the method does not readily yield
parameters that we could use to assess the statistical/economic significance of individual dimensions and/or sources. While possible, such
analyses require the researcher to impose considerably more structure and are left open for future research.
8
5
Our method differs from the CART method in Henry (2006b) in that we do not sequentially add measures of constructs to partition the data. Rather, the entire set of words is
used to construct a model.
6
For instance, Asquith et al. (2006) examine the information content of qualitative analysis provided by equity analysts.
7
As an example, consider a model that tests the ability of film reviews to predict box office receipts. We can then identify the reviewers whose reviews consistently out-
perform reviews by other reviewers. Studying this sub-sample of reviews then can help us understand the features that make a review more predictive of box office success.
Similarly, we can use this methodology to find sub-sets of firms whose narrative disclosures are more informative regarding market and/or accounting performance. We can then
study these disclosures to glean the reason why.
8
The two approaches are complements. The algorithmic approach can potentially help identify the constructs and an outline of the model. We can then employ traditional
statistical methodology to fit the model and identify parameters.
R. Balakrishnan et al. /European Journal of Operational Research 202 (2010) 789–801 791
Author's personal copy
3. Methodology
We focus on whether we could use narrative disclosures to construct measures that predict the firms’ performance. Constructing such
an algorithm requires that we define (1) a method to quantitatively represent a document’s narrative disclosure, (2) measures of a firm’s
performance, and (3) a model that will enable the use of the disclosure measure in step (1) to predict performance as defined in step (2).
We address these issues next. (See Appendix A for a non-technical description of the text classification problem; Mahinovs and Tiwari
(2007) provide an accessible review of the literature. See http://videolectures.net/mlas06_cohen_tc/ (accessed on 7/8/08) and Sebastiani
(2002) for an in-depth review of the area.)
Briefly, text classification is the task of automatically putting documents into predefined categories. (A ready example is assigning news
articles by topic such as politics, sports or culture.) This classification task comprises several steps, the first of which is text representation.
For this step, we employ standard text representation techniques used in computer science, with suitable modifications for financial re-
ports. Consistent with the literature (Sebastiani, 2005), we first stem the words in a document to their morphological roots (e.g., running
is stemmed to run) and eliminate common words such as aand the. We then represent the document as a vector of stems using the ‘‘bag of
words” approach. The approach is so named because it uses all the terms (stems) in a document regardless of the order or position of the
terms. Loosely, the set of ‘‘independent variables” in the model is the set of all stemmed words. We can then map a document in n-space,
treating each term as a dimension and using a numerical weight for each stem. This weight is usually a function of frequency of the stem in
the document and in the full collection of documents (Hand, 2001).
9
Naturally, because the method treats each unique term as a separate
dimension, this step leads to a large term space. Accordingly, the next step is to reduce the term space and generate a smaller vocabulary
(loosely, identify the words that have the greatest ability to distinguish among documents).This step is particularly important in our study
because the term space generated from 10K reports is of extremely high dimension. We employ the document frequency (DF) method to re-
duce the term space. This method ranks words by the number of documents that contain the word and uses a threshold level to reduce the
number of words considered. Yang and Pedesen (1997) shows that DF method produces an overall efficiency gain by eliminating less infor-
mative terms and reducing the vocabulary size without sacrificing classification accuracy. Finally, we use the term frequency * inverse document
frequency (TF*IDF) method (Singhal et al., 1996), the most commonly used weighting scheme, for estimating the term weights for individual
terms identified by the DF method. Intuitively, this weighting scheme assumes that the best descriptive terms for a given document are those
that occur very often in the given document (term frequency) but not much in other documents (inverse document frequency) (Salton and Buck-
ley, 1988). Note that the document frequency is calculated in the context of our collection of 10K filing documents. Thus, these words will do
well in separating the considered document from other documents. In this way, we represent each document as a point in the n-dimensional
term space.
Step 2 in our method is to identify the predictive attribute of interest. We focus our analysis on size-adjusted returns because the market
return is the metric of most interest to shareholders, analysts and other users.
10
This performance measure becomes the nþ1 dimension
associated with a document. In this context, note that predicting a specific value of a certain performance measure is a harder task than pre-
dicting a category of a performance measure, because a real value prediction is more granular than a category prediction. As an exploratory
study, we start with a coarser approach and classify firms into three classes relative to their peers: under-performing, average, and out-per-
forming. Each year, we rank the firms by their actual performance for the next year, and use the 25th and the 75th percentiles to define the
cutoffs for the three classes.
For step 3 in our method, ideally, we could develop a mapping between a firm’s disclosure vector (as developed in step 1) and the per-
formance measure (in step 2). The classical statistical approach (which includes studies that examine one or more specified aspects of the
text) then finds parameters that fit a specified model to the data. Our approach differs in that we do not adopt a model or specify the attri-
butes of interest. Rather, akin to a neural net, we let the data-driven text-classification algorithm ‘‘learn” the potentially non-linear and
multi-faceted relation between the text attributes and future returns. Essentially, the model seeks to construct an n-dimensional hyper-
plane that best separates the data points as per their categories.
11
Once we ‘‘train” the model, we apply it to a hold-out sample (in our case to the annual reports for the next year). The output from this
analysis is a prediction for each firm in the hold-out sample as to its category: out-perform, average or under-perform. We then construct
equally-weighted portfolios based on these predictions. That is, we allocate the same $ amount to two sets of firms – we buy firms pre-
dicted to outperform and sell firms expected to under-perform. The size adjusted return earned by the portfolio is our measure of incre-
mental value and the predictive ability of narrative disclosures.
3.1. Design
For our design, a data point represents the results from a particular measure and year. We draw the training document set and the test
document set from adjacent years. We use documents that report performance for the period t1tot(available at time t) to build the
predictive model.
12
We then apply the model to the documents reporting results for year t(available at time tþ1) to predict the performance
category for the period tþ1totþ2. (Notice that the standard 10-fold validation in text classification as in Witten and Frank (2000) is not
9
In accounting, Smith and Taffler (2000) and Hussainey et al. (2003) show that counts of keywords are related to bankruptcy and the association between current earnings and
future stock returns.
10
Unlike two accounting metrics, the return metric impounds other information not reflected in the firm’s financial statements because market prices are based on forward
looking information (Kothari, 2001). Thus, the market return is the hardest to predict. On the other side, firm’s management exercises greater control over accounting data. Even
though there is ongoing debate on whether earnings management is generally opportunistic or strategic (e.g., Arya et al., 1998, 2003), there is broad consensus that firms employ
discretionary accruals to manage reported income. Such practices add noise to the accounting measures we consider.
11
The method is ideally suited for binary classifications. Because we have three classes, we perform three two-way classifications and combine the predictions to generate an
overall classification. See last paragraph of Appendix A for details.
12
The number of years to consider when building a predictive model is an interesting question. We could use all available data to construct the model, weighting recent years
more. We use a conservative approach and only employ the most recently available information. In essence, our approach assumes that the patterns unearthed in last year’s
annual report would hold for the current year’s annual report, and can help predict performance in the forthcoming year.
792 R. Balakrishnan et al. /European Journal of Operational Research 202 (2010) 789–801
Author's personal copy
sensible in this context as we build a single model for each year in the sample.) Based on the classification, we examine if an implementable
trading strategy based on predictions from our model earns a positive size adjusted return. Such a test is interesting because predictive accu-
racy is a relatively coarse performance metric. Further, portfolio returns have an endogenous cost of prediction successes and failures. Finally, a
returns test is the appropriate measure to examine if there is incremental information content in the narrative disclosures relative to the infor-
mation impounded into contemporaneous prices.
We calculate a portfolio return as the average size adjusted return difference between the out-performing firms and the under-perform-
ing firms for each year. We report results for a 25–50–25% cut-off for defining the three classes of out-performing, average, and under-per-
forming firms. (We verified robustness with a 10–80–10 cut-off.) We calculate a portfolio level return for a buy and hold strategy
(see Fig. 1). Specifically, consider the model constructed using documents for the year ending 12/31/1999 (data available March 2000)
and calendar year 2000 performance (data available in March 2001). We apply this model to documents available in March 2001, make
predictions, and measure the cumulative size adjusted return for the portfolio from April 1, 2001 to April 1, 2002. We verified that such
a strategy is implementable in that documents are available before 3/31. Further, because SAR for a random portfolio is zero by construc-
tion, this return is the incremental return relative to constructing a random portfolio.
We perform two analyses to understand the source of any excess return. Our first approach checks whether our disclosure score is pick-
ing up known document-level features such as the fog index or risk-sentiment. For each document, we add these features to the term space
and construct a new model, and use the predictions of the augmented model to construct portfolios. If these meta-features are incremen-
tally useful, the prediction of the augmented model should exhibit greater returns relative to the base model.
Our second approach employs cross-sectional regressions. We estimate:
SAR ¼
a
þb
1
Dummy þb
2
Size þb
3
MTB þb
4
PM þb
5
Earning Surprise þb
6
Size Dummy þb
7
MTB Dummy þb
8
PM Dummy
þb
9
Earnings Surprise Dummy þerror;
where SAR = Size adjusted buy-and-hold return for the year; Dummy = 1 if the firm is classified as out-performing and 0 for predicted under-
performing firms. Average firms are excluded from this analysis; Size = The size of the firm, measured as the natural logarithm of total assets;
MTB = Market-to-book ratio (a valuation proxy), using the closing market price as of the start of the holding period; PM = Price momentum,
measured as the SAR for the six months preceding the start of the holding period; Surprise = Actual EPS – Forecast EPS, where the forecast is
the latest available consensus analyst forecast.
Our choice of the regressors stems from studies (e.g., Jegadeesh et al., 2004) that examine the incremental information content of ana-
lyst forecast revisions after controlling for factors known to affect returns. In the above regression, a positive coefficient for b
1
is consistent
with the narrative disclosures providing incremental value-relevant information to market participants. A non-zero interaction term is
consistent with the narrative disclosure altering the confidence market participants place in the numeric estimates.
4. Data and descriptive statistics
The primary data for our experiments include the firms’ financial data, size-adjusted return, and the firms’ annual reports. To increase
the homogeneity of firms in the sample, we restrict the sample to firms in the manufacturing industry (SIC codes 2000 to 3999), having
December as fiscal year ending month. The sample period is from 1997 to 2002 (we include return data for 2003 as well).
We ensure data integrity and accuracy by using the values for gvkey from the COMPUSTAT database, permno from the CRSP database,
and OFTIC from I/B/E/S to identify a unique firm. We collect a total of 1236 unique firms’ financial data. Each annual report has an accession
Overview of Design
Year
t-1 t t+1
SAR SARtSARt+1
Annual
Report Doct-1 Doct
where:
SARt Size adjusted return cumulated from April 1 of year t to March 31 of year t+1
Doct-1 Annual report for year t-1, usually available in March of year t.
A. For firms in year t-1, build predictive model of year t using firms’ SAR in year t (i.e.
size-adjusted return cumulated from April of year t to March of year t+1) and annual re-
ports for year t-1 which are usually published in March of year t.
B. For firms in year t, apply the predictive model built in step A) to the annual reports for
year t which are published in March year t+1, and predict the class of SAR performance
of these firms in year t+1, i.e. the 3-class of SAR (size-adjusted return cumulated from
April of year t+1 to March of year t+2).
C. On March 31st, year t+1, given a set of predicted out-performing firms and a set of pre-
dicted under-performing firms from step B), we sell the under-performing firms’ stocks
at a total value of (for example) 10 million dollars and buy the out-performing firms’
stocks with a total value of 10 million dollars. In both the buying and selling transactions,
we will allocate equal values of stocks among the firms. On March 31st, year t+2, we will
sell the stocks of the out-performing firms and buy the stocks of the under-performing
firms. If our
p
rediction was correct, this transaction should
g
enerate non-ne
g
ative
p
rofit.
A
BC
Fig. 1. Overview of design.
R. Balakrishnan et al. /European Journal of Operational Research 202 (2010) 789–801 793
Author's personal copy
code as its unique identifier. We manually download from Mergent Online the accession codes of the annual reports for each firm from
1997 to 2002. Then we automatically retrieve from the EdgarScan website the annual reports using the downloaded accession code.
There are 10 different submission types for annual reports: 10K (10K filings), 10K405 (10K filings where regulation S-K Item 405 box on
the cover page is checked), 10K405A (amendments to 10K405), 10KA (amendments to 10K filings), 10KSB (10K filings for small business),
10KSBA (amendments to 10KSB), 10KSB40 (optional form for small business where regulation S-B Item 405 box on the cover page is
checked), 10KSB40A (amendment to 10KSB40), 10KT (10K transition report), 10KTA (amendment to 10K transition report). We focus on
the major submission types of 10K and 10K405. Our final useable documents with matching financial performance measure values are
4280 annual reports from 1236 firms published in years 1997 to 2002.
Using the CRSP database, we calculate size-adjusted cumulative return as the size-adjusted buy-and-hold return cumulated for 12
months from April 1 of the fiscal year to the next April. We verify that the relevant documents are available, and that the strategy is
implementable.
4.1. Sample description
Table 1, panel A provides the number of observations considered for each analysis. We begin with 4755 documents but make only 3529
predictions because of missing data and because of the lagging nature of the predictive model. We also trim the top and bottom 1% of
observations on size-adjusted returns and other classification variables to reduce the influence of outlying observations.
13
We use 3070
observations in sub-sample analysis because we could not collect the classification data required for 1997.
Panel B provides descriptive data for our sample, with each observation representing a firm-year. Over all years, the average firm has
mean sales of $2749 million and median sales of $336 million indicating the presence of several large firms in the sample. The average ROE
is 3.48%, while the median is 7.77%. The mean and median values for the market-to-book ratio are 1.95 and 1.18, respectively.
Panel C of Table 1 provides the industry breakdown for sample firms. We do not find any significant clustering of industries specific to
our sample. Tests (not reported) do not reveal any systematic difference between the spread of firms in our sample and the distribution of
Table 1
Descriptive statistics.
Item Number of firm-years Notes
Panel A: sample selection
Total number of documents (1997–2002) 4755
Do not have SAR data (295)
Potentially useful in SAR exercise 4460
Truncated extreme observations (180) Trimmed top and bottom 1% of observations
Documents used to develop in SAR prediction model 4280 Statistics presented in Table 1
Loss due to 1-year ahead prediction (751) We do not have predictions for 1997, the first year with documents
Documents with portfolio experiment results 3529 Results presented in Tables 2, 3, and Panel A of Table 4
No data for sub-sample classification (459)
Available for sub-sample analysis 3070
Trimmed for extreme observations (124) Top and bottom 1% of observations removed for each variable
Net available for sub-sample analysis 2946 Results presented in Panel A of Table 3
Used only extreme quintiles in regressions 1473
Lost due to missing data 406
Used in regression 1007 Results presented in Panel B of Table 4
Item N(firm-years) Mean Median 25th percentile 75th percentile
Panel B: sample characteristics
Sales (millions) 4099 $2749.66 $336.44 $65.93 $1666.91
Net assets (millions) 4099 $3153.43 $385.9 $95.52 $1735.89
ROE 3073 3.48% 7.77% 10.64% 18.09%
EPS 4255 $0.248 $0.63 $0.18 $1.39
Size adjusted return 4280 2.23% 12.91% 41.11% 18.48%
Market-to-book ratio 4098 1.95 1.18 0.64 2.36
SIC codes Number of firms in sample
Panel C: industry composition
20–25 99
26 38
27 32
28 276
29–32 59
33 43
34 38
35 171
36 185
37 55
38 199
39 23
Total 1236
13
In the accounting and finance literatures, such trimming is standard when dealing with security returns. The average return in the bottom (top) 1% is close to (well over)
100% (+100%) which is not representative of average returns.
794 R. Balakrishnan et al. /European Journal of Operational Research 202 (2010) 789–801
Author's personal copy
all COMPUSTAT firms from the relevant SIC codes. We also do not find systematic differences in key firm characteristics. Thus, our results
appear generalizable, albeit only to manufacturing firms.
We use the size-adjusted cumulative return as the key metric of firms’ financial performance. As noted earlier, this metric contains the
market response information that is generally not reflected in the financial statements.
5. Results
The dependent variable in our analysis is a portfolio size-adjusted return, rebalanced each year. We construct an equally weighted buy-
and-hold portfolio that sells the under-performing firms and invests in the predicted out-performing firms. We ensure that we employ an
implementable strategy by verifying that all of the documents were available before April. We calculate annual returns for the prediction
period (April to April). For robustness, we replicate the analysis both for the 25–50–25 partition (reported), and for the 10–80–10 partitions
of the sample for identifying out- and under-performing firms.
Panel A of Table 2 presents results on the cumulative size-adjusted return by year. For both partitions, we find a significant return for
every year except for 1998 and 2000 (when we find a significantly negative portfolio return). One reason for this anomaly might be the
considerable turbulence experienced by financial markets during 2000 (see, for example, Barber et al., 2003). On average, we find an annual
excess return of 12.16% using the 25–50–25 partition and 6.59% using the 10–80–10 partition for developing the model. These estimates
are consistent with earlier research that hints at the considerable information content of narrative disclosures. These results also suggest
that the market has difficulty in immediately parsing the information content of the disclosures, meaning that this information shows up in
the return for the next year.
14
In panel B, we report the number of firms classified as out- and under-performing, by year, for each of our partitions. These data show,
while the predictive model was constructed using a 25–50–25 partition of actual performance, the actual number predicted to out or un-
der-perform is not 25% of the hold-out sample of firms. For instance, only 608 firm-years predicted to out-perform when a naı
¨ve expec-
tation is for 882 = 3529 total observations classified * 0.25. (Using a proportions test, this difference is statistically significant.) Thus, as
is intuitive, our predictive model is better able to pick up ‘‘extreme” differences from the average firm relative to smaller differences.
Panel C of Table 2 presents results for sub-samples of firms. We investigate two partitions based on market-to-book and on firm size. For
the first set of results, we determined the median market-to-book value for each year. We then partitioned sample firm-years into the value
Table 2
Average return difference between predicted out-performing and under -performing firms model based on documents for year tand performance for year t+ 1. Tested on
documents for year t+ 1 and performance for year t+2.
Year With 25–50–25% performance class definition (%) With 10–80–10 performance class definition (%)
Panel A: portfolio returns
1998 2.46 4.69
1999 63.68 42.62
2000 36.84 45.73
2001 19.6 19.11
2002 16.82 12.25
Average 12.16 6.59
Year With 25–50–25% performance class definition With 10–80–10 performance class definition
Predicted out-performing Predicted under-performing Predicted out-performing Predicted under-performing
Panel B: number of out-performing and under-performing firms predicted as in panel A
1998 98 126 45 54
1999 110 83 61 41
2000 206 110 78 46
2001 97 170 55 54
2002 97 112 56 37
Total 608 601 295 232
Year Large size (%) Small size (%) Value (%) Glamour (%)
Panel C: portfolio returns (sub-sample analysis)
1999 18.08 88.4 7.76 83.36
2000 18.78 41.08 17.49 46.66
2001 16.61 0 3.68 15.04
2002 11.08 9.58 11.9 13.85
Average 6.75 14.23 1.46 16.39
Notes:
1. Cell entries represent the portfolio level buy-and-hold size-adjusted return for a year beginning April 1 and ending March 31. The portfolio is long on the predicted out-
perform firms and short on the predicted under-performers.
2. The performance class specification relates to the performance cutoffs used to define classes in the training sample.
14
Inspection suggests that general market volatility affects participant’s ability to gainfully use narrative disclosures to predict future performance. Systematic analysis of this
inference is hampered because we only have 5 observations. Extending the analysis to more years (and/or using quarterly reports) is one way to obtain enough data to test this
conjecture.
R. Balakrishnan et al. /European Journal of Operational Research 202 (2010) 789–801 795
Author's personal copy
or glamour categories based on their value relative to the median value for the relevant year. We then re-estimated the textual model for
each of the sub-samples separately. We repeated the exercise for size, using total assets as the measure.
Our textual model indicates greater value relevance in the disclosures made by glamour firms and by small firms. Portfolios based on the
model predictions have a size-adjusted return of 14.23% on average for small firms but only 6.75% for large firms. Similarly, expectations
about future growth drive the valuations of glamour firms more so than for value firms. Again, we find size adjusted returns of 16.39%
(1.46%) when we form portfolios for glamour (value) firms. In other words, our findings show that firms grouped on readily observable met-
rics such as size and market-to-book ratio follow detectably different text disclosure strategies. (However, our analysis does not speak to the
dimensions in which the disclosures differ, a matter for additional research in this area.)
Table 3
Supplementary analysis.
Firms partitioned by market-to-book, separate
analysis for each group
Firms partitioned by size, separate analysis
for each group
Glamour firms Value firms Large firms Small firms
Panel A: semantics values by firm partitions
Mean values of attribute
All firms
N(firm years) 3529 1473 1472 1473 1472
Fog index 18.41 18.41 17.97 17.82 18.57
Risk sentiment 28.81 29.23 24.37 27.30 26.20
Tone 0.401 0.404 0.392 0.401 0.391
Firms-predicted to out-perform
N(firm years) 608 249 129 217 192
Fog index 18.515 18.59 17.49 18.02 18.65
Risk sentiment 31.939 33.76 30.92 37.21 30.92
Tone 0.399 0.406 0.384 0.404 0.394
Firms predicted to under-perform
N(firm years) 601 141 181 203 167
Fog index 18.740 18.86 17.96 18.30 18.81
Risk sentiment 34.600 37.95 32.41 35.77 34.70
Tone 0.393 0.400 0.383 0.383 0.387
t-Tests of differences
Firms predicted to out-perform versus all firms
Fog index 2.34
*
1.06 0.68 0.75 0.02
Risk sentiment 4.65
***
3.01
***
3.32
***
3.31
***
2.84
**
Tone 1.99 0.34 2.46 0.22 1.42
Firms predicted to under-perform versus all firms
Fog index 6.19
***
3.44
***
1.02 2.75
**
1.83
Risk sentiment 7.40
***
3.69
***
4.25
***
5.06
***
4.38
***
Tone 4.25
***
1.19 1.02 2.73
**
2.65
**
Predicted out performers versus predicted under-performers
Fog index 2.46
*
1.95 1.19 1.53 1.29
Risk sentiment 1.57 1.82 0.65 1.30 1.80
Tone 2.78
**
0.80 0.15 1.47 0.87
Panel B: average return difference between predicted out-performing and under-performing firms. Model (augmented with three
document-level meta-features) based on documents for year t and performance for year t + 1. Tested on documents for year t + 1 and
performance for year t + 2.
Year With 25–50–25% performance class
definition (%)
1998 1.85
1999 64.65
2000 37.48
2001 17.01
2002 16.39
Average 11.74
Notes:
1. Variable definitions are as follows:
Risk Sum of risk term frequency
Tone (Optimism term frequency pessimism term frequency)/(optimism term frequency + pessimism term frequency))
Fog index A measure of readability. Calculated as
0:4word
sentences

þ100 wordswith 2syllables
words

2. Entries in panel A are the raw values. We performed a log transformation when including the three textual feature definitions in the model.
3. Cell entries in Panel B represent the portfolio level buy-and-hold size-adjusted return for a year beginning April 1 and ending March 31. The portfolio is long on the pre-
dicted out-perform firms and short on the predicted under-performers.
4. The performance class specification relates to the performance cutoffs used to define classes in the training sample.
*
p< 0.05.
**
p< 0.01.
***
p< 0.001.
796 R. Balakrishnan et al. / European Journal of Operational Research 202 (2010) 789–801
Author's personal copy
5.1. Link to meta-features
It is possible that our text classification model is merely replicating the previously known association between document-level meta-
features (e.g., clarity, tone) and future performance. Panel A of Table 3 presents descriptive data for the three features we study, both for the
full sample and for the sub-samples we consider. (In this table, each firm-year is a separate observation.) We report relevant t-statistics at
the bottom of this panel.
The first column of this panel reports data relating to the entire sample for which we obtain performance predictions. Relative to the
average firm, we find that firms predicted to out-perform have a denser text (fog index of 18.515 versus 18.41, t= 2.34), have more words
expressing risk (sentiment 31.93 versus 28.81, t=4.65) but have a similar tone. We find the similar pattern for firms predicted to under-
perform, with even the tone turning more pessimistic. Thus, firms in the ‘‘tails” of the distribution of predicted performance differ relative to
average firm. A first conclusion is that the attributes that our classification models pick up firms that differ systematically on the meta-fea-
tures of tone, clarity and risk sentiment. However, we have weaker evidence for this conclusion when we compare the feature scores for
Table 4
Sample characteristics and incremental returns model based on documents for year tand performance for year t+ 1 Tested on documents for year t+ 1 and performance for year
t+2.
Sample firm/year from SAR implementable experiment
Predicted under-perform Predicted average-perform Predicted outperform
Mean (Median) Mean (Median) Mean (Median)
N= 601 N= 2320 N= 608
Panel A: sample characteristics
Assets ($ million) $1736.93 $3625.59 $3715.88
(170.88) (488.60) (482.53)
Sales ($ million) $1389.62 $3136.83 $3125.95
(91.72) (473.51) (376.17)
EPS $0.305 $0.746 $0.699
(0.231) (0.74) (0.72)
Market-to-book 1.636 1.718 3.124
(1.042) (1.012) (1.953)
Leverage 0.441 0.493 0.447
(0.396) (0.505) (0.445)
Earnings Surprise 0.153 0.249 0.099
(0.175) (0.1) (0)
Price Momentum 0.332 0.017 0.696
(0.39) (0.067) (0.315)
Size adjusted return (annual) 0.057 0.009 0.019
(0.173) (0.097) (0.126)
SAR ¼
a
þb
1
Dummy þb
2
Size þb
3
MTB þb
3
PM þb
4
Earning Surprise þb
5
Size Dummy
þb
6
MTB Dummy þb
7
PM Dummy þb
8
Earning Surprise Dummy þerror
Item Regression model 1 Regression model 2
Estimate t-Value Estimate t-Value
Panel B: incremental information content
Intercept 0.068 0.87 0.042 0.36
Dummy for model prediction 0.054 1.08 0.043 0.26
Log (total assets) 0.014 1.17 0.011 0.56
Log (market-to-book) 0.003 0.12 0.027 0.75
Earnings surprise 0.001 0.36 0.004 0.68
Price momentum 0.083 3.11
***
0.035 1.23
Dummy log (total assets) 0.002 0.08
Dummy log (market-to-book) 0.071 1.65
Dummy earning surprise 0.011 1.39
Dummy price momentum 0.291 4.00
***
N1007 1007
Adjusted R-square 0.009 0.023
F-value 2.71
**
3.54
***
Notes:
1. Variable definitions are as follows:
SAR = Size adjusted buy-and-hold return for the year.
Dummy = 1 if the firm is classified as out-performing and 0 for predicted under-performing firms. Average firms are excluded from this analysis.
Size = The size of the firm, measured as the natural logarithm of total assets.
MTB = Market-to-book ratio (a valuation proxy), using the closing market price as of the start of the holding period.
PM = Price momentum, measured as the SAR for the six months preceding the start of the holding period.
Surprise = Actual EPS Forecast EPS, where the forecast is the latest available consensus analyst forecast.
2. Test statistics employ cluster-adjusted standard errors to control for multiple observations from the same firm.
**
p< 0.01.
***
p< 0.001.
R. Balakrishnan et al. / European Journal of Operational Research 202 (2010) 789–801 797
Author's personal copy
firms predicted to out-perform with those predicted to under-perform. We find that predicted under-performers have a marginally less
readable document and have slightly greater pessimism. The two groups express similar risk profiles. These comparisons suggest that text
disclosures do have information content relating to performance, and that meta-features can help us identify the extremes.
The comparisons in the columns highlight that we cannot simply use the meta-features to replace the model. This inability arises be-
cause meta-features are of less use in distinguishing the direction of the performance differential, the key attribute of interest. Data in the
next four columns (for the sub-samples of glamour, value, large and small firms) provide additional evidence that support our inference.
For all four sub-samples, the predicted under- and out-performing firms have greater scores for risk sentiment relative to the average firm.
However, we do not find differences between the sets of firms predicted to out- and under-perform, for every sub-sample and for every
measure. We conclude that while meta-features are picking up differences in style and tone that are systematically related to performance
differentials predicted by our model, they seem unable to distinguish the nature of the performance differential.
15
For an additional test of whether our predictive model captures more than the meta-features, we refit the predictive model including
the meta-features in the term space. As shown in Panel B of Table 3, we obtain a similar (11.74%) return from the augmented model. More
importantly, we also fit a model using only the meta-features. Such a model should, in theory, produce the same predictive ability, if the
meta-features contained all of the information in the documents. However, we find that such a parsimonious representation of a document
(as 3 meta-features of clarity, tone and risk sentiment) has no explanatory power at all (results not tabled). Overall, we conclude that our
text classification model is capturing features not picked up by the selected meta-features.
5.2. Incremental information content
Panel A of Table 4 returns to the full sample analysis. This table provides descriptive data on the firms predicted in the three classes, for
the 25–50–25 classification. Relative to the average firm in the out-perform sample, the firms in the under-perform sample are reliably
smaller, have lower market-to-book ratios, and are less profitable (as measured by EPS). This distinction provides additional evidence about
the information content of disclosures because the classification does not use any numeric item. The text in the annual reports is enough to
identify a distinct sample of under- and out-performing firms.
Panel B of Table 4 provides results that speak to the relation between the information in the narratives and the information in quan-
titative disclosures. In particular, it is of interest to know whether the information in the narrative disclosure is subsumed by or is incre-
mental to the information in the quantitative disclosure.
The first column reports results for a model that considers main effects only. We find that the coefficient for the model prediction (for
‘‘dummy”) is not reliably different than zero. Thus, the text disclosure does not appear to provide value-relevant information incremental
to that provided by known factors but rather captures known features. We find that large firms earn smaller returns (Reinganum, 1981),
and that a high market-to-book ratio presages lower returns as well (Fama and French, 1992). However, once we account for all other fac-
tors, our data do not show the expected relation between price momentum and excess returns (Jegadeesh and Titman, 1993). Interestingly,
we note that the univariate comparison in Panel A is significant at the 5% level (the price momentum is 0.019 for predicted out-performers
versus 0.332 for the predicted under-performing firms). The regression estimate, however, indicates that the incremental effect (after
accounting for other factors such as market-to-book, size and earnings momentum) is negative.
The second column in this panel reports results for a complete model that includes interaction terms for the model prediction (a binary
variable) with established causes. We continue to find an insignificant main effect for our model’s prediction. However, as indicated by the
significant interaction terms, the disclosure score is weakly informative as to whether the glamour/value partition will continue to yield
excess returns for the next period. Moreover, the interaction with price momentum is significant. That is, the disclosure score indicates that
the effect of price momentum for firms predicted to out-perform is reliably smaller than for the average firm. Thus, while Jegadeesh and
Titman (1993) shows abnormal returns to buying winners and selling losers, our results suggest the possibilities of finer partitions.
16
Thus,
one interpretation of our results is that narrative disclosures can help identify firms with negative price momentum that reverses over the next
year. Stated differently, narrative disclosures could help identify if the price momentum will sustain into the next period or will reverse.
6. Conclusions
This paper is part of a nascent literature that explores the narrative disclosures made by firms and complements established literature
that considers the ability of numeric data to predict market performance (e.g., the accrual or the post earnings announcement drift anom-
alies). Most of the prior studies that studied textual disclosures have largely relied on expert classification thereby limiting sample sizes
and the kinds of questions that could be asked. This study demonstrates a methodology for large-scale text mining of the narrative disclo-
sures in annual reports. Even a relatively simple model, when applied to the narrative data alone, successfully predicts future accounting
and market performance.
There are several limitations of our approach. Our methodology only allows for limited economic insight into what characteristics of the
disclosure lead to certain predictions (see e.g., Li, 2006). We also employ a simple ‘bag of words’ approach, without paying attention to the
context of the usage of specific words. Further, we limit ourselves to the disclosures in the annual report and thus restrict the information
that market participants would employ. However, these limitations can be addressed using some of the emerging techniques in text mining
(see, e.g., Pant and Srinivasan, 2005).
We could expand this study along several dimensions. The first is the use of text mining models that consider attributes such as tone,
phrasing and so on. The second avenue is to augment the disclosures with additional disclosures such as press releases. We also could over-
weight economic predictions such as the news releases from the Federal Reserve, and sector-specific forecasts by trade associations. A third
15
We note that, in a changes analysis, Li (2006) shows the predictive ability of the fog index (we replicate this finding as well). The other studies do not focus on market
performance in an implementable way. We also note that, considering all firms, the group of glamour firms has greater risk sentiment and a more optimistic tone (p<0:001 for
both comparisons) relative to the scores recorded for value firms. We also find that larger firms tend to have greater readability but their count of risk-related words is also higher.
16
These results hold even if we discard the data for the year 2000 in the analysis. We also note that portfolios formed on price momentum generate abnormal negative returns
after an initial holding period (Jegadeesh and Titman, 1993).
798 R. Balakrishnan et al. / European Journal of Operational Research 202 (2010) 789–801
Author's personal copy
avenue is to identify the nature of the differences in disclosures by large versus small firms, and glamour versus value firms as our results
demonstrate that these sub-sets follow differing strategies. We also could study extreme observations (e.g., high positive score but high
negative return) to identify features that diminish the informativeness of text disclosures. Finally, it is of interest to examine the time it
takes market participants to impound textual information. While we have focused on annual returns, we conjecture that studies that exam-
ined shorter-time frames might provide sharper differences while the additional economic noise would wipe out the effect if we considered
longer time frames. However, the relation is not likely monotonic because quantitative data likely dominates the returns for very short
(intra-day or a few days) return intervals.
Acknowledgements
We thank Mort Pincus, Cristi Gleason, Paul Hribar, the editor, two anonymous reviewers, and workshop participants at the University of
Iowa and Christopher Newport University for helpful comments. Xin Ying Qiu also acknowledges contributions from members of her thesis
committee.
Appendix A. Building text classifiers for prediction
Text classification is a core activity in information science. The goal is to assign each text to one (or more) of a given set of categories. As
an example, we may be interested in classifying a news article using the categories of sports, health, famous persons, entertainment, gar-
dening, real estate or finance. It could be that the article belongs to sports or it could be that the article belongs to both the famous persons
and sports categories. Trained individuals may perform text classification manually. Alternately classification may be accomplished using
computational tools called text classifiers.
The design and evaluation of algorithms for automatic text classification has been the basis of a highly active field of research for several
decades. The field is now mature to the point that we are seeing text classifiers used to decide not only conceptual categories (as in the
above example) but also to capture more subtle human phenomenon such as sentiment; classifiers are being used to identify sentences
that are speculative (versus presenting ideas with confidence), to identify sentence tone as positive, negative or neutral and so on. Devel-
opments in these more subtle realms in part motivate our current research on text classifiers to predict market performance.
The automatic methods employed in text classification derive predominantly from research in machine learning, a subfield of artificial
intelligence. Major examples of these derivations include text classification algorithms based on support vector machine (as in this paper),
neural nets, decision trees and association rules. Of these the Support Vector Machine (SVM) based algorithms are at least amongst the most
effective algorithmic methods Sebastianip. 49, 2002.
A given classification problem generally (but not always) starts with some training data that has been classified by some reliable mech-
anism, such as an expert, into one of two classes. Alternatively, we can use a known outcome such as next period’s return to classify the
text. An SVM represents each example in the training data as a vector in an n-dimensional space and proceeds to find an n1 dimensional
hyperplane that separates the two classes. This strategy produces a linear classifier. Here, the parameter ‘‘n” represents the number of fea-
tures considered. Thus, in text classification problems, ncan be fairly large consisting of every nontrivial word in the collection of texts
being classified. Because many candidate hyperplanes are likely to exist, SVMs are additionally designed to achieve the best or maximum
separation (also called margin) between the two classes of the training data. That is, the nearest distance between a point in one separated
hyperplane and a point in the other separated hyperplane is maximized. The ‘‘trained” classifier may then be applied to new data, classi-
fying each new text into one of the two classes.
Several key extensions have been made to the basic linear SVM. For instance, when a clean separation between the two classes of points
is not possible, soft margins allow for some amount of classification error through the use of slack variables. SVM then aims at maximizing
margin while minimizing error. In addition, researchers often employ one of several functions to transform the initial n-dimensional space.
The classifier then looks for a separating hyperplane in this transformed space, a hyperplane that may be non-linear in the original space.
This strategy may be useful in cases where linear classifiers are not sufficiently effective. Several such ‘‘kernel functions” to transform the
initial space are available in implementations of SVM tools, including polynomial and Sigmoid functions. In this paper we use the base lin-
ear SVM classifier.
17
SVMs are designed mainly for solving binary or two-class classification problems. Since our research problem is to classify documents
into three classes, we consider some options to extend SVMs to multi-class problems. We perform one-against-rest classification for each
class, and combine the results to make a final decision. The computing time for this option is linear in n(the number of classes). That is we
produce a total of three binary (one-against-rest) SVM models. We use the highest predictive scores generated by the three models to as-
sign a class label to the document.
A.1. Document representation
In information retrieval and text-classification research, the most common approach to encode (or represent) a text document is to
model a document as a vector of weighted of terms. There are generally three aspects to consider when constructing such a document
model:
(1) What are the terms in the vector? Are they all the words from the document set, or phrases, or some transformation of the words or
phrases?
(2) How many terms do we need to construct the document representation? Do we use all the defined terms, or a subset of the terms?
And, if we want only a subset of the terms, how do we select this subset and why?
17
We build our classifiers using the SVM-Light implementation of Support Vector Machines with default parameter settings and linear kernel function. (See http://
svmlight.joachims.org/.)
R. Balakrishnan et al. / European Journal of Operational Research 202 (2010) 789–801 799
Author's personal copy
(3) How do we construct a weighting scheme for the terms in the document vector, to best indicate the terms’ relative informativeness
and importance with respect to representing the document?
In addressing the first question of defining the terms to represent document, the most widely used ‘‘bag of words” approach starts with
the complete vocabulary in the training corpus (the set of words used as ‘‘independent variables” in the model). Functional or connective
words, such as ‘‘a, hence, and, the,” are considered as stop words and are generally removed since they are assumed to have no information
content. Stemming (e.g., connecting or connected is the same as connect) is sometimes performed to remove the suffixes and to map words
to their morphological roots. Researchers have explored other more complex textual representations (e.g. Peng and Schuurmans, 2003;
Dumais et al., 1998; Apte et al., 1994). While each method has its strengths and weaknesses, more complex definitions have not been
shown to be superior to the basic ‘‘bag of words” approach in solving classification problems. In this study, we use the stemmed words
of the document corpus to construct the document vector representation.
Since the term space generated from our 10K report collection is of extremely high dimension, we will need to reduce the term space
and generate smaller vocabularies. The benefits of such a reduced term space include better generalization ability of the model, saving of
computing time, and possibly better interpretation and understanding of the predictive features. Most term selection methods either com-
pute statistical feature scores to select high-scoring terms or apply simpler feature selection algorithms from machine learning research
(e.g. Yang and Liu, 1999; Larkey, 1998; Yang and Pedesen, 1997).
We use the document frequency (DF) threshold method for reducing the term space. Relative to other methods, this method (which
employs a count of the number of 10K filing documents in our collection that use a given word) shows an overall efficiency in eliminating
less informative terms and reducing the vocabulary size without sacrificing classification accuracy.
Researchers have used many ways to calculate term weights in document vectors. The term frequency * inverse document frequency or
TF*IDF is the most commonly used weighting scheme for estimating the usefulness of a given term as a descriptor of a document. Its inter-
pretation is that the best descriptive terms of a given document are those that occur very often in the given document (high term frequency
or TF) but not much in the other documents (IDF). In our previous study, we explored several constructions of TF*IDF weights. The best
performer is the atn weight formulated as:
atn ¼0:5þ0:5tf
max tf

ln N
n

;
where tf is raw term frequency; max tf is the maximum term frequency for term in the document collection; Nis the total number of doc-
uments in the collection; nis the number of documents containing the given term i. Therefore, we report results only using atn as our weight-
ing scheme for the terms in the document vector.
References
Apte, C., Damerau, F.J., Weiss, S.M., 1994. Automated learning of decision rules for text categorization. ACM Transaction on Information Systems 12 (3), 233–251.
Arya, A., Glover, J., Sunder, S., 1998. Earnings management and the revelation principle. Review of Accounting Studies, 7–34.
Arya, A., Glover, J., Sunder, S., 2003. Are unmanaged earnings always better for shareholders? Accounting Horizons 17, 111–116.
Association for Investment Management and Research (AIMR), 2000. AIMR Corporate Disclosure Survey: A Report to AIMR. Fleishman-Hillard Research, St. Louis, MO.
Asquith, P., Mikhail, M., Au, A., 2006. Information content of equity analyst’s reports. Journal of Financial Economics 75, 245–282.
Ball, R., Brown, P., 1968. An empirical evaluation of accounting income numbers. Journal of Accounting Research 6 (2), 159–178.
Barber, B., Lehavy, R., McNichols, M., Trueman, B., 2003. Reassessing the returns to analysts’ stock recommendations’. Financial Analysts Journal 59 (2), 16–18.
Barron, O., Kile, C., O’Keefe, T., 1999. MD&A quality as measured by the SEC and analysts’ earnings forecasts. Contemporary Accounting Research 16 (Spring), 75–109.
Botosan, C., 1997. Disclosure level and the cost of equity capital. The Accounting Review 72, 323–349.
Botosan, C., Plumlee, M., 2000. Disclosure level and expected cost of equity capital: An examination of analysts’ rankings of corporate disclosures and alternative methods for
estimating the cost of capital. Working paper, The University of Utah.
Bryan, S.H., 1997. Incremental information content of required disclosures contained in management discussion and analysis. The Accounting Review 72 (2), 285–301.
Clarkson, P., Kao, J., Richardson, G., 1999. Evidence that management discussion and analysis (MD&A) is a part of a firm’s overall disclosure package. Contemporary Accounting
Research 61, 111–134.
Core, J.E., 2001. Firm’s disclosure and their cost of capital: A discussion of a review of the empirical disclosure literature. Journal of Accounting and Economics 31, 441–456.
Dave, D., Lawrence, S., Pennock, D.M., 2003. Mining the peanut gallery: Opinion extraction and semantic classification of product reviews. In: Proceedings of the 12th
International World Wide Web Conference, (WWW 2003), ACM, pp. 519–528.
Davis, A., Piger, J., Sedor, L., 2006. Beyond the numbers: An analysis of optimistic and pessimistic language in earnings press releases. Working paper, Washington University
in St. Louis.
Dumais, S.T., Platt, J., Heckerman, D., Sahami, M., 1998. Inductive learning algorithms and representations for text categorization. In: Proceedings of CIKM-98, Seventh ACM
International Conference on Information and Knowledge Management, pp. 148–155.
Fama, E., French, K., 1992. The cross section of expected stock returns. Journal of Finance 47, 427–465.
Fields, T., Lys, T., Vincent, L., 2001. Empirical research in accounting choice. Journal of Accounting and Economics 31 (1–3).
Firtel, K., 1999. Plain English: A reappraisal of the intended audience of disclosure under the securities at of 1933. Southern California Law Review 72, 851–889.
Hand, D., Mannila, H., Smyth, P., 2001. Principles of Data Mining. MIT Press, Cambridge, MA.
Healy, P., Palepu, K.G., 2001. Information asymmetry, corporate disclosure, and the capital markets: A review of the empirical disclosure literature. Journal of Accounting and
Economics 31 (1–3), 405–440.
Henry, E., 2006a. Market reaction to verbal components of earnings press releases: Event study using a predictive algorithm. Journal of Emerging Technologies in Accounting 3
(1–19).
Henry, E., 2006b. Are Investors influenced by how earnings releases are written?. Working paper, University of Miami.
Hussainey, K., Schleicher, T., Walker, M., 2003. Undertaking large-scale disclosure studies when AIMR-FAF ratings are not available: The case for prices leading earnings.
Accounting and Business Research 33 (4), 275–294.
Jegadeesh, N., Kim, J., Krische, S.D., Lee, C.M.C., 2004. Analyzing the analysts: When do recommendations add value? The Journal of Finance LIX (3), 1083–1124.
Jegadeesh, N., Titman, S., 1993. Returns to buying winners and selling losers: Implications for stock market efficiency. Journal of Finance 48, 65–91.
Kothari, S.P., 2001. Capital markets research in accounting. Journal of Accounting and Economics 31 (1–3).
Kohut, G., Segars, A., 1992. The president’s letter to stockholders: An examination of corporate communication strategy. Journal of Business Communication 29 (1), 7–21.
Lang, M., Lundholm, R., 2000. Voluntary disclosure during equity offerings: Reducing information asymmetry or hyping the stock? Contemporary Accounting Research 17,
623–662.
Larkey, L.S., 1998. Automatic essay grading using text categorization techniques. In: Proceedings of ICML-98, 12th International Conference on Machine Learning, pp. 90–95.
Li, F., 2006. Annual Report readability, current earnings and earnings persistence. Working paper, University of Michigan, Ann Arbor.
Mahinovs, A., Tiwari, A., 2007. Text Classification Method Review. In: Roy, R., Baxter, D. (Eds.), Decision Engineering Report Series. Miemo. Cranfield University, UK.
Pant, G., Srinivasan, P., 2005. Learning to crawl: Comparing classifier schemes. ACM Transactions on Information Systems 23 (4), 430–462.
800 R. Balakrishnan et al. / European Journal of Operational Research 202 (2010) 789–801
Author's personal copy
Peng, F., Schuurmans Dales, 2003. Combining naive Bayes and n-gram language models for text categorization. In: Proceedings of the 25th European Conference on
Information Retrieval Research (ECIR03).
Pérez-Sancho, C., Iñesta, J.M., Calera-Rubio, J., 2005. A text categorization approach for music style recognition: Pattern recognition and image analysis. Lecture Notes in
Computer Science 3523, 649–657.
Popa, S., Zeitouni, K., Gardarin, G., Nakache, D., Métai, E., 2007. Text categorization for multi-label documents and many categories. In: Proceedings of the 12th IEEE
International Symposium on Computer-Based Medical Systems (CBMS’07), IEEE Computer Society, Washington, DC, pp. 421–426.
Reinganum, M., 1981. Misspecification of the capital asset pricing: Empirical anomalies based on earnings’ yield and market values. Journal of Financial Economics 9, 19–46.
Rogers, K., Grant, J., 1997. Content analysis of information cited in reports of sell-side financial analysts. Journal of Financial Statement Analysis 3, 17–30.
Salton, G., Buckley, C., 1988. Term-weighting approaches in automatic text retrieval. Information Processing & Management 24 (5), 513–523.
Sebastiani, F., 2002. Machine learning in automated text categorization. ACM Computing Surveys 34 (1), 1–47.
Sebastiani, F., 2005. Text categorization. In: Zanasi, Alessandro (Ed.), Text Mining and its Applications to Intelligence. CRM and Knowledge Management, WIT Press,
Southampton, UK, pp. 109–129.
Singhal, A., Buckley, C., Mitra, M., 1996. Pivoted document length normalization. In: Proceedings of the 1996 ACM SIGIR Conference on Research and Development in
Information Retrieval, pp. 21–29.
Smith, M., Taffler, R.J., 2000. The chairman’s statement: A content analysis of discretionary narrative disclosures. Accounting Auditing & Accountability Journal 13 (5), 624–
646.
Subramanian, R., Insley, R.G., Blackwell, R.D., 1993. Performance and readability: A comparison of annual reports of profitable and unprofitable corporations. Journal of
Business Communication 30, 50–61.
Tetlock, P., Saar-Tsechansky, M., Macsakssy, S. 2006. More than words: Quantifying language to measure firm’s fundamentals. Working paper, University of Texas at Austin.
Thompson, P., 2001. Automatic categorization of case law. In: Proceedings of the 8th international conference on Artificial intelligence and law, ACM, pp. 70–71.
Witten, I., Frank, E., 2000. Data Mining. Morgan Kaufmann Publishers, San Francisco.
Yang, Y., Liu, X., 1999. A re-examination of text categorization methods. In: Proceedings of SIGIR-99, 22nd ACM International Conference on Research and Development in
Information Retrieval, pp. 42–49.
Yang, Y., Pedesen, J.O., 1997. A comparative study in feature selection in text categorization. In: Proceedings of the 14th International Conference on Machine Learning, pp.
412–420.
R. Balakrishnan et al. / European Journal of Operational Research 202 (2010) 789–801 801
... Usage scenarios are many, and they all demonstrate that text mining techniques can be applied in any knowledge domain to facilitate the acquisition of hidden knowledge which can inform decision making [103,104]. For example, text mining has proven effective when applied for improving sales forecasting systems [105][106][107], supporting customer acquisition and retention through behaviour analysis [108][109][110], making scientific discovery more efficient [111,112], monitoring and predicting food quality and safety [113][114][115], predicting criminal incidents [116], and assisting in talent identification ...
... Usage scenarios are many, and they all demonstrate that text mining techniques can be applied in any knowledge domain to facilitate the acquisition of hidden knowledge which can inform decision making [103,104]. For example, text mining has proven effective when applied for improving sales forecasting systems [105][106][107], supporting customer acquisition and retention through behaviour analysis [108][109][110], making scientific discovery more efficient [111,112], monitoring and predicting food quality and safety [113][114][115], predicting criminal incidents [116], and assisting in talent identification and human resource management [117,118]. In addition, advances in the construction sector suggests data mining is boosting building performance improvement by offering the means to detect or forecast operation deficiencies [119][120][121][122][123]. ...
Article
Full-text available
Despite the commitment of the United Nations (UN) to provide everyone with equal access to basic services, the construction sector still fails to reach the production capacity and quality standards which are needed to meet the fast-growing demand for affordable homes. Whilst innovation measures are urgently needed to address the existing inefficiencies, the identification and development of the most appropriate solutions require a comprehensive understanding of the barriers obstructing the design and construction phase of affordable housing. To identify such barriers, an exploratory data mining analysis was conducted in which agglomerative hierarchical clustering made it possible to gather latent knowledge from 3566 text-based research outputs sourced from the Web of Science and Scopus. The analysis captured 83 supply-side barriers which impact the efficiency of the value chain for affordable housing provision. Of these barriers, 18 affected the design and construction phase, and after grouping them by thematic area, seven key matters of concern were identified: (1) design (not) for all, (2) homogeneity of provision, (3) unhealthy living environment, (4) inadequate construction project management, (5) environmental unsustainability, (6) placemaking, and (7) inadequate technical knowledge and skillsets. The insights which resulted from the analysis were seen to support evidence-informed decision making across the affordable housing sector. The findings suggest that fixing the inefficiencies of the affordable housing provision system will require UN Member States to accelerate the transition towards a fully sustainable design and construction process. This transition should prioritize a more inclusive and socially sensitive approach to the design and construction of affordable homes, capitalizing on the benefits of greater user involvement. In addition, transformative actions which seek to deliver more resource-efficient and environmentally friendly homes should be promoted, as well as new investments in the training and upskilling of construction professionals.
... TF-IDF weighting scheme is a common approach, widely used by the literature due to its merit of providing more considerable attention to rarer words across our entire speech sample collection (Loughran & McDonald, 2016). So far, plenty of studies have employed it (Balakrishnan et al., 2010;Brown & Tucker, 2011;Kumar et al., 2012;Mai et al., 2019;Katsafados et al., 2021). ...
Article
Full-text available
Motivated by the successful usage of machine learning around computer science and its wide acceptance from the finance literature, we utilize monthly data spanning the period 2008–2018 for the Euro area peripheral countries, in order to embark on a two-fold mission. First, to construct short-term prediction models for bank deposit flows in the Euro area peripheral countries, employing machine learning techniques. Second, to examine whether textual features enhance the predictive ability of our models. From the variety of models tested, we find that Random Forest models including both textual features and macroeconomic variables outperform models including only macro factors or textual features. Monetary policy authorities or macroprudential regulators could adopt our approach to timely predict potential excessive bank deposit outflows and assess the resilience of the whole banking sector in the Euro area peripheral countries.
... The current body of literature also looked at some CSR characteristics that could help predict a company's future performance. Balakrishnan et al, for example, applied support vector machines (SVMs) to forecast the future performance of organizations using CSRD [24]. In addition, random forest is used to predict fraud in quarterly annual reports by [25]. ...
Article
Full-text available
Corporate social responsibility (CSR) has gained a great deal of interest in recent years due to the need for information that can help many stakeholders (e.g., governments, investors, professional organizations, researchers, etc.) understand companies’ contributions to the environment and society. CSR disclosure (CSRD) is now the key source of such information when analyzing, for example, an institution’s future performance. In the current body of CSRD literature, the majority of quantitative CSRD studies have relied on traditional statistical approaches for the correlation analysis of CSRD influencing factors. In this paper, we intend to quantitatively analyze firms’ characteristics related to CSRD in Saudi Arabia, understand CSRD and its influencing factors, and predict CSRD patterns. This study lays the groundwork to help companies make informed decisions. It also helps many other stakeholders better understand CSRD’s impacts. To achieve this, we propose a deep learning framework based on long short-term memory (LSTM) for identifying and predicting CSRD patterns. Moreover, a correlation-based technique is also used to visualize the relationships between variables and identify the significant features. The dataset used in this study was collected from annual reports, CSR reports, and firms’ websites between 2015 and 2018. It contains a variety of variables to explain the CSR behaviour of 117 companies. The proposed framework is evaluated with several approaches, including logistic regression (LR), K-nearest neighbours (KNN), support vector machines (SVM), random forests (RF), and decision trees (DT). Compared to other machine learning models, experiment results show that LSTM achieved acceptable results with the highest accuracy of \(88_\%\).
Article
Financial distress prediction has been a prominent research field for several decades. Accurate prediction of financial distress not only helps to safeguard the interests of investors but also improves the ability of managers to manage financial risks. Prior studies predominantly rely on accounting metrics derived from financial statements to predict financial distress. Our research takes a step further by incorporating media news to enhance the accuracy of financial distress prediction. Based on the data from Chinese listed companies, seven classifiers are established to verify the additional value of media news in improving the financial distress prediction performance of models. Experimental results demonstrate that the inclusion of media news in predictive models is effective as it contributes to better performance compared with models that solely rely on accounting features. Moreover, random forest model is a reliable tool in financial distress prediction due to its superior ability to capture complex feature relationships. Evaluation indicators, statistical tests, and Bayesian A/B tests further confirm that the inclusion of media news can significantly improve the identification of financially distressed companies.
Article
Full-text available
The aim of this study the effect The Impact of The effect of the Level of forward looking information disclosure on the forward-looking profitability in firms Listed on Tehran Stock Exchange. Statistical population of the present study is consisted of firms listed on Tehran Stock Exchange during the time period of 2017 to 2021 and sample volume is equal to 112 firms by using screening method. The present study is an applied study in terms of goal and in terms of method, it is a descriptive - correlation study. On the other hand, it is based on panel data analysis as well. In this study, in which panel data with fixed effects were used, results obtained from firm data analysis by using multivariate regression test at 95% confidence level indicated that there is not a direct and significant Level of forward looking information disclosure on the forward-looking profitability.
Article
This paper investigates the role of textual information in a U.S. bank merger prediction task. Our intuition behind this approach is that text could reduce bank opacity and allow us to understand better the strategic options of banking firms. We retrieve textual information from bank annual reports using a sample of 9,207 U.S. bank-year observations during the period 1994-2016. To predict bidders and targets, we use textual information along with financial variables as inputs to several machine learning models. We find that when we jointly use textual information and financial variables as inputs, the performance of our models is substantially improved compared to models using a single type of input. Furthermore, we find that the performance improvement due to the inclusion of text is more noticeable in predicting future bidders, a task which is less explored in the relevant literature. Therefore, our findings highlight the importance of textual information in a bank merger prediction task.
Article
This paper analyzes seven mandated disclosures contained in Management Discussion and Analysis (MD&A) to assess their information content. Generally, the results show that certain MD&A disclosures, particularly the discussions of future operations and planned capital expenditures, are associated with future (short-term) performance measures and investment decisions, after controlling for information contained in financial-statement-based ratios. However, the associations with longer-term results are generally not significant. The study illustrates that, in conjunction with the financial statements, the MD&A disclosures, especially prospective disclosures, can assist in assessing firms' future (short-term) prospects.
Article
ABSTRACT Two easily measured variables, size and book-to-market equity, combine to capture the cross-sectional variation in average stock returns associated with market {3, size, leverage, book-to-market equity, and earnings-price ratios. Moreover, when the tests allow for variation in {3 that is unrelated to size, the relation between market {3 and average return is flat, even when {3 is the only explanatory variable. THE ASSET-PRICING MODEL OF Sharpe (1964), Lintner (1965), and Black (1972)
Article
This study tested the relationship between performance and the readability of annual reports. Style analysis of 60 annual reports using a computer style analyzer revealed that the annual reports of good performers were easier to read than those of poor per formers. Good performers used strong writing in their annual reports unlike poor performers, but did not use significantly more jargon or modifiers.