ArticlePDF Available

Text Classification for Organizational Researchers: A Tutorial

Authors:
  • Amsterdam University of Applied Science

Abstract and Figures

Organizations are increasingly interested in classifying texts or parts thereof into categories, as this enables more effective use of their information. Manual procedures for text classification work well for up to a few hundred documents. However, when the number of documents is larger manual procedures become laborious, time-consuming, and potentially unreliable. Techniques from text mining facilitate the automatic assignment of text strings to categories, making classification expedient, fast, and reliable, which creates potential for its application in organizational research. The purpose of this paper is to familiarize organizational researchers with text mining techniques from machine learning and statistics. We describe the text classification process in several roughly sequential steps, namely training data preparation, preprocessing, transformation, application of classification techniques, and validation, and provide concrete recommendations at each step. In order to help researchers develop their own text classifiers, the R code associated with each step is presented in a tutorial. The tutorial draws from our own work on job vacancy mining. We end the paper by discussing how researchers can validate a text classification model and the associated output.
Content may be subject to copyright.
Article
Text Classification for
Organizational Researchers:
A Tutorial
Vladimer B. Kobayashi
1
, Stefan T. Mol
1
,
Hannah A. Berkers
1
,Ga
´bor Kismiho
´k
1
and Deanne N. Den Hartog
1
Abstract
Organizations are increasingly interested in classifying texts or parts thereof into categories, as
this enables more effective use of their information. Manual procedures for text classification
work well for up to a few hundred documents. However, when the number of documents is larger,
manual procedures become laborious, time-consuming, and potentially unreliable. Techniques
from text mining facilitate the automatic assignment of text strings to categories, making classi-
fication expedient, fast, and reliable, which creates potential for its application in organizational
research. The purpose of this article is to familiarize organizational researchers with text mining
techniques from machine learning and statistics. We describe the text classification process in
several roughly sequential steps, namely training data preparation, preprocessing, transformation,
application of classification techniques, and validation, and provide concrete recommendations at
each step. To help researchers develop their own text classifiers, the R code associated with each
step is presented in a tutorial. The tutorial draws from our own work on job vacancy mining. We
end the article by discussing how researchers can validate a text classification model and the
associated output.
Keywords
text classification, text mining, random forest, support vector machines, naive Bayes
Text data are pervasive in organizations. Digitization (Cardie & Wilkerson, 2008) and the ease of
creating online information (e.g., e-mail messages; Berry & Castellanos, 2008) contributes to the
vast quantities of text generated each day. Embedded in these texts is information that may improve
our understanding of organizational processes. Thus, organizational researchers increasingly seek
1
Leadership and Management Group, Amsterdam Business School, University of Amsterdam, Amsterdam, The Netherlands
Corresponding Author:
Stefan T. Mol, Leadership and Management Group, Amsterdam Business School, University of Amsterdam, Valckenierstraat
59, 1018 XE Amsterdam, The Netherlands.
Email: s.t.mol@uva.nl
Organizational Research Methods
1-34
ªThe Author(s) 2017
Reprints and permission:
sagepub.com/journalsPermissions.nav
DOI: 10.1177/1094428117719322
journals.sagepub.com/home/orm
ways to organize, classify, label, and extract opinions, experiences, and sentiments from text (Pang
& Lee, 2008; Wiebe, Wilson, & Cardie, 2005). Up until recently, the majority of text analyses in
organizations relied on time consuming and labor-intensive manual procedures, which are imprac-
tical and less effective for voluminous collections of documents especially when resources are
limited (Kobayashi et al., in press). Hence, automatic (or computer-assisted) strategies are increas-
ingly employed to accelerate the analysis of text (Berry & Castellanos, 2008).
Similar to content analysis (Duriau, Reger, & Pfarrer, 2007; Hsieh & Shannon, 2005; Scharkow,
2013) and template analysis (Brooks, McCluskey, Turley, & King, 2015), a common objective of
text analysis is to assign text to predefined categories. Manually assigning large collections of text to
categories is costly and may become inaccurate and unreliable due to cognitive overload. Further-
more, idiosyncrasies among human coders may creep into the labeling process resulting in coding
errors. One workaround is to code only part of the corpus as opposed to coding all documents.
However, this comes at the expense of possibly omitting relevant information, which may lead to
bias and a degradation of the internal and external validity of the findings. Another option is to hire
multiple human coders, but this adds cost (e.g., cost of hiring and training coders) and effort
pertaining to determining interrater reliability and consensus seeking (Sheng, Provost, & Ipeirotis,
2008). A final (and more affordable) option is to solicit the help of the public to label text, for
instance through the Amazon Mechanical Turk platform (Buhrmester, Kwang, & Gosling, 2011).
However, this may be effective only in labeling objective information (e.g., names of people, events,
etc.) since it is often difficult to establish consistency on subjective labels (e.g., sentiments; Wiebe,
Wilson, Bruce, Bell, & Martin, 2004). Hence, automatic text analysis procedures that reliably,
efficiently, and effectively assign text elements to classes are both necessary and advantageous
especially in dealing with a massive corpus of text.
This article focuses on automatic text classification for several reasons. First, although text
classification (henceforth TC) has been applied in various fields, such as in political science (Atte-
veldt, Kleinnijenhuis, Ruigrok, & Schlobach, 2008; B. Yu, Kaufmann, & Diermeier, 2008), occu-
pational fraud (Holton, 2009), law (Gonc¸alves & Quaresma, 2005), finance (Chan & Chong, 2017;
Chan & Franklin, 2011; Kloptchenko et al., 2004), and personality research (Shen, Brdiczka, & Liu,
2013), so far its uptake in organizational research is limited. Second, the use of TC is economical
both in terms of time and cost (Duriau et al., 2007). Third, many of the techniques that have been
developed in TC, such as sentiment analysis (Pang & Lee, 2008), genre classification (Finn &
Kushmerick, 2006), and sentence classification (Khoo, Marom, & Albrecht, 2006) seem particularly
well suited to address contemporary organizational research questions. Fourth, the acceptance and
broader use of TC within the organizational research community can stimulate the development of
novel TC techniques.
Tutorials or review-tutorials on TC that have been published so far (Harish, Guru, & Manjunath,
2010; Li & Jain, 1998; Sebastiani, 2002) were targeted mainly toward researchers in the field of
machine learning and data mining. This has resulted in a skewed focus on technical and methodo-
logical details. In this article our goal is to balance the discussion among techniques, theoretical
concepts, and validity concerns to increase the accessibility of TC to organizational researchers.
Below we first discuss the TC process, by pointing out key concerns and providing concrete
recommendations at each step. Previous studies are cited to enrich the discussion and to illustrate
different use cases. The second part is a hands-on tutorial using part of our own work as a running
example. We applied TC to automatically extract nursing job tasks from nursing vacancies to
augment nursing job analysis (Kobayashi, Mol, Kismiho´k, & Hesterberg, 2016). The findings from
this study were used in the EU-funded Pro-Nursing (http://pro-nursing.eu) project which aimed to
understand, among others, how nursing tasks are embedded in the nursing process. We also address
validity assessment because the ability to demonstrate the validity of TC outcomes will likely be
critical to its uptake by organizational researchers. Thus, we discuss and illustrate how to establish
2Organizational Research Methods XX(X)
validity for TC outcomes. Specifically, we address assessing the predictive validity of the classifier
and triangulating the output of the classification with other data sources (e.g., expert input and output
from alternative analyses).
Text Classification
TC is defined as the automatic assignment of text to one or more predefined classes (Li & Jain, 1998;
Sebastiani, 2002). Formally, the task of TC is stated as follows. Given a set of text and a set of
categories, construct a model of the form: Y¼fðX;yÞþefrom a set of documents with known
categories. In the preceding formula, Xis a suitably chosen text representation (e.g., a vector), yis
the set of unknown parameters associated with the function f(also known as the classifier or
classification model) that need to be estimated using the training data, and eis the error of the
classification. The error is added to account for the fact that fis just an approximation to the true but
unknown function hsuch that Y¼hðXÞ. Hence, the smaller eis, the more effective the classifier fis.
The Yterm usually takes numerical values indicating the membership of text to a particular cate-
gory. For example, when there are only 2 categories, such as in classifying the polarity of relations
between political actors and issues, as either positive or negative (Atteveldt et al., 2008), Ycan take
the values of þ1 and 1, respectively signifying positive and negative sentiment. We further discuss
how to deal with each part of the formula, such as how to choose Xand f,below.Oncethe
classification model has been constructed it is then used to predict the category of new text (Aggar-
wal & Zhai, 2012).
An ideal classifier would mimic how humans process and deduce meaning from text. However,
there are still many challenges before this becomes reality. Natural languages contain high-level
semantics and abstract concepts (Harish et al., 2010; Popping, 2012) that are difficult to articulate in
computer language. For instance, the meaning of a word may change depending on the context in
which it is used (Landauer, Foltz, & Laham, 1998). Also, lexical, syntactic, and structural ambi-
guities in text are continuing challenges that would need to be addressed (Hindle & Rooth, 1993;
Popping, 2012). Another issue is dealing with typographical errors or misspellings, abbreviations,
and new lexicons. Strategies for dealing with ambiguities all need to be explicated during classifier
development. Before a classifier is deployed it thus needs several rounds of training, testing, fine-
tuning (of parameters), and repeated evaluation until acceptable levels of performance and validity
are reached. The resulting classifier is expected to approximate the performance of human experts in
classification tasks (Cardie & Wilkerson, 2008), but for a large corpus its advantage is that it will be
able to do so in a faster, cheaper, and more reliable manner.
TC: The Process
The TC process consists of six interrelated steps, namely (a) text preprocessing, (b) text represen-
tation or transformation, (c) dimensionality reduction, (d) selection and application of classification
techniques, (e) classifier evaluation, and (f) classifier validation. As with any research activity,
before starting the TC process, we begin by formulating the research question and identifying text
of interest. Here, we assume that classes are predefined and that the researcher has access to, or can
gather, documents with known classes, that is, the training data. For example, in a study about
identifying disgruntled employee communications, researchers used posts from intracompany dis-
cussion groups. Subsequently, using criteria on employee disgruntlement, two people manually
classified 80 messages into either disgruntled or nondisgruntled communication (Holton, 2009).
Another study focused on the detection of personality of users from their email messages. Research-
ers first administered a 120-item questionnaire to 486 users to identify their personalities after which
their email messages over a 12-month period were collected (Shen et al., 2013). Compared to the
Kobayashi et al. 3
study on disgruntlement, it is more straightforward to label the associated text in this latter study
because the labels are based on the questionnaire. Researchers are often faced with the decision of
how many documents to label, an issue we will return to in the “Other TC issues” section below.
Once the training dataset has been compiled, the next step is to preprocess the documents.
Text Preprocessing for Classification
The purpose of preprocessing is to remove irrelevant bits of text as these may obscure meaningful
patterns and lead to poor classification performance and redundancy in the analysis (Uysal & Gunal,
2014). During preprocessing we first apply tokenization to separate individual terms. Terms may be
words, punctuation marks, numbers, tags, and other symbols (e.g., an emoticon). In written English,
terms are usually separated by spaces.
Punctuations and numbers, if deemed irrelevant to the classification task at hand are removed,
although in some cases these may be informative and thus retained (exclamation marks or emoti-
cons, for instance, may be indicative of sentiment). Dictionaries or lexicons are used to apply
spelling correction, and to resolve typos, and abbreviations. Words that are known to have low
information content such as conjunctions and prepositions are typically deleted. These words are
called stopwords (Fox, 1992), examples of pre-identified stopwords in the English language are
“and,” “the,” and “of” (see http://www.ranks.nl/stopwords for different lists stopwords in various
languages). When the case of the letters is irrelevant it is advisable to transform all upper case letters
into lower case.
During preprocessing stemming, which is defined as the process of obtaining the base or stem
form of words (Frakes, 1992; Porter, 1980), is also commonly applied. A key assumption in stem-
ming is that words that have similar root forms are identical in meaning. Stemming is performed by
removing suffixes that may not correspond to an actual base form of the word (Willett, 2006). For
example, the words calculate,calculating,calculated will be rewritten to calculate although the
actual base form is calculate (Toman, Tesar, & Jezek, 2006). If one wants to recover the actual base
form then one can use lemmatization instead of stemming. However, lemmatization is more chal-
lenging than stemming (Toman et al., 2006) and the added complexity of applying lemmatization
may offset its benefits. Both lemmatization and stemming leads to a loss of inflection information in
words (e.g., tense, gender, and voice). Inflection information may be important in some applications,
such as in identifying the sentiment of product reviews, since as it turns out, most negative reviews
are written in the past tense (Dave, Lawrence, & Pennock, 2003). Stemming and lemmatization are
part of a broad class of preprocessing techniques called normalization (Dave et al., 2003; Toman
et al., 2006). The aim of normalization is to merge terms that express the same idea or concept under
a single code called a template. For example, another normalization strategy is to use the template
POST_CODE to replace all occurrences of postcodes in a collection of documents. This can be
useful when it is important to consider if a document does or does not contain a postcode (i.e.,
contains an address), but the actual postcode is irrelevant.
A practical question is: what preprocessing techniques to apply for a given text? The answer is
largely determined by the nature of text (e.g., language and genre), the problem that we want to
address, and the application domain (Uysal & Gunal, 2014). Any given preprocessing procedure
may be useful for a specific domain of application or language but not for others. Several empirical
studies demonstrated the effect of preprocessing on classification performance. For example, stem-
ming in the Turkish language does not seem to make a difference in classification performance when
the size of the training data set is large (Torunog
˘lu, C¸ akırman, Ganiz, Akyokus¸, & Gu
¨rbu
¨z, 2011). In
some applications stemming even appears to degrade classification performance, particularly in the
English and Czech languages (Toman et al., 2006). In the classification of English online news, the
impact of both stemming and stopword removal is negligible (Song, Liu, & Yang, 2005). In general,
4Organizational Research Methods XX(X)
the classification of English and Czech documents benefits from stopword removal but may suffer
from word normalization (Toman et al., 2006). For the Arabic language, certain classifiers benefit
from stemming (Kanaan, Al-Shalabi, Ghwanmeh, & Al-Ma’adeed, 2009). In spam email filtering,
some words typically seen as stopwords (e.g., “however” or “therefore”) were found to be partic-
ularly rare in spam email, hence these should not be removed for this reason (Me´ndez, Iglesias,
Fdez-Riverola, Dı´az, & Corchado, 2006).
Recommendation. In using English documents, our general recommendation is to apply word toke-
nization, convert upper case letters to lower case, and apply stopword removal (except for short text
such as email messages and product titles; Me´ ndez et al., 2006; H.-F. Yu, Ho, Arunachalam,
Somaiya, & Lin, 2012). Since the effects of normalization have been mixed, our suggestion is to
apply it only when there is no substantial degradation on classification performance, since it can
increase classification efficiency by reducing the number of terms. When in doubt whether to
remove numbers or punctuations (or other symbols), our advice is to retain them and apply the
dimensionality reduction techniques discussed in the below section on text transformation.
Text Transformation (X)
Text transformation is about representing documents so that they form a suitable input to a classi-
fication algorithm. In essence, this comprises imposing structure on a previously unstructured text.
Most classification algorithms accept vectors or matrices as input. Thus the most straightforward
way is to represent a document as a vector and the corpus as a matrix.
The most common way to transform text is to use the so-called vector space model (VSM) where
documents are modeled as elements in a vector space (Raghavan & Wong, 1986; Salton, Wong, &
Yang, 1975). The features in this representation are the individual terms found in the corpus. This
somehow makes sense under the assumption that words are the smallest independently meaningful
units of a language. The size of the vector is therefore equal to the size of the vocabulary (i.e., the set
of unique terms in a corpus). Hence, we can represent document jas Xj¼ðx1
jx2
j xM
jÞ
where Mis the size of the vocabulary, and the element xi
jis the weight of term iin document j.
Weights can be the count of the terms in a document (xi
j¼TFðj;iÞ) or, when using binary weighting,
a 1 (presence of a term) or 0 (absence of a term). Applying the transformation to the entire corpus
will lead to a document-by-term matrix (DTM), where the rows are the documents, the columns are
the terms, and the entries are the weights of the terms in each document.
Other weighting options can be derived from basic count weighting. One can take the logarithm
of the counts to dampen the effect of highly frequent terms. Here we need to add 1 to the counts so
that we avoid taking the logarithm of zero counts. It is also possible to normalize with respect to
document length by dividing each count by the maximum term count in a given document. This is to
ensure that frequent terms in long documents are not overrepresented. Apart from the weights of the
terms in each document, terms can also be weighted with respect to the corpus. Common corpus-
based weights include the inverse document frequency (IDF), which assesses the specificity of terms
in a corpus (Algarni & Tairan, 2014). Terms that occur in too few (large IDF) or in too many (IDF
close to zero) documents have low discriminatory power and are therefore not useful for classifi-
cation purposes. The formula for IDF is: IDFðiÞ¼log N
df ðiÞ

, where df ðiÞstands for the document
frequency of term i, that is, the number of documents containing term i. Document- and corpus –
based weights may also be combined so that that the weights simultaneously reflect the importance
of a term in a document and its specificity to the corpus. The most popular combined weight measure
is the product of term frequency (TF) and the IDF (xi
j¼TFðj;iÞIDF ðiÞ) (Aizawa, 2003).
Kobayashi et al. 5
Although the VSM ignores word order information, it is popular due to its simplicity and effec-
tiveness. Ignoring word order means losing some information regarding the semantic relationships
between words. Also, words alone may not always express true atomic units of meaning. Some
researchers improve the VSM by adding adjacent word pairs or trios (bigrams and trigrams)as
features. For example, “new” followed by “york” becomes “new york” in a bigram. Although this
incorporates some level of word order information, it also leads to feature explosion thereby
increasing noise and redundancy. Also, many bigrams and trigrams do not occur often, thus their
global contributions to the classification are negligible and will only contribute to sparsity and
computational load. A workaround is to use only the most informative phrases (e.g., frequent
phrases; Scott & Matwin, 1999). Strategies for selecting key phrases include the noun phrase (Lewis,
1992) and key phrase (Turney, 1999) extraction algorithms. However, this does add additional
complexity in the analysis, which may again not result in a significant improvement in the classi-
fication. Studies have consistently shown that using bigrams only marginally improved classifica-
tion performance and in some cases degraded it, whereas the use of trigrams typically yielded
improvement (Dave et al., 2003; Ragas & Koster, 1998). Using syntactic phrases typically does
not improve performance much compared to single term features (Moschitti & Basili, 2004; Scott &
Matwin, 1999). Thus, the recommendation is to rely on single term features rather than phrases
unless there is a strong rationale to use phrases.
Text transformation plays a critical role in determining classification performance. Inevitably
some aspects of the text are lost in the transformation phase. Thus, when resulting classification
performance is poor, we recommend that the researcher reexamines this step. For example, while
term-based features are popular, if performance is poor one could also consider developing features
derived from linguistic information (e.g., parts of speech) contained in text (Gonc¸alves & Quaresma,
2005; Kobayashi et al., 2017; Moschitti & Basili, 2004) or using consecutive characters instead of
whole words (e.g., n-grams; Cavnar & Trenkle, 1994).
Reducing dimensionality. Even after preprocessing, transformation through VSM is still likely to
result in a large feature set. Too large a number of features is undesirable because it may increase
computational time and may degrade classification performance, especially when there are many
redundant and noisy features (Forman, 2003; Guyon & Elisseeff, 2003; Joachims, 1998). The size
of the vector and hence the size of feature set is referred to as the dimensionality of the VSM
representation. When possible, one should reduce dimensionality either by selectively eliminating
features or by creating latent features from existing ones without sacrificing classification per-
formance (Burges, 2010; Fodor, 2002; van der Maaten, Postma, & van den Herik, 2009). A
reduced feature set has advantages such as higher efficiency and in some cases, improved clas-
sification performance.
One way to eliminate features is to first assign scores to each feature and then remove features
by setting a cutoff value. This is called thresholding (Lan, Tan, Su, & Lu, 2009; Salton & Buckley,
1988). Weights from the transformation steps are sometimes used to score features. An example is
to remove rare terms, that is, terms with high IDF or low DF since they are noninformative for
category prediction or not influential in global performance. In some cases, rare terms are noise
terms (e.g., misspellings).
Another group of strategies to score features is to make use of class membership information in
the training data. These methods are called supervised scoring methods. Examples of these methods
are mutual information (MI), chi-squared (CHI), Gini index (GI), and information gain (IG; Yang &
Pedersen, 1997). Supervised scoring methods are expected to be superior to unsupervised ones
(e.g., DF), although in some cases DF thresholding has yielded performance comparable to super-
vised scoring methods such as CHI and GI (Yang, 1999) and even exceeded the performance of MI.
6Organizational Research Methods XX(X)
An alternative to scoring methods is to create latent orthogonal features by combining existing
features. Methods that construct new features from existing ones are known as feature transforma-
tion methods. Techniques include principal component analysis (PCA; Sirbu et al., 2016; Zu,
Ohyama, Wakabayashi, & Kimura, 2003), latent semantic analysis (LSA; Landauer et al., 1998),
and nonnegative matrix factorization (Zurada, Ensari, Asl, & Chorowski, 2013). These methods
construct high level features as a (non)linear combination of the original features with the property
that the new features are uncorrelated. They operate on the DTM by applying a matrix factorization
method. The text is scored (or projected) on the new features, or factors, and these new features are
used in the subsequent analysis. LSA improves upon the VSM through its ability to detect synonymy
(Landauer et al., 1998). Words that appear together and load highly on a single factor may be
considered to be synonyms.
Recommendation. Our recommendation is start with the traditional VSM, that is, transform the
documents into vectors using single terms as features. For the unsupervised scoring, compute the
DF of each term and filter out terms with very low and very high DF, customarily those terms
belonging to the lower 5th and upper 99th percentiles. For the supervised scoring try CHI and IG and
for the feature transformation try LSA and nonnegative matrix factorization. Compare the effect on
classification performance of the different feature sets generated by the methods and choose the
feature set that yields the highest performance (e.g., accuracy). We also suggest to try combining
scoring and transformation methods. For example, one can first run CHI and perform LSA on the
terms selected by CHI. Note that the quality of the feature set (and that of the representation) is
assessed based on its resulting classification performance (Forman, 2003).
For LSA and nonnegative matrix factorization, we need to decide how many dimensions to retain.
For LSA, Fernandes, Artı´fice, and Fonseca (2017) offered this formula as a rough guide
K¼N
1
1þlogðNÞ
10

, where Nis the size of the corpus, Kis the number of dimensions to retain and
the logarithm is base 10. For example, if there are 500 documents, then retain approximately 133
latent dimensions. In the case of nonnegative matrix factorization, an upper bound for choosing K is
that it must satisfy this inequality ðNþMÞK<NM, where M is the number of original features
(Tsuge, Shishibori, Kuroiwa, & Kita, 2001). Hence if there are 500 documents and 1,000 terms, K
should not be greater than 333. Of course, one has to experiment with different sizes of dimension-
ality and select the size that yields the maximum performance. For example, the formula gave 133
dimensions for 500 documents but one may also try experimenting values within +30 around 133.
Application of TC Algorithms (f)
The transformed text, usually the original DTM or the dimensionality reduced DTM, serves as input
to one or more classification techniques. Most techniques are from the fields of machine learning
and statistics. There are three general types of techniques: (a) geometric, (b) probabilistic, and (c)
logical (Flach, 2012).
Geometric algorithms assume that the documents can be represented as points in a hyperspace,
the dimensions of which are the features. This means that distances between documents and lengths
of the documents can be defined as well. In this representation, nearness implies similarity. An
example of a geometric classifier is K-nearest neighbors in which classification is done by first
finding the closest Kdocuments (using a distance measure) from the training data (Jiang, Pang, Wu,
& Kuang, 2012) then the majority class of the Kclosest documents is the class to which the new
document is assigned. The parameter Kis chosen to be an odd number to prevent ties from occurring.
Another geometric classifier is support vector machines (Joachims, 1998) in which a hyperplane is
Kobayashi et al. 7
constructed that provides the best separation among the text in each class. The hyperplane is
constructed in such a way that it provides the widest separation between the two nearest observations
of each class.
Probabilistic algorithms compute a joint probability distribution between the observations (e.g.,
documents) and their classes. Each document is assumed to be an independent random draw from
this joint probability distribution. The key point in this case is to estimate the posterior probability
PðYmjXÞ. Classification is achieved by identifying the class that yields the maximum posterior
probability for a given document. The posterior probability is estimated in two ways. Either one
can marginalize the joint distribution PðX;YmÞ, or one may compute PðXjYmÞand PðYmÞsepa-
rately and apply Bayes theorem. Both naive Bayes (Eyheramendy, Lewis, & Madigan, 2003) and
logistic regression (J. Zhang, Jin, Yang, & Hauptmann, 2003) are examples of probabilistic
algorithms.
The third type of algorithm is the logical classifier, which accomplishes classification by means
of logical rules (Dumais, Platt, Heckerman, & Sahami, 1998; Rokach & Maimon, 2005). An
example of such a rule in online news categorization is: “If an article contains any of the stemmed
terms “vs”, “earn”, “loss” and not the words “money”, “market open”, or “tonn” then classify the
article under category “earn” (Rullo, Cumbo, & Policicchio, 2007). The rules in logical models are
readable and thus facilitate revision, and, if necessary, correction of how the classification works. An
example of a logical classifier is a decision tree (Rokach & Maimon, 2005).
Naive Bayes and support vector machines are popular choices (Ikonomakis, Kotsiantis, &
Tampakas, 2005; Joachims, 1998; Li & Jain, 1998; Sebastiani, 2002). Both can efficiently deal
with high dimensionality and data sparsity, though in naive Bayes appropriate smoothing will need
to be applied to adjust for terms which are rare in the training data. The method of K-nearest
neighbor works well when the amount of training data is large. Both logistic regression and
discriminant analysis yield high performance if the features are transformed using LSA. The
performance of decision trees has been unsatisfactory. A number of researchers therefore recom-
mend the strategy of training and combining several classifiers to increase classification perfor-
mance, which is known as ensemble learning (Breiman, 1996; Dietterich, 1997; Dong & Han,
2004; Polikar, 2012). This kind of classification can be achieved in three ways. The first is using a
singlemethodandtrainingitondifferentsubsets of the data. Examples include bagging and
boosting, which both rely on resampling. Random forest is a combination of bagging and random
selection of features that uses decision trees as base learners. Gradient boosted trees, a technique
that combines several decision trees, has been shown to significantly increase performance as
compared with that of individual decision trees (Ferreira & Figueiredo, 2012). The second is using
a single method but varying the training parameters such as, for example, using different initial
weights in neural networks (Kolen & Pollack, 1990). The third is using different classification
techniques (naive Bayes, decision trees, or SVM; Li & Jain, 1998) and combining their predictions
using, for instance, the majority vote.
Recommendation. Rather than using a single technique, we suggest applying different methods, by
pairing different algorithms and feature sets (including those obtained from feature selection and
transformation) and choosing the pair with the lowest error rate. For example, using the DTM
matrix, apply SVM, naive Bayes, random forest bagging, and gradient boosted trees. When feature
transformation has been applied (e.g., LSA and nonnegative matrix factorization), use logistic
regression or discriminant analysis. When the training data are large (e.g., hundreds of thousands
of cases), use K-nearest neighbors. Rule-based algorithms are seldom used in TC, however, if
readability and efficiency are desired in a classifier, then these can be trialed as well.
8Organizational Research Methods XX(X)
Evaluation Measures
Crucial to any classification task is the assessment of the performance of classifiers using evaluation
measures (Powers, 2011; Yang, 1999). These measures indicate whether a classifier models the
relationship between features and class membership well, and may thus be used to indicate the extent
to which the classifier is able to emulate a human coder. The most straightforward evaluation
measure is the accuracy measure, which is calculated as the proportion of correct classifications.
Accuracy ranges from 0 to 1 (or 0 to 100 when expressed as a percentage). The higher the accuracy
the better the classifier (1 corresponds to perfect classification). However, in case of imbalanced
classification (i.e., when there is one class with only a few documents) and/or unequal costs of
misclassification, accuracy may not be appropriate. An example is detecting career shocks (cf.
Seibert, Kraimer, Holtom & Pierotti, 2013) in job forums. Since it is likely that only a small fraction
of these postings pertain to career shocks (suppose .05), a classifier can still have a high accuracy
(equal to .95) even if that classifier classifies all discussion as containing no career shocks content.
Alternative measures to accuracy are precision, recall, F-measure (Powers, 2011), specificity, break-
even point, and balanced accuracy (Ogura, Amano, & Kondo, 2011). In binary classification, classes are
commonly referred to as positive and negative. Classifiers aim to correctly identify observations in the
positive class. A summary table which can be used as a reference for computing these measures is
presented in Figure 1. The entries of the table are as follows: TP stands for true positives, TN for true
negatives, FP for false positives (i.e.,negative cases incorrectly classified into the positive class), and FN
for false negatives (i.e., positive cases incorrectly classified into the negative class). Hence the five
evaluation measures are computed as follows: precision ¼TP
TPþFP,recall ¼TP
TPþFN,specificity ¼TN
TNþFP,
Fmeasure ¼2recallprecision
recallþprecision ,andBal:Accu ¼TP
TPþFP þTN
TNþFN

=2.
The breakeven point is the value at which precision ¼recall. F-measure and balanced accuracy
are generally to be preferred in case of imbalanced classification, because they aggregate the more
basic evaluation measures.
Evaluation measures are useful to compare the performance of several classifiers (Alpaydin,
2014). Thus, one can probe different combinations of feature sets and classification techniques to
determine the best combination (i.e., the one which gives the optimal value for the evaluation
measure). Apart from classification performance, one can also take the parsimony of the trained
classifier into account by examining the relative size of the different feature sets, since they deter-
mine the complexity of the trained classifier. In line with Occam’s razor, when two classifiers have
the same classification performance, the one with the lower number of features is to be preferred
(Shreve, Schneider, & Soysal, 2011).
Evaluation measures are computed from the labeled data. It is not advisable to use all labeled data
to train the classifier since this might result in overfitting which is the case when the classifier is good
at classifying the observations in the training data but performs poorly on new data. Hence, part of
the labeled data should be set aside for evaluation so that we can assess the degree to which the
classifier is able to predict accurately in data that were not used for training.
Predicted Classes
Positive Negative
True Classes Positive TP FN
Negative FP TN
Figure 1. Confusion matrix as a reference to compute the evaluation measures. Note:FN¼false negative;
FP ¼false positive; TN ¼true negative; TP ¼true positive.
Kobayashi et al. 9
Cross-validation can be applied by computing not only one value for the evaluation measure but
several values corresponding to different splits of the data. A systematic strategy to evaluate a
classifier is to use k-fold cross-validation (Kohavi, 1995). This method splits the labeled dataset
into kparts. A classifier is trained using k1 parts and evaluated on the remaining part. This is
repeated until each of the kparts has been used as test data. Thus for kequals 10, there are 10
partitions of the labeled data and corresponding 10 values for a given measure, the final estimate is
just the average of the 10 values. Another strategy is called bootstrapping, which is accomplished by
computing an average of the evaluation measures for Nbootstrap samples of the data (sampling with
replacement).
Recommendation. Since accuracy may give misleading results when classes are imbalanced we
recommend using measures sensitive to this, such as F-measure or balanced accuracy (Powers,
2011). For the systematic evaluation of the classifier we advise using k-fold cross validation and
setting K to 5 or 10 when data are large, as this ensures sufficient data for the training. For smaller
data sets, such as fewer than 100 documents, we suggest bootstrapping or choosing a higher K for
cross-validation.
Model Validity
Figure 2 illustrates that a classification model consists of features and the generic classification
algorithm (Domingos, 2012). Thus the validity of the classification model depends both on the
choice of features and the algorithm.
Many TC applications use the set of unique words as the feature set (i.e., VSM). For organiza-
tional researchers this way of specifying the initial set of features may seem counterintuitive since
features are constructed in an ad hoc and inductive manner, that is, without reference to theory.
Classification Model
Classification
Algorithm
Pre-labelled
Documents
Documents
Unlabeled
Documents
Classified Documents
Features
wij’s
Research questions
and identification of
documents of interest
Documents
New Features
Wij’s +
Figure 2. Diagrammatic depiction of the text classification process.
10 Organizational Research Methods XX(X)
Indeed, specifying the initial set of features, scoring features, transforming features, evaluating
features, and modifying the set of features in light of the evaluation constitutes a data-driven
approach to feature construction and selection (Guyon, Gunn, Nikravesh, & Zadeh, 2008). The
validity of the features is ultimately judged in terms of the classification performance of the resulting
classification model. But this does not mean that researchers should abandon theory based
approaches. If there is prior knowledge or theory that supports the choice of features then this can
be incorporated (Liu & Motoda, 1998). Theory can also be used as a basis for assigning scores to
features such as using theory to rank features according to importance. Our recommendation,
however, would be to have theory complement, as opposed to restrict, feature construction, because
powerful features (that may even be relevant to subsequent theory building and refinement) may
emerge inductively.
The second component, the classification algorithm, models the relationship between features
and class membership. Similar to the features, the validity of the algorithm is ultimately determined
from the classification performance and is also for the most part data driven. The validity of both the
features and the classification algorithm establishes the validity of the classification model.
A useful strategy to further assess the validity of the classification model is to compare the
classifications made by the model with the classification of an independent (group of) human
expert(s). Usually agreement between the model and the human expert(s) is quantified using mea-
sures of concordance or measures of how close the classification of the two correspond to one
another (such as Cohen’s kappa for interrater agreement where one “rater” is the classifier). Using
expert knowledge, labels can also be checked against standards. For example, in job task extraction
from a specific set of job vacancies one can check with experts or job incumbents to verify whether
the extracted tasks correspond to those tasks actually carried out on the job and whether specific
types of tasks are under or over represented.
Once model validity is established one may start applying the classification model to unlabeled
data. However, the model will still need to be reevaluated from time to time. When the performance
drops below an acceptability threshold, there are four possible solutions: (a) add more features or
change existing features, (b) try other classification algorithms, (c) do both, and/or (d) collect more
data or label additional observations.
Other Issues in TC
In this section we discuss how to deal with multiclass classification, where there is an increased
likelihood of classes being imbalanced, and provide some suggestions on determining training size
and what to do when obtaining labeled data is both expensive and difficult.
Multiclass classification. Multiclass classification pertains to dealing with more than 2 categories. The
preprocessing and representation parts are the same as in the binary case. The only changes are in the
choices of supervised feature selection techniques, classification techniques and evaluation mea-
sures. Most supervised feature selection techniques can be easily generalized to more than 2 cate-
gories. For example, when calculating CHI, we just need to add an extra column to the two-way
contingency table. Most techniques for classification we discussed previously have been extended to
multiclass classification. For example, techniques suited for binary classification problems (e.g.,
SVM) are extended to the multiclass case by breaking the multiclass problem into several binary
classification problems in either one-against all or one-against-one approaches. In the former
approach we build binary classifiers by taking each category as the positive class and merging the
others into the negative class. Hence, if there are Kcategories, then we build Kbinary classifiers. For
the latter approach, we construct a binary classifier for each pair of categories resulting to KðK1Þ
2
classifiers. Since several classifiers are built, and thus there are several outputs, final category
Kobayashi et al. 11
membership is obtained by choosing the category with the largest value for the decision function for
the one-against-all case or by a voting approach for the one-against-one case (Hsu & Lin, 2002).
The four evaluation measures can also be extended to classifications with more than two classes
by computing these measures per category, the same as in one-against-all, and averaging the results.
An example is the extension of F-measure called the macro F-measure which is obtained by
computing the F-measure of each category, and then averaging them.
Imbalanced classification. By and large, in binary classification, when the number of observations in
one class represents less than 20%of the total number of observations then the data can be seen as
imbalanced. The main danger of imbalanced classification is that we may train a classifier with a
high accuracy even if it fails to correctly classify the observations in the minority class. In some
cases, we are more interested in detecting the observations in the minority class. At the same time
however, we also want to avoid many false detections.
Obvious fixes are to label more observations until the classes are balanced as was done by
Holton (2009), or by disregarding some observations in the majority class. In cases where clas-
sification problems are inherently imbalanced and labeling additional data is costly and difficult,
another approach is to oversample the minority class or to undersample the majority class during
classifier training and evaluation. A strategy called the synthetic minority oversampling technique
(SMOTE) is based on oversampling but instead of selecting existing observations in the minority
class it creates synthetic samples to increase the number of observations in the minority class
(Chawla, Bowyer, Hall, & Kegelmeyer, 2002). Preprocessing and representation remain the same
as in balanced classes. The parts that make use of class membership need to be adjusted for
imbalanced data.
There are options for supervised dimensionality reduction for imbalanced classification such as
those provided by Ogura et al. (2011). For the choice of classification techniques, those discussed
previously can be used with minor variations such as adjusting the costs of misclassification, which
is known as cost-sensitive classification (Elkan, 2001). Traditional techniques apply equal costs of
misclassification to all categories, whereas, for cost-sensitive classification we can assign large cost
for incorrect classification of observations in the minority class. For the choice of evaluation
measures, we suggest using the weighted F-measure or balanced accuracy. One last suggestion is
to treat imbalanced classification as an anomaly or outlier detection problem where the observations
in the minority class are the outliers (Chandola, Banerjee, & Kumar, 2009).
Size of the training data. A practical question that often arises is how many documents one should
label to ensure a valid classifier. The size of the training dataset depends on many considerations
such as the cost and limitations associated with acquiring prelabeled documents (e.g., ethical and
legal impediments) and the kind of learning framework we are using. In the probably approximately
correct (PAC) learning framework, which is perhaps the most popular framework for learning
concepts (such as the concept of spam emails or party affiliation) training size is determined by
the type of classification technique, the representation size, the maximum error rate one is willing to
tolerate, and the probability of not exceeding the maximum error rate. Under the PAC learning
framework, formulae have been developed to determine the lower bound for the training size, an
example being the one by Goldman (2010): O1
elog21
dþVCDðCÞ
e

, where Eis the maximum error
rate, 1 dindicates the probability that the error will not exceed e, and VCDðCÞof the classifier C.
VCD stands for Vapnik-Chervonenski dimension of the classifier Cwhich can be interpreted as the
expressive power of the classifier which depends on the representation size and the form of the
classifier (e.g., axis parallel rectangle, closed sets, or half-spaces). As an illustration suppose we
want to learn the concept of positive sentiment from English text, we then represent each document
as a vector of 50,000 dimensions (number of commonly used English words) and our classification
12 Organizational Research Methods XX(X)
technique constructs a hyperplane that separates positive and negative observations (e.g., SVM
using ordinary dot product as kernel). If we want to ensure with probability 0.99 that the error rate
will not exceed 0.01, then the minimum training size is: 1
0:01 log21
0:01 þ50001
0:01 ¼5000764. This means
we would need at least 5 million documents. Here we calculated the VCD of Cusing the formula
dþ1, where dis the dimensionality of the representation, since we consider classifiers that con-
struct hyperplane boundaries (half-space) in 50,000 dimensions. Of course, in practice dimension-
ality reduction can be applied still get adequate representation. If one would manage to reduce
dimensionality to 200 then the lower bound for the training size is dramatically reduced to 20,765.
We can tweak this lower bound by adjusting the other parameters.
Although formulae provide theoretical guarantees, determining training size is largely empiri-
cally driven and involves a good deal of training, evaluation, and validation. To give readers an idea
of training sizes as typically found in practice, Table 1 provides information about the training data
sizes for some existing TC studies.
Suggestions when labeled data are scarce. In many classification problems, labeled data are costly or
difficult to obtain. Fortunately, even in this case, principled approaches can be applied. In practice,
unlabeled data are plentiful and we can apply techniques to make use of the structure and patterns in
the unlabeled data. This approach of using unlabeled data in classification is called semisupervised
classification (Zhu, 2005). Various assumptions are made to make semisupervised classification
feasible. Examples are the smoothness assumption which says that observations near each other are
likely to share the same label, and the cluster assumption which states that if the observations form
clusters then observations in the same cluster are likely to share the same label (Zhu, 2005).
Another approach is to use classification output to help us determine which observations to label.
In this way, we take a targeted approach to labeling by labeling those observations which are most
likely to generate better classifiers. This is called active learning in the machine learning literature
(Settles, 2010). Active learning is made possible because some classifiers give membership prob-
abilities or confidence rather than a single decision as to whether to assign to one class or not. For
example, if a classifier assigns equal membership probabilities to all categories for a new observa-
tion then we call an expert to label the new observation. For a review of active learning techniques
we refer the reader to Fu, Zhu, and Li (2013).
Tutorial
We developed the following tutorial to provide a concrete treatment of TC. Here we demonstrate TC
using actual data and codes. Our intended audience are researchers who have little or no experience with
TC. This tutorial is a scaleddown version of our work on using TC to automatically extract job tasks from
job vacancies. Our objective is to build a classifier that automatically classifies sentences into task or
nontask categories. The sentences were obtained from German language nursing job vacancies.
We set out to automate the process of classification because one can then deal with huge numbers
(i.e., millions) of vacancies. The output of the text classifier can be used as input to other research or
tasks such as job analysis or the development of tools to facilitate personnel decision making. We
used the R software since it has many ready-to-use facilities that automate most TC procedures. We
provide the R annotated scripts and data to run each procedure. Both codes and data can be down-
loaded as a Zip file from Github; the URL is https://github.com/vkobayashi/textclassificationtutor
ial. The naming of R scripts are in the following format: CodeListing (CL) <number>.R and in this
tutorial we referenced them as CL <number>. Thus, CL 1 refers to the script CodeListing_1.R. Note
that the CL files contain detailed descriptions of each command, and that each command should be
run sequentially.
Kobayashi et al. 13
Table 1. Training Sizes, Number of Categories, Evaluation Measures, and Evaluation Procedures Used in Various Text Classification Studies.
Citations Subject Matter Training Size
Number of
Categories Evaluation Measure Evaluation Procedure
Phan, Nguyen, & Horiguchi, 2008 Domain disambiguation
for web search
results
12,340 8 Accuracy Fivefold cross validation (CV)
Moschitti & Basili, 2004; Phan et al.,
2008
Disease classification
for medical abstracts
28,145 5 Accuracy Fivefold CV
Vo & Ock, 2015 Titles of scientific
documents
8,100 6 Accuracy & F-measure Fivefold CV
Khoo et al., 2006 Response emails of
operators to
customers
1,486 14 Accuracy & F-measure Tenfold CV
Chen, Huang, Tian, & Qu, 2009; Li &
Jain, 1998; Moschitti & Basili, 2004;
Ogura et al., 2011; Scott & Matwin,
1999; Song et al., 2005; Toman et al.,
2006; Torunog
˘lu et al., 2011; Uysal
& Gunal, 2014; Yang & Pedersen,
1997; W. Zhang, Yoshida, & Tang,
2008
News items 764-19,997 4-93 Accuracy, micro averaged
breakeven points, F-measure,
recall, precision, breakeven
point
1 training and 2 test, single train-test,
20 splits with intercorpus
evaluation, fourfold CV, tenfold CV,
20 splits
Atteveldt et al., 2008 Sentiment analysis of
actor-issue
relationship
5,348 2 F-measure Did not mention
Dave et al., 2003 Product reviews
sentiments
31,574 7 Accuracy 2 test
Scott & Matwin, 1999 Song lyrics 6,499 33 Micro-averaged break even
points
Single-train test
Chen et al., 2009 Chinese news texts 2,816 10 Accuracy & F-measure Single-train test
Kanaan et al., 2009 Arabic news documents 1,445 9 F-measure Fourfold CV
Ragas & Koster, 1998 Dutch documents 1,436 4 Accuracy Single train-test
Me
´ndez et al., 2006; Panigrahi, 2012;
Uysal & Gunal, 2014; Youn &
McLeod, 2007
Emails 400-9,332 2 F-measure Single train-test
(continued)
14
Table 1. (continued)
Citations Subject Matter Training Size
Number of
Categories Evaluation Measure Evaluation Procedure
Torunog
˘lu et al., 2011; Uysal & Gunal,
2014
Turkish news items 1,150-99,021 5-10 F-measure
Moschitti & Basili, 2004 Italian news items 16,000 8 F-measure & breakeven point 20 splits with intercorpus validation
Toman et al., 2006 Czech news items 8,000 5 F-measure Fourfold CV
Thelwall, Buckley, Paltoglou, Cai, &
Kappas, 2010
Cyberspace comment
sentiment analysis
1,041 5 Accuracy Tenfold CV
Holton, 2009 Disgruntled employee
communications
80 2 Accuracy Single train—test varying proportion
Shen et al., 2013 Personality from emails
(this is a multilabel
classification
problem)
114,907 3 categories
per personality
Accuracy Tenfold CV and single train-test
15
All the scripts were tested and are expected to work on any computer (PC or Mac) with R,
RStudio, and the required libraries installed. However, basic knowledge including how to start R,
open R projects, run R commands, and install packages in RStudio are needed to run and understand
the codes. For those new to R we recommend following an introductory R tutorial (see, for example,
DataCamp [www.datacamp.com/courses/free-introduction-to-r] or tutorialspoint [www.tutorial
spoint.com/index.htm] for free R tutorials).
This tutorial covers each of the previously enumerated TC steps in sequence. For each step we
first explain the input, elaborate the process, and provide the output, which is often the input for the
subsequent step. Table 2 provides a summary of the input, process, and output for each step in this
tutorial. Finally, after downloading the codes and data, open the text_classification_tutorial.Rproj
file. The reader should then run the codes for every step as we go along, so as to be able to examine
the input and the corresponding output.
Preparing Text
The input for this step consists of the raw German job vacancies. These vacancies were obtained
from Monsterboard (www.monsterboard.nl). Since the vacancies are webpages, they are in hyper-
text markup language (HTML), the standard markup language for representing content in web
documents (Graham, 1995). Apart from the relevant text (i.e., content), raw HTML pages also
contain elements used for layout. Therefore, a technique known as HTML parsing is used to separate
the content from the layout.
In R, parsing HTML pages can be done using the XML package. This package contains two
functions, namely, htmlTreeParse() that parses HTML documents and xpathSApply() that
extracts specific content from parsed HTML documents. CL 1 (see the annotations in the file for
further details as to what each command does), installs and loads the XML package, and applies the
htmlTreeParse() and xpathSApply() functions. In addition, the contents of the HTML file
sample_nursing_vacancy.html in the folder data are imported as a string object and stored in
the variable htmlfile. Subsequently, this variable is provided as an argument to the
htmlTreeParse() function. The parsed content is then stored in the variable rawpagehtml,
which in turn is the doc argument to the xpathSApply() function which searches for the tags in
the text that we are interested in. In our case this text can be found in the div tag of the class
content. Tags are keywords surrounded by angle brackets (e.g., <div> and </div>). The
xmlValue in the xpathSApply() function means that we are obtaining the content of the HTML
element between the corresponding tags. Finally the writeLines() function writes the text
content to a text file named sample_nursing_vacancy.txt (in the folder parsed).
To extract text from several HTML files, the codes in CL 1 are put in a loop in CL 2. The function
htmlfileparser() in CL 2 accepts two arguments and applies the procedures in CL 1 to each
HTML file in a particular folder. The first argument is the name of the folder and the second argument
is the name of the destination folder where the extracted text content is to be written. Supposing these
html files are in the folder vacancypages and the extracted text content is to be saved in the folder
parsedvacancies, these are the arguments we provide to htmlfileparser(). Expectedly, the
number of text files generated corresponds to the number of HTML files, provided that all HTML
files are well-defined (e.g., correct formatting). The text files comprise the output for this step.
Preprocessing Text
The preprocessing step consists of two stages. The first identifies sentences in the vacancies, since
the sentence is our unit of analysis, and the second applies text preprocessing operations on the
sentences. We used sentences as our unit of analysis since our assumption is that the sentence is the
16 Organizational Research Methods XX(X)
Table 2. Text Classification Based on the Input-Process-Output Approach.
Text Preprocessing
Text Transformation
Dimensionality
Reduction Classification Evaluation ValidationText Preparation Text Cleaning
Input Raw html files Output from text
preparation
Output from text
cleaning
Document-by-term
matrix
Output from
dimensionality
reduction
Classification model,
test data, and an
evaluation measure
Classification from the
model
Process Parsing, sentence
segmentation
Punctuation, number,
and stopword
removal, lower case
transformation
Word tokenization,
constructing the
document-by-term
matrix where the
words are the
features and the
entries are raw
frequencies of the
words in each
document
Latent semantic
analysis and/or
supervised
scoring methods
for feature
selection
Apply classification
algorithms such
as naive Bayes,
support vector
machines, or
random forest
Classify the documents
in test data and
compare with the
actual labels;
calculate the value of
the evaluation
measure
Compute classification
performance using
an independent
validation data set or
compare the
classification to the
classification of
domain experts
Output Raw text file
(one sentence
per line)
Raw text file sentences
where all letters are
in lower cases and
without punctuation,
number and
stopwords
Document-by-term
matrix
Matrix where the
columns are the
new set of
features or the
reduced
document-by-
term matrix
Classification model Value for the evaluation Measure of agreement
(one can quantify the
agreement through
the use existing
evaluation measure)
17
right resolution level to detect job task information. We did not use the vacancy as our unit of
analysis since a vacancy may contain more than one task. In fact if we chose to treat the
vacancy as the unit of analysis it would still be important to identify which of the sentences
contain task information. Another reason to select sentence as the unit of analysis is to mini-
mize variance in document length. Input for the first stage are the text files generated from the
previous step, and the output sentences from this stage serve as input to the second stage. CL 3
contains functions that can detect sentences from the parsed HTML file in the previous section
(i.e., sample_nursing_vacancy.html).
The code loads the openNLP package. This package contains functions that run many popular
natural language processing (NLP) routines including a sentence segmentation algorithm for the
German language. Although the German sentence segmenter in openNLP generally works well, at
times it may fail. Examining such failures in the output can provide ideas for the inclusion of new
arguments in the algorithm. For example, if the segmenter encounters the words bzw. and inkl.
(which are abbreviations of “or” and “including” respectively in German) then the algorithm will
treat the next word as the start of a new sentence. This is because the algorithm has a rule that when
there is a space after a period the next word is the start of a new sentence. To adjust for these and
other similar cases, we created a wrapper function named sent_tokens(). Another function
sent_split()searches for matches of the provided pattern within the string and when a match
is found it separates the two sentences at this match. For example, some vacancies use bullet points
or symbols such as “|” to enumerate tasks or responsibilities. To separate these items we supply the
symbols as arguments to the function. Finally, once the sentences are identified the code writes the
sentences to a text file where one line corresponds to one sentence.
For multiple text files, the codes should again be run in a loop. One large text file will then be
generated containing the sentences from all parsed vacancies. Since we put all sentences from all
vacancies in a single file, we attached the names of the corresponding text vacancy files to the
sentences to facilitate tracing back the source vacancy of each sentence. Thus, the resulting text file
containing the sentences has two columns: the first column contains the file names of the vacancies
from which the sentences in the second column were derived.
After applying sentence segmentation on the parsed vacancy in sample_nursing_vacancy.txt, the
sentences are written to the file sentencelines_nursing_vacancy.txt located in the folder
sentences_from_sample_vacancy. The next task is to import the sentences into R so that additional
preprocessing (e.g., text cleaning) can be performed. Other preprocessing steps that may be applied
are lower case transformation, punctuation removal, number removal, stopword removal, and
stemming. For this we use the tm package in R. This package automatically applies word
tokenization, so we do not need to create separate commands for that.
The sentences are imported as a data frame in R (see CL 4). Since the sentence is our unit of
analysis, hereafter we refer to these sentences as documents. The first column is temporarily ignored
since it contains only the names of the vacancy files. Since the sentences are now stored in a vector
(in the second column of the data frame), the VectorSource() function is used. The source determines
where to find the documents. In this case the documents are in mysentences[,2]. If the documents are
stored in another source, for example in a directory rather than in a vector, one can use DirSource().
For a list of supported sources, invoke the function getSources(). Once the source has been set, the
next step is to create a corpus from this source using the VCorpus() function. In the tm package,
corpus is the main structure for managing documents. Several preprocessing procedures can be
applied to the documents once collected in the corpus. Many popular preprocessing procedures
are available in this package. Apart from the existing procedures, users can also specify their
own via user-defined functions. The procedures we applied are encapsulated in the
transformCorpus()function. They include number, punctuation, and extra whitespace removal,
and lower case conversion. We did not apply stemming since previous work recommends not to use
18 Organizational Research Methods XX(X)
stemming for short documents (H.-F. Yu et al., 2012). The output consists of the cleaned sentences
in the corpus with numbers, punctuation, and superfluous whitespaces removed.
Text Transformation
CodeListing 5 details the structural transformation of the documents. The input in this step is the
output from the preceding step (i.e., the cleaned sentences in the training data). To quantify text
characteristics, we use the VSM because this is the simplest and perhaps most straightforward
approach to quantify text and thus forms an appropriate starting point in the application of TC (Frakes
& Baeza-Yates, 1992; Salton et al., 1975). For this transformation, the DocumentTermMatrix()
of the tm package has a function that may be used to build features based on the individual words
in the corpus.
The DocumentTermMatrix() function transforms a corpus into a matrix where the rows are
the documents, the columns are features, and the entries are the weights of the features in each
document. The default behavior of the DocumentTermMatrix() function is to ignore terms with
less than 3 characters. Hence, it is possible that some rows consist entirely of 0’s because after
preprocessing it may be the case that in some sentences all remaining terms have less than 3
characters. The output in this step is the constructed DTM. This matrix is then used as a basis for
further analysis. We can further manipulate the DTM, for instance by adjusting the weights.
We mentioned previously that for word features one can use raw counts as weights. The idea of
using raw counts is that the higher the count of a term in a document the more important it is in that
document. The DocumentTermmatrix() function uses the raw count as the default weighting
option. One can specify other weights through the weighting option of the control argument.
To take into account documents sizes, for example, we can apply a normalization to the weights
although in this case it is not an issue because sentences are short.
Let us assign a “weight” to a feature that reflects its importance with respect to the entire corpus
using the DF. Another useful feature of DF is that it provides us with an idea of what the corpus is
about. For our example the word with the highest DF (excluding stopwords) is plege (which
translates to “care”) which makes sense because nursing is about the provision of care. Terms that
are extremely common are not useful for classification.
Another common text analysis strategy is to find keywords in documents. The keywords may be
used as a heuristic to determine the most likely topic in each document. For this we can use the
TF-IDF measure. The keyword for each document is the word with the maximum TF-IDF weight
(ties are resolved through random selection). The codes in CL 6 compute the keyword for each
document. For example, the German keyword for Document 4 is aufgabenschwerpunkte which
translates in English to “task focal points.”
The final DTM can be used as input to dimensionality reduction techniques or directly to the
classification algorithms. The process from text preprocessing to text transformation culminated in
the DTM that is depicted in Figure 3.
Dimensionality Reduction
Before running classification algorithms on the data, we first investigate which among the features
are likely most useful for classification. Since the initial features were selected in an ad hoc manner,
that is, without reference to specific background knowledge or theory, it may be possible that some
of the features are irrelevant. In this case, we applied dimensionality reduction to the DTM.
LSA is a commonly applied to reduce the size of the feature set (Landauer et al., 1998). The
output of LSA yields new dimensions which reveal underlying patterns in the original features. The
new features can be interpreted as new terms that summarize the contextual similarity of the original
Kobayashi et al. 19
Vacancy and the ass ociated HTML file
Document by Term Matrix
Text Content
Sentences
Für unser GDA Wohns tift Hannover-Waldhausen mit über 109 A ppartements für Senioren suchen wir
zur Unterstützung unseres eine
Pflege fac hkraft (m/w)in T eilze it, zun äc hs t be fris tet auf 1 Jah r,
mit der Option auf Verlängerung
Die Ges ellscha ft für Diens te im A lter mbH (GDA) ist ein führen der Betreiber von 11
Seniorenwohnanlagen in Deutschland und seit 40 Jahren erfolgreich tätig.
Sie sind ein mutiger, aufrechter Verfechter von Bewohnerinteressen?
Für Sie ist Alltagsgestaltung eine Kernaufgabe Ihrer täglichen Arbeit, neben Behandlungs - und
Grundpflege?
data_10054 Für u nser GDA Wo hnstift Hannover-Waldhausen mit über 109 A ppartements für
Senioren su ch en wir zur Unt ers tzun g uns eres eine
data_10054 Pflegefachkraft (m/w)in Teilzeit, zunächst befristet auf 1 Jahr,
data_10054 mit der Option auf Verngerung
data_10054 Die Gesellschaft für Dienste im Alter mbH (GDA) ist ein führender Betreiber von
11 Seniorenwohnanlagen in Deutschland und seit 40 Jahren erfolgreich tätig.
data_10054 Sie sind ein mutiger, aufrechter Verfechter von Bewohnerinteresse n?
data_10054 Für Sie ist Alltagsgestaltung eine Kernaufgabe Ihrer täglichen Arbeit, neben
Behandlungs-und Grundpflege?
HTML Parsing
Sente nce se gmentation
Transformation
Figure 3. Illustration of text preprocessing from raw HTML file to document-by-term matrix.
20 Organizational Research Methods XX(X)
terms. Thus, LSA partly addresses issues of synonymy and in some circumstances, polysemy (i.e.,
when a single meaning of a word is used predominantly in a corpus). In R, the lsa package contains
a function that runs LSA.
To illustrate LSA we need additional vacancies. For illustrative purposes we used 11 job vacan-
cies (see the parsedvacancies folder). We applied sentence segmentation to all the vacancies and
obtained a text file containing 425 sentences that were extracted from the 11 vacancies (see the
sentences_from_vacancies folder). After applying preprocessing and transforming the sentences in
sentenceVacancies.txt into a DTM, we obtained 1079 features and retained 422 sentences. We
selected all sentences and ran LSA on the transposed DTM (i.e., the term-by-document matrix; see
CL7). We applied normalization to the term frequencies to minimize the effect of longer sentences.
Documents and terms are projected onto the constructed LSA space in the projdocterms
matrix. The entries in this matrix are readjustments of the original entries in the term-by-
document matrix. The readjustments take into account patterns of co-occurrence between terms.
Hence, terms which often occur together will roughly have the same values in documents where they
are expected to appear. We can apply the cosine measure to identify similar terms. Similarity is
interpreted in terms of having the same pattern of occurrence. For example, terms which have the
same pattern of occurrence with sicherstellung can be found by running the corresponding
commands in CL 8.
The German word sicherstellung (which means “to guarantee” or “to make sure” in
English) is found to be contextually similar to patientenversorgunng (patient care) and
reibungslosen (smooth or trouble-free) because these two words appeared together with
sicherstellung (to guarantee) in the selected documents. Another interesting property of LSA
is that it can uncover similarity between two terms even though the two terms may never be found to
co-occur in a single document. Consider for example the word koordinierung (coordination), we
find that kooperative (cooperative) is a term with which it is associated even though there is not
one document in the corpus in which the two terms co-occur. This happens because both terms are
found to co-occur with zusammenarbeit (collaboration), thus when either one of the terms occurs
then LSA expects that the other should also be present. This is the way LSA addresses the issue of
synonymy and polysemy. One can also find the correlation among documents and among terms by
running the corresponding commands in CL 8.
Since our aim is to reduce dimensionality, we project the documents to the new dimensions.
This is accomplished through the corresponding codes in CL 8. From the LSA, we obtain a total of
107 new dimensions from the original 1,079 features. It is usually not easy to attach natural
language interpretations to the new dimensions. In some scenarios, we can interpret the new
dimension by examining the scaled coefficients of the terms on the new dimensions (much like
in PCA). Terms with higher loadings on a dimension have a greater impact on that dimension.
Figure 4 visualizes the terms with high numerical coefficients on the first 6 LSA dimensions (see
CL 8 for the relevant code). Here we distinguish between terms found to occur in a task sentence
(red) or not (blue). In this way, an indication is provided of which dimensions are indicative for
each class (note that distinguishing between tasks and nontasks requires the training data, which is
discussed in greater detail below).
Another approach that we could try is to downsize the feature set by eliminating those features
that are not (or less) relevant. Such techniques are collectively called filter methods (Guyon &
Elisseeff, 2003). They work by assigning scores to features and setting a threshold whereby features
having scores below the threshold are filtered out. Both the DF and IDF can be used as scoring
methods. However, one main disadvantage of DF and IDF is that they do not use class membership
information in the training data. Including class membership (i.e., through supervised scoring
methods) ought to be preferred, as it capitalizes on the discriminatory potential of features (Lan
et al., 2009).
Kobayashi et al. 21
For supervised scoring methods, we need to rely on the labels of the training data. In this example,
the labels are whether a sentence expresses task information (1) or not (0). These labels were
obtained by having experts manually label each sentence. For our example, experts manually
assigned labels to the 425 sentences. We applied three scoring methods, namely, Information Gain,
Gain ratio, and Symmetric Uncertainty (see CL 12). Due to the limited number of labeled docu-
ments, these scoring methods yielded less than optimal results. However, they still managed to
detect one feature that may be useful for identifying the class of task sentences, that is, the word
zusammenarbeit (collaboration), as this word most often occurred in task sentences. The output
from this step is a column-reduced matrix that is either the reduced version of the DTM or the matrix
with the new dimensions. In our example we applied LSA and the output is a matrix in which the
columns are the LSA dimensions.
Classification
The reduced matrix from the preceding section can be used as input for classification algorithms.
The output from this step is a classification model which we can then use to automatically classify
sentences in new vacancies. We have mentioned earlier that reducing dimensionality is an
e
handlungspflege
betreuung
durchführung
examinierte
grund
menschen
t
ienten
ge
rahmen
sowie
unsere
unserer
berlins
betten
chirurgie
endoprothetik
fachdisziplinen
gesundheits
gmbh
innere
intensivmedizin
krankenhaus
notfallkrankenhaus
psychiatrie
psychosomatik
psychotherapie
radiologie
teil
unsere
betreuung
bringen
caritassozialstation
freude
menschen
patienten
sowie
spät
umgang
unsere
zentrale
zuhause
abgeschlossene
bereich
betten
gesundheits
innere
krankenhaus
krankenpflege
krankenpfleger
notfallkrankenhaus per
psychiatrie
psychotherapie
teil
behandlungspflege
bereich
freude
gesundheits
grund
krankenpflege
menschen
patienten
pflege
sowie
umgang
unsere
werbungsunterlagen
email
per pflege
post
stehende
LSA 4 LSA 5 LSA 6
LSA 1 LSA 2 LSA 3
−2
−1
0
1
2
−2
−1
0
1
2
Term coefficients
Not a Task
Tas k
Figure 4. Loadings of the terms on the first 6 LSA dimensions using 422 sentences from 11 vacancies.
22 Organizational Research Methods XX(X)
empirically driven decision rather than one which is guided by specific rules of thumb. Thus, we will
test whether the new dimensions lead to improvement in performance as compared to the original set
by running separate classification algorithms, namely support vector machines (SVMs), naive
Bayes, and random forest, on each set. These three have been shown to work well on text data
(Dong & Han, 2004; Eyheramendy et al., 2003; Joachims, 1998).
Accuracy is not a good performance metric in this case since the proportion of task sentences in
our example data is low (less than 10%). The baseline accuracy (computed from the model which
assigns all sentences to the dominant class), would be 90%which is high, and thus difficult to
improve upon. More suitable performance metrics are the F-measure (Ogura et al., 2011; Powers,
2011) and balanced accuracy (Brodersen, Ong, Stephan, & Buhmann, 2010). We use these two
measures here since the main focus is on the correct classification of task sentences and we also want
to control for misclassifications (nontask sentences put into the task class or task sentences put into
the nontask class).
In assessing the generalizability of the classifiers, we employed 10 times 10 fold cross-validation.
We repeated 10 fold cross-validation 10 times because of the limited training data. We use one part
of the data to train a classifier and test its performance by applying the classifier on the remaining
part and computing the F-measure and balanced accuracy. For the 10 times 10 fold cross-validation,
we performed 100 runs for each classifier using the reduced and original feature sets. Hence, for the
example we ran about 600 trainings since we trained 6 classifiers in total. All performance results
reported are computed using the test sets (see CL 10).
From the results we see how classification performance varies across the choice of features,
classification algorithms, and evaluation measures. Figure 5 presents the results of the cross-
validation. Based on the F-measure, random forest yielded the best performance using the LSA
reduced feature set. The highest F-measure obtained is 1.00 and the highest average F-measure is
0.40 both from random forest. SVM and naive Bayes have roughly the same performance. This
suggests that among the three classifiers random forest is the best classifier using the LSA reduced
feature set, and F-measure as the evaluation metric. If we favor the correct detection of task
sentences and we want a relatively small dimensionality then random forest should thus be favored
over the other methods. For the case of using the original features, SVM and random forest exhibit
comparable performance. Hence, when using F-measure and the original feature set, either SVM or
random forest would be the preferred classifier. The low values of the F-measures can be accounted
for by the limited amount of training data. For each fold, we found that there are about 3-4 task
sentences, thus a single misclassification of a task sentence leads to sizeable reduction in precision
and recall which in turn results in a low F-measure value.
When balanced accuracy is the evaluation measure, SVM and random forest consistently yield
similar performance when using either the LSA reduced feature set or the original feature set,
although, random forest yielded a slightly higher performance compared to SVM using the LSA
reduced features set. This seems to suggest that for balanced accuracy and employing the original
features, one can choose between SVM and random forest, and if one decides to use the LSA feature
set then random forest is to be preferred. Moreover, notice that the numerical value for balanced
accuracy is higher than F-measure. Balanced accuracy can be increased by the accuracy of the
dominant class, in this case the nontask class.
This classification example reveals the many issues that one may face in building a suitable
classification model. First is the central role of features in classification. Second is how to model the
relationship between the features and the class membership. Third is the crucial role of choosing an
appropriate evaluation measure or performance metric. This choice should be guided by the nature
of the problem, the objectives of the study, and the amount of error we are willing to tolerate. In our
example, we assign equal importance to both classes, and we therefore have slight preference for
balanced accuracy. In applications where the misclassification cost for the positive class is greater
Kobayashi et al. 23
than that for the other class, the F-measure may be preferred. For a discussion of alternative
evaluation measures see Powers (2011).
Other issues include the question of how to set a cutoff value for the evaluation measure to judge
whether a model is good enough. A related question is how much training data are needed for the
classification model to generalize well (i.e., how to avoid overfitting). These questions are best
answered empirically through systematic model evaluation, such as by trying different training sizes
and varying the threshold, and then observing the effect on classifier performance. One strategy is to
treat this as a factorial experiment where the choices of training size and evaluation measures are
considered as factor combinations. In addition, one has to perform repeated evaluation (e.g., cross-
validation) and validation. Aside from modeling issues there are also practical concerns such as the
cost of acquiring training data and the interpretability of the resulting model. Classification models
with high predictive performance are not always the ones that yield the greatest insight. Insofar as
the algorithm is to be used to support decision making, the onus is on the researcher to be able to
explain and justify its workings.
Classification for Job Information Extraction
For our work on job task information extraction three people hand labeled a total of 2,072 out of
60,000 sentences. It took a total of 3 days to label, verify and relabel 2,072 sentences. From this total,
LSA Original
SVM Naive Bayes Random Forest SVM Naive Bayes Random Forest
0.25
0.50
0.75
1.00
Classifier
Balanced Accuracy
FeatureSelection
LSA
Original
LSA Original
SVM Naive Bayes Random Forest SVM Naive Bayes Random Forest
0.25
0.50
0.75
1.00
Classifier
F−measure
FeatureSelection
LSA
Original
Figure 5. Comparison of classification performance among three classifiers and between the term-based and
LSA-based features.
24 Organizational Research Methods XX(X)
132 sentences were identified as task sentences (note that the task sentences were not unique). The
proportion of task sentences in vacancy texts was only 6%. This means that the resulting training
data are imbalanced. This is because not all tasks that are part of a particular job will be written in the
vacancies, likely only the essential and more general ones. This partly explains their low proportion.
Since labeling additional sentences will be costly and time-consuming we employed a semisu-
pervised learning approach called label propagation (Zhu & Ghahramani, 2002). For the transfor-
mation and dimensionality reduction we respectively constructed the DTM and applied LSA. Once
additional task sentences were obtained via semisupervised learning we ran three classification
algorithms, namely, SVM, random forest, and naive Bayes. Instead of choosing a single classifier
we combined the predictions of the three in a simple majority vote. For the evaluation measure we
used the Recall measure since we wanted to obtain as many task sentences as possible. Cross-
validation was used to assess the generalization property of the model. The application of classifi-
cation resulted to identification of 1,179 new task sentences. We further clustered these sentences to
obtain unique nursing tasks since some sentences pointed to the same tasks.
Model Reliability and Validity
We set out to build a classification model that can extract sentences containing nursing tasks from
job vacancies. Naturally, a subsequent step is to determine whether the extracted task sentences
correspond to real tasks performed by nurses. An approach to establish construct validity is to use an
independent source to examine the validity of the classification. Independent means that the source
should be blind from the data collection activity, initial labeling procedure, and model building
process. Moreover, in case ratings are obtained, these should be provided by subject matter experts
(SMEs), that is, individuals who have specialist knowledge about the application domain. If found to
be sufficiently valid the extracted sentences containing job tasks may then be used for other purposes
such as in job analysis, identifying training needs or developing selection instruments.
We enlisted the help of SMEs, presented them the task sentences predicted by the text classifier,
and asked them to check whether the sentences are actual nursing tasks or not so as to be able to
compute the precision measure. Specifically, we compute precision as the ratio of the number of
sentence tasks confirmed as actual nursing tasks to the total number of sentences tasks predicted by
the model. We reran the classification algorithm in light of the input from the experts. The input is
data containing the correct label of sentences which were misclassified by the classifier. We per-
formed this in several iterations until there was no significant improvement in precision. This is
necessarily an asymmetric approach since we use the expert knowledge as the “ground truth.”
A more elaborate approach would be to compare the extracted tasks from vacancies to tasks
collected using a more traditional job analysis method, namely a task inventory. The task inventory
would consist of interviews and observations with SMEs to collect a list of tasks performed by
nurses. Based on this comparison, a percentage of tasks would be found in both lists, a percentage of
unique tasks would only be found in the task inventory, and a percentage of unique tasks would only
be found in the online vacancies. A high correspondence between the list of tasks collected by text
mining and the list of tasks collected in the task inventory (which would be considered to accurately
reflect the nursing job) could be taken as evidence for convergent validity. Conversely, one could
establish discriminant validity, or a very low correspondence with so-called bogus tasks that are
completely unrelated to the nursing job.
We apply the less elaborate approach by first training a classification model, making predictions
using the model, and presenting the task sentences to an SME. The expert judged whether the
sentences are actual nursing tasks or not. The precision measure was used to give an indication
of the validity of the model. The first round of validation resulted in a precision of 65%(precision
range: 0%to 100%) and we found out that some of the initial labels we assigned did not match the
Kobayashi et al. 25
labels provided by the independent expert (that is some of the labels in the initial labels were judged
to be erroneous by the expert). In light of this, we adjusted the labels and conducted a second round
of validation in which precision increased to 89%. This indicates that we gained classification
validity in the classification model. A total of 91 core tasks were validated. Table 3 contains
validated tasks under the basic care and medical care clusters. In practice, it is difficult to obtain
100%precision since forcing a model to give high precision comes at the expense of sacrificing its
recall. High precision and low recall imply the possibility that many task sentences will be dismissed
though we can put more confidence on the sentences that are labeled as a task. As a last note, TC
models are seldom static, that is, as new documents arrive, we have to continually assess the
performance of the model on new observations and adjust our model if there is significant degrada-
tion in performance.
Conclusion
This article provided an overview of TC and a tutorial on how to conduct actual TC on the problem
of job task information extraction from vacancies. We discussed and demonstrated the different
steps in TC and highlighted issues surrounding the choices of features, classification algorithms, and
evaluation metrics. We also outlined ways to evaluate and validate the resulting classification
models and prediction from these models. TC is an empirical enterprise where experimentation
with choices of representation, dimensionality reduction, and classification techniques is standard
practice. By building several classifiers and comparing them, the final classifier is chosen based on
repeated evaluation and validation. Thus TC is not a linear process; one has to revisit each step
iteratively to examine how choices in each step affects succeeding steps. Moreover, classifiers
evolve in the presence of new data. TC is a wide research field and there are many other techniques
Table 3. Basic Care and Medical Care Core Nursing Tasks Extracted From Nursing Vacancies by Applying
Text Classification.
Task German Translation Task Cluster
Monitoring the patients’ therapy U
¨berwachung der Therapie des Patienten Basic care
Caring for the elderly Pflege von a
¨lteren Menschen Basic care
Providing basic or general care Durchfu¨hrung der Allgemeinen Pflege Basic care
Providing palliative care Durchfu¨hrung von Palliativpflege Basic care
Caring for mentally ill patients Pflege von psychisch kranken Menschen Basic care
Caring for children Pflege von Kindern Basic care
Assisting at intake of food Hilfe bei der Nahrungsaufnahme Basic care
Supporting of rehabilitation Unterstu¨tzung der Rehabilitation Basic care
Providing holistic care Durchfu¨hrung ganzheitlicher Pflege Basic care
Accompanying patients Begleitung von Patienten Basic care
Assisting at surgical interventions Assistenz bei operativen Eingriffen Medical care
Doing laboratory tests Durchfu¨ hrung von Labortests Medical care
Participating in resuscitations Beteiligung an Reanimationsmaßnahmen Medical care
Conducting ECG Durchfu¨ hrung von EKG Medical care
Collecting blood Durchfu¨hrung der Blutabnahme Medical care
Preparing and administer intravenous drugs Vorbereitung und Verabreichung von intraveno
¨sen
Medikamenten
Medical care
Assisting at diagnostic interventions Assistenz bei diagnostischen Maßnahmen Medical care
Operating the technical equipment Bedienung der technischen Gera
¨teschaften Medical care
Assisting at endoscopic tests Assistenz bei endoskopischen Maßnahmen Medical care
Assisting at examination Assistenz bei Untersuchungen Medical care
26 Organizational Research Methods XX(X)
that were not covered here. An exciting new area is the application of deep learning techniques for
text understanding (for more on this we refer the reader to Maas et al., 2011; Mikolov, Chen,
Corrado, & Dean, 2013; X. Zhang & LeCun, 2015).
TC models are often descriptive as opposed to explanatory in nature, in the sense that they
capture the pattern of features and inductively relate these to class membership (Bird, Klein, &
Loper, 2009). This contrasts with explanatory models whose aim is to explain why the pattern in
features leads to the prediction of a class. Nevertheless, the descriptive work can be of use for
further theory building too as the knowledge of patterns can be used as a basis for the development
of explanatory models. For example, in the part about feature selection we found out that the word
sicherstellung (to guarantee or to make sure) is useful in detecting sentences containing nursing
tasks. Based on this we can define the concept of “task verb,” that is, a verb that is indicative of a
task in the context of job vacancy. We could then compile a list of verbs that are “task verbs” and
postulate that task verbs pair with noun or verb phrases to form task sentences. Further trials could
then be designed to validate this concept and establish the relationship between features and
patterns. In this way, we are not only detecting patterns but we also attempt to infer their properties
and their relationship to class membership.
Whether a descriptive model suffices or whether an explanatory model is needed depends on the
objectives of a specific study. If the objective is accurate and reliable categorization (e.g., when one
is interested in using the categorized text as input to other systems) then a descriptive model will
suffice although the outcomes still need to be validated. On the other hand, if the objective is to
explain how patterns lead to categorization or how structure and form lead to meaning then an
explanatory model is required.
In this article we tried to present TC in such a manner that organizational researchers can
understand the underlying process. However, in practice, organizational researchers will often
work with technical experts to make choices on the algorithms and assist in tweaking and tuning
the parameters of the resulting model. The role of organizational researchers then is to provide the
research questions, help select the relevant features, and provide insights in light of the classifi-
cation output. These insights might lead to further investigation and ultimately to theory devel-
opment and testing.
Finally, we conclude that TC offers great potential to make the conduct of text-based organiza-
tional researches fast, reliable, and effective. The utility of TC is most evident when there is a need to
analyze massive text data, in fact in some cases TC is able to recover patterns that are difficult for
humans to detect. Otherwise, manual qualitative text analysis procedures may suffice. As noted, the
increased use of TC in organizational research will likely not only contribute to organizational
research, but also to the advancement of TC research, because real problems and existing theory
can further simulate the development of new techniques.
Acknowledgments
An earlier version of this article was presented as Kobayashi, V. B., Berkers, H. A., Mol, S. T., Kismiho´k, G., &
Den Hartog, D. N. (2015, August). Augmenting organizational research with the text mining toolkit: All
aboard!In J. M. LeBreton (Chair), Big Data: Implications for Organizational Research. Showcase symposium
at the 75th annual meeting of the Academy of Management, Vancouver, BC, Canada. The authors would like to
acknowledge the anonymous reviewers at Organizational Research Methods and the Academy of Management
for their constructive comments on earlier drafts of this article.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or pub-
lication of this article.
Kobayashi et al. 27
Funding
This work was supported by the European Commission through the Marie-Curie Initial Training Network
EDUWORKS (Grant PITN-GA-2013-608311) and by the Society of Industrial and Organizational Psychology
Sidney A. Fine Grant for Research on Job Analysis, for the Big Data Based Job Analytics Project.
References
Aggarwal, C. C., & Zhai, C. (2012). A survey of text classification algorithms. In C. C. Aggarwal & C. Zhai
(Eds.), Mining text data (pp. 163-222). New York, NY: Springer. Retrieved from http://link.springer.com/
chapter/10.1007/978-1-4614-3223-4_6
Aizawa, A. (2003). An information-theoretic perspective of tf-idf measures. Information Processing &
Management,39(1), 45-65. https://doi.org/10.1016/S0306-4573(02)00021-3
Algarni, A., & Tairan, N. (2014). Feature selection and term weighting. In Proceedings of the 2014 IEEE/WIC/
ACM International Joint Conferences on Web Intelligence (WI) and Intelligent Agent Technologies (IAT)
(Vol. 1, pp. 336-339). Washington, DC: IEEE. https://doi.org/10.1109/WI-IAT.2014.53
Alpaydin, E. (2014). Introduction to machine learning. Cambridge, MA: MIT Press.
Atteveldt, W. van, Kleinnijenhuis, J., Ruigrok, N., & Schlobach, S. (2008). Good news or bad news?
Conducting sentiment analysis on Dutch text to distinguish between positive and negative relations.
Journal of Information Technology & Politics,5(1), 73-94. https://doi.org/10.1080/19331680802154145
Berry, M. W., & Castellanos, M. (2008). Survey of text mining II—Clustering, classification, and retrieval.
Retrieved from http://www.springer.com/gp/book/9781848000452
Bird, S., Klein, E., & Loper, E. (2009). Natural language processing with Python. Sebastopol, CA: O’Reilly
Media. Retrieved from https://books.google.nl
Breiman, L. (1996). Bagging predictors. Machine Learning,24(2), 123-140.
Brodersen, K. H., Ong, C. S., Stephan, K. E., & Buhmann, J. M. (2010). The balanced accuracy and its posterior
distribution. In 20th International Conference on Pattern Recognition (ICPR) (pp. 3121-3124). Washington,
DC: IEEE. Retrieved from http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber¼5597285
Brooks, J., McCluskey, S., Turley, E., & King, N. (2015). The utility of template analysis in qualitative
psychology research. Qualitative Research in Psychology,12(2), 202-222. https://doi.org/10.1080/
14780887.2014.955224
Buhrmester, M., Kwang, T., & Gosling, S. D. (2011). Amazon’s Mechanical Turk: A new source of inexpen-
sive, yet high-quality, data? Perspectives on Psychological Science,6(1), 3-5. https://doi.org/10.1177/
1745691610393980
Burges, C. J. (2010). Dimension reduction: A guided tour. Redmond, WA: Now Publishers. Retrieved from
https://books.google.nl
Cardie, C., & Wilkerson, J. (2008). Text annotation for political science research. Journal of Information
Technology & Politics,5(1), 1-6. https://doi.org/10.1080/19331680802149590
Cavnar, W. B., & Trenkle, J. M. (1994). N-gram-based text categorization. Ann Arbor, MI: Environmental
Research Institute of Michigan.
Chan, S. W. K., & Chong, M. W. C. (2017). Sentiment analysis in financial texts. Decision Support Systems,94,
53-64. https://doi.org/10.1016/j.dss.2016.10.006
Chan, S. W. K., & Franklin, J. (2011). A text-based decision support system for financial sequence prediction.
Decision Support Systems,52(1), 189-198. https://doi.org/10.1016/j.dss.2011.07.003
Chandola, V., Banerjee, A., & Kumar, V. (2009). Anomaly detection: A survey. ACM Computing Surveys,
41(3), 15:1-15:58. https://doi.org/10.1145/1541880.1541882
Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: Synthetic minority over-
sampling technique. Journal of Artificial Intelligence Research,16, 321-357.
Chen, J., Huang, H., Tian, S., & Qu, Y. (2009). Feature selection for text classification with naı
¨ve Bayes. Expert
Systems with Applications,36(3, pt. 1), 5432-5435. https://doi.org/10.1016/j.eswa.2008.06.054
28 Organizational Research Methods XX(X)
Dave, K., Lawrence, S., & Pennock, D. M. (2003). Mining the peanut gallery: Opinion extraction and semantic
classification of product reviews. In Proceedings of the 12th International Conference on World Wide Web
(pp. 519-528). New York, NY: ACM. https://doi.org/10.1145/775152.775226
Dietterich, T. G. (1997). Machine-learning research. AI Magazine,18(4), 97.
Domingos, P. (2012). A few useful things to know about machine learning. Communications of the ACM,
55(10), 78-87.
Dong, Y.-S., & Han, K.-S. (2004). A comparison of several ensemble methods for text categorization.
In 2004 IEEE International Conference on Services Computing 2004 (SCC 2004) (pp. 419-422).
Washington, DC: IEEE. Retrieved from http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber¼
1358033
Dumais, S., Platt, J., Heckerman, D., & Sahami, M. (1998). Inductive learning algorithms and representations
for text categorization. In Proceedings of the Seventh International Conference on Information and
Knowledge Management (pp. 148-155). New York, NY: ACM. Retrieved from http://dl.acm.org/citation.
cfm?id¼288651
Duriau, V. J., Reger, R. K., & Pfarrer, M. D. (2007). A content analysis of the content analysis literature in
organization studies: Research themes, data sources, and methodological refinements. Organizational
Research Methods,10(1), 5-34. https://doi.org/10.1177/1094428106289252
Elkan, C. (2001). The foundations of cost-sensitive learning. In Proceedings of the 17th International Joint
Conference on Artificial Intelligence (Vol. 2, pp. 973-978). San Francisco, CA: Morgan Kaufmann.
Retrieved from http://dl.acm.org/citation.cfm?id¼1642194.1642224
Eyheramendy, S., Lewis, D. D., & Madigan, D. (2003). On the naive Bayes model for text categorization.
Retrieved from http://citeseerx.ist.psu.edu/viewdoc/summary?doi¼10.1.1.20.4949
Fernandes, J., Artı´fice, A., & Fonseca, M. J. (2017). Automatic Estimation of the LSA Dimension. Paper
presented at the International Conference of Knowledge Discovery and Information Retrieval.Paris,
France, October 2011. Retrieved from http://www.di.fc.ul.pt/~mjf/publications/2014-2010/pdf/kdir11.pdf.
Ferreira, A. J., & Figueiredo, M. A. T. (2012). Boosting algorithms: A review of methods, theory, and
applications. In C. Zhang & Y. Ma (Eds.), Ensemble machine learning (pp. 35-85). New York, NY:
Springer. https://doi.org/10.1007/978-1-4419-9326-7_2
Finn, A., & Kushmerick, N. (2006). Learning to classify documents according to genre. Journal of the
American Society for Information Science and Technology,57(11), 1506-1518. https://doi.org/10.1002/
asi.20427
Flach, P. (2012). Machine learning: The art and science of algorithms that make sense of data. New York, NY:
Cambridge University Press.
Fodor, I. K. (2002). A survey of dimension reduction techniques (Tech. Rep. UCRL-ID-148494). Livermore,
CA: Lawrence Livermore National Laboratory. Retrieved from https://e-reports-ext.llnl.gov/pdf/240921.
pdf
Forman, G. (2003). An extensive empirical study of feature selection metrics for text classification. Journal of
Machine Learning Research,3, 1289-1305.
Fox, C. (1992). Lexical analysis and stoplists. In W. B. Frakes & R. Baeza-Yates (Eds.), Information retrieval:
Data structures and algorithms (pp. 102-130). Upper Saddle River, NJ: Prentice Hall. Retrieved from http://
dl.acm.org/citation.cfm?id¼129687.129694
Frakes, W. B. (1992). Stemming algorithms. In W. B. Frakes & R. Baeza-Yates (Eds.), Information retrieval:
Data structures and Algorithms (pp. 131-160). Upper Saddle River, NJ: Prentice Hall. Retrieved from http://
dl.acm.org/citation.cfm?id¼129687.129695
Frakes, W. B., & Baeza-Yates, R. (1992). Information retrieval: Data structures and algorithms. Retrieved
from http://www.citeulike.org/group/328/article/308697
Fu, Y., Zhu, X., & Li, B. (2013). A survey on instance selection for active learning. Knowledge and Information
Systems,35(2), 249-283. https://doi.org/10.1007/s10115-012-0507-8
Kobayashi et al. 29
Goldman, S. A. (2010). Computational learning theory. In M. J. Atallah & M. Blanton (Eds.), Algorithms and
theory of computation handbook (2nd ed., Vol. 1, pp. 26-26). London. UK: Chapman & Hall/CRC.
Retrieved from http://dl.acm.org/citation.cfm?id¼1882757.1882783
Gonc¸alves, T., & Quaresma, P. (2005). Is linguistic information relevant for the classification of legal texts? In
Proceedings of the 10th International Conference on Artificial Intelligence and Law (pp. 168-176). New
York, NY: ACM. https://doi.org/10.1145/1165485.1165512
Graham, I. S. (1995). The HTML sourcebook. New York, NY: John Wiley. Retrieved from http://dl.acm.org/
citation.cfm?id¼526978
Guyon, I., & Elisseeff, A. (2003). An introduction to variable and feature selection. Journal of Machine
Learning Research,3, 1157-1182.
Guyon, I., Gunn, S., Nikravesh, M., & Zadeh, L. (2008). Feature extraction: Foundations and applications.
New York, NY: Springer.
Harish, B. S., Guru, D. S., & Manjunath, S. (2010). Representation and classification of text documents: A brief
review. International Journal of Computer Applications,2, 110-119.
Hindle, D., & Rooth, M. (1993). Structural ambiguity and lexical relations. Computational Linguistics,19(1),
103-120.
Holton, C. (2009). Identifying disgruntled employee systems fraud risk through text mining: A simple solution
for a multi-billion dollar problem. Decision Support Systems,46(4), 853-864. https://doi.org/10.1016/j.dss.
2008.11.013
Hsieh, H.-F., & Shannon, S. E. (2005). Three approaches to qualitative content analysis. Qualitative Health
Research,15(9), 1277-1288. https://doi.org/10.1177/1049732305276687
Hsu, C.-W., & Lin, C.-J. (2002). A comparison of methods for multiclass support vector machines. IEEE
Transactions on Neural Networks,13(2), 415-425. https://doi.org/10.1109/72.991427
Ikonomakis, M., Kotsiantis, S., & Tampakas, V. (2005). Text classification using machine learning techniques.
WSEAS Transactions on Computers,4(8), 966-974.
Jiang, S., Pang, G., Wu, M., & Kuang, L. (2012). An improved K-nearest-neighbor algorithm for text categor-
ization. Expert Systems with Applications,39(1), 1503-1509.
Joachims, T. (1998). Text categorization with support vector machines: Learning with many relevant features.
New York, NY: Springer. Retrieved from http://link.springer.com/chapter/10.1007/BFb0026683
Kanaan, G., Al-Shalabi, R., Ghwanmeh, S., & Al-Ma’adeed, H. (2009). A comparison of text-classification
techniques applied to Arabic text. Journal of the American Society for Information Science and Technology,
60(9), 1836-1844. https://doi.org/10.1002/asi.20832
Khoo, A., Marom, Y., & Albrecht, D. (2006). Experiments with sentence classification. In Proceedings of the
2006 Australasian language technology workshop (pp. 18-25). Retrieved from http://www.aclweb.org/
anthology/U06-1#page¼26
Kloptchenko, A., Eklund, T., Karlsson, J., Back, B., Vanharanta, H., & Visa, A. (2004). Combining data and
text mining techniques for analysing financial reports. Intelligent Systems in Accounting, Finance &
Management,12(1), 29-41. https://doi.org/10.1002/isaf.239
Kobayashi, V. B., Mol, S. T., Berkers, H. A., Kismiho´k, G., & Den Hartog, D. N. (in press). Text mining in
organizational research. Organizational Research Methods.
Kobayashi, V. B., Mol, S. T., Kismiho´k, G., & Hesterberg, M. (2016). Automatic extraction of nursing tasks
from online job vacancies. In M. Fathi, M. Khobreh, & F. Ansari (Eds.), Professional education and training
through knowledge, technology and innovation (pp. 51-56). Siegen, Germany: University of Siegen.
Retrieved from http://dokumentix.ub.uni-siegen.de/opus/volltexte/2016/1057/pdf/Professional_education_
and_training.pdf#page¼58
Kohavi, R. (1995). A study of cross-validation and bootstrap for accuracy estimation and model selection. In
Ijcai (Vol. 14, pp. 1137-1145). Retrieved from http://frostiebek.free.fr/docs/Machine%20Learning/valida
tion-1.pdf
30 Organizational Research Methods XX(X)
Kolen, J. F., & Pollack, J. B. (1990). Back propagation is sensitive to initial conditions. In Proceedings of the
1990 Conference on Advances in Neural Information Processing Systems 3 (pp. 860-867). San Francisco,
CA: Morgan Kaufmann. Retrieved from http://dl.acm.org/citation.cfm?id¼118850.119960
Lan, M., Tan, C. L., Su, J., & Lu, Y. (2009). Supervised and traditional term weighting methods for automatic
text categorization. IEEE Transactions on Pattern Analysis and Machine Intelligence,31(4), 721-735.
https://doi.org/10.1109/TPAMI.2008.110
Landauer, T. K., Foltz, P. W., & Laham, D. (1998). An introduction to latent semantic analysis. Discourse
Processes,25(2-3), 259-284. https://doi.org/10.1080/01638539809545028
Lewis, D. D. (1992). Representation and learning in information retrieval (Doctoral dissertation, University of
Massachusetts Amherst). Retrieved from http://ciir.cs.umass.edu/pubfiles/UM-CS-1991-093.pdf
Li, Y. H., & Jain, A. K. (1998). Classification of text documents. Computer Journal,41(8), 537-546. https://
doi.org/10.1093/comjnl/41.8.537
Liu, H., & Motoda, H. (1998). Feature extraction, construction and selection: A data mining perspective. New
York, NY: Springer.
Maas, A. L., Daly, R. E., Pham, P. T., Huang, D., Ng, A. Y., & Potts, C. (2011). Learning word vectors for
sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational
Linguistics: Human language technologies (Vol. 1, pp. 142-150). Association for Computational
Linguistics. Retrieved from http://dl.acm.org/citation.cfm?id¼2002491
Me´ndez, J. R., Iglesias, E. L., Fdez-Riverola, F., Dı´az, F., & Corchado, J. M. (2006). Tokenising, stemming and
stopword removal on anti-spam filtering domain. In R. Marı´n, E. Onaindı´a, A. Bugarı´n, & J. Santos (Eds.),
Current topics in artificial intelligence (pp. 449-458). New York, NY: Springer. Retrieved from http://link.
springer.com/chapter/10.1007/11881216_47
Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector
space (ArXiv Preprint ArXiv:1301.3781). Retrieved from http://arxiv.org/abs/1301.3781
Moschitti, A., & Basili, R. (2004). Complex linguistic features for text classification: A comprehensive study.
In S. McDonald & J. Tait (Eds.), Advances in information retrieval (pp. 181-196). New York, NY: Springer.
https://doi.org/10.1007/978-3-540-24752-4_14
Ogura, H., Amano, H., & Kondo, M. (2011). Comparison of metrics for feature selection in imbalanced text
classification. Expert Systems with Applications,38(5), 4978-4989. https://doi.org/10.1016/j.eswa.2010.09.
153
Pang, B., & Lee, L. (2008). Opinion mining and sentiment analysis. Foundations and Trends in Information
Retrieval,2(1-2), 1-135. https://doi.org/10.1561/1500000011
Panigrahi, P. K. (2012). A comparative study of supervised machine learning techniques for spam e-mail
filtering. In 2012 Fourth International Conference on Computational Intelligence and Communication
Networks (pp. 506-512). Washington, DC: IEEE. https://doi.org/10.1109/CICN.2012.14
Phan, X.-H., Nguyen, L.-M., & Horiguchi, S. (2008). Learning to classify short and sparse text & web with
hidden topics from large-scale data collections. In Proceedings of the 17th International Conference on
World Wide Web (pp. 91-100). New York, NY: ACM. https://doi.org/10.1145/1367497.1367510
Polikar, R. (2012). Ensemble learning. In C. Zhang & Y. Ma (Eds.), Ensemble machine learning (pp. 1-34).
New York, NY: Springer. Retrieved from http://link.springer.com/chapter/10.1007/978-1-4419-9326-7_1
Popping, R. (2012). Qualitative decisions in quantitative text analysis research. Sociological Methodology,42,
88-90.
Porter, M. F. (1980). An algorithm for suffix stripping. Program,14(3), 130-137. https://doi.org/10.1108/
eb046814
Powers, D. M. (2011). Evaluation: From precision, recall and F-measure to ROC, informedness, markedness
and correlation. Retrieved from http://dspace2.flinders.edu.au/xmlui/handle/2328/27165
Ragas, H., & Koster, C. H. (1998). Four text classification algorithms compared on a Dutch corpus. In
Proceedings of the 21st annual international ACM SIGIR conference on Research and development in
Kobayashi et al. 31
information retrieval (pp. 369-370). New York, NY: ACM. Retrieved from http://dl.acm.org/citation.cfm?
id¼291059
Raghavan, V. V., & Wong, S. M. (1986). A critical analysis of vector space model for information retrieval.
Journal of the American Society for Information Science,37(5), 279-287.
Rokach, L., & Maimon, O. (2005). Top-down induction of decision trees classifiers—A survey. IEEE
Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews,35(4), 476-487.
https://doi.org/10.1109/TSMCC.2004.843247
Rullo, P., Cumbo, C., & Policicchio, V. L. (2007). Learning rules with negation for text categorization. In
Proceedings of the 2007 ACM Symposium on Applied Computing (pp. 409-416). New York, NY: ACM.
https://doi.org/10.1145/1244002.1244098
Salton, G., & Buckley, C. (1988). Term-weighting approaches in automatic text retrieval. Information
Processing & Management,24(5), 513-523.
Salton, G., Wong, A., & Yang, C.-S. (1975). A vector space model for automatic indexing. Communications of
the ACM,18(11), 613-620.
Scharkow, M. (2013). Thematic content analysis using supervised machine learning: An empirical evaluation
using German online news. Quality & Quantity,47(2), 761-773. https://doi.org/10.1007/s11135-011-9545-7
Scott, S., & Matwin, S. (1999). Feature engineering for text classification. In ICML (Vol. 99, pp. 379-388).
Retrieved from http://comp.mq.edu.au/units/comp348/reading/scott99feature.pdf
Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Computing Surveys,34(1),
1-47.
Seibert, S. E., Kraimer, M. L., Holtom, B. C., & Pierotti, A. J. (2013). Even the best laid plans sometimes go
askew: Career self-management processes, career shocks, and the decision to pursue graduate education.
Journal of Applied Psychology,98(1), 169.
Settles, B. (2010). Active learning literature survey (Tech. Rep. 1648). Madison: University of Wisconsin–
Madison.
Shen, J., Brdiczka, O., & Liu, J. (2013). Understanding email writers: Personality prediction from email
messages. In S. Carberry, S. Weibelzahl, A. Micarelli, & G. Semeraro (Eds.), User modeling, adapta-
tion, and personalization (pp. 318-330). New York, NY: Springer. https://doi.org/10.1007/978-3-642-
38844-6_29
Sheng, V. S., Provost, F., & Ipeirotis, P. G. (2008). Get another label? Improving data quality and data mining
using multiple, noisy labelers. In Proceedings of the 14th ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining (pp. 614-622). New York, NY: ACM. https://doi.org/10.1145/
1401890.1401965
Shreve, J., Schneider, H., & Soysal, O. (2011). A methodology for comparing classification methods through
the assessment of model stability and validity in variable selection. Decision Support Systems,52(1),
247-257. https://doi.org/10.1016/j.dss.2011.08.001
Sirbu, D., Secui, A., Dascalu, M., Crossley, S. A., Ruseti, S., & Trausan-Matu, S. (2016). Extracting gamers’
opinions from reviews. In 2016 18th International Symposium on Symbolic and Numeric Algorithms for
Scientific Computing (SYNASC) (pp. 227-232). https://doi.org/10.1109/SYNASC.2016.044
Song, F., Liu, S., & Yang, J. (2005). A comparative study on text representation schemes in text categorization.
Pattern Analysis and Applications,8(1-2), 199-209. https://doi.org/10.1007/s10044-005-0256-3
Thelwall, M., Buckley, K., Paltoglou, G., Cai, D., & Kappas, A. (2010). Sentiment strength detection in short
informal text. Journal of the American Society for Information Science and Technology,61(12), 2544-2558.
Toman, M., Tesar, R., & Jezek, K. (2006). Influence of word normalization on text classification. In
Proceedings of InSciT (pp. 354-358). Merida, Spain. Retrieved from http://www.kiv.zcu.cz/research/
groups/text/publications/inscit20060710.pdf
Torunog
˘lu, D., C¸ akırman, E., Ganiz, M. C., Akyokus¸, S., & Gu
¨rbu
¨z, M. Z. (2011). Analysis of preprocessing
methods on classification of Turkish texts. In 2011 International Symposium on Innovations in Intelligent
32 Organizational Research Methods XX(X)
Systems and Applications (INISTA) (pp. 112-117). Washington, DC: IEEE. Retrieved from http://ieeexplore.
ieee.org/xpls/abs_all.jsp?arnumber¼5946084
Tsuge, S., Shishibori, M., Kuroiwa, S., & Kita, K. (2001). Dimensionality reduction using non-negative matrix
factorization for information retrieval. In 2001 IEEE International Conference on Systems, Man and
Cybernetics. e-Systems and e-Man for Cybernetics in Cyberspace (Cat.No.01CH37236) (Vol. 2,pp.
960-965). Washington, DC: IEEE. https://doi.org/10.1109/ICSMC.2001.973042
Turney, P. (1999). Learning to extract keyphrases from text. Retrieved from http://nparc.cisti-icist.nrc-cnrc.gc.
ca/npsi/ctrl?action¼rtdoc&an¼8913245
Uysal, A. K., & Gunal, S. (2014). The impact of preprocessing on text classification. Information Processing &
Management,50(1), 104-112. https://doi.org/10.1016/j.ipm.2013.08.006
van der Maaten, L. J., Postma, E. O., & van den Herik, H. J. (2009). Dimensionality reduction: A comparative
review. Journal of Machine Learning Research,10(1-41), 66-71.
Vo, D.-T., & Ock, C.-Y. (2015). Learning to classify short text from scientific documents using topic models
with various types of knowledge. Expert Systems with Applications,42(3), 1684-1698. https://doi.org/10.
1016/j.eswa.2014.09.031
Wiebe, J., Wilson, T., Bruce, R., Bell, M., & Martin, M. (2004). Learning subjective language. Computational
Linguistics,30(3), 277-308. https://doi.org/10.1162/0891201041850885
Wiebe, J., Wilson, T., & Cardie, C. (2005). Annotating expressions of opinions and emotions in language.
Language Resources and Evaluation,39(2-3), 165-210. https://doi.org/10.1007/s10579-005-7880-9
Willett, P. (2006). The Porter stemming algorithm: Then and now. Program,40(3), 219-223. https://doi.org/
10.1108/00330330610681295
Yang, Y. (1999). An evaluation of statistical approaches to text categorization. Information Retrieval,1(1-2),
69-90.
Yang, Y., & Pedersen, J. O. (1997). A comparative study on feature selection in text categorization. In ICML
(Vol. 97, pp. 412-420). Retrieved from http://www.surdeanu.info/mihai/teaching/ista555-spring15/readings/
yang97comparative.pdf
Youn, S., & McLeod, D. (2007). A comparative study for email classification. In K. Elleithy (Ed.), Advances
and innovations in systems, computing sciences and software engineering (pp. 387-391). New York, NY:
Springer. Retrieved from http://link.springer.com/chapter/10.1007/978-1-4020-6264-3_67
Yu, B., Kaufmann, S., & Diermeier, D. (2008). Classifying party affiliation from political speech. Journal of
Information Technology & Politics,5(1), 33-48. https://doi.org/10.1080/19331680802149608
Yu, H.-F., Ho, C.-H., Arunachalam, P., Somaiya, M., & Lin, C.-J. (2012). Product title classification versus text
classification. Retrieved from http://www.csie.ntu.edu.tw/*cjlin/papers/title.pdf
Zhang, J., Jin, R., Yang, Y., & Hauptmann, A. (2003). Modified logistic regression: An approximation to SVM
and its applications in large-scale text categorization. In ICML (pp. 888-895). Retrieved from http://www.
aaai.org/Papers/ICML/2003/ICML03-115.pdf
Zhang, W., Yoshida, T., & Tang, X. (2008). Text classification based on multi-word with support vector
machine. Knowledge-Based Systems,21(8), 879-886. https://doi.org/10.1016/j.knosys.2008.03.044
Zhang, X., & LeCun, Y. (2015). Text understanding from scratch (ArXiv Preprint ArXiv:1502.01710).
Retrieved from http://arxiv.org/abs/1502.01710
Zhu, X. (2005). Semi-supervised learning literature survey. Retrieved from http://citeseerx.ist.psu.edu/view
doc/download?doi¼10.1.1.99.9681&rep¼rep1&type¼pdf
Zhu, X., & Ghahramani, Z. (2002). Learning from labeled and unlabeled data with label propagation.
Retrieved from http://citeseerx.ist.psu.edu/viewdoc/download?doi¼10.1.1.14.3864&rep¼rep1&
type¼pdf
Zu, G., Ohyama, W., Wakabayashi, T., & Kimura, F. (2003). Accuracy improvement of automatic text
classification based on feature transformation. In Proceedings of the 2003 ACM Symposium on
Document Engineering (pp. 118-120). New York, NY: ACM. https://doi.org/10.1145/958220.
958242
Kobayashi et al. 33
Zurada, J. M., Ensari, T., Asl, E. H., & Chorowski, J. (2013). Nonnegative matrix factorization and its
application to pattern analysis and text mining. In 2013 Federated Conference on Computer Science and
Information Systems (FedCSIS) (pp. 11-16). Washington, DC: IEEE.
Author Biographies
Vladimer B. Kobayashi is a PhD student at the University of Amsterdam and on study leave from the
University of the Philippines Mindanao. His current research interest is in labor market driven learning
analytics. Specifically, he uses text mining techniques and machine learning to automatically extract informa-
tion from job vacancies, to understand education-to-labor market transition, to study job mobility, to determine
success in the labor market, and to match education and labor market needs. He plans to continue this line of
research by applying his training and knowledge in industry context.
Stefan T. Mol is an assistant professor of organizational behavior at the Leadership and Management Section
of the Amsterdam Business School of the University of Amsterdam, the Netherlands. His research interests
center on a variety of topics relating to the broad fields of organizational behavior and research methods,
including but not limited to person-environment fit, refugee integration in the labor market, employability, job
crafting, calling, work identity, psychological contracts, expatriate management, text mining (with a focus on
vacancies), learning analytics, and meta-analysis.
Hannah A. Berkers is a PhD candidate in organizational behavior at the Leadership and Management Section
of the Amsterdam Business School of the University of Amsterdam, the Netherlands. Her research interests
include work and professional identity, calling, employee well-being, meaningfulness, job crafting, and a task
level perspective on work.
Ga
´bor Kismiho
´kis a postdoc in knowledge management at the Leadership and Management Section of the
Amsterdam Business School of the University of Amsterdam, the Netherlands. His research is focusing on the
bridge between education and the labor market, and entails topics like learning analytics, refugee integration in
the labor market, and employability.
Deanne N. Den Hartog is a professor of organizational behavior, head of the Leadership and Management
Section, and director of the Amsterdam Business School Research Institute at the University of Amsterdam, the
Netherlands and visiting research professor at the University of Southern Australia. Her research interests
include leadership, HRM, proactive work behavior, and trust. She has published widely on these topics and
serves on several editorial boards.
34 Organizational Research Methods XX(X)
... Nevertheless, QUAID is limited as it can only check a few guidelines and cannot address content analysis. LLMs have been used to classify the topics of item stems with impressive results Kobayashi et al., 2018). Therefore, we expect LLMs to be of great aid to check if item stems meet certain guidelines even if it requires analyzing its content. ...
... Therefore, these models can aid at different tasks related to content analysis. Moreover, in some cases LLMs have been fine-tuned to conduct text classification for content analysis Kobayashi et al., 2018). Nevertheless, state-of-the-art LLMs are capable of processing the main ideas from items and compare them with other information prompted (e.g., construct's theory), via zero-shot prompting. ...
Preprint
Full-text available
The irruption of Large Language Models (LLMs) in our daily lives has opened up an intriguing future for the course of psychometrics. We give a glimpse of that future through our seven “wonderings” of LLMs: a series of wanderings on how current LLMs can instill wonder in researchers and professionals by assisting them in each step of the design, refinement, and analysis of psychometric tools. Using GPT-4 as illustration, we have tried to answer what are the capabilities of LLMs as item designers and format generators, as reviewers and respondents, and as data analysts and results interpreters. We interacted with the LLM applying a systematic prompt scheme, evidencing the peaks and pitfalls of its responses when addressing psychometric tasks. Finally, we provide some thoughts and guidelines about the validity of the uses LLMs responses can offer, and how to study and perform such validation process.
... Unravelling the Characteristics of Cyberbullies across Twenty-Five European Countries" (Gorzig & Olafsson, 2012) Survey data: Logistic regression "Improved cyberbullying detection using gender information" (Davdar et al., 2012) YouTube comments "Twitter bullying detection" (Sanchez & Kumar, 2011) "naïve Bayes: Twitter data" "Twitter sentiment classification using distant supervision" (Sanchez & Kumar, 2009) "naïve Bayes. maximum entropy and SVM: Twitter data" classifying text information into two or more classes, based on its content (Kobayashi et al., 2017). The text classification process consists of six steps, which include text preprocessing, normalisation or transformation, dimensionality reduction, application of classification techniques, model evaluation, and validation (Kobayashi et al., 2017). ...
... maximum entropy and SVM: Twitter data" classifying text information into two or more classes, based on its content (Kobayashi et al., 2017). The text classification process consists of six steps, which include text preprocessing, normalisation or transformation, dimensionality reduction, application of classification techniques, model evaluation, and validation (Kobayashi et al., 2017). Classification method predicts the category which the text message belongs to (Lantz, 2013). ...
Article
Full-text available
The intensive use of the internet comes with negative and positive effects. Cyberbullying is one of the negative effects of using the internet. Cyberbullying has a negative effect on the victims emotionally, academically, and psychologically. Cyberbullying detection tools can help in reducing or eliminating cyberbullying on social media platforms. The aim of the study was to identify the elements that drive cyberbullying and build classification models to determine whether social media textual information contains cyberbullying text or not. The research aim was achieved through a mixed methods research design, containing qualitative and quantitative elements. The drivers of cyberbullying were identified through a literature review. These included age, gender, family structure, parental education, race, technology, anonymity, academic achievement, and awareness of cyber safety. The support vector machines and naïve Bayes models were used to classify the text dataset (Formspring.me dataset), with a 72.81% and a 99.87% classification accuracy, respectively.
... To construct our data set, we resort to multiple sources. Recently the significance and acceptance of text mining as a tool to investigate organizational research questions has been acknowledged (Kobayashi et al., 2018). We collected data on firms' CSR reports from a dataset compiled by a data consultancy company called InnovAccer based in San Francisco that specializes in data mining and analytics. ...
Article
Multinational enterprises (MNEs) face increasing institutional complexity as they expand internationally, which has a significant effect on their corporate social responsibility (CSR) strategy. In this study, we investigate an under‐explored difference between MNEs' employee and environmental CSR disclosure quality concerning the extent to which CSR reports are indicative of substantive activities. We propose that as international diversification increases institutional complexity, MNEs lower the disclosure quality of environmental CSR yet improve the disclosure quality of employee CSR due to their different institutional implications. In addition, we examine the effect of MNEs' home country cultural differences regarding uncertainty avoidance on the relationship between international diversification and employee and environmental disclosure quality. Based on a sample of 335 global MNEs from 31 countries, we found general support for our hypotheses.
... The text-mining approach has been shown to be an effective tool for real-time analysis of enormous quantities of unstructured data, thereby saving the time required for traditional analysis (Marshall & Wallace, 2019). Furthermore, manual analysis entails high labor costs, apart from challenges such as human biases and mishandling of data (Kobayashi et al., 2017). Analysis of online reviews using text mining has become an interesting research area in retail and service businesses (Kumar et al., 2021). ...
... Entretanto,é inviável essa análise por humanos em virtude da elevada quantidade de registros envolvidos. Assim, faz-se necessário a adoção de técnicas computacionais com o propósito de otimizar o esforço humano demandado para a execução dessa tarefa [Kobayashi et al. 2018]. ...
Conference Paper
As organizações públicas enfrentam dificuldades para realizar a devida classificação e promover a transparência dos seus documentos. A classificação correta é fundamental para prevenir o acesso público a informações sensíveis e proteger indivíduos e organizações contra o uso malicioso. Este trabalho apresenta uma pesquisa em andamento que propõe métodos para realizar a tarefa de classificação de documentos sensíveis utilizando técnicas de aprendizagem de máquina. Foram utilizados dados reais do Sistema Eletrônico de Informações (SEI) da UFG e os resultados preliminares demonstram o potencial e viabilidade do projeto, tendo já alcançado uma taxa de acerto de 87% na classificação de documentos públicos.
... Topic modeling is one of the most widely used text mining techniques, identifying the topics from documents (Kobayashi, Mol, Berkers, Kismihók & Den Hartog, 2018). Topic modeling algorithms are statistical methods that analyze words in a text and thus extract topics from textual data. ...
Article
Full-text available
Publications on knowledge and information creation have grown significantly due to their importance in information and knowledge management. This study aims to discover and analyze the hidden thematic topics of information and knowledge creation publications. The research applied was performed using text mining techniques and an analytical approach. The research population comprises publications on knowledge and information creation from 1900 to 2021, retrieved from the Web of Science Core Collection (WOSCC). The data were analyzed by Latent Dirichlet Allocation (LDA) algorithm and Python Programming Language. Forty-eight thousand two hundred sixty-five documents were retrieved and analyzed. "Data production," "Health seeking behavior," "Human Brain and Information processing," "Decision-making models," "Knowledge production," "Information needs," and "Digital Literacy" are among the essential topics in order of publication rate. The results indicated that the spectrum of the fourteen topics covered a variety of dimensions, including "data and knowledge creation," "information processing," "information needs and behavior," "digital literacy," and "critical thinking." The study's findings have shown the conceptual relationships between textual data and the presentation of the knowledge structure of information and knowledge creation. Based on this, it can be concluded that the creation of knowledge and information includes human mental and behavioral processes concerning knowledge.
Article
Purpose The applications of Artificial Intelligence (AI) in various areas of professional and knowledge work are growing. Emotions play an important role in how users incorporate a technology into their work practices. The current study draws on work in the areas of AI-powered technologies adaptation, emotions, and the future of work, to investigate how knowledge workers feel about adopting AI in their work. Design/methodology/approach We gathered 107,111 tweets about the new AI programmer, GitHub Copilot, launched by GitHub and analysed the data in three stages. First, after cleaning and filtering the data, we applied the topic modelling method to analyse 16,130 tweets posted by 10,301 software programmers to identify the emotions they expressed. Then, we analysed the outcome topics qualitatively to understand the stimulus characteristics driving those emotions. Finally, we analysed a sample of tweets to explore how emotional responses changed over time. Findings We found six categories of emotions among software programmers: challenge, achievement, loss, deterrence, scepticism, and apathy. In addition, we found these emotions were driven by four stimulus characteristics: AI development, AI functionality, identity work, and AI engagement. We also examined the change in emotions over time. The results indicate that negative emotions changed to more positive emotions once software programmers redirected their attention to the AI programmer's capabilities and functionalities, and related that to their identity work. Practical implications Overall, as organisations start adopting AI-powered technologies in their software development practices, our research offers practical guidance to managers by identifying factors that can change negative emotions to positive emotions. Originality/value Our study makes a timely contribution to the discussions on AI and the future of work through the lens of emotions. In contrast to nascent discussions on the role of AI in high-skilled jobs that show knowledge workers' general ambivalence towards AI, we find knowledge workers show more positive emotions over time and as they engage more with AI. In addition, this study unveils the role of professional identity in leading to more positive emotions towards AI, as knowledge workers view such technology as a means of expanding their identity rather than as a threat to it.
Article
The rapid development of the Internet has led to a geometric expansion of text information resources online. Among them, corpus, as the basic data source of natural language processing based on statistical language model, its construction and application have become a hot issue in current language processing research. After consulting a large number of relevant literature and materials, it was found that many researchers have provided new ideas for multi label corpus text classification methods. However, this article was adding its own understanding and taking this as the direction and basis. In the introduction, the research significance of text classification was introduced, and then academic research and analysis were carried out on the two key sentences of corpus text classification and natural language processing in multi tag corpus text classification. This article then utilized an algorithm model to provide a theoretical basis for the study of multi label corpus text classification methods; At the end of the article, a simulation comparative experiment would be conducted, and the experiment would be summarized and discussed; In the Enterprise L corpus, the difference in recall rates before and after the use of Entrance 1 was 5.5%, the difference in recall rates before and after the use of Entrance 2 was 7.8%, the difference in recall rates before and after the use of Entrance 3 was 3.3%, and the difference in recall rates before and after the use of Entrance 4 was 4.5%. At the same time, with the continuous research of natural language processing and machine learning, the research on text classification methods of multi tag corpus is also facing new opportunities and challenges.
Article
Over the past two decades, databases and the tools to access them in a simple manner have become increasingly available, allowing historical and modern-day topics to be merged and studied. Throughout the recent COVID-19 pandemic, for example, many researchers have reflected on whether any lessons learned from the Spanish flu pandemic of 1918 could have been helpful in the present pandemic. Most studies using text-mining applications rarely use full-text journal articles. This article provides a methodology used to develop a full-text journal article corpus using the R fulltext package. Using the proposed methodology, 2743 full-text journal articles were obtained. The aim of this article is to provide a methodology and supplementary codes for researchers to use the R fulltext package to curate a full-text journal corpus.
Article
Full-text available
Despite the ubiquity of textual data, so far, few researchers have applied text mining to answer organizational research questions. Text mining, which essentially entails a quantitative approach to the analysis of (usually) voluminous textual data, helps accelerate knowledge discovery by radically increasing the amount data that can be analyzed. This paper aims to acquaint organizational researchers with the fundamental logic underpinning text mining, the analytical stages involved, and contemporary techniques that may be used to achieve different types of objectives. The specific analytical techniques reviewed are: 1) dimensionality reduction, 2) distance and similarity computing, 3) clustering, 4) topic modelling, and 5) classification. We describe how text mining may extend contemporary organizational research by allowing the testing of existing or new research questions with data that are likely to be rich, contextualized, and ecologically valid. After an exploration of how evidence for the validity of text mining output may be generated, we conclude the manuscript by illustrating the text mining process in a job analysis setting using a dataset comprised of job vacancies.
Article
Full-text available
The growth of financial texts in the wake of big data has challenged most organizations and brought escalating demands for analysis tools. In general, text streams are more challenging to handle than numeric data streams. Text streams are unstructured by nature, but they represent collective expressions that are of value in any financial decision. It can be both daunting and necessary to make sense of unstructured textual data. In this study, we address key questions related to the explosion of interest in how to extract insight from unstructured data and how to determine if such insight provides any hints concerning the trends of financial markets. A sentiment analysis engine (SAE) is proposed which takes advantage of linguistic analyses based on grammars. This engine extends sentiment analysis not only at the word token level, but also at the phrase level within each sentence. An assessment heuristic is applied to extract the collective expressions shown in the texts. Also, three evaluations are presented to assess the performance of the engine. First, several standard parsing evaluation metrics are applied on two treebanks. Second, a benchmark evaluation using a dataset of English movie review is conducted. Results show our SAE outperforms the traditional bag of words approach. Third, a financial text stream with twelve million words that aligns with a stock market index is examined. The evaluation results and their statistical significance provide strong evidence of a long persistence in the mood time series generated by the engine. In addition, our approach establishes grounds for belief that the sentiments expressed through text streams are helpful for analyzing the trends in a stock market index, although such sentiments and market indices are normally considered to be completely uncorrelated.
Book
There is broad interest in feature extraction, construction, and selection among practitioners from statistics, pattern recognition, and data mining to machine learning. Data preprocessing is an essential step in the knowledge discovery process for real-world applications. This book compiles contributions from many leading and active researchers in this growing field and paints a picture of the state-of-art techniques that can boost the capabilities of many existing data mining tools. The objective of this collection is to increase the awareness of the data mining community about the research of feature extraction, construction and selection, which are currently conducted mainly in isolation. This book is part of our endeavor to produce a contemporary overview of modern solutions, to create synergy among these seemingly different branches, and to pave the way for developing meta-systems and novel approaches. Even with today's advanced computer technologies, discovering knowledge from data can still be fiendishly hard due to the characteristics of the computer generated data. Feature extraction, construction and selection are a set of techniques that transform and simplify data so as to make data mining tasks easier. Feature construction and selection can be viewed as two sides of the representation problem.
Book
The proliferation of digital computing devices and their use in communication has resulted in an increased demand for systems and algorithms capable of mining textual data. Thus, the development of techniques for mining unstructured, semi-structured, and fully-structured textual data has become increasingly important in both academia and industry. This second volume continues to survey the evolving field of text mining - the application of techniques of machine learning, in conjunction with natural language processing, information extraction and algebraic/mathematical approaches, to computational information retrieval. Numerous diverse issues are addressed, ranging from the development of new learning approaches to novel document clustering algorithms, collectively spanning several major topic areas in text mining. Features: • Acts as an important benchmark in the development of current and future approaches to mining textual information • Serves as an excellent companion text for courses in text and data mining, information retrieval and computational statistics • Experts from academia and industry share their experiences in solving large-scale retrieval and classification problems • Presents an overview of current methods and software for text mining • Highlights open research questions in document categorization and clustering, and trend detection • Describes new application problems in areas such as email surveillance and anomaly detection Survey of Text Mining II offers a broad selection in state-of-the art algorithms and software for text mining from both academic and industrial perspectives, to generate interest and insight into the state of the field. This book will be an indispensable resource for researchers, practitioners, and professionals involved in information retrieval, computational statistics, and data mining. Michael W. Berry is a professor in the Department of Electrical Engineering and Computer Science at the University of Tennessee, Knoxville. Malu Castellanos is a senior researcher at Hewlett-Packard Laboratories in Palo Alto, California.
Conference Paper
We propose two novel model architectures for computing continuous vector representations of words from very large data sets. The quality of these representations is measured in a word similarity task, and the results are compared to the previously best performing techniques based on different types of neural networks. We observe large improvements in accuracy at much lower computational cost, i.e. it takes less than a day to learn high quality word vectors from a 1.6 billion words data set. Furthermore, we show that these vectors provide state-of-the-art performance on our test set for measuring syntactic and semantic word similarities.