ArticlePDF Available

Using Internet search engines to estimate word frequency

Authors:

Abstract and Figures

The present research investigated Internet search engines as a rapid, cost-effective alternative for estimating word frequencies. Frequency estimates for 382 words were obtained and compared across four methods: (1) Internet search engines, (2) the Kucera and Francis (1967) analysis of a traditional linguistic corpus, (3) the CELEX English linguistic database (Baayen, Piepenbrock, & Gulikers, 1995), and (4) participant ratings of familiarity. The results showed that Internet search engines produced frequency estimates that were highly consistent with those reported by Kucera and Francis and those calculated from CELEX, highly consistent across search engines, and very reliable over a 6-month period of time. Additional results suggested that Internet search engines are an excellent option when traditional word frequency analyses do not contain the necessary data (e.g., estimates for forenames and slang). In contrast, participants' familiarity judgments did not correspond well with the more objective estimates of word frequency. Researchers are advised to use search engines with large databases (e.g., AltaVista) to ensure the greatest representativeness of the frequency estimates.
Content may be subject to copyright.
Copyright 2002 Psychonomic Society, Inc. 286
Behavior Research Methods, Instruments, & Computers
2002, 34 (2), 286-290
Word frequencyis an important experimentalvariable,
with the potential to influence any cognitive phenome-
non that involves language. Domains that have been
shown to benefit from a consideration of word frequency
include word recognition (e.g., Balota & Rayner, 1991),
lexical decision latency (e.g., Rubenstein, Garfield, &
Millikan, 1970), memory (e.g., Chalmers, Humphreys,
& Dennis, 1997), language acquisition (e.g., Brysbaert,
Lange, & Wijnendaele,2000), and unitization in reading
(e.g., Peterzell, Sinclair, Healy, & Bourne, 1990). Even
when word frequency is not the focus of attention, re-
searchers routinely control for it in their experiments to
eliminate a potentiallypowerful confound.
One of the most popular methods for determining
word frequency has been to analyze a linguistic corpus,
which consists of a sample of texts intended to be a rep-
resentative reference for that language (McEnery & Wil-
son, 1996). Some of the most commonly used frequency
analyses are Thorndike and Lorge (1944), KucÏ era and
Francis (1967;see also the extended 1982 analysis of the
same corpus by Francis & KucÏ era), and those obtained
from the CELEX linguistic database (Baayen, Piepen-
brock, & Gulikers, 1995). The popularity of these analy-
ses, however, belies several potential problems in their
usage.Becauseeach corpusis composedof written texts,
they may not fully represent the frequency with which
some words are used in spoken English (Chomsky,
1965). In addition,a traditionalcorpus is unlikely to con-
tain certain types of words, such as slang and forenames,
that are not common in formal written texts but are used
with some frequency in everyday life and are central for
some experiments (e.g., Blair & Banaji, 1996; Devine,
1989; Greenwald, McGhee, & Schwartz, 1998).
Another potential problem concerns the age of the
popular corpora analyses. KucÏ era and Francis (1967) is
over 30 years old, and Thorndike and Lorge (1944) is al-
most 60 years old. In light of rapid changes in vocabu-
lary, researchers may question whether those frequency
counts are still valid (Gernsbacher, 1984). The CELEX
database (Baayen et al., 1995) contains contemporary
samples of written English,but its cost is more than some
researchers may be willing to spend (currently $150),
and without constant updating it too will go out of date.
Finally, access to even the older frequency analysesis re-
stricted because they are out of print and difficult to ob-
tain. Many research institutions have only one or two
copies that must be shared among all researchers.
In light of the limitationsof traditionalword frequency
analyses, the goalof the present research was to examine
the potential of using Internet search engines to provide
validand reliable information on word frequency. Search
engines operate by sending automated agents (known as
This work was supported by NIH Grant MH63372 to I.V.B., a NSF
Graduate Research Fellowship to G.R.U., and an NIH postdoctoral
fellowship to J.E.M. We thank Alice Healy and the University of Col-
orado Stereotyping and Prejudice (CUSP) Lab for their helpful com-
ments on an earlier draft of the paper. We also thank Gary McClelland
and Lou McClelland for their assistance with the sampling analyses.
Correspondence should be addressed to I. V. Blair, Department of
Psychology,University of Colorado, Boulder, CO 80309-0345(e-mail:
irene.blair@colorado.edu).
Using Internet search engines
to estimate word frequency
IRENE V. BLAIR and GEOFFREY R. URLAND
University of Colorado, Boulder, Colorado
and
JENNIFER E. MA
University of Kansas, Lawrence, Kansas
The present researchinvestigatedInternet searchengines as a rapid, cost-effectivealternativefor es-
timating word frequencies. Frequency estimates for 382 words were obtained and compared across
four methods: (1) Internet search engines, (2) the Kuc
Ï
era and Francis (1967) analysis of a traditional
linguistic corpus, (3) the CELEX English linguistic database (Baayen, Piepenbrock, & Gulikers, 1995),
and (4) participant ratings of familiarity.The resultsshowed that Internet searchengines produced fre-
quency estimatesthat were highly consistent with those reported by Kuc
Ï
era and Francis and those cal-
culated from CELEX, highly consistent across search engines, and very reliable over a 6-month period
of time. Additional results suggested that Internet search engines are an excellent option when tradi-
tional word frequency analyses do not contain the necessary data (e.g., estimates for forenames and
slang). In contrast, participants’ familiarity judgments did not correspond well with the more objective
estimates of word frequency. Researchers are advised to use search engines with large databases(e.g.,
AltaVista) to ensure the greatest representativenessof the frequency estimates.
ESTIMATING WORD FREQUENCY ON THE INTERNET 287
“spiders”) out on the Internet. These spiders, in turn,
send information about a site’s content back to a data-
base, which can be searched by multiple users. For ex-
ample, one may ask the engine to search for the word
desk
. When the search is completed, the user is provided
with a report on the number of times the word was found
in the database—commonly known as the “hit” count.
1
We argue that this count is analogous to a conventional
word frequency estimate, and it can be compared with
the hit count for other words of interest.
Internet search engines solve many of the drawbacks of
traditional frequency analyses. The Internet is ubiquitous
and search engines are generally free sites, making issues
of availabilityand cost nearly irrelevant. Moreover, the In-
ternet is relatively comprehensive, including academic
texts, commercial and personal information, and records
from newsgroup postings.The latter source of information
is similar to spoken language and gives the Internet an ad-
ditional advantage over corpora that rely on more formal
written language. Because anyone can post information on
the Internet, it may also be more representative of every-
ones language. Likewise, the fast-paced nature of the In-
ternet and the fact that search engines constantly update
their databases provide a way for the search engines to re-
flect contemporarylanguageusage. Thus, the Internet pro-
videsa linguisticdatabasethat is relativelycomprehensive,
representative, contemporary, and easily searched.
However, it is necessary to determine empirically
whether search engines provide valid estimates of word
frequency. In addition, the fluid nature of the Internet
may undermine the reliability of estimates based on In-
ternet databases. These questions were addressed in the
following study by obtaining word frequency estimates
for a large sample of words from four popular Internet
search engines and comparing them with the estimates
obtained from KucÏ era and Francis (1967) and the CELEX
linguistic database (Baayen et al., 1995), as well as to
participant ratings of the words’ familiarity. The test–
retest reliability of the search engines was also examined
by conductinga secondsetof Internet searches,6 months
later.
METHOD
Test Sample
The test sample was composed of 382 words. The majority of the
words (n = 250) were standard English words that included a selec-
tion of nouns, verbs, and adjectives (e.g., attain, dishonest, nurse,
welfare). The frequency of these words, according to KucÏ era and
Francis (1967), ranged from 0 to 1,303 (M = 102.41). In addition,
a subsample of 132 nonstandard words was added. This subsample
contained 36 slang terms (e.g., bimbo, rad, reefer, yuppie) and 96
forenames (48 male and female European American names and 48
male and female African American names, taken from Greenwald
et al., 1998). According to KucÏ era and Francis, these words ranged
in frequency from 0 to 92 (M = 5.82).
Frequency Estimates
Four methods were used to estimate the frequency of the 382
words in the sample: (1) KucÏ era and Francis (1967), (2) CELEX
linguistic database (Baayen et al., 1995), (3) participant ratings of
familiarity, and (4) Internet search engines. These methods are de-
scribed below.
KucÏ era and Francis analysis. Frequency estimates for the words
in the test sample were obtained from KucÏ era and Francis’s (1967)
Computational Analysis of Present-Day American English. If a word
did not appear in the database, the frequency was recorded as zero.
CELEX linguistic database. Frequency estimates for the words
in the test sample were obtained by electronically searching the
CELEX linguistic database (Baayen et al., 1995). This CD-ROM
database contains 160,594 words from 284 written texts. If a word
did not appear in the database, the frequency was recorded as zero.
Participant ratings. Because pretesting is often used to select
stimuli that are matched on a particular dimension (e.g., imageabil-
ity), some researchers may use participants’ subjective familiarity
with words as an alternative to obtaining objective word frequency
estimates. To explore the validity of this method, 33 undergraduates
at the University of Kansas were asked to rate the familiarity of
each word in the sample, using a 5-point scale with labeled end-
points (1 = very unfamiliar,5=very familiar). The participants
were asked to consider “how common or frequently you have en-
countered the word, or how well you know the word. The 382
words were divided into two lists of equal length, with the words on
each list presented in a single fixed order. The participants rated the
familiarity of the words on one list and then, following a 5-min
break, they rated the familiarity of the second list of words. The
order of the two word lists was counterbalanced across the partici-
pants. Cronbach’s alpha for interrater reliability was .97.
Internet search engines. Four search engines were selected for
the study: AltaVista, Northern Light, Excite, and Yahoo! These search
engines were chosen primarily for their popularity among Internet
users. Database and search technique were also considered. Alta-
Vista (www.altavista.com) and Northern Light (www.northernlight.
com) were included because they are two of the largest and most
comprehensive Internet search engines. When the analyses were
conducted, AltaVista had a database of more than 250 million
webpages and Northern Light had over 200 million webpages
(AllSearchEngines.Com, 2000; Kansas City Public Library, 2000;
Leita, 2000). During a search with these engines, the full text of
webpages and articles in their databases are searched for word
matches. In contrast, Excite (www.excite.com) and Yahoo! (www.
yahoo.com) ha
ve relatively smaller databases (150 million and 2
million, respectively). In addition, Yahoo! conducts its searches in
a very different manner from the other three engines. Specifically,
it is an Internet subject directory that searches for general topics as
opposed to keyword matches. As a consequence, its frequency es-
timates may not be as valid or reliable.
Each word in the sample was entered into the search function of
each of the four search engines. The number of hits returned was
then recorded as the frequency estimate for that word. To examine
the reliability of frequency estimates obtained from the search en-
gines, the same search process was repeated 6 months later.
RESULTS
As expected for frequency data, the word counts from
the search engines,KucÏ era and Francis (1967),and CELEX
(Baayen et al., 1995) were positively skewed. Thus, a
standard square-root transformation was applied before
further analyses (Judd & McClelland,1989). In contrast,
the participants’ ratings of familiarity were negatively
skewed. Because this skew was relatively minor and we
believed that it reflected an important psychological re-
ality for the participants (see Discussion), those data
were left untransformed.
Due to differences in database size, the two larger
search engines, AltaVista and Northern Light, returned
288 BLAIR, URLAND, AND MA
significantly higher estimates, on average, than the two
smaller search engines (
M
= 2.6 million vs. 0.7 million).
And all of the search engines produced higher estimates
than KucÏ era and Francis (
M
= 69.03) or CELEX (
M
=
925.64).Of greater importance, however, was the validity
and reliabilityof the search enginesas determinedby com-
parisons of word frequency estimates
among
the words
in the test sample. The following tests were conducted.
First, correlations were calculated among the fre-
quency estimates obtained using each of the four meth-
ods. Table 1 shows that the estimates obtained from the
four search engines were highly correlated with those
obtained from KucÏ era and Francis (mean
r
=.79)and
with the estimates provided by CELEX (mean
r
= .72).
In contrast, the participantsword ratings were onlymod-
erately correlated with the other frequency counts.
Second, correlations were calculated among the four
search engines. As shown in Table 1, the search engines
provided highly consistent estimates of word frequency
on the Internet (mean
r
= .82).
Finally, the test–retest reliabilities of the search en-
gineswere examined by calculatingcorrelationsbetween
the word estimates obtained at Time 1 and at Time 2.
These correlations, provided in Table 1, showed that the
search engines produced highly reliable estimates over
the 6-month period of time (mean
r
= .92).
Frequency Estimates for Subsamples of Words
As noted, the full test sample was composed of 250
standard words and 132 nonstandard words. To examine
the congruence among the frequency methods for the
two subsamples, the analyses were repeated within each
sample. These analyses showed that for both subsam-
ples, the Internet search engines produced frequency es-
timates that were very reliable in terms of their consis-
tency with one another (mean
r
= .79 and .89, for the
standard and nonstandard samples, respectively) and
their consistency across time (mean
r
= .91 and .89, for
the standard and nonstandard samples, respectively).
However, the congruence between the search engines
and the other three methods was different for the two
subsamples of words (Table 2).
First, the congruence between the search engines and
the two traditionallinguisticdatabaseswas higher for the
standard than the nonstandard words (mean
r
= .78 vs.
.57,
z
=3.65,
p
< .001 for KucÏ era & Francis; mean
r
= .70
vs. .44,
z
= 3.64,
p
< .001 for CELEX). One explanation
for this discrepancy is that the linguistic databases did
not contain estimates for many of the nonstandardwords.
Specifically, the KucÏ era and Francis database was miss-
ing 51% of the nonstandard words (54% of the fore-
names and 42% of the slang terms), compared with only
3% missing for the standard words; CELEX was missing
66% of the nonstandard words (84% of the forenames
and 17% of the slang terms), compared with only 2%
missing for the standard words. A very large number of
zero estimates could have attenuated the correlation for
the nonstandard words. However, even when all words
with a zero estimate were eliminated from the analysis,
the correlation between the search engines and the stan-
dard databases continued to be higher for the standard
than the nonstandard words (mean
r
= .77 vs. .56,
z
=
2.72,
p
< .01 for KucÏ era & Francis; mean
r
= .70 vs. .50,
z
= 1.90,
p
=
.058 for CELEX). We cannot be certain why
these differences exist. However, it doesn’t seem so sur-
prising that the type of “common” English used on the
Table 1
Correlation Coefficients Among Frequency Estimates
for the Full Word Sample (N =382)
CELEX PR AV NL EX YH
KucÏ era & Francis (1967) .92 .48 .81 .89 .78 .69
CELEX .46 .76 .81 .71 .62
Participant ratings (PR) .45 .49 .46 .47
Search engines
AltaVista (AV) .94 .91 .81 .73
Northern Light (NL) .96 .84 .76
Excite (EX) .94 .88
Yahoo! (YH) .84
Note—Coefficients on the diagonal for the search engines are the test–retest reliabil-
ity estimates. All correlations are significant at p < .0001.
Table 2
Correlation Coefficients Between the Internet Search Engines and the Other Three Methods of
Obtaining Frequency Estimates, for the Standard and Nonstandard Word Samples
Standard Word Sample Nonstandard Word Sample
Method PR AV NL EX YH PR AV NL EX YH
KucÏ era & Francis (1967) .40 .85 .88 .76 .64 .46 .46 .61 .60 .59
CELEX .36 .78 .79 .68 .56 .46 .44 .45 .44 .43
Participant ratings .40 .43 .39 .40 .59 .61 .60 .63
Note—All correlations are significant at p < .0001. PR, participant ratings; AV, AltaVista; NL, Northern
Light; EX, Excite; YH, Yahoo!
ESTIMATING WORD FREQUENCY ON THE INTERNET 289
Web and the more formal writing contained in the tradi-
tional linguisticcorpora may differ in the frequency with
which various slang terms and forenames are used.
The second difference observed between the two word
samples was that the congruence between the search en-
gines and the participantsratings was
lower
for the stan-
dard than for the nonstandard words (mean
r
= .40 vs.
.61,
z
=2.62,
p
< .01). This finding suggests that the par-
ticipantsmay have had an easier time making familiarity
distinctions among relatively unfamiliar words. None-
theless, in neither case was the correlation very high.
Because researchers may question whether they can
rely on Internet search engines for small samples of
words, correlational analyses were conducted on 100
randomly selected samples of 30 standard words. Me-
dian correlations and their interquartile ranges based on
these analyses are presented in Table 3. These numbers
showed that even for relatively small samples of words,
the Internet search engines produced word frequency es-
timatesthat were highly consistentwith the two standard
databases, highly consistentwith one another, and highly
consistent across time. The two smaller search engines
(EX and YH), however, returned results that were a little
less consistent with the standard databases than those
obtained from the two larger search engines (AV and
NL).
DISCUSSION
The present research demonstratesthat Internet search
engines provide word frequency estimates that are both
valid and reliable. The estimates obtained from the four
search engines showed good convergent validity with
both KucÏ era and Francis (1967) and CELEX (Baayen
et al., 1995),were highlyconsistentwith one another, and
showed excellenttest–retest reliabilityovera 6-monthpe-
riodof time. These results oughtto encourageresearchers
to take advantage of this highly accessible and easy-to-
use method.
The highconvergencebetween the search engines and
KucÏ era and Francis (1967) also suggests that despite its
age, KucÏ era and Francis is still a valid source for word
frequencies. We have shown, however, that one of the
greatest drawbacks of that method—and similar data-
bases, such as CELEX—is missing data. The lack of
data for forenames is especially problematic for social
psychologistswho frequently employ forenames as stim-
uli and have few available means to estimate their fre-
quency (for discussions of this problem, see Dasgupta,
McGhee, Greenwald, & Banaji,2000; Kasof, 1993). The
lack of data for slang terms suggests that KucÏ era and
Francis and CELEX may also not be a good source of
frequency data when the words are relatively new to the
lexiconor appear more often in speech than writing.The
Internet search engines, in contrast, produced highly
consistent and reliable word frequency estimates for
both the standard and nonstandard words, suggesting
that they can be used where other methods fail.
In addition to being an easy-to-use, cost-effective
method of obtaining word frequencies, search engines
may also open up other avenuesfor research. For example,
by treating the Web as a linguistic database, researchers
can conductanalyses of the contextssurroundingcertain
words. Such analyses could be informative in regard to
the typical user of a word (e.g., age, education, culture)
and the objects and social roles that are most often asso-
ciated with it. An analysis of the surrounding context
may also provide the researcher with a better sense of
how familiar people really are with a particular word. If
a word is most often listed in technical or otherwise spe-
cialized webpages, then it may not be as familiar to the
average person as a word that is found on more main-
stream webpages. Another advantageof using search en-
gines for frequency analysesis the potentialto search for
phrases as well as for single words. For example, one
might wonder if
baseball bat
or
hockey stick
occurs with
greater frequency, or whether people are more familiar
with “To be or not to be” or “I think therefore I am. (In
both cases, the former phrase occurs with much greater
frequency than the latter.) Finally, it is important to note
that search engines can be used to estimate word fre-
quenciesfor languages other than English (see New, Pal-
lier, Ferrand, & Matos, in press). Researchers who use
words from more than one language may find it useful to
conduct word frequency analyses with the same basic
method. However, we caution that the validity of such
searches would depend on the extent to which speakers
of the language use the Web.
Table 3
Median Correlation Coefficients (Mdns) and Interquartile Ranges (IRs)
Based on Analyses of 100 Random Samples of 30 Standard Words
CELEX PR AV NL EX YH
Method Mdn IR Mdn IR Mdn IR Mdn IR Mdn IR Mdn IR
KucÏ era & Francis (1967) .91 .06 .42 .14 .85 .09 .89 .07 .77 .26 .65 .36
CELEX .41 .11 .78 .10 .80 .10 .72 .29 .63 .36
Participant ratings (PR) .44 .11 .47 .13 .43 .12 .43 .12
Search engines
AltaVista (AV) .98 .06 .93 .06 .85 .27 .71 .36
Northern Light (NL) .99 .03 .82 .26 .72 .39
Excite (EX) .98 .09 .97 .23
Yahoo! (YH) .92 .30
Note—Coefficients on the diagonal for the search engines are the test–retest reliability estimates.
290 BLAIR, URLAND, AND MA
The present research also providedevidenceon the va-
lidity of participants’ subjective ratings of familiarity as
an alternative measure of word frequency. The inconsis-
tencies between such ratings and the other methods sug-
gest that subjective familiarity is not equivalent to more
objective measures of word frequency (see also Gerns-
bacher, 1984). In particular, the present data showed that
the familiarity ratings were negatively skewed, whereas
the other estimates were positively skewed. That nega-
tive skew reveals that the raters did not make distinctions
among words that are relatively familiar but have very
different objectivefrequencies in the language(e.g.,
lazy
vs.
school
). This discrepancywas especially pronounced
in the standard word sample, where the negative skew
was greater than in the nonstandardword sample (
2
2.34
vs.
2
0.56, respectively). Other researchers have cau-
tioned,however, that subjective familiarity should not be
discounted as an important variable in cognition despite
its differences from objective word frequency (Gerns-
bacher, 1984).
Although the present data providedstrong evidencein
favor of Internet search engines as a method of estimat-
ing word frequency, two caveats are in order. First, the
two smaller search engines (Excite and Yahoo!) pro-
duced somewhat less consistent estimates with a rela-
tively small sample of words. Thus, it is recommended
that the larger search engines be used as a general rule
because they have more representative databases. Sec-
ond, Internet search engines are best used when relative
word frequency estimates are satisfactory. With 463,830
hits in AltaVista for
brush
and 4,860,810 hits for
earth
,
we know that the latterword occurs more frequentlythan
the former word. However, with only a rough estimate of
the total database (approximately 250 million) and the
fact that it is always changing, the absolute frequencies
of those words cannot be determined with any certainty.
For many research purposes, relative word frequencies
are the only estimates of interest, and it is for those studies
that Internet search engines provide an excellent option.
REFERENCES
AllSearchEngines.Com homepage (May, 2000). Available: http://
www.allsearchengines.com.
Baayen, R. H., Piepenbrock,R., & Gulikers, L. (1995).The CELEX
lexical database [CD-ROM]. Philadelphia: University of Pennsylva-
nia, LinguisticData Consortium.
Balota, D. A., & Rayner, K. (1991). Word recognition processes in
foveal and parafoveal vision: The range of influence of lexical vari-
ables. In D. Besner & G. W. Humphreys(Eds.), Basic processes in read-
ing: Visual word recognition (pp. 198-232).Hillsdale, NJ: Erlbaum.
Blair, I. V., & Banaji, M. R. (1996). Automatic and controlled pro-
cesses in stereotype priming. Journal of Personality & Social Psy-
chology, 70, 1142-1163.
Brysbaert, M., Lange, M., & Wijnendaele, I. V. (2000). The effects
of age-of-acquisition and frequency-of-occurrence in visual word
recognition: Further evidence from the Dutch language. European
Journal of Cognitive Psychology, 12, 65-85.
Chalmers, K. A., Humphreys, M. S., & Dennis, S. (1997). A natural-
istic study of the word frequency effect in episodic recognition.Mem-
ory & Cognition, 25, 780-784.
Chomsky, N. (1965) Aspects of the theory of syntax. Cambridge, MA:
MIT Press.
Dasgupta, N., McGhee, D. E., Greenwald, A. G., & Banaji, M. R.
(2000). Automatic preference for White Americans: Eliminating the
familiarity explanation. Journal of Experimental Social Psychology,
36, 316-328.
Devine, P. G. (1989). Stereotypes and prejudice: Their automatic and
controlledcomponents.Journalof Personality & Social Psychology,
56, 680-690.
Francis, W. N., & KucÏera, H. (1982). Frequency analysis of English
usage: Lexicon and grammar. Boston: Houghton Mifflin.
Gernsbacher,M. A. (1984). Resolving 20 years of inconsistent inter-
actions between lexical familiarity and orthography, concreteness,
and polysemy. Journal of Experimental Psychology: General, 113,
256-281.
Greenwald, A. G., McGhee, D. E., & Schwartz, J. L. K. (1998).
Measuring individual differences in implicit cognition: The implicit
association test. Journal of Personality & Social Psychology, 74,
1464-1480.
Judd, C. M., & McClelland, G. H. (1989). Data analysis: A model
comparison approach.
San Diego: Harcourt Brace Jovanovich.
Kansas City Public Library (2000, March). Introduction to search
engines. Available: http://www.kcpl.lib.mo.us/search.
Kasof, J. (1993). Sex bias in the naming of stimulus persons. Psycho-
logical Bulletin, 113, 140-163.
KucÏera, H., & Francis, W. N. (1967).Computationalanalysis of present-
day American English. Providence, RI: Brown University Press.
Leita, C. (2000, May). InfoPeople Search Tools Chart. Available: 2000
InFoPeople Project at http://infopeople.org/src/chart.html.
McEnery, T., & Wilson, A. (1996).Corpus linguistics. Edinburgh:Ed-
inburgh University Press.
New, B., Pallier, C., Ferrand, L., & Matos, R. (in press). Une base
de données lexicales du français contemporain sur internet: LEX-
IQUE [A lexical database of contemporary French on the Internet:
LEXIQUE]. L’Année Psychologique.
Peterzell, D. H., Sinclair, G. P., Healy, A. F., & Bourne, L. E.
(1990). Identification of letters in the predesignated target paradigm:
A word superiority effect for the common word the. American Jour-
nal of Psychology, 103, 299-315.
Rubenstein, H., Garfield, L., & Millikan, J. A. (1970). Homo-
graphic entries in the internal lexicon. Journal of Verbal Learning &
Verbal Behavior, 9, 487-494.
Thorndike, E. L., & Lorge, I. (1944). The teacher’s word book of
30,000 words (3rd ed.). New York: Columbia University, Teachers
College Press.
NOTE
1. Some search engines report both the number of exact word
matches and the number of websites that contain the word. It is the for-
mer hit count that provides the more accurate word frequency count.
(Manuscript received February 15, 2001;
revision accepted for publication December 15, 2001.)
... We used two Spanish corpora, El Corpus del Español (Mark Davies, http://www.corpusdelespanol.org) and CORPES (Real Academia Española). As the results proved insufficient, we also examined the potential of using the Internet as corpus, following current proposals in corpus linguistics (Blair et al., 2002;Hundt et al., 2007;Gatto, 2014). ...
... Hundt et al. (2007) underlines the benefits of the Internet as a corpus comparing it with really big reference corpora, such as BNC. They see the Internet as a valuable resource for the enormous and varied amount of data it can provide concerning several fields such as morphological productivity, the study of word frequency (Blair et al., 2002) and a broad range of research questions. The use of the Internet as a corpus has also been postulated to better explore lexical uses (Álvarez, in press). ...
... In this section we present the data obtained when using the Internet as a corpus. As can be observed, the figures vary considerably from the ones seen previously, contrarily to other studies where the results obtained using the Internet were rather similar than those obtained using corpora (Blair et al., 2002). Table 4 shows that we found at least one example for all the verbs in our analysis (100%) combining with both expressions entre… and [(DET) uno* PREP (DET) otro*]. ...
Article
Full-text available
Resumen In this paper we present a descriptive study on the compatibility of emphatic reciprocal expressions with Spanish lexical reciprocal (or symmetric) verbs. Since lexical reciprocal verbs express reciprocity intrinsically, they should not require the use of an emphatic reciprocal expression to denote reciprocal meaning. Some scholars even claim that some emphatic reciprocal expressions, such as mutuamente, are incompatible. The aim of this paper is to describe to what extent symmetric verbs can be used with four of these expressions using an empirical approach. The results obtained shed light to the questions raised: we have been able to verify that these expressions are more frequent with non-reciprocal verbs and we have proved that the combination of symmetric verbs with all these expressions is possible, even in the case of mutuamente. KEYWORDS: lexical reciprocal verbs, symmetric verbs, emphatic reciprocal expressions, reciprocity, corpus Expresiones recíprocas enfáticas y verbos simétricos en español: un análisis empírico En este artículo presentamos un estudio descriptivo sobre la compatibilidad de las expresiones recíprocas enfáticas y los verbos recíprocos (o simétricos). En tanto que los verbos recíprocos expresan reciprocidad intrínsecamente, es previsible que no requieran el uso de una expresión recíproca enfática para expresar el significado recíproco. Algunos autores incluso defienden que algunas expresiones recíprocas enfáticas, como mutuamente, son incompatibles con estos predicados. El objetivo de este trabajo es describir hasta qué punto los verbos simétricos pueden ser usados junto con cuatro de estas expresiones, para lo cual hemos utilizado una metodología basada en corpus. Los resultados obtenidos responden a las preguntas planteadas: hemos podido verificar la más elevada frecuencia de estas expresiones con verbos no recíprocos que con predicados recíprocos y hemos probado que la combinación de estos últimos con todas estas expresiones es posible, incluso con mutuamente. PALABRAS CLAVE: verbos recíprocos léxicos, verbos simétricos, expresiones recíprocas enfáticas, reciprocidad, corpus DOI: 10.20420/PhilCan.2016.107
... Using Internet browsing for such a purpose is further justified by the studies that show the correlations between the number of Google hits and language norms regarding the word frequency estimates. For example, Blair et al. (2002), showed that Internet searches were a very useful instrument and an adequate indicator of the word frequency estimates. An archival approach which relies on this methodology thus seems reasonable and empirically supported by previous research, but solely in the English language. ...
Article
Full-text available
Values refer to stable beliefs and principles held by individuals, which guide their attitudes, behaviours, and judgments, and play a crucial role in shaping their identities and interactions with others. Studying values in social psychology is important as it provides insights into the motivational forces that drive individuals' behaviour and decision-making, shaping the dynamics of interpersonal relationships and societal interactions. The aim of this paper is to test the possibility of measuring basic values in the archive and text materials. Based on the Schwartz's theory of values and earlier work on the value lexicon in English, the Serbian lexicon of values was developed and preliminarily validated on a large-scale Internet-based survey. The lexical co-occurrence of words in the natural language use on the Internet was analysed in order to assess the convergent, discriminant and predictive validity of the lexicon. Lexical co-occurrence analysis showed that the words representing the same values co-occurred significantly more in comparison to the words denoting different values. The pattern of correlations between the values measured in the archive material on the Internet using the value lexicon showed high convergence with the pattern of correlations between the values assessed by the self-reported measures used in the European Social Survey in 2018. The relative prominence of the specific values on the official websites of the exemplar societal institutions and organizations identified by the value lexicon was in line with the expectations and preliminarily confirmed the criterion validity of the lexicon of values. Possible applications of the lexicon of values, as well as some methodological issues pertaining to its future use, are discussed in the final part.
... We used the search engine "Bing", which is the most frequently used by the participants of our study (as emerged from the answers to our questions). This is a procedure typically used in the event that frequency data are not available, and which has been found to be reliable (Blair et al., 2002). Abstract characters are more frequent than concrete ones but only when they have a simple morphological structure (i.e. ...
Article
This study extends the examination of the difference between abstract concepts to the Chinese language and its peculiar characteristics in word formation, where components with different semantic content might be aggregated within a word. Native Chinese speakers categorised abstract and concrete words by moving the computer mouse towards their choice. Stimuli with a “semantically simple structure” (i.e. abstract-abstract/concrete-concrete) were compared with those with a “mixed structure” (i.e. abstract-concrete/concrete-abstract) to test for an effect of the conceptual content of the stimulus’s components on its overall processing. Response time and kinematic parameters revealed that: a) the semantic content of the components affected the processing of abstract but not concrete concepts, b) concepts differed when they have a semantically mixed structure, not a simple one. We extend the concreteness effect to logographic script and provide evidence that the presence of a concrete component within an abstract concept is elaborated and affects its processing.
... The bias of occurring with OCNs or OMNs may be caused by these classifiers' cooccurrence frequency with OMNs versus OCNs. Using two Mandarin corpora (National Language Resources Monitoring and Research Centre) and the number of Baidu (Mandarin Chinese version of Google) hits (May 2019) (Blair et al., 2002;Pollatsek et al., 2010), the co-occurrence frequency counts of the OCNs and OMNs used in the current research with the three classifiers were calculated, and are illustrated in Table 3. Table 3 showed that when the classifier was ba, the difference of the co-occurrence frequency between count and mass nominals was not significant, p = 0.35; while when the classifier was gen and kuai, mass nominals had significantly higher co-occurrence frequency than count nominals, ps < 0.01, represented by *. This classifier-noun co-occurrence frequency pattern does not offer any obvious explanation for our finding that participants expect ba to be followed by count nominals but have no biases for gen and kuai. ...
Article
Full-text available
Using the Visual World Paradigm, the current study aimed to explore whether the mass/count distinction is determined by syntax in Mandarin Chinese, focusing on classified nouns in nominal phrases. By using dual-role classifiers, ontological count and mass nouns, and phrase structures with and without biased syntactic cues we found that the mass/count distinction is initially computed using phrase structure but can be overridden in cases where the syntax is incompatible with nouns’ ontological meanings. The results indicate that in Mandarin Chinese, syntactic cues can be rapidly used to make predictions about upcoming information in real time processing.
... In the study two simple tasks were used with 28 test stimuli and 28 distractors with repeated measures (with a final total of 224 stimuli), selected from a previous study with University students from Brazil, Spain, and the USA [34]. Google frequency searches were also employed as a measure of frequency of the stimuli, as suggested in previous literature [40,43,44]. ...
Article
Full-text available
The face is a fundamental feature of our identity. In humans, the existence of specialized processing modules for faces is now widely accepted. However, identifying the processes involved for proper names is more problematic. The aim of the present study is to examine which of the two treatments is produced earlier and whether the social abilities are influent. We selected 100 university students divided into two groups: Spanish and USA students. They had to recognize famous faces or names by using a masked priming task. An analysis of variance about the reaction times (RT) was used to determine whether significant differences could be observed in word or face recognition and between the Spanish or USA group. Additionally, and to examine the role of outliers, the Gaussian distribution has been modified exponentially. Famous faces were recognized faster than names, and differences were observed between Spanish and North American participants, but not for unknown distracting faces. The current results suggest that response times to face processing might be faster than name recognition, which supports the idea of differences in processing nature.
... Because numerous scholars (e.g. Blair et al. 2002;Hundt et al. 2007;or Lindquist 2009, to name but a few) claim that the quality of the data found on the Internet is comparable to that of standard corpora or, in some cases, Internet data are more telling than corpus data (e.g. Mondorf 2007), the World Wide Web was selected. ...
Article
Full-text available
The article deals with two of the long-standing problems in English linguistics: whether it is possible that each noun can have both count and mass senses, and the problem of determining a complete list of the regularities of count-to-mass and mass-to-count changes. While there have been numerous attempts to solve each of these problems, this article shows the results of applying Cognitive Grammar to them. The analysis covers a set of concrete nouns representative of English – sixty nouns with different ontological properties and all frequencies of occurrence. These are nouns that are classified by dictionaries as solely count and solely mass. Because of its usage-based character, the analysis scrutinises over 1,700 real-life utterances produced by native speakers of English. The analysis shows that even such nouns possess senses whose properties are the reverse of the properties of the nouns’ basic senses. A thorough examination of the nouns’ basic and extended senses leads to certain grammatical regularities of count-to-mass and mass-to-count changes. The analysis not only systematises the grammatical regularities determined so far and solves many problems that can be noticed about them, but also proposes novel regularities.
Article
Corpora are ubiquitous in linguistic research, yet to date, there has been no consensus on how to conceptualize corpus representativeness and collect corpus samples. This pioneering book bridges this gap by introducing a conceptual and methodological framework for corpus design and representativeness. Written by experts in the field, it shows how corpora can be designed and built in a way that is both optimally suited to specific research agendas, and adequately representative of the types of language use in question. It considers questions such as 'what types of texts should be included in the corpus?', and 'how many texts are required?' – highlighting that the degree of representativeness rests on the dual pillars of domain considerations and distribution considerations. The authors introduce, explain, and illustrate all aspects of this corpus representativeness framework in a step-by-step fashion, using examples and activities to help readers develop practical skills in corpus design and evaluation.
Article
Full-text available
Three studies tested basic assumptions derived from a theoretical model based on the dissociation of automatic and controlled processes involved in prejudice. Study 1 supported the model's assumption that high- and low-prejudice persons are equally knowledgeable of the cultural stereotype. The model suggests that the stereotype is automatically activated in the presence of a member (or some symbolic equivalent) of the stereotyped group and that low-prejudice responses require controlled inhibition of the automatically activated stereotype. Study 2, which examined the effects of automatic stereotype activation on the evaluation of ambiguous stereotype-relevant behaviors performed by a race-unspecified person, suggested that when subjects' ability to consciously monitor stereotype activation is precluded, both high- and low-prejudice subjects produce stereotype-congruent evaluations of ambiguous behaviors. Study 3 examined high- and low-prejudice subjects' responses in a consciously directed thought-listing task. Consistent with the model, only low-prejudice subjects inhibited the automatically activated stereotype-congruent thoughts and replaced them with thoughts reflecting equality and negations of the stereotype. The relation between stereotypes and prejudice and implications for prejudice reduction are discussed.
Article
Full-text available
We present a new lexical database of French, named Lexique. Based on a corpus of texts written since 1950 which contained 31 million words, Lexique yields 130 000 entries including the inflected forms of verbs, nouns and adjectives. Each entry provides several kinds of information including frequency, gender, number, phonological form, graphemic and phonemic unicity points. Several tables give additional statistics such as the frequencies of various units: letters, bigrams, trigrams, phonemes and syllables. The database is available for free on the Internet.
Article
Full-text available
The experiments in this article were conducted to observe the automatic activation of gender stereotypes and to assess theoretically specified conditions under which such stereotype priming may be moderated. Across 4 experiments, 3 patterns of data were observed: (1) evidence of stereotype priming under baseline conditions of intention and high cognitive constraints; (2) significant reduction of stereotype priming when a counterstereotype intention was formed even though cognitive constraints were high; and (3) complete reversal of stereotype priming when a counterstereotype intention was formed and cognitive constraints were low. These data support proposals that stereotypes may be automatically activated, as well as proposals that perceivers can control and even eliminate such effects. (PsycINFO Database Record (c) 2012 APA, all rights reserved)
Article
Full-text available
Examined the role of anticipated-interaction instructions on memory for and organization of social information. In Study 1, Ss read and recalled information about a prospective partner (i.e., target) on a problem-solving task and about 4 other stimulus people. The results indicated that (a) Ss recalled more items about the target than the others, (b) the target was individuated from the others in memory, and (c) Ss were more accurate on a name–item matching task for the target than for the others. Study 2 compared anticipated interaction with several other processing goals (i.e., memory, impression formation, self-comparison, friend-comparison). Only anticipated-interaction and impression formation instructions led to higher levels of recall and more accurate matching performance for the target than for the others. However, the conditional probability data suggest that anticipated interaction led to higher levels of organization of target information than did any of the other conditions. Discussion considers information processing strategies that are possibly instigated by anticipated-interaction instructions. (PsycINFO Database Record (c) 2012 APA, all rights reserved)
Article
An alphabetical list of 10,000 words which are found to occur most widely in a count of about 625,000 words from literature for children. 41 different sources were used. (PsycINFO Database Record (c) 2012 APA, all rights reserved)
Book
This completely rewritten classic text features many new examples, insights, and topics including mediational, categorical, and multilevel models. Substantially reorganized, this edition provides a briefer, more streamlined examination of data analysis. Noted for its model comparison approach and unified framework based on the general linear model, the book provides readers with a greater understanding of a variety of statistical procedures. This consistent framework, including consistent vocabulary and notation, is used throughout to develop fewer but more powerful model building techniques. The authors show how all analysis of variance and multiple regression can be accomplished within this framework. The model comparison approach provides several benefits: It strengthens the intuitive understanding of the material, thereby increasing the ability to successfully analyze data in the future; It provides more control in the analysis of data so that readers can apply the techniques to a broader spectrum of questions; It reduces the number of statistical techniques that must be memorized; It teaches readers how to become data analysts instead of statisticians. The book opens with an overview of data analysis. All the necessary concepts for statistical inference used throughout the book are introduced in Chapters 2 through 4. The remainder of the book builds on these models. Chapters 5-7 focus on regression analysis, followed by analysis of variance (ANOVA), mediational analyses, nonindependent or correlated errors, including multilevel modeling, and outliers and error violations. The book is appreciated by all for its detailed treatment of ANOVA, multiple regression, nonindependent observations, interactive and nonlinear models of data, and its guidance for treating outliers and other problematic aspects of data analysis. Intended for advanced undergraduate or graduate courses on data analysis, statistics, and/or quantitative methods taught in psychology, education, or other behavioral and social science departments, this book also appeals to researchers who analyze data. A protected website featuring additional examples and problems with datasets, lecture notes, PowerPoint presentations, and class-tested exam questions is available to adopters. This material uses SAS but can easily be adapted to other programs. A working knowledge of basic algebra and any multiple regression program is assumed. (PsycINFO Database Record (c) 2012 APA, all rights reserved)
Article
The task was to distinguish between English and nonsense words, which were displayed singly. The display persisted until S pressed the yes-key if he thought the stimulus was English or the no-key if he thought it was nonsense. The response times were faster for English than nonsense, faster for English words of higher frequency than lower frequency, and faster for homographs than nonhomographs. It is hypothesized that word recognition in general requires consulting the internal lexicon. A model of the underlying processes is sketched which proposes that words of higher frequency are recognized sooner because their lexical entries are marked earlier for comparison against the stimulus information. It is also proposed that homographs are recognized sooner than nonhomographs since homographs have more lexical entries available for comparison against the stimulus information.