ArticlePDF Available

Using Internet search engines to estimate word frequency

May 2002
Behavior Research Methods Instruments &amp Computers 34(2):286-90

May 2002
34(2):286-90

DOI:10.3758/BF03195456

Source
PubMed

Authors:

Geoffrey Urland

University of Colorado Boulder

Jennifer E. Ma

Scripps College

The present research investigated Internet search engines as a rapid, cost-effective alternative for estimating word frequencies. Frequency estimates for 382 words were obtained and compared across four methods: (1) Internet search engines, (2) the Kucera and Francis (1967) analysis of a traditional linguistic corpus, (3) the CELEX English linguistic database (Baayen, Piepenbrock, & Gulikers, 1995), and (4) participant ratings of familiarity. The results showed that Internet search engines produced frequency estimates that were highly consistent with those reported by Kucera and Francis and those calculated from CELEX, highly consistent across search engines, and very reliable over a 6-month period of time. Additional results suggested that Internet search engines are an excellent option when traditional word frequency analyses do not contain the necessary data (e.g., estimates for forenames and slang). In contrast, participants' familiarity judgments did not correspond well with the more objective estimates of word frequency. Researchers are advised to use search engines with large databases (e.g., AltaVista) to ensure the greatest representativeness of the frequency estimates.

Correlation Coefficients Among Frequency Estimates for the Full Word Sample (N = 382)

…

Figures - uploaded by Geoffrey Urland

Content may be subject to copyright.

Content uploaded by Geoffrey Urland

Content may be subject to copyright.

Behavior Research Methods, Instruments, & Computers

2002, 34 (2), 286-290

Word frequencyis an important experimentalvariable,

with the potential to influence any cognitive phenome-

non that involves language. Domains that have been

shown to benefit from a consideration of word frequency

include word recognition (e.g., Balota & Rayner, 1991),

lexical decision latency (e.g., Rubenstein, Garfield, &

Millikan, 1970), memory (e.g., Chalmers, Humphreys,

& Dennis, 1997), language acquisition (e.g., Brysbaert,

Lange, & Wijnendaele,2000), and unitization in reading

(e.g., Peterzell, Sinclair, Healy, & Bourne, 1990). Even

when word frequency is not the focus of attention, re-

searchers routinely control for it in their experiments to

eliminate a potentiallypowerful confound.

One of the most popular methods for determining

word frequency has been to analyze a linguistic corpus,

which consists of a sample of texts intended to be a rep-

resentative reference for that language (McEnery & Wil-

son, 1996). Some of the most commonly used frequency

analyses are Thorndike and Lorge (1944), KucÏ era and

Francis (1967;see also the extended 1982 analysis of the

same corpus by Francis & KucÏ era), and those obtained

from the CELEX linguistic database (Baayen, Piepen-

brock, & Gulikers, 1995). The popularity of these analy-

ses, however, belies several potential problems in their

usage.Becauseeach corpusis composedof written texts,

they may not fully represent the frequency with which

some words are used in spoken English (Chomsky,

1965). In addition,a traditionalcorpus is unlikely to con-

tain certain types of words, such as slang and forenames,

that are not common in formal written texts but are used

with some frequency in everyday life and are central for

some experiments (e.g., Blair & Banaji, 1996; Devine,

1989; Greenwald, McGhee, & Schwartz, 1998).

Another potential problem concerns the age of the

popular corpora analyses. KucÏ era and Francis (1967) is

over 30 years old, and Thorndike and Lorge (1944) is al-

most 60 years old. In light of rapid changes in vocabu-

lary, researchers may question whether those frequency

counts are still valid (Gernsbacher, 1984). The CELEX

database (Baayen et al., 1995) contains contemporary

samples of written English,but its cost is more than some

researchers may be willing to spend (currently $150),

and without constant updating it too will go out of date.

Finally, access to even the older frequency analysesis re-

stricted because they are out of print and difficult to ob-

tain. Many research institutions have only one or two

copies that must be shared among all researchers.

In light of the limitationsof traditionalword frequency

analyses, the goalof the present research was to examine

the potential of using Internet search engines to provide

validand reliable information on word frequency. Search

engines operate by sending automated agents (known as

This work was supported by NIH Grant MH63372 to I.V.B., a NSF

Graduate Research Fellowship to G.R.U., and an NIH postdoctoral

fellowship to J.E.M. We thank Alice Healy and the University of Col-

orado Stereotyping and Prejudice (CUSP) Lab for their helpful com-

ments on an earlier draft of the paper. We also thank Gary McClelland

and Lou McClelland for their assistance with the sampling analyses.

Correspondence should be addressed to I. V. Blair, Department of

Psychology,University of Colorado, Boulder, CO 80309-0345(e-mail:

irene.blair@colorado.edu).

Using Internet search engines

to estimate word frequency

IRENE V. BLAIR and GEOFFREY R. URLAND

University of Colorado, Boulder, Colorado

and

JENNIFER E. MA

University of Kansas, Lawrence, Kansas

The present researchinvestigatedInternet searchengines as a rapid, cost-effectivealternativefor es-

timating word frequencies. Frequency estimates for 382 words were obtained and compared across

four methods: (1) Internet search engines, (2) the Kuc

era and Francis (1967) analysis of a traditional

linguistic corpus, (3) the CELEX English linguistic database (Baayen, Piepenbrock, & Gulikers, 1995),

and (4) participant ratings of familiarity.The resultsshowed that Internet searchengines produced fre-

quency estimatesthat were highly consistent with those reported by Kuc

era and Francis and those cal-

culated from CELEX, highly consistent across search engines, and very reliable over a 6-month period

of time. Additional results suggested that Internet search engines are an excellent option when tradi-

tional word frequency analyses do not contain the necessary data (e.g., estimates for forenames and

slang). In contrast, participants’ familiarity judgments did not correspond well with the more objective

estimates of word frequency. Researchers are advised to use search engines with large databases(e.g.,

AltaVista) to ensure the greatest representativenessof the frequency estimates.

ESTIMATING WORD FREQUENCY ON THE INTERNET 287

“spiders”) out on the Internet. These spiders, in turn,

send information about a site’s content back to a data-

base, which can be searched by multiple users. For ex-

ample, one may ask the engine to search for the word

desk

. When the search is completed, the user is provided

with a report on the number of times the word was found

in the database—commonly known as the “hit” count.

We argue that this count is analogous to a conventional

word frequency estimate, and it can be compared with

the hit count for other words of interest.

Internet search engines solve many of the drawbacks of

traditional frequency analyses. The Internet is ubiquitous

and search engines are generally free sites, making issues

of availabilityand cost nearly irrelevant. Moreover, the In-

ternet is relatively comprehensive, including academic

texts, commercial and personal information, and records

from newsgroup postings.The latter source of information

is similar to spoken language and gives the Internet an ad-

ditional advantage over corpora that rely on more formal

written language. Because anyone can post information on

the Internet, it may also be more representative of “every-

one’s language.” Likewise, the fast-paced nature of the In-

ternet and the fact that search engines constantly update

their databases provide a way for the search engines to re-

flect contemporarylanguageusage. Thus, the Internet pro-

videsa linguisticdatabasethat is relativelycomprehensive,

representative, contemporary, and easily searched.

However, it is necessary to determine empirically

whether search engines provide valid estimates of word

frequency. In addition, the fluid nature of the Internet

may undermine the reliability of estimates based on In-

ternet databases. These questions were addressed in the

following study by obtaining word frequency estimates

for a large sample of words from four popular Internet

search engines and comparing them with the estimates

obtained from KucÏ era and Francis (1967) and the CELEX

linguistic database (Baayen et al., 1995), as well as to

participant ratings of the words’ familiarity. The test–

retest reliability of the search engines was also examined

by conductinga secondsetof Internet searches,6 months

later.

METHOD

Test Sample

The test sample was composed of 382 words. The majority of the

words (n = 250) were standard English words that included a selec-

tion of nouns, verbs, and adjectives (e.g., attain, dishonest, nurse,

welfare). The frequency of these words, according to KucÏ era and

Francis (1967), ranged from 0 to 1,303 (M = 102.41). In addition,

a subsample of 132 nonstandard words was added. This subsample

contained 36 slang terms (e.g., bimbo, rad, reefer, yuppie) and 96

forenames (48 male and female European American names and 48

male and female African American names, taken from Greenwald

et al., 1998). According to KucÏ era and Francis, these words ranged

in frequency from 0 to 92 (M = 5.82).

Frequency Estimates

Four methods were used to estimate the frequency of the 382

words in the sample: (1) KucÏ era and Francis (1967), (2) CELEX

linguistic database (Baayen et al., 1995), (3) participant ratings of

familiarity, and (4) Internet search engines. These methods are de-

scribed below.

KucÏ era and Francis analysis. Frequency estimates for the words

in the test sample were obtained from KucÏ era and Francis’s (1967)

Computational Analysis of Present-Day American English. If a word

did not appear in the database, the frequency was recorded as zero.

CELEX linguistic database. Frequency estimates for the words

in the test sample were obtained by electronically searching the

CELEX linguistic database (Baayen et al., 1995). This CD-ROM

database contains 160,594 words from 284 written texts. If a word

did not appear in the database, the frequency was recorded as zero.

Participant ratings. Because pretesting is often used to select

stimuli that are matched on a particular dimension (e.g., imageabil-

ity), some researchers may use participants’ subjective familiarity

with words as an alternative to obtaining objective word frequency

estimates. To explore the validity of this method, 33 undergraduates

at the University of Kansas were asked to rate the familiarity of

each word in the sample, using a 5-point scale with labeled end-

points (1 = very unfamiliar,5=very familiar). The participants

were asked to consider “how common or frequently you have en-

countered the word, or how well you know the word.” The 382

words were divided into two lists of equal length, with the words on

each list presented in a single fixed order. The participants rated the

familiarity of the words on one list and then, following a 5-min

break, they rated the familiarity of the second list of words. The

order of the two word lists was counterbalanced across the partici-

pants. Cronbach’s alpha for interrater reliability was .97.

Internet search engines. Four search engines were selected for

the study: AltaVista, Northern Light, Excite, and Yahoo! These search

engines were chosen primarily for their popularity among Internet

users. Database and search technique were also considered. Alta-

Vista (www.altavista.com) and Northern Light (www.northernlight.

com) were included because they are two of the largest and most

comprehensive Internet search engines. When the analyses were

conducted, AltaVista had a database of more than 250 million

webpages and Northern Light had over 200 million webpages

(AllSearchEngines.Com, 2000; Kansas City Public Library, 2000;

Leita, 2000). During a search with these engines, the full text of

webpages and articles in their databases are searched for word

matches. In contrast, Excite (www.excite.com) and Yahoo! (www.

yahoo.com) ha

ve relatively smaller databases (150 million and 2

million, respectively). In addition, Yahoo! conducts its searches in

a very different manner from the other three engines. Specifically,

it is an Internet subject directory that searches for general topics as

opposed to keyword matches. As a consequence, its frequency es-

timates may not be as valid or reliable.

Each word in the sample was entered into the search function of

each of the four search engines. The number of hits returned was

then recorded as the frequency estimate for that word. To examine

the reliability of frequency estimates obtained from the search en-

gines, the same search process was repeated 6 months later.

RESULTS

As expected for frequency data, the word counts from

the search engines,KucÏ era and Francis (1967),and CELEX

(Baayen et al., 1995) were positively skewed. Thus, a

standard square-root transformation was applied before

further analyses (Judd & McClelland,1989). In contrast,

the participants’ ratings of familiarity were negatively

skewed. Because this skew was relatively minor and we

believed that it reflected an important psychological re-

ality for the participants (see Discussion), those data

were left untransformed.

Due to differences in database size, the two larger

search engines, AltaVista and Northern Light, returned

288 BLAIR, URLAND, AND MA

significantly higher estimates, on average, than the two

smaller search engines (

= 2.6 million vs. 0.7 million).

And all of the search engines produced higher estimates

than KucÏ era and Francis (

= 69.03) or CELEX (

925.64).Of greater importance, however, was the validity

and reliabilityof the search enginesas determinedby com-

parisons of word frequency estimates

among

the words

in the test sample. The following tests were conducted.

First, correlations were calculated among the fre-

quency estimates obtained using each of the four meth-

ods. Table 1 shows that the estimates obtained from the

four search engines were highly correlated with those

obtained from KucÏ era and Francis (mean

=.79)and

with the estimates provided by CELEX (mean

= .72).

In contrast, the participants’word ratings were onlymod-

erately correlated with the other frequency counts.

Second, correlations were calculated among the four

search engines. As shown in Table 1, the search engines

provided highly consistent estimates of word frequency

on the Internet (mean

= .82).

Finally, the test–retest reliabilities of the search en-

gineswere examined by calculatingcorrelationsbetween

the word estimates obtained at Time 1 and at Time 2.

These correlations, provided in Table 1, showed that the

search engines produced highly reliable estimates over

the 6-month period of time (mean

= .92).

Frequency Estimates for Subsamples of Words

As noted, the full test sample was composed of 250

standard words and 132 nonstandard words. To examine

the congruence among the frequency methods for the

two subsamples, the analyses were repeated within each

sample. These analyses showed that for both subsam-

ples, the Internet search engines produced frequency es-

timates that were very reliable in terms of their consis-

tency with one another (mean

= .79 and .89, for the

standard and nonstandard samples, respectively) and

their consistency across time (mean

= .91 and .89, for

the standard and nonstandard samples, respectively).

However, the congruence between the search engines

and the other three methods was different for the two

subsamples of words (Table 2).

First, the congruence between the search engines and

the two traditionallinguisticdatabaseswas higher for the

standard than the nonstandard words (mean

= .78 vs.

.57,

=3.65,

< .001 for KucÏ era & Francis; mean

= .70

vs. .44,

= 3.64,

< .001 for CELEX). One explanation

for this discrepancy is that the linguistic databases did

not contain estimates for many of the nonstandardwords.

Specifically, the KucÏ era and Francis database was miss-

ing 51% of the nonstandard words (54% of the fore-

names and 42% of the slang terms), compared with only

3% missing for the standard words; CELEX was missing

66% of the nonstandard words (84% of the forenames

and 17% of the slang terms), compared with only 2%

missing for the standard words. A very large number of

zero estimates could have attenuated the correlation for

the nonstandard words. However, even when all words

with a zero estimate were eliminated from the analysis,

the correlation between the search engines and the stan-

dard databases continued to be higher for the standard

than the nonstandard words (mean

= .77 vs. .56,

2.72,

< .01 for KucÏ era & Francis; mean

= .70 vs. .50,

= 1.90,

.058 for CELEX). We cannot be certain why

these differences exist. However, it doesn’t seem so sur-

prising that the type of “common” English used on the

Table 1

Correlation Coefficients Among Frequency Estimates

for the Full Word Sample (N =382)

CELEX PR AV NL EX YH

KucÏ era & Francis (1967) .92 .48 .81 .89 .78 .69

CELEX .46 .76 .81 .71 .62

Participant ratings (PR) .45 .49 .46 .47

Search engines

AltaVista (AV) .94 .91 .81 .73

Northern Light (NL) .96 .84 .76

Excite (EX) .94 .88

Yahoo! (YH) .84

Note—Coefficients on the diagonal for the search engines are the test–retest reliabil-

ity estimates. All correlations are significant at p < .0001.

Table 2

Correlation Coefficients Between the Internet Search Engines and the Other Three Methods of

Obtaining Frequency Estimates, for the Standard and Nonstandard Word Samples

Standard Word Sample Nonstandard Word Sample

Method PR AV NL EX YH PR AV NL EX YH

KucÏ era & Francis (1967) .40 .85 .88 .76 .64 .46 .46 .61 .60 .59

CELEX .36 .78 .79 .68 .56 .46 .44 .45 .44 .43

Participant ratings .40 .43 .39 .40 .59 .61 .60 .63

Note—All correlations are significant at p < .0001. PR, participant ratings; AV, AltaVista; NL, Northern

Light; EX, Excite; YH, Yahoo!

ESTIMATING WORD FREQUENCY ON THE INTERNET 289

Web and the more formal writing contained in the tradi-

tional linguisticcorpora may differ in the frequency with

which various slang terms and forenames are used.

The second difference observed between the two word

samples was that the congruence between the search en-

gines and the participants’ratings was

lower

for the stan-

dard than for the nonstandard words (mean

= .40 vs.

.61,

=2.62,

< .01). This finding suggests that the par-

ticipantsmay have had an easier time making familiarity

distinctions among relatively unfamiliar words. None-

theless, in neither case was the correlation very high.

Because researchers may question whether they can

rely on Internet search engines for small samples of

words, correlational analyses were conducted on 100

randomly selected samples of 30 standard words. Me-

dian correlations and their interquartile ranges based on

these analyses are presented in Table 3. These numbers

showed that even for relatively small samples of words,

the Internet search engines produced word frequency es-

timatesthat were highly consistentwith the two standard

databases, highly consistentwith one another, and highly

consistent across time. The two smaller search engines

(EX and YH), however, returned results that were a little

less consistent with the standard databases than those

obtained from the two larger search engines (AV and

NL).

DISCUSSION

The present research demonstratesthat Internet search

engines provide word frequency estimates that are both

valid and reliable. The estimates obtained from the four

search engines showed good convergent validity with

both KucÏ era and Francis (1967) and CELEX (Baayen

et al., 1995),were highlyconsistentwith one another, and

showed excellenttest–retest reliabilityovera 6-monthpe-

riodof time. These results oughtto encourageresearchers

to take advantage of this highly accessible and easy-to-

use method.

The highconvergencebetween the search engines and

KucÏ era and Francis (1967) also suggests that despite its

age, KucÏ era and Francis is still a valid source for word

frequencies. We have shown, however, that one of the

greatest drawbacks of that method—and similar data-

bases, such as CELEX—is missing data. The lack of

data for forenames is especially problematic for social

psychologistswho frequently employ forenames as stim-

uli and have few available means to estimate their fre-

quency (for discussions of this problem, see Dasgupta,

McGhee, Greenwald, & Banaji,2000; Kasof, 1993). The

lack of data for slang terms suggests that KucÏ era and

Francis and CELEX may also not be a good source of

frequency data when the words are relatively new to the

lexiconor appear more often in speech than writing.The

Internet search engines, in contrast, produced highly

consistent and reliable word frequency estimates for

both the standard and nonstandard words, suggesting

that they can be used where other methods fail.

In addition to being an easy-to-use, cost-effective

method of obtaining word frequencies, search engines

may also open up other avenuesfor research. For example,

by treating the Web as a linguistic database, researchers

can conductanalyses of the contextssurroundingcertain

words. Such analyses could be informative in regard to

the typical user of a word (e.g., age, education, culture)

and the objects and social roles that are most often asso-

ciated with it. An analysis of the surrounding context

may also provide the researcher with a better sense of

how familiar people really are with a particular word. If

a word is most often listed in technical or otherwise spe-

cialized webpages, then it may not be as familiar to the

average person as a word that is found on more main-

stream webpages. Another advantageof using search en-

gines for frequency analysesis the potentialto search for

phrases as well as for single words. For example, one

might wonder if

baseball bat

hockey stick

occurs with

greater frequency, or whether people are more familiar

with “To be or not to be” or “I think therefore I am.” (In

both cases, the former phrase occurs with much greater

frequency than the latter.) Finally, it is important to note

that search engines can be used to estimate word fre-

quenciesfor languages other than English (see New, Pal-

lier, Ferrand, & Matos, in press). Researchers who use

words from more than one language may find it useful to

conduct word frequency analyses with the same basic

method. However, we caution that the validity of such

searches would depend on the extent to which speakers

of the language use the Web.

Table 3

Median Correlation Coefficients (Mdns) and Interquartile Ranges (IRs)

Based on Analyses of 100 Random Samples of 30 Standard Words

CELEX PR AV NL EX YH

Method Mdn IR Mdn IR Mdn IR Mdn IR Mdn IR Mdn IR

KucÏ era & Francis (1967) .91 .06 .42 .14 .85 .09 .89 .07 .77 .26 .65 .36

CELEX .41 .11 .78 .10 .80 .10 .72 .29 .63 .36

Participant ratings (PR) .44 .11 .47 .13 .43 .12 .43 .12

Search engines

AltaVista (AV) .98 .06 .93 .06 .85 .27 .71 .36

Northern Light (NL) .99 .03 .82 .26 .72 .39

Excite (EX) .98 .09 .97 .23

Yahoo! (YH) .92 .30

Note—Coefficients on the diagonal for the search engines are the test–retest reliability estimates.

290 BLAIR, URLAND, AND MA

The present research also providedevidenceon the va-

lidity of participants’ subjective ratings of familiarity as

an alternative measure of word frequency. The inconsis-

tencies between such ratings and the other methods sug-

gest that subjective familiarity is not equivalent to more

objective measures of word frequency (see also Gerns-

bacher, 1984). In particular, the present data showed that

the familiarity ratings were negatively skewed, whereas

the other estimates were positively skewed. That nega-

tive skew reveals that the raters did not make distinctions

among words that are relatively familiar but have very

different objectivefrequencies in the language(e.g.,

lazy

vs.

school

). This discrepancywas especially pronounced

in the standard word sample, where the negative skew

was greater than in the nonstandardword sample (

2.34

vs.

0.56, respectively). Other researchers have cau-

tioned,however, that subjective familiarity should not be

discounted as an important variable in cognition despite

its differences from objective word frequency (Gerns-

bacher, 1984).

Although the present data providedstrong evidencein

favor of Internet search engines as a method of estimat-

ing word frequency, two caveats are in order. First, the

two smaller search engines (Excite and Yahoo!) pro-

duced somewhat less consistent estimates with a rela-

tively small sample of words. Thus, it is recommended

that the larger search engines be used as a general rule

because they have more representative databases. Sec-

ond, Internet search engines are best used when relative

word frequency estimates are satisfactory. With 463,830

hits in AltaVista for

brush

and 4,860,810 hits for

earth

we know that the latterword occurs more frequentlythan

the former word. However, with only a rough estimate of

the total database (approximately 250 million) and the

fact that it is always changing, the absolute frequencies

of those words cannot be determined with any certainty.

For many research purposes, relative word frequencies

are the only estimates of interest, and it is for those studies

that Internet search engines provide an excellent option.

REFERENCES

AllSearchEngines.Com homepage (May, 2000). Available: http://

www.allsearchengines.com.

Baayen, R. H., Piepenbrock,R., & Gulikers, L. (1995).The CELEX

lexical database [CD-ROM]. Philadelphia: University of Pennsylva-

nia, LinguisticData Consortium.

Balota, D. A., & Rayner, K. (1991). Word recognition processes in

foveal and parafoveal vision: The range of influence of lexical vari-

ables. In D. Besner & G. W. Humphreys(Eds.), Basic processes in read-

ing: Visual word recognition (pp. 198-232).Hillsdale, NJ: Erlbaum.

Blair, I. V., & Banaji, M. R. (1996). Automatic and controlled pro-

cesses in stereotype priming. Journal of Personality & Social Psy-

chology, 70, 1142-1163.

Brysbaert, M., Lange, M., & Wijnendaele, I. V. (2000). The effects

of age-of-acquisition and frequency-of-occurrence in visual word

recognition: Further evidence from the Dutch language. European

Journal of Cognitive Psychology, 12, 65-85.

Chalmers, K. A., Humphreys, M. S., & Dennis, S. (1997). A natural-

istic study of the word frequency effect in episodic recognition.Mem-

ory & Cognition, 25, 780-784.

Chomsky, N. (1965) Aspects of the theory of syntax. Cambridge, MA:

MIT Press.

Dasgupta, N., McGhee, D. E., Greenwald, A. G., & Banaji, M. R.

(2000). Automatic preference for White Americans: Eliminating the

familiarity explanation. Journal of Experimental Social Psychology,

36, 316-328.

Devine, P. G. (1989). Stereotypes and prejudice: Their automatic and

controlledcomponents.Journalof Personality & Social Psychology,

56, 680-690.

Francis, W. N., & KucÏera, H. (1982). Frequency analysis of English

usage: Lexicon and grammar. Boston: Houghton Mifflin.

Gernsbacher,M. A. (1984). Resolving 20 years of inconsistent inter-

actions between lexical familiarity and orthography, concreteness,

and polysemy. Journal of Experimental Psychology: General, 113,

256-281.

Greenwald, A. G., McGhee, D. E., & Schwartz, J. L. K. (1998).

Measuring individual differences in implicit cognition: The implicit

association test. Journal of Personality & Social Psychology, 74,

1464-1480.

Judd, C. M., & McClelland, G. H. (1989). Data analysis: A model

comparison approach.

San Diego: Harcourt Brace Jovanovich.

Kansas City Public Library (2000, March). Introduction to search

engines. Available: http://www.kcpl.lib.mo.us/search.

Kasof, J. (1993). Sex bias in the naming of stimulus persons. Psycho-

logical Bulletin, 113, 140-163.

KucÏera, H., & Francis, W. N. (1967).Computationalanalysis of present-

day American English. Providence, RI: Brown University Press.

Leita, C. (2000, May). InfoPeople Search Tools Chart. Available: 2000

InFoPeople Project at http://infopeople.org/src/chart.html.

McEnery, T., & Wilson, A. (1996).Corpus linguistics. Edinburgh:Ed-

inburgh University Press.

New, B., Pallier, C., Ferrand, L., & Matos, R. (in press). Une base

de données lexicales du français contemporain sur internet: LEX-

IQUE [A lexical database of contemporary French on the Internet:

LEXIQUE]. L’Année Psychologique.

Peterzell, D. H., Sinclair, G. P., Healy, A. F., & Bourne, L. E.

(1990). Identification of letters in the predesignated target paradigm:

A word superiority effect for the common word the. American Jour-

nal of Psychology, 103, 299-315.

Rubenstein, H., Garfield, L., & Millikan, J. A. (1970). Homo-

graphic entries in the internal lexicon. Journal of Verbal Learning &

Verbal Behavior, 9, 487-494.

Thorndike, E. L., & Lorge, I. (1944). The teacher’s word book of

30,000 words (3rd ed.). New York: Columbia University, Teachers

College Press.

NOTE

1. Some search engines report both the number of exact word

matches and the number of websites that contain the word. It is the for-

mer hit count that provides the more accurate word frequency count.

(Manuscript received February 15, 2001;

revision accepted for publication December 15, 2001.)

Emphatic Reciprocal Expressions and Symmetric Verbs in Spanish: An Empirical Analysis

Article

Full-text available

Oct 2016

Resumen In this paper we present a descriptive study on the compatibility of emphatic reciprocal expressions with Spanish lexical reciprocal (or symmetric) verbs. Since lexical reciprocal verbs express reciprocity intrinsically, they should not require the use of an emphatic reciprocal expression to denote reciprocal meaning. Some scholars even claim that some emphatic reciprocal expressions, such as mutuamente, are incompatible. The aim of this paper is to describe to what extent symmetric verbs can be used with four of these expressions using an empirical approach. The results obtained shed light to the questions raised: we have been able to verify that these expressions are more frequent with non-reciprocal verbs and we have proved that the combination of symmetric verbs with all these expressions is possible, even in the case of mutuamente. KEYWORDS: lexical reciprocal verbs, symmetric verbs, emphatic reciprocal expressions, reciprocity, corpus Expresiones recíprocas enfáticas y verbos simétricos en español: un análisis empírico En este artículo presentamos un estudio descriptivo sobre la compatibilidad de las expresiones recíprocas enfáticas y los verbos recíprocos (o simétricos). En tanto que los verbos recíprocos expresan reciprocidad intrínsecamente, es previsible que no requieran el uso de una expresión recíproca enfática para expresar el significado recíproco. Algunos autores incluso defienden que algunas expresiones recíprocas enfáticas, como mutuamente, son incompatibles con estos predicados. El objetivo de este trabajo es describir hasta qué punto los verbos simétricos pueden ser usados junto con cuatro de estas expresiones, para lo cual hemos utilizado una metodología basada en corpus. Los resultados obtenidos responden a las preguntas planteadas: hemos podido verificar la más elevada frecuencia de estas expresiones con verbos no recíprocos que con predicados recíprocos y hemos probado que la combinación de estos últimos con todas estas expresiones es posible, incluso con mutuamente. PALABRAS CLAVE: verbos recíprocos léxicos, verbos simétricos, expresiones recíprocas enfáticas, reciprocidad, corpus DOI: 10.20420/PhilCan.2016.107

The development and preliminary validation of the Serbian value lexicon − An archival approach to value measurement

Article

Full-text available

Nov 2023

Values refer to stable beliefs and principles held by individuals, which guide their attitudes, behaviours, and judgments, and play a crucial role in shaping their identities and interactions with others. Studying values in social psychology is important as it provides insights into the motivational forces that drive individuals' behaviour and decision-making, shaping the dynamics of interpersonal relationships and societal interactions. The aim of this paper is to test the possibility of measuring basic values in the archive and text materials. Based on the Schwartz's theory of values and earlier work on the value lexicon in English, the Serbian lexicon of values was developed and preliminarily validated on a large-scale Internet-based survey. The lexical co-occurrence of words in the natural language use on the Internet was analysed in order to assess the convergent, discriminant and predictive validity of the lexicon. Lexical co-occurrence analysis showed that the words representing the same values co-occurred significantly more in comparison to the words denoting different values. The pattern of correlations between the values measured in the archive material on the Internet using the value lexicon showed high convergence with the pattern of correlations between the values assessed by the self-reported measures used in the European Social Survey in 2018. The relative prominence of the specific values on the official websites of the exemplar societal institutions and organizations identified by the value lexicon was in line with the expectations and preliminarily confirmed the criterion validity of the lexicon of values. Possible applications of the lexicon of values, as well as some methodological issues pertaining to its future use, are discussed in the final part.

Implicit effect of abstract/concrete components in the categorization of Chinese words

Article

Mar 2022

This study extends the examination of the difference between abstract concepts to the Chinese language and its peculiar characteristics in word formation, where components with different semantic content might be aggregated within a word. Native Chinese speakers categorised abstract and concrete words by moving the computer mouse towards their choice. Stimuli with a “semantically simple structure” (i.e. abstract-abstract/concrete-concrete) were compared with those with a “mixed structure” (i.e. abstract-concrete/concrete-abstract) to test for an effect of the conceptual content of the stimulus’s components on its overall processing. Response time and kinematic parameters revealed that: a) the semantic content of the components affected the processing of abstract but not concrete concepts, b) concepts differed when they have a semantically mixed structure, not a simple one. We extend the concreteness effect to logographic script and provide evidence that the presence of a concrete component within an abstract concept is elaborated and affects its processing.

Processing Evidence for the Grammatical Encoding of the Mass/Count Distinction in Mandarin Chinese

Article

Full-text available

Apr 2022
J PSYCHOLINGUIST RES

Using the Visual World Paradigm, the current study aimed to explore whether the mass/count distinction is determined by syntax in Mandarin Chinese, focusing on classified nouns in nominal phrases. By using dual-role classifiers, ontological count and mass nouns, and phrase structures with and without biased syntactic cues we found that the mass/count distinction is initially computed using phrase structure but can be overridden in cases where the syntax is incompatible with nouns’ ontological meanings. The results indicate that in Mandarin Chinese, syntactic cues can be rapidly used to make predictions about upcoming information in real time processing.

Word and Face Recognition Processing Based on Response Times and Ex-Gaussian Components

Article

Full-text available

May 2021
Entropy

The face is a fundamental feature of our identity. In humans, the existence of specialized processing modules for faces is now widely accepted. However, identifying the processes involved for proper names is more problematic. The aim of the present study is to examine which of the two treatments is produced earlier and whether the social abilities are influent. We selected 100 university students divided into two groups: Spanish and USA students. They had to recognize famous faces or names by using a masked priming task. An analysis of variance about the reaction times (RT) was used to determine whether significant differences could be observed in word or face recognition and between the Spanish or USA group. Additionally, and to examine the role of outliers, the Gaussian distribution has been modified exponentially. Famous faces were recognized faster than names, and differences were observed between Spanish and North American participants, but not for unknown distracting faces. The current results suggest that response times to face processing might be faster than name recognition, which supports the idea of differences in processing nature.

New insights into English count and mass nouns -the Cognitive Grammar perspective

Article

Full-text available

Oct 2020

Grzegorz Drożdż

The article deals with two of the long-standing problems in English linguistics: whether it is possible that each noun can have both count and mass senses, and the problem of determining a complete list of the regularities of count-to-mass and mass-to-count changes. While there have been numerous attempts to solve each of these problems, this article shows the results of applying Cognitive Grammar to them. The analysis covers a set of concrete nouns representative of English – sixty nouns with different ontological properties and all frequencies of occurrence. These are nouns that are classified by dictionaries as solely count and solely mass. Because of its usage-based character, the analysis scrutinises over 1,700 real-life utterances produced by native speakers of English. The analysis shows that even such nouns possess senses whose properties are the reverse of the properties of the nouns’ basic senses. A thorough examination of the nouns’ basic and extended senses leads to certain grammatical regularities of count-to-mass and mass-to-count changes. The analysis not only systematises the grammatical regularities determined so far and solves many problems that can be noticed about them, but also proposes novel regularities.

The Development and Preliminary Validation of the Serbian Value Lexicon - An Archival Approach to Value Measurement

Article

Aug 2023

Systematically mapping innovations in electricity using startups: A comprehensive database analysis

Article

Jun 2023
Tech Soc

References

Article

Apr 2022

Corpora are ubiquitous in linguistic research, yet to date, there has been no consensus on how to conceptualize corpus representativeness and collect corpus samples. This pioneering book bridges this gap by introducing a conceptual and methodological framework for corpus design and representativeness. Written by experts in the field, it shows how corpora can be designed and built in a way that is both optimally suited to specific research agendas, and adequately representative of the types of language use in question. It considers questions such as 'what types of texts should be included in the corpus?', and 'how many texts are required?' – highlighting that the degree of representativeness rests on the dual pillars of domain considerations and distribution considerations. The authors introduce, explain, and illustrate all aspects of this corpus representativeness framework in a step-by-step fashion, using examples and activities to help readers develop practical skills in corpus design and evaluation.

Research History and Trend of Scientific Research Management : Big Data Analysis Based on Bibliometrix Software

Conference Paper

Nov 2020

Stereotypes and Prejudice: Their Automatic and Controlled Components

Article

Full-text available

Jan 1989

Patricia G. Devine

Three studies tested basic assumptions derived from a theoretical model based on the dissociation of automatic and controlled processes involved in prejudice. Study 1 supported the model's assumption that high- and low-prejudice persons are equally knowledgeable of the cultural stereotype. The model suggests that the stereotype is automatically activated in the presence of a member (or some symbolic equivalent) of the stereotyped group and that low-prejudice responses require controlled inhibition of the automatically activated stereotype. Study 2, which examined the effects of automatic stereotype activation on the evaluation of ambiguous stereotype-relevant behaviors performed by a race-unspecified person, suggested that when subjects' ability to consciously monitor stereotype activation is precluded, both high- and low-prejudice subjects produce stereotype-congruent evaluations of ambiguous behaviors. Study 3 examined high- and low-prejudice subjects' responses in a consciously directed thought-listing task. Consistent with the model, only low-prejudice subjects inhibited the automatically activated stereotype-congruent thoughts and replaced them with thoughts reflecting equality and negations of the stereotype. The relation between stereotypes and prejudice and implications for prejudice reduction are discussed.

Une base de données lexicales du français contemporain sur internet : LEXIQUE™//A lexical database for contemporary french : LEXIQUE™

Article

Full-text available

Jan 2001
ANN PSYCHOL

We present a new lexical database of French, named Lexique. Based on a corpus of texts written since 1950 which contained 31 million words, Lexique yields 130 000 entries including the inflected forms of verbs, nouns and adjectives. Each entry provides several kinds of information including frequency, gender, number, phonological form, graphemic and phonemic unicity points. Several tables give additional statistics such as the frequencies of various units: letters, bigrams, trigrams, phonemes and syllables. The database is available for free on the Internet.

Automatic and Controlled Processes in Stereotype Priming

Article

Full-text available

Jun 1996

The experiments in this article were conducted to observe the automatic activation of gender stereotypes and to assess theoretically specified conditions under which such stereotype priming may be moderated. Across 4 experiments, 3 patterns of data were observed: (1) evidence of stereotype priming under baseline conditions of intention and high cognitive constraints; (2) significant reduction of stereotype priming when a counterstereotype intention was formed even though cognitive constraints were high; and (3) complete reversal of stereotype priming when a counterstereotype intention was formed and cognitive constraints were low. These data support proposals that stereotypes may be automatically activated, as well as proposals that perceivers can control and even eliminate such effects. (PsycINFO Database Record (c) 2012 APA, all rights reserved)

Goals in Social Information Processing: The Case of Anticipated Interaction

Article

Full-text available

May 1989

Examined the role of anticipated-interaction instructions on memory for and organization of social information. In Study 1, Ss read and recalled information about a prospective partner (i.e., target) on a problem-solving task and about 4 other stimulus people. The results indicated that (a) Ss recalled more items about the target than the others, (b) the target was individuated from the others in memory, and (c) Ss were more accurate on a name–item matching task for the target than for the others. Study 2 compared anticipated interaction with several other processing goals (i.e., memory, impression formation, self-comparison, friend-comparison). Only anticipated-interaction and impression formation instructions led to higher levels of recall and more accurate matching performance for the target than for the others. However, the conditional probability data suggest that anticipated interaction led to higher levels of organization of target information than did any of the other conditions. Discussion considers information processing strategies that are possibly instigated by anticipated-interaction instructions. (PsycINFO Database Record (c) 2012 APA, all rights reserved)

Data Analysis: A Model-Comparison Approach

Article

Feb 1992

Stereotype and prejudice: Their automatic and controlled components

Article

Jan 1989

Devine

Aspects of The Theory of Syntax

Article

Feb 1970

Noam Chomsky

The Teacher's Word Book

Article

Jan 1921

Edward Lee Thorndike

An alphabetical list of 10,000 words which are found to occur most widely in a count of about 625,000 words from literature for children. 41 different sources were used. (PsycINFO Database Record (c) 2012 APA, all rights reserved)

Data analysis: A Model Comparison Approach (2nd ed.).

Book

Jan 2009

This completely rewritten classic text features many new examples, insights, and topics including mediational, categorical, and multilevel models. Substantially reorganized, this edition provides a briefer, more streamlined examination of data analysis. Noted for its model comparison approach and unified framework based on the general linear model, the book provides readers with a greater understanding of a variety of statistical procedures. This consistent framework, including consistent vocabulary and notation, is used throughout to develop fewer but more powerful model building techniques. The authors show how all analysis of variance and multiple regression can be accomplished within this framework. The model comparison approach provides several benefits: It strengthens the intuitive understanding of the material, thereby increasing the ability to successfully analyze data in the future; It provides more control in the analysis of data so that readers can apply the techniques to a broader spectrum of questions; It reduces the number of statistical techniques that must be memorized; It teaches readers how to become data analysts instead of statisticians. The book opens with an overview of data analysis. All the necessary concepts for statistical inference used throughout the book are introduced in Chapters 2 through 4. The remainder of the book builds on these models. Chapters 5-7 focus on regression analysis, followed by analysis of variance (ANOVA), mediational analyses, nonindependent or correlated errors, including multilevel modeling, and outliers and error violations. The book is appreciated by all for its detailed treatment of ANOVA, multiple regression, nonindependent observations, interactive and nonlinear models of data, and its guidance for treating outliers and other problematic aspects of data analysis. Intended for advanced undergraduate or graduate courses on data analysis, statistics, and/or quantitative methods taught in psychology, education, or other behavioral and social science departments, this book also appeals to researchers who analyze data. A protected website featuring additional examples and problems with datasets, lecture notes, PowerPoint presentations, and class-tested exam questions is available to adopters. This material uses SAS but can easily be adapted to other programs. A working knowledge of basic algebra and any multiple regression program is assumed. (PsycINFO Database Record (c) 2012 APA, all rights reserved)

Homographic entries in the internal lexicon

Article

Oct 1970

The task was to distinguish between English and nonsense words, which were displayed singly. The display persisted until S pressed the yes-key if he thought the stimulus was English or the no-key if he thought it was nonsense. The response times were faster for English than nonsense, faster for English words of higher frequency than lower frequency, and faster for homographs than nonhomographs. It is hypothesized that word recognition in general requires consulting the internal lexicon. A model of the underlying processes is sketched which proposes that words of higher frequency are recognized sooner because their lexical entries are marked earlier for comparison against the stimulus information. It is also proposed that homographs are recognized sooner than nonhomographs since homographs have more lexical entries available for comparison against the stimulus information.

Using Internet search engines to estimate word frequency

Abstract and Figures

Recommended publications

Markedness and Lexical Typicality in Mandarin Acceptability Judgments

The readability checker DeLite

Reconceptualizing the Native/Nonnative Speaker Dichotomy

Theoretical and methodological perspectives on the use of grammaticality judgment tasks in linguisti...