ChapterPDF Available

Tracking linguistic features underlying lexical variation patterns: A case study on Tuscan dialects

January 2016

January 2016

In book: The Future of Dialects (pp.117-134)
Edition: Language Variation
Publisher: Language Science Press
Editors: Marie-Hélène Côté, Remco Knooihuizen, John Nerbonne

Authors:

Simonetta Montemagni

Italian National Research Council

Martijn Wieling

University of Groningen

In this paper, we illustrate the application of hierarchical spectral partitioning of bipartite graphs in the study of lexical variation in Tuscany based on the data from a regional linguistic atlas. This method makes it possible not only to identify existing patterns of lexical variation in Tuscany, but also to uncover the underlying lexical features in terms of the most characteristic concept-lexicalization pairs. The results are promising, demonstrating the potential of the method for tracking the linguistic features underlying identified patterns of lexical variation and change across generations.

Geographic visualization of the clustering of Tuscan varieties into seven groups.

…

Value maps of the first (row 1) and second (row 2) topmost CL pairs for the A, C and E dialectal clusters.

…

Geographic visualization of the clustering of Tuscan varieties into two, four and seven groups.

…

Geographic visualization of a four-way clustering of Tuscan varieties on the basis of data from young vs. old speakers.

…

Figures - uploaded by Martijn Wieling

Content may be subject to copyright.

Content uploaded by Martijn Wieling

Content may be subject to copyright.

Tracking linguistic features underlying lexical

variation patterns: A case study on Tuscan

dialects

Simonetta Montemagni

Istituto di Linguistica Computazionale

“Antonio Zampolli”, ILC-CNR

Martijn Wieling

University of Groningen, CLCG

In this paper, we illustrate the application of hierarchical spectral partitioning of

bipartite graphs in the study of lexical variation in Tuscany based on the data from a

regional linguistic atlas. This method makes it possible not only to identify existing

patterns of lexical variation in Tuscany, but also to uncover the underlying lexical

features in terms of the most characteristic concept-lexicalization pairs. The results

are promising, demonstrating the potential of the method for tracking the linguistic

features underlying identified patterns of lexical variation and change across

generations.

Introduction

In dialectometry (Séguy 1971) the focus lies on the aggregate analysis of dialect variation. In

contrast to “cherry-picking” a few linguistic items confirming the analysis one wishes to settle on

(Nerbonne 2009), the advantage of the aggregate approach is that it offers a more objective view of

dialect variation. Unfortunately, many studies focusing on the aggregate pattern of dialect variation

have disregarded the underlying linguistic basis. As a consequence, linguists have remained critical

of the dialectometric approach (Schneider 1988, Woolhiser 2005, Loporcaro 2009).

To counter this criticism, various new dialectometric methods have been developed aimed at

identifying the linguistic basis of dialectal variation (as reviewed in Wieling and Nerbonne 2015).

For example, Nerbonne (2006) and Pröll et al. (in press) use an approach based on factor analysis,

whereas Shackleton (2005) uses principal component analysis. Grieve et al. (2011) follow the

workflow of traditional dialectology (i.e. identifying isoglosses, bundling isoglosses and cluster

analysis) by using multivariate spatial analysis.

The method we will apply here, hierarchical bipartite spectral graph partitioning (HBSGP), has

been developed by Wieling and Nerbonne (2009, 2010, 2011), who adopted it from information

retrieval (Dhillon 2001) and applied it to dialectology. HBSGP results in a clustering of

geographical varieties while simultaneously providing a linguistic basis for each of the identified

clusters. The approach of Wieling and Nerbonne (2011) has been successfully applied to study

phonetic variation in Dutch dialects (Wieling and Nerbonne 2011), English dialects (Wieling et al.

2013) and Tuscan dialects (Montemagni et al. 2012, 2013). More recently, the method has also been

applied to investigate lexical variation in contemporary English dialects on the basis of the BBC

Voices data (Wieling et al. 2014a).

In this study, we focus on lexical variation. Our dataset, a regional lexical atlas of Tuscan dialects

whose data have a diatopic and diachronic characterization, allows us to explore the potential of the

HBSGP method in the study of lexical variation. In particular, it enables us to identify lexical

features and their relationships on the one hand and to reconstruct the dynamics of lexical change

across generations on the other hand. Technically, a new measure is proposed for determining the

most important lexical features associated with the identified dialectal areas.

Data

We investigate Tuscan lexical variation on the basis of a linguistic atlas of Tuscany, the Atlante

Lessicale Toscano (ALT, Giacomelli et al. 2000), now available as an online resource

(http://serverdbt.ilc.cnr.it/ALTWEB). ALT is a regional Italian lexical atlas focusing on dialectal

variation throughout Tuscany, where both Tuscan and non-Tuscan dialects are spoken. In this paper

we focus on Tuscan dialects only, recorded in 213 localities by a total of 2060 informants who were

selected with respect to various socio-demographic parameters (such as age, education and gender).

ALT interviews were carried out on the basis of a questionnaire of 745 target items, designed to

elicit mainly lexical, but also semantic and phonetic variation. This study is based on the results of

onomasiological questions, i.e. starting from concepts and looking for their lexicalizations. A

typical onomasiological question asks how a given concept is designated or named, e.g. “what is the

name for flat and crispy bread, seasoned with salt and oil?”. To avoid interference with non-

lexicalized answers, we excluded questions prompting 50 or more distinct lexical items.

Furthermore, we only considered nouns (the large majority of items of ALT questionnaire) in this

study. The resulting subset consists of 170 questionnaire items for which a total of 5,174 distinct

normalized answers were given (on average 30 lexical variants per concept) distributed into 61,496

geo-referenced responses (i.e. associated with locations). The total number of speaker-responses

was 384,454.

To abstract away from phonetic variation, we used the most abstract representation level present in

ALT (Cucurullo et al. 2006). This normalized representation was meant to abstract away from

phonetic variation (caused by productive phonetic processes), but did not remove morphological

variation or variation caused by unproductive phonetic processes. In this study we used the

normalized lexical answers to the selected subset of 170 onomasiological questions. The same set of

questions has also been used by Wieling et al. (2014b) in a study of lexical differences between

Tuscan dialects and standard Italian.

The representativeness of the selected sample with respect to the whole set of ALT onomasiological

questions (i.e. a total of 460 questionnaire items) was assayed using the correlation between overall

lexical distances and lexical distances obtained from the selected sample (Wieling et al. 2014b). The

Pearson’s correlation coefficient was r = 0.94, showing the representativeness of the selected

sample with respect to the whole set of onomasiological questions.

Methods

In this study, we use hierarchical bipartite spectral graph partitioning as our method of choice

(Wieling and Nerbonne 2011). As mentioned before, this approach simultaneously clusters the

geographic locations together with the linguistic features characterizing them. In this case, a cluster

of locations is characterized by a linguistic basis expressed in terms of the most salient lexical

features. These lexical features can be seen as a proxy of the traditional notion of lexical isoglosses,

establishing the boundaries of dialectal areas.

Every variety attested in a given location is described in terms of concept-lexicalization (CL) pairs

linking each of the 170 selected concepts with its lexicalization(s) (reported in the normalized form)

in the specific location. CL frequencies are normalized by dividing the number of recorded answers

by the number of informants in a given location, with their value ranging between 0 and 1. Since for

each location there was a socio-demographically differentiated group of informants potentially

giving rise to multiple responses to denote the same concept, the sum of normalized frequencies of

lexical variants associated with the same concept in a certain location can be greater than 1.

The input for the HBSGP method is a bipartite graph which contains two sets of vertices, locations

and CL pairs, connected by lines. There exists a line between a location and a CL pair whenever at

least one of the speakers in the location uses the lexical variant. The lines are weighted between 0

and 1. A value of 0 indicates that no speakers in the location use the lexical variant (and thus equals

the absence of a line), whereas a value of 1 indicates that all speakers in the location use the lexical

variant to denote the concept being investigated. Table 1 gives an example of (a tabular

representation of) the bipartite graph, with the rows corresponding to the locations and the columns

to the CL pairs. About 80% of the speakers in Caprese Michelangelo use the form aràncio to denote

an ORANGE. A similar number of speakers also uses melàngola to denote the same (speakers

frequently provided multiple lexicalizations to denote a certain concept).

The input matrix is then subjected to singular value decomposition (SVD), and the k-means

clustering algorithm (with k equals 2) is applied to the results of the SVD resulting in a two-way

clustering. The k-means clustering was repeated 1000 times for robustness. As the output of the

SVD combines the locations with the CL pairs, the clustering likewise groups locations and CL

pairs. Consequently, lexical variants grouped with locations can be seen as characteristic elements

of those locations. For more mathematical details, we refer the interested reader to Wieling and

Nerbonne (2011).

Location

ORANGE-arància

ORANGE-aràncio

ORANGE-melàngola

Caprese Michelangelo

0.1379

0.7931

Pieve Santo Stefano

0.4000

0.7333

0.2000

Anghiari

0.0000

0.7059

1.0000

Sansepolcro

0.0000

1.0000

Table 1: Tabular representation of a bipartite graph. The numbers represent the normalized

frequency (obtained by dividing by the number of speakers) of the lexical variant associated with a

given concept in the different locations which ranges between 0 and 1. As the speakers may use

multiple variants to denote a concept, the normalized frequencies associated with a concept in a

certain location do not have to sum to 1.

In order to identify the most characteristic linguistic features for a group of locations, Wieling and

Nerbonne (2011) combined two different criteria which were implemented in two different and

complementary measures: representativeness and distinctiveness. Representativeness measures the

relative frequency of the lexicalization of a given concept in the locations in the cluster. For

example, if the cluster contains ten locations and all speakers in seven locations use the lexical

variant, the representativeness is 0.7. Distinctiveness measures how frequently the lexical variant

occurs within as opposed to outside of the cluster (corrected for the relative size of the cluster,

which is calculated by dividing the number of locations in the cluster by the total number of

locations in the dataset). A distinctiveness of 1 indicates that the lexical variant is only used inside

the cluster. The distinctiveness equals 0 when the relative frequency of the lexical variant in the

cluster is equal to the relative size of the cluster (i.e. it is not distinctive). Interestingly, the measures

of representativeness and distinctiveness are reminiscent of the Consistency and Homogeneity

measures introduced by Labov and colleagues for the construction of isoglosses in the Atlas of

North American English (Labov et al. 2006). Homogeneity measures how much variation exists

within the region defined by the isogloss (i.e. corresponding to a non-chance corrected variant of

distinctiveness) and Consistency (i.e. corresponding to representativeness) measures how strongly

the variable is concentrated within a given region.

The two measures capture two different equally important desiderata of isoglosses: to put it in the

words of Labov et al. (2006), “First, we want the area defined to be as uniform as possible […].

Second, we want as high a proportion of hits as possible to be located within the isogloss”. For this

reason they need to be combined. Wieling and Nerbonne (2011) combined representativeness and

distinctiveness measures by averaging them, yielding the importance score. Here, we propose that

to determine the relevance of CL pairs in the characterization of identified lexical areas it is better

to multiply the two values. The advantage of this approach is that it is not possible to assign high

importance values to lexical variants which score high on a single measure only. For example,

lexical variants occurring in all locations are highly representative, but not distinctive. Similarly, a

lexical variant only occurring in a single location is highly distinctive, but not representative (unless

the cluster contains a single location). Note that constraints on isogloss construction were also

foreseen by Labov et al. (2006) by enforcing frequency thresholds. However, the advantage of the

approach proposed by Wieling and Nerbonne (2011) and its evolution presented here consists in the

fact that no a priori constraints on the values of individual measures are defined.

Results

In this section, we report the results of applying the HBSGP method to the selected ALT dataset.

The results obtained are based on 5,174 CL pairs and 213 locations, which correspond to all lexical

data gathered through fieldwork (as opposed to a dataset in which infrequent lexical variants are

filtered out) for the 170 selected concepts. See Wieling and Montemagni (2015) for a discussion of

the advantages connected with this dataset.

The map in Figure 1 shows the geographic visualization of the clustering of Tuscan varieties into

seven groups designated as follows: the Florence area (A), the western Tuscan area (C) and the

dialects from Arezzo, Siena, Grosseto and Mount Amiata (E) which represent the three main

groupings, together with the dialects from Elba island (D), Chiana Valley (F), Capraia Island (G)

and Apuan Alps (B) which are minor but clearly distinct dialectal areas.

Figure 1: Geographic visualization of the clustering of Tuscan varieties into seven groups.

It is interesting to note that this result is in line with the classifications of Tuscan dialects proposed

by Giacomelli (1975) for what concerns the lexicon, and by Giannelli (1976, 2000) which is based

instead on phonetic, phonemic, morpho-syntactic and lexical features. It is also in line with the

subdivision of Tuscan dialects by Pellegrini (1977), in spite of it being mainly based on the

distribution of phonetic phenomena.

Linguistic features underlying identified lexical areas

For what concerns the underlying lexical features, we first focus on the three main dialectal clusters

(A, C and E). Table 2 reports for each cluster the five most important CL pairs with associated

values of representativeness, distinctiveness and importance.

The relevance of the lexical features with respect to the dialectal subdivision emerges clearly from

the value maps in Figure 2, which show the geographic distribution of the first and second topmost

lexical features of each of the three main identified clusters (A, C and E). The topmost lexical

features associated with each identified cluster can be assimilated with the traditional notion of

bundle of isoglosses, which have long been considered a major criterion for the definition of dialect

areas: as Chambers and Trudgill (1998) put it, “the significance of a dialect area increases as more

and more isoglosses are found which separate it from adjoining areas”.

Cluster

Concept-Lexicalization pair

Representativeness

Distinctiveness

Importance

TURKEY-bìllo

0.863

0.700

0.604

CORNER OF TISSUE-pìnzo

0.724

0.795

0.576

EYE GUM-cipìcchia

0.624

0.920

0.574

OIL JAR-zìro

0.879

0.609

0.535

VAT-bigónzo

0.649

0.821

0.533

ORANGE-arància

0.779

0.675

0.526

LADLE-romaiòlo

0.788

0.536

0.423

OIL JAR-órcio

0.671

0.590

0.396

TURKEY-tàcco

0.390

1.000

0.390

BRAWN-capofréddo

0.432

0.900

0.389

OIL JAR-cóppo

0.749

0.696

0.522

EYE GUM-cìspia

0.702

0.676

0.474

BREAST-pùppa

0.649

0.717

0.466

FLEA-pùce

0.602

0.686

0.413

CLUSTER OF GRAPES-pìgna

0.570

0.701

0.400

Table 2: The five topmost lexical variants for the three main clusters of Tuscan dialects.

By comparing the maps of Figure 2, we can observe that the geographic distribution of the topmost

CL pairs of the E, A and C clusters does not cover all and only the locations in the cluster. Each of

them can be seen as a quantitative visualization of individual isoglosses, where darkness of color

denotes the frequency of occurrence of the represented lexical variant (dark colors denote a greater

frequency, lighter colors lower frequency, and no coloring indicates the absence of the variant). As

can be observed, lexical variants shown in Table 2 may occur beyond the border of the cluster area,

thus lowering the distinctiveness value of the CL pair, or they may not occur in the whole cluster

area resulting in a lower representativeness. For instance, in cluster A comparable

representativeness values are observed for the two topmost CL pairs (0.77 - 0.78), whereas the CL

ranked in second place, i.e. LADLE-romaiòlo, has a lower distinctiveness value (0.53) than the

topmost CL (i.e. whose distinctiveness value is 0.67). Different patterns can be observed in clusters

E and C, with decreasing representativeness and increasing distinctiveness in the former case, and

with both of them decreasing in the latter case. Despite these slight differences, in all cases

representativeness and distinctiveness show relatively high values which never reach the value of 1

(with the only exception of the CL pair TURKEY-tàcco in cluster A whose distinctiveness is equal to

1). The average values of the five topmost lexical features for representativeness and distinctiveness

range between 0.61 and 0.74, and 0.69 and 0.77 respectively, demonstrating that the corresponding

dialect areas are not marked by very clear and strong dialect borders.

Figure 2: Value maps of the first (row 1) and second (row 2) topmost CL pairs for the A, C and E dialectal clusters.

Areas with darker (blue) color denote a greater frequency of occurrence of the selected lexical variant; lighter colors

denote a lower frequency, while no coloring (white) denotes the absence of the variant.

Different distinctiveness-representativeness patterns are observed in the case of the smaller

peripheral areas B, D, F and G (see Table 3). Here, the most salient CL pairs are highly distinctive

(their average values ranges from 0.84 to 1), with the average representativeness ranging from 0.49

to 1. Thus smaller dialect areas are characterized by much more distinctive features than the larger

areas.

Besides the strength of dialectal borders, granularity of the identified dialectal areas is another open

issue in the study of dialectal variation. Consider, for instance, the traditional dialectal subdivision

of Tuscan dialects by Pellegrini (1977) and Giannelli (1976, 2000). In his Carta dei Dialetti

d’Italia, Pellegrini (1977) identifies a western variety of Tuscan which is further subdivided into

Pisano-Livornese-Elbano, and Pistoiese and Lucchese. On the other hand, Giannelli (1976, 2000)

identifies Pisano-Livornese, Lucchese, Elbano and Pistoiese as independent dialectal varieties in his

seminal work Toscana. The two subdivisions are compatible with each other but adopt different

levels of granularity, i.e. they are seen through lenses differing in their magnifying power.

Depending on the specific goals of a study, different levels of granularity of the dialectal landscape

may be appropriate. By exploiting the hierarchical clustering results, the HBSGP method can also

be used to identify increasingly smaller dialectal areas associated with progressively more specific

lexical features. These nested dialect areas are characterized by nested isoglosses (i.e. the spatial

distribution of one feature is entirely contained within that of another). To assess these nested

isoglosses, we compare the geographical and linguistic results obtained by clustering the selected

dataset into two, four and seven groups (with the latter representing the clustering discussed so far).

Cluster

Concept-Lexicalization pair

Representativeness

Distinctiveness

Importance

FINCH-frenguéllo

1.000

CUCUMBER-citróne

1.000

0.973

HAIL-granìschia

0.667

1.000

0.667

GOOSE-ciucióne

0.667

1.000

0.667

LIZARD-racanàccio

0.667

1.000

0.667

SNOW-gnéva

0.429

1.000

0.429

ROLLING PIN-canèlla

0.429

1.000

0.429

STYE-orzaiolo

0.653

0.633

0.414

GARBAGE-rùsco

0.531

0.734

0.389

LIZARD-ciortellóne

0.430

0.853

0.367

HORNET-buffóne

0.950

1.000

0.950

KHAKIS-cicàchi

0.500

1.000

0.500

KHAKIS-cicàco

0.500

1.000

0.500

PINE CONE-pignòcca

0.500

1.000

0.500

TROUGH-tròlego

0.500

1.000

0.500

WATERMELON-patècca

1.000

MELON-melòne

1.000

CLUSTER-raspòllo

1.000

SQUIRREL-miseràngolo

1.000

LIZARD-bìscia

1.000

Table 3: The five topmost lexical variants for the smaller peripheral areas B, D, F and G.

Figure 3 reports the geographic visualization of clustering the Tuscan varieties into two, four and

seven groups. In the map with two clusters (Figure 3, left), the large red cluster corresponds to the

composite set of Tuscan dialects, excluding only the Chiana Valley dialects (cyan cluster). The map

with four clusters (Figure 3, middle) shows the main subdivision of Tuscan dialects between

Northern dialects (cyan and green clusters), covering (from east to west) Fiorentino, Pistoiese,

Lucchese and Pisano-Livornese, and Southern dialects (violet and red clusters), i.e. (from east to

west) the dialect from Arezzo, Siena and Grosseto (violet cluster) and from the Chiana valley (red

cluster). The map containing seven clusters (Figure 3, right) has already been discussed above.

Figure 3: Geographic visualization of the clustering of Tuscan varieties into two, four and seven groups.

Table 4 shows the lexical features characterizing the red, cyan and blue clusters in the first, second

and third map, respectively. These clusters cover a progressively restricted area including the

province of Florence (in the map with seven clusters it almost coincides with the province). Table 4

reports, for each of these clusters, the five topmost lexical variants with their associated scores. The

most salient CL pairs characterizing the red cluster of the two-clusters map coincide with pan-

Tuscan words well known from the literature (Giacomelli and Poggi Salani 1984): they show a

distinctiveness value equal to 1 and very high representativeness values (≥ 0.79). Similar

observations hold for the cluster corresponding to the set of Northern Tuscan dialects (the cyan

cluster in Figure 3, middle) with one main difference: all values are considerably lower, with a

general reduction observed at the level of representativeness. This illustrates that the cyan cluster is

a heterogeneous area. However, by comparing the CL pairs underlying the cyan cluster in the

second map and the blue cluster in the third map, we can also see there are two shared lexical

variants, namely OIL JAR-cóppo and EYE GUM-cìspia, which appear among the topmost features

whose importance values in the smaller blue cluster are higher (determining a higher ranking),

despite their unavoidably lower distinctiveness. In this case, these CL pairs are more characteristic

of the smaller cluster, whereas a word such as THIMBLE-anèllo is more characteristic of the larger

cluster. This suggests that whenever the same features appear to qualify nested clusters, they should

be taken as relevant features for the cluster in which they play a more prominent role (i.e. having a

higher importance value). Consequently, OIL JAR-cóppo and EYE GUM-cìspia should be removed

from the most salient features of the cyan cluster due to the lower importance (0.424 against 0.522

for the former, and 0.397 against 0.474 for the latter) with respect to the nested blue cluster.

In sum, these results show that hierarchical spectral partitioning can be usefully exploited to

identify dialectal areas at different levels of granularity with their associated lexical features. In

particular, the method may help in the selection of the most appropriate isoglosses for each dialectal

area and in the reconstruction of nested isoglosses.

Cluster

Concept-Lexicalization pair

Representativeness

Distinctiveness

Importance

Two-

cluster

map:

Red

SINK-acquàio

0.909

1.000

0.909

CELERY-sèdano

0.853

1.000

0.853

MELON-popóne

0.844

1.000

0.844

LAUREL-allòro

0.801

1.000

0.801

WATERMELON-cocómero

0.794

1.000

0.794

Four-

cluster

map:

Cyan

THIMBLE-anèllo

0.495

0.857

0.424

OIL JAR-cóppo

0.525

0.808

0.424

CATERPILLAR-brùcio

0.448

0.928

0.416

EYE GUM-cìspia

0.498

0.798

0.397

TURKEY-lùcio

0.445

0.872

0.388

Seven-

cluster

map:

Blue

OIL JAR-cóppo

0.749

0.696

0.522

EYE GUM-cìspia

0.702

0.676

0.474

BREAST-pùppa

0.649

0.717

0.466

FLEA-pùce

0.602

0.686

0.413

CLUSTER OF GRAPES-pìgna

0.570

0.701

0.400

Table 4: The five topmost lexical variants of the red, cyan and blue areas in the two, four and seven-

cluster maps of Tuscan dialects.

Reconstructing the dynamics of lexical change

The hierarchical spectral partitioning method can also be used for studying the dynamics of lexical

change across generations. For this purpose, ALT speakers were grouped in an old age group (born

in 1930 or earlier – 1930 was the median year of birth) and a young age group (born after 1930). To

guarantee comparability of results, we focused on two maps each having four clusters. As Figure 4

shows, the analysis of the two datasets results in slightly different, partially overlapping lexical

areas, with the area corresponding to the southeastern (cyan) cluster being more restricted for the

older speakers. Major differences, however, are explicitly clear at the level of the underlying lexical

features. In particular, the central blue area is more restricted (and also linked with fewer CL pairs:

881 vs. 1193) in the map built on the basis of the answers by the young speakers.

Besides the different size of the set of associated linguistic features (i.e. more reduced in the case of

young speakers), it is interesting to note that 424 salient lexical features underlying the old speakers

map do not appear among the features underlying the young speakers map. These CL pairs

emerging from old speakers correspond typically to old-fashioned and traditional notions as well as

less common plants and animals. Examples include STRUCTURE FOR BED WARMER-prète, POPPY-

ròsolo, MUTTON-bìrro, SET OF POPLARS-alborellàia. These CL pairs can be seen as lexical variants

which are no longer being used by younger speakers, and these are likely to disappear altogether.

The number of CL pairs restricted to young speakers is much lower (112) than the number of CL

pairs restricted to the old speakers. In this case, the CL pairs correspond to standard Italian words

(e.g., CLOSET-ripostìglio, WEEPING WILLOW-sàlice piangènte, HARVEST-mietitùra), generic terms

(e.g., AFTERNOON-dòpo mangiàto, SLUG-lumàca ignùda) or “distorted” (i.e. deviant with respect to

traditional pronunciation) variants of dialectal terms (e.g., TUSCAN COLD CUT FROM PORK

SHOULDER-capricòllo). The typology of these lexical variants shows the dynamics of lexical change

ongoing in younger Tuscan generations, characterized by the loss of local features in favor of

generic or standard terms, and by the creative distortion of dialectal words.

In both cases, however, these CL pairs are not highly ranked (i.e. not the most important) for the

associated old and young clusters. Instead, the CL pairs underlying both maps (a total of 769) show

clear differences with respect to their ranking. For example, the 1st, 10th, 20th and 50th lexical

variants in the ranked list of CL pairs underlying the old speakers map correspond to the 60th,

809th, 59th and 818th position in the young CL pairs list, respectively. Similarly, the 1st, 10th, 20th

and 50th ranked lexical variants of the young speakers are ranked (respectively) in the 100th, 13th,

17th and 69th position in the old speakers list. The asymmetry between the old-young vs. young-old

correspondences can be seen as the result of a dialect leveling process, causing the lower

importance of old-fashioned lexical variants for the young speakers (which are top-ranked for the

old speaker). Seen from the perspective of young speakers, the disalignment of the ranking is more

reduced, reflecting an additional shared set of dialectal lexical items.

Table 5 reports the five topmost CL pairs underlying the blue cluster in the two maps. Clearly, the

importance values associated with the blue cluster of the old speakers are higher than those

associated with the blue cluster of the young speakers. This pattern is confirmed by comparing the

average importance scores of the top-10 and top-100 CL pairs in the two lists, which are much

higher for the old speakers (0.42 vs. 0.34 for the top-10 and 0.26 vs. 0.17 for the top-100). This may

also be seen as evidence in support of dialect leveling: lexical areas inferred from young speakers

data are characterized by less distinctive and/or representative features.

Figure 4: Geographic visualization of a four-way clustering of Tuscan varieties on the basis of data from young vs. old

speakers.

Cluster

Concept-Lexicalization pair

Representativeness

Distinctiveness

Importance

Old

speakers:

Blue

cluster

GRAPE-chìcco

0.721

0.828

0.597

CHESTNUT HUSK-rìccio

0.706

0.661

0.467

EMBERS-bràce

0.673

0.632

0.425

BRAZIER-bracière

0.596

0.680

0.405

HAZELNUT-nocciòla

0.794

0.507

0.403

Young

speakers:

Blue

cluster

BAT-pipistrèllo

0.736

0.538

0.396

BREAST-pùppa

0.428

0.900

0.385

THIMBLE-anèllo

0.394

0.893

0.352

OIL JAR-cóppo

0.437

0.772

0.337

EYE GUM-cìspia

0.431

0.779

0.335

Table 5: The five topmost lexical variants of the blue cluster in the young vs. old speakers maps of

Tuscan dialects.

Conclusion

In this paper, we illustrated the application of hierarchical spectral partitioning of bipartite graphs in

the study of lexical variation in Tuscany based on the dialectal corpus of the Atlante Lessicale

Toscano. Our results demonstrate the potential of the method in bridging the gap between models of

linguistic variation based on aggregate analyses and more traditional analyses based on individual

linguistic features.

By using the HBSGP method, we not only identified existing patterns of lexical variation in

Tuscany on the basis of the whole dialectal corpus, but also uncovered the underlying lexical

features in terms of the characterizing concept-lexicalization pairs. The most relevant CL pairs

represent the features used to classify and define each identified lexical area. To put it in more

traditional terms, they can be seen as a proxy of lexical isoglosses marking both the qualitative and

quantitative distribution of the lexical variants identified as discriminating features of a given

lexical dialect area. This entails that the set of the topmost CL pairs associated with each identified

lexical dialect area acts as a proxy of bundles of isoglosses, where the grading of individual

isoglosses within the bundle is determined on the basis of the combination of representativeness and

distinctiveness. If the representativeness score associated with identified isoglosses (CL pairs) can

help to shed light on how much variation exists within the area defined by a given isogloss, the

distinctiveness score reflects how strongly the lexical variant is concentrated within that area. By

comparing the results obtained for different dialect areas, we have seen that different stages of the

process of dialect differentiation can be inferred from the different values of these two measures:

dialectal subdivisions range from clearly defined areas to areas characterized by fuzzy borders.

We also investigated whether and to what extent patterns of lexical variation and their associated

features varied with respect to the granularity of the identified dialectal areas and with the age of

informants, revealing interesting results. The possibility of exploring linguistic variation at different

levels of granularity makes it possible to customize the analysis with respect to the user’s needs.

The linguistic features associated with increasingly smaller areas can be seen as nested isoglosses,

occurring when the spatial distribution of one feature is contained entirely within that of another

and establishing an implicational relationship between the two.

The analysis and comparison of lexical variation patterns and associated features across generations

showed that the method can also be usefully exploited to track the change in the typology of

features in young vs. old informants and to monitor the vitality of a dialect in a given area. In

particular, the HBSGP method turned out to effectively capture the dynamics of lexical change in

Tuscany, by highlighting the emergence of lexical innovations and the obsolescence of old-

fashioned traditional dialectal words.

Current directions of research include testing the robustness of these results by noisy clustering and

the analysis of lexical variation patterns across semantic domains.

References

Chambers, J.K. & Peter Trudgill. 1998. Dialectology. Second edition. Cambridge University Press,

Cambridge.

Cucurullo, Sebastiana, Simonetta Montemagni, Matilde Paoli, Eugenio Picchi & Eva Sassolini.

2006. Dialectal resources on-line: the ALT-Web experience. In Proceedings of the 5th International

Conference on Language Resources and Evaluation (LREC-2006), Genova, Italy, 24-26 May 2006,

pp. 1846-1851.

Dhillon, Inderjit S. 2001. Co-clustering documents and words using bipartite spectral graph

partitioning. In Proceedings of the 7th ACM SIGKDD International Conference on Knowledge

Discovery. ACM, New York, pp. 269-274.

Giacomelli Gabriella. 1975. Aree lessicali toscane. La ricerca dialettale, I, Pacini Editore, Pisa, pp.

115-152.

Giacomelli, Gabriella & Teresa Poggi Salani. 1984. Parole toscane. Quaderni dell’Atlante Lessicale

Toscano, 2(3), Leo S. Olschki Editore, Firenze, pp. 123-229.

Giacomelli, Gabriella, Luciano Agostiniani, Patrizia Bellucci, Luciano Giannelli, Simonetta

Montemagni, Annalisa Nesi, Matilde Paoli, Eugenio Picchi & Teresa Poggi Salani. 2000. Atlante

Lessicale Toscano. Lexis Progetti, Roma.

Giannelli Luciano. 2000. Toscana. Pacini Editore, Pisa (1976, first edition).

Grieve Jack, Dirk Speelman & Dirk Geeraerts. 2011. A statistical method for the identification and

aggregation of regional linguistic variation. Language Variation and Change, 23, Cambridge

University Press, pp. 193-221.

Labov, William, Sharon Ash & Charles Boberg. 2006. The Atlas of North American English.

Phonetics, Phonology and Sound Change. Mouton de Gruyter, Berlin-New York.

Loporcaro, Michele. 2009. Profilo linguistico dei dialetti italiani, Laterza, Roma-Bari.

Montemagni, Simonetta, Martijn Wieling, Bob de Jonge & John Nerbonne. 2012. Patterns of

language variation and underlying linguistic features: A new dialectometric approach. In Patricia

Bianchi, Nicola De Blasi, Chiara De Caprio and Francesco Montuori (Eds.), La Variazione

nell’italiano e nella sua Storia. Varietà e Varianti Linguistiche e Testuali. Atti dell’XI Congresso

SILFI (Società Internazionale di Linguistica e Filologia Italiana). Franco Cesati Editore, Firenze.

Vol. II, pp. 879-889.

Montemagni, Simonetta, Martijn Wieling, Bob de Jonge & John Nerbonne. 2013. Synchronic

patterns of Tuscan phonetic variation and diachronic change: Evidence from a dialectometric study.

Literary and Linguistic Computing, 28(1), Oxford University Press, pp. 157-172.

Nerbonne, John. 2006. Identifying linguistic structure in aggregate comparison. Literary and

Linguistic Computing, 21, Oxford University Press, pp. 463-476.

Nerbonne, John. 2009. Data-driven dialectology. Language and Linguistics Compass 3(1), pp. 175-

198.

Pellegrini, Giovan Battista. 1977. Carta dei Dialetti d'Italia, Pacini Editore, Pisa.

Pröll, Simon, Simon Pickl, Aaron Spettl. In press. Latente Strukturen in geolinguistischen Korpora.

In Michael Elmentaler, Markus Hundt, Jürgen E. Schmidt (Eds.), Deutsche Dialekte - Konzepte,

Probleme, Handlungsfelder. Kongress der Internationalen Gesellschaft für Dialektologie des

Deutschen (IGDD), Kiel vom 13.–15. September 2012. Stuttgart: Steiner.

Séguy, Jean. 1971. La relation entre la distance spatiale et la distance lexicale. Revue de

Linguistique Romane, 35(138), pp. 335-357.

Schneider, Edgar. 1988. Qualitative vs. quantitative methods of area delimitation in dialectology: A

comparison based on lexical data from Georgia and Alabama. Journal of English Linguistics, 21,

pp. 175-212.

Shackleton, Robert G., Jr. 2005. English-American speech relationships: A quantitative approach.

Journal of English Linguistics 33, pp. 99-159.

Wieling Martijn & John Nerbonne. 2009. Bipartite spectral graph partitioning to co-cluster varieties

and sound correspondences in dialectology. In Proceedings of the 2009 Workshop on Graph-Based

Methods for Natural Language Processes. ACL, Stroudsburg, PA, pp. 14-22.

Wieling Martijn & John Nerbonne. 2010. Hierarchical spectral partitioning of bipartite graphs to

cluster dialects and identify distinguishing features. In Proceedings of the 2010 Workshop on

Graph-Based Methods for Natural Language Processing, ACL, Stroudsburg, PA, pp. 33-41.

Wieling, Martijn & John Nerbonne. 2011. Bipartite spectral graph partitioning for clustering dialect

varieties and detecting their linguistic features. Computer Speech and Language 25, pp. 700-715.

Wieling, Wieling, Robert G. Shackleton, Jr. & John Nerbonne. 2013. Analyzing phonetic variation

in the traditional English dialects: Simultaneously clustering dialects and phonetic features. LLC

28(1). 31-41.

Wieling, Martijn, Clive Upton & Ann Thompson. 2014a. Analyzing the BBC Voices data:

Contemporary English dialect areas and their characteristic lexical variants. Literary and Linguistic

Computing, 29(1), Oxford University Press, pp. 107-117.

Wieling, Martijn, Simonetta Montemagni, John Nerbonne & R. Harald Baayen. 2014b. Lexical

differences between Tuscan dialects and standard Italian: Accounting for geographic and

sociodemographic variation using generalized additive mixed modeling. Language 90(3), pp. 669-

692.

Wieling, Martijn & John Nerbonne. 2015. Advances in dialectometry. Annual Review of Linguistics

Wieling, Martijn & Simonetta Montemagni. 2015. Infrequent forms: noise or not? In this volume.

Woolhiser, Curt. 2005. Political borders and dialect divergence/convergence in Europe. In Peter

Auer, Frans Hinskens & Paul Kerswill (Eds.), Dialect Change. Convergence and Divergence in

European Languages, Cambridge University Press, New York, pp. 236-262.

Infrequent forms: Noise or not?

Chapter

Full-text available

Jan 2016

In this study we ask the question whether simplifying the data in dialectometrical studies by removing infrequent forms is advantageous to uncover the geographical structure in dialect data. By investigating lexical variation in a large corpus of Tuscan dialect data via hierarchical bipartite spectral graph partitioning, we are able to identify the main geographical areas together with their linguistic basis. In order to assess the influence of infrequent forms, we conduct two analyses: one which includes only lexical variants used by at least 0.5% of the informants, and another which includes all lexical variants in the data. Using this approach we show that using all data enables us to find a geographical characterization with a more adequate linguistic basis than by using the trimmed data.

Dialectometry-based classification of the Central–Southern Italian dialects

Article

Full-text available

Apr 2024

Antonio Sciarretta

This paper provides a new classification of Central–Southern Italian dialects using dialectometric methods. All varieties considered are analyzed and cast in a data set where homogeneous areas are evaluated according to a selected list of phonetic features. Using numerical evaluation of these features and the Manhattan distance, a linguistic distance rule is defined. On this basis, the classification problem is formulated as a clustering problem, and a k-means algorithm is used. Additionally, an ad-hoc rule is set to identify transitional areas, and silhouette analysis is used to select the most appropriate number of clusters. While meaningful results are obtained for each number of clusters, a nine-group classification emerges as the most appropriate. As the results suggest, this classification is less subjective, more precise, and more comprehensive than traditional ones based on selected isoglosses.

Quantifying folk perceptions of dialect boundaries. A case study from Tuscany (Italy)

Article

Full-text available

Aug 2022

This paper aims to understand the contribution of geographical information in the perception of linguistic variation. A total of 813 mental maps collected among young speakers from different cities in Tuscany have been analyzed via an open-access web dialectometric tool (Gabmap). In particular, the study seeks to verify the role of geographic distance and the place of residence of the respondents in modeling perceived variation. The relationship between dialect grouping as made by linguists and perceived taxonomies of sublinguistic areas is also investigated. Results show that geographical proximity between mapped areas significantly predicts the perception of dialect similarity. Our participants made their decisions looking at (1) a keen sense of spatial contiguity, and (2) the synchronic presence of linguistic differences between the Tuscan subregions. Moreover, classification uncertainty grows when the mapped areas are very close to, or very distant from, the participants’ places of residence. Methodological and linguistic perspectives of mental maps in folk linguistics are finally discussed.

Learned ontology guided opinions analysis of extracted aspects from online product reviews

Article

Full-text available

Jun 2018

The opinions expressed by the consumers on online product reviews in e-commerce websites play major role in judging the evaluative character of the product aspect. These expressed opinions lack conceptual preciseness allowing consumers to use them in both syntactically and semantically different ways (lexical variations) on various aspects in the reviews. Also some section of consumers present their opinions in the implicit manner. The evaluation of these types of opinions for opinion orientations raises the semantic gap between the human language and the actual opinionated knowledge. Thus, extracting all these types of opinions on the product aspects may bridge the semantic gap and thereby improving the accuracy of the opinion orientation. In this paper, iterative ontology learning approach is carried out in order to solve the aforementioned problems. In the proposed method, first the pre-processed product reviews are analyzed for extracting opinionated lexical variations. Then, the reviews are further analyzed to extract the implicit opinions. Further, these opinionated lexical variations and implicit opinions with the reviews are formalized for ontology learning. The aspect, opinion pair is formed by reasoning the learned ontology. Finally, the aspect’s opinion orientation is ascertained by using the sentiwordnet scores in the improved geodesic distance metric. The evaluation of semantic orientation of opinions using the learned ontology guidance against the state-of-the-art approaches shows the effectiveness of the proposed method.

Lexical differences between Tuscan dialects and standard Italian: Accounting for geographic and socio-demographic variation using generalized additive mixed modeling

Article

Full-text available

Sep 2014

This study uses a generalized additive mixed-effects regression model to predict lexical differences in Tuscan dialects with respect to standard Italian. We used lexical information for 170 concepts used by 2,060 speakers in 213 locations in Tuscany. In our model, geographical position was found to be an important predictor, with locations more distant from Florence having lexical forms more likely to differ from standard Italian. In addition, the geographical pattern varied significantly for low- versus high-frequency concepts and older versus younger speakers. Younger speakers generally used variants more likely to match the standard language. Several other factors emerged as significant. Male speakers as well as farmers were more likely to use lexical forms different from standard Italian. In contrast, higher-educated speakers used lexical forms more likely to match the standard. The model also indicates that lexical variants used in smaller communities are more likely to differ from standard Italian. The impact of community size, however, varied from concept to concept. For a majority of concepts, lexical variants used in smaller communities are more likely to differ from the standard Italian form. For a minority of concepts, however, lexical variants used in larger communities are more likely to differ from standard Italian. Similarly, the effect of the other community- and speaker-related predictors varied per concept. These results clearly show that the model succeeds in teasing apart different forces influencing the dialect landscape and helps us to shed light on the complex interaction between the standard Italian language and the Tuscan dialectal varieties. In addition, this study illustrates the potential of generalized additive mixed-effects regression modeling applied to dialect data.*

Infrequent forms: Noise or not?

Chapter

Full-text available

Jan 2016

Patterns of language variation and underlying linguistic features: A new dialectometric approach

Conference Paper

Full-text available

Jan 2012

Analyzing the BBC Voices data: Contemporary English dialect areas and their characteristic lexical variants

Article

Full-text available

Apr 2014

This study investigates data from the BBC Voices project, which contains a large amount of vernacular data collected by the BBC between 2004 and 2005. The project was designed primarily to collect information on vernacular speech around the UK for broadcasting purposes. As part of the project, a web-based questionnaire was created, to which tens of thousands of people supplied their way of denoting thirty-eight variables that were known to exhibit marked lexical variation. Along with their variants, those responding to the online prompts provided information on their age, gender, and-significantly for this study-their location, this being recorded by means of their postcode. In this study, we focus on the relative frequency of the top ten variants for all variables in every postcode area. By using hierarchical spectral partitioning of bipartite graphs, we are able to identify four contemporary geographical dialect areas together with their characteristic lexical variants. Even though these variants can be said to characterize their respective geographical area, they also occur in other areas, and not all people in a certain region use the characteristic variant. This supports the view that dialect regions are not clearly defined by strict borders, but are fuzzy at best.© The Author 2013. Published by Oxford University Press on behalf of ALLC. All rights reserved.

Latente Strukturen in geolinguistischen Korpora

Chapter

Jan 2015

Ziel dieser Fallstudie ist es zum einen, die Faktorenanalyse als Instrument zur effektiven Auswertung und Interpretation geolinguistischer Daten zu propagieren. Es zeigt sich, dass die Faktorenanalyse tiefgreifende, latente Strömungen in der Gesamtvariation aufdecken kann, die sowohl beim Blick auf Einzelvarianten als auch in der bislang üblichen quantitativen Dialektologie verborgen bleiben. Dabei bleibt auch in großen Kartenkorpora der Zugriff auf die Rolle der Einzelvariante unverstellt. Zum anderen kann dargestellt werden, wie die unterschiedlichen Ebenen des Sprachsystems auch unterschiedliche geografische Konfigurationen zeigen. Das führt zur Einsicht, dass a) auf die einzelnen sprachlichen Systemebenen jeweils individuelle Faktoren unterschiedlich stark einwirken sowie b) das Ergebnis von Dialekteinteilungen durch die Wahl des zugrunde gelegten Materials klar vordeterminiert wird – und im Umkehrschluss, dass Einteilungen, die nur auf Ausschnitten der Daten oder einzelnen Systemebenen beruhen, nicht oder nur sehr eingeschränkt für andere Systemebenen sprechen können.

Political borders and dialect divergence/convergence in Europe

Chapter

Jan 2005

Curt Woolhiser

Introduction Political borders have long been a central concern of geographers, students of international relations, and legal scholars. Since the 1960s, a growing body of sociological and anthropological research has, in addition, provided valuable new insights concerning the sociocultural aspects of border regions. Dialectologists, on the other hand, have given scant attention to the role of modern political borders in the spatial distribution and diffusion of linguistic features, generally viewing such factors as physical geography, earlier migration and settlement patterns, patterns of trade, and the influence of urban centres as linguistically far more significant. However, with the rise of the modern nation state in the nineteenth century, accompanied in the twentieth century by the emergence of modern communications, improved transportation networks, greater geographical and social mobility of populations, and universal education, political borders have become a far more potent factor in dialect divergence and convergence. In many parts of the developed world, and particularly on the European Continent, dialect areas or dialect continua that are divided by international borders are, in many cases, beginning to show signs of divergence, either as a consequence of cross-border differences in the degree of cross-dialectal levelling or dialect maintenance, or as the result of convergence towards different superposed standard languages. Obviously, the mere existence of a political border is insufficient to cause dialect divergence, just as the existence of social differentiation does not necessarily entail linguistic divergence within a speech community.

La relation entre la distance spatiale et la distance lexicale

Article

Jan 1971

J. Séguy

Profilo linguistico dei dialetti italiani

Article

Jan 2013

Michele Loporcaro

The Atlas of North American English: Phonetics, Phonology and Sound Change

Book

Jan 2006

The Atlas of North American English: Phonetics, Phonology and Sound Change

Book

Jan 2006

The Atlas of North American English provides the first overall view of the pronunciation and vowel systems of the dialects of the U.S. and Canada. The Atlas re-defines the regional dialects of American English on the basis of sound changes active in the 1990s and draws new boundaries reflecting those changes. It is based on a telephone survey of 762 local speakers, representing all the urbanized areas of North America. It has been developed by William Labov, one of the leading sociolinguists of the world, together with his colleagues Sharon Ash and Charles Boberg. The Atlas consists of a printed volume accompanied by an interactive CD-ROM. The print and multimedia content is also available online. © 2006 by Walter de Gruyter GmbH & Co. KG, D-10785 Berlin. All rights reserved.

Tracking linguistic features underlying lexical variation patterns: A case study on Tuscan dialects

Abstract and Figures

Recommended publications

Energy keep us going

University of Groningen in 83rd place on THE ranking list

Origins Center to open at Fundamentals of Life in the Universe symposium

A grand future with small particles. Nanotechnology affects many aspects of our lives

Application of the linguistic atlas method to dialect study in the South‐Central Area

Support for end-weight as a determinant of linguistic variation and change

Non-standard dialects and linguistic data

Politeness, Time Constraints, and Collaboration in Decision-Making Meetings: A Case Study