ArticlePDF Available

A 61 Million Word Corpus of Brazilian Portuguese Film Subtitles as a Resource for Linguistic Research

December 2012

December 2012
25:208-214

Authors:

Heinrich-Heine-Universität Düsseldorf

This work documents the motivation and development of a subtitle-based corpus for Brazilian Portuguese, SUBTLEX-PT-BR, available at http://crr.ugent.be/subtlex-pt-br/. While the target language was Brazilian Portuguese, the methodology can be extended to any other languages with subtitles. A preliminary corpus comparison with a large conversational and written corpus was conducted to evaluate the validity of the corpus, and suggested that the subtitle corpus is more similar to the conversational than the written language. Future work on the methodology and the corpus itself is outlined. Its diverse use as a resource for linguistic research is discussed.

Existing major Brazilian Portuguese corpora

…

Figures - uploaded by Kevin Tang

Content may be subject to copyright.

Content uploaded by Kevin Tang

Content may be subject to copyright.

A preview of the PDF is not available

Proceedings of the 44th Annual Meeting of the Berkeley Linguistics Society

Book

Full-text available

Jan 2018

LextPT: A reliable and efficient vocabulary size test for L2 Portuguese proficiency

Article

Dec 2021

Vocabulary size has been repeatedly shown to be a good indicator of second language (L2) proficiency. Among the many existing vocabulary tests, the LexTALE test and its equivalents are growing in popularity since they provide a rapid (within 5 minutes) and objective way to assess the L2 proficiency of several languages (English, French, Spanish, Chinese, and Ital-ian) in experimental research. In this study, expanding on the standard procedure of test construction in previous LexTALE tests, we develop a vocabulary size test for L2 Portuguese proficiency: LextPT. The selected lexical items fall in the same frequency interval in European and Brazilian Portuguese, so that LextPT accommodates both varieties. A large-scale validation study with 452 L2 learners of Portuguese shows that LextPT is not only a sound and effective instrument to measure L2 lexical knowledge and indicate the proficiency of both European and Brazilian Portuguese, but is also appropriate for learners with different L1 backgrounds (e.g. Chinese, Germanic, Romance, Slavic). The construction of LextPT, apart from joining the effort to provide a standardised assessment of L2 proficiency across languages, shows that the LexTALE tests can be extended to cover different varieties of a language, and that they are applicable to bilinguals with different linguistic experience.

Identity Avoidance in Turkish Partial Reduplication: Feature Specificity and Locality

Article

Full-text available

Jan 2023

This study investigates the Turkish partial reduplication phenomenon, in which the reduplicant is derived by prefixing C 1 VC 2 syllable, where C 1 V are identical to the word-initial CV of the base and the C 2 ends in one of the four linking consonants:-p,-m,-s,-r. This study reexamines the factors conditioning the choice of the linking consonant , by focusing the nature of the (dis)similarity (feature specificity) and the proximity (locality) between the consonants in the base and the linking consonant, using an acceptability rating task with over 200 participants and a diverse set of stimuli in terms of length and word shapes. Results indicate a gradient identity avoidance effect that extends over all consonants in the base. Crucially, the effect of all consonants is not uniform, with the strength of the effect decreasing further into the base. The study also uncovers an elusive interplay between the distance-based decay effect and the syllable position effect, both of which turn out to play a role in these non-categorical patterns with multiple features. Furthermore, results indicate that identity avoidance operates over both individual features as well as whole segments. Overall, the study argues that locality-sensitive feature-specific identity avoidance constraints are part of the grammar.

Lexical access in Portuguese stress

Article

Full-text available

Jun 2022

Categorical approaches to lexical stress typically assume that words have either regular or irregular stress, and imply that only the latter needs to be stored in the lexicon, while the former can be derived by rule. In this paper, we compare these two groups of words in a lexical decision task in Portuguese to examine whether the dichotomy in question affects lexical retrieval latencies in native speakers, which could indirectly reveal different processing patterns. Our results show no statistically credible effect of stress regularity on reaction times, even when lexical frequency, neighborhood density, and phonotactic probability are taken into consideration. The lack of an effect is consistent with a probabilistic approach to stress, not with a categorical (traditional) approach where syllables are either light or heavy and stress is either regular or irregular. We show that the posterior distribution of credible effect sizes of regularity is almost entirely (96.28%) within the region of practical equivalence, which provides strong evidence that no effect of regularity exists in the lexical decision data modelled. Frequency and phonotactic probability, in contrast, showed statistically credible effects given the experimental data modelled, which is consistent with the literature.

NILC-Metrix: assessing the complexity of written and spoken language in Brazilian Portuguese

Article

Oct 2023
LANG RESOUR EVAL

The objective of this paper is to present and make publicly available the NILC-Metrix, a computational system comprising 200 metrics proposed in studies on discourse, psycholinguistics, cognitive and computational linguistics, to assess textual complexity in Brazilian Portuguese (BP). The metrics are relevant for descriptive analysis and the creation of computational models and can be used to extract information from various linguistic levels of written and spoken language. The metrics were developed during the last 13 years, starting in the end of 2007, within the scope of the PorSimples project. Once the PorSimples finished, new metrics were added to the initial 48 metrics of the Coh-Metrix-Port tool. Coh-Metrix-Port adapted some metrics to BP from the Coh-Metrix tool that computes metrics related to cohesion and coherence of texts in English. Given the large number of metrics, we present them following an organisation similar to the metrics of Coh-Metrix v3.0 to facilitate comparisons made with metrics in Portuguese and English, in future studies using both tools. In this paper, we illustrate the potential of the NILC-Metrix by presenting three applications: (i) a descriptive analysis of the differences between children’s film subtitles and texts written for Elementary School I (comprises classes from 1st to 5th grade) and II (Final Years) (comprises classes from 6th to 9th grade, in an age group that corresponds to the transition between childhood and adolescence); (ii) a new predictor of textual complexity for the corpus of original and simplified texts of the PorSimples project; (iii) a complexity prediction model for school grades, using transcripts of children’s story narratives told by teenagers. For each application, we evaluate which groups of metrics are more discriminative, showing their contribution for each task.

Conflicting standards and variability: Spirantization in two varieties of Uruguayan Spanish

Article

Sep 2023

Madeline Gilbert

Most sociolinguistic work on variation focuses on how rates of occurrence or mean measurements differ between speech communities and speakers. However, speakers and communities also differ in variability – that is, in dispersion around the mean. The current study investigates the effects of speech style and multilingualism on variation and variability, by measuring the degree of intervocalic /bdɡ/ spirantization in spontaneous and careful speech. Data come from two varieties of Uruguayan Spanish, one monolingual (Montevideo) and one in contact with Brazilian Portuguese (Rivera). The results from a variation analysis confirm expected linguistic and social effects on gradient spirantization. An analysis of variability shows that, at the group level, careful speech is more variable than spontaneous speech, and the data from Rivera is more variable than that from Montevideo. Variability at the individual level differs slightly, suggesting that the group-level variability arises from between-speaker variability and within-speaker variability in different contexts. I propose that multilingualism in Rivera may heighten variability because contact with Portuguese provides a wider range of available pronunciations, and that careful speech may increase variability because the available pronunciations are subject to conflicting standards that are most active in this style.

A tale of two lexical-decision tasks: The reality of taking the lab to the field

Article

May 2023

John M. Lipski

As a probe into the degree of integration of the bilingual lexicon, a series of lexical-decision tasks was carried out in two bilingual speech communities with greatly differing linguistic, cultural, and socio-historical characteristics: Misiones province in northeastern Argentina (Portuguese-Spanish), and three indigenous communities in northern Ecuador (Quichua and the mixed language known as Media Lengua). In both cases the results suggest a tightly integrated bilingual lexicon, but the pattern of responses was qualitatively and quantitatively different for each group, to such an extent as to potentially challenge the assumption of universal validity for lexical decision tasks.

Modelling L1 and the artificial language during artificial language learning

Article

Full-text available

Apr 2023

Artificial language learning research has become a popular tool to investigate universal mechanisms in language learning. However, often it is unclear whether the found effects are due to learning, or due to artefacts of the native language or the artificial language, and whether findings in only one language will generalise to speakers of other languages. The present study offers a new approach to model the influence of both the L1 and the target artificial language on language learning. The idea is to control for linguistic factors of the artificial and the native language by incorporating measures of wordlikeness into the statistical analysis as covariates. To demonstrate the approach, we extend Linzen and Gallagher (2017)’s study on consonant identity pattern to evaluate whether speakers of German and Mandarin rapidly learn the pattern when influences of L1 and the artificial language are accounted for by incorporating measures assessed by analogical and discriminative learning models over the L1 and artificial lexicon. Results show that nonwords are more likely to be accepted as grammatical if they are more similar to the trained artificial lexicon and more different from the L1 and, crucially, the identity effect is still present. The proposed approach is helpful for designing cross-linguistic studies.

NILC-Metrix : assessing the complexity of written and spoken language in Brazilian Portuguese

Article

Full-text available

Dec 2021

This paper presents and makes publicly available the NILC-Metrix, a computational system comprising 200 metrics proposed in studies on discourse, psycholinguistics, cognitive and computational linguistics, to assess textual complexity in Brazilian Portuguese (BP). These metrics are relevant for descriptive analysis and the creation of computational models and can be used to extract information from various linguistic levels of written and spoken language. The metrics in NILC-Metrix were developed during the last 13 years, starting in 2008 with Coh-Metrix-Port, a tool developed within the scope of the PorSimples project. Coh-Metrix-Port adapted some metrics to BP from the Coh-Metrix tool that computes metrics related to cohesion and coherence of texts in English. After the end of PorSimples in 2010, new metrics were added to the initial 48 metrics of Coh-Metrix-Port. Given the large number of metrics, we present them following an organisation similar to the metrics of Coh-Metrix v3.0 to facilitate comparisons made with metrics in Portuguese and English. In this paper, we illustrate the potential of NILC-Metrix by presenting three applications: (i) a descriptive analysis of the differences between children's film subtitles and texts written for Elementary School I and II (Final Years); (ii) a new predictor of textual complexity for the corpus of original and simplified texts of the PorSimples project; (iii) a complexity prediction model for school grades, using transcripts of children's story narratives told by teenagers. For each application, we evaluate which groups of metrics are more discriminative, showing their contribution for each task.

NILC-Metrix: assessing the complexity of written and spoken language in Brazilian Portuguese

Preprint

Full-text available

Dec 2021

Frequency biases in phonological variation

Article

Full-text available

Feb 2013

In the past two decades, variation has received a lot of attention in mainstream generative phonology, and several different models have been developed to account for variable phonological phenomena. However, all existing generative models of phonological variation account for the overall rate at which some process applies in a corpus, and therefore implicitly assume that all words are affected equally by a variable process. In this paper, we show that this is not the case. Many variable phenomena are more likely to apply to frequent than to infrequent words. A model that accounts perfectly for the overall rate of application of some variable process therefore does not necessarily account very well for the actual application of the process to individual words. We illustrate this with two examples, English t/d-deletion and Japanese geminate devoicing. We then augment one existing generative model (noisy Harmonic Grammar) to incorporate the contribution of usage frequency to the application of variable processes. In this model, the influence of frequency is incorporated by scaling the weights of faithfulness constraints up or down for words of different frequencies. This augmented model accounts significantly better for variation than existing generative models.

The use of film subtitles to estimate word frequencies

Article

Full-text available

Oct 2007
APPL PSYCHOLINGUIST

We examine the use of film subtitles as an approximation of word frequencies in human interactions. Because subtitle files are widely available on the Internet, they may present a fast and easy way to obtain word frequency measures in language registers other than text writing. We compiled a corpus of 52 million French words, coming from a variety of films. Frequency measures based on this corpus compared well to other spoken and written frequency measures, and explained variance in lexical decision times in addition to what is accounted for by the available French written frequency measures. The availability of digitally stored texts on the Internet has opened a completely new avenue for linguists and psycholinguists to gain access to large corpora of written language. For instance, Blair, Urland, and Ma (2002) and New, Pallier, Brysbaert, and Ferrand (2004) showed that word frequency estimates obtained with Internet search engines correlate highly with those from well-established sources such as Celex for English (Baayen, Piepenbrock, & Gulikers, 1995) and Lexique for French (New, Pallier, Ferrand, & Brysbaert, 2004). This opens the possibility to obtain frequency estimates for words in languages without an existing frequency list. Similarly, Grondelaers, Deygers, Van Aken, Van Den Heede, and Speelman (2000) showed how Internet sources can be used to get access to texts from different language registers. They downloaded materials from newspapers, discussion groups, and chat channels, and showed how the presence of a particular word ("er" in Dutch, a word meaning something like "there" and in many instances

Phonology and Language Use

Book

Jan 2001
PHONETICA

Joan Bybee

Referencing new developments in cognitive and functional linguistics, phonetics, and connectionist modeling, this book investigates various ways in which a speaker/hearer's experience with language affects the representation of phonology. Rather than assuming phonological representations in terms of phonemes, Joan Bybee adopts an exemplar model, in which specific tokens of use are stored and categorized phonetically with reference to variables in the context. This model allows an account of phonetically gradual sound change that produces lexical variation, and provides an explanatory account of the fact that many reductive sound changes affect high frequency items first.

The C-ORAL-BRASIL I: reference corpus for informal spoken brazilian portuguese

Conference Paper

Apr 2012

The C-ORAL-BRASIL is a Brazilian Portuguese spontaneous speech corpus, representative of the state of Minas Gerais diatopy (primarily from the capital city, Belo Horizonte,metropolitan area). The corpus was compiled following the same architecture and segmentation criteria adopted by the C-ORAL-ROM [1] as well as its alignment software, the WinPitch [2]. The corpus comprises 139 informal speech texts, 208,130 words, 21:08:52 hours of recording (6.1 GB wav files). The mean word number per text is 1,500. The recordings were carried out with high resolution, non-invasive wireless equipment, generally with clip-on, monodirectional microphones, and a mixer whenever there were more than two interactants, in a few occasions omnidirectional microphones were used. The texts are transcribed following the CHAT format [3], implemented for prosodic annotation [4]. The main goals for the corpus architecture are the documentation of the diaphasic and diastratic variations in Brazilian Portuguese speech.

Cross-cultural Pragmatics: The Translation of Implicit Compliments in Subtitles

Article

Jan 2006

Silvia Bruti

This paper focuses on the strategic function of implicit compliments, aiming to evaluate their contribution to positive and negative politeness and their translation in interlingual subtitles (from English into Italian).

PHONOLOGY AND LANGUAGE USE

Article

Sep 2004
STUD SECOND LANG ACQ

Grover Hudson

Regular Morphology and the Lexicon

Article

Oct 1995
LANG COGNITIVE PROC

Joan Bybee

Three models of morphological storage and processing are compared: the dual-processing model of Pinker, Marcus and colleagues, the connectionist model of Marchman, Plunkett, Seidenberg and others, and the network model of Bybee and Langacker. In line with predictions made in the latter two frameworks, type frequency of a morphological pattern is shown to be important in determining productivity. In addition, the paper considers the nature of lexical schemas in the network model, which are of two types: source-oriented and product-oriented. The interaction of phonological properties of lexical patterns with frequency and the interaction of type and token frequency are shown to influence degree of productivity. Data are drawn from English, German, Arabic and Hausa.

Open-source Corpora: Using the net to fish for linguistic data

Article

Dec 2006

Serge Sharoff

The paper proposes a methodology for collecting "open-source" corpora, i.e. corpora that are automatically collected from the Internet and distrib-uted in the form of a list of links with open-source software for recreating their full text. The result is a random snapshot of Internet pages which contain stretches of connected text in a given language. The paper discusses a methodology for acquiring such corpora, two ways of documenting them (using a set of metatextual categories and by comparison to frequency lists from existing corpora) and their function as benchmarks for comparing results of linguistic inquiry. Experiments with a variety of languages show that Internet-derived corpora can be successfully used in the absence of large representative corpora that are rare and expensive to build. . The problem Lexicographic studies using corpora can be reliable only if corpora provid-ing the basis for the study are sufficiently large and diverse. The famous ex-ample with collocations of powerful and strong, such as strong tea (Halliday 1966:150), can only be studied computationally on a corpus of at least the size of the British National Corpus (BNC). In 100 million words of the BNC, the expression strong tea occurs 28 times, 1 which makes it a reasonably strong col-location along with strong {candidate, contrast, leadership, reason}, all of which have roughly the same frequency and statistical significance according to the log-likelihood score. However, the chances of detecting these collocations in a smaller corpus are minuscule: strong tea occurs only once in the Brown corpus, and it contains no instances of strong candidate, leadership or reason.

Some Methods for Classification and Analysis of MultiVariate Observations

Conference Paper

Jan 1967

J.B. MacQueen

On Information and Sufficiency

Article

Mar 1951
Ann Math Stat

A 61 Million Word Corpus of Brazilian Portuguese Film Subtitles as a Resource for Linguistic Research

Abstract and Figures

Recommended publications

'Safety in numbers'

Learning Object Methodology

Consideraciones generales sobre Cuadros de Clasificacion Documental (CCD)

The hOCR Microformat for OCR Workflow and Results