ArticlePDF Available

A 61 Million Word Corpus of Brazilian Portuguese Film Subtitles as a Resource for Linguistic Research

Authors:

Abstract and Figures

This work documents the motivation and development of a subtitle-based corpus for Brazilian Portuguese, SUBTLEX-PT-BR, available at http://crr.ugent.be/subtlex-pt-br/. While the target language was Brazilian Portuguese, the methodology can be extended to any other languages with subtitles. A preliminary corpus comparison with a large conversational and written corpus was conducted to evaluate the validity of the corpus, and suggested that the subtitle corpus is more similar to the conversational than the written language. Future work on the methodology and the corpus itself is outlined. Its diverse use as a resource for linguistic research is discussed.
Content may be subject to copyright.
A preview of the PDF is not available
... 1120-1136. Anderson, John R. and Christian Lebiere (1998 are involved in tone sandhi, and the sandhi patterns are both various and complex, affecting either the first or the second tone or both tones of a T1+T2 sequence (Li 1965, Luo 1982, 2012, Rao 1986, 1987, Hsu 1995, Chen 2004, Chen et al. 2008). Previous studies have tried different approaches in order to explain its sandhi patterns, including generative approaches and classic Optimality Theory (OT; Prince & Smolensky 2004[1993, Kager 1999, McCarthy 2002, 2008 3 Tone Sandhi 1 (TS1) ...
... Using the SUBTLEX corpus (Tang 2012) and past literature (Araújo 2002;Gonçalves 2004;Sempere 2006), a total of 49 reduplicants were compiled. The properties of the reduplicated verbs in this list were compared to the properties of Brazilian Portuguese verbs overall in the SUBTLEX corpus, and there was a difference in size and shape between the two sets of data (henceforth referred to as the reduplicant corpus and SUBTLEX corpus). ...
... Only nouns and adjectives were considered, while compounds and derived words were left out of the selection. After selecting the first version of 90 real word items, we turned to a homologous lexical database of Brazilian Portuguese, SUBTLEX_PT_BR 6 (Tang, 2012) to check whether the selected items belonged to the same frequency interval (items in SUBTLEX_PT_BR were likewise subdivided into six intervals). Items that did not fit into the same frequency level in both corpora were excluded and we continued to evaluate novel candidates. ...
... Although both orthographic forms, oxigénio and oxigênio, can be found in SUBTLEX-PT-BR(Tang, 2012), 14 out of 61 L1-BP speakers rejected oxigénio as a real Portuguese word. ...
Article
Vocabulary size has been repeatedly shown to be a good indicator of second language (L2) proficiency. Among the many existing vocabulary tests, the LexTALE test and its equivalents are growing in popularity since they provide a rapid (within 5 minutes) and objective way to assess the L2 proficiency of several languages (English, French, Spanish, Chinese, and Ital-ian) in experimental research. In this study, expanding on the standard procedure of test construction in previous LexTALE tests, we develop a vocabulary size test for L2 Portuguese proficiency: LextPT. The selected lexical items fall in the same frequency interval in European and Brazilian Portuguese, so that LextPT accommodates both varieties. A large-scale validation study with 452 L2 learners of Portuguese shows that LextPT is not only a sound and effective instrument to measure L2 lexical knowledge and indicate the proficiency of both European and Brazilian Portuguese, but is also appropriate for learners with different L1 backgrounds (e.g. Chinese, Germanic, Romance, Slavic). The construction of LextPT, apart from joining the effort to provide a standardised assessment of L2 proficiency across languages, shows that the LexTALE tests can be extended to cover different varieties of a language, and that they are applicable to bilinguals with different linguistic experience.
... The corpus contains approximately 200 million word tokens and over 200,000 word types. The use of a subtitle corpus was motivated by the fact that lexical frequencies derived from subtitle texts have consistently shown to outperform those from other genres in capturing behavioural responses in psycholinguistic tasks across languages (Brysbaert & New, 2009;Keuleers, Brysbaert, & New, 2010;Tang, 2012;Tang & de Chene, 2014). The expectation is that the higher the transitional phonotactic probability (estimated using token frequency) of a juncture sequence, the higher the acceptability rating (Albright, 2007;Bailey & Hahn, 2001;Goldrick, 2011). ...
Article
Full-text available
This study investigates the Turkish partial reduplication phenomenon, in which the reduplicant is derived by prefixing C 1 VC 2 syllable, where C 1 V are identical to the word-initial CV of the base and the C 2 ends in one of the four linking consonants:-p,-m,-s,-r. This study reexamines the factors conditioning the choice of the linking consonant , by focusing the nature of the (dis)similarity (feature specificity) and the proximity (locality) between the consonants in the base and the linking consonant, using an acceptability rating task with over 200 participants and a diverse set of stimuli in terms of length and word shapes. Results indicate a gradient identity avoidance effect that extends over all consonants in the base. Crucially, the effect of all consonants is not uniform, with the strength of the effect decreasing further into the base. The study also uncovers an elusive interplay between the distance-based decay effect and the syllable position effect, both of which turn out to play a role in these non-categorical patterns with multiple features. Furthermore, results indicate that identity avoidance operates over both individual features as well as whole segments. Overall, the study argues that locality-sensitive feature-specific identity avoidance constraints are part of the grammar.
... The total number of responses was 12,949 (excluding outliers at ±2.5 standard deviations), of which 11,423 were accurate (~88%). In addition to stress and weight, three other key variables were considered: frequency, which was extracted from Tang's (2012) word corpus of Brazilian Portuguese film subtitles; 5 phonotactic probability (bigram), calculated based on the Portuguese Stress Lexicon (Garcia, 2014); and phonological neighborhood density, which counts the number of words that differ from a given target word by a single phoneme. ...
Article
Full-text available
Categorical approaches to lexical stress typically assume that words have either regular or irregular stress, and imply that only the latter needs to be stored in the lexicon, while the former can be derived by rule. In this paper, we compare these two groups of words in a lexical decision task in Portuguese to examine whether the dichotomy in question affects lexical retrieval latencies in native speakers, which could indirectly reveal different processing patterns. Our results show no statistically credible effect of stress regularity on reaction times, even when lexical frequency, neighborhood density, and phonotactic probability are taken into consideration. The lack of an effect is consistent with a probabilistic approach to stress, not with a categorical (traditional) approach where syllables are either light or heavy and stress is either regular or irregular. We show that the posterior distribution of credible effect sizes of regularity is almost entirely (96.28%) within the region of practical equivalence, which provides strong evidence that no effect of regularity exists in the lexical decision data modelled. Frequency and phonotactic probability, in contrast, showed statistically credible effects given the experimental data modelled, which is consistent with the literature.
Article
The objective of this paper is to present and make publicly available the NILC-Metrix, a computational system comprising 200 metrics proposed in studies on discourse, psycholinguistics, cognitive and computational linguistics, to assess textual complexity in Brazilian Portuguese (BP). The metrics are relevant for descriptive analysis and the creation of computational models and can be used to extract information from various linguistic levels of written and spoken language. The metrics were developed during the last 13 years, starting in the end of 2007, within the scope of the PorSimples project. Once the PorSimples finished, new metrics were added to the initial 48 metrics of the Coh-Metrix-Port tool. Coh-Metrix-Port adapted some metrics to BP from the Coh-Metrix tool that computes metrics related to cohesion and coherence of texts in English. Given the large number of metrics, we present them following an organisation similar to the metrics of Coh-Metrix v3.0 to facilitate comparisons made with metrics in Portuguese and English, in future studies using both tools. In this paper, we illustrate the potential of the NILC-Metrix by presenting three applications: (i) a descriptive analysis of the differences between children’s film subtitles and texts written for Elementary School I (comprises classes from 1st to 5th grade) and II (Final Years) (comprises classes from 6th to 9th grade, in an age group that corresponds to the transition between childhood and adolescence); (ii) a new predictor of textual complexity for the corpus of original and simplified texts of the PorSimples project; (iii) a complexity prediction model for school grades, using transcripts of children’s story narratives told by teenagers. For each application, we evaluate which groups of metrics are more discriminative, showing their contribution for each task.
Article
Most sociolinguistic work on variation focuses on how rates of occurrence or mean measurements differ between speech communities and speakers. However, speakers and communities also differ in variability – that is, in dispersion around the mean. The current study investigates the effects of speech style and multilingualism on variation and variability, by measuring the degree of intervocalic /bdɡ/ spirantization in spontaneous and careful speech. Data come from two varieties of Uruguayan Spanish, one monolingual (Montevideo) and one in contact with Brazilian Portuguese (Rivera). The results from a variation analysis confirm expected linguistic and social effects on gradient spirantization. An analysis of variability shows that, at the group level, careful speech is more variable than spontaneous speech, and the data from Rivera is more variable than that from Montevideo. Variability at the individual level differs slightly, suggesting that the group-level variability arises from between-speaker variability and within-speaker variability in different contexts. I propose that multilingualism in Rivera may heighten variability because contact with Portuguese provides a wider range of available pronunciations, and that careful speech may increase variability because the available pronunciations are subject to conflicting standards that are most active in this style.
Article
As a probe into the degree of integration of the bilingual lexicon, a series of lexical-decision tasks was carried out in two bilingual speech communities with greatly differing linguistic, cultural, and socio-historical characteristics: Misiones province in northeastern Argentina (Portuguese-Spanish), and three indigenous communities in northern Ecuador (Quichua and the mixed language known as Media Lengua). In both cases the results suggest a tightly integrated bilingual lexicon, but the pattern of responses was qualitatively and quantitatively different for each group, to such an extent as to potentially challenge the assumption of universal validity for lexical decision tasks.
Article
Full-text available
Artificial language learning research has become a popular tool to investigate universal mechanisms in language learning. However, often it is unclear whether the found effects are due to learning, or due to artefacts of the native language or the artificial language, and whether findings in only one language will generalise to speakers of other languages. The present study offers a new approach to model the influence of both the L1 and the target artificial language on language learning. The idea is to control for linguistic factors of the artificial and the native language by incorporating measures of wordlikeness into the statistical analysis as covariates. To demonstrate the approach, we extend Linzen and Gallagher (2017)’s study on consonant identity pattern to evaluate whether speakers of German and Mandarin rapidly learn the pattern when influences of L1 and the artificial language are accounted for by incorporating measures assessed by analogical and discriminative learning models over the L1 and artificial lexicon. Results show that nonwords are more likely to be accepted as grammatical if they are more similar to the trained artificial lexicon and more different from the L1 and, crucially, the identity effect is still present. The proposed approach is helpful for designing cross-linguistic studies.
Article
Full-text available
This paper presents and makes publicly available the NILC-Metrix, a computational system comprising 200 metrics proposed in studies on discourse, psycholinguistics, cognitive and computational linguistics, to assess textual complexity in Brazilian Portuguese (BP). These metrics are relevant for descriptive analysis and the creation of computational models and can be used to extract information from various linguistic levels of written and spoken language. The metrics in NILC-Metrix were developed during the last 13 years, starting in 2008 with Coh-Metrix-Port, a tool developed within the scope of the PorSimples project. Coh-Metrix-Port adapted some metrics to BP from the Coh-Metrix tool that computes metrics related to cohesion and coherence of texts in English. After the end of PorSimples in 2010, new metrics were added to the initial 48 metrics of Coh-Metrix-Port. Given the large number of metrics, we present them following an organisation similar to the metrics of Coh-Metrix v3.0 to facilitate comparisons made with metrics in Portuguese and English. In this paper, we illustrate the potential of NILC-Metrix by presenting three applications: (i) a descriptive analysis of the differences between children's film subtitles and texts written for Elementary School I and II (Final Years); (ii) a new predictor of textual complexity for the corpus of original and simplified texts of the PorSimples project; (iii) a complexity prediction model for school grades, using transcripts of children's story narratives told by teenagers. For each application, we evaluate which groups of metrics are more discriminative, showing their contribution for each task.
Preprint
Full-text available
This paper presents and makes publicly available the NILC-Metrix, a computational system comprising 200 metrics proposed in studies on discourse, psycholinguistics, cognitive and computational linguistics, to assess textual complexity in Brazilian Portuguese (BP). These metrics are relevant for descriptive analysis and the creation of computational models and can be used to extract information from various linguistic levels of written and spoken language. The metrics in NILC-Metrix were developed during the last 13 years, starting in 2008 with Coh-Metrix-Port, a tool developed within the scope of the PorSimples project. Coh-Metrix-Port adapted some metrics to BP from the Coh-Metrix tool that computes metrics related to cohesion and coherence of texts in English. After the end of PorSimples in 2010, new metrics were added to the initial 48 metrics of Coh-Metrix-Port. Given the large number of metrics, we present them following an organisation similar to the metrics of Coh-Metrix v3.0 to facilitate comparisons made with metrics in Portuguese and English. In this paper, we illustrate the potential of NILC-Metrix by presenting three applications: (i) a descriptive analysis of the differences between children's film subtitles and texts written for Elementary School I and II (Final Years); (ii) a new predictor of textual complexity for the corpus of original and simplified texts of the PorSimples project; (iii) a complexity prediction model for school grades, using transcripts of children's story narratives told by teenagers. For each application, we evaluate which groups of metrics are more discriminative, showing their contribution for each task.
Article
Full-text available
In the past two decades, variation has received a lot of attention in mainstream generative phonology, and several different models have been developed to account for variable phonological phenomena. However, all existing generative models of phonological variation account for the overall rate at which some process applies in a corpus, and therefore implicitly assume that all words are affected equally by a variable process. In this paper, we show that this is not the case. Many variable phenomena are more likely to apply to frequent than to infrequent words. A model that accounts perfectly for the overall rate of application of some variable process therefore does not necessarily account very well for the actual application of the process to individual words. We illustrate this with two examples, English t/d-deletion and Japanese geminate devoicing. We then augment one existing generative model (noisy Harmonic Grammar) to incorporate the contribution of usage frequency to the application of variable processes. In this model, the influence of frequency is incorporated by scaling the weights of faithfulness constraints up or down for words of different frequencies. This augmented model accounts significantly better for variation than existing generative models.
Article
Full-text available
We examine the use of film subtitles as an approximation of word frequencies in human interactions. Because subtitle files are widely available on the Internet, they may present a fast and easy way to obtain word frequency measures in language registers other than text writing. We compiled a corpus of 52 million French words, coming from a variety of films. Frequency measures based on this corpus compared well to other spoken and written frequency measures, and explained variance in lexical decision times in addition to what is accounted for by the available French written frequency measures. The availability of digitally stored texts on the Internet has opened a completely new avenue for linguists and psycholinguists to gain access to large corpora of written language. For instance, Blair, Urland, and Ma (2002) and New, Pallier, Brysbaert, and Ferrand (2004) showed that word frequency estimates obtained with Internet search engines correlate highly with those from well-established sources such as Celex for English (Baayen, Piepenbrock, & Gulikers, 1995) and Lexique for French (New, Pallier, Ferrand, & Brysbaert, 2004). This opens the possibility to obtain frequency estimates for words in languages without an existing frequency list. Similarly, Grondelaers, Deygers, Van Aken, Van Den Heede, and Speelman (2000) showed how Internet sources can be used to get access to texts from different language registers. They downloaded materials from newspapers, discussion groups, and chat channels, and showed how the presence of a particular word ("er" in Dutch, a word meaning something like "there" and in many instances
Book
Referencing new developments in cognitive and functional linguistics, phonetics, and connectionist modeling, this book investigates various ways in which a speaker/hearer's experience with language affects the representation of phonology. Rather than assuming phonological representations in terms of phonemes, Joan Bybee adopts an exemplar model, in which specific tokens of use are stored and categorized phonetically with reference to variables in the context. This model allows an account of phonetically gradual sound change that produces lexical variation, and provides an explanatory account of the fact that many reductive sound changes affect high frequency items first.
Conference Paper
The C-ORAL-BRASIL is a Brazilian Portuguese spontaneous speech corpus, representative of the state of Minas Gerais diatopy (primarily from the capital city, Belo Horizonte,metropolitan area). The corpus was compiled following the same architecture and segmentation criteria adopted by the C-ORAL-ROM [1] as well as its alignment software, the WinPitch [2]. The corpus comprises 139 informal speech texts, 208,130 words, 21:08:52 hours of recording (6.1 GB wav files). The mean word number per text is 1,500. The recordings were carried out with high resolution, non-invasive wireless equipment, generally with clip-on, monodirectional microphones, and a mixer whenever there were more than two interactants, in a few occasions omnidirectional microphones were used. The texts are transcribed following the CHAT format [3], implemented for prosodic annotation [4]. The main goals for the corpus architecture are the documentation of the diaphasic and diastratic variations in Brazilian Portuguese speech.
Article
This paper focuses on the strategic function of implicit compliments, aiming to evaluate their contribution to positive and negative politeness and their translation in interlingual subtitles (from English into Italian).
Article
-
Article
Three models of morphological storage and processing are compared: the dual-processing model of Pinker, Marcus and colleagues, the connectionist model of Marchman, Plunkett, Seidenberg and others, and the network model of Bybee and Langacker. In line with predictions made in the latter two frameworks, type frequency of a morphological pattern is shown to be important in determining productivity. In addition, the paper considers the nature of lexical schemas in the network model, which are of two types: source-oriented and product-oriented. The interaction of phonological properties of lexical patterns with frequency and the interaction of type and token frequency are shown to influence degree of productivity. Data are drawn from English, German, Arabic and Hausa.
Article
The paper proposes a methodology for collecting "open-source" corpora, i.e. corpora that are automatically collected from the Internet and distrib-uted in the form of a list of links with open-source software for recreating their full text. The result is a random snapshot of Internet pages which contain stretches of connected text in a given language. The paper discusses a methodology for acquiring such corpora, two ways of documenting them (using a set of metatextual categories and by comparison to frequency lists from existing corpora) and their function as benchmarks for comparing results of linguistic inquiry. Experiments with a variety of languages show that Internet-derived corpora can be successfully used in the absence of large representative corpora that are rare and expensive to build. . The problem Lexicographic studies using corpora can be reliable only if corpora provid-ing the basis for the study are sufficiently large and diverse. The famous ex-ample with collocations of powerful and strong, such as strong tea (Halliday 1966:150), can only be studied computationally on a corpus of at least the size of the British National Corpus (BNC). In 100 million words of the BNC, the expression strong tea occurs 28 times, 1 which makes it a reasonably strong col-location along with strong {candidate, contrast, leadership, reason}, all of which have roughly the same frequency and statistical significance according to the log-likelihood score. However, the chances of detecting these collocations in a smaller corpus are minuscule: strong tea occurs only once in the Brown corpus, and it contains no instances of strong candidate, leadership or reason.