This work documents the motivation and development of a subtitle-based corpus for Brazilian Portuguese, SUBTLEX-PT-BR, available at http://crr.ugent.be/subtlex-pt-br/. While the target language was Brazilian Portuguese, the methodology can be extended to any other languages with subtitles. A preliminary corpus comparison with a large conversational and written corpus was conducted to evaluate the validity of the corpus, and suggested that the subtitle corpus is more similar to the conversational than the written language. Future work on the methodology and the corpus itself is outlined. Its diverse use as a resource for linguistic research is discussed.
Figures - uploaded by
Kevin TangAuthor contentAll figure content in this area was uploaded by Kevin Tang
Content may be subject to copyright.