Table 1 - uploaded by Yutaka Sasaki
Content may be subject to copyright.
Subcategorization frame examples

Subcategorization frame examples

Source publication
Conference Paper
Full-text available
The extraction of information from texts requires resources that contain both syntactic and semantic properties of lexical units. As the use of language in specialized domains, such as biology, can be very different to the general domain, there is a need for domain-specific resources to ensure that the information extracted is as accurate as possib...

Citations

... This is why high-quality, rich computational lexicons comprising information about the meaning and combinatorial properties of words in biomedical texts can significantly boost the performance of NLP systems in problems ranging from information retrieval, relation and event extraction, or entailment detection. Similarly to the general language domain, lexicographic efforts in biomedicine have primarily focused on nouns (e.g., UMLS Metathesaurus [1]), while the demand for rich, large-coverage verb-specific biomedical resources has not yet been satisfied [2][3][4][5][6]. ...
Article
Full-text available
Background Recent advances in representation learning have enabled large strides in natural language understanding; However, verbal reasoning remains a challenge for state-of-the-art systems. External sources of structured, expert-curated verb-related knowledge have been shown to boost model performance in different Natural Language Processing (NLP) tasks where accurate handling of verb meaning and behaviour is critical. The costliness and time required for manual lexicon construction has been a major obstacle to porting the benefits of such resources to NLP in specialised domains, such as biomedicine. To address this issue, we combine a neural classification method with expert annotation to create BioVerbNet. This new resource comprises 693 verbs assigned to 22 top-level and 117 fine-grained semantic-syntactic verb classes. We make this resource available complete with semantic roles and VerbNet-style syntactic frames. Results We demonstrate the utility of the new resource in boosting model performance in document- and sentence-level classification in biomedicine. We apply an established retrofitting method to harness the verb class membership knowledge from BioVerbNet and transform a pretrained word embedding space by pulling together verbs belonging to the same semantic-syntactic class. The BioVerbNet knowledge-aware embeddings surpass the non-specialised baseline by a significant margin on both tasks. Conclusion This work introduces the first large, annotated semantic-syntactic classification of biomedical verbs, providing a detailed account of the annotation process, the key differences in verb behaviour between the general and biomedical domain, and the design choices made to accurately capture the meaning and properties of verbs used in biomedical texts. The demonstrated benefits of leveraging BioVerbNet in text classification suggest the resource could help systems better tackle challenging NLP tasks in biomedicine.
... UMLS Metathesaurus, [1]), *Correspondence: hwc25@cam.ac.uk † Billy Chiu and Olga Majewska contributed equally to this work. 1 Language Technology Laboratory, MML, University of Cambridge, 9 West Road, CB39DB Cambridge, UK Full list of author information is available at the end of the article verb-related resources are still lacking in both depth and coverage [2][3][4][5][6]. One particularly useful verb resource for general domain NLP is VerbNet [7]. ...
... The current version of VerbNet (v3.3) consists of 9344 verbs organised in 329 main classes [29]. Although it has a wide coverage for general domain NLP applications, it is not designed for specialized domains, such as biomedicine, where verbs tend to have a very different meaning and behaviour than in general English [2,3]. Hence, there is a need to develop domain-specific resources to support biomedical NLP. ...
Article
Full-text available
Background VerbNet, an extensive computational verb lexicon for English, has proved useful for supporting a wide range of Natural Language Processing tasks requiring information about the behaviour and meaning of verbs. Biomedical text processing and mining could benefit from a similar resource. We take the first step towards the development of BioVerbNet: A VerbNet specifically aimed at describing verbs in the area of biomedicine. Because VerbNet-style classification is extremely time consuming, we start from a small manual classification of biomedical verbs and apply a state-of-the-art neural representation model, specifically developed for class-based optimization, to expand the classification with new verbs, using all the PubMed abstracts and the full articles in the PubMed Central Open Access subset as data. Results Direct evaluation of the resulting classification against BioSimVerb (verb similarity judgement data in biomedicine) shows promising results when representation learning is performed using verb class-based contexts. Human validation by linguists and biologists reveals that the automatically expanded classification is highly accurate. Including novel, valid member verbs and classes, our method can be used to facilitate cost-effective development of BioVerbNet. Conclusion This work constitutes the first effort on applying a state-of-the-art architecture for neural representation learning to biomedical verb classification. While we discuss future optimization of the method, our promising results suggest that the automatic classification released with this article can be used to readily support application tasks in biomedicine. Electronic supplementary material The online version of this article (10.1186/s13326-018-0193-x) contains supplementary material, which is available to authorized users.
... Semantic Verb Classification Semantic verb classifications are of great interest to NLP, specifically regarding the pervasive problem of data sparseness in the processing of natural language. Such classifications have been used in applications such as word sense disambiguation (Dorr and Jones, 1996;Kohomban and Lee, 2005;, parsing (Carroll et al., 1998;Carroll and Fang, 2004), machine translation (Prescher et al., 2000;Koehn and Hoang, 2007;Weller et al., 2014), and information extraction (Surdeanu et al., 2003;Venturi et al., 2009). We target the semantic classification of German complex verbs by applying hard clustering to multi-sense embeddings, rather than using soft clustering. ...
Conference Paper
Full-text available
Up to date, the majority of computational models still determines the semantic relaDedness between words (or larger linguis- tic units) on the type level. In this paper, we compare and extend multi-sense embeddings, in order to model and utilise word senses on the token level. We focus on the challenging class of complex verbs, and evaluate the model variants on various semantic tasks: semantic classification; predicting compositionality; and detecting non-literal language usage. While there is no overall best model, all models significantly outperform a word2vec single-sense skip baseline, thus demonstrating the need to distinguish between word senses in a distributional semantic model.
... Semantic classifications are of great interest to computational linguistics, specifically regarding the pervasive problem of data sparseness in the processing of natural language. Such classifications have been used in applications such as word sense disambiguation (Dorr and Jones, 1996;Kohomban and Lee, 2005;McCarthy et al., 2007), parsing (Carroll et al., 1998;Carroll and Fang, 2004), machine translation (Prescher et al., 2000;Koehn and Hoang, 2007;Weller et al., 2014), and information extraction (Surdeanu et al., 2003;Venturi et al., 2009), among many others. ...
Conference Paper
Full-text available
In this paper we explore the role of verb frequencies and the number of clusters in soft-clustering approaches as a tool for automatic semantic classification. Relying on a large-scale setup including 4,871 base verb types and 3,173 complex verb types, and focusing on synonymy as a task-independent goal in semantic classification, we demonstrate that low-frequency German verbs are clustered significantly worse than mid-or high-frequency German verbs, and that German complex verbs are in general more difficult to cluster than German base verbs.
... Semantic classifications are of great interest to computational linguistics, specifically regarding the pervasive problem of data sparseness in the processing of natural language. Such classifications have been used in applications such as word sense disambiguation (Dorr and Jones, 1996;Kohomban and Lee, 2005;McCarthy et al., 2007), parsing (Carroll et al., 1998;Carroll and Fang, 2004), machine translation (Prescher et al., 2000;Koehn and Hoang, 2007;Weller et al., 2014), and information extraction (Surdeanu et al., 2003;Venturi et al., 2009). ...
... (Briscoe et al., 2006)) and can be extracted from the output of others (De-Marneffe et al., 2006). Two representative systems for English are the Cambridge system (Preiss et al., 2007) and the BioLexicon system which was used to acquire a substantial lexicon for biomedicine (Venturi et al., 2009). These systems extract GRs at the verb instance level from the output of a parser: the RASP general-language unlexicalized parser 3 (Briscoe et al., 2006) and the lexicalized Enju parser tuned to the biomedical domain (Miyao and Tsujii, 2005), respectively. ...
... Briscoe and Carroll (1997) for English; Sarkar and Zeman (2000) for Czech; Schulte im Walde (2002a) for German; Messiant (2008) for French. This basic kind of verb knowledge has been shown to be useful in many NLP tasks such as information extraction (Surdeanu et al., 2003;Venturi1 et al., 2009), parsing (Carroll et al., 1998;Carroll and Fang, 2004) ...
Conference Paper
Full-text available
This paper demonstrates the need and impact of subcategorization information for SMT. We combine (i) features on sourceside syntactic subcategorization and (ii) an external knowledge base with quantitative, dependency-based information about target-side subcategorization frames. A manual evaluation of an English-to-German translation task shows that the subcategorization information has a positive impact on translation quality through better prediction of case.
... Our results show a respectable level of accuracy considering that no adaptations were made to the SCF acquisition system besides using a large biomedical corpus as input. We then compare the accuracy of BioCat to that of the BioLexicon [4, 5, 6, 7] , the only previously existing automatically-acquired SCF lexicon for biomedicine, which was extracted from corpus data in the E. Coli subdomain using NLP technology adapted to the subdomain of molecular biology, but which has not previously been evaluated. Our results show that the BioLexicon has greater precision while BioCat has better coverage of SCFs. ...
... The purpose of this experiment was to perform a comparative evaluation of BioCat against another SCF resource, the BioLexicon [4, 6, 7]. BioCat was built using unadapted, general language tools applied to a multi-subdomain biomedical corpus; while the BioLexicon was built using tools adapted to a single biomedical subdomain, applied to data also drawn from that subdo- main. ...
... The BioLexicon [4, 6, 7] is currently the only biomedical NLP resource containing an automatically constructed SCF lexicon. It is built on data from the E. Coli subdomain, and each component used in acquisition of the lexicon – for example, the part-of-speech tagger, named entity recognizer, and parser – was manually adapted to the subdomain of molecular biology. ...
Article
Background: Biomedical natural language processing (NLP) applications that have access to detailed resources about the linguistic characteristics of biomedical language demonstrate improved performance on tasks such as relation extraction and syntactic or semantic parsing. Such applications are important for transforming the growing unstructured information buried in the biomedical literature into structured, actionable information. In this paper, we address the creation of linguistic resources that capture how individual biomedical verbs behave. We specifically consider verb subcategorization, or the tendency of verbs to "select" co-occurrence with particular phrase types, which influences the interpretation of verbs and identification of verbal arguments in context. There are currently a limited number of biomedical resources containing information about subcategorization frames (SCFs), and these are the result of either labor-intensive manual collation, or automatic methods that use tools adapted to a single biomedical subdomain. Either method may result in resources that lack coverage. Moreover, the quality of existing verb SCF resources for biomedicine is unknown, due to a lack of available gold standards for evaluation. Results: This paper presents three new resources related to verb subcategorization frames in biomedicine, and four experiments making use of the new resources. We present the first biomedical SCF gold standards, capturing two different but widely-used definitions of subcategorization, and a new SCF lexicon, BioCat, covering a large number of biomedical sub-domains. We evaluate the SCF acquisition methodologies for BioCat with respect to the gold standards, and compare the results with the accuracy of the only previously existing automatically-acquired SCF lexicon for biomedicine, the BioLexicon. Our results show that the BioLexicon has greater precision while BioCat has better coverage of SCFs. Finally, we explore the definition of subcategorization using these resources and its implications for biomedical NLP. All resources are made publicly available. Conclusion: The SCF resources we have evaluated still show considerably lower accuracy than that reported with general English lexicons, demonstrating the need for domain- and subdomain-specific SCF acquisition tools for biomedicine. Our new gold standards reveal major differences when annotators use the different definitions. Moreover, evaluation of BioCat yields major differences in accuracy depending on the gold standard, demonstrating that the definition of subcategorization adopted will have a direct impact on perceived system accuracy for specific tasks.
... On the other hand, we find automatic approaches to the induction of verb subcategorisation information at the syntax-semantics interface for a large number of languages, including Briscoe and Carroll (1997) for English; Sarkar and Zeman (2000) for Czech; Schulte im Walde (2002) for German; and Messiant (2008) for French. This basic kind of verb knowledge has been shown to be useful in many NLP tasks such as information extraction (Surdeanu et al., 2003;Venturi et al., 2009), parsing (Carroll et al., 1998;Carroll and Fang, 2004) and word sense disambiguation (Kohomban and Lee, 2005;McCarthy et al., 2007). ...
Conference Paper
Full-text available
This paper describes the SubCat-Extractor as a novel tool to obtain verb subcategori-sation data from parsed German web corpora. The SubCat-Extractor is based on a set of detailed rules that go beyond what is directly accessible in the parses. The extracted subcategorisation database is represented in a compact but linguistically detailed and flexible format, comprising various aspects of verb information, complement information and sentence information , within a one-line-per-clause style. We describe the tool, the extraction rules and the obtained resource database, as well as actual and potential uses in computational linguistics.
... Two state-of-the-art data-driven systems for English verbs are those that produced VALEX, Preiss et al. (2007), and the BioLexicon (Venturi et al., 2009). ...
Conference Paper
Full-text available
We present a novel approach for building verb subcategorization lexicons using a simple graphical model. In contrast to previous methods, we show how the model can be trained without parsed input or a predefined subcategorization frame inventory. Our method outperforms the state-of-the-art on a verb clustering task, and is easily trained on arbitrary domains. This quantitative evaluation is complemented by a qualitative discussion of verbs and their frames. We discuss the advantages of graphical models for this task, in particular the ease of integrating semantic information about verbs and arguments in a principled fashion. We conclude with future work to augment the approach.