Article

Computer Intensive Methods for Testing Hypotheses: An Introduction

Authors:
To read the full-text of this research, you can request a copy directly from the author.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the author.

... Like Pareti (2015), we exclude single-token content spans from the evaluation. To test for statistical significance of differences, we use the approximate randomization test (Noreen, 1989) at a significance level of α = 0.05. ...
... Like Pareti (2015), we exclude single-token content spans from the evaluation. To test for statistical significance of differences, we use the approximate randomization test (Noreen, 1989) at a significance level of ? = 0.05. ...
... Computer-intensive methods use the observed data to generate a reference distribution, which is then used for confidence interval estimation and significance testing (Manly 1997, Mooney & Duval 1993, Noreen 1989. Programs to compute confidence limits of the mediated effect for bootstrap methods is www.annualreviews.org ...
Chapter
Mediation analysis is a statistical method used to quantify the causal sequence by which an antecedent variable causes a mediating variable that causes a dependent variable. Although mediation analysis is useful for observational studies, it is perhaps most compelling for answering questions of cause and effect in randomized treatment and prevention programs. Information about mediating mechanisms improves programs by providing information about the critical ingredients of successful programs. Sewall Wright’s work in the 1920s first applied the notion of mediated or indirect effects in path models for the inheritance of skin color in guinea pigs. Methodological developments and applications of mediation analysis have dramatically increased since ideas about mediation in the social sciences were first formalized by Herbert Hyman and Paul Lazarsfeld in 1955. Innovations in mediation analyses have been rapid in the last forty years, and recently progress in understanding the causal basis of mediation analysis has been a major breakthrough. Link to website: https://www.oxfordbibliographies.com/view/document/obo-9780199828340/obo-9780199828340-0245.xml
... The learning rate η was selected in the same way, whereas the possible values were 1e−4, 1e−5, 1e−6 or, alternatively, Adadelta (Zeiler, 2012), which sets the learning rate on a per-feature basis. The results on both validation and test set are reported in Table 5. Statistical significance of the outof-domain system compared to all other systems is measured using Approximate Randomization testing (Noreen, 1989). ...
Article
The goal of counterfactual learning for statistical machine translation (SMT) is to optimize a target SMT system from logged data that consist of user feedback to translations that were predicted by another, historic SMT system. A challenge arises by the fact that risk-averse commercial SMT systems deterministically log the most probable translation. The lack of sufficient exploration of the SMT output space seemingly contradicts the theoretical requirements for counterfactual learning. We show that counterfactual learning from deterministic bandit logs is possible nevertheless by smoothing out deterministic components in learning. This can be achieved by additive and multiplicative control variates that avoid degenerate behavior in empirical risk minimization. Our simulation experiments show improvements of up to 2 BLEU points by counterfactual learning from deterministic bandit feedback.
... When the alignment quality is low (e.g., as for Japanese and Korean) and hence the projection-labeled NER data are quite noisy, the proposed data selection scheme is very effective in selecting good-quality projection-labeled data and the improvement is big: +12.2 F 1 score for Japanese and +13.7 F 1 score for Korean. Using a stratified shuffling test (Noreen, 1989), for a significance level of 0.05, data-selection is statistically significantly better than no-selection for Japanese, Korean and Portuguese. ...
... All models are evaluated with Neural Monkey's mteval. For statistical significance tests we used Approximate Randomization testing (Noreen, 1989). ...
... Multi-Task with Entailment Generation (Mto-1) Here, the video captioning and entailment generation tasks share their language decoder LSTM-RNN weights and word embeddings in a many-to-one multi-task setting. We observe 4 Statistical significance of p < 0.01 for CIDEr-D and ROUGE-L, p < 0.02 for BLEU-4, p < 0.03 for METEOR, based on the bootstrap test (Noreen, 1989;Efron and Tibshirani, 1994) with 100K samples. that a mixing ratio of 100 : 50 alternating minibatches (between the captioning and entailment tasks) works well here. ...
... When the alignment quality is low (e.g., as for Japanese and Korean) and hence the projection-labeled NER data are quite noisy, the proposed data selection scheme is very effective in selecting good-quality projection-labeled data and the improvement is big: +12.2 F 1 score for Japanese and +13.7 F 1 score for Korean. Using a stratified shuffling test (Noreen, 1989), for a significance level of 0.05, data-selection is statistically significantly better than no-selection for Japanese, Korean and Portuguese. ...
Article
Full-text available
The state-of-the-art named entity recognition (NER) systems are supervised machine learning models that require large amounts of manually annotated data to achieve high accuracy. However, annotating NER data by human is expensive and time-consuming, and can be quite difficult for a new language. In this paper, we present two weakly supervised approaches for cross-lingual NER with no human annotation in a target language. The first approach is to create automatically labeled NER data for a target language via annotation projection on comparable corpora, where we develop a heuristic scheme that effectively selects good-quality projection-labeled data from noisy data. The second approach is to project distributed representations of words (word embeddings) from a target language to a source language, so that the source-language NER system can be applied to the target language without re-training. We also design two co-decoding schemes that effectively combine the outputs of the two projection-based approaches. We evaluate the performance of the proposed approaches on both in-house and open NER data for several target languages. The results show that the combined systems outperform three other weakly supervised approaches on the CoNLL data.
... Results are reported in terms of case-insensitive BLEU-4 (Papineni et al., 2002). Approximate randomization (Noreen., 1989;Riezler and Maxwell, 2005) is used to detect statistically significant differences. ...
Article
Full-text available
Neural machine translation is a recently proposed approach which has shown competitive results to traditional MT approaches. Standard neural MT is an end-to-end neural network where the source sentence is encoded by a recurrent neural network (RNN) called encoder and the target words are predicted using another RNN known as decoder. Recently, various models have been proposed which replace the RNN encoder with a convolutional neural network (CNN). In this paper, we propose to augment the standard RNN encoder in NMT with additional convolutional layers in order to capture wider context in the encoder output. Experiments on English to German translation demonstrate that our approach can achieve significant improvements over a standard RNN-based baseline.
... Consequently, special procedures known as quadratic assignment procedure (QAP) and multiple regression quadratic assignment procedure (MRQAP) (Baker and Hubert 1981, Krackhardt 1988) were used to run the correlations and multiple regressions, respectively. QAP and MRQAP are identical to their nonnetwork counterparts with respect to parameter estimates, but use a randomization/permutation technique (Edgington 1969, Noreen 1989 to construct significance tests. Significance levels for correlations and regressions are based on distributions generated from 10,000 random permutations. ...
Article
Full-text available
Research in organizational learning has demonstrated processes and occasionallyperformance implications of acquisition of declarative (know what) and procedural(know how) knowledge. However, there has been considerably less attention paid tocharacteristics of relationships that affect the decision to seek information from specificothers when faced with a new problem or opportunity (know who). Based on a review of the social network, information processing and organizational learning literatures, along with the results of a previous qualitative study, we propose a formal model of information seeking in which the probability of seeking information from a specific other is modeled as a function of: 1) knowing what that person knows; 2) valuing what that person knows; 3) being able to gain timely access to that person's thinking and 4) perceiving that seeking information from that person would not be too costly. We also hypothesize that these relational variables mediate the relation-ship between physical proximity and information seeking. The model is tested using two separate research sites (to provide replication). The results indicate strong support for the model (with the exception of the cost variable), and partial support for the mediation hypothesis. Implications are drawn for the study of transactive memory and organizational learning, as well as for management practice.
... This shows that all dictionaries have comparable AUC scores, and that each dictionary outperforms the unigram baseline. To obtain additional evidence, we computed the statistical significance of performance differences between the models based on the dictionaries and unigram baseline model using approximate randomization testing (ART) (Noreen, 1989). 6 An ART test between dictionary models reveals that none of the models had performance differences that were statistically significant. ...
Article
Full-text available
We present a dictionary-based approach to racism detection in Dutch social media comments, which were retrieved from two public Belgian social media sites likely to attract racist reactions. These comments were labeled as racist or non-racist by multiple annotators. For our approach, three discourse dictionaries were created: first, we created a dictionary by retrieving possibly racist and more neutral terms from the training data, and then augmenting these with more general words to remove some bias. A second dictionary was created through automatic expansion using a \texttt{word2vec} model trained on a large corpus of general Dutch text. Finally, a third dictionary was created by manually filtering out incorrect expansions. We trained multiple Support Vector Machines, using the distribution of words over the different categories in the dictionaries as features. The best-performing model used the manually cleaned dictionary and obtained an F-score of 0.46 for the racist class on a test set consisting of unseen Dutch comments, retrieved from the same sites used for the training set. The automated expansion of the dictionary only slightly boosted the model's performance, and this increase in performance was not statistically significant. The fact that the coverage of the expanded dictionaries did increase indicates that the words that were automatically added did occur in the corpus, but were not able to meaningfully impact performance. The dictionaries, code, and the procedure for requesting the corpus are available at: https://github.com/clips/hades
... The organizers of PAN provided us the output of the participating systems. We used the approximate randomization testing [60] implemented by Vincent Van Asch (http://www.clips.uantwerpen.be/scripts/art). We did a pairwise comparison of the accuracies of our results against the best results of PAN for the corresponding datasets. ...
Article
Full-text available
We apply the integrated syntactic graph feature extraction methodology to the task of automatic authorship detection. This graph-based representation allows integrating different levels of language description into a single structure. We extract textual patterns based on features obtained from shortest path walks over integrated syntactic graphs and apply them to determine the authors of documents. On average, our method outperforms the state of the art approaches and gives consistently high results across different corpora, unlike existing methods. Our results show that our textual patterns are useful for the task of authorship attribution.
... For SemEval-2010 Task 8, we also omitted the entity detection and label embeddings since only target nominals are annotated and the task defines no entity types. Our statistical significance results are based on the Approximate Randomization (AR) test (Noreen, 1989). ...
Conference Paper
Full-text available
We present a novel end-to-end neural model to extract entities and relations between them. Our recurrent neural network based model stacks bidirectional sequential LSTM-RNNs and bidirectional tree-structured LSTM-RNNs to capture both word sequence and dependency tree substructure information. This allows our model to jointly represent both entities and relations with shared parameters. We further encourage detection of entities during training and use of entity information in relation extraction via curriculum learning and scheduled sampling. Our model improves over the state-of-the-art feature-based model on end-to-end relation extraction, achieving 3.5% and 4.8% relative error reductions in F-score on ACE2004 and ACE2005, respectively. We also show improvements over the state-of-the-art convolutional neural network based model on nominal relation classification (SemEval-2010 Task 8), with 2.5% relative error reduction in F-score.
... Our method also outperforms DSPROTO+, which used a small amount of the labeled data, while our method is fully unsupervised. We calculated confidence intervals (P < 0.05) using bootstrap resampling (Noreen, 1989). For example, for the results using the BNC-Wikipedia data, the intervals on MC'07 and VJ'05 are (0.455, 0.574) and (0.475, 0.579), respectively. ...
Conference Paper
We present a novel method for jointly learning compositional and non-compositional phrase embeddings by adaptively weighting both types of embeddings using a compositionality scoring function. The scoring function is used to quantify the level of compositionality of each phrase, and the parameters of the function are jointly optimized with the objective for learning phrase embeddings. In experiments, we apply the adaptive joint learning method to the task of learning embeddings of transitive verb phrases, and show that the compositionality scores have strong correlation with human ratings for verb-object compositionality, substantially outperforming the previous state of the art. Moreover, our embeddings improve upon the previous best model on a transitive verb disambiguation task. We also show that a simple ensemble technique further improves the results for both tasks.
... Considering the average score of MUC, B 3 , and CEAF e , Martschat, and Peng perform equally. However, according to LEA, Martschat performs significantly better based on an approximate randomization test (Noreen, 1989). CEAF e also agrees with LEA for this ranking. ...
Conference Paper
Interpretability and discriminative power are the two most basic requirements for an evaluation metric. In this paper, we report the mention identification effect in the B3, CEAF, and BLANC coreference evaluation metrics that makes it impossible to interpret their results properly. The only metric which is insensitive to this flaw is MUC, which, however, is known to be the least discriminative metric. It is a known fact that none of the current metrics are reliable. The common practice for ranking coreference resolvers is to use the average of three different metrics. However, one cannot expect to obtain a reliable score by averaging three unreliable metrics. We propose LEA, a Link-based Entity-Aware evaluation metric that is designed to overcome the shortcomings of the current evaluation metrics. LEA is available as branch LEA-scorer in the reference implementation of the official CoNLL scorer.
... For bandit-type algorithms, final results are averaged over 3 runs with different random seeds. For statistical significance testing of results against baselines we use Approximate Randomization testing (Noreen, 1989). Multiclass classification. ...
... Case-insensitive 4-gram BLEU (Papineni et al., 2002) is used as evaluation metric. Approximate randomization (Noreen., 1989; Riezler and Maxwell, 2005 ) is used to detect statistically significant differences. ...
... We evaluate the systems on the CoNLL 2012 English test set using the M U C (Vilain et al., 1995), B 3 (Bagga and Baldwin, 1998), and CEAF e (Luo, 2005) measures as provided by the CoNLL coreference scorer version 8.01 (Pradhan et al., 2014 ). According to the approximate randomization test (Noreen, 1989), all of the improvements made by our singleton detection module are statistically significant (p < 0.05). Baseline shows the result of the Stanford system without using singleton detection. ...
Conference Paper
There is a significant gap between the performance of a coreference resolution system on gold mentions and on system mentions. This gap is due to the large and unbalanced search space in coreference resolution when using system mentions. In this paper we show that search space pruning is a simple but efficient way of improving coreference resolvers. By incorporating our pruning method in one of the state-of-the-art coreference resolution systems, we achieve the best reported overall score on the CoNLL 2012 English test set. A version of our pruning method is available with the Cort coreference resolution source code.
... 4 Table 2shows experimental results for different settings of semantic parsing on NLMAPS. Statistical significance of system differences in terms of F1 was assessed by an Approximate Randomization test (Noreen, 1989). For the word-alignment step, we found that the choice of the strategy for combining word alignments from both translation directions is crucial to the semantic parser's performance . ...
... Sample size is presented below means, standard deviations, and Pearson's correlations in parentheses. Therefore, the use of permutation tests is required (Borgatti, Everett, & Johnson, 2013;Noreen, 1989). With permutation tests, the standard t-test is computed to compare the means of the two groups but the significance level is generated with permutations. ...
Article
Which factors contribute to effective meetings? The interaction among participants plays a key role. Interaction is a relational, interdependent process that constitutes social structure. Applying a network perspective to meeting interactions allows us to take account of the social structure. The aim of this study was to use social network analysis to distinguish functional and dysfunctional interaction structures and gain insight into the facilitation of meetings by analyzing antecedents and consequences of functional interaction structures. Data were based on a field study in which 51 regular meetings were videotaped and coded with act4teams. Analyses revealed that compared with dysfunctional networks, functional interaction is less centralized and has a positive effect on team performance. Social similarity has a crucial effect on functional interaction because participants significantly interact with others who are similar in personal initiative and self-efficacy. Our results provide important information about how to assist the interaction process and promote team success.
... The second baseline is n-grams, the commonly used baseline in prior work. We compute statistical significance using the Approximate Randomization test (Noreen, 1989;Yeh, 2000), a suitable significance metric for F-score. ...
Conference Paper
Full-text available
Determining when conversational participants agree or disagree is instrumental for broader conversational analysis; it is necessary, for example, in deciding when a group has reached consensus. In this paper, we describe three main contributions. We show how different aspects of conversational structure can be used to detect agreement and disagreement in discussion forums. In particular, we exploit information about meta-thread structure and accommodation between participants. Second, we demonstrate the impact of the features using 3-way classification, including sentences expressing disagreement, agreement or neither. Finally, we show how to use a naturally occurring data set with labels derived from the sides that participants choose in debates on createdebate.com. The resulting new agreement corpus, Agreement by Create Debaters (ABCD) is 25 times larger than any prior corpus. We demonstrate that using this data enables us to outperform the same system trained on prior existing in-domain smaller annotated datasets.
... Translation quality of all experiments is measured with case-insensitive BLEU (Papineni et al., 2002) using the closest-reference brevity penalty. We use approximate randomization (Noreen, 1989) for significance testing (Riezler and Maxwell, 2005 ). Statistically significant differences are marked by and for the p ≤ 0.05 and the p ≤ 0.01 level, respectively. ...
Conference Paper
Research in domain adaptation for statistical machine translation (SMT) has resulted in various approaches that adapt system components to specific translation tasks. The concept of a domain, however, is not precisely defined, and most approaches rely on provenance information or manual subcorpus labels, while genre differences have not been addressed explicitly. Motivated by the large translation quality gap that is commonly observed between different genres in a test corpus, we explore the use of document-level genrerevealing text features for the task of translation model adaptation. Results show that automatic indicators of genre can replace manual subcorpus labels, yielding significant improvements across two test sets of up to 0.9 BLEU. In addition, we find that our genre-adapted translation models encourage document-level translation consistency.
... We use case-insensitive BLEU (Papineni et al., 2002 ) as evaluation metric . Approximate randomization (Noreen, 1989; Riezler and Maxwell, 2005 ) is used to detect statistically significant differences. ...
... For testing, we used MT08 and MT09 for Arabic, and MT06 and MT08 for Chinese. We use approximate randomization (Noreen, 1989; Riezler and Maxwell, 2005 ) to test for statistically significant differences. In the next two subsections we discuss the general results for Arabic and Chinese, where we use case-insensitive BLEU (Papineni et al., 2002) and TER (Snover et al., 2006) as evaluation metrics. ...
... The wmt13 set (Bojar et al., 2013) was used for testing. We use approximate randomization (Noreen, 1989) to test for statistically significant differences between runs (Riezler and Maxwell, 2005). Translation quality is measured with caseinsensitive BLEU[%] using one reference translation . ...
... L is negative if the classifier surpasses the baseline. As a statistical significance test (indicated by † in the text), we use approximate randomization (Noreen, 1989) with 10,000 iterations at p < 0.05. ...
... We use two complementary evaluation methods: crossvalidation within the training data, and learning curves against the test set. We calculate significance using the approximate randomization test (Noreen, 1989) with 10k iterations. ...
... The systems are applied to translated queries, but evaluated in terms of standard parsing metrics. Statistical significance is measured using an Approximate Randomization test (Noreen, 1989; Riezler and Maxwell, 2005 ). The baseline system is CDEC as described above. ...
Article
Full-text available
Many modern entity recognition systems, including the current state-of-the-art de-identification systems, are based on bidirectional long short-term memory (biLSTM) units augmented by a conditional random field (CRF) sequence optimizer. These systems process the input sentence by sentence. This approach prevents the systems from capturing dependencies over sentence boundaries and makes accurate sentence boundary detection a prerequisite. Since sentence boundary detection can be problematic especially in clinical reports, where dependencies and co-references across sentence boundaries are abundant, these systems have clear limitations. In this study, we built a new system on the framework of one of the current state-of-the-art de-identification systems, NeuroNER, to overcome these limitations. This new system incorporates context embeddings through forward and backward n -grams without using sentence boundaries. Our context-enhanced de-identification (CEDI) system captures dependencies over sentence boundaries and bypasses the sentence boundary detection problem altogether. We enhanced this system with deep affix features and an attention mechanism to capture the pertinent parts of the input. The CEDI system outperforms NeuroNER on the 2006 i2b2 de-identification challenge dataset, the 2014 i2b2 shared task de-identification dataset, and the 2016 CEGS N-GRID de-identification dataset (p < 0.01). All datasets comprise narrative clinical reports in English but contain different note types varying from discharge summaries to psychiatric notes. Enhancing CEDI with deep affix features and the attention mechanism further increased performance.
Chapter
We briefly report on the four shared tasks organized as part of the PAN 2020 evaluation lab on digital text forensics and authorship analysis. Each tasks is introduced, motivated, and the results obtained are presented. Altogether, the four tasks attracted 230 registrations, yielding 83 successful submissions. This, and the fact that we continue to invite the submissions of software rather than its run output using the TIRA experimentation platform, marks for a good start into the second decade of PAN evaluations labs.
Chapter
We describe the fundamental issues that long-horizon event studies face in choosing the proper research methodology and summarize findings from existing simulation studies about the performance of commonly used methods. We document in details how to implement a simulation study and report our own findings on large-size samples. The findings have important implications for future research. We examine the performance of more than 20 different testing procedures that fall into two categories. First, the buy-and-hold benchmark approach uses a benchmark to measure the abnormal buy-and-hold return for every event firm and tests the null hypothesis that the average abnormal return is zero. Second, the calendar-time portfolio approach forms a portfolio in each calendar month consisting of firms that have had an event within a certain time period prior to the month and tests the null hypothesis that the intercept is zero in the regression of monthly portfolio returns against the factors in an asset-pricing model. We find that using the sign test and the single most correlated firm being the benchmark provides the best overall performance for various sample sizes and long horizons. In addition, the Fama-French three-factor model performs better in our simulation study than the four-factor model, as the latter leads to serious over-rejection of the null hypothesis. We evaluate the performance of bootstrapped Johnson’s skewness-adjusted t-test. This computation-intensive procedure is considered because the distribution of long-horizon abnormal returns tends to be highly skewed to the right. The bootstrapping method uses repeated random sampling to measure the significance of relevant test statistics. Due to the nature of random sampling, the resultant measurement of significance varies each time such a procedure is used. We also evaluate simple nonparametric tests, such as the Wilcoxon signed-rank test or the Fisher’s sign test, which are free from random sampling variation.
Chapter
Full-text available
Author verification is a fundamental task in authorship analysis and associated with significant applications in humanities, cyber-security, and social media analytics. In some of the relevant studies, there is evidence that heterogeneous ensembles can provide very reliable solutions, better than any individual verification model. However, there is no systematic study of examining the application of ensemble methods in this task. In this paper, we start from a large set of base verification models covering the main paradigms in this area and study how they can be combined to build an accurate ensemble. We propose a simple stacking ensemble as well as a dynamic ensemble selection approach that can use the most reliable base models for each verification case separately. The experimental results in ten benchmark corpora covering multiple languages and genres verify the suitability of ensembles for this task and demonstrate the effectiveness of our method, in some cases improving the best reported results by more than 10%.
Article
Full-text available
State-of-the-art global coupled models used in seasonal prediction systems and climate projections still have important deficiencies in representing the boreal summer tropical rainfall climatology. These errors include prominently a severe dry bias over all the Northern Hemisphere monsoon regions, excessive rainfall over the ocean and an unrealistic double inter-tropical convergence zone (ITCZ) structure in the tropical Pacific. While these systematic errors can be partly reduced by increasing the horizontal atmospheric resolution of the models, they also illustrate our incomplete understanding of the key mechanisms controlling the position of the ITCZ during boreal summer. Using a large collection of coupled models and dedicated coupled experiments, we show that these tropical rainfall errors are partly associated with insufficient surface thermal forcing and incorrect representation of the surface albedo over the Northern Hemisphere continents. Improving the parameterization of the land albedo in two global coupled models leads to a large reduction of these systematic errors and further demonstrates that the Northern Hemisphere subtropical deserts play a seminal role in these improvements through a heat low mechanism.
Article
The estimation of the relationship between phenotype and fitness in natural populations is constrained by the distribution of phenotypes available for selection to act on. Because selection is blind to the underlying genotype, a more variable phenotypic distribution created by using environmental effects can be used to enhance the power of a selection study. I measured selection on a population of adult damselflies (Enallagma boreale) whose phenotype had been modified by raising the larvae under various levels of food availability and density. Selection on body size (combination of skeletal and mass at emergence) and date of emergence was estimated in two consecutive episodes. The first episode was survival from emergence to sexual maturity and the second was reproductive success after attaining sexual maturity. Female survival to sexual maturity was lower, and therefore opportunity for selection greater, than males in both years. Opportunity for selection due to reproductive success was greater for males. The total opportunity for selection was greater for males one year and for females the other. Survival to sexual maturity was related to mass gain between emergence and sexual maturity. Females gained more mass and survived less well than males in both years but there was no linear relationship between size at emergence and survival for females in either year. However, females in the tails of the phenotype distribution were less likely to survive than those near the mean. In contrast, small males consistently gained more mass than large males and survived less well in one year. There was significant selection on timing of emergence in both years, but the direction of selection changed due to differences in weather; early emerging females were more successful one year and late emerging males and females the other. The number of clutches laid by females was independent of body size. Because the resources used to produce eggs are acquired after emergence and this was independent of size at emergence, female fitness did not increase with size. Small males may have had lower survival to sexual maturity but they had higher mating success than large males. Resources acquired prior to sexual maturity are essential for reproductive success and may in some species alter their success in inter- and intrasexual competition. Therefore, ignoring the mortality associated with resource acquisition will give an incomplete and potentially misleading picture of selection on the phenotype.
Chapter
Natural language processing (NLP) emerged in the 1900s to support the wartime efforts. It’s dubious performance, however, slowed research initiatives until the 1960s when advances in machine learning provided novel approaches to text analysis. Increased processing speed and widespread availability of digital text accelerated this trend in the late 1990s. At the present time, there are extensive efforts to use NLP on clinical text and to incorporate this technology into software applications that support clinical care. In this chapter, the first of two about NLP, we will present: basic principles of NLP, the lexical resources required to produce high quality output from clinical text, the process (called annotation) of creating and NLP gold standard, the statistical methods used to evaluate and the role of shared tasks for evaluating and facilitating standardization in the field. Subsequent chapters will discuss ongoing research dedicated improving the quality and utility of NLP in the clinical setting.
Article
Full-text available
Social media has become very popular and mainstream, leading to an abundance of content. This wealth of content contains many interactions and conversations that can be analyzed for a variety of information. One such type of information is analyzing the roles people take in a conversation. Detecting influencers, one such role, can be useful for political campaigning, successful advertisement strategies, and detecting terrorist leaders. We explore influence in discussion forums, weblogs, and micro-blogs through the development of learned language analysis components to recognize known indicators of influence. Our components are author traits, agreement, claims, argumentation, persuasion, credibility, and certain dialog patterns. Each of these components ismotivated by social science through Robert Cialdini's "Weapons of Influence" [Cialdini 2007]. We classify influencers across five online genres and analyze which features are most indicative of influencers in each genre. First, we describe a rich suite of features that were generated using each of the system components. Then, we describe our experiments and results, including using domain adaptation to exploit the data from multiple online genres.
Article
Full-text available
We address the problem of automatically cleaning a translation memory (TM) by identifying problematic translation units (TUs). In this context, we treat as “problematic TUs” those containing useless translations from the point of view of the user of a computer-assisted translation tool. We approach TM cleaning both as a supervised and as an unsupervised learning problem. In both cases, we take advantage of Translation Memory open-source purifier, an open-source TM cleaning tool also presented in this paper. The two learning paradigms are evaluated on different benchmarks extracted from MyMemory, the world’s largest public TM. Our results indicate the effectiveness of the supervised approach in the ideal condition in which labelled training data is available, and the viability of the unsupervised solution for challenging situations in which training data is not accessible.
Article
The limited scale and genre coverage of labeled data greatly hinders the effectiveness of supervised models, especially when analyzing spoken languages, such as texts transcribed from speech and informal text including tweets and product comments in Internet. In order to effectively utilize multiple labeled datasets with heterogeneous annotations for the same task, this paper proposes a coupled sequence labeling model that can directly learn and infer two heterogeneous annotations simultaneously, using Chinese part-of-speech (POS) tagging as our case study. The key idea is to bundle two sets of POS tags together (e.g., “[NN, n] <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">n</sup> ), and build a conditional random field (CRF) based tagging model in the enlarged space of bundled tags with the help of ambiguous labeling. To train our model on two nonoverlapping datasets that each has only one-side tags, we transform a one-side tag into a set of bundled tags by concatenating the tag with every possible tag at the missing side according to a predefined context-free tag-to-tag mapping function, thus producing ambiguous labeling as weak supervision. We design and investigate four different context-free tag-to-tag mapping functions, and find out that the coupled model achieves its best performance when each one-side tag is mapped to all tags at the other side (namely complete mapping), indicating that the model can effectively learn the loose mapping between the two heterogeneous annotations, without the need of manually designed mapping rules. Moreover, we propose a context-aware online pruning strategy that can more accurately capture mapping relationships between annotations based on contextual evidences and thus effectively solve the severe inefficiency problem with our coupled model under complete mapping, making it comparable with the baseline CRF model. Experiments on benchmark datasets show that our coupled model significantly outperforms the state-of-the-art baselines on both one-side POS tagging and annotation conversion tasks. The codes and newly annotated data are released for research usage. <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">1</sup
Conference Paper
People naturally anthropomorphize the movement of nonliving objects, as social psychologists Fritz Heider and Marianne Simmel demonstrated in their influential 1944 research study. When they asked participants to narrate an animated film of two triangles and a circle moving in and around a box, participants described the shapes' movement in terms of human actions. Using a framework for authoring and annotating animations in the style of Heider and Simmel, we established new crowdsourced datasets where the motion trajectories of animated shapes are labeled according to the actions they depict. We applied two machine learning approaches, a spatial-temporal bag-of-words model and a recurrent neural network, to the task of automatically recognizing actions in these datasets. Our best results outperformed a majority baseline and showed similarity to human performance, which encourages further use of these datasets for modeling perception from motion trajectories. Future progress on simulating human-like motion perception will require models that integrate motion information with top-down contextual knowledge.
Article
The empirical financial literature reports evidence of mean reversion in stock prices and the absence of out-of-sample return predictability over horizons shorter than 10 years. Anecdotal evidence suggests the presence of mean reversion in stock prices and return predictability over horizons longer than 10 years, but thus far, there is no empirical evidence confirming such anecdotal evidence. The goal of this paper is to fill this gap in the literature. Specifically, using 141 years of data, this paper begins by performing formal tests of the random walk hypothesis in the prices of the real S&P Composite Index over increasing time horizons of up to 40 years. Although our results cannot support the conventional wisdom that the stock market is safer for long-term investors, our findings speak in favor of the mean reversion hypothesis. In particular, we find statistically significant in-sample evidence that past 15-17 year returns are able to predict the future 15-17 year returns. This finding is robust to the choice of data source, deflator, and test statistic. The paper continues by investigating the out-of-sample performance of long-horizon return forecasting based on the mean-reverting model. These latter tests demonstrate that the forecast accuracy provided by the mean-reverting model is statistically significantly better than the forecast accuracy provided by the naive historical-mean model. Moreover, we show that the predictive ability of the mean-reverting model is economically significant and translates into substantial performance gains.
Conference Paper
This paper provides a didactic example of how to conduct multi-group invariance testing distribution-free multi-group permutation procedure used in conjunction with Partial Least Squares (PLS).To address the likelihood that methods such as covariance-based SEM (CBSEM) with chi-square difference testing can enable group effects that mask noninvariance at lower levels of analysis problem, a variant of CBSEM invariance testing that focuses the evaluation on one parameter at a time (i.e. single parameter invariance testing) is proposed. Using a theoretical model from the field of Information Systems, with three exogenous constructs (routinization, infusion, and faithfulness of appropriation) predicting the endogenous construct of deep usage, the results show both techniques yield similar outcomes for the measurement and structural paths. The results enable greater confidence in the permutation-based procedure with PLS. The pros and cons of both techniques are also discussed.
Article
Full-text available
The eastern Pacific Ocean received a record highest number of sub-tropical convective activities during boreal summer (June–September) of 2015, since last four decades. The associated rainfall distribution was also atypical with anomalously enhanced rainfall extending from equator to sub-tropical central-eastern Pacific. The present analysis reveals a pronounced meridional sea surface temperature (SST) gradient across central-eastern Pacific, with the mean SST exceeding 28 °C over sub-tropical north Pacific, setting up favorable conditions for these enhanced convective activities. It is found that these anomalous features promoted northward spanning of westerly anomalies and drastically modified the east–west circulation over sub-tropical north Pacific. This seems to induce large-scale subsidence over the off-equatorial monsoon regions of south and south-east Asia, thus constituting an east–west asymmetry over sub-tropical Indo-Pacific region. Based on our observational study, it can be concluded that the sub-tropical convective activities over east Pacific may play a pivotal role in mediating the Pacific-monsoon teleconnection through the unexplored meridional SST gradient across Pacific.
Chapter
Despite being the most influential learning-based coreference model, the mention-pair model is unsatisfactory from both a linguistic perspective and a modeling perspective: its focus on making local coreference decisions involving only two mentions and their contexts makes it even less expressive than the coreference systems developed in the pre-statistical NLP era. Realizing its weaknesses, researchers have developed many advanced coreference models over the years. In particular, there is a gradual shift from local models towards global models, which seek to address the weaknesses of local models by exploiting additional information beyond that of the local context. In this chapter, we will discuss these advanced models for coreference resolution.
Chapter
In this chapter, the explanation on the analysis performed and the results obtained had been expounded. Based on the results, the conclusion of the hypotheses testing has been elucidated whereby six hypotheses were tested and all six were supported. Meanwhile, personality was found to have a moderating effect on brain drain.
Conference Paper
Document retrieval is the task of returning relevant textual resources for a given user query. In this paper, we investigate whether the semantic analysis of the query and the documents, obtained exploiting state-of-the-art Natural Language Processing techniques (e.g., Entity Linking, Frame Detection) and Semantic Web resources (e.g., YAGO, DBpedia), can improve the performances of the traditional term-based similarity approach. Our experiments, conducted on a recently released document collection, show that Mean Average Precision (MAP) increases of 3.5 % points when combining textual and semantic analysis, thus suggesting that semantic content can effectively improve the performances of Information Retrieval systems.
Chapter
The prevalence and availability of efficient computing machinery has had a profound, if not inevitable, effect on modern statistical practices. Not only has the surge in efficient numerical methods and the greater general interest in computational problems provided statisticians and probabalists with tools necessary to compute what would otherwise be “uncomutable”, but this surge has also impacted on statistical theory as well. The best example of this impact on theoretical aspects of statistical practice is, without question, the development of bootstrap methodology [Efron, 1979] — a body of ideas so well received and innovative that they have been outlined in the workhorse of popular scientific periodicals, Scientific American [Diaconis and Efron, 1984].
ResearchGate has not been able to resolve any references for this publication.