Figure 3 - uploaded by Paul Prasse
Content may be subject to copyright.
Structure of a syntax tree for an element ofˆYxofˆ ofˆYx,D. 

Structure of a syntax tree for an element ofˆYxofˆ ofˆYx,D. 

Source publication
Article
Full-text available
This paper addresses the problem of inferring a regular expression from a given set of strings that resembles, as closely as possible, the regular expression that a human expert would have written to identify the language. This is motivated by our goal of automating the task of postmasters of an email service who use regular expressions to describe...

Contexts in source publication

Context 1
... vector Ψ(x, y) decomposes linearly into a sum over the nodes and a sum over pairs of adjacent nodes (see Equation 7). The syntax tree of an instantia- tion y = a 0 y 1 a 1 . . . y n a n of the alignment a x consists of a root node labeled as an alternating concatenation of constant strings a j and subexpressions y j (see Fig- ure 3). This root node is connected to a layer on which constant strings a j = a j,1 . . . a j,|aj | and subtrees T yj syn alternate (blue area in Figure 3). However, the terms in Equation 10 that correspond to the root node y and Algorithm 1 Constructing the decoding space Input: Subexpressions Y D and alignment a x = a 0 (. * )a 1 . . . (. * )a n of the strings in x. 1: let T ax syn be the syntax tree of the alignment and v 1 , . . . , v n be the nodes labeled Γ ax syn (v j ) = "(. * )". 2: for j = 1 . . . n do 3: let M j = M ax,x (v j ...
Context 2
... vector Ψ(x, y) decomposes linearly into a sum over the nodes and a sum over pairs of adjacent nodes (see Equation 7). The syntax tree of an instantia- tion y = a 0 y 1 a 1 . . . y n a n of the alignment a x consists of a root node labeled as an alternating concatenation of constant strings a j and subexpressions y j (see Fig- ure 3). This root node is connected to a layer on which constant strings a j = a j,1 . . . a j,|aj | and subtrees T yj syn alternate (blue area in Figure 3). However, the terms in Equation 10 that correspond to the root node y and Algorithm 1 Constructing the decoding space Input: Subexpressions Y D and alignment a x = a 0 (. * )a 1 . . . (. * )a n of the strings in x. 1: let T ax syn be the syntax tree of the alignment and v 1 , . . . , v n be the nodes labeled Γ ax syn (v j ) = "(. * )". 2: for j = 1 . . . n do 3: let M j = M ax,x (v j ...
Context 3
... the a j are constant for all values of the y j (red area in Figure 3). Since no edges connect multiple wild- cards, the feature representation of these subtrees can be decomposed into n independent summands as in Equation ...
Context 4
... syntax tree of an instantia- tion y = a 0 y 1 a 1 . . . y n a n of the alignment a x consists of a root node labeled as an alternating concatenation of constant strings a j and subexpressions y j (see Fig- ure 3). This root node is connected to a layer on which constant strings a j = a j,1 . . . ...
Context 5
... root node is connected to a layer on which constant strings a j = a j,1 . . . a j,|aj | and subtrees T yj syn alternate (blue area in Figure 3). However, the terms in Equation 10 that correspond to the root node y and Algorithm 1 Constructing the decoding space Input: Subexpressions Y D and alignment a x = a 0 (. ...
Context 6
... the a j are constant for all values of the y j (red area in Figure 3). Since no edges connect multiple wild- cards, the feature representation of these subtrees can be decomposed into n independent summands as in Equation 11. ...
Context 7
... vector Ψ(x, y) decomposes linearly into a sum over the nodes and a sum over pairs of adjacent nodes (see Equation 7). The syntax tree of an instantia- tion y = a 0 y 1 a 1 . . . y n a n of the alignment a x consists of a root node labeled as an alternating concatenation of constant strings a j and subexpressions y j (see Fig- ure 3). This root node is connected to a layer on which constant strings a j = a j,1 . . . a j,|aj | and subtrees T yj syn alternate (blue area in Figure 3). However, the terms in Equation 10 that correspond to the root node y and Algorithm 1 Constructing the decoding space Input: Subexpressions Y D and alignment a x = a 0 (. * )a 1 . . . (. * )a n of the strings in x. 1: let T ax syn be the syntax tree of the alignment and v 1 , . . . , v n be the nodes labeled Γ ax syn (v j ) = "(. * )". 2: for j = 1 . . . n do 3: let M j = M ax,x (v j ...
Context 8
... vector Ψ(x, y) decomposes linearly into a sum over the nodes and a sum over pairs of adjacent nodes (see Equation 7). The syntax tree of an instantia- tion y = a 0 y 1 a 1 . . . y n a n of the alignment a x consists of a root node labeled as an alternating concatenation of constant strings a j and subexpressions y j (see Fig- ure 3). This root node is connected to a layer on which constant strings a j = a j,1 . . . a j,|aj | and subtrees T yj syn alternate (blue area in Figure 3). However, the terms in Equation 10 that correspond to the root node y and Algorithm 1 Constructing the decoding space Input: Subexpressions Y D and alignment a x = a 0 (. * )a 1 . . . (. * )a n of the strings in x. 1: let T ax syn be the syntax tree of the alignment and v 1 , . . . , v n be the nodes labeled Γ ax syn (v j ) = "(. * )". 2: for j = 1 . . . n do 3: let M j = M ax,x (v j ...
Context 9
... the a j are constant for all values of the y j (red area in Figure 3). Since no edges connect multiple wild- cards, the feature representation of these subtrees can be decomposed into n independent summands as in Equation ...

Similar publications

Conference Paper
Full-text available
" Tagging email is an important tactic for managing information overload. Machine learning methods can help the user with this task by predicting tags for incoming email messages. The natural user interface displays the predicted tags on the email message, and the user doesn't need to do anything unless those predictions are wrong (in which case, t...
Article
Full-text available
We consider the correlated multiarmed bandit (MAB) problem in which the rewards associated with each arm are modeled by a multivariate Gaussian random variable, and we investigate the influence of the assumptions in the Bayesian prior on the performance of the upper credible limit (UCL) algorithm and a new correlated UCL algorithm. We rigorously ch...
Article
Full-text available
In the context of few-shot learning, one cannot measure the generalization ability of a trained classifier using validation sets, due to the small number of labeled samples. In this paper, we are interested in finding alternatives to answer the question: is my classifier generalizing well to new data? We investigate the case of transfer-based few-s...
Article
Full-text available
Algorithm UCB1 for multi-armed bandit problem has already been extended to Algorithm UCT (Upper bound Confidence for Tree) which works for minimax tree search. We have developed a Monte-Carlo Go program, MoGo, which is the first computer Go program using UCT. We explain our modification of UCT for Go application and also the intelligent random simu...

Citations

... In this respect, the extraction of regexes is not only helpful for annotating datasets but also very appropriate for semiautomated web data extraction. Generating regexes from positive and negative samples has been studied in depth [5,[25][26][27][28]. These approaches use common repetitive patterns to build regexes either from positive samples [25,29,30] or from both positive and negative samples [5,31,32]. ...
Article
Traditional approaches for extracting relevant images automatically from web pages are error-prone and time-consuming. To improve this task, operations such as preparing a larger dataset and finding new features are used in the web data extraction approaches. However, these operations are difficult and laborious. In this study, we propose a fully-automated approach based on alignment of regular expressions to automatically extract the relevant images from web pages. The automatically constructed regular expressions has been applied to a classification task for the first time. In this respect, a multi-stage inference approach is developed for generating regular expressions from the attribute values of relevant and irrelevant image elements in web pages. The proposed approach reduces the complexity of the alignment of two regular expressions by applying a constraint on a version of the Levenshtein distance algorithm. The classification accuracy of regular expression approaches is compared with the naive Bayes, logistic regression, J48, and multilayer perceptron classifiers on a balanced relevant image retrieval dataset consisting of 360 image element samples for 10 shopping websites. According to the cross-validation results, the regular expression inference-based classification achieved a 0.98 f-measure with only 5 frequent n-grams, and it outperformed other classifiers on the same set of features. The classification efficiency of the proposed approach is measured at 0.108 ms, which is very competitive with other classifiers.
... Hence, such a well interpretable technology can effectively reduce dependence on data [35]. However, it is technically difficult to create and maintain high-precision regular expressions [1,2,25]. Although, numerous scholars have begun to study the work of automatically constructing regular expressions [3,6], these methods still require a certain amount of labor cost. ...
Article
Full-text available
Chinese logistics address segmentation is a specific domain of the address resolution, which is very challenging due to language, culture, user privacy, business value, etc. Although deep learning can effectively solve problems where traditional segmentation methods are overly dependent on domain knowledge, it faces the dilemma of costly manual labeling. In this context, a decision tree model based on regular expression boundaries is proposed, which requires no additional data and manual labeling. First, different from traditional methods of describing the entire address elements, a regular expressions rule library (RERL) is constructed, which only describes the boundaries of address elements. Second, the binary split attribute is defined according to the boundary matching algorithm based on RERL. A decision tree model is then constructed concerning the distribution law of address element types to segment an address and to evaluate its effect. The final experimental results demonstrate the improvement of our model and further substantiate that our proposal can provide a high-quality labeling training set for deep learning models without any professional domain knowledge, even if in low-resource scenarios.
... We also added German equivalents of such adverbs as "no", "not", i.e. "kein(e)", "nicht", for the case of English-German ticket texts. Next, the following handcrafted rules and historical data were selected to identify the real BP complexity: (1) the presence of the mentioned one-and bi-grams in the IT ticketing system fields "Impact description" and "Brief description" of the ticket (RegEx (Prasse et al., 2015) based free text search), (2) number of tasks per ticket (count of tasks, integer data type), (3) number of configuration items, specifically applications involved in the ticket (count of applications, integer data type) and (4) risk type of ticket (enumeration, ordinal scale of "low", "medium", "high"). ...
Article
Purpose -This study aims to draw the attention of business process management (BPM) research and practice to the textual data generated in the processes and the potential of meaningful insights extraction. The authors apply standard natural language processing (NLP) approaches to gain valuable knowledge in the form of business process (BP) complexity concept suggested in the study. It is built on the objective, subjective and meta-knowledge extracted from the BP textual data and encompassing semantics, syntax and stylistics. As a result, the authors aim to create awareness about cognitive, attention and reading efforts forming the textual data-based BP complexity. The concept serves as a basis for the development of various decision-support solutions for BP workers. Design/methodology/approach -The starting point is an investigation of the complexity concept in the BPM literature to develop an understanding of the related complexity research and to put the textual data-based BP complexity in its context. Afterward, utilizing the linguistic foundations and the theory of situation awareness (SA), the concept is empirically developed and evaluated in a real-world application case using qualitative interview-based and quantitative data-based methods. Findings - In the practical, real-world application, the authors confirmed that BP textual data could be used to predict BP complexity from the semantic, syntactic and stylistic viewpoints. The authors were able to prove the value of this knowledge about the BP complexity formed based on the (1) professional contextual experience of the BP worker enriched by the awareness of cognitive efforts required for BP execution (objective knowledge), (2) business emotions enriched by attention efforts (subjective knowledge) and (3) quality of the text, i.e. professionalism, expertise and stress level of the text author, enriched by reading efforts (meta-knowledge). In particular, the BP complexity concept has been applied to an industrial example of Information Technology Infrastructure Library (ITIL) change management (CHM) Information Technology (IT) ticket processing. The authors used IT ticket texts from two samples of 28,157 and 4,625 tickets as the basis for the analysis. The authors evaluated the concept with the help of manually labeled tickets and a rule-based approach using historical ticket execution data. Having a recommendation character, the results showed to be useful in creating awareness regarding cognitive, attention and reading efforts for ITIL CHM BP workers coordinating the IT ticket processing. Originality/value - While aiming to draw attention to those valuable insights inherent in BP textual data, the authors propose an unconventional approach to BP complexity definition through the lens of textual data. Hereby, the authors address the challenges specified by BPM researchers, i.e. focus on semantics in the development of vocabularies and organization-and sector-specific adaptation of standard NLP techniques.
... The second category directly learns regular expressions from training samples which can be easily obtained. [13] developed a method to generate the regular expression for recognizing email. However, their method can not be easily modified for other kinds of data. ...
Preprint
Full-text available
Regular expression is important for many natural language processing tasks especially when used to deal with unstructured and semi-structured data. This work focuses on automatically generating regular expressions and proposes a novel genetic algorithm to deal with this problem. Different from the methods which generate regular expressions from character level, we first utilize byte pair encoder (BPE) to extract some frequent items, which are then used to construct regular expressions. The fitness function of our genetic algorithm contains multi objectives and is solved based on evolutionary procedure including crossover and mutation operation. In the fitness function, we take the length of generated regular expression, the maximum matching characters and samples for positive training samples, and the minimum matching characters and samples for negative training samples into consideration. In addition, to accelerate the training process, we do exponential decay on the population size of the genetic algorithm. Our method together with a strong baseline is tested on 13 kinds of challenging datasets. The results demonstrate the effectiveness of our method, which outperforms the baseline on 10 kinds of data and achieves nearly 50 percent improvement on average. By doing exponential decay, the training speed is approximately 100 times faster than the methods without using exponential decay. In summary, our method possesses both effectiveness and efficiency, and can be implemented for the industry application.
... In addition, most of these works consider theoretical problems that are not inspired by any real-world applications [28] and the applicability of the corresponding methods is still largely unexplored. Attempts at learning regular expressions over real text were later introduced to detect Hyper Text Markup Language (HTML) lines [29] and spam emails [30], [31]. In medical domain, many regular expression based approaches have been adopted in various tasks including symptom classification [32] and extraction of medical information such as VOLUME 7, 2019 blood pressure [33], ejection fraction [34], and bodyweight values [35] from clinical notes. ...
Article
Full-text available
Medical text classification assigns medical related text into different categories such as topics or disease types. Machine learning based techniques have been widely used to perform such tasks despite the obvious drawback in such “black box” approach, leaving no easy way to fine-tune the resultant model for better performance. We propose a novel constructive heuristic approach to generate a set of regular expressions that can be used as effective text classifiers. The main innovation of our approach is that we develop a novel regular expression based text classifier with both satisfactory classification performance and excellent interpretability. We evaluate our framework on real-world medical data provided by our collaborator, one of the largest online healthcare providers in the market, and observe the high performance and consistency of this approach. Experimental results show that the machine-generated regular expressions can be effectively used in conjunction with machine learning techniques to perform medical text classification tasks. The proposed methodology improves the performance of baseline methods (Naive Bayes and Support Vector Machines) by 9% in precision and 4.5% in recall. We also evaluate the performance of modified regular expressions by human experts and demonstrate the potential of practical applications using the proposed method.
... In this paper, we target the problem of detecting presence of entity mentions that follow or closely resemble patterns that can be described by REs. Unlike much of the previous body of work on this topic, we do not focus on learning/inferring highly accuracte REs for entity identification (Prasse et al., 2012;Bui and Zeng-Treitler, 2014;Li et al., 2008;Banko et al., 2007;Brauer et al., 2011;Bartoli et al., 2016). We aim instead to show that deep learning can leverage imperfect REs and achieve very high accuracy while requiring only a modest human involvement. ...
Conference Paper
Full-text available
Many important entity types in web documents , such as dates, times, email addresses, and course numbers, follow or closely resemble patterns that can be described by Regular Expressions (REs). Due to a vast diversity of web documents and ways in which they are being generated, even seemingly straightforward tasks such as identifying mentions of date in a document become very challenging. It is reasonable to claim that it is impossible to create a RE that is capable of identifying such entities from web documents with perfect precision and recall. Rather than abandoning REs as a go-to approach for entity detection, this paper explores ways to combine the expressive power of REs, ability of deep learning to learn from large data, and human-in-the loop approach into a new integrated framework for entity identification from web data. The framework starts by creating or collecting the existing REs for a particular type of an entity. Those REs are then used over a large document corpus to collect weak labels for the entity mentions and a neural network is trained to predict those REgenerated weak labels. Finally , a human expert is asked to label a small set of documents and the neural network is fine tuned on those documents. The experimental evaluation on several entity identification problems shows that the proposed framework achieves impressive accuracy, while requiring very modest human effort.
... The bottom-up approach starts by finding similarities or patterns among textual data that are then generalized to build regular expressions. For example, Prasse et al (Prasse, 2012) used this approach to discover campaign message templates from predefined batches of emails. ...
Conference Paper
Full-text available
Background: Clinical measurements are commonly embedded in free-text clinical notes. These can be extracted using natural language processing, but this can be resource intensive with limited generalizability. We demonstrate a new approach using regular expression discovery for extraction (REDEx), a supervised machine learning algorithm that we have developed that automatically generates regular expressions to extract measurements with reduced effort. Results: We compare this approach to that of a support vector machine (SVM) in the task of body weight extraction. 968 weight values were annotated in 300 clinical notes and used for training of the REDEx and SVM models. 98 regular expressions were automatically generated by REDEx. In 10-fold cross validation the REDEx model consistently outperformed the SVM model, with precision .99 vs .85, recall .98 vs. .87, f1-score .99 vs .86, and accuracy .98 vs. .82.
... They applied their method to the problem of identifying spam email messages. Their technique frequently predicted the exact regular expression a human expert would have created or the predicted regular expression was accepted by the expert with little modification [12]. Becchi et al. showed how finite automata can be extended to accommodate Perl-Compatible Regular Expressions (PCRE) [3]. ...
Article
A forensics investigation after a breach often uncovers network and host indicators of compromise (IOCs) that can be deployed to sensors to allow early detection of the adversary in the future. Over time, the adversary will change tactics, techniques, and procedures (TTPs), which will also change the data generated. If the IOCs are not kept up-to-date with the adversary's new TTPs, the adversary will no longer be detected once all of the IOCs become invalid. Tracking the Known (TTK) is the problem of keeping IOCs, in this case regular expressions (regexes), up-to-date with a dynamic adversary. Our framework solves the TTK problem in an automated, cyclic fashion to bracket a previously discovered adversary. This tracking is accomplished through a data-driven approach of self-adapting a given model based on its own detection capabilities. In our initial experiments, we found that the true positive rate (TPR) of the adaptive solution degrades much less significantly over time than the naive solution, suggesting that self-updating the model allows the continued detection of positives (i.e., adversaries). The cost for this performance is in the false positive rate (FPR), which increases over time for the adaptive solution, but remains constant for the naive solution. However, the difference in overall detection performance, as measured by the area under the curve (AUC), between the two methods is negligible. This result suggests that self-updating the model over time should be done in practice to continue to detect known, evolving adversaries.
... In our classification task, however, inclusion of regular expressions did not consistently improve accuracy over that obtained using "bag of token" features alone. Consistent with previous studies, regular expressions provided only a small performance benefit over use of simple word vectors in classification tasks, bearing a weak but correlative trend to the training sample size 42,45 . Accordingly, methods that aggregate text fragments (as in induction of regular expressions) -although generating features with better sensitivity (recall)provide little overall additional information when used in conjunction with a multivariate learner for classification and prediction. ...
... Both problems arise from the algorithm failing to fully examine the underlying semantic structure, resulting in only partial observations. Such misdiscovery represents the ceiling of capability for semantic-free NLP methods, but could be amendable to a richer knowledge representation by incorporating a comprehensive semantic analysis on platforms such as MedLEE 45 and cTAKES 46 during the pre-processing step. A trend was evident from our analysis which suggested that a more sophisticated representation (e.g., regular expression) confers better descriptive power (e.g., versus n-grams). ...
Article
Full-text available
Vast amounts of clinically relevant text-based variables lie undiscovered and unexploited in electronic medical records (EMR). To exploit this untapped resource, and thus facilitate the discovery of informative covariates from unstructured clinical narratives, we have built a novel computational pipeline termed Text-based Exploratory Pattern Analyser for Prognosticator and Associator discovery (TEPAPA). This pipeline combines semantic-free natural language processing (NLP), regular expression induction, and statistical association testing to identify conserved text patterns associated with outcome variables of clinical interest. When we applied TEPAPA to a cohort of head and neck squamous cell carcinoma patients, plausible concepts known to be correlated with human papilloma virus (HPV) status were identified from the EMR text, including site of primary disease, tumour stage, pathologic characteristics, and treatment modalities. Similarly, correlates of other variables (including gender, nodal status, recurrent disease, smoking and alcohol status) were also reliably recovered. Using highly-associated patterns as covariates, a patient’s HPV status was classifiable using a bootstrap analysis with a mean area under the ROC curve of 0.861, suggesting its predictive utility in supporting EMR-based phenotyping tasks. These data support using this integrative approach to efficiently identify disease-associated factors from unstructured EMR narratives, and thus to efficiently generate testable hypotheses.
... Unlike Whisk, which works well only when hundreds of examples are provided, SEER works with only a small number of examples. Also related are systems for inducing regular expressions [25,4,5,21]. Regular expression learners are limited to syntactic features of text; SEER capitalizes on semantic features of the text whenever possible and beneficial to data extraction. ...
Conference Paper
Full-text available
Time-consuming and complicated best describe the current state of the Information Extraction (IE) field. Machine learning approaches to IE require large collections of labeled datasets that are difficult to create and use obscure mathematical models, occasionally returning unwanted results that are unexplainable. Rule-based approaches, while resulting in easy-to-understand IE rules, are still time-consuming and labor-intensive. SEER combines the best of these two approaches: a learning model for IE rules based on a small number of user-specified examples. In this paper, we explain the design behind SEER and present a user study comparing our system against a commercially available tool in which users create IE rules manually. Our results show that SEER helps users complete text extraction tasks more quickly, as well as more accurately.