Structure of a syntax tree for an element ofˆYxofˆ ofˆYx,D.

Source publication

Learning to Identify Regular Expressions that Describe Email Campaigns

Article

Full-text available

Jun 2012

This paper addresses the problem of inferring a regular expression from a given set of strings that resembles, as closely as possible, the regular expression that a human expert would have written to identify the language. This is motivated by our goal of automating the task of postmasters of an email service who use regular expressions to describe...

Context 1

... vector Ψ(x, y) decomposes linearly into a sum over the nodes and a sum over pairs of adjacent nodes (see Equation 7). The syntax tree of an instantia- tion y = a 0 y 1 a 1 . . . y n a n of the alignment a x consists of a root node labeled as an alternating concatenation of constant strings a j and subexpressions y j (see Fig- ure 3). This root node is connected to a layer on which constant strings a j = a j,1 . . . a j,|aj | and subtrees T yj syn alternate (blue area in Figure 3). However, the terms in Equation 10 that correspond to the root node y and Algorithm 1 Constructing the decoding space Input: Subexpressions Y D and alignment a x = a 0 (. * )a 1 . . . (. * )a n of the strings in x. 1: let T ax syn be the syntax tree of the alignment and v 1 , . . . , v n be the nodes labeled Γ ax syn (v j ) = "(. * )". 2: for j = 1 . . . n do 3: let M j = M ax,x (v j ...

View in full-text

Context 2

View in full-text

Context 3

... the a j are constant for all values of the y j (red area in Figure 3). Since no edges connect multiple wild- cards, the feature representation of these subtrees can be decomposed into n independent summands as in Equation ...

View in full-text

Context 4

... syntax tree of an instantia- tion y = a 0 y 1 a 1 . . . y n a n of the alignment a x consists of a root node labeled as an alternating concatenation of constant strings a j and subexpressions y j (see Fig- ure 3). This root node is connected to a layer on which constant strings a j = a j,1 . . . ...

View in full-text

Context 5

... root node is connected to a layer on which constant strings a j = a j,1 . . . a j,|aj | and subtrees T yj syn alternate (blue area in Figure 3). However, the terms in Equation 10 that correspond to the root node y and Algorithm 1 Constructing the decoding space Input: Subexpressions Y D and alignment a x = a 0 (. ...

Context 6

Context 7

Context 8

Context 9

Improving Automated Email Tagging with Implicit Feedback

Conference Paper

Full-text available

Nov 2015

" Tagging email is an important tactic for managing information overload. Machine learning methods can help the user with this task by predicting tags for incoming email messages. The natural user interface displays the predicted tags on the email message, and the user doesn't need to do anything unless those predictions are wrong (in which case, t...

Article

Full-text available

Jul 2015

We consider the correlated multiarmed bandit (MAB) problem in which the rewards associated with each arm are modeled by a multivariate Gaussian random variable, and we investigate the influence of the assumptions in the Bayesian prior on the performance of the upper credible limit (UCL) algorithm and a new correlated UCL algorithm. We rigorously ch...

Figure 1. Illustration of the transfer-based few-shot learning principle.

Figure 2. Supervised setting. Study of the linear correlations between...

Figure 3. Supervised setting. Each point represents a task. We plot the...

Figure 4. Unsupervised setting. Study of linear correlations between...

Figure 5. Semi-supervised setting. Study of linear correlations between...

Predicting the Generalization Ability of a Few-Shot Classifier

Article

Full-text available

Jan 2021

In the context of few-shot learning, one cannot measure the generalization ability of a trained classifier using validation sets, due to the small number of labeled samples. In this paper, we are interested in finding alternatives to answer the question: is my classifier generalizing well to new data? We investigate the case of transfer-based few-s...

Modification of UCT with Patterns in Monte-Carlo Go

Article

Full-text available

Jan 2006

Algorithm UCB1 for multi-armed bandit problem has already been extended to Algorithm UCT (Upper bound Confidence for Tree) which works for minimax tree search. We have developed a Monte-Carlo Go program, MoGo, which is the first computer Go program using UCT. We explain our modification of UCT for Go application and also the intelligent random simu...

An efficient regular expression inference approach for relevant image extraction

Article

Mar 2023
APPL SOFT COMPUT

Traditional approaches for extracting relevant images automatically from web pages are error-prone and time-consuming. To improve this task, operations such as preparing a larger dataset and finding new features are used in the web data extraction approaches. However, these operations are difficult and laborious. In this study, we propose a fully-automated approach based on alignment of regular expressions to automatically extract the relevant images from web pages. The automatically constructed regular expressions has been applied to a classification task for the first time. In this respect, a multi-stage inference approach is developed for generating regular expressions from the attribute values of relevant and irrelevant image elements in web pages. The proposed approach reduces the complexity of the alignment of two regular expressions by applying a constraint on a version of the Levenshtein distance algorithm. The classification accuracy of regular expression approaches is compared with the naive Bayes, logistic regression, J48, and multilayer perceptron classifiers on a balanced relevant image retrieval dataset consisting of 360 image element samples for 10 shopping websites. According to the cross-validation results, the regular expression inference-based classification achieved a 0.98 f-measure with only 5 frequent n-grams, and it outperformed other classifiers on the same set of features. The classification efficiency of the proposed approach is measured at 0.108 ms, which is very competitive with other classifiers.

REBDT: A regular expression boundary-based decision tree model for Chinese logistics address segmentation

Article

Full-text available

Jul 2022
APPL INTELL

Chinese logistics address segmentation is a specific domain of the address resolution, which is very challenging due to language, culture, user privacy, business value, etc. Although deep learning can effectively solve problems where traditional segmentation methods are overly dependent on domain knowledge, it faces the dilemma of costly manual labeling. In this context, a decision tree model based on regular expression boundaries is proposed, which requires no additional data and manual labeling. First, different from traditional methods of describing the entire address elements, a regular expressions rule library (RERL) is constructed, which only describes the boundaries of address elements. Second, the binary split attribute is defined according to the boundary matching algorithm based on RERL. A decision tree model is then constructed concerning the distribution law of address element types to segment an address and to evaluate its effect. The final experimental results demonstrate the improvement of our model and further substantiate that our proposal can provide a high-quality labeling training set for deep learning models without any professional domain knowledge, even if in low-resource scenarios.

Assessing business process complexity based on textual data: Evidence from ITIL IT ticket processing

Article

Sep 2021
Bus Process Manag J

Purpose -This study aims to draw the attention of business process management (BPM) research and practice to the textual data generated in the processes and the potential of meaningful insights extraction. The authors apply standard natural language processing (NLP) approaches to gain valuable knowledge in the form of business process (BP) complexity concept suggested in the study. It is built on the objective, subjective and meta-knowledge extracted from the BP textual data and encompassing semantics, syntax and stylistics. As a result, the authors aim to create awareness about cognitive, attention and reading efforts forming the textual data-based BP complexity. The concept serves as a basis for the development of various decision-support solutions for BP workers. Design/methodology/approach -The starting point is an investigation of the complexity concept in the BPM literature to develop an understanding of the related complexity research and to put the textual data-based BP complexity in its context. Afterward, utilizing the linguistic foundations and the theory of situation awareness (SA), the concept is empirically developed and evaluated in a real-world application case using qualitative interview-based and quantitative data-based methods. Findings - In the practical, real-world application, the authors confirmed that BP textual data could be used to predict BP complexity from the semantic, syntactic and stylistic viewpoints. The authors were able to prove the value of this knowledge about the BP complexity formed based on the (1) professional contextual experience of the BP worker enriched by the awareness of cognitive efforts required for BP execution (objective knowledge), (2) business emotions enriched by attention efforts (subjective knowledge) and (3) quality of the text, i.e. professionalism, expertise and stress level of the text author, enriched by reading efforts (meta-knowledge). In particular, the BP complexity concept has been applied to an industrial example of Information Technology Infrastructure Library (ITIL) change management (CHM) Information Technology (IT) ticket processing. The authors used IT ticket texts from two samples of 28,157 and 4,625 tickets as the basis for the analysis. The authors evaluated the concept with the help of manually labeled tickets and a rule-based approach using historical ticket execution data. Having a recommendation character, the results showed to be useful in creating awareness regarding cognitive, attention and reading efforts for ITIL CHM BP workers coordinating the IT ticket processing. Originality/value - While aiming to draw attention to those valuable insights inherent in BP textual data, the authors propose an unconventional approach to BP complexity definition through the lens of textual data. Hereby, the authors address the challenges specified by BPM researchers, i.e. focus on semantics in the development of vocabularies and organization-and sector-specific adaptation of standard NLP techniques.

Revisiting Regex Generation for Modeling Industrial Applications by Incorporating Byte Pair Encoder

Preprint

Full-text available

May 2020

Regular expression is important for many natural language processing tasks especially when used to deal with unstructured and semi-structured data. This work focuses on automatically generating regular expressions and proposes a novel genetic algorithm to deal with this problem. Different from the methods which generate regular expressions from character level, we first utilize byte pair encoder (BPE) to extract some frequent items, which are then used to construct regular expressions. The fitness function of our genetic algorithm contains multi objectives and is solved based on evolutionary procedure including crossover and mutation operation. In the fitness function, we take the length of generated regular expression, the maximum matching characters and samples for positive training samples, and the minimum matching characters and samples for negative training samples into consideration. In addition, to accelerate the training process, we do exponential decay on the population size of the genetic algorithm. Our method together with a strong baseline is tested on 13 kinds of challenging datasets. The results demonstrate the effectiveness of our method, which outperforms the baseline on 10 kinds of data and achieves nearly 50 percent improvement on average. By doing exponential decay, the training speed is approximately 100 times faster than the methods without using exponential decay. In summary, our method possesses both effectiveness and efficiency, and can be implemented for the industry application.

Regular Expression Based Medical Text Classification Using Constructive Heuristic Approach

Article

Full-text available

Oct 2019

Medical text classification assigns medical related text into different categories such as topics or disease types. Machine learning based techniques have been widely used to perform such tasks despite the obvious drawback in such “black box” approach, leaving no easy way to fine-tune the resultant model for better performance. We propose a novel constructive heuristic approach to generate a set of regular expressions that can be used as effective text classifiers. The main innovation of our approach is that we develop a novel regular expression based text classifier with both satisfactory classification performance and excellent interpretability. We evaluate our framework on real-world medical data provided by our collaborator, one of the largest online healthcare providers in the market, and observe the high performance and consistency of this approach. Experimental results show that the machine-generated regular expressions can be effectively used in conjunction with machine learning techniques to perform medical text classification tasks. The proposed methodology improves the performance of baseline methods (Naive Bayes and Support Vector Machines) by 9% in precision and 4.5% in recall. We also evaluate the performance of modified regular expressions by human experts and demonstrate the potential of practical applications using the proposed method.

Regular Expression Guided Entity Mention Mining from Noisy Web Data

Conference Paper

Full-text available

Oct 2018

Many important entity types in web documents , such as dates, times, email addresses, and course numbers, follow or closely resemble patterns that can be described by Regular Expressions (REs). Due to a vast diversity of web documents and ways in which they are being generated, even seemingly straightforward tasks such as identifying mentions of date in a document become very challenging. It is reasonable to claim that it is impossible to create a RE that is capable of identifying such entities from web documents with perfect precision and recall. Rather than abandoning REs as a go-to approach for entity detection, this paper explores ways to combine the expressive power of REs, ability of deep learning to learn from large data, and human-in-the loop approach into a new integrated framework for entity identification from web data. The framework starts by creating or collecting the existing REs for a particular type of an entity. Those REs are then used over a large document corpus to collect weak labels for the entity mentions and a neural network is trained to predict those REgenerated weak labels. Finally , a human expert is asked to label a small set of documents and the neural network is fine tuned on those documents. The experimental evaluation on several entity identification problems shows that the proposed framework achieves impressive accuracy, while requiring very modest human effort.

Extract Clinical Measurement Values Using a Regular Expression Pattern Discovery Algorithm vs Support Vector Machine

Conference Paper

Full-text available

Jul 2018

Background: Clinical measurements are commonly embedded in free-text clinical notes. These can be extracted using natural language processing, but this can be resource intensive with limited generalizability. We demonstrate a new approach using regular expression discovery for extraction (REDEx), a supervised machine learning algorithm that we have developed that automatically generates regular expressions to extract measurements with reduced effort. Results: We compare this approach to that of a support vector machine (SVM) in the task of body weight extraction. 968 weight values were annotated in 300 clinical notes and used for training of the REDEx and SVM models. 98 regular expressions were automatically generated by REDEx. In 10-fold cross validation the REDEx model consistently outperformed the SVM model, with precision .99 vs .85, recall .98 vs. .87, f1-score .99 vs .86, and accuracy .98 vs. .82.

Tracking Cyber Adversaries with Adaptive Indicators of Compromise

Article

Dec 2017

A forensics investigation after a breach often uncovers network and host indicators of compromise (IOCs) that can be deployed to sensors to allow early detection of the adversary in the future. Over time, the adversary will change tactics, techniques, and procedures (TTPs), which will also change the data generated. If the IOCs are not kept up-to-date with the adversary's new TTPs, the adversary will no longer be detected once all of the IOCs become invalid. Tracking the Known (TTK) is the problem of keeping IOCs, in this case regular expressions (regexes), up-to-date with a dynamic adversary. Our framework solves the TTK problem in an automated, cyclic fashion to bracket a previously discovered adversary. This tracking is accomplished through a data-driven approach of self-adapting a given model based on its own detection capabilities. In our initial experiments, we found that the true positive rate (TPR) of the adaptive solution degrades much less significantly over time than the naive solution, suggesting that self-updating the model allows the continued detection of positives (i.e., adversaries). The cost for this performance is in the false positive rate (FPR), which increases over time for the adaptive solution, but remains constant for the naive solution. However, the difference in overall detection performance, as measured by the area under the curve (AUC), between the two methods is negligible. This result suggests that self-updating the model over time should be done in practice to continue to detect known, evolving adversaries.

TEPAPA: A novel in silico feature learning pipeline for mining prognostic and associative factors from text-based electronic medical records

Article

Full-text available

Jul 2017

Vast amounts of clinically relevant text-based variables lie undiscovered and unexploited in electronic medical records (EMR). To exploit this untapped resource, and thus facilitate the discovery of informative covariates from unstructured clinical narratives, we have built a novel computational pipeline termed Text-based Exploratory Pattern Analyser for Prognosticator and Associator discovery (TEPAPA). This pipeline combines semantic-free natural language processing (NLP), regular expression induction, and statistical association testing to identify conserved text patterns associated with outcome variables of clinical interest. When we applied TEPAPA to a cohort of head and neck squamous cell carcinoma patients, plausible concepts known to be correlated with human papilloma virus (HPV) status were identified from the EMR text, including site of primary disease, tumour stage, pathologic characteristics, and treatment modalities. Similarly, correlates of other variables (including gender, nodal status, recurrent disease, smoking and alcohol status) were also reliably recovered. Using highly-associated patterns as covariates, a patient’s HPV status was classifiable using a bootstrap analysis with a mean area under the ROC curve of 0.861, suggesting its predictive utility in supporting EMR-based phenotyping tasks. These data support using this integrative approach to efficiently identify disease-associated factors from unstructured EMR narratives, and thus to efficiently generate testable hypotheses.

SEER: Auto-Generating Information Extraction Rules from User-Specified Examples

Conference Paper

Full-text available

May 2017

Time-consuming and complicated best describe the current state of the Information Extraction (IE) field. Machine learning approaches to IE require large collections of labeled datasets that are difficult to create and use obscure mathematical models, occasionally returning unwanted results that are unexplainable. Rule-based approaches, while resulting in easy-to-understand IE rules, are still time-consuming and labor-intensive. SEER combines the best of these two approaches: a learning model for IE rules based on a small number of user-specified examples. In this paper, we explain the design behind SEER and present a user study comparing our system against a commercially available tool in which users create IE rules manually. Our results show that SEER helps users complete text extraction tasks more quickly, as well as more accurately.

Structure of a syntax tree for an element ofˆYxofˆ ofˆYx,D.

Contexts in source publication

Similar publications

Citations