Figure 2 - uploaded by Tina Yu
Content may be subject to copyright.
A Parse Tree Representation  

A Parse Tree Representation  

Source publication
Conference Paper
Full-text available
With the continuing exponential growth of the Internet and the more recent growth of business Intranets, the commercial world is becoming increasingly aware of the problem of electronic information overload. This has encouraged interest in developing agents/softbots that can act as electronic personal assistants and can develop and adapt representa...

Context in source publication

Context 1
... any part of a tree can be interchanged with another part and the tree remains valid -perfect for flexible evolution and mutations. An example parse tree is shown in Figure 2. ...

Similar publications

Article
Full-text available
Advanced Technology Laboratories, in conjunction with Lock-heed Martin Maritime Systems and Sensors and Lockheed Martin Aeronautics Advanced Development Programs, performed a set of experiments in cooperation with the U.S. Navy involving collaborative, unmanned surface and air vehicle mis-sion execution. Our multi-domain, collaborative research foc...
Article
Full-text available
The increasing complexity of production logistic systems has lead to an emergence of new decentralized control concepts. The Collaborative Research Center 637 (CRC 637) investigates the ad-vantages and limitations of autonomous control as one of these con-cepts. This research mainly focuses on control strategies consisting of precise descriptions o...
Article
Full-text available
To realise an autonomous control for transport networks it is attempted to transfer well known and approved routing protocols from data communication to transport problems. Here structural differences between data and transportation networks prevent a direct transfer of the protocols. In transportation networks not one but several diverse and parti...
Conference Paper
Full-text available
The German Collaborative Research Centre 637 'Autonomous Cooperating Logistic Processes' tries to make a paradigm shift from central planning to autonomous control in the field of logistics. Among other things, autonomous routing algorithms based on internet routing protocols are developed. The Distributed Logistics Routing Protocol (DLRP) was orig...

Citations

... GAs have also been in use for some time to generate rules for text classification [21,22,23] and clustering [11,24,25], which have the advantage of being explainable. The simple disjunctive search queries produced by the eSQ system are easy to understand and are potentially modifiable by a human analyst. ...
Preprint
Full-text available
We present a novel, hybrid approach for clustering text databases. We use a genetic algorithm to generate and evolve a set of search queries in Apache Lucene format. Clusters are formed as the set of documents matched by a search query. The queries are optimized to maximize the number of documents returned and to minimize the overlap between clusters (documents returned by more than one query). Where queries contain more than one word, we have found it useful to assign one word to be the root and constrain the query construction such that the set of documents returned by any additional query words intersect with the set returned by the root word. Multiword queries are interpreted disjunctively. We also describe how a gene can be used to determine the number of clusters (k). Not all documents in a collection are returned by any of the search queries in a set, so once the search query evolution is completed a second stage is performed whereby a KNN algorithm is applied to assign all unassigned documents to their nearest cluster. We describe the method and present results using 8 text datasets comparing effectiveness with well-known existing algorithms. We note that search query format has the qualitative benefits of being interpretable and providing an explanation of cluster construction.
... GAs have also been used to generate rules for text classification [16][17][18][19] and clustering [20], which have the advantage of being transparent and explainable. The eSQ system presented here has a novel fitness test based entirely on the count of unique query hits. ...
Conference Paper
Full-text available
We present a novel, hybrid approach for clustering text databases. We use a genetic algorithm to generate and evolve a set of single word search queries in Apache Lucene format. Clusters are formed as the set of documents matching a search query. The queries are optimized to maximize the number of documents returned and to minimize the overlap between clusters (documents returned by more than one query in a set). Optionally, the number of clusters can be specified in advance, which will normally result in an improvement in performance. Not all documents in a collection are returned by any of the search queries in a set, so once the search query evolution is completed a second stage is performed whereby a KNN algorithm is applied to assign all unassigned documents to their nearest cluster. We describe the method and compare effectiveness with other well-known existing systems on 8 different text datasets. We note that search query format has the qualitative benefits of being interpretable and providing an explanation of cluster construction.
... Genetic methods have been in use for some time in the area of document classification [5], [17]. We have previously described a system whereby Apache Lucene search queries were evolved from a set of training documents in order to classify documents in a collection [9]. ...
... This is a novel method for classifying documents, where all agents develop representation of a parse-tree of a user's specific information necessity. The other remarkable properties of work are a continual training process; user's feedback assists the agent to adapt long-term information requirements for the user [23,25]. ...
... The paper deals with the spam detection based on the Reverse Polish Notation (RPN) [1], [2] expression based Linear Genetic Programming (LGP) [3]- [5] approach compared to Naïve Bayesian classifier [6]- [8]. ...
... Earlier systems include Magi [25] which used decision trees to automatically route new messages to the relevant folders, RIPPER [1] which automatically learnt rules to classify email into categories (spam was not mentioned), and Genetic Document Classifier [2] which using a classical GP routed inbound documents to interested research groups within a large organisation. ...
... The Naïve Bayesian Classifier assumes that f 1 , f 2 , .. f n are conditionally independent, which gives (2): (2) where both P(Fi|C) and P(C) are relative frequencies which can be calculated from the training corpus. ...
Conference Paper
Full-text available
An investigation is performed of a machine learning algorithm and the Bayesian classifier in the spam-filtering context. The paper shows the advantage of the use of Reverse Polish Notation (RPN) expressions with feature extraction compared to the traditional Naïve Bayesian classifier used for spam detection assuming the same features. The performance of the two is investigated using a public corpus and a recent private spam collection, concluding that the system based on RPN LGP (Linear Genetic Programming) gave better results compared to two popularly used open source Bayesian spam filters.
... A filter or a query is applied to information with the result of specific documents or files related to the filter's purpose. This is useful because classification and routing techniques are what compose filtering [6]. Clack creates an automatic document classification system for businesses by using filtering. ...
... While this thesis does not seek to build a full fledged semantic or "smart" desktop, it does seek to create a useful part of that goal. Elimination of tedious work is often a helpful and useful goal [6]. ...
... 6 shows that the results for the large set of the Reuters corpus are much different. The simple classifier does the best at 25% followed by the NB and HMM at 25% and 18%. ...
Article
Document classification is used to sort and label documents. This gives users quicker access to relevant data. Users that work with large inflow of documents spend time filing and categorizing them to allow for easier procurement. The Automatic Classification and Document Filing (ACDF) system proposed here is designed to allow users working with files or documents to rely on the system to classify and store them with little manual attention. By using a system built on Hidden Markov Models, the documents in a smaller desktop environment are categorized with better results than the traditional Naive Bayes implementation of classification.
... The example clearly indicates that readability and modifiability have recognized value to commercial classification products and that the production of readable rules with high accuracy is a worth while objective in text classification research. Generally, the attempts to produce classification systems that are human understandable have involved the production of a set of rules which are used for classification purposes [6], [7], [8], [9], [10], [11]. Often, the set of rules is quite large which reduces some of the qualitative advantages because it will be harder for a human to comprehend or modify the classifier. ...
... Both are stochastic search methods inspired by biological evolution. The evolution will require a fitness test based on some measure of classification accuracy [6], [7], [9], [11], [12], [13]. The basic idea we introduce here is that each individual will encode a candidate solution in a search query format. ...
Conference Paper
Full-text available
Human readable text classifiers have a number of advantages over classifiers based on complex and opaque mathematical models. For some time now search queries or rules have been used for classification purposes, either constructed manually or automatically. We have performed experiments using genetic algorithms to evolve text classifiers in search query format with the combined objective of classifier accuracy and classifier readability. We have found that a small set of disjunct Lucene SpanFirst queries effectively meet both goals. This kind of query evaluates to true for a document if a particular word occurs within the first N words of a document. Previously researched classifiers based on queries using combinations of words connected with OR, AND and NOT were found to be generally less accurate and (arguably) less readable. The approach is evaluated using standard test sets Reuters-21578 and Ohsumed and compared against several classification algorithms.
... Although GP has been used in a textual environment [4][5] it has not previously been used to evolve search query classifiers for large text datasets. ...
Conference Paper
Full-text available
We describe a method for generating accurate, compact, human understandable text classifiers. Text datasets are indexed using Apache Lucene and Genetic Programs are used to construct Lucene search queries. Genetic programs acquire fitness by producing queries that are effective binary classifiers for a particular category when evaluated against a set of training documents. We describe a set of functions and terminals and provide results from classification tasks.
... Genetic Programming also has been applied to text classification through the use of a parse-tree. In [6], Clack et al. used GP to route in-bound documents to a central classifier which autonomously sent documents to interested research groups within a large organization. The central classifier used a parse tree to match the aspects of a document to nodes of the tree, which ultimately leads to a single numerical value – the classification or " confidence value " – during evaluation. ...
Conference Paper
Full-text available
This paper shows how citation-based information and structural content (e.g., title, abstract) can be combined to improve classification of text documents into predefined categories. We evaluate different measures of similarity -- five derived from the citation information of the collection, and three derived from the structural content -- and determine how they can be fused to improve classification effectiveness. To discover the best fusion framework, we apply Genetic Programming (GP) techniques. Our experiments with the ACM Computing Classification Scheme, using documents from the ACM Digital Library, indicate that GP can discover similarity functions superior to those based solely on a single type of evidence. Effectiveness of the similarity functions discovered through simple majority voting is better than that of content-based as well as combination-based Support Vector Machine classifiers. Experiments also were conducted to compare the performance between GP techniques and other fusion techniques such as Genetic Algorithms (GA) and linear fusion. Empirical results show that GP was able to discover better similarity functions than GA or other fusion techniques.
... We then provide information concerning the implementation of our application and the initial results we have obtained on a text classification task. Although GP has been used in a textual environment ([8]; [9]) it has not previously been used to evolve compressed classifiers based on evolving N-Gram patterns. ...
... @BULLET Functions for identifying words that are ADJACENT in the text or NEAR one another. @BULLET New functions together with numeric terminals for identifying frequency information may be introduced [8]. Functions such as '>' return a Boolean value based on the frequency of a particular N-Gram in comparison to an integer terminal. ...
Article
Full-text available
We describe a novel method for using genetic programming to create compact classification rules using combinations of N-grams (character strings). Genetic programs acquire fitness by producing rules that are effective classifiers in terms of precision and recall when evaluated against a set of training documents. We describe a set of functions and terminals and provide results from a classification task using the Reuters 21578 dataset. We also suggest that the rules may have a number of other uses beyond classification and provide a basis for text mining applications.