Conference PaperPDF Available

Framework on Extracting Personal Name Pseudonyms from the Web

Authors:

Abstract and Figures

A person may have multiple personal name aliases on the web. Identifying aliases of a name is useful in information retrieval and knowledge management, sentiment analysis, relation extraction and name disambiguation. The objective of detecting aliases from the web is to retrieve all the information pertaining to a personal name whose content is described with different nick names in different documents of web. As of now, web contains aliases of popular personalities in various domains like sports, politics, medicine, music, cinema etc., and does not contain alias information about common man. Recently, there are proven methods of extracting aliases through lexical pattern based retrieval tested using real-world name-alias pairs in Japanese and English as training data related to limited domains. In this paper, we discuss about various personal name disambiguation methods used in web related tasks.
Content may be subject to copyright.
Assistant Professor, Department of Computer Science, Aalim Muhamed Salegh College of Engineering,
Chennai-600055, Tamil Nadu, India. E-mail:iqbalmecse@gmail.com
Assistant Professor, Department of Computer Science and Engineering, Anna University,
Thiruchirappalli-620024, Tamil Nadu, India. E-mail:erklatha@gmail.com
Framework on Extracting Personal Name
Pseudonyms from the Web
Mr. M. Mohamed Iqbal
Dr. K. Latha
Abstract
A person may have multiple personal name aliases on the web. Identifying aliases of a name is useful
in information retrieval and knowledge management, sentiment analysis, relation extraction and
name disambiguation. The objective of detecting aliases from the web is to retrieve all the information
pertaining to a personal name whose content is described with different nick names in different
documents of web. As of now, web contains aliases of popular personalities in various domains like
sports, politics, medicine, music, cinema etc., and does not contain alias information about common
man. Recently, there are proven methods of extracting aliases through lexical pattern based retrieval
tested using real-world name-alias pairs in Japanese and English as training data related to limited
domains. In this paper, we discuss about various personal name disambiguation methods used in web
related tasks.
Keywords: Information Retrieval, Mnemonic name, name disambiguation, Word Sense
Disambiguation, Lexico-syntactic pattern , Natural Language Processing, Semantic Similarity
I. Introduction
Finding information about people in the web is one of the day-to-day activities among
Internet users. Thirty percent of search engine queries are based on the person names
[1]. Nevertheless, extracting information about people from web search engines is a
difficult task when a person is referred by different nick names.
For instance, a popular cinema artiste original name Shivaji Rao Gaekwad is referred by
different alias names like “Super Star”, “Badsha”, ”Muthu”, “Robot”, “Chitti”, “Dancing
Maharaja”, and much more. We will not be able to retrieve all information about the
artiste from the web, unless we extract the top ranked alias names. Here, different entities
can share the same name called lexical ambiguity.
On the other hand, a single entity can be designated by multiple names (i.e, referential
ambiguity). A real-world example is alias name “Badsha” refers to Shah Rukh Khan
Advances in Innovative Engineering and Technologies
ISBN: 978-0-9948937-1-0 [223]
another actor in the same domain of expertise. This problem is solved by semantic Meta
data for entities and automatic extraction of Meta data [2] can accelerate the process of
semantic annotation. For named entities, automatically extracted aliases can serve as a
useful source of Meta data, thereby providing a means to disambiguate an entity.
Identifying aliases of a name are important for extracting relations among entities. For
example, Matsuo et al[3] propose a social network extraction algorithm in which they
compute the strength of relation between two individuals X and Y by the web hits for the
conjunctive query , “X “ and “Y”. However, both persons X and Y might also appear in their
alias names in web contents. Consequently, by expanding the conjunctive query using
aliases for the names, a social network extraction algorithm can accurately compute the
strength of a relationship between two persons.
II. Methods for Name Ambiguation Problem
Our research is headed towards building a web extraction system which extracts efficient
patterns for Indian name aliases and further this system can be adapted to various fields.
Alias extraction is basically an information retrieval task [IR], which looks for similar,
preceding, succeeding, adjacent, lexico-syntactic, supervised co-occurring text from a
large cluster of documents. The main function of information retrieval is to build a term-
weighting system [4] which will enhance the retrieval effectiveness. Two measures are
normally used to assess the ability of a system to retrieve the relevant and reject the non-
relevant items of a collection, which is known as Recall and Precision respectively.
Determining recall and precision is the significant accuracy measure of any information
retrieval task in web and holds good for alias extraction too. Below we will discuss
further on various techniques viz., Word Association Norms and lexicography, collocation
extraction in natural language processing, cross-document co-reference resolution,
Duplicate Detection Using learnable String Similarity Measures, Unsupervised clustering
to identify the referents of personal names, disambiguating name sakes, approximate
string matching method, mnemonic extraction, approximate name matching using finite
state graphs, measuring semantic similarity between words, mining the web for alias
extraction, user name alias extraction in emails.
2.1 Word Association Norms and Lexicography
In linguistics, it is a general practice to classify words not only on the basis of their
meaning but also on the basis of their co-occurrence with other words. The word ’bank’
has dual meaning with respect to the association of adjacent words and expressions. For
instance words such as, currency, cheque, loan, account, interest etc., are related with
financial institutions. On the other hand, bank co-occurring with water, boat etc., are
related to river. Word association norms are well known to be an important factor in
psycholinguistic research, specifically in the area of lexical retrieval. People understand
quicker than normal to the work ‘nurse’ if it follows a frequently associated word such as
‘doctor’.
Feature Selection for Activity Recognition using Eye Movements
ISBN: 978-0-9948937-1-0 [224]
It is found in psycholinguistic research that the word ‘doctor’ is most often associated
with ‘nurse’ followed by sick, health, medicine, hospital . In this paper, association ratio
[18] was proposed for measuring word association norms from computer readable
corpora based on information theoretic concept of mutual information.
2.1.1 Mutual Information
Mutual Information states that if two points (words), x and y have probabilities p(x) and
p(y), then their mutual information, I(x,y) is defined to be
I(x,y) = (Log P(x,y))/(P(x) p(y))
Informally, mutual information compares the probability of observing x and y together
with the probabilities of observing x and y independently (chance). If there is a genuine
association between x and y then the joint probability P(x,y) will be much larger than
chance P(x) P(y), and consequently I(x,y) > 0. If there is no interesting relationship
between x and y, then P(x,y) = P(x) P(y), and thus I(x,y)=0. If x and y are in
complementary distribution, then P(x,y) will be much less than P(X) P(Y), forcing I(x,y) <
0. Word probabilities P(x) and P(y) are estimated by counting the number of
observations of x and y in a corpus, f(x) and f(y) and normalizing by N, the size of corpus.
For experimentation, corpora of different sizes were used. Joint probabilities, P(x,y) are
estimated by counting the number of times that x is followed by y in a window of words,
fw(x,y), and normalizing by N. The window size parameter used to look at different
scales. Smaller window identified fixed expressions (idioms such as bread and butter),
larger window sized highlight semantic concepts and other relationships.
Mean and Variance of the Separation between Word X and Word Y
Relation
Words X
Word Y
Separation
Mean Variance
Fixed
Bread
butter
2.00
Drink
drive
2.00
0.00
Compound
Computer
scientist
1.12
0.10
United
States
0.98
0.14
Semantic
Man
woman
1.46
8.07
Men
women
-0.12
Lexical
refraining
From
1.11
Coming
From
0.83
Keeping
From
2.14
Advances in Innovative Engineering and Technologies
ISBN: 978-0-9948937-1-0 [225]
From the above table, it was inferred that fixed expressions such as ‘bread and butter’ or
‘drink and drive’, the words are separated by a few numbers of words. They often found
very close to each other within five words. Hence, mean separation is two, and variance
is zero. Compound expressions also appear close to each other. In contrast, semantic
words like man/woman have larger variance in their separation. Lexical relations come
in several varieties. There are some like ‘refraining from’ are fairly fixed, others ‘coming
from’ separated by an argument, and still others like ‘keeping from’ are almost certain to
be separated by an argument. Technically association ratio is different from mutual
information in two aspects. First, joint probabilities are supposed to be symmetric: P(x,y)
=P(y,x) and thus mutual information is also symmetric: I(x,y) = I(y,x). However, an
association ratio is not symmetric since f(x,y) encodes linear precedence. f(x,y) denotes
the number of times that word x appears before y in the window of w words, not the
number of times that the two word appears in either order. This work provides a precise
statistical calculation that could be applied to a large corpus of text to produce a table of
associations for tens of thousands of words. This association ratio could be an important
tool to aid the lexicographer. It can help us decide what to look for; it provides a quick
summary of what associated word must be in a readable corpora.
2.2 Extracting Collocations
Corpus analysis has been widely used by researchers after the tremendous growth rate of
web. Corpus analysis extracts collocations by using automatic techniques for retrieving
lexical information from textual corpora. Collocations refer to sequence of words that co-
occur in a web document. Natural languages are full of collocations, recurrent
combinations of words that co-occur more often than expected by chance and that
correspond to arbitrary word usages. Research work in lexicography indicates that
collocations are common in all types of writing, including both technical and non-
technical modes of communication. Xtract [5] software consists of a set of tools to locate
words in context and make statistical observations to identify collocation in web
documents. Xtract uses straight statistical measures to retrieve from a corpus pair-wise
lexical relations whose common appearance within a single sentence are correlated. The
advantage of Xtract is that it can be used to produce collocations involving more than two
words (n-grams). Evaluation of any retrieval systems is usually done with two
parameters precision and recall [Salton 1989] and it is evident that it is used to assess the
quality of retrieved material. However, Xtract tool cannot be directly applied to extract
aliases, since nick names of celebrities need not be a natural language collocation.
2.3 Cross-Document Co-Reference Problem
Cross-document co-reference [CDC] is another problem in alias extraction using natural
language. Bagga and Baldwin et al [6] discovered a cross-document co-reference
resolution algorithm which uses the vector space model (SVM) to resolve ambiguities
between people having the same name. Initially, they developed a co-reference algorithm
Feature Selection for Activity Recognition using Eye Movements
ISBN: 978-0-9948937-1-0 [226]
which works in two steps (1) Extract co-reference chains for within the document (2)
Clustering co-reference chains under a SVM to identify all names mentioned in the
document set. However, due to enormous documents on the web it is impractical to
perform within-document co-reference resolution to each document separately and
cluster the documents to find aliases. Moreover, the noise and different writing styles
followed in web documents make it difficult to perform within-document co-reference
resolution. Later, they devised a cross-document co-reference resolution which works
across each document separately with a better accuracy.
2.4 Duplicate Detection Using Learnable String Similarity Measures
The problem of identifying approximate duplicate records in databases is an essential
step for data cleaning and data integration processes. Duplicates can cause data-mining
algorithms from discovering important regularities. This problem is typically handled
during a tedious manual data cleaning, or “de-duping”, process.
Previously they have been using manually tuned distance metrics for estimating the
potential duplicates. The author presented two learnable text similarity measures[16]
suitable for this task: First one uses the Expectation Maximisation (EM) algorithm for
estimating the parameters of a generative model based on learnable string edit distance,
and a novel vector-space based measure that employs a Support Vector Machine (SVM) to
obtain a similarity estimate based on the vector-space model of text. The character based
distance is best suited for shorter strings with minor variations, while the vector- spaced
representation is more appropriate for fields contain longer strings with more variations.
The overall duplicate detection system , MARLIN (Multiply Adaptive Record Linkage with
INduction), employs a two-level learning approach, first string similarity measures are
trained for every database field so that they can provide accurate estimates of string
distance between values for that field. Next, a final predicate for detecting duplicate
records is learned from similarity metrics applied to each of the individual fields. Utilized
Support Vector machines for evaluation and showed that it outperforms previous
methods such as decision trees, and classifiers [56,57]. It has been proved MARLIN can
lead to improved duplicate detection accuracy over traditional techniques.
The Fig 1 is the framework for improving duplicate detection, using trainable measures of
textual similarity and will provide significant value addition in alias extraction.
Advances in Innovative Engineering and Technologies
ISBN: 978-0-9948937-1-0 [227]
Figure 1: Duplicate Detection Framework [MARLIN]
2.5 Unsupervised Clustering to Identify the Referents of Personal Names
In this paper, a set of algorithms were described for disambiguating personal names [7]
with multiple real referents in text, based on little or no supervision. The approach
utilizes unsupervised clustering technique over a rich feature space of biographic facts,
which are automatically extracted via a language-independent bootstrapping process.
The induced clustering of named entities are then partitioned and linked to their
referents via the extracted biographic data.
One open problem in natural language ambiguity resolution is the task of proper noun
disambiguation. While word senses and translation ambiguities may typically have 2-20
alternative meanings that must be resolved through context, a personal name such as
“Bill Clinton” may potentially refer to hundreds or thousands of distinct individuals.
Supposing, the Search of Google shows 30 web pages mentioning “Bill Clinton”, of which
the top 5 unique referents are: Bill Clinton Professor at University of Malaysia, Former
President of USA, Film producer in Hyderabad, Gun Dealer in New Delhi, a Computer
science student in Japan.
Feature Selection for Activity Recognition using Eye Movements
ISBN: 978-0-9948937-1-0 [228]
Each different referent typically has some distinct contextual characteristics. These
characteristics can help distinguish, resolve and trace the referents when the names
appear in online documents.
2.6 Social Network System
Social network plays important role in the semantic web viz., Information Retrieval,
Knowledge Management, and Ubiquitous computing and so on. People conduct
communications and share information through social relations with other such as
friends, family, Class mates, colleagues, collaborators, and Business partners. Social
networking services have gained popularity in the recent years. SNS’s are useful to
register personal information including a user’s friends and acquaintances on these
systems; the systems promote information exchange such as sending messages and
reading weblogs, Friendster, Orkut, Face book, Twitter are the successful SNS. In the
context of semantic Web, social networks are crucial to realize a web of trust, which
enables the estimation of information credibility and trustworthiness [40]. Because
anyone can say anything on the Web, the Web of Trust helps humans and machines to
discern which contents are credible and to determine which information can be used
reliably. Ontology construction is also related to social network. Kautz and Selman et al
developed a social network extraction system from the Web called Referral Web [41]. The
system focuses on co-occurrences of names on Web pages using a search engine. It
estimates the strength of relevance of two persons X and Y putting a query “X and Y” to a
search engine. If X and Y have strong relation, we can find much evidence with their
homepages, list of co- authors in technical papers, and organizational charts. A path from
person to person is obtained automatically. Later, with the development of WWW and
semantic Web technology, more information on our daily activities has become online.
Due to greater potential and demand this method seems to be outdated. P.Mika et al
developed a system for extraction, aggregation and visualization of online social
networks for a semantic web community, called Flink [42]. Social networks are obtained
using analyses of Web pages, e-mail messages, and publications and self created profiles (
FOAF Files). The Web mining component Flink also employs a co-occurrence analysis.
Given a set of names as input, the component uses a search engine to obtain hit counts for
individual names as well as the co-occurrence of those two names. The system targets
semantic web community. Therefore the term, “semantic web OR ontology” is added to
the query for disambiguation.
McCallum et al and his group [43] present an end-to-end system that extracts a user’s
social network. The system identifies unique people in e-mail messages, finds their
homepages, and fills the fields of a contact address book as well as the other person’s
name. Links are placed in the social network between the owner of the web page and
persons discovered on that page. Harada et al [44] develop a system to extract names and
also person- to-person relations from the Web. Faloutsos et al obtain a social network of
Advances in Innovative Engineering and Technologies
ISBN: 978-0-9948937-1-0 [229]
15 million persons from 500 million Web pages using their co-occurrence within a
window of 10 words. Knees et al[45] classify artists into genres using co-occurrence of
names and keywords of music in the top 50 pages retrieved by search engine. L.Adamic et
al classified the social network of Stanford university students, and collected relations
among students from Web link structure and text information. In this paper, an advanced
social network extraction system called PolyPhonet [3] were introduced, which employs
advanced techniques to extract relations of persons, detect groups of persons, and obtain
keywords for a person. It is a Web-based system for an academic community to facilitate
communication and mutual understanding based on a social network extracted from the
Web. The system has been used at JSAI annual conferences for three years and at
UbiComp2005. Person names co-occur with many words on the Web. A particular
researcher’s name will co -occur with many words that are related to that person’s major
research topic. This paper uses person-to person matrix called adjacent matrix and
person-word co -occurrence matrix as affiliation matrix. The multi-faceted retrieval is
possible on the social network: researchers can be sought by name, affiliation, keyword,
and research field, related researchers to retrieved researcher are listed; and a search for
the shortest path between two researchers can be made. We can measure the similarity of
two research paper contexts. In the Researcher’s cases, we can measure how mutually
relevant the two researcher’s research topics are: if two persons are researchers of very
similar topics, the distribution of word co-occurrences will also be similar. Even more
complicated task such as searching for a researcher who is nearest to a user on the social
network among researchers in a certain field. ‘PolyPhonet’ is incorporated with a
scheduling support system [46] and a location information display system [47] in the
ubiquitous computing environment. Google is used to measure co-occurrence of
information and obtain web documents.
2.7 Mining the Web for Mnemonic Name Extraction
The web is a source from which we can collect and summarize information about a
particular real -world object. The proliferation of tools like bulletin boards and web logs
initiated the need for information extraction, disseminating knowledge and much work
has been done in this area. Major problem is that the same object is referred in different
ways in different documents. For example, a person may be referred to by full name, first
name, affiliation and title or nick names. The term mnemonic name refers to an unofficial
name of object. Generally, people use nick names or mnemonic names when they
complain or evaluate an object unfavourably. Here, full name of a person is considered as
official name.
The first novel method for extracting mnemonic names from the web is proposed by
Hokama and Kitagawa et al[10] [Fig 2].Evaluative expressions about an object (e.g
business organization, product, person) are extracted from text surroundings the string
that represents that object. The ability to collect web pages describing the target object is
Feature Selection for Activity Recognition using Eye Movements
ISBN: 978-0-9948937-1-0 [230]
first needed to extract representing information. Existing research extracts evaluative
expression from text surrounding the official name of target object, such as product name.
Specialized topic detection for a particular object will become important as well as
reputation information extraction, which analyses text around the object’s name and
extracts “local information”. Personal information sources exist, personal databases,
public home pages, Wikipedia. These are static or official, so they cannot yield dynamic
and unofficial information that includes recent popular topics about a person. For larger
web space, information sources such as bulletin boards and blogs must be tapped to
collect dynamic and unofficial sources. We need to know how much attention these topics
attract to the public. In this method, short strings adjacent to the full name of target
person to extract mnemonic names. This method is applicable only to extract Japanese
texts, because it uses a Japanese linguistic language. Object identification or name entity
recognition aims to discover official names of entities; purpose is to extract “non-official”
names of people.
Figure 2: Extracting Mnemonic names of people from the web
This method comprises of three components: 1) Extracting candidate mnemonic of target
person from web 2) Extracting string adjacent to the first name of target person using
prefix and suffixes. 3) Evaluate candidate mnemonic names extracted in step1, using
adjacent patterns extracted in step 2. Then, select top k candidates as mnemonic of target
person. For a given name p, they search for the query “* koto p” and extract the content
that matches the asterisk, Koto “<<string in Japanese>>" in English which is equivalent to
“be called”, and also it is a vague term. It can be a clue for searching but not the decisive
factor. The Japanese language word koto has multiple meanings “also known as”,
Advances in Innovative Engineering and Technologies
ISBN: 978-0-9948937-1-0 [231]
“incident”, “thing”, “matter”, “experience”, and “task”.
Candidate Mnemonic Name Extraction
The pattern “alias<<Japanese string>>full name” in Japanese language is used to describe
the alias name or nick names of person. Therefore, string that occurs right before the
“<<Japanese string>>full name” is a mnemonic name. If this string is commonly used as
mnemonic name, it occurs in web repeatedly, based on this, we extract candidate
mnemonic names.
1. Perform a query “<<Japanese>>full name” on a web search engine then get the
URL list
2. Get web pages in URL list and analyse these pages, thus extract the string <t1 t2
t3....tn> that occurs right before the string “<<Japanese string>>full name”
(t1t2.....tn) are morphemes
3. Extract sub-strings of <t1 t2 ....tn> and then select sub-strings when first
morphemes POS tag is a “general noun” as candidate mnemonic names. Then
count frequency of occurrence for each candidate.
4. Eliminate candidate mnemonic names that occur only once in analysed web
pages.
Adjacent Pattern Extraction
We get web pages including the full name of target person by performing a web search,
and extract strings adjacent to the full name i.e. prefix and suffix patterns. Because, there
can be people with same name, we add an object name that has relevance to the person
(i.e. parent organization) to the search query. After extracting all prefix and suffix
patterns, we calculate weights for all patterns by considering the co-occurrence relation
between full name and patterns.
The procedure for adjacent pattern extraction follows
1. Let the object name that has great relevance to the target person be rel-object.
Perform the query “Full Name AND rel-object” on a web search engine, then get
URL list
2. Analyse web pages in the URL list and extract strings adjacent to the full name<t1
t2 t3.......tm> extract sub-strings of <t1 t2, t3.....tm> in a similar way of candidate
mnemonic names extraction and add these sub-strings to the list of prefix and
suffix patterns.
3. Calculate weights for all extracted prefix patterns as follows. The weight of a
prefix pattern, w(prefix) is calculated by the formula
R=searchResults(Prefix)
R1=searchResults(“PrefixFullName”)
W(prefix)=R1/R
R1 - refers to number of web pages including “prefix- Full Name”
Feature Selection for Activity Recognition using Eye Movements
ISBN: 978-0-9948937-1-0 [232]
R - refers to number of web pages including Prefix
searchResults(query) is a function that returns the total number of web pages
including query. It is not possible to know the total number of those pages. This paper
uses yahoo! API’s , to get totalResultsAvailable Field for the estimated value.
4. Calculate weights for all extracted suffix patterns as follows. The weight suffix
pattern w(suffix) is calculated by the same formula
R=searchResults(suffix)
R1=search Results(“FullNamesuffix”) W(suffix)=R1/R
5. Add prefix and suffix pattern whose weights exceed the given threshold to the
“adjacent patterns list”
Candidate Mnemonic Name Evaluation
Final step evaluate extracted mnemonic names using second heuristics. It is possible that
a candidate mnemonic name cand is actually a mnemonic name of target person if cand
occurs just before or just after adjacent pattern.
The evaluation procedure as follows:
1. Set the initial score of cand as 0
2. For all adjacent patterns, apply this procedure
(a)If the adjacent pattern is prefix pattern generate string “prefix cand
If the adjacent pattern is suffix pattern, generate a string “cand suffix
(b) Obtain the total number of web pages including the generated string total used by
a search engine. Add the product of total and the pattern’s weight (w (prefix)) or
w (suffix)) to the score.
This calculation because it is highly possible that cand is the actual mnemonic name in
those situations; there are many web pages including “prefix cand” orcand suffixand
the pattern’s weight is big. After calculating all scores of candidate mnemonic names, we
select top k candidates as mnemonic names of the target person. Hence, this method
works well for extracting mnemonic names in a Japanese websites. Results returned by
the method for six different people were proved to be correct. However, there are few
inappropriate mnemonic names and a few appropriate mnemonic names were missing. In
spite of, two post processing heuristics specific for Japanese language to filter out
incorrect mnemonic names, the method seem to produce incorrect retrieval under some
cases. Moreover, Due to multiple meanings of said pattern many noisy and incorrect
aliases were extracted. Future directions given by the author are improving pattern’s
weight and candidate’s score calculation, investigating generality and robustness of the
method, and extending the work to other objects and languages.
Advances in Innovative Engineering and Technologies
ISBN: 978-0-9948937-1-0 [233]
2.8 Approximate Name Matching Using Finite State Graphs
Some of the string matching algorithms [9] used for extracting variants or abbreviations
of personal names. For instance, matching “Ram Kumar” with the first name initialized
variant ‘R. Kumar’. Approximate string matching is an interrelated area of natural
language processing, Information Extraction, and Information Retrieval. Personal name
can be considered as object tags that may appear in many different forms called as
variants. A personal variant can be described as a text occurrence that is conceptually
well related with the correct form or canonical form of a name. The recognition [Thomson
and Dozier 1999] of variant of these sequences belongs to three categories: name-
recognition, name-matching and name searching. Another method for common name
extraction and there problems, Name Recognition is the process by which a string of
characters is identified as a name. It is widely used to extract names from texts as
described in Message Understanding Conferences [MUC-4,1992;MUC-6,1995] as a part of
information extraction. In [MUC6,1995], the recognition of the named entity is considered
as a key element of extraction systems. Entities include names of persons, organizations,
or places as well as expressions of time or monetary expressions. In [MUC-7,1997], the
named entity recognition implies identifying and categorizing three subareas which are
the recognition of time expressions, numerical expressions, and entity names-person,
organizations, and places. Name matching corresponds to determining whether two
strings of characters previously recognized as names actually designate the same person.
Name matching does not focus on the case of various individuals who have identical name
labels. In this case, two possibilities arise (1) The matching is exact, there is no problem
(2) matching is not exact, making it necessary to determine the origin of these variants
and apply approximate string matching. NameSearching designates the process through
which a name is used as part of a query to retrieve information associated with that
sequence in a database. Here, two problems can appear. (1) The names are not identified
as such in the data base registers or in the syntax of the query, Name recognition
techniques are needed. (2) The names are recognized in the database records and in the
query, but it is not certain that the recognized names designate the same person. It does
not require any matching techniques. In bibliographic databases and citation index
systems, variant forms create problems of inaccuracy which affects information retrieval.
It means ultimately it affects the quality of information from databases and the citation
statistics used for the evaluation of scientist work. A number of string matching
techniques had been developed to validate variant forms, based on similarity and
equivalence relations. This variant identification requires binary matrices and finite-state
graphs. This procedure was tested on samples of author names from bibliographic
records, Library and Information Science Abstracts and Science Citation Index and
Expanded databases. The evaluation includes precision and recall as a proof for
completeness and accuracy. However, an inherent limitation of such string matching
methods would not identify aliases.
Feature Selection for Activity Recognition using Eye Movements
ISBN: 978-0-9948937-1-0 [234]
2.9 Measuring Semantic Similarity
Semantic similarity measures play significant roles in information retrieval and Natural
Language Processing. Semantic similarity measures [19] were used in automatic query
suggestion and expansion. Previous work used the same for community mining, relation
extraction, automatic meta data extraction. Semantic similarity between entities changes
over time and across domains. For example, a user may be interested to retrieve
information about “apple” in the sense of apple computer and not apple as a fruit. New
words are constantly being created as well as new senses are assigned to existing words.
This paper proposes an automatic method to measure semantic similarity between words
or entities using web search engines. Page counts and snippets are two useful information
sources provided by most search engines. Page count for the query “P AND Q” can be
considered as a global measure of co-occurrence of words P and Q. For example, page
count of the query “apple” AND “computer” is higher than “banana” AND “computer”. The
page counts for the former query is higher and it indicates that “apple” is more
semantically similar to computer than “banana”. However, there are some drawbacks.
Page count need not be equal to the word frequency because the queried word might
appear many times on one page. Moreover, page count of a polysemous word [multiple
senses] might contain a combination of all its senses. For these reasons, page counts alone
are unreliable measure for semantic similarity. Here, a new method that considers both
page counts and lexico-syntactic patterns extracted from snippets to overcome the above
problem.
2.9.1 Lexico-syntactic patterns
Consider the following snippet from Google for the query ‘Jaguar and Cat’ “The Jaguar is
the largest Cat in western hemisphere usually found in forests” Here, the phrase is the
largest indicates a hyponym relationship between the Jaguar and the Cat. Phrases such
as “also known as”, “is a part of”, “is an example of” all indicate various semantic
relations of different types. Such indicative phrases have been applied to various tasks
with better results, such as hyponym extraction [25] and fact extraction [Pasca et al.
2006]. From this example, we can say pattern X is the largest Y, where we replace the
two words Jaguar and Cat by two wild cards X and Y. Therefore, an automatic extraction
of lexico-syntactic pattern based approach was proposed and semantic similarity had
been proved using text snippets obtained from a web search engine. Also, it is
integrated with web-based similarity measures using WordNet, Synsets and support
vector machines to create a robust semantic similarity measure. This integrated method
has been proved with a existing benchmark dataset. And also this was the first attempt
to combine WordNet, Synsets and web content to leverage a robust semantic similarity
measure. Proven method was tested in a community mining task in capturing similarity
between real-world entities, and also proved to be worthy in word sense
disambiguation [WSD Mc Carthy et al 2004].
Advances in Innovative Engineering and Technologies
ISBN: 978-0-9948937-1-0 [235]
2.9.2 Word Sense Disambiguation
Contextual Hypothesis for Sense [Schutze 1998] states that the context in which a word
appears can be used to determine its sense. For instance, a web page discussing ‘Jaguar as
a car’, is likely to narrate about other types of cars, parts of cars etc., whereas a web page
on ‘Jaguar the cat’ is likely to contain information about other types of cats and animals.
Paper computed precision, recall and F-score for each cluster created for the top 1000
snippets returned by Google for two ambiguous entities Jaguar and Java. Jaguar can have
any one of three senses with respect to the context like cat, car or an operating system.
Ambiguous word Java also can have three senses like programming language, Island and
coffee.
The scope of this work can be well utilised in automatic synonym extraction, query
suggestion and finally name alias recognition.
2.10 Mining the Web for Alias Extraction
In the second alias extraction method Danushka Bollegala et al [11] [35], discovered a
novel approach to find aliases of a given name from the web. Method (Fig 3) comprises of
two components: lexical pattern extraction, and candidate alias extraction and ranking.
Exploited a set of known name and their aliases (name-alias pair) as training data to
generate lexical patterns that convey information related to aliases of names from text
snippets returned by a web search engine. Since web contents are dynamic in nature,
using the initial seed search engine retrieves text snippets as available in the web
documents.
The training data is allowed to learn for every new search of a lexical-pattern and it is
updated every time. Different combination of pattern is given to search engine for
maximizing retrieval of lexical-patterns relevant to the given personal name. The patterns
are then used to find candidate aliases of a given name. Anchor texts and hyperlinks were
used to design a word co-occurrence model and define numerous ranking scores to
evaluate association between a name and its candidate aliases.
2.10.1 Lexical Pattern Extraction
Search engine plays as a source in name alias extraction. It provides a brief snippet for
each search result by selecting portion of text that appears in web page within the
proximity of query. Such snippets provide valuable information related to the local
context of query. For example, consider the snippet returned by Google for the query
“Will Smith * The Fresh Prince”.
Feature Selection for Activity Recognition using Eye Movements
ISBN: 978-0-9948937-1-0 [236]
Here, wild card operator * is used to perform a NEAR query and it matches with one or
more words in a snippet. The snippets are parsed with various patterns. This Lexica-
syntactic pattern approach have been used in numerous related tasks such as extracting
synonyms, hyponyms [25] and metonyms [26 ].
Figure 3: Personal Name Alias Extraction of Celebrities from web
The above diagram Fig 3, explains the extracting approach of alias. First a set with Name
+ alias will be extracted from the web Fig 3. Then, from each snippet, the Create Pattern
function extracts the sequence of words that appear between the name and the alias. We
repeat the process described above for the reversed query, “ALIAS *NAME” to extract
patterns in which alias precedes the name. Like-wise there are another eight patterns [Fig
4] given in the query to extract snippets from web documents, each input pattern is
evaluated for performance with a measure called F-Score.
Figure 4: A snippet returned for the query “Rajnikanth, aka the “Dancing
Maharajah” by Google
Muthu(1995)[Actor Rajnikanth] ... aka "The
Dancing Maharaja" - India (English title)
Advances in Innovative Engineering and Technologies
ISBN: 978-0-9948937-1-0 [237]
2.10.2 Candidates Ranking
Considering the noise in web snippets, candidates extracted by shallow lexical patterns
might include some invalid aliases. Among these candidates we should identify
candidates which are most likely to be correct aliases of a given name. Alias ranking is
necessary to identify which one of the candidate alias is most likely to be correct alias in
the alias set. Here, candidate ranking is done with ranking scores to measure the
association between a name and a candidate alias using three different approaches: (1)
lexical pattern frequency (2) Word Co-occurrences in an anchor text graph, and (3) Page
counts on the web. ExtractPattern algorithm extracts over 8000 patterns for 50 English
personal names in the data set used. However, not all patterns are equally informative
about aliases of a real name. Input query patterns are ranked according to their F-Score to
identify top k efficient patterns for study (fig.5). F-Score of a pattern‘s’ is computed as the
harmonic mean between the precision and recall of the pattern. The following table
shows, ranking of different patterns given in query for generating l exical pattern output
using English personal name dataset.
Figure 5: Top Lexical patterns with F-score measure
F-Score improves as a result of the improvement in recall. This algorithm is ideally suited
to extract patterns written in languages other than English. Results had been proved that
it was quite successful in Italian and French languages, and patterns that contain
punctuation symbols also appear among the top 200 patterns used for measuring the
completeness and accuracy of this work. To measure the strength of association between
a name and a candidate alias the following nine popular statistical measures were used to
perform ranking of candidate aliases. Statistical measures are co-occurrence
frequency(CF), Term-frequency and inverse-document frequency(tfidf), chi-squared
measure(CS), Log-likelihood ratio(LLR), Point-wise mutual information(PMI), Hyper-
geometric distribution(HG), Cosine measure (cosine), Overlap, and Dice co-efficient. The
nine statistical ranking scores are integrated in to a single ranking function using ranking
support vector machines (SVM). Paper used linear, quadratic, and radial basis function
Feature Selection for Activity Recognition using Eye Movements
ISBN: 978-0-9948937-1-0 [238]
(RBF) kernels for ranking SVM. The Statistically significant Mean Reciprocal Rank (MRR)
and Average Procision (AP[22]) is used to evaluate the different approaches. The MRR of
this method is 0.67 yielded better performance than previous method of Hokama and
Kitagawa et al[10] .
2.11 User Name Alias Extraction in Emails
Finding out user identity information from emails is one of the important research topics
in email mining. Most approaches extract an email user’s name only from the header of an
email, but there are often many name information appearing in the body of emails, and
those names are usually more suitable for representing the sender’s or recipient’s
identity. Meijuan Yin and Junyong Luo et al [64] focuses on the problem of extracting
email users’ name aliases in the body of plain-text emails. After locating and extracting
salutation and signature blocks from email bodies, identify the potential aliases in the
salutation and signature lines, which can be directly associated with the corresponding
email address in email headers, by using named entity recognition(NER) tools. However
the identified aliases may be half-baked or there are still some potential aliases that can’t
be correctly identified. Then efficiently and accurately extract aliases in the salutation and
signature lines based on name boundary word template built on the characteristics of
alias neighboring words.
2.11.1 Salutation and Signature Blocks Locating Algorithm
This is based on statistical and rules restriction methods. The basic idea is to exploit the
statistical method to roughly estimate the number of lines in salutation and signature
blocks, and then introduce some restriction rules to refine the lines located by the
statistical method and elicit the lines that exactly belong to the salutation and signature
blocks.
The locating algorithm can simultaneously extract salutation and signature lines. As this
method used both the statistical method and rules restriction method, the salutation and
signature blocks locating algorithm can greatly improve the locating efficiency and
promise a relatively high accuracy of the extracted blocks.
2.11.2 Definition of Name Boundary Word Templates
In the alias extraction system, after having located and elicited salutation and signature
blocks from email bodies, use part-of-speech tagging tools to label block texts and identify
candidate aliases. There are some relatively mature part-of-speech tagging tools in
different languages. For English emails, choose the well-known named entity recognition
tool in English nature language process field, Named Entity Recognizer System
Version1.1.1 of Stanford University (abbreviated to Stanford NER) [1]. The label of names
tagged by NER is a pair of labels “<PERSON>” and “</PERSON>”, and between the pair of
labels is a person name. For example, a result tagged by NER is <PERSON>Jim
Advances in Innovative Engineering and Technologies
ISBN: 978-0-9948937-1-0 [239]
Jarmusch</PERSON>”, and the string “Jim Jarmusch” between the label pair “<PERSON>”
and “</PERSON>” is an English person name. The part-of-speech tagging label for a
person name is “/nr”, and it includes four sub-labels: “/nr1” is for a Chinese family name,
“/nr2” is for a Chinese Christian name, “/nrj” is for a Japanese name, and “/nrf” is for a
transliteration name. The Chinese characters before the label “/nr” is a potential person
name. By using above named entity recognizing tools, this method can identify most of
the format names in salutation and signature blocks. However, names appear in email
bodies are usually informal names such as anonyms, nicknames, short names, honorific
names and so on, which results in that candidate aliases identified by only using named
entity recognizing tools may not entire or even some aliases would be omitted in the
tagging process of named entity recognizing tools. To affectively identify the aliases, this
method uses name boundary word template for general use based on the feature of
words around names in email salutation and signature blocks, and then use the template
to amend aliases identified by named entity recognizing tools or discover new aliases
omitted by named entity recognizing tools.
Figure 6: Process flow of alias extraction system
2.11.3 Alias Extraction Algorithm
The basic idea of Name Boundary Word Template based Alias Extraction
Algorithm is: if there is a name having been identified by NER tools in email salutation
and signature blocks, then directly use name boundary word template FNR1 to amend the
front and rear of the name, and get the corresponding alias to be extracted; otherwise,
that is to say there is no name having been identified by NER tools, employ name
boundary word template FNR2 to locate the word sequence n whose front and rear
boundary words can both be affirmed, and the word sequence n is the alias to be
extracted.
Feature Selection for Activity Recognition using Eye Movements
ISBN: 978-0-9948937-1-0 [240]
III. Conclusion
This paper we have shown the existing algorithms there challenges and various methods
which was proposed earlier.
The proposed work is foreseen to have significant advantages in web extraction. The
extent this algorithm can be used will not be limited. In future, alias extraction can
become a common tool for ‘non-celebrities’ and further would be extended to support
different languages. Also, it can be used in inter-disciplinary fields for retrieving need
based information from readable corpora.
References
[1] R.Guha and Garg, “Disambiguating People in Search”, Technical Report,
Stanford University., 2004
[2] P.Cimano S.Handshuh, and S.Staab,”Towards the Self Annotating Web,” Proc.
Int’l World Wide Web Conf.(WWW’2004)
[3] Y.Matsuo, J.Mori, M.Hamasaki, K.Ishida, T.Nishimura,H.Takeda,K.Hasida, and
M.Ishizuka, “Polyphonet:An advanced Social Network Extraction
System,”Proc.WWW’ 2006
[4] G.Salton and C.Buckley, “Term-Weighting Approaches in Automatic Text
Retrieval,”Information Processing and Management” , 1988
[5] F.Smadja, “Retrieving Collocations from Text:Xtract” Computational
Linguistics, 1993
[6] A.Bagga and B.Baldwin, “Entity-Based Cross-Document Co-referencing using
the Vector Space Model,” Proc. Int’l conf. Computational Linguistics, 1998
[7] G.Mann and D.Yarowsky, “Unsupervised Personal Name Disambiguation,”
Proc.Conf. Computational Natural Language Learning, 2003
[8] R.Bekkerman and A.McCallum, “Disambiguating Web Appearances of People in
a Social Network,” Proc. Int’l World Wide Web Conf, 2005
[9] C.Galvez and F.Moya Anegon, “Approximate Personal Name-Matching through
Finite State Graphs,”Journal of American Society for Information Science and
Technology, 2007
[10] T.Hokama and Kitagawa , “Extracting Mnemonic names of People from the
Web,” Proc. Ninth Int’l conf.Asian Digital Libraries, 2006
[11] Danushka Bollegala, Taiki Honma, Yutaka Matsuo and Mitsuru Ishizuka,
“Mining for personal Name Aliases on the Web”, In proc. of WWW ‘2008
[12] J.Artiles, J.Gonzalo, and F.Verdejo, “A Testbed for People Searching Strategies in
the WWW,” Proc.SIGIR’05, 2005
[13] S.Sekine and J.Artiles, “Weps 2 Evaluation Campaign:Overview of the Web
People Search Attribute:Extraction Task,” Proc. second Web People Search
Evaluation Workshop (WePs ’09) at 18th Int’l World Wide Web conf,. 2009
Advances in Innovative Engineering and Technologies
ISBN: 978-0-9948937-1-0 [241]
[14] G.Salton and M.McGill, “Introduction to Modern Information Retreival”, Mc
Graw-Hill 1986
[15] M.Mitra, A.Singhal, and C.Buckley, “Improving Automatic Query Expansion,”
Proc. SIGIR ’98 1998
[16] M.Bilenko and R.Mooney, “Adaptive Duplicate Detection using Learnable String
Similarity Measures,” Proc. SIGKDD’03, 2003
[17] C.Manning and H.Schutz, “Foundations of Statistical Natural Language
Processing”, MIT Press, 1999
[18] K.Church and P.Hanks, “Word Association Norms, Mutual Information and
Lexicography,” Computational Linguistics 1991
[19] D.Bollegala, Y.Matsuo and M.Ishizuka, “Measuring Semantic Similarity between
Words using Web Search Engines,” Proc. Int’l World Wide Web Conf’, 2007
[20] T.Joachims, “Optimizing Search Engines using Clickthrough Data,” Proc. ACM
SIGKDD’02, 2002
[21] T.Kudo, K.Yamamoto, and Y.Matsumoto, “Applying Conditional Random Fields
to Japanese Morphological Analysis,” Proc. Conf.Empirical Methods in Natural
Language (EMNLP ’04), 2004
[22] R.Beaza-Yates and B.Riberio-Neto, “Modern Information Retrieval”, ACM
Press,1999
[23] P.Mika, “Ontologies Are Us:A Unified Model of Social Networks and Semantics,”
Proc. Int’l Semantic Web Conf.(ISWC’05), 2005
[24] T.Dunning, “Accurate Methods for the Statistics of Surprise and Coincidence,”
Computational Linguistics, 1993
[25] M.Hearst, “Automatic Acquisition of Hyponyms from large Text Corpora,”
Proc.Int’l conf. Computational Linguistics,1992
[26] M.Berland and E.Charniak, “Finding Parts in Very Large Corpora,” Proc.Annual
Meeting of the Association for Computational Linguistics, 1999
[27] S.Chakrabarti, “Mining the Web: Discovering knowledge form Hypertext Data”,
Morgan Kaufman, 2003
[28] Danushka Bollegala, Yutaka Matsuo and Mitsuru Ishizuka, “Identifying People
on the Web through Automatically Extracted Key phrases”, In proc.of WWW
[29] G.Salton A.Wong and C.S.Yang, “A vector space model for Information
Retrieval”, Communications of the ACM, 1975
[30] G.D.M.Renmie and T.Jaakkola, “Using term informativeness for named entity
detection,” In por.c of ACM SIGIR’05, 2005
[31] D.Bollegala, Y.Matsuo, and M.Ishizuka, “Extracting key phrases to disambiguate
personal names on the web”, In proc. CICLing 2006.
[32] Y.Li.Zuhair, A.Bandar, D.M, “An approach for measuring semantic similarity
between words using multiple information sources”, IEEE Transactions on
Knowledge and Data Engineering, 2003
[33] D.Beeferman and A.Berger,” Agglomerative clustering of a search engine Query
Feature Selection for Activity Recognition using Eye Movements
ISBN: 978-0-9948937-1-0 [242]
log”, In ACM SIGKDD, International conference on Knowledge Discovery and
Data Mining (KDD), 2000
[34] Dekang Lin, “Automatic retrieval and clustering of similar words”, 17th Int’l
conf. On computational Linguistics, 1998
[35] Danushka Bollegala , Yutaka Matsuo , Mitsuru Ishizuka, “Automatic Discovery
of Personal Name Aliases from the Web”, IEEE Transactions On Knowledge
and Data Engineering, 2011
[36] Gonenc Ercan, Ilyas Cicekli, “Using Lexical Chains for Keyword Extraction”.
[37] Tarique Anwar, Muhammed Abulaish and Khaled Alghathbar, “Web Mining for
Alias Identification: a First Step towards Suspect Tracking”
[38] S.Handshuh and S.Staab, “Authoring and annotation of web pages in CREAM.
Int’l Proceedings of the 11th International World Wide Web Conference, WWW
2002, Honolulu, Hawaii, ACM 2002
[39] E.Agirre, O.Ansa, E.Hovy, and D.Martinez, “Enriching Very Large Ontologies
using the WWW. In. Proc. Of the first workshop on Ontology Learning 2000
[40] J.Golbeck and J.Hendler, “Accuracy of metrics for inferring trust and reputation
in semantic web-based social networks”, In Proc. EKAW 2004
[41] H.Kautz , B.Selman, and M.Shah, “The Hidden Web”, AI Magazine, 1997
[42] P.Mika, “Flink: Semantic Web technology for the extraction and analysis of
social networks”, Journal of Web Semantics, 2005
[43] A.Culotta, R.Bekkerman, and A.McCallum,” Extracting social networks and
contact information from email and the web”. In. CEAS- 1, 2004
[44] M.Harada, Sh.Sato, and K.Kazama, “Finding authoritative people from the Web”,
In.Proc. Joint Conference Digital Libraries(JCDL2004), 2004
[45] P.Knees, E.Pampalk, and G.Widmer, “Artist Classification with Web-based
data”, In. 5th International Conference on Music Information Retrieval(ISMIR),
2004
[46] M.Hamasaki, H.Takeda, I.Ohmukai, and R.Ichise, “Scheduling support system
for academic conferences based on interpersonal networks. In.Proc.ACM
Hypertext 2004.
[47] T.Nishimura, Y.Nakamura, H.Itoh, and H.Nakamura, “System design of event
space information support utilizing” In.Proc.IEEE ICDCS2004
[48] P.Turney, “Thumbs Up or Thumbs Down? Semantic Orientation Applied to
Unsupervised Classification of Reviews”, Proc. Assoc. for Cmputational
Linguistics (ACL) 2002
[49] J.Artiles, J.Gonzalo, and S.Sekine. The smeval-2007 Weps
evaluation:Establishing a benchmark for the Web people search task. In.
Proceedings of the Fourth International Workshop on Semantic Evaluations,
ACL 2007
[50] D.Kalashnikov, R.Nuray-Turan, and S.Mehrotra, “Towards breaking the quality
curse. A Web querying approach to web people search”, In Proc. Of Annual
Advances in Innovative Engineering and Technologies
ISBN: 978-0-9948937-1-0 [243]
International ACM SIGIR Conference, Singapore, July 2008.
[51] Z.Kozareva, R.Moraliyski, and G.Dias, “Web people search with Domain
Ranking”, In TSD ’08: Proccedings of the 11th Internationnal conference on
Text, Speech and Dialogue, 2008.
[52] H.Saggion, “Experiments on semantic-based clustering for cross-document
coreference”, In International Joint Conference on Naural Language Processing,
2008
[53] M.Sanderson , “Ambiguous Queries:Test collections need more sense”,In SIGIR
‘08”Proccedings of 31st conference on Research and Development in
Information Retrieval, USA 2008, ACM
[54] T.Hisamitsu amd Y.Niwa, “Topic-Word Selection Based on Combinatorial
Probability”, Proc. Natural Language Processing (NLPRS ’01) 2001
[55] C.H.Gooi and J.Allan, “Cross-Document Co-reference on a Large scale Corpus”,
Technical report, Centre for Intelligent Information Retrival, Department of
Computer Science, University of Massachusetts, 2004
[56] S.Sarawagi and A.Bhamidipaty, “Interactive de-duplication using active
learning”, In Proc. Of the Eighth ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining(KDD-2002), Edmonton, Alberta 2002.
[57] S.Tejada, C.A.Knoblock, and S.Minton, “Learning doimain-independent string
transformation weights for high accuracy object identification. In.Proc. of the
Eighth ACM SIGKDD International conference on Knowledge Discovery and
Data Mining, 2002
[58] Huang Xuanjing, Wu Lide, “Language-independent Text Categorization,” 2000
International Conference on Multilingual Information Processing, pp 37-43,
2000
[59] Yiming Yang, Xin Liu, “A re-examination of text categorization methods,”
Proceedings of ACM SIGIR Conference on Research and Development in
Information Retrieval (SIGIR), 1999, pp 42-49
[60] Tao Jiang, Ke Wang/Ah-Hwee Tan “Mining Generalized Assocaitions of
Semantic Relations from Textual Web Content”, IEEE Transactions on
Knowledge and Data Engineering, Volume 19 Issue 2, February 2007
[61] Jian Zhang, Yiming Yang, “Robustness of regularized linear classification
methods in text categorization,” In Proceedings of SIGIR 2003. The Twenty-
Sixth Annual International ACM SIGIR Conference on Research and
Development in Information Retrieval.
[62] Y.Yao, “Measuring retrieval effectiveness based on user preference of
documents”, Journal of the American Society for Information science, 1995
[63] T.Joachims, “Learning to classify Text Using Support Vector Machines
Methods, Theory, and Algorithms, 2002
Feature Selection for Activity Recognition using Eye Movements
ISBN: 978-0-9948937-1-0 [244]
[64 ] Meijuan Yin, Junyong Luo, “User Name Alias Extraction in Emails” journal of
MECS, 2011
ResearchGate has not been able to resolve any citations for this publication.
Conference Paper
Full-text available
The world wide web is the biggest information source which people consult daily for facts and events. Studies demonstrate that 30% of the searches relate to proper names such as organizations, actors, singers, books or movie titles. However, a serious problem is posed by the high level of ambiguity where one and the same name can be shared by different individuals or even across different proper name categories. In order to provide faster and more relevant access to the requested information, current research focuses on the clustering of web pages related to the same individual. In this paper, we focus on the resolution of the web people search problem through the integration of domain information.
Article
Full-text available
In this paper, we present the design of a web content mining system to identify and extract aliases of a given entity from the Web in an automatic way. Starting with a pattern­ based information extraction process, the system applies n­ gram technique to extract candidate aliases. Thereafter, various statistical measures are applied to identify feasible aliases from them. The extracted aliases can be used to generate profiles of suspects and keep track of their movements on the Web using different identities. Index Terms- Web content mining; Cyber security; Alias identification; Suspect profiling; Web monitoring.
Article
Full-text available
This paper presents a simple unsupervised learning algorithm for classifying reviews as recommended (thumbs up) or not rec- ommended (thumbs down). The classifi- cation of a review is predicted by the average semantic orientation of the phrases in the review that contain adjec- tives or adverbs. A phrase has a positive semantic orientation when it has good as- sociations (e.g., "subtle nuances") and a negative semantic orientation when it has bad associations (e.g., "very cavalier"). In this paper, the semantic orientation of a phrase is calculated as the mutual infor- mation between the given phrase and the word "excellent" minus the mutual information between the given phrase and the word "poor". A review is classified as recommended if the average semantic ori- entation of its phrases is positive. The al- gorithm achieves an average accuracy of 74% when evaluated on 410 reviews from Epinions, sampled from four different domains (reviews of automobiles, banks, movies, and travel destinations). The ac- curacy ranges from 84% for automobile reviews to 66% for movie reviews.
Book
Text Classification, or the task of automatically assigning semantic categories to natural language text, has become one of the key methods for organizing online information. Since hand-coding classification rules is costly or even impractical, most modern approaches employ machine learning techniques to automatically learn text classifiers from examples. However, none of these conventional approaches combines good prediction performance, theoretical understanding, and efficient training algorithms. \ \ Based on ideas from Support Vector Machines (SVMs), Learning To Classify Text Using Support Vector Machines presents a new approach to generating text classifiers from examples. The approach combines high performance and efficiency with theoretical understanding and improved robustness. In particular, it is highly effective without greedy heuristic components. The SVM approach is computationally efficient in training and classification, and it comes with a learning theory that can guide real-world applications. \ \ Learning To Classify Text Using Support Vector Machines gives a complete and detailed description of the SVM approach to learning text classifiers, including training algorithms, transductive text classification, efficient performance estimation, and a statistical learning model of text classification. In addition, it includes an overview of the field of text classification, making it self-contained even for newcomers to the field. This book gives a concise introduction to SVMs for pattern recognition, and it includes a detailed description of how to formulate text-classification tasks for machine learning. \ \ Learning To Classify Text Using Support Vector Machines is designed as a reference for researchers and practitioners, and is suitable as a secondary text for graduate-level students in Computer Science within Machine Learning and Language Technology.