Conference PaperPDF Available

Optimizing enterprise search by automatically relating user context totextual document content

Authors:

Abstract and Figures

It is widely agreed that information retrieval (IR) systems benefit enormously from considering not only the user's query but also contextual data. In enterprise IR systems corporate knowledge bases and additional manually triggered information about users are normally taken to obtain such contextual data. In this paper we propose a solution for role-specific search in enterprise environments without the need of manual administration of mappings between roles and documents. We include collaboratively constructed knowledge engineering systems for computing similarity measures between user role attributes and relevant information snippets in enterprise documents. Our approach suggestsoptimizing such enterprise search systems by a role-sensitive ranking algorithm that relates contextually-derived information needs of users to unstructured (textual) data in documents. Hence we introduce a linguistic conceptfor generatingrole describing word vectorsbased on query (search) histories and corporate knowledge base generation. The Introduction outlines some basic ideas concerning the major areas of enterprise search, some relevant differences between web search and enterprise search. Subsequently we sketch our optimized enterprise search model. In Chapter 2some theoretical background and Related Work is briefly discussed.Chapter 3depicts some linguistically relevant details of our proposed model. We discuss our concept of User Roles, Role Term Vectors, some approaches for Role Term Extraction andTerm Extraction incorporating knowledge bases and query histories. In Chapter4 we describe our ranking mechanism, the re-ranking strategy and the method for Role Relevance Scoring. Chapter 5 gives a conclusion of the work as well as an outlook on future work.
Content may be subject to copyright.
Optimizing Enterprise Search by Automatically Relating
User Context toTextual Document Content
Matthias Reichhold
Universität Klagenfurt
Universitätsstraße 65-67
A-9020 Klagenfurt
matreich@edu.uni-klu.ac.at
Jörg Kerschbaumer
Universität Klagenfurt
Universitätsstraße 65-67
A-9020 Klagenfurt
joerg.kerschbaumer@edu.uni-
klu.ac.at
Ao.Univ.-Prof. Mag. Dr.
Günther Fliedl
Universität Klagenfurt
Universitätsstraße 65-67
A-9020 Klagenfurt
guenther.fliedl@aau.at
ABSTRACT
It is widely agreed that information retrieval (IR) systems benefit
enormously from considering not only the user’s query but also
contextual data. In enterprise IR systems corporate knowledge
bases and additional manually triggered information about users
are normally taken to obtain such contextual data.
In this paper we propose a solution for role-specific search in
enterprise environments without the need of manual
administration of mappings between roles and documents. We
include collaboratively constructed knowledge engineering
systems for computing similarity measures between user role
attributes and relevant information snippets in enterprise
documents.
Our approach suggestsoptimizing such enterprise search systems
by a role-sensitive ranking algorithm that relates contextually-
derived information needs of users to unstructured (textual) data
in documents. Hence we introduce a linguistic conceptfor
generatingrole describing word vectorsbased on query (search)
histories and corporate knowledge base generation.
The Introduction outlines some basic ideas concerning the major
areas of enterprise search, some relevant differences between web
search and enterprise search. Subsequently we sketch our
optimized enterprise search model.
In Chapter 2some theoretical background and Related Work is
briefly discussed.Chapter 3depicts some linguistically relevant
details of our proposed model. We discuss our concept of User
Roles, Role Term Vectors, some approaches for Role Term
Extraction andTerm Extraction incorporating knowledge bases
and query histories. In Chapter4 we describe our ranking
mechanism, the re-ranking strategy and the method for Role
Relevance Scoring. Chapter 5 gives a conclusion of the work as
well as an outlook on future work.
Categories and Subject Descriptors
H.3.3 [Information storage and retrieval]: Information Search
and Retrieval – retrieval models, search process.H.3.1
[Information storage and retrieval]: Content Analysis and
Indexing – Linguistic processing.
General Terms
Algorithms
Keywords
Enterprise search, Enterprise search ranking, Enterprise search
optimization,user context, user role, role-sensitive ranking,
context-sensitive search
1. INTRODUCTION
The amount and complexity of data employees in companies are
faced with nowadays is increasing rapidly. In addition, the
majority of this data is unstructured (textual data) making search
even harder as shown by Huang [1]. Hence, information retrieval
systems meeting these special requirements (enterprise search
engines) are becoming more and more important (see Dmitriev et
al[2]).Furthermore [2]also state that in contrast to web search only
very limited attention has been paid to this research area so far.
But there are many differences between these types of systems.As
also stated by Demartini[3], Hawking [4]identifies three major
areas an enterprise search system covers:
(1) search of the organisation’s external website
(2) search of the organisation’s internal website (its
intranet)
(3) search of other electronic text held by the organisation
in the form of email, database records, documents on
file shares etc.
According to Demartini[3], one important difference between
information retrieval systems for companies (Enterprise Search
Systems) and for web search is that much more information about
the searching user is available to the former one due to the fact
that in enterprise environments a user is a known employee who
has a specific role. Roles can be derived from certain job-related
user properties (e.g. job title, function, department, etc.) or are
already managed in IT systems like directory services, HR
systems, etc.
Demartini[3] also points out that current search systems do not
consider these role context although “different roles (like
manager, IT, software developers) with the same query have
different information needs […] and a ES system should exploit
this information”.
Referringto the work of Shen et al[5], most existing systems,
which are currently available for information retrieval, are still
only using the actual query and document data in order to find
relevant information, but do not consider any contextual
information.
Moreover, [5] note on page 1 that “from a single query, however,
the retrieval system can only have very limited clue about the
user’s information need. An optimal retrieval system thus should
try to exploit as much additional context information as possible
to improve retrieval accuracy, whenever it is available.” The
significant importance of user context is also stated by e.g.
Hawking [4], Navrat et al[6] and Schmidt et al[7].
Besides that we know from [2, 8] that users with similar roles in
corporate environments are often searching for similar documents,
because they are interested in information belonging to the same
domain or on related topics and thus their information needs are
more comparable than others. Also the work of Rosen‐Zvietal
[9,10]showsthatIRsystemsbenefitsignificantlyfrom
consideringcontextualinformationaboutenterpriseusers.
Our enterprise search approach includes user related context
information and combines it with linguistically enhanced
document analysis.
Figure 1 provides an overview of our approach for optimizing
enterprise search: every user is assigned to a user role which has
one Role Term Vector RTrrelated to it. When a user sends a query
to the search engine it creates a ranked result set (the Original
RankOd). Our Role-sensitive Ranking algorithm merges Odwith
the so called Role Rank Rd and thus obtains a role-sensitive
Merged Rank Md which is presented to the user as optimized
search result.
The Role Rank again is computed by a special Role Relevance
Scoring module based on the document contents on the on hand
and the RTr on the other hand. The relevance score is calculated
using the cosine similarity measure[27], measuring the similarity
between a document d and RTr. A high similarity between d and
RTr indicates a high relevancy of d for all users with the role
related to RTr while a low similarity value on the other hand
shows low relevancy. Documents with higher relevancy scores get
higher Rd values leading to a higher over-all rank Mdat the end.
Accordingly, documents with lower relevancy scores will end up
with a lower Md.
2. BACKGROUND& RELATED WORK
Information retrieval systems have been developed already more
than 50 years ago and with the rise of the World Wide Web,
research efforts (not very surprisingly) have focused very much
on web/internet search [Dignum et al [11]]. But as it is argued
in[11, 12], retrieval methods delivering good performance for
internet search do not inevitably deliver as good results in
enterprise environments which is very much due to the different
structure of intranets compared to the public internet [2, 4].
Ranking algorithms successful in the web like HITS [13] or
PageRank [14]suffer from poor or missing linkage structure [15,
16] in enterprise document repositoriesand therefore perform less
well in corporate environments [11, 17]. Additional challenges for
enterprise search systems according to [17] are “high redundancy
(many versions of the same document)” and “notational
heterogeneity (synonyms) distort[ing] the search results”.
Figure 1: The Architecture of ourrole-sensitive enterprise
search model and its components
Another characteristic about enterprise search is the fact that users
first have to spend a lot of time and effort to get familiar with the
domain specific concepts and terminology used in the enterprise
environment in order to be able to submit relevant query strings
for a search system [11]. Due to space limitations in this work we
refer to the paper of Mukherjee et al [18] for further description of
challenges and differences regarding enterprise search.
In recent years, however, there has been a lot of research going on
about using contextual information like explicit feedback (e.g.
relevance feedback, tagging, labelling), implicit feedback (such as
query history and clickthrough history), user profiles, etc. to
personalize and therefore improve retrieval systems. While the
focus on above mentioned research topics was clearly on web (or
internet) search, there are only a few studies dealing with
considering contextual information for enterprise search systems
[2]. This is rather surprising for us since existing work shows
promising results like [5] have achieved significant improvements
on enterprise search using implicit feedback or the approach of
Kohn et al [17, 19], which is proposed to be superior to standard
search engines in the company environment due to the
introduction of some simple principles like personalized ranking
based on a user’s role and organizational embedding, automatic
classification of documents by using domain knowledge and
learning from search history.
Despite the promising results, Kohn et al also note the main
deficitregarding theirsystem: role ontologies and mapping rules
between ontologies and document meta data have to be managed
manually and are therefore very costly to maintain.
Another approach to optimize IR systems by relating user profiles
and document data is presented by Rosen-Zvi et al [9, 10]. They
introduce an “author-topic model” which is an extension of the
well-known Latent Dirichlet Allocation [20], which derives author
interests from document data based on probability distributions
and thus can exploit relations between users, documents and
topics. Our proposal presupposes these ideas about role-reflecting
ranking improvements but uses a different approach to relate user
context and document data.
As mentioned before integrationof user context and the
personalization of enterprise search are current key research areas
[4, 7, 21], whereas especially ontology-based approaches have
drawn a lot of attention recently. E.g. the work of Solskinnbakk et
al [22] introduces an “ontology profile” representing a weighted
vector-based description for each ontology concept. [22] use these
powerful ontology profiles to expand queries submitted to a
search engine. Their experiments show promising results and “a
generally better performance than the baseline”.
3. USER ROLES
As mentioned above, considering user context plays a very
important role for further improvement of enterprise search
systems. But current systems often present the same search results
for a certain query to all users not respecting that the information
needs may differ considerably for different people [23]. Moreover
enterprise search systems have to cope with the fact that most of
the submitted queries are very short and ambiguous [23] making it
more or less impossible for the search system to derive the user’s
information need.
Every employee has different information needs depending on
certain properties (function, job description, department, location,
etc.). Similar properties can be consolidated into roles.Thus we
propose the use of explicit user roles which are defined company-
wide and are assigned to each employee. These roles represent the
long term user context (e.g. „Controlling“, “Procurement”, etc.)
and therefore indicate the differing user information needs (e.g.
role „Engineering” vs. role „Marketing”). The definition of the
roles to be used in the company as well as the mapping between
certain users and roles is handled by a role expert.
3.1 Role Term Vectors
User context can be represented as the concept“user roles”. Role
Term Vectors can be used to reflect the information needs of
employees andto obtain Role Relevance Scores indicating the
relative importance of documents for different employees.
We attach a Role Term VectorRTr to each role which contains
weighted words (terms) that describe the role and which is used to
relate the role tothe content of documents.
The examples stated below show (1) a Role Term Vector assigned
to the role “Marketing” and (2) a vector assigned to the role
“Engineer”with weight 1 for all terms.
(1) RTMarketing =
{(“marketing”,1), (“revenue”,1), (“intake”,1),
(“engine”,1)}
(2) RTEngineer = {(“engine”,1), (“combustion”,1),
(“composite”,1)}
The use of Role Term Vectors enables our model to find relations
between documents and user roles and evaluate the relevance of a
document for a certain role. Every Role Term Vector consists of a
number of weighted terms that influence the relevancy scoring
heavily. Therefore extraction and weighting of the relevant terms
is a very crucial task.
3.2 Approaches toRole Term Extraction
In the following, we describe two semi-automated approaches for
role term extraction and argue for adopting them partially in our
model.
A rather simple but uncomfortable possibility for defining role
terms is a centralized and completely manual task where a role
expert assigns relevant terms to roles. Such a manual task is of
course very time-consuming and inflexible. Therefore one (or
more) role experts with domain and company specific knowledge
about roles and relevant terms are needed. If on the other hand
such resources are available in a company they can create very
valuable inputs. Hence we propose to use manual term extraction
in the form of black lists (terms that have to be excluded) and
white lists (terms that the vector must include) for extending the
automatic processing step.
Secondly, we introduce a semi-automatic approach at which every
user in the company maintains a personal list of keywords
relevant for his/her work.We then collect the keywords entered by
the users,group them by user role and use those keywords to build
up the Role TermVector. The advantage of this approach
compared to the first one is that we do not depend on role experts
and their personal knowledge any longer. Instead, we get
immediate and direct feedback about what is relevant since the
people actually responsible for the role terms are also the ones
using the search engine. Still, manual work has to be done in order
to be able to get the relevant terms. Thus we present a third
approach using corporate knowledge bases (enterprise wikis) and
the query (search) history of the users in order to minimize the
manual efforts needed for role term extraction.
3.3 Term Extraction incorporating
Knowledge Bases and Query Histories
Wikisare a popular form of knowledge management systems in
public (e.g. Wikipedia) as well as within companies (“Enterprise
Wikis”). They can be seen as semantic graphs consisting of two
different types of nodes:
(1) Concept nodes containingthe actual content(e.g.
description of domain or company specific
abbreviations) as well as links to other nodes and
(2) Category nodes building up a hierarchical system of
overlapping trees whereas every category can have one
or more sub categories and also one or more parent
categories.
Every concept node can be assigned to one or more category
nodes.
For our approach we additionallyassign each of the user roles to at
least one category within an available enterprise wiki.
Furthermore we use the query (search) history of the users to
identify term candidates. For a query qfrom a user uwe first need
to do some linguistically pre-processing steps (tokenization,
chunking, stemming and lemmatization and collocation finding)
in order to get an appropriate term candidate c.Linguistic pre-
processing is a non-trivial task that plays a rather important role
and thus needs a lot of attention.
Still, this issue is out of scope for this paper. For further
information we refer toHassler&Fliedl[24].
Next,the system searches in the enterprise wiki for a concept or
category node corresponding to c. If no such node is found, c is
rejected but if a node exists the systems checks whereas it is in
one of the sub graphs of the categories mapped to the role
assigned to u. Only if c is found in one of the sub graphs it is
added as new term in the Role Term Vector RT. If an entry for c
already exists in RT the weight of this entry is increased.
This mechanism ensures that only terms relevant for the user’s
work are included in the role term vector of that user. For
example: user u searches for “sales pipe”. Furthermore u is
mapped to the role “mechanical engineer”. In the enterprise wiki a
concept node for “sales pipe” exists which is assigned to the
category “revenue forecasts”. No node for “sales pipe” is found in
the sub graphs of any of the categories assigned to role
“mechanical engineer”, since “mechanical engineer” is not
mapped to the category “revenue forecasts” or any of its parent
categories. Consequently “sales pipe” is considered not relevant
for role “mechanical engineer” andthus not included in its role
term vector. If u would be mapped to the role “account manager”
instead and if the role “account manager” would be assigned to
the category “revenue forecasts”, the term “sales pipe” would be
added to the role term vector of the role “account manager”.
Using this approach, the manual effort for role term extraction can
be reduced significantly compared to the two methods discussed
earlier in this section. Still, also this approach is not yet fully
automated since the mapping between user roles and wiki
categories has to be done by hand. Automating and further
optimizing this procedure is an interesting area for future work.
4. ROLE-SENSITIVE RANKING
In order to be able to optimize the results of an enterprise search
engine based on user roles, we introduceda role-sensitive ranking
algorithm that re-ranks the original result set as returned by the
enterprise search engine according tothe role relevance.which
reflects the relevance of a result toa searching user with a specific
role. The actual re-ranking function is derived from the work of
Agichtein[25]1 and adopted to our requirements as follows:
1Agichteinet al evaluated many different approaches and found
that “a simple rank merging heuristic combination works well
and is robust to variations in score values from original
rankers”.
For every document d within the original result set a merged score
SM is computed based on the document’s original rank Od and the
role rank Rd obtained from the document’s Role Relevance Vector.
A Role Relevance Vectorexists for every document and specifies
the relevance of its according document to every role defined in
the company. The specific characteristics about Role Relevance
Vectors are described in Section 3.2. As proposed by [25] we also
use weight w1 as a factor for scaling the “relative importance” of
the role relevance compared to the original rank.
4.1 Re-ranking Search Results using the
Merged Rank
Table 1 shows examples for the computation of the merged score
SM and the merged rank RM obtained thereof whereas RM is used to
re-rank the results presented to the searching user. In the first case
(w1 = 1) the importance of the original rank and the role rank is
equal leading to a complete new order of the result documents.
Increasing w1 to a higher value favors the role rank to the original
rank; at a certain value, only the role rank is decisive. Case 2 (w1
= 100) in below-mentioned example shows that RM equals Rd as a
result of a very high value for w1. On the opposite, a too small
value for w1 causes the role rank to be ignored (RM in case 3
equals Od).
Table 1: Example for role-sensitive ranking using different
weights
w1 = 1 w1 = 100 w1 = 0,01
d Od Rd SM RM SM RM SM RM
d1 1 4 0,700 2 20,500 4 0,502 1
d2 2 3 0,583 3 25,333 3 0,336 2
d3 3 1 0,750 1 50,250 1 0,255 3
d4 4 5 0,367 5 16,867 5 0,202 4
d5 5 2 0,500 4 33,500 2 0,170 5
3.2. ROLE RELEVANCE SCORING
As already mentioned before, our approach uses the information
about the specific role a user (employee) plays in a company and
generates a Role Term Vectorfor each role describing it in the
form of a weighted term list. In this section we describe our
approach to relate a document to a role using Role Term Vectors.
For every single document in the company’s document collection
we create a Role Relevance Vector
RRd = {RSr1, RSr2, … ,RSrn}
containing aRole Relevance ScoreRSr for each rolerdefined in the
company whereas RSr is calculated as the cosine similarity
between the vector representationTd of a document d and a Role
Term Vector RTrof a role r:
Cosine similarity is a widely used measure to determine the
similarity between two vectors. A result equal to 1 indicates that
the angle between the two vectors is 0 and that they therefore
point into the same direction. On the other hand, a result equal to
-1 means that the vectors are pointing in the opposite direction.
The length of the vector does not influence the similarity value.
In order to be able to use this similarity measure for comparing a
role term vector with a textual document we also need to represent
the textual content of a document as a weighted term
vectorwhereas the weight is represented as the well-known tf–
idf(term frequency–inverse document frequency) score.Words and
Multiwordsare filtered out with respect to their weight in a
certain domain. For managing this task we also use linguistic
strategies like co-occurrency determination and dependency
parsing [26]. Weighting key words collocations is one of the most
important tasks in the workflow triggered by our model.A more
detailed description regarding cosine similarity and tf-idf score
can be found e.g. in Chim[27].The example in
Table 2 shows tf-idf scores for the documents d1 and d2 and the
Role Term VectorsRTMarketing and RTEngineerwhich were already
introduced in section 3.1. tf-idfvalues are obtained from the term
frequency tf, and the inverse document frequency idf:
The value increases with the number of times a term occurs in a
vector (tf) and decreases with the number of times a term occurs
in different documents throughout the company’s document
collection (idf). For this example we used a total number of
documents of 10.
Table 2: tf-idf scores for different documents and role terms
d1 d2 RTMarketing RTEngineer
Terms df idf tf tf-idf tf tf-idf tf tf-idf tf tf-idf
marketing 3 0,52 9 4,71 0 0,00 1 0,52 0 0,00
Revenue 4 0,40 5 1,99 0 0,00 1 0,52 0 0,00
Intake 2 0,70 7 4,89 1 0,52 1 0,52 0 0,00
Engine 6 0,22 1 0,22 6 3,14 1 0,52 1 0,52
combustion 2 0,70 0 0,00 4 2,09 0 0,00 1 0,52
composite 3 0,52 2 1,05 8 4,18 0 0,00 1 0,52
Calculated with above stated formula, RSr ranges from 0
(indicating no similarity/relevancy) to 1 (maximum
similarity/relevancy). Values smaller than 0 are not possible since
the tf-idf score cannot be negative.
Table 3 shows, thatRTMarketinghas a higher score for d1 than for d2
leading us to the conclusion that d1 is more relevant for users
assigned to the role “Marketing” than d2 and should therefore get
a higher role rank. On the other hand d2 is more relevant for
employees with the role “Engineer” than forthose with
“Marketing”.
Table 3: Role relevance scores RS for documents d and role
term vectors RT
RS d2 RTMarketing RTEngineer
d1 0,1885 0,8254 0,1023
d2 1 0,3236 0,9608
Based on the Role Relevance Scores we can build up the Role
Relevance Vectors RRd for each document in the entire collection.
For our example those vectors would look as follows:
(1) RRd1 = {0,8254; 0,9608}
(2) RRd2 = {0,3236; 0,9608}
On the basis of these values we can obtain the role rank Rd of each
document for a given Role Term Vector RTr and its related role as
shown in the example in below table.
Table 4: Obtaining role rank Rd from Role Relevance Score
RSd
RTMarketing RTEngineer
d RSd Rd RSd Rd
d1 0,8254 1 0,1023 5
d2 0,3236 5 0,9608 1
d3 0,7502 2 0,7009 2
d4 0,3671 4 0,5832 3
d5 0,5008 3 0,2551 4
The role rank is then incorporated by the role-sensitive ranking
algorithm and combined with the original rank of the enterprise
search engine.
5. CONCLUSION & FUTURE WORK
In this paper we described a solution for role-specific search in
enterprise environments based on some computational linguistics
methods for term vector preparation and generation.In our
approach we propose to optimize such enterprise search systems
by a role-sensitive ranking algorithm that relates contextually-
derived information needs of users to unstructured (textual) data
in documents.
We have presenteda model that incorporates
(1) contextual information of enterprise users like user roles
and search historyas well as
(2) collaboratively constructed enterprise knowledge
management systems
to automatically identify role-based relationships between users
and unstructured enterprise content.We also claimed that such
relationships can lead to a significant improvement of enterprise
search when utilized by a role-sensitive ranking algorithm such as
described in this paper.
Hence we introduced a linguistic concept for generating role
describing word vectors based on query (search) histories and
corporate knowledge generation.
We described also in detail how different information needs can
be represented as weighted term lists (role term vectors) which
enable us to identify role-based relationships.
Our future research activities will focus on the evaluation of this
system.The goal is of course to prove the relevance of the search
results returned by our system.Additional promising areas of work
are the automation of user-role-mapping as well as the further
optimization of our rank merging algorithm.
6. REFERENCES
[1] Huang, Y., X. Ma, and D. Li, Research and Application of
Enterprise Search Based on Database Security Services, in
Proceedings of the Second International Symposium on
Networking and Network Security (ISNNS ’10). 2010.
[2] Dmitriev, P., P. Serdyukov, and S. Chernov, Enterprise and
desktop search, in Proceedings of the 19th international
conference on World wide web. 2010, ACM: Raleigh, North
Carolina, USA. p. 1345-1346.
[3] Demartini, G., Leveraging semantic technologies for
enterprise search, in Proceedings of the ACM first Ph.D.
workshop in CIKM. 2007, ACM: Lisbon, Portugal. p. 25-32.
[4] Hawking, D., Challenges in enterprise search, in
Proceedings of the 15th Australasian database conference -
Volume 27. 2004: Dunedin, New Zealand. p. 15-24.
[5] Shen, X., B. Tan, and C. Zhai, Context-sensitive information
retrieval using implicit feedback, in Proceedings of the 28th
annual international ACM SIGIR conference on Research
and development in information retrieval. 2005, ACM:
Salvador, Brazil. p. 43-50.
[6] Navrat, P. and T. Taraba, Context Search, in Proceedings of
the 2007 IEEE/WIC/ACM International Conferences on
Web Intelligence and Intelligent Agent Technology -
Workshops. 2007, IEEE Computer Society. p. 99-102.
[7] Schmidt, K.-U., D. Oberle, and K. Deissner (2009) Taking
Enterprise Search to the Next Level.
[8] Hertzum, M. and A.M. Pejtersen, The information-seeking
practices of engineers: searching for documents as well as
for people. Information Processing & Management, 2000.
36(5): p. 761-778.
[9] Rosen-Zvi, M., et al., Learning author-topic models from
text corpora. ACM Trans. Inf. Syst., 2010. 28(1): p. 1-38.
[10] Rosen-Zvi, M., et al., The author-topic model for authors
and documents, in Proceedings of the 20th conference on
Uncertainty in artificial intelligence. 2004, AUAI Press:
Banff, Canada. p. 487-494.
[11] Dignum, S., et al., Moving towards adaptive search, in
Workshop on Advanced Technologies for Digital Libraries.
2009: Trento, Italy.
[12] White, M., Making Search Work: Implementing Web,
Intranet and Enterprise Search. 2007, London: Facet
Publishing.
[13] Kleinberg, J.M., Authoritative sources in a hyperlinked
environment. J. ACM, 1999. 46(5): p. 604-632.
[14] Brin, S. and L. Page, The Anatomy of a Large-Scale
HypertextualWeb Search Engine, in Seventh International
World Wide Web Conference (WWW7). 1998: Brisbane. p.
107–117.
[15] Fagin, R., et al., Searching the workplace web, in
Proceedings of the 12th international conference on World
Wide Web. 2003, ACM: Budapest, Hungary. p. 366-375.
[16] Xue, G.-R., et al., Implicit link analysis for small web
search, in Proceedings of the 26th annual international
ACM SIGIR conference on Research and development in
informaion retrieval. 2003, ACM: Toronto, Canada. p. 56-
63.
[17] Kohn, A. and F. Bry, Exploiting a Company’s Knowledge:
The Adaptive Search Agent YASE. 2008.
[18] Mukherjee, R. and J. Mao, Enterprise Search: Tough Stuff.
Queue, 2004. 2(2): p. 36-46.
[19] Kohn, A. and F. Bry, PROFESSIONAL SEARCH:
REQUIREMENTS, PROTOTYPE AND PRELIMINARY
EXPERIENCE REPORT, in IADIS International
Conference WWW/Internet. 2008.
[20] Blei, D.M., A.Y. Ng, and M.I. Jordan, Latent dirichlet
allocation. J. Mach. Learn. Res., 2003. 3: p. 993-1022.
[21] Hawking, D., et al., Context in Enterprise Search and
Delivery, in IRiX Workshop, ACM SIGIR. 2005.
[22] Solskinnsbakk, G. and J.A. Gulla, Combining ontological
profiles with context in information retrieval. Data &
Knowledge Engineering, 2010. 69(s3): p. 251-260.
[23] Dou, Z., R. Song, and J.-R. Wen, A large-scale evaluation
and analysis of personalized search strategies, in
Proceedings of the 16th international conference on World
Wide Web. 2007, ACM: Banff, Alberta, Canada. p. 581-590.
[24] Hassler, M. and G. Fliedl, Text preparation through
extended tokenization, in Data Mining VII; Data, Text and
Web Mining, and their Business Applications, C.A. Zanasi,
N.F.F. Brebbia, and E. A., Editors. 2006, WIT Press. p. 13-
21.
[25] Agichtein, E., E. Brill, and S. Dumais, Improving web
search ranking by incorporating user behavior information,
in Proceedings of the 29th annual international ACM SIGIR
conference on Research and development in information
retrieval. 2006: Seattle, Washington, USA. p. 19-26.
[26] Manning, C.D. and H. Schutze, Foundations of Statistical
Natural Language Processing 1999: MIT Press. 620.
[27] Chim, H. and X. Deng, A new suffix tree similarity measure
for document clustering, in Proceedings of the 16th
international conference on World Wide Web. 2007, ACM:
Banff, Alberta, Canada. p. 121-130.
... Focussing on the automatic generation of role term vectors representing long-term information needs of employees, we will explain the underlying linguistic concept and the process for Role Term Extraction, based on online job advertisements and a standardized skills & competences thesaurus. As for a more detailed description of role relevance scoring and role-sensitive ranking, we refer to [3] . To the best of our knowledge, the rolesensitive IR solution described in this work is the first system for enterprise environments that optimizes search results by considering context-based long-term user information needs and ranking search results depending on calculated relevance scores between textual (unstructured) enterprise data and automatically generated role profiles. ...
... Role profiles are represented by weighted term vectors as described in [3]: every profile vector is made up of tuples of weighted terms, which actually describe the role. The examples below show (1) a profile vector representing the role " Sales Manager " , and (2) a vector representing the role " Software Development " with weight 1 for all terms. ...
... Therefore, extracting relevant terms for generating role profiles is a most critical task. As described in [3], different approaches to role term extraction have already been examined: ...
Article
Full-text available
The presenters of the SIGIR 2004 IRIX Workshop (1) discussed context and how it could be used in information retrieval in general. The types of context identified included users' familiarity with the search topic, search system and search collection; users' interaction history; the time, place and device; and the task in hand. All these discussions on context could apply to enterprise information search environment. However, as a subset of a broad information search, enterprise search brings its own typical context characteristics: users are mainly the employees, search tasks are usually related to its business, and information sources are dominated by those used or generated in its business operation.
Article
Full-text available
We introduce the author-topic model, a generative model for documents that extends Latent Dirichlet Allocation (LDA; Blei, Ng, & Jordan, 2003) to include authorship information. Each author is associated with a multinomial distribution over topics and each topic is associated with a multinomial distribution over words. A document with multiple authors is modeled as a distribution over topics that is a mixture of the distributions associated with the authors. We apply the model to a collection of 1,700 NIPS conference papers and 160,000 CiteSeer abstracts. Exact inference is intractable for these datasets and we use Gibbs sampling to estimate the topic and author distributions. We compare the performance with two other generative models for documents, which are special cases of the author-topic model: LDA (a topic model) and a simple author model in which each author is associated with a distribution over words rather than a distribution over topics. We show topics recovered by the author-topic model, and demonstrate applications to computing similarity between authors and entropy of author output.
Article
Tokenization is commonly understood as the first step of any kind of nat- ural language text preparation. The major goal of this early (pre-linguistic) task is to convert a stream of characters into a stream of processing units called tokens. Beyond the text mining community this job is taken for granted. Commonly it is seen as an already solved problem comprising the identification of word borders and punctuation marks separated by spaces and line breaks. But in our sense it should manage language related word dependencies, incorporate domain specific knowledge, and handle mor- phosyntactically relevant linguistic specificities. Therefore, we propose rule- based Extended Tokenization including all sorts of linguistic knowledge (e.g., grammar rules, dictionaries). The core features of our implementa- tion are identification and disambiguation of all kinds of linguistic markers, detection and expansion of abbreviations, treatment of special formats, and typing of tokens including single- and multi-tokens. To improve the quality of text mining we suggest linguistically-based tokenization as a necessary step preceeding further text processing tasks. In this paper, we focus on the task of improving the quality of standard tagging.
Article
The network structure of a hyperlinked environment can be a rich source of information about the content of the environment, provided we have effective means for understanding it. We develop a set of algorithmic tools for extracting information from the link structures of such environments, and report on experiments that demonstrate their effectiveness in a variety of contexts on the World Wide Web. The central issue we address within our framework is the distillation of broad search topics, through the discovery of "authoritative" information sources on such topics. We propose and test an algorithmic formulation of the notion of authority, based on the relationship between a set of relevant authoritative pages and the set of "hub pages" that join them together in the link structure. Our formulation has connections to the eigenvectors of certain matrices associated with the link graph; these connections in turn motivate additional heuristics for link-based analysis.
Article
This paper introduces YASE, a domain-aware search agent with learning capabilities. Initially built for the research community of Roche Penzberg, YASE proved to be superior to standard search engines in the company environment due to the introduction of some simple principles: personalized ranking based on a user's role and organizational embedding, automatic classification of documents by using domain knowledge and learning from search history. While the benefits of the learning feature need more time to be fully realized, the other two principles have proved to be surprisingly powerful.
Article
The network structure of a hypcrlinked environment can be a rich source of information about the content of the environment, provided we have effective means for understanding it. We develop a set of algorithmic tools for extracting information from the link structures of such environments, and report on experiments that demonstrate their effectiveness in a variety of contexts on the World Wide Web. The central issue we address within our framework is the distillation of broad search topics, through the discovery of "authoritative" information sources on such topics. We propose and test an algorithmic formulation of the notion of authority, based on the relationship between a set of relevant authoritative pages and the set of "hub pages" that join them together in the link structure. Our formulation has connections to the eigenvectors of certain matrices associated with the link graph; these connections in turn motivate additional heuristics for link-based analysis.
Article
With the development of internet and the build of enterprise information, How to organize the information and make it accessible and useful? Especially, how to support the information security services? They have been become a hot topic of internet and enterprise information. This paper discussed the enterprise search systems based on database security services. Such as Oracle Ultra Search, Sybase Enterprise Portal and LogicSQL-based enterprise archive and search system. Ultra Search enables a portal search across the content assets of a corporation, bringing to bear Oracle's core capabilities of platform scalability and reliability. "Sybase Enterprise Portal Security Services", which describes how Enterprise Portal security works, a description of the security APIs that you customize to accommodate your Enterprise Portal system, and a description of the Web-based Security Administration tool. LogicSQL is independently developed based on the Linux high-level security database system. The development of enterprise internal Search System is based on LogicSQL security database. The enterprise information resources integration and security search are completed in the search system.