Conference PaperPDF Available

Optimizing enterprise search by automatically relating user context totextual document content

September 2011

September 2011

DOI:10.1145/2024288.2024316

Source
DBLP

Conference: Proceedings of the 11th International Conference on Knowledge Management and Knowledge Technologies

Authors:

Mathias Reichhold

Alpen-Adria-Universität Klagenfurt

Günther Fliedl

Alpen-Adria-Universität Klagenfurt

It is widely agreed that information retrieval (IR) systems benefit enormously from considering not only the user's query but also contextual data. In enterprise IR systems corporate knowledge bases and additional manually triggered information about users are normally taken to obtain such contextual data. In this paper we propose a solution for role-specific search in enterprise environments without the need of manual administration of mappings between roles and documents. We include collaboratively constructed knowledge engineering systems for computing similarity measures between user role attributes and relevant information snippets in enterprise documents. Our approach suggestsoptimizing such enterprise search systems by a role-sensitive ranking algorithm that relates contextually-derived information needs of users to unstructured (textual) data in documents. Hence we introduce a linguistic conceptfor generatingrole describing word vectorsbased on query (search) histories and corporate knowledge base generation. The Introduction outlines some basic ideas concerning the major areas of enterprise search, some relevant differences between web search and enterprise search. Subsequently we sketch our optimized enterprise search model. In Chapter 2some theoretical background and Related Work is briefly discussed.Chapter 3depicts some linguistically relevant details of our proposed model. We discuss our concept of User Roles, Role Term Vectors, some approaches for Role Term Extraction andTerm Extraction incorporating knowledge bases and query histories. In Chapter4 we describe our ranking mechanism, the re-ranking strategy and the method for Role Relevance Scoring. Chapter 5 gives a conclusion of the work as well as an outlook on future work.

The Architecture of ourrole-sensitive enterprise search model and its components

…

Figures - uploaded by Mathias Reichhold

Content may be subject to copyright.

Content uploaded by Mathias Reichhold

Content may be subject to copyright.

Optimizing Enterprise Search by Automatically Relating

User Context toTextual Document Content

Matthias Reichhold

Universität Klagenfurt

Universitätsstraße 65-67

A-9020 Klagenfurt

matreich@edu.uni-klu.ac.at

Jörg Kerschbaumer

Universität Klagenfurt

Universitätsstraße 65-67

A-9020 Klagenfurt

joerg.kerschbaumer@edu.uni-

klu.ac.at

Ao.Univ.-Prof. Mag. Dr.

Günther Fliedl

Universität Klagenfurt

Universitätsstraße 65-67

A-9020 Klagenfurt

guenther.fliedl@aau.at

ABSTRACT

It is widely agreed that information retrieval (IR) systems benefit

enormously from considering not only the user’s query but also

contextual data. In enterprise IR systems corporate knowledge

bases and additional manually triggered information about users

are normally taken to obtain such contextual data.

In this paper we propose a solution for role-specific search in

enterprise environments without the need of manual

administration of mappings between roles and documents. We

include collaboratively constructed knowledge engineering

systems for computing similarity measures between user role

attributes and relevant information snippets in enterprise

documents.

Our approach suggestsoptimizing such enterprise search systems

by a role-sensitive ranking algorithm that relates contextually-

derived information needs of users to unstructured (textual) data

in documents. Hence we introduce a linguistic conceptfor

generatingrole describing word vectorsbased on query (search)

histories and corporate knowledge base generation.

The Introduction outlines some basic ideas concerning the major

areas of enterprise search, some relevant differences between web

search and enterprise search. Subsequently we sketch our

optimized enterprise search model.

In Chapter 2some theoretical background and Related Work is

briefly discussed.Chapter 3depicts some linguistically relevant

details of our proposed model. We discuss our concept of User

Roles, Role Term Vectors, some approaches for Role Term

Extraction andTerm Extraction incorporating knowledge bases

and query histories. In Chapter4 we describe our ranking

mechanism, the re-ranking strategy and the method for Role

Relevance Scoring. Chapter 5 gives a conclusion of the work as

well as an outlook on future work.

Categories and Subject Descriptors

H.3.3 [Information storage and retrieval]: Information Search

and Retrieval – retrieval models, search process.H.3.1

[Information storage and retrieval]: Content Analysis and

Indexing – Linguistic processing.

General Terms

Algorithms

Keywords

Enterprise search, Enterprise search ranking, Enterprise search

optimization,user context, user role, role-sensitive ranking,

context-sensitive search

1. INTRODUCTION

The amount and complexity of data employees in companies are

faced with nowadays is increasing rapidly. In addition, the

majority of this data is unstructured (textual data) making search

even harder as shown by Huang [1]. Hence, information retrieval

systems meeting these special requirements (enterprise search

engines) are becoming more and more important (see Dmitriev et

al[2]).Furthermore [2]also state that in contrast to web search only

very limited attention has been paid to this research area so far.

But there are many differences between these types of systems.As

also stated by Demartini[3], Hawking [4]identifies three major

areas an enterprise search system covers:

(1) search of the organisation’s external website

(2) search of the organisation’s internal website (its

intranet)

(3) search of other electronic text held by the organisation

in the form of email, database records, documents on

file shares etc.

According to Demartini[3], one important difference between

information retrieval systems for companies (Enterprise Search

Systems) and for web search is that much more information about

the searching user is available to the former one due to the fact

that in enterprise environments a user is a known employee who

has a specific role. Roles can be derived from certain job-related

user properties (e.g. job title, function, department, etc.) or are

already managed in IT systems like directory services, HR

systems, etc.

Demartini[3] also points out that current search systems do not

consider these role context although “different roles (like

manager, IT, software developers) with the same query have

different information needs […] and a ES system should exploit

this information”.

Referringto the work of Shen et al[5], most existing systems,

which are currently available for information retrieval, are still

only using the actual query and document data in order to find

relevant information, but do not consider any contextual

information.

Moreover, [5] note on page 1 that “from a single query, however,

the retrieval system can only have very limited clue about the

user’s information need. An optimal retrieval system thus should

try to exploit as much additional context information as possible

to improve retrieval accuracy, whenever it is available.” The

significant importance of user context is also stated by e.g.

Hawking [4], Navrat et al[6] and Schmidt et al[7].

Besides that we know from [2, 8] that users with similar roles in

corporate environments are often searching for similar documents,

because they are interested in information belonging to the same

domain or on related topics and thus their information needs are

more comparable than others. Also the work of Rosen‐Zvietal

[9,10]showsthatIRsystemsbenefitsignificantlyfrom

consideringcontextualinformationaboutenterpriseusers.

Our enterprise search approach includes user related context

information and combines it with linguistically enhanced

document analysis.

Figure 1 provides an overview of our approach for optimizing

enterprise search: every user is assigned to a user role which has

one Role Term Vector RTrrelated to it. When a user sends a query

to the search engine it creates a ranked result set (the Original

RankOd). Our Role-sensitive Ranking algorithm merges Odwith

the so called Role Rank Rd and thus obtains a role-sensitive

Merged Rank Md which is presented to the user as optimized

search result.

The Role Rank again is computed by a special Role Relevance

Scoring module based on the document contents on the on hand

and the RTr on the other hand. The relevance score is calculated

using the cosine similarity measure[27], measuring the similarity

between a document d and RTr. A high similarity between d and

RTr indicates a high relevancy of d for all users with the role

related to RTr while a low similarity value on the other hand

shows low relevancy. Documents with higher relevancy scores get

higher Rd values leading to a higher over-all rank Mdat the end.

Accordingly, documents with lower relevancy scores will end up

with a lower Md.

2. BACKGROUND& RELATED WORK

Information retrieval systems have been developed already more

than 50 years ago and with the rise of the World Wide Web,

research efforts (not very surprisingly) have focused very much

on web/internet search [Dignum et al [11]]. But as it is argued

in[11, 12], retrieval methods delivering good performance for

internet search do not inevitably deliver as good results in

enterprise environments which is very much due to the different

structure of intranets compared to the public internet [2, 4].

Ranking algorithms successful in the web like HITS [13] or

PageRank [14]suffer from poor or missing linkage structure [15,

16] in enterprise document repositoriesand therefore perform less

well in corporate environments [11, 17]. Additional challenges for

enterprise search systems according to [17] are “high redundancy

(many versions of the same document)” and “notational

heterogeneity (synonyms) distort[ing] the search results”.

Figure 1: The Architecture of ourrole-sensitive enterprise

search model and its components

Another characteristic about enterprise search is the fact that users

first have to spend a lot of time and effort to get familiar with the

domain specific concepts and terminology used in the enterprise

environment in order to be able to submit relevant query strings

for a search system [11]. Due to space limitations in this work we

refer to the paper of Mukherjee et al [18] for further description of

challenges and differences regarding enterprise search.

In recent years, however, there has been a lot of research going on

about using contextual information like explicit feedback (e.g.

relevance feedback, tagging, labelling), implicit feedback (such as

query history and clickthrough history), user profiles, etc. to

personalize and therefore improve retrieval systems. While the

focus on above mentioned research topics was clearly on web (or

internet) search, there are only a few studies dealing with

considering contextual information for enterprise search systems

[2]. This is rather surprising for us since existing work shows

promising results like [5] have achieved significant improvements

on enterprise search using implicit feedback or the approach of

Kohn et al [17, 19], which is proposed to be superior to standard

search engines in the company environment due to the

introduction of some simple principles like personalized ranking

based on a user’s role and organizational embedding, automatic

classification of documents by using domain knowledge and

learning from search history.

Despite the promising results, Kohn et al also note the main

deficitregarding theirsystem: role ontologies and mapping rules

between ontologies and document meta data have to be managed

manually and are therefore very costly to maintain. 

Another approach to optimize IR systems by relating user profiles

and document data is presented by Rosen-Zvi et al [9, 10]. They

introduce an “author-topic model” which is an extension of the

well-known Latent Dirichlet Allocation [20], which derives author

interests from document data based on probability distributions

and thus can exploit relations between users, documents and

topics. Our proposal presupposes these ideas about role-reflecting

ranking improvements but uses a different approach to relate user

context and document data.

As mentioned before integrationof user context and the

personalization of enterprise search are current key research areas

[4, 7, 21], whereas especially ontology-based approaches have

drawn a lot of attention recently. E.g. the work of Solskinnbakk et

al [22] introduces an “ontology profile” representing a weighted

vector-based description for each ontology concept. [22] use these

powerful ontology profiles to expand queries submitted to a

search engine. Their experiments show promising results and “a

generally better performance than the baseline”.

3. USER ROLES

As mentioned above, considering user context plays a very

important role for further improvement of enterprise search

systems. But current systems often present the same search results

for a certain query to all users not respecting that the information

needs may differ considerably for different people [23]. Moreover

enterprise search systems have to cope with the fact that most of

the submitted queries are very short and ambiguous [23] making it

more or less impossible for the search system to derive the user’s

information need.

Every employee has different information needs depending on

certain properties (function, job description, department, location,

etc.). Similar properties can be consolidated into roles.Thus we

propose the use of explicit user roles which are defined company-

wide and are assigned to each employee. These roles represent the

long term user context (e.g. „Controlling“, “Procurement”, etc.)

and therefore indicate the differing user information needs (e.g.

role „Engineering” vs. role „Marketing”). The definition of the

roles to be used in the company as well as the mapping between

certain users and roles is handled by a role expert.

3.1 Role Term Vectors

User context can be represented as the concept“user roles”. Role

Term Vectors can be used to reflect the information needs of

employees andto obtain Role Relevance Scores indicating the

relative importance of documents for different employees.

We attach a Role Term VectorRTr to each role which contains

weighted words (terms) that describe the role and which is used to

relate the role tothe content of documents.

The examples stated below show (1) a Role Term Vector assigned

to the role “Marketing” and (2) a vector assigned to the role

“Engineer”with weight 1 for all terms.

(1) RTMarketing =

{(“marketing”,1), (“revenue”,1), (“intake”,1),

(“engine”,1)}

(2) RTEngineer = {(“engine”,1), (“combustion”,1),

(“composite”,1)}

The use of Role Term Vectors enables our model to find relations

between documents and user roles and evaluate the relevance of a

document for a certain role. Every Role Term Vector consists of a

number of weighted terms that influence the relevancy scoring

heavily. Therefore extraction and weighting of the relevant terms

is a very crucial task.

3.2 Approaches toRole Term Extraction

In the following, we describe two semi-automated approaches for

role term extraction and argue for adopting them partially in our

model.

A rather simple but uncomfortable possibility for defining role

terms is a centralized and completely manual task where a role

expert assigns relevant terms to roles. Such a manual task is of

course very time-consuming and inflexible. Therefore one (or

more) role experts with domain and company specific knowledge

about roles and relevant terms are needed. If on the other hand

such resources are available in a company they can create very

valuable inputs. Hence we propose to use manual term extraction

in the form of black lists (terms that have to be excluded) and

white lists (terms that the vector must include) for extending the

automatic processing step.

Secondly, we introduce a semi-automatic approach at which every

user in the company maintains a personal list of keywords

relevant for his/her work.We then collect the keywords entered by

the users,group them by user role and use those keywords to build

up the Role TermVector. The advantage of this approach

compared to the first one is that we do not depend on role experts

and their personal knowledge any longer. Instead, we get

immediate and direct feedback about what is relevant since the

people actually responsible for the role terms are also the ones

using the search engine. Still, manual work has to be done in order

to be able to get the relevant terms. Thus we present a third

approach using corporate knowledge bases (enterprise wikis) and

the query (search) history of the users in order to minimize the

manual efforts needed for role term extraction.

3.3 Term Extraction incorporating

Knowledge Bases and Query Histories

Wikisare a popular form of knowledge management systems in

public (e.g. Wikipedia) as well as within companies (“Enterprise

Wikis”). They can be seen as semantic graphs consisting of two

different types of nodes:

(1) Concept nodes containingthe actual content(e.g.

description of domain or company specific

abbreviations) as well as links to other nodes and

(2) Category nodes building up a hierarchical system of

overlapping trees whereas every category can have one

or more sub categories and also one or more parent

categories.

Every concept node can be assigned to one or more category

nodes.

For our approach we additionallyassign each of the user roles to at

least one category within an available enterprise wiki.

Furthermore we use the query (search) history of the users to

identify term candidates. For a query qfrom a user uwe first need

to do some linguistically pre-processing steps (tokenization,

chunking, stemming and lemmatization and collocation finding)

in order to get an appropriate term candidate c.Linguistic pre-

processing is a non-trivial task that plays a rather important role

and thus needs a lot of attention.

Still, this issue is out of scope for this paper. For further

information we refer toHassler&Fliedl[24].

Next,the system searches in the enterprise wiki for a concept or

category node corresponding to c. If no such node is found, c is

rejected but if a node exists the systems checks whereas it is in

one of the sub graphs of the categories mapped to the role

assigned to u. Only if c is found in one of the sub graphs it is

added as new term in the Role Term Vector RT. If an entry for c

already exists in RT the weight of this entry is increased.

This mechanism ensures that only terms relevant for the user’s

work are included in the role term vector of that user. For

example: user u searches for “sales pipe”. Furthermore u is

mapped to the role “mechanical engineer”. In the enterprise wiki a

concept node for “sales pipe” exists which is assigned to the

category “revenue forecasts”. No node for “sales pipe” is found in

the sub graphs of any of the categories assigned to role

“mechanical engineer”, since “mechanical engineer” is not

mapped to the category “revenue forecasts” or any of its parent

categories. Consequently “sales pipe” is considered not relevant

for role “mechanical engineer” andthus not included in its role

term vector. If u would be mapped to the role “account manager”

instead and if the role “account manager” would be assigned to

the category “revenue forecasts”, the term “sales pipe” would be

added to the role term vector of the role “account manager”.

Using this approach, the manual effort for role term extraction can

be reduced significantly compared to the two methods discussed

earlier in this section. Still, also this approach is not yet fully

automated since the mapping between user roles and wiki

categories has to be done by hand. Automating and further

optimizing this procedure is an interesting area for future work.

4. ROLE-SENSITIVE RANKING

In order to be able to optimize the results of an enterprise search

engine based on user roles, we introduceda role-sensitive ranking

algorithm that re-ranks the original result set as returned by the

enterprise search engine according tothe role relevance.which

reflects the relevance of a result toa searching user with a specific

role. The actual re-ranking function is derived from the work of

Agichtein[25]1 and adopted to our requirements as follows:

1Agichteinet al evaluated many different approaches and found

that “a simple rank merging heuristic combination works well

and is robust to variations in score values from original

rankers”.

For every document d within the original result set a merged score

SM is computed based on the document’s original rank Od and the

role rank Rd obtained from the document’s Role Relevance Vector.

A Role Relevance Vectorexists for every document and specifies

the relevance of its according document to every role defined in

the company. The specific characteristics about Role Relevance

Vectors are described in Section 3.2. As proposed by [25] we also

use weight w1 as a factor for scaling the “relative importance” of

the role relevance compared to the original rank.

4.1 Re-ranking Search Results using the

Merged Rank

Table 1 shows examples for the computation of the merged score

SM and the merged rank RM obtained thereof whereas RM is used to

re-rank the results presented to the searching user. In the first case

(w1 = 1) the importance of the original rank and the role rank is

equal leading to a complete new order of the result documents.

Increasing w1 to a higher value favors the role rank to the original

rank; at a certain value, only the role rank is decisive. Case 2 (w1

= 100) in below-mentioned example shows that RM equals Rd as a

result of a very high value for w1. On the opposite, a too small

value for w1 causes the role rank to be ignored (RM in case 3

equals Od).

Table 1: Example for role-sensitive ranking using different

weights

w1 = 1 w1 = 100 w1 = 0,01

d Od Rd SM RM SM RM SM RM

d1 1 4 0,700 2 20,500 4 0,502 1

d2 2 3 0,583 3 25,333 3 0,336 2

d3 3 1 0,750 1 50,250 1 0,255 3

d4 4 5 0,367 5 16,867 5 0,202 4

d5 5 2 0,500 4 33,500 2 0,170 5

3.2. ROLE RELEVANCE SCORING

As already mentioned before, our approach uses the information

about the specific role a user (employee) plays in a company and

generates a Role Term Vectorfor each role describing it in the

form of a weighted term list. In this section we describe our

approach to relate a document to a role using Role Term Vectors.

For every single document in the company’s document collection

we create a Role Relevance Vector

RRd = {RSr1, RSr2, … ,RSrn}

containing aRole Relevance ScoreRSr for each rolerdefined in the

company whereas RSr is calculated as the cosine similarity

between the vector representationTd of a document d and a Role

Term Vector RTrof a role r:

Cosine similarity is a widely used measure to determine the

similarity between two vectors. A result equal to 1 indicates that

the angle between the two vectors is 0 and that they therefore

point into the same direction. On the other hand, a result equal to

-1 means that the vectors are pointing in the opposite direction.

The length of the vector does not influence the similarity value.

In order to be able to use this similarity measure for comparing a

role term vector with a textual document we also need to represent

the textual content of a document as a weighted term

vectorwhereas the weight is represented as the well-known tf–

idf(term frequency–inverse document frequency) score.Words and

Multiwordsare filtered out with respect to their weight in a

certain domain. For managing this task we also use linguistic

strategies like co-occurrency determination and dependency

parsing [26]. Weighting key words collocations is one of the most

important tasks in the workflow triggered by our model.A more

detailed description regarding cosine similarity and tf-idf score

can be found e.g. in Chim[27].The example in

Table 2 shows tf-idf scores for the documents d1 and d2 and the

Role Term VectorsRTMarketing and RTEngineerwhich were already

introduced in section 3.1. tf-idfvalues are obtained from the term

frequency tf, and the inverse document frequency idf:

The value increases with the number of times a term occurs in a

vector (tf) and decreases with the number of times a term occurs

in different documents throughout the company’s document

collection (idf). For this example we used a total number of

documents of 10.

Table 2: tf-idf scores for different documents and role terms

d1 d2 RTMarketing RTEngineer

Terms df idf tf tf-idf tf tf-idf tf tf-idf tf tf-idf

marketing 3 0,52 9 4,71 0 0,00 1 0,52 0 0,00

Revenue 4 0,40 5 1,99 0 0,00 1 0,52 0 0,00

Intake 2 0,70 7 4,89 1 0,52 1 0,52 0 0,00

Engine 6 0,22 1 0,22 6 3,14 1 0,52 1 0,52

combustion 2 0,70 0 0,00 4 2,09 0 0,00 1 0,52

composite 3 0,52 2 1,05 8 4,18 0 0,00 1 0,52

Calculated with above stated formula, RSr ranges from 0

(indicating no similarity/relevancy) to 1 (maximum

similarity/relevancy). Values smaller than 0 are not possible since

the tf-idf score cannot be negative.

Table 3 shows, thatRTMarketinghas a higher score for d1 than for d2

leading us to the conclusion that d1 is more relevant for users

assigned to the role “Marketing” than d2 and should therefore get

a higher role rank. On the other hand d2 is more relevant for

employees with the role “Engineer” than forthose with

“Marketing”.

Table 3: Role relevance scores RS for documents d and role

term vectors RT

RS d2 RTMarketing RTEngineer

d1 0,1885 0,8254 0,1023

d2 1 0,3236 0,9608

Based on the Role Relevance Scores we can build up the Role

Relevance Vectors RRd for each document in the entire collection.

For our example those vectors would look as follows:

(1) RRd1 = {0,8254; 0,9608}

(2) RRd2 = {0,3236; 0,9608}

On the basis of these values we can obtain the role rank Rd of each

document for a given Role Term Vector RTr and its related role as

shown in the example in below table.

Table 4: Obtaining role rank Rd from Role Relevance Score

RSd

RTMarketing RTEngineer

d RSd Rd RSd Rd

d1 0,8254 1 0,1023 5

d2 0,3236 5 0,9608 1

d3 0,7502 2 0,7009 2

d4 0,3671 4 0,5832 3

d5 0,5008 3 0,2551 4

The role rank is then incorporated by the role-sensitive ranking

algorithm and combined with the original rank of the enterprise

search engine.

5. CONCLUSION & FUTURE WORK

In this paper we described a solution for role-specific search in

enterprise environments based on some computational linguistics

methods for term vector preparation and generation.In our

approach we propose to optimize such enterprise search systems

by a role-sensitive ranking algorithm that relates contextually-

derived information needs of users to unstructured (textual) data

in documents.

We have presenteda model that incorporates

(1) contextual information of enterprise users like user roles

and search historyas well as

(2) collaboratively constructed enterprise knowledge

management systems

to automatically identify role-based relationships between users

and unstructured enterprise content.We also claimed that such

relationships can lead to a significant improvement of enterprise

search when utilized by a role-sensitive ranking algorithm such as

described in this paper.

Hence we introduced a linguistic concept for generating role

describing word vectors based on query (search) histories and

corporate knowledge generation.

We described also in detail how different information needs can

be represented as weighted term lists (role term vectors) which

enable us to identify role-based relationships.

Our future research activities will focus on the evaluation of this

system.The goal is of course to prove the relevance of the search

results returned by our system.Additional promising areas of work

are the automation of user-role-mapping as well as the further

optimization of our rank merging algorithm.

6. REFERENCES

[1] Huang, Y., X. Ma, and D. Li, Research and Application of

Enterprise Search Based on Database Security Services, in

Proceedings of the Second International Symposium on

Networking and Network Security (ISNNS ’10). 2010.

[2] Dmitriev, P., P. Serdyukov, and S. Chernov, Enterprise and

desktop search, in Proceedings of the 19th international

conference on World wide web. 2010, ACM: Raleigh, North

Carolina, USA. p. 1345-1346.

[3] Demartini, G., Leveraging semantic technologies for

enterprise search, in Proceedings of the ACM first Ph.D.

workshop in CIKM. 2007, ACM: Lisbon, Portugal. p. 25-32.

[4] Hawking, D., Challenges in enterprise search, in

Proceedings of the 15th Australasian database conference -

Volume 27. 2004: Dunedin, New Zealand. p. 15-24.

[5] Shen, X., B. Tan, and C. Zhai, Context-sensitive information

retrieval using implicit feedback, in Proceedings of the 28th

annual international ACM SIGIR conference on Research

and development in information retrieval. 2005, ACM:

Salvador, Brazil. p. 43-50.

[6] Navrat, P. and T. Taraba, Context Search, in Proceedings of

the 2007 IEEE/WIC/ACM International Conferences on

Web Intelligence and Intelligent Agent Technology -

Workshops. 2007, IEEE Computer Society. p. 99-102.

[7] Schmidt, K.-U., D. Oberle, and K. Deissner (2009) Taking

Enterprise Search to the Next Level.

[8] Hertzum, M. and A.M. Pejtersen, The information-seeking

practices of engineers: searching for documents as well as

for people. Information Processing & Management, 2000.

36(5): p. 761-778.

[9] Rosen-Zvi, M., et al., Learning author-topic models from

text corpora. ACM Trans. Inf. Syst., 2010. 28(1): p. 1-38.

[10] Rosen-Zvi, M., et al., The author-topic model for authors

and documents, in Proceedings of the 20th conference on

Uncertainty in artificial intelligence. 2004, AUAI Press:

Banff, Canada. p. 487-494.

[11] Dignum, S., et al., Moving towards adaptive search, in

Workshop on Advanced Technologies for Digital Libraries.

2009: Trento, Italy.

[12] White, M., Making Search Work: Implementing Web,

Intranet and Enterprise Search. 2007, London: Facet

Publishing.

[13] Kleinberg, J.M., Authoritative sources in a hyperlinked

environment. J. ACM, 1999. 46(5): p. 604-632.

[14] Brin, S. and L. Page, The Anatomy of a Large-Scale

HypertextualWeb Search Engine, in Seventh International

World Wide Web Conference (WWW7). 1998: Brisbane. p.

107–117.

[15] Fagin, R., et al., Searching the workplace web, in

Proceedings of the 12th international conference on World

Wide Web. 2003, ACM: Budapest, Hungary. p. 366-375.

[16] Xue, G.-R., et al., Implicit link analysis for small web

search, in Proceedings of the 26th annual international

ACM SIGIR conference on Research and development in

informaion retrieval. 2003, ACM: Toronto, Canada. p. 56-

63.

[17] Kohn, A. and F. Bry, Exploiting a Company’s Knowledge:

The Adaptive Search Agent YASE. 2008.

[18] Mukherjee, R. and J. Mao, Enterprise Search: Tough Stuff.

Queue, 2004. 2(2): p. 36-46.

[19] Kohn, A. and F. Bry, PROFESSIONAL SEARCH:

REQUIREMENTS, PROTOTYPE AND PRELIMINARY

EXPERIENCE REPORT, in IADIS International

Conference WWW/Internet. 2008.

[20] Blei, D.M., A.Y. Ng, and M.I. Jordan, Latent dirichlet

allocation. J. Mach. Learn. Res., 2003. 3: p. 993-1022.

[21] Hawking, D., et al., Context in Enterprise Search and

Delivery, in IRiX Workshop, ACM SIGIR. 2005.

[22] Solskinnsbakk, G. and J.A. Gulla, Combining ontological

profiles with context in information retrieval. Data &

Knowledge Engineering, 2010. 69(s3): p. 251-260.

[23] Dou, Z., R. Song, and J.-R. Wen, A large-scale evaluation

and analysis of personalized search strategies, in

Proceedings of the 16th international conference on World

Wide Web. 2007, ACM: Banff, Alberta, Canada. p. 581-590.

[24] Hassler, M. and G. Fliedl, Text preparation through

extended tokenization, in Data Mining VII; Data, Text and

Web Mining, and their Business Applications, C.A. Zanasi,

N.F.F. Brebbia, and E. A., Editors. 2006, WIT Press. p. 13-

21.

[25] Agichtein, E., E. Brill, and S. Dumais, Improving web

search ranking by incorporating user behavior information,

in Proceedings of the 29th annual international ACM SIGIR

conference on Research and development in information

retrieval. 2006: Seattle, Washington, USA. p. 19-26.

[26] Manning, C.D. and H. Schutze, Foundations of Statistical

Natural Language Processing 1999: MIT Press. 620.

[27] Chim, H. and X. Deng, A new suffix tree similarity measure

for document clustering, in Proceedings of the 16th

international conference on World Wide Web. 2007, ACM:

Banff, Alberta, Canada. p. 121-130.

Automatic Generation of User Role Profiles for Optimizing Enterprise Search

Conference Paper

Full-text available

Oct 2012

Context in Enterprise Search and Delivery

Article

Full-text available

Jan 2005

The presenters of the SIGIR 2004 IRIX Workshop (1) discussed context and how it could be used in information retrieval in general. The types of context identified included users' familiarity with the search topic, search system and search collection; users' interaction history; the time, place and device; and the task in hand. All these discussions on context could apply to enterprise information search environment. However, as a subset of a broad information search, enterprise search brings its own typical context characteristics: users are mainly the employees, search tasks are usually related to its business, and information sources are dominated by those used or generated in its business operation.

The Author-Topic Model for Authors and Documents

Article

Full-text available

Jul 2012

We introduce the author-topic model, a generative model for documents that extends Latent Dirichlet Allocation (LDA; Blei, Ng, & Jordan, 2003) to include authorship information. Each author is associated with a multinomial distribution over topics and each topic is associated with a multinomial distribution over words. A document with multiple authors is modeled as a distribution over topics that is a mixture of the distributions associated with the authors. We apply the model to a collection of 1,700 NIPS conference papers and 160,000 CiteSeer abstracts. Exact inference is intractable for these datasets and we use Gibbs sampling to estimate the topic and author distributions. We compare the performance with two other generative models for documents, which are special cases of the author-topic model: LDA (a topic model) and a simple author model in which each author is associated with a distribution over words rather than a distribution over topics. We show topics recovered by the author-topic model, and demonstrate applications to computing similarity between authors and entropy of author output.

Making Search Work: Implementing web, intranet and enterprise search

Book

Mar 2007

Martin White

Implicit link analysis for small web search

Conference Paper

Jan 2003

Text Preparation through Extended Tokenization

Article

Jun 2006

Tokenization is commonly understood as the first step of any kind of nat- ural language text preparation. The major goal of this early (pre-linguistic) task is to convert a stream of characters into a stream of processing units called tokens. Beyond the text mining community this job is taken for granted. Commonly it is seen as an already solved problem comprising the identification of word borders and punctuation marks separated by spaces and line breaks. But in our sense it should manage language related word dependencies, incorporate domain specific knowledge, and handle mor- phosyntactically relevant linguistic specificities. Therefore, we propose rule- based Extended Tokenization including all sorts of linguistic knowledge (e.g., grammar rules, dictionaries). The core features of our implementa- tion are identification and disambiguation of all kinds of linguistic markers, detection and expansion of abbreviations, treatment of special formats, and typing of tokens including single- and multi-tokens. To improve the quality of text mining we suggest linguistically-based tokenization as a necessary step preceeding further text processing tasks. In this paper, we focus on the task of improving the quality of standard tagging.

Authoritative Sources in a Hyperlinked Environment

Article

Jan 1999

Jon Kleinberg

The network structure of a hyperlinked environment can be a rich source of information about the content of the environment, provided we have effective means for understanding it. We develop a set of algorithmic tools for extracting information from the link structures of such environments, and report on experiments that demonstrate their effectiveness in a variety of contexts on the World Wide Web. The central issue we address within our framework is the distillation of broad search topics, through the discovery of "authoritative" information sources on such topics. We propose and test an algorithmic formulation of the notion of authority, based on the relationship between a set of relevant authoritative pages and the set of "hub pages" that join them together in the link structure. Our formulation has connections to the eigenvectors of certain matrices associated with the link graph; these connections in turn motivate additional heuristics for link-based analysis.

Exploiting a Company's Knowledge: The Adaptive Search Agent YASE

Article

This paper introduces YASE, a domain-aware search agent with learning capabilities. Initially built for the research community of Roche Penzberg, YASE proved to be superior to standard search engines in the company environment due to the introduction of some simple principles: personalized ranking based on a user's role and organizational embedding, automatic classification of documents by using domain knowledge and learning from search history. While the benefits of the learning feature need more time to be fully realized, the other two principles have proved to be surprisingly powerful.

Authoritative sources in a hyperlinked environment

Article

Nov 1998

Jon Kleinberg

The network structure of a hypcrlinked environment can be a rich source of information about the content of the environment, provided we have effective means for understanding it. We develop a set of algorithmic tools for extracting information from the link structures of such environments, and report on experiments that demonstrate their effectiveness in a variety of contexts on the World Wide Web. The central issue we address within our framework is the distillation of broad search topics, through the discovery of "authoritative" information sources on such topics. We propose and test an algorithmic formulation of the notion of authority, based on the relationship between a set of relevant authoritative pages and the set of "hub pages" that join them together in the link structure. Our formulation has connections to the eigenvectors of certain matrices associated with the link graph; these connections in turn motivate additional heuristics for link-based analysis.

Research and Application of Enterprise Search Based on Database Security Services

Article

With the development of internet and the build of enterprise information, How to organize the information and make it accessible and useful? Especially, how to support the information security services? They have been become a hot topic of internet and enterprise information. This paper discussed the enterprise search systems based on database security services. Such as Oracle Ultra Search, Sybase Enterprise Portal and LogicSQL-based enterprise archive and search system. Ultra Search enables a portal search across the content assets of a corporation, bringing to bear Oracle's core capabilities of platform scalability and reliability. "Sybase Enterprise Portal Security Services", which describes how Enterprise Portal security works, a description of the security APIs that you customize to accommodate your Enterprise Portal system, and a description of the Web-based Security Administration tool. LogicSQL is independently developed based on the Linux high-level security database system. The development of enterprise internal Search System is based on LogicSQL security database. The enterprise information resources integration and security search are completed in the search system.

Foundations of Statistical Natural Language Processing

Chapter

Jan 1999

Optimizing enterprise search by automatically relating user context totextual document content

Abstract and Figures

Recommended publications

Enhancing the Accuracy of Case-Based Estimation Model through Early Prediction of Error Patterns

Modified Pattern Extraction Algorithm for Efficient Semantic Similarity Measures between Words

Object-Based Surveillance Video Retrieval System with Real-Time Indexing Methodology

Intelligent tutoring systems founded of incremental dynamic case based reasoning and multi-agent sys...