Article

Network-based filtering for large email collections in E-Discovery

Authors:
To read the full-text of this research, you can request a copy directly from the author.

Abstract

The information overload in E-Discovery proceedings makes reviewing expensive and it increases the risk of failure to produce results on time and consistently. New interactive techniques have been introduced to increase reviewer productivity. In contrast, the research that is discussed in this paper proposes an alternative method that tries to reduce information during culling so that less information needs to be reviewed. The proposed method first focuses on mapping the email collection universe using straightforward statistical methods based on keyword filtering combined with date time and custodian identities. Subsequently, a social network is constructed from the email collection that is analyzed by filtering on date time and keywords. By using the network context we expect to provide a better understanding of the keyword hits and the ability to discard certain parts of the collection.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the author.

... Jason Baron has documented the magnitude of the problem in the context of National Archives' responsibilities (Losey 2009). Henseler has subsequently shown through a series of experiments that it is possible to identify the most relevant portions of e-mail chains through the use of statistical techniques (Henseler 2010). In so doing, technologists have a potent tool for culling entire sets of e-mail documents determined to be nonresponsive. ...
... In addition to Ashley and Bridewell's work describing the nascent contributions that Social Network Analysis (SNA) can make to theories of relevance, Henseler has also extended his e-mail analysis into the field of Social Networks (Scott 2000), showing that e-mail investigations can reveal and effectively probe Social Networks as well (Henseler 2010). The author argues that a number of new techniques in E-Discovery address critical problems in the field, in particular, technologies such as concept search, visualization and fuzzy duplicate detection, and enhanced culling strategies using keyword-based filtering. ...
Article
Full-text available
In this work, we provide a broad overview of the distinct stages of E-Discovery. We portray them as an interconnected, often complex workflow process, while relating them to the general Electronic Discovery Reference Model (EDRM). We start with the definition of E-Discovery. We then describe the very positive role that NIST’s Text REtrieval Conference (TREC) has added to the science of E-Discovery, in terms of the tasks involved and the evaluation of the legal discovery work performed. Given the critical nature that data analysis plays at various stages of the process, we present a pyramid model, which complements the EDRM model: for gathering and hosting; indexing; searching and navigating; and finally consolidating and summarizing E-Discovery findings. Next we discuss where the current areas of need and areas of growth appear to be, using one of the field’s most authoritative surveys of providers and consumers of E-Discovery products and services. We subsequently address some areas of Artificial Intelligence, both Information Retrieval-related and not, which promise to make future contributions to the E-Discovery discipline. Some of these areas include data mining applied to e-mail and social networks, classification and machine learning, and the technologies that will enable next generation E-Discovery. The lesson we convey is that the more IR researchers and others understand the broader context of E-Discovery, including the stages that occur before and after primary search, the greater will be the prospects for broader solutions, creative optimizations and synergies yet to be tapped.
... Furthermore, enterprise email databases are common objects of study in the E-Discovery scenario, as can be seen by the TREC Legal track which aims to closely 6. Analyzing and Predicting Email Communication resemble a real-world E-Discovery task [46,117]. Henseler [118] likewise studied email in an E-Discovery setting. He found that combining keyword search with communication pattern analysis reduces the amount of information reviewers need to process. ...
Preprint
In the era of big data, we continuously - and at times unknowingly - leave behind digital traces, by browsing, sharing, posting, liking, searching, watching, and listening to online content. When aggregated, these digital traces can provide powerful insights into the behavior, preferences, activities, and traits of people. While many have raised privacy concerns around the use of aggregated digital traces, it has undisputedly brought us many advances, from the search engines that learn from their users and enable our access to unforeseen amounts of data, knowledge, and information, to, e.g., the discovery of previously unknown adverse drug reactions from search engine logs. Whether in online services, journalism, digital forensics, law, or research, we increasingly set out to exploring large amounts of digital traces to discover new information. Consider for instance, the Enron scandal, Hillary Clinton's email controversy, or the Panama papers: cases that revolve around analyzing, searching, investigating, exploring, and turning upside down large amounts of digital traces to gain new insights, knowledge, and information. This discovery task is at its core about "finding evidence of activity in the real world." This dissertation revolves around discovery in digital traces, and sits at the intersection of Information Retrieval, Natural Language Processing, and applied Machine Learning. We propose computational methods that aim to support the exploration and sense-making process of large collections of digital traces. We focus on textual traces, e.g., emails and social media streams, and address two aspects that are central to discovery in digital traces.
... Furthermore, enterprise email databases are common objects of study in the E-Discovery scenario, as can be seen by the TREC Legal track which aims to closely resemble a real-world E-Discovery task [46,117]. Henseler [118] likewise studied email in an E-Discovery setting. He found that combining keyword search with communication pattern analysis reduces the amount of information reviewers need to process. ...
Thesis
Full-text available
In the era of big data, we continuously — and at times unknowingly — leave behind digital traces, by browsing, sharing, posting, liking, searching, watching, and listening to online content. Aggregated, these digital traces can provide powerful insights into the behavior, preferences, activities, and traits of people. While many have raised privacy concerns around the use of aggregated digital traces, it has undisputedly brought us many advances, from the search engines that enable our access to unforeseen amounts of data, knowledge, and information, to, e.g., the discovery of previously unknown adverse drug reactions from search engine logs. Whether in online services, journalism, digital forensics, law, or research, we increasingly set out to exploring large amounts of digital traces to discover new information. Consider for instance, the Enron scandal, Hillary Clinton’s email controversy, or the Panama Papers: cases that revolve around analyzing, searching, investigating, exploring, and turning upside down large amounts of digital traces to gain new insights, knowledge, and information. This discovery task is at its core about “finding evidence of activity in the real world.” This dissertation revolves around discovery in digital traces. We propose computational methods that aim to support the exploration and sense-making process of large collections of textual digital traces, e.g., emails and social media streams. We address methods for analyzing the textual content of digital traces, and the contexts in which they are created, with the goal of predicting people’s future activity, by leveraging their historic digital traces.
... The need for better search tools and methods is reflected in the growth of the eDiscovery market (Gartner 2009) and in the growing research interest (Ashley and Engers 2011). New techniques are being introduced that allow a faster review of ESI by using tools such as conceptual search (Chaplin 2008), detection of near duplicates and visual analysis (Görg, Stasko and Liu 2008), predictive coding, that is used to automatically identify ESI using statistical pattern recognition (Peck 2011) and social network analysis (Henseler 2010b). ...
Conference Paper
Full-text available
Within eGovernment, trust in electronic stored information (ESI) is a necessity, not only when communicating with citizens, but also for organizational transparency and accountability. In the last decades, most organizations underwent substantial reorganization. The integration of structured data in relational databases has improved documentation of business transactions and increased data quality. That integration has improved accountability as well. Almost 90% of the information that organizations manage is unstructured (e.g., e-mail, documents, multimedia files, etc.). Those files cannot be integrated into a traditional database in an easy way. Like structured data, unstructured ESI in organizations can be denoted as records, when it is meant to be (and used as) evidence for organizational policies, decisions, products, actions and transactions. Stakeholders in eGovernment, like citizens , governments and courts, are making increasing demands for the trustworthiness of this ESI for privacy, evidential and transparency reasons. A theoretical analysis of literature of information, organization and archival science illustrates that for delivering evidence, reconstruction of the past is essential , even in this age of information overload. We want to analyse how Digital Archiving and eDis-covery contribute to the realization of trusted ESI, to the reconstruction of the past and to delivering evidence. Digital Archiving ensures (by implementing and managing the 'information value chain') that: [1] ESI can be trusted, that it meets the necessary three dimensions of information: quality, context and relevance, and that [2] trusted ESI meets the remaining fourth dimension of information: survival , so that it is preserved for as long as is necessary (even indefinitely) to comply to privacy, accountability and transparency regulations. EDiscovery is any process (or series of processes) in which (trusted) ESI is sought, located, secured and searched with the intent of using it as evidence in a civil or criminal legal case. A difference between the two mechanisms is that Digital Archiving is implemented ex ante and eDiscovery ex post legal proceedings. The combination of both mechanisms ensures that organizations have a documented understanding of [1] the processing of policies, decisions , products, actions and transactions within (inter-) organizational processes; [2] the way organizations account for those policies, decisions, products, actions and transactions within their business processes; and [3] the reconstruction of policies, decisions, products, actions and transactions from business processes over time. This understanding is extremely important for the realization of eGov-ernment, for which reconstruction of the past is an essential functionality. Both mechanisms are illustrated with references to practical examples.
... This measure is an indicator of the popularity that tends to identify centers of large cliques [25]. Hansler [26] in his project calculates eigenvalues to rank people based on the number of emails sent and received in an email network. Kayes et al. [27] use the eigenvector centrality measure to find influential bloggers in a blogging community. ...
Article
Full-text available
Tracing criminal ties and mining evidence from a large network to begin a crime case analysis has been difficult for criminal investigators due to large numbers of nodes and their complex relationships. In this paper, trust networks using blind carbon copy (BCC) emails were formed. We show that our new shortest paths network search algorithm combining shortest paths and network centrality measures can isolate and identify criminals' connections within a trust network. A group of BCC emails out of 1,887,305 Enron email transactions were isolated for this purpose. The algorithm uses two central nodes, most influential and middle man, to extract a shortest paths trust network.
... In this paper, we focus on an enterprise set- ting, allowing us to leverage the full content and structure of the communication network, as opposed to taking a strictly local (ego network) approach. Combining signals from email content and the communication graph has been studied, e.g., in e-discovery, where combining keyword search with communication pattern analysis of e-mail corpora reduces the amount of information reviewers need to process [3]. Recipient recommendation similarly allows us to gain a better understanding of communication patterns in enterprises, potentially revealing underlying structures of enterprises. ...
Conference Paper
Full-text available
We address the task of recipient recommendation for emailing in enterprises. We propose an intuitive and elegant way of modeling the task of recipient recommendation, which uses both the communication graph (i.e., who are most closely connected to the sender) and the content of the email. Additionally, the model can incorporate evidence as prior probabilities. Experiments on two enterprise email collections show that our model achieves very high scores, and that it outperforms two variants that use either the communication graph or the content in isolation.
... Henseler [4] proposes the use of social network analysis for network-based filtering in large email collections in E-Discovery. This research formalizes the use of networks in E-Discovery using of identities that were extracted from email headers in the Enron email data set [5]. ...
Conference Paper
With the pervasiveness of computers and mobile devices, digital forensics becomes more important in law enforcement. Detectives increasingly depend on the scarce support of digital specialists which impedes efficiency of criminal investigations. This paper proposes and algorithm to extract, merge and rank identities that are encountered in the electronic evidence during processing. Two experiments are described demonstrating that our approach can assist with the identification of frequently occurring identities so that investigators can prioritize the investigation of evidence units accordingly.
... However, this approach does not identify the activities and relationships of a suspect under investigation. Henseler (2010) suggests an approach for filtering large email collections during an investigation based on statistical and visualisation techniques. This method uses the Enron email corpus as the basis for its results. ...
Article
Full-text available
The continued reliance on email communications ensures that it remains a major source of evidence during a digital investigation. Emails comprise both structured and unstructured data. Structured data provides qualitative information to the forensics examiner and is typically viewed through existing tools. Unstructured data is more complex as it comprises information associated with social networks, such as relationships within the network, identification of key actors and power relations, and there are currently no standardised tools for its forensic analysis. This paper posits a framework for the forensic investigation of email data. In particular, it focuses on the triage and analysis of unstructured data to identify key actors and relationships within an email network. This paper demonstrates the applicability of the approach by applying relevant stages of the framework to the Enron email corpus. The paper illustrates the advantage of triaging this data to identify (and discount) actors and potential sources of further evidence. It then applies social network analysis techniques to key actors within the data set. This paper posits that visualisation of unstructured data can greatly aid the examiner in their analysis of evidence discovered during an investigation.
Chapter
The increasing use of social media, applications or platforms that allow users to interact online, ensures that this environment will provide a useful source of evidence for the forensics examiner. Current tools for the examination of digital evidence find this data problematic as they are not designed for the collection and analysis of online data. Therefore, this paper presents a framework for the forensic analysis of user interaction with social media. In particular, it presents an inter-disciplinary approach for the quantitative analysis of user engagement to identify relational and temporal dimensions of evidence relevant to an investigation. This framework enables the analysis of large data sets from which a (much smaller) group of individuals of interest can be identified. In this way, it may be used to support the identification of individuals who might be ‘instigators’ of a criminal event orchestrated via social media, or a means of potentially identifying those who might be involved in the ‘peaks’ of activity. In order to demonstrate the applicability of the framework, this paper applies it to a case study of actors posting to a social media Web site.
Article
Email remains a key source of evidence during a digital investigation. The forensics examiner may be required to triage and analyse large email data sets for evidence. Current practice utilises tools and techniques that require a manual trawl through such data, which is a time-consuming process. Recent research has focused on speeding up analysis through the use of data visualization and the quantitative analysis of emails, for example, by analysing actor relationships identified through this medium. However, these approaches are unable to analyse the qualitative content, or narrative, of the emails themselves to provide a much richer picture of the evidence. This paper posits a novel approach which combines both quantitative and qualitative analysis of emails using data visualization to elucidate qualitative information for the forensics examiner. In this way, the examiner is able to triage large volumes of emails to identify actor relationships as well as their network narrative. In order to demonstrate the applicability of this methodology, this paper applies it to a case study of email data.
Article
Purpose – The purpose of this paper is to propose a novel approach that automates the visualisation of both quantitative data (the network) and qualitative data (the content) within emails to aid the triage of evidence during a forensics investigation. Email remains a key source of evidence during a digital investigation, and a forensics examiner may be required to triage and analyse large email data sets for evidence. Current practice utilises tools and techniques that require a manual trawl through such data, which is a time-consuming process. Design/methodology/approach – This paper applies the methodology to the Enron email corpus, and in particular one key suspect, to demonstrate the applicability of the approach. Resulting visualisations of network narratives are discussed to show how network narratives may be used to triage large evidence data sets. Findings – Using the network narrative approach enables a forensics examiner to quickly identify relevant evidence within large email data sets. Within the case study presented in this paper, the results identify key witnesses, other actors of interest to the investigation and potential sources of further evidence. Practical implications – The implications are for digital forensics examiners or for security investigations that involve email data. The approach posited in this paper demonstrates the triage and visualisation of email network narratives to aid an investigation and identify potential sources of electronic evidence. Originality/value – There are a number of network visualisation applications in use. However, none of these enable the combined visualisation of quantitative and qualitative data to provide a view of what the actors are discussing and how this shapes the network in email data sets.
Article
The increasing use of social media, applications or platforms that allow users to interact online, ensures that this environment will provide a useful source of evidence for the forensics examiner. Current tools for the examination of digital evidence find this data problematic as they are not designed for the collection and analysis of online data. Therefore, this paper presents a framework for the forensic analysis of user interaction with social media. In particular, it presents an inter-disciplinary approach for the quantitative analysis of user engagement to identify relational and temporal dimensions of evidence relevant to an investigation. This framework enables the analysis of large data sets from which a much smaller group of individuals of interest can be identified. In this way, it may be used to support the identification of individuals who might be 'instigators' of a criminal event orchestrated via social media, or a means of potentially identifying those who might be involved in the 'peaks' of activity. In order to demonstrate the applicability of the framework, this paper applies it to a case study of actors posting to a social media Web site.
Article
Full-text available
This paper focuses on three emerging techniques for improving the process of automating analysis and retrieval of electronically stored information in discovery proceedings: (1) machine learning to extend and apply users' hypotheses (theories) of document relevance; (2) a hypothesis ontology to generalize user modeling regarding relevance theories; and (3) social network analysis to supplement and apply user modeling regarding relevance theories. Since all three pertain to representing and reasoning with litigators' hypotheses about relevance, and since a central theme of AI & Law involves computationally modeling legal knowledge, reasoning and decision making, all three techniques can be described as characteristic of that field's potentially unique contribution to e-Discovery.
Conference Paper
Full-text available
We shall start with a presentation of some typical and well-known real-life networks. After introducing the fundamental concepts of network analysis the following topics will be presented:-mode to 1-mode networks; -clustering and blockmodeling. In examples we shall use program Pajek.
Conference Paper
Full-text available
As part of a long-term investigation into visualizing email, we have created two visualizations of email archives. One highlights social networks while the other depicts the temporal rhythms of interactions with individuals. While interviewing users of these systems, it became clear that the applications triggered recall of many personal events. One of the most striking and not entirely expected outcomes was that the visualizations motivated retelling stories from the users' pasts to others. In this paper, we discuss the motivation and design of these projects and analyze their use as catalysts for personal narrative and recall.
Conference Paper
Full-text available
We propose an analysis of the codified Law of France as a structured system. Fifty two legal codes are selected on the basis of explicit legal criteria and considered as vertices with their mutual quotations forming the edges in a network which properties are analyzed relying on graph theory. We find that a group of 10 codes are simultaneously the most citing and the most cited by other codes, and are also strongly connected together so forming a "rich club" sub-graph. Three other code communities are also found that somewhat partition the legal field is distinct thematic sub-domains. The legal interpretation of this partition is opening new untraditional lines of research. We also conjecture that many legal systems are forming such new kind of networks that share some properties in common with small worlds but are far denser. We propose to call "concentrated world".
Conference Paper
Full-text available
We present Themail , a visualization that portrays relationships using the interaction histories preserved in email archives. Using the content of exchanged messages, it shows the words that characterize one's correspondence with an individual and how they change over the period of the relationship. This paper describes the interface and content-parsing algorithms in Themail. It also presents the results from a user study where two main interaction modes with the visualization emerged: exploration of "big picture" trends and themes in email ( haystack mode) and more detail-oriented exploration (needle mode). Finally, the paper discusses the limitations of the content parsing approach in Themail and the implications for further research on email content visualization.
Conference Paper
Full-text available
In this paper we address the task of finding topically relevant email messages in public discussion lists. We make two important observations. First, email messages are not isolated, but are part of a larger online environment. This context, existing on different levels, can be incorporated into the retrieval model. We explore the use of thread, mailing list, and community content levels, by expanding our original query with term from these sources. We find that query models based on contextual information improve retrieval effectiveness. Second, email is a relatively informal genre, and therefore offers scope for incorporating techniques previously shown useful in searching user-generated content. Indeed, our experiments show that using query-independent features (email length, thread size, and text quality), implemented as priors, results in further improvements.
Conference Paper
Full-text available
We profile a system for search and analysis of large-scale email archives. The system builds around four facets: content-based search engine, statistical topic model, automatically inferred social networks, and time-series analysis. The facets correspond to the types of information available in email data. The presented system allows chaining or combining the facets flexibly. Results of one facet may be used as input to another yielding remarkable combinatorial power. In information retrieval point of view, the system provides support for exploration, approximate textual searches and data visualization. We present some experimental results based on a large real-world email corpus.
Conference Paper
Full-text available
Introduction The goal of the enterprise track is to conduct experiments with enterprise data --- intranet pages, email archives, document repositories --- that reflect the experiences of users in real organisations, such that for example, an email ranking technique that is effective here would be a good choice for deployment in a real multi-user email search application. This involves both understanding user needs in enterprise search and development of appropriate IR techniques. The enterprise track began this year as the successor to the web track, and this is reflected in the tasks and measures. While the track takes much of its inspiration from the web track, the foci are on search at the enterprise scale, incorporating non-web data and discovering relationships between entities in the organisation. Obviously, it's hard to imagine that any organisation would be willing to open its intranet to public distribution, even for research, so for the initial document collection we looke
Article
Full-text available
This paper focuses on three emerging techniques for improving the process of automating analysis and retrieval of electronically stored information in discovery proceedings: (1) machine learning to extend and apply users' hypotheses (theories) of document relevance; (2) a hypothesis ontology to generalize user modeling regarding relevance theories; and (3) social network analysis to supplement and apply user modeling regarding relevance theories. Since all three pertain to representing and reasoning with litigators' hypotheses about relevance, and since a central theme of AI & Law involves computationally modeling legal knowledge, reasoning and decision making, all three techniques can be described as characteristic of that field's potentially unique contribution to e-Discovery.
Article
This paper reports on the development of social network analysis, tracing its origins in classical sociology and its more recent formulation in social scientific and mathematical work. It is argued that the concept of social network provides a powerful model for social structure, and that a number of important formal methods of social network analysis can be discerned. Social network analysis has been used in studies of kinship structure, social mobility, science citations, contacts among members of deviant groups, corporate power, international trade exploitation, class structure, and many other areas. A review of the formal models proposed in graph theory, multidimensional scaling, and algebraic topology is followed by extended illustrations of social network analysis in the study of community structure and interlocking directorships.
Article
The network structure of a hypcrlinked environment can be a rich source of information about the content of the environment, provided we have effective means for understanding it. We develop a set of algorithmic tools for extracting information from the link structures of such environments, and report on experiments that demonstrate their effectiveness in a variety of contexts on the World Wide Web. The central issue we address within our framework is the distillation of broad search topics, through the discovery of "authoritative" information sources on such topics. We propose and test an algorithmic formulation of the notion of authority, based on the relationship between a set of relevant authoritative pages and the set of "hub pages" that join them together in the link structure. Our formulation has connections to the eigenvectors of certain matrices associated with the link graph; these connections in turn motivate additional heuristics for link-based analysis.
Conference Paper
Citation networks are a cornerstone of network research and have been important to the general development of network theory. Citation data have the advantage of constituting a well-defined set where the nature of nodes and edges is reasonably well specified. Much interesting and important work has been done in this vein, with respect to not only academic but also judicial citation networks. For example, previous scholarship focuses upon broad citation patterns, the evolution of precedent, and time-varying change in the likelihood that communities of cases will be cited. As research of judicial citation and semantic networks transitions from a strict focus on the structural characteristics of these networks to the evolutionary dynamics behind their growth, it becomes even more important to develop theoretically coherent and empirically grounded ideas about the nature of edges and nodes. In this paper, we move in this direction on several fronts. We compare several network representations of the corpus of United States Supreme Court decisions (1791-2005). This corpus is not only of seminal importance, but also represents a highly structured and largely self-contained body of case law. As constructed herein, nodes represent whole cases or individual 'opinion units' within cases. Edges represent either citations or semantic connections. As our broader goal is to better understand American common law development, we are particularly interested in the union, intersect and compliment of these various citation networks as they offer potential insight into the long-standing question of whether 'law is a seamless web'? We believe the characterization of law’s interconnectedness is an empirical question well suited to the tools of computer science and applied graph theory. While much work still remains, the analysis provided herein is designed to advance the broader cause.
Article
In this paper, we present Google, a prototype of a large-scale search engine which makes heavy use of the structure present in hypertext. Google is designed to crawl and index the Web efficiently and produce much more satisfying search results than existing systems. The prototype with a full text and hyperlink database of at least 24 million pages is available at http://google.stanford.edu/ To engineer a search engine is a challenging task. Search engines index tens to hundreds of millions of web pages involving a comparable number of distinct terms. They answer tens of millions of queries every day. Despite the importance of large-scale search engines on the web, very little academic research has been done on them. Furthermore, due to rapid advance in technology and web proliferation, creating a web search engine today is very different from three years ago. This paper provides an in-depth description of our large-scale web search engine -- the first such detailed public description we know of to date. Apart from the problems of scaling traditional search techniques to data of this magnitude, there are new technical challenges involved with using the additional information present in hypertext to produce better search results. This paper addresses this question of how to build a practical largescale system which can exploit the additional information present in hypertext. Also we look at the problem of how to effectively deal with uncontrolled hypertext collections where anyone can publish anything they want. Keywords World Wide Web, Search Engines, Information Retrieval, PageRank, Google 1.
Conference Paper
A large set of email messages, the Enron corpus, was made public during the legal investigation concerning the Enron corporation. This dataset, along with a thorough explanation of its origin, is available at http://www-2.cs.cmu.edu/~enron/. This paper provides a brief introduction and analysis of the dataset. The raw Enron corpus contains 619,446 messages belonging to 158 users. We cleaned the corpus before this analysis by removing certain folders from each user, such as "discussion_threads". These folders were present for most users, and did not appear to be used directly by the users, but rather were computer generated. Many, such as "all_documents", also contained large numbers of duplicate emails, which were already present in the users' other folders. Our goal in this paper is to analyze the suitability of this corpus for exploring how to classify messages as organized by a human, so these folders would have likely been misleading. In our cleaned Enron corpus, there are a total of 200,399 messages belonging to 158 users with an average of 757 messages per user. Figure 1 shows the distribution of emails per user. The users in the corpus are sorted by ascending number of messages along the x- axis. The number of messages is represented in log scale on the y-axis. The horizontal line represents the average number of messages per user (757).
Article
We present an end-to-end system that extracts a user’s social network and its members’ contact information given the user’s email inbox. The system identifies unique people in email, finds theirWeb presence, and automatically fills the fields of a contact address book using conditional random fields—a type of probabilistic model well-suited for such information extraction tasks. By recursively calling itself on new people discovered on the Web, the system builds a social network with multiple degrees of separation from the user. Additionally, a set of expertise-describing keywords are extracted and associated with each person. We outline the collection of statistical and learning components that enable this system, and present experimental results on the real email of two users; we also present results with a simple method of learning transfer, and discuss the capabilities of the system for addressbook population, expert-finding, and social network analysis.
Article
In this paper, we present Google, a prototype of a large-scale search engine which makes heavy use of the structure present in hypertext. Google is designed to crawl and index the Web efficiently and produce much more satisfying search results than existing systems. The prototype with a full text and hyperlink database of at least 24 million pages is available at http://google.stanford.edu/ To engineer a search engine is a challenging task. Search engines index tens to hundreds of millions of web pages involving a comparable number of distinct terms. They answer tens of millions of queries every day. Despite the importance of large-scale search engines on the web, very little academic research has been done on them. Furthermore, due to rapid advance in technology and web proliferation, creating a web search engine today is very different from three years ago. This paper provides an in-depth description of our large-scale web search engine -- the first such detailed public description we know of to date. Apart from the problems of scaling traditional search techniques to data of this magnitude, there are new technical challenges involved with using the additional information present in hypertext to produce better search results. This paper addresses this question of how to build a practical large-scale system which can exploit the additional information present in hypertext. Also we look at the problem of how to effectively deal with uncontrolled hypertext collections where anyone can publish anything they want.
Article
. The network structure of a hyperlinked environment can be a rich source of information about the content of the environment, provided we have effective means for understanding it. We develop a set of algorithmic tools for extracting information from the link structures of such environments, and report on experiments that demonstrate their effectiveness in a variety of contexts on the World Wide Web. The central issue we address within our framework is the distillation of broad search topics, through the discovery of "authoritative" information sources on such topics. We propose and test an algorithmic formulation of the notion of authority, based on the relationship between a set of relevant authoritative pages and the set of "hub pages" that join them together in the link structure. Our formulation has connections to the eigenvectors of certain matrices associated with the link graph; these connections in turn motivate additional heuristics for link-based analysis. Categories and S...
Using contextual information to improve search in email archives In: 31st European conference on information retrieval conference://staff.science.uva.nl/~mdr/Publications/Files
  • W Weerkamp
  • K Balog
  • Rijke
Authoritative sources in a hyperlinked environment http://www.cs.cornell.edu/home/kleinber/auth
  • Kleinberg
Emerging AI+law approaches to automating analysis and retrieval of ESI in discovery proceedings. DESI III Global E-Discovery/E-Disclosure workshop
  • Kd Ashley
  • W Bridewell
Introducing the Enron corpus In: Proceedings of the collaboration, electronic messaging, anti-abuse and spam conference
  • B Klimt
  • Y Yang
Enhancing legal discovery with linguistic processing. DESI I. Second international workshop on supporting search and sensemaking for electronically stored information in discovery proceedings
  • D Bobrow
  • T King
  • L Lee
Overview of the TREC-2005 enterprise track In: The fourteenth text retrieval conference proceedings
  • N Craswell
  • A De Vries
  • I Soboroff
Jigsaw: investigative analysis on text document collections through visualization. DESI II. Second international workshop on supporting search and sensemaking for electronically stored information in discovery proceedings
  • C Görg
  • J Stasko
Exploring Enron: visual data mining of email
  • J Heer
Information inflation: can the legal system adapt?
  • G Paul
  • J Baron
Paul, George L. and Baron, Jason R. (2007). Information Inflation: Can the Legal System Adapt?
EDRM E-discovery reference model
  • Socha-Gelbmann
Conceptual search-ESI, litigation and the issue of language. DESI II. Second international workshop on supporting search and sensemaking for electronically stored information in discovery proceedings
  • D Chaplin