Conference PaperPDF Available

Using Conditional Random Fields to Detect Different Functional Types of Content in Decisions of United States Courts with Example Application to Sentence Boundary Detection

Authors:

Abstract and Figures

We detect different functional types of content in decisions of the United States courts using conditional random fields models. The work suggests a possible approach to deal with the obstacles in processing the court decisions automatically. Even basic natural language processing tasks such as sentence boundary detection are challenging when performed on the decisions. It is our assumption that one of the main causes of this difficulty is the presence of heterogeneous content where different categories of content require different treatments. Thus, the goal of this work is to automatically distinguish among the different functional types of content. The trained models find full and incomplete sentences as well as non-sentential sequences. In addition, the models detect external and internal references, quotations, editorial marks, headings, metadata fields and numbering sequences such as lists. We show the utility of this approach on an example of the improved performance in the sentence boundary detection task as compared to existing general approaches that do not take the heterogeneity of the content into account.
Content may be subject to copyright.
Using Conditional Random Fields to Detect Dierent Functional
Types of Content in Decisions of United States Courts with
Example Application to Sentence Boundary Detection
Jarom´
ır ˇ
Savelka
Intelligent Systems Program
University of Pisburgh
210 South Bouquet Street
Pisburgh, PA 15260
jas438@pi.edu
Kevin D. Ashley
Learning Research and Development Center
University of Pisburgh
3939 O’Hara Street
Pisburgh, PA 15260
ashley@pi.edu
ABSTRACT
We detect dierent functional types of content in decisions of the
United States courts using conditional random elds models. e
work suggests a possible approach to deal with the obstacles in
processing the court decisions automatically. Even basic natural
language processing tasks such as sentence boundary detection are
challenging when performed on the decisions. It is our assumption
that one of the main causes of this diculty is the presence of
heterogeneous content where dierent categories of content require
dierent treatments. us, the goal of this work is to automatically
distinguish among the dierent functional types of content. e
trained models nd full and incomplete sentences as well as non-
sentential sequences. In addition, the models detect external and
internal references, quotations, editorial marks, headings, meta
data elds and numbering sequences such as lists. We show utility
of this approach on an example of the improved performance in the
sentence boundary detection task as compared to existing general
approaches that do not take the heterogeneity of the content into
account.
KEYWORDS
Sequence labeling, conditional random elds, court decision, sen-
tence boundary detection
ACM Reference format:
Jarom
´
ır
ˇ
Savelka and Kevin D. Ashley. 1997. Using Conditional Random
Fields to Detect Dierent Functional Types of Content in Decisions of United
States Courts with Example Application to Sentence Boundary Detection. In
Proceedings of 2nd Workshop on Automated Semantic Analysis of Information
in Legal Texts, London, Great Britain, June 2017 (ASAIL 2017), 10 pages.
DOI: 10.475/123 4
Produces the permission block, and copyright information
Permission to make digital or hard copies of part or all of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for prot or commercial advantage and that copies bear this notice and the full citation
on the rst page. Copyrights for third-party components of this work must be honored.
For all other uses, contact the owner/author(s).
ASAIL 2017, London, Great Britain
©2016 Copyright held by the owner/author(s). 123-4567-24-567/08/06. .. $15.00
DOI: 10.475/123 4
1 INTRODUCTION
In this paper we present some early results from our eort to auto-
matically recognize dierent functional types of content in US court
decisions. e decisions are complex documents where dierent
types of content appear side by side. Sentences may be interleaved
with citations and editorial marks (e.g., a page number or correc-
tion). e whole decision as well as the individual sentences may be
organized into list-like structures. Excerpts from other documents
as well as speech transcripts are oen present. e decisions may
be very long, organized into sections and subsections.
e heterogeneity of content is one of the reasons why it is
dicult to apply most of the well established NLP techniques to
court decisions. Even basic NLP tasks such as sentence boundary
detection are challenging when performed on the decisions. Wyner
and Peters observe that lists (using punctuation, including enumer-
ations, colons) and references (containing a mix of punctuation
and alpha-numeric characters) confound tokenization and sentence
spliing. [
20
] Although, they talk about regulatory texts these ob-
servations apply to the decisions as well. At the same time the use
of the mentioned techniques is foundational to many valuable appli-
cations from information retrieval to computer assisted reasoning.
Less than optimal performance in the lower level processing oen
introduces mistakes into the pipeline that are impossible to correct
in later stages. It is our assumption that the ability to distinguish
dierent types of content may dramatically improve the quality of
the processing. It appears that de Maat and Winkels express similar
assumptions in [
2
]. ey talk about spliing the sentences into
the principle and the auxiliary. In addition, they observe that lists
cause degradation of sentence classication performance.
We developed a simple labeling scheme that captures some of
the functional types of content. e scheme is described in Section
2. It is a set of types such as a sentential and non-sentential content
or references and quotations. We applied the scheme to a data set of
19 selected court decisions (mostly from the domain of cyber crime)
producing more than 20,000 annotations. e annotation eort and
the resulting data set is the subject of Section 3. We train sequence
labeling models (specically CRF) capable of producing the annota-
tions automatically. e training and evaluation of the models is
described in Section 4. Finally, we use some of the models in a task
of sentence boundary detection (Section 6). We show improved
performance over the methods that do not take the heterogeneity
of the content into account.
ASAIL 2017, June 2017, London, Great Britain J. ˇ
Savelka and K. D. Ashley
ere is a large body of research in AI&Law that focuses on
extraction of various types of content from legal documents. e
most related to our work is the consistent eort in extraction of
references and citations. e Reference recognition focuses on nd-
ing references to other documents (or their specic parts) such as
statutory provisions, court decisions, or expert literature. Tran et
al. presents a framework for recognition and resolution of refer-
ences in Japanese law. [
17
,
18
] De Maat et al. developed a system
for detection of references in Dutch law. [
3
] Opijnen et al. focus
not only on the references to Dutch legal documents but they also
recognize the references to EU documents. [19]
Dell’Orlea et al. present a law specic account of sentence
boundaries which is in many respects similar to the one we use
here. [
4
] Grabmair et al. developed a dedicated sentence bound-
ary detection module to operate on court decisions by extending
Lingpipe.1[5]
2 LABELING SCHEME
At the most fundamental level we distinguish between sentential
and non-sentential content. e distinction is drawn at a semantic
level, not purely a grammatical one. is may appear quite counter
intuitive on certain occasions. e sentential content comprises
unstructured data in the form of natural language text that plays
a role of a substantive statement in a decision. e rest of the
content that mostly plays a supportive role, such as structured
meta data or standardized references, is considered non-sentential.
e distinction appears to be similar to the one suggested in [
2
]
(principle and auxiliary sentences). We further dierentiate the
sentential content into sentences and incomplete sentences.
Sentence is a sequence of tokens (most oen words) that together
form a full stand-alone grammatical sentence. It usually starts with
a capital leer and ends with a period, exclamation mark, ques-
tion mark, or (rarely) a dierent symbol. Consider the following
examples:
(1)
We have recognized that even a limited search of the person
is a substantial invasion of privacy.
(2)
In other contexts, however, we have held that although
“some quantum of individualized suspicion is usually a
prerequisite to a constitutional search or seizure[,] . . . the
Fourth Amendment imposes no irreducible requirement
of such suspicion.
(3)
Justice Jackson rejected the view that the search could be
supported by exigent circumstances:
Example 1 shows a typical sentence starting with a capital leer and
ending with a period. Example 2 shows a more complex sentence
with a quotation and some editorial marks (e.g., “.. . ”) embedded
within it. Example 3 presents a sentence ending with a colon which
is traditionally not used to terminate a sentence.
Incomplete sentence is a sequence of tokens (most oen words)
that do not form a full grammatical sentence. ese sequences
usually appear in headings or parentheses. ey may also appear
in front of lists or enumerations. Examples:
1hp://alias-i.com/lingpipe/
(1)
(student with bloodshot eyes wandering halls in violation
of school rule requiring students to remain in examination
room or at home during midterm examinations)
(2) e nature of the writ.
Example 1 is an expression in parentheses which does not form a
full grammatical sentence. Example 2 looks like a full sentence but
it lacks a verb phrase. is may oen be a case with headings.
Non-sentential piece of content is a sequence of tokens which is
not unstructured text in natural language. e sequence is oen
not formed of words or has very few words. Typically, it would
make no sense to assign the tokens of such a sequence with part of
speech tags (e..g, noun, verb). ese sequences usually appear as
references or meta data in front maer. Some examples are shown
in the following list:
(1) See Warden v. Hayden, 387 U.S. 294, 306 -307 (1967).
(2) November 14, 2012.
(3) [ Footnote 1 ]
Example 1 is a reference in a standardized format. Note that on a
purely functional level this is an imperative sentence yet within
our labeling scheme it is considered a non-sentential sequence.
is is because it is not a substantive statement—its role is rather
supportive. Example 2 shows a date expression. Example 3 presents
a numbering token from the footnotes list.
Citations play an important role in court decisions. To account
for this phenomenon we detect references to other documents
(external references) and references to other parts of the same
document (internal references). In addition, we look for sequences
of text that come from other documents or other parts of the same
document (quotations). It is worth mentioning that these types
are not mutually exclusive with the types presented earlier (i.e.,
sentence, non-sentential sequence). ite contrary, almost any
reference is at the same time non-sentential sequence (but not
necessarily the other way around).
External reference is a sequence of tokens the role of which is to
point to a document (or its part) other than the one in which it is
embedded. External references oen have a standardized format
(see the examples) and they could be easily recognized:
(1) [469 U.S. 325, 329]
(2)
West Virginia State Bd. of Ed. v. Barnee, 319 U.S. 624, 637
(1943)
(3)
Parent-Student Handbook of Piscataway [N. J.] H. S. (1979),
Record Doc. S-1, p. 7.
Example 1 shows a reference to a statute. Example 2 is a reference
to a court decision. Example 3 presents a reference to a book.
Internal reference is a sequence of tokens the role of which is to
point to another place in the same document. Internal references
oen have a form of footnote pointers, pointers to gures or tables,
or pointers to preceding or following sections of the document.
Consider the following examples:
(1) Ante, at 337
(2) Figure 1
Example 1 shows a reference to a preceding part of the text (e.g., a
page or paragraph). Example 2 shows a reference to a gure within
the decision.
Detecting Dierent Types of Content in US Court Decisions ASAIL 2017, June 2017, London, Great Britain
otation is a sequence of tokens from a dierent document (or
from a dierent part of the same document) that is copied into the
document of interest. Sometimes the sequence could be modied
(still preserving the meaning). ite oen a citation is surrounded
with quotation marks.
(1)
“a school ocial may properly conduct a search of a stu-
dent’s person if the ocial has a reasonable suspicion that
a crime has been or is in the process of being commied,
or reasonable cause to believe that the search is necessary
to maintain school discipline or enforce school policies.
(2)
“[A]s a general rule notice and hearing should precede
removal of the student from school. We agree .. . , however,
that there are recurring situations in which prior notice and
hearing cannot be insisted upon. Students whose presence
poses a continuing danger to persons or property or an
ongoing threat of disrupting the academic process may
be immediately removed from school. In such cases the
necessary notice and rudimentary hearing should follow
as soon as practicable . . ..”
Example 1 shows an in-line quotation which is a part of a surround-
ing sentence. Example 2 is a standalone citation consisting of a
number of sentences. It also contains a couple of editorial marks
(“. . . ”, square brackets).
We detect a number of other constituents in a decision because
these could be useful in more advanced processing stages. In par-
ticular, we are interested in constituents playing a specic role in
a decision (headings). We also detect elements suggestive about
a structure of the presented content (numbering tokens). Finally,
we are interested in the elements that are inserted in the document
to make it more readable or useful (editorial marks and meta data
elds).
Heading is a sequence of tokens, either a full sentence or an
incomplete sentence, the role of which is to open (and concisely
describe) a document or a section of a document. Examples:
(1) JUSTICE WHITE delivered the opinion of the Court.
(2)
MEMORANDUM AND ORDER ON A DEFENDANTS’ MO-
TION TO DISMISS
Example 1 is a typical template sentence opening the opinion part of
the decision. Strictly speaking one would not consider this sentence
a heading within a general meaning of the word. However, we
consider it as such in our labeling scheme. Example 2 is a heading
opening a specic part of an opinion.
Numbering token (or a sequence of tokens) denes a position of
the associated piece of text within a structure such as a list or a
tree. Numbering can usually be found at the beginning of a line
but it can also appear within a paragraph.
(1) 1st.
(2) III
(3) [ Footnote 1 ]
(4) (viii)
e examples 1 through 4 show dierent types of numbering tokens
that are commonly found in the decisions.
Editorial mark is any token or sequence of tokens which is intro-
duced in a text by somebody other than an author in order to make
the text more readable or useful. is could be certain symbols
introduced into a citation by the author of a document or symbols
introduced into a document by a publisher.
(1) *153
(2) . . .
(3) [A]
Example 1 shows an editorial mark which is commonly used to
indicate beginning of a page of the specied number. is type
of mark is oen used in documents that were transformed from a
printed form into an electronic form. Example 2 is a mark indicating
that a piece of original content is leaved out. Square brackets shown
in Example 3 indicate that “A was inserted in the original text by
an editor.
Meta data eld is a sequence of tokens representing a structured
information, usually in a form of a label–value pair. Consider the
following examples:
(1) Filed: April 16th, 2009
(2)
Panel: Frank Hoover Easterbrook, Michael Stephen Kanne,
Ann Claire Williams
Example 1 is a meta data eld providing an information about
the day on which the motion was led. Example 2 shows a meta
data eld that provides an information about the members of the
deciding panel.
3 DATA SET
We downloaded 19 court decisions from the online Court Listener
service.
2
13 of these decisions are from the area of cyber crime
(cyber bullying, credit card frauds, possession of electronic child
pornography), 3 are landmark SCOTUS decisions, and the remain-
ing 3 are cases involving intellectual property. We use 10 cyber
crime decisions as a training and development set (cyber-train)
and the remaining 3 cyber crime decisions as a hold-out test set
(cyber-test). We use the 3 SCOTUS decisions and the 3 intellectual
property decisions as additional test sets (scotus-test, ip-test). Al-
though, we focus on the area of cyber crime we use the additional
test sets to measure how well do the trained models generalize to
other domains.
e two human annotators (the authors of this paper) annotated
the decisions with labels described in Section 2. Each decision was
annotated by only one of the annotators. We did not measure inter-
annotator agreement. We simply assume the labels provided by
each annotator to be correct. We are well aware that this assumption
does not hold. Despite this we are convinced that, given the goal
of this work, it is a reasonable assumption anyway. e annotators
were provided with a codebook that contains information similar
to what is presented in Section 2 (i.e., description of each label with
examples). In addition, for each label there was a set of if–then
rules guiding the annotators such as the following one:
If a sequence of tokens represents what is wrien
in another document or at a dierent place of the
document of interest then it is a quotation.
e basic characteristics of the data set are summarized in Table
1. Summary statistics about the annotations are provided in Table
2. A relatively small number of decisions may suggest a very small
size of the data set. However, some of the annotated decisions are
2www.courtlistener.com
ASAIL 2017, June 2017, London, Great Britain J. ˇ
Savelka and K. D. Ashley
Table 1: Data set summary statistics. Length of longest,
shortest and average documents is reported in number of
characters.
cyber-train cyber-test scotus-test ip-test
# of docs 10 3 3 3
# of tokens 219484 75158 130065 42382
longest doc 181009 121910 190352 54430
average doc 58237 67820 118749 38549
shortest doc 16859 29278 47450 16546
very long. e longest decision with 190,352 characters roughly
corresponds to 80–100 standard pages. Even the shortest decision
(16,546 characters) corresponds to approximately 7 standard pages.
e total number of annotations (21,294) clearly shows that the
data set is sucient for far more than toy experiments.
4 LABELING THE SEQUENCES
AUTOMATICALLY
We use the cyber-train data set to train a CRF model for each of
the 10 labels. Although this is certainly suboptimal, we use the
same training strategy and features for all the models. It should
be evident that dierent types (such as a sentence or an editorial
mark) could most likely benet from a custom-tailored model and
contextual features. We reserve ne-tuning of the individual models
for future work. A CRF is a random eld model that is globally
conditioned on an observation sequence
O
. e states of the model
correspond to event labels
E
. We use a rst-order CRF in our
experiments (observation
Oi
is associated with
Ei
). We use the
CRFsuite3implementation of rst-order CRF. [7, 8, 11]
We use a very aggressive tokenization strategy that segments
text into a greater number of tokens than usual. We consider an
individual token to be any consecutive sequence of either:
(1) leers
(2) numbers
(3) whitespace
Each character that does not belong to any of the above constitutes
a single token. For example, the following sequence is tokenized as
shown below:
Call me at 9am on my phone (123)456-7890.
[“Call”, , “me”, ”, “at”, “9”, “am”, ”, on”, ,
“my”, “phone”, ”, “(”, “123”, “)”, “456”, “-”, “7890”,
“.”]
Each of the tokens is then a data point in a sequence a CRF model
operates on.
Each token is represented by a small set of relatively simple
features. Specically, the set includes:
(1)
atfront a binary feature which is set to true if the token is
located among the rst 20% tokens of the whole document.
(2)
atback a binary feature which is set to true if the token is
located among the last 20% tokens of the whole document.
(3) lower a token in lower case.
3www.chokkan.org/soware/crfsuite/
(4)
sig a feature representing a signature of a token. is
feature corresponds to the token with the following trans-
formations applied:
(a) each lower case leer is rewrien to “c”
(b) each upper case leer is rewrien to “C”
(c) each digit is rewrien to “D”
(5)
length a number corresponding to the length of the token
in characters if the length is smaller than 4. If the length
is between 4 and 6 the feature is set to “normal. If it is
greater than 6 it is set to “long.
(6)
islower a binary feature which is set to true if all the
token characters are in lower case.
(7)
isupper a binary feature which is set to true if all the
token characters are in upper case.
(8)
istitle a binary feature which is set to true if the rst of
the token characters is in upper case and the rest in lower
case.
(9)
isdigit a binary feature which is set to true if all the token
characters are digits.
(10)
isspace a binary feature which is set to true if all the
token characters are whitespace.
In addition, for each token we also include lower,sig,islower,isupper,
istitle,isdigit, and isspace features from the ve preceding tokens
and ve following tokens. If one of these tokens falls beyond the
document boundaries we signal this by including BOS (beginning
of sequence) and EOS (end of sequence) features.
Taking a look at the “Call me at 9am .. .” sequence from the
above presented example. e third token of this sequnce (“me”)
would be represented along the following lines:
{bias, atfront=true, atback=false, 0:lower=me,
0:sig=cc, 0:length=2, 0:islower=true,
0:isupper=false, 0:istitle=false, 0:isdigit=false,
0:isspace=false, -3:BOS, -2:lower=call, -2:sig=Ccc,
-2:length=normal, -2:islower=false, -2:isupper=false,
-2:istitle=true, -2:isdigit=false, -2:isspace=false,
-1:lower=" ", -1:sig=" ", -1:length=1,
-1:islower=false, -1:isupper=false, -1:istitle=false,
-1:isdigit=false, -1:isspace=true
...}
As labels we use the annotation types projected into the BILOU
4
scheme. Considering an example of annotation time (TIM) and
phone number (TEL) expressions the “Call me at 9am .. .” sequence
would be labeled as follows:
[O, O, O, O, O, B-TIM, L-TIM, O, O, O, O, O, O,
B-TEL, I-TEL, I-TEL, I-TEL, I-TEL, L-TEL, O]
For demonstration purposes we show multiple annotation types
at once. In our work we used one annotation type for each model.
In addition, instead of the TIM and TEL types from this example
we worked with the types presented in Section 2. us, each of
the models was trained on the tag set with 5 labels such as the
following one:
B-SENT, I-SENT, L-SENT, O, U-SENT
4
B: beginning of sequence, I: inside sequence, L: last in sequence, O: outside of sequence,
U: unit-length sequence
Detecting Dierent Types of Content in US Court Decisions ASAIL 2017, June 2017, London, Great Britain
Table 2: Summary statistics of annotations for the cyber-train, cyber-test, scotus-test, and ip-test data sets. Average sequence
length is reported in number of characters. Annotation types: Sentence (SENT), Incomplete Sentence (ISENT), Non-sentential
Sequence (NSENT), External Reference (EREF), Internal Reference (IREF), otation (QUOT), Editorial Mark (EMARK), Head-
ing (HEAD), Numbering Token (NMB), Meta Data Field (MDF).
SENT ISENT NSENT EREF IREF QUOT EMARK HEAD NMB MDF
cyber-train
# of seq 3493 495 1945 1574 189 959 818 108 360 43
avg # of seq per doc 349 50 195 157 19 96 82 11 36 4
avg seq length 145 50 31 37 12 143 6 23 3 40
cyber-test
# of seq 1060 125 728 753 46 325 245 21 109 19
avg # of seq per doc 353 42 243 254 15 108 82 7 36 6
avg seq length 163 84 29 30 7 112 4 33 3 63
scotus-test
# of seq 1984 158 1316 1256 185 375 164 34 182 2
avg # of seq per doc 661 53 439 419 62 125 55 11 61 1
avg seq length 153 54 34 38 8 123 6 29 3 26
ip-test
# of seq 608 119 474 506 37 224 116 38 89 2
avg # of seq per doc 203 40 158 169 12 75 39 13 30 1
avg seq length 160 58 26 27 3 111 6 28 3 91
total
# of seq 7145 897 4463 4099 457 1883 1343 201 740 66
avg # of seq per doc 376 47 235 216 24 99 71 11 39 3
avg seq length 151 57 31 34 9 130 6 26 3 48
5 RESULTS
e results of applying the CRF models trained on cyber-train data
set for cyber-test, scotus-test, and ip-set data sets are summarized
in Table 3. In the table each column represents a type and each
row represents a token type (from the BILOU scheme) for one of
the three test sets. We report the F
1
value as well as a support (a
number of true labels of the respective type within BILOU scheme,
e.g., the number of B-SENT) for each token type. e feature set we
have selected works well for the Sentence, Non-sentential sequence,
and External Reference types across all three data sets. In case of In-
ternal Reference, otation, Editorial Mark, and Numbering types
the performance is somewhat lower although quite promising. e
features apparently do not work well for the Incomplete Sentence
and Meta Data Field types. We reserve ne-tuning of the mod-
els (and thus improvement of performance) for future work. As
expected the performance in almost all the categories is slightly
higher on cyber-test data set when compared to the performance
on scotus-test and ip-test data sets.
6 EXAMPLE APPLICATION ON SENTENCE
BOUNDARY DETECTION
e goal in sentence boundary detection (SBD) is to split a natural
language text into individual sentences (i.e., identify each sentence’s
boundaries). Typically, SBD is operationalized as a binary classi-
cation of a xed number of candidate boundary points (e.g., “.,
“!”, “?”). SBD could be a critical task in many applications such as
machine translation, summarization, or information retrieval.
Approaches to SBD roughly fall into three categories:
(1)
Rules A baery of hand-craed matching rules is applied.
e rules may look like the following:
IF “!” OR “?” MATCHED MARK AS BOUND
IF <EOL> <EOL> MATCHED MARK AS BOUND
e rst rule states that every time there is a “!” or “?”
the system should consider it a boundary. e second
rule can be understood in such a way that the boundary
should be predicted every time the system encounters two
consecutive line breaks.
(2)
Supervised Machine Learning (ML) Given that triggering
event occurs decide if it is an instance of sentence boundary.
Each event is represented in terms of selected features such
as the following:
xi=<0:token=“., 0:isTrigger=1, -1:token=“Mr”,
-1:isAbbr=1, 1:token=“Lange”, 1:isName=1 >
Given the labels:
yi {0,1}
e supervised classier is a function:
f(xi)yi
(3)
Unsupervised ML Similar to supervised ML approach but
the system is trained on unlabeled data.
Multiple SBD systems were reported as having an excellent
performance: [14]
99.8% accuracy of a tree-based classier in predicting “. as
ending (or not) a sentence evaluated on Brown corpus [
16
]
99.5% accuracy of a combination of original system based
on neural nets and decision trees with existing system [
1
]
evaluated on WSJ corpus [12]
ASAIL 2017, June 2017, London, Great Britain J. ˇ
Savelka and K. D. Ashley
Table 3: Annotation types: Sentence (SENT), Incomplete Sentence (ISENT), Non-sentential Sequence (NSENT), External Ref-
erence (EREF), Internal Reference (IREF), otation (QUOT), Editorial Mark (EMARK), Heading (HEAD), Numbering Token
(NMB), Meta Data Field (MDF). e table shows F1and support for all the types on the three data sets.
SENT ISENT NSENT EREF IREF QUOT EMARK HEAD NMB MDF
F1supp F1supp F1supp F1supp F1supp F1supp F1supp F1supp F1supp F1supp
cyber-test
B .93 1058 .40 121 .88 645 .85 564 .78 45 .73 325 .83 98 .62 16 .90 109 .89 19
I .96 57630 .42 3400 .94 10483 .95 565 .43 117 .69 325 .76 142 .69 180 .89 66 .87 370
L .93 1058 .44 121 .88 646 .60 565 .80 45 .75 11807 .84 107 .29 16 .90 109 .83 19
O .86 15412 .98 71512 .99 63307 .99 10496 1.0 74951 .94 62701 1.0 74718 1.0 74941 1.0 74874 1.0 74750
U 0.0 4 .93 77 0.0 194 .16 93 .75 5
AVG .94 75158 .95 75158 .98 75158 .98 75158 1.0 75158 .90 75158 1.0 75158 1.0 75158 1.0 75158 1.0 75158
scotus-test
B .95 1976 .25 144 .80 1165 .74 1107 .44 139 .49 374 .66 102 .32 23 .50 2 .24 90
I .98 99514 .27 2625 .92 20377 .90 19923 .51 540 .41 15496 .58 242 .43 221 .60 14 .30 68
L .95 1977 .39 144 .81 1157 .47 1102 .13 139 .51 373 .69 102 .47 21 .50 2 .24 90
O .94 26598 .99 127139 .98 107215 .98 107785 1.0 129201 .91 113822 1.0 129603 1.0 129789 1.0 130047 1.0 129725
U 0.0 13 .70 151 0.0 148 .29 46 .54 16 .71 11 .87 92
AVG .97 130065 .97 130065 .97 130065 .96 130065 .99 130065 .85 130065 1.0 130065 1.0 130065 1.0 130065 1.0 130065
ip-test
B .93 600 .38 105 .81 468 .81 445 .91 32 .73 224 .70 56 .09 31 .88 89 0.0 2
I .97 31675 .32 2094 .87 6026 .92 6297 .71 50 .69 8208 .80 133 .17 290 .84 60 0.0 69
L .93 603 .47 105 .86 472 .51 447 .85 37 .73 224 .71 57 .47 30 .86 88 0.0 2
O .90 9504 .98 40067 .98 35416 .98 35135 1.0 42263 .91 33726 1.0 42113 1.0 42027 1.0 42145 1.0 42382
U 0.0 11 .32 58 .82 23 .22 4
AVG .96 42382 .94 42382 .96 42382 .97 42382 1.0 42382 .87 42382 1.0 42382 .99 42382 42382 1.0 42382
99.75% accuracy (WSJ) and 99.64% (Brown) of a maximum
entropy model in assessing “.”, “!”, and “?” [15]
99.69% (WSJ) and 99.8% (Brown) of a rule-based sentence
splier combined with a supervised POS-tagger [10]
98.35% (WSJ) and 98.98% (Brown) of an unsupervised sys-
tem based on identication of abbreviations[6]
Read et al. [
14
] conducted a study of SBD systems performance
across dierent corpora and report more modest results ranging
from 95.0% to 97.6% for dierent systems. Also, they tested the sys-
tems on corpora of user generated web content. e performance of
the SBD systems deteriorated for these corpora where the accuracy
oen falls in the lower nineties. [14]
6.1 SBD on Court Decisions
Court decisions are more challenging for SBD than news articles—
the traditional subject of interest. Whereas news articles are gener-
ally short texts a decision may be short but it may also be as long
as a book (consider the 80–100 pages long decision in Table 1). A
decision may be structured into sections and subsections preceded
by a heading (possibly numbered). A decision may contain spe-
cic constituents such as a header and a footer, footnotes, or lists.
Sentences are interleaved with citations. e sentences themselves
may be extremely long, even organized as lists. In decisions there is
a high usage of sentence organizers such as “;”, or “—” and brackets
(multiple types). otes (possibly nested) are frequent.
Let us consider the following example of a very long sentence
coming from a decision:
As used in the statute, “‘act in furtherance of a
person’s right of petition or free speech under the
United States or California Constitution in connec-
tion with a public issue’ includes: (1) any wrien
or oral statement or writing made before a leg-
islative, executive, or judicial proceeding, or any
other ocial proceeding authorized by law; (2)
any wrien or oral statement or writing made in
connection with an issue under consideration or
review by a legislative, executive, or judicial body,
or any other ocial proceeding authorized by law;
(3) any wrien or oral statement or writing made
in a place open to the public or a public forum
in connection with an issue of public interest; (4)
or any other conduct in furtherance of the exer-
cise of the constitutional right of petition or the
constitutional right of free speech in connection
with a public issue or an issue of public interest.
(
§
425.16, subd. (e), italics added; see Briggs v. Eden
Council for Hope & Opportunity (1999) 19 Cal. 4th
1106, 1117-1118, 1123 [81 Cal.Rptr.2d 471, 969 P.2d
564] [discussing types of statements covered by
anti-SLAPP statute].)
e example sentence contains a quotation (with a nested quotation)
organized as a list followed by citations and their captions. is
text is very challenging for an SBD systems because it contains
many triggering events that are not sentence boundaries.
e sentences in decisions tend to be very long (as the example
sentence shown above). is may cause troubles to various compo-
nents in the processing pipeline (e.g., syntactic parsing might be
dicult). erefore, it makes sense to adopt an aggressive segmen-
tation strategy and predict boundaries wherever possible. One such
opportunity are semicolons which are sometimes used to separate
items in a list (as shown in Example 1 below), as well as independent
clauses (Example 2). e authors of [
4
] also decided to go beyond
the traditional triggering events (e.g., they use “;” and “:” as well).
(1)
[O]ur family suered: emotional distress; anxiety; sleep-
lessness; physical pain; insecurity; fear; pain and suering;
payment of aorneys’ fees; payment of medical expenses;
Detecting Dierent Types of Content in US Court Decisions ASAIL 2017, June 2017, London, Great Britain
Table 4: SBD Data set summary statistics. Length of longest,
shortest and average sentences is reported in number of
characters.
cyber-train cyber-test scotus-test ip-test
# of docs 10 3 3 3
# of sentences 5423 1757 3214 1107
longest sentence 1065 1182 670 1145
average sentence 106 114 110 103
shortest sentence 1 1 1 2
payment of moving expenses; payment of *1204 traveling
and housing expenses to and from Los Angeles to support
our business endeavors; [and] [D.C.]’s lost income. .. .
(2)
It takes RAPCO a year or two to design, and obtain ap-
proval for, a complex part; the dynamometer testing alone
can cost
$
75,000. . .. Drawings and other manufacturing
information contain warnings of RAPCO’s intellectual-
property rights; every employee receives a notice that the
information with which he works is condential.
Additional complexity is caused by the presence of informal
poorly edited text:
e next post, from “DAN JUSTICE, is the rst
to raise the rhetoric to a level that could, when
considered out of context, be construed as a threat.
It says “HEY [D.C.], I KNOW A GOOD *** WHEN
I SEE ONE. I LIKE WHAT I SEE, LET’S GO GET
SOME COFFEE. ****** im gonna kill you” and is
signed “H-W student.5
A sentence may span over a double line break:
Section 1(4) of the Uniform Act provides that:
“Trade secret” means information, including a for-
mula, paern, compilation, program, device, method,
technique, or process .. .
Headings are examples of sequences that possibly do not have a
triggering event:
FACTS AND PROCEDURAL HISTORY
6.2 SBD Data Set
We use the cyber-train, cyber-test, scotus-test, and ip-test data sets
for the SBD experiments. Based on the above presented phenomena
we concluded that a denition of SBD as a binary classication of
a limited number of triggering events is not adequate. Further-
more, we observe that segmentation of a decision into sentences (or
sentence-like units) may oen be done in multiple dierent ways.
erefore, we do not adopt the limited view of triggering events
(allowing possibly any token to be a boundary). Also we apply
a consistent policy of aggressive segmenting (i.e., if doubts exist
there is a boundary). e basic characteristics of the SBD data set
are summarized in Table 4.
5e asterisks are ours.
6.3 SBD Experiments
We conduct a series of experiments to test the hypothesis that
taking the heterogeneity of decision’s content into account leads to
an improved performance in SBD. First, we measure performance
of existing vanilla SBD systems (i.e., using the pre-trained general
models) on cyber-test, scotus-test, and ip-test data sets. Second,
we test the performance of the same models trained on cyber-train
data set. Finally, we measure the performance of a custom CRF
solution that uses automatically predicted information about the
dierent types of content. We use sentence, incomplete sentence,
and non-sentential sequence types as features. Note that these
types, although quite related, do not map directly to the sentence
boundaries within the meaning of SBD. Also note, that we use
as features the automatic predictions of these types—not the gold
standard.
For evaluation we use traditional information retrieval measures—
precision (P), recall (R), and F
1
-measure (F
1
). We evaluate the per-
formance from two dierent perspectives:
(1) boundaries each boundary counts on its own
(2) segments both boundaries need to match
For each perspective we use two approaches to determine if the
boundary was predicted correctly:
(1) strict boundary osets match exactly
(2)
lenient the dierence between boundary osets does not
contain an alphanumeric character
Let us consider the following example where §stands for the true
boundary and & for a predicted boundary:
§
&Accordingly, we nd that the circuit court did
not abuse its discretion when it denied Mr.& &Ren-
frow’s motion for a JNOV.
§ §
**&We nd no merit
to this issue.§&
Two of the predicted boundaries match the true boundaries. e
remaining three dier. In case of one of the three the dierence
subsists in the two asterisks (non-alphanumeric). From the strict
boundaries perspective (strict-B) the P is 0.4 and R is 0.5. Using
the lenient boundaries perspective (lenient-B) the P is 0.6 and R
is 0.75. From the strict segments perspective (strict-S) both P and
R are 0 (no segment is predicted correctly). Using the lenient seg-
ments perspective (lenient-S) the P is 0.33 and R is 0.5. Using the
dierent perspectives allows more detailed analysis of a model’s
performance. As shown above a decent performance on predicting
boundaries correctly does not necessarily imply that the whole
segments are predicted correctly as well.
6.4 Performance of Vanilla SBD Systems
For evaluation of SBD systems’ performance on the corpus of court
decisions we use one system from each category:
(1)
We work with the SBD module from the Stanford CoreNLP
toolkit [9] as an example of a system based on rules.6
(2)
To test a system based on supervised ML classier we
employ the SBD component from openNLP.7
6nlp.stanford.edu/soware/corenlp.shtml
7opennlp.apache.org
ASAIL 2017, June 2017, London, Great Britain J. ˇ
Savelka and K. D. Ashley
(3)
As an example of an unsupervised system we use the punkt
[6] module from the NLTK toolkit.8
e criterion for selection of the SBD systems was the assumed
wide adoption of general toolkits the respective SBD systems are
part of.
e rule based sentence splier from Stanford CoreNLP requires
a text to be already segmented into tokens. e system is based
on triggering events the presence of which is a prerequisite for a
boundary to be predicted. e default events are a single “. or a
sequence of “?” and “!”. e system may use information about
paragraph boundaries which can be congured as either a single
EOL or two consecutive EOLs. e system may also exploit HTML
or XML markup if present. Certain paerns that may appear aer
a boundary are treated as parts of the preceding sentence (e.g.,
parenthesized expression).
e supervised sentence splier from OpenNLP is based on a
maximum entropy model which requires a corpus annotated with
sentence boundaries. e triggering events are “.”, “?”, and “!”. As
features the system uses information about the token containing
the potential boundary and about its immediate neighbours:
the prex
the sux
the presence of particular chars in the prex and sux
whether the candidate is an honoric or corporate desig-
nator
features of the words le and right of the candidate [13]
e unsupervised sentence splier (punkt) from NLTK does not
depend on any additional resources besides the corpus it is supposed
to segment into sentences. e leading idea behind the system is
that the chief source of wrongly predicted boundaries are periods
aer abbreviations. e system discovers abbreviations by testing
the hypothesis
P(·|w)=
0
.
99 against the corpus. Additionally,
token length (abbreviations are short) and the presence of internal
periods are taken into account. For prediction the system uses:
orthographic features
collocation heuristic (collocation is evidence against split)
frequent sentence starter heuristic (split aer abbreviation)
[6]
e results of application of the three SBD systems on the SBD
data set described in Section 6.2 are summarized in Table 5. e
results clearly show that performance of the general SBD systems
is drastically lower when compared to the performance on news
articles data sets. It is also much below the reported performance
on the user generated web content. Certain portions of this gap
could be explained by the particular denition of the SBD task we
adopt (i.e., the agressive segmentation). e remaining portion is
due to the decisions being particularly challenging for SBD.
e most common source of errors is due to wrongly predicted
sentence boundaries in citations as shown in the example:
see United States v. X-Citement Video, Inc., 513
U.S. 64, 76-78, 115 S. Ct. 464, 130 L. Ed.& 2d 372
(1994)
8nltk.org/api/nltk.tokenize.html
Table 5: Vanilla SBD systems performance
cyber-test scotus-test ip-test
P R F1 P R F1 P R F1
CoreNLP
strict-B .81 .78 .79 .87 .76 .82 .75 .79 .77
lenient-B .82 .78 .80 .88 .77 .82 .75 .79 .77
strict-S .56 .54 .55 .70 .60 .64 .50 .53 .52
lenient-S .57 .55 .56 .70 .61 .65 .51 .54 .52
openNLP
strict-B .88 .77 .82 .84 .74 .78 .79 .78 .79
lenient-B .88 .77 .82 .84 .74 .78 .80 .79 .79
strict-S .64 .56 .60 .65 .57 .61 .57 .55 .56
lenient-S .64 .56 .60 .66 .58 .61 .57 .56 .56
punkt
strict-B .72 .79 .75 .77 .72 .74 .67 .80 .73
lenient-B .72 .79 .75 .78 .73 .76 .69 .83 .75
strict-S .41 .46 .44 .55 .52 .54 .42 .50 .46
lenient-S .42 .47 .44 .56 .53 .55 .44 .53 .48
As in the previous example the predicted boundary is marked with
&. is type of error is very serious because it causes broken sen-
tences to be passed for further processing within the pipeline. ese
sentences may eventually even show up in the output presented to
a user (e.g., in a summary).
Another type of a commonly occurring error is a missed bound-
ary that follows a unit if a triggering event is absent:
(1)
Deliberate avoidance is not a standard less than knowl-
edge;
§
it is simply another way that knowledge may be
proven.
(2) B. Response to Jury estion§
(3)
Kolender v. Lawson, 461 U.S. 352, 357, 103 S. Ct. 1855, 75 L.
Ed. 2d 903 (1983);
§
United States v. Lim, 444 F.3d 910, 915
(7th Cir.2006)
As in the previous example the true boundary is marked with
§
.
is type of error is partly caused by our specic denition of SBD
(Example 1). Strictly speaking the missed boundary in Example
1 does not necessarily have to be considered to be an error. And
indeed in the traditional denition of SBD it would not. Example
2 (a missed boundary aer a heading) and Example 3 (a missed
boundary between two citations) are certainly errors. is type of
mistake is less serious than the previous one. It may still negatively
aect the performance of the processing pipeline but it does not
introduce broken sentences that may eventually make it to an
output for a user.
6.5 Performance of Trained SBD Systems
OpenNLP and punkt may be trained on a custom data set which
is encouraged. It can be expected that such training will improve
performance of these two systems. We use the data from cyber-
train with labeled sentence boundaries to train both openNLP and
punkt. It should be noted that punkt is an unsupervised system
and as such it does not use the labels in its training. erefore,
training punkt is very cheap and one could use a training set of
much greater size. Indeed, we expect that if we use a larger data
Detecting Dierent Types of Content in US Court Decisions ASAIL 2017, June 2017, London, Great Britain
Table 6: Trained SBD systems performance
cyber-test scotus-test ip-test
P R F1 P R F1 P R F1
CoreNLP++
strict-B .81 .90 .85 .86 .93 .89 .75 .88 .81
lenient-B .81 .91 .86 .86 .94 .90 .75 .88 .81
strict-S .62 .70 .66 .71 .77 .74 .52 .62 .57
lenient-S .63 .70 .66 .72 .78 .75 .53 .62 .57
openNLP++
strict-B .93 .79 .85 .91 .76 .83 .94 .81 .87
lenient-B .93 .79 .85 .91 .76 .83 .94 .82 .87
strict-S .71 .59 .64 .74 .61 .67 .72 .62 .67
lenient-S .71 .60 .65 .74 .62 .68 .73 .62 .67
punkt++
strict-B .77 .79 .78 .82 .71 .77 .76 .80 .78
lenient-B .77 .79 .78 .84 .73 .78 .79 .83 .81
strict-S .47 .49 .48 .61 .53 .57 .52 .55 .53
lenient-S .47 .49 .48 .62 .54 .58 .54 .57 .55
set to train punkt its performance would increase beyond what we
observe in our experiments. e same does not hold for openNLP
which is trained in a supervised fashion (requiring the gold labels).
Training openNLP is quite expensive. If we would wish to use
more documents in training (increasing the performance further)
we would have to manually label additional documents.
CoreNLP SBD module is rule based and therefore it is not pos-
sible to train a custom model. To approximate training one could
use its conguration options and tune the system to perform well
on our data set. We congured the CoreNLP SBD module to per-
form well on the cyber-train data set and we evaluated it alongside
openNLP and punkt on the cyber-test, scotus-test, and ip-test data
sets. It should be emphasized that this kind of comparison is very
problematic and as such it should be taken with a grain of salt.
Specically, the authors were familiar with the documents, even
those from the test sets. In light of this familiarity it appears that
the CoreNLP could have an unfair advantage in this experiment.
e performance of the trained (or congured) systems is sum-
marized in Table 6. We observe that all the systems perform beer
when compared to the vanilla versions. e performance of some of
the systems on some of the data sets is geing close to low nineties
(similar to performance of the models on user-generated web con-
tent). For example, CoreNLP++ works very well on the scotus-test
data set and openNLP++ has reasonable performance on the ip-test
data set.
Even though the performance of the CoreNLP improved, the
wrongly predicted boundaries in citations remain a problem. Below
are two examples of the boundaries that were incorrectly predicted
by the CoreNLP++:
(1) Entick v. Carrington, 95 Eng.& Rep. 807 (C. P. 1765)
(2) 451 F. Supp.& 2d 71, 88 (2006).
A new class of errors appeared as well. By allowing the system to
predict boundaries on characters such as “;” the system sometimes
commits an error which would have not occurred in its uncong-
ured version:
Knos noted the “limited use which the govern-
ment made of the signals from this particular beeper,
460 U. S., at 284;& and reserved the question whether
“dierent constitutional principles may be applica-
ble to “dragnet-type law enforcement practices”
of the type that GPS tracking made possible here,
ibid.
is is manifested in dramatic improvement of recall for CoreNLP++
whereas precision is frozen on levels similar to uncongured CoreNLP
SBD module.
e training improved the performance of the general openNLP
SBD module dramatically when it comes to precision. e perfor-
mance in terms of recall remained about the same. is is a beer
type of improvement when compared to CoreNLP++. Although
some of the boundaries are missed, it is quite rare for openNLP++ to
predict an incorrect boundary. Systematic errors are mostly missed
boundaries such as those in the following examples:
(1) 5. e Government’s Hybrid eory§
(2)
is device delivers many dierent types of communica-
tion: live conversations, voice mail, pages, text messages,
e-mail, alarms, internet, video, photos, dialing, signaling,
etc.
§
e legal standard for government access depends
entirely upon the type of communication involved.
In the rst example the system missed a boundary because it is
not associated with a triggering event (heading). Example 2 is
interesting because the system obviously learned that the “etc. is
an abbreviation which oen does not end a sentence.
e trained punkt++ performs beer than the general one. It
still commits quite a lot of errors when compared to the other two
trained/congured systems. One would probably need to train
punkt on a considerably larger data set in order to match the perfor-
mance of the other two systems. e previously identied typical
errors occur:
(1) II. ANALYSIS§
(2)
“[T]he district court retains broad discretion in deciding
how to respond to a question propounded from the jury
and . .. & the court has an obligation to dispel any confusion
quickly and with concrete accuracy.
(3)
United States v. Leahy, 464 F.3d 773, 796 (7th Cir.2006);
§
United States v. Carrillo, 435 F.3d 767, 780 (7th Cir.2006)
e examples show a missed boundaries aer a heading (Exam-
ple 1) and aer a citation (Example 3). Example 2 shows a wrongly
predicted boundary aer three dots in a quotation.
6.6 Performance of Custom CRF Model
To improve the SBD performance on court decisions even further
we train a custom CRF model—the same as the model described
in Section 4. We use cyber-train as the training set. In addition
to the 10 features described in Section 4 we add the automatically
predicted labels corresponding to the following types:
(1) Sentence
(2) Incomplete Sentence
(3) Non-sentential Sequence
is means we use a two-phased SBD system based on CRF. In the
rst pass the three CRF models aempt to label the sequence of
ASAIL 2017, June 2017, London, Great Britain J. ˇ
Savelka and K. D. Ashley
Table 7: Custom CRF SBD system performance
cyber-test scotus-test ip-test
P R F1 P R F1 P R F1
Custom CRF
strict-B .94 .96 .95 .90 .95 .92 .95 .94 .95
lenient-B .94 .96 .95 .91 .95 .93 .96 .95 .95
strict-S .86 .86 .86 .81 .85 .83 .86 .84 .85
lenient-S .86 .87 .86 .82 .85 .83 .86 .84 .85
tokens in terms of the three above mentioned types. In the second
pass the single CRF model uses the information from the rst pass
to predict the sentence boundaries. During the rst pass we focus
the system on the heterogeneity of the decisions’ content. It is our
assumption that by doing this we could outperform the general
purpose models that do not take the heterogeneity of the content
into account. In addition, the CRF has an advantage that it does not
make any assumptions about the SBD task allowing it to t more
easily with our particular denition.
e results of our CRF SBD system as applied to the cyber-test,
scotus-test, and ip-test data sets are summarized in Table 7. e
custom CRF system clearly outperforms both vanilla and trained
general systems on all three data sets suggesting that our general
hypothesis holds. Indeed, it appears that focusing the system on
the heterogeneity of the content results in beer performance.
Although, the system performs quite well there is certainly a lot
of room for improvement. Particularly annoying classes of errors
are those shown in the examples below (
§
stands for true boundary
and & stands for a predicted boundary):
(1)
e judgment of the Court of Appeals for the D. C.& Circuit
is armed.
(2)
. . . a case we have described as a ‘monument of English
freedom’ ‘undoubtedly familiar’ to ‘every& American states-
man’ at the time the Constitution was adopted . . .
(3)
. . . search is not involved and resort must be had to Katz
analysis;
§
but& there is no reason for rushing forward .. .
All these errors (except Example 1) are quite silly and no traditional
SBD system would commit similar mistakes. is is the tax that
is being paid for relaxing all the traditional SBD assumptions. We
expect that these kinds of mistakes would eventually go away with
a suciently large training set.
7 CONCLUSIONS
We have presented preliminary results from our ongoing eort to
mine US court decisions for valuable knowledge. We tested the hy-
pothesis that one of the reasons why the decisions are challenging
for NLP processing is the heterogeneity of the decisions’ content.
We classied the content of selected decisions in terms of 10 dif-
ferent types (e.g, sentence, quotation, reference). On an example
application of sentence boundary detection we have shown how
the information about the content type improves the performance
of an SBD system. Please, notice that we do not claim that we have
come up with a beer SBD system than are the ones that were
used as baselines. ite contrary, we are certain that our system
would fail miserably if tested on traditional corpora of news articles.
We simply claim that we have created a system that outperforms
the general solutions on a specically dened SBD task (US court
decisions). e key advantage of the system is that it focuses on
the heterogeneity of the decisions’ content.
REFERENCES
[1]
John Aberdeen, John Burger, David Day, Lynee Hirschman, Patricia Robinson,
and Marc Vilain. 1995. MI TRE: description of the Alembic system used for MUC-
6. In Proceedings of the 6th conference on Message understanding. Association for
Computational Linguistics, 141–155.
[2]
Emile de Maat and Radboud Winkels. 2009. A next step towards automated
modelling of sources of law. In Proceedings of the 12th International Conference
on Articial Intelligence and Law. ACM, 31–39.
[3]
Emile De Maat, Radboud Winkels, and Tom Van Engers. 2006. Automated
detection of reference structures in law. Frontiers in Articial Intelligence and
Applications (2006), 41.
[4]
Felice Dell’Orlea, Simone Marchi, Simonea Montemagni, Barbara Plank, and
Giulia Venturi. 2012. e SPLeT-2012 shared task on dependency parsing of legal
texts. In Proceedings of the 4th Workshop on Semantic Processing of Legal Texts.
[5]
Mahias Grabmair, Kevin D Ashley, Ran Chen, Preethi Sureshkumar, Chen Wang,
Eric Nyberg, and Vern R Walker. 2015. Introducing LUIMA: an experiment in legal
conceptual retrieval of vaccine injury decisions using a UIMA type system and
tools. In Proceedings of the 15th International Conference on Articial Intelligence
and Law. ACM, 69–78.
[6]
Tibor Kiss and Jan Strunk. 2006. Unsupervised multilingual sentence boundary
detection. Computational Linguistics 32, 4 (2006), 485–525.
[7]
John Laerty, Andrew McCallum, Fernando Pereira, and others. 2001. Condi-
tional random elds: Probabilistic models for segmenting and labeling sequence
data. In Proceedings of the eighteenth international conference on machine learning,
ICML, Vol. 1. 282–289.
[8]
Yang Liu, Andreas Stolcke, Elizabeth Shriberg, and Mary Harper. 2005. Using
conditional random elds for sentence boundary detection in speech. In Pro-
ceedings of the 43rd Annual Meeting on Association for Computational Linguistics.
Association for Computational Linguistics, 451–458.
[9]
Christopher D Manning, Mihai Surdeanu, John Bauer, Jenny Rose Finkel, Steven
Bethard, and David McClosky. 2014. e stanford corenlp natural language
processing toolkit.. In ACL (System Demonstrations). 55–60.
[10]
Andrei Mikheev. 2002. Periods, capitalized words, etc. Computational Linguistics
28, 3 (2002), 289–318.
[11]
Naoaki Okazaki. 2007. CRFsuite: a fast implementation of Conditional Random
Fields. (2007).
[12]
David D Palmer and Marti A Hearst. 1997. Adaptive multilingual sentence
boundary disambiguation. Computational Linguistics 23, 2 (1997), 241–267.
[13]
Adwait Ratnaparkhi. 1998. Maximum entropy models for natural language ambi-
guity resolution. Ph.D. Dissertation. University of Pennsylvania.
[14]
Jonathon Read, Rebecca Dridan, Stephan Oepen, and Lars Jørgen Solberg. 2012.
Sentence boundary detection: A long solved problem? COLING (Posters) 12
(2012), 985–994.
[15]
Jerey C Reynar and Adwait Ratnaparkhi. 1997. A maximum entropy approach to
identifying sentence boundaries. In Proceedings of the h conference on Applied
natural language processing. Association for Computational Linguistics, 16–19.
[16]
Michael D Riley. 1989. Some applications of tree-based modelling to speech
and language. In Proceedings of the workshop on Speech and Natural Language.
Association for Computational Linguistics, 339–352.
[17]
Oanh i Tran, Minh Le Nguyen, and Akira Shimazu. 2013. Reference resolution
in legal texts. In Proceedings of the Fourteenth International Conference on Articial
Intelligence and Law. ACM, 101–110.
[18]
Oanh i Tran, Bach Xuan Ngo, Minh Le Nguyen, and Akira Shimazu. 2014.
Automated reference resolution in legal texts. Articial intelligence and law 22, 1
(2014), 29–60.
[19]
M Van Opijnen, N Verwer, and J Meijer. 2015. Beyond the Experiment: the
eXtendable Legal Link eXtractor. In Workshop on Automated Detection, Extraction
and Analysis of Semantic Information in Legal Texts, held in conjunction with the
2015 International Conference on Articial Intelligence and Law.
[20]
Adam Wyner and Wim Peters. 2011. On Rule Extraction from Regulations.. In
JURIX, Vol. 11. 113–122.
... The task involves labeling of smaller textual snippets such as sentences [31] in terms of, e.g., rhetorical roles, functional or argument units. Examples include court [27] or administrative decisions from the U.S. [36], multi-domain court decisions from India [1] or Canada [43,44], international court [25] or arbitration decisions [6], or even multi-{domain,country} adjudicatory decisions [32]. Identifying a section that states an outcome of the case has also received considerable attention separately [24,42]. ...
... The task involves labeling of smaller textual snippets such as sentences [31] in terms of, e.g., rhetorical roles, functional or argument units. Examples include court [27] or administrative decisions from the U.S. [36], multi-domain court decisions from India [1] or Canada [43,44], international court [25] or arbitration decisions [6], or even multi-{domain,country} adjudicatory decisions [32]. Identifying a section that states an outcome of the case has also received considerable attention separately [24,42]. ...
Preprint
Full-text available
We evaluated the capability of a state-of-the-art generative pre-trained transformer (GPT) model to perform semantic annotation of short text snippets (one to few sentences) coming from legal documents of various types. Discussions of potential uses (e.g., document drafting, summarization) of this emerging technology in legal domain have intensified, but to date there has not been a rigorous analysis of these large language models' (LLM) capacity in sentence-level semantic annotation of legal texts in zero-shot learning settings. Yet, this particular type of use could unlock many practical applications (e.g., in contract review) and research opportunities (e.g., in empirical legal studies). We fill the gap with this study. We examined if and how successfully the model can semantically annotate small batches of short text snippets (10-50) based exclusively on concise definitions of the semantic types. We found that the GPT model performs surprisingly well in zero-shot settings on diverse types of documents (F1=.73 on a task involving court opinions, .86 for contracts, and .54 for statutes and regulations). These findings can be leveraged by legal scholars and practicing lawyers alike to guide their decisions in integrating LLMs in wide range of workflows involving semantic annotation of legal texts.
... The problem has also been treated as tagging of sequences that consist of multiple sentences instead of simpler single sentence classification. Here, models such as Conditional Random Fields have often been used [24,26]. A deep learning system based on Bi-LSTM was shown to perform well in [3]. ...
Chapter
Full-text available
We propose an adaptive environment (CABINET) to support caselaw analysis (identifying key argument elements) based on a novel cognitive computing framework that carefully matches various machine learning (ML) capabilities to the proficiency of a user. CABINET supports law students in their learning as well as professionals in their work. The results of our experiments focused on the feasibility of the proposed framework are promising. We show that the system is capable of identifying a potential error in the analysis with very low false positives rate (2.0–3.5%), as well as of predicting the key argument element type (e.g., an issue or a holding) with a reasonably high F1-score (0.74).
... The problem has also been treated as tagging of sequences that consist of multiple sentences instead of simpler single sentence classification. Here, models such as Conditional Random Fields have often been used [24,26]. A deep learning system based on Bi-LSTM was shown to perform well in [3]. ...
Preprint
Full-text available
We propose an adaptive environment (CABINET) to support caselaw analysis (identifying key argument elements) based on a novel cognitive computing framework that carefully matches various machine learning (ML) capabilities to the proficiency of a user. CABINET supports law students in their learning as well as professionals in their work. The results of our experiments focused on the feasibility of the proposed framework are promising. We show that the system is capable of identifying a potential error in the analysis with very low false positives rate (2.0-3.5%), as well as of predicting the key argument element type (e.g., an issue or a holding) with a reasonably high F1-score (0.74).
... The identification of rhethorical rules of sentences in legal cases has been investigated by several researchers. [22] used Conditional Random Field models with custom features to predict the rhethorical role of sentences in three distinct domains. [26] trained a number of machine learning models to predict the rhethorical role of sentences in decision from the U.S. Board of Veteran Appeals and compared them to rule-based approaches. ...
Preprint
Full-text available
We analyze the ability of pre-trained language models to transfer knowledge among datasets annotated with different type systems and to generalize beyond the domain and dataset they were trained on. We create a meta task, over multiple datasets focused on the prediction of rhetorical roles. Prediction of the rhetorical role a sentence plays in a case decision is an important and often studied task in AI & Law. Typically, it requires the annotation of a large number of sentences to train a model, which can be time-consuming and expensive. Further, the application of the models is restrained to the same dataset it was trained on. We fine-tune language models and evaluate their performance across datasets, to investigate the models' ability to generalize across domains. Our results suggest that the approach could be helpful in overcoming the cold-start problem in active or interactvie learning, and shows the ability of the models to generalize across datasets and domains.
... Savelka and Ashley [7] examined the possibility of automatically segmenting court opinions into high-level functional parts (i.e., Introduction (I), Background (B), Analysis (A), Footnotes (F)) and issue specific parts (i.e., Conclusions(C)). They assembled 316 court decisions from Court Listener and Google Scholar, 143 in the area of cyber crime and 173 involving trade secrets. ...
Preprint
Full-text available
In this paper, we present a method of building strong, explainable classifiers in the form of Boolean search rules. We developed an interactive environment called CASE (Computer Assisted Semantic Exploration) which exploits word co-occurrence to guide human annotators in selection of relevant search terms. The system seamlessly facilitates iterative evaluation and improvement of the classification rules. The process enables the human annotators to leverage the benefits of statistical information while incorporating their expert intuition into the creation of such rules. We evaluate classifiers created with our CASE system on 4 datasets, and compare the results to machine learning methods, including SKOPE rules, Random forest, Support Vector Machine, and fastText classifiers. The results drive the discussion on trade-offs between superior compactness, simplicity, and intuitiveness of the Boolean search rules versus the better performance of state-of-the-art machine learning models for text classification.
... In [35] it is shown that features such as a presence of a reference to the source provision, syntactic importance of the term of interest, structural placement of the sentence (such as its membership in a part as described in [37,38]), or attribution [36,40] are useful in predicting the value of a sentence (accuracy> .69). Integrating these methods into the proposed framework could increase the ranker's performance. ...
Conference Paper
Full-text available
Statutory texts employ vague terms that are difficult to understand. Here we study and evaluate methods for retrieving useful sentences from court opinions that elaborate on the meaning of a vague statutory term. Retrieving sentences instead of whole cases may spare a user the need to review long lists of cases in search of useful explanations. We assembled a data set of 4,635 sentences that were responses to three statutory queries and labeled them in terms of their usefulness for interpretation. We have run a series of experiments on this data set, which we have made public, assessing different techniques to solve the task. These include techniques that measure the similarity between the sentence and the query, utilize the context of a sentence, expand queries, or assess the novelty of a sentence with respect to a statutory provision from which the interpreted term comes. Based on a detailed error analysis we propose a specialized sentence retrieval framework that mitigates the challenges of retrieving case law sentences for interpreting statutory terms. The results of evaluating different implementations of the framework are promising (.725 for NDGC at 10, .662 at 100).
Article
Full-text available
The emergence of ChatGPT has sensitized the general public, including the legal profession, to large language models' (LLMs) potential uses (e.g., document drafting, question answering, and summarization). Although recent studies have shown how well the technology performs in diverse semantic annotation tasks focused on legal texts, an influx of newer, more capable (GPT-4) or cost-effective (GPT-3.5-turbo) models requires another analysis. This paper addresses recent developments in the ability of LLMs to semantically annotate legal texts in zero-shot learning settings. Given the transition to mature generative AI systems, we examine the performance of GPT-4 and GPT-3.5-turbo(-16k), comparing it to the previous generation of GPT models, on three legal text annotation tasks involving diverse documents such as adjudicatory opinions, contractual clauses, or statutory provisions. We also compare the models' performance and cost to better understand the trade-offs. We found that the GPT-4 model clearly outperforms the GPT-3.5 models on two of the three tasks. The cost-effective GPT-3.5-turbo matches the performance of the 20× more expensive text-davinci-003 model. While one can annotate multiple data points within a single prompt, the performance degrades as the size of the batch increases. This work provides valuable information relevant for many practical applications (e.g., in contract review) and research projects (e.g., in empirical legal studies). Legal scholars and practicing lawyers alike can leverage these findings to guide their decisions in integrating LLMs in a wide range of workflows involving semantic annotation of legal texts.
Chapter
In order to justify rulings, legal documents need to present facts as well as an analysis built thereon. In this paper, we present two methods to automatically extract case-relevant facts from French-language legal documents pertaining to tenant-landlord disputes. Our models consist of an ensemble that classifies a given sentence as either Fact or non-Fact, regardless of its context, and a recurrent architecture that contextually determines the class of each sentence in a given document. Both models are combined with a heuristic-based segmentation system that identifies the optimal point in the legal text where the presentation of facts ends and the analysis begins. When tested on a dataset of rulings from the Régie du Logement of the city of ANONYMOUS, the recurrent architecture achieves a better performance than the sentence ensemble classifier. The fact segmentation task produces a splitting index which can be weighted in order to favour shorter segments with few instances of non-facts or longer segments that favour the recall of facts. Our best configuration successfully segments 40% of the dataset within a single sentence of offset with respect to the gold standard. An analysis of the results leads us to believe that the commonly accepted assumption that, in legal documents, facts should precede the analysis is often not followed.
Article
Sentence boundary detection (SBD) is an essential step for a very large number of natural language processing applications such as parsing, information retrieval, automatic summarization, machine translation, etc. In this paper, we tackle the problem of SBD of dialectal Arabic, especially for the Tunisian dialect. We compare the efficiency of three learning algorithms: Deep Neuronal Networks (DNN), Support Vector Machines (SVM) and Conditional Random Fields (CRF) to detect the boundaries of sentences written in different types of dialect. The best model achieved an F-measure of 84.37% using CRF which is a popular formalism for structured prediction in NLP and it has been widely applied in text segmentation. (Read here: https://rdcu.be/ci4Wm)
Conference Paper
Full-text available
In this paper we describe a software framework for detecting and resolving references to (national and EU) legislation, case law, parliamentary documents and official gazettes. Meant to function in a large-scale production environment, performance, flexibility and maintainability are essential requirements. This led us to some noteworthy choices: within the pipeline architecture of Apache Cocoon we use the trie data structure for named entity recognition and a parsing expression grammar for pattern recognition, the latter having significant advantages over the use of regular expressions. Additional attention is paid to some substantive maintainability issues.
Conference Paper
Full-text available
We describe the design and use of the Stanford CoreNLP toolkit, an extensible pipeline that provides core natural language analysis. This toolkit is quite widely used, both in the research NLP community and also among commercial and government users of open source NLP technology. We suggest that this follows from a simple, approachable design, straight-forward interfaces, the inclusion of robust and good quality analysis components, and not requiring use of a large amount of associated baggage.
Article
Full-text available
Rules in regulations such as found in the US Federal Code of Regulations can be expressed using conditional and deontic rules. Identifying and extracting such rules from the language of the source material would be useful for automating rulebook management and translating into an executable logic. The paper presents a linguistically-oriented, rule-based approach, which is in contrast to a machine learning approach. It outlines use cases, discusses the source materials, reviews the methodology, then provides initial results and future steps.
Article
Full-text available
This paper investigates the task of reference resolution in the legal domain. This is a new interesting task in Legal Engineering research. The goal is to create a system which can automatically detect references and then extracts their referents. Previous work limits itself to detect and resolve references at the document targets. In this paper, we go a step further in trying to resolve references to sub-document targets. Referents extracted are the smallest fragments of texts in documents, rather than the entire documents that contain the referenced texts. Based on analyzing the characteristics of reference phenomena in legal texts, we propose a four-step framework to deal with the task: mention detection, contextual information extraction, antecedent candidate extraction, and antecedent determination. We also show how machine learning methods can be exploited in each step. The final system achieves 80.06 % in the F1 score for detecting references, 85.61 % accuracy for resolving them, and 67.02 % in the F1 score for the end-to-end setting task on the Japanese National Pension Law corpus.
Conference Paper
Full-text available
The ultimate goal of the research line described here is support for automated modelling of sources of law. One of the first steps is the automatic recognition of norms. In earlier work we presented a categorization of norms or provisions in legislation. We claimed that the categories are characterized by the use of typical sentence structures and that this would enable automatic detection and classification. In this paper we present the results of experiments in such automatic classification of provisions. We have defined fourteen different categories of provisions, and compiled a list of 88 sentence structures for those categories from twenty Dutch laws. Based on these structures, a parser was used to classify the sentences in fifteen different Dutch laws, classifying 91% of 592 sentences correctly. It compares well with other, statistical approaches. An important improvement of our classifier will be the distinction of principal and auxiliary sentences.
Conference Paper
Full-text available
Sentence boundary detection in speech is important for enriching speech recognition output, making it easier for humans to read and downstream modules to process. In previous work, hidden Markov model (HMM) and maximum entropy (Maxent) classifier approaches have been used for detecting sentence boundaries us- ing both textual and prosodic information. A conditional random field (CRF) combines advantages of these approaches, being both discriminative and able to perform sequence decoding. We show in this paper that a CRF yields a lower error rate than the HMM and Maxent models on the NISTsentence boundary detection task. Extensive comparisons across two corpora on both human tran- scriptions and recognition output confirm the strength of the CRF modeling approach when applying a variety of knowledge sources.
Article
Full-text available
In this article, we present a language-independent, unsupervised approach to sentence boundary detection. It is based on the assumption that a large number of ambiguities in the determination of sentence boundaries can be eliminated once abbreviations have been identified. Instead of relying on orthographic clues, the proposed system is able to detect abbreviations with high accuracy using three criteria that only require information about the candidate type itself and are independent of context: Abbreviations can be defined as a very tight collocation consisting of a truncated word and a final period, abbreviations are usually short, and abbreviations sometimes contain internal periods. We also show the potential of collocational evidence for two other important subtasks of sentence boundary disambiguation, namely the detection of initials and ordinal numbers. The proposed system has been tested extensively on eleven different languages and on different text genres. It achieves good results without any further amendments or language-specific resources. We evaluate its performance against three different baselines and compare it to other systems for sentence boundary detection proposed in the literature.
Conference Paper
Several applications of statistical tree-based modelling are described here to problems in speech and language. Classification and regression trees are well suited to many of the pattern recognition problems encountered in this area since they (1) statistically select the most significant features involved (2) provide "honest" estimates of their performance, (3) permit both categorical and continuous features to be considered, and (4) allow human interpretation and exploration of their result. First the method is summarized, then its application to automatic stop classification, segment duration prediction for synthesis, phoneme-to-phone classification, and end-of-sentence detection in text are described. For other applications to speech and language, see [Lucassen 1984], [Bahl, et al 1987].
Conference Paper
We present conditional random fields, a framework for building probabilistic models to segment and label sequence data. Conditional random fields offer several advantages over hidden Markov models and stochastic grammars for such tasks, including the ability to relax strong independence assumptions made in those models. Conditional random fields also avoid a fundamental limitation of maximum entropy Markov models (MEMMs) and other discriminative Markov models based on directed graphical models, which can be biased towards states with few successor states. We present iterative parameter estimation algorithms for conditional random fields and compare the performance of the resulting models to HMMs and MEMMs on synthetic and natural-language data.
Article
this article presents an efficient, trainable system for sentence boundary disambiguation. The system, called Satz, makes simple estimates of the parts of speech of the tokens immediately preceding and following each punctuation mark, and uses these estimates as input to a machine learning algorithm that then classifies the punctuation mark. Satz is very fast both in training and sentence analysis, and its combined robustness and accuracy surpass existing techniques. The system needs only a small lexicon and training corpus, and has been shown to transfer quickly and easily from English to other languages, as demonstrated on French and German.