Content uploaded by Jaromir Savelka
Author content
All content in this area was uploaded by Jaromir Savelka on May 04, 2023
Content may be subject to copyright.
Using Conditional Random Fields to Detect Dierent Functional
Types of Content in Decisions of United States Courts with
Example Application to Sentence Boundary Detection∗
Jarom´
ır ˇ
Savelka
Intelligent Systems Program
University of Pisburgh
210 South Bouquet Street
Pisburgh, PA 15260
jas438@pi.edu
Kevin D. Ashley
Learning Research and Development Center
University of Pisburgh
3939 O’Hara Street
Pisburgh, PA 15260
ashley@pi.edu
ABSTRACT
We detect dierent functional types of content in decisions of the
United States courts using conditional random elds models. e
work suggests a possible approach to deal with the obstacles in
processing the court decisions automatically. Even basic natural
language processing tasks such as sentence boundary detection are
challenging when performed on the decisions. It is our assumption
that one of the main causes of this diculty is the presence of
heterogeneous content where dierent categories of content require
dierent treatments. us, the goal of this work is to automatically
distinguish among the dierent functional types of content. e
trained models nd full and incomplete sentences as well as non-
sentential sequences. In addition, the models detect external and
internal references, quotations, editorial marks, headings, meta
data elds and numbering sequences such as lists. We show utility
of this approach on an example of the improved performance in the
sentence boundary detection task as compared to existing general
approaches that do not take the heterogeneity of the content into
account.
KEYWORDS
Sequence labeling, conditional random elds, court decision, sen-
tence boundary detection
ACM Reference format:
Jarom
´
ır
ˇ
Savelka and Kevin D. Ashley. 1997. Using Conditional Random
Fields to Detect Dierent Functional Types of Content in Decisions of United
States Courts with Example Application to Sentence Boundary Detection. In
Proceedings of 2nd Workshop on Automated Semantic Analysis of Information
in Legal Texts, London, Great Britain, June 2017 (ASAIL 2017), 10 pages.
DOI: 10.475/123 4
∗Produces the permission block, and copyright information
Permission to make digital or hard copies of part or all of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for prot or commercial advantage and that copies bear this notice and the full citation
on the rst page. Copyrights for third-party components of this work must be honored.
For all other uses, contact the owner/author(s).
ASAIL 2017, London, Great Britain
©2016 Copyright held by the owner/author(s). 123-4567-24-567/08/06. .. $15.00
DOI: 10.475/123 4
1 INTRODUCTION
In this paper we present some early results from our eort to auto-
matically recognize dierent functional types of content in US court
decisions. e decisions are complex documents where dierent
types of content appear side by side. Sentences may be interleaved
with citations and editorial marks (e.g., a page number or correc-
tion). e whole decision as well as the individual sentences may be
organized into list-like structures. Excerpts from other documents
as well as speech transcripts are oen present. e decisions may
be very long, organized into sections and subsections.
e heterogeneity of content is one of the reasons why it is
dicult to apply most of the well established NLP techniques to
court decisions. Even basic NLP tasks such as sentence boundary
detection are challenging when performed on the decisions. Wyner
and Peters observe that lists (using punctuation, including enumer-
ations, colons) and references (containing a mix of punctuation
and alpha-numeric characters) confound tokenization and sentence
spliing. [
20
] Although, they talk about regulatory texts these ob-
servations apply to the decisions as well. At the same time the use
of the mentioned techniques is foundational to many valuable appli-
cations from information retrieval to computer assisted reasoning.
Less than optimal performance in the lower level processing oen
introduces mistakes into the pipeline that are impossible to correct
in later stages. It is our assumption that the ability to distinguish
dierent types of content may dramatically improve the quality of
the processing. It appears that de Maat and Winkels express similar
assumptions in [
2
]. ey talk about spliing the sentences into
the principle and the auxiliary. In addition, they observe that lists
cause degradation of sentence classication performance.
We developed a simple labeling scheme that captures some of
the functional types of content. e scheme is described in Section
2. It is a set of types such as a sentential and non-sentential content
or references and quotations. We applied the scheme to a data set of
19 selected court decisions (mostly from the domain of cyber crime)
producing more than 20,000 annotations. e annotation eort and
the resulting data set is the subject of Section 3. We train sequence
labeling models (specically CRF) capable of producing the annota-
tions automatically. e training and evaluation of the models is
described in Section 4. Finally, we use some of the models in a task
of sentence boundary detection (Section 6). We show improved
performance over the methods that do not take the heterogeneity
of the content into account.
ASAIL 2017, June 2017, London, Great Britain J. ˇ
Savelka and K. D. Ashley
ere is a large body of research in AI&Law that focuses on
extraction of various types of content from legal documents. e
most related to our work is the consistent eort in extraction of
references and citations. e Reference recognition focuses on nd-
ing references to other documents (or their specic parts) such as
statutory provisions, court decisions, or expert literature. Tran et
al. presents a framework for recognition and resolution of refer-
ences in Japanese law. [
17
,
18
] De Maat et al. developed a system
for detection of references in Dutch law. [
3
] Opijnen et al. focus
not only on the references to Dutch legal documents but they also
recognize the references to EU documents. [19]
Dell’Orlea et al. present a law specic account of sentence
boundaries which is in many respects similar to the one we use
here. [
4
] Grabmair et al. developed a dedicated sentence bound-
ary detection module to operate on court decisions by extending
Lingpipe.1[5]
2 LABELING SCHEME
At the most fundamental level we distinguish between sentential
and non-sentential content. e distinction is drawn at a semantic
level, not purely a grammatical one. is may appear quite counter
intuitive on certain occasions. e sentential content comprises
unstructured data in the form of natural language text that plays
a role of a substantive statement in a decision. e rest of the
content that mostly plays a supportive role, such as structured
meta data or standardized references, is considered non-sentential.
e distinction appears to be similar to the one suggested in [
2
]
(principle and auxiliary sentences). We further dierentiate the
sentential content into sentences and incomplete sentences.
Sentence is a sequence of tokens (most oen words) that together
form a full stand-alone grammatical sentence. It usually starts with
a capital leer and ends with a period, exclamation mark, ques-
tion mark, or (rarely) a dierent symbol. Consider the following
examples:
(1)
We have recognized that even a limited search of the person
is a substantial invasion of privacy.
(2)
In other contexts, however, we have held that although
“some quantum of individualized suspicion is usually a
prerequisite to a constitutional search or seizure[,] . . . the
Fourth Amendment imposes no irreducible requirement
of such suspicion.”
(3)
Justice Jackson rejected the view that the search could be
supported by exigent circumstances:
Example 1 shows a typical sentence starting with a capital leer and
ending with a period. Example 2 shows a more complex sentence
with a quotation and some editorial marks (e.g., “.. . ”) embedded
within it. Example 3 presents a sentence ending with a colon which
is traditionally not used to terminate a sentence.
Incomplete sentence is a sequence of tokens (most oen words)
that do not form a full grammatical sentence. ese sequences
usually appear in headings or parentheses. ey may also appear
in front of lists or enumerations. Examples:
1hp://alias-i.com/lingpipe/
(1)
(student with bloodshot eyes wandering halls in violation
of school rule requiring students to remain in examination
room or at home during midterm examinations)
(2) e nature of the writ.
Example 1 is an expression in parentheses which does not form a
full grammatical sentence. Example 2 looks like a full sentence but
it lacks a verb phrase. is may oen be a case with headings.
Non-sentential piece of content is a sequence of tokens which is
not unstructured text in natural language. e sequence is oen
not formed of words or has very few words. Typically, it would
make no sense to assign the tokens of such a sequence with part of
speech tags (e..g, noun, verb). ese sequences usually appear as
references or meta data in front maer. Some examples are shown
in the following list:
(1) See Warden v. Hayden, 387 U.S. 294, 306 -307 (1967).
(2) November 14, 2012.
(3) [ Footnote 1 ]
Example 1 is a reference in a standardized format. Note that on a
purely functional level this is an imperative sentence yet within
our labeling scheme it is considered a non-sentential sequence.
is is because it is not a substantive statement—its role is rather
supportive. Example 2 shows a date expression. Example 3 presents
a numbering token from the footnotes list.
Citations play an important role in court decisions. To account
for this phenomenon we detect references to other documents
(external references) and references to other parts of the same
document (internal references). In addition, we look for sequences
of text that come from other documents or other parts of the same
document (quotations). It is worth mentioning that these types
are not mutually exclusive with the types presented earlier (i.e.,
sentence, non-sentential sequence). ite contrary, almost any
reference is at the same time non-sentential sequence (but not
necessarily the other way around).
External reference is a sequence of tokens the role of which is to
point to a document (or its part) other than the one in which it is
embedded. External references oen have a standardized format
(see the examples) and they could be easily recognized:
(1) [469 U.S. 325, 329]
(2)
West Virginia State Bd. of Ed. v. Barnee, 319 U.S. 624, 637
(1943)
(3)
Parent-Student Handbook of Piscataway [N. J.] H. S. (1979),
Record Doc. S-1, p. 7.
Example 1 shows a reference to a statute. Example 2 is a reference
to a court decision. Example 3 presents a reference to a book.
Internal reference is a sequence of tokens the role of which is to
point to another place in the same document. Internal references
oen have a form of footnote pointers, pointers to gures or tables,
or pointers to preceding or following sections of the document.
Consider the following examples:
(1) Ante, at 337
(2) Figure 1
Example 1 shows a reference to a preceding part of the text (e.g., a
page or paragraph). Example 2 shows a reference to a gure within
the decision.
Detecting Dierent Types of Content in US Court Decisions ASAIL 2017, June 2017, London, Great Britain
otation is a sequence of tokens from a dierent document (or
from a dierent part of the same document) that is copied into the
document of interest. Sometimes the sequence could be modied
(still preserving the meaning). ite oen a citation is surrounded
with quotation marks.
(1)
“a school ocial may properly conduct a search of a stu-
dent’s person if the ocial has a reasonable suspicion that
a crime has been or is in the process of being commied,
or reasonable cause to believe that the search is necessary
to maintain school discipline or enforce school policies.”
(2)
“[A]s a general rule notice and hearing should precede
removal of the student from school. We agree .. . , however,
that there are recurring situations in which prior notice and
hearing cannot be insisted upon. Students whose presence
poses a continuing danger to persons or property or an
ongoing threat of disrupting the academic process may
be immediately removed from school. In such cases the
necessary notice and rudimentary hearing should follow
as soon as practicable . . ..”
Example 1 shows an in-line quotation which is a part of a surround-
ing sentence. Example 2 is a standalone citation consisting of a
number of sentences. It also contains a couple of editorial marks
(“. . . ”, square brackets).
We detect a number of other constituents in a decision because
these could be useful in more advanced processing stages. In par-
ticular, we are interested in constituents playing a specic role in
a decision (headings). We also detect elements suggestive about
a structure of the presented content (numbering tokens). Finally,
we are interested in the elements that are inserted in the document
to make it more readable or useful (editorial marks and meta data
elds).
Heading is a sequence of tokens, either a full sentence or an
incomplete sentence, the role of which is to open (and concisely
describe) a document or a section of a document. Examples:
(1) JUSTICE WHITE delivered the opinion of the Court.
(2)
MEMORANDUM AND ORDER ON A DEFENDANTS’ MO-
TION TO DISMISS
Example 1 is a typical template sentence opening the opinion part of
the decision. Strictly speaking one would not consider this sentence
a heading within a general meaning of the word. However, we
consider it as such in our labeling scheme. Example 2 is a heading
opening a specic part of an opinion.
Numbering token (or a sequence of tokens) denes a position of
the associated piece of text within a structure such as a list or a
tree. Numbering can usually be found at the beginning of a line
but it can also appear within a paragraph.
(1) 1st.
(2) III
(3) [ Footnote 1 ]
(4) (viii)
e examples 1 through 4 show dierent types of numbering tokens
that are commonly found in the decisions.
Editorial mark is any token or sequence of tokens which is intro-
duced in a text by somebody other than an author in order to make
the text more readable or useful. is could be certain symbols
introduced into a citation by the author of a document or symbols
introduced into a document by a publisher.
(1) *153
(2) . . .
(3) [A]
Example 1 shows an editorial mark which is commonly used to
indicate beginning of a page of the specied number. is type
of mark is oen used in documents that were transformed from a
printed form into an electronic form. Example 2 is a mark indicating
that a piece of original content is leaved out. Square brackets shown
in Example 3 indicate that “A” was inserted in the original text by
an editor.
Meta data eld is a sequence of tokens representing a structured
information, usually in a form of a label–value pair. Consider the
following examples:
(1) Filed: April 16th, 2009
(2)
Panel: Frank Hoover Easterbrook, Michael Stephen Kanne,
Ann Claire Williams
Example 1 is a meta data eld providing an information about
the day on which the motion was led. Example 2 shows a meta
data eld that provides an information about the members of the
deciding panel.
3 DATA SET
We downloaded 19 court decisions from the online Court Listener
service.
2
13 of these decisions are from the area of cyber crime
(cyber bullying, credit card frauds, possession of electronic child
pornography), 3 are landmark SCOTUS decisions, and the remain-
ing 3 are cases involving intellectual property. We use 10 cyber
crime decisions as a training and development set (cyber-train)
and the remaining 3 cyber crime decisions as a hold-out test set
(cyber-test). We use the 3 SCOTUS decisions and the 3 intellectual
property decisions as additional test sets (scotus-test, ip-test). Al-
though, we focus on the area of cyber crime we use the additional
test sets to measure how well do the trained models generalize to
other domains.
e two human annotators (the authors of this paper) annotated
the decisions with labels described in Section 2. Each decision was
annotated by only one of the annotators. We did not measure inter-
annotator agreement. We simply assume the labels provided by
each annotator to be correct. We are well aware that this assumption
does not hold. Despite this we are convinced that, given the goal
of this work, it is a reasonable assumption anyway. e annotators
were provided with a codebook that contains information similar
to what is presented in Section 2 (i.e., description of each label with
examples). In addition, for each label there was a set of if–then
rules guiding the annotators such as the following one:
If a sequence of tokens represents what is wrien
in another document or at a dierent place of the
document of interest then it is a quotation.
e basic characteristics of the data set are summarized in Table
1. Summary statistics about the annotations are provided in Table
2. A relatively small number of decisions may suggest a very small
size of the data set. However, some of the annotated decisions are
2www.courtlistener.com
ASAIL 2017, June 2017, London, Great Britain J. ˇ
Savelka and K. D. Ashley
Table 1: Data set summary statistics. Length of longest,
shortest and average documents is reported in number of
characters.
cyber-train cyber-test scotus-test ip-test
# of docs 10 3 3 3
# of tokens 219484 75158 130065 42382
longest doc 181009 121910 190352 54430
average doc 58237 67820 118749 38549
shortest doc 16859 29278 47450 16546
very long. e longest decision with 190,352 characters roughly
corresponds to 80–100 standard pages. Even the shortest decision
(16,546 characters) corresponds to approximately 7 standard pages.
e total number of annotations (21,294) clearly shows that the
data set is sucient for far more than toy experiments.
4 LABELING THE SEQUENCES
AUTOMATICALLY
We use the cyber-train data set to train a CRF model for each of
the 10 labels. Although this is certainly suboptimal, we use the
same training strategy and features for all the models. It should
be evident that dierent types (such as a sentence or an editorial
mark) could most likely benet from a custom-tailored model and
contextual features. We reserve ne-tuning of the individual models
for future work. A CRF is a random eld model that is globally
conditioned on an observation sequence
O
. e states of the model
correspond to event labels
E
. We use a rst-order CRF in our
experiments (observation
Oi
is associated with
Ei
). We use the
CRFsuite3implementation of rst-order CRF. [7, 8, 11]
We use a very aggressive tokenization strategy that segments
text into a greater number of tokens than usual. We consider an
individual token to be any consecutive sequence of either:
(1) leers
(2) numbers
(3) whitespace
Each character that does not belong to any of the above constitutes
a single token. For example, the following sequence is tokenized as
shown below:
Call me at 9am on my phone (123)456-7890.
[“Call”, “ ”, “me”, “ ”, “at”, “9”, “am”, “ ”, “on”, “ ”,
“my”, “phone”, “ ”, “(”, “123”, “)”, “456”, “-”, “7890”,
“.”]
Each of the tokens is then a data point in a sequence a CRF model
operates on.
Each token is represented by a small set of relatively simple
features. Specically, the set includes:
(1)
atfront – a binary feature which is set to true if the token is
located among the rst 20% tokens of the whole document.
(2)
atback – a binary feature which is set to true if the token is
located among the last 20% tokens of the whole document.
(3) lower – a token in lower case.
3www.chokkan.org/soware/crfsuite/
(4)
sig – a feature representing a signature of a token. is
feature corresponds to the token with the following trans-
formations applied:
(a) each lower case leer is rewrien to “c”
(b) each upper case leer is rewrien to “C”
(c) each digit is rewrien to “D”
(5)
length – a number corresponding to the length of the token
in characters if the length is smaller than 4. If the length
is between 4 and 6 the feature is set to “normal.” If it is
greater than 6 it is set to “long.”
(6)
islower – a binary feature which is set to true if all the
token characters are in lower case.
(7)
isupper – a binary feature which is set to true if all the
token characters are in upper case.
(8)
istitle – a binary feature which is set to true if the rst of
the token characters is in upper case and the rest in lower
case.
(9)
isdigit – a binary feature which is set to true if all the token
characters are digits.
(10)
isspace – a binary feature which is set to true if all the
token characters are whitespace.
In addition, for each token we also include lower,sig,islower,isupper,
istitle,isdigit, and isspace features from the ve preceding tokens
and ve following tokens. If one of these tokens falls beyond the
document boundaries we signal this by including BOS (beginning
of sequence) and EOS (end of sequence) features.
Taking a look at the “Call me at 9am .. .” sequence from the
above presented example. e third token of this sequnce (“me”)
would be represented along the following lines:
{bias, atfront=true, atback=false, 0:lower=me,
0:sig=cc, 0:length=2, 0:islower=true,
0:isupper=false, 0:istitle=false, 0:isdigit=false,
0:isspace=false, -3:BOS, -2:lower=call, -2:sig=Ccc,
-2:length=normal, -2:islower=false, -2:isupper=false,
-2:istitle=true, -2:isdigit=false, -2:isspace=false,
-1:lower=" ", -1:sig=" ", -1:length=1,
-1:islower=false, -1:isupper=false, -1:istitle=false,
-1:isdigit=false, -1:isspace=true
...}
As labels we use the annotation types projected into the BILOU
4
scheme. Considering an example of annotation time (TIM) and
phone number (TEL) expressions the “Call me at 9am .. .” sequence
would be labeled as follows:
[O, O, O, O, O, B-TIM, L-TIM, O, O, O, O, O, O,
B-TEL, I-TEL, I-TEL, I-TEL, I-TEL, L-TEL, O]
For demonstration purposes we show multiple annotation types
at once. In our work we used one annotation type for each model.
In addition, instead of the TIM and TEL types from this example
we worked with the types presented in Section 2. us, each of
the models was trained on the tag set with 5 labels such as the
following one:
B-SENT, I-SENT, L-SENT, O, U-SENT
4
B: beginning of sequence, I: inside sequence, L: last in sequence, O: outside of sequence,
U: unit-length sequence
Detecting Dierent Types of Content in US Court Decisions ASAIL 2017, June 2017, London, Great Britain
Table 2: Summary statistics of annotations for the cyber-train, cyber-test, scotus-test, and ip-test data sets. Average sequence
length is reported in number of characters. Annotation types: Sentence (SENT), Incomplete Sentence (ISENT), Non-sentential
Sequence (NSENT), External Reference (EREF), Internal Reference (IREF), otation (QUOT), Editorial Mark (EMARK), Head-
ing (HEAD), Numbering Token (NMB), Meta Data Field (MDF).
SENT ISENT NSENT EREF IREF QUOT EMARK HEAD NMB MDF
cyber-train
# of seq 3493 495 1945 1574 189 959 818 108 360 43
avg # of seq per doc 349 50 195 157 19 96 82 11 36 4
avg seq length 145 50 31 37 12 143 6 23 3 40
cyber-test
# of seq 1060 125 728 753 46 325 245 21 109 19
avg # of seq per doc 353 42 243 254 15 108 82 7 36 6
avg seq length 163 84 29 30 7 112 4 33 3 63
scotus-test
# of seq 1984 158 1316 1256 185 375 164 34 182 2
avg # of seq per doc 661 53 439 419 62 125 55 11 61 1
avg seq length 153 54 34 38 8 123 6 29 3 26
ip-test
# of seq 608 119 474 506 37 224 116 38 89 2
avg # of seq per doc 203 40 158 169 12 75 39 13 30 1
avg seq length 160 58 26 27 3 111 6 28 3 91
total
# of seq 7145 897 4463 4099 457 1883 1343 201 740 66
avg # of seq per doc 376 47 235 216 24 99 71 11 39 3
avg seq length 151 57 31 34 9 130 6 26 3 48
5 RESULTS
e results of applying the CRF models trained on cyber-train data
set for cyber-test, scotus-test, and ip-set data sets are summarized
in Table 3. In the table each column represents a type and each
row represents a token type (from the BILOU scheme) for one of
the three test sets. We report the F
1
value as well as a support (a
number of true labels of the respective type within BILOU scheme,
e.g., the number of B-SENT) for each token type. e feature set we
have selected works well for the Sentence, Non-sentential sequence,
and External Reference types across all three data sets. In case of In-
ternal Reference, otation, Editorial Mark, and Numbering types
the performance is somewhat lower although quite promising. e
features apparently do not work well for the Incomplete Sentence
and Meta Data Field types. We reserve ne-tuning of the mod-
els (and thus improvement of performance) for future work. As
expected the performance in almost all the categories is slightly
higher on cyber-test data set when compared to the performance
on scotus-test and ip-test data sets.
6 EXAMPLE APPLICATION ON SENTENCE
BOUNDARY DETECTION
e goal in sentence boundary detection (SBD) is to split a natural
language text into individual sentences (i.e., identify each sentence’s
boundaries). Typically, SBD is operationalized as a binary classi-
cation of a xed number of candidate boundary points (e.g., “.”,
“!”, “?”). SBD could be a critical task in many applications such as
machine translation, summarization, or information retrieval.
Approaches to SBD roughly fall into three categories:
(1)
Rules – A baery of hand-craed matching rules is applied.
e rules may look like the following:
IF “!” OR “?” MATCHED →MARK AS BOUND
IF “<EOL> <EOL>” MATCHED →MARK AS BOUND
e rst rule states that every time there is a “!” or “?”
the system should consider it a boundary. e second
rule can be understood in such a way that the boundary
should be predicted every time the system encounters two
consecutive line breaks.
(2)
Supervised Machine Learning (ML) – Given that triggering
event occurs decide if it is an instance of sentence boundary.
Each event is represented in terms of selected features such
as the following:
xi=<0:token=“.”, 0:isTrigger=1, -1:token=“Mr”,
-1:isAbbr=1, 1:token=“Lange”, 1:isName=1 >
Given the labels:
yi∈ {0,1}
e supervised classier is a function:
f(xi)→yi
(3)
Unsupervised ML – Similar to supervised ML approach but
the system is trained on unlabeled data.
Multiple SBD systems were reported as having an excellent
performance: [14]
•
99.8% accuracy of a tree-based classier in predicting “.” as
ending (or not) a sentence evaluated on Brown corpus [
16
]
•
99.5% accuracy of a combination of original system based
on neural nets and decision trees with existing system [
1
]
evaluated on WSJ corpus [12]
ASAIL 2017, June 2017, London, Great Britain J. ˇ
Savelka and K. D. Ashley
Table 3: Annotation types: Sentence (SENT), Incomplete Sentence (ISENT), Non-sentential Sequence (NSENT), External Ref-
erence (EREF), Internal Reference (IREF), otation (QUOT), Editorial Mark (EMARK), Heading (HEAD), Numbering Token
(NMB), Meta Data Field (MDF). e table shows F1and support for all the types on the three data sets.
SENT ISENT NSENT EREF IREF QUOT EMARK HEAD NMB MDF
F1supp F1supp F1supp F1supp F1supp F1supp F1supp F1supp F1supp F1supp
cyber-test
B .93 1058 .40 121 .88 645 .85 564 .78 45 .73 325 .83 98 .62 16 .90 109 .89 19
I .96 57630 .42 3400 .94 10483 .95 565 .43 117 .69 325 .76 142 .69 180 .89 66 .87 370
L .93 1058 .44 121 .88 646 .60 565 .80 45 .75 11807 .84 107 .29 16 .90 109 .83 19
O .86 15412 .98 71512 .99 63307 .99 10496 1.0 74951 .94 62701 1.0 74718 1.0 74941 1.0 74874 1.0 74750
U – – 0.0 4 .93 77 0.0 194 – – – – .16 93 .75 5 – – – –
AVG .94 75158 .95 75158 .98 75158 .98 75158 1.0 75158 .90 75158 1.0 75158 1.0 75158 1.0 75158 1.0 75158
scotus-test
B .95 1976 .25 144 .80 1165 .74 1107 .44 139 .49 374 .66 102 .32 23 .50 2 .24 90
I .98 99514 .27 2625 .92 20377 .90 19923 .51 540 .41 15496 .58 242 .43 221 .60 14 .30 68
L .95 1977 .39 144 .81 1157 .47 1102 .13 139 .51 373 .69 102 .47 21 .50 2 .24 90
O .94 26598 .99 127139 .98 107215 .98 107785 1.0 129201 .91 113822 1.0 129603 1.0 129789 1.0 130047 1.0 129725
U – – 0.0 13 .70 151 0.0 148 .29 46 – – .54 16 .71 11 – – .87 92
AVG .97 130065 .97 130065 .97 130065 .96 130065 .99 130065 .85 130065 1.0 130065 1.0 130065 1.0 130065 1.0 130065
ip-test
B .93 600 .38 105 .81 468 .81 445 .91 32 .73 224 .70 56 .09 31 .88 89 0.0 2
I .97 31675 .32 2094 .87 6026 .92 6297 .71 50 .69 8208 .80 133 .17 290 .84 60 0.0 69
L .93 603 .47 105 .86 472 .51 447 .85 37 .73 224 .71 57 .47 30 .86 88 0.0 2
O .90 9504 .98 40067 .98 35416 .98 35135 1.0 42263 .91 33726 1.0 42113 1.0 42027 1.0 42145 1.0 42382
U – – 0.0 11 – – .32 58 – – – – .82 23 .22 4 – – – –
AVG .96 42382 .94 42382 .96 42382 .97 42382 1.0 42382 .87 42382 1.0 42382 .99 42382 42382 1.0 42382
•
99.75% accuracy (WSJ) and 99.64% (Brown) of a maximum
entropy model in assessing “.”, “!”, and “?” [15]
•
99.69% (WSJ) and 99.8% (Brown) of a rule-based sentence
splier combined with a supervised POS-tagger [10]
•98.35% (WSJ) and 98.98% (Brown) of an unsupervised sys-
tem based on identication of abbreviations[6]
Read et al. [
14
] conducted a study of SBD systems performance
across dierent corpora and report more modest results ranging
from 95.0% to 97.6% for dierent systems. Also, they tested the sys-
tems on corpora of user generated web content. e performance of
the SBD systems deteriorated for these corpora where the accuracy
oen falls in the lower nineties. [14]
6.1 SBD on Court Decisions
Court decisions are more challenging for SBD than news articles—
the traditional subject of interest. Whereas news articles are gener-
ally short texts a decision may be short but it may also be as long
as a book (consider the 80–100 pages long decision in Table 1). A
decision may be structured into sections and subsections preceded
by a heading (possibly numbered). A decision may contain spe-
cic constituents such as a header and a footer, footnotes, or lists.
Sentences are interleaved with citations. e sentences themselves
may be extremely long, even organized as lists. In decisions there is
a high usage of sentence organizers such as “;”, or “—” and brackets
(multiple types). otes (possibly nested) are frequent.
Let us consider the following example of a very long sentence
coming from a decision:
As used in the statute, “‘act in furtherance of a
person’s right of petition or free speech under the
United States or California Constitution in connec-
tion with a public issue’ includes: (1) any wrien
or oral statement or writing made before a leg-
islative, executive, or judicial proceeding, or any
other ocial proceeding authorized by law; (2)
any wrien or oral statement or writing made in
connection with an issue under consideration or
review by a legislative, executive, or judicial body,
or any other ocial proceeding authorized by law;
(3) any wrien or oral statement or writing made
in a place open to the public or a public forum
in connection with an issue of public interest; (4)
or any other conduct in furtherance of the exer-
cise of the constitutional right of petition or the
constitutional right of free speech in connection
with a public issue or an issue of public interest.”
(
§
425.16, subd. (e), italics added; see Briggs v. Eden
Council for Hope & Opportunity (1999) 19 Cal. 4th
1106, 1117-1118, 1123 [81 Cal.Rptr.2d 471, 969 P.2d
564] [discussing types of statements covered by
anti-SLAPP statute].)
e example sentence contains a quotation (with a nested quotation)
organized as a list followed by citations and their captions. is
text is very challenging for an SBD systems because it contains
many triggering events that are not sentence boundaries.
e sentences in decisions tend to be very long (as the example
sentence shown above). is may cause troubles to various compo-
nents in the processing pipeline (e.g., syntactic parsing might be
dicult). erefore, it makes sense to adopt an aggressive segmen-
tation strategy and predict boundaries wherever possible. One such
opportunity are semicolons which are sometimes used to separate
items in a list (as shown in Example 1 below), as well as independent
clauses (Example 2). e authors of [
4
] also decided to go beyond
the traditional triggering events (e.g., they use “;” and “:” as well).
(1)
[O]ur family suered: emotional distress; anxiety; sleep-
lessness; physical pain; insecurity; fear; pain and suering;
payment of aorneys’ fees; payment of medical expenses;
Detecting Dierent Types of Content in US Court Decisions ASAIL 2017, June 2017, London, Great Britain
Table 4: SBD Data set summary statistics. Length of longest,
shortest and average sentences is reported in number of
characters.
cyber-train cyber-test scotus-test ip-test
# of docs 10 3 3 3
# of sentences 5423 1757 3214 1107
longest sentence 1065 1182 670 1145
average sentence 106 114 110 103
shortest sentence 1 1 1 2
payment of moving expenses; payment of *1204 traveling
and housing expenses to and from Los Angeles to support
our business endeavors; [and] [D.C.]’s lost income. .. .
(2)
It takes RAPCO a year or two to design, and obtain ap-
proval for, a complex part; the dynamometer testing alone
can cost
$
75,000. . .. Drawings and other manufacturing
information contain warnings of RAPCO’s intellectual-
property rights; every employee receives a notice that the
information with which he works is condential.
Additional complexity is caused by the presence of informal
poorly edited text:
e next post, from “DAN JUSTICE,” is the rst
to raise the rhetoric to a level that could, when
considered out of context, be construed as a threat.
It says “HEY [D.C.], I KNOW A GOOD *** WHEN
I SEE ONE. I LIKE WHAT I SEE, LET’S GO GET
SOME COFFEE. ****** im gonna kill you” and is
signed “H-W student.”5
A sentence may span over a double line break:
Section 1(4) of the Uniform Act provides that:
“Trade secret” means information, including a for-
mula, paern, compilation, program, device, method,
technique, or process .. .
Headings are examples of sequences that possibly do not have a
triggering event:
FACTS AND PROCEDURAL HISTORY
6.2 SBD Data Set
We use the cyber-train, cyber-test, scotus-test, and ip-test data sets
for the SBD experiments. Based on the above presented phenomena
we concluded that a denition of SBD as a binary classication of
a limited number of triggering events is not adequate. Further-
more, we observe that segmentation of a decision into sentences (or
sentence-like units) may oen be done in multiple dierent ways.
erefore, we do not adopt the limited view of triggering events
(allowing possibly any token to be a boundary). Also we apply
a consistent policy of aggressive segmenting (i.e., if doubts exist
there is a boundary). e basic characteristics of the SBD data set
are summarized in Table 4.
5e asterisks are ours.
6.3 SBD Experiments
We conduct a series of experiments to test the hypothesis that
taking the heterogeneity of decision’s content into account leads to
an improved performance in SBD. First, we measure performance
of existing vanilla SBD systems (i.e., using the pre-trained general
models) on cyber-test, scotus-test, and ip-test data sets. Second,
we test the performance of the same models trained on cyber-train
data set. Finally, we measure the performance of a custom CRF
solution that uses automatically predicted information about the
dierent types of content. We use sentence, incomplete sentence,
and non-sentential sequence types as features. Note that these
types, although quite related, do not map directly to the sentence
boundaries within the meaning of SBD. Also note, that we use
as features the automatic predictions of these types—not the gold
standard.
For evaluation we use traditional information retrieval measures—
precision (P), recall (R), and F
1
-measure (F
1
). We evaluate the per-
formance from two dierent perspectives:
(1) boundaries – each boundary counts on its own
(2) segments – both boundaries need to match
For each perspective we use two approaches to determine if the
boundary was predicted correctly:
(1) strict – boundary osets match exactly
(2)
lenient – the dierence between boundary osets does not
contain an alphanumeric character
Let us consider the following example where §stands for the true
boundary and & for a predicted boundary:
§
&Accordingly, we nd that the circuit court did
not abuse its discretion when it denied Mr.& &Ren-
frow’s motion for a JNOV.
§ §
**&We nd no merit
to this issue.§&
Two of the predicted boundaries match the true boundaries. e
remaining three dier. In case of one of the three the dierence
subsists in the two asterisks (non-alphanumeric). From the strict
boundaries perspective (strict-B) the P is 0.4 and R is 0.5. Using
the lenient boundaries perspective (lenient-B) the P is 0.6 and R
is 0.75. From the strict segments perspective (strict-S) both P and
R are 0 (no segment is predicted correctly). Using the lenient seg-
ments perspective (lenient-S) the P is 0.33 and R is 0.5. Using the
dierent perspectives allows more detailed analysis of a model’s
performance. As shown above a decent performance on predicting
boundaries correctly does not necessarily imply that the whole
segments are predicted correctly as well.
6.4 Performance of Vanilla SBD Systems
For evaluation of SBD systems’ performance on the corpus of court
decisions we use one system from each category:
(1)
We work with the SBD module from the Stanford CoreNLP
toolkit [9] as an example of a system based on rules.6
(2)
To test a system based on supervised ML classier we
employ the SBD component from openNLP.7
6nlp.stanford.edu/soware/corenlp.shtml
7opennlp.apache.org
ASAIL 2017, June 2017, London, Great Britain J. ˇ
Savelka and K. D. Ashley
(3)
As an example of an unsupervised system we use the punkt
[6] module from the NLTK toolkit.8
e criterion for selection of the SBD systems was the assumed
wide adoption of general toolkits the respective SBD systems are
part of.
e rule based sentence splier from Stanford CoreNLP requires
a text to be already segmented into tokens. e system is based
on triggering events the presence of which is a prerequisite for a
boundary to be predicted. e default events are a single “.” or a
sequence of “?” and “!”. e system may use information about
paragraph boundaries which can be congured as either a single
EOL or two consecutive EOLs. e system may also exploit HTML
or XML markup if present. Certain paerns that may appear aer
a boundary are treated as parts of the preceding sentence (e.g.,
parenthesized expression).
e supervised sentence splier from OpenNLP is based on a
maximum entropy model which requires a corpus annotated with
sentence boundaries. e triggering events are “.”, “?”, and “!”. As
features the system uses information about the token containing
the potential boundary and about its immediate neighbours:
•the prex
•the sux
•the presence of particular chars in the prex and sux
•
whether the candidate is an honoric or corporate desig-
nator
•features of the words le and right of the candidate [13]
e unsupervised sentence splier (punkt) from NLTK does not
depend on any additional resources besides the corpus it is supposed
to segment into sentences. e leading idea behind the system is
that the chief source of wrongly predicted boundaries are periods
aer abbreviations. e system discovers abbreviations by testing
the hypothesis
P(·|w)=
0
.
99 against the corpus. Additionally,
token length (abbreviations are short) and the presence of internal
periods are taken into account. For prediction the system uses:
•orthographic features
•
collocation heuristic (collocation is evidence against split)
•
frequent sentence starter heuristic (split aer abbreviation)
[6]
e results of application of the three SBD systems on the SBD
data set described in Section 6.2 are summarized in Table 5. e
results clearly show that performance of the general SBD systems
is drastically lower when compared to the performance on news
articles data sets. It is also much below the reported performance
on the user generated web content. Certain portions of this gap
could be explained by the particular denition of the SBD task we
adopt (i.e., the agressive segmentation). e remaining portion is
due to the decisions being particularly challenging for SBD.
e most common source of errors is due to wrongly predicted
sentence boundaries in citations as shown in the example:
see United States v. X-Citement Video, Inc., 513
U.S. 64, 76-78, 115 S. Ct. 464, 130 L. Ed.& 2d 372
(1994)
8nltk.org/api/nltk.tokenize.html
Table 5: Vanilla SBD systems performance
cyber-test scotus-test ip-test
P R F1 P R F1 P R F1
CoreNLP
strict-B .81 .78 .79 .87 .76 .82 .75 .79 .77
lenient-B .82 .78 .80 .88 .77 .82 .75 .79 .77
strict-S .56 .54 .55 .70 .60 .64 .50 .53 .52
lenient-S .57 .55 .56 .70 .61 .65 .51 .54 .52
openNLP
strict-B .88 .77 .82 .84 .74 .78 .79 .78 .79
lenient-B .88 .77 .82 .84 .74 .78 .80 .79 .79
strict-S .64 .56 .60 .65 .57 .61 .57 .55 .56
lenient-S .64 .56 .60 .66 .58 .61 .57 .56 .56
punkt
strict-B .72 .79 .75 .77 .72 .74 .67 .80 .73
lenient-B .72 .79 .75 .78 .73 .76 .69 .83 .75
strict-S .41 .46 .44 .55 .52 .54 .42 .50 .46
lenient-S .42 .47 .44 .56 .53 .55 .44 .53 .48
As in the previous example the predicted boundary is marked with
&. is type of error is very serious because it causes broken sen-
tences to be passed for further processing within the pipeline. ese
sentences may eventually even show up in the output presented to
a user (e.g., in a summary).
Another type of a commonly occurring error is a missed bound-
ary that follows a unit if a triggering event is absent:
(1)
Deliberate avoidance is not a standard less than knowl-
edge;
§
it is simply another way that knowledge may be
proven.
(2) B. Response to Jury estion§
(3)
Kolender v. Lawson, 461 U.S. 352, 357, 103 S. Ct. 1855, 75 L.
Ed. 2d 903 (1983);
§
United States v. Lim, 444 F.3d 910, 915
(7th Cir.2006)
As in the previous example the true boundary is marked with
§
.
is type of error is partly caused by our specic denition of SBD
(Example 1). Strictly speaking the missed boundary in Example
1 does not necessarily have to be considered to be an error. And
indeed in the traditional denition of SBD it would not. Example
2 (a missed boundary aer a heading) and Example 3 (a missed
boundary between two citations) are certainly errors. is type of
mistake is less serious than the previous one. It may still negatively
aect the performance of the processing pipeline but it does not
introduce broken sentences that may eventually make it to an
output for a user.
6.5 Performance of Trained SBD Systems
OpenNLP and punkt may be trained on a custom data set which
is encouraged. It can be expected that such training will improve
performance of these two systems. We use the data from cyber-
train with labeled sentence boundaries to train both openNLP and
punkt. It should be noted that punkt is an unsupervised system
and as such it does not use the labels in its training. erefore,
training punkt is very cheap and one could use a training set of
much greater size. Indeed, we expect that if we use a larger data
Detecting Dierent Types of Content in US Court Decisions ASAIL 2017, June 2017, London, Great Britain
Table 6: Trained SBD systems performance
cyber-test scotus-test ip-test
P R F1 P R F1 P R F1
CoreNLP++
strict-B .81 .90 .85 .86 .93 .89 .75 .88 .81
lenient-B .81 .91 .86 .86 .94 .90 .75 .88 .81
strict-S .62 .70 .66 .71 .77 .74 .52 .62 .57
lenient-S .63 .70 .66 .72 .78 .75 .53 .62 .57
openNLP++
strict-B .93 .79 .85 .91 .76 .83 .94 .81 .87
lenient-B .93 .79 .85 .91 .76 .83 .94 .82 .87
strict-S .71 .59 .64 .74 .61 .67 .72 .62 .67
lenient-S .71 .60 .65 .74 .62 .68 .73 .62 .67
punkt++
strict-B .77 .79 .78 .82 .71 .77 .76 .80 .78
lenient-B .77 .79 .78 .84 .73 .78 .79 .83 .81
strict-S .47 .49 .48 .61 .53 .57 .52 .55 .53
lenient-S .47 .49 .48 .62 .54 .58 .54 .57 .55
set to train punkt its performance would increase beyond what we
observe in our experiments. e same does not hold for openNLP
which is trained in a supervised fashion (requiring the gold labels).
Training openNLP is quite expensive. If we would wish to use
more documents in training (increasing the performance further)
we would have to manually label additional documents.
CoreNLP SBD module is rule based and therefore it is not pos-
sible to train a custom model. To approximate training one could
use its conguration options and tune the system to perform well
on our data set. We congured the CoreNLP SBD module to per-
form well on the cyber-train data set and we evaluated it alongside
openNLP and punkt on the cyber-test, scotus-test, and ip-test data
sets. It should be emphasized that this kind of comparison is very
problematic and as such it should be taken with a grain of salt.
Specically, the authors were familiar with the documents, even
those from the test sets. In light of this familiarity it appears that
the CoreNLP could have an unfair advantage in this experiment.
e performance of the trained (or congured) systems is sum-
marized in Table 6. We observe that all the systems perform beer
when compared to the vanilla versions. e performance of some of
the systems on some of the data sets is geing close to low nineties
(similar to performance of the models on user-generated web con-
tent). For example, CoreNLP++ works very well on the scotus-test
data set and openNLP++ has reasonable performance on the ip-test
data set.
Even though the performance of the CoreNLP improved, the
wrongly predicted boundaries in citations remain a problem. Below
are two examples of the boundaries that were incorrectly predicted
by the CoreNLP++:
(1) Entick v. Carrington, 95 Eng.& Rep. 807 (C. P. 1765)
(2) 451 F. Supp.& 2d 71, 88 (2006).
A new class of errors appeared as well. By allowing the system to
predict boundaries on characters such as “;” the system sometimes
commits an error which would have not occurred in its uncong-
ured version:
Knos noted the “limited use which the govern-
ment made of the signals from this particular beeper,”
460 U. S., at 284;& and reserved the question whether
“dierent constitutional principles may be applica-
ble to “dragnet-type law enforcement practices”
of the type that GPS tracking made possible here,
ibid.
is is manifested in dramatic improvement of recall for CoreNLP++
whereas precision is frozen on levels similar to uncongured CoreNLP
SBD module.
e training improved the performance of the general openNLP
SBD module dramatically when it comes to precision. e perfor-
mance in terms of recall remained about the same. is is a beer
type of improvement when compared to CoreNLP++. Although
some of the boundaries are missed, it is quite rare for openNLP++ to
predict an incorrect boundary. Systematic errors are mostly missed
boundaries such as those in the following examples:
(1) 5. e Government’s Hybrid eory§
(2)
is device delivers many dierent types of communica-
tion: live conversations, voice mail, pages, text messages,
e-mail, alarms, internet, video, photos, dialing, signaling,
etc.
§
e legal standard for government access depends
entirely upon the type of communication involved.
In the rst example the system missed a boundary because it is
not associated with a triggering event (heading). Example 2 is
interesting because the system obviously learned that the “etc.” is
an abbreviation which oen does not end a sentence.
e trained punkt++ performs beer than the general one. It
still commits quite a lot of errors when compared to the other two
trained/congured systems. One would probably need to train
punkt on a considerably larger data set in order to match the perfor-
mance of the other two systems. e previously identied typical
errors occur:
(1) II. ANALYSIS§
(2)
“[T]he district court retains broad discretion in deciding
how to respond to a question propounded from the jury
and . .. & the court has an obligation to dispel any confusion
quickly and with concrete accuracy.”
(3)
United States v. Leahy, 464 F.3d 773, 796 (7th Cir.2006);
§
United States v. Carrillo, 435 F.3d 767, 780 (7th Cir.2006)
e examples show a missed boundaries aer a heading (Exam-
ple 1) and aer a citation (Example 3). Example 2 shows a wrongly
predicted boundary aer three dots in a quotation.
6.6 Performance of Custom CRF Model
To improve the SBD performance on court decisions even further
we train a custom CRF model—the same as the model described
in Section 4. We use cyber-train as the training set. In addition
to the 10 features described in Section 4 we add the automatically
predicted labels corresponding to the following types:
(1) Sentence
(2) Incomplete Sentence
(3) Non-sentential Sequence
is means we use a two-phased SBD system based on CRF. In the
rst pass the three CRF models aempt to label the sequence of
ASAIL 2017, June 2017, London, Great Britain J. ˇ
Savelka and K. D. Ashley
Table 7: Custom CRF SBD system performance
cyber-test scotus-test ip-test
P R F1 P R F1 P R F1
Custom CRF
strict-B .94 .96 .95 .90 .95 .92 .95 .94 .95
lenient-B .94 .96 .95 .91 .95 .93 .96 .95 .95
strict-S .86 .86 .86 .81 .85 .83 .86 .84 .85
lenient-S .86 .87 .86 .82 .85 .83 .86 .84 .85
tokens in terms of the three above mentioned types. In the second
pass the single CRF model uses the information from the rst pass
to predict the sentence boundaries. During the rst pass we focus
the system on the heterogeneity of the decisions’ content. It is our
assumption that by doing this we could outperform the general
purpose models that do not take the heterogeneity of the content
into account. In addition, the CRF has an advantage that it does not
make any assumptions about the SBD task allowing it to t more
easily with our particular denition.
e results of our CRF SBD system as applied to the cyber-test,
scotus-test, and ip-test data sets are summarized in Table 7. e
custom CRF system clearly outperforms both vanilla and trained
general systems on all three data sets suggesting that our general
hypothesis holds. Indeed, it appears that focusing the system on
the heterogeneity of the content results in beer performance.
Although, the system performs quite well there is certainly a lot
of room for improvement. Particularly annoying classes of errors
are those shown in the examples below (
§
stands for true boundary
and & stands for a predicted boundary):
(1)
e judgment of the Court of Appeals for the D. C.& Circuit
is armed.
(2)
. . . a “case we have described as a ‘monument of English
freedom’ ‘undoubtedly familiar’ to ‘every& American states-
man’ at the time the Constitution was adopted . . .
(3)
. . . search is not involved and resort must be had to Katz
analysis;
§
but& there is no reason for rushing forward .. .
All these errors (except Example 1) are quite silly and no traditional
SBD system would commit similar mistakes. is is the tax that
is being paid for relaxing all the traditional SBD assumptions. We
expect that these kinds of mistakes would eventually go away with
a suciently large training set.
7 CONCLUSIONS
We have presented preliminary results from our ongoing eort to
mine US court decisions for valuable knowledge. We tested the hy-
pothesis that one of the reasons why the decisions are challenging
for NLP processing is the heterogeneity of the decisions’ content.
We classied the content of selected decisions in terms of 10 dif-
ferent types (e.g, sentence, quotation, reference). On an example
application of sentence boundary detection we have shown how
the information about the content type improves the performance
of an SBD system. Please, notice that we do not claim that we have
come up with a beer SBD system than are the ones that were
used as baselines. ite contrary, we are certain that our system
would fail miserably if tested on traditional corpora of news articles.
We simply claim that we have created a system that outperforms
the general solutions on a specically dened SBD task (US court
decisions). e key advantage of the system is that it focuses on
the heterogeneity of the decisions’ content.
REFERENCES
[1]
John Aberdeen, John Burger, David Day, Lynee Hirschman, Patricia Robinson,
and Marc Vilain. 1995. MI TRE: description of the Alembic system used for MUC-
6. In Proceedings of the 6th conference on Message understanding. Association for
Computational Linguistics, 141–155.
[2]
Emile de Maat and Radboud Winkels. 2009. A next step towards automated
modelling of sources of law. In Proceedings of the 12th International Conference
on Articial Intelligence and Law. ACM, 31–39.
[3]
Emile De Maat, Radboud Winkels, and Tom Van Engers. 2006. Automated
detection of reference structures in law. Frontiers in Articial Intelligence and
Applications (2006), 41.
[4]
Felice Dell’Orlea, Simone Marchi, Simonea Montemagni, Barbara Plank, and
Giulia Venturi. 2012. e SPLeT-2012 shared task on dependency parsing of legal
texts. In Proceedings of the 4th Workshop on Semantic Processing of Legal Texts.
[5]
Mahias Grabmair, Kevin D Ashley, Ran Chen, Preethi Sureshkumar, Chen Wang,
Eric Nyberg, and Vern R Walker. 2015. Introducing LUIMA: an experiment in legal
conceptual retrieval of vaccine injury decisions using a UIMA type system and
tools. In Proceedings of the 15th International Conference on Articial Intelligence
and Law. ACM, 69–78.
[6]
Tibor Kiss and Jan Strunk. 2006. Unsupervised multilingual sentence boundary
detection. Computational Linguistics 32, 4 (2006), 485–525.
[7]
John Laerty, Andrew McCallum, Fernando Pereira, and others. 2001. Condi-
tional random elds: Probabilistic models for segmenting and labeling sequence
data. In Proceedings of the eighteenth international conference on machine learning,
ICML, Vol. 1. 282–289.
[8]
Yang Liu, Andreas Stolcke, Elizabeth Shriberg, and Mary Harper. 2005. Using
conditional random elds for sentence boundary detection in speech. In Pro-
ceedings of the 43rd Annual Meeting on Association for Computational Linguistics.
Association for Computational Linguistics, 451–458.
[9]
Christopher D Manning, Mihai Surdeanu, John Bauer, Jenny Rose Finkel, Steven
Bethard, and David McClosky. 2014. e stanford corenlp natural language
processing toolkit.. In ACL (System Demonstrations). 55–60.
[10]
Andrei Mikheev. 2002. Periods, capitalized words, etc. Computational Linguistics
28, 3 (2002), 289–318.
[11]
Naoaki Okazaki. 2007. CRFsuite: a fast implementation of Conditional Random
Fields. (2007).
[12]
David D Palmer and Marti A Hearst. 1997. Adaptive multilingual sentence
boundary disambiguation. Computational Linguistics 23, 2 (1997), 241–267.
[13]
Adwait Ratnaparkhi. 1998. Maximum entropy models for natural language ambi-
guity resolution. Ph.D. Dissertation. University of Pennsylvania.
[14]
Jonathon Read, Rebecca Dridan, Stephan Oepen, and Lars Jørgen Solberg. 2012.
Sentence boundary detection: A long solved problem? COLING (Posters) 12
(2012), 985–994.
[15]
Jerey C Reynar and Adwait Ratnaparkhi. 1997. A maximum entropy approach to
identifying sentence boundaries. In Proceedings of the h conference on Applied
natural language processing. Association for Computational Linguistics, 16–19.
[16]
Michael D Riley. 1989. Some applications of tree-based modelling to speech
and language. In Proceedings of the workshop on Speech and Natural Language.
Association for Computational Linguistics, 339–352.
[17]
Oanh i Tran, Minh Le Nguyen, and Akira Shimazu. 2013. Reference resolution
in legal texts. In Proceedings of the Fourteenth International Conference on Articial
Intelligence and Law. ACM, 101–110.
[18]
Oanh i Tran, Bach Xuan Ngo, Minh Le Nguyen, and Akira Shimazu. 2014.
Automated reference resolution in legal texts. Articial intelligence and law 22, 1
(2014), 29–60.
[19]
M Van Opijnen, N Verwer, and J Meijer. 2015. Beyond the Experiment: the
eXtendable Legal Link eXtractor. In Workshop on Automated Detection, Extraction
and Analysis of Semantic Information in Legal Texts, held in conjunction with the
2015 International Conference on Articial Intelligence and Law.
[20]
Adam Wyner and Wim Peters. 2011. On Rule Extraction from Regulations.. In
JURIX, Vol. 11. 113–122.