Conference PaperPDF Available

Using Conditional Random Fields to Detect Different Functional Types of Content in Decisions of United States Courts with Example Application to Sentence Boundary Detection

June 2017

June 2017

Conference: 2nd Workshop on Automated Semantic Analysis of Information in Legal Texts (ASAIL 2017)
At: London, UK

Authors:

Jaromir Savelka

Carnegie Mellon University

Kevin Ashley

University of Pittsburgh

We detect different functional types of content in decisions of the United States courts using conditional random fields models. The work suggests a possible approach to deal with the obstacles in processing the court decisions automatically. Even basic natural language processing tasks such as sentence boundary detection are challenging when performed on the decisions. It is our assumption that one of the main causes of this difficulty is the presence of heterogeneous content where different categories of content require different treatments. Thus, the goal of this work is to automatically distinguish among the different functional types of content. The trained models find full and incomplete sentences as well as non-sentential sequences. In addition, the models detect external and internal references, quotations, editorial marks, headings, metadata fields and numbering sequences such as lists. We show the utility of this approach on an example of the improved performance in the sentence boundary detection task as compared to existing general approaches that do not take the heterogeneity of the content into account.

Data set summary statistics. Length of longest, shortest and average documents is reported in number of characters. cyber-train cyber-test scotus-test ip-test

…

SBD Data set summary statistics. Length of longest, shortest and average sentences is reported in number of characters.

…

Vanilla SBD systems performance

…

Custom CRF SBD system performance

…

Figures - uploaded by Jaromir Savelka

Content may be subject to copyright.

Content uploaded by Jaromir Savelka

Content may be subject to copyright.

Using Conditional Random Fields to Detect Dierent Functional

Types of Content in Decisions of United States Courts with

Example Application to Sentence Boundary Detection∗

Jarom´

ır ˇ

Savelka

Intelligent Systems Program

University of Pisburgh

210 South Bouquet Street

Pisburgh, PA 15260

jas438@pi.edu

Kevin D. Ashley

Learning Research and Development Center

University of Pisburgh

3939 O’Hara Street

Pisburgh, PA 15260

ashley@pi.edu

ABSTRACT

We detect dierent functional types of content in decisions of the

United States courts using conditional random elds models. e

work suggests a possible approach to deal with the obstacles in

processing the court decisions automatically. Even basic natural

language processing tasks such as sentence boundary detection are

challenging when performed on the decisions. It is our assumption

that one of the main causes of this diculty is the presence of

heterogeneous content where dierent categories of content require

dierent treatments. us, the goal of this work is to automatically

distinguish among the dierent functional types of content. e

trained models nd full and incomplete sentences as well as non-

sentential sequences. In addition, the models detect external and

internal references, quotations, editorial marks, headings, meta

data elds and numbering sequences such as lists. We show utility

of this approach on an example of the improved performance in the

sentence boundary detection task as compared to existing general

approaches that do not take the heterogeneity of the content into

account.

KEYWORDS

Sequence labeling, conditional random elds, court decision, sen-

tence boundary detection

ACM Reference format:

Jarom

ır

Savelka and Kevin D. Ashley. 1997. Using Conditional Random

Fields to Detect Dierent Functional Types of Content in Decisions of United

States Courts with Example Application to Sentence Boundary Detection. In

Proceedings of 2nd Workshop on Automated Semantic Analysis of Information

in Legal Texts, London, Great Britain, June 2017 (ASAIL 2017), 10 pages.

DOI: 10.475/123 4

∗Produces the permission block, and copyright information

Permission to make digital or hard copies of part or all of this work for personal or

classroom use is granted without fee provided that copies are not made or distributed

for prot or commercial advantage and that copies bear this notice and the full citation

on the rst page. Copyrights for third-party components of this work must be honored.

For all other uses, contact the owner/author(s).

ASAIL 2017, London, Great Britain

DOI: 10.475/123 4

1 INTRODUCTION

In this paper we present some early results from our eort to auto-

matically recognize dierent functional types of content in US court

decisions. e decisions are complex documents where dierent

types of content appear side by side. Sentences may be interleaved

with citations and editorial marks (e.g., a page number or correc-

tion). e whole decision as well as the individual sentences may be

organized into list-like structures. Excerpts from other documents

as well as speech transcripts are oen present. e decisions may

be very long, organized into sections and subsections.

e heterogeneity of content is one of the reasons why it is

dicult to apply most of the well established NLP techniques to

court decisions. Even basic NLP tasks such as sentence boundary

detection are challenging when performed on the decisions. Wyner

and Peters observe that lists (using punctuation, including enumer-

ations, colons) and references (containing a mix of punctuation

and alpha-numeric characters) confound tokenization and sentence

spliing. [

] Although, they talk about regulatory texts these ob-

servations apply to the decisions as well. At the same time the use

of the mentioned techniques is foundational to many valuable appli-

cations from information retrieval to computer assisted reasoning.

Less than optimal performance in the lower level processing oen

introduces mistakes into the pipeline that are impossible to correct

in later stages. It is our assumption that the ability to distinguish

dierent types of content may dramatically improve the quality of

the processing. It appears that de Maat and Winkels express similar

assumptions in [

]. ey talk about spliing the sentences into

the principle and the auxiliary. In addition, they observe that lists

cause degradation of sentence classication performance.

We developed a simple labeling scheme that captures some of

the functional types of content. e scheme is described in Section

2. It is a set of types such as a sentential and non-sentential content

or references and quotations. We applied the scheme to a data set of

19 selected court decisions (mostly from the domain of cyber crime)

producing more than 20,000 annotations. e annotation eort and

the resulting data set is the subject of Section 3. We train sequence

labeling models (specically CRF) capable of producing the annota-

tions automatically. e training and evaluation of the models is

described in Section 4. Finally, we use some of the models in a task

of sentence boundary detection (Section 6). We show improved

performance over the methods that do not take the heterogeneity

of the content into account.

ASAIL 2017, June 2017, London, Great Britain J. ˇ

Savelka and K. D. Ashley

ere is a large body of research in AI&Law that focuses on

extraction of various types of content from legal documents. e

most related to our work is the consistent eort in extraction of

references and citations. e Reference recognition focuses on nd-

ing references to other documents (or their specic parts) such as

statutory provisions, court decisions, or expert literature. Tran et

al. presents a framework for recognition and resolution of refer-

ences in Japanese law. [

] De Maat et al. developed a system

for detection of references in Dutch law. [

] Opijnen et al. focus

not only on the references to Dutch legal documents but they also

recognize the references to EU documents. [19]

Dell’Orlea et al. present a law specic account of sentence

boundaries which is in many respects similar to the one we use

here. [

] Grabmair et al. developed a dedicated sentence bound-

ary detection module to operate on court decisions by extending

Lingpipe.1[5]

2 LABELING SCHEME

At the most fundamental level we distinguish between sentential

and non-sentential content. e distinction is drawn at a semantic

level, not purely a grammatical one. is may appear quite counter

intuitive on certain occasions. e sentential content comprises

unstructured data in the form of natural language text that plays

a role of a substantive statement in a decision. e rest of the

content that mostly plays a supportive role, such as structured

meta data or standardized references, is considered non-sentential.

e distinction appears to be similar to the one suggested in [

]

(principle and auxiliary sentences). We further dierentiate the

sentential content into sentences and incomplete sentences.

Sentence is a sequence of tokens (most oen words) that together

form a full stand-alone grammatical sentence. It usually starts with

a capital leer and ends with a period, exclamation mark, ques-

tion mark, or (rarely) a dierent symbol. Consider the following

examples:

(1)

We have recognized that even a limited search of the person

is a substantial invasion of privacy.

(2)

In other contexts, however, we have held that although

“some quantum of individualized suspicion is usually a

prerequisite to a constitutional search or seizure[,] . . . the

Fourth Amendment imposes no irreducible requirement

of such suspicion.”

(3)

Justice Jackson rejected the view that the search could be

supported by exigent circumstances:

Example 1 shows a typical sentence starting with a capital leer and

ending with a period. Example 2 shows a more complex sentence

with a quotation and some editorial marks (e.g., “.. . ”) embedded

within it. Example 3 presents a sentence ending with a colon which

is traditionally not used to terminate a sentence.

Incomplete sentence is a sequence of tokens (most oen words)

that do not form a full grammatical sentence. ese sequences

usually appear in headings or parentheses. ey may also appear

in front of lists or enumerations. Examples:

1hp://alias-i.com/lingpipe/

(1)

(student with bloodshot eyes wandering halls in violation

of school rule requiring students to remain in examination

room or at home during midterm examinations)

(2) e nature of the writ.

Example 1 is an expression in parentheses which does not form a

full grammatical sentence. Example 2 looks like a full sentence but

it lacks a verb phrase. is may oen be a case with headings.

Non-sentential piece of content is a sequence of tokens which is

not unstructured text in natural language. e sequence is oen

not formed of words or has very few words. Typically, it would

make no sense to assign the tokens of such a sequence with part of

speech tags (e..g, noun, verb). ese sequences usually appear as

references or meta data in front maer. Some examples are shown

in the following list:

(1) See Warden v. Hayden, 387 U.S. 294, 306 -307 (1967).

(2) November 14, 2012.

(3) [ Footnote 1 ]

Example 1 is a reference in a standardized format. Note that on a

purely functional level this is an imperative sentence yet within

our labeling scheme it is considered a non-sentential sequence.

is is because it is not a substantive statement—its role is rather

supportive. Example 2 shows a date expression. Example 3 presents

a numbering token from the footnotes list.

Citations play an important role in court decisions. To account

for this phenomenon we detect references to other documents

(external references) and references to other parts of the same

document (internal references). In addition, we look for sequences

of text that come from other documents or other parts of the same

document (quotations). It is worth mentioning that these types

are not mutually exclusive with the types presented earlier (i.e.,

sentence, non-sentential sequence). ite contrary, almost any

reference is at the same time non-sentential sequence (but not

necessarily the other way around).

External reference is a sequence of tokens the role of which is to

point to a document (or its part) other than the one in which it is

embedded. External references oen have a standardized format

(see the examples) and they could be easily recognized:

(1) [469 U.S. 325, 329]

(2)

West Virginia State Bd. of Ed. v. Barnee, 319 U.S. 624, 637

(1943)

(3)

Parent-Student Handbook of Piscataway [N. J.] H. S. (1979),

Record Doc. S-1, p. 7.

Example 1 shows a reference to a statute. Example 2 is a reference

to a court decision. Example 3 presents a reference to a book.

Internal reference is a sequence of tokens the role of which is to

point to another place in the same document. Internal references

oen have a form of footnote pointers, pointers to gures or tables,

or pointers to preceding or following sections of the document.

Consider the following examples:

(1) Ante, at 337

(2) Figure 1

Example 1 shows a reference to a preceding part of the text (e.g., a

page or paragraph). Example 2 shows a reference to a gure within

the decision.

Detecting Dierent Types of Content in US Court Decisions ASAIL 2017, June 2017, London, Great Britain

otation is a sequence of tokens from a dierent document (or

from a dierent part of the same document) that is copied into the

document of interest. Sometimes the sequence could be modied

(still preserving the meaning). ite oen a citation is surrounded

with quotation marks.

(1)

“a school ocial may properly conduct a search of a stu-

dent’s person if the ocial has a reasonable suspicion that

a crime has been or is in the process of being commied,

or reasonable cause to believe that the search is necessary

to maintain school discipline or enforce school policies.”

(2)

“[A]s a general rule notice and hearing should precede

removal of the student from school. We agree .. . , however,

that there are recurring situations in which prior notice and

hearing cannot be insisted upon. Students whose presence

poses a continuing danger to persons or property or an

ongoing threat of disrupting the academic process may

be immediately removed from school. In such cases the

necessary notice and rudimentary hearing should follow

as soon as practicable . . ..”

Example 1 shows an in-line quotation which is a part of a surround-

ing sentence. Example 2 is a standalone citation consisting of a

number of sentences. It also contains a couple of editorial marks

(“. . . ”, square brackets).

We detect a number of other constituents in a decision because

these could be useful in more advanced processing stages. In par-

ticular, we are interested in constituents playing a specic role in

a decision (headings). We also detect elements suggestive about

a structure of the presented content (numbering tokens). Finally,

we are interested in the elements that are inserted in the document

to make it more readable or useful (editorial marks and meta data

elds).

Heading is a sequence of tokens, either a full sentence or an

incomplete sentence, the role of which is to open (and concisely

describe) a document or a section of a document. Examples:

(1) JUSTICE WHITE delivered the opinion of the Court.

(2)

MEMORANDUM AND ORDER ON A DEFENDANTS’ MO-

TION TO DISMISS

Example 1 is a typical template sentence opening the opinion part of

the decision. Strictly speaking one would not consider this sentence

a heading within a general meaning of the word. However, we

consider it as such in our labeling scheme. Example 2 is a heading

opening a specic part of an opinion.

Numbering token (or a sequence of tokens) denes a position of

the associated piece of text within a structure such as a list or a

tree. Numbering can usually be found at the beginning of a line

but it can also appear within a paragraph.

(1) 1st.

(2) III

(3) [ Footnote 1 ]

(4) (viii)

e examples 1 through 4 show dierent types of numbering tokens

that are commonly found in the decisions.

Editorial mark is any token or sequence of tokens which is intro-

duced in a text by somebody other than an author in order to make

the text more readable or useful. is could be certain symbols

introduced into a citation by the author of a document or symbols

introduced into a document by a publisher.

(1) *153

(2) . . .

(3) [A]

Example 1 shows an editorial mark which is commonly used to

indicate beginning of a page of the specied number. is type

of mark is oen used in documents that were transformed from a

printed form into an electronic form. Example 2 is a mark indicating

that a piece of original content is leaved out. Square brackets shown

in Example 3 indicate that “A” was inserted in the original text by

an editor.

Meta data eld is a sequence of tokens representing a structured

information, usually in a form of a label–value pair. Consider the

following examples:

(1) Filed: April 16th, 2009

(2)

Panel: Frank Hoover Easterbrook, Michael Stephen Kanne,

Ann Claire Williams

Example 1 is a meta data eld providing an information about

the day on which the motion was led. Example 2 shows a meta

data eld that provides an information about the members of the

deciding panel.

3 DATA SET

We downloaded 19 court decisions from the online Court Listener

service.

13 of these decisions are from the area of cyber crime

(cyber bullying, credit card frauds, possession of electronic child

pornography), 3 are landmark SCOTUS decisions, and the remain-

ing 3 are cases involving intellectual property. We use 10 cyber

crime decisions as a training and development set (cyber-train)

and the remaining 3 cyber crime decisions as a hold-out test set

(cyber-test). We use the 3 SCOTUS decisions and the 3 intellectual

property decisions as additional test sets (scotus-test, ip-test). Al-

though, we focus on the area of cyber crime we use the additional

test sets to measure how well do the trained models generalize to

other domains.

e two human annotators (the authors of this paper) annotated

the decisions with labels described in Section 2. Each decision was

annotated by only one of the annotators. We did not measure inter-

annotator agreement. We simply assume the labels provided by

each annotator to be correct. We are well aware that this assumption

does not hold. Despite this we are convinced that, given the goal

of this work, it is a reasonable assumption anyway. e annotators

were provided with a codebook that contains information similar

to what is presented in Section 2 (i.e., description of each label with

examples). In addition, for each label there was a set of if–then

rules guiding the annotators such as the following one:

If a sequence of tokens represents what is wrien

in another document or at a dierent place of the

document of interest then it is a quotation.

e basic characteristics of the data set are summarized in Table

1. Summary statistics about the annotations are provided in Table

2. A relatively small number of decisions may suggest a very small

size of the data set. However, some of the annotated decisions are

2www.courtlistener.com

ASAIL 2017, June 2017, London, Great Britain J. ˇ

Savelka and K. D. Ashley

Table 1: Data set summary statistics. Length of longest,

shortest and average documents is reported in number of

characters.

cyber-train cyber-test scotus-test ip-test

# of docs 10 3 3 3

# of tokens 219484 75158 130065 42382

longest doc 181009 121910 190352 54430

average doc 58237 67820 118749 38549

shortest doc 16859 29278 47450 16546

very long. e longest decision with 190,352 characters roughly

corresponds to 80–100 standard pages. Even the shortest decision

(16,546 characters) corresponds to approximately 7 standard pages.

e total number of annotations (21,294) clearly shows that the

data set is sucient for far more than toy experiments.

4 LABELING THE SEQUENCES

AUTOMATICALLY

We use the cyber-train data set to train a CRF model for each of

the 10 labels. Although this is certainly suboptimal, we use the

same training strategy and features for all the models. It should

be evident that dierent types (such as a sentence or an editorial

mark) could most likely benet from a custom-tailored model and

contextual features. We reserve ne-tuning of the individual models

for future work. A CRF is a random eld model that is globally

conditioned on an observation sequence

. e states of the model

correspond to event labels

. We use a rst-order CRF in our

experiments (observation

is associated with

). We use the

CRFsuite3implementation of rst-order CRF. [7, 8, 11]

We use a very aggressive tokenization strategy that segments

text into a greater number of tokens than usual. We consider an

individual token to be any consecutive sequence of either:

(1) leers

(2) numbers

(3) whitespace

Each character that does not belong to any of the above constitutes

a single token. For example, the following sequence is tokenized as

shown below:

Call me at 9am on my phone (123)456-7890.

[“Call”, “ ”, “me”, “ ”, “at”, “9”, “am”, “ ”, “on”, “ ”,

“my”, “phone”, “ ”, “(”, “123”, “)”, “456”, “-”, “7890”,

“.”]

Each of the tokens is then a data point in a sequence a CRF model

operates on.

Each token is represented by a small set of relatively simple

features. Specically, the set includes:

(1)

atfront – a binary feature which is set to true if the token is

located among the rst 20% tokens of the whole document.

(2)

atback – a binary feature which is set to true if the token is

located among the last 20% tokens of the whole document.

(3) lower – a token in lower case.

3www.chokkan.org/soware/crfsuite/

(4)

sig – a feature representing a signature of a token. is

feature corresponds to the token with the following trans-

formations applied:

(a) each lower case leer is rewrien to “c”

(b) each upper case leer is rewrien to “C”

(5)

length – a number corresponding to the length of the token

in characters if the length is smaller than 4. If the length

is between 4 and 6 the feature is set to “normal.” If it is

greater than 6 it is set to “long.”

(6)

islower – a binary feature which is set to true if all the

token characters are in lower case.

(7)

isupper – a binary feature which is set to true if all the

token characters are in upper case.

(8)

istitle – a binary feature which is set to true if the rst of

the token characters is in upper case and the rest in lower

case.

(9)

isdigit – a binary feature which is set to true if all the token

characters are digits.

(10)

isspace – a binary feature which is set to true if all the

token characters are whitespace.

In addition, for each token we also include lower,sig,islower,isupper,

istitle,isdigit, and isspace features from the ve preceding tokens

and ve following tokens. If one of these tokens falls beyond the

document boundaries we signal this by including BOS (beginning

of sequence) and EOS (end of sequence) features.

Taking a look at the “Call me at 9am .. .” sequence from the

above presented example. e third token of this sequnce (“me”)

would be represented along the following lines:

{bias, atfront=true, atback=false, 0:lower=me,

0:sig=cc, 0:length=2, 0:islower=true,

0:isupper=false, 0:istitle=false, 0:isdigit=false,

0:isspace=false, -3:BOS, -2:lower=call, -2:sig=Ccc,

-2:length=normal, -2:islower=false, -2:isupper=false,

-2:istitle=true, -2:isdigit=false, -2:isspace=false,

-1:lower=" ", -1:sig=" ", -1:length=1,

-1:islower=false, -1:isupper=false, -1:istitle=false,

-1:isdigit=false, -1:isspace=true

...}

As labels we use the annotation types projected into the BILOU

scheme. Considering an example of annotation time (TIM) and

phone number (TEL) expressions the “Call me at 9am .. .” sequence

would be labeled as follows:

[O, O, O, O, O, B-TIM, L-TIM, O, O, O, O, O, O,

B-TEL, I-TEL, I-TEL, I-TEL, I-TEL, L-TEL, O]

For demonstration purposes we show multiple annotation types

at once. In our work we used one annotation type for each model.

In addition, instead of the TIM and TEL types from this example

we worked with the types presented in Section 2. us, each of

the models was trained on the tag set with 5 labels such as the

following one:

B-SENT, I-SENT, L-SENT, O, U-SENT

B: beginning of sequence, I: inside sequence, L: last in sequence, O: outside of sequence,

U: unit-length sequence

Detecting Dierent Types of Content in US Court Decisions ASAIL 2017, June 2017, London, Great Britain

Table 2: Summary statistics of annotations for the cyber-train, cyber-test, scotus-test, and ip-test data sets. Average sequence

length is reported in number of characters. Annotation types: Sentence (SENT), Incomplete Sentence (ISENT), Non-sentential

Sequence (NSENT), External Reference (EREF), Internal Reference (IREF), otation (QUOT), Editorial Mark (EMARK), Head-

ing (HEAD), Numbering Token (NMB), Meta Data Field (MDF).

SENT ISENT NSENT EREF IREF QUOT EMARK HEAD NMB MDF

cyber-train

# of seq 3493 495 1945 1574 189 959 818 108 360 43

avg # of seq per doc 349 50 195 157 19 96 82 11 36 4

avg seq length 145 50 31 37 12 143 6 23 3 40

cyber-test

# of seq 1060 125 728 753 46 325 245 21 109 19

avg # of seq per doc 353 42 243 254 15 108 82 7 36 6

avg seq length 163 84 29 30 7 112 4 33 3 63

scotus-test

# of seq 1984 158 1316 1256 185 375 164 34 182 2

avg # of seq per doc 661 53 439 419 62 125 55 11 61 1

avg seq length 153 54 34 38 8 123 6 29 3 26

ip-test

# of seq 608 119 474 506 37 224 116 38 89 2

avg # of seq per doc 203 40 158 169 12 75 39 13 30 1

avg seq length 160 58 26 27 3 111 6 28 3 91

total

# of seq 7145 897 4463 4099 457 1883 1343 201 740 66

avg # of seq per doc 376 47 235 216 24 99 71 11 39 3

avg seq length 151 57 31 34 9 130 6 26 3 48

5 RESULTS

e results of applying the CRF models trained on cyber-train data

set for cyber-test, scotus-test, and ip-set data sets are summarized

in Table 3. In the table each column represents a type and each

row represents a token type (from the BILOU scheme) for one of

the three test sets. We report the F

value as well as a support (a

number of true labels of the respective type within BILOU scheme,

e.g., the number of B-SENT) for each token type. e feature set we

have selected works well for the Sentence, Non-sentential sequence,

and External Reference types across all three data sets. In case of In-

ternal Reference, otation, Editorial Mark, and Numbering types

the performance is somewhat lower although quite promising. e

features apparently do not work well for the Incomplete Sentence

and Meta Data Field types. We reserve ne-tuning of the mod-

els (and thus improvement of performance) for future work. As

expected the performance in almost all the categories is slightly

higher on cyber-test data set when compared to the performance

on scotus-test and ip-test data sets.

6 EXAMPLE APPLICATION ON SENTENCE

BOUNDARY DETECTION

e goal in sentence boundary detection (SBD) is to split a natural

language text into individual sentences (i.e., identify each sentence’s

boundaries). Typically, SBD is operationalized as a binary classi-

cation of a xed number of candidate boundary points (e.g., “.”,

“!”, “?”). SBD could be a critical task in many applications such as

machine translation, summarization, or information retrieval.

Approaches to SBD roughly fall into three categories:

(1)

Rules – A baery of hand-craed matching rules is applied.

e rules may look like the following:

IF “!” OR “?” MATCHED →MARK AS BOUND

IF “<EOL> <EOL>” MATCHED →MARK AS BOUND

e rst rule states that every time there is a “!” or “?”

the system should consider it a boundary. e second

rule can be understood in such a way that the boundary

should be predicted every time the system encounters two

consecutive line breaks.

(2)

Supervised Machine Learning (ML) – Given that triggering

event occurs decide if it is an instance of sentence boundary.

Each event is represented in terms of selected features such

as the following:

xi=<0:token=“.”, 0:isTrigger=1, -1:token=“Mr”,

-1:isAbbr=1, 1:token=“Lange”, 1:isName=1 >

Given the labels:

yi∈ {0,1}

e supervised classier is a function:

f(xi)→yi

(3)

Unsupervised ML – Similar to supervised ML approach but

the system is trained on unlabeled data.

Multiple SBD systems were reported as having an excellent

performance: [14]

•

99.8% accuracy of a tree-based classier in predicting “.” as

ending (or not) a sentence evaluated on Brown corpus [

]

•

99.5% accuracy of a combination of original system based

on neural nets and decision trees with existing system [

]

evaluated on WSJ corpus [12]

ASAIL 2017, June 2017, London, Great Britain J. ˇ

Savelka and K. D. Ashley

Table 3: Annotation types: Sentence (SENT), Incomplete Sentence (ISENT), Non-sentential Sequence (NSENT), External Ref-

erence (EREF), Internal Reference (IREF), otation (QUOT), Editorial Mark (EMARK), Heading (HEAD), Numbering Token

(NMB), Meta Data Field (MDF). e table shows F1and support for all the types on the three data sets.

SENT ISENT NSENT EREF IREF QUOT EMARK HEAD NMB MDF

F1supp F1supp F1supp F1supp F1supp F1supp F1supp F1supp F1supp F1supp

cyber-test

B .93 1058 .40 121 .88 645 .85 564 .78 45 .73 325 .83 98 .62 16 .90 109 .89 19

I .96 57630 .42 3400 .94 10483 .95 565 .43 117 .69 325 .76 142 .69 180 .89 66 .87 370

L .93 1058 .44 121 .88 646 .60 565 .80 45 .75 11807 .84 107 .29 16 .90 109 .83 19

O .86 15412 .98 71512 .99 63307 .99 10496 1.0 74951 .94 62701 1.0 74718 1.0 74941 1.0 74874 1.0 74750

U – – 0.0 4 .93 77 0.0 194 – – – – .16 93 .75 5 – – – –

AVG .94 75158 .95 75158 .98 75158 .98 75158 1.0 75158 .90 75158 1.0 75158 1.0 75158 1.0 75158 1.0 75158

scotus-test

B .95 1976 .25 144 .80 1165 .74 1107 .44 139 .49 374 .66 102 .32 23 .50 2 .24 90

I .98 99514 .27 2625 .92 20377 .90 19923 .51 540 .41 15496 .58 242 .43 221 .60 14 .30 68

L .95 1977 .39 144 .81 1157 .47 1102 .13 139 .51 373 .69 102 .47 21 .50 2 .24 90

O .94 26598 .99 127139 .98 107215 .98 107785 1.0 129201 .91 113822 1.0 129603 1.0 129789 1.0 130047 1.0 129725

U – – 0.0 13 .70 151 0.0 148 .29 46 – – .54 16 .71 11 – – .87 92

AVG .97 130065 .97 130065 .97 130065 .96 130065 .99 130065 .85 130065 1.0 130065 1.0 130065 1.0 130065 1.0 130065

ip-test

B .93 600 .38 105 .81 468 .81 445 .91 32 .73 224 .70 56 .09 31 .88 89 0.0 2

I .97 31675 .32 2094 .87 6026 .92 6297 .71 50 .69 8208 .80 133 .17 290 .84 60 0.0 69

L .93 603 .47 105 .86 472 .51 447 .85 37 .73 224 .71 57 .47 30 .86 88 0.0 2

O .90 9504 .98 40067 .98 35416 .98 35135 1.0 42263 .91 33726 1.0 42113 1.0 42027 1.0 42145 1.0 42382

U – – 0.0 11 – – .32 58 – – – – .82 23 .22 4 – – – –

AVG .96 42382 .94 42382 .96 42382 .97 42382 1.0 42382 .87 42382 1.0 42382 .99 42382 42382 1.0 42382

•

99.75% accuracy (WSJ) and 99.64% (Brown) of a maximum

entropy model in assessing “.”, “!”, and “?” [15]

•

99.69% (WSJ) and 99.8% (Brown) of a rule-based sentence

splier combined with a supervised POS-tagger [10]

•98.35% (WSJ) and 98.98% (Brown) of an unsupervised sys-

tem based on identication of abbreviations[6]

Read et al. [

] conducted a study of SBD systems performance

across dierent corpora and report more modest results ranging

from 95.0% to 97.6% for dierent systems. Also, they tested the sys-

tems on corpora of user generated web content. e performance of

the SBD systems deteriorated for these corpora where the accuracy

oen falls in the lower nineties. [14]

6.1 SBD on Court Decisions

Court decisions are more challenging for SBD than news articles—

the traditional subject of interest. Whereas news articles are gener-

ally short texts a decision may be short but it may also be as long

as a book (consider the 80–100 pages long decision in Table 1). A

decision may be structured into sections and subsections preceded

by a heading (possibly numbered). A decision may contain spe-

cic constituents such as a header and a footer, footnotes, or lists.

Sentences are interleaved with citations. e sentences themselves

may be extremely long, even organized as lists. In decisions there is

a high usage of sentence organizers such as “;”, or “—” and brackets

(multiple types). otes (possibly nested) are frequent.

Let us consider the following example of a very long sentence

coming from a decision:

As used in the statute, “‘act in furtherance of a

person’s right of petition or free speech under the

United States or California Constitution in connec-

tion with a public issue’ includes: (1) any wrien

or oral statement or writing made before a leg-

islative, executive, or judicial proceeding, or any

other ocial proceeding authorized by law; (2)

any wrien or oral statement or writing made in

connection with an issue under consideration or

review by a legislative, executive, or judicial body,

or any other ocial proceeding authorized by law;

(3) any wrien or oral statement or writing made

in a place open to the public or a public forum

in connection with an issue of public interest; (4)

or any other conduct in furtherance of the exer-

cise of the constitutional right of petition or the

constitutional right of free speech in connection

with a public issue or an issue of public interest.”

(

425.16, subd. (e), italics added; see Briggs v. Eden

Council for Hope & Opportunity (1999) 19 Cal. 4th

1106, 1117-1118, 1123 [81 Cal.Rptr.2d 471, 969 P.2d

564] [discussing types of statements covered by

anti-SLAPP statute].)

e example sentence contains a quotation (with a nested quotation)

organized as a list followed by citations and their captions. is

text is very challenging for an SBD systems because it contains

many triggering events that are not sentence boundaries.

e sentences in decisions tend to be very long (as the example

sentence shown above). is may cause troubles to various compo-

nents in the processing pipeline (e.g., syntactic parsing might be

dicult). erefore, it makes sense to adopt an aggressive segmen-

tation strategy and predict boundaries wherever possible. One such

opportunity are semicolons which are sometimes used to separate

items in a list (as shown in Example 1 below), as well as independent

clauses (Example 2). e authors of [

] also decided to go beyond

the traditional triggering events (e.g., they use “;” and “:” as well).

(1)

[O]ur family suered: emotional distress; anxiety; sleep-

lessness; physical pain; insecurity; fear; pain and suering;

payment of aorneys’ fees; payment of medical expenses;

Detecting Dierent Types of Content in US Court Decisions ASAIL 2017, June 2017, London, Great Britain

Table 4: SBD Data set summary statistics. Length of longest,

shortest and average sentences is reported in number of

characters.

cyber-train cyber-test scotus-test ip-test

# of docs 10 3 3 3

# of sentences 5423 1757 3214 1107

longest sentence 1065 1182 670 1145

average sentence 106 114 110 103

shortest sentence 1 1 1 2

payment of moving expenses; payment of *1204 traveling

and housing expenses to and from Los Angeles to support

our business endeavors; [and] [D.C.]’s lost income. .. .

(2)

It takes RAPCO a year or two to design, and obtain ap-

proval for, a complex part; the dynamometer testing alone

can cost

75,000. . .. Drawings and other manufacturing

information contain warnings of RAPCO’s intellectual-

property rights; every employee receives a notice that the

information with which he works is condential.

Additional complexity is caused by the presence of informal

poorly edited text:

e next post, from “DAN JUSTICE,” is the rst

to raise the rhetoric to a level that could, when

considered out of context, be construed as a threat.

It says “HEY [D.C.], I KNOW A GOOD *** WHEN

I SEE ONE. I LIKE WHAT I SEE, LET’S GO GET

SOME COFFEE. ****** im gonna kill you” and is

signed “H-W student.”5

A sentence may span over a double line break:

Section 1(4) of the Uniform Act provides that:

“Trade secret” means information, including a for-

mula, paern, compilation, program, device, method,

technique, or process .. .

Headings are examples of sequences that possibly do not have a

triggering event:

FACTS AND PROCEDURAL HISTORY

6.2 SBD Data Set

We use the cyber-train, cyber-test, scotus-test, and ip-test data sets

for the SBD experiments. Based on the above presented phenomena

we concluded that a denition of SBD as a binary classication of

a limited number of triggering events is not adequate. Further-

more, we observe that segmentation of a decision into sentences (or

sentence-like units) may oen be done in multiple dierent ways.

erefore, we do not adopt the limited view of triggering events

(allowing possibly any token to be a boundary). Also we apply

a consistent policy of aggressive segmenting (i.e., if doubts exist

there is a boundary). e basic characteristics of the SBD data set

are summarized in Table 4.

5e asterisks are ours.

6.3 SBD Experiments

We conduct a series of experiments to test the hypothesis that

taking the heterogeneity of decision’s content into account leads to

an improved performance in SBD. First, we measure performance

of existing vanilla SBD systems (i.e., using the pre-trained general

models) on cyber-test, scotus-test, and ip-test data sets. Second,

we test the performance of the same models trained on cyber-train

data set. Finally, we measure the performance of a custom CRF

solution that uses automatically predicted information about the

dierent types of content. We use sentence, incomplete sentence,

and non-sentential sequence types as features. Note that these

types, although quite related, do not map directly to the sentence

boundaries within the meaning of SBD. Also note, that we use

as features the automatic predictions of these types—not the gold

standard.

For evaluation we use traditional information retrieval measures—

precision (P), recall (R), and F

-measure (F

). We evaluate the per-

formance from two dierent perspectives:

(1) boundaries – each boundary counts on its own

(2) segments – both boundaries need to match

For each perspective we use two approaches to determine if the

boundary was predicted correctly:

(1) strict – boundary osets match exactly

(2)

lenient – the dierence between boundary osets does not

contain an alphanumeric character

Let us consider the following example where §stands for the true

boundary and & for a predicted boundary:

&Accordingly, we nd that the circuit court did

not abuse its discretion when it denied Mr.& &Ren-

frow’s motion for a JNOV.

§ §

**&We nd no merit

to this issue.§&

Two of the predicted boundaries match the true boundaries. e

remaining three dier. In case of one of the three the dierence

subsists in the two asterisks (non-alphanumeric). From the strict

boundaries perspective (strict-B) the P is 0.4 and R is 0.5. Using

the lenient boundaries perspective (lenient-B) the P is 0.6 and R

is 0.75. From the strict segments perspective (strict-S) both P and

R are 0 (no segment is predicted correctly). Using the lenient seg-

ments perspective (lenient-S) the P is 0.33 and R is 0.5. Using the

dierent perspectives allows more detailed analysis of a model’s

performance. As shown above a decent performance on predicting

boundaries correctly does not necessarily imply that the whole

segments are predicted correctly as well.

6.4 Performance of Vanilla SBD Systems

For evaluation of SBD systems’ performance on the corpus of court

decisions we use one system from each category:

(1)

We work with the SBD module from the Stanford CoreNLP

toolkit [9] as an example of a system based on rules.6

(2)

To test a system based on supervised ML classier we

employ the SBD component from openNLP.7

6nlp.stanford.edu/soware/corenlp.shtml

7opennlp.apache.org

ASAIL 2017, June 2017, London, Great Britain J. ˇ

Savelka and K. D. Ashley

(3)

As an example of an unsupervised system we use the punkt

[6] module from the NLTK toolkit.8

e criterion for selection of the SBD systems was the assumed

wide adoption of general toolkits the respective SBD systems are

part of.

e rule based sentence splier from Stanford CoreNLP requires

a text to be already segmented into tokens. e system is based

on triggering events the presence of which is a prerequisite for a

boundary to be predicted. e default events are a single “.” or a

sequence of “?” and “!”. e system may use information about

paragraph boundaries which can be congured as either a single

EOL or two consecutive EOLs. e system may also exploit HTML

or XML markup if present. Certain paerns that may appear aer

a boundary are treated as parts of the preceding sentence (e.g.,

parenthesized expression).

e supervised sentence splier from OpenNLP is based on a

maximum entropy model which requires a corpus annotated with

sentence boundaries. e triggering events are “.”, “?”, and “!”. As

features the system uses information about the token containing

the potential boundary and about its immediate neighbours:

•the prex

•the sux

•the presence of particular chars in the prex and sux

•

whether the candidate is an honoric or corporate desig-

nator

•features of the words le and right of the candidate [13]

e unsupervised sentence splier (punkt) from NLTK does not

depend on any additional resources besides the corpus it is supposed

to segment into sentences. e leading idea behind the system is

that the chief source of wrongly predicted boundaries are periods

aer abbreviations. e system discovers abbreviations by testing

the hypothesis

P(·|w)=

99 against the corpus. Additionally,

token length (abbreviations are short) and the presence of internal

periods are taken into account. For prediction the system uses:

•orthographic features

•

collocation heuristic (collocation is evidence against split)

•

frequent sentence starter heuristic (split aer abbreviation)

[6]

e results of application of the three SBD systems on the SBD

data set described in Section 6.2 are summarized in Table 5. e

results clearly show that performance of the general SBD systems

is drastically lower when compared to the performance on news

articles data sets. It is also much below the reported performance

on the user generated web content. Certain portions of this gap

could be explained by the particular denition of the SBD task we

adopt (i.e., the agressive segmentation). e remaining portion is

due to the decisions being particularly challenging for SBD.

e most common source of errors is due to wrongly predicted

sentence boundaries in citations as shown in the example:

see United States v. X-Citement Video, Inc., 513

U.S. 64, 76-78, 115 S. Ct. 464, 130 L. Ed.& 2d 372

(1994)

8nltk.org/api/nltk.tokenize.html

Table 5: Vanilla SBD systems performance

cyber-test scotus-test ip-test

P R F1 P R F1 P R F1

CoreNLP

strict-B .81 .78 .79 .87 .76 .82 .75 .79 .77

lenient-B .82 .78 .80 .88 .77 .82 .75 .79 .77

strict-S .56 .54 .55 .70 .60 .64 .50 .53 .52

lenient-S .57 .55 .56 .70 .61 .65 .51 .54 .52

openNLP

strict-B .88 .77 .82 .84 .74 .78 .79 .78 .79

lenient-B .88 .77 .82 .84 .74 .78 .80 .79 .79

strict-S .64 .56 .60 .65 .57 .61 .57 .55 .56

lenient-S .64 .56 .60 .66 .58 .61 .57 .56 .56

punkt

strict-B .72 .79 .75 .77 .72 .74 .67 .80 .73

lenient-B .72 .79 .75 .78 .73 .76 .69 .83 .75

strict-S .41 .46 .44 .55 .52 .54 .42 .50 .46

lenient-S .42 .47 .44 .56 .53 .55 .44 .53 .48

As in the previous example the predicted boundary is marked with

&. is type of error is very serious because it causes broken sen-

tences to be passed for further processing within the pipeline. ese

sentences may eventually even show up in the output presented to

a user (e.g., in a summary).

Another type of a commonly occurring error is a missed bound-

ary that follows a unit if a triggering event is absent:

(1)

Deliberate avoidance is not a standard less than knowl-

edge;

it is simply another way that knowledge may be

proven.

(2) B. Response to Jury estion§

(3)

Kolender v. Lawson, 461 U.S. 352, 357, 103 S. Ct. 1855, 75 L.

Ed. 2d 903 (1983);

United States v. Lim, 444 F.3d 910, 915

(7th Cir.2006)

As in the previous example the true boundary is marked with

is type of error is partly caused by our specic denition of SBD

(Example 1). Strictly speaking the missed boundary in Example

1 does not necessarily have to be considered to be an error. And

indeed in the traditional denition of SBD it would not. Example

2 (a missed boundary aer a heading) and Example 3 (a missed

boundary between two citations) are certainly errors. is type of

mistake is less serious than the previous one. It may still negatively

aect the performance of the processing pipeline but it does not

introduce broken sentences that may eventually make it to an

output for a user.

6.5 Performance of Trained SBD Systems

OpenNLP and punkt may be trained on a custom data set which

is encouraged. It can be expected that such training will improve

performance of these two systems. We use the data from cyber-

train with labeled sentence boundaries to train both openNLP and

punkt. It should be noted that punkt is an unsupervised system

and as such it does not use the labels in its training. erefore,

training punkt is very cheap and one could use a training set of

much greater size. Indeed, we expect that if we use a larger data

Detecting Dierent Types of Content in US Court Decisions ASAIL 2017, June 2017, London, Great Britain

Table 6: Trained SBD systems performance

cyber-test scotus-test ip-test

P R F1 P R F1 P R F1

CoreNLP++

strict-B .81 .90 .85 .86 .93 .89 .75 .88 .81

lenient-B .81 .91 .86 .86 .94 .90 .75 .88 .81

strict-S .62 .70 .66 .71 .77 .74 .52 .62 .57

lenient-S .63 .70 .66 .72 .78 .75 .53 .62 .57

openNLP++

strict-B .93 .79 .85 .91 .76 .83 .94 .81 .87

lenient-B .93 .79 .85 .91 .76 .83 .94 .82 .87

strict-S .71 .59 .64 .74 .61 .67 .72 .62 .67

lenient-S .71 .60 .65 .74 .62 .68 .73 .62 .67

punkt++

strict-B .77 .79 .78 .82 .71 .77 .76 .80 .78

lenient-B .77 .79 .78 .84 .73 .78 .79 .83 .81

strict-S .47 .49 .48 .61 .53 .57 .52 .55 .53

lenient-S .47 .49 .48 .62 .54 .58 .54 .57 .55

set to train punkt its performance would increase beyond what we

observe in our experiments. e same does not hold for openNLP

which is trained in a supervised fashion (requiring the gold labels).

Training openNLP is quite expensive. If we would wish to use

more documents in training (increasing the performance further)

we would have to manually label additional documents.

CoreNLP SBD module is rule based and therefore it is not pos-

sible to train a custom model. To approximate training one could

use its conguration options and tune the system to perform well

on our data set. We congured the CoreNLP SBD module to per-

form well on the cyber-train data set and we evaluated it alongside

openNLP and punkt on the cyber-test, scotus-test, and ip-test data

sets. It should be emphasized that this kind of comparison is very

problematic and as such it should be taken with a grain of salt.

Specically, the authors were familiar with the documents, even

those from the test sets. In light of this familiarity it appears that

the CoreNLP could have an unfair advantage in this experiment.

e performance of the trained (or congured) systems is sum-

marized in Table 6. We observe that all the systems perform beer

when compared to the vanilla versions. e performance of some of

the systems on some of the data sets is geing close to low nineties

(similar to performance of the models on user-generated web con-

tent). For example, CoreNLP++ works very well on the scotus-test

data set and openNLP++ has reasonable performance on the ip-test

data set.

Even though the performance of the CoreNLP improved, the

wrongly predicted boundaries in citations remain a problem. Below

are two examples of the boundaries that were incorrectly predicted

by the CoreNLP++:

(1) Entick v. Carrington, 95 Eng.& Rep. 807 (C. P. 1765)

(2) 451 F. Supp.& 2d 71, 88 (2006).

A new class of errors appeared as well. By allowing the system to

predict boundaries on characters such as “;” the system sometimes

commits an error which would have not occurred in its uncong-

ured version:

Knos noted the “limited use which the govern-

ment made of the signals from this particular beeper,”

460 U. S., at 284;& and reserved the question whether

“dierent constitutional principles may be applica-

ble to “dragnet-type law enforcement practices”

of the type that GPS tracking made possible here,

ibid.

is is manifested in dramatic improvement of recall for CoreNLP++

whereas precision is frozen on levels similar to uncongured CoreNLP

SBD module.

e training improved the performance of the general openNLP

SBD module dramatically when it comes to precision. e perfor-

mance in terms of recall remained about the same. is is a beer

type of improvement when compared to CoreNLP++. Although

some of the boundaries are missed, it is quite rare for openNLP++ to

predict an incorrect boundary. Systematic errors are mostly missed

boundaries such as those in the following examples:

(1) 5. e Government’s Hybrid eory§

(2)

is device delivers many dierent types of communica-

tion: live conversations, voice mail, pages, text messages,

e-mail, alarms, internet, video, photos, dialing, signaling,

etc.

e legal standard for government access depends

entirely upon the type of communication involved.

In the rst example the system missed a boundary because it is

not associated with a triggering event (heading). Example 2 is

interesting because the system obviously learned that the “etc.” is

an abbreviation which oen does not end a sentence.

e trained punkt++ performs beer than the general one. It

still commits quite a lot of errors when compared to the other two

trained/congured systems. One would probably need to train

punkt on a considerably larger data set in order to match the perfor-

mance of the other two systems. e previously identied typical

errors occur:

(1) II. ANALYSIS§

(2)

“[T]he district court retains broad discretion in deciding

how to respond to a question propounded from the jury

and . .. & the court has an obligation to dispel any confusion

quickly and with concrete accuracy.”

(3)

United States v. Leahy, 464 F.3d 773, 796 (7th Cir.2006);

United States v. Carrillo, 435 F.3d 767, 780 (7th Cir.2006)

e examples show a missed boundaries aer a heading (Exam-

ple 1) and aer a citation (Example 3). Example 2 shows a wrongly

predicted boundary aer three dots in a quotation.

6.6 Performance of Custom CRF Model

To improve the SBD performance on court decisions even further

we train a custom CRF model—the same as the model described

in Section 4. We use cyber-train as the training set. In addition

to the 10 features described in Section 4 we add the automatically

predicted labels corresponding to the following types:

(1) Sentence

(2) Incomplete Sentence

(3) Non-sentential Sequence

is means we use a two-phased SBD system based on CRF. In the

rst pass the three CRF models aempt to label the sequence of

ASAIL 2017, June 2017, London, Great Britain J. ˇ

Savelka and K. D. Ashley

Table 7: Custom CRF SBD system performance

cyber-test scotus-test ip-test

P R F1 P R F1 P R F1

Custom CRF

strict-B .94 .96 .95 .90 .95 .92 .95 .94 .95

lenient-B .94 .96 .95 .91 .95 .93 .96 .95 .95

strict-S .86 .86 .86 .81 .85 .83 .86 .84 .85

lenient-S .86 .87 .86 .82 .85 .83 .86 .84 .85

tokens in terms of the three above mentioned types. In the second

pass the single CRF model uses the information from the rst pass

to predict the sentence boundaries. During the rst pass we focus

the system on the heterogeneity of the decisions’ content. It is our

assumption that by doing this we could outperform the general

purpose models that do not take the heterogeneity of the content

into account. In addition, the CRF has an advantage that it does not

make any assumptions about the SBD task allowing it to t more

easily with our particular denition.

e results of our CRF SBD system as applied to the cyber-test,

scotus-test, and ip-test data sets are summarized in Table 7. e

custom CRF system clearly outperforms both vanilla and trained

general systems on all three data sets suggesting that our general

hypothesis holds. Indeed, it appears that focusing the system on

the heterogeneity of the content results in beer performance.

Although, the system performs quite well there is certainly a lot

of room for improvement. Particularly annoying classes of errors

are those shown in the examples below (

stands for true boundary

and & stands for a predicted boundary):

(1)

e judgment of the Court of Appeals for the D. C.& Circuit

is armed.

(2)

. . . a “case we have described as a ‘monument of English

freedom’ ‘undoubtedly familiar’ to ‘every& American states-

man’ at the time the Constitution was adopted . . .

(3)

. . . search is not involved and resort must be had to Katz

analysis;

but& there is no reason for rushing forward .. .

All these errors (except Example 1) are quite silly and no traditional

SBD system would commit similar mistakes. is is the tax that

is being paid for relaxing all the traditional SBD assumptions. We

expect that these kinds of mistakes would eventually go away with

a suciently large training set.

7 CONCLUSIONS

We have presented preliminary results from our ongoing eort to

mine US court decisions for valuable knowledge. We tested the hy-

pothesis that one of the reasons why the decisions are challenging

for NLP processing is the heterogeneity of the decisions’ content.

We classied the content of selected decisions in terms of 10 dif-

ferent types (e.g, sentence, quotation, reference). On an example

application of sentence boundary detection we have shown how

the information about the content type improves the performance

of an SBD system. Please, notice that we do not claim that we have

come up with a beer SBD system than are the ones that were

used as baselines. ite contrary, we are certain that our system

would fail miserably if tested on traditional corpora of news articles.

We simply claim that we have created a system that outperforms

the general solutions on a specically dened SBD task (US court

decisions). e key advantage of the system is that it focuses on

the heterogeneity of the decisions’ content.

REFERENCES

[1]

John Aberdeen, John Burger, David Day, Lynee Hirschman, Patricia Robinson,

and Marc Vilain. 1995. MI TRE: description of the Alembic system used for MUC-

6. In Proceedings of the 6th conference on Message understanding. Association for

Computational Linguistics, 141–155.

[2]

Emile de Maat and Radboud Winkels. 2009. A next step towards automated

modelling of sources of law. In Proceedings of the 12th International Conference

on Articial Intelligence and Law. ACM, 31–39.

[3]

Emile De Maat, Radboud Winkels, and Tom Van Engers. 2006. Automated

detection of reference structures in law. Frontiers in Articial Intelligence and

Applications (2006), 41.

[4]

Felice Dell’Orlea, Simone Marchi, Simonea Montemagni, Barbara Plank, and

Giulia Venturi. 2012. e SPLeT-2012 shared task on dependency parsing of legal

texts. In Proceedings of the 4th Workshop on Semantic Processing of Legal Texts.

[5]

Mahias Grabmair, Kevin D Ashley, Ran Chen, Preethi Sureshkumar, Chen Wang,

Eric Nyberg, and Vern R Walker. 2015. Introducing LUIMA: an experiment in legal

conceptual retrieval of vaccine injury decisions using a UIMA type system and

tools. In Proceedings of the 15th International Conference on Articial Intelligence

and Law. ACM, 69–78.

[6]

Tibor Kiss and Jan Strunk. 2006. Unsupervised multilingual sentence boundary

detection. Computational Linguistics 32, 4 (2006), 485–525.

[7]

John Laerty, Andrew McCallum, Fernando Pereira, and others. 2001. Condi-

tional random elds: Probabilistic models for segmenting and labeling sequence

data. In Proceedings of the eighteenth international conference on machine learning,

ICML, Vol. 1. 282–289.

[8]

Yang Liu, Andreas Stolcke, Elizabeth Shriberg, and Mary Harper. 2005. Using

conditional random elds for sentence boundary detection in speech. In Pro-

ceedings of the 43rd Annual Meeting on Association for Computational Linguistics.

Association for Computational Linguistics, 451–458.

[9]

Christopher D Manning, Mihai Surdeanu, John Bauer, Jenny Rose Finkel, Steven

Bethard, and David McClosky. 2014. e stanford corenlp natural language

processing toolkit.. In ACL (System Demonstrations). 55–60.

[10]

Andrei Mikheev. 2002. Periods, capitalized words, etc. Computational Linguistics

28, 3 (2002), 289–318.

[11]

Naoaki Okazaki. 2007. CRFsuite: a fast implementation of Conditional Random

Fields. (2007).

[12]

David D Palmer and Marti A Hearst. 1997. Adaptive multilingual sentence

boundary disambiguation. Computational Linguistics 23, 2 (1997), 241–267.

[13]

Adwait Ratnaparkhi. 1998. Maximum entropy models for natural language ambi-

guity resolution. Ph.D. Dissertation. University of Pennsylvania.

[14]

Jonathon Read, Rebecca Dridan, Stephan Oepen, and Lars Jørgen Solberg. 2012.

Sentence boundary detection: A long solved problem? COLING (Posters) 12

(2012), 985–994.

[15]

Jerey C Reynar and Adwait Ratnaparkhi. 1997. A maximum entropy approach to

identifying sentence boundaries. In Proceedings of the h conference on Applied

natural language processing. Association for Computational Linguistics, 16–19.

[16]

Michael D Riley. 1989. Some applications of tree-based modelling to speech

and language. In Proceedings of the workshop on Speech and Natural Language.

Association for Computational Linguistics, 339–352.

[17]

Oanh i Tran, Minh Le Nguyen, and Akira Shimazu. 2013. Reference resolution

in legal texts. In Proceedings of the Fourteenth International Conference on Articial

Intelligence and Law. ACM, 101–110.

[18]

Oanh i Tran, Bach Xuan Ngo, Minh Le Nguyen, and Akira Shimazu. 2014.

Automated reference resolution in legal texts. Articial intelligence and law 22, 1

(2014), 29–60.

[19]

M Van Opijnen, N Verwer, and J Meijer. 2015. Beyond the Experiment: the

eXtendable Legal Link eXtractor. In Workshop on Automated Detection, Extraction

and Analysis of Semantic Information in Legal Texts, held in conjunction with the

2015 International Conference on Articial Intelligence and Law.

[20]

Adam Wyner and Wim Peters. 2011. On Rule Extraction from Regulations.. In

JURIX, Vol. 11. 113–122.

Unlocking Practical Applications in Legal Domain: Evaluation of GPT for Zero-Shot Semantic Annotation of Legal Texts

Conference Paper

Full-text available

Sep 2023

Jaromir Savelka

Unlocking Practical Applications in Legal Domain: Evaluation of GPT for Zero-Shot Semantic Annotation of Legal Texts

Preprint

Full-text available

May 2023

Jaromir Savelka

We evaluated the capability of a state-of-the-art generative pre-trained transformer (GPT) model to perform semantic annotation of short text snippets (one to few sentences) coming from legal documents of various types. Discussions of potential uses (e.g., document drafting, summarization) of this emerging technology in legal domain have intensified, but to date there has not been a rigorous analysis of these large language models' (LLM) capacity in sentence-level semantic annotation of legal texts in zero-shot learning settings. Yet, this particular type of use could unlock many practical applications (e.g., in contract review) and research opportunities (e.g., in empirical legal studies). We fill the gap with this study. We examined if and how successfully the model can semantically annotate small batches of short text snippets (10-50) based exclusively on concise definitions of the semantic types. We found that the GPT model performs surprisingly well in zero-shot settings on diverse types of documents (F1=.73 on a task involving court opinions, .86 for contracts, and .54 for statutes and regulations). These findings can be leveraged by legal scholars and practicing lawyers alike to guide their decisions in integrating LLMs in wide range of workflows involving semantic annotation of legal texts.

Toward an Intelligent Tutoring System for Argument Mining in Legal Texts

Chapter

Full-text available

Dec 2022

We propose an adaptive environment (CABINET) to support caselaw analysis (identifying key argument elements) based on a novel cognitive computing framework that carefully matches various machine learning (ML) capabilities to the proficiency of a user. CABINET supports law students in their learning as well as professionals in their work. The results of our experiments focused on the feasibility of the proposed framework are promising. We show that the system is capable of identifying a potential error in the analysis with very low false positives rate (2.0–3.5%), as well as of predicting the key argument element type (e.g., an issue or a holding) with a reasonably high F1-score (0.74).

Toward an Intelligent Tutoring System for Argument Mining in Legal Texts

Preprint

Full-text available

Oct 2022

We propose an adaptive environment (CABINET) to support caselaw analysis (identifying key argument elements) based on a novel cognitive computing framework that carefully matches various machine learning (ML) capabilities to the proficiency of a user. CABINET supports law students in their learning as well as professionals in their work. The results of our experiments focused on the feasibility of the proposed framework are promising. We show that the system is capable of identifying a potential error in the analysis with very low false positives rate (2.0-3.5%), as well as of predicting the key argument element type (e.g., an issue or a holding) with a reasonably high F1-score (0.74).

Cross-Domain Generalization and Knowledge Transfer in Transformers Trained on Legal Data

Preprint

Full-text available

Dec 2021

We analyze the ability of pre-trained language models to transfer knowledge among datasets annotated with different type systems and to generalize beyond the domain and dataset they were trained on. We create a meta task, over multiple datasets focused on the prediction of rhetorical roles. Prediction of the rhetorical role a sentence plays in a case decision is an important and often studied task in AI & Law. Typically, it requires the annotation of a large number of sentences to train a model, which can be time-consuming and expensive. Further, the application of the models is restrained to the same dataset it was trained on. We fine-tune language models and evaluate their performance across datasets, to investigate the models' ability to generalize across domains. Our results suggest that the approach could be helpful in overcoming the cold-start problem in active or interactvie learning, and shows the ability of the models to generalize across datasets and domains.

Computer-Assisted Creation of Boolean Search Rules for Text Classification in the Legal Domain

Preprint

Full-text available

Dec 2021

In this paper, we present a method of building strong, explainable classifiers in the form of Boolean search rules. We developed an interactive environment called CASE (Computer Assisted Semantic Exploration) which exploits word co-occurrence to guide human annotators in selection of relevant search terms. The system seamlessly facilitates iterative evaluation and improvement of the classification rules. The process enables the human annotators to leverage the benefits of statistical information while incorporating their expert intuition into the creation of such rules. We evaluate classifiers created with our CASE system on 4 datasets, and compare the results to machine learning methods, including SKOPE rules, Random forest, Support Vector Machine, and fastText classifiers. The results drive the discussion on trade-offs between superior compactness, simplicity, and intuitiveness of the Boolean search rules versus the better performance of state-of-the-art machine learning models for text classification.

Improving Sentence Retrieval from Case Law for Statutory Interpretation

Conference Paper

Full-text available

Jun 2019

Statutory texts employ vague terms that are difficult to understand. Here we study and evaluate methods for retrieving useful sentences from court opinions that elaborate on the meaning of a vague statutory term. Retrieving sentences instead of whole cases may spare a user the need to review long lists of cases in search of useful explanations. We assembled a data set of 4,635 sentences that were responses to three statutory queries and labeled them in terms of their usefulness for interpretation. We have run a series of experiments on this data set, which we have made public, assessing different techniques to solve the task. These include techniques that measure the similarity between the sentence and the query, utilize the context of a sentence, expand queries, or assess the novelty of a sentence with respect to a statutory provision from which the interpreted term comes. Based on a detailed error analysis we propose a specialized sentence retrieval framework that mitigates the challenges of retrieving case law sentences for interpreting statutory terms. The results of evaluating different implementations of the framework are promising (.725 for NDGC at 10, .662 at 100).

The unreasonable effectiveness of large language models in zero-shot semantic annotation of legal texts

Article

Full-text available

Nov 2023

The emergence of ChatGPT has sensitized the general public, including the legal profession, to large language models' (LLMs) potential uses (e.g., document drafting, question answering, and summarization). Although recent studies have shown how well the technology performs in diverse semantic annotation tasks focused on legal texts, an influx of newer, more capable (GPT-4) or cost-effective (GPT-3.5-turbo) models requires another analysis. This paper addresses recent developments in the ability of LLMs to semantically annotate legal texts in zero-shot learning settings. Given the transition to mature generative AI systems, we examine the performance of GPT-4 and GPT-3.5-turbo(-16k), comparing it to the previous generation of GPT models, on three legal text annotation tasks involving diverse documents such as adjudicatory opinions, contractual clauses, or statutory provisions. We also compare the models' performance and cost to better understand the trade-offs. We found that the GPT-4 model clearly outperforms the GPT-3.5 models on two of the three tasks. The cost-effective GPT-3.5-turbo matches the performance of the 20× more expensive text-davinci-003 model. While one can annotate multiple data points within a single prompt, the performance degrades as the size of the batch increases. This work provides valuable information relevant for many practical applications (e.g., in contract review) and research projects (e.g., in empirical legal studies). Legal scholars and practicing lawyers alike can leverage these findings to guide their decisions in integrating LLMs in a wide range of workflows involving semantic annotation of legal texts.

Extracting Facts from Case Rulings Through Paragraph Segmentation of Judicial Decisions

Chapter

Jun 2021

In order to justify rulings, legal documents need to present facts as well as an analysis built thereon. In this paper, we present two methods to automatically extract case-relevant facts from French-language legal documents pertaining to tenant-landlord disputes. Our models consist of an ensemble that classifies a given sentence as either Fact or non-Fact, regardless of its context, and a recurrent architecture that contextually determines the class of each sentence in a given document. Both models are combined with a heuristic-based segmentation system that identifies the optimal point in the legal text where the presentation of facts ends and the analysis begins. When tested on a dataset of rulings from the Régie du Logement of the city of ANONYMOUS, the recurrent architecture achieves a better performance than the sentence ensemble classifier. The fact segmentation task produces a splitting index which can be weighted in order to favour shorter segments with few instances of non-facts or longer segments that favour the recall of facts. Our best configuration successfully segments 40% of the dataset within a single sentence of offset with respect to the gold standard. An analysis of the results leads us to believe that the commonly accepted assumption that, in legal documents, facts should precede the analysis is often not followed.

Sentence boundary detection of various forms of Tunisian Arabic (Read here: https://rdcu.be/ci4Wm)

Article

Mar 2022
LANG RESOUR EVAL

Sentence boundary detection (SBD) is an essential step for a very large number of natural language processing applications such as parsing, information retrieval, automatic summarization, machine translation, etc. In this paper, we tackle the problem of SBD of dialectal Arabic, especially for the Tunisian dialect. We compare the efficiency of three learning algorithms: Deep Neuronal Networks (DNN), Support Vector Machines (SVM) and Conditional Random Fields (CRF) to detect the boundaries of sentences written in different types of dialect. The best model achieved an F-measure of 84.37% using CRF which is a popular formalism for structured prediction in NLP and it has been widely applied in text segmentation. (Read here: https://rdcu.be/ci4Wm)

Beyond the Experiment: the eXtendable Legal Link eXtractor

Conference Paper

Full-text available

Jun 2015

In this paper we describe a software framework for detecting and resolving references to (national and EU) legislation, case law, parliamentary documents and official gazettes. Meant to function in a large-scale production environment, performance, flexibility and maintainability are essential requirements. This led us to some noteworthy choices: within the pipeline architecture of Apache Cocoon we use the trie data structure for named entity recognition and a parsing expression grammar for pattern recognition, the latter having significant advantages over the use of regular expressions. Additional attention is paid to some substantive maintainability issues.

The Stanford CoreNLP Natural Language Processing Toolkit

Conference Paper

Full-text available

Jan 2014

We describe the design and use of the Stanford CoreNLP toolkit, an extensible pipeline that provides core natural language analysis. This toolkit is quite widely used, both in the research NLP community and also among commercial and government users of open source NLP technology. We suggest that this follows from a simple, approachable design, straight-forward interfaces, the inclusion of robust and good quality analysis components, and not requiring use of a large amount of associated baggage.

On Rule Extraction from Regulations

Article

Full-text available

Jan 2011

Rules in regulations such as found in the US Federal Code of Regulations can be expressed using conditional and deontic rules. Identifying and extracting such rules from the language of the source material would be useful for automating rulebook management and translating into an executable logic. The paper presents a linguistically-oriented, rule-based approach, which is in contrast to a machine learning approach. It outlines use cases, discusses the source materials, reviews the methodology, then provides initial results and future steps.

Automated Reference Resolution in Legal Texts

Article

Full-text available

Jun 2013

This paper investigates the task of reference resolution in the legal domain. This is a new interesting task in Legal Engineering research. The goal is to create a system which can automatically detect references and then extracts their referents. Previous work limits itself to detect and resolve references at the document targets. In this paper, we go a step further in trying to resolve references to sub-document targets. Referents extracted are the smallest fragments of texts in documents, rather than the entire documents that contain the referenced texts. Based on analyzing the characteristics of reference phenomena in legal texts, we propose a four-step framework to deal with the task: mention detection, contextual information extraction, antecedent candidate extraction, and antecedent determination. We also show how machine learning methods can be exploited in each step. The final system achieves 80.06 % in the F1 score for detecting references, 85.61 % accuracy for resolving them, and 67.02 % in the F1 score for the end-to-end setting task on the Japanese National Pension Law corpus.

A next step towards automated modelling of sources of law

Conference Paper

Full-text available

Jun 2009

The ultimate goal of the research line described here is support for automated modelling of sources of law. One of the first steps is the automatic recognition of norms. In earlier work we presented a categorization of norms or provisions in legislation. We claimed that the categories are characterized by the use of typical sentence structures and that this would enable automatic detection and classification. In this paper we present the results of experiments in such automatic classification of provisions. We have defined fourteen different categories of provisions, and compiled a list of 88 sentence structures for those categories from twenty Dutch laws. Based on these structures, a parser was used to classify the sentences in fifteen different Dutch laws, classifying 91% of 592 sentences correctly. It compares well with other, statistical approaches. An important improvement of our classifier will be the distinction of principal and auxiliary sentences.

Using Conditional Random Fields for Sentence Boundary Detection in Speech.

Conference Paper

Full-text available

Jan 2005

Sentence boundary detection in speech is important for enriching speech recognition output, making it easier for humans to read and downstream modules to process. In previous work, hidden Markov model (HMM) and maximum entropy (Maxent) classifier approaches have been used for detecting sentence boundaries us- ing both textual and prosodic information. A conditional random field (CRF) combines advantages of these approaches, being both discriminative and able to perform sequence decoding. We show in this paper that a CRF yields a lower error rate than the HMM and Maxent models on the NISTsentence boundary detection task. Extensive comparisons across two corpora on both human tran- scriptions and recognition output confirm the strength of the CRF modeling approach when applying a variety of knowledge sources.

Unsupervised Multilingual Sentence Boundary Detection

Article

Full-text available

Dec 2006

In this article, we present a language-independent, unsupervised approach to sentence boundary detection. It is based on the assumption that a large number of ambiguities in the determination of sentence boundaries can be eliminated once abbreviations have been identified. Instead of relying on orthographic clues, the proposed system is able to detect abbreviations with high accuracy using three criteria that only require information about the candidate type itself and are independent of context: Abbreviations can be defined as a very tight collocation consisting of a truncated word and a final period, abbreviations are usually short, and abbreviations sometimes contain internal periods. We also show the potential of collocational evidence for two other important subtasks of sentence boundary disambiguation, namely the detection of initials and ordinal numbers. The proposed system has been tested extensively on eleven different languages and on different text genres. It achieves good results without any further amendments or language-specific resources. We evaluate its performance against three different baselines and compare it to other systems for sentence boundary detection proposed in the literature.

Some applications of tree-based modelling to speech and language

Conference Paper

Oct 1989

Michael D Riley

Several applications of statistical tree-based modelling are described here to problems in speech and language. Classification and regression trees are well suited to many of the pattern recognition problems encountered in this area since they (1) statistically select the most significant features involved (2) provide "honest" estimates of their performance, (3) permit both categorical and continuous features to be considered, and (4) allow human interpretation and exploration of their result. First the method is summarized, then its application to automatic stop classification, segment duration prediction for synthesis, phoneme-to-phone classification, and end-of-sentence detection in text are described. For other applications to speech and language, see [Lucassen 1984], [Bahl, et al 1987].

Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data

Conference Paper

Jan 2001

We present conditional random fields, a framework for building probabilistic models to segment and label sequence data. Conditional random fields offer several advantages over hidden Markov models and stochastic grammars for such tasks, including the ability to relax strong independence assumptions made in those models. Conditional random fields also avoid a fundamental limitation of maximum entropy Markov models (MEMMs) and other discriminative Markov models based on directed graphical models, which can be biased towards states with few successor states. We present iterative parameter estimation algorithms for conditional random fields and compare the performance of the resulting models to HMMs and MEMMs on synthetic and natural-language data.

Adaptive Multilingual Sentence Boundary Disambiguation

Article

Dec 1998
COMPUT LINGUIST

this article presents an efficient, trainable system for sentence boundary disambiguation. The system, called Satz, makes simple estimates of the parts of speech of the tokens immediately preceding and following each punctuation mark, and uses these estimates as input to a machine learning algorithm that then classifies the punctuation mark. Satz is very fast both in training and sentence analysis, and its combined robustness and accuracy surpass existing techniques. The system needs only a small lexicon and training corpus, and has been shown to transfer quickly and easily from English to other languages, as demonstrated on French and German.

Using Conditional Random Fields to Detect Different Functional Types of Content in Decisions of United States Courts with Example Application to Sentence Boundary Detection

Abstract and Figures

Recommended publications

Sentence boundary detection in adjudicatory decisions in the United States

Mining Information from Statutory Texts in Multi-Jurisdictional Settings

Transfer of predictive models for classification of statutory texts in multi-jurisdictional settings

Improving Sentence Retrieval from Case Law for Statutory Interpretation