PreprintPDF Available

Validate your white matter tractography algorithms with a reappraised ISMRM 2015 Tractography Challenge scoring system

December 2022

December 2022

DOI:10.21203/rs.3.rs-2411825/v1

License
CC BY 4.0

Authors:

Emmanuelle Renauld

Université de Sherbrooke

Antoine Théberge

Université de Sherbrooke

Laurent Petit

Institut des Maladies Neurodegeneratives

Jean-Christophe Houde

Show all 5 authorsHide

Since 2015, research groups seek to produce the nec-plus-ultra tractography algorithms using the ISMRM 2015 Tractography Challenge as evaluation. In particular, since 2017, machine learning has made its entrance into the tractography world. The ISMRM 2015 Tractography Challenge is the most used phantom during tractography validation, although it contains limitations. We offer, here, a new Tractometer scoring system for this phantom, where segmentation of the bundles is now based on manually-defined regions of interest rather than on bundle recognition. Bundles are now more reliably segmented, offering more stable metrics with higher precision for future users. New code is available online. Scores of the initial 96 submissions to the challenge are updated. Overall, conclusions from the 2015 challenge are confirmed with the new scoring, but individual tractograms scores have changed, and the data is much improved at the bundle- and streamline-level. This work also led to the production of a ground truth tractogram with less noisy streamlines and an example of processed data, all available on the Tractometer website. This enhanced Tractometer scoring system and new data should continue to help researchers develop and evaluate the next generation of tractography techniques.

…

Figures - available via license: Creative Commons Attribution 4.0 International

Content may be subject to copyright.

Available via license: CC BY 4.0

Content may be subject to copyright.

Page 1/21

Validate your white matter tractography algorithms

with a reappraised ISMRM 2015 Tractography

Challenge scoring system

Emmanuelle Renauld (  emmanuelle.renauld@usherbrooke.ca )

Université de Sherbrooke

Antoine Théberge

Université de Sherbrooke

Laurent Petit

Université Bordeaux, CNRS, CEA, IMN, UMR 5293

Jean-Christophe Houde

Imeka Solutions Inc

Maxime Descoteaux

Université de Sherbrooke

Article

Keywords:

Posted Date: January 3rd, 2023

DOI: https://doi.org/10.21203/rs.3.rs-2411825/v1

License:   This work is licensed under a Creative Commons Attribution 4.0 International License. 

Read Full License

Page 2/21

Abstract

Since 2015, research groups seek to produce the nec-plus-ultra tractography algorithms using the ISMRM

2015 Tractography Challenge as evaluation. In particular, since 2017, machine learning has made its

entrance into the tractography world. The ISMRM 2015 Tractography Challenge is the most used

phantom during tractography validation, although it contains limitations. We offer, here, a new

Tractometer scoring system for this phantom, where segmentation of the bundles is now based on

manually-dened regions of interest rather than on bundle recognition. Bundles are now more reliably

segmented, offering more stable metrics with higher precision for future users. New code is available

online. Scores of the initial 96 submissions to the challenge are updated. Overall, conclusions from the

2015 challenge are conrmed with the new scoring, but individual tractograms scores have changed, and

the data is much improved at the bundle- and streamline-level. This work also led to the production of a

ground truth tractogram with less noisy streamlines and an example of processed data, all available on

the Tractometer website. This enhanced Tractometer scoring system and new data should continue to

help researchers develop and evaluate the next generation of tractography techniques.

Introduction

Tractography allows the

in-vivo

non-invasive recovery of white-matter ber trajectories in the brain. In this

context, a good tractography algorithm builds a tractogram representing the ground truth (GT) of the

brain anatomy. But such a GT still does not exist for verication of algorithmic results [1,2]. To alleviate

this limitation and allow the evaluation of the tractography algorithm output quality, one typically relies

on phantoms: simulated diffusion-weighted images (DWI) associated with GT tractograms [1]. The level

of similarity between the tractogram and GT can be scored based on various metrics, such as false

positives / false negative rates, or coverage metrics, such as overlap or overreach, amongst others [3].

Generally, they are calculated for each bundle present in the dataset rather than on the whole tractogram.

A phantom must thus be associated with a scoring system of its own, including a process for bundle

segmentation and metrics that quantify the quality of these bundles.

The ISMRM 2015 Tractography Challenge [4] has become the most widely used phantom for

tractography validation [1]. In fact, it is nearly the only tractography dataset with human brain geometries

offering a GT. The article, published in 2017, has been cited approximately 1000 times (as of December

2022). It has also provided important insights into the challenges of tractography, particularly regarding

the strong presence of false positives and the poor overlap of true positives. Now, the development of

new algorithms for tractography often includes a tractography validation step using this phantom.

Tractography has come a long way since its beginnings, and, generally, the most recent algorithms all

achieve similar scores. Even small differences in scoring may lead to big conclusions on the choice of

optimal model parameters. This is particularly true in the eld of machine learning in tractography [5–9],

where the validation phase often relies on nal scores for ne-tuning hyper-parameters. A robust, stable

scoring system of high precision is important. Also, bundle-specic tractography has become

Page 3/21

increasingly investigated [10,11], therefore scores must be of quality for all bundles equally, not only in

averaged scores.

In this work, we veried the quality, precision, and robustness of the challenge data and its ocial bundle

segmentation process. We discovered that the segmentation of the bundles led, sometimes, to poor

results. When looking visually at the segmented data from tractograms submitted to the challenge in

2015, some bundles seemed recurrently poorly segmented, such as the OR and the CST (Fig1,Fig2, see

below for the list of acronyms). Even scoring the GT tractogram itself led to non-perfect results, with 95%

overlap, 9% overreach, and a Dice score of 92%. Segmentation was based on Recobundles [12], a bundle

segmentation method based on clustering of streamlines, which is inuenced by the quality of the

reference bundles, relies on manually dened thresholds, and whose results depend on the ordering

sequence of bundles during the processing.

Here, we propose a more stable scoring system using carefully positioned regions of interest (ROIs). We

present the consequences of the new process on the published scores of the 96 tractograms submitted

during the challenge in 2015. Overall, general conclusions drawn in the original article [4] still hold: most

teams recovered most bundles correctly, but with lots of false positives and a poor overlap of true

positives. However, individual scores for some bundles or some teams are now strongly reappraised. In

particular, CA and CP are better recovered than shown in the previous analysis, and coverage scores are

more stable.

Our work also led to the production of a new ground truth tractogram with less noisy streamlines,

revisions of the previously published scores, revisions of the initial code, and preparation of an example

of well processed data. All updated data and scoring information (ROIs, code) are available on the

Tractometer website: www.tractometer.org.

Results

A. Conrmation of the original scores.We rst veried that we could reproduce the original results [4]

using the updated python3 version and original data. All 2015’s submissions were scored again with

reviewed and updated code, with 100% reproducibility with original scores. 

B. Curation of the tractogram.The quality of the GT prevented the creation of ROIs. Analysis of the GT

tractogram revealed short/long, looping, and broken streamlines (Fig3) that we ltered. Streamline

rejection was kept as small as possible to ensure good compatibility between the tractogram and the

associated simulated DWI.

We found long or looping streamlines in 12 bundles (out of 25). The biggest changes included 8%

rejection in the CC, 24% and 23% rejection in both ILF and 12% and 6% for both OR. CC and right ILF

included a substantial number of looping streamlines. CC had many half-streamlines stopping mid-line.

ILF and OR were too similar to allow a good segmentation; some streamlines were rejected manually. In

Page 4/21

other bundles, less than 1% of streamlines were discarded. The nal clean tractogram contains 190,065

streamlines (5% rejection). 

C. Creation of an ROI-based segmentation system.The new segmentation relies on endpoint ROI masks,

“all” masks, and in some cases, on other criteria such as maximum length, maximum total displacement

per orientation, or “any” masks.

- Endpoint masks: head and tail of the bundle. Segmented streamlines must have one endpoint in each of

the two masks. Masks were created large enough to ensure they covered most variation in streamlines

shape of any scored tractogram (Fig4). This prevented an adequate segmentation of IB, which would be

dened as bundles connecting ROIs that should not be connected.

- “All” masks: bundle envelope. Streamlines must be entirely included inside the mask. This avoids wrong-

path connections, where streamlines connect the right regions but with a wrong path. Again, these masks

were created as large as possible to include overreaching streamlines from most submissions.

- “Any” masks: masks of mandatory passage. Streamlines must traverse it (at “any” point of the

streamline).

Mask names and other criteria are included in a scoring conguration le formatted as a json le.

We veried the quality of ROIs by scoring the new curated GT data. We obtained 100% OL and 0% ORgt

for all bundles, as expected. When scoring the initial (non-curated) tractogram, mean OL was also 100%,

with a 1% overreach, showing that modications during curation were kept minimal. Running the new

scoring system on all 96 submissions took 2h 57m, vs 8h57m using the initial Recobundles-based

system.

D. Inuence of the bundle masks on previous scores.To compare new and initial scores, we ensured that

the two sets of results were indeed comparable. We noted that differences in results could be inuenced

by the difference in computation of the GT masks, which are called bundle masks in the original scoring

data. Our new scoring was thus compared to the 2015 Recobundles system but with new bundles masks,

computed with the recent denition [13]. We veried the inuence of this change on the original results.

Updated bundle masks led to a decrease in both OL and ORgt(see Table1), but to nearly unchanged f1

scores (p-value>0.1). To allow comparison, these results were computed over 21 bundles using the mean

value of FPT/POPT/CST.

E. Inuence of the new scoring system on scores. Visually, new scoring of the initial 2015 submissions

led to better segmentation (see Fig2). On average, Dice scores were signicantly different (p < 0.001) (see

Table2), but with an average change of only 2%, offering similar rankings (average absolute difference: 2

positions out of 96), thus leading to similar conclusions as in the original analysis. However, some

bundles showed major differences (see Table3) in scores and in ranking.The detailed score tables for

each team, each bundle is provided on the website.

Page 5/21

- VB: As seen in Table3, CP and CA were discovered more often than estimated in the original analysis.

They still are the two most dicult bundles to recover but to a lesser extent.

- VS: Biggest change in VS is seen in the CC, partly because it is by far the biggest bundle. When

observing the VS in raw numbers rather than as percentages, a comparison between the two scoring

systems reveals drastic changes, as seen in Table3.

- Bundle coverage: Fig5 (top section) compares the bundle dispersion in OL and ORgt between the two

scoring systems. Main changes are reported in Table3. Overall, f1 score was improved, particularly for

the two BPS bundles and left OR, bundles for which modications have been brought in the GT data.

Bottom section in Fig5 compares the submissions dispersion for these metrics. Overall, previous

conclusions still hold: probabilistic tracking may help generate highest OL, but with highest ORgt.

Submissions 9.1 and 9.2 (best OL) only obtain Dice scores of 45% and 46%, placing them in 43rd and

38th rank. When relying on the Dice score for a nal ranking of the submissions, the biggest variations

included an upgrade of 9 places for submission 17.0 and a drop of 13 places for submission 1.4. Top 8

submissions stayed the same but in a different order, as did the bottom 8 submissions.

Table 1 Impact of updated bundle masks on the scoring, using the original Tractometer scoring system.

Mean Original 2015 scores

(old bundle masks)

Updated 2015 scores

(new bundle masks)

OL (%) 35.6 ± 16.5

[1.1 to 76.6]

34.7 ± 16.2

[1.1 to 75.4]

ORgt (%) 29.0 ± 25.9

[1.0 to 152.5]

25.5 ± 23.3

[0.9 to 137.7]

Dice / f1 (%) 37.8 ± 12.6

[2.0 to 56.1]

37.8 ± 12.8

[2.0 to 58.0]



Table 2 Effect of the new segmentation on average scores. Nb: Number of submissions who recovered

the bundle.

Page 6/21

Mean Updated 2015 scores

(21 bundles)

New scores

VB 18.0 ± 2.7

[5 to 20]

18.5 ± 2.3

[9 to 21]

Nb 82.1 ± 25.4

[2 to 96]

84.5 ± 20.8

[22 to 96]

VS (%) 53.6 ± 23,5

[3.7 to 92.5]

52.5 ± 22.1

[4.3 to 88.6]

OL (%) 35,7 ± 16.0

[1.3 to 74.3]

37.8 ± 16.4

[1.8 to 80.0]

ORgt (%) 26.7 ± 23.7

[1.1 to 141.4]

29.1 ± 26.7

[2.4 to 161.1]

Dice / f1 (%) 38.4 ± 12.1

[2.4 to 54.9]

40.7 ± 12.2

[3.1 to 57.9]

F. Usage on new data.We successfully used the Tractoow pipeline [14] with the noisy data using both

PFT-tracking and local-tracking to obtain two full tractograms that were scored with the new system. The

PFT version led to the best Dice score (64%. Previous best was 58%), with an average OL and ORgt of 76%

and 60%. The local tracking version, which used a dilated white matter (WM) mask, obtained the best

overlap (91%. Previous best was 80%), but with more ORgt, explaining its lower, yet high, Dice score (57%).

Discussion

We have developed an enhanced Tractometer scoring system for the ISMRM 2015 Tractography

Challenge data. It uses carefully determined regions of interest. It offers more reliable results because the

segmentation now depends only on the quality of the ROIs. It does not depend on other aspects that were

important in the Recobundles segmentation, such as the ordering of the bundles, quality of the reference

tractogram, and threshold values for the mean direct-ip (MDF) metric [15]. Recovered bundles could hide

streamlines with noisy shapes because the GT data itself contained noisy shapes. In short, our new

segmentation was strict enough to prevent the inclusion of noisy streamlines but exible enough to allow

scoring submissions of varied streamline lengths, curvature, fanning, and tracking mask.

Overall, the new segmentation offers similar rankings as before when using averaged values over all

bundles and all teams, but scores for some bundles were strongly modied, and the nal ordering of the

teams based on Dice scores varied.

Page 7/21

Table 3 Effect of the new scoring: some of the main changes in specic bundles (average over teams)

Page 8/21

Mean Bundle

(L/R =

left /

right)

Tractometer

2015

(21

bundles)

Tractometer

2022 Difference

Nb submissions recovering the

bundle CP:

CA:

SCP

L/R:



86 / 83



88 / 88



+ 23

+10

+2 / +5

Others: Differences

in less than 4

submissions

VS (Total number of streamlines

recovered amongst all teams) CP:

CA:

SCP

L/R:

Cg L/R:

BPS

L/R:

OR L:



38,193 /

23,607

278,422 /

238,027

322,645 /

520,016

49,883



172

2011

59,109 /

36,171

374,647 /

375,725

437,459 /

636,523

65,161



+8500%

+3400%

+55% / +53%

+35% / +58%

+36% / +22%

+31%

Others: Less than

20% variation

OL (%) BPS

L/Rr:

OR L:

SCP

L/R:



28.8 / 29.7

21.4

33.9 / 27.9



37.1 / 39.4

30.6

40.0 / 33.2



+8.3% / +9.7%

+9.1%

+6 / +5

Others: Less than

5% variation

ORgt (%) SCP

L/R:

SLF L/R:

ICP L/R:

CA:

ILF R:



26.1 / 18.8

50.4 / 57.3

37.8 / 25.1

0.7

41.8



44.3 / 31.2

49.0 / 47.5

45.5 / 30.5

7.7

54.0



+18.2 / +12.4

-5.0 / -9.8

+7.7 / +5.4

+6.9

+6.2

Others: Less than

5% variation



Dice / f1 (%)

BPS

L/R: 34 / 36

44 / 47

+10 / +11

+12

Page 9/21

OR L:

CA:



Others: Less than a

3% variation

Verication of the original code.No error was found in the original code. Importantly, however, tractogram

formats and headers management has evolved signicantly since 2015. Users should verify that their

tractogram are correctly interpreted when using the updated python3 code.

Verication of the original scores.The scores published in the 2017 article were good, but the detailed

scores published on the website contained errors which are now corrected. Please also note that some

wrong numbers tend to be relayed amongst publications citing the ISMRM challenge results. We urge

readers to rely on the up-to-date scores currently published on the ocial website.

We also brought modications to metrics terminology to avoid confusion:

1) VS/IS: In the original analysis, the term “connection” was used in the terms valid/invalid/no

connections (VC, IC, NC). However, VC was dened as the number of streamlines belonging to a valid

bundle and could actually include broken or prematurely stopped streamlines that do not reach any gray

matter region as long as they were classied as belonging to the bundle by the chosen segmentation

process. The word may encourage wrong interpretation of the results, suggesting they can allow

connectivity analysis between brain regions. We renamed VC as VS (valid streamlines). We regrouped IC

and NC under the term IS (invalid streamlines).

2) IB: Segmenting invalid streamlines into invalid bundles gives insights on erroneous streamlines

typically produced, but scoring their number (IB) may however be misleading as a scoring metric as it

depends on the denition of these bundles. Recobundles segmentation of spurious streamlines with

varied shapes and distribution offers scores that are dicult to interpret. This score should only be used

with great care. IB scores are therefore not used anymore in our work.

Curation of the data.Curation of the data was kept as minimal as possible. We removed streamlines that

prevented the creation of a good scoring system. Because of this work, the new GT now corresponds less

perfectly with the associated DWI. Creating a new simulated DWI with Fiberfox [16] would be possible, but

future work using this new data could not be compared with the scores presented here from teams who

participated in the challenge.

One note to the reader should be made here. The phantom was created with knowledge available at the

time. Although the bundles have names that correspond to known anatomical tracts, users should keep in

mind that they might not present exact characteristics and features compared to the real tracts [17].

These bundles should be used as phantom parts, not as anatomical references. Here is a short list of

differences that were noticed between the GT bundles and known anatomical landmarks.

Page 10/21

- CC: The corpus callosum is known to contain a majority of homotopic connections [18]. Heterotopic

connections do exist, but are less documented [19]. Many heterotopic connections are found in this GT

(ex, ventral-striatal).

- Cg: The Cg consists of 5 sub-bundles [20]. The GT bundle lacks the posterior part (named CB-V in the

paper).

- ICP: This bundle should end in the brainstem, but the GT bundle contains two sub-bundles; one is

anatomically correct but the other, looping back into the cerebellar cortex , does not correspond to any

known path in the human anatomy.

- OR: The current bundle would be better named as thalamo-occipital connections. The OR is typically

dened as the streamlines from the peri-calcarine ssure to the thalamus [21], but in this GT, the bundle

extends to a larger section of the occipital lobe. Note also that the Meyer's loop [22] is absent from the

current GT.

- ILF: The ILF should reach the anterior temporal lobe [23]. However, in the initial version of the phantom, it

reached a larger region, extending posteriorly close to the (expected) Meyer’s loop region. This was

modied in the new curated data and therefore the ILF is now more anatomically reliable.

- UF: As of 2018 [24], the uncinate fasciculus is now considered with a larger fanning both anteriorly in

the frontal cortex and posteriorly in the temporal cortex.

- CST / FPT / POPT: These three bundles appear intricate, but should be more different. The cortical

terminations of the CST should be constrained to the precentral and postcentral gyri [25]. Both FPT and

POPT should end in the pons, but the bundles go further down, nearly to the medulla (see Fig1).

Due to these differences, the ROIs dened here do not represent perfect anatomical features either, but are

only the necessary tool to segment bundles before the scoring.

Creating new bundles with better anatomical features would require developing a new simulated DWI

data, i.e., a new phantom which, as stated above, was not the objective of this work. We encourage the

community to produce new and varied phantoms as there is a lack of validation data in the eld of

tractography. However, here, the goal was essentially to improve the existing one and allow, particularly,

the machine learning community to adequately compare their results with previous state-of-the-art

tractography tools. We present in a section below conclusions and suggestions drawn from our analysis

to readers interested in proposing a new phantom.

Preparation of the new segmentation technique.To allow for a good bundle segmentation in the

submitted data of most teams, the endpoint ROIs had to be created very large, sometimes up to a 16-pass

dilation of the GT bundles’ endpoint ROIs, and up to an 11-pass of the bundles’ “all” masks. This could

reveal that the stopping criteria was not well dened in many processing pipelines. It generally depends

on a WM mask, which may come either from a thresholded FA map (typically ~0.1 to 0.2) or from a

Page 11/21

segmentation from the T1. In the rst case, the simulated



DWI may have acted differently than usual and

provided FA values that would require a different threshold. In the second case, the T1 is also simulated.

Segmentation algorithms were not created to deal with “fake” images and may have resulted in WM

masks of lesser quality. We consider that the goal of this challenge was to evaluate the ability of

tractography algorithms to understand diffusion information and to follow diffusion anisotropy

information through challenging paths such as ber crossing and bottlenecks. We have decided not to

penalize submissions with streamlines going further than expected. For instance, some submissions had

streamlines from the OR going out of the thalamus without stopping, or streamlines from the Fornix

looping very far off the mamillary bodies, or even streamlines going out of the brain. Our ROIs thus spill

out of realistic anatomical regions in an attempt to include the biggest part of every submission’s

bundles. We can still segment bundles correctly by combining the endpoint ROIs with the “all” masks.

Analysis of the score differences.Compared to the initial analysis [4], it is still true that teams were able

to recover most bundles. It is also still true that, on average, only half of the streamlines in the submitted

tractograms are valid streamlines. Finally, we still nd that probabilistic tracking may help generate the

highest OL, but with the highest ORgt when compared to deterministic tracking, resulting in small changes

on the Dice score.

VB:CA and CP are still the two most dicult bundles to reconstruct, but although they are still a well-

dened category inFig5, it is to a lesser extent. Using Recobundles, CP was scored after CC; these

streamlines were often associated to the CC and thus ignored when segmenting the CP. Other changes in

recovered bundles are explained by the fact that newly found bundles generally contained only a few very

small streamlines that may be harder to compare with reference streamlines using the MDF metric (in

Recobundles). The hard-to-track and medium-diculty bundles (Fig5) are now less separate categories.

IB: Invalid bundles cannot be scored anymore due to the large size of the ROIs. It could be possible to add

an additional analysis step and segment the invalid streamlines (IS) into invalid bundles (IB) using

Quickbundles, similarly as before. We chose not to include this here as it is prone to the same instability

as Recobundles that we so rmly seek to avoid. The number of invalid bundles obtained with

Quickbundles depends strongly on the type of invalid streamlines. Even a few misplaced streamlines may

lead to a rapid increase in IB, which should not be used to infer the quality of the scored tractogram. We

do recognize that the IB analysis was useful in the original article to visualize the typical errors recovered

recurrently over multiple submissions, but the IB score itself should be used carefully.

VS/IS: Often, the additional recovered streamlines were of very poor quality, and other metrics were not

improved much. The total percentage of VS, averaged over all teams, all bundles, only varied by less than

1%. Yet, it represents an average of 1000 streamlines per submission. In the future, with algorithms

becoming ever better and researchers trying to push the limits of tractography, these small differences in

scoring could impact researcher choices in implementation.

Page 12/21

Bundle coverage: Despite the big changes in the total number of recovered streamlines in individual

bundles throughout the 96 submitted tractograms, general scoring metrics stayed similar, but ranking

amongst teams was modied.

Suggestions for the creation of a new phantom.The nal comparison of “winners” based on the Dice

score, either in the original analysis or here, did not allow a clear denition of the best tractography

parameters. This can be explained by the large inuence of preprocessing steps such as the choice of

tracking space, the tracking masks, the registration quality, and so on. Future phantoms should limit the

possibilities to ensure that they can understand specically our ability to follow diffusion information in

the brain, in other words, the “tracking” aspect, rather than the quality of the whole pipeline. We present

here some afterthoughts.

1. The level of complexity in the challenge data was good. It presented human-like geometries with

multiple bundle crossings or bottlenecks. Its number of bundles was good and allowed a scoring

system.

2. The associated simulated T1 data, however, was not realistic enough to allow good results in

segmentation software such as Freesurfer [26] or FSL FAST [27] for instance. We suggest that future

work should include a list of potentially interesting masks, particularly a WM mask that could be

used as a tracking mask.

3. The quality of individual streamlines, not only of bundles as whole entities, should be veried in the

GT and during scoring.

4. Developers should specify a way that users may verify their tractogram format to prevent shifts (ex:

±0.5 when the origin of a voxel coordinate is considered at the center or the corner of the voxel) or

swapping of axis during interpretation (ex, specifying the orientation).

5. Developers should specify in which space the nal scoring will be performed. Users applying a

substandard registration between T1 and DWI spaces could be strongly disadvantaged, even if their

tracking algorithm itself was perfect.

. OL, ORgt, Dice scores offer good insights, but there is still a lack of metrics comparing the shape of

individual streamlines in the literature that should be addressed.

Analysis of the Tractoow-processed data.The data was processed using state-of-the-art tools and

presented very good scores.

One general comment that was seen in machine learning studies was that differences in scores may

arise from the choice of model, but also from the training data. Therefore, we also offer the Tractoow-

processed data in open access on the website. It could be used as common training data.

Conclusion

We proposed a new and enhanced Tractometer scoring system based on manually-dened regions of

interest rather than on bundle recognition. Bundles are now more reliably segmented, offering more stable

Page 13/21

metrics with higher precision for future users of this phantom and its scoring system. We provide on the

Tractometer website all tools necessary for a robust scoring of any new tractogram with our new scoring

system: the ROIs and congurations les necessary to run the code, the tables of detailed results and the

Tractoow-processed data.

This should help researchers better develop and evaluate the next generation of tractography algorithms.

Methods

A. Verication of the original scores.The original code was converted to python3, proof-read and

reviewed to ensure it was still suitable with today’s standard. Metrics terminology were revised. All 2015’s

submissions were scored again.

The original code included forced shifting (adding 0.5 values) of .trk (trackvis) les. In the updated code,

tractograms are simply loaded through dipy’s load_tractogram method. No further verication is

performed on the validity of space attributes.

B. Curation of the GT tractogram.The GT bundles were modied to allow creation of the ROIs.

Streamlines from the GT bundles were ltered to keep only those with length in the range 20-200mm

(generally streamlines presenting looping shapes) or recovered as loops using scilpy were discarded (see

https://scilpy.readthedocs.io/). Others were discarded based on visual analysis of the bundles. CST,

POPT, and FPT were too similar and dicult to segment adequately (Fig1) and were gathered into a new

bundle called Brainstem Projection System (BPS). The ILF and OR were also too similar, preventing a

good segmentation (Fig1), either with Recobundles or with ROIs. In this case, we chose to lter out some

streamlines to better separate the two bundles.

C. Creation of a ROI-based segmentation system.All of the masks were created by looking carefully at

both the GT data and the general distribution of results from the tractograms submitted to the Challenge

in 2015.

Endpoint ROIS: GT streamlines’ endpoints were saved as head and tail masks. We then dilated these two

masks (11-pass on average, see Fig4). Some endpoint ROIs were modied manually based on visual

inspection of results. Examples of modication were: dilation to reach the end of the cortex in some

regions, manual dilation of the OR’s ROI to include more of the thalamus without spilling into the ILF,

manual separation between hemispheres, careful separation of anterior/posterior ROIs in the case of the

cingulum and of the fornix. The CC was separated into sub-bundles for segmentation purposes

(CC_u_shaped, CC_ventro_striatal1, CC_ventro_striatal2, CC_temporal), allowing for a better delimitation

of endpoint ROIs. However, only the total CC, composed of the re-merged sub-bundles, is used during

scoring. Similarly, the ICP was segmented into ICP_part1 (similar to its anatomical denition) and

ICP_part2 (looping back into the cerebellar cortex).

Page 14/21

“All” masks: GT streamlines paths were saved as binary masks and dilated (by default, the number of

passes was 3 but some bundles required varied parameters, up to an 11-pass for the CC). These GT

masks were combined with both endpoint ROIs for each bundle. Manual modications were also applied,

generally more manual dilation.

“Any” masks: They were dened using manually positioned boxes of interest.

D. Inuence of the bundle masks on scores.To allow comparing new and old scores, original bundle

masks were computed again using more recent technology. As suggested in 2017 by Rheault et al. [13],

bundle masks should not recover only voxels containing streamlines points (even after resampling), but

should rather account for the whole segment between two points. We computed the new masks with

scilpy. Bundles segmented using the Recobundles-based system were scored again using the same

metrics but with the new GT bundle masks. Final Dice scores, averaged over all bundles, were compared

to previous scores using a Student T-test.

E. Inuence of the new scoring system on scores.Newly segmented bundles of the 96 submissions were

scored using the same metrics as before. Again, nal Dice scores, averaged over all bundles, were

compared to previous scores using a Student T-test.

F. Usage on new data.We prepared a new tractogram to be scored using recent state-of-the art

techniques. The tractogram was prepared by running the Tractoow pipeline [14] on the noisy DWI, using

the version with additional reversed b0 to allow topup correction. The pipeline was modied to skip the

N4 denoising step on the T1 data, which produced irregular results, probably due to the fact that a T1 is in

fact a simulated dataset. Two tracking algorithms were tested. First, PFT tracking on WM maps. Second,

local tracking on a mask of WM that was rst modied to pass visual quality check: it was eroded (1-

pass) and dilated again (2-pass). Both versions were scored using the new system.

Abbreviations

List of acronyms for bundles

BPS: Brainstem Projection System, CA: Anterior commissure, CC: Corpus callosum, Cg: Cingulum, CP:

Posterior commissure, CST: Cortico-spinal tract, Fornix, FPT: Fronto-pontine tract, ICP: Inferior cerebellar

peduncle, ILF: Inferior longitudinal fasciculus, MCP: Middle cerebellar peduncle, OR: Optic radiation,

POPT: Parieto-occipital pontine tract, SCP: Superior cerebellar peduncle, SLF: Superior longitudinal

fasciculus, UF: uncinate fasciculus

List of acronyms for metrics

OL: Overlap (percentage of GT voxels recovered), ORgt: Overreach (number of false positive voxels,

normalized by the volume of the GT bundle), f1: Equivalent to the Dice score, VB: valid bundles (number

Page 15/21

of recovered bundles), VS: valid streamlines (number of streamlines in these VB), IS: invalid streamlines

(number of remaining streamlines).



Declarations

Data and code availability

The datasets generated during and/or analysed during the current study are available on the Tractometer

website: www.tractometer.org.

Acknowledgement

The authors are grateful to the Fonds de recherche du Québec - Nature et technologies (FRQNT) and the

Natural Sciences and Engineering Research Council of Canada (NSERC) programs for funding this

research.

Author contributions

ER and AT proof-read the original code and prepared the scripts in scilpy for the new scoring. They also

veried the format of submitted tractograms and the scoring. ER prepared the ROIs and other necessary

masks for the new segmentation process and compared scores between versions. ER wrote the

manuscript, and MD, LP and AT provided feed-back. JCH was the project leader in the previous version

and answered our questions concerning the original code and data.

Competing interests

The author(s) declare no competing interests.

References

1. Drobnjak, I., Neher, P., Poupon, C. & Sarwar, T. Physical and digital phantoms for validating

tractography and assessing artifacts.

NeuroImage

245, (2021).

2. Rheault, F., Poulin, P., Valcourt Caron, A., St-Onge, E. & Descoteaux, M. Common misconceptions,

hidden biases and modern challenges of dMRI tractography.

J. Neural Eng.

17, (2020).

3. Côté, M. A.

et al.

Tractometer: Towards validation of tractography pipelines.

Med. Image Anal.

17,

844–857 (2013).

4. Maier-Hein, K. H.

et al.

The challenge of mapping the human connectome based on diffusion

tractography.

Nat. Commun.

8, (2017).

5. Neher, P., Côté, M.-A., Houde, J.-C., Descoteaux, M. & Maier-Hein, K. Fiber tractography using machine

learning.

NeuroImage

158, 417–429 (2017).

Page 16/21

. Benou, I. & Riklin Raviv, T. DeepTract: A probabilistic deep learning framework for white matter ber

tractography.

Lect. Notes Comput. Sci. Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinforma.

11766

LNCS, 626–635 (2019).

7. Poulin, P.

et al.

Learn to track: Deep learning for tractography.

Lect. Notes Comput. Sci. Subser. Lect.

Notes Artif. Intell. Lect. Notes Bioinforma.

10433 LNCS, 540–547 (2017).

. Wegmayr, V. & Buhmann, J. M. Entrack: Probabilistic spherical regression with entropy regularization

for ber tractography.

Int. J. Comput. Vis.

129, 656–680 (2020).

9. Théberge, A., Descoteaux, M., Desrosiers, C. & Jodoin, P. M. Track-to-learn: A general framework for

tractography with deep reinforcement learning.

Med. Image Anal.

102093 (2021)

doi:10.1101/2020.11.16.385229.

10. Rheault, F.

et al.

Bundle-specic tractography. in 129–139 (2018). doi:10.1007/978-3-319-73839-

0_10.

11. Wasserthal, J., Neher, P. & Maier-Hein, K. H. TractSeg - Fast and accurate white matter tract

segmentation.

NeuroImage

183, 239–253 (2018).

12. Garyfallidis, E.

et al.

Recognition of white matter bundles using local and global streamline-based

registration and clustering.

NeuroImage

170, 283–295 (2018).

13. Rheault, F., Houde, J.-C. & Descoteaux, M. Visualization, interaction and tractometry: dealing with

millions of streamlines from diffusion MRI tractography.

Front. Neuroinformatics

11, (2017).

14. Theaud, G., Houde, J., Bor, A., Morency, F. & Descoteaux, M. TractoFlow : A robust , ecient and

reproducible diffusion MRI pipeline leveraging Nextow & Singularity.

NeuroImage

218, (2020).

15. Garyfallidis, E., Brett, M., Correia, M. M., Williams, G. B. & Nimmo-Smith, I. QuickBundles, a method for

tractography simplication.

Front. Neurosci.

6, (2012).

1. Neher, P. F., Laun, F. B., Stieltjes, B. & Maier-Hein, K. H. Fiberfox: Facilitating the creation of realistic

white matter software phantoms.

Magn. Reson. Med.

72, 1460–1470 (2014).

17. Bullock, D. N.

et al.

A taxonomy of the brain’s white matter: twenty-one major tracts for the 21st

century.

Cereb. Cortex

(2022) doi:10.1093/cercor/bhab500.

1. Francisco, A. & Montiel, J. One hundred million years of interhemispheric communication: the history

of the corpus callosum.

Brazilian journal of medical and biological researc

409–420 (2003).

19. De Benedictis, A.

et al.

New insights in the homotopic and heterotopic connectivity of the frontal

portion of the human corpus callosum revealed by microdissection and diffusion tractography.

Hum.

Brain Mapp.

37, 4718–4735 (2016).

20. Wu, Y., Sun, D., Wang, Y., Wang, Y. & Ou, S. Segmentation of the cingulum bundle in the human brain:

A new perspective based on DSI tractography and ber dissection study.

Front. Neuroanat.

10,

(2016).

21. Sarubbo, S.

et al.

The course and the anatomo-functional relationships of the optic radiation: a

combined study with ‘post mortem’ dissections and ‘in vivo’ direct electrical mapping.

J. Anat.

226,

47–59 (2015).

Page 17/21

22. Falconer, M. A. & Wilson, J. L. Visual eld changes following anterior temporal lobectomy: Their

signicance in relation to ‘Meyer’s loop’ of the optic radiation.

Brain

81, part 1, (1958).

23. Panesar, S. S., Yeh, F.-C., Jacquesson, T., Hula, W. & Fernandez-Miranda, J. C. A quantitative

tractography study into the connectivity, segmentation and laterality of the human inferior

longitudinal fasciculus.

Front. Neuroanat.

12, (2018).

24. Hau, J.

et al.

Revisiting the human uncinate fasciculus, its subcomponents and asymmetries with

stem-based tractography and microdissection validation.

Brain Struct. Funct.

222, 1645–1662

(2017).

25. Chenot, Q.

et al.

A population-based atlas of the human pyramidal tract in 410 healthy participants.

Brain Struct. Funct.

224, 599–612 (2019).

2. Dale, A. M., Fischl, B. & Sereno, M. I. Cortical surface-based Analysis: I. segmentation and surface

reconstruction.

NeuroImage

9, 179–194 (1999).

27. Zhang, Y., Brady, M. & Smith, S. Segmentation of brain MR images through a hidden Markov random

eld model and the expectation-maximization algorithm.

IEEE Trans. Med. Imaging

20, 45–57

(2001).

Figures

Page 18/21

Figure 1

Erroneous bundle segmentation examples. A) ILF (red) and OR (blue) with B) an example of sub-optimal

bundle segmentation in submission 1.3 (using Recobundles). C) FPT (pink), CST (orange), and POPT

(blue), with D) streamlines recovered for these bundles from all 2015 submissions. The GT bundle mask

borders are shown in a darker contour. We can see that classication was sometimes arbitrary to one or

the other bundle, particularly in the center.

Page 19/21

Figure 2

Recobundles led to poor results on some bundles. The top row shows the MCP in sagittal view. A) 2015’s

GT. B) Streamlines recovered by Recobundles from all submissions. They include vertical streamlines

that should not belong to the MCP. C) Streamlines recovered using our new ROI-based segmentation. D, E,

and F present similar patterns for the SLF.

Figure 3

Examples of looping bers that were hidden in the original GT tractogram.

Page 20/21

Figure 4

Examples of possible endpoint ROIs. A) OR, B) MCP, C and D) SLF. Various degrees of dilation were

tested. Bigger ROIs such as in B and D were necessary to score adequately all submitted tractograms.

Figure 5

Overlap (OL) vs Overreach (ORgt) scores in 2022 vs 2015 (with updated masks). Best results should have

high overlap (top) and low overreach (left). Top graphs: scores per bundle (averaged over all teams).

Page 21/21

Colors reect the differences between easy (blue), average (green) and hard-to-track (pink) bundles [4].

Bottom graphs: scores per submission (averaged over all bundles). Colors reect the algorithm choice:

deterministic (blue), probabilistic (orange) or others (gray).



ResearchGate has not been able to resolve any citations for this publication.

Entrack: Probabilistic Spherical Regression with Entropy Regularization for Fiber Tractography

Article

Full-text available

Mar 2021
INT J COMPUT VISION

White matter tractography, based on diffusion-weighted magnetic resonance images, is currently the only available in vivo method to gather information on the structural brain connectivity. The low resolution of diffusion MRI data suggests to employ probabilistic methods for streamline reconstruction, i.e., for fiber crossings. We propose a general probabilistic model for spherical regression based on the Fisher-von-Mises distribution, which efficiently estimates maximum entropy posteriors of local streamline directions with machine learning methods. The optimal precision of posteriors for streamlines is determined by an information-theoretic technique, the expected log-posterior agreement concept. It relies on the requirement that the posterior distributions of streamlines, inferred on retest measurements of the same subject, should yield stable results within the precision determined by the noise level of the data source.

Track-To-Learn: A general framework for tractography with deep reinforcement learning

Preprint

Full-text available

Nov 2020

A bstract Diffusion MRI tractography is currently the only non-invasive tool able to assess the white-matter structural connectivity of a brain. Since its inception, it has been widely documented that tractography is prone to producing erroneous tracks while missing true positive connections. Anatomical priors have been conceived and implemented in classical algorithms to try and tackle these issues, yet problems still remain and the conception and validation of these priors is very challenging. Recently, supervised learning algorithms have been proposed to learn the tracking procedure implicitly from data, without relying on anatomical priors. However, these methods rely on labelled data that is very hard to obtain. To remove the need for such data but still leverage the expressiveness of neural networks, we introduce Track-To-Learn : A general framework to pose tractography as a deep reinforcement learning problem. Deep reinforcement learning is a type of machine learning that does not depend on ground-truth data but rather on the concept of “reward”. We implement and train algorithms to maximize returns from a reward function based on the alignment of streamlines with principal directions extracted from diffusion data. We show that competitive results can be obtained on known data and that the algorithms are able to generalize far better to new, unseen data, than prior machine learning-based tractography algorithms. To the best of our knowledge, this is the first successful use of deep reinforcement learning for tractography.

TractoFlow: A robust, efficient and reproducible diffusion MRI pipeline leveraging Nextflow & Singularity

Article

Full-text available

May 2020

Diffusion MRI tractography processing pipeline requires a large number of steps (typically 20+ steps). If parameters of these steps, number of threads, and random seed generators are not carefully controlled, the resulting tractography can easily be non-reproducible and non-replicable, even in test-test experiments. To handle these issues, we developed TractoFlow. TractoFlow is fully automatic from raw diffusion weighted images to tractography. The pipeline also outputs classical diffusion tensor imaging measures and several fiber orientation distribution function measures. TractoFlow supports the recent Brain Imaging Data Structure (BIDS) format as input and is based on two engines: Nextflow and Singularity. In this work, the TractoFlow pipeline is evaluated on three databases and shown to be efficient and reproducible from 98% to 100%, depending on parameter choices. Moreover, it is easy to use for non-technical users, with little to no installation requirements. TractoFlow is publicly available for academic research and is an important step forward for better structural brain connectivity mapping.

Common misconceptions, hidden biases and modern challenges of dMRI tractography

Article

Full-text available

Jan 2020

The human brain is a complex and organized network, where the connection between regions is not achieved with single neurons crisscrossing each other but rather millions of densely packed and well-ordered neurons. Reconstruction from diffusion MRI tractography is only an attempt to capture the full complexity of this network, at the macroscale. This review provides an overview of the misconceptions, biases and pitfalls present in structural white matter bundle and connectome reconstruction using tractography. The goal is not to discourage readers, but rather to inform them of the limitations present in the methods used by researchers in the field in order to focus on what they can do and promote proper interpretations of their results. It also provides a list of open problems that could be solved in future research projects for the next generation of PhD students.

A population-based atlas of the human pyramidal tract in 410 healthy participants

Article

Full-text available

Mar 2019
BRAIN STRUCT FUNCT

With the advances in diffusion MRI and tractography, numerous atlases of the human pyramidal tract (PyT) have been proposed, but the inherent limitation of tractography to resolve crossing bundles within the centrum semiovale has so far prevented the complete description of the most lateral PyT projections. Here, we combined a precise manual positioning of individual subcortical regions of interest along the descending pathway of the PyT with a new bundle-specific tractography algorithm. This later is based on anatomical priors to improve streamlines tracking in crossing areas. We then extracted both left and right PyT in a large cohort of 410 healthy participants and built a population-based atlas of the whole-fanning PyT with a complete description of its most corticolateral projections. Clinical applications are envisaged, the whole-fanning PyT atlas being likely a better marker of corticospinal integrity metrics than those currently used within the frame of prediction of poststroke motor recovery. The present population-based PyT, freely available, provides an interesting tool for clinical applications to locate specific PyT damage and its impact to the short- and long-term motor recovery after stroke.

TractSeg - Fast and accurate white matter tract segmentation

Article

Full-text available

Aug 2018

The individual course of white matter fiber tracts is an important factor for analysis of white matter characteristics in healthy and diseased brains. Diffusion-weighted MRI tractography in combination with region-based or clustering-based selection of streamlines is a unique combination of tools which enables the in-vivo delineation and analysis of anatomically well-known tracts. This, however, currently requires complex, computationally intensive processing pipelines which take a lot of time to set up. TractSeg is a novel convolutional neural network-based approach that directly segments tracts in the field of fiber orientation distribution function (fODF) peaks without using tractography, image registration or parcellation. We demonstrate that the proposed approach is much faster than existing methods while providing unprecedented accuracy, using a population of 105 subjects from the Human Connectome Project. We also show initial evidence that TractSeg is able to generalize to differently acquired data sets for most of the bundles. The code and data are openly available at https://github.com/MIC-DKFZ/TractSeg/ and https://doi.org/10.5281/zenodo.1088277, respectively.

A Quantitative Tractography Study Into the Connectivity, Segmentation and Laterality of the Human Inferior Longitudinal Fasciculus

Article

Full-text available

May 2018

The human inferior longitudinal fasciculus (ILF) is a ventral, temporo-occipital association tract. Though described in early neuroanatomical works, its existence was later questioned. Application of in vivo tractography to the neuroanatomical study of the ILF has generally confirmed its existence, however consensus is lacking regarding its subdivision, laterality and connectivity. Further, there is a paucity of detailed neuroanatomic data pertaining to the exact anatomy of the ILF. Generalized Q-Sampling imaging (GQI) is a non-tensor tractographic modality permitting high resolution imaging of white-matter structures. As it is a non-tensor modality, it permits visualization of crossing fibers and accurate delineation of close-proximity fiber-systems. We applied deterministic GQI tractography to data from 30 healthy subjects and a large-volume, averaged diffusion atlas, to delineate ILF anatomy. Post-mortem white matter dissection was also carried out in a cadaveric specimen for further validation. The ILF was found in all 60 hemispheres. At its occipital extremity, ILF fascicles demonstrated a bifurcated, ventral-dorsal morphological termination pattern, which we used to further subdivide the bundle for detailed analysis. These divisions were consistent across the subject set and within the atlas. We applied quantitative techniques to study connectivity strength of the ILF at its anterior and posterior extremities. Overall, both morphological divisions, and the un-separated ILF, demonstrated strong leftward-lateralized connectivity patterns. Leftward-lateralization was also found for ILF volumes across the subject set. Due to connective and volumetric leftward-dominance and ventral location, we postulate the ILFs role in the semantic system. Further, our results are in agreement with functional and lesion-based postulations pertaining to the ILFs role in facial recognition.

Bundle-Specific Tractography

Chapter

Full-text available

Jan 2018

Tractography allows the investigation of white matter fascicles. However, it requires a large amount of streamlines to be generated to cover the full spatial extent of desired bundles. In this work, a bundle-specific tractography algorithm was developed to increase reproducibility and sensitivity of white matter fascicle virtual dissection, thus avoiding the computation of a full brain tractography. Using fascicle priors from manually segmented bundles templates or atlases, we propose a novel local orientation enhancement methodology that overcomes reconstruction difficulties in crossing regions. To reduce unnecessary computation, tractography seeding and tracking were restricted to specific locales within the brain. These additions yield better spatial coverage, increasing the quality of the fanning in crossing regions, helping to accurately represent fascicle shape. In this work, tractography methods were analyzed and compared using a single bundle of interest, the corticospinal tract.

A Taxonomy of the Brain's White Matter: Twenty-One Major Tracts for the 21st Century

Article

Feb 2022
CEREB CORTEX

The functional and computational properties of brain areas are determined, in large part, by their connectivity profiles. Advances in neuroimaging and network neuroscience allow us to characterize the human brain noninvasively, but a comprehensive understanding of the human brain demands an account of the anatomy of brain connections. Long-range anatomical connections are instantiated by white matter, which itself is organized into tracts. These tracts are often disrupted by central nervous system disorders, and they can be targeted by neuromodulatory interventions, such as deep brain stimulation. Here, we characterized the connections, morphology, traversal, and functions of the major white matter tracts in the brain. There are major discrepancies across different accounts of white matter tract anatomy, hindering our attempts to accurately map the connectivity of the human brain. However, we are often able to clarify the source(s) of these discrepancies through careful consideration of both histological tract-tracing and diffusion-weighted tractography studies. In combination, the advantages and disadvantages of each method permit novel insights into brain connectivity. Ultimately, our synthesis provides an essential reference for neuroscientists and clinicians interested in brain connectivity and anatomy, allowing for the study of the association of white matter's properties with behavior, development, and disorders.

Physical and digital phantoms for validating tractography and assessing artifacts

Article

Nov 2021
NEUROIMAGE

Fiber tractography is widely used to non-invasively map white-matter bundles in vivo using diffusion-weighted magnetic resonance imaging (dMRI). As it is the case for all scientific methods, proper validation is a key prerequisite for the successful application of fiber tractography, be it in the area of basic neuroscience or in a clinical setting. It is well-known that the indirect estimation of the fiber tracts from the local diffusion signal is highly ambiguous and extremely challenging. Furthermore, the validation of fiber tractography methods is hampered by the lack of a real ground truth, which is caused by the extremely complex brain microstructure that is not directly observable non-invasively and that is the basis of the huge network of long-range fiber connections in the brain that are the actual target of fiber tractography methods. As a substitute for in vivo data with a real ground truth that could be used for validation, a widely and successfully employed approach is the use of synthetic phantoms. In this work, we are providing an overview of the state-of-the-art in the area of physical and digital phantoms, answering the following guiding questions: “What are dMRI phantoms and what are they good for?”, “What would the ideal phantom for validation fiber tractography look like?” and “What phantoms, phantom datasets and tools used for their creation are available to the research community?”. We will further discuss the limitations and opportunities that come with the use of dMRI phantoms, and what future direction this field of research might take.

Validate your white matter tractography algorithms with a reappraised ISMRM 2015 Tractography Challenge scoring system

Abstract and Figures

Recommended publications

Validate your white matter tractography algorithms with a reappraised ISMRM 2015 Tractography Challe...

Track-To-Learn: A general framework for tractography with deep reinforcement learning

Track-to-Learn: A general framework for tractography with deep reinforcement learning

What Matters in Reinforcement Learning for Tractography