Available via license: CC BY 4.0
Content may be subject to copyright.
Page 1/21
Validate your white matter tractography algorithms
with a reappraised ISMRM 2015 Tractography
Challenge scoring system
Emmanuelle Renauld ( emmanuelle.renauld@usherbrooke.ca )
Université de Sherbrooke
Antoine Théberge
Université de Sherbrooke
Laurent Petit
Université Bordeaux, CNRS, CEA, IMN, UMR 5293
Jean-Christophe Houde
Imeka Solutions Inc
Maxime Descoteaux
Université de Sherbrooke
Article
Keywords:
Posted Date: January 3rd, 2023
DOI: https://doi.org/10.21203/rs.3.rs-2411825/v1
License: This work is licensed under a Creative Commons Attribution 4.0 International License.
Read Full License
Page 2/21
Abstract
Since 2015, research groups seek to produce the nec-plus-ultra tractography algorithms using the ISMRM
2015 Tractography Challenge as evaluation. In particular, since 2017, machine learning has made its
entrance into the tractography world. The ISMRM 2015 Tractography Challenge is the most used
phantom during tractography validation, although it contains limitations. We offer, here, a new
Tractometer scoring system for this phantom, where segmentation of the bundles is now based on
manually-dened regions of interest rather than on bundle recognition. Bundles are now more reliably
segmented, offering more stable metrics with higher precision for future users. New code is available
online. Scores of the initial 96 submissions to the challenge are updated. Overall, conclusions from the
2015 challenge are conrmed with the new scoring, but individual tractograms scores have changed, and
the data is much improved at the bundle- and streamline-level. This work also led to the production of a
ground truth tractogram with less noisy streamlines and an example of processed data, all available on
the Tractometer website. This enhanced Tractometer scoring system and new data should continue to
help researchers develop and evaluate the next generation of tractography techniques.
Introduction
Tractography allows the
in-vivo
non-invasive recovery of white-matter ber trajectories in the brain. In this
context, a good tractography algorithm builds a tractogram representing the ground truth (GT) of the
brain anatomy. But such a GT still does not exist for verication of algorithmic results [1,2]. To alleviate
this limitation and allow the evaluation of the tractography algorithm output quality, one typically relies
on phantoms: simulated diffusion-weighted images (DWI) associated with GT tractograms [1]. The level
of similarity between the tractogram and GT can be scored based on various metrics, such as false
positives / false negative rates, or coverage metrics, such as overlap or overreach, amongst others [3].
Generally, they are calculated for each bundle present in the dataset rather than on the whole tractogram.
A phantom must thus be associated with a scoring system of its own, including a process for bundle
segmentation and metrics that quantify the quality of these bundles.
The ISMRM 2015 Tractography Challenge [4] has become the most widely used phantom for
tractography validation [1]. In fact, it is nearly the only tractography dataset with human brain geometries
offering a GT. The article, published in 2017, has been cited approximately 1000 times (as of December
2022). It has also provided important insights into the challenges of tractography, particularly regarding
the strong presence of false positives and the poor overlap of true positives. Now, the development of
new algorithms for tractography often includes a tractography validation step using this phantom.
Tractography has come a long way since its beginnings, and, generally, the most recent algorithms all
achieve similar scores. Even small differences in scoring may lead to big conclusions on the choice of
optimal model parameters. This is particularly true in the eld of machine learning in tractography [5–9],
where the validation phase often relies on nal scores for ne-tuning hyper-parameters. A robust, stable
scoring system of high precision is important. Also, bundle-specic tractography has become
Page 3/21
increasingly investigated [10,11], therefore scores must be of quality for all bundles equally, not only in
averaged scores.
In this work, we veried the quality, precision, and robustness of the challenge data and its ocial bundle
segmentation process. We discovered that the segmentation of the bundles led, sometimes, to poor
results. When looking visually at the segmented data from tractograms submitted to the challenge in
2015, some bundles seemed recurrently poorly segmented, such as the OR and the CST (Fig1,Fig2, see
below for the list of acronyms). Even scoring the GT tractogram itself led to non-perfect results, with 95%
overlap, 9% overreach, and a Dice score of 92%. Segmentation was based on Recobundles [12], a bundle
segmentation method based on clustering of streamlines, which is inuenced by the quality of the
reference bundles, relies on manually dened thresholds, and whose results depend on the ordering
sequence of bundles during the processing.
Here, we propose a more stable scoring system using carefully positioned regions of interest (ROIs). We
present the consequences of the new process on the published scores of the 96 tractograms submitted
during the challenge in 2015. Overall, general conclusions drawn in the original article [4] still hold: most
teams recovered most bundles correctly, but with lots of false positives and a poor overlap of true
positives. However, individual scores for some bundles or some teams are now strongly reappraised. In
particular, CA and CP are better recovered than shown in the previous analysis, and coverage scores are
more stable.
Our work also led to the production of a new ground truth tractogram with less noisy streamlines,
revisions of the previously published scores, revisions of the initial code, and preparation of an example
of well processed data. All updated data and scoring information (ROIs, code) are available on the
Tractometer website: www.tractometer.org.
Results
A. Conrmation of the original scores.We rst veried that we could reproduce the original results [4]
using the updated python3 version and original data. All 2015’s submissions were scored again with
reviewed and updated code, with 100% reproducibility with original scores.
B. Curation of the tractogram.The quality of the GT prevented the creation of ROIs. Analysis of the GT
tractogram revealed short/long, looping, and broken streamlines (Fig3) that we ltered. Streamline
rejection was kept as small as possible to ensure good compatibility between the tractogram and the
associated simulated DWI.
We found long or looping streamlines in 12 bundles (out of 25). The biggest changes included 8%
rejection in the CC, 24% and 23% rejection in both ILF and 12% and 6% for both OR. CC and right ILF
included a substantial number of looping streamlines. CC had many half-streamlines stopping mid-line.
ILF and OR were too similar to allow a good segmentation; some streamlines were rejected manually. In
Page 4/21
other bundles, less than 1% of streamlines were discarded. The nal clean tractogram contains 190,065
streamlines (5% rejection).
C. Creation of an ROI-based segmentation system.The new segmentation relies on endpoint ROI masks,
“all” masks, and in some cases, on other criteria such as maximum length, maximum total displacement
per orientation, or “any” masks.
- Endpoint masks: head and tail of the bundle. Segmented streamlines must have one endpoint in each of
the two masks. Masks were created large enough to ensure they covered most variation in streamlines
shape of any scored tractogram (Fig4). This prevented an adequate segmentation of IB, which would be
dened as bundles connecting ROIs that should not be connected.
- “All” masks: bundle envelope. Streamlines must be entirely included inside the mask. This avoids wrong-
path connections, where streamlines connect the right regions but with a wrong path. Again, these masks
were created as large as possible to include overreaching streamlines from most submissions.
- “Any” masks: masks of mandatory passage. Streamlines must traverse it (at “any” point of the
streamline).
Mask names and other criteria are included in a scoring conguration le formatted as a json le.
We veried the quality of ROIs by scoring the new curated GT data. We obtained 100% OL and 0% ORgt
for all bundles, as expected. When scoring the initial (non-curated) tractogram, mean OL was also 100%,
with a 1% overreach, showing that modications during curation were kept minimal. Running the new
scoring system on all 96 submissions took 2h 57m, vs 8h57m using the initial Recobundles-based
system.
D. Inuence of the bundle masks on previous scores.To compare new and initial scores, we ensured that
the two sets of results were indeed comparable. We noted that differences in results could be inuenced
by the difference in computation of the GT masks, which are called bundle masks in the original scoring
data. Our new scoring was thus compared to the 2015 Recobundles system but with new bundles masks,
computed with the recent denition [13]. We veried the inuence of this change on the original results.
Updated bundle masks led to a decrease in both OL and ORgt(see Table1), but to nearly unchanged f1
scores (p-value>0.1). To allow comparison, these results were computed over 21 bundles using the mean
value of FPT/POPT/CST.
E. Inuence of the new scoring system on scores. Visually, new scoring of the initial 2015 submissions
led to better segmentation (see Fig2). On average, Dice scores were signicantly different (p < 0.001) (see
Table2), but with an average change of only 2%, offering similar rankings (average absolute difference: 2
positions out of 96), thus leading to similar conclusions as in the original analysis. However, some
bundles showed major differences (see Table3) in scores and in ranking.The detailed score tables for
each team, each bundle is provided on the website.
Page 5/21
- VB: As seen in Table3, CP and CA were discovered more often than estimated in the original analysis.
They still are the two most dicult bundles to recover but to a lesser extent.
- VS: Biggest change in VS is seen in the CC, partly because it is by far the biggest bundle. When
observing the VS in raw numbers rather than as percentages, a comparison between the two scoring
systems reveals drastic changes, as seen in Table3.
- Bundle coverage: Fig5 (top section) compares the bundle dispersion in OL and ORgt between the two
scoring systems. Main changes are reported in Table3. Overall, f1 score was improved, particularly for
the two BPS bundles and left OR, bundles for which modications have been brought in the GT data.
Bottom section in Fig5 compares the submissions dispersion for these metrics. Overall, previous
conclusions still hold: probabilistic tracking may help generate highest OL, but with highest ORgt.
Submissions 9.1 and 9.2 (best OL) only obtain Dice scores of 45% and 46%, placing them in 43rd and
38th rank. When relying on the Dice score for a nal ranking of the submissions, the biggest variations
included an upgrade of 9 places for submission 17.0 and a drop of 13 places for submission 1.4. Top 8
submissions stayed the same but in a different order, as did the bottom 8 submissions.
Table 1 Impact of updated bundle masks on the scoring, using the original Tractometer scoring system.
Mean Original 2015 scores
(old bundle masks)
Updated 2015 scores
(new bundle masks)
OL (%) 35.6 ± 16.5
[1.1 to 76.6]
34.7 ± 16.2
[1.1 to 75.4]
ORgt (%) 29.0 ± 25.9
[1.0 to 152.5]
25.5 ± 23.3
[0.9 to 137.7]
Dice / f1 (%) 37.8 ± 12.6
[2.0 to 56.1]
37.8 ± 12.8
[2.0 to 58.0]
Table 2 Effect of the new segmentation on average scores. Nb: Number of submissions who recovered
the bundle.
Page 6/21
Mean Updated 2015 scores
(21 bundles)
New scores
VB 18.0 ± 2.7
[5 to 20]
18.5 ± 2.3
[9 to 21]
Nb 82.1 ± 25.4
[2 to 96]
84.5 ± 20.8
[22 to 96]
VS (%) 53.6 ± 23,5
[3.7 to 92.5]
52.5 ± 22.1
[4.3 to 88.6]
OL (%) 35,7 ± 16.0
[1.3 to 74.3]
37.8 ± 16.4
[1.8 to 80.0]
ORgt (%) 26.7 ± 23.7
[1.1 to 141.4]
29.1 ± 26.7
[2.4 to 161.1]
Dice / f1 (%) 38.4 ± 12.1
[2.4 to 54.9]
40.7 ± 12.2
[3.1 to 57.9]
F. Usage on new data.We successfully used the Tractoow pipeline [14] with the noisy data using both
PFT-tracking and local-tracking to obtain two full tractograms that were scored with the new system. The
PFT version led to the best Dice score (64%. Previous best was 58%), with an average OL and ORgt of 76%
and 60%. The local tracking version, which used a dilated white matter (WM) mask, obtained the best
overlap (91%. Previous best was 80%), but with more ORgt, explaining its lower, yet high, Dice score (57%).
Discussion
We have developed an enhanced Tractometer scoring system for the ISMRM 2015 Tractography
Challenge data. It uses carefully determined regions of interest. It offers more reliable results because the
segmentation now depends only on the quality of the ROIs. It does not depend on other aspects that were
important in the Recobundles segmentation, such as the ordering of the bundles, quality of the reference
tractogram, and threshold values for the mean direct-ip (MDF) metric [15]. Recovered bundles could hide
streamlines with noisy shapes because the GT data itself contained noisy shapes. In short, our new
segmentation was strict enough to prevent the inclusion of noisy streamlines but exible enough to allow
scoring submissions of varied streamline lengths, curvature, fanning, and tracking mask.
Overall, the new segmentation offers similar rankings as before when using averaged values over all
bundles and all teams, but scores for some bundles were strongly modied, and the nal ordering of the
teams based on Dice scores varied.
Page 7/21
Table 3 Effect of the new scoring: some of the main changes in specic bundles (average over teams)
Page 8/21
Mean Bundle
(L/R =
left /
right)
Tractometer
2015
(21
bundles)
Tractometer
2022 Difference
Nb submissions recovering the
bundle CP:
CA:
SCP
L/R:
2
12
86 / 83
25
22
88 / 88
+ 23
+10
+2 / +5
Others: Differences
in less than 4
submissions
VS (Total number of streamlines
recovered amongst all teams) CP:
CA:
SCP
L/R:
Cg L/R:
BPS
L/R:
OR L:
2
64
38,193 /
23,607
278,422 /
238,027
322,645 /
520,016
49,883
172
2011
59,109 /
36,171
374,647 /
375,725
437,459 /
636,523
65,161
+8500%
+3400%
+55% / +53%
+35% / +58%
+36% / +22%
+31%
Others: Less than
20% variation
OL (%) BPS
L/Rr:
OR L:
SCP
L/R:
28.8 / 29.7
21.4
33.9 / 27.9
37.1 / 39.4
30.6
40.0 / 33.2
+8.3% / +9.7%
+9.1%
+6 / +5
Others: Less than
5% variation
ORgt (%) SCP
L/R:
SLF L/R:
ICP L/R:
CA:
ILF R:
26.1 / 18.8
50.4 / 57.3
37.8 / 25.1
0.7
41.8
44.3 / 31.2
49.0 / 47.5
45.5 / 30.5
7.7
54.0
+18.2 / +12.4
-5.0 / -9.8
+7.7 / +5.4
+6.9
+6.2
Others: Less than
5% variation
Dice / f1 (%)
BPS
L/R: 34 / 36
25
44 / 47
37
+10 / +11
+12
Page 9/21
OR L:
CA:
2
7
+5
Others: Less than a
3% variation
Verication of the original code.No error was found in the original code. Importantly, however, tractogram
formats and headers management has evolved signicantly since 2015. Users should verify that their
tractogram are correctly interpreted when using the updated python3 code.
Verication of the original scores.The scores published in the 2017 article were good, but the detailed
scores published on the website contained errors which are now corrected. Please also note that some
wrong numbers tend to be relayed amongst publications citing the ISMRM challenge results. We urge
readers to rely on the up-to-date scores currently published on the ocial website.
We also brought modications to metrics terminology to avoid confusion:
1) VS/IS: In the original analysis, the term “connection” was used in the terms valid/invalid/no
connections (VC, IC, NC). However, VC was dened as the number of streamlines belonging to a valid
bundle and could actually include broken or prematurely stopped streamlines that do not reach any gray
matter region as long as they were classied as belonging to the bundle by the chosen segmentation
process. The word may encourage wrong interpretation of the results, suggesting they can allow
connectivity analysis between brain regions. We renamed VC as VS (valid streamlines). We regrouped IC
and NC under the term IS (invalid streamlines).
2) IB: Segmenting invalid streamlines into invalid bundles gives insights on erroneous streamlines
typically produced, but scoring their number (IB) may however be misleading as a scoring metric as it
depends on the denition of these bundles. Recobundles segmentation of spurious streamlines with
varied shapes and distribution offers scores that are dicult to interpret. This score should only be used
with great care. IB scores are therefore not used anymore in our work.
Curation of the data.Curation of the data was kept as minimal as possible. We removed streamlines that
prevented the creation of a good scoring system. Because of this work, the new GT now corresponds less
perfectly with the associated DWI. Creating a new simulated DWI with Fiberfox [16] would be possible, but
future work using this new data could not be compared with the scores presented here from teams who
participated in the challenge.
One note to the reader should be made here. The phantom was created with knowledge available at the
time. Although the bundles have names that correspond to known anatomical tracts, users should keep in
mind that they might not present exact characteristics and features compared to the real tracts [17].
These bundles should be used as phantom parts, not as anatomical references. Here is a short list of
differences that were noticed between the GT bundles and known anatomical landmarks.
Page 10/21
- CC: The corpus callosum is known to contain a majority of homotopic connections [18]. Heterotopic
connections do exist, but are less documented [19]. Many heterotopic connections are found in this GT
(ex, ventral-striatal).
- Cg: The Cg consists of 5 sub-bundles [20]. The GT bundle lacks the posterior part (named CB-V in the
paper).
- ICP: This bundle should end in the brainstem, but the GT bundle contains two sub-bundles; one is
anatomically correct but the other, looping back into the cerebellar cortex , does not correspond to any
known path in the human anatomy.
- OR: The current bundle would be better named as thalamo-occipital connections. The OR is typically
dened as the streamlines from the peri-calcarine ssure to the thalamus [21], but in this GT, the bundle
extends to a larger section of the occipital lobe. Note also that the Meyer's loop [22] is absent from the
current GT.
- ILF: The ILF should reach the anterior temporal lobe [23]. However, in the initial version of the phantom, it
reached a larger region, extending posteriorly close to the (expected) Meyer’s loop region. This was
modied in the new curated data and therefore the ILF is now more anatomically reliable.
- UF: As of 2018 [24], the uncinate fasciculus is now considered with a larger fanning both anteriorly in
the frontal cortex and posteriorly in the temporal cortex.
- CST / FPT / POPT: These three bundles appear intricate, but should be more different. The cortical
terminations of the CST should be constrained to the precentral and postcentral gyri [25]. Both FPT and
POPT should end in the pons, but the bundles go further down, nearly to the medulla (see Fig1).
Due to these differences, the ROIs dened here do not represent perfect anatomical features either, but are
only the necessary tool to segment bundles before the scoring.
Creating new bundles with better anatomical features would require developing a new simulated DWI
data, i.e., a new phantom which, as stated above, was not the objective of this work. We encourage the
community to produce new and varied phantoms as there is a lack of validation data in the eld of
tractography. However, here, the goal was essentially to improve the existing one and allow, particularly,
the machine learning community to adequately compare their results with previous state-of-the-art
tractography tools. We present in a section below conclusions and suggestions drawn from our analysis
to readers interested in proposing a new phantom.
Preparation of the new segmentation technique.To allow for a good bundle segmentation in the
submitted data of most teams, the endpoint ROIs had to be created very large, sometimes up to a 16-pass
dilation of the GT bundles’ endpoint ROIs, and up to an 11-pass of the bundles’ “all” masks. This could
reveal that the stopping criteria was not well dened in many processing pipelines. It generally depends
on a WM mask, which may come either from a thresholded FA map (typically ~0.1 to 0.2) or from a
Page 11/21
segmentation from the T1. In the rst case, the simulated
DWI may have acted differently than usual and
provided FA values that would require a different threshold. In the second case, the T1 is also simulated.
Segmentation algorithms were not created to deal with “fake” images and may have resulted in WM
masks of lesser quality. We consider that the goal of this challenge was to evaluate the ability of
tractography algorithms to understand diffusion information and to follow diffusion anisotropy
information through challenging paths such as ber crossing and bottlenecks. We have decided not to
penalize submissions with streamlines going further than expected. For instance, some submissions had
streamlines from the OR going out of the thalamus without stopping, or streamlines from the Fornix
looping very far off the mamillary bodies, or even streamlines going out of the brain. Our ROIs thus spill
out of realistic anatomical regions in an attempt to include the biggest part of every submission’s
bundles. We can still segment bundles correctly by combining the endpoint ROIs with the “all” masks.
Analysis of the score differences.Compared to the initial analysis [4], it is still true that teams were able
to recover most bundles. It is also still true that, on average, only half of the streamlines in the submitted
tractograms are valid streamlines. Finally, we still nd that probabilistic tracking may help generate the
highest OL, but with the highest ORgt when compared to deterministic tracking, resulting in small changes
on the Dice score.
VB:CA and CP are still the two most dicult bundles to reconstruct, but although they are still a well-
dened category inFig5, it is to a lesser extent. Using Recobundles, CP was scored after CC; these
streamlines were often associated to the CC and thus ignored when segmenting the CP. Other changes in
recovered bundles are explained by the fact that newly found bundles generally contained only a few very
small streamlines that may be harder to compare with reference streamlines using the MDF metric (in
Recobundles). The hard-to-track and medium-diculty bundles (Fig5) are now less separate categories.
IB: Invalid bundles cannot be scored anymore due to the large size of the ROIs. It could be possible to add
an additional analysis step and segment the invalid streamlines (IS) into invalid bundles (IB) using
Quickbundles, similarly as before. We chose not to include this here as it is prone to the same instability
as Recobundles that we so rmly seek to avoid. The number of invalid bundles obtained with
Quickbundles depends strongly on the type of invalid streamlines. Even a few misplaced streamlines may
lead to a rapid increase in IB, which should not be used to infer the quality of the scored tractogram. We
do recognize that the IB analysis was useful in the original article to visualize the typical errors recovered
recurrently over multiple submissions, but the IB score itself should be used carefully.
VS/IS: Often, the additional recovered streamlines were of very poor quality, and other metrics were not
improved much. The total percentage of VS, averaged over all teams, all bundles, only varied by less than
1%. Yet, it represents an average of 1000 streamlines per submission. In the future, with algorithms
becoming ever better and researchers trying to push the limits of tractography, these small differences in
scoring could impact researcher choices in implementation.
Page 12/21
Bundle coverage: Despite the big changes in the total number of recovered streamlines in individual
bundles throughout the 96 submitted tractograms, general scoring metrics stayed similar, but ranking
amongst teams was modied.
Suggestions for the creation of a new phantom.The nal comparison of “winners” based on the Dice
score, either in the original analysis or here, did not allow a clear denition of the best tractography
parameters. This can be explained by the large inuence of preprocessing steps such as the choice of
tracking space, the tracking masks, the registration quality, and so on. Future phantoms should limit the
possibilities to ensure that they can understand specically our ability to follow diffusion information in
the brain, in other words, the “tracking” aspect, rather than the quality of the whole pipeline. We present
here some afterthoughts.
1. The level of complexity in the challenge data was good. It presented human-like geometries with
multiple bundle crossings or bottlenecks. Its number of bundles was good and allowed a scoring
system.
2. The associated simulated T1 data, however, was not realistic enough to allow good results in
segmentation software such as Freesurfer [26] or FSL FAST [27] for instance. We suggest that future
work should include a list of potentially interesting masks, particularly a WM mask that could be
used as a tracking mask.
3. The quality of individual streamlines, not only of bundles as whole entities, should be veried in the
GT and during scoring.
4. Developers should specify a way that users may verify their tractogram format to prevent shifts (ex:
±0.5 when the origin of a voxel coordinate is considered at the center or the corner of the voxel) or
swapping of axis during interpretation (ex, specifying the orientation).
5. Developers should specify in which space the nal scoring will be performed. Users applying a
substandard registration between T1 and DWI spaces could be strongly disadvantaged, even if their
tracking algorithm itself was perfect.
. OL, ORgt, Dice scores offer good insights, but there is still a lack of metrics comparing the shape of
individual streamlines in the literature that should be addressed.
Analysis of the Tractoow-processed data.The data was processed using state-of-the-art tools and
presented very good scores.
One general comment that was seen in machine learning studies was that differences in scores may
arise from the choice of model, but also from the training data. Therefore, we also offer the Tractoow-
processed data in open access on the website. It could be used as common training data.
Conclusion
We proposed a new and enhanced Tractometer scoring system based on manually-dened regions of
interest rather than on bundle recognition. Bundles are now more reliably segmented, offering more stable
Page 13/21
metrics with higher precision for future users of this phantom and its scoring system. We provide on the
Tractometer website all tools necessary for a robust scoring of any new tractogram with our new scoring
system: the ROIs and congurations les necessary to run the code, the tables of detailed results and the
Tractoow-processed data.
This should help researchers better develop and evaluate the next generation of tractography algorithms.
Methods
A. Verication of the original scores.The original code was converted to python3, proof-read and
reviewed to ensure it was still suitable with today’s standard. Metrics terminology were revised. All 2015’s
submissions were scored again.
The original code included forced shifting (adding 0.5 values) of .trk (trackvis) les. In the updated code,
tractograms are simply loaded through dipy’s load_tractogram method. No further verication is
performed on the validity of space attributes.
B. Curation of the GT tractogram.The GT bundles were modied to allow creation of the ROIs.
Streamlines from the GT bundles were ltered to keep only those with length in the range 20-200mm
(generally streamlines presenting looping shapes) or recovered as loops using scilpy were discarded (see
https://scilpy.readthedocs.io/). Others were discarded based on visual analysis of the bundles. CST,
POPT, and FPT were too similar and dicult to segment adequately (Fig1) and were gathered into a new
bundle called Brainstem Projection System (BPS). The ILF and OR were also too similar, preventing a
good segmentation (Fig1), either with Recobundles or with ROIs. In this case, we chose to lter out some
streamlines to better separate the two bundles.
C. Creation of a ROI-based segmentation system.All of the masks were created by looking carefully at
both the GT data and the general distribution of results from the tractograms submitted to the Challenge
in 2015.
Endpoint ROIS: GT streamlines’ endpoints were saved as head and tail masks. We then dilated these two
masks (11-pass on average, see Fig4). Some endpoint ROIs were modied manually based on visual
inspection of results. Examples of modication were: dilation to reach the end of the cortex in some
regions, manual dilation of the OR’s ROI to include more of the thalamus without spilling into the ILF,
manual separation between hemispheres, careful separation of anterior/posterior ROIs in the case of the
cingulum and of the fornix. The CC was separated into sub-bundles for segmentation purposes
(CC_u_shaped, CC_ventro_striatal1, CC_ventro_striatal2, CC_temporal), allowing for a better delimitation
of endpoint ROIs. However, only the total CC, composed of the re-merged sub-bundles, is used during
scoring. Similarly, the ICP was segmented into ICP_part1 (similar to its anatomical denition) and
ICP_part2 (looping back into the cerebellar cortex).
Page 14/21
“All” masks: GT streamlines paths were saved as binary masks and dilated (by default, the number of
passes was 3 but some bundles required varied parameters, up to an 11-pass for the CC). These GT
masks were combined with both endpoint ROIs for each bundle. Manual modications were also applied,
generally more manual dilation.
“Any” masks: They were dened using manually positioned boxes of interest.
D. Inuence of the bundle masks on scores.To allow comparing new and old scores, original bundle
masks were computed again using more recent technology. As suggested in 2017 by Rheault et al. [13],
bundle masks should not recover only voxels containing streamlines points (even after resampling), but
should rather account for the whole segment between two points. We computed the new masks with
scilpy. Bundles segmented using the Recobundles-based system were scored again using the same
metrics but with the new GT bundle masks. Final Dice scores, averaged over all bundles, were compared
to previous scores using a Student T-test.
E. Inuence of the new scoring system on scores.Newly segmented bundles of the 96 submissions were
scored using the same metrics as before. Again, nal Dice scores, averaged over all bundles, were
compared to previous scores using a Student T-test.
F. Usage on new data.We prepared a new tractogram to be scored using recent state-of-the art
techniques. The tractogram was prepared by running the Tractoow pipeline [14] on the noisy DWI, using
the version with additional reversed b0 to allow topup correction. The pipeline was modied to skip the
N4 denoising step on the T1 data, which produced irregular results, probably due to the fact that a T1 is in
fact a simulated dataset. Two tracking algorithms were tested. First, PFT tracking on WM maps. Second,
local tracking on a mask of WM that was rst modied to pass visual quality check: it was eroded (1-
pass) and dilated again (2-pass). Both versions were scored using the new system.
Abbreviations
List of acronyms for bundles
BPS: Brainstem Projection System, CA: Anterior commissure, CC: Corpus callosum, Cg: Cingulum, CP:
Posterior commissure, CST: Cortico-spinal tract, Fornix, FPT: Fronto-pontine tract, ICP: Inferior cerebellar
peduncle, ILF: Inferior longitudinal fasciculus, MCP: Middle cerebellar peduncle, OR: Optic radiation,
POPT: Parieto-occipital pontine tract, SCP: Superior cerebellar peduncle, SLF: Superior longitudinal
fasciculus, UF: uncinate fasciculus
List of acronyms for metrics
OL: Overlap (percentage of GT voxels recovered), ORgt: Overreach (number of false positive voxels,
normalized by the volume of the GT bundle), f1: Equivalent to the Dice score, VB: valid bundles (number
Page 15/21
of recovered bundles), VS: valid streamlines (number of streamlines in these VB), IS: invalid streamlines
(number of remaining streamlines).
Declarations
Data and code availability
The datasets generated during and/or analysed during the current study are available on the Tractometer
website: www.tractometer.org.
Acknowledgement
The authors are grateful to the Fonds de recherche du Québec - Nature et technologies (FRQNT) and the
Natural Sciences and Engineering Research Council of Canada (NSERC) programs for funding this
research.
Author contributions
ER and AT proof-read the original code and prepared the scripts in scilpy for the new scoring. They also
veried the format of submitted tractograms and the scoring. ER prepared the ROIs and other necessary
masks for the new segmentation process and compared scores between versions. ER wrote the
manuscript, and MD, LP and AT provided feed-back. JCH was the project leader in the previous version
and answered our questions concerning the original code and data.
Competing interests
The author(s) declare no competing interests.
References
1. Drobnjak, I., Neher, P., Poupon, C. & Sarwar, T. Physical and digital phantoms for validating
tractography and assessing artifacts.
NeuroImage
245, (2021).
2. Rheault, F., Poulin, P., Valcourt Caron, A., St-Onge, E. & Descoteaux, M. Common misconceptions,
hidden biases and modern challenges of dMRI tractography.
J. Neural Eng.
17, (2020).
3. Côté, M. A.
et al.
Tractometer: Towards validation of tractography pipelines.
Med. Image Anal.
17,
844–857 (2013).
4. Maier-Hein, K. H.
et al.
The challenge of mapping the human connectome based on diffusion
tractography.
Nat. Commun.
8, (2017).
5. Neher, P., Côté, M.-A., Houde, J.-C., Descoteaux, M. & Maier-Hein, K. Fiber tractography using machine
learning.
NeuroImage
158, 417–429 (2017).
Page 16/21
. Benou, I. & Riklin Raviv, T. DeepTract: A probabilistic deep learning framework for white matter ber
tractography.
Lect. Notes Comput. Sci. Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinforma.
11766
LNCS, 626–635 (2019).
7. Poulin, P.
et al.
Learn to track: Deep learning for tractography.
Lect. Notes Comput. Sci. Subser. Lect.
Notes Artif. Intell. Lect. Notes Bioinforma.
10433 LNCS, 540–547 (2017).
. Wegmayr, V. & Buhmann, J. M. Entrack: Probabilistic spherical regression with entropy regularization
for ber tractography.
Int. J. Comput. Vis.
129, 656–680 (2020).
9. Théberge, A., Descoteaux, M., Desrosiers, C. & Jodoin, P. M. Track-to-learn: A general framework for
tractography with deep reinforcement learning.
Med. Image Anal.
102093 (2021)
doi:10.1101/2020.11.16.385229.
10. Rheault, F.
et al.
Bundle-specic tractography. in 129–139 (2018). doi:10.1007/978-3-319-73839-
0_10.
11. Wasserthal, J., Neher, P. & Maier-Hein, K. H. TractSeg - Fast and accurate white matter tract
segmentation.
NeuroImage
183, 239–253 (2018).
12. Garyfallidis, E.
et al.
Recognition of white matter bundles using local and global streamline-based
registration and clustering.
NeuroImage
170, 283–295 (2018).
13. Rheault, F., Houde, J.-C. & Descoteaux, M. Visualization, interaction and tractometry: dealing with
millions of streamlines from diffusion MRI tractography.
Front. Neuroinformatics
11, (2017).
14. Theaud, G., Houde, J., Bor, A., Morency, F. & Descoteaux, M. TractoFlow : A robust , ecient and
reproducible diffusion MRI pipeline leveraging Nextow & Singularity.
NeuroImage
218, (2020).
15. Garyfallidis, E., Brett, M., Correia, M. M., Williams, G. B. & Nimmo-Smith, I. QuickBundles, a method for
tractography simplication.
Front. Neurosci.
6, (2012).
1. Neher, P. F., Laun, F. B., Stieltjes, B. & Maier-Hein, K. H. Fiberfox: Facilitating the creation of realistic
white matter software phantoms.
Magn. Reson. Med.
72, 1460–1470 (2014).
17. Bullock, D. N.
et al.
A taxonomy of the brain’s white matter: twenty-one major tracts for the 21st
century.
Cereb. Cortex
(2022) doi:10.1093/cercor/bhab500.
1. Francisco, A. & Montiel, J. One hundred million years of interhemispheric communication: the history
of the corpus callosum.
Brazilian journal of medical and biological researc
409–420 (2003).
19. De Benedictis, A.
et al.
New insights in the homotopic and heterotopic connectivity of the frontal
portion of the human corpus callosum revealed by microdissection and diffusion tractography.
Hum.
Brain Mapp.
37, 4718–4735 (2016).
20. Wu, Y., Sun, D., Wang, Y., Wang, Y. & Ou, S. Segmentation of the cingulum bundle in the human brain:
A new perspective based on DSI tractography and ber dissection study.
Front. Neuroanat.
10,
(2016).
21. Sarubbo, S.
et al.
The course and the anatomo-functional relationships of the optic radiation: a
combined study with ‘post mortem’ dissections and ‘in vivo’ direct electrical mapping.
J. Anat.
226,
47–59 (2015).
Page 17/21
22. Falconer, M. A. & Wilson, J. L. Visual eld changes following anterior temporal lobectomy: Their
signicance in relation to ‘Meyer’s loop’ of the optic radiation.
Brain
81, part 1, (1958).
23. Panesar, S. S., Yeh, F.-C., Jacquesson, T., Hula, W. & Fernandez-Miranda, J. C. A quantitative
tractography study into the connectivity, segmentation and laterality of the human inferior
longitudinal fasciculus.
Front. Neuroanat.
12, (2018).
24. Hau, J.
et al.
Revisiting the human uncinate fasciculus, its subcomponents and asymmetries with
stem-based tractography and microdissection validation.
Brain Struct. Funct.
222, 1645–1662
(2017).
25. Chenot, Q.
et al.
A population-based atlas of the human pyramidal tract in 410 healthy participants.
Brain Struct. Funct.
224, 599–612 (2019).
2. Dale, A. M., Fischl, B. & Sereno, M. I. Cortical surface-based Analysis: I. segmentation and surface
reconstruction.
NeuroImage
9, 179–194 (1999).
27. Zhang, Y., Brady, M. & Smith, S. Segmentation of brain MR images through a hidden Markov random
eld model and the expectation-maximization algorithm.
IEEE Trans. Med. Imaging
20, 45–57
(2001).
Figures
Page 18/21
Figure 1
Erroneous bundle segmentation examples. A) ILF (red) and OR (blue) with B) an example of sub-optimal
bundle segmentation in submission 1.3 (using Recobundles). C) FPT (pink), CST (orange), and POPT
(blue), with D) streamlines recovered for these bundles from all 2015 submissions. The GT bundle mask
borders are shown in a darker contour. We can see that classication was sometimes arbitrary to one or
the other bundle, particularly in the center.
Page 19/21
Figure 2
Recobundles led to poor results on some bundles. The top row shows the MCP in sagittal view. A) 2015’s
GT. B) Streamlines recovered by Recobundles from all submissions. They include vertical streamlines
that should not belong to the MCP. C) Streamlines recovered using our new ROI-based segmentation. D, E,
and F present similar patterns for the SLF.
Figure 3
Examples of looping bers that were hidden in the original GT tractogram.
Page 20/21
Figure 4
Examples of possible endpoint ROIs. A) OR, B) MCP, C and D) SLF. Various degrees of dilation were
tested. Bigger ROIs such as in B and D were necessary to score adequately all submitted tractograms.
Figure 5
Overlap (OL) vs Overreach (ORgt) scores in 2022 vs 2015 (with updated masks). Best results should have
high overlap (top) and low overreach (left). Top graphs: scores per bundle (averaged over all teams).