PreprintPDF Available

Abstract and Figures

Since 2015, research groups seek to produce the nec-plus-ultra tractography algorithms using the ISMRM 2015 Tractography Challenge as evaluation. In particular, since 2017, machine learning has made its entrance into the tractography world. The ISMRM 2015 Tractography Challenge is the most used phantom during tractography validation, although it contains limitations. We offer, here, a new Tractometer scoring system for this phantom, where segmentation of the bundles is now based on manually-defined regions of interest rather than on bundle recognition. Bundles are now more reliably segmented, offering more stable metrics with higher precision for future users. New code is available online. Scores of the initial 96 submissions to the challenge are updated. Overall, conclusions from the 2015 challenge are confirmed with the new scoring, but individual tractograms scores have changed, and the data is much improved at the bundle- and streamline-level. This work also led to the production of a ground truth tractogram with less noisy streamlines and an example of processed data, all available on the Tractometer website. This enhanced Tractometer scoring system and new data should continue to help researchers develop and evaluate the next generation of tractography techniques.
Content may be subject to copyright.
Page 1/21
Validate your white matter tractography algorithms
with a reappraised ISMRM 2015 Tractography
Challenge scoring system
Emmanuelle Renauld ( emmanuelle.renauld@usherbrooke.ca )
Université de Sherbrooke
Antoine Théberge
Université de Sherbrooke
Laurent Petit
Université Bordeaux, CNRS, CEA, IMN, UMR 5293
Jean-Christophe Houde
Imeka Solutions Inc
Maxime Descoteaux
Université de Sherbrooke
Article
Keywords:
Posted Date: January 3rd, 2023
DOI: https://doi.org/10.21203/rs.3.rs-2411825/v1
License: This work is licensed under a Creative Commons Attribution 4.0 International License. 
Read Full License
Page 2/21
Abstract
Since 2015, research groups seek to produce the nec-plus-ultra tractography algorithms using the ISMRM
2015 Tractography Challenge as evaluation. In particular, since 2017, machine learning has made its
entrance into the tractography world. The ISMRM 2015 Tractography Challenge is the most used
phantom during tractography validation, although it contains limitations. We offer, here, a new
Tractometer scoring system for this phantom, where segmentation of the bundles is now based on
manually-dened regions of interest rather than on bundle recognition. Bundles are now more reliably
segmented, offering more stable metrics with higher precision for future users. New code is available
online. Scores of the initial 96 submissions to the challenge are updated. Overall, conclusions from the
2015 challenge are conrmed with the new scoring, but individual tractograms scores have changed, and
the data is much improved at the bundle- and streamline-level. This work also led to the production of a
ground truth tractogram with less noisy streamlines and an example of processed data, all available on
the Tractometer website. This enhanced Tractometer scoring system and new data should continue to
help researchers develop and evaluate the next generation of tractography techniques.
Introduction
Tractography allows the
in-vivo
non-invasive recovery of white-matter ber trajectories in the brain. In this
context, a good tractography algorithm builds a tractogram representing the ground truth (GT) of the
brain anatomy. But such a GT still does not exist for verication of algorithmic results [1,2]. To alleviate
this limitation and allow the evaluation of the tractography algorithm output quality, one typically relies
on phantoms: simulated diffusion-weighted images (DWI) associated with GT tractograms [1]. The level
of similarity between the tractogram and GT can be scored based on various metrics, such as false
positives / false negative rates, or coverage metrics, such as overlap or overreach, amongst others [3].
Generally, they are calculated for each bundle present in the dataset rather than on the whole tractogram.
A phantom must thus be associated with a scoring system of its own, including a process for bundle
segmentation and metrics that quantify the quality of these bundles.
The ISMRM 2015 Tractography Challenge [4] has become the most widely used phantom for
tractography validation [1]. In fact, it is nearly the only tractography dataset with human brain geometries
offering a GT. The article, published in 2017, has been cited approximately 1000 times (as of December
2022). It has also provided important insights into the challenges of tractography, particularly regarding
the strong presence of false positives and the poor overlap of true positives. Now, the development of
new algorithms for tractography often includes a tractography validation step using this phantom.
Tractography has come a long way since its beginnings, and, generally, the most recent algorithms all
achieve similar scores. Even small differences in scoring may lead to big conclusions on the choice of
optimal model parameters. This is particularly true in the eld of machine learning in tractography [5–9],
where the validation phase often relies on nal scores for ne-tuning hyper-parameters. A robust, stable
scoring system of high precision is important. Also, bundle-specic tractography has become
Page 3/21
increasingly investigated [10,11], therefore scores must be of quality for all bundles equally, not only in
averaged scores.
In this work, we veried the quality, precision, and robustness of the challenge data and its ocial bundle
segmentation process. We discovered that the segmentation of the bundles led, sometimes, to poor
results. When looking visually at the segmented data from tractograms submitted to the challenge in
2015, some bundles seemed recurrently poorly segmented, such as the OR and the CST (Fig1,Fig2, see
below for the list of acronyms). Even scoring the GT tractogram itself led to non-perfect results, with 95%
overlap, 9% overreach, and a Dice score of 92%. Segmentation was based on Recobundles [12], a bundle
segmentation method based on clustering of streamlines, which is inuenced by the quality of the
reference bundles, relies on manually dened thresholds, and whose results depend on the ordering
sequence of bundles during the processing.
Here, we propose a more stable scoring system using carefully positioned regions of interest (ROIs). We
present the consequences of the new process on the published scores of the 96 tractograms submitted
during the challenge in 2015. Overall, general conclusions drawn in the original article [4] still hold: most
teams recovered most bundles correctly, but with lots of false positives and a poor overlap of true
positives. However, individual scores for some bundles or some teams are now strongly reappraised. In
particular, CA and CP are better recovered than shown in the previous analysis, and coverage scores are
more stable.
Our work also led to the production of a new ground truth tractogram with less noisy streamlines,
revisions of the previously published scores, revisions of the initial code, and preparation of an example
of well processed data. All updated data and scoring information (ROIs, code) are available on the
Tractometer website: www.tractometer.org.
Results
A. Conrmation of the original scores.We rst veried that we could reproduce the original results [4]
using the updated python3 version and original data. All 2015’s submissions were scored again with
reviewed and updated code, with 100% reproducibility with original scores. 
B. Curation of the tractogram.The quality of the GT prevented the creation of ROIs. Analysis of the GT
tractogram revealed short/long, looping, and broken streamlines (Fig3) that we ltered. Streamline
rejection was kept as small as possible to ensure good compatibility between the tractogram and the
associated simulated DWI.
We found long or looping streamlines in 12 bundles (out of 25). The biggest changes included 8%
rejection in the CC, 24% and 23% rejection in both ILF and 12% and 6% for both OR. CC and right ILF
included a substantial number of looping streamlines. CC had many half-streamlines stopping mid-line.
ILF and OR were too similar to allow a good segmentation; some streamlines were rejected manually. In
Page 4/21
other bundles, less than 1% of streamlines were discarded. The nal clean tractogram contains 190,065
streamlines (5% rejection). 
C. Creation of an ROI-based segmentation system.The new segmentation relies on endpoint ROI masks,
“all” masks, and in some cases, on other criteria such as maximum length, maximum total displacement
per orientation, or “any” masks.
- Endpoint masks: head and tail of the bundle. Segmented streamlines must have one endpoint in each of
the two masks. Masks were created large enough to ensure they covered most variation in streamlines
shape of any scored tractogram (Fig4). This prevented an adequate segmentation of IB, which would be
dened as bundles connecting ROIs that should not be connected.
- “All” masks: bundle envelope. Streamlines must be entirely included inside the mask. This avoids wrong-
path connections, where streamlines connect the right regions but with a wrong path. Again, these masks
were created as large as possible to include overreaching streamlines from most submissions.
- “Any” masks: masks of mandatory passage. Streamlines must traverse it (at “any” point of the
streamline).
Mask names and other criteria are included in a scoring conguration le formatted as a json le.
We veried the quality of ROIs by scoring the new curated GT data. We obtained 100% OL and 0% ORgt
for all bundles, as expected. When scoring the initial (non-curated) tractogram, mean OL was also 100%,
with a 1% overreach, showing that modications during curation were kept minimal. Running the new
scoring system on all 96 submissions took 2h 57m, vs 8h57m using the initial Recobundles-based
system.
D. Inuence of the bundle masks on previous scores.To compare new and initial scores, we ensured that
the two sets of results were indeed comparable. We noted that differences in results could be inuenced
by the difference in computation of the GT masks, which are called bundle masks in the original scoring
data. Our new scoring was thus compared to the 2015 Recobundles system but with new bundles masks,
computed with the recent denition [13]. We veried the inuence of this change on the original results.
Updated bundle masks led to a decrease in both OL and ORgt(see Table1), but to nearly unchanged f1
scores (p-value>0.1). To allow comparison, these results were computed over 21 bundles using the mean
value of FPT/POPT/CST.
E. Inuence of the new scoring system on scores. Visually, new scoring of the initial 2015 submissions
led to better segmentation (see Fig2). On average, Dice scores were signicantly different (p < 0.001) (see
Table2), but with an average change of only 2%, offering similar rankings (average absolute difference: 2
positions out of 96), thus leading to similar conclusions as in the original analysis. However, some
bundles showed major differences (see Table3) in scores and in ranking.The detailed score tables for
each team, each bundle is provided on the website.
Page 5/21
- VB: As seen in Table3, CP and CA were discovered more often than estimated in the original analysis.
They still are the two most dicult bundles to recover but to a lesser extent.
- VS: Biggest change in VS is seen in the CC, partly because it is by far the biggest bundle. When
observing the VS in raw numbers rather than as percentages, a comparison between the two scoring
systems reveals drastic changes, as seen in Table3.
- Bundle coverage: Fig5 (top section) compares the bundle dispersion in OL and ORgt between the two
scoring systems. Main changes are reported in Table3. Overall, f1 score was improved, particularly for
the two BPS bundles and left OR, bundles for which modications have been brought in the GT data.
Bottom section in Fig5 compares the submissions dispersion for these metrics. Overall, previous
conclusions still hold: probabilistic tracking may help generate highest OL, but with highest ORgt.
Submissions 9.1 and 9.2 (best OL) only obtain Dice scores of 45% and 46%, placing them in 43rd and
38th rank. When relying on the Dice score for a nal ranking of the submissions, the biggest variations
included an upgrade of 9 places for submission 17.0 and a drop of 13 places for submission 1.4. Top 8
submissions stayed the same but in a different order, as did the bottom 8 submissions.
Table 1 Impact of updated bundle masks on the scoring, using the original Tractometer scoring system.
Mean Original 2015 scores
(old bundle masks)
Updated 2015 scores
(new bundle masks)
OL (%) 35.6 ± 16.5
[1.1 to 76.6]
34.7 ± 16.2
[1.1 to 75.4]
ORgt (%) 29.0 ± 25.9
[1.0 to 152.5]
25.5 ± 23.3
[0.9 to 137.7]
Dice / f1 (%) 37.8 ± 12.6
[2.0 to 56.1]
37.8 ± 12.8
[2.0 to 58.0]
Table 2 Effect of the new segmentation on average scores. Nb: Number of submissions who recovered
the bundle.
Page 6/21
Mean Updated 2015 scores
(21 bundles)
New scores
VB 18.0 ± 2.7
[5 to 20]
18.5 ± 2.3
[9 to 21]
Nb 82.1 ± 25.4
[2 to 96]
84.5 ± 20.8
[22 to 96]
VS (%) 53.6 ± 23,5
[3.7 to 92.5]
52.5 ± 22.1
[4.3 to 88.6]
OL (%) 35,7 ± 16.0
[1.3 to 74.3]
37.8 ± 16.4
[1.8 to 80.0]
ORgt (%) 26.7 ± 23.7
[1.1 to 141.4]
29.1 ± 26.7
[2.4 to 161.1]
Dice / f1 (%) 38.4 ± 12.1
[2.4 to 54.9]
40.7 ± 12.2
[3.1 to 57.9]
F. Usage on new data.We successfully used the Tractoow pipeline [14] with the noisy data using both
PFT-tracking and local-tracking to obtain two full tractograms that were scored with the new system. The
PFT version led to the best Dice score (64%. Previous best was 58%), with an average OL and ORgt of 76%
and 60%. The local tracking version, which used a dilated white matter (WM) mask, obtained the best
overlap (91%. Previous best was 80%), but with more ORgt, explaining its lower, yet high, Dice score (57%).
Discussion
We have developed an enhanced Tractometer scoring system for the ISMRM 2015 Tractography
Challenge data. It uses carefully determined regions of interest. It offers more reliable results because the
segmentation now depends only on the quality of the ROIs. It does not depend on other aspects that were
important in the Recobundles segmentation, such as the ordering of the bundles, quality of the reference
tractogram, and threshold values for the mean direct-ip (MDF) metric [15]. Recovered bundles could hide
streamlines with noisy shapes because the GT data itself contained noisy shapes. In short, our new
segmentation was strict enough to prevent the inclusion of noisy streamlines but exible enough to allow
scoring submissions of varied streamline lengths, curvature, fanning, and tracking mask.
Overall, the new segmentation offers similar rankings as before when using averaged values over all
bundles and all teams, but scores for some bundles were strongly modied, and the nal ordering of the
teams based on Dice scores varied.
Page 7/21
Table 3 Effect of the new scoring: some of the main changes in specic bundles (average over teams)
Page 8/21
Mean Bundle
(L/R =
left /
right)
Tractometer
2015
(21
bundles)
Tractometer
2022 Difference
Nb submissions recovering the
bundle CP:
CA:
SCP
L/R:
2
12
86 / 83
25
22
88 / 88
+ 23
+10
+2 / +5
Others: Differences
in less than 4
submissions
VS (Total number of streamlines
recovered amongst all teams) CP:
CA:
SCP
L/R:
Cg L/R:
BPS
L/R:
OR L:
2
64
38,193 /
23,607
278,422 /
238,027
322,645 /
520,016
49,883
172
2011
59,109 /
36,171
374,647 /
375,725
437,459 /
636,523
65,161
+8500%
+3400%
+55% / +53%
+35% / +58%
+36% / +22%
+31%
Others: Less than
20% variation
OL (%) BPS
L/Rr:
OR L:
SCP
L/R:
28.8 / 29.7
21.4
33.9 / 27.9
37.1 / 39.4
30.6
40.0 / 33.2
+8.3% / +9.7%
+9.1%
+6 / +5
Others: Less than
5% variation
ORgt (%) SCP
L/R:
SLF L/R:
ICP L/R:
CA:
ILF R:
26.1 / 18.8
50.4 / 57.3
37.8 / 25.1
0.7
41.8
44.3 / 31.2
49.0 / 47.5
45.5 / 30.5
7.7
54.0
+18.2 / +12.4
-5.0 / -9.8
+7.7 / +5.4
+6.9
+6.2
Others: Less than
5% variation
Dice / f1 (%)
BPS
L/R: 34 / 36
25
44 / 47
37
+10 / +11
+12
Page 9/21
OR L:
CA:
2
7
+5
Others: Less than a
3% variation
Verication of the original code.No error was found in the original code. Importantly, however, tractogram
formats and headers management has evolved signicantly since 2015. Users should verify that their
tractogram are correctly interpreted when using the updated python3 code.
Verication of the original scores.The scores published in the 2017 article were good, but the detailed
scores published on the website contained errors which are now corrected. Please also note that some
wrong numbers tend to be relayed amongst publications citing the ISMRM challenge results. We urge
readers to rely on the up-to-date scores currently published on the ocial website.
We also brought modications to metrics terminology to avoid confusion:
1) VS/IS: In the original analysis, the term “connection” was used in the terms valid/invalid/no
connections (VC, IC, NC). However, VC was dened as the number of streamlines belonging to a valid
bundle and could actually include broken or prematurely stopped streamlines that do not reach any gray
matter region as long as they were classied as belonging to the bundle by the chosen segmentation
process. The word may encourage wrong interpretation of the results, suggesting they can allow
connectivity analysis between brain regions. We renamed VC as VS (valid streamlines). We regrouped IC
and NC under the term IS (invalid streamlines).
2) IB: Segmenting invalid streamlines into invalid bundles gives insights on erroneous streamlines
typically produced, but scoring their number (IB) may however be misleading as a scoring metric as it
depends on the denition of these bundles. Recobundles segmentation of spurious streamlines with
varied shapes and distribution offers scores that are dicult to interpret. This score should only be used
with great care. IB scores are therefore not used anymore in our work.
Curation of the data.Curation of the data was kept as minimal as possible. We removed streamlines that
prevented the creation of a good scoring system. Because of this work, the new GT now corresponds less
perfectly with the associated DWI. Creating a new simulated DWI with Fiberfox [16] would be possible, but
future work using this new data could not be compared with the scores presented here from teams who
participated in the challenge.
One note to the reader should be made here. The phantom was created with knowledge available at the
time. Although the bundles have names that correspond to known anatomical tracts, users should keep in
mind that they might not present exact characteristics and features compared to the real tracts [17].
These bundles should be used as phantom parts, not as anatomical references. Here is a short list of
differences that were noticed between the GT bundles and known anatomical landmarks.
Page 10/21
- CC: The corpus callosum is known to contain a majority of homotopic connections [18]. Heterotopic
connections do exist, but are less documented [19]. Many heterotopic connections are found in this GT
(ex, ventral-striatal).
- Cg: The Cg consists of 5 sub-bundles [20]. The GT bundle lacks the posterior part (named CB-V in the
paper).
- ICP: This bundle should end in the brainstem, but the GT bundle contains two sub-bundles; one is
anatomically correct but the other, looping back into the cerebellar cortex , does not correspond to any
known path in the human anatomy.
- OR: The current bundle would be better named as thalamo-occipital connections. The OR is typically
dened as the streamlines from the peri-calcarine ssure to the thalamus [21], but in this GT, the bundle
extends to a larger section of the occipital lobe. Note also that the Meyer's loop [22] is absent from the
current GT.
- ILF: The ILF should reach the anterior temporal lobe [23]. However, in the initial version of the phantom, it
reached a larger region, extending posteriorly close to the (expected) Meyer’s loop region. This was
modied in the new curated data and therefore the ILF is now more anatomically reliable.
- UF: As of 2018 [24], the uncinate fasciculus is now considered with a larger fanning both anteriorly in
the frontal cortex and posteriorly in the temporal cortex.
- CST / FPT / POPT: These three bundles appear intricate, but should be more different. The cortical
terminations of the CST should be constrained to the precentral and postcentral gyri [25]. Both FPT and
POPT should end in the pons, but the bundles go further down, nearly to the medulla (see Fig1).
Due to these differences, the ROIs dened here do not represent perfect anatomical features either, but are
only the necessary tool to segment bundles before the scoring.
Creating new bundles with better anatomical features would require developing a new simulated DWI
data, i.e., a new phantom which, as stated above, was not the objective of this work. We encourage the
community to produce new and varied phantoms as there is a lack of validation data in the eld of
tractography. However, here, the goal was essentially to improve the existing one and allow, particularly,
the machine learning community to adequately compare their results with previous state-of-the-art
tractography tools. We present in a section below conclusions and suggestions drawn from our analysis
to readers interested in proposing a new phantom.
Preparation of the new segmentation technique.To allow for a good bundle segmentation in the
submitted data of most teams, the endpoint ROIs had to be created very large, sometimes up to a 16-pass
dilation of the GT bundles’ endpoint ROIs, and up to an 11-pass of the bundles’ “all” masks. This could
reveal that the stopping criteria was not well dened in many processing pipelines. It generally depends
on a WM mask, which may come either from a thresholded FA map (typically ~0.1 to 0.2) or from a
Page 11/21
segmentation from the T1. In the rst case, the simulated
DWI may have acted differently than usual and
provided FA values that would require a different threshold. In the second case, the T1 is also simulated.
Segmentation algorithms were not created to deal with “fake” images and may have resulted in WM
masks of lesser quality. We consider that the goal of this challenge was to evaluate the ability of
tractography algorithms to understand diffusion information and to follow diffusion anisotropy
information through challenging paths such as ber crossing and bottlenecks. We have decided not to
penalize submissions with streamlines going further than expected. For instance, some submissions had
streamlines from the OR going out of the thalamus without stopping, or streamlines from the Fornix
looping very far off the mamillary bodies, or even streamlines going out of the brain. Our ROIs thus spill
out of realistic anatomical regions in an attempt to include the biggest part of every submissions
bundles. We can still segment bundles correctly by combining the endpoint ROIs with the “all” masks.
Analysis of the score differences.Compared to the initial analysis [4], it is still true that teams were able
to recover most bundles. It is also still true that, on average, only half of the streamlines in the submitted
tractograms are valid streamlines. Finally, we still nd that probabilistic tracking may help generate the
highest OL, but with the highest ORgt when compared to deterministic tracking, resulting in small changes
on the Dice score.
VB:CA and CP are still the two most dicult bundles to reconstruct, but although they are still a well-
dened category inFig5, it is to a lesser extent. Using Recobundles, CP was scored after CC; these
streamlines were often associated to the CC and thus ignored when segmenting the CP. Other changes in
recovered bundles are explained by the fact that newly found bundles generally contained only a few very
small streamlines that may be harder to compare with reference streamlines using the MDF metric (in
Recobundles). The hard-to-track and medium-diculty bundles (Fig5) are now less separate categories.
IB: Invalid bundles cannot be scored anymore due to the large size of the ROIs. It could be possible to add
an additional analysis step and segment the invalid streamlines (IS) into invalid bundles (IB) using
Quickbundles, similarly as before. We chose not to include this here as it is prone to the same instability
as Recobundles that we so rmly seek to avoid. The number of invalid bundles obtained with
Quickbundles depends strongly on the type of invalid streamlines. Even a few misplaced streamlines may
lead to a rapid increase in IB, which should not be used to infer the quality of the scored tractogram. We
do recognize that the IB analysis was useful in the original article to visualize the typical errors recovered
recurrently over multiple submissions, but the IB score itself should be used carefully.
VS/IS: Often, the additional recovered streamlines were of very poor quality, and other metrics were not
improved much. The total percentage of VS, averaged over all teams, all bundles, only varied by less than
1%. Yet, it represents an average of 1000 streamlines per submission. In the future, with algorithms
becoming ever better and researchers trying to push the limits of tractography, these small differences in
scoring could impact researcher choices in implementation.
Page 12/21
Bundle coverage: Despite the big changes in the total number of recovered streamlines in individual
bundles throughout the 96 submitted tractograms, general scoring metrics stayed similar, but ranking
amongst teams was modied.
Suggestions for the creation of a new phantom.The nal comparison of “winners” based on the Dice
score, either in the original analysis or here, did not allow a clear denition of the best tractography
parameters. This can be explained by the large inuence of preprocessing steps such as the choice of
tracking space, the tracking masks, the registration quality, and so on. Future phantoms should limit the
possibilities to ensure that they can understand specically our ability to follow diffusion information in
the brain, in other words, the “tracking” aspect, rather than the quality of the whole pipeline. We present
here some afterthoughts.
1. The level of complexity in the challenge data was good. It presented human-like geometries with
multiple bundle crossings or bottlenecks. Its number of bundles was good and allowed a scoring
system.
2. The associated simulated T1 data, however, was not realistic enough to allow good results in
segmentation software such as Freesurfer [26] or FSL FAST [27] for instance. We suggest that future
work should include a list of potentially interesting masks, particularly a WM mask that could be
used as a tracking mask.
3. The quality of individual streamlines, not only of bundles as whole entities, should be veried in the
GT and during scoring.
4. Developers should specify a way that users may verify their tractogram format to prevent shifts (ex:
±0.5 when the origin of a voxel coordinate is considered at the center or the corner of the voxel) or
swapping of axis during interpretation (ex, specifying the orientation).
5. Developers should specify in which space the nal scoring will be performed. Users applying a
substandard registration between T1 and DWI spaces could be strongly disadvantaged, even if their
tracking algorithm itself was perfect.
. OL, ORgt, Dice scores offer good insights, but there is still a lack of metrics comparing the shape of
individual streamlines in the literature that should be addressed.
Analysis of the Tractoow-processed data.The data was processed using state-of-the-art tools and
presented very good scores.
One general comment that was seen in machine learning studies was that differences in scores may
arise from the choice of model, but also from the training data. Therefore, we also offer the Tractoow-
processed data in open access on the website. It could be used as common training data.
Conclusion
We proposed a new and enhanced Tractometer scoring system based on manually-dened regions of
interest rather than on bundle recognition. Bundles are now more reliably segmented, offering more stable
Page 13/21
metrics with higher precision for future users of this phantom and its scoring system. We provide on the
Tractometer website all tools necessary for a robust scoring of any new tractogram with our new scoring
system: the ROIs and congurations les necessary to run the code, the tables of detailed results and the
Tractoow-processed data.
This should help researchers better develop and evaluate the next generation of tractography algorithms.
Methods
A. Verication of the original scores.The original code was converted to python3, proof-read and
reviewed to ensure it was still suitable with today’s standard. Metrics terminology were revised. All 2015’s
submissions were scored again.
The original code included forced shifting (adding 0.5 values) of .trk (trackvis) les. In the updated code,
tractograms are simply loaded through dipy’s load_tractogram method. No further verication is
performed on the validity of space attributes.
B. Curation of the GT tractogram.The GT bundles were modied to allow creation of the ROIs.
Streamlines from the GT bundles were ltered to keep only those with length in the range 20-200mm
(generally streamlines presenting looping shapes) or recovered as loops using scilpy were discarded (see
https://scilpy.readthedocs.io/). Others were discarded based on visual analysis of the bundles. CST,
POPT, and FPT were too similar and dicult to segment adequately (Fig1) and were gathered into a new
bundle called Brainstem Projection System (BPS). The ILF and OR were also too similar, preventing a
good segmentation (Fig1), either with Recobundles or with ROIs. In this case, we chose to lter out some
streamlines to better separate the two bundles.
C. Creation of a ROI-based segmentation system.All of the masks were created by looking carefully at
both the GT data and the general distribution of results from the tractograms submitted to the Challenge
in 2015.
Endpoint ROIS: GT streamlines’ endpoints were saved as head and tail masks. We then dilated these two
masks (11-pass on average, see Fig4). Some endpoint ROIs were modied manually based on visual
inspection of results. Examples of modication were: dilation to reach the end of the cortex in some
regions, manual dilation of the OR’s ROI to include more of the thalamus without spilling into the ILF,
manual separation between hemispheres, careful separation of anterior/posterior ROIs in the case of the
cingulum and of the fornix. The CC was separated into sub-bundles for segmentation purposes
(CC_u_shaped, CC_ventro_striatal1, CC_ventro_striatal2, CC_temporal), allowing for a better delimitation
of endpoint ROIs. However, only the total CC, composed of the re-merged sub-bundles, is used during
scoring. Similarly, the ICP was segmented into ICP_part1 (similar to its anatomical denition) and
ICP_part2 (looping back into the cerebellar cortex).
Page 14/21
“All” masks: GT streamlines paths were saved as binary masks and dilated (by default, the number of
passes was 3 but some bundles required varied parameters, up to an 11-pass for the CC). These GT
masks were combined with both endpoint ROIs for each bundle. Manual modications were also applied,
generally more manual dilation.
“Any” masks: They were dened using manually positioned boxes of interest.
D. Inuence of the bundle masks on scores.To allow comparing new and old scores, original bundle
masks were computed again using more recent technology. As suggested in 2017 by Rheault et al. [13],
bundle masks should not recover only voxels containing streamlines points (even after resampling), but
should rather account for the whole segment between two points. We computed the new masks with
scilpy. Bundles segmented using the Recobundles-based system were scored again using the same
metrics but with the new GT bundle masks. Final Dice scores, averaged over all bundles, were compared
to previous scores using a Student T-test.
E. Inuence of the new scoring system on scores.Newly segmented bundles of the 96 submissions were
scored using the same metrics as before. Again, nal Dice scores, averaged over all bundles, were
compared to previous scores using a Student T-test.
F. Usage on new data.We prepared a new tractogram to be scored using recent state-of-the art
techniques. The tractogram was prepared by running the Tractoow pipeline [14] on the noisy DWI, using
the version with additional reversed b0 to allow topup correction. The pipeline was modied to skip the
N4 denoising step on the T1 data, which produced irregular results, probably due to the fact that a T1 is in
fact a simulated dataset. Two tracking algorithms were tested. First, PFT tracking on WM maps. Second,
local tracking on a mask of WM that was rst modied to pass visual quality check: it was eroded (1-
pass) and dilated again (2-pass). Both versions were scored using the new system.
Abbreviations
List of acronyms for bundles
BPS: Brainstem Projection System, CA: Anterior commissure, CC: Corpus callosum, Cg: Cingulum, CP:
Posterior commissure, CST: Cortico-spinal tract, Fornix, FPT: Fronto-pontine tract, ICP: Inferior cerebellar
peduncle, ILF: Inferior longitudinal fasciculus, MCP: Middle cerebellar peduncle, OR: Optic radiation,
POPT: Parieto-occipital pontine tract, SCP: Superior cerebellar peduncle, SLF: Superior longitudinal
fasciculus, UF: uncinate fasciculus
List of acronyms for metrics
OL: Overlap (percentage of GT voxels recovered), ORgt: Overreach (number of false positive voxels,
normalized by the volume of the GT bundle), f1: Equivalent to the Dice score, VB: valid bundles (number
Page 15/21
of recovered bundles), VS: valid streamlines (number of streamlines in these VB), IS: invalid streamlines
(number of remaining streamlines).
Declarations
Data and code availability
The datasets generated during and/or analysed during the current study are available on the Tractometer
website: www.tractometer.org.
Acknowledgement
The authors are grateful to the Fonds de recherche du Québec - Nature et technologies (FRQNT) and the
Natural Sciences and Engineering Research Council of Canada (NSERC) programs for funding this
research.
Author contributions
ER and AT proof-read the original code and prepared the scripts in scilpy for the new scoring. They also
veried the format of submitted tractograms and the scoring. ER prepared the ROIs and other necessary
masks for the new segmentation process and compared scores between versions. ER wrote the
manuscript, and MD, LP and AT provided feed-back. JCH was the project leader in the previous version
and answered our questions concerning the original code and data.
Competing interests
The author(s) declare no competing interests.
References
1. Drobnjak, I., Neher, P., Poupon, C. & Sarwar, T. Physical and digital phantoms for validating
tractography and assessing artifacts.
NeuroImage
245, (2021).
2. Rheault, F., Poulin, P., Valcourt Caron, A., St-Onge, E. & Descoteaux, M. Common misconceptions,
hidden biases and modern challenges of dMRI tractography.
J. Neural Eng.
17, (2020).
3. Côté, M. A.
et al.
Tractometer: Towards validation of tractography pipelines.
Med. Image Anal.
17,
844–857 (2013).
4. Maier-Hein, K. H.
et al.
The challenge of mapping the human connectome based on diffusion
tractography.
Nat. Commun.
8, (2017).
5. Neher, P., Côté, M.-A., Houde, J.-C., Descoteaux, M. & Maier-Hein, K. Fiber tractography using machine
learning.
NeuroImage
158, 417–429 (2017).
Page 16/21
. Benou, I. & Riklin Raviv, T. DeepTract: A probabilistic deep learning framework for white matter ber
tractography.
Lect. Notes Comput. Sci. Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinforma.
11766
LNCS, 626–635 (2019).
7. Poulin, P.
et al.
Learn to track: Deep learning for tractography.
Lect. Notes Comput. Sci. Subser. Lect.
Notes Artif. Intell. Lect. Notes Bioinforma.
10433 LNCS, 540–547 (2017).
. Wegmayr, V. & Buhmann, J. M. Entrack: Probabilistic spherical regression with entropy regularization
for ber tractography.
Int. J. Comput. Vis.
129, 656–680 (2020).
9. Théberge, A., Descoteaux, M., Desrosiers, C. & Jodoin, P. M. Track-to-learn: A general framework for
tractography with deep reinforcement learning.
Med. Image Anal.
102093 (2021)
doi:10.1101/2020.11.16.385229.
10. Rheault, F.
et al.
Bundle-specic tractography. in 129–139 (2018). doi:10.1007/978-3-319-73839-
0_10.
11. Wasserthal, J., Neher, P. & Maier-Hein, K. H. TractSeg - Fast and accurate white matter tract
segmentation.
NeuroImage
183, 239–253 (2018).
12. Garyfallidis, E.
et al.
Recognition of white matter bundles using local and global streamline-based
registration and clustering.
NeuroImage
170, 283–295 (2018).
13. Rheault, F., Houde, J.-C. & Descoteaux, M. Visualization, interaction and tractometry: dealing with
millions of streamlines from diffusion MRI tractography.
Front. Neuroinformatics
11, (2017).
14. Theaud, G., Houde, J., Bor, A., Morency, F. & Descoteaux, M. TractoFlow : A robust , ecient and
reproducible diffusion MRI pipeline leveraging Nextow & Singularity.
NeuroImage
218, (2020).
15. Garyfallidis, E., Brett, M., Correia, M. M., Williams, G. B. & Nimmo-Smith, I. QuickBundles, a method for
tractography simplication.
Front. Neurosci.
6, (2012).
1. Neher, P. F., Laun, F. B., Stieltjes, B. & Maier-Hein, K. H. Fiberfox: Facilitating the creation of realistic
white matter software phantoms.
Magn. Reson. Med.
72, 1460–1470 (2014).
17. Bullock, D. N.
et al.
A taxonomy of the brain’s white matter: twenty-one major tracts for the 21st
century.
Cereb. Cortex
(2022) doi:10.1093/cercor/bhab500.
1. Francisco, A. & Montiel, J. One hundred million years of interhemispheric communication: the history
of the corpus callosum.
Brazilian journal of medical and biological researc
409–420 (2003).
19. De Benedictis, A.
et al.
New insights in the homotopic and heterotopic connectivity of the frontal
portion of the human corpus callosum revealed by microdissection and diffusion tractography.
Hum.
Brain Mapp.
37, 4718–4735 (2016).
20. Wu, Y., Sun, D., Wang, Y., Wang, Y. & Ou, S. Segmentation of the cingulum bundle in the human brain:
A new perspective based on DSI tractography and ber dissection study.
Front. Neuroanat.
10,
(2016).
21. Sarubbo, S.
et al.
The course and the anatomo-functional relationships of the optic radiation: a
combined study with ‘post mortem’ dissections and ‘in vivo’ direct electrical mapping.
J. Anat.
226,
47–59 (2015).
Page 17/21
22. Falconer, M. A. & Wilson, J. L. Visual eld changes following anterior temporal lobectomy: Their
signicance in relation to ‘Meyer’s loop’ of the optic radiation.
Brain
81, part 1, (1958).
23. Panesar, S. S., Yeh, F.-C., Jacquesson, T., Hula, W. & Fernandez-Miranda, J. C. A quantitative
tractography study into the connectivity, segmentation and laterality of the human inferior
longitudinal fasciculus.
Front. Neuroanat.
12, (2018).
24. Hau, J.
et al.
Revisiting the human uncinate fasciculus, its subcomponents and asymmetries with
stem-based tractography and microdissection validation.
Brain Struct. Funct.
222, 1645–1662
(2017).
25. Chenot, Q.
et al.
A population-based atlas of the human pyramidal tract in 410 healthy participants.
Brain Struct. Funct.
224, 599–612 (2019).
2. Dale, A. M., Fischl, B. & Sereno, M. I. Cortical surface-based Analysis: I. segmentation and surface
reconstruction.
NeuroImage
9, 179–194 (1999).
27. Zhang, Y., Brady, M. & Smith, S. Segmentation of brain MR images through a hidden Markov random
eld model and the expectation-maximization algorithm.
IEEE Trans. Med. Imaging
20, 45–57
(2001).
Figures
Page 18/21
Figure 1
Erroneous bundle segmentation examples. A) ILF (red) and OR (blue) with B) an example of sub-optimal
bundle segmentation in submission 1.3 (using Recobundles). C) FPT (pink), CST (orange), and POPT
(blue), with D) streamlines recovered for these bundles from all 2015 submissions. The GT bundle mask
borders are shown in a darker contour. We can see that classication was sometimes arbitrary to one or
the other bundle, particularly in the center.
Page 19/21
Figure 2
Recobundles led to poor results on some bundles. The top row shows the MCP in sagittal view. A) 2015’s
GT. B) Streamlines recovered by Recobundles from all submissions. They include vertical streamlines
that should not belong to the MCP. C) Streamlines recovered using our new ROI-based segmentation. D, E,
and F present similar patterns for the SLF.
Figure 3
Examples of looping bers that were hidden in the original GT tractogram.
Page 20/21
Figure 4
Examples of possible endpoint ROIs. A) OR, B) MCP, C and D) SLF. Various degrees of dilation were
tested. Bigger ROIs such as in B and D were necessary to score adequately all submitted tractograms.
Figure 5
Overlap (OL) vs Overreach (ORgt) scores in 2022 vs 2015 (with updated masks). Best results should have
high overlap (top) and low overreach (left). Top graphs: scores per bundle (averaged over all teams).
Page 21/21
Colors reect the differences between easy (blue), average (green) and hard-to-track (pink) bundles [4].
Bottom graphs: scores per submission (averaged over all bundles). Colors reect the algorithm choice:
deterministic (blue), probabilistic (orange) or others (gray).
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
White matter tractography, based on diffusion-weighted magnetic resonance images, is currently the only available in vivo method to gather information on the structural brain connectivity. The low resolution of diffusion MRI data suggests to employ probabilistic methods for streamline reconstruction, i.e., for fiber crossings. We propose a general probabilistic model for spherical regression based on the Fisher-von-Mises distribution, which efficiently estimates maximum entropy posteriors of local streamline directions with machine learning methods. The optimal precision of posteriors for streamlines is determined by an information-theoretic technique, the expected log-posterior agreement concept. It relies on the requirement that the posterior distributions of streamlines, inferred on retest measurements of the same subject, should yield stable results within the precision determined by the noise level of the data source.
Preprint
Full-text available
A bstract Diffusion MRI tractography is currently the only non-invasive tool able to assess the white-matter structural connectivity of a brain. Since its inception, it has been widely documented that tractography is prone to producing erroneous tracks while missing true positive connections. Anatomical priors have been conceived and implemented in classical algorithms to try and tackle these issues, yet problems still remain and the conception and validation of these priors is very challenging. Recently, supervised learning algorithms have been proposed to learn the tracking procedure implicitly from data, without relying on anatomical priors. However, these methods rely on labelled data that is very hard to obtain. To remove the need for such data but still leverage the expressiveness of neural networks, we introduce Track-To-Learn : A general framework to pose tractography as a deep reinforcement learning problem. Deep reinforcement learning is a type of machine learning that does not depend on ground-truth data but rather on the concept of “reward”. We implement and train algorithms to maximize returns from a reward function based on the alignment of streamlines with principal directions extracted from diffusion data. We show that competitive results can be obtained on known data and that the algorithms are able to generalize far better to new, unseen data, than prior machine learning-based tractography algorithms. To the best of our knowledge, this is the first successful use of deep reinforcement learning for tractography.
Article
Full-text available
Diffusion MRI tractography processing pipeline requires a large number of steps (typically 20+ steps). If parameters of these steps, number of threads, and random seed generators are not carefully controlled, the resulting tractography can easily be non-reproducible and non-replicable, even in test-test experiments. To handle these issues, we developed TractoFlow. TractoFlow is fully automatic from raw diffusion weighted images to tractography. The pipeline also outputs classical diffusion tensor imaging measures and several fiber orientation distribution function measures. TractoFlow supports the recent Brain Imaging Data Structure (BIDS) format as input and is based on two engines: Nextflow and Singularity. In this work, the TractoFlow pipeline is evaluated on three databases and shown to be efficient and reproducible from 98% to 100%, depending on parameter choices. Moreover, it is easy to use for non-technical users, with little to no installation requirements. TractoFlow is publicly available for academic research and is an important step forward for better structural brain connectivity mapping.
Article
Full-text available
The human brain is a complex and organized network, where the connection between regions is not achieved with single neurons crisscrossing each other but rather millions of densely packed and well-ordered neurons. Reconstruction from diffusion MRI tractography is only an attempt to capture the full complexity of this network, at the macroscale. This review provides an overview of the misconceptions, biases and pitfalls present in structural white matter bundle and connectome reconstruction using tractography. The goal is not to discourage readers, but rather to inform them of the limitations present in the methods used by researchers in the field in order to focus on what they can do and promote proper interpretations of their results. It also provides a list of open problems that could be solved in future research projects for the next generation of PhD students.
Article
Full-text available
With the advances in diffusion MRI and tractography, numerous atlases of the human pyramidal tract (PyT) have been proposed, but the inherent limitation of tractography to resolve crossing bundles within the centrum semiovale has so far prevented the complete description of the most lateral PyT projections. Here, we combined a precise manual positioning of individual subcortical regions of interest along the descending pathway of the PyT with a new bundle-specific tractography algorithm. This later is based on anatomical priors to improve streamlines tracking in crossing areas. We then extracted both left and right PyT in a large cohort of 410 healthy participants and built a population-based atlas of the whole-fanning PyT with a complete description of its most corticolateral projections. Clinical applications are envisaged, the whole-fanning PyT atlas being likely a better marker of corticospinal integrity metrics than those currently used within the frame of prediction of poststroke motor recovery. The present population-based PyT, freely available, provides an interesting tool for clinical applications to locate specific PyT damage and its impact to the short- and long-term motor recovery after stroke.
Article
Full-text available
The individual course of white matter fiber tracts is an important factor for analysis of white matter characteristics in healthy and diseased brains. Diffusion-weighted MRI tractography in combination with region-based or clustering-based selection of streamlines is a unique combination of tools which enables the in-vivo delineation and analysis of anatomically well-known tracts. This, however, currently requires complex, computationally intensive processing pipelines which take a lot of time to set up. TractSeg is a novel convolutional neural network-based approach that directly segments tracts in the field of fiber orientation distribution function (fODF) peaks without using tractography, image registration or parcellation. We demonstrate that the proposed approach is much faster than existing methods while providing unprecedented accuracy, using a population of 105 subjects from the Human Connectome Project. We also show initial evidence that TractSeg is able to generalize to differently acquired data sets for most of the bundles. The code and data are openly available at https://github.com/MIC-DKFZ/TractSeg/ and https://doi.org/10.5281/zenodo.1088277, respectively.
Article
Full-text available
The human inferior longitudinal fasciculus (ILF) is a ventral, temporo-occipital association tract. Though described in early neuroanatomical works, its existence was later questioned. Application of in vivo tractography to the neuroanatomical study of the ILF has generally confirmed its existence, however consensus is lacking regarding its subdivision, laterality and connectivity. Further, there is a paucity of detailed neuroanatomic data pertaining to the exact anatomy of the ILF. Generalized Q-Sampling imaging (GQI) is a non-tensor tractographic modality permitting high resolution imaging of white-matter structures. As it is a non-tensor modality, it permits visualization of crossing fibers and accurate delineation of close-proximity fiber-systems. We applied deterministic GQI tractography to data from 30 healthy subjects and a large-volume, averaged diffusion atlas, to delineate ILF anatomy. Post-mortem white matter dissection was also carried out in a cadaveric specimen for further validation. The ILF was found in all 60 hemispheres. At its occipital extremity, ILF fascicles demonstrated a bifurcated, ventral-dorsal morphological termination pattern, which we used to further subdivide the bundle for detailed analysis. These divisions were consistent across the subject set and within the atlas. We applied quantitative techniques to study connectivity strength of the ILF at its anterior and posterior extremities. Overall, both morphological divisions, and the un-separated ILF, demonstrated strong leftward-lateralized connectivity patterns. Leftward-lateralization was also found for ILF volumes across the subject set. Due to connective and volumetric leftward-dominance and ventral location, we postulate the ILFs role in the semantic system. Further, our results are in agreement with functional and lesion-based postulations pertaining to the ILFs role in facial recognition.
Chapter
Full-text available
Tractography allows the investigation of white matter fascicles. However, it requires a large amount of streamlines to be generated to cover the full spatial extent of desired bundles. In this work, a bundle-specific tractography algorithm was developed to increase reproducibility and sensitivity of white matter fascicle virtual dissection, thus avoiding the computation of a full brain tractography. Using fascicle priors from manually segmented bundles templates or atlases, we propose a novel local orientation enhancement methodology that overcomes reconstruction difficulties in crossing regions. To reduce unnecessary computation, tractography seeding and tracking were restricted to specific locales within the brain. These additions yield better spatial coverage, increasing the quality of the fanning in crossing regions, helping to accurately represent fascicle shape. In this work, tractography methods were analyzed and compared using a single bundle of interest, the corticospinal tract.
Article
The functional and computational properties of brain areas are determined, in large part, by their connectivity profiles. Advances in neuroimaging and network neuroscience allow us to characterize the human brain noninvasively, but a comprehensive understanding of the human brain demands an account of the anatomy of brain connections. Long-range anatomical connections are instantiated by white matter, which itself is organized into tracts. These tracts are often disrupted by central nervous system disorders, and they can be targeted by neuromodulatory interventions, such as deep brain stimulation. Here, we characterized the connections, morphology, traversal, and functions of the major white matter tracts in the brain. There are major discrepancies across different accounts of white matter tract anatomy, hindering our attempts to accurately map the connectivity of the human brain. However, we are often able to clarify the source(s) of these discrepancies through careful consideration of both histological tract-tracing and diffusion-weighted tractography studies. In combination, the advantages and disadvantages of each method permit novel insights into brain connectivity. Ultimately, our synthesis provides an essential reference for neuroscientists and clinicians interested in brain connectivity and anatomy, allowing for the study of the association of white matter's properties with behavior, development, and disorders.
Article
Fiber tractography is widely used to non-invasively map white-matter bundles in vivo using diffusion-weighted magnetic resonance imaging (dMRI). As it is the case for all scientific methods, proper validation is a key prerequisite for the successful application of fiber tractography, be it in the area of basic neuroscience or in a clinical setting. It is well-known that the indirect estimation of the fiber tracts from the local diffusion signal is highly ambiguous and extremely challenging. Furthermore, the validation of fiber tractography methods is hampered by the lack of a real ground truth, which is caused by the extremely complex brain microstructure that is not directly observable non-invasively and that is the basis of the huge network of long-range fiber connections in the brain that are the actual target of fiber tractography methods. As a substitute for in vivo data with a real ground truth that could be used for validation, a widely and successfully employed approach is the use of synthetic phantoms. In this work, we are providing an overview of the state-of-the-art in the area of physical and digital phantoms, answering the following guiding questions: “What are dMRI phantoms and what are they good for?”, “What would the ideal phantom for validation fiber tractography look like?” and “What phantoms, phantom datasets and tools used for their creation are available to the research community?”. We will further discuss the limitations and opportunities that come with the use of dMRI phantoms, and what future direction this field of research might take.