PreprintPDF Available

ClairS: a deep-learning method for long-read somatic small variant calling

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

Identifying somatic variants in tumor samples is a crucial task, which is often performed using statistical methods and heuristic filters applied to short-read data. However, with the increasing demand for long-read somatic variant calling, existing methods have fallen short. To address this gap, we present ClairS, the first deep-learning-based, long-read somatic small variant caller. ClairS was trained on massive synthetic somatic variants with diverse coverages and variant allele frequencies (VAF), enabling it to accurately detect a wide range of somatic variants from paired tumor and normal samples. We evaluated ClairS using the latest Nanopore Q20+ HCC1395-HCC1395BL dataset. With 50-fold/25-fold tumor/normal, ClairS achieved a 93.01%/86.86% precision/recall rate for Single Nucleotide Variation (SNVs), and 66.54%/66.89% for somatic insertions and deletions (Indels). Applying ClairS to short-read datasets from multiple sources showed comparable or better performance than Strelka2 and Mutect2. Our findings suggest that improved read phasing enabled by long-read sequencing is key to accurate long-read SNV calling, especially for variants with low VAF. Through experiments across various coverage, purity, and contamination settings, we demonstrated that ClairS is a reliable somatic variant caller. ClairS is open-source at https://github.com/HKU-BAL/ClairS.
Content may be subject to copyright.
ClairS: a deep-learning method for long-
!
"
read somatic small variant calling
#
"
"
$
"
%&'()*+("%&'(,#-"./(&+0"1/#-"2'*"3&'(#-"4+(52+6"2''-"7+859+&"2+6-":/*;+(,"2/0<"
=
"
"
>
"
?'@+AB6'(B"0C"306@/B'A"1D*'(D'-"7&'"E(*F'AG*BH"0C"I0(,"J0(,-"I0(,"J0(,-"3&*(+"
K
"
"
L
"
#"7&'G'"+/B&0AG"D0(BA*;/B'M"'N/+OOH"B0"B&*G"P0A8"
Q
"
<"70"P&06"D0AA'G@0(M'(D'"G&0/OM";'"+MMA'GG'MR"S6+*OT"A;O/0UDGR&8/R&8"
V
"
"
!W
"
"
!!
"
Abstract
!#
"
XM'(YCH*(,"G06+YD"F+A*+(BG"*("B/60A"G+6@O'G"*G"+"DA/D*+O"B+G8-"P&*D&"*G"0Z'("@'AC0A6'M"/G*(,"
!$
"
GB+YGYD+O"6'B&0MG"+(M"&'/A*GYD"[OB'AG"+@@O*'M"B0"G&0AB5A'+M"M+B+R"I0P'F'A-"P*B&"B&'"*(DA'+G*(,"
!=
"
M'6+(M"C0A"O0(,5A'+M"G06+YD"F+A*+(B"D+OO*(,-"')*GY(,"6'B&0MG"&+F'"C+OO'("G&0ABR"70"+MMA'GG"B&*G"
!>
"
,+@-"P'"@A'G'(B"3O+*A1-"B&'"[AGB"M''@5O'+A(*(,5;+G'M-"O0(,5A'+M"G06+YD"G6+OO"F+A*+(B"D+OO'AR"
!K
"
3O+*A1"P+G"BA+*('M"0("6+GG*F'"GH(B&'YD"G06+YD"F+A*+(BG"P*B&"M*F'AG'"D0F'A+,'G"+(M"F+A*+(B"
!L
"
+OO'O'"CA'N/'(D*'G"\]^_`-"'(+;O*(,"*B"B0"+DD/A+B'OH"M'B'DB"+"P*M'"A+(,'"0C"G06+YD"F+A*+(BG"CA06"
!Q
"
@+*A'M"B/60A"+(M"(0A6+O"G+6@O'GR"9'"'F+O/+B'M"3O+*A1"/G*(,"B&'"O+B'GB"a+(0@0A'"b#Wc"
!V
"
I33!$V>5I33!$V>d2"M+B+G'BR"9*B&">W5C0OMe#>5C0OM"B/60Ae(0A6+O-"3O+*A1"+D&*'F'M"+"
#W
"
V$RW!feQKRQKf"@A'D*G*0(eA'D+OO"A+B'"C0A"1*(,O'"a/DO'0YM'"]+A*+Y0("\1a]G`-"+(M"KKR>=feKKRQVf"
#!
"
C0A"G06+YD"*(G'AY0(G"+(M"M'O'Y0(G"\X(M'OG`R"^@@OH*(,"3O+*A1"B0"G&0AB5A'+M"M+B+G'BG"CA06"
##
"
6/OY@O'"G0/AD'G"G&0P'M"D06@+A+;O'"0A";'g'A"@'AC0A6+(D'"B&+("1BA'O8+#"+(M"h/B'DB#R"i/A"
#$
"
[(M*(,G"G/,,'GB"B&+B"*6@A0F'M"A'+M"@&+G*(,"'(+;O'M";H"O0(,5A'+M"G'N/'(D*(,"*G"8'H"B0"+DD/A+B'"
#=
"
O0(,5A'+M"1a]"D+OO*(,-"'G@'D*+OOH"C0A"F+A*+(BG"P*B&"O0P"]^_R"7&A0/,&"')@'A*6'(BG"+DA0GG"F+A*0/G"
#>
"
D0F'A+,'-"@/A*BH-"+(M"D0(B+6*(+Y0("G'j(,G-"P'"M'60(GBA+B'M"B&+B"3O+*A1"*G"+"A'O*+;O'"G06+YD"
#K
"
F+A*+(B"D+OO'AR"3O+*A1"*G"0@'(5G0/AD'"+B"&g@GTee,*B&/;RD06eIJE5d^2e3O+*A1R"
#L
"
"
#Q
"
"
#V
"
Introduc.on
$W
"
^(+OHG*G"0C"D+(D'A",'(06'G"B&+B"*M'(YCH"+(M"D&+A+DB'A*k'"G06+YD"F+A*+(BG"&+G"'(+;O'M"+";'g'A"
$!
"
/(M'AGB+(M*(,"0C"B/60A"@A0,A'GG*0(1"+(M"O'M"B0"@A'D*G*0("0(D0O0,H2R"XM'(YCH*(,"G06+YD"
$#
"
F+A*+(BG-"&0P'F'A-"A'6+*(G"D&+OO'(,*(,"M/'"B0"*(BA+5"+(M"*(B'A5B/60A"&'B'A0,'('*BH-"P&*D&"0Z'("
$$
"
O'+MG"B0"O0P"]^_-"+(M"D0(C0/(M*(,"C+DB0AG-"*(DO/M*(,"G'N/'(D*(,"+AYC+DBG-"*(+M'N/+B'"
$=
"
G'N/'(D*(,"D0F'A+,'-"+(M"(0A6+O"D0(B+6*(+Y0(3R"S(M'+F0AG"P'A'"6+M'"B0"+MMA'GG"B&'G'"
$>
"
D&+OO'(,'G"+(M"6+)*6*k'"G'(G*YF*BH"+(M"+DD/A+DH"*("*M'(YCH*(,"G06+YD"F+A*+(BG"/G*(,"(')B5
$K
"
,'('A+Y0("G'N/'(D*(,"\al1`"G&0AB5A'+MG4-13R"I0P'F'A-"D0(GBA+*('M";H"A'+M"O'(,B&-"G&0AB"A'+MG"
$L
"
.CC-BY-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted August 21, 2023. ; https://doi.org/10.1101/2023.08.17.553778doi: bioRxiv preprint
&+F'"O*6*B'M"F+A*+(B"M*GD0F'AH"D+@+;*O*BH"*("&+AM5B056+@",'(06*D"A',*0(G-"G/D&"+G"
$Q
"
&060@0OH6'AG"+(M"G',6'(B+O"M/@O*D+Y0(GR"7&*G"@A0;O'6"*G"')@'DB'M"B0";'"+OO'F*+B'M"B&A0/,&"
$V
"
O0(,5A'+M"G'N/'(D*(,14R"i)C0AM"a+(0@0A'"7'D&(0O0,*'G"\ia7`"*G"+60(,"B&'"O'+M*(,"O0(,5A'+M"
=W
"
G'N/'(D*(,"B'D&(0O0,*'G"+(M"0m'AG"6*(*+B/A*k'M"G'N/'(D*(,"M'F*D'G"+(M"C+GB"G+6@O'5B05M+B+"
=!
"
B/A(+A0/(M-"P&*D&"*G"+"DA'M*;O'"GB'@"B0P+AMG"M'60DA+Yk*(,"G'N/'(D*(,";H"MA+GYD+OOH"A'M/D*(,"
=#
"
B&'"D0GB"0C"D+AAH*(,"0/B"G'N/'(D*(,"')@'A*6'(BGR"ia7"A+P"A'+MG"P'A'"A'@0AB'M"B0"&+F'"+("'AA0A"
=$
"
A+B'"0C"$5!>f"*("B&'"@+GB15R"7&*G"P+G"A'M/D'M"B0"!f"0A"O0P'A"/G*(,"ia7nG"O+B'GB"b#Wc"
==
"
D&'6*GBAH16R"7&'",+@"*G"GYOO"G*,(*[D+(B-"&0P'F'A-"D06@+A'M"P*B&"al1"G&0AB5A'+MG-"P&*D&"&+F'"+("
=>
"
+F'A+,'"'AA0A"A+B'"+B"WR!f17-"6+8*(,"B&'"G06+YD"F+A*+(B"D+OO'AG"0(D'"M'G*,('M"C0A"G&0AB"A'+MG"
=K
"
@A+DYD+OOH"/(P0A8+;O'"C0A"ia7"O0(,5A'+MGR"
=L
"
"
=Q
"
l'A6O*('"F+A*+(BG"+A'"0Z'("D0(G*M'A'M"B0";'"'+G*'A"B0"D0AA'DBOH"*M'(YCH"B&+("G06+YD"F+A*+(B"
=V
"
D+OO*(,R"7&'"[AGB"+g'6@B"B0"D+OO",'A6O*('"G6+OO"F+A*+(BG"/G*(,"(0*GH"ia7"O0(,5A'+MG"P+G"6+M'";H"
>W
"
3O+*AF0H+(B'"*("#W!Q18R"7&'"P0A8"P+G"'(+;O'M"/G*(,"!`"+"M''@"('/A+O"('BP0A8-"P&*D&"P+G"[AGB"
>!
"
/G'M"C0A"F+A*+(B"D+OO*(,";H"?''@]+A*+(B19o"+(M"#`"B&'"&*,&5N/+O*BH"8(0P("BA/B&"F+A*+(BG"*("lX^d"
>#
"
A'C'A'(D'"G+6@O'G"C0A"('/A+O"('BP0A8"60M'O"BA+*(*(,20R"1/;G'N/'(B"P0A8G";H"3O+*AF0H+(B'"
>$
"
*(DO/M*(,"3O+*A21"+(M"3O+*A$22-"*(BA0M/D'M"0@Y6*k'M"('BP0A8"*(@/B-"('BP0A8"0/B@/B-"('BP0A8"
>=
"
+AD&*B'DB/A'-"+(M"P0A8p0P"M'G*,(G"B0"6+8'"B&'";'GB"0/B"0C"(0*GH"ia7"M+B+"C0A",'A6O*('"G6+OO"
>>
"
F+A*+(B"D+OO*(,R"d0B&"3O+*A$"+(M"+"@*@'O*('"(+6'M"qSqqS:5h+A,*(5?''@]+A*+(B23"\?''@]+A*+(B`-"
>K
"
P&*D&"*G"+OG0"M'G*,('M"C0A"ia7"O0(,5A'+M",'A6O*('"G6+OO5F+A*+(B"D+OO*(,-"&+F'"M'60(GBA+B'M"
>L
"
;'g'A"G*(,O'"(/DO'0YM'"@0OH60A@&*G6"\1aq`5D+OO*(,"@'AC0A6+(D'"B&+("/G*(,"B&'"G+6'"D0F'A+,'"
>Q
"
0C"XOO/6*(+"G&0AB"A'+MGR"I0P'F'A-"P&*O'"G0O/Y0(G"+A'"A'+MH"C0A"ia7"O0(,5A'+M",'A6O*('"G6+OO5
>V
"
F+A*+(B"D+OO*(,-"B&'A'"&+G";''("(0"D+OO'A"+F+*O+;O'"C0A"ia7"O0(,5A'+M"G06+YD"G6+OO5F+A*+(B"
KW
"
D+OO*(,R"9'"(0B'"B&+B"ia7"O0(,5A'+M"G06+YD"1]"\GBA/DB/A+O"F+A*+(B`"D+OO'AG-"*(DO/M*(,"1(*r'G#24"
K!
"
+(M"a+(060(GF25-"P&*D&"P'A'"M'F'O0@'M"*("B&'"@+GB"H'+A"D+OO'M"C0A"B&'"M'F'O0@6'(B"0C"+"G6+OO"
K#
"
F+A*+(B"D+OO'A"B0"D06@O'B'"B&'"ia7"O0(,5A'+M"G06+YD"F+A*+(B5D+OO*(,"P0A8p0PR"
K$
"
"
K=
"
E(C0AB/(+B'OH-"G06'"M'G*,(G"DA*YD+O"B0"ia7"O0(,5A'+M",'A6O*('"F+A*+(B"D+OO*(,"+A'"(0B"+@@O*D+;O'"
K>
"
B0"G06+YD"F+A*+(B"D+OO*(,R"_*AGB-"*("B&'*A"('BP0A8"0/B@/B-";0B&"3O+*A$"+(M"?''@]+A*+(B"+@@OH"+"
KK
"
GBA0(,"M*@O0*M",'(06'"+GG/6@Y0(R"3O+*A$"/G'G"+"#!5,'(0BH@'"0/B@/B-"P&*D&"*G"+"BP05
KL
"
D06;*(+Y0("0C"^-"3-"l-"7-"*(G'AY0(-"+(M"M'O'Y0(22R"?''@]+A*+(B"/G'G"+"B&A''5D+B',0AH"0/B@/B"
KQ
"
B&+B"*(DO/M'G"&065A'C"\&060kH,0/G"A'C'A'(D'`-"&'B"\&'B'A0kH,0/G`-"+(M"&065+OB"\&060kH,0/G"
KV
"
+OB'A(+YF'`23R"d0B&"3O+*A$"+(M"?''@]+A*+(B"+A'"DO+GG*[D+Y0("60M'OG"B&+B"/G'"0;G'AF'M"+OO'O'"
LW
"
CA'N/'(DH"0C"+OB'A(+YF'"+OO'O'G"+G"('BP0A8"*(@/B-"+(M"0/B@/B"B&'"D+B',0AH"B&+B"A'@A'G'(BG"B&'"
L!
"
')@'DB'M"+OO'O'"CA'N/'(DH"0C"+"F+A*+(B"\'R,R-"+OO'O'"CA'N/'(DH"W-"WR>-"!-"+(M"WR>eWR>"C0A",'(0BH@'"
L#
"
WeW-"We!-"!e!-"+(M"!e#-"A'G@'DYF'OH`R"I0P'F'A-"G06+YD"F+A*+(BG"&+F'"]^_"A+(,*(,"D0(Y(/0/GOH"
L$
"
CA06"W"B0"!R"9*B&0/B"+"D'AB+*("@O0*MH-"G06+YD"F+A*+(B"D+(M*M+B'G"&+F'"(0"')@'DB'M"+OO'O'"
L=
"
CA'N/'(DH"C0A"+"60M'O"B0"B'GB"+,+*(GBR"7&/G-"+"('P"M'G*,("*G"A'N/*A'MR"^G"+("')+6@O'-"+"('P"
L>
"
M'G*,("D0/OM"/G'"+"A',A'GG*0("60M'O"B0"M'A*F'"]^_"M*A'DBOH-"0A"+"DO+GG*[D+Y0("60M'O"B0"G*6@OH"
LK
"
M'B'A6*('"P&'B&'A"+"D+(M*M+B'"*G"+"G06+YD"F+A*+(B"0A"(0B"+(M"*(C'A"]^_"G/;G'N/'(BOHR"1'D0(M-"
LL
"
B&'"G'F'("GB+(M+AM"lX^d"A'C'A'(D'"G+6@O'G"IlWW!sIlWWL"@A0F*M'"+@@A0)*6+B'OH"#>"6*OO*0("
LQ
"
BA/B&",'A6O*('"F+A*+(BG20-"P&*D&"+A'"DA*YD+O"C0A"B&'"60M'O"BA+*(*(,"0C"+(H"GB+B'50C5B&'5+AB-"M''@5
LV
"
O'+A(*(,5;+G'M",'A6O*('"F+A*+(B"D+OO'AGR"I0P'F'A-"*("B'A6G"0C"8(0P("BA/B&"G06+YD"F+A*+(BG-"0(OH"
QW
"
B&'"I33!$V>sI33!$V>d2"\+"&/6+("BA*@O'5(',+YF'";A'+GB"D+(D'A"D'OO"O*('"+(M"+"(0A6+O"D'OO"O*('"
Q!
"
.CC-BY-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted August 21, 2023. ; https://doi.org/10.1101/2023.08.17.553778doi: bioRxiv preprint
M'A*F'M"CA06"B&'"d"OH6@&0DHB'G"0C"B&'"G+6'"M0(0A-"&'A'+Z'A"A'C'AA'M"B0"+G"I33!$V>ed2`"
Q#
"
B/60A5(0A6+O"@+*A"P+G"@/;O*G&'M";H"B&'"106+YD"h/B+Y0("90A8*(,"lA0/@"0C"B&'"1Sb3#"
Q$
"
\1'N/'(D*(,"b/+O*BH"30(BA0O"q&+G'"XX`"D0(G0AY/63R"XB"D0(B+*(G"0(OH"$V->KW"1a]G"+(M"!-V##"
Q=
"
X(M'OG-"P&*D&"*G"0AM'AG"0C"6+,(*B/M'"C'P'A"B&+("B&'"+F+*O+;O'"BA/B&",'A6O*('"F+A*+(BG-"+(M"C+A"
Q>
"
CA06"'(0/,&"C0A"M''@"('/A+O5('BP0A8"60M'O"BA+*(*(,R"X("B&'"+;G'(D'"0C"+M'N/+B'"A'+O"B/60A5
QK
"
(0A6+O"G+6@O'G-"0('"D0/OM"B&*(8"0C"GH(B&'YD"M+B+"+G"+"G0O/Y0(R"d+6G/A,'0("G@*8'M"G06+YD"
QL
"
F+A*+(BG"*(B0"G'N/'(D*(,"A'+MG"B0"6*6*D"+"B/60A26R"7&*G"P+G"+"G/DD'GGC/O"6'B&0M"C0A"G&0AB"
QQ
"
A'+MG-";/B"*B"M0'G(nB"P0A8"C0A"O0(,"A'+MG"C0A"BP0"A'+G0(GT"!`"XOO/6*(+"G&0AB"A'+MG"+A'"A'+M"0('"
QV
"
;+G'"+B"+"Y6'-"P&'A'+G"ia7"O0(,"A'+MG"+A'"A'+M"+G"B&'"G*,(+OG"0C"+"GO*M*(,">56'A"0A"K56'A"
VW
"
P*(M0PR"7&+B"*G-"+"G@*8'5*("F+A*+(B"+OG0"D&+(,'G"B&'"G*,(+OG"0C"B&'"+Mt+D'(B";+G'GR"d+G'G"D+(";'"
V!
"
;+G'5D+OO'M"CA06"G*,(+OG";/B"D+((0B";'"+/B&'(YD+OOH"B/A('M";+D8"*(B0"G*,(+OG-"G0"B&'"G@*8'5*("
V#
"
6'B&0M"D+((0B";'"+@@O*'M"B0"O0(,"A'+MGo"+(M"#`"d+6G/A,'0("M0'G"(0B"+@@OH"B&'"@0*(B"B&+B"+"
V$
"
G06+YD"F+A*+(B"*G"/G/+OOH"C0/(M"*("0(OH"0('"&+@O0BH@'"\'*B&'A"6+B'A(+O"0A"@+B'A(+O`";/B"*G"
V=
"
6*GG*(,"*("+(0B&'AR"7&'"/G'"0C"O0(,"A'+MG"0F'A"G&0AB"A'+MG"C0A"G06+YD"F+A*+(B"D+OO*(,"6+8'G"
V>
"
G'(G'"0(OH"*C"B&'"O0(,5A'+M"+MF+(B+,'"0C"&+@O0BH@*(,"\+OG0"D+OO'M"@&+G*(,`"*G"/YO*k'MR"^OO"B&*(,G"
VK
"
D0(G*M'A'M-"+"('P"6'B&0M"C0A"GH(B&'G*k*(,"O0(,5A'+M"M+B+"B0"D0(B+*("+;/(M+(B"@O+/G*;O'"
VL
"
G06+YD"F+A*+(BG"*G"(''M'MR"
VQ
"
"
VV
"
X("B&*G"GB/MH-"P'"@A'G'(B"3O+*A5106+YD"\3O+*A1`-"B&'"[AGB"G06+YD"G6+OO"F+A*+(B"D+OO'A"C0A"ia7"O0(,"
!WW
"
A'+MG-"P&*D&"*G"(+6'M"+Z'A"*BG",'A6O*('"F+A*+(B"D+OO'A"@A'M'D'GG0AGR"3O+*A1"MA+PG"0("B&'"
!W!
"
G/DD'GGC/O"')@'A*'(D'"0C"B&'"3O+*A"G'A*'G-"+(M"/G'G"+"('P"('BP0A8"0/B@/B"+(M"+"('P"P0A8p0P"
!W#
"
M'G*,("B0"+MMA'GG"B&'"D0(Y(/0/G"]^_"G@+D'"0C"G06+YD"F+A*+(BGR"dH"D0(G*M'A*(,"BP0"M*m'A'(B"
!W$
"
G+6@O'G-"^"+(M"d-"+G"B/60A"+(M"(0A6+O-"A'G@'DYF'OH-"+(M"M''6*(,"+",'A6O*('"F+A*+(B"G@'D*[D"B0"
!W=
"
^"+G"+"G06+YD"F+A*+(B"+,+*(GB"d-"P'"M'F*G'M"+"M+B+"GH(B&'YD"GBA+B',H"B&+B"/G'G"0(OH"B&'"A'+O"
!W>
"
O0(,"A'+MG"0C"lX^d"A'C'A'(D'"G+6@O'G"P*B&"8(0P(",'A6O*('"F+A*+(BG-";/B"D+("G*6/O+B'"G06+YD"
!WK
"
F+A*+(BG"0C"+(H"B/60A"@/A*BH-"G'N/'(D*(,"D0F'A+,'"\O0P'A"B&+("B&'"M+B+"G0/AD'`-"+(M"O'F'O"0C"
!WL
"
(0A6+O"D0(B+6*(+Y0(R"7&'"GBA+B',H"D+("B&'0A'YD+OOH"@A0M/D'"+("*([(*B'"(/6;'A"0C"G06+YD"
!WQ
"
F+A*+(BG"C0A"60M'O"BA+*(*(,R"9'"G&0P"A'G/OBG"B0"&*,&O*,&B"&0P"@&+G*(,"*6@A0F'G"G06+YD"F+A*+(B5
!WV
"
D+OO*(,"@'AC0A6+(D'"0("O0(,"A'+MGR"70"O'F'A+,'"60A'"A'60B'"+O*,(6'(B"*(C0A6+Y0("B&+B"*G"
!!W
"
D06@/B+Y0(+OOH"*6@A+DYD+O"B0"*(DO/M'"*("B&'"('BP0A8"*(@/B-"P'"M'F*G'M"+"@0GB5@A0D'GG*(,"GB'@"
!!!
"
B&+B"G'+AD&'G"C0A"+(D'GBA+O"&+@O0BH@'"G/@@0AB"C0A"+(H"G06+YD"F+A*+(B"D+(M*M+B'GR"7&*G"GB'@"
!!#
"
A'60F'M"+"D0(G*M'A+;O'"+60/(B"0C"C+OG'"@0G*YF'"D+OOG"*("0/A"')@'A*6'(BGR"_0A";'(D&6+A8*(,-"
!!$
"
P'"G'N/'(D'M"*("B0B+O"L>5C0OM"I33!$V>"+(M"=>5C0OM"I33!$V>d2"ia7"b#Wc"O0(,5A'+MG"\M+B+"
!!=
"
M'@0G*B'M"B0"a3dX"1:^`-"+(M"/G'M"B&'"BA/B&"G06+YD"F+A*+(BG"@A0F*M'M";H"B&'"1Sb3#"
!!>
"
D0(G0AY/63R"9*B&">W5C0OMe#>5C0OM"B/60Ae(0A6+O-"3O+*A1"+D&*'F'M"QKRQKfeV$RW!f"
!!K
"
A'D+OOe@A'D*G*0("A+B'"1a]G-"+(M"KKRQVfeKKR>=f"C0A"G06+YD"X(M'OG"P&'("B+A,'Y(,"]^_"uWRW>R"_0A"
!!L
"
F+A*+(BG"P*B&"]^_"uWR#-"B&'"(/6;'AG",0"/@"B0"V=RK>feVKRK$f"C0A"1a]G-"+(M"L$R##feLLR$>f"C0A"
!!Q
"
G06+YD"X(M'OGR"9'"+OG0"G&0P"B&'"@'AC0A6+(D'"0C"3O+*A1"+B"M*m'A'(B"B/60Ae(0A6+O"D0F'A+,'G-"
!!V
"
B/60A"@/A*BH"+(M"(0A6+O"D0(B+6*(+Y0(R"3O+*A1"*G"M'G*,('M"C0A"ia7"O0(,"A'+MG-";/B"B&'"P&0O'"
!#W
"
6'B&0M"*G"+OG0"+@@O*D+;O'"B0"XOO/6*(+"G&0AB"A'+MGR"7&*G"F'AG+YO*BH"+OO0P'M"/G"B0";'(D&6+A8"3O+*A1"
!#!
"
+,+*(GB"GB+B'50C5B&'5+AB"G&0AB5A'+M"G06+YD"F+A*+(B"D+OO'AG"D0(G*M'A*(,"B&+B"B&'A'"*G"(0"0B&'A"
!##
"
O0(,5A'+M"G06+YD"F+A*+(B"D+OO'A"B0";'(D&6+A8"+,+*(GBR"7&'"A'G/OBG"G&0P"B&+B"3O+*A1"@'AC0A6'M"
!#$
"
D06@+A+;OH"0A"GO*,&BOH";'g'A"B&+("B&'"D/AA'(B"&'/A*GYD5;+G'M"+(M"M''@5O'+A(*(,5;+G'M"D+OO'AG"
!#=
"
0("G&0AB"A'+MGR"
!#>
"
.CC-BY-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted August 21, 2023. ; https://doi.org/10.1101/2023.08.17.553778doi: bioRxiv preprint
"
!#K
"
"
!#L
"
Results
!#Q
"
The ClairS method
!#V
"
7&'"3O+*A1"6'B&0M"*G"'O+;0A+B'M"*("B&'"h'B&0M"G'DY0(R"7')BG"*(";0OM"*("B&*G"+(M"B&'"C0OO0P*(,"
!$W
"
BP0"@+A+,A+@&G"D+(";'"C0/(M"+G"G/;G'DY0("YBO'G"*("B&'"h'B&0M"G'DY0(R"X("B&'"/G'"0C"M''@5
!$!
"
O'+A(*(,"C0A"O0(,5A'+M"G06+YD"F+A*+(B"D+OO*(,-"3O+*A1"&+G"6+M'";A'+8B&A0/,&G"*("!"#$%$%&'(#)#'
!$#
"
*+%),-*$*"+(M"B&'"./#$"0"12"3421'#%('(-*$&%R"
!$$
"
"
!$=
"
:',+AM*(,"BA+*(*(,"M+B+"GH(B&'G*G-"_*,/A'"!+"G&0PG"B&'"P0A8p0P"C0A"5-%-"#6%&'*+%),-67')892"'
!$>
"
#%('*+%),-67'%2"9#/R"^G"P'"A'N/*A'M"G+6@O*(,"P*B&0/B"A'@O+D'6'(B-"+(M"*("F*'P"0C"B&'"
!$K
"
D0660("@A+DYD'"B&+B"+"B/60A"G+6@O'"&+F*(,"&*,&'A"D0F'A+,'"B&+("*BG"6+BD&*(,"(0A6+O-"P'"
!$L
"
,+F'".2:-"#&-'#(:$7-';2"'),-'*28"7-'(#)#R"?'B+*OG"0C"&0P"&060kH,0/G"+(M"&'B'A0kH,0/G"
!$Q
"
,'A6O*('"F+A*+(BG"*("M*m'A'(B"G+6@O'G"+A'"D0(F'AB'M"*(B0"G06+YD"F+A*+(BG"+A'",*F'("*("<-"$:$%&'
!$V
"
98/6=/-'7#)-&2"$-*'2;':#"$#%)*';"29'#'*+%),-67')892">%2"9#/'=#$"R"1'B/@G"B0"+F0*M"@A+DYD+O"
!=W
"
D0(D'A(G"+(M"O*6*B+Y0(G"0C"B&'"M+B+"GH(B&'G*G"6'B&0M"+A'",*F'("*("?),-"'(-)#$/*'#@28)'),-'
!=!
"
:#"$#%)*'*-/-7)-(';2"'92(-/')"#$%$%&R"i/A"M+B+"GH(B&'G*G"6'B&0M"*G";+G'M"0("B&'"0;G'AF+Y0("
!=#
"
B&+B"+("+/B&'(YD"G06+YD"F+A*+(B"*G"/G/+OOH"C0/(M"*("B&'"A'+MG"0C"+"G*(,O'"&+@O0BH@'"\M'@*DB'M"*("
!=$
"
S)B'(M'M"?+B+"_*,/A'"!+`R"^OG0-"P'"C0/(M"B&+B"A,#*$%&'$%;2"9#62%'-%,#%7-*'*29#67':#"$#%)'
!==
"
7#//$%&'=-";2"9#%7-"*("3O+*A1R"
!=>
"
"
!=K
"
:',+AM*(,"B&'"3O+*A1"P0A8p0P"+(M"M'G*,(-"_*,/A'"#+"G&0PG"+("?:-":$-1"0C"B&'"3O+*A1"P0A8p0PR"
!=L
"
_*,/A'"#;"G&0PG"0)-='BC'5-"9/$%-':#"$#%)'7#//$%&D'=,#*$%&'#%('"-#(',#=/2)#&&$%&-"+(M"_*,/A'"#D"
!=Q
"
G&0PG"0)-='EF'A$/-8=G@#*-('#%(';8//G#/$&%9-%)'@#*-(':#"$#%)'7#//$%&R"_0A"'+D&"G+6@O'-"P&*O'"
!=V
"
0(OH"+"CA+DY0("0C",'(06'"@0G*Y0(G"&+G"+OB'A(+YF'"+OO'O'"G/@@0AB"+(M"+60(,"B&'6-"0(OH"+"C'P"
!>W
"
&+F'"B&'"@0B'(Y+O"B0";'"D+OO'M"G06+YD"F+A*+(BG-"P'"/G'M"&'/A*GYDG"C0A"0-/-76%&':#"$#%)'
!>!
"
7#%($(#)-*R"?'B+*OG"0C"!,-'(-*$&%'2;'=$/-8='$%=8)'#%(';8//G#/$&%9-%)'$%=8)"+(M"!,-'(-*$&%'2;'
!>#
"
%-8"#/'%-)12"3*"+A'"G&0P("*("_*,/A'"$R"7&'"P+HG"*("P&*D&"3O+*A1"M*m'AG"CA06"*BG"@A'M'D'GG0A-"
!>$
"
3O+*A$-"+A'"M*GD/GG'M"*("h'B&0MR"_*,/A'"#M"G&0PG"0)-='HF'0-#"7,';2"'#%7-*)"#/',#=/2)+=-'
!>=
"
*8==2")-"P&*D&"*G"+"@0GB5@A0D'GG*(,"GB'@"B&+B"O'F'A+,'G"60A'"A'60B'"+O*,(6'(B"*(C0A6+Y0("B0"
!>>
"
G'+AD&"C0A"+(D'GBA+O"&+@O0BH@'"G/@@0AB"B0"B&'"G06+YD"F+A*+(BG"D+OO'M"*("GB'@"#R"70"+M+@B"B0"
!>K
"
M*m'A'(B"/G+,'"GD'(+A*0G-"6/OY@O'"?8)=8)"0@Y0(G"+A'"@A0F*M'MR"
!>L
"
"
!>Q
"
ClairS performance on ONT data
!>V
"
^"G/66+AH"0C"B&'"ia7"M+B+"/G'M"C0A"60M'O"BA+*(*(,"+(M";'(D&6+A8*(,"*G"G&0P("*("
!KW
"
1/@@O'6'(B+AH"7+;O'"!R"9'"BA+*('M"B&'"3O+*A1"ia7"60M'O"/G*(,"GH(B&'YD"M+B+",'('A+B'M"CA06"
!K!
"
BP0"lX^d"G+6@O'GT"IlWW!"+(M"IlWW#20R"9'"/G'M";0B&"IlWW!eIlWW#"+(M"IlWW#eIlWW!"+G"
!K#
"
B/60Ae(0A6+O"G+6@O'G"C0A"M+B+"GH(B&'G*GR"7&'"IlWW#"G+6@O'"&+G"LKR#V5C0OM"D0F'A+,'"+(M"P+G"
!K$
"
6+M'"+F+*O+;O'";H"a+(0@0A'"B&A0/,&"SqX#hS"2+;GR"7&'"IlWW!"G+6@O'"&+G"=QR==5C0OM"D0F'A+,'"
!K=
"
+(M"P+G"G'N/'(D'M"+B"IJE-"P*B&"M'B+*OG",*F'("*("B&'"h'B&0M"s"ia7"O*;A+AH"@A'@+A+Y0("+(M"
!K>
"
.CC-BY-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted August 21, 2023. ; https://doi.org/10.1101/2023.08.17.553778doi: bioRxiv preprint
G'N/'(D*(,"G'DY0(R"_0OO0P*(,"B&'"D0(F'(Y0("0C"B&'"GB+B'50C5B&'5+AB"M''@5O'+A(*(,5;+G'M"
!KK
"
F+A*+(B"D+OO'AG-"P'"')DO/M'M"A'+MG"+(M"F+A*+(BG"CA06"3&A060G06'"#W"CA06"60M'O"BA+*(*(,R""
!KL
"
"
!KQ
"
_0A";'(D&6+A8*(,-"P'"/G'M"B&'"I33!$V>ed2"B/60A5(0A6+O"@+*A-"P*B&"8(0P("BA/B&"F+A*+(BG"
!KV
"
@A0F*M'M";H"B&'"1Sb3#"D0(G0AY/63R"7&'"I33!$V>"G+6@O'"&+G"L>RVL5C0OM"D0F'A+,'R"7&'"
!LW
"
I33!$V>d2"G+6@O'"&+G"=>R>>5C0OM"D0F'A+,'R"d0B&"G+6@O'G"P'A'"G'N/'(D'M"+B"BP0"G'N/'(D*(,"
!L!
"
D'(B'AG"\IJE"+(M"a0F0,'('`"C0A"N/+O*BH"D0(BA0O"@/A@0G'GR"7&'"H*'OM"0C"B&'"BP0"D'(B'AG"*G"G&0P("
!L#
"
*("1/@@O'6'(B+AH"7+;O'"#R"i(OH"1Sb3#"BA/B&"F+A*+(BG"O+;'O'M"vI*,&30(Cw"\&*,&"D0([M'(D'`"+(M"
!L$
"
vh'M30(Cw"\6'M*/6"D0([M'(D'`"P'A'"/G'M"C0A";'(D&6+A8*(,R"9&*O'"B&'A'"+A'"x=W8"1a]G";/B"
!L=
"
0(OH"x#8"X(M'OG"*("B&'"BA/B&"G'B-"B&'";*,"M*m'A'(D'"*("B&'"(/6;'A"0C"BA/B&"F+A*+(BG"A'G/OBG"*("
!L>
"
M*m'A'(B"+(+OHYD+O"@0P'A";'BP''("B&'"BP0"F+A*+(B"BH@'GR"10"P'"[AGB";'(D&6+A8'M"6/OY@O'"
!LK
"
D0F'A+,'G-"B/60A"@/A*BH"+(M"(0A6+O"D0(B+6*(+Y0("P*B&"0(OH"1a]G-"+(M"B&'(";'(D&6+A8'M"+(M"
!LL
"
M*GD/GG"G06+YD"X(M'O5D+OO*(,"@'AC0A6+(D'"*("+"G'@+A+B'"G'DY0(R"9'"/G'M"XOO/6*(+nG"I+@O0BH@'"
!LQ
"
306@+A*G0("700OG27"B0",'('A+B'"B&'"@'AC0A6+(D'"[,/A'G-"*(DO/M*(,"_!51D0A'-"qA'D*G*0(-"+(M"
!LV
"
:'D+OO-"+(M"DA0GG5F+O*M+B'M"B&'6"P*B&"B&'"vD06@+A'yFDCw"G/;60M/O'"*("3O+*A1R"h0A'"
!QW
"
DO+A*[D+Y0(G"+(M"@+A+6'B'AG"+A'"G&0P("*("B&'"h'B&0M"s"d'(D&6+A8*(,"G'DY0(R""
!Q!
"
"
!Q#
"
_0A";0B&"60M'O"BA+*(*(,"+(M";'(D&6+A8*(,-"P'"/G'M"l:3&$Q-"P&*D&"*G"B&'"('P'GB"A'C'A'(D'"
!Q$
"
,'(06'"F'AG*0("0("P&*D&";0B&"lX^d"+(M"1Sb3#"BA/B&"F+A*+(BG"+A'";+G'MR"^OO"ia7"G'N/'(D*(,"
!Q=
"
M+B+"P'A'";+G'5D+OO'M"/G*(,"l/@@H"F'AG*0("KR!R>"+(M"+O*,('M"B0"l:3&$Q"/G*(,"6*(*6+@#"F'AG*0("
!Q>
"
#R!L5AV=!R"7&'"D066+(M"O*('"/G'M"*G",*F'("*("B&'"1/@@O'6'(B+AH"a0B'G"s"3066+(M"O*('G"/G'M"
!QK
"
G'DY0(R"^OO"M+B+"6'(Y0('M"+;0F'-"*(DO/M*(,"ia7"G'N/'(D*(,"M+B+-"lX^d"BA/B&"F+A*+(BG-"1Sb3#"
!QL
"
BA/B&"F+A*+(BG-"+(M"A'C'A'(D'",'(06'G-"+A'"@/;O*DOH"+DD'GG*;O'"F*+"O*(8G"0A"1:^"+DD'GG*0("X?G"
!QQ
"
O*GB'M"*("B&'"1/@@O'6'(B+AH"a0B'G"s"?+B+"+F+*O+;*O*BH"G'DY0(R"
!QV
"
"
!VW
"
!"#$%#&'()"*+,-.*/,0"#"(-*-1&%#*)%&2,('3%(4*'(/*(%#&'5*)%6"#'7"8"9'"+GG'GG'M"B&'"3O+*A1"
!V!
"
@'AC0A6+(D'"P*B&"M*m'A'(B"D06;*(+Y0(G"0C"B/60AG"+(M"(0A6+O"D0F'A+,'R"9'"B'GB'M"B&A''"
!V#
"
B/60A"D0F'A+,'"A+B'GT"#>5-">W5-"+(M"L>5C0OMR"9'"+@@O*'M"#>5C0OM"+G"B&'"[AGB"GB'@"+G"*B"A'@A'G'(BG"
!V$
"
+"D0(G'AF+YF'"B&A0/,&@/B"'GY6+Y0("0C"+(":!WR=R!"qA06'B&Xia"p0PD'OOR"9'"+OG0"B'GB'M"B&A''"
!V=
"
(0A6+O"D0F'A+,'GT"#W5-"#>5-"+(M"$W5C0OMR"7&'"#>5C0OM"GB'@"A'G'6;O'G"B&'"B&A0/,&@/B"F+A*+(D'"0C"
!V>
"
+"G*(,O'"p0PD'OOR"i/A"')@'A*6'(BG"+*6'M"B0"*6*B+B'"+"@A+DYD+O"G'j(,"C0A"DO*(*D+O"D+(D'A"
!VK
"
M*+,(0G*G-"P*B&"B&'"B/60A"G+6@O'"D0F'A+,'"*(DA'+G'M"0('"p0PD'OO"+B"+"Y6'"B0"G''8"&*,&'A"
!VL
"
M*GD0F'AH"@0P'A-";/B"B&'"(0A6+O"G+6@O'"D0F'A+,'"*G"[)'M"+B"+"G*(,O'"p0PD'OO"C0A"D0GB5
!VQ
"
'm'DYF'('GGR""
!VV
"
"
#WW
"
7&'"A'G/OBG"+A'"G&0P("+G"qA'D*G*0(5:'D+OO"D/AF'G"*("_*,/A'"=+R"7&'"'F+O/+Y0("6'BA*DG"+B"BP0"
#W!
"
F+A*+(B"N/+O*BH"D/B0mG"\Q"+(M"!>`"+A'"G&0P("*("1/@@O'6'(B+AH"7+;O'"$R"b/+O*BH"D/B0m"!>"
#W#
"
\&'A'+Z'A"A'C'AA'M"B0"+G"v@A*0A*Yk'5C!"60M'w`"[OB'AG"60A'"F+A*+(BG"+(M"+*6G"C0A";+O+(D'M"
#W$
"
@A'D*G*0("+(M"A'D+OOR"b/+O*BH"D/B0m"Q"\v@A*0A*Yk'5A'D+OO"60M'w`"A'B+*(G"60A'"F+A*+(BG"+(M"+*6G"
#W=
"
C0A"&*,&'A"A'D+OOR"X("B&'"@A*0A*Yk'5C!"60M'-"P*B&"(0A6+O"D0F'A+,'"[)'M"+B"#>5C0OM-"3O+*A1"+D&*'F'M"
#W>
"
V>RW$feLQRL!feQKR!!f-"V$RW!feQKRQKfeQVRQ$f-"+(M"V#RV=feQKRV#feQVRQ$f"
#WK
"
@A'D*G*0(eA'D+OOeC!5GD0A'"+B"#>5-">W5-"+(M"L>5C0OM"B/60A"D0F'A+,'-"A'G@'DYF'OHR"_A06"#>5"B0">W5
#WL
"
C0OM-"B&'"A'D+OO"*(DA'+G'M"CA06"LQR!!f"B0"QKRQKf"\cQRL>f`-"P*B&"+"#RW#f"@A'D*G*0("MA0@"\CA06"
#WQ
"
V>RW$f"B0"V$RW!f`R"_A06">W5"B0"L>5C0OM-"&0P'F'A-"(0"*6@A0F'6'(B"P+G"0;G'AF'MR"X("B&'"
#WV
"
.CC-BY-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted August 21, 2023. ; https://doi.org/10.1101/2023.08.17.553778doi: bioRxiv preprint
@A*0A*Yk'5A'D+OO"60M'-"+OG0"P*B&"(0A6+O"D0F'A+,'"[)'M"+B"#>5C0OM-"3O+*A1"+D&*'F'M"
#!W
"
Q#RKKfeV!R>>feQKRQQf-"LWR#!feVKR!WfeQ!R!=f-"+(M"K$RQWfeVKRKWfeLKRQ>f"
#!!
"
@A'D*G*0(eA'D+OOeC!5GD0A'"+B"#>5-">W5-"+(M"L>5C0OM"B/60A"D0F'A+,'-"A'G@'DYF'OHR"306@+A'M"B0"
#!#
"
@A*0A*Yk'5C!-"B&'"@A*0A*Yk'5A'D+OO"60M'"A'+D&'M"V!R>>f"A'D+OO"+B"#>5C0OM"\+,+*(GB"LQRL!f-"
#!$
"
c!#RQ=f`-"+(M"VKR!Wf"+B">W5C0OM"\+,+*(GB"QKRQKf-"cVR#=f`R"^OB&0/,&"3O+*A$"*G"(0B"M'G*,('M"C0A"
#!=
"
G06+YD"F+A*+(BG-"B0",*F'"+"A'C'A'(D'"@0*(B-"P'"D0(M/DB'M"')@'A*6'(BG"P*B&"*B"/G*(,"B&'"G+6'"
#!>
"
M+B+G'BG";H"D0(G*M'A*(,"B&'",'A6O*('"F+A*+(BG"C0/(M"*("B/60A";/B"(0B"*("(0A6+O"+G"G06+YD"
#!K
"
F+A*+(BGR"^B"#>5C0OM"(0A6+O-"3O+*A$"&+M"+"!VR#LfeL#R!Wfe$WR=#f-"#VRQ#feKQR$#fe=!R>#f-"+(M"
#!L
"
$KR#!feK=R>Kfe=KR=Wf"@A'D*G*0(eA'D+OOeC!5GD0A'"+B"#>5-">W5-"+(M"L>5C0OM"B/60A"D0F'A+,'R"7&'"
#!Q
"
A'G/OBG"&*,&O*,&B"B&'"*(+@@A0@A*+B'('GG"0C"/G*(,"+",'A6O*('"F+A*+(B"D+OO'A"C0A"G06+YD"F+A*+(B"
#!V
"
D+OO*(,R"
##W
"
"
##!
"
X(";0B&"@A*0A*Yk'5C!"+(M"@A*0A*Yk'5A'D+OO"60M'G-"A+*G*(,"B&'"(0A6+O"D0F'A+,'"D0(G*GB'(BOH"
###
"
*(DA'+G'M"G06+YD"F+A*+(B5D+OO*(,"@'AC0A6+(D'R"9*B&"B/60A"D0F'A+,'"[)'M"+B">W5C0OM"+(M"/G*(,"
##$
"
B&'"@A*0A*Yk'5C!"60M'-"3O+*A1"+D&*'F'M"+"V!RQKfeQ$RLLfeQLRK$f-"V$RW!feQKRQKfeQVRQ$f-"+(M"
##=
"
V#RWKfeQQR>!feVWR#>f"@A'D*G*0(eA'D+OOeC!5GD0A'"+B"#W5-"#>5-"+(M"$W5C0OM"(0A6+O"D0F'A+,'R"7&'"
##>
"
(/6;'AG"P'A'"KQR$KfeV>RKVfeLVRL>f-"LWR#!feVKR!WfeQ!R!=f-"+(M"LWR>KfeVKR=$feQ!R=Vf"
##K
"
*("B&'"@A*0A*Yk'5A'D+OO"60M'R"
##L
"
"
##Q
"
_*,/A'"=;"G&0PG"B&'"@'AC0A6+(D'"0C"3O+*A1"*("@A*0A*Yk'5A'D+OO"60M'-";A08'("M0P("B0"C0/A"]^_"
##V
"
A+(,'GT"WR>5!-"WR#5WR>-"WR!5WR#-"+(M"WRW>5WR!R"^B"M*m'A'(B"D0F'A+,'G-"B&'"@'AC0A6+(D'"0C"3O+*A1"+B"
#$W
"
A+(,'"WR#5WR>"\O0P56*M`"P+G"C0/(M"B0";'"+G",00M"+G"WR>5!"\6*M5&*,&`R"_0A"')+6@O'-"+B">We#>5C0OM"
#$!
"
B/60Ae(0A6+O"D0F'A+,'-"3O+*A1"+D&*'F'M"+"V=RLfeVVR$feVKRVf"@A'D*G*0(eA'D+OOeC!5GD0A'"+B"WR>5
#$#
"
!-"+(M"V>R#feVQR!feVKRLf"+B"WR#5WR>R"^B"A+(,'"WR!5WR#-"@A'D*G*0("P+G"A'M/D'M"B0"xKWf-"P&*O'"
#$$
"
A'D+OOG"@O+B'+/'M"+B"xVWfR"^B"A+(,'"WRW>5WR!-"@A'D*G*0("P+G"C/AB&'A"A'M/D'M"B0";'O0P"!Wf"
#$=
"
\!!RQf-">RLf-"+(M"=RKf"+B"#>5-">W5-"+(M"L>5C0OM"B/60A"D0F'A+,'`-"P&*O'"A'D+OOG"P'A'"A+*G'M"P*B&"
#$>
"
*(DA'+G*(,"B/60A"D0F'A+,'"\!KR>f-"$#RQf-"+(M"=KR!f`R"7&'"A'+G0("C0A"B&'"MA0@"*("@A'D*G*0("P+G"
#$K
"
B&+B"B&'"&*,&'A"D0F'A+,'"O'M"B0"+"MA+GYD"*(DA'+G'"*("B&'"(/6;'A"0C"F+A*+(B"D+(M*M+B'G"+B"F'AH"
#$L
"
O0P"]^_R"^B"A+(,'"WRW>5WR!-"B&'"(/6;'A"0C"D+(M*M+B'G"P+G"+;0/B"!$!8-"$!W8-"+(M"=!V8"+B"#>5-"
#$Q
"
>W5-"+(M"L>5C0OM"B/60A"D0F'A+,'R"
#$V
"
"
#=W
"
!"#$%#&'()"*'-*/,0"#"(-*-1&%#*91#,3"4*'(/*(%#&'5*)%(-'&,('3%(8"9'"+GG'GG'M"B&'"
#=!
"
@'AC0A6+(D'"0C"3O+*A1"+B"M*m'A'(B"D06;*(+Y0(G"0C"B/60A"@/A*BH"\!RW-"WRQ-"WRK-"WR=-"+(M"WR#`"+(M"
#=#
"
(0A6+O"@/A*BH"\!RW-"WRV>-"+(M"WRVW`R"^OO"B&'"')@'A*6'(BG"*("B&*G"G'DY0("/G'M">W5C0OM"B/60A"+(M"
#=$
"
#>5C0OM"(0A6+O"D0F'A+,'R"7&'"A'G/OBG"+A'"G&0P("*("_*,/A'"=D"+(M"1/@@O'6'(B+AH"7+;O'"=R"7&'"
#==
"
BP0"60M'G"\@A*0A*Yk'5C!"+(M"@A*0A*Yk'5A'D+OO`";'&+F'M"M*m'A'(BOH"P*B&"F+AH*(,"@/A*BHR"9*B&"
#=>
"
(0A6+O"@/A*BH"[)'M"+B"!RW-"*("@A*0A*Yk'5C!"60M'-"@A'D*G*0("A'6+*('M"+;0F'"VWf"\V$RW!f-"
#=K
"
VKR#>f-"VLRLVf-"VQRKQf-"+(M"VQRVVf"+B"B/60A"@/A*BH"!RW-"WRQ-"WRK-"WR=-"+(M"WR#`-"P&*O'"A'D+OO"
#=L
"
MA0@@'M"\QKRQKf-"Q!RK$f-"L!RWQf-">#RV=f-"+(M"##R=$f`"P*B&"M'DA'+G*(,"B/60A"@/A*BHR"X("B&'"
#=Q
"
@A*0A*Yk'5A'D+OO"60M'O-"@A'D*G*0("F+A*'M"\LWR#!f-"QWR!!f-"QQR!=f-"V$RQ!f-"+(M"VLR>=f`-"P&*O'"
#=V
"
A'D+OO"P+G";00GB'M-"'G@'D*+OOH"+B"O0P'A"B/60A"@/A*BH"\VKR!Wf-"V=R=>f-"VWR>!f-"QWRQ!f-"+(M"
#>W
"
>$RWKf`R"^DD0AM*(,"B0"B&'G'"A'G/OBG-",'('A+OOH-"P'"G/,,'GB"/G*(,"@A*0A*Yk'5C!"60M'"+B"&*,&'A"
#>!
"
B/60A"@/A*BH"+(M"@A*0A*Yk'5A'D+OO"60M'"+B"O0P'A"B/60A"@/A*BHR""
#>#
"
"
#>$
"
.CC-BY-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted August 21, 2023. ; https://doi.org/10.1101/2023.08.17.553778doi: bioRxiv preprint
i/A"A'G/OBG"+OG0"G&0P'M"B&+B"O0P'A"(0A6+O"@/A*BH"&+A6'M"G06+YD"F+A*+(B5D+OO*(,"@'AC0A6+(D'-"
#>=
"
'G@'D*+OOH"0("A'D+OOR"9*B&"B/60A"@/A*BH"[)'M"+B"!RW-"*("@A*0A*Yk'5C!"60M'-"B&'"A'D+OOG"P'A'"
#>>
"
QKRQKf-"LWR$>f-"+(M">!RL=f"+B"(0A6+O"@/A*BH"!RW-"WRV>-"+(M"WRVWR"X("@A*0A*Yk'5A'D+OO"60M'-"B&'"
#>K
"
A'D+OOG"P'A'"VKR!Wf-"Q>R$Kf-"+(M"KVR=#fR"^DD0AM*(,OH-"P'"G/,,'GB"/G*(,"+"&*,&5@/A*BH"(0A6+O"
#>L
"
G+6@O'"P*B&"3O+*A1"C0A"G06+YD"F+A*+(B"M*GD0F'AHR"2+GBOH-"P'"G&0P'M"B&'"A'G/OBG"/G*(,"WRQeWRV>"
#>Q
"
B/60Ae(0A6+O"@/A*BHR"X("@A*0A*Yk'5A'D+OO"60M'-"3O+*A1"+D&*'F'M"+"LVRKQfeQ!R!$feQWR=Wf"
#>V
"
@A'D*G*0(eA'D+OOeC!5GD0A'-"M'60(GBA+Y(,"3O+*A1n"A'O*+;*O*BH"*("D&+OO'(,*(,"G+6@O'"D0(M*Y0(GR"
#KW
"
"
#K!
"
:('5;4,4*%$*<'54"*!%4,36"*'(/*<'54"*="7'36"*)'5548"EG*(,">W5C0OM"B/60A"+(M"#>5C0OM"(0A6+O"
#K#
"
D0F'A+,'-"P'"6+(/+OOH"+(+OHk'M"$WW"C+OG'"@0G*YF'"+(M"$WW"C+OG'"(',+YF'"D+OOG"A+(M06OH"@*D8'M"
#K$
"
CA06"+OO"F+A*+(B"D+OOGR"S+D&"_q"+(M"_a"P+G"+GG*,('M"P*B&"B&'"60GB"0;F*0/G"O*6*B+Y0("+G"B&'"
#K=
"
A'+G0("P&H"B&'"D+OO"P+G"C+OG'R"7&'"A'+G0(G"C0A"B&'"KWW"C+OG'"D+OOG"+A'"O*GB'M"*("1/@@O'6'(B+AH"
#K>
"
7+;O'">R"^"M*GBA*;/Y0("0C"B&'"A'+G0(G"*G",*F'("*("_*,/A'"="+G"+"@*'"D&+ABR"
#KK
"
"
#KL
"
^60(,"B&'"C+OG'"@0G*YF'"D+OOG-"$Vf"&+M"(0"6+BD&*(,"BA/B&";/B"P'A'"P*B&"B/60A"WRW>z]^_{WR!-"
#KQ
"
##f"P'A'"P*B&"B/60A"WR!z]^_{WR!>-"+(M"Vf"P'A'"P*B&"B/60A"]^_uWR!>R"^G"B&'"h'B&0M"G'DY0("
#KV
"
'O+;0A+B'G-"B&'G'"D+OOG"+A'"P*B&"B/60A"+(M"(0A6+O"D0F'A+,'"u=-"+(M"(0A6+O"]^_";'O0P"WRW>R"
#LW
"
i('"@0GG*;O'"')@O+(+Y0("*G"B&+B"3O+*A1"P+G"/(+;O'"B0"@*D8"/@"60A'"&*(BG"+(M"B'OO"B&'G'"D+OOG"CA06"
#L!
"
B&'"BA/'"0('GR"106'"0C"B&'G'"D+G'G"6*,&B";'"D0AA'DBOH"D+OO'M"+,+*("P*B&"&*,&'A"B/60A"D0F'A+,'-"
#L#
"
P&*D&"A'M/D'G"B&'"GB+YGYD+O";*+G"*("]^_R"XB"*G"+OG0"@0GG*;O'"B&+B"G*(D'"B&'"1Sb3#"BA/B&"G'B"*G"GYOO"
#L$
"
/(M'A"+DYF'"M'F'O0@6'(B-"*BG"*(D06@O'B'('GG"D+/G'M"+"C'P"F+A*+(BG"B&+B"+A'"+DB/+OOH"BA/'"B0";'"
#L=
"
6*GDO+GG*['M"+G"C+OG'"@0G*YF'GR"^(0B&'A"6+t0A"D+B',0AH"0C"C+OG'"@0G*YF'"D+OOG"*G"O*8'OH"B0";'"
#L>
"
D+/G'M";H"+O*,(6'(B"+AYC+DBG";'D+/G'"0C"+"A'@'YYF'"0A"*6@'AC'DB",'(06'"A'C'A'(D'"G'N/'(D'R"
#LK
"
XB"*(DO/M'G"Kf"vi('"0A"60A'"M'O'Y0(G"*("p+(8*(,">W;@w-">f"vX("O0P"D06@O')*BH"A',*0(w-"=f"
#LL
"
vS)D'GG*F'"6*G6+BD&'G"*("+O*,(6'(Bw-"$f"vi('"0A"60A'"*(G'AY0(G"*("p+(8*(,">W;@w-"+(M"*("
#LQ
"
B0B+O-"+(0B&'A"!Wf"*("A'@'YYF'"A',*0(G"0C"M*m'A'(B"BH@'GR"^OG0-"P'"C0/(M"!f"C+OG'"@0G*YF'"D+OOG-"
#LV
"
@0GG*;OH";'D+/G'"0C"*(G/|D*'(B"(0A6+O"D0F'A+,'R"
#QW
"
"
#Q!
"
^60(,"B&'"C+OG'"(',+YF'"D+OOG-"=Wf"BA/B&"F+A*+(BG"B&+B"P'A'"(0B"D+OO'M"P'A'"P*B&"B/60A"
#Q#
"
]^_{WR!-"!Wf"P'A'"P*B&"va0A6+O"]^_"uWRW>-";/B"B/60A"]^_"{K"Y6'G"O+A,'A"B&+("(0A6+O"]^_w-"
#Q$
"
+(M"$f"P'A'"P*B&"{$"A'+MG"G/@@0AY(,"B&'"F+A*+(B"+OO'O'"*("+"B/60AR"7&'G'"6*GG'M"F+A*+(BG"6*,&B"
#Q=
"
;'"D+OO'M"+,+*("*C"&*,&'A"B/60A"D0F'A+,'"*G",*F'(R"_+OG'"(',+YF'"D+OOG"+A'"60A'"O*8'OH"B0";'"
#Q>
"
D+/G'M";H"+O*,(6'(B"+AYC+DBG-"+G"P'"0;G'AF'M"!Wf"C+OG'"(',+YF'"D+OOG"*("B&'"&060@0OH6'A"
#QK
"
A',*0(-"!Wf"*("B&'"O0P"D06@O')*BH"A',*0(-"Lf"*("B&'"B+(M'6"A'@'+B"A',*0(-"+(M"*("B0B+O-"+(0B&'A"
#QL
"
Vf"B&+B"P'A'"+OG0"O*8'OH"B0"&+F'";''("D+/G'M";H"+O*,(6'(BR"9'"+OG0"0;G'AF'M"=f"C+OG'"(',+YF'"
#QQ
"
D+OOG"D+/G'M";H"')BA'6'"GBA+(M";*+G"\*R'R-"A'+MG"0;G'AF'M"*("0(OH"0('"GBA+(M`o"+(M"!f"P*B&";+G'"
#QV
"
N/+O*BH"0C"+OO"G/@@0AY(,";+G'G"{#WR"
#VW
"
"
#V!
"
>%&'3)*?(/"5@)'55,(7*9"#$%#&'()"8"7&'"1Sb3#"BA/B&"G'B"@A0F*M'G"x=W8"1a]G";/B"0(OH"x#8"X(M'OGR"
#V#
"
iP*(,"B0"B&'"GD+AD*BH"0C"BA/B&"G06+YD"X(M'OG"B&+B"D0/OM"'M/D'"GB+YGYD+O";*+G'G-"P'"
#V$
"
;'(D&6+A8'M"G06+YD"X(M'O5D+OO*(,"G'@+A+B'OHR"7&'"A'G/OBG"0C"M*m'A'(B"D06;*(+Y0(G"0C"B/60A"
#V=
"
D0F'A+,'"\#>5-">W5-"+(M"L>5C0OM`"+(M"(0A6+O"D0F'A+,'"\#W5-"#>5-"+(M"$W5C0OM`"+A'"G&0P("*("
#V>
"
1/@@O'6'(B+AH"7+;O'"KR"9*B&"(0A6+O"D0F'A+,'"[)'M"+B"#>5C0OM-"*("@A*0A*Yk'5C!"60M'"\F+A*+(B"
#VK
"
N/+O*BH"D/B0m"+B"!#`"3O+*A1"+D&*'F'M"LKRV!fe=VR$KfeKWR!$f-"KKR>=feKKRQVfeKKRL#f-"+(M"
#VL
"
.CC-BY-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted August 21, 2023. ; https://doi.org/10.1101/2023.08.17.553778doi: bioRxiv preprint
>LRLVfeL!R=QfeK$RV!f"0C"@A'D*G*0(eA'D+OOeC!5GD0A'"+B"#>5-">W5-"+(M"L>5C0OM"B/60A"D0F'A+,'R"X("
#VQ
"
@A*0A*Yk'5A'D+OO"60M'"\D/B0m"+B"Q`-"3O+*A1"+D&*'F'M">$RW>feK>R=Kfe>QRK!f-"
#VV
"
$QRK$feLKR!>fe>!R#>f-"+(M"$LR>>feLKRKWfe>WR=WfR"^G"*("0/A"D0(DO/G*0("C0A"1a]"D+OO*(,-"
$WW
"
,'('A+OOH-"P'"G/,,'GB"/G*(,"@A*0A*Yk'5A'D+OO"60M'"C0A"D+OO*(,"G06+YD"X(M'OG"+B"O0P'A"B/60A"
$W!
"
D0F'A+,'R""
$W#
"
'
$W$
"
!"#$%#&'()"*'A"#*'//,(7*9.'4,(7*,($%#&'3%(*-%*-."*,(91-*,(*4-"9*B8"7&'"A'D0(GBA/DY0("0C"
$W=
"
&+@O0BH@'G-"+OG0"8(0P("+G"&+@O0BH@'5A'G0OF'M"+GG'6;OH"0A"@&+G*(,-",A'+BOH"*6@A0F'M"B&'"
$W>
"
@'AC0A6+(D'"0C"O0(,5A'+M",'A6O*('"F+A*+(B"D+OO*(,"*("@A'F*0/G"@A+DYD'22, 23R"X("B&'"h'B&0M"
$WK
"
G'DY0(-"P'"'O+;0A+B'"&0P"@&+G*(,"*(C0A6+Y0("D+(";'"/G'M"B0"'(&+(D'"G06+YD"F+A*+(B5D+OO*(,"
$WL
"
@'AC0A6+(D'R"EG*(,">We#>5C0OM"I33!$V>ed2"+(M"@A*0A*Yk'5A'D+OO"60M'"C0A";'(D&6+A8*(,-"P'"
$WQ
"
G&0P"*("S)B'(M'M"_*,/A'"!;"B&+B"@&+G+;O'"G06+YD"F+A*+(BG"@'AC0A6'M";'g'A"B&+("/(@&+G+;O'"
$WV
"
0('G-"'G@'D*+OOH"+B"O0P'A"]^_R"^B"]^_"WR!5WR!>-"@&+G+;O'"G06+YD"F+A*+(BG"&+M"+"
$!W
"
=$RQfeQ>RKfe>LRVf"@A'D*G*0(eA'D+OOeC!5GD0A'-";/B"/(@&+G+;O'"G06+YD"F+A*+(BG"&+M"0(OH"
$!!
"
!KR=feQ!R#fe#LR$fR"^B"]^_"WRW>5WR!-"@&+G+;O'"G06+YD"F+A*+(BG"&+M">RQfe=KRKfe!WR$f-";/B"
$!#
"
/(@&+G+;O'"G06+YD"F+A*+(BG"&+M"0(OH"#R$fe#LR#fe=R$fR"9'"BA*'M"M*G+;O*(,"@&+G*(,"*("3O+*A1"
$!$
"
+(M"C'M"(0"@&+G*(,"*(C0A6+Y0("B0"B&'"D+OO*(,"('BP0A8GR"^G"G&0P("*("1/@@O'6'(B+AH"7+;O'"L-"B&'"
$!=
"
0F'A+OO"_!5GD0A'"MA0@@'M"CA06"Q!R!=f"B0"LQRKKf"\5#R=Qf`R*
$!>
"
'
$!K
"
!"#$%#&'()"*%$*-."*-+%*#"49")36"*("-+%#C4*,(*4-"9*B8"X("D0(BA+GB"B0"3O+*A$-"*("P&*D&"B&'"@*O'/@"
$!L
"
('BP0A8"&+(MO'G"+OO"B&'"F+A*+(B"D+(M*M+B'G-"+(M"B&'"C/OO5+O*,(6'(B"('BP0A8"@A0D'GG'G"0(OH"B&'"
$!Q
"
/(M'D*M'M"D+(M*M+B'G"/G*(,"B&'"@*O'/@"('BP0A8-"3O+*A1"/G'G";0B&"('BP0A8G"'N/+OOH"B0"6+8'"
$!V
"
D0OO'DYF'"M'D*G*0(GR"7&'"A+Y0(+O'"+(M"M'B+*OG"+A'"'O+;0A+B'M"*("B&'"h'B&0M"G'DY0(R"9'"B'GB'M"
$#W
"
B&'"@'AC0A6+(D'"/G*(,";0B&"B&'"@*O'/@"('BP0A8"0(OH"+(M"B&'"C/OO5+O*,(6'(B"('BP0A8"0(OH-"P*B&"
$#!
"
>We#>5C0OM"0C"I33!$V>ed2"+(M"@A*0A*Yk'5A'D+OO"60M'R"7&'"A'G/OBG"+A'"G&0P("*("1/@@O'6'(B+AH"
$##
"
7+;O'"LR"9&'("P'"/G'M"0(OH"B&'"@*O'/@"('BP0A8-"B&'"_!5GD0A'"MA0@@'M"CA06"Q!R!=f"B0"L$R#Lf"
$#$
"
\5LRQLf`R"9&'("P'"/G'M"0(OH"B&'"C/OO5+O*,(6'(B"('BP0A8-"B&'"_!5GD0A'"MA0@@'M"CA06"Q!R!=f"B0"
$#=
"
LVR!=f"\5#RWWf`R*
$#>
"
"
$#K
"
!"#$%#&'()"*%$*>-"9DE*>"'#).,(7*$%#*'()"4-#'5*.'95%-;9"*4199%#-8"^G"'O+;0A+B'M"*("B&'"h'B&0M"
$#L
"
G'DY0(-"GB'@"$"/YO*k'G"A'60B'"+O*,(6'(B"G*,(+OG"B&+B"D0/OM"(0B";'"*(DO/M'M"*("B&'"('BP0A8"
$#Q
"
*(@/BG"M/'"B0"D06@/B+Y0(+O"O*6*B+Y0(G"B0"*6@A0F'"B&'"@A'D*G*0("0C"B&'"D+OO'M"G06+YD"F+A*+(BGR"
$#V
"
^G"G&0P("*("1/@@O'6'(B+AH"7+;O'"L-"P*B&0/B"B&*G"GB'@-"B&'"@A'D*G*0("MA0@@'M"CA06"LWR#!f"B0"
$$W
"
KLR!=f"\5$RWLf-"/G*(,">We#>5C0OM"I33!$V>ed2"+(M"@A*0A*Yk'5A'D+OO"60M'`R"
$$!
"
"
$$#
"
ClairS performance on Illumina data
$$$
"
1&0AB5A'+M"G06+YD"G6+OO5F+A*+(B"D+OO*(,"&+G";''("*(B'(G*F'OH"GB/M*'MR"^"(0(5')&+/GYF'"O*GB"0C"
$$=
"
GB+B'50C5B&'5+AB"6'B&0MG"*(DO/M'"1BA'O8+#10-"h/B'DB#8-"2+(D'B6-"a'/G06+YD5-"iDB0@/G12-"
$$>
"
106+YD1(*@'A9-"+(M"]+A('B4R"3O+*A1"P+G"M'G*,('M"@A*6+A*OH"C0A"O0(,5A'+M"G06+YD"G6+OO5F+A*+(B"
$$K
"
D+OO*(,R"I0P'F'A-"*BG"@0G'G"(0"O*6*B+Y0(G"B0"G6+OO"A'+MG-"+(M"P'"')@'DB"+"F+A*+(B5D+OO*(,"6'B&0M"
$$L
"
B&+B"P0A8G"C0A"O0(,"A'+MG"B0"@'AC0A6"+G"P'OO"+G"0A"'F'(";'g'A"B&+("')*GY(,"G&0AB5A'+M"6'B&0MGR"
$$Q
"
^OG0-";'(D&6+A8*(,"+,+*(GB"0B&'A"G&0AB5A'+M"G06+YD"G6+OO5F+A*+(B"D+OO'AG"@A0F*M'G"*(G*,&BG"0("
$$V
"
.CC-BY-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted August 21, 2023. ; https://doi.org/10.1101/2023.08.17.553778doi: bioRxiv preprint
&0P"3O+*A1"@'AC0A6G"+,+*(GB"')*GY(,"6'B&0MG-"G*(D'"(0"0B&'A"O0(,5A'+M"D+OO'AG"+A'"+F+*O+;O'"C0A"
$=W
"
D06@+A*G0(R"
$=!
"
"
$=#
"
^"G/66+AH"0C"B&'"XOO/6*(+"M+B+"/G'M"C0A"60M'O"BA+*(*(,"+(M";'(D&6+A8*(,"*G"G&0P("*("
$=$
"
1/@@O'6'(B+AH"7+;O'"!R"9'"/G'M";0B&"IlWW$eIlWW="+(M"IlWW=eIlWW$"+G"B/60Ae(0A6+O"
$==
"
G+6@O'G"C0A"M+B+"GH(B&'G*GR"7&'"IlWW$"+(M"IlWW="G+6@O'G"&+M"V!R!>5"+(M"QQR=V5C0OM"D0F'A+,'-"
$=>
"
+(M"P'A'"@/;O*DOH"G&+A'M";H"l00,O'"I'+OB&"3'(B'AR"_0A";'(D&6+A8*(,-"P'"+OG0"/G'M"B&'"
$=K
"
I33!$V>ed2"B/60A5(0A6+O"@+*A-";/B"P*B&"M+B+"CA06"G*)"G'N/'(D*(,"D'(B'AG-"6+M'"+F+*O+;O'";H"
$=L
"
B&'"1Sb3#"D0(G0AY/6"s"a1T"a0F+1'N"+B"XOO/6*(+o"a3T"I*1'N"+B"B&'"a+Y0(+O"3+(D'A"X(GYB/B'-"X2o"
$=Q
"
I*1'N"+B"XO/6*(+-"S^o"I*1'N"+B"S/A0@'+("X(CA+GBA/DB/A'"C0A"7A+(GO+Y0(+O"h'M*D*('-"_?o"I*1'N"+B"
$=V
"
_/M+("E(*F'AG*BH-"a]o"I*1'N"+B"a0F+AYG"s"P*B&"D0F'A+,'"A+(,*(,"CA06"$LRV$5"B0"QLR>=5C0OMR"7&'"
$>W
"
G*)"6/OY5D'(B'A"A'@O*D+B'G"'(+;O'M"/G"B0"F'A*CH"3O+*A1n"@'AC0A6+(D'"D0(G*GB'(DHR"XC"(0B"
$>!
"
G@'D*[D+OOH"6'(Y0('M-"0B&'A"BA+*(*(,"+(M";'(D&6+A8*(,"M'B+*OG"+A'"B&'"G+6'"+G"B&0G'"0C"B&'"
$>#
"
ia7"M+B+"')@'A*6'(BG-"+(M"P'";'(D&6+A8'M"0(OH"1a]G"C0A"+OO"D+OO'AGR"2*8'"B&'"ia7"M+B+"
$>$
"
')@'A*6'(BG-"B&'"D066+(M"O*('G"+(M"O*(8G"B0"B&'"M+B+"/G'M"C0A";0B&"60M'O"BA+*(*(,"+(M"
$>=
"
;'(D&6+A8*(,"+A'"O*GB'M"*("B&'"1/@@O'6'(B+AH"a0B'GR"
$>>
"
"
$>K
"
7&'"'F+O/+Y0("6'BA*DG"0C"B&'"'*,&B"D+OO'AG"0("B&'"G*)"M+B+G'BG"+A'"G&0P("*("1/@@O'6'(B+AH"7+;O'"
$>L
"
QR"_*,/A'"K+"G&0PG"B&'"@A'D*G*0(5A'D+OO"D/AF'G-"+(M"[,/A'"K;"G&0PG"+"&*GB0,A+6"0C"B&'"_!5GD0A'GR"
$>Q
"
3O+*A1"D0(G*GB'(BOH"@'AC0A6'M"D06@+A+;OH"0A"GO*,&BOH";'g'A"B&+("B&'"BP0"B0@5@'AC0A6*(,"D+OO'AG-"
$>V
"
1BA'O8+#"+(M"h/B'DB#R"i("B&'"G*)"M+B+G'BG-"3O+*A1"+D&*'F'M"+"VLRQQf-"VLRQ#f-"VLR=Kf-"VLR>$f-"
$KW
"
VLRW$f-"+(M"VKR=!f"_!5GD0A'R"1BA'O8+#"+D&*'F'M"+"VKR!Kf-"VKR!Qf-"VLRW$f-"VLR$>f-"VKR$#f-"
$K!
"
+(M"V>R=LfR"h/B'DB#"+D&*'F'M"V>R#!f-"V>RL>f-"V>R$$f-"VKR#=f-"VKR=Kf-"+(M"V=R$>fR"dA08'("
$K#
"
M0P("*(B0"M*m'A'(B"]^_"A+(,'G-"B&'"3O+*A1"@'AC0A6+(D'"P+G"+OG0"D0(G*GB'(BOH"D06@+A+;O'"B0"0A"
$K$
"
;'g'A"B&+("B&+B"0C"0B&'A"D+OO'AG-"+G"G&0P("*("_*,/A'"KDR"_*,/A'"KM"G&0PG"]'(("M*+,A+6G"0C"B&'"
$K=
"
0F'AO+@G"0C"C+OG'"@0G*YF'"D+OOG";'BP''("1BA'O8+#-"h/B'DB#-"+(M"3O+*A1R"7&'"M*+,A+6G"G&0P"B&+B"
$K>
"
+OB&0/,&"A+A'-"B&'A'"+A'"$W"B0"=!"C+OG'"@0G*YF'"D+OOG"*("B&'"G*)"M+B+G'BG"B&+B"P'A'"D+OO'M";H"+OO"
$KK
"
B&A''"D+OO'AGR"iP*(,"B0"B&'"@0GG*;O'"*(D06@O'B'('GG"0C"B&'"BA/B&"G'B-"+"C+OG'"@0G*YF'"F+A*+(B"D+("
$KL
"
;'"'*B&'A"+"PA0(,"D+OO"0A"+"BA/'"F+A*+(B"6*GG*(,"CA06"B&'"BA/B&"G'BR"7&'G'"C+OG'"@0G*YF'"F+A*+(B"
$KQ
"
D+OOG"D+OO'M";H"+OO"B&A''"D+OO'AG"+A'"P0AB&"D0(M/DY(,"C/AB&'A"F'A*[D+Y0("+(M"6*,&B"C/AB&'A"
$KV
"
D0(BA*;/B'"B0"B&'"D06@O'B'('GG"0C"B&'"BA/B&"G'BR"
$LW
"
"
$L!
"
Discussion
$L#
"
X("B&*G"GB/MH-"P'"@A'G'(B"3O+*A1-"B&'"[AGB"G06+YD"G6+OO"F+A*+(B"D+OO'A"C0A"ia7"O0(,5A'+MGR"X("0/A"
$L$
"
;'(D&6+A8G-"P'"G&0P'M"B&+B"*B"*G"A'O*+;O'"+B"M*m'A'(B"G+6@O'"D0F'A+,'G-"B/60A"@/A*Y'G"+(M"
$L=
"
(0A6+O"D0(B+6*(+Y0(GR"9*B&"B&'"BA+*(*(,"M+B+"GH(B&'G*G"6'B&0M"P'"M'F*G'M-"3O+*A1"D+(";'"
$L>
"
BA+*('M"C0A"G06+YD"G6+OO5F+A*+(B"D+OO*(,"C0A"+(H"G'N/'(D*(,"@O+}0A6R"9'"M'60(GBA+B'M"B&+B"
$LK
"
3O+*A1"@'AC0A6'M"+G"P'OO"+G"0A"'F'(";'g'A"B&+("B&'"B0@5@'AC0A6*(,"G06+YD"F+A*+(B"D+OO'AG"C0A"
$LL
"
XOO/6*(+"G&0AB"A'+MGR"3O+*A1"MA+PG"0("*BG",'A6O*('"F+A*+(B"D+OO'A"@A'M'D'GG0AGn"')@'A*'(D'-"P&*O'"
$LQ
"
/G*(,"+"A'M'G*,('M"P0A8p0P-"('BP0A8"+AD&*B'DB/A'-"('BP0A8"0/B@/B-"+(M"@0GB5@A0D'GG*(,"
$LV
"
@A0D'M/A'"C0A"B&'"60A'"D&+OO'(,*(,"G06+YD"F+A*+(B5D+OO*(,"B+G8GR"
$QW
"
"
$Q!
"
.CC-BY-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted August 21, 2023. ; https://doi.org/10.1101/2023.08.17.553778doi: bioRxiv preprint
7&'"/G'"0C"O0(,"A'+MG"C0A"G06+YD"1]"M*GD0F'AH"&+G"/(A+F'O'M"D06@O')"G06+YD"1]G"B&+B"P'A'"
$Q#
"
@A'F*0/G"&+6@'A'M";H"G&0AB"A'+MG28R"9*B&"B&'"/(@A'D'M'(B'M"@0P'A"0C"O0(,"A'+MG"B0"D0F'A"
$Q$
"
A'@'YYF'",'(06'"A',*0(G-"P'"')@'DB"B&'"/G'"0C"O0(,"A'+MG"C0A"G06+YD"G6+OO"F+A*+(B"D+OO*(,"B0"
$Q=
"
A'F'+O"60A'"G06+YD"F+A*+(BG"B&+B"P'A'"@A'F*0/GOH"*(+DD'GG*;O'";H"G&0AB"A'+MG-"+(M"O'+M"B0"+"
$Q>
"
;'g'A"/(M'AGB+(M*(,"0C"B&'"6/B+Y0(+O"@A0D'GG'G"+(M"C/(DY0(+O"D0(G'N/'(D'G"0C"B&'"G06+YD"
$QK
"
F+A*+(BG"*("M*m'A'(B"D+(D'A"BH@'GR"70"+OO0P"60A'"A'G'+AD&'AG"B0"+D&*'F'"B&'G'",0+OG-"3O+*A1"P+G"
$QL
"
*(DO/M'M"+G"B&'"G6+OO"F+A*+(B"D+OO'A"*("ia7nG"G06+YD"F+A*+(B5D+OO*(,"P0A8p0P29R"
$QQ
"
"
$QV
"
?'G@*B'"/G*(,"ia7nG"O+B'GB"b#Wc"M+B+-"B&'"_!5GD0A'"C0A"G06+YD"*(M'OG"P+G"0(OH"xKWfR"^OB&0/,&"
$VW
"
B&'"@'AC0A6+(D'"O008G";'g'A"P&'("D0(G*M'A*(,"0(OH"B&'"G06+YD"*(M'OG"*("B&'"D0M*(,"
$V!
"
G'N/'(D'G-"B&'"@A06*G'"0C";'g'A"P&0O'",'(06'"G06+YD"*(M'O5D+OO*(,"@'AC0A6+(D'"O*'G"*("B&'"
$V#
"
D0(Y(/0/G"+MF+(D'6'(B"0C"ia7nG"G'N/'(D*(,"D&'6*GBAH"+(M";+G'5D+OO*(,"+O,0A*B&6R"
$V$
"
"
$V=
"
Method
$V>
"
Training data synthesis
$VK
"
F"("#'3(7*4;(-."3)*-1&%#4*'(/*4;(-."3)*(%#&'58"^"B/60A"D06@A*G'G"(0A6+O"D'OOG"+(M"B/60A"
$VL
"
D'OOGo"B&'"O+g'A"+A'"A',+AM'M"+G"C0A'*,(R"1*6*O+AOH-"B&'"(0A6+O"D'OOG"0C"+("*(M*F*M/+O"+A'"
$VQ
"
D0(G*M'A'M"C0A'*,("B0"B&'"(0A6+O"D'OOG"0C"+(0B&'A"*(M*F*M/+O-"+(M"F*D'"F'AG+R"7&'",'A6O*('"
$VV
"
F+A*+(BG"/(*N/'"B0"+("*(M*F*M/+O"D+("6*6*D"+"G06+YD"F+A*+(B"P&'("6*)'M"P*B&"+(0B&'A"
=WW
"
*(M*F*M/+OR"9*B&"*(G/|D*'(B"8(0P("BA/B&"G06+YD"F+A*+(BG"+(M"GB+(M+AM"B/60A5(0A6+O"G+6@O'"
=W!
"
@+*AG"+F+*O+;O'-"B&*G"0;G'AF+Y0("O+HG"B&'"C0/(M+Y0("C0A",'('A+Y(,"+6@O'"GH(B&'YD"G06+YD"
=W#
"
F+A*+(BG"CA06"8(0P("BA/B&",'A6O*('"F+A*+(BG"*("B&'"lX^d"A'C'A'(D'"G+6@O'G"P*B&"A'+O"
=W$
"
G'N/'(D*(,"M+B+"C0A"M''@5O'+A(*(,"60M'O"BA+*(*(,R"7&'"M'B+*O'M"P0A8p0P"*G"G&0P("*("_*,/A'"!+R"
=W=
"
EG*(,"QW5C0OM"lX^d"IlWW#"0C"ia7"9l1"+O*,(6'(BG"+G"B&'"G0/AD'"0C"B&'"B/60A"\&'A'+Z'A"
=W>
"
A'C'AA'M"B0"+G"^`"+(M">W5C0OM"0C"IlWW!"+G"B&'"G0/AD'"0C"(0A6+O"\d`"+G"+("')+6@O'-"P'"[AGB"G@O*B"
=WK
"
B&'"+O*,(6'(BG"0C";0B&"G+6@O'G"*(B0"G6+OO'A"D&/(8G-"'+D&"P*B&"=5C0OM"D0F'A+,'R"7&'"G6+OO'A"
=WL
"
D&/(8G"CA06";0B&"G+6@O'G"D+(";'"D06;*('M"B0"G*6/O+B'"!`"M*m'A'(B"+OO'O'"CA'N/'(D*'G-"G/D&"+G"
=WQ
"
D06;*(*(,"=W5C0OM"^-"*R'R-"!W"=5C0OM"D&/(8G"0C"^"P*B&"#W5C0OM"d-"*R'R-">"=5C0OM"D&/(8G"0C"d-"B0"
=WV
"
G*6/O+B'"+"GH(B&'YD"B/60A"G+6@O'"P*B&"+("*M'+O"KLf"+OO'O'"CA'N/'(DH-"*R'R-"=W5C0OM"0C"^"+,+*(GB"
=!W
"
KW5C0OM"0C"^cdo"#`"M*m'A'(B"D0F'A+,'"0C";0B&"B/60A"+(M"(0A6+O-"'R,R"*(DA'+G'"0A"M'DA'+G'"B&'"
=!!
"
(/6;'A"0C"D&/(8G"+G"(''M'Mo"+(M"$`"M*m'A'(B"O'F'OG"0C"D0(B+6*(+Y0("*("(0A6+O-"'R,R-"*(GB'+M"0C"
=!#
"
/G*(,"!WWf"^"+G"(0A6+O-"+MM*(,"0('"0A"60A'"D&/(8G"0C"d"*("(0A6+OR"
=!$
"
"
=!=
"
G%6"#'7"*'/6,)"*$%#*-."*4%1#)"*/'-'8"9'"+@@O*'M"BP0"A'GBA*DY0(G"B0"B/60A"+(M"(0A6+O"
=!>
"
GH(B&'G*GR"_*AGB-"B0"+F0*M"+(H";*+G'G"D+/G'M";H"A'/G*(,"*(M*F*M/+O"A'+MG-"P'"/G'M"G+6@O*(,"
=!K
"
P*B&0/B"A'@O+D'6'(B"*("B/60A"+(M"(0A6+O"GH(B&'G*GR"7&+B"*G-"+"D&/(8"B&+B"P+G"/G'M"D0/OM"(0B"
=!L
"
;'"/G'M"+,+*(R"1'D0(M-"P'"A'N/*A'M"B&'"GH(B&'YD"B/60A"B0"&+F'"'N/+O"0A"&*,&'A"D0F'A+,'"B&+("
=!Q
"
B&'"GH(B&'YD"(0A6+O"B0"+O*,("P*B&"D0660("@A+DYD'R"EG*(,"B&'"@A'F*0/G"')+6@O'-"*R'R-"QW5C0OM"^"
=!V
"
+G"B&'"G0/AD'"0C"+"B/60A-"+(M">W5C0OM"d"+G"B&'"G0/AD'"0C"(0A6+O-"*C"#W5C0OM"d"*G"A'G'AF'M"B0"6*)"
=#W
"
P*B&"^"C0A"B&'"B/60A-"B&'("$W5C0OM"d"*G"O'Z"C0A"(0A6+O-";/B"P'"D+("+D&*'F'"B/60A"@/A*BH"0(OH"
=#!
"
;'BP''("$$f"\!W5C0OM"^"c"#W5C0OM"d`"+(M"!WWf"\QW5C0OM"^"c"(0"d`R"I0P'F'A-"*C"$W5C0OM"d"*G"
=##
"
.CC-BY-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted August 21, 2023. ; https://doi.org/10.1101/2023.08.17.553778doi: bioRxiv preprint
A'G'AF'M"B0"6*)"P*B&"^"C0A"B&'"B/60A-"B/60A"@/A*BH";'BP''("Wf"\(0"^"c"$W5C0OM"d`"+(M"!WWf"
=#$
"
\QW5C0OM"^"c"(0"d`"D+(";'"+D&*'F'M-";/B"0(OH"#W5C0OM"d"*G"O'Z"C0A"(0A6+OR"7&/G-"P'"G/,,'GB"/G*(,"
=#=
"
B&'"G+6@O'"P*B&"&*,&'A"D0F'A+,'"+G"B&'"G0/AD'"0C"(0A6+O"B0"+D&*'F'";0B&"&*,&'A"(0A6+O"
=#>
"
D0F'A+,'"+(M"C/OO5G@'DBA/6"B/60A"@/A*BH"C0A"60M'O"BA+*(*(,R"7&*G"*G"D0/(B'A*(B/*YF'";'D+/G'"
=#K
"
&*,&'A"B/60A"D0F'A+,'"0C"D0660("*("D+(D'A"GB/M*'GR"
=#L
"
"
=#Q
"
H"#,6,(7*&15395"*)'-"7%#,"4*%$*6'#,'(-4*$#%&*'*4;(-."3)*-1&%#I(%#&'5*9',#8"_0/A"D+B',0A*'G"0C"
=#V
"
F+A*+(BG-"(+6'OH"v106+YDw-"vl'A6O*('w-"v^AYC+DBw-"+(M"va0A6+O50(OHw-"P'A'"M'A*F'M"CA06"+"
=$W
"
GH(B&'YD"B/60Ae(0A6+O"@+*A-"+G"')@O+*('M"*("M'B+*O"*("_*,/A'"!+R"9'"/G'M"lX^d"BA/B&"F+A*+(BG"*("
=$!
"
B&'"GH(B&'YD"B/60A"+(M"GH(B&'YD"(0A6+O-"+(M"P'"M'[('M"+(H"BA/B&"F+A*+(BG"\*(DO/M*(,";0B&"
=$#
"
&060kH,0/G"+(M"&'B'A0kH,0/G`"+G"vD+OO+;O'5F+A*+(BwR"d+G*D+OOH-"G06+YD"F+A*+(BG"+A'"D+OO+;O'"
=$$
"
F+A*+(BG"*("B&'"GH(B&'YD"B/60A";/B"(0B"*("B&'"GH(B&'YD"(0A6+O-"P&*D&"+A'"+OG0"8(0P("+G"BA/B&"
=$=
"
,'A6O*('"F+A*+(BG"*("B&'"B/60A"G0/AD'-";/B"(0B"*("B&'"(0A6+O"G0/AD'R"l'A6O*('"F+A*+(BG"+A'"
=$>
"
D+OO+;O'"F+A*+(BG"*(";0B&"B&'"GH(B&'YD"B/60A"+(M"GH(B&'YD"(0A6+O-"P&*D&"+A'"+OG0"8(0P("+G"
=$K
"
BA/B&",'A6O*('"F+A*+(BG"*(";0B&"B&'"B/60A"G0/AD'"+(M"(0A6+O"G0/AD'R"^AYC+DB"F+A*+(BG"+A'"
=$L
"
D+OO+;O'"F+A*+(BG"0(OH"*("B&'"GH(B&'YD"B/60AR"7&'H"+A'"(0B"8(0P("+G"BA/B&"*("'*B&'A"B&'"B/60A"
=$Q
"
G0/AD'"0A"B&'"(0A6+O"G0/AD'R"a0A6+O50(OH"F+A*+(BG"+A'"D+OO+;O'5F+A*+(BG"C0/(M"0(OH"*("B&'"
=$V
"
GH(B&'YD"(0A6+OR"7&A''"0C"B&'"C0/A"D+B',0A*'G-"106+YD-"l'A6O*('"+(M"^AYC+DB-"+A'"/G'M"C0A"
==W
"
60M'O"BA+*(*(,R"i(OH"B&'"F+A*+(BG"O0D+B'M"*("B&'"0F'AO+@@*(,"lX^d5M'[('M"&*,&5D0([M'(D'"
==!
"
A',*0(G"0C";0B&"B&'"B/60A"+(M"(0A6+O"G0/AD'G"P'A'"/G'M"C0A"BA+*(*(,"B0"'(G/A'"B&'"N/+O*BH"0C"
==#
"
B&'"BA+*(*(,"F+A*+(BGR"d'D+/G'"0C"B&'"G/;5G+6@O*(,"@A0D'GG-"G06'"F+A*+(BG"6*,&B"&+F'"C'P"
==$
"
G/@@0AY(,"A'+MG"*("B&'"GH(B&'YD"B/60A-"'G@'D*+OOH"C0A"O0P"^_"G06+YD"F+A*+(BGR"7&'G'"F+A*+(BG"
===
"
P'A'"')DO/M'M"CA06"60M'O"BA+*(*(,"B0"+F0*M"D0(C/G*(,"B&'"('/A+O"('BP0A8R"h0A'"')DO/G*0("
==>
"
M'B+*OG"+A'",*F'("*("B&'"C0OO0P*(,"@+A+,A+@&R"i/A"G+6@O'"GH(B&'G*G"6'B&0M"G/@@0ABG",'('A+Y(,"
==K
"
GH(B&'YD"B/60AG"+B"+(H"@/A*BH"O'F'O-"G0"P'"D+("/G'"+G"6+(H"@/A*Y'G"+G"@0GG*;O'"B0"+D&*'F'"[('"
==L
"
D0F'A+,'"0C"]^_"CA06"W"B0"!-";/B"C0A"@A+DYD+O*BH-"P'"/G'M"B&A''"B/60A"@/A*Y'G"\#>f-">Wf-"+(M"
==Q
"
!WWf`-"+(M"+@@O*'M"G/;G+6@O*(,"B0"+OO"F+A*+(BG"CA06"B&'"B&A''"@/A*Y'G"B0"+D&*'F'"+DD'@B+;O'"
==V
"
]^_"M*GBA*;/Y0(R"7&*G"*G"C'+G*;O'";'D+/G'"!`"B&'"*((+B'"F+A*+(D'"0C"B&'"^_"0C"B&'",'A6O*('"
=>W
"
F+A*+(BG"CA06"B&'"B/60A"+(M"(0A6+O"G0/AD'G"'(+;O'G"+"@00O"0C"G06+YD"F+A*+(BG"C/OOH"D0F'A*(,"
=>!
"
]^_"CA06"W"B0"!-"'F'("P*B&"t/GB"B&A''"@/A*Y'G-"+(M"#`"+@@OH*(,"G/;G+6@O*(,"B0"B&'"@00O"'(+;O'G"
=>#
"
/G"B0"'(A*D&"M*|D/OB"G06+YD"F+A*+(BG"+(M"A'M/D'"B&'"(/6;'A"0C"O'GG"D0660("G06+YD"F+A*+(BG"
=>$
"
C0A"60M'O"BA+*(*(,R"X("B'A6G"0C"G/;G+6@O*(,-"B&'"]^_G"0C"D&0G'("G06+YD"F+A*+(BG"P'A'"A+(M06OH"
=>=
"
G'O'DB'M"CA06"+";'B+"M*GBA*;/Y0("P*B&"G&+@'"@+A+6'B'AG"~•#"+(M"€•>R"7&'"G+6'"M*GBA*;/Y0("
=>>
"
P+G"/G'M";H"B&'"1Sb3#"D0(G0AY/6"C0A"G@*8'5*("G06+YD"F+A*+(BG"*("1+&A+'*+("'B"+O30R"i/A"
=>K
"
')@'A*6'(B"G&0P'M"B&+B"G/;G+6@O*(,"*BG'OC"A'G/OB'M"*("+"x!RQf"*(DA'+G'"*("B&'"_!5GD0A'"/G*(,"
=>L
"
>W)e#>)"0C"I33!$V>ed2R"7&'"A'G/OY(,"]^_"M*GBA*;/Y0("0C"1a]G"*G"G&0P("*("_*,/A'"!;R"9'"BA*'M"
=>Q
"
+MM*(,"0('"60A'"B/60A"@/A*BH"+B"!#R>f-";/B"+@+AB"CA06"O0(,'A"60M'O"BA+*(*(,"Y6'-"(0"
=>V
"
@'AC0A6+(D'",+*("P+G"0;G'AF'MR""
=KW
"
"
=K!
"
J-."#*/"-',54*'2%1-*-."*6'#,'(-4*4"5")-"/*$%#*&%/"5*-#',(,(78"_0A"B&'"G06+YD"F+A*+(BG"/G'M"C0A"
=K#
"
60M'O"BA+*(*(,-"+"6*(*6/6"D0F'A+,'"0C"C0/A-"+(M"+"6*(*6/6"0C"B&A''"A'+MG"G/@@0AY(,"B&'"
=K$
"
G06+YD"F+A*+(B"+OO'O'"+A'"A'N/*A'MR"106+YD"F+A*+(BG"P*B&"]^_"•"WRW$"*("B&'"GH(B&'YD"(0A6+O"
=K=
"
P'A'"')DO/M'M"CA06"BA+*(*(,"B0"+F0*M"D0(C/G*(,"B&'"60M'O"P*B&"+"F'AH"(0*GH"(0A6+OR"_0A"B&'"
=K>
"
+AYC+DBG-"B&'"(0(5A'C'A'(D'"^_"P+G"D+@@'M"+B"WRW>"B0"+F0*M"/G*(,"+"O+A,'"(/6;'A"0C"0;F*0/G"
=KK
"
.CC-BY-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted August 21, 2023. ; https://doi.org/10.1101/2023.08.17.553778doi: bioRxiv preprint
+AYC+DBG"C0A"BA+*(*(,R"_0A",'A6O*('"F+A*+(BG-"6*(*6/6"D0F'A+,'"0C"C0/A"A'+MG"+(M"+"6*(*6/6"0C"
=KL
"
B&A''"A'+MG"G/@@0AY(,"B&'",'A6O*('"F+A*+(B"+OO'O'-"P'A'"A'N/*A'M"*(";0B&"GH(B&'YD"B/60A"+(M"
=KQ
"
(0A6+OR"l'A6O*('"F+A*+(BG"P*B&"+"M*m'A'(D'"*("^_"O+A,'A"B&+("WR!";'BP''("B&'"GH(B&'YD"B/60A"
=KV
"
+(M"(0A6+O"P'A'"')DO/M'M"CA06"BA+*(*(,R"70 " @A'F'(B"B&'"60M'O"CA06"*(C'AA*(,"+"G06+YD"F+A*+(B"
=LW
"
CA06"*BG"D05')*GB'(D'"P*B&"BP0"0A"60A'"+Mt+D'(B",'A6O*('"F+A*+(BG-"P&*D&"*G"+"D0(C0/(M*(,"
=L!
"
C+DB0A"B&+B"D+(";'"'+G*OH"O'+A('M";H"B&'"60M'O-",'A6O*('G"F+A*+(BG"B&+B"P'A'"O'GG"B&+("$$;@"\B&'"
=L#
"
P*(M0P"G*k'"0C"0/A"60M'O"M'G*,(`"CA06"'+D&"0B&'A"P'A'"')DO/M'M"CA06"BA+*(*(,R"i/A"')@'A*6'(B"
=L$
"
G&0P'M"B&+B"B&*G"')DO/G*0("+O0('"*(DA'+G'M"G06+YD"F+A*+(B"D+OO*(,"@A'D*G*0(";H"x!fR"9*B&"B&A''"
=L=
"
B/60A"@/A*Y'G"+(M"B&'"')DO/G*0(G"')@O+*('M"+;0F'-"!#-=QV-$=#"BA+*(*(,"G+6@O'G"P'A'"O'ZR"7&'"
=L>
"
;A'+8M0P("*G"G&0P("*("_*,/A'"!DR""
=LK
"
"
=LL
"
!.'4,(7*,($%#&'3%(*"(.'()"4*4%&'3)*6'#,'(-@)'55,(7*9"#$%#&'()"8"^("+/B&'(YD"G06+YD"
=LQ
"
F+A*+(B"/G/+OOH"0A*,*(+B'G"CA06"'*B&'A"B&'"6+B'A(+O"0A"@+B'A(+O"&+@O0BH@'G-"P&*O'"+"A+(M06"'AA0A"
=LV
"
/G/+OOH"&+G"+"C+*A"D&+(D'"&+@@'(*(,"*(";0B&"\S)B'(M'M"?+B+"_*,/A'"!+`R"7&/G-"G06+YD"F+A*+(BG"
=QW
"
B&+B"&+F'"+"G*(,O'"+(D'GBA+O"&+@O0BH@'"\'*B&'A"6+B'A(+O"0A"@+B'A(+O`"G&0/OM";'"D0(G*M'A'M"60A'"
=Q!
"
A'O*+;O'"B&+("B&0G'"P*B&"BP0"+(D'GBA+O"&+@O0BH@'G-"')D'@B"C0A"G06+YD"F+A*+(BG"P*B&"&*,&"]^_"
=Q#
"
B&+B"6*,&B";'"+"A'G/OB"0C"D0@H"(/6;'A"+OB'A+Y0("0A"DO0(+O"M/@O*D+Y0(31R"3O+*A1"/G'G"@&+G*(,"
=Q$
"
*(C0A6+Y0("C0A";0B&"60M'O"BA+*(*(,"+(M"*(C'A'(D'R"3O+*A$"+(M"20(,q&+G'"+A'"/G'M"C0A"@&+G*(,"
=Q=
"
+(M"A'+M"&+@O0B+,,*(,R"h0A'"M'B+*OG"+A'",*F'("*("B&'"v3O+*A1"*(@/B"+(M"0/B@/Bw"G'DY0(R"3O+*A1"
=Q>
"
/G'G"@&+G*(,"*(C0A6+Y0("M/A*(,"C/OO5+O*,(6'(B5;+G'M"F+A*+(B"D+OO*(,-"*("P&*D&"+"D&+(('O"(+6'M"
=QK
"
v7/60Aea0A6+Oeq&+G*(,"X(C0w"*G"/G'MR"X("B&*G"D&+(('O-"B&'"+O*,(6'(BG"+A'",A0/@'M"*(B0"
=QL
"
&+@O0BH@'5/(8(0P(-"&+@O0BH@'"!-"+(M"&+@O0BH@'"#-"'+D&"/G*(,"B&'"A'+M"0AM'A"0C"B&'"
=QQ
"
+O*,(6'(BGR"^OB&0/,&"O0(,5A'+M"G'N/'(D*(,"'(+;O'G"0/BGB+(M*(,"@&+G*(,"@'AC0A6+(D'-"G06'"
=QV
"
G06+YD"F+A*+(BG"*("M*|D/OB",'(06*D"A',*0(G"0A"P*B&0/B"+"&'B'A0kH,0/G",'A6O*('"*("B&'*A"F*D*(*BH"
=VW
"
GYOO"D+((0B";'"D0F'A'M";H"+(H"@&+G'M"A'+MGR"7&/G-"M/A*(,"60M'O"BA+*(*(,-"C0A"'+D&"F+A*+(B"B&+B"
=V!
"
&+G"+"&'B'A0kH,0/G"0A*,*("CA06"B&'"B/60A"G0/AD'-"*C"0('"0A"60A'"A'+MG"D+(";'"@&+G'M-";0B&"+"
=V#
"
F'AG*0("0C"*(@/B"P*B&"A'+MG"+Z'A"@&+G*(,"+(M"+"F'AG*0(";'C0A'"@&+G*(,"P'A'"/G'MR"
=V$
"
"
=V=
"
ClairS workflow and design
=V>
"
J6"#6,"+8"_*,/A'"#"G&0PG"+("0F'AF*'P"0C"B&'"3O+*A1"G06+YD"F+A*+(B5D+OO*(,"P0A8p0PR"1B+AY(,"
=VK
"
CA06"B&'"+O*,(6'(BG"*("B&'"d^he3:^h"C0A6+B"0C"+"B/60Ae(0A6+O"G+6@O'"@+*A-"3O+*A1"C0OO0PG"
=VL
"
B&A''"GB'@G"B0"M'A*F'"B&'"G06+YD"F+A*+(BG"*("+"B/60A"+(M"0/B@/BG"B&'6"B0"+"]3_"[O'R"X("GB'@"!-"
=VQ
"
3O+*A1"/G'G"3O+*A$"+(M"20(,q&+G'"C0A",'A6O*('"F+A*+(B"D+OO*(,-"@&+G*(,"+(M"A'+M"&+@O0B+,,*(,R"
=VV
"
7&'"@A0D'GG'M"+O*,(6'(BG"+A'"B&'("/G'M"C0A";0B&"@*O'/@"+(M"C/OO5+O*,(6'(B5;+G'M"G06+YD"
>WW
"
F+A*+(B"D+OO*(,"*("GB'@"#R"1B'@"$"*(F0OF'G"@0GB5@A0D'GG*(,"[OB'AG"B&+B"'O*6*(+B'"G06+YD"F+A*+(B"
>W!
"
D+OO*(,"*C"+("+(D'GBA+O"&+@O0BH@'"\'*B&'A"6+B'A(+O"0A"@+B'A(+O`"CA06"P&*D&"B&'"G06+YD"F+A*+(B"
>W#
"
D0/OM"0A*,*(+B'"D+((0B";'"C0/(MR"
>W$
"
"
>W=
"
>-"9*KE*F"#&5,("*6'#,'(-*)'55,(7L*9.'4,(7*'(/*#"'/*.'95%-'77,(78"1B'@"!"*G"M'@*DB'M"*("_*,/A'"#;R"
>W>
"
3O+*A$22"*G"*(B',A+B'M"*(B0"3O+*A1"C0A"D+OO*(,"&*,&5N/+O*BH"&'B'A0kH,0/G",'A6O*('"F+A*+(BG"*(";0B&"
>WK
"
B/60A"+(M"(0A6+O"B0"6+)*6*k'"B&'"@'AC0A6+(D'"0C"B&'"G/;G'N/'(B"@&+G*(,"B+G8R"E(O*8'"3O+*A$nG"
>WL
"
M'C+/OB-"^_uWR#"+(M"D0F'A+,'u!W"P'A'"+@@O*'M"B0"'(G/A'"B&'"N/+O*BH"0C"B&'"D+OO'M"F+A*+(BG"+(M"
>WQ
"
A'M/D'"D06@/B+Y0(+O"0F'A&'+MR"i(OH"B&'"&'B'A0kH,0/G",'A6O*('"F+A*+(BG"C0/(M"*(";0B&"B/60A"
>WV
"
.CC-BY-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted August 21, 2023. ; https://doi.org/10.1101/2023.08.17.553778doi: bioRxiv preprint
+(M"(0A6+O"P'A'"D&0G'("C0A"@&+G*(,R"_0A"@&+G*(,"+(M"&+@O0B+,,*(,"B&'"B/60A"+O*,(6'(BG-";0B&"
>!W
"
20(,q&+G'32"+(M"9&+BGI+@33+A'"+OO0P'M"*("3O+*A1R"9'"D&0G'"20(,q&+G'"0F'A"9&+BGI+@"+G"B&'"
>!!
"
M'C+/OB";'D+/G'"20(,q&+G'"A/(G"x!>"Y6'G"C+GB'A"P&*O'"M'O*F'A*(,"G*6*O+A"0A"O0(,'A"@&+G'"G'BG"
>!#
"
0("&/6+("G+6@O'GR"a0B+;OH-"3O+*A1"M0'G"(0B"@&+G'"+(M"&+@O0B+,"B&'"(0A6+O"+O*,(6'(BGR"i/A"
>!$
"
')@'A*6'(B"G&0P'M"@&+G*(,"B&'"(0A6+O"+O*,(6'(BG"M0/;O'M"B&'"@A0D'GG*(,"Y6'";/B"M*M"(0B"
>!=
"
A'G/OB"*("+(H"*6@A0F'6'(B"*("D+OO*(,"@'AC0A6+(D'R"
>!>
"
"
>!K
"
>-"9*B8*!,5"19@2'4"/*'(/*$155@'5,7(&"(-@2'4"/*6'#,'(-*)'55,(78"1B'@"#"*G"M'@*DB'M"*("_*,/A'"#DR"_0A"
>!L
"
+"F+A*+(B"D+(M*M+B'"\')@O+*('M"*("v1'O'DY(,"F+A*+(B"D+(M*M+B'Gw`-"+"@*O'/@"*(@/B"+(M"+"C/OO5
>!Q
"
+O*,(6'(B"*(@/B"+A'",'('A+B'M"\')@O+*('M"*("v7&'"M'G*,("0C"@*O'/@"*(@/B"+(M"C/OO5+O*,(6'(B"
>!V
"
*(@/Bw`R"7&'("B&'"*(@/BG"+A'"G'(B"B0"+"d*5l:E5;+G'M"@*O'/@5D+OO*(,"('/A+O"('BP0A8"+(M"+":'Ga'B5
>#W
"
;+G'M"C/OO5+O*,(6'(B5D+OO*(,"('/A+O"('BP0A8"\')@O+*('M"*("v7&'"M'G*,("0C"('/A+O"('BP0A8Gw`"C0A"
>#!
"
*(C'A'(D'R"d0B&"('BP0A8G"&+F'"B&'"G+6'"0/B@/B"s"+"G*(,O'"B+G8"P*B&"B&A''"D+B',0A*'G-"v106+YDw-"
>##
"
vl'A6O*('w-"+(M"v^AYC+DBw-"P&*D&"6+BD&"')+DBOH"B&'"B&A''"D+B',0A*'G"M'[('M"*("B&'"GH(B&'YD"
>#$
"
BA+*(*(,"M+B+R"X("D0(BA+GB"B0"3O+*A$-"*("P&*D&"B&'"C+GB'A"@*O'/@5;+G'M"D+OO*(,"DO'+(G"/@"60GB"
>#=
"
F+A*+(B"D+(M*M+B'G"B&+B"+A'"0;F*0/G"F+A*+(BG-"+(M"B&'"60A'"D06@/B+Y0(+O5M'6+(M*(,"C/OO5
>#>
"
+O*,(6'(B5;+G'M"D+OO*(,"&+(MO'G"B&'"BA*D8H"+(M"O'GG"0;F*0/G"D+(M*M+B'G-"3O+*A1"D0(G*M'AG"B&'"
>#K
"
@0P'A"0C"B&'"BP0"('/A+O"('BP0A8G"'N/+OR"9'"0;G'AF'M"B&+B"C/OO5+O*,(6'(B5;+G'M"D+OO*(,"*G"
>#L
"
@'AC0A6+(B"+B"6*M5A+(,'"]^_GR"I0P'F'A-"@*O'/@5;+G'M"D+OO*(,"A'N/*A'G"O'GG"'F*M'(D'"B&+("C/OO5
>#Q
"
+O*,(6'(B"D+OO*(,"B0"MA+P"B&'"G+6'"D0(DO/G*0(R"9&'("]^_",0'G"/(M'A"WR!-"@*O'/@5;+G'M"D+OO*(,"
>#V
"
;'D06'G"*(DA'+G*(,OH"60A'"G'(G*YF'"+(M"/G/+OOH"0/B@'AC0A6G"C/OO5+O*,(6'(B5;+G'M"D+OO*(,R"7&*G"
>$W
"
0;G'AF+Y0("6+8'G"@*O'/@5;+G'M"D+OO*(,"60A'"*6@0AB+(B"C0A"G06+YD"F+A*+(B"D+OO*(,"B&+("*BG"A0O'"
>$!
"
*("3O+*A$"C0A",'A6O*('"F+A*+(B"D+OO*(,-"'G@'D*+OOH"*("6/OY@O'"DO*(*D+O"/G+,'"GD'(+A*0G"P&'("
>$#
"
G'(G*YF*BH"*G"'6@&+G*k'MR"X("3O+*A1-"+"G06+YD"F+A*+(B"*G"D+OO'M"P&'(";0B&"('BP0A8G",*F'"G06+YD"
>$$
"
B&'"&*,&'GB"@A0;+;*O*BHR"7&'"F+A*+(B"N/+O*BH"\bE^2`"*G"q&A'M5O*8'"+(M"*G"D+OD/O+B'M"+G"
>$=
"
!"#$%&'()*+!"
,
!#$
$
-
./0(1
-"P&'A'"
23%!"#$%&'
(&)*+, &%!"#$%&'
-+)).$)&/0#*0%
'
R"^OG0"*("D0(BA+GB"B0"3O+*A$-"
>$>
"
P&*D&"/G'G"B&'"G+6'"('BP0A8"C0A";0B&"1aq"+(M"X(M'O"D+OO*(,-"3O+*A1"/G'G"BP0"M*m'A'(B"('BP0A8G"
>$K
"
A'G@'DYF'OH"BA+*('M"C0A"1a]"+(M"X(M'O"D+OO*(,R"7&*G"6'+(G"B&+B"3O+*A1"A/(G"0("C0/A"('BP0A8G"*("
>$L
"
B0B+OT"@*O'/@"C0A"1a]-"@*O'/@"C0A"X(M'O-"C/OO5+O*,(6'(B"C0A"1a]-"+(M"C/OO5+O*,(6'(B"C0A"X(M'OR"7&'"
>$Q
"
A+Y0(+O'";'&*(M"B&'"('P"M'G*,("*G"B&+B"/(O*8'",'A6O*('"F+A*+(BG"B&+B"+A'"D0660(OH"M*@O0*M-"
>$V
"
G06+YD"F+A*+(BG"&+F'"(0"@O0*MH"+GG/6@Y0(-"6'+(*(,"B&+B"B&'"')*GB'(D'"0C"1a]G"+(M"X(M'OG"*("
>=W
"
B&'"G+6'"@0G*Y0("+A'"*(M'@'(M'(B"'F'(BGR"i/A"B'GBG"C0/(M"B&+B"/G*(,"G'@+A+B'M"('BP0A8G"O'M"B0"
>=!
"
+"!R>f"*(DA'+G'"*("1a]"A'D+OOR"7&'"/G'"0C"G'@+A+B'M"('BP0A8G"+OG0"+OO0P'M"B&'"/G'"0C"M*m'A'(B"
>=#
"
F+A*+(B"N/+O*BH"D/B0mG"C0A"1a]"+(M"X(M'O-"P&*D&"*G"/G'C/O"C0A"G06+YD"F+A*+(B"D+OO*(,-"'G@'D*+OOH"
>=$
"
P&'("B&'"G+6@O'"D0(M*Y0("*G"(0B"*M'+OR"
>==
"
"
>=>
"
>"5")3(7*6'#,'(-*)'(/,/'-"48"1'(M*(,"'F'AH",'(06'"@0G*Y0("+G"+"F+A*+(B"D+(M*M+B'"B0"B&'"
>=K
"
('/A+O"('BP0A8G",/+A+(B''G"6+)*6/6"G'(G*YF*BHR"I0P'F'A-"*B"*G"(0B"0(OH"D06@/B+Y0(+OOH"
>=L
"
*(C'+G*;O'-";/B"+OG0"/(A'+G0(+;O'"B0"P0A8"0("(0(GB+AB'A"@0G*Y0(G-"G/D&"+G"B&0G'"P*B&0/B"+(H"
>=Q
"
(0(5A'C'A'(D'"+OO'O'"G/@@0ABR"^",00M"F+A*+(B"D+(M*M+B'"G'O'DY0("GBA+B',H"*G"'GG'(Y+O"B0"+D&*'F'"
>=V
"
+";+O+(D'";'BP''("G'(G*YF*BH"+(M"A/((*(,"Y6'R"X("3O+*A1-"B&'"G'O'DY0("DA*B'A*+"+A'"+G"C0OO0PGR"
>>W
"
2'B"#
4
M•\:L*GL*FL*N`";'"B&'"A'C'A'(D'";+G'"0C"+",'(06'"@0G*Y0(-"+(M"&
4
*M@#";'"B&'"+OB'A(+YF'"
>>!
"
;+G'GR"HXm"M'(0B'G"B&'"D0F'A+,'"0C"&"+B"B&'"@0G*Y0("*("G+6@O'"O
4
NL* = ƒ-"P&'A'"N"+(M"="
>>#
"
.CC-BY-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted August 21, 2023. ; https://doi.org/10.1101/2023.08.17.553778doi: bioRxiv preprint
A'@A'G'(B"B&'"B/60A"+(M"(0A6+O"G+6@O'R"Gm"M'[('G"B&'"G'O'DY0("DA*B'A*+"0C"'+D&"+OB'A(+YF'"
>>$
"
;+G'"*("&-"+GT"
>>=
"
5(3
6
7
7
7
8
7
7
7
9
:;)<=$$$>?$@(
)$AB$*C$@(
)
D
@*
)+
*AE
FCG=$$$>?$@(
)
D
@*
)+
*HE$;IJ$@(
,
D
@*
,+
*AE
FCG=$$$>?$-#
1
-&
12
&
3#
4
3&
42
&
HK$;IJ$@(
,
D
@*
,+
*HE
"
>>>
"
P&'A'"~"G'BG"B&'"6*(*6/6"]^_-"+(M"€"G'BG"B&'"6*(*6/6"B/60A"]^_"B0"(0A6+O"]^_"A+Y0"C0A"+"
>>K
"
D+(M*M+B'"B0";'"G'O'DB'MR"X(B/*YF'OH-"B&'"[AGB"'N/+Y0("6'+(G"M*GA',+AM*(,"F+A*+(B"D+(M*M+B'G"
>>L
"
P*B&"{"$"A'+MG"*("B/60A"G/@@0AY(,"B&'"F+A*+(B"+OO'O'-"0A"P*B&"]^_"*("B/60A"{"~R"7&'"G'D0(M"
>>Q
"
'N/+Y0("6'+(G"G'O'DY(,"+"F+A*+(B"D+(M*M+B'"*C"*BG"]^_"*G"{~"*("(0A6+O-";/B"u~"*("B/60AR"7&'"
>>V
"
B&*AM"'N/+Y0("6'+(G"G'O'DY(,"+"F+A*+(B"D+(M*M+B'-"'F'("*C"*BG"]^_"*("(0A6+O"*G"u~-"B&'"]^_"*("
>KW
"
B/60A"*G"u€"Y6'G"O+A,'A"B&+("B&'"]^_"*("(0A6+OR"X("3O+*A1-"~"+(M"€"+A'"D0([,/A+;O'"+(M"M'C+/OB"
>K!
"
B0"WRW>"+(M"K-"A'G@'DYF'OHR"2*8'"60M'O"BA+*(*(,"M+B+"@A'@+A+Y0(-"D0F'A+,'"u="*G"A'N/*A'M"*(";0B&"
>K#
"
B/60A"+(M"(0A6+O"C0A"+"D+(M*M+B'"B0";'"G'O'DB'M"C0A"F+A*+(B"D+OO*(,R"
>K$
"
"
>K=
"
N."*/"4,7(*%$*9,5"19*,(91-*'(/*$155@'5,7(&"(-*,(91-8"3O+*A1"@*O'/@"*(@/B"D06@A*G'G"!-!##"*(B','AG"
>K>
"
s"$$"@0G*Y0(G"P*M'"P*B&"$="C'+B/A'G"+B"'+D&"@0G*Y0("\+("')+6@O'"*G",*F'("*("S)B'(M'M"?+B+"
>KK
"
_*,/A'"#+`R"^"M'B+*O'M"')@O+(+Y0("0C"'+D&"C'+B/A'"*G",*F'("*("1/@@O'6'(B+AH"h'B&0MG"/(M'A"
>KL
"
v?'GDA*@Y0("0C"@*O'/@"*(@/B"C'+B/A'GwR"3O+*A1"&+G"!Q"@*O'/@"C'+B/A'G"*("D0660("P*B&"3O+*A$-"+(M"
>KQ
"
!K"+MM*Y0(+O"C'+B/A'GR"7&'"!K"('P"C'+B/A'G"+A'"A'+M"D0/(BG"=LMQP-"=LMQ@-"=LBQP-"+(M"=LBQ@-"
>KV
"
P&'A'"="*G"'*B&'A"0C"B&'"(/DO'0YM'G"^-"3-"l-"+(M"7-"2hb"G/;GDA*@B"6'+(G"6+@@*(,"N/+O*BH"O0P'A"
>LW
"
B&+("#W"\hb{#W`-"2db"6'+(G";+G'"N/+O*BH"O0P'A"B&+("$W"\db{$W`-"+(M"c"+(M"5"6'+("B&'"
>L!
"
C0AP+AM"+(M"A'F'AG'"GBA+(M-"A'G@'DYF'OHR"7&'"A+Y0(+O'";'&*(M"B&'"('P"C'+B/A'G"*G"B&+B"*("3O+*A1-"
>L#
"
B&'"A'G/OBG"0C"@*O'/@5;+G'M"D+OO*(,"+(M"C/OO5+O*,(6'(B5;+G'M"D+OO*(,"+A'"BA/GB'M"'N/+OOH-"G0"B&'"
>L$
"
6+@@*(,"N/+O*BH"+(M";+G'"N/+O*BH"*(C0A6+Y0("B&+B"/G'M"B0";'"')DO/G*F'"B0"C/OO5+O*,(6'(B"D+OO*(,"
>L=
"
(''M"B0";'"+MM'M"B0"@*O'/@5;+G'M"D+OO*(,R"i/A"')@'A*6'(B"G&0P'M"B&+B"A'60F*(,"B&'"!K"('P"
>L>
"
C'+B/A'G"A'M/D'M"@A'D*G*0(";H"x#f"/G*(,">W)e#>)"0C"I33!$V>ed2R"3O+*A1"C/OO5+O*,(6'(B"*(@/B"
>LK
"
D06@A*G'G"$W-W$W"*(B','AG"s"G'F'("D&+(('OG-"'+D&"P*B&"$$"@0G*Y0(G"+(M"!$W"A0PG"B0"G/@@0AB"+B"
>LL
"
60GB"LK"B/60A"A'+MG-">#"(0A6+O"A'+MG-"+(M"#"'6@BH"A0PG"+G"G@+D'";'BP''("B/60A"+(M"(0A6+O"
>LQ
"
\+("')+6@O'"*G",*F'("*("S)B'(M'M"?+B+"_*,/A'"#;`R"2*8'"3O+*A$-"A+(M06"G/;G+6@O*(,"M0P("B0"B&'"
>LV
"
6+)*6/6"G/@@0AB'M"D0F'A+,'"*G"/G'M"+B"')D'GG*F'"D0F'A+,'GR"^"M'B+*O'M"')@O+(+Y0("0C"'+D&"
>QW
"
D&+(('O"*G",*F'("*("1/@@O'6'(B+AH"h'B&0MG"/(M'A"v?'GDA*@Y0("0C"C/OO5+O*,(6'(B"*(@/B"
>Q!
"
D&+(('OGwR"X(";0B&"*(@/BG-"B&'"D+(M*M+B'"F+A*+(B"*G"D'(B'A'M"+B"B&'"!Kth"@0G*Y0(R"q0G*Y0(G"
>Q#
"
/(D0F'A'M";H"+(H";+G'"*("C/OO5+O*,(6'(B"*(@/B"+A'"[OO'M"P*B&"k'A0R"
>Q$
"
"
>Q=
"
H"4,7(*%$*("1#'5*("-+%#C48"7&'"@*O'/@"+(M"C/OO5+O*,(6'(B"('BP0A8"+AD&*B'DB/A'"+(M"*6@0AB+(B"
>Q>
"
@+A+6'B'AG"+A'"G&0P("*("_*,/A'"$R"7&'"@*O'/@"('BP0A8"/G'G"BP0";*M*A'DY0(+O",+B'"A'D/AA'(B"/(*B"
>QK
"
\d*5l:E`"O+H'AG-"'+D&"P*B&"!#Q"+(M"!V#"/(*BGR"306@+A'M"B0"B&'"3O+*A$"@*O'/@"('BP0A8-"B&'"/G'"0C"
>QL
"
d*5l:E"*(GB'+M"0C";*M*A'DY0(+O"O0(,"G&0AB5B'A6"6'60AH"\d*5217h`"+AD&*B'DB/A'"A'M/D'M"
>QQ
"
BA+*(+;O'"@+A+6'B'AG"CA06"#->$#-VV>"B0"#-$WV->WL"+(M"6+BA*)"D06@/B+Y0(G"CA06"$R!!"B0"#R$Q"
>QV
"
.CC-BY-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted August 21, 2023. ; https://doi.org/10.1101/2023.08.17.553778doi: bioRxiv preprint
;*OO*0(-";/B"*6@A0F'M"@'AC0A6+(D'"*("0/A"')@'A*6'(BR"7&'"C/OO5+O*,(6'(B"('BP0A8"*G"+"A'G*M/+O"
>VW
"
('/A+O"('BP0A8"\:'Ga'B`"D06@A*G*(,"B&A''"GB+(M+AM"A'G*M/+O";O0D8GR"^"D0(F0O/Y0(+O"O+H'A"*G"
>V!
"
+MM'M"*66'M*+B'OH";'C0A'"'+D&"A'G*M/+O";O0D8"B0"')@+(M"B&'"(/6;'A"0C"D&+(('OGR"X(";0B&"
>V#
"
('BP0A8G-"+"MA0@0/B"A+B'"+B"WR$"*G"G'B"C0A"B&'"p+g'('M"O+H'A"+(M"M'(G'"O+H'A"B0"@A'F'(B"
>V$
"
0F'A[j(,R"
>V=
"
"
>V>
"
>-"9*D8*>"'#).*$%#*'()"4-#'5*.'95%-;9"*4199%#-8"1B'@"$"*G"M'@*DB'M"*("_*,/A'"#MR"7&'"('/A+O"
>VK
"
('BP0A8G"')&*;*B'M",00M"@0P'A"*("M*GY(,/*G&*(,"A'+O"F+A*+(BG"CA06"C+OG'"@0G*YF'"D+(M*M+B'GR"
>VL
"
I0P'F'A-"/G'C/O"G*,(+OG"A'60B'"B0"+"F+A*+(B"D+(M*M+B'"+A'"(0B"D0F'A'M";H"B&'"D/AA'(B"('/A+O"
>VQ
"
('BP0A8"M'G*,(G"*("3O+*A1-"P&*D&"D0(G*M'AG"0(OH"B&'"p+(8*(,"!K;@"0C"+"D+(M*M+B'R"a0B+;OH-"'F'("
>VV
"
*C"B&'"p+(8*(,"P*(M0P"*G"')B'(M'M"B0">W;@-"*B"*G"GYOO"B00"G&0AB"C0A"+("+DD/A+B'"*(C'A'(D'"0C"P&*D&"
KWW
"
&+@O0BH@'"+"F+A*+(B"D+(M*M+B'";'O0(,G"B0"/G*(,"0(OH"B&'"('BP0A8G-";/B"B&'"('BP0A8G"P0/OM"
KW!
"
+OA'+MH";'"D06@/B+Y0(+OOH"*(C'+G*;O'"C0A"G06+YD"F+A*+(B"D+OO*(,R"X("3O+*A1-"@0GB5@A0D'GG*(,"GB'@"
KW#
"
$"*G"M'G*,('M"B0"A'M/D'"C+OG'"@0G*YF'"D+OO*(,"6*GB+8'G"6+M'";H"B&'"('BP0A8G";H"O'F'A+,*(,"
KW$
"
A'O+YF'OH"A'60B'",'A6O*('"F+A*+(BG"B0"[(M"B&'"D0AA'DB"+(D'GBA+O"&+@O0BH@'"C0A"+"G06+YD"F+A*+(BR"
KW=
"
^(H"G06+YD"F+A*+(B"D+OOG"B&+B"D+((0B";'"C0/(M"P*B&"+(D'GBA+O"&+@O0BH@'"G/@@0AB"+A'"GP*BD&'M"B0"
KW>
"
+("+AYC+DB"+(M"+A'"')DO/M'M"CA06"B&'"0/B@/BR"7&'"&+@O0B+,,'M"A'+MG"@A0M/D'M"*("GB'@"!"+A'"
KWK
"
/G'M"*("B&*G"GB'@R"_0A"+"G06+YD"F+A*+(B"B&+B"D0F'AG"+(H"&+@O0B+,,'M"A'+MG-"P'"A'N/*A'M"B&'"
KWL
"
G06+YD"F+A*+(B"B0"D0')*GB"P*B&"B&'"&'B'A0kH,0/G",'A6O*('"F+A*+(BG"O'GG"B&+("!WW";@"+P+H"0("*BG"
KWQ
"
O'Z"+(M"A*,&B"*("B&'"A'+MG"*("B&'"&+@O0BH@'",A0/@"B&'"G06+YD"F+A*+(B"G/@@0AY(,"A'+MG"P'A'"*(R"
KWV
"
^("')+6@O'"0C"+"C+OG'"@0G*YF'"G06+YD"F+A*+(B"[OB'A'M";H"B&*G"A/O'"*G",*F'("*("S)B'(M'M"?+B+"
K!W
"
_*,/A'"$+R"^"G06+YD"F+A*+(B"+B"D&A=T$Q-W!#-V=#"P+G"D+OO'M";H"B&'"BP0"('BP0A8GR"^"@&+G'M"
K!!
"
&'B'A0kH,0/G",'A6O*('"F+A*+(B"P+G"C0/(M"K!";@"O'Z"0C"B&'"G06+YD"D+OOR"7&A''"A'+MG"*("&+@O0BH@'"
K!#
"
#"B&+B"G/@@0AB'M"B&'"G06+YD"F+A*+(B"P'A'"C0/(M"(0B"B0"&+F'"B&'"&'B'A0kH,0/G",'A6O*('"F+A*+(BR"
K!$
"
7&/G-"B&'"G06+YD"F+A*+(B"P+G"D0(G*M'A'M"/(G/@@0AB'M";H"+("+(D'GBA+O"&+@O0BH@'R"_0A"+"G06+YD"
K!=
"
F+A*+(B"B&+B"D0F'AG"(0"&+@O0B+,,'M"A'+M-"*B"*G"@A0;+;OH";'D+/G'"B&'A'"+A'"(0",'A6O*('"F+A*+(BG"
K!>
"
0A"0(OH"&060kH,0/G"F+A*+(BG"*("B&'"F*D*(*BHR"X("B&*G"D+G'-"P'"A'N/*A'M"B&'"G06+YD"F+A*+(B"B0";'"
K!K
"
D0')*GY(,"P*B&"B&'"&060kH,0/G",'A6O*('"F+A*+(BG"O'GG"B&+("!WW";@"+P+H"0("*BG"O'Z"+(M"A*,&B"*("
K!L
"
+OO"G06+YD"F+A*+(B"G/@@0AY(,"A'+MGR"^("')+6@O'"0C"B&*G"*G",*F'("*("S)B'(M'M"?+B+"_*,/A'"$;R"^"
K!Q
"
G06+YD"F+A*+(B"+B"D&A!T!WW-K$#-!>Q"P+G"D+OO'M";H"B&'"BP0"('BP0A8GR"^"&060kH,0/G",'A6O*('"
K!V
"
F+A*+(B"P+G"C0/(M"$V";@"O'Z"0C"B&'"G06+YD"D+OOR"h/OY@O'"A'+MG"B&+B"G/@@0AB"B&'"G06+YD"F+A*+(B"
K#W
"
P'A'"C0/(M"(0B"B0"&+F'"B&'"&060kH,0/G",'A6O*('"F+A*+(BR"7&/G-"B&'"G06+YD"F+A*+(B"P+G"(0B"
K#!
"
D0(G*M'A'M"B0";'"G/@@0AB'M";H"+("+(D'GBA+O"&+@O0BH@'R"106+YD"F+A*+(BG"B&+B"M0"(0B"&+F'"+(H"
K##
"
,'A6O*('"F+A*+(BG"O'GG"B&+("!WW";@"+P+H"0("B&'*A"O'Z"0A"A*,&B"+A'"(0B"+@@O*D+;O'"*("B&*G"GB'@R"
K#$
"
"
K#=
"
J1-91-8"3O+*A1"G/@@0ABG"]3_"C0A6+B"0/B@/BR"106+YD"F+A*+(BG"+A'"6+A8'M"vq^11w"0A"v20Pb/+Ow"*C"
K#>
"
B&'"F+A*+(B"N/+O*BH"*G"O0P"\*R'R-"bE^2{Q-"D0([,/A+;O'";H"0@Y0(`-"0A"B&'H"+A'"[OB'A'M"*("GB'@"$R"_0A"
K#K
"
'+D&"F+A*+(B-"B&'"+OO'O'"CA'N/'(DH"+(M"G/@@0AY(,"D0F'A+,'"0C"B&'"A'C'A'(D'"+OO'O'"+(M"+OO"
K#L
"
+OB'A(+YF'"+OO'O'G"+A'"G&0P(R"7&'"0@Y0(G"v55@A*(By,'A6O*('yD+OOGw"+(M"v55@A*(ByA'CyD+OOGw"
K#Q
"
'(+;O'"0/B@/j(,",'A6O*('"F+A*+(BG"+(M"+AYC+DBG-"A'G@'DYF'OHR"
K#V
"
"
K$W
"
.CC-BY-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted August 21, 2023. ; https://doi.org/10.1101/2023.08.17.553778doi: bioRxiv preprint
ONT library preparation and sequencing
K$!
"
l'(06*D"?a^"\,?a^`"0C"+"BA*@O'5(',+YF'";A'+GB"D+(D'A"\7ad3`"D'OO"O*('"\I33!$V>`"+(M"+"d"
K$#
"
OH6@&0DHB'5M'A*F'M"(0A6+O"D'OO"O*('"\I33!$V>d2`"CA06"B&'"G+6'"M0(0A"P'A'"@/AD&+G'M"CA06"
K$$
"
B&'"^6'A*D+("7H@'"3/OB/A'"30OO'DY0("\^733`R"l'(06*D"?a^"\,?a^`"0C"IlWW!"P+G"@/AD&+G'M"
K$=
"
CA06"B&'"30A*'OO"X(GYB/B'R"7&'"&*,&560O'D/O+A5P'*,&B",?a^"P+G"')+6*('M";H"a+(0MA0@-"b/;*B-"
K$>
"
+(M"WR$>f"+,+A0G'"'O'DBA0@&0A'G*G"C0A"*BG"D0(D'(BA+Y0(-"@/A*BH-"+(M"*(B',A*BHR"7&'",?a^"P+G"
K$K
"
B&'("CA+,6'(B'M"P*B&",7/;'"B0",'('A+B'"?a^"CA+,6'(BG"+@@A0)*6+B'OH"#W"8;"*("O'(,B&R"7&'G'"
K$L
"
CA+,6'(BG"P'A'"B&'(";'*(,"G'N/'(D'M"+B"BP0"G'N/'(D*(,"D'(B'AGT"IJE"+(M"a0F0,'('R"^B"IJE-"
K$Q
"
B&'"CA+,6'(BG"0C"I33!$V>-"I33!$V>d2-"+(M"IlWW!"P'A'"@A'@+A'M"+(M"O*,+B'M"P*B&"+"
K$V
"
G'N/'(D*(,"+M+@B'A"/G*(,"ia7„G"O*,+Y0("G'N/'(D*(,"8*B"]!="1bJ521J!!=R"7&'"O*,+B'M"G+6@O'G"
K=W
"
P'A'"G'N/'(D'M"0(":!WR=R!"qA06'B&Xia"p0PD'OOG"/G*(,"+"qA06'B&Xia"#"10O0"M'F*D'"+(M"
K=!
"
h*(Jai9"G0ZP+A'"F'AG*0("!R!QRW#-"C0A"VK"&R"^B"a0F0,'('-"B&'"CA+,6'(BG"0C"I33!$V>"+(M"
K=#
"
I33!$V>d2"P'A'"@A'@+A'M"+(M"O*,+B'M"P*B&"+"G'N/'(D*(,"+M+@B'A"/G*(,"ia7nG"O*,+Y0("
K=$
"
G'N/'(D*(,"8*B"]!#"1bJ521J!!#R"7&'"O*,+B'M"G+6@O'G"P'A'"G'N/'(D'M"0(":!WR="qA06'B&Xia"
K==
"
p0PD'OOG"/G*(,"qA06'B&Xia"=Q-"C0A"VK"&R"
K=>
"
"
K=K
"
Benchmarking
K=L
"
9'"/G'M"B&'"BA/B&"G'B"0C"G06+YD"F+A*+(BG"*("I33!$V>ed2",'('A+B'M"+(M"6+*(B+*('M";H"B&'"
K=Q
"
1Sb3#"D0(G0AY/6R"7&'"BA/B&"G'B"P+G"0AB&0,0(+OOH"F+O*M+B'M"P*B&"6/OY@O'"G'N/'(D*(,"A'@O*D+B'G"
K=V
"
CA06"6/OY@O'"G'N/'(D*(,"D'(B'AG"B&+B"D06@A*G'"0F'A"!->WW5C0OM"G'N/'(D*(,"M+B+"*("B0B+OR"9'"
K>W
"
/G'M"0(OH"B&'"G06+YD"F+A*+(BG"O+;'O'M"vI*,&30(Cw"\I*,&"30([M'(D'`"0A"vh'M30(Cw"\h'M*/6"
K>!
"
30([M'(D'`"+G"BA/B&R"106+YD"F+A*+(BG"O+;'O'M"v20P30(Cw"\20P"30([M'(D'-"]^_"zWRW>-"(0B"+"
K>#
"
@+AB"0C"B&'"BA/B&"G'B"+G"M'[('M";H"1Sb3#`"P'A'"(0B"/G'M"C0A";'(D&6+A8*(,R"X("B0B+O-"B&'A'"P'A'"
K>$
"
$V->KW"BA/B&"1a]G"+(M"!-V##"BA/B&"X(M'OGo"$V-==L"0C"B&'"1a]G"+(M"!-KW#"0C"B&'"X(M'OG"P'A'"
K>=
"
P*B&*("B&'"&*,&5D0([M'(D'"A',*0(G"M'[('M"*("+"dS?"[O'"@A0F*M'M";H"1Sb3#R"^"F+A*+(B"D+OO"P+G"
K>>
"
D0(G*M'A'M"D0AA'DB"0(OH"*C"*B"6+BD&'M";0B&"B&'",'(06'"@0G*Y0("+(M"F+A*+(B"+OO'O'"0C"B&'"BA/B&R"
K>K
"
_0A";0B&"B&'"ia7"+(M"XOO/6*(+";'(D&6+A8G-"G06'"BA/B&"F+A*+(BG"P'A'"')DO/M'M"C0A"B&'"C0OO0P*(,"
K>L
"
A'+G0(GR"_*AGB-"'F'("P*B&"B&'"&*,&"G'N/'(D*(,"D0F'A+,'-"G/D&"+G"L>RVL5C0OM"I33!$V>"P'"
K>Q
"
,'('A+B'M"C0A"B&'"ia7";'(D&6+A8G-"G06'"BA/B&"F+A*+(BG"GYOO"&+M"F'AH"O0P"0A"(0"D0F'A+,'-"0A"
K>V
"
&+M"(0"A'+M"G/@@0AY(,"B&'"F+A*+(B"+OO'O'R"7&'G'"BA/B&"F+A*+(BG"P0/OM"C+*O"+OO"B&'";'(D&6+A8G-"G0"
KKW
"
B&'H"G&0/OM";'"')DO/M'MR"1'D0(M-"G06'";'(D&6+A8G"B'GB'M"6/OY@O'"G'N/'(D*(,"D0F'A+,'G"+(M"
KK!
"
A'N/*A'M"G'N/'(D*(,"A'+M"G/;G+6@O*(,"CA06"B&'"C/OO"M+B+G'BR"7&'"G/;G+6@O*(,"@A0D'GG"6*,&B"
KK#
"
A'60F'"A'+MG"G/@@0AY(,"+"BA/B&"F+A*+(B"B0"+("')B'(B"B&+B"C'P"0A"(0"G/@@0AY(,"A'+MG"+A'"O'ZR"
KK$
"
7&*G"+m'DBG"'G@'D*+OOH"B&'"G06+YD"F+A*+(BG"B&+B"+OA'+MH"&+F'"+"O0P"]^_R"_0A"')+6@O'-"+"]^_"WRW>"
KK=
"
G06+YD"F+A*+(B"P*B&"#W5C0OM"D0F'A+,'"+(M"0('"A'+M"G/@@0AY(,"B&'"F+A*+(B"+OO'O'"D+(";'"A'M/D'M"
KK>
"
B0"]^_"W";H"A'60F*(,"t/GB"0('"A'+M"M/A*(,"G/;G+6@O*(,R"7&*G"A'M/D'G"B&'"N/+O*BH"0C"B&'"
KKK
"
;'(D&6+A8*(,"A'G/OBG-"'G@'D*+OOH"C0A"O0P"]^_"BA/B&"F+A*+(BG"P&'("G/;G+6@O'M"M+B+G'BG"+A'"/G'MR"
KKL
"
70"+OO'F*+B'"B&'"@A0;O'6-"+(H"BA/B&"F+A*+(BG"B&+B"&+F'"F'AH"O0P"]^_"\{WRW>`"0;G'AF'M"*("B&'"C/OO"
KKQ
"
M+B+G'B";'C0A'"G/;G+6@O*(,"G&0/OM";'"')DO/M'MR"1/66*(,"/@"B&'"BP0"A'+G0(G"+;0F'-"C0A"'+D&"
KKV
"
0C"B&'"C/OO"M+B+G'BG"P'"/G'M"*(";0B&"B&'"ia7"+(M"XOO/6*(+";'(D&6+A8G-"P'"')DO/M'M"BA/B&"
KLW
"
F+A*+(BG"B&+B"6+BD&'M"+(H"0C"B&'"C0OO0P*(,"DA*B'A*+"CA06";'(D&6+A8*(,T"!`"]^_"zWRW>-"#`"A'+MG"
KL!
"
G/@@0AY(,"B&'"F+A*+(B"+OO'O'"{$-"$`"B/60A"D0F'A+,'"{=-"+(M"=`"(0A6+O"D0F'A+,'"{=R"_0A"
KL#
"
GB+(M+AM*k+Y0(-"P'"/G'M"G06R@H-"@A0F*M'M"*("XOO/6*(+nG"I+@O0BH@'"306@+A*G0("700OG27"\F'AG*0("
KL$
"
.CC-BY-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted August 21, 2023. ; https://doi.org/10.1101/2023.08.17.553778doi: bioRxiv preprint
FWR$R!#`"B0",'('A+B'"'F+O/+Y0("6'BA*DG-"*(DO/M*(,"_!51D0A'-"qA'D*G*0(-"+(M":'D+OO"+,+*(GB"B&'"
KL=
"
BA/B&"F+A*+(BGR"7&'"vD06@+A'yFDCw"G/;60M/O'"*("3O+*A1"@A0M/D'G"*M'(YD+O"A'G/OBG"B0"G06R@H-";/B"
KL>
"
+/B06+B'G"B&'"')DO/G*0("0C"/(N/+O*CH*(,"BA/B&"F+A*+(BGR"7&'"BA/B&"G'B"6+B'A*+OG"+A'"@/;O*DOH"
KLK
"
+F+*O+;O'"B0"B&'"D066/(*BHR"^OO"B00OG-"B&'*A"F'AG*0(-"+(M"D066+(M"O*('G"/G'M"+A'",*F'("*("B&'"
KLL
"
v3066+(M"O*('G"/G'Mw"G'DY0("*("1/@@O'6'(B+AH"a0B'GR"
KLQ
"
"
KLV
"
Computational performance
KQW
"
3O+*A1"P+G"PA*g'("*("qHB&0("+(M"3ccR"7&'"qHB&0("@+ABG"O'F'A+,'M"qHqH"C0A"G@''M"/@R"7&'"('/A+O"
KQ!
"
('BP0A8"*6@O'6'(B+Y0(G"/G'M"qH70AD&R"7A+*(*(,"3O+*A1"('/A+O"('BP0A8G"A'N/*A'G"+"&*,&5'(M"
KQ#
"
lqE-";/B"/G*(,"3O+*A1"C0A"G06+YD"F+A*+(B"D+OO*(,"A'N/*A'G"0(OH"+"3qER"_0A"B&'">W)e#>)"
KQ$
"
I33!$V>ed2"@+*A-"3O+*A1"[(*G&'M"A/((*(,"*("x>"&0/AG"C0A"ia7"M+B+"+(M"x#"&0/AG"C0A"XOO/6*(+"M+B+"
KQ=
"
\$Wf"GO0P'A"B&+("1BA'O8+#-";/B"C+GB'A"B&+("+OO"0B&'A"G&0AB5A'+M"G06+YD"F+A*+(B"D+OO'AG`-"/G*(,"
KQ>
"
BP0"!#5D0A'"X(B'O"…'0("1*OF'A"=!!K"@A0D'GG0AGR"7&'"6'60AH"C00B@A*(B"*G"O0P"+(M"*G"D0(BA0OO'M"+B"
KQK
"
O0P'A"B&+("!ld"@'A"3qER"_0A"60M'O"BA+*(*(,-"P'"B'GB'M"aF*M*+"l'_0AD'":7…"#WQW"7*-"$WVW-"+(M"
KQL
"
=WVW-"+(M"C0/(M"'+D&"('P"60M'O"@A0F*M'M"+"x$>f"G@''M"*(DA'+G'"CA06"B&'"@A'F*0/G"
KQQ
"
,'('A+Y0(R"
KQV
"
"
KVW
"
"
KV!
"
Code availability
KV#
"
3O+*A1"*G"0@'("G0/AD'"+(M"+F+*O+;O'"+B"&g@GTee,*B&/;RD06eIJE5d^2e3O+*A1"/(M'A"B&'"d1?"$5
KV$
"
3O+/G'"O*D'(G'R"7&'"A'G/OBG"*("B&*G"@+@'A"P'A'";+G'M"0("B&'"3O+*A1"*(*Y+O"A'O'+G'"\F'AG*0("WRWR!`R"
KV=
"
h/OY@O'"*(GB+OO+Y0("0@Y0(G"+A'"+F+*O+;O'"C0A"3O+*A1-"*(DO/M*(,"?0D8'A"+(M"1*(,/O+A*BHR"3O+*A1"&+G"
KV>
"
+OG0";''("*(DO/M'M"+G"B&'"G6+OO"F+A*+(B"D+OO'A"*("ia7nG"G06+YD"F+A*+(B"D+OO*(,"P0A8p0P29"G*(D'"
KVK
"
F'AG*0("WR!RWR"
KVL
"
"
KVQ
"
Data availability
KVV
"
7&'"O*(8G"B0"B&'"A'C'A'(D'",'(06'G-"BA/B&"G06+YD"F+A*+(BG-";'(D&6+A8*(,"6+B'A*+OG-"ia7-"+(M"
LWW
"
XOO/6*(+"M+B+"+A'",*F'("*("B&'"v?+B+"+F+*O+;*O*BHw"G'DY0("*("1/@@O'6'(B+AH"a0B'GR"^OO"+(+OHG*G"
LW!
"
0/B@/B-"*(DO/M*(,"B&'"]3_G"+(M"A/((*(,"O0,G-"*G"+F+*O+;O'"+B"
LW#
"
&g@TeePPPR;*0QRDGR&8/R&8eDO+*AGe+(+OHG*GyA'G/OBR"7&'"I33!$V>ed2"G'N/'(D*(,"M+B+",'('A+B'M"
LW$
"
*("B&*G"GB/MH"P+G"M'@0G*B'M"*("B&'"a3dX"G&0AB5A'+M"+AD&*F'"P*B&"+DD'GG*0("X?"q:.a^VQK#V#R"
LW=
"
"
LW>
"
Acknowledgements
LWK
"
:R2R"P+G"G/@@0AB'M";H"I0(,"J0(,":'G'+AD&"lA+(BG"30/(D*O",A+(BG"l:_"\!L!!$L#!`"+(M"7:1"\7#!5
LWL
"
LW>e#W5a`-"B&'"1&'(k&'("h/(*D*@+O"l0F'A(6'(B"l'('A+O"qA0,A+6"\.34.#W#!W$#=!$==W>W!>`-"
LWQ
"
B&'"E:3"C/(M"CA06"IJE-"+(M"i)C0AM"a+(0@0A'"7'D&(0O0,*'GR"
LWV
"
"
L!W
"
.CC-BY-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted August 21, 2023. ; https://doi.org/10.1101/2023.08.17.553778doi: bioRxiv preprint
Author contribu.ons
L!!
"
:R2R"D0(D'*F'M"B&'"GB/MHR"%R%R"+(M":R2R"M'G*,('M"B&'"+O,0A*B&6G-"*6@O'6'(B'M"3O+*A1-"+(M"PA0B'"
L!#
"
B&'"@+@'AR"4R2R"+(M"7R59R2R"'F+O/+B'M"B&'";'(D&6+A8*(,"A'G/OBGR"^OO"+/B&0AG"A'F*G'M"B&'"
L!$
"
6+(/GDA*@BR"
L!=
"
'
L!>
"
Compe.ng interests
L!K
"
:R2R"A'D'*F'G"A'G'+AD&"C/(M*(,"CA06"ia7R"7&'"0B&'A"+/B&0AG"M'DO+A'"(0"D06@'Y(,"*(B'A'GBGR"
L!L
"
"
L!Q
"
" "
L!V
"
.CC-BY-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted August 21, 2023. ; https://doi.org/10.1101/2023.08.17.553778doi: bioRxiv preprint
References
L#W
"
!R" 9'*(GB'*(-".RaR"'B"+OR"7&'"D+(D'A",'(06'"+BO+G"@+(5D+(D'A"+(+OHG*G"@A0t'DBR"='-1#"*
L#!
"
7"("3)4"IJ-"!!!$5!!#W"\#W!$`R"
L##
"
#R"q'A'A+5d'O-".R"'B"+OR"_A06"G06+YD"F+A*+(BG"B0P+AMG"@A'D*G*0("0(D0O0,HT"'F*M'(D'5MA*F'("
L#$
"
A'@0AY(,"0C"BA'+B6'(B"0@Y0(G"*("60O'D/O+A"B/60A";0+AMGR"F"(%&"*&"/,),(""BK-"!5!>"
L#=
"
\#W!Q`R"
L#>
"
$R" _+(,-"2R7R"'B"+OR"SGB+;O*G&*(,"D066/(*BH"A'C'A'(D'"G+6@O'G-"M+B+"+(M"D+OO"G'BG"C0A"
L#K
"
;'(D&6+A8*(,"D+(D'A"6/B+Y0("M'B'DY0("/G*(,"P&0O'5,'(06'"G'N/'(D*(,R"='-1#"*
L#L
"
2,%-").(%5%7;"HL-"!!>!5!!KW"\#W#!`R"
L#Q
"
=R" JA*G&(+6+D&+A*-"JR"'B"+OR"^DD/A+B'"G06+YD"F+A*+(B"M'B'DY0("/G*(,"P'+8OH"G/@'AF*G'M"
L#V
"
M''@"O'+A(*(,R"='-1#"*G%&&1(,)'3%(4"BH-"=#=Q"\#W##`R"
L$W
"
>R"1+&A+'*+(-"1RhRSR"'B"+OR"?''@"D0(F0O/Y0(+O"('/A+O"('BP0A8G"C0A"+DD/A+B'"G06+YD"
L$!
"
6/B+Y0("M'B'DY0(R"='-1#"*)%&&1(,)'3%(4"BK-"!W=!"\#W!V`R"
L$#
"
KR" a+Ak*G*-"lR"'B"+OR"l'(06'5P*M'"G06+YD"F+A*+(B"D+OO*(,"/G*(,"O0D+O*k'M"D0O0A'M"M'"dA/*t("
L$$
"
,A+@&GR"G%&&1(,)'3%(4*2,%5%7;"B-"#W"\#W!Q`R"
L$=
"
LR" _+(-"4R"'B"+OR"h/1ST"+DD0/(Y(,"C0A"B/60A"&'B'A0,'('*BH"/G*(,"+"G+6@O'5G@'D*[D"'AA0A"
L$>
"
60M'O"*6@A0F'G"G'(G*YF*BH"+(M"G@'D*[D*BH"*("6/B+Y0("D+OO*(,"CA06"G'N/'(D*(,"M+B+R"
L$K
"
F"(%&"*2,%5%7;"BM-"!5!!"\#W!K`R"
L$L
"
QR" 3*;/OG8*G-"JR"'B"+OR"1'(G*YF'"M'B'DY0("0C"G06+YD"@0*(B"6/B+Y0(G"*("*6@/A'"+(M"
L$Q
"
&'B'A0,'('0/G"D+(D'A"G+6@O'GR"='-1#"*2,%-").(%5%7;"HB-"#!$5#!V"\#W!$`R"
L$V
"
VR" 2+AG0(-"?RSR"'B"+OR"106+YD1(*@'AT"*M'(Y[D+Y0("0C"G06+YD"@0*(B"6/B+Y0(G"*("P&0O'"
L=W
"
,'(06'"G'N/'(D*(,"M+B+R"Q,%,($%#&'3)4"EN-"$!!5$!L"\#W!#`R"
L=!
"
!WR" J*6-"1R"'B"+OR"1BA'O8+#T"C+GB"+(M"+DD/A+B'"D+OO*(,"0C",'A6O*('"+(M"G06+YD"F+A*+(BGR"='-1#"*
L=#
"
&"-.%/4"BJ-">V!5>V="\#W!Q`R"
L=$
"
!!R" _A''M-"?R-"q+(-":R"†"^OM+(+-":R"7aGD0@'T"+DD/A+B'"M'B'DY0("0C"G06+YD"6/B+Y0(G"P*B&"
L==
"
&+@O0BH@'5;+G'M"F+A*+(B"D+(M*M+B'"M'B'DY0("+(M"6+D&*('"O'+A(*(,"[OB'A*(,R"2,%#R,6-"
L=>
"
#>WK=L"\#W!Q`R"
L=K
"
!#R" 3008'-"?RqR-"9'M,'-"?R3R"†"2/(B'A-"lR"^"/(*['M"&+@O0BH@'5;+G'M"6'B&0M"C0A"+DD/A+B'"
L=L
"
+(M"D06@A'&'(G*F'"F+A*+(B"D+OO*(,R"='-1#"*2,%-").(%5%7;"HL-"QQ>5QV#"\#W#!`R"
L=Q
"
!$R" 2+*-"%R"'B"+OR"]+A?*DBT"+"(0F'O"+(M"F'AG+YO'"F+A*+(B"D+OO'A"C0A"(')B5,'('A+Y0("G'N/'(D*(,"*("
L=V
"
D+(D'A"A'G'+AD&R"=1)5",)*'),/4*#"4"'#)."II-"'!WQ5'!WQ"\#W!K`R"
L>W
"
!=R" J0F+8+-"1R-"i/-"1R-".'(*8'-"JRhR"†"1D&+Bk-"hR3R"^@@A0+D&*(,"D06@O'B'",'(06'G-"
L>!
"
BA+(GDA*@B06'G"+(M"'@*506'G"P*B&"+DD/A+B'"O0(,5A'+M"G'N/'(D*(,R"='-1#"*S"-.%/4"EK-"
L>#
"
!#5!K"\#W#$`R"
L>$
"
!>R"^6'/A-"^R-"JO00GB'A6+(-"9RqR"†"I'GB+(M-"hR1R"1*(,O'560O'D/O'"G'N/'(D*(,T"B0P+AMG"
L>=
"
DO*(*D+O"+@@O*D+Y0(GR"N#"(/4*,(*2,%-").(%5%7;"HM-"L#5Q>"\#W!V`R"
L>>
"
!KR" a+(0@0A'"b#Wc"D&'6*GBAH-"&g@GTee(+(0@0A'B'D&RD06eN#W@O/G5D&'6*GBAHR""\#W!V`R"
L>K
"
!LR"_0)-"SR.R-":'*M5d+HO*GG-"JR1R-"S60(M-"hR.R"†"20';-"2R^R"^DD/A+DH"0C"(')B",'('A+Y0("
L>L
"
G'N/'(D*(,"@O+}0A6GR"="R-*7"("#'3%(L*4"T1"(),(7*U*'995,)'3%(4"B"\#W!=`R"
L>Q
"
!QR" 2/0-":R-"1'MO+k'D8-"_R.R-"2+6-"7R59R"†"1D&+Bk-"hR3R"^"6/OY5B+G8"D0(F0O/Y0(+O"M''@"('/A+O"
L>V
"
('BP0A8"C0A"F+A*+(B"D+OO*(,"*("G*(,O'"60O'D/O'"G'N/'(D*(,R"='-1#"*)%&&1(,)'3%(4"BK-"
LKW
"
VVQ"\#W!V`R"
LK!
"
.CC-BY-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted August 21, 2023. ; https://doi.org/10.1101/2023.08.17.553778doi: bioRxiv preprint
!VR" q0@O*(-":R"'B"+OR"^"/(*F'AG+O"1aq"+(M"G6+OO5*(M'O"F+A*+(B"D+OO'A"/G*(,"M''@"('/A+O"
LK#
"
('BP0A8GR"='-1#"*2,%-").(%5%7;"HO-"VQ$5VQL"\#W!Q`R"
LK$
"
#WR" 9+,('A-".R"'B"+OR"d'(D&6+A8*(,"D&+OO'(,*(,"G6+OO"F+A*+(BG"P*B&"O*(8'M"+(M"O0(,"A'+MGR"G"55*
LK=
"
F"(%&,)4"E-"!WW!#Q"\#W##`R"
LK>
"
#!R" 2/0-":R"'B"+OR"S)@O0A*(,"B&'"O*6*B"0C"/G*(,"+"M''@"('/A+O"('BP0A8"0("@*O'/@"M+B+"C0A"
LKK
"
,'A6O*('"F+A*+(B"D+OO*(,R"='-1#"*S').,("*?(-"55,7"()""E-"##W5##L"\#W#W`R"
LKL
"
##R" %&'(,-"%R"'B"+OR"1H6@&0(*k*(,"@*O'/@"+(M"C/OO5+O*,(6'(B"C0A"M''@"O'+A(*(,5;+G'M"O0(,5
LKQ
"
A'+M"F+A*+(B"D+OO*(,R"='-1#"*G%&91-'3%('5*>),"()""E-"LVL5QW$"\#W##`R"
LKV
"
#$R" 1&+[(-"JR"'B"+OR"I+@O0BH@'5+P+A'"F+A*+(B"D+OO*(,"P*B&"qSqqS:5h+A,*(5?''@]+A*+(B"
LLW
"
'(+;O'G"&*,&"+DD/A+DH"*("(+(0@0A'"O0(,5A'+MGR"='-1#"*&"-.%/4"BN-"!$##5!$$#"\#W#!`R"
LL!
"
#=R" 160O8+-"hR"'B"+OR"306@A'&'(G*F'"GBA/DB/A+O"F+A*+(B"M'B'DY0(T"CA06"60G+*D"B0"
LL#
"
@0@/O+Y0(5O'F'OR"Q,%VR,6-"#W##R#WW=R"#WW=R=QLW>>"\#W##`R"
LL$
"
#>R" 1&*A+*G&*-"4R"'B"+OR"qA'D*G'"D&+A+DB'A*k+Y0("0C"G06+YD"D06@O')"GBA/DB/A+O"F+A*+Y0(G"CA06"
LL=
"
B/60AeD0(BA0O"@+*A'M"O0(,5A'+M"G'N/'(D*(,"M+B+"P*B&"(+(060(GFR"=1)5",)*:),/4*
LL>
"
V"4"'#).-",8+M>#K"\#W#$`R"
LLK
"
#KR" SP*(,-"^R?R"'B"+OR"306;*(*(,"B/60A",'(06'"G*6/O+Y0("P*B&"DA0PMG0/AD*(,"B0"
LLL
"
;'(D&6+A8"G06+YD"G*(,O'5(/DO'0YM'5F+A*+(B"M'B'DY0(R"='-1#"*&"-.%/4"BE-"K#$5K$W"
LLQ
"
\#W!>`R"
LLV
"
#LR" JA/GD&'-"qR"'B"+OR"d'GB"@A+DYD'G"C0A";'(D&6+A8*(,",'A6O*('"G6+OO5F+A*+(B"D+OOG"*("&/6+("
LQW
"
,'(06'GR"='-1#"*2,%-").(%5%7;"HM-">>>5>KW"\#W!V`R"
LQ!
"
#QR" 1&*A+*G&*-"4R"'B"+OR"qA'D*G'"D&+A+DB'A*k+Y0("0C"G06+YD"D06@O')"GBA/DB/A+O"F+A*+Y0(G"CA06"
LQ#
"
@+*A'M"O0(,5A'+M"G'N/'(D*(,"M+B+"P*B&"(+(060(GFR"Q,%VR,6-"#W#WR#WWLR"#W##R#!=#K#"
LQ$
"
\#W#W`R"
LQ=
"
#VR" a+(0@0A'"SqX#hS"2+;G-"&g@GTee,*B&/;RD06e'@*#6'5O+;GePC5G06+YD5F+A*+Y0(R""\#W#$`R"
LQ>
"
$WR" 1+&A+'*+(-"1RhRSR"'B"+OR"^D&*'F*(,"A0;/GB"G06+YD"6/B+Y0("M'B'DY0("P*B&"M''@"O'+A(*(,"
LQK
"
60M'OG"M'A*F'M"CA06"A'C'A'(D'"M+B+"G'BG"0C"+"D+(D'A"G+6@O'R"F"(%&"*Q,%5%7;"EH-"!#"
LQL
"
\#W##`R"
LQQ
"
$!R" 7+A+;*D&*-"hR"'B"+OR"^"@A+DYD+O",/*M'"B0"D+(D'A"G/;DO0(+O"A'D0(GBA/DY0("CA06"?a^"
LQV
"
G'N/'(D*(,R"='-1#"*&"-.%/4"BN-"!==5!>>"\#W#!`R"
LVW
"
$#R"2*(-".R5IR-"3&'(-"2R53R-"4/-"1R53R"†"I/+(,-"4R57R"20(,q&+G'T"+("/OBA+5C+GB"D&A060G06'5GD+O'"
LV!
"
@&+G*(,"+O,0A*B&6"C0A"G6+OO"+(M"O+A,'"F+A*+(BGR"Q,%,($%#&'3)4"HN-"!Q!K5!Q##"\#W##`R"
LV#
"
$$R" q+g'AG0(-"hR"'B"+OR"9&+BGI+@T"P'*,&B'M"&+@O0BH@'"+GG'6;OH"C0A"C/B/A'5,'('A+Y0("
LV$
"
G'N/'(D*(,"A'+MGR"W%1#('5*%$*G%&91-'3%('5*Q,%5%7;"EE-"=VQ5>WV"\#W!>`R"
LV=
"
"
LV>
"
" "
LVK
"
.CC-BY-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted August 21, 2023. ; https://doi.org/10.1101/2023.08.17.553778doi: bioRxiv preprint
Figures
LVL
"
"
LVQ
"
\+`"
LVV
"
"
QWW
"
"
QW!
"
\;`"
QW#
"
"
QW$
"
"
QW=
"
\D`"
QW>
"
.CC-BY-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted August 21, 2023. ; https://doi.org/10.1101/2023.08.17.553778doi: bioRxiv preprint
"
QWK
"
"
QWL
"
Figure 1. Overview of ClairS training data synthesis workflow.
QWQ
"
\+`"7&'"P0A8p0P"M'60(GBA+B'G"&0P"B0"@A0M/D'"GH(B&'YD"G06+YD"F+A*+(BG"/G*(,"BP0";*0O0,*D+OOH"
QWV
"
/(A'O+B'M"G+6@O'G"P*B&"8(0P("BA/B&",'A6O*('"F+A*+(BG"C0A"3O+*A1"60M'O"BA+*(*(,R"X("B&*G"GB/MH-"
Q!W
"
G@'D*[D+OOH-"P'"/G'M"QW)"ia7"9l1"M+B+"0C"lX^d"IlWW#"+G"G+6@O'"^-"+(M">W)"IlWW!"+G"G+6@O'"
Q!!
"
dR"_*AGB-",'A6O*('"F+A*+(BG"FA"+(M"FB"P'A'"M'[('M"+G"8(0P("BA/B&",'A6O*('"F+A*+(BG"*("G+6@O'"^"
Q!#
"
+(M"d",*F'(";H"lX^dR"FA"+(M"FB"*(DO/M'";0B&"&060kH,0/G"+(M"&'B'A0kH,0/G",'A6O*('"F+A*+(BG"0C"
Q!$
"
+"G+6@O'R"70",'('A+B'"GH(B&'YD"B/60A"F+A*+(BG"N"+(M"GH(B&'YD"(0A6+O"F+A*+(BG"=Ae=B"C0A"'+D&"
Q!=
"
G+6@O'-"B&'"+O*,(6'(BG"P'A'"G@O*B"*(B0"G6+OO'A"D&/(8G"P*B&"=)"D0F'A+,'"'+D&R"7&'(-"B&'"D&/(8G"
Q!>
"
CA06";0B&"G+6@O'G"P'A'"D06;*('M"+(M"B&'"F+A*+(BG"D+OO'M"CA06"B&'6"P'A'"M'[('M"+G"NR"9*B&"
Q!K
"
B&'"p')*;*O*BH"0C"D06;*(*(,"+(H"(/6;'A"0C"D&/(8G"CA06";0B&"G+6@O'G-"N"'m'DYF'OH"D0F'A'M"
Q!L
"
F+A*+(BG"D+OO'M"+B"M*m'A'(B"D0F'A+,'G"+(M"]^_R"1*6*O+AOH-"B&'"D&/(8G"CA06"+"G+6@O'"P'A'"
Q!Q
"
D06;*('M"+B"6/OY@O'"D0F'A+,'G"C0A"D+OO*(,"GH(B&'YD"(0A6+O"F+A*+(BG"=A"+(M"=BR"9*B&"+"G6+OO"
Q!V
"
(/6;'A"0C"D&/(8G"CA06"+(0B&'A"G+6@O'"D06;*('M"*(B0"+"GH(B&'YD"(0A6+O-"=A"+(M"=B"'m'DYF'OH"
Q#W
"
D0F'A'M"M*m'A'(B"D0(B+6*(+Y0("O'F'OGR"7&'"F+A*+(BG"FA-"FB-"N-"=A"+(M"=B"P'A'"B&'("/G'M"B0"
Q#!
"
,'('A+B'"C0/A"D+B',0A*'G"0C"F+A*+(BG"s"v106+YDw-"vl'A6O*('-"v^AYC+DBw-"+(M"va0A6+O50(OHw"s"
Q##
"
P*B&"M*m'A'(B"A/O'GR"106+YD-"l'A6O*('-"+(M"^AYC+DB"6+BD&"B&'"B&A''"D+B',0A*'G"*("B&'"*(C'A'(D'"
Q#$
"
B+G8"0C"B&'"3O+*A1"('BP0A8"+AD&*B'DB/A'R"7&'"F+A*+(BG"0C"B&'"B&A''"D+B',0A*'G"P'A'"/G'M"C0A"
Q#=
"
60M'O"BA+*(*(,R"9&'("/G*(,"G+6@O'"d"+G"B/60A"+(M"^"+G"(0A6+O-"106+YD"*G"M'[('M"+G"v\N5=A`"
L
"
Q#>
"
\FB5FA`w-"*R'R-"F+A*+(BG"B&+B"P'A'"!`"C0/(M"*("GH(B&'YD"B/60A"No"#`"(0B"C0/(M"*("GH(B&'YD"(0A6+O"
Q#K
"
=Ao"$`"C0/(M"+G"+",'A6O*('"F+A*+(B"*("FBo"0A"=`"(0B"C0/(M"*("FAR"l'A6O*('"*G"M'[('M"+G"vN*
L
"=A*
L
"
Q#L
"
FA*
L
*FBw-"*R'R-"F+A*+(BG"B&+B"P'A'"C0/(M"*("+OO"N-"=A-"FA-"+(M"FBR"^AYC+DB"*G"M'[('M"+G"vN5=A5FA5
Q#Q
"
FBw-"P&*D&"G*,(*['G"B&'"F+A*+(BG"C0/(M"0(OH"*("N"+(M"(0B"*("B&'",'A6O*('G"0A"GH(B&'YD"(0A6+OR"
Q#V
"
9&'("/G*(,"G+6@O'"^"+G"B/60A"+(M"d"+G"(0A6+O-"B&'"M'[(*Y0(G"A'6+*("B&'"G+6'"')D'@B"C0A"
Q$W
"
GP*BD&*(,"B&'"G/;GDA*@BGR""
Q$!
"
.CC-BY-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted August 21, 2023. ; https://doi.org/10.1101/2023.08.17.553778doi: bioRxiv preprint
\;`"7&'"]^_"M*GBA*;/Y0("0C"B&'"GH(B&'YD"G06+YD"1a]G"+B"B&A''"M*m'A'(B"G*6/O+B'M"B/60A"
Q$#
"
@/A*Y'G"\!WWf-">Wf-"+(M"#>f`-"/G*(,"'*B&'A"IlWW!eIlWW#"0A"IlWW#eIlWW!"+G"B/60Ae(0A6+OR"
Q$$
"
1*(D'";0B&"&'B'A0kH,0/G"+(M"&060kH,0/G"F+A*+(BG"P'A'"/G'M"*("GH(B&'G*G-"+B"!WWf"B/60A"
Q$=
"
@/A*BH-"B&'"F+A*+(BG"P'A'",+B&'A'M"+B"WR>"+(M"!RW"]^_R"7&'"M*GBA*;/Y0("G&0P'M",00M"D0F'A+,'"0C"
Q$>
"
BH@*D+O"G06+YD"1a]"]^_";H"B&'"GH(B&'YD"1a]GR""
Q$K
"
"
Q$L
"
\D`"7&'";A'+8M0P("0C"B&'"(/6;'A"0C"GH(B&'YD"F+A*+(BG"C0A"BA+*(*(,R"7&'"(/6;'AG"!`"/G*(,"'*B&'A"
Q$Q
"
IlWW#eIlWW!"0A"IlWW!eIlWW#"+G"B/60Ae(0A6+O-"+(M"#`"0C"B&'"B&A''"D+B',0A*'G"106+YD-"
Q$V
"
l'A6O*('-"+(M"^AYC+DB-"+G"M'[('M"*("G/;[,/A'"+-"+A'"G&0P(R"7&'"(/6;'A"0C"106+YD"D+B',0A*'G"
Q=W
"
*G"C/AB&'A"M*F*M'M"*(B0"B&0G'"GH(B&'G*k'M"CA06"'*B&'A"&060kH,0/G"1aqG"0A"&'B'A0kH,0/G"1aqGR"
Q=!
"
7&'G'"(/6;'AG"')@O+*("P&H"*(DO/M*(,"&'B'A0kH,0/G"1aqG"*("B&'"GH(B&'G*G"*G"'GG'(Y+O"B0"'(G/A'"+"
Q=#
"
G/|D*'(B"(/6;'A"0C"GH(B&'YD"G06+YD"F+A*+(BG"C0A"60M'O"BA+*(*(,R""
Q=$
"
"
Q==
"
"
Q=>
"
.CC-BY-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted August 21, 2023. ; https://doi.org/10.1101/2023.08.17.553778doi: bioRxiv preprint
"
Q=K
"
"
Q=L
"
Figure 2. Overview of the ClairS somatic variant calling workflow.
Q=Q
"
\+`"7&'"P0A8p0P"*OO/GBA+B'G"B&'"B&A''"GB'@G"0C"3O+*A1R"X("GB'@"!-"3O+*A1"/G'G"3O+*A$"+(M"20(,q&+G'"
Q=V
"
C0A",'A6O*('"F+A*+(B"D+OO*(,-"@&+G*(,"+(M"A'+M"&+@O0B+,,*(,R"7&'"@A0D'GG'M"+O*,(6'(BG"+A'"B&'("
Q>W
"
/G'M"C0A";0B&"@*O'/@5"+(M"C/OO5+O*,(6'(B5;+G'M"G06+YD"F+A*+(B"D+OO*(,"*("GB'@"#R"1B'@"$"*(F0OF'G"
Q>!
"
@0GB5@A0D'GG*(,"[OB'AG"B&+B"'O*6*(+B'"G06+YD"F+A*+(B"D+OO*(,G"*C"+("+(D'GBA+O"&+@O0BH@'"\6+B'A(+O"
Q>#
"
0A"@+B'A(+O`"CA06"P&*D&"B&'"G06+YD"F+A*+(B"D0/OM"0A*,*(+B'"D+((0B";'"C0/(MR"7&'"M'B+*OG"0C"
Q>$
"
GB'@G"!-"#-"+(M"$"+A'"G&0P("*("G/;[,/A'G";-"D-"+(M"MR"\;`"1B'@"!"M'B+*OGR"3O+*A$"*G"+@@O*'M"B0";0B&"
Q>=
"
B/60A"+(M"(0A6+O"G+6@O'G"C0A",'A6O*('"F+A*+(B"D+OO*(,R"I*,&5N/+O*BH"&'B'A0kH,0/G",'A6O*('"
Q>>
"
F+A*+(BG"G&+A'M";H";0B&"G+6@O'G"+A'"G'O'DB'M"+(M"/G'M";H"20(,q&+G'"B0"@&+G'"B&'",'A6O*('"
Q>K
"
F+A*+(BG"C0/(M"*("B&'"B/60A"G+6@O'R"EG*(,"B&'"@&+G'M",'A6O*('"F+A*+(BG-"B&'"B/60A"A'+MG"+A'"
Q>L
"
.CC-BY-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted August 21, 2023. ; https://doi.org/10.1101/2023.08.17.553778doi: bioRxiv preprint
B&'("&+@O0B+,,'M"B0";'O0(,"B0"'*B&'A"&+@O0BH@'"!-"#-"0A"/(8(0P(R"\D`"1B'@"#"M'B+*OGR"7&'"
Q>Q
"
@A0D'GG'M"+O*,(6'(BG"CA06"GB'@"!"+A'"C'M"*(B0";0B&"B&'"@*O'/@5;+G'M"F+A*+(B5D+OO*(,"('/A+O"
Q>V
"
('BP0A8"+(M"B&'"C/OO5+O*,(6'(B";+G'M"F+A*+(B5D+OO*(,"('/A+O"('BP0A8R"i("+"G*(,O'"G06+YD"F+A*+(B"
QKW
"
D+(M*M+B'-";0B&"('BP0A8G",*F'"A'G@'DYF'"@A'M*DY0(G"0("B&'"@A0;+;*O*BH"0C"B&A''"D+B',0A*'GT"
QK!
"
v106+YDw-"vl'A6O*('w-"+(M"v^AYC+DBwR"7&'"@A'M*DY0(G"+A'"B&'("6'A,'M"+DD0AM*(,"B0"+"G'B"0C"
QK#
"
A/O'G"*(BA0M/D'M"*("B&'"h'B&0M"G'DY0(R"\M`"1B'@"$"M'B+*OGR"7&'"G06+YD"F+A*+(BG"D+OO'M"*("GB'@"#"
QK$
"
+A'"')+6*('M"B0"M'B'A6*('"*C"B&'H"+A'"G/@@0AB'M";H"+("+(D'GBA+O"&+@O0BH@'R"^(D'GBA+O"
QK=
"
&+@O0BH@'G-"P&*D&"D+(";'"'*B&'A"6+B'A(+O"0A"@+B'A(+O-"+A'"M'A*F'M"/G*(,",'A6O*('"F+A*+(BGR"^"
QK>
"
G06+YD"F+A*+(B"*G"D0(G*M'A'M"G/@@0AB'M";H"+("+(D'GBA+O"&+@O0BH@'"*C"B&'"&+@O0BH@'"D0(B+*(*(,"
QKK
"
B&'"G06+YD"F+A*+(B"*G";'O*'F'M"B0"0A*,*(+B'"CA06"0('"0C"B&'"+(D'GBA+O"&+@O0BH@'GR"
QKL
"
"
QKQ
"
"
QKV
"
"
QLW
"
"
QL!
"
Figure 3. The ClairS neural network architecture.
QL#
"
d0B&"\+`"B&'"@*O'/@"('BP0A8"+(M"\;`"C/OO5+O*,(6'(B"('BP0A8"/G'"+O*,(6'(BG"0C";0B&"B&'"B/60A"
QL$
"
+(M"(0A6+O"G+6@O'G"+G"*(@/BR"7'(G0AG"+A'"DA'+B'M"CA06";0B&"G+6@O'G"/G*(,"6'B&0MG"M'B+*O'M"*("
QL=
"
B&'"h'B&0M"G'DY0(-"+(M"+A'"B&'("D0(D+B'(+B'MR"7&'"B'(G0AG"+A'"B&'("@A0D'GG'M";H"B&'*A"
QL>
"
A'G@'DYF'"('/A+O"('BP0A8"C0A"*(C'A'(D'R"d0B&"('BP0A8G"0/B@/B"B&'"@A0;+;*O*BH"0C"B&A''"
QLK
"
D+B',0A*'GT"vl'A6O*('w-"v106+YDw-"+(M"v^AYC+DBwR"7&'"G'N/'(D'"0C"O+H'AG"+(M"O+H'A"
QLL
"
D0([,/A+Y0(G"+A'"G&0P(R"7&'"O'g'AG"D-"G-"+(M"8-"A'@A'G'(B"D&+(('O-"GBA*M'-"+(M"8'A('O-"
QLQ
"
A'G@'DYF'OHR"
QLV
"
"
QQW
"
+"
QQ!
"
.CC-BY-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted August 21, 2023. ; https://doi.org/10.1101/2023.08.17.553778doi: bioRxiv preprint
"
QQ#
"
"
QQ$
"
;"
QQ=
"
"
QQ>
"
"
QQK
"
"
QQL
"
D"
QQQ
"
.CC-BY-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted August 21, 2023. ; https://doi.org/10.1101/2023.08.17.553778doi: bioRxiv preprint
"
QQV
"
"
QVW
"
Figure 4. ONT HCC1395/BL dataset benchmarking results.
QV!
"
\+`"7&'"@A'D*G*0(5A'D+OO"D/AF'"0C"M*m'A'(B"D06;*(+Y0(G"0C"B/60A"+(M"(0A6+O"D0F'A+,'R"7&'"M0B"
QV#
"
0("'+D&"M+G&'M"O*('"G&0PG"P&'A'"B&'";'GB"_!5GD0A'"P+G"+D&*'F'MR"\;`"7&'"@'AC0A6+(D'"0C"3O+*A1"
QV$
"
+B"6/OY@O'"]^_"A+(,'G";'(D&6+A8'M"0("B&'"ia7"I33!$V>ed2"M+B+G'BR"X("B&'"[AGB"A0P-"#>-">W)-"
QV=
"
+(M"L>)"B/60A"P'A'"B'GB'M-"P*B&"B&'"(0A6+O"D0F'A+,'"[)'M"+B"#>)R"X("B&'"G'D0(M"A0P-"#W)-"#>)-"
QV>
"
+(M"$W)"0C"(0A6+O"P'A'"B'GB'M-"P*B&"B/60A"D0F'A+,'"[)'M"+B">W)R"]+A*+(B"N/+O*BH"D/B0m"Q"
QVK
"
\@A*0A*Yk'5A'D+OO"60M'`"P+G"/G'MR"\D`"7&'"@A'D*G*0(5A'D+OO"D/AF'"0C"M*m'A'(B"B/60Ae(0A6+O"@/A*BH"
QVL
"
D06;*(+Y0(G"P*B&"B/60A"D0F'A+,'"[)'M"+B">W)"+(M"(0A6+O"D0F'A+,'"[)'M"+B"#>)R"
QVQ
"
"
QVV
"
"
VWW
"
"
VW!
"
"
VW#
"
.CC-BY-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted August 21, 2023. ; https://doi.org/10.1101/2023.08.17.553778doi: bioRxiv preprint
Figure 5. Categorizing the FPs and FNs in ClairS
VW$
"
7&'"@*'"D&+ABG"G&0P"B&'"M*GBA*;/Y0("0C"A'+G0(G"C0A"B&'"_qG"+(M"_aG"*("3O+*A1R"^">W)e#>)"ia7"
VW=
"
I33!$V>ed2"M+B+G'B"P+G"/G'M"C0A"D+OO*(,R"$WW"_qG"+(M"$WW"_aG"P'A'"A+(M06OH"D&0G'("CA06"B&'"
VW>
"
A'G/OBG"+(M"+(+OHk'MR"7&'"B+(M'6"A'@'+B-"O0P"D06@O')*BH-"&060@0OH6'A-"+(M"G',6'(B+O"
VWK
"
M/@O*D+Y0("A',*0(G"P'A'"M'[('M"/G*(,"lX^d"F$RW"l'(06'"1BA+Y[D+Y0(R"^60(,"B&'"D+B',0A*'G-"
VWL
"
v')D'GG*F'"6*G6+BD&'G"*("+O*,(6'(Bw"+(M"v*(G/|D*'(B"(0A6+O"D0F'A+,'w"P'A'"M'D*M'M"
VWQ
"
6+(/+OOH-"*R'R-"P*B&0/B"D'AB+*("D/B50mGR"vS)D'GG*F'"6*G6+BD&'G"*("+O*,(6'(Bw"P+G",*F'("*C"+("'H'"
VWV
"
D&'D8"0C"B&'"+O*,(6'(BG"A'F'+O'M"')D'GG*F'"*(D0(G*GB'(B"6*G6+BD&'G"B&+("/G/+O"+O*,(6'(BG"P*B&"
V!W
"
+"BA/'"G06+YD"F+A*+(BR"vX(G/|D*'(B"(0A6+O"D0F'A+,'w"P+G",*F'("P&'("+",'A6O*('"F+A*+(B"G*,(+O"
V!!
"
')*GB'M"*(";0B&"B/60A"+(M"(0A6+O-";/B"B&'"D0F'A+,'"0C"(0A6+O"P+G"O0P-"G0"B&'",'A6O*('"F+A*+(B"
V!#
"
G*,(+O"*("(0A6+O"P+G"0;F*0/GOH"P'+8'A"B&+("*("B/60AR"
V!$
"
"
V!=
"
"
V!>
"
+"
V!K
"
"
V!L
"
"
V!Q
"
;"
V!V
"
.CC-BY-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted August 21, 2023. ; https://doi.org/10.1101/2023.08.17.553778doi: bioRxiv preprint
"
V#W
"
"
V#!
"
D"
V##
"
"
V#$
"
"
V#=
"
M"
V#>
"
.CC-BY-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted August 21, 2023. ; https://doi.org/10.1101/2023.08.17.553778doi: bioRxiv preprint
"
V#K
"
"
V#L
"
Figure 6. Ilumina HCC1395/BL dataset benchmarking results.
V#Q
"
\+`"7&'"@A'D*G*0(5A'D+OO"D/AF'"0C"I33!$V>ed2"G&0AB5A'+M"M+B+G'BG"CA06"G*)"1Sb3#"G0/AD'G"\a1T"
V#V
"
a0F+1'N"+B"XOO/6*(+-"a3T"I*1'N"+B"a+Y0(+O"3+(D'A"X(GYB/B'-"X2T"I*1'N"+B"XO/6*(+-"S^T"I*1'N"+B"
V$W
"
S/A0@'+("X(CA+GBA/DB/A'"C0A"7A+(GO+Y0(+O"h'M*D*('-"_?T"I*1'N"+B"_/M+("E(*F'AG*BH-"a]T"I*1'N"+B"
V$!
"
a0F+AYG`"/G*(,"'*,&B"B00OG"\1BA'O8+#-"2+(D'B-"h/B'DB#-"a'/G06+YD-"iDB0@/G-"106+YD1(*@'A-"
V$#
"
]+A('B-"3O+*A1`R"]+A*+(BG"P'A'"A+(8'M";H"1BA'O8+#"s"106+YDS]1-"h/B'DB#"s"72i?-"]+Aa'B"s"1D0A'-"
V$$
"
106+YD1(*@'A"s"113-"+(M"0B&'A"D+OO'AG"s"bE^2R"7&'"M0B"0("'+D&"O*('"G&0PG"P&'A'"B&'";'GB"_!5
V$=
"
GD0A'"P+GR"\;`"7&'"0F'A+OO"_!5GD0A'"0C"B&'"')@'A*6'(BG"G&0P("*("G/;[,/A'"+R"\D`"7&'"_!5GD0A'"+B"
V$>
"
M*m'A'(B"]^_G"0C"B&'"')@'A*6'(BG"G&0P("*("G/;[,/A'"+R"\M`"]'(("M*+,A+6G"G&0P*(,"B&'"0F'AO+@"
V$K
"
0C"C+OG'"@0G*YF'"F+A*+(B"D+OOG";'BP''("1BA'O8+#-"h/B'DB#-"+(M"3O+*A1R"
V$L
"
"
V$Q
"
" "
V$V
"
.CC-BY-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted August 21, 2023. ; https://doi.org/10.1101/2023.08.17.553778doi: bioRxiv preprint
Extended Data Figures
V=W
"
+"
V=!
"
"
V=#
"
"
V=$
"
;"
V==
"
"
V=>
"
"
V=K
"
Extended Data Figure 1. Performance differences between phasable and not-phasable
V=L
"
SNVs
V=Q
"
\+`"7&'"[,/A'"G&0PG"B&+B"+"G06+YD"F+A*+(B"/G/+OOH"0A*,*(+B'G"*("+"G*(,O'"G06+YD"D'OO"+(M"B&'("
V=V
"
G@A'+MG"B0"60A'"D'OOG"B&A0/,&"D'OO"M*F*G*0(-"A'G/OY(,"*("+"DO0(+O"D+AAH*(,"B&'"G+6'"F+A*+(BR"XB"+OG0"
V>W
"
G&0PG"&0P"B&'"6*G6+BD&'G"*("B&'"B/60A"G+6@O'"+(M"(0A6+O"G+6@O'"+A'"M*m'A'(B"CA06"'+D&"
V>!
"
0B&'AR"^"G06+YD"F+A*+(B"*G"60A'"O*8'OH"B0";'"+GG*,('M"B0"+"&+@O0BH@'"B&A0/,&"@&+G*(,-"P&*O'"+"
V>#
"
F+A*+(B"D+/G'M";H"A+(M06"G'N/'(D*(,"'AA0AG"*G"O'GG"O*8'OH"B0";'"G/DD'GGC/OOH"@&+G'MR"\;`"^"
V>$
"
@'AC0A6+(D'"D06@+A*G0("0C"G06+YD"F+A*+(BG"P&'A'"vqwT"D+(";'"@&+G'M-"+(M"vaqwT"D+((0BR"7&'"
V>=
"
.CC-BY-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted August 21, 2023. ; https://doi.org/10.1101/2023.08.17.553778doi: bioRxiv preprint
[,/A'"G&0PG"+"&*,&'A"@'AC0A6+(D'"*("G06+YD"F+A*+(BG"B&+B"D+(";'"@&+G'M-"'G@'D*+OOH"+B"O0P'A"
V>>
"
]^_GR"9'"/G'M">We#>5C0OM"I33!$V>ed2"+(M"@A*0A*Yk'5A'D+OO"60M'R"
V>K
"
"
V>L
"
"
V>Q
"
+"
V>V
"
"
VKW
"
;"
VK!
"
"
VK#
"
"
VK$
"
.CC-BY-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted August 21, 2023. ; https://doi.org/10.1101/2023.08.17.553778doi: bioRxiv preprint
Extended Data Figure 2. Visualization of neural network inputs.
VK=
"
\+`"q*O'/@5;+G'M"D+OO*(,"*(@/B"F*G/+O*k+Y0(R"7&'"D+(M*M+B'"G*B'"*G"D'(B'A'M"+(M"6+A8'M";H"BP0"
VK>
"
M+G&'M"O*('GR"\;`"_/OO5+O*,(6'(B5;+G'M"D+OO*(,"*(@/B"F*G/+O*k+Y0(R"X(";-"B&'"B0@"+(M";0g06"+A'"
VKK
"
@+MM'M"P*B&"k'A0"P&'("B&'"B0B+O"D0F'A+,'"0C"B/60A"+(M"(0A6+O"G+6@O'G"M0'G"(0B"A'+D&"B&'"
VKL
"
*(@/B"O*6*BR"7&'"(0A6+O"A'+M"+O*,(6'(BG"+(M"B/60A"A'+M"+O*,(6'(BG"*("+OO"D&+(('OG"+A'"
VKQ
"
G'@+A+B'M";H"BP0"A0PG"[OO'M"P*B&"k'A0GR"7&'"BP0"M'60(GBA+Y0(G"*(F0OF'M"BA/B&"F+A*+(BG"
VKV
"
A+(M06OH"@*D8'M"CA06"B&'"I33!$V>ed2"M+B+G'BR"
VLW
"
"
VL!
"
"
VL#
"
"
VL$
"
"
VL=
"
.CC-BY-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted August 21, 2023. ; https://doi.org/10.1101/2023.08.17.553778doi: bioRxiv preprint
Extended Data Figure 3. Two examples of haplotype inconsistency that signifies a false
VL>
"
somatic call.
VLK
"
\+`"S)+6@O'"0C"+"C+OG'"G06+YD"D+OO"P*B&"+"&+@O0BH@'"*(D0(G*GB'(B"P*B&"B&'"&+@O0BH@'"M'A*F'M"
VLL
"
CA06"+"&'B'A0kH,0/G",'A6O*('"F+A*+(B"('+A;HR"\;`"S)+6@O'"0C"+"C+OG'"G06+YD"D+OO"P*B&"+"
VLQ
"
&+@O0BH@'"*(D0(G*GB'(B"P*B&"B&'"&+@O0BH@'G"M'A*F'M"CA06"+"&060kH,0/G",'A6O*('"F+A*+(B"
VLV
"
('+A;HR"7&'";+G'G"^-"3-"l-"+(M"7"+A'"M'@*DB'M"*(",A''(-";O/'-"H'OO0P-"+(M"A'M-"A'G@'DYF'OHR"7&'"
VQW
"
;+D8,A0/(M"*(",A+H-"@/A@O'-"+(M"@*(8"A'@A'G'(BG"+("/(8(0P("&+@O0BH@'-"&+@O0BH@'"!-"+(M"
VQ!
"
&+@O0BH@'"#-"A'G@'DYF'OHR"l7"*G"+("+;;A'F*+Y0("0C",'(0BH@'R"
VQ#
"
"
VQ$
"
"
VQ=
"
.CC-BY-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted August 21, 2023. ; https://doi.org/10.1101/2023.08.17.553778doi: bioRxiv preprint
Article
Full-text available
We present our novel software, nanomonsv, for detecting somatic structural variations (SVs) using tumor and matched control long-read sequencing data with a single-base resolution. The current version of nanomonsv includes two detection modules, Canonical SV module, and Single breakend SV module. Using tumor/control paired long-read sequencing data from three cancer and their matched lymphoblastoid lines, we demonstrate that Canonical SV module can identify somatic SVs that can be captured by short-read technologies with higher precision and recall than existing methods. In addition, we have developed a workflow to classify mobile element insertions while elucidating their in-depth properties, such as 5' truncations, internal inversions, as well as source sites for 3' transductions. Furthermore, Single breakend SV module enables the detection of complex SVs that can only be identified by long-reads, such as SVs involving highly-repetitive centromeric sequences, and LINE1- and virus-mediated rearrangements. In summary, our approaches applied to cancer long-read sequencing data can reveal various features of somatic SVs and will lead to a better understanding of mutational processes and functional consequences of somatic SVs.
Article
Full-text available
Deep learning-based variant callers are becoming the standard and have achieved superior single nucleotide polymorphisms calling performance using long reads. Here we present Clair3, which leverages two major method categories: pileup calling handles most variant candidates with speed, and full-alignment tackles complicated candidates to maximize precision and recall. Clair3 runs faster than any of the other state-of-the-art variant callers and demonstrates improved performance, especially at lower coverage.
Article
Full-text available
Identification of somatic mutations in tumor samples is commonly based on statistical methods in combination with heuristic filters. Here we develop VarNet, an end-to-end deep learning approach for identification of somatic variants from aligned tumor and matched normal DNA reads. VarNet is trained using image representations of 4.6 million high-confidence somatic variants annotated in 356 tumor whole genomes. We benchmark VarNet across a range of publicly available datasets, demonstrating performance often exceeding current state-of-the-art methods. Overall, our results demonstrate how a scalable deep learning approach could augment and potentially supplant human engineered features and heuristic filters in somatic variant calling. Deep learning could be applied to the challenge of somatic variant calling in cancer by making use of large-scale genomic data. Here, the authors develop VarNet, a weakly supervised deep learning model for somatic variant calling in cancer with robust performance across multiple cancer genomics datasets.
Article
Full-text available
Genome in a Bottle benchmarks are widely used to help validate clinical sequencing pipelines and develop variant calling and sequencing methods. Here we use accurate linked and long reads to expand benchmarks in 7 samples to include difficult-to-map regions and segmental duplications that are challenging for short reads. These benchmarks add more than 300,000 SNVs and 50,000 insertions or deletions (indels) and include 16% more exonic variants, many in challenging, clinically relevant genes not covered previously, such as PMS2. For HG002, we include 92% of the autosomal GRCh38 assembly while excluding regions problematic for benchmarking small variants, such as copy number variants, that should not have been in the previous version, which included 85% of GRCh38. It identifies eight times more false negatives in a short read variant call set relative to our previous benchmark. We demonstrate that this benchmark reliably identifies false positives and false negatives across technologies, enabling ongoing methods development.
Article
Full-text available
Background Accurate detection of somatic mutations is challenging but critical in understanding cancer formation, progression, and treatment. We recently proposed NeuSomatic, the first deep convolutional neural network-based somatic mutation detection approach, and demonstrated performance advantages on in silico data. Results In this study, we use the first comprehensive and well-characterized somatic reference data sets from the SEQC2 consortium to investigate best practices for using a deep learning framework in cancer mutation detection. Using the high-confidence somatic mutations established for a cancer cell line by the consortium, we identify the best strategy for building robust models on multiple data sets derived from samples representing real scenarios, for example, a model trained on a combination of real and spike-in mutations had the highest average performance. Conclusions The strategy identified in our study achieved high robustness across multiple sequencing technologies for fresh and FFPE DNA input, varying tumor/normal purities, and different coverages, with significant superiority over conventional detection approaches in general, as well as in challenging situations such as low coverage, low variant allele frequency, DNA damage, and difficult genomic regions
Article
Full-text available
Long-read sequencing has the potential to transform variant detection by reaching currently difficult-to-map regions and routinely linking together adjacent variations to enable read-based phasing. Third-generation nanopore sequence data have demonstrated a long read length, but current interpretation methods for their novel pore-based signal have unique error profiles, making accurate analysis challenging. Here, we introduce a haplotype-aware variant calling pipeline, PEPPER-Margin-DeepVariant, that produces state-of-the-art variant calling results with nanopore data. We show that our nanopore-based method outperforms the short-read-based single-nucleotide-variant identification method at the whole-genome scale and produces high-quality single-nucleotide variants in segmental duplications and low-mappability regions where short-read-based genotyping fails. We show that our pipeline can provide highly contiguous phase blocks across the genome with nanopore reads, contiguously spanning between 85% and 92% of annotated genes across six samples. We also extend PEPPER-Margin-DeepVariant to PacBio HiFi data, providing an efficient solution with superior performance over the current WhatsHap-DeepVariant standard. Finally, we demonstrate de novo assembly polishing methods that use nanopore and PacBio HiFi reads to produce diploid assemblies with high accuracy (Q35+ nanopore-polished and Q40+ PacBio HiFi-polished).
Article
Full-text available
The lack of samples for generating standardized DNA datasets for setting up a sequencing pipeline or benchmarking the performance of different algorithms limits the implementation and uptake of cancer genomics. Here, we describe reference call sets obtained from paired tumor–normal genomic DNA (gDNA) samples derived from a breast cancer cell line—which is highly heterogeneous, with an aneuploid genome, and enriched in somatic alterations—and a matched lymphoblastoid cell line. We partially validated both somatic mutations and germline variants in these call sets via whole-exome sequencing (WES) with different sequencing platforms and targeted sequencing with >2,000-fold coverage, spanning 82% of genomic regions with high confidence. Although the gDNA reference samples are not representative of primary cancer cells from a clinical sample, when setting up a sequencing pipeline, they not only minimize potential biases from technologies, assays and informatics but also provide a unique resource for benchmarking ‘tumor-only’ or ‘matched tumor–normal’ analyses.
Article
Full-text available
Almost all haplotype-based variant callers were designed specifically for detecting common germline variation in diploid populations, and give suboptimal results in other scenarios. Here we present Octopus, a variant caller that uses a polymorphic Bayesian genotyping model capable of modeling sequencing data from a range of experimental designs within a unified haplotype-aware framework. Octopus combines sequencing reads and prior information to phase-called genotypes of arbitrary ploidy, including those with somatic mutations. We show that Octopus accurately calls germline variants in individuals, including single nucleotide variants, indels and small complex replacements such as microinversions. Using a synthetic tumor data set derived from clean sequencing data from a sample with known germline haplotypes and observed mutations in a large cohort of tumor samples, we show that Octopus is more sensitive to low-frequency somatic variation, yet calls considerably fewer false positives than other methods. Octopus also outputs realigned evidence BAM files to aid validation and interpretation.
Article
The year 2022 will be remembered as the turning point for accurate long-read sequencing, which now establishes the gold standard for speed and accuracy at competitive costs. We discuss the key bioinformatics techniques needed to power long reads across application areas and close with our vision for long-read sequencing over the coming years.
Article
Motivation: Long-read phasing has been used for reconstructing diploid genomes, improving variant calling, and resolving microbial strains in metagenomics. However, the phasing blocks of existing methods are broken by large Structural Variations (SVs), and the efficiency is unsatisfactory for population-scale phasing. Results: This paper presents a novel algorithm, LongPhase, which can simultaneously phase single nucleotide polymorphisms (SNPs) and SVs of a human genome in 10-20 minutes, 10x faster than the state-of-the-art WhatsHap, HapCUT2 and Margin. In particular, co-phasing SNPs and SVs produces much larger haplotype blocks (N50 = 25Mbp) than those of existing methods (N50 = 10-15Mbp). We show that LongPhase combined with Nanopore ultra-long reads is a cost-effective and highly contiguous solution, which can produce between one and 26 blocks per chromosome arm without the need for additional trios, chromosome-conformation, and strand-seq data. Availability: LongPhase is freely available at https://github.com/twolinin/LongPhase/. Supplementary information: Supplementary figures and tables are available online.