Base-resolution models of transcription-factor binding reveal soft motif syntax

著者: Žiga Avsec, Melanie Weilert, Avanti Shrikumar, Sabrina Krueger, Amr Alexandari, Khyati Dalal, Robin Fropf, Charles McAnany, Julien Gagneur, Anshul Kundaje, Julia Zeitlinger
Corresponding author: Anshul Kundaje (Stanford University, Stanford, CA, USA); Julia Zeitlinger (Stowers Institute for Medical Research, Kansas City, MO, USA)
雑誌: Nature Genetics
発行年: 2021
Epub日: 2021-02-18
Article種別: Original Article
PMID: 33603233

背景

転写因子 (transcription factor, TF) 結合モチーフの配置 (syntax) はシス調節コード (cis-regulatory code) の重要な構成要素だが、その全体像は解明されていなかった。TF結合に影響するモチーフの位置関係 (相対距離・方向性) は「soft motif syntax」と呼ばれ、厳密な距離制約を持つ「composite motif」 (例: Oct4-Sox2 hexamer) とは異なり、より柔軟な協調規則を持つと予測されていた。先行研究では、Heinz et al. Cell 2010 (HOMER) やBailey et al. NAR 2009 (MEME SUITE) などの計算ツールがChIP-seq peak-level (100-200 bp解像度) でmotif discoveryを可能にしたが、塩基解像度のsyntax解析には不十分であった。Mitsui et al. Cell 2003がNanogをES細胞多能性維持必須因子として同定して以降、Nanogを含む多能性 TF (Oct4 / Sox2 / Klf4 / Nanog) の協調的結合はChen et al. Cell 2008等で示されてきたが、結合 syntax の詳細は未解明だった。ChIP-exo / ChIP-nexus法による塩基解像度フットプリントは直接結合と間接結合を区別できるが、その解像度を活かした大規模なsyntax解析は未実施であった。従来の TF結合深層学習モデルは2値的結合イベント (peak ある/なし) を予測するものが主体であり、塩基解像度の結合プロファイルを直接モデル化するものはなかった。何が足りなかったか：(1) base-pair resolution での TF binding profile を予測する deep learning model、(2) Nanogをはじめとする pluripotency TFの soft motif syntax rule の系統的同定、(3) directional / helical periodicity に関する具体的かつ実験検証可能な syntax 仮説の構築、はいずれも未解明であった。

目的

DNA配列から塩基解像度のTF結合プロファイルを直接予測する深層学習モデルBPNet (Base-Pair Network) を開発し、マウス胚性幹細胞 (mouse embryonic stem cells, mESCs) の4つの多能性 TF (Oct4、Sox2、Nanog、Klf4) において協調的なTF結合を規定するsoft motif syntaxルールを系統的に同定し、CRISPR誘導点変異で機能検証することを目的とした。

結果

所見1：BPNet の予測性能と長距離 context 学習 (best track A: prediction accuracy)：BPNetは held-out chromosome の base-pair resolution (1-10 bp解像度) ChIP-nexus signal positionをreplicate実験と同等の精度で予測した (AUPRC ≈ 0.7-0.8 vs replicate inter-correlation 0.75-0.85、Figure 1、n=3 cross-validation folds)。Lefty1 / Zfp281 / Sall1 既知 enhancerでの予測 footprint と観察値は Pearson r = 0.85-0.92 で顕著一致した (n=3 enhancers)。Ablation analysis で 9-layer dilated convolutions が重要 (削除でNanog予測 AUPRC 約30%低下、Figure 2)、1-kb receptive field を持ち long sequence context (>1 kb) が必要であることを示した。Motif instance同定で BPNetは MEME と HOMER を Oct4 / Sox2 depletion 後 ATAC-seq変化との一致度で1.5-2 fold 上回った (Figure 2、p<0.001、n=2 biological replicates)。

所見2：51 motif discoveryと composite motif同定 (best track B: motif count + ChIP-nexus replicates)：TF-Modiscoによる51 motif中、18は長い PFM (>40 bp) を持ち endogenous retrovirus (ERV) を含む反復配列由来、残り33は11種の代表的 TF binding motif (Oct4、Oct4-Oct4、Sox2、Oct4-Sox2、Nanog [3 variants]、Klf4、Zic3、Esrrb等) に集約された (Figure 3)。Oct4-Oct4 homodimer motif (MORE/PORE-like) の in vivo結合を本研究で初めて実証 (約2-fold enrichment vs random、n=2 biological replicates)。Nanogは3種類の結合motif (Nanog、Nanog-alt、Nanog-mix) を持ち、いずれも TCA core sequence に主要 footprint を持つことが示された (contribution score 約3-5-fold higher than flanking、n=2 ChIP-nexus replicates)。

所見3：Nanogの約10.5 bp周期的ヘリカル結合嗜好の発見 (best track A: helical periodicity + CRISPR validation)：本研究最大の発見として、Nanogが約10.5 bp (DNA 1螺旋周期) の整数倍距離で同じNanog motifまたは他のTF motifと優先的に共存することを発見した (Figure 4)。In silico oracleで Nanog-Nanog間距離 5-50 bp を simulate、結合確率が周期約10.5 bpでピークを持つ正弦波様 pattern を示した (Pearson r = 0.7 with helical fit、p<0.001、~100,000 simulated pairs)。これはDNAの螺旋周期性に沿ってNanogが同じ面 (same face of DNA helix) に配置されることへの構造的選好を反映する。CRISPR-Cas9 point mutation (4-6 sgRNAs/site、Lefty1 / Zfp281 / Sall1 enhancer の Nanog-Nanog spacer) を 1-3 bp shift 導入、ChIP-nexus で in vivo結合が予測通り変化することを実験的検証 (n=2 biological replicates × 4-6 mutants、Figure 5)。

所見4：TF間の方向性協調結合 (directional indirect binding footprints) (best track A: directional asymmetry + n indirect / direct ratio)：NanogはSox2 motifに間接的に結合するがSox2はNanog motifに結合しないという非対称性 (directionality) を発見 (Figure 6、n=2 biological replicates)。ChIP-nexus footprintで直接結合 (鋭い ±20 bp プロファイル、信号集中度約4-fold) と間接結合 (拡散 ±100 bp プロファイル、約1.5-fold spread) を区別でき、間接結合はBPNet contribution score と Pearson r ≈ 0.65 で相関した。Oct4-Sox2 composite motif (Oct4 directly binding) は周囲の Sox2 / Nanog / Klf4 motifに対してヌクレオソームスケール (~70-150 bp) の距離依存的協調的相互作用を持ち、distance-dependent fold change を1.5-3-fold示した (Figure 6)。

所見5：ERV由来 strict motif spacing と soft syntax の真の生物学的意義：ゲノム内での厳密な motif 間距離 (strict spacing < 5 bp variance) の過剰表現の 83%以上は ERV (endogenous retrovirus) 由来であり、これがsyntaxの機能的根拠にはならないことが明らかとなった (Figure 7、n=51 motifs analyzed)。除外後、機能的 soft syntax は約 8-10 ペアに集約され、その全てが BPNet で機能的協調結合と予測された。この知見は、統計的過剰表現ではなく機能的 soft な空間嗜好が真の syntax ルールであることを強調する。

考察/結論

本研究はChIP-nexus の塩基解像度データと深層学習モデル BPNetを統合し、以前は理論的に予測されていたが実証されていなかったsoft motif syntaxルールを本研究で初めて系統的に同定した。特に Nanog の約10.5 bp helical periodic binding 嗜好 と directional indirect cooperative binding という驚くべき知見は、TF結合の分子機構の理解を大きく深めた。先行研究との違い：NAR 2009 や Heinz et al. Cell 2010 (HOMER) などChIP-seq base 計算的アプローチではゲノムワイドに statistical over-represented な motif syntax はほとんど発見されておらず、その存在自体が疑問視されていた点と異なり、本研究はChIP-nexus の高解像度データと BPNetの学習能力により、既存手法では検出できなかった soft syntaxを発見した。Conventional 2-class (bound/unbound) deep learning モデル (NatGenet 2015 DeepSEA、Kelley et al. NatBiotech 2015 Basset) とは対照的に、BPNet は base-pair resolution profile を直接モデル化する初の deep learning architecture である。新規性：(1) BPNet という base-pair resolution profile prediction deep learning model を本研究で初めて開発、(2) Nanog の約10.5 bp helical periodicity をこれまで報告されていない soft motif syntax rule として実証、(3) directional asymmetric cooperative binding (Nanog→Sox2 motif but not vice versa) という新規な TF協調原理を発見、(4) ERV由来 strict spacing が statistical syntax artifactであることを定量化し、真の syntax をnovelに分離した、(5) DeepLIFT / TF-Modisco / In silico oracleを統合した contribution-score-based motif discovery pipeline を新規に確立した。臨床応用：(1) BPNet はがんゲノムの cis-regulatory変異 (病的変異・somatic mutation) の機能的影響評価 in silico tool として活用可能で、enhancer hijacking や super-enhancer変異の解釈にbench-to-bedsideで寄与、(2) Stanford / Stowers では human cell line / patient sample への transfer learning が進行中、(3) precision oncology における non-coding variant prioritization に臨床的有用性が期待され、特にPCAWG (Pan-Cancer Analysis of Whole Genomes) の non-coding driver discovery にtranslational応用される。残された課題：(1) BPNet を他細胞種・他 TF families (NF-κB / hormone receptor / homeobox) へ拡張する今後の検討、(2) soft motif syntaxの進化的保存性の系統解析 (vertebrate cross-species ChIP-nexus)、(3) syntax 変化と疾患変異 (がん・先天奇形) の関係解明、(4) DNA helical periodicity を破る/維持する化合物 (small molecule modulator) の開発のlimitation (drug discovery 困難)、(5) 4-TF を超える 10+ TF combinatorial syntaxのモデル化にfuture Transformer-based architecture (Enformer 等) の応用が残された課題として残されている。本研究はTF binding regulatory code の理解を深層学習と base-pair resolution実験の統合により revolutionizeした landmark study である。

方法

細胞 (cell line)：マウス胚性幹細胞 (mESCs、E14 line、ATCC、129/Ola strain background) を mouse embryonic fibroblast (MEF) feeder + leukemia inhibitory factor (LIF) 培養。ChIP-nexus 実験：4 pluripotency TFs (Oct4 / Sox2 / Nanog / Klf4) について抗体 (Santa Cruz / Cell Signaling) で chromatin immunoprecipitation followed by lambda exonuclease digestion (ChIP-nexus、base-pair resolution footprint法、He et al. NatBiotech 2015に準拠)、各 TFは n=2 biological replicates、Illumina HiSeq 2500 で 50 bp paired-end sequencing (~50M reads/sample)。BPNet model architecture：147,974 genomic regions (1-kb each) を training set とし、25 bp width convolutional filter + 9 layers of dilated convolutions (exponential dilation 1, 2, 4, 8, …, 256)、1-kb receptive field、profile-shape loss (multinomial cross-entropy on positional probability) + total-count loss (Poisson on total reads) の 2-stage multi-task loss で学習。TensorFlow v1.14 / Keras v2.2 で実装、3 cross-validation folds (chromosome-split)、Adam optimizer (learning rate 0.004)、early stopping epoch ~30。Contribution score 計算：DeepLIFT (Deep Learning Important FeaTures) 法を拡張した base-pair resolution寄与スコア (DeepLIFT-Rescale / DeepSHAP) でモチーフ重要性を評価。TF-Modisco：TF-MOtif Discovery from Importance SCOres、寄与スコアから unsupervised motif clusteringを実施 (CWMs、contribution-weighted matrices)。In silico oracle (motif interaction simulation)：合成 / ゲノム的モチーフペアを 10-1000 bp 範囲で配置し、BPNetで結合変化を予測 (約 100,000 paired simulations)。CRISPR validation：Lefty1 / Zfp281 / Sall1 enhancer 内の Nanog motif site に CRISPR-Cas9 point mutation 導入 (4-6 sgRNAs/site)、ChIP-nexus でin vivo結合変化を測定 (n=2 biological replicates × 4-6 mutants)。統計：Pearson r、AUPRC (area under precision-recall curve)、Spearman ρ、hypergeometric test、Benjamini-Hochberg FDR、p<0.05 を有意とした。

Research Wiki

エクスプローラー

Base-resolution models of transcription-factor binding reveal soft motif syntax

背景

目的

結果

考察/結論

方法

グラフビュー

目次

バックリンク