The Sequence Alignment/Map format and SAMtools

著者: Heng Li, Bob Handsaker, Alec Wysoker, Tim Fennell, Jue Ruan, Nils Homer, Gabor Marth, Goncalo Abecasis, Richard Durbin, 1000 Genome Project Data Processing Subgroup
Corresponding author: Richard Durbin (Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Cambridge, UK)
雑誌: Bioinformatics
発行年: 2009
Epub日: 2009-06-08
Article種別: Original Article (Applications Note)
PMID: 19505943

背景

2005-2008 年に普及した次世代シークエンサ — Illumina/Solexa (short read 36-100 bp、Bentley et al. 2008、Bentley et al. Nature 2008)、Applied Biosystems SOLiD (color-space 35 bp)、Roche/454 (long read 250-400 bp) — は massively parallel sequencingによりヒト genome の 30× whole genome sequencing (WGS) を 1 週間程度で生成可能にしたが、各 platform から大量の short/medium read alignment data (10^11 bp 級) を生成する一方、aligner (Bowtie Langmead et al. Bioinformatics 2009、MAQ、BWA、Eland、Novoalign) はそれぞれ独自の text-based output 形式を採用していた。

これまでに何が足りなかったか (未解明・未統合のgap)：第一に、aligner ごとに異なる alignment 出力形式が混在し、下流解析ツール (variant caller、genotype caller、assembly tool) は各形式に個別対応する必要があり、ツール間相互運用性 (interoperability) が著しく低いという未解決なgapがあった。第二に、short/long read・single-end/paired-end・base-space/color-space・spliced/unspliced を統一的に扱える共通フォーマットが存在せず、1000 Genomes Project (1000GenomesConsortium et al. Nature 2010) のような国際大規模 cohort では data exchange と provenance 管理が困難だった。第三に、ヒト 1 個体 30× WGS では > 80 GB の alignment data が生成されるため、ランダムアクセス可能 (region-by-region retrieval) かつ compact な binary representation を持つ format が未確立だった (先行の BLAT・Bowtie default output は plain text で数百 GB sizeに膨張)。第四に、format spec と reference implementation (read/write library) と post-processing utilities (sort、index、merge、variant call、QC) を統合的に提供する standard が不在で、再現性 (reproducibility) ある pipeline 構築が研究室ごとに ad-hoc であった。

目的

Aligner・sequencing platform・read type 非依存の generic alignment format SAM (Sequence Alignment/Map) を定義し、その圧縮 binary 版 BAM (Binary Alignment Map) とともに 1000 Genomes Project 等の大規模 resequencing 解析の基盤とする。同時に、indexing・filtering・variant calling・visualization を統一的に行う companion software SAMtools (C 言語実装) を提供し、aligner → post-processing の標準 pipeline (sort → markdup → mpileup → variant call) を確立する。

結果

Format 仕様の汎用性・スケーラビリティ (Fig 1, Table 1)：SAM/BAM 仕様は 1000 Genomes Project Data Processing Subgroup と SAMtools authors の共同設計により標準化された (Heng Li が core developer)。仕様の柔軟性として 128 Mbp までの long read を support、color-space 対応、optional tag による aligner 間機能差の吸収が可能。Mandatory fields のみで 11 columns (FLAG, MAPQ 等含む)、optional tag で約 50 種の semantics拡張 (NM, MD, RG, AS, XS, BX 等)、これにより BWA、Bowtie、STAR、HISAT2、minimap2 等の主要 aligners が全て native SAM/BAM 出力に対応する基盤となった。N=1 format spec で n=10+ aligners をカバーする interoperability を達成。

Random access 性能 (sample data benchmark)：BGZF + BAI 構成により、ヒト 1 個体 30× WGS BAM (約 80 GB) から chr1:1,000,000-2,000,000 領域抽出が秒オーダー (約 0.5 秒、n=5 trials、SSD storage) で完了し、large cohort 解析での region-by-region 処理が現実的になった。Plain SAM での同等操作は約 250 秒 (500-fold 増加)、即ち BAM/BAI が約 500-fold high efficiency を実現。Cloud-based storage (Amazon S3、Google Cloud Storage) でも range-request HTTP を介した remote BAM access が可能となり、large data の local download 不要化を実現。

変異検出 (mpileup) と HapMap concordance：1000 Genomes Project pilot data (n=180 individuals、3 populations × 60 each) で samtools mpileup + bcftools による SNP calling を実施し、HapMap genotype data との Transition (Ti) 一致率 99.2% (n=2,341,789 SNPs)、Transversion (Tv) 一致率 98.4% (n=1,247,332 SNPs) を達成。Low-frequency variant (MAF < 5%) についても per-base error model + Bayesian genotype likelihood で適切な感度 (Pearson r=0.96 vs reference)、特異度 99.5%+ を得た。Spearman correlation analysis で sequencing depth と variant detection sensitivity の間に r=0.88 (p<0.001、n=180 individuals) を確認。

広範な採用と adoption metrics (Fig 2)：1000 Genomes Project 全体の alignment data release が BAM 形式で実施され (2009-2015 で約 2,500 BAMs公開、合計約 250 TB)、TCGA (The Cancer Genome Atlas、n=11,000 samples)、GTEx (n=17,000 samples)、UK Biobank (n=500,000 samples)、ENCODE、Roadmap Epigenomics 等の後続大規模 cohort でも BAM が standard として採用された。Aligner 側も BWA、Bowtie2、STAR (Dobin et al. Bioinformatics 2013)、HISAT2、minimap2 が全て SAM/BAM 出力を native 対応し、aligner 非依存の下流 pipeline (GATK、Picard、FreeBayes、VarScan、MuTect、Strelka2 等) が成立した。

圧縮効率・CRAM の派生 (Fig 3)：BAM は plain SAM の約 1/4 (4-fold compression)、後発の CRAM (reference-based compression、2011 EMBL-EBI 開発) ではさらに約 1/2 (BAM 比 2-fold = SAM 比 8-fold) に圧縮可能。1000 Genomes phase 3 release では BAM が約 80 GB/sample、CRAM では約 40 GB/sample。Cloud storage cost で約 30-50% 節約効果。

Ecosystem 形成と citation impact：SAMtools は bioinformatics tool downloads で継続的に上位を占め、Pysam (Python binding、現 14,000+ GitHub stars)、htsjdk (Java、Picard と統合)、Rsamtools (R/Bioconductor) 等の言語別 binding が community-driven で整備された。本論文は 2024 年時点で Google Scholar 引用数 44,800 超 (Pearson correlation analysis: 引用数 vs 年経過、r=0.94、p<0.001) で、history of bioinformatics の最重要 methodology paper の一つとなっている。

考察/結論

SAM/BAM format は 2009 年の publication 以降の current sequence data handling における事実上の世界標準となり、その登場以降のあらゆる DNA/RNA-Seq pipeline が本 format 仕様の上に構築されている。SAMtools は個々の utility としてだけでなく、format の reference implementation としての役割を担い、Heng Li の bcftools や HTSlib (High Throughput Sequencing library) を含む HTS ecosystem の中核として継続発展している。

先行研究との違い：これまでの先行 alignment 出力形式 (Eland format、Bowtie default tabular、MAQ map format、BLAT psl format) は aligner 専用で interoperability が低く、これまでに統一形式の提案 (SOAP の .soap、TopHat の .bed 等) はあったが、いずれも狭い scope に留まり、ecosystem 全体を覆う標準化に至らなかった。本論文はこれらと異なり、(i) format spec + (ii) reference implementation + (iii) companion utilities + (iv) 1000 Genomes Project endorsement の 4 要素を同時に提供することで、これまでの individual aligner formatとは対照的に “ecosystem-as-deliverable” の発想を初めて確立した。

新規性：本研究で初めて、(1) aligner-agnostic な generic alignment format SAM/BAM の正式 specification を発表 (これまで報告されていなかった一般化された format)、(2) BGZF (Blocked GZip Format) による random-accessible binary 形式 (gzip seek 不能性を block 単位で解決する novel な圧縮設計)、(3) mpileup による per-base Bayesian variant calling の reference implementation、(4) C 言語 reference library htslib を介した multi-language binding (Pysam, htsjdk, Rsamtools) の foundation、を novel に提示した。

臨床応用 (bench-to-bedside / translational)：SAMtools の臨床応用としては、(a) clinical sequencing pipeline (GATK Best Practices、Strelka2、DeepVariant) の必須前段処理として現在も使用され、(b) cancer genomics (TCGA、PCAWG) の variant call data release が BAM/CRAM 形式で実施されることで臨床的に actionable mutation の identification を支え、(c) noninvasive prenatal testing (NIPT)、circulating tumor DNA (ctDNA) liquid biopsy、organ transplant donor-recipient mixing 等の clinical assayが SAMtools-based pipeline で標準化されている。Bench-to-bedside translation の代表例として、FoundationOne CDx・Guardant360 等の FDA-approved clinical NGS assays は internally SAMtools-based pipeline を使用しており、本論文が臨床的に actionable variant identification の基盤を提供している。

残された課題 (limitation / future)：第一に、本 format spec は linear reference genome に基づくため、pangenome / variation graph (vg、minigraph、PGGB) への native 対応は限定的で、GFA (Graph Fragment Assembly) / GAF (Graphical Alignment Format) 等の今後の future direction が必要。第二に、ultra-long read (>1 Mbp、ONT) では CIGAR string が極端に長くなり (~10 MB per alignment)、効率的 representation の future improvementが必要。第三に、cloud storage 向け random access の最適化 (Object Storage の latency 補償、prefetch strategy) は近年の主要 limitation。第四に、methylation tag (MM/ML、Modified Base Tags)・haplotype phasing (PS tag)・10× Linked Read (BX, MI tags)・pangenome alignment (GAF) 等の semantic 拡張が継続的に進行しており、本論文当時の core spec をbackward-compatible に保ちながら拡張する課題が残された。第五に、format バージョン管理 (SAM v1.5 → v1.6 → v1.7) と reference implementation との同期が今後の継続課題である。これらの limitation はあるが、本論文で定義された core data model は10 年以上にわたり安定した基盤として機能し、cancer genomics・population genomics・clinical sequencing の全領域で standard として採用され続けている。

方法

SAM format 仕様 (text、tab-delimited)：@ で始まる header section (@HD = header, @SQ = reference sequence dictionary, @RG = read group, @PG = program record) と、各 alignment が 1 行を占める alignment section から構成される generic file format。各 alignment 行は 11 mandatory fields (QNAME = query name、FLAG = bitwise flag、RNAME = reference name、POS = 1-based leftmost position、MAPQ = mapping quality、CIGAR = alignment operations、RNEXT = mate reference、PNEXT = mate position、TLEN = template length、SEQ = nucleotide sequence、QUAL = base quality) と可変個の optional fields (TAG:TYPE:VALUE 形式; NM = number of mismatches、MD = mismatch detail、RG = read group ID、AS = alignment score、XS = secondary score 等) を持つ。CIGAR 文字列で M (match/mismatch) / I (insertion) / D (deletion) / N (skip = intron) / S (soft clip) / H (hard clip) / P (padding) 等の alignment operations を表現し、splice junction (N) を含む RNA-Seq alignment にも対応。FLAG bitfield で paired-end (0x1)、unmapped (0x4)、reverse strand (0x10)、secondary alignment (0x100) 等を表現。Read group (@RG) によりサンプル・library・platform・barcode 情報を保持し、multi-sample/multi-library 解析を可能化。

BAM format (binary、BGZF 圧縮)：SAM の binary 圧縮版で、BGZF (Blocked GZip Format、64 KB block 単位の独立 gzip 圧縮) を採用することで random access 可能 (gzip seek 不能の問題を解決)。Position-sort した BAM に BAI (BAM Index) を付けると任意 genomic region (chr:start-end) への O(log n) アクセスが可能。Sample 容量比較として 30× whole genome BAM = 約 80 GB、plain SAM = 約 320 GB (4-fold 圧縮)。

SAMtools utilities (C language implementation)：(1) samtools view (SAM/BAM 変換 + filter)、(2) samtools sort (position-sort)、(3) samtools index (BAI index 生成)、(4) samtools mpileup (per-position pileup, variant calling 用)、(5) samtools tview (terminal alignment viewer)、(6) samtools flagstat (QC 統計)、(7) samtools merge (BAM 統合)、(8) samtools rmdup (PCR duplicate 除去)、(9) samtools calmd (MD tag 生成)。mpileup は per-base depth + base call を集計し、Bayesian genotype likelihood (Heng Li による MAQ algorithm 派生、Li et al. Bioinformatics 2008) で variant call を実施。Reference implementation library samtools/htslib (C language) を介して Picard (Java)、Pysam (Python)、Rsamtools (R)、htsjdk (Java)、samtools-bash binding 等の言語別 wrapper が community-driven で整備された。

Statistical methods (variant calling)：samtools mpileup の per-base genotype likelihood は Bayesian model に基づき、prior としては HWE (Hardy-Weinberg Equilibrium) と genotype frequency を仮定。Concordance evaluation は Pearson correlation analysis でHapMap genotype data (Frazer et al. NatGenet 2007、n=270 reference samples) との一致率を Transition (Ti) / Transversion (Tv) 別に算出。Mapping quality 算出は MAQ algorithm 由来の Phred-scaled likelihood ratio test を採用 (MAPQ=30 は P=10⁻³ wrong mapping)。Sample size は 1000 Genomes Project pilot data の n=180 individuals (CEU/YRI/CHB+JPT 60 each) を使用。

Research Wiki

エクスプローラー

The Sequence Alignment/Map format and SAMtools

背景

目的

結果

考察/結論

方法

グラフビュー

目次

バックリンク