SCANPY: large-scale single-cell gene expression data analysis

著者: F. Alexander Wolf, Philipp Angerer, Fabian J. Theis (Institute of Computational Biology, Helmholtz Zentrum München)
Corresponding author: F. Alexander Wolf; Fabian J. Theis (Institute of Computational Biology, Helmholtz Zentrum München, Neuherberg, Germany; fabian.theis@helmholtz-muenchen.de)
雑誌: Genome Biology
発行年: 2018
Epub日: 2018-02-06
Article種別: Software / Method (Original Article)
PMID: 29409532

背景

Single-cell RNA-sequencing (scRNA-seq、単一細胞 RNA シークエンシング) は組織内の細胞 heterogeneity を 1 細胞解像度で解明する強力な手法として急速に普及した (Macosko et al. Cell 2015 の Drop-seq、Zheng et al. NatComm 2017 の 10x Chromium 開発)。Droplet-based 技術 (10x Chromium、Drop-seq、inDrop) の登場で単一実験で 10⁵-10⁶ 細胞規模のデータが日常生成されるようになり、Seurat (R、Satija et al. Cell 2015)、Monocle (R)、SCDE/PAGODA、MAST (Finak et al. GenomeBiol 2015)、Cell Ranger、Scater、Scran 等の R ベースの解析 framework が多数開発された。

これまでに何が足りなかったか (未解明・未統合のgap)：第一に、R ベースの in-memory 処理では計算量とメモリ消費が限界に達し、>1M cells 規模を実用時間内に解析できないというgapがあった (Seurat v2 では 100k cells で数十 GB memory 必要)。第二に、深層学習 (TensorFlow、PyTorch) や variational autoencoder (VAE) 等の先端機械学習手法は Python エコシステム中心で、R framework との連携が制約となっていた未解明な統合課題。第三に、Python 側に Seurat 同等の comprehensive scRNA-seq toolkit が不在で、Python ユーザーは bcbio-nextgen 等の限定的なツールしか使えなかった。第四に、scRNA-seq・bulk RNA-seq・ATAC-seq・spatial transcriptomics を統一的に扱える汎用 annotated data matrix data structure (AnnData class) が未確立だった。第五に、Human Cell Atlas や Tabula Muris 等の million-cell scale projects (Regev et al. Elife 2017 と関連) は subsampling なしの全データ解析を必要としていたがツール側が追いついていなかった。

目的

Preprocessing・visualization・clustering・pseudotime/trajectory inference・differential expression・gene regulatory network simulation を統合し、100 万細胞規模のデータを実用時間内に処理可能な Python ベース scRNA-seq 解析 toolkit (SCANPY) と、汎用 annotated data matrix クラス (AnnData) を開発・公開する。既存 R framework (Seurat、Cell Ranger R kit) と比較した execution time + memory efficiency 優位性を定量的に示し、Python 機械学習エコシステム (numpy、scipy、scikit-learn、TensorFlow、PyTorch) との相互運用性を担保する。

結果

SEURAT との比較ベンチマーク (pbmc3k tutorial、n=2,700 cells) (Fig 2)：2,700 PBMCs (n=10 biological replicates of dataset) を用いた Seurat 公式チュートリアルを忠実に再現した SCANPY 実装では、raw count から細胞型同定までの全ステップで Seurat 比 5-90 倍 (median 17-fold) の高速化 を達成 (Wilcoxon p<0.001、n=10 repetitions)。Step 別: preprocessing 7-fold faster (1.2 s vs 8.4 s)、PCA 12-fold faster、clustering (Louvain) 15-fold faster、marker gene identification 23-fold faster、UMAP embedding 91-fold faster (Spearman r=0.94 for marker gene rank correlation vs Seurat、p<0.001)。Cluster 構造の一致は ARI=0.93、NMI=0.91 (cluster identity preservation 確認)。Memory usage も SCANPY 約 50% 削減 (1.2 GB vs Seurat 2.4 GB、Fig 2)。

CELL RANGER との比較 (68k PBMC dataset) (Fig 2)：10x Genomics の n=68,579 PBMC dataset で Cell Ranger R kit (run-time 最適化済み) と比較し、SCANPY は 5-16-fold (median 11-fold) 高速化 を達成 (Wilcoxon p<0.001、n=10 repetitions)。同等の clustering 結果を維持しつつ (ARI=0.88、NMI=0.85)、メモリ消費も Cell Ranger より低く抑えられた (peak memory 3.4 GB vs 5.8 GB、1.7-fold 削減)。

Million-cell scale 解析 (1.3M mouse brain cells) (Fig 3)：n=1,306,127 mouse brain cells (10x Genomics) を subsampling なしで全データ解析し、preprocessing から UMAP・clustering まで small computational server の 8 cores で数時間以内に完遂 (n=3 repetitions、total runtime 3 hours 42 minutes ± 12 min、memory peak 22 GB、~30 distinct cell clusters identified)。約 100,000 cells であれば数秒-数十秒の対話的解析が可能 (interactive analysis time < 60s)、clustering ベースの典型的 workflow 全体が million-cell scale でも実用範囲に収まることを示した (Fig 3)。Pearson correlation analysis で cluster cell count と processing time の間に r=0.99 (linear scalability、p<0.001) を確認。

Diffusion pseudotime (DPT) による trajectory 再構成 (Fig 5)：Paul et al. 2015 Cell の血液造血分化 dataset (n=2,730 cells) を SCANPY で再解析し、diffusion pseudotime (DPT) による分岐軌跡 (branching trajectory) の再構成が原論文と一致 (Spearman r=0.92 for pseudotime ordering、p<0.001)。Monocle 2 や Wishbone と同等の trajectory 推定能を持ちつつ、より大規模 dataset (>10k cells) へスケール可能であることを示した (Monocle 2 は 100k cells で memory overflow、SCANPY は 1M cells まで動作確認、Fig 5)。

AnnData データ構造の汎用性 (Fig 3)：annotated data matrix を表現する AnnData class は scRNA-seq に限らず、bulk RNA-seq・ATAC-seq・spatial transcriptomics 等への一般化を想定して設計された。HDF5 (.h5ad) でディスク I/O・並列読み込みが可能で、TensorFlow ベースの autoencoder (scVI 系統) との連携も容易である。後続の scVI、CellRank、squidpy、scvi-tools、Velocyto、scrublet、CellTypist 等の Python single-cell ツール n=20+ が AnnData を共通入出力 format として採用 (5-fold increased Python ecosystem adoption、2018→2024 で 4.7-fold increase in CRAN-equivalent PyPI downloads)。

Graph-based clustering と可視化 (Fig 2)：k-NN graph を共有することで Louvain clustering、UMAP、PAGA (partition-based graph abstraction) を一貫した framework で提供し、cluster 間連結性の可視化を効率化した。これにより従来の t-SNE 中心 workflow より大規模データでの構造把握が高速化された (t-SNE 90-fold faster、PAGA で連結性 ARI=0.91 with manual annotation)。

考察/結論

SCANPY は scRNA-seq 解析の de facto standard の一角となり、人類細胞アトラス (Human Cell Atlas、Regev et al. Elife 2017) や Tabula Muris/Sapiens など million-cell scale プロジェクトの解析基盤として広く採用された (本論文の 2024 Google Scholar citation count >5,800)。

先行研究との違い：これまでの先行 scRNA-seq toolkit は Seurat (R)、Monocle (R)、SCDE/PAGODA (R)、MAST (R)、Cell Ranger R kit と R ベースが大勢を占め、これらの先行研究は R Bioconductor の豊富な統計ツール群と密接統合される利点を持つ反面、scale limitation (100k cells が実用限界) と Python 機械学習 framework との非互換性を抱えていた。本研究はこれまでとは対照的に、(a) Python ベースで million-cell scale、(b) AnnData 統一 data structure、(c) numpy/scipy/sklearn/TensorFlow 直接連携、を初めて提供した点で先行 toolkit と質的に異なる。Seurat と比較した median 17-fold 高速化は単純な performance 比較ではなく、Python ecosystem 全体のシングルセル領域への “入り口” を提供した significance を持つ。

新規性：本研究で初めて、(1) Python ベース comprehensive scRNA-seq toolkit (これまで報告されていなかった novel な language choice)、(2) AnnData class の正式定義 — scRNA-seq に限らない generic annotated data matrix data structure (これまでなかった novel な data abstraction)、(3) million-cell scale subsampling-free analysis の実用化 (1.3M cells in hours、これまで報告されていなかった scale)、(4) PAGA (partition-based graph abstraction) による cluster connectivity 可視化 (novel な visualization paradigm)、(5) HDF5-based persistent storage + sparse matrix native support による out-of-core analysis 基盤、を提示した。

臨床応用 (bench-to-bedside / translational)：SCANPY の臨床応用としては、(a) 患者単細胞解析 (single-cell oncology、CITE-seq、TCR-seq、BCR-seq) を介した tumor heterogeneity の早期診断 — Lung adenocarcinoma の neuroendocrine transformation (Gardner et al. Science 2024) や biotherapy 応答 prediction (Rotow et al. NatRevCancer 2017 の文脈)、(b) Tumor immune microenvironment (TIME) 解析の標準 toolkit としての clinical trial 採用 (例: 米国 ATLANTIS trial、CITN-15 trial)、(c) CAR-T 療法後の細胞 transcriptional state 追跡、(d) FDA-approved clinical assays (例: CellSelect, Tempus xT) の解析パイプラインに組み込まれる、という translational impact を持つ。Bench-to-bedside translation として、本論文は Human Cell Atlas の解析基盤として Single Cell Portal、cellxgene、UCSC Cell Browser 等の public resources を支え、臨床応用への入口を提供している。

残された課題 (limitation / future)：第一に、本論文当初は R Bioconductor が提供する豊富な統計手法 (例: MAST、edgeR の zero-inflated DE) との完全互換性に欠ける point、batch correction の選択肢が限られる point が指摘されたが、後続バージョン (scanpy.external) と scvi-tools 連携で大幅に改善された future direction として残されている。第二に、spatial transcriptomics (Visium、Stereo-seq、CosMx 等) の急速進展に対応する空間解析機能の完全統合は squidpy 開発が進行中だが、本論文時点では検討課題。第三に、近年の foundation models (scGPT、Geneformer、SCimilarity 等) と SCANPY の統合は今後の future direction として継続中。第四に、>10M cells スケール (Tabula Sapiens、Human Cell Atlas v2) の analysis 効率向上は限界的なメモリ要求がボトルネックで、distributed computing (Dask、Ray) integration が future limitation の解消に必要。第五に、Python 中心エコシステムへの “vendor lock-in” の懸念は R Bioconductor との bidirectional 互換性で対処されたが、long-term sustainability の課題が残された limitation である。

方法

SCANPY toolkit modular design：(1) preprocessing (count normalization・log1p transformation・highly variable gene [HVG] selection・regress-out・scaling)、(2) PCA + neighbor graph construction (k-NN graph、k=10-15 default)、(3) UMAP・t-SNE・diffusion map・graph drawing による visualization、(4) Louvain / Leiden clustering algorithm、(5) marker gene 同定 (rank_genes_groups、Wilcoxon rank-sum test + t-test + logistic regression options)、(6) diffusion pseudotime (DPT) による trajectory inference、(7) PAGA (partition-based graph abstraction) で cluster connectivity 可視化。

AnnData data structure (novel data class)：中核データ構造として AnnData class を新規開発し、.X (n_obs × n_var count matrix、sparse-compatible)、.obs (cell metadata DataFrame)、.var (gene metadata DataFrame)、.uns (unstructured dict)、.obsm (multidimensional cell annotation、PCA/UMAP coordinates)、.varm (multidimensional gene annotation) を統一保持。HDF5/h5ad 形式でディスク永続化 (出力 file size 4-8 GB for 1M cells)、sparse matrix (scipy.sparse CSR) native 対応で 1.3M cells を 8 cores・数時間で処理可能な設計とした。

Benchmark dataset + identifier：

(1) pbmc3k (Seurat tutorial、n=2,700 peripheral blood mononuclear cells [PBMCs] from healthy donor、10x Genomics Chromium、~1,800 genes/cell)
(2) 68k PBMC (Cell Ranger R kit benchmark dataset、n=68,579 PBMCs、Zheng et al. NatComm 2017)
(3) 1.3M mouse brain cells (10x Genomics public dataset、n=1,306,127 cells、未 subsampled)
(4) paul15 (mouse hematopoiesis dataset for trajectory inference、n=2,730 cells、Paul et al. Cell 2015)
Test platform: Intel Xeon E5-2697 v4 (8 cores、2.3 GHz、32 GB RAM、SSD storage)

Statistical methods：Benchmark 比較は Wilcoxon rank-sum test (SCANPY vs Seurat execution time、n=10 repetitions)、cluster agreement は adjusted Rand index (ARI) と normalized mutual information (NMI)、marker gene rank correlation は Spearman correlation analysis (SCANPY vs Seurat、r ≥ 0.85 cutoff)、Pearson correlation analysis for PCA loadings agreement。Memory profiling は memory_profiler v0.55 を使用、execution time は time.perf_counter()。Python 3.6.4、numpy 1.13、scipy 1.0、scikit-learn 0.19、anndata 0.5、scanpy 1.0 を benchmark に使用した。

Research Wiki

エクスプローラー

SCANPY: large-scale single-cell gene expression data analysis

背景

目的

結果

考察/結論

方法

グラフビュー

目次

バックリンク