PyPI travis Docs

scArches (PyTorch) - single-cell architecture surgery

scArches is a package to integrate newly produced single-cell datasets into integrated reference atlases. Our method can facilitate large collaborative projects with decentralized training and integration of multiple datasets by different groups. scArches is compatible with scanpy. and hosts efficient implementations of all conditional generative models for single-cell data.


expiMap has been added to scArches code base. It allows interpretable representation learning from scRNA-seq data and also reference mapping. Try it in the tutorial section.

What can you do with scArches?

  • Construct single or multi-modal (CITE-seq) reference atlases and share the trained model and the data (if possible).

  • Download a pre-trained model for your atlas of interest, update it with new datasets and share with your collaborators.

  • Project and integrate query datasets on the top of a reference and use latent representation for downstream tasks, e.g.:diff testing, clustering, classification

What are the different models?

scArches is itself an algorithm to map to project query on the top of reference datasets and applies to different models. Here we provide a short explanation and hints on when to use which model. Our models are:

  • scVI (Lopez et al., 2018): Requires access to raw counts values for data integration and assumes count distribution on the data (NB, ZINB, Poisson).

  • trVAE (Lotfollahi et al.,2020): It supports both normalized log-transformed or count data as input and applies additional MMD loss to have better merging in the latent space.

  • scANVI (Xu et al., 2019): It needs cell type labels for reference data. Your query data can be either unlabeled or labeled. In the case of unlabeled query data, you can use this method also to classify your query cells using reference labels.

  • scGen (Lotfollahi et al., 2019): This method requires cell-type labels for both reference building and Mapping. The reference mapping for this method solely relies on the integrated reference and requires no fine-tuning.

  • expiMap (Lotfollahi*, Rybakov* et al., 2023): This method takes prior knowledge from gene sets databases or users allowing to analyze your query data in the context of known gene programs.

  • totalVI (Gayoso al., 2019): This model can be used to build multi-modal CITE-seq reference atalses.

  • treeArches (Michielsen*, Lotfollahi* et al., 2022): This model builds a hierarchical tree for cell-types in the reference atlas and when mapping the query data can annotate and also identify novel cell-states and populations present in the query data.

  • SageNet (Heidari et al., 2022): This model allows constrcution of a spatial atlas by mapping query dissociated single cells/spots (e.g., from scRNAseq or visium datasets) into a common coordinate framework using one or more spatially resolved reference datasets.

  • mvTCR (Drost et al., 2022): Using this model you will be able to integrate T-cell receptor (TCR, treated as a sequence) and scRNA-seq dataset across multiple donors into a joint representation capturing information from both modalities.

  • scPoli (De Donno et al., 2022): This model allows data integration of scRNA-seq dataset, prototype-based label transfer and reference mapping. scPoli learns both sample embeddings and integrated cell embeddings, thus providing the user with a multi-scale view of the data, especially useful in the case of many samples to integrate.

Where to start?

To get a sense of how the model works please go through this tutorial. To find out how to construct and share or use pre-trained models example sections.


If scArches is useful in your research, please consider citing the paper.