CELLxGENE Schema - Loom Path Translations

Version 7.1.0

H5AD/AnnData to Loom file path mapping documentation

Table of Contents

Overview

This document provides a comprehensive mapping between the canonical H5AD/AnnData file format used by CELLxGENE Discover and the Loom file format. The mappings follow the CELLxGENE Schema v7.1.0 specification.

Key Differences:
  • In ASAP, cell metadata is in /col_attrs/ (cells are columns in ASAP's genes x cells matrix)
  • Loom stores gene metadata in /row_attrs/ (AnnData uses /var/)
  • Loom stores dataset-level metadata in /attrs/ (AnnData uses /uns/)
  • Loom does not have native categorical support - use string arrays
ASAP-specific Note: In ASAP, the /matrix can contain either raw count data or already processed/normalized data, depending on what the user uploads. Additional processed matrices are stored in /layers/. This differs from CELLxGENE's expectation where /X is normalized and /raw/X contains raw counts. See the Matrix Layers section for details.
Ontology Versions: ASAP applies the structural rules and field requirements from the CELLxGENE schema, but uses its own ontology and reference database versions that are associated with each ASAP version. The specific ontology versions pinned by CELLxGENE are not enforced. See the Validation Requirements section for details.

Location Mapping Summary

H5AD Path Purpose Loom Path Notes
/X Primary matrix /matrix Shape: (n_cells, n_genes). In ASAP, this can be either raw counts or normalized data depending on user input.
/raw.X or /layers/{name} Additional matrices /layers/{name} In ASAP, additional processed/normalized matrices are stored in /layers/. Same shape as /matrix.
/obs/{field} Cell metadata columns /col_attrs/{field} 1-D array, length n_cells. In ASAP: cells are columns.
/var/{field} Gene metadata columns /row_attrs/{field} 1-D array, length n_genes. In ASAP: genes are rows.
/raw.var/{field} Raw gene metadata /col_attrs/{field} Duplicate or use layer attrs
/obsm/{key} Embeddings (UMAP, tSNE, etc.) /col_attrs/{key} 2-D arrays (n_cells, m)
/obsp/{key} Pairwise cell matrices /pairwise/{key} No direct support
/varm/{key} Multi-dim gene annotations /col_attrs/{key} 2-D arrays (n_genes, m)
/varp/{key} Pairwise gene matrices /pairwise/{key} No direct support
/uns/{key} Dataset-level metadata /attrs/{key} Flatten nested dicts
/obs/index Cell identifiers /col_attrs/CellID Unique identifiers. In ASAP: cells are columns.
/var/index Gene identifiers /row_attrs/_StableID ENSEMBL IDs (no version). In ASAP: genes are rows.

Cell Metadata (/obs/ to /col_attrs/)

All cell-level metadata from /obs/ maps to /col_attrs/ in ASAP Loom files (because ASAP uses cells as columns).

Multi-Value Fields: Some fields can contain multiple ontology terms separated by || (space-pipe-pipe-space). These are marked with MULTI-VALUE. For example: "MONDO:0005015 || MONDO:0004975" for multiple diseases. Fields marked with MULTI-VALUE * are ASAP extensions to fields that are single-value in the standard CELLxGENE schema. See the ASAP Schema Extensions section for details.

Required Fields (Curator Must Annotate)

H5AD Path Loom Path Type Description
/obs/index /col_attrs/CellID str REQUIRED. Unique cell identifiers
/obs/assay_ontology_term_id /col_attrs/assay_ontology_term_id categorical str REQUIRED. EFO assay term
/obs/cell_type_ontology_term_id /col_attrs/cell_type_ontology_term_id categorical str REQUIRED (if not pre-analysis). CL term or "unknown" MULTI-VALUE *
/obs/development_stage_ontology_term_id /col_attrs/development_stage_ontology_term_id categorical str REQUIRED. Development stage term
/obs/disease_ontology_term_id /col_attrs/disease_ontology_term_id categorical str REQUIRED. MONDO term(s) or "PATO:0000461" MULTI-VALUE
Multiple terms separated by " || " (e.g., "MONDO:0005015 || MONDO:0004975")
/obs/donor_id /col_attrs/donor_id categorical str REQUIRED. Unique donor identifier
/obs/is_primary_data /col_attrs/is_primary_data bool REQUIRED. True if canonical instance
/obs/self_reported_ethnicity_ontology_term_id /col_attrs/self_reported_ethnicity_ontology_term_id categorical str REQUIRED. HANCESTRO term(s) MULTI-VALUE
Multiple terms separated by " || " (e.g., "HANCESTRO:0005 || HANCESTRO:0014")
/obs/sex_ontology_term_id /col_attrs/sex_ontology_term_id categorical str REQUIRED. PATO term
/obs/suspension_type /col_attrs/suspension_type categorical str REQUIRED. "cell", "nucleus", or "na"
/obs/tissue_ontology_term_id /col_attrs/tissue_ontology_term_id categorical str REQUIRED. UBERON or Cellosaurus term MULTI-VALUE *
/obs/tissue_type /col_attrs/tissue_type categorical str REQUIRED. "tissue", "organoid", "cell line", or "primary cell culture"

* ASAP extension: these fields are single-value in the standard CELLxGENE schema but support multiple ontology terms in ASAP. See ASAP Schema Extensions for details.

Auto-Generated Fields (CELLxGENE Annotates)

Ontology Term Name Mapping: These fields contain the human-readable ontology term names corresponding to the ontology term IDs in their associated *_ontology_term_id fields. For example, assay contains the name for the term ID in assay_ontology_term_id, cell_type contains the name for cell_type_ontology_term_id, etc.
H5AD Path Loom Path Source Field Description
/obs/assay /col_attrs/assay assay_ontology_term_id Ontology term name for the assay
/obs/cell_type /col_attrs/cell_type cell_type_ontology_term_id Ontology term name(s) for the cell type MULTI-VALUE *
/obs/development_stage /col_attrs/development_stage development_stage_ontology_term_id Ontology term name for the development stage
/obs/disease /col_attrs/disease disease_ontology_term_id Ontology term name(s) for the disease MULTI-VALUE
/obs/self_reported_ethnicity /col_attrs/self_reported_ethnicity self_reported_ethnicity_ontology_term_id Ontology term name(s) for ethnicity MULTI-VALUE
/obs/sex /col_attrs/sex sex_ontology_term_id Ontology term name for sex
/obs/tissue /col_attrs/tissue tissue_ontology_term_id Ontology term name(s) for the tissue MULTI-VALUE *
/obs/observation_joinid /col_attrs/observation_joinid - Unique observation identifier (not from ontology)

* ASAP extension: these fields are single-value in the standard CELLxGENE schema but support multiple ontology term names in ASAP. See ASAP Schema Extensions for details.

Conditional Fields (Visium/Spatial)

H5AD Path Loom Path Condition
/obs/array_col /col_attrs/array_col Visium with /uns/spatial/is_single=True
/obs/array_row /col_attrs/array_row Visium with /uns/spatial/is_single=True
/obs/in_tissue /col_attrs/in_tissue Visium with /uns/spatial/is_single=True

Genetic Perturbation Fields (v7.1.0)

H5AD Path Loom Path Condition
/obs/genetic_perturbation_id /col_attrs/genetic_perturbation_id If /uns/genetic_perturbations present
/obs/genetic_perturbation_strategy /col_attrs/genetic_perturbation_strategy If /obs/genetic_perturbation_id present
/obs/experimental_condition_ontology_term_id /col_attrs/experimental_condition_ontology_term_id Optional for perturbation experiments
/obs/experimental_condition /col_attrs/experimental_condition Auto-generated if ontology term present
/obs/perturbation_types /col_attrs/perturbation_types Auto-generated from perturbation data

Gene Metadata (/var/ to /row_attrs/)

All gene-level metadata from /var/ maps to /row_attrs/ in ASAP Loom files.

ASAP Matrix Orientation: ASAP uses a genes x cells matrix orientation (genes as rows, cells as columns). This is the opposite of CELLxGENE/AnnData which uses cells x genes. Therefore:
  • Gene metadata is in /row_attrs/ (not /col_attrs/)
  • Cell metadata is in /col_attrs/ (not /row_attrs/)

Field Name Mapping: ASAP to CELLxGENE

H5AD Path ASAP Loom Path Status Description
/var/index /row_attrs/_StableID REQUIRED ENSEMBL gene IDs (no version suffix)
/var/feature_name /row_attrs/Gene Available Gene symbol/name
/var/feature_biotype /row_attrs/_Biotypes Available Gene biotype (e.g., "protein_coding", "lncRNA")
/var/feature_length /row_attrs/_SumExonLength Available Sum of exon lengths (bps)
/var/feature_type - CELLxGENE auto Gene type from GENCODE/ENSEMBL. Added by CELLxGENE on upload.
/var/feature_is_filtered - REQUIRED* REQUIRED for CELLxGENE. Must be added manually before submission.
/var/feature_reference - CELLxGENE auto NCBITaxon term for reference organism. ASAP does not have multi-species projects.

ASAP-Specific Gene Metadata

These fields are generated by ASAP during parsing and have no equivalent in H5AD files:

Loom Path Description
/row_attrs/_StableID ENSEMBL gene identifiers (stable IDs, no version suffix). Used as the gene index in ASAP.
/row_attrs/Accession Alternative gene accession identifiers
/row_attrs/Gene Gene symbols/names
/row_attrs/_Biotypes Gene biotype (e.g., "protein_coding", "lncRNA", "miRNA")
/row_attrs/_SumExonLength Sum of exon lengths in base pairs
/row_attrs/_Sum Total expression per gene (QC metric)
For CELLxGENE Compliance: The following fields are REQUIRED by CELLxGENE but not automatically generated by ASAP:
  • feature_is_filtered: Boolean indicating if gene is filtered from normalized matrix. Must be added manually.
  • feature_reference: NCBITaxon term for reference organism. Auto-generated by CELLxGENE on upload (ASAP does not support multi-species projects).
  • feature_type: Gene type from GENCODE/ENSEMBL. Auto-generated by CELLxGENE on upload.

Embeddings (/obsm/)

Embeddings from /obsm/ are stored as 2-D arrays in /col_attrs/ in ASAP (since cells are columns).

H5AD Path Loom Path Shape Notes
/obsm/spatial /col_attrs/spatial (n_cells, 2+) Required for Visium/Slide-seq when /uns/spatial/is_single=True
/obsm/X_{suffix} /col_attrs/X_{suffix} (n_cells, 2+) UMAP, tSNE, PCA embeddings. {suffix} cannot be "spatial"
Embedding Requirements:
  • At least one X_{suffix} embedding required for non-spatial assays
  • Must be numpy.float32, float64, int32, int64, or uint
  • Must NOT contain infinity or all NaN values
  • {suffix} pattern: starts with letter, then alphanumeric, underscore, dash, or dot

Global Attributes (/uns/ to /attrs/)

Dataset-level metadata from /uns/ maps to /attrs/ in ASAP Loom files.

Required Global Attributes

H5AD Path Loom Path Annotator Description
/uns/title /attrs/title Curator Dataset title
/uns/organism_ontology_term_id /attrs/organism_ontology_term_id Curator NCBITaxon term for organism
/uns/schema_version /attrs/schema_version CELLxGENE Must be "7.1.0"
/uns/schema_reference /attrs/schema_reference CELLxGENE URL to schema document
/uns/organism /attrs/organism CELLxGENE Human-readable organism name
/uns/citation /attrs/citation CELLxGENE Citation string
/uns/is_pre_analysis /attrs/is_pre_analysis CELLxGENE True for pre-analysis collections

Optional Global Attributes

H5AD Path Loom Path Description
/uns/batch_condition /attrs/batch_condition JSON array of batch keys
/uns/default_embedding /attrs/default_embedding Key of default embedding to display
/uns/X_approximate_distribution /attrs/X_approximate_distribution "count" or "normal"
/uns/{column}_colors /attrs/{column}_colors Color palette array (hex or named colors)

Spatial Metadata (Visium)

Nested spatial metadata must be flattened using slash notation:

H5AD Path Loom Path
/uns/spatial/is_single /attrs/spatial/is_single
/uns/spatial/{library_id}/images/hires /attrs/spatial/{library_id}/images/hires (as base64 or separate file)
/uns/spatial/{library_id}/scalefactors/spot_diameter_fullres /attrs/spatial/{library_id}/scalefactors/spot_diameter_fullres
/uns/spatial/{library_id}/scalefactors/tissue_hires_scalef /attrs/spatial/{library_id}/scalefactors/tissue_hires_scalef

Genetic Perturbations (v7.1.0)

Complex perturbation metadata should be stored as JSON:

H5AD Path Loom Path Format
/uns/genetic_perturbations /attrs/genetic_perturbations JSON string
/uns/genetic_perturbations/{id}/role /attrs/genetic_perturbations/{id}/role Or flatten keys
/uns/genetic_perturbations/{id}/protospacer_sequence /attrs/genetic_perturbations/{id}/protospacer_sequence DNA sequence (A/C/G/T)

Matrix Layers

ASAP Behavior: In ASAP, the /matrix contains the primary data matrix as provided by the user. This can be either raw counts or already normalized/processed data, depending on what the user uploads. Additional processed matrices (e.g., normalized, scaled, log-transformed) are stored in /layers/.
H5AD Path CELLxGENE Loom ASAP Loom Purpose
/X /matrix /matrix Primary matrix. In ASAP: contains raw counts OR normalized data depending on user input.
/raw/X /layers/raw N/A - see note CELLxGENE expects raw counts here. ASAP does not create /layers/raw.
/layers/{name} /layers/{name} /layers/{name} Additional processed matrices (normalized, scaled, log-transformed, etc.)
CELLxGENE Expectation vs ASAP Reality:
  • CELLxGENE expects: /matrix = normalized data, /layers/raw = raw counts.
  • ASAP behavior: /matrix contains what the user uploads (typically raw counts). /layers/raw does NOT exist in ASAP.
  • If the user submits raw counts, they are stored directly in /matrix.
  • If the user submits already-normalized data, it is also stored in /matrix (with no separate raw layer).
  • For CELLxGENE compliance: If /matrix already contains raw counts, no additional layer is needed. If /matrix contains normalized data, raw counts should be added to /layers/raw before submission (??)
Sparse Matrix Encoding: If 50% or more values are zeros, the matrix MUST be encoded as scipy.sparse.csr_matrix with implicit zeros. In Loom, use chunked compression (LZF) to minimize file size.

Features Requiring Special Handling

Feature Challenge Workaround
/obsp/ and /varp/ (pairwise matrices) No native Loom support Store in /pairwise/{key} or as external .npz file referenced by global attribute
Hierarchical /uns/ dictionaries Loom attributes are flat Flatten with slash keys (e.g., /attrs/spatial/is_single) or store as JSON string
pandas CategoricalDtype Loom stores raw arrays only Store {field}_categories and optionally {field}_category_codes arrays
Mixed dtypes (strings + NaN) Loom requires homogeneous dtypes Cast to string, use sentinel values ("na", "unknown")
/raw/var/ distinct from /var/ Loom has no second gene table Duplicate columns in /col_attrs/ and document in global attribute
Neighbor graphs (/uns/neighbors) No canonical storage Store as /col_attrs/neighbors_indices + /col_attrs/neighbors_distances

ASAP Schema Extensions and Processes

ASAP extends the standard CELLxGENE schema in specific ways to better support annotation workflows. These extensions are backward-compatible: fields that follow the standard schema remain valid. The extensions are documented here so that downstream tools processing ASAP Loom files are aware of them.

Multi-Value Extension for Cell Type and Tissue

In the standard CELLxGENE schema, the cell_type_ontology_term_id and tissue_ontology_term_id fields accept a single ontology term per cell. ASAP extends these fields to support multiple ontology terms for cases where a cell is associated with more than one cell type or tissue annotation.

Field Pair Loom Paths Standard Schema ASAP Extension
Cell Type /col_attrs/cell_type_ontology_term_id
/col_attrs/cell_type
Single CL term (e.g., CL:0000540) Multiple CL terms separated by ||
(e.g., CL:0000540 || CL:0000127)
Tissue /col_attrs/tissue_ontology_term_id
/col_attrs/tissue
Single UBERON term (e.g., UBERON:0002371) Multiple UBERON terms separated by ||
(e.g., UBERON:0002371 || UBERON:0001264)
Multi-Value Separator: Multi-value fields use || (space-pipe-pipe-space) as the separator, consistent with the standard CELLxGENE convention used for fields like disease_ontology_term_id and self_reported_ethnicity_ontology_term_id. The corresponding name fields (cell_type, tissue) also contain multiple names separated by || , in the same order as their ontology term IDs.
Compatibility Note: When submitting to CELLxGENE Discover, multi-value cell type and tissue fields may need to be resolved to single values, as CELLxGENE does not currently support multi-value for these fields. Tools reading ASAP Loom files should be prepared to handle both single-value and multi-value formats for these fields.

Automatic Paired Metadata Generation

The CELLxGENE schema defines paired cell metadata fields: an ontology term identifier field (e.g., cell_type_ontology_term_id) and its corresponding ontology term name field (e.g., cell_type). In ASAP, users can populate either side of the pair from existing metadata, and the other side is generated automatically:

Values that cannot be resolved against the ontology (mismatches, typos, or terms not present in the ontology version used by ASAP) are flagged as unresolved and can be corrected manually through the compliance fix interface.

Validation Requirements

Required Validations

Organism-Specific Requirements

Ontology and Reference Versions

ASAP Uses Its Own Ontology Versions: ASAP applies the structural rules and field requirements from the CELLxGENE schema, but uses ontology and reference database versions that are associated with each ASAP version, not the specific versions pinned by CELLxGENE. This allows ASAP to maintain consistency with its own annotation pipeline and organism databases.

CELLxGENE 7.1.0 Pinned Versions (for reference only, not enforced by ASAP):

Note: See your ASAP version's documentation for the actual ontology versions in use.