anndata.experimental.concat_on_disk

Contents

anndata.experimental.concat_on_disk#

anndata.experimental.concat_on_disk(in_files, out_file, *, max_loaded_elems=100000000, axis=0, join='inner', merge=None, uns_merge=None, label=None, keys=None, index_unique=None, fill_value=None, pairwise=False)[source]#

Concatenates multiple AnnData objects along a specified axis using their corresponding stores or paths, and writes the resulting AnnData object to a target location on disk.

Unlike anndata.concat(), this method does not require loading the input AnnData objects into memory, making it a memory-efficient alternative for large datasets. The resulting object written to disk should be equivalent to the concatenation of the loaded AnnData objects using anndata.concat().

To adjust the maximum amount of data loaded in memory; for sparse arrays use the max_loaded_elems argument; for dense arrays see the Dask documentation, as the Dask concatenation function is used to concatenate dense arrays in this function

Parameters:
in_files Collection[str | PathLike] | Mapping[str, str | PathLike]

The corresponding stores or paths of AnnData objects to be concatenated. If a Mapping is passed, keys are used for the keys argument and values are concatenated.

out_file str | PathLike

The target path or store to write the result in.

max_loaded_elems int (default: 100000000)

The maximum number of elements to load in memory when concatenating sparse arrays. Note that this number also includes the empty entries. Set to 100m by default meaning roughly 400mb will be loaded to memory simultaneously.

axis Literal[0, 1] (default: 0)

Which axis to concatenate along.

join Literal['inner', 'outer'] (default: 'inner')

How to align values when concatenating. If "outer", the union of the other axis is taken. If "inner", the intersection. See concatenation for more.

merge Union[Literal['same', 'unique', 'first', 'only'], Callable[[Collection[Mapping]], Mapping], None] (default: None)

How elements not aligned to the axis being concatenated along are selected. Currently implemented strategies include:

  • None: No elements are kept.

  • "same": Elements that are the same in each of the objects.

  • "unique": Elements for which there is only one possible value.

  • "first": The first element seen at each from each position.

  • "only": Elements that show up in only one of the objects.

uns_merge Union[Literal['same', 'unique', 'first', 'only'], Callable[[Collection[Mapping]], Mapping], None] (default: None)

How the elements of .uns are selected. Uses the same set of strategies as the merge argument, except applied recursively.

label str | None (default: None)

Column in axis annotation (i.e. .obs or .var) to place batch information in. If it’s None, no column is added.

keys Collection[str] | None (default: None)

Names for each object being added. These values are used for column values for label or appended to the index if index_unique is not None. Defaults to incrementing integer labels.

index_unique str | None (default: None)

Whether to make the index unique by using the keys. If provided, this is the delimiter between "{orig_idx}{index_unique}{key}". When None, the original indices are kept.

fill_value Any | None (default: None)

When join="outer", this is the value that will be used to fill the introduced indices. By default, sparse arrays are padded with zeros, while dense arrays and DataFrames are padded with missing values.

pairwise bool (default: False)

Whether pairwise elements along the concatenated dimension should be included. This is False by default, since the resulting arrays are often not meaningful.

Return type:

None

Notes

Warning

If you use join='outer' this fills 0s for sparse data when variables are absent in a batch. Use this with care. Dense data is filled with NaN.

Examples

See anndata.concat() for the semantics. The following examples highlight the differences this function has.

First, let’s get some “big” datasets with a compatible var axis:

>>> import httpx
>>> import scanpy as sc
>>> base_url = "https://datasets.cellxgene.cziscience.com"
>>> def get_cellxgene_data(id_: str):
...     out_path = sc.settings.datasetdir / f'{id_}.h5ad'
...     if out_path.exists():
...         return out_path
...     file_url = f"{base_url}/{id_}.h5ad"
...     sc.settings.datasetdir.mkdir(parents=True, exist_ok=True)
...     with httpx.stream('GET', file_url) as r, out_path.open('wb') as f:
...         r.raise_for_status()
...         for data in r.iter_bytes():
...             f.write(data)
...     return out_path
>>> path_b_cells = get_cellxgene_data('a93eab58-3d82-4b61-8a2f-d7666dcdb7c4')
>>> path_fetal = get_cellxgene_data('d170ff04-6da0-4156-a719-f8e1bbefbf53')

Now we can concatenate them on-disk:

>>> import anndata as ad
>>> ad.experimental.concat_on_disk(
...     dict(b_cells=path_b_cells, fetal=path_fetal),
...     'merged.h5ad',
...     label='dataset',
... )
>>> adata = ad.read_h5ad('merged.h5ad', backed=True)
>>> adata.X
CSRDataset: backend hdf5, shape (490, 15585), data_dtype float32
>>> adata.obs['dataset'].value_counts()  
dataset
fetal      344
b_cells    146
Name: count, dtype: int64