anndata.experimental.concat_on_disk#
- anndata.experimental.concat_on_disk(in_files, out_file, *, max_loaded_elems=100000000, axis=0, join='inner', merge=None, uns_merge=None, label=None, keys=None, index_unique=None, fill_value=None, pairwise=False)[source]#
- Concatenates multiple AnnData objects along a specified axis using their corresponding stores or paths, and writes the resulting AnnData object to a target location on disk. - Unlike - anndata.concat(), this method does not require loading the input AnnData objects into memory, making it a memory-efficient alternative for large datasets. The resulting object written to disk should be equivalent to the concatenation of the loaded AnnData objects using- anndata.concat().- To adjust the maximum amount of data loaded in memory; for sparse arrays use the max_loaded_elems argument; for dense arrays see the Dask documentation, as the Dask concatenation function is used to concatenate dense arrays in this function - Parameters:
- in_files Collection[str|PathLike] |Mapping[str,str|PathLike]
- The corresponding stores or paths of AnnData objects to be concatenated. If a Mapping is passed, keys are used for the - keysargument and values are concatenated.
- out_file str|PathLike
- The target path or store to write the result in. 
- max_loaded_elems int(default:100000000)
- The maximum number of elements to load in memory when concatenating sparse arrays. Note that this number also includes the empty entries. Set to 100m by default meaning roughly 400mb will be loaded to memory simultaneously. 
- axis Literal[0,1] (default:0)
- Which axis to concatenate along. 
- join Literal['inner','outer'] (default:'inner')
- How to align values when concatenating. If - "outer", the union of the other axis is taken. If- "inner", the intersection. See concatenation for more.
- merge Union[Literal['same','unique','first','only'],Callable[[Collection[Mapping]],Mapping],None] (default:None)
- How elements not aligned to the axis being concatenated along are selected. Currently implemented strategies include: - None: No elements are kept.
- "same": Elements that are the same in each of the objects.
- "unique": Elements for which there is only one possible value.
- "first": The first element seen at each from each position.
- "only": Elements that show up in only one of the objects.
 
- uns_merge Union[Literal['same','unique','first','only'],Callable[[Collection[Mapping]],Mapping],None] (default:None)
- How the elements of - .unsare selected. Uses the same set of strategies as the- mergeargument, except applied recursively.
- label str|None(default:None)
- Column in axis annotation (i.e. - .obsor- .var) to place batch information in. If it’s None, no column is added.
- keys Collection[str] |None(default:None)
- Names for each object being added. These values are used for column values for - labelor appended to the index if- index_uniqueis not- None. Defaults to incrementing integer labels.
- index_unique str|None(default:None)
- Whether to make the index unique by using the keys. If provided, this is the delimiter between - "{orig_idx}{index_unique}{key}". When- None, the original indices are kept.
- fill_value Any|None(default:None)
- When - join="outer", this is the value that will be used to fill the introduced indices. By default, sparse arrays are padded with zeros, while dense arrays and DataFrames are padded with missing values.
- pairwise bool(default:False)
- Whether pairwise elements along the concatenated dimension should be included. This is False by default, since the resulting arrays are often not meaningful. 
 
- in_files 
- Return type:
 - Notes - Warning - If you use - join='outer'this fills 0s for sparse data when variables are absent in a batch. Use this with care. Dense data is filled with- NaN.- Examples - See - anndata.concat()for the semantics. The following examples highlight the differences this function has.- First, let’s get some “big” datasets with a compatible - varaxis:- >>> import httpx >>> import scanpy as sc >>> base_url = "https://datasets.cellxgene.cziscience.com" >>> def get_cellxgene_data(id_: str): ... out_path = sc.settings.datasetdir / f'{id_}.h5ad' ... if out_path.exists(): ... return out_path ... file_url = f"{base_url}/{id_}.h5ad" ... sc.settings.datasetdir.mkdir(parents=True, exist_ok=True) ... with httpx.stream('GET', file_url) as r, out_path.open('wb') as f: ... r.raise_for_status() ... for data in r.iter_bytes(): ... f.write(data) ... return out_path >>> path_b_cells = get_cellxgene_data('a93eab58-3d82-4b61-8a2f-d7666dcdb7c4') >>> path_fetal = get_cellxgene_data('d170ff04-6da0-4156-a719-f8e1bbefbf53') - Now we can concatenate them on-disk: - >>> import anndata as ad >>> ad.experimental.concat_on_disk( ... dict(b_cells=path_b_cells, fetal=path_fetal), ... 'merged.h5ad', ... label='dataset', ... ) >>> adata = ad.read_h5ad('merged.h5ad', backed=True) >>> adata.X CSRDataset: backend hdf5, shape (490, 15585), data_dtype float32 >>> adata.obs['dataset'].value_counts() dataset fetal 344 b_cells 146 Name: count, dtype: int64