anndata.experimental.concat_on_disk#
- anndata.experimental.concat_on_disk(in_files, out_file, *, max_loaded_elems=100000000, axis=0, join='inner', merge=None, uns_merge=None, label=None, keys=None, index_unique=None, fill_value=None, pairwise=False)[source]#
Concatenates multiple AnnData objects along a specified axis using their corresponding stores or paths, and writes the resulting AnnData object to a target location on disk.
Unlike
anndata.concat()
, this method does not require loading the input AnnData objects into memory, making it a memory-efficient alternative for large datasets. The resulting object written to disk should be equivalent to the concatenation of the loaded AnnData objects usinganndata.concat()
.To adjust the maximum amount of data loaded in memory; for sparse arrays use the max_loaded_elems argument; for dense arrays see the Dask documentation, as the Dask concatenation function is used to concatenate dense arrays in this function
- Parameters:
- in_files
Collection
[str
|PathLike
] |Mapping
[str
,str
|PathLike
] The corresponding stores or paths of AnnData objects to be concatenated. If a Mapping is passed, keys are used for the
keys
argument and values are concatenated.- out_file
str
|PathLike
The target path or store to write the result in.
- max_loaded_elems
int
(default:100000000
) The maximum number of elements to load in memory when concatenating sparse arrays. Note that this number also includes the empty entries. Set to 100m by default meaning roughly 400mb will be loaded to memory simultaneously.
- axis
Literal
[0
,1
] (default:0
) Which axis to concatenate along.
- join
Literal
['inner'
,'outer'
] (default:'inner'
) How to align values when concatenating. If
"outer"
, the union of the other axis is taken. If"inner"
, the intersection. See concatenation for more.- merge
Union
[Literal
['same'
,'unique'
,'first'
,'only'
],Callable
[[Collection
[Mapping
]],Mapping
],None
] (default:None
) How elements not aligned to the axis being concatenated along are selected. Currently implemented strategies include:
None
: No elements are kept."same"
: Elements that are the same in each of the objects."unique"
: Elements for which there is only one possible value."first"
: The first element seen at each from each position."only"
: Elements that show up in only one of the objects.
- uns_merge
Union
[Literal
['same'
,'unique'
,'first'
,'only'
],Callable
[[Collection
[Mapping
]],Mapping
],None
] (default:None
) How the elements of
.uns
are selected. Uses the same set of strategies as themerge
argument, except applied recursively.- label
str
|None
(default:None
) Column in axis annotation (i.e.
.obs
or.var
) to place batch information in. If it’s None, no column is added.- keys
Collection
[str
] |None
(default:None
) Names for each object being added. These values are used for column values for
label
or appended to the index ifindex_unique
is notNone
. Defaults to incrementing integer labels.- index_unique
str
|None
(default:None
) Whether to make the index unique by using the keys. If provided, this is the delimiter between
"{orig_idx}{index_unique}{key}"
. WhenNone
, the original indices are kept.- fill_value
Any
|None
(default:None
) When
join="outer"
, this is the value that will be used to fill the introduced indices. By default, sparse arrays are padded with zeros, while dense arrays and DataFrames are padded with missing values.- pairwise
bool
(default:False
) Whether pairwise elements along the concatenated dimension should be included. This is False by default, since the resulting arrays are often not meaningful.
- in_files
- Return type:
Notes
Warning
If you use
join='outer'
this fills 0s for sparse data when variables are absent in a batch. Use this with care. Dense data is filled withNaN
.Examples
See
anndata.concat()
for the semantics. The following examples highlight the differences this function has.First, let’s get some “big” datasets with a compatible
var
axis:>>> import httpx >>> import scanpy as sc >>> base_url = "https://datasets.cellxgene.cziscience.com" >>> def get_cellxgene_data(id_: str): ... out_path = sc.settings.datasetdir / f'{id_}.h5ad' ... if out_path.exists(): ... return out_path ... file_url = f"{base_url}/{id_}.h5ad" ... sc.settings.datasetdir.mkdir(parents=True, exist_ok=True) ... with httpx.stream('GET', file_url) as r, out_path.open('wb') as f: ... r.raise_for_status() ... for data in r.iter_bytes(): ... f.write(data) ... return out_path >>> path_b_cells = get_cellxgene_data('a93eab58-3d82-4b61-8a2f-d7666dcdb7c4') >>> path_fetal = get_cellxgene_data('d170ff04-6da0-4156-a719-f8e1bbefbf53')
Now we can concatenate them on-disk:
>>> import anndata as ad >>> ad.experimental.concat_on_disk( ... dict(b_cells=path_b_cells, fetal=path_fetal), ... 'merged.h5ad', ... label='dataset', ... ) >>> adata = ad.read_h5ad('merged.h5ad', backed=True) >>> adata.X CSRDataset: backend hdf5, shape (490, 15585), data_dtype float32 >>> adata.obs['dataset'].value_counts() dataset fetal 344 b_cells 146 Name: count, dtype: int64