anndata.AnnData.concatenate

AnnData.concatenate(*adatas, join='inner', batch_key='batch', batch_categories=None, index_unique='-')

Concatenate along the observations axis.

The .uns, .varm and .obsm attributes are ignored.

Currently, this works only in 'memory' mode.

Parameters:
adatas : AnnData

AnnData matrices to concatenate with. Each matrix is referred to as a “batch”.

join : str

Use intersection ('inner') or union ('outer') of variables.

batch_key : str

Add the batch annotation to .obs using this key.

batch_categories : Optional[Sequence[Any]]

Use these as categories for the batch annotation. By default, use increasing numbers.

index_unique : Optional[str]

Make the index unique by joining the existing index names with the batch category, using index_unique='-', for instance. Provide None to keep existing indices.

Return type:

AnnData

Returns:

AnnData – The concatenated AnnData, where adata.obs[batch_key] stores a categorical variable labeling the batch.

Notes

Warning

If you use join='outer' this fills 0s for sparse data when variables are absent in a batch. Use this with care. Dense data is filled with ``NaN``s. See the examples.

Examples

Joining on intersection of variables.

>>> adata1 = AnnData(np.array([[1, 2, 3], [4, 5, 6]]),
>>>                  {'obs_names': ['s1', 's2'],
>>>                   'anno1': ['c1', 'c2']},
>>>                  {'var_names': ['a', 'b', 'c'],
>>>                   'annoA': [0, 1, 2]})
>>> adata2 = AnnData(np.array([[1, 2, 3], [4, 5, 6]]),
>>>                  {'obs_names': ['s3', 's4'],
>>>                   'anno1': ['c3', 'c4']},
>>>                  {'var_names': ['d', 'c', 'b'],
>>>                   'annoA': [0, 1, 2]})
>>> adata3 = AnnData(np.array([[1, 2, 3], [4, 5, 6]]),
>>>                  {'obs_names': ['s1', 's2'],
>>>                   'anno2': ['d3', 'd4']},
>>>                  {'var_names': ['d', 'c', 'b'],
>>>                   'annoA': [0, 2, 3],
>>>                   'annoB': [0, 1, 2]})
>>>
>>> adata = adata1.concatenate(adata2, adata3)
>>> adata
AnnData object with n_obs × n_vars = 6 × 2
    obs_keys = ['anno1', 'anno2', 'batch']
    var_keys = ['annoA-0', 'annoA-1', 'annoB-2', 'annoA-2']
>>> adata.X
array([[2., 3.],
       [5., 6.],
       [3., 2.],
       [6., 5.],
       [3., 2.],
       [6., 5.]], dtype=float32)
>>> adata.obs
     anno1 anno2 batch
s1-0    c1   NaN     0
s2-0    c2   NaN     0
s3-1    c3   NaN     1
s4-1    c4   NaN     1
s1-2   NaN    d3     2
s2-2   NaN    d4     2
>>> adata.var.T
         b  c
annoA-0  1  2
annoA-1  2  1
annoB-2  2  1
annoA-2  3  2

Joining on the union of variables.

>>> adata = adata1.concatenate(adata2, adata3, join='outer')
>>> adata
AnnData object with n_obs × n_vars = 6 × 4
    obs_keys = ['anno1', 'anno2', 'batch']
    var_keys = ['annoA-0', 'annoA-1', 'annoB-2', 'annoA-2']
>>> adata.var.T
index      a    b    c    d
annoA-0  0.0  1.0  2.0  NaN
annoA-1  NaN  2.0  1.0  0.0
annoB-2  NaN  2.0  1.0  0.0
annoA-2  NaN  3.0  2.0  0.0
>>> adata.var_names
Index(['a', 'b', 'c', 'd'], dtype='object')
>>> adata.X
array([[ 1.,  2.,  3., nan],
       [ 4.,  5.,  6., nan],
       [nan,  3.,  2.,  1.],
       [nan,  6.,  5.,  4.],
       [nan,  3.,  2.,  1.],
       [nan,  6.,  5.,  4.]], dtype=float32)
>>> adata.X.sum(axis=0)
array([nan, 25., 23., nan], dtype=float32)
>>> import pandas as pd
>>> Xdf = pd.DataFrame(adata.X, columns=adata.var_names)
index    a    b    c    d
0      1.0  2.0  3.0  NaN
1      4.0  5.0  6.0  NaN
2      NaN  3.0  2.0  1.0
3      NaN  6.0  5.0  4.0
4      NaN  3.0  2.0  1.0
5      NaN  6.0  5.0  4.0
>>> Xdf.sum()
index
a     5.0
b    25.0
c    23.0
d    10.0
dtype: float32
>>> from numpy import ma
>>> adata.X = ma.masked_invalid(adata.X)
>>> adata.X
masked_array(
  data=[[1.0, 2.0, 3.0, --],
        [4.0, 5.0, 6.0, --],
        [--, 3.0, 2.0, 1.0],
        [--, 6.0, 5.0, 4.0],
        [--, 3.0, 2.0, 1.0],
        [--, 6.0, 5.0, 4.0]],
  mask=[[False, False, False,  True],
        [False, False, False,  True],
        [ True, False, False, False],
        [ True, False, False, False],
        [ True, False, False, False],
        [ True, False, False, False]],
  fill_value=1e+20,
  dtype=float32)
>>> adata.X.sum(axis=0).data
array([ 5., 25., 23., 10.], dtype=float32)

The masked array is not saved but has to be reinstantiated after saving.

>>> adata.write('./test.h5ad')
>>> from anndata import read_h5ad
>>> adata = read_h5ad('./test.h5ad')
>>> adata.X
array([[ 1.,  2.,  3., nan],
       [ 4.,  5.,  6., nan],
       [nan,  3.,  2.,  1.],
       [nan,  6.,  5.,  4.],
       [nan,  3.,  2.,  1.],
       [nan,  6.,  5.,  4.]], dtype=float32)

For sparse data, everything behaves similarly, except that for join='outer', zeros are added.

>>> from scipy.sparse import csr_matrix
>>> adata1 = AnnData(csr_matrix([[0, 2, 3], [0, 5, 6]]),
>>>                  {'obs_names': ['s1', 's2'],
>>>                   'anno1': ['c1', 'c2']},
>>>                  {'var_names': ['a', 'b', 'c']})
>>> adata2 = AnnData(csr_matrix([[0, 2, 3], [0, 5, 6]]),
>>>                  {'obs_names': ['s3', 's4'],
>>>                   'anno1': ['c3', 'c4']},
>>>                  {'var_names': ['d', 'c', 'b']})
>>> adata3 = AnnData(csr_matrix([[1, 2, 0], [0, 5, 6]]),
>>>                  {'obs_names': ['s5', 's6'],
>>>                   'anno2': ['d3', 'd4']},
>>>                  {'var_names': ['d', 'c', 'b']})
>>>
>>> adata = adata1.concatenate(adata2, adata3, join='outer')
>>> adata.var_names
Index(['a', 'b', 'c', 'd'], dtype='object')
>>> adata.X.toarray()
array([[0., 2., 3., 0.],
       [0., 5., 6., 0.],
       [0., 3., 2., 0.],
       [0., 6., 5., 0.],
       [0., 0., 2., 1.],
       [0., 6., 5., 0.]], dtype=float32)