Awkward Arrays in AnnData#
Author: Gregor Sturm
Warning
Support for awkward arrays in AnnData is experimental.
Behavior, in particular of concat()
, may change in the future.
Please report any issues using the issue tracker
[1]:
import awkward as ak
import scanpy as sc
from biothings_client import get_client as get_biothings_client
Awkward Array is a library for working with nested, variable-sized data using NumPy-like idioms. It is considerably faster than working with lists-of-lists or lists-of-dicts in Python.
Here are two simple examples what an awkward array could look like:
ragged array:
[2]:
ragged = ak.Array(
[
None,
[1, 2, 3],
[3, 4],
]
)
ragged[:, 1]
[2]:
[None, 2, 4] ---------------- type: 3 * ?int64
list of records:
[3]:
records = ak.Array(
[
{"a": 1, "b": 2},
{"a": 3, "c": 4},
{"d": 5},
]
)
records["a"]
[3]:
[1, 3, None] ---------------- type: 3 * ?int64
Please refer to the akward array documentation for more information.
Since v0.9, awkward arrays are supported in AnnData in the .layers
, .obsm
, .varm
and .uns
slots.
In the following, we will explore how awkward arrays can be useful when working with single-cell data.
Storing transcripts in .varm
#
Every gene can have one or many transcripts. Using awkward arrays, we can store a ragged list of transcripts for each gene in adata.varm
.
[4]:
adata = sc.datasets.pbmc3k()
[5]:
adata.var.head()
[5]:
gene_ids | |
---|---|
index | |
MIR1302-10 | ENSG00000243485 |
FAM138A | ENSG00000237613 |
OR4F5 | ENSG00000186092 |
RP11-34P13.7 | ENSG00000238009 |
RP11-34P13.8 | ENSG00000239945 |
Let’s retrieve a list of transcripts for each gene using the MyGene.info API.
[6]:
mygene = get_biothings_client("gene")
[7]:
%%capture
mygene_res = mygene.querymany(
adata.var["gene_ids"],
scopes=["ensembl.gene"],
fields=["ensembl.transcript"],
species="human",
as_dataframe=True,
)
# remove duplicated results
mygene_res = mygene_res.loc[~mygene_res.index.duplicated()]
assert (
adata.var["gene_ids"].tolist() == mygene_res.index.tolist()
), "Order of genes does not match"
The API call returns a data frame with transcripts in the ensembl.transcript
key:
[8]:
mygene_res.head()
[8]:
_id | _score | ensembl.transcript | notfound | ensembl | |
---|---|---|---|---|---|
query | |||||
ENSG00000243485 | ENSG00000243485 | 25.719067 | [ENST00000469289, ENST00000473358] | NaN | NaN |
ENSG00000237613 | 645520 | 25.719007 | [ENST00000417324, ENST00000461467] | NaN | NaN |
ENSG00000186092 | 79501 | 24.912605 | ENST00000641515 | NaN | NaN |
ENSG00000238009 | ENSG00000238009 | 25.719582 | [ENST00000453576, ENST00000466430, ENST0000047... | NaN | NaN |
ENSG00000239945 | ENSG00000239945 | 25.719145 | ENST00000495576 | NaN | NaN |
Let’s construct an awkward Array from the list of dictionaries and assign the ragged list of transcripts to adata.varm
[9]:
adata.varm["transcripts"] = ak.Array(mygene_res["ensembl.transcript"])
/home/sturm/projects/2022/anndata/anndata/_core/aligned_mapping.py:54: ExperimentalFeatureWarning: Support for Awkward Arrays is currently experimental. Behavior may change in the future. Please report any issues you may encounter!
warnings.warn(
We can now access transcripts of individual genes by slicing the AnnData object:
[10]:
adata[:, ["CD8A", "CXCL13"]].varm["transcripts"]
[10]:
[['ENST00000283635', 'ENST00000352580', ..., 'ENST00000699439'], ['ENST00000286758', 'ENST00000506590', 'ENST00000682537']] ---------------------------------------------------------------------------------------------------------------- type: 2 * union[ var * string, string, float64, parameters={"_view_args": ["target-140242149270096", "varm", ["transcripts"]], "__array__": "AwkwardArrayView"}]