# Awkward Arrays in AnnData

**Author**: [Gregor Sturm](https://github.com/grst)

:::{warning}
Support for awkward arrays in AnnData is **experimental**.

Behavior, in particular of {func}`~anndata.concat`, may change in the future. 
Please report any issues using the [issue tracker](https://github.com/scverse/anndata)
:::

In [1]:
import awkward as ak
import scanpy as sc
from biothings_client import get_client as get_biothings_client

[Awkward Array](https://awkward-array.org/doc/main/) is a library for working with **nested, variable-sized data** using **NumPy-like idioms**. 
It is considerably faster than working with lists-of-lists or lists-of-dicts in Python. 

Here are two simple examples what an awkward array could look like: 

**ragged array:**

In [2]:
ragged = ak.Array(
    [
        None,
        [1, 2, 3],
        [3, 4],
    ]
)
ragged[:, 1]

**list of records:**

In [3]:
records = ak.Array(
    [
        {"a": 1, "b": 2},
        {"a": 3, "c": 4},
        {"d": 5},
    ]
)
records["a"]

Please refer to the [akward array documentation](https://awkward-array.org) for more information. 

Since v0.9, awkward arrays are supported in AnnData in the `.layers`, `.obsm`, `.varm` and `.uns` slots. 

In the following, we will explore how awkward arrays can be useful when working with single-cell data.

## Storing transcripts in `.varm`

Every gene can have one or many transcripts. Using awkward arrays, we can store a ragged list of transcripts for each gene in `adata.varm`. 

In [4]:
adata = sc.datasets.pbmc3k()

In [5]:
adata.var.head()

Unnamed: 0_level_0,gene_ids
index,Unnamed: 1_level_1
MIR1302-10,ENSG00000243485
FAM138A,ENSG00000237613
OR4F5,ENSG00000186092
RP11-34P13.7,ENSG00000238009
RP11-34P13.8,ENSG00000239945


Let's retrieve a list of transcripts for each gene using the [MyGene.info API](https://docs.mygene.info/en/latest/). 

In [6]:
mygene = get_biothings_client("gene")

In [7]:
%%capture
mygene_res = mygene.querymany(
    adata.var["gene_ids"],
    scopes=["ensembl.gene"],
    fields=["ensembl.transcript"],
    species="human",
    as_dataframe=True,
)
# remove duplicated results
mygene_res = mygene_res.loc[~mygene_res.index.duplicated()]
assert (
    adata.var["gene_ids"].tolist() == mygene_res.index.tolist()
), "Order of genes does not match"

The API call returns a data frame with transcripts in the `ensembl.transcript` key:

In [8]:
mygene_res.head()

Unnamed: 0_level_0,_id,_score,ensembl.transcript,notfound,ensembl
query,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
ENSG00000243485,ENSG00000243485,25.719067,"[ENST00000469289, ENST00000473358]",,
ENSG00000237613,645520,25.719007,"[ENST00000417324, ENST00000461467]",,
ENSG00000186092,79501,24.912605,ENST00000641515,,
ENSG00000238009,ENSG00000238009,25.719582,"[ENST00000453576, ENST00000466430, ENST0000047...",,
ENSG00000239945,ENSG00000239945,25.719145,ENST00000495576,,


Let's construct an awkward Array from the list of dictionaries and 
assign the ragged list of transcripts to `adata.varm`

In [9]:
adata.varm["transcripts"] = ak.Array(mygene_res["ensembl.transcript"])



We can now access transcripts of individual genes by slicing the AnnData object:

In [10]:
adata[:, ["CD8A", "CXCL13"]].varm["transcripts"]