Curate & link scRNA-seq datasets#
This illustrates how to manage scRNA-seq datasets in absence of a custom schema.
Show code cell content
!lamin init --storage ./test-scrna --schema bionty
# avoids download bars
import bionty as bt
bt.Gene(species="human")
bt.Gene(species="mouse")
bt.Gene(species="saccharomyces cerevisiae")
bt.CellMarker(species="human")
π‘ creating schemas: core==0.45.5 bionty==0.29.6
β
saved: User(id='DzTjkKse', handle='testuser1', email='testuser1@lamin.ai', name='Test User1', updated_at=2023-08-17 17:32:30)
β
saved: Storage(id='6ObKC1Wo', root='/home/runner/work/lamin-usecases/lamin-usecases/docs/test-scrna', type='local', updated_at=2023-08-17 17:32:30, created_by_id='DzTjkKse')
β
loaded instance: testuser1/test-scrna
π‘ did not register local instance on hub (if you want, call `lamin register`)
CellMarker
Species: human
Source: cellmarker, 2.0
#terms: 15466
π CellMarker.df(): ontology reference table
π CellMarker.lookup(): autocompletion of terms
π― CellMarker.search(): free text search of terms
β
CellMarker.validate(): strictly validate values
π§ CellMarker.inspect(): full inspection of values
π½ CellMarker.standardize(): convert to standardized names
πͺ CellMarker.diff(): difference between two versions
π CellMarker.ontology: Pronto.Ontology object
import lamindb as ln
import lnschema_bionty as lb
# don't recurse through ontology hierarchies to speed up CI
# recommend to set to True
lb.settings.auto_save_parents = False
β
loaded instance: testuser1/test-scrna (lamindb 0.50.7)
ln.track()
β
saved: Transform(id='Nv48yAceNSh8z8', name='Curate & link scRNA-seq datasets', short_name='scrna', stem_id='Nv48yAceNSh8', version='0', type=notebook, updated_at=2023-08-17 17:32:36, created_by_id='DzTjkKse')
β
saved: Run(id='UcskU0EgGleETR6Oao82', run_at=2023-08-17 17:32:36, transform_id='Nv48yAceNSh8z8', created_by_id='DzTjkKse')
Preparation: registries#
Letβs assume that this is not the first time we work with experimental entities, and hence, our registries are already pre-populated:
Show code cell content
# assume prepared registries
# strain
lb.ExperimentalFactor.from_bionty(ontology_id="EFO:0004472").save()
record = lb.ExperimentalFactor.filter(ontology_id="EFO:0004472").one()
record.add_synonym("C57BL/6N")
# developmental stage
lb.ExperimentalFactor.from_bionty(ontology_id="EFO:0001272").save()
# tissue
lb.Tissue.from_bionty(ontology_id="UBERON:0001542").save()
# cell types
ln.save(lb.CellType.from_values(["CL:0000115", "CL:0000738"], "ontology_id"))
# genes
adata = ln.dev.datasets.anndata_mouse_sc_lymph_node()
validated = lb.Gene.bionty(species="mouse").validate(
adata.var.index, field="ensembl_gene_id"
)
ln.save(
lb.Gene.from_values(
adata.var.index[validated], field="ensembl_gene_id", species="mouse"
)
)
β
validated 1 ExperimentalFactor record from Bionty on ontology_id: EFO:0004472
β
validated 1 ExperimentalFactor record from Bionty on ontology_id: EFO:0001272
β
validated 1 Tissue record from Bionty on ontology_id: UBERON:0001542
β
validated 2 CellType records from Bionty on ontology_id: CL:0000115, CL:0000738
β
Downloading Species source file from: https://ftp.ensembl.org/pub/release-110/species_EnsemblVertebrates.txt
β
validated 1 Species record from Bionty on name: mouse
β
9976 terms (99.80%) are validated
β 24 terms (0.20%) are not validated
β
validated 9976 Gene records from Bionty on ensembl_gene_id: ENSMUSG00000104923, ENSMUSG00000079038, ENSMUSG00000069755, ENSMUSG00000040648, ENSMUSG00000113486, ENSMUSG00000034854, ENSMUSG00000053153, ENSMUSG00000020074, ENSMUSG00000022090, ENSMUSG00000111483, ENSMUSG00000083382, ENSMUSG00000022814, ENSMUSG00000030589, ENSMUSG00000016386, ENSMUSG00000022906, ENSMUSG00000072214, ENSMUSG00000019773, ENSMUSG00000109378, ENSMUSG00000039233, ENSMUSG00000015093, ...
ln.view(schema="bionty", registries=["CellType", "ExperimentalFactor", "Tissue"])
CellType
name | ontology_id | abbr | synonyms | description | bionty_source_id | updated_at | created_by_id | |
---|---|---|---|---|---|---|---|---|
id | ||||||||
4fOuOYtl | endothelial cell | CL:0000115 | None | endotheliocyte | An Endothelial Cell Comprises The Outermost La... | u0GC | 2023-08-17 17:32:40 | DzTjkKse |
MkrH0gsX | leukocyte | CL:0000738 | None | white blood cell|leucocyte | An Achromatic Cell Of The Myeloid Or Lymphoid ... | u0GC | 2023-08-17 17:32:40 | DzTjkKse |
ExperimentalFactor
name | ontology_id | abbr | synonyms | description | molecule | instrument | measurement | bionty_source_id | updated_at | created_by_id | |
---|---|---|---|---|---|---|---|---|---|---|---|
id | |||||||||||
EfXUa6F0 | adult | EFO:0001272 | None | adult stage | A Maturity Quality Inhering In An Individual B... | None | None | None | RFxi | 2023-08-17 17:32:38 | DzTjkKse |
eXg039cd | obsolete_C57BL/6 | EFO:0004472 | None | C57Bl\6|C57/BL6|C57BL6|C57Black|C57BL/6N|C57/B... | C57Bl/6 Is A Mouse Strain As Described In Jack... | None | None | None | RFxi | 2023-08-17 17:32:37 | DzTjkKse |
Tissue
name | ontology_id | abbr | synonyms | description | bionty_source_id | updated_at | created_by_id | |
---|---|---|---|---|---|---|---|---|
id | ||||||||
Na0qNZ1o | inguinal lymph node | UBERON:0001542 | None | None | The Lymph Nodes Located In The Groin Area. | ktcN | 2023-08-17 17:32:39 | DzTjkKse |
Mouse lymph node cells: Detmar22#
Weβre working with mouse data:
lb.settings.species = "mouse"
β
set species: Species(id='vado', name='mouse', taxon_id=10090, scientific_name='mus_musculus', updated_at=2023-08-17 17:32:44, bionty_source_id='WiPo', created_by_id='DzTjkKse')
Extract #
Letβs look at a scRNA-seq count matrix in form of an AnnData
object:
adata = ln.dev.datasets.anndata_mouse_sc_lymph_node()
Format the metadata columns:
Show code cell content
# The column names are a bit lengthy, let's abbreviate them:
adata.obs.columns = (
adata.obs.columns.str.replace("Sample Characteristic", "")
.str.replace("Factor Value ", "Factor Value:", regex=True)
.str.replace("Factor Value\[", "Factor Value:", regex=True)
.str.replace(" Ontology Term\[", "ontology_id:", regex=True)
.str.strip("[]")
.str.replace("organism part", "tissue")
.str.replace("organism", "species")
.str.replace("developmental stage", "developmental_stage")
.str.replace("cell type", "cell_type")
# the last one could be interesting, too
# .str.replace("Factor Value:Ontology Term[inferred cell_type - authors labels", "cell_type_authors")
)
adata
AnnData object with n_obs Γ n_vars = 1135 Γ 10000
obs: 'species', 'ontology_id:species', 'strain', 'ontology_id:strain', 'age', 'ontology_id:age', 'developmental_stage', 'ontology_id:developmental_stage', 'sex', 'ontology_id:sex', 'genotype', 'ontology_id:genotype', 'tissue', 'ontology_id:tissue', 'cell_type', 'ontology_id:cell_type', 'immunophenotype', 'ontology_id:immunophenotype', 'post analysis well quality', 'ontology_id:post analysis well quality', 'Factor Value:single cell identifier', 'Factor Value:Ontology Term[single cell identifier', 'Factor Value:cluster', 'Factor Value:Ontology Term[cluster', 'Factor Value:inferred cell_type - authors labels', 'Factor Value:Ontology Term[inferred cell_type - authors labels'
Curate #
Validate genes in .var:
validated = lb.Gene.validate(adata.var.index, lb.Gene.ensembl_gene_id)
π‘ using global setting species = mouse
β
9976 terms (99.80%) are validated
β 24 terms (0.20%) are not validated
Weβre seeing that 24 gene identifiers canβt be validated through Bionty. Weβd like to validate all features in this dataset, hence, letβs register them:
non_validated = adata.var.index[~validated]
records = lb.Gene.from_values(non_validated, lb.Gene.ensembl_gene_id)
ln.save(records)
π‘ using global setting species = mouse
β did not validate 24 Gene records for ensembl_gene_ids: ENSMUSG00000117732, ENSMUSG00000094030, ENSMUSG00000091604, ENSMUSG00000074735, ENSMUSG00000116275, ENSMUSG00000096385, ENSMUSG00000097078, ENSMUSG00000074210, ENSMUSG00000116184, ENSMUSG00000105204, ENSMUSG00000075015, ENSMUSG00000095547, ENSMUSG00000066378, ENSMUSG00000094958, ENSMUSG00000096923, ENSMUSG00000090625, ENSMUSG00000114046, ENSMUSG00000096201, ENSMUSG00000092345, ENSMUSG00000022591, ...
Now all genes pass validation:
lb.Gene.validate(adata.var.index, lb.Gene.ensembl_gene_id);
π‘ using global setting species = mouse
β
10000 terms (100.00%) are validated
Similarly, for the metadata, weβd like to validate the names as features:
columns = [
col
for col in adata.obs.columns
if not col.startswith("ontology_id") and not col.startswith("Factor Value")
]
ln.save(ln.Feature.from_df(adata.obs[columns]))
β did not validate 10 Feature records for names: species, strain, age, developmental_stage, sex, genotype, tissue, cell_type, immunophenotype, post analysis well quality
ln.Feature.validate(columns, field="name");
β
10 terms (100.00%) are validated
Some of the metadata can be typed using dedicated registries:
species = lb.Species.from_bionty(name="mouse")
strains = lb.ExperimentalFactor.from_values(adata.obs["strain"], "name")
dev_stages = lb.ExperimentalFactor.from_values(adata.obs["developmental_stage"], "name")
cell_types = lb.CellType.from_values(adata.obs["cell_type"], "name")
tissues = lb.Tissue.from_values(adata.obs["tissue"], "name")
β
validated 1 Species record on name: mouse
β did not validate 1 ExperimentalFactor record for name: C57BL/6N
β
validated 1 ExperimentalFactor record on name: adult
β
validated 1 CellType record on name: endothelial cell
β
validated 1 Tissue record on name: inguinal lymph node
We did not validate strains, hence, letβs try to map synonyms:
lb.ExperimentalFactor.standardize(adata.obs["strain"], return_mapper=True)
{'C57BL/6N': 'obsolete_C57BL/6'}
Indeed, there is a synonym:
adata.obs["strain"] = adata.obs["strain"].map(
lb.ExperimentalFactor.standardize(adata.obs["strain"], return_mapper=True)
)
Now we can validate:
strains = lb.ExperimentalFactor.from_values(adata.obs["strain"], "name")
β
validated 1 ExperimentalFactor record on name: obsolete_C57BL/6
Metadata that doesnβt have canβt be typed with dedicated registries:
labels = ln.Label.from_values(adata.obs["sex"])
labels += ln.Label.from_values(adata.obs["age"])
labels += ln.Label.from_values(adata.obs["genotype"])
labels += ln.Label.from_values(adata.obs["immunophenotype"])
ln.save(labels)
β did not validate 1 Label record for name: female
β did not validate 1 Label record for name: 8 to 10 week
β did not validate 1 Label record for name: wild type genotype
β did not validate 2 Label records for names: CD45 negative, CD45 positive
Register #
When we create a File
object from an AnnData
, weβll automatically link its feature sets and get information about unmapped categories:
file = ln.File.from_anndata(
adata, description="Detmar22", var_ref=lb.Gene.ensembl_gene_id
)
π‘ file will be copied to default storage upon `save()` with key `None` ('.lamindb/9SaWYjVX2DwR0dCC2l1B.h5ad')
π‘ parsing feature names of X stored in slot 'var'
π‘ using global setting species = mouse
β
validated 10000 Gene records on ensembl_gene_id: ENSMUSG00000049488, ENSMUSG00000042472, ENSMUSG00000038268, ENSMUSG00000020598, ENSMUSG00000026303, ENSMUSG00000049797, ENSMUSG00000049044, ENSMUSG00000045410, ENSMUSG00000028948, ENSMUSG00000061295, ENSMUSG00000001053, ENSMUSG00000081439, ENSMUSG00000035790, ENSMUSG00000040127, ENSMUSG00000095794, ENSMUSG00000071647, ENSMUSG00000031608, ENSMUSG00000078249, ENSMUSG00000082103, ENSMUSG00000030275, ...
β
linked: FeatureSet(id='C2hSi0gNZb05Ep0WoIVY', n=10000, type='float', registry='bionty.Gene', hash='T5FFYU5aZpuTyA3bXPfI', created_by_id='DzTjkKse')
π‘ parsing feature names of slot 'obs'
β
validated 10 Feature records on name: age, cell_type, developmental_stage, genotype, immunophenotype, post analysis well quality, sex, species, strain, tissue
β did not validate 16 Feature records for names: Factor Value:Ontology Term[cluster, Factor Value:Ontology Term[inferred cell_type - authors labels, Factor Value:Ontology Term[single cell identifier, Factor Value:cluster, Factor Value:inferred cell_type - authors labels, Factor Value:single cell identifier, ontology_id:age, ontology_id:cell_type, ontology_id:developmental_stage, ontology_id:genotype, ontology_id:immunophenotype, ontology_id:post analysis well quality, ontology_id:sex, ontology_id:species, ontology_id:strain, ontology_id:tissue
β ignoring non-validated features: Factor Value:Ontology Term[cluster,Factor Value:Ontology Term[inferred cell_type - authors labels,Factor Value:Ontology Term[single cell identifier,Factor Value:cluster,Factor Value:inferred cell_type - authors labels,Factor Value:single cell identifier,ontology_id:age,ontology_id:cell_type,ontology_id:developmental_stage,ontology_id:genotype,ontology_id:immunophenotype,ontology_id:post analysis well quality,ontology_id:sex,ontology_id:species,ontology_id:strain,ontology_id:tissue
β
linked: FeatureSet(id='pNw2lrfCKq8CRfiBC8oa', n=10, registry='core.Feature', hash='xp0klzB1xrIvoFUodvli', created_by_id='DzTjkKse')
file.save()
β
saved 2 feature sets for slots: ['var', 'obs']
β
storing file '9SaWYjVX2DwR0dCC2l1B' at '.lamindb/9SaWYjVX2DwR0dCC2l1B.h5ad'
The file now has two linked feature sets:
file.features
'var': FeatureSet(id='C2hSi0gNZb05Ep0WoIVY', n=10000, type='float', registry='bionty.Gene', hash='T5FFYU5aZpuTyA3bXPfI', updated_at=2023-08-17 17:32:53, created_by_id='DzTjkKse')
'obs': FeatureSet(id='pNw2lrfCKq8CRfiBC8oa', n=10, registry='core.Feature', hash='xp0klzB1xrIvoFUodvli', updated_at=2023-08-17 17:32:54, created_by_id='DzTjkKse')
file.add_labels(species, feature="species")
file.add_labels(strains + dev_stages + tissues + cell_types)
β
linked feature 'species' to registry 'bionty.Species'
β
linked labels 'obsolete_C57BL/6' to feature 'strain', linked feature 'strain' to registry 'bionty.ExperimentalFactor'
β
linked labels 'adult' to feature 'developmental_stage', linked feature 'developmental_stage' to registry 'bionty.ExperimentalFactor'
β
linked labels 'inguinal lymph node' to feature 'tissue', linked feature 'tissue' to registry 'bionty.Tissue'
β
linked labels 'endothelial cell' to feature 'cell_type', linked feature 'cell_type' to registry 'bionty.CellType'
file.add_labels(labels)
β
linked labels 'female' to feature 'sex', linked feature 'sex' to registry 'core.Label'
β
linked labels '8 to 10 week' to feature 'age', linked feature 'age' to registry 'core.Label'
β
linked labels 'wild type genotype' to feature 'genotype', linked feature 'genotype' to registry 'core.Label'
β
linked labels 'CD45 negative', 'CD45 positive' to feature 'immunophenotype', linked feature 'immunophenotype' to registry 'core.Label'
The file is now queryable by everything we linked:
file.describe()
π‘ File(id=9SaWYjVX2DwR0dCC2l1B, key=None, suffix=.h5ad, accessor=AnnData, description=Detmar22, version=None, size=17342743, hash=lC0PrTQici8k6yWBFsaJzg, hash_type=md5, created_at=2023-08-17 17:32:54.663207+00:00, updated_at=2023-08-17 17:32:54.663230+00:00)
Provenance:
ποΈ storage: Storage(id='6ObKC1Wo', root='/home/runner/work/lamin-usecases/lamin-usecases/docs/test-scrna', type='local', updated_at=2023-08-17 17:32:30, created_by_id='DzTjkKse')
π transform: Transform(id='Nv48yAceNSh8z8', name='Curate & link scRNA-seq datasets', short_name='scrna', stem_id='Nv48yAceNSh8', version='0', type='notebook', updated_at=2023-08-17 17:32:53, created_by_id='DzTjkKse')
π£ run: Run(id='UcskU0EgGleETR6Oao82', run_at=2023-08-17 17:32:36, transform_id='Nv48yAceNSh8z8', created_by_id='DzTjkKse')
π€ created_by: User(id='DzTjkKse', handle='testuser1', email='testuser1@lamin.ai', name='Test User1', updated_at=2023-08-17 17:32:30)
Features:
var (X):
π index (10000, bionty.Gene.id): ['M1FVY5laHAth', 'B0mmBd1BIj73', 'WqZ5QZ4HkbO8', 'wFKdeJV97lNv', 'QrWPtoLL0HTX'...]
obs (metadata):
π cell_type (1, bionty.CellType): ['endothelial cell']
π strain (2, bionty.ExperimentalFactor): ['obsolete_C57BL/6', 'adult']
π developmental_stage (2, bionty.ExperimentalFactor): ['obsolete_C57BL/6', 'adult']
π species (1, bionty.Species): ['mouse']
π tissue (1, bionty.Tissue): ['inguinal lymph node']
π genotype (1, core.Label): ['wild type genotype']
π age (1, core.Label): ['8 to 10 week']
π sex (1, core.Label): ['female']
π immunophenotype (2, core.Label): ['CD45 positive', 'CD45 negative']
Human immune cells: Conde22#
lb.settings.species = "human"
Show code cell output
β
set species: Species(id='uHJU', name='human', taxon_id=9606, scientific_name='homo_sapiens', updated_at=2023-08-17 17:32:54, bionty_source_id='WiPo', created_by_id='DzTjkKse')
conde22 = ln.dev.datasets.anndata_human_immune_cells()
ln.save(lb.Gene.from_values(conde22.var.index, lb.Gene.ensembl_gene_id))
π‘ using global setting species = human
β
validated 36390 Gene records from Bionty on ensembl_gene_id: ENSG00000243485, ENSG00000237613, ENSG00000186092, ENSG00000238009, ENSG00000239945, ENSG00000239906, ENSG00000241860, ENSG00000241599, ENSG00000286448, ENSG00000236601, ENSG00000284733, ENSG00000235146, ENSG00000284662, ENSG00000229905, ENSG00000237491, ENSG00000177757, ENSG00000228794, ENSG00000225880, ENSG00000230368, ENSG00000272438, ...
β did not validate 113 Gene records for ensembl_gene_ids: ENSG00000112096, ENSG00000182230, ENSG00000203812, ENSG00000204092, ENSG00000215271, ENSG00000221995, ENSG00000224739, ENSG00000224745, ENSG00000225932, ENSG00000226377, ENSG00000226380, ENSG00000226403, ENSG00000227021, ENSG00000227220, ENSG00000227902, ENSG00000228139, ENSG00000228906, ENSG00000229352, ENSG00000231575, ENSG00000232196, ...
conde22.obs.columns = conde22.obs.columns.str.replace("donor_id", "donor")
columns = [col for col in conde22.obs.columns if "ontology_term" not in col]
ln.save(ln.Feature.from_df(conde22.obs[columns]))
β
validated 2 Feature records on name: cell_type, tissue
β did not validate 2 Feature records for names: assay, donor
file = ln.File.from_anndata(
conde22, description="Conde22", var_ref=lb.Gene.ensembl_gene_id
)
π‘ file will be copied to default storage upon `save()` with key `None` ('.lamindb/P8fI3RY9UtTAEQfY7wx1.h5ad')
π‘ parsing feature names of X stored in slot 'var'
π‘ using global setting species = human
β
validated 36503 Gene records on ensembl_gene_id: ENSG00000115539, ENSG00000224359, ENSG00000253399, ENSG00000283247, ENSG00000083828, ENSG00000231505, ENSG00000267278, ENSG00000100028, ENSG00000187833, ENSG00000284710, ENSG00000105672, ENSG00000267670, ENSG00000126787, ENSG00000125352, ENSG00000212122, ENSG00000185730, ENSG00000254526, ENSG00000231265, ENSG00000287355, ENSG00000251055, ...
β
linked: FeatureSet(id='zUz2mA7XBIB20eZr9Oiq', n=36503, type='float', registry='bionty.Gene', hash='dnRexHCtxtmOU81_EpoJ', created_by_id='DzTjkKse')
π‘ parsing feature names of slot 'obs'
β
validated 4 Feature records on name: assay, cell_type, donor, tissue
β did not validate 3 Feature records for names: assay_ontology_term_id, cell_type_ontology_term_id, tissue_ontology_term_id
β ignoring non-validated features: assay_ontology_term_id,cell_type_ontology_term_id,tissue_ontology_term_id
β
linked: FeatureSet(id='n6WJW6D4Kf8VxORkqwrR', n=4, registry='core.Feature', hash='dVn-3OuJyOEjmXLMT-cd', created_by_id='DzTjkKse')
file.save()
β
saved 2 feature sets for slots: ['var', 'obs']
β
storing file 'P8fI3RY9UtTAEQfY7wx1' at '.lamindb/P8fI3RY9UtTAEQfY7wx1.h5ad'
The file has the following linked feature sets:
file.features
'var': FeatureSet(id='zUz2mA7XBIB20eZr9Oiq', n=36503, type='float', registry='bionty.Gene', hash='dnRexHCtxtmOU81_EpoJ', updated_at=2023-08-17 17:33:08, created_by_id='DzTjkKse')
'obs': FeatureSet(id='n6WJW6D4Kf8VxORkqwrR', n=4, registry='core.Feature', hash='dVn-3OuJyOEjmXLMT-cd', updated_at=2023-08-17 17:33:12, created_by_id='DzTjkKse')
Letβs now link observational metadata.
cell_types = lb.CellType.from_values(conde22.obs.cell_type, field="name")
efs = lb.ExperimentalFactor.from_values(conde22.obs.assay, field="name")
tissues = lb.Tissue.from_values(conde22.obs.tissue, field="name")
β
validated 32 CellType records from Bionty on name: classical monocyte, T follicular helper cell, memory B cell, alveolar macrophage, naive thymus-derived CD4-positive, alpha-beta T cell, effector memory CD8-positive, alpha-beta T cell, terminally differentiated, alpha-beta T cell, CD4-positive helper T cell, naive thymus-derived CD8-positive, alpha-beta T cell, macrophage, mucosal invariant T cell, group 3 innate lymphoid cell, naive B cell, animal cell, CD16-negative, CD56-bright natural killer cell, human, plasma cell, CD8-positive, alpha-beta memory T cell, CD16-positive, CD56-dim natural killer cell, human, gamma-delta T cell, conventional dendritic cell, ...
β
validated 3 ExperimentalFactor records from Bionty on name: 10x 3' v3, 10x 5' v2, 10x 5' v1
β
validated 17 Tissue records from Bionty on name: blood, thoracic lymph node, spleen, lung, mesenteric lymph node, lamina propria, liver, jejunal epithelium, omentum, bone marrow, ileum, caecum, thymus, skeletal muscle tissue, duodenum, sigmoid colon, transverse colon
file.add_labels([lb.settings.species], feature="species")
ln.save(cell_types + efs + tissues)
file.add_labels(cell_types + efs + tissues)
β
created feature set for slot 'external'
β
linked labels 'classical monocyte', 'T follicular helper cell', 'memory B cell', 'alveolar macrophage', 'naive thymus-derived CD4-positive, alpha-beta T cell', 'effector memory CD8-positive, alpha-beta T cell, terminally differentiated', 'alpha-beta T cell', 'CD4-positive helper T cell', 'naive thymus-derived CD8-positive, alpha-beta T cell', 'macrophage', 'mucosal invariant T cell', 'group 3 innate lymphoid cell', 'naive B cell', 'animal cell', 'CD16-negative, CD56-bright natural killer cell, human', 'plasma cell', 'CD8-positive, alpha-beta memory T cell', 'CD16-positive, CD56-dim natural killer cell, human', 'gamma-delta T cell', 'conventional dendritic cell', 'CD8-positive, alpha-beta memory T cell, CD45RO-positive', 'effector memory CD4-positive, alpha-beta T cell', 'non-classical monocyte', 'mast cell', 'regulatory T cell', 'progenitor cell', 'dendritic cell, human', 'plasmablast', 'plasmacytoid dendritic cell', 'lymphocyte', 'germinal center B cell', 'megakaryocyte' to feature 'cell_type'
β
linked labels '10x 3' v3', '10x 5' v2', '10x 5' v1' to feature 'assay', linked feature 'assay' to registry 'bionty.ExperimentalFactor'
β
linked labels 'blood', 'thoracic lymph node', 'spleen', 'lung', 'mesenteric lymph node', 'lamina propria', 'liver', 'jejunal epithelium', 'omentum', 'bone marrow', 'ileum', 'caecum', 'thymus', 'skeletal muscle tissue', 'duodenum', 'sigmoid colon', 'transverse colon' to feature 'tissue'
As neither the core schema nor lnschema_bionty
have a Donor
table, weβre using Label
to track donor ids:
donors = ln.Label.from_values(conde22.obs["donor"])
ln.save(donors)
file.add_labels(donors)
β did not validate 12 Label records for names: D496, 621B, A29, A36, A35, 637C, A52, A37, D503, 640C, A31, 582C
β
linked labels 'D496', '621B', 'A29', 'A36', 'A35', '637C', 'A52', 'A37', 'D503', '640C', 'A31', '582C' to feature 'donor', linked feature 'donor' to registry 'core.Label'
file.describe()
π‘ File(id=P8fI3RY9UtTAEQfY7wx1, key=None, suffix=.h5ad, accessor=AnnData, description=Conde22, version=None, size=28061905, hash=3cIcmoqp1MxjX8NlRkKGlQ, hash_type=md5, created_at=2023-08-17 17:33:12.867793+00:00, updated_at=2023-08-17 17:33:12.867819+00:00)
Provenance:
ποΈ storage: Storage(id='6ObKC1Wo', root='/home/runner/work/lamin-usecases/lamin-usecases/docs/test-scrna', type='local', updated_at=2023-08-17 17:32:30, created_by_id='DzTjkKse')
π transform: Transform(id='Nv48yAceNSh8z8', name='Curate & link scRNA-seq datasets', short_name='scrna', stem_id='Nv48yAceNSh8', version='0', type='notebook', updated_at=2023-08-17 17:33:08, created_by_id='DzTjkKse')
π£ run: Run(id='UcskU0EgGleETR6Oao82', run_at=2023-08-17 17:32:36, transform_id='Nv48yAceNSh8z8', created_by_id='DzTjkKse')
π€ created_by: User(id='DzTjkKse', handle='testuser1', email='testuser1@lamin.ai', name='Test User1', updated_at=2023-08-17 17:32:30)
Features:
var (X):
π index (36503, bionty.Gene.id): ['a1AGqGk4ywu1', 'dNgfeMyI0d1p', 'br4ZjTdWVCNF', '3Ucw70JttwA8', '9bqHaU58Tjf1'...]
external:
π species (1, bionty.Species): ['human']
obs (metadata):
π cell_type (32, bionty.CellType): ['macrophage', 'T follicular helper cell', 'plasma cell', 'regulatory T cell', 'animal cell']
π assay (3, bionty.ExperimentalFactor): ["10x 5' v2", "10x 5' v1", "10x 3' v3"]
π tissue (17, bionty.Tissue): ['liver', 'lamina propria', 'mesenteric lymph node', 'omentum', 'spleen']
π donor (12, core.Label): ['A37', 'A31', 'A52', '637C', 'A35']
A less well curated dataset#
Letβs now consider a dataset with less-well curated features:
pbcm68k = ln.dev.datasets.anndata_pbmc68k_reduced()
We see that this dataset is indexed by gene symbols:
pbcm68k.var.index
Index(['HES4', 'TNFRSF4', 'SSU72', 'PARK7', 'RBP7', 'SRM', 'MAD2L2', 'AGTRAP',
'TNFRSF1B', 'EFHD2',
...
'ATP5O', 'MRPS6', 'TTC3', 'U2AF1', 'CSTB', 'SUMO3', 'ITGB2', 'S100B',
'PRMT2', 'MT-ND3'],
dtype='object', name='index', length=765)
Because gene symbols donβt uniquely characterize an Ensembl ID, weβre linking more feature records to this file than columns in the AnnData
.
Tip
Use Ensembl Gene IDs rather than gene Symbols to index genes.
validated = lb.Gene.validate(pbcm68k.var.index, lb.Gene.symbol)
π‘ using global setting species = human
β
695 terms (90.80%) are validated
β 70 terms (9.20%) are not validated
pbcm68k_validated = pbcm68k[:, validated]
Link cell types:
Show code cell content
# inspect shows none of the terms are mappable
lb.CellType.inspect(pbcm68k_validated.obs["cell_type"], "name")
# here we search the cell type names from the public ontology and grab the top match
# then add the cell type names from the pbcm68k as synonyms
celltype_bt = lb.CellType.bionty()
ontology_ids = []
mapper = {}
for ct in pbcm68k_validated.obs["cell_type"].unique():
ontology_id = celltype_bt.search(ct).iloc[0].ontology_id
record = lb.CellType.from_bionty(ontology_id=ontology_id)
mapper[ct] = record.name
record.save()
record.add_synonym(ct)
pbcm68k_validated.obs["cell_type"] = pbcm68k_validated.obs["cell_type"].map(mapper)
β received 9 unique terms, 61 empty/duplicated terms are ignored
β
0 terms (0.00%) are validated
β 9 terms (100.00%) are not validated
β
validated 1 CellType record from Bionty on ontology_id: CL:0000451
β
validated 1 CellType record from Bionty on ontology_id: CL:0001201
β
validated 1 CellType record from Bionty on ontology_id: CL:0001087
β
validated 1 CellType record from Bionty on ontology_id: CL:0000910
β
validated 1 CellType record from Bionty on ontology_id: CL:0000919
β
validated 1 CellType record from Bionty on ontology_id: CL:0002057
β
validated 1 CellType record on ontology_id: CL:0000939
β
validated 1 CellType record from Bionty on ontology_id: CL:0002102
β
validated 1 CellType record on ontology_id: CL:0000990
/tmp/ipykernel_2309/593905118.py:16: ImplicitModificationWarning: Trying to modify attribute `.obs` of view, initializing view as actual.
pbcm68k_validated.obs["cell_type"] = pbcm68k_validated.obs["cell_type"].map(mapper)
Now, all cell types should be validated:
lb.CellType.validate(pbcm68k_validated.obs["cell_type"], "name");
β received 9 unique terms, 61 empty/duplicated terms are ignored
β
9 terms (100.00%) are validated
file_pbcm68k = ln.File.from_anndata(
pbcm68k_validated, description="10x reference pbmc68k", var_ref=lb.Gene.symbol
)
π‘ file will be copied to default storage upon `save()` with key `None` ('.lamindb/oVfV4tXeKpGM6X7UFMk7.h5ad')
π‘ parsing feature names of X stored in slot 'var'
π‘ using global setting species = human
β
validated 695 Gene records on symbol: PPP1R14A, RSL1D1, KLF6, CD47, BBX, OTUB1, GATA3, DCXR, SRRM2, IL2RB, BANK1, G0S2, ATP6V1E1, IMPDH2, AGTRAP, SLCO3A1, HLA-DPB1, ARPC5L, ACADS, ATP6V1G1, ...
β
linked: FeatureSet(id='fgqRxj5dmLeN5groV2zW', n=695, type='float', registry='bionty.Gene', hash='W4ps_86b5dxk2Wd1gWTo', created_by_id='DzTjkKse')
π‘ parsing feature names of slot 'obs'
β
validated 1 Feature record on name: cell_type
β did not validate 3 Feature records for names: louvain, n_genes, percent_mito
β ignoring non-validated features: louvain,n_genes,percent_mito
β
linked: FeatureSet(id='gGvGY3rlFIqEN7LXUlAG', n=1, registry='core.Feature', hash='Fhzyzfg9Cd5MnYB_uSjG', created_by_id='DzTjkKse')
file_pbcm68k.save()
β
saved 2 feature sets for slots: ['var', 'obs']
β
storing file 'oVfV4tXeKpGM6X7UFMk7' at '.lamindb/oVfV4tXeKpGM6X7UFMk7.h5ad'
cell_types = lb.CellType.from_values(pbcm68k_validated.obs["cell_type"], "name")
file_pbcm68k.add_labels(cell_types)
β
validated 9 CellType records on name: B cell, CD19-positive, CD14-positive, CD16-negative classical monocyte, CD16-positive, CD56-dim natural killer cell, human, CD38-negative naive B cell, CD8-positive, CD25-positive, alpha-beta regulatory T cell, conventional dendritic cell, cytotoxic T cell, dendritic cell, effector memory CD4-positive, alpha-beta T cell, terminally differentiated
β
linked labels 'B cell, CD19-positive', 'CD14-positive, CD16-negative classical monocyte', 'CD16-positive, CD56-dim natural killer cell, human', 'CD38-negative naive B cell', 'CD8-positive, CD25-positive, alpha-beta regulatory T cell', 'conventional dendritic cell', 'cytotoxic T cell', 'dendritic cell', 'effector memory CD4-positive, alpha-beta T cell, terminally differentiated' to feature 'cell_type'
file_pbcm68k.describe()
π‘ File(id=oVfV4tXeKpGM6X7UFMk7, key=None, suffix=.h5ad, accessor=AnnData, description=10x reference pbmc68k, version=None, size=589484, hash=eKVXV5okt5YRYjySMTKGEw, hash_type=md5, created_at=2023-08-17 17:33:21.798553+00:00, updated_at=2023-08-17 17:33:21.798576+00:00)
Provenance:
ποΈ storage: Storage(id='6ObKC1Wo', root='/home/runner/work/lamin-usecases/lamin-usecases/docs/test-scrna', type='local', updated_at=2023-08-17 17:32:30, created_by_id='DzTjkKse')
π transform: Transform(id='Nv48yAceNSh8z8', name='Curate & link scRNA-seq datasets', short_name='scrna', stem_id='Nv48yAceNSh8', version='0', type='notebook', updated_at=2023-08-17 17:33:21, created_by_id='DzTjkKse')
π£ run: Run(id='UcskU0EgGleETR6Oao82', run_at=2023-08-17 17:32:36, transform_id='Nv48yAceNSh8z8', created_by_id='DzTjkKse')
π€ created_by: User(id='DzTjkKse', handle='testuser1', email='testuser1@lamin.ai', name='Test User1', updated_at=2023-08-17 17:32:30)
Features:
var (X):
π index (695, bionty.Gene.id): ['S40ynPlt1FR6', 'NlRk1RxzXPJS', '10ogBGiTBuv4', 'E231nvehewhz', 'Z2SBxpiWYpPs'...]
obs (metadata):
π cell_type (9, bionty.CellType): ['CD8-positive, CD25-positive, alpha-beta regulatory T cell', 'cytotoxic T cell', 'effector memory CD4-positive, alpha-beta T cell, terminally differentiated', 'CD14-positive, CD16-negative classical monocyte', 'CD38-negative naive B cell']
file_pbcm68k.view_lineage()
π Now letβs continue with data integration: Integrate scRNA-seq datasets