Curate datasets of any format¶
Our previous guide explained how to validate, standardize & annotate DataFrame
and AnnData
. In this guide, we’ll walk through the basic API that lets you work with any format of data.
How do I validate based on a public ontology?
LaminDB makes it easy to validate categorical variables based on registries that inherit from CanCurate
.
CanCurate
methods validate against the registries in your LaminDB instance.
In Manage biological registries, you’ll see how to extend standard validation to validation against public references using a ReferenceTable
ontology object: public = Record.public()
.
By default, from_values()
considers a match in a public reference a validated value for any bionty
entity.
# !pip install 'lamindb[bionty,zarr]'
!lamin init --storage ./test-curate-any --modules bionty
Show code cell output
→ initialized lamindb: testuser1/test-curate-any
import lamindb as ln
import bionty as bt
import zarr
import numpy as np
data = zarr.create(
(10,),
dtype=[("value", "f8"), ("gene", "U15"), ("disease", "U16")],
store="data.zarr",
)
data["gene"] = [
"ENSG00000139618",
"ENSG00000141510",
"ENSG00000133703",
"ENSG00000157764",
"ENSG00000171862",
"ENSG00000091831",
"ENSG00000141736",
"ENSG00000133056",
"ENSG00000146648",
"ENSG00000118523",
]
data["disease"] = np.random.default_rng().choice(["MONDO:0004975", "MONDO:0004980"], 10)
→ connected lamindb: testuser1/test-curate-any
Define validation criteria¶
Entities that don’t have a dedicated registry (“are not typed”) can be validated & registered using ULabel
:
criteria = {
"disease": bt.Disease.ontology_id,
"project": ln.ULabel.name,
"gene": bt.Gene.ensembl_gene_id,
}
Validate and standardize metadata¶
validate()
validates passed values against reference values in a registry.
It returns a boolean vector indicating whether a value has an exact match in the reference values.
bt.Disease.validate(data["disease"], field=bt.Disease.ontology_id)
! Your Disease registry is empty, consider populating it first!
→ use `.import_source()` to import records from a source, e.g. a public ontology
array([False, False, False, False, False, False, False, False, False,
False])
When validation fails, you can call inspect()
to figure out what to do.
inspect()
applies the same definition of validation as validate()
, but returns a rich return value InspectResult
. Most importantly, it logs recommended curation steps that would render the data validated.
Note: you can use standardize()
to standardize synonyms.
bt.Disease.inspect(data["disease"], field=bt.Disease.ontology_id);
! received 2 unique terms, 8 empty/duplicated terms are ignored
! 2 unique terms (100.00%) are not validated for ontology_id: 'MONDO:0004980', 'MONDO:0004975'
detected 2 Disease terms in Bionty for ontology_id: 'MONDO:0004980', 'MONDO:0004975'
→ add records from Bionty to your Disease registry via .from_values()
Following the suggestions to register new labels:
Bulk creating records using from_values()
only returns validated records:
Note: Terms validated with public reference are also created with .from_values
, see Manage biological registries for details.
diseases = bt.Disease.from_values(data["disease"], field=bt.Disease.ontology_id)
ln.save(diseases)
Repeat the process for more labels:
projects = ln.ULabel.from_values(
["Project A", "Project B"],
field=ln.ULabel.name,
create=True, # create non-existing labels rather than attempting to load them from the database
)
ln.save(projects)
genes = bt.Gene.from_values(data["gene"], field=bt.Gene.ensembl_gene_id)
ln.save(genes)
Annotate and save dataset with validated metadata¶
Register the dataset as an artifact:
artifact = ln.Artifact("data.zarr", description="a zarr object").save()
Show code cell output
! no run & transform got linked, call `ln.track()` & re-run
Link the artifact to validated labels. You could directly do this, e.g., via artifact.ulabels.add(projects)
or artifact.diseases.add(diseases)
.
However, often, you want to track the features that measured labels. Hence, let’s try to associate our labels with features:
from lamindb.core.exceptions import ValidationError
try:
artifact.features.add_values({"project": projects, "disease": diseases})
except ValidationError as e:
print(e)
Show code cell output
! cannot infer feature type of: [ULabel(uid='4wMhHlUX', name='Project A', created_by_id=1, space_id=1, created_at=2025-01-17 14:22:04 UTC), ULabel(uid='vf5q6xMq', name='Project B', created_by_id=1, space_id=1, created_at=2025-01-17 14:22:04 UTC)], returning '?
! cannot infer feature type of: [Disease(uid='4JmTj6Sn', name='atopic eczema', ontology_id='MONDO:0004980', synonyms='allergic form of dermatitis|Atopic dermatitis|Besnier's prurigo|Atopic neurodermatitis|atopic eczema|eczema|eczematous dermatitis|allergic dermatitis', description='A Chronic Inflammatory Genetically Determined Disease Of The Skin Marked By Increased Ability To Form Reagin (Ige), With Increased Susceptibility To Allergic Rhinitis And Asthma, And Hereditary Disposition To A Lowered Threshold For Pruritus. It Is Manifested By Lichenification, Excoriation, And Crusting, Mainly On The Flexural Surfaces Of The Elbow And Knee. In Infants It Is Known As Infantile Eczema.', created_by_id=1, space_id=1, source_id=50, created_at=2025-01-17 14:22:04 UTC), Disease(uid='4F2HPJ3w', name='Alzheimer disease', ontology_id='MONDO:0004975', synonyms='Alzheimer dementia|Alzheimer disease|Alzheimers disease|Alzheimers dementia|Alzheimer's disease|presenile and senile dementia|Alzheimer's dementia|AD', description='A Progressive, Neurodegenerative Disease Characterized By Loss Of Function And Death Of Nerve Cells In Several Areas Of The Brain Leading To Loss Of Cognitive Function Such As Memory And Language.', created_by_id=1, space_id=1, source_id=50, created_at=2025-01-17 14:22:04 UTC)], returning '?
These keys could not be validated: ['project', 'disease']
Here is how to create a feature:
ln.Feature(name='project', dtype='?').save()
ln.Feature(name='disease', dtype='?').save()
This errored because we hadn’t yet registered features. After copy and paste from the error message, things work out:
ln.Feature(name="project", dtype="cat[ULabel]").save()
ln.Feature(name="disease", dtype="cat[bionty.Disease]").save()
artifact.features.add_values({"project": projects, "disease": diseases})
artifact.features
Show code cell output
Artifact .zarr └── Linked features └── disease cat[bionty.Disease] Alzheimer disease, atopic eczema project cat[ULabel] Project A, Project B
Since genes are the measurements, we register them as features:
feature_set = ln.FeatureSet(genes).save()
artifact.features.add_feature_set(feature_set, slot="genes")
artifact.describe()
Show code cell output
Artifact .zarr ├── General │ ├── .uid = 'f85TfajZUb6HtGTd0000' │ ├── .size = 974 │ ├── .hash = 'JJDeBeRu0_4uSm0J7tjtKg' │ ├── .n_files = 2 │ ├── .path = /home/runner/work/lamindb/lamindb/docs/test-curate-any/.lamindb/f85TfajZUb6HtGTd.zarr │ ├── .created_by = testuser1 (Test User1) │ └── .created_at = 2025-01-17 14:22:07 ├── Dataset features/._schemas_m2m │ └── genes • 10 [bionty.Gene] │ BRCA2 num │ TP53 num │ KRAS num │ BRAF num │ PTEN num │ ESR1 num │ ERBB2 num │ PIK3C2B num │ EGFR num │ CCN2 num ├── Linked features │ └── disease cat[bionty.Disease] Alzheimer disease, atopic eczema │ project cat[ULabel] Project A, Project B └── Labels └── .diseases bionty.Disease atopic eczema, Alzheimer disease .ulabels ULabel Project A, Project B
Show code cell content
# clean up test instance
!lamin delete --force test-curate-any
!rm -r data.zarr
╭─ Error ──────────────────────────────────────────────────────────────────────╮
│ Storage '/home/runner/work/lamindb/lamindb/docs/test-curate-any/.lamindb' │
│ contains 2 objects - delete them prior to deleting the instance │
╰──────────────────────────────────────────────────────────────────────────────╯