What happens if I save the same artifacts & records twice?

LaminDB’s operations are idempotent in the sense defined in this document.

This allows you to re-run a notebook or script without erroring or duplicating data. Similar behavior holds for human data entry.

Summary

Metadata records

If you try to create any metadata record (Record) and search_names is True (the default):

  1. LaminDB will warn you if a record with similar name exists and display a table of similar existing records.

  2. You can then decide whether you’d like to save a record to the database or rather query an existing one from the table.

  3. If a name already has an exact match in a registry, LaminDB will return it instead of creating a new record. For versioned entities, also the version must be passed.

If you set search_names to False, you’ll directly populate the DB.

Data: artifacts & collections

If you try to create a Artifact object from the same content, depending on artifact_if_hash_exists,

  • you’ll get an existing object, if creation.artifact_if_hash_exists = "warn_return_existing" (the default)

  • you’ll get an error, if creation.artifact_if_hash_exists = "error"

  • you’ll get a warning and a new object, if creation.artifact_if_hash_exists = "warn_create_new"

Examples

# !pip install 'lamindb[jupyter]'
!lamin init --storage ./test-idempotency
 initialized lamindb: testuser1/test-idempotency
import lamindb as ln
import pytest

ln.track("ANW20Fr4eZgM0000")
 connected lamindb: testuser1/test-idempotency
 created Transform('ANW20Fr4eZgM0000'), started new Run('eMMlJfa0...') at 2025-01-17 14:20:02 UTC
 notebook imports: lamindb==1.0rc1 pytest==8.3.4

Metadata records

assert ln.settings.creation.search_names

Let us add a first record to the ULabel registry:

label = ln.ULabel(name="My project 1")
label.save()
ULabel(uid='E8WLXmas', name='My project 1', created_by_id=1, run_id=1, space_id=1, created_at=2025-01-17 14:20:03 UTC)

If we create a new record, we’ll automatically get search results that give clues on whether we are prone to duplicating an entry:

label = ln.ULabel(name="My project 1a")
! record with similar name exists! did you mean to load it?
uid name is_type description reference reference_type space_id type_id run_id created_at created_by_id _aux _branch_code
id
1 E8WLXmas My project 1 None None None None 1 None 1 2025-01-17 14:20:03.678000+00:00 1 None 1
label.save()
ULabel(uid='iPyAaK1h', name='My project 1a', created_by_id=1, run_id=1, space_id=1, created_at=2025-01-17 14:20:03 UTC)

In case we match an existing name directly, we’ll get the existing object:

label = ln.ULabel(name="My project 1")
 returning existing ULabel record with same name: 'My project 1'

If we save it again, it will not create a new entry in the registry:

label.save()
ULabel(uid='E8WLXmas', name='My project 1', created_by_id=1, run_id=1, space_id=1, created_at=2025-01-17 14:20:03 UTC)

Now, if we create a third record, we’ll get two alternatives:

label = ln.ULabel(name="My project 1b")
! records with similar names exist! did you mean to load one of them?
uid name is_type description reference reference_type space_id type_id run_id created_at created_by_id _aux _branch_code
id
1 E8WLXmas My project 1 None None None None 1 None 1 2025-01-17 14:20:03.678000+00:00 1 None 1
2 iPyAaK1h My project 1a None None None None 1 None 1 2025-01-17 14:20:03.744000+00:00 1 None 1

If we prefer to not perform a search, e.g. for performance reasons or too noisy logging, we can switch it off.

ln.settings.creation.search_names = False
label = ln.ULabel(name="My project 1c")

In this walkthrough, switch it back on:

ln.settings.creation.search_names = True

Data: artifacts and collections

Warn upon trying to re-ingest an existing artifact

assert ln.settings.creation.artifact_if_hash_exists == "warn_return_existing"
filepath = ln.core.datasets.file_fcs()

Create an Artifact:

artifact = ln.Artifact(filepath, description="My fcs artifact").save()
Hide code cell content
assert artifact.hash == "KCEXRahJ-Ui9Y6nksQ8z1A"
assert artifact.run == ln.context.run
assert len(artifact._previous_runs.all()) == 0

Create an Artifact from the same path:

artifact2 = ln.Artifact(filepath, description="My fcs artifact")
 returning existing artifact with same hash: Artifact(uid='jg5R486XIVAQnoaA0000', is_latest=True, description='My fcs artifact', suffix='.fcs', size=6785467, hash='KCEXRahJ-Ui9Y6nksQ8z1A', space_id=1, storage_id=1, run_id=1, created_by_id=1, created_at=2025-01-17 14:20:04 UTC); if you intended to query to track this artifact as an input, use: ln.Artifact.get()

It gives us the existing object:

assert artifact.id == artifact2.id
assert artifact.run == artifact2.run
assert len(artifact._previous_runs.all()) == 0

If you save it again, nothing will happen (the operation is idempotent):

artifact2.save()
Artifact(uid='jg5R486XIVAQnoaA0000', is_latest=True, description='My fcs artifact', suffix='.fcs', size=6785467, hash='KCEXRahJ-Ui9Y6nksQ8z1A', space_id=1, storage_id=1, run_id=1, created_by_id=1, created_at=2025-01-17 14:20:04 UTC)

In the hidden cell below, you’ll see how this interplays with data lineage.

Hide code cell content
ln.context.track(new_run=True)
artifact3 = ln.Artifact(filepath, description="My fcs artifact")
assert artifact3.id == artifact2.id
assert artifact3.run != artifact2.run
assert artifact3._previous_runs.first() == artifact2.run
 loaded Transform('ANW20Fr4eZgM0000'), started new Run('WJOr918T...') at 2025-01-17 14:20:04 UTC
 notebook imports: lamindb==1.0rc1 pytest==8.3.4
 returning existing artifact with same hash: Artifact(uid='jg5R486XIVAQnoaA0000', is_latest=True, description='My fcs artifact', suffix='.fcs', size=6785467, hash='KCEXRahJ-Ui9Y6nksQ8z1A', space_id=1, storage_id=1, run_id=1, created_by_id=1, created_at=2025-01-17 14:20:04 UTC); if you intended to query to track this artifact as an input, use: ln.Artifact.get()

Error upon trying to re-ingest an existing artifact

ln.settings.creation.artifact_if_hash_exists = "error"

In this case, you’ll not be able to create an object from the same content:

with pytest.raises(FileExistsError):
    artifact3 = ln.Artifact(filepath, description="My new fcs artifact")

Warn and create a new artifact

Lastly, let us discuss the following setting:

ln.settings.creation.artifact_if_hash_exists = "warn_create_new"

In this case, you’ll create a new object:

artifact4 = ln.Artifact(filepath, description="My new fcs artifact").save()
! creating new Artifact object despite existing artifact with same hash: Artifact(uid='jg5R486XIVAQnoaA0000', is_latest=True, description='My fcs artifact', suffix='.fcs', size=6785467, hash='KCEXRahJ-Ui9Y6nksQ8z1A', space_id=1, storage_id=1, run_id=1, created_by_id=1, created_at=2025-01-17 14:20:04 UTC)

You can verify that it’s a new entry by comparing the ids:

assert artifact4.id != artifact.id
ln.Artifact.filter(hash="KCEXRahJ-Ui9Y6nksQ8z1A").df()
uid key description suffix kind otype size hash n_files n_observations _hash_type _key_is_virtual _overwrite_versions space_id storage_id schema_id version is_latest run_id created_at created_by_id _aux _branch_code
id
1 jg5R486XIVAQnoaA0000 None My fcs artifact .fcs None None 6785467 KCEXRahJ-Ui9Y6nksQ8z1A None None md5 True False 1 1 None None True 1 2025-01-17 14:20:04.144000+00:00 1 None 1
2 OaCOi85me397dDut0000 None My new fcs artifact .fcs None None 6785467 KCEXRahJ-Ui9Y6nksQ8z1A None None md5 True False 1 1 None None True 2 2025-01-17 14:20:05.688000+00:00 1 None 1
Hide code cell content
assert len(ln.Artifact.filter(hash="KCEXRahJ-Ui9Y6nksQ8z1A").all()) == 2
!rm -rf ./test-idempotency
!lamin delete --force test-idempotency
Hide code cell output
 deleting instance testuser1/test-idempotency