Standalone Extractor Usage#

The NexusLIMS extractor system can be used as a standalone Python library without a full NexusLIMS deployment. No .env file, database, NEMO instance, or CDCS server is required.

This notebook demonstrates:

Downloading example microscopy files from the project repository
Extracting metadata with the high-level parse_metadata() API
Using the low-level registry API (ExtractionContext + get_registry())
What degrades gracefully when no NexusLIMS config is present

Note

Download this notebook to run it locally.

Installation#

pip install nexusLIMS
# or, if you use uv:
uv add nexusLIMS

1. Download example data files#

We pull a handful of microscopy files straight from the NexusLIMS test suite on GitHub so this notebook is self-contained. Some larger files are stored as .tar.gz archives in the repository; those are downloaded and extracted here.

import tarfile
import urllib.request
from pathlib import Path

# Raw-content base URL for test files in the NexusLIMS repository
_BASE = "https://raw.githubusercontent.com/datasophos/NexusLIMS/main/tests/unit/files/"

data_dir = Path("example_data")
data_dir.mkdir(exist_ok=True)


def fetch(remote_name: str, extract_to: Path | None = None) -> Path:
    """Download a file from the NexusLIMS test suite.

    If *extract_to* is given the file is treated as a .tar.gz archive:
    it is downloaded to a temp path and extracted into *extract_to*.
    Returns the path of the (extracted) file.
    """
    url = _BASE + remote_name
    dest = data_dir / remote_name

    if extract_to is None:
        # Plain file download
        if dest.exists():
            print(f"  already present : {remote_name}")
        else:
            print(f"  downloading {remote_name} …", end=" ", flush=True)
            urllib.request.urlretrieve(url, dest)
            print(f"done ({dest.stat().st_size:,} bytes)")
        return dest
    else:
        # Archive download + extraction
        # The member file has the same stem as the archive (e.g., foo.dm3.tar.gz → foo.dm3)
        stem = remote_name.replace(".tar.gz", "")
        extracted = extract_to / stem
        if extracted.exists():
            print(f"  already present : {stem}")
        else:
            archive_path = data_dir / remote_name
            if not archive_path.exists():
                print(f"  downloading {remote_name} …", end=" ", flush=True)
                urllib.request.urlretrieve(url, archive_path)
                print(f"done ({archive_path.stat().st_size:,} bytes)")
            print(f"  extracting  {stem} …", end=" ", flush=True)
            with tarfile.open(archive_path) as tf:
                tf.extract(stem, path=extract_to, filter="data")
            print(f"done ({extracted.stat().st_size:,} bytes)")
        return extracted


# --- Files stored directly in the repo ---
dm3_file = fetch("test_STEM_image.dm3")
orion_tif = fetch("orion-zeiss_dataZeroed.tif")
msa_file = fetch("leo_edax_test.msa")
spc_file = fetch("leo_edax_test.spc")

# --- Files stored as archives ---
multi_dm3 = fetch("TEM_list_signal_dataZeroed.dm3.tar.gz", extract_to=data_dir)
neoarm_dm4 = fetch("neoarm-gatan_SI_dataZeroed.dm4.tar.gz", extract_to=data_dir)

print("\nAll files ready.")

  already present : test_STEM_image.dm3
  already present : orion-zeiss_dataZeroed.tif
  already present : leo_edax_test.msa
  already present : leo_edax_test.spc
  already present : TEM_list_signal_dataZeroed.dm3
  already present : neoarm-gatan_SI_dataZeroed.dm4

All files ready.

2. High-level API — `parse_metadata()`#

parse_metadata() is the main entry point. Pass write_output=False, generate_preview=False to skip the NexusLIMS deployment-specific steps (writing JSON sidecars and generating thumbnails).

2a. Gatan DM3 — single STEM image#

from nexusLIMS.extractors import parse_metadata

metadata_list, _previews = parse_metadata(
    dm3_file,
    write_output=False,
    generate_preview=False,
)

print(f"Signals found : {len(metadata_list)}")
nx = metadata_list[0]["nx_meta"]
print(f"DatasetType   : {nx['DatasetType']}")
print(f"Data Type     : {nx['Data Type']}")
print(f"Creation Time : {nx['Creation Time']}")

Signals found : 1
DatasetType   : Image
Data Type     : STEM_Imaging
Creation Time : 2026-02-18T14:54:03.330955-07:00

The full nx_meta dictionary contains everything the extractor could pull from the file. A small helper makes nested dicts and Pint Quantity objects readable:

from decimal import Decimal

import numpy as np


def _to_display(obj):
    """Convert quantities and numpy scalars to printable strings."""
    if hasattr(obj, "magnitude"):  # pint Quantity
        return f"{obj.magnitude} {obj.units}"
    if isinstance(obj, (np.integer, np.floating)):
        return obj.item()
    if isinstance(obj, Decimal):
        return float(obj)
    return obj


def pretty(d, indent=0):
    for k, v in d.items():
        if isinstance(v, dict):
            print(" " * indent + f"{k}:")
            pretty(v, indent + 2)
        else:
            print(" " * indent + f"{k}: {_to_display(v)}")


pretty(nx)

acceleration_voltage: 200000.0 volt
acquisition_device: DigiScan
Creation Time: 2026-02-18T14:54:03.330955-07:00
Data Dimensions: (68, 68)
Data Type: STEM_Imaging
DatasetType: Image
dwell_time: 3.5 microsecond
extensions:
  Cs: 0.0 millimeter
  GMS Version: 2.31.734.0
  Illumination Mode: STEM NANOPROBE
  Imaging Mode: DIFFRACTION
  Microscope: FEI Titan
  Name: TEST Titan Remote
  Operation Mode: SCANNING
  STEM Camera Length: 135.0 millimeter
horizontal_field_width: 0.5090058644612631 micrometer
magnification: 225000.0
stage_position:
  tilt_alpha: 24.950478513002935 degree
  x: -461.276 micrometer
  y: 52.0039 micrometer
  z: 35.033899999999996 millimeter
warnings: []
NexusLIMS Extraction:
  Date: 2026-02-18T15:13:06.884944-07:00
  Module: nexusLIMS.extractors.plugins.dm3_extractor
  Version: 2.4.2.dev0

2b. Multi-signal DM3 file#

Some DM3/DM4 files contain multiple signals (e.g., a survey image alongside an EELS spectrum). parse_metadata() returns one metadata dict per signal.

multi_list, _ = parse_metadata(
    multi_dm3,
    write_output=False,
    generate_preview=False,
)

print(f"Signals in file: {len(multi_list)}")
for i, m in enumerate(multi_list):
    nx = m["nx_meta"]
    dims = nx.get("Data Dimensions", "?")
    print(f"  [{i}] {nx['DatasetType']:15s}  {nx['Data Type']:25s}  dims={dims}")

Signals in file: 2
  [0] Image            STEM_Imaging               dims=(512, 512)
  [1] Image            TEM_Imaging                dims=(512,)

2c. Gatan DM4 — multi-signal spectrum image (EELS/EDS SI)#

dm4_list, _ = parse_metadata(
    neoarm_dm4,
    write_output=False,
    generate_preview=False,
)

print(f"Signals in file: {len(dm4_list)}")
for i, m in enumerate(dm4_list):
    nx = m["nx_meta"]
    dims = nx.get("Data Dimensions", "?")
    print(f"  [{i}] {nx['DatasetType']:15s}  {nx['Data Type']:25s}  dims={dims}")

Signals in file: 4
  [0] Image            STEM_Imaging               dims=(512, 512)
  [1] Image            STEM_Imaging               dims=(52, 106)
  [2] Image            STEM_Imaging               dims=(52, 106)
  [3] SpectrumImage    EDS_Spectrum_Imaging       dims=(52, 106, 2048)

2d. Zeiss Orion HIM TIFF#

orion_list, _ = parse_metadata(
    orion_tif,
    write_output=False,
    generate_preview=False,
)

nx = orion_list[0]["nx_meta"]
print(f"DatasetType   : {nx['DatasetType']}")
print(f"Data Type     : {nx['Data Type']}")
print(f"Creation Time : {nx['Creation Time']}")

DatasetType   : Image
Data Type     : HIM_Imaging
Creation Time : 2026-02-18T21:54:03.444135+00:00

2e. EDAX spectrum files (`.spc` / `.msa`)#

for path in [spc_file, msa_file]:
    result, _ = parse_metadata(path, write_output=False, generate_preview=False)
    nx = result[0]["nx_meta"]
    print(f"{path.name}")
    print(f"  DatasetType  : {nx['DatasetType']}")
    print(f"  Data Type    : {nx['Data Type']}")
    print(f"  Creation Time: {nx['Creation Time']}")
    print()

leo_edax_test.spc
  DatasetType  : Spectrum
  Data Type    : EDS_Spectrum
  Creation Time: 2026-02-18T21:54:03.966308+00:00

WARNING | Hyperspy | `signal_type='EDS'` not understood. See `hs.print_known_signal_types()` for a list of installed signal types or https://github.com/hyperspy/hyperspy-extensions-list for the list of all hyperspy extensions providing signals. (hyperspy.io:745)

2026-02-18 15:13:07,245 hyperspy.io WARNING: `signal_type='EDS'` not understood. See `hs.print_known_signal_types()` for a list of installed signal types or https://github.com/hyperspy/hyperspy-extensions-list for the list of all hyperspy extensions providing signals.

leo_edax_test.msa
  DatasetType  : Spectrum
  Data Type    : EDS_Spectrum
  Creation Time: 2026-02-18T21:54:03.851536+00:00

3. Low-level API — `ExtractionContext` + `get_registry()`#

For more control you can work directly with the extractor registry. This lets you inspect which extractor was selected, or call extract() yourself without going through parse_metadata().

from nexusLIMS.extractors.base import ExtractionContext
from nexusLIMS.extractors.registry import get_registry

# instrument=None is fine when there is no database to look up
context = ExtractionContext(file_path=dm3_file, instrument=None)

registry = get_registry()
extractor = registry.get_extractor(context)

print(f"Selected extractor  : {extractor.name}")
print(f"Supported extensions: {extractor.supported_extensions}")

Selected extractor  : dm3_extractor
Supported extensions: {'dm4', 'dm3'}

# Call extract() directly — returns list[dict], one dict per signal
raw_list = extractor.extract(context)

print(f"Signals returned: {len(raw_list)}")
nx = raw_list[0]["nx_meta"]
print(f"  DatasetType : {nx['DatasetType']}")
print(f"  Data Type   : {nx['Data Type']}")

Signals returned: 1
  DatasetType : Image
  Data Type   : STEM_Imaging

Listing all registered extractors#

print(f"{'Name':<35} {'Priority':>8}  Extensions")
print("-" * 70)
for ext in registry.all_extractors:
    exts = (
        ", ".join(sorted(ext.supported_extensions))
        if ext.supported_extensions
        else "(any)"
    )
    print(f"{ext.name:<35} {ext.priority:>8}  {exts}")

Name                                Priority  Extensions
----------------------------------------------------------------------
orion_HIM_tif_extractor                  150  tif, tiff
tescan_tif_extractor                     150  tif, tiff
dm3_extractor                            100  dm3, dm4
msa_extractor                            100  msa
spc_extractor                            100  spc
ser_emi_extractor                        100  ser
quanta_tif_extractor                     100  tif, tiff
basic_file_info_extractor                  0  (any)

4. Graceful degradation without NexusLIMS configuration#

Feature	Without config	With config
Metadata extraction	✅ Full	✅ Full
Schema validation	✅ Full	✅ Full
Instrument ID from database	⚠️ Returns `None`	✅ Looks up DB
JSON sidecar write (`write_output`)	⚠️ Skipped + warning	✅ Written
Preview thumbnail (`generate_preview`)	⚠️ Skipped + warning	✅ Generated
Local instrument profiles	⚠️ Skipped (built-ins active)	✅ Loaded

If you call parse_metadata() with the defaults (write_output=True, generate_preview=True) and no config is present, NexusLIMS logs a warning but still returns the metadata dict. The cell below demonstrates that:

import logging

import nexusLIMS.extractors as _ext

# Show warnings inline
logging.basicConfig(level=logging.WARNING)

# Temporarily pretend config is unavailable
_real = _ext._config_available
_ext._config_available = lambda: False

try:
    result, previews = parse_metadata(
        dm3_file,
        write_output=True,  # normally writes JSON — skipped with a warning
        generate_preview=True,  # normally makes thumbnail — skipped with a warning
    )
    print(f"Metadata returned despite missing config: {result is not None}")
    print(f"Previews list (all None when config missing): {previews}")
finally:
    _ext._config_available = _real  # restore

2026-02-18 15:13:07,281 nexusLIMS.extractors WARNING: NexusLIMS config unavailable; skipping metadata file write (pass write_output=False to suppress this warning)
2026-02-18 15:13:07,281 nexusLIMS.extractors WARNING: NexusLIMS config unavailable; skipping preview generation (pass generate_preview=False to suppress this warning)

Metadata returned despite missing config: True
Previews list (all None when config missing): [None]