Standalone Extractor Usage#
The NexusLIMS extractor system can be used as a standalone Python library
without a full NexusLIMS deployment. No .env file, database, NEMO instance,
or CDCS server is required.
This notebook demonstrates:
Downloading example microscopy files from the project repository
Extracting metadata with the high-level
parse_metadata()APIUsing the low-level registry API (
ExtractionContext+get_registry())What degrades gracefully when no NexusLIMS config is present
Note
Download this notebook to run it locally.
Installation#
pip install nexusLIMS
# or, if you use uv:
uv add nexusLIMS
1. Download example data files#
We pull a handful of microscopy files straight from the NexusLIMS test suite
on GitHub so this notebook is self-contained. Some larger files are stored as
.tar.gz archives in the repository; those are downloaded and extracted here.
import tarfile
import urllib.request
from pathlib import Path
# Raw-content base URL for test files in the NexusLIMS repository
_BASE = "https://raw.githubusercontent.com/datasophos/NexusLIMS/main/tests/unit/files/"
data_dir = Path("example_data")
data_dir.mkdir(exist_ok=True)
def fetch(remote_name: str, extract_to: Path | None = None) -> Path:
"""Download a file from the NexusLIMS test suite.
If *extract_to* is given the file is treated as a .tar.gz archive:
it is downloaded to a temp path and extracted into *extract_to*.
Returns the path of the (extracted) file.
"""
url = _BASE + remote_name
dest = data_dir / remote_name
if extract_to is None:
# Plain file download
if dest.exists():
print(f" already present : {remote_name}")
else:
print(f" downloading {remote_name} …", end=" ", flush=True)
urllib.request.urlretrieve(url, dest)
print(f"done ({dest.stat().st_size:,} bytes)")
return dest
else:
# Archive download + extraction
# The member file has the same stem as the archive (e.g., foo.dm3.tar.gz → foo.dm3)
stem = remote_name.replace(".tar.gz", "")
extracted = extract_to / stem
if extracted.exists():
print(f" already present : {stem}")
else:
archive_path = data_dir / remote_name
if not archive_path.exists():
print(f" downloading {remote_name} …", end=" ", flush=True)
urllib.request.urlretrieve(url, archive_path)
print(f"done ({archive_path.stat().st_size:,} bytes)")
print(f" extracting {stem} …", end=" ", flush=True)
with tarfile.open(archive_path) as tf:
tf.extract(stem, path=extract_to, filter="data")
print(f"done ({extracted.stat().st_size:,} bytes)")
return extracted
# --- Files stored directly in the repo ---
dm3_file = fetch("test_STEM_image.dm3")
orion_tif = fetch("orion-zeiss_dataZeroed.tif")
msa_file = fetch("leo_edax_test.msa")
spc_file = fetch("leo_edax_test.spc")
# --- Files stored as archives ---
multi_dm3 = fetch("TEM_list_signal_dataZeroed.dm3.tar.gz", extract_to=data_dir)
neoarm_dm4 = fetch("neoarm-gatan_SI_dataZeroed.dm4.tar.gz", extract_to=data_dir)
print("\nAll files ready.")
already present : test_STEM_image.dm3
already present : orion-zeiss_dataZeroed.tif
already present : leo_edax_test.msa
already present : leo_edax_test.spc
already present : TEM_list_signal_dataZeroed.dm3
already present : neoarm-gatan_SI_dataZeroed.dm4
All files ready.
2. High-level API — parse_metadata()#
parse_metadata() is the main entry point. Pass
write_output=False, generate_preview=False to skip the NexusLIMS
deployment-specific steps (writing JSON sidecars and generating thumbnails).
2a. Gatan DM3 — single STEM image#
from nexusLIMS.extractors import parse_metadata
metadata_list, _previews = parse_metadata(
dm3_file,
write_output=False,
generate_preview=False,
)
print(f"Signals found : {len(metadata_list)}")
nx = metadata_list[0]["nx_meta"]
print(f"DatasetType : {nx['DatasetType']}")
print(f"Data Type : {nx['Data Type']}")
print(f"Creation Time : {nx['Creation Time']}")
Signals found : 1
DatasetType : Image
Data Type : STEM_Imaging
Creation Time : 2026-02-18T14:54:03.330955-07:00
The full nx_meta dictionary contains everything the extractor could pull from the file.
A small helper makes nested dicts and Pint Quantity objects readable:
from decimal import Decimal
import numpy as np
def _to_display(obj):
"""Convert quantities and numpy scalars to printable strings."""
if hasattr(obj, "magnitude"): # pint Quantity
return f"{obj.magnitude} {obj.units}"
if isinstance(obj, (np.integer, np.floating)):
return obj.item()
if isinstance(obj, Decimal):
return float(obj)
return obj
def pretty(d, indent=0):
for k, v in d.items():
if isinstance(v, dict):
print(" " * indent + f"{k}:")
pretty(v, indent + 2)
else:
print(" " * indent + f"{k}: {_to_display(v)}")
pretty(nx)
acceleration_voltage: 200000.0 volt
acquisition_device: DigiScan
Creation Time: 2026-02-18T14:54:03.330955-07:00
Data Dimensions: (68, 68)
Data Type: STEM_Imaging
DatasetType: Image
dwell_time: 3.5 microsecond
extensions:
Cs: 0.0 millimeter
GMS Version: 2.31.734.0
Illumination Mode: STEM NANOPROBE
Imaging Mode: DIFFRACTION
Microscope: FEI Titan
Name: TEST Titan Remote
Operation Mode: SCANNING
STEM Camera Length: 135.0 millimeter
horizontal_field_width: 0.5090058644612631 micrometer
magnification: 225000.0
stage_position:
tilt_alpha: 24.950478513002935 degree
x: -461.276 micrometer
y: 52.0039 micrometer
z: 35.033899999999996 millimeter
warnings: []
NexusLIMS Extraction:
Date: 2026-02-18T15:13:06.884944-07:00
Module: nexusLIMS.extractors.plugins.dm3_extractor
Version: 2.4.2.dev0
2b. Multi-signal DM3 file#
Some DM3/DM4 files contain multiple signals (e.g., a survey image alongside
an EELS spectrum). parse_metadata() returns one metadata dict per signal.
multi_list, _ = parse_metadata(
multi_dm3,
write_output=False,
generate_preview=False,
)
print(f"Signals in file: {len(multi_list)}")
for i, m in enumerate(multi_list):
nx = m["nx_meta"]
dims = nx.get("Data Dimensions", "?")
print(f" [{i}] {nx['DatasetType']:15s} {nx['Data Type']:25s} dims={dims}")
Signals in file: 2
[0] Image STEM_Imaging dims=(512, 512)
[1] Image TEM_Imaging dims=(512,)
2c. Gatan DM4 — multi-signal spectrum image (EELS/EDS SI)#
dm4_list, _ = parse_metadata(
neoarm_dm4,
write_output=False,
generate_preview=False,
)
print(f"Signals in file: {len(dm4_list)}")
for i, m in enumerate(dm4_list):
nx = m["nx_meta"]
dims = nx.get("Data Dimensions", "?")
print(f" [{i}] {nx['DatasetType']:15s} {nx['Data Type']:25s} dims={dims}")
Signals in file: 4
[0] Image STEM_Imaging dims=(512, 512)
[1] Image STEM_Imaging dims=(52, 106)
[2] Image STEM_Imaging dims=(52, 106)
[3] SpectrumImage EDS_Spectrum_Imaging dims=(52, 106, 2048)
2d. Zeiss Orion HIM TIFF#
orion_list, _ = parse_metadata(
orion_tif,
write_output=False,
generate_preview=False,
)
nx = orion_list[0]["nx_meta"]
print(f"DatasetType : {nx['DatasetType']}")
print(f"Data Type : {nx['Data Type']}")
print(f"Creation Time : {nx['Creation Time']}")
DatasetType : Image
Data Type : HIM_Imaging
Creation Time : 2026-02-18T21:54:03.444135+00:00
2e. EDAX spectrum files (.spc / .msa)#
for path in [spc_file, msa_file]:
result, _ = parse_metadata(path, write_output=False, generate_preview=False)
nx = result[0]["nx_meta"]
print(f"{path.name}")
print(f" DatasetType : {nx['DatasetType']}")
print(f" Data Type : {nx['Data Type']}")
print(f" Creation Time: {nx['Creation Time']}")
print()
leo_edax_test.spc
DatasetType : Spectrum
Data Type : EDS_Spectrum
Creation Time: 2026-02-18T21:54:03.966308+00:00
WARNING | Hyperspy | `signal_type='EDS'` not understood. See `hs.print_known_signal_types()` for a list of installed signal types or https://github.com/hyperspy/hyperspy-extensions-list for the list of all hyperspy extensions providing signals. (hyperspy.io:745)
2026-02-18 15:13:07,245 hyperspy.io WARNING: `signal_type='EDS'` not understood. See `hs.print_known_signal_types()` for a list of installed signal types or https://github.com/hyperspy/hyperspy-extensions-list for the list of all hyperspy extensions providing signals.
leo_edax_test.msa
DatasetType : Spectrum
Data Type : EDS_Spectrum
Creation Time: 2026-02-18T21:54:03.851536+00:00
3. Low-level API — ExtractionContext + get_registry()#
For more control you can work directly with the extractor registry. This lets
you inspect which extractor was selected, or call extract() yourself without
going through parse_metadata().
from nexusLIMS.extractors.base import ExtractionContext
from nexusLIMS.extractors.registry import get_registry
# instrument=None is fine when there is no database to look up
context = ExtractionContext(file_path=dm3_file, instrument=None)
registry = get_registry()
extractor = registry.get_extractor(context)
print(f"Selected extractor : {extractor.name}")
print(f"Supported extensions: {extractor.supported_extensions}")
Selected extractor : dm3_extractor
Supported extensions: {'dm4', 'dm3'}
# Call extract() directly — returns list[dict], one dict per signal
raw_list = extractor.extract(context)
print(f"Signals returned: {len(raw_list)}")
nx = raw_list[0]["nx_meta"]
print(f" DatasetType : {nx['DatasetType']}")
print(f" Data Type : {nx['Data Type']}")
Signals returned: 1
DatasetType : Image
Data Type : STEM_Imaging
Listing all registered extractors#
print(f"{'Name':<35} {'Priority':>8} Extensions")
print("-" * 70)
for ext in registry.all_extractors:
exts = (
", ".join(sorted(ext.supported_extensions))
if ext.supported_extensions
else "(any)"
)
print(f"{ext.name:<35} {ext.priority:>8} {exts}")
Name Priority Extensions
----------------------------------------------------------------------
orion_HIM_tif_extractor 150 tif, tiff
tescan_tif_extractor 150 tif, tiff
dm3_extractor 100 dm3, dm4
msa_extractor 100 msa
spc_extractor 100 spc
ser_emi_extractor 100 ser
quanta_tif_extractor 100 tif, tiff
basic_file_info_extractor 0 (any)
4. Graceful degradation without NexusLIMS configuration#
Feature |
Without config |
With config |
|---|---|---|
Metadata extraction |
✅ Full |
✅ Full |
Schema validation |
✅ Full |
✅ Full |
Instrument ID from database |
⚠️ Returns |
✅ Looks up DB |
JSON sidecar write ( |
⚠️ Skipped + warning |
✅ Written |
Preview thumbnail ( |
⚠️ Skipped + warning |
✅ Generated |
Local instrument profiles |
⚠️ Skipped (built-ins active) |
✅ Loaded |
If you call parse_metadata() with the defaults (write_output=True, generate_preview=True) and no config is present, NexusLIMS logs a warning but
still returns the metadata dict. The cell below demonstrates that:
import logging
import nexusLIMS.extractors as _ext
# Show warnings inline
logging.basicConfig(level=logging.WARNING)
# Temporarily pretend config is unavailable
_real = _ext._config_available
_ext._config_available = lambda: False
try:
result, previews = parse_metadata(
dm3_file,
write_output=True, # normally writes JSON — skipped with a warning
generate_preview=True, # normally makes thumbnail — skipped with a warning
)
print(f"Metadata returned despite missing config: {result is not None}")
print(f"Previews list (all None when config missing): {previews}")
finally:
_ext._config_available = _real # restore
2026-02-18 15:13:07,281 nexusLIMS.extractors WARNING: NexusLIMS config unavailable; skipping metadata file write (pass write_output=False to suppress this warning)
2026-02-18 15:13:07,281 nexusLIMS.extractors WARNING: NexusLIMS config unavailable; skipping preview generation (pass generate_preview=False to suppress this warning)
Metadata returned despite missing config: True
Previews list (all None when config missing): [None]