Writing Extractor Plugins#
Added in version 2.1.0.
This guide explains how to create custom metadata extractors for NexusLIMS using the plugin-based system introduced in v2.1.0.
Overview#
NexusLIMS uses a plugin-based architecture for metadata extraction. Extractors are automatically discovered from the nexusLIMS/extractors/plugins/ directory and registered based on their file type support and priority.
Quick Start#
To create a new extractor plugin:
Create a
.pyfile innexusLIMS/extractors/plugins/Define a class with the required interface (see below)
That’s it! The registry will automatically discover and use your extractor
Minimal Example#
Here’s a minimal extractor for a hypothetical .xyz file format:
"""XYZ file format extractor plugin."""
import logging
from typing import Any
from pathlib import Path
from nexusLIMS.extractors.base import ExtractionContext
logger = logging.getLogger(__name__)
class XYZExtractor:
"""Extractor for XYZ format files."""
# Required class attributes
name = "xyz_extractor" # Unique identifier
priority = 100 # Higher = preferred (0-1000)
supported_extensions = {"xyz"} # Extensions this extractor supports
def supports(self, context: ExtractionContext) -> bool:
"""
Check if this extractor supports the given file.
Can check file extension and/or file contents (content-sniffing).
Parameters
----------
context : ExtractionContext
Contains file_path and instrument information
Returns
-------
bool
True if this extractor can handle the file
"""
extension = context.file_path.suffix.lower().lstrip(".")
if extension != "xyz":
return False
# Optional: Check file contents for format signature
# with open(context.file_path, "rb") as f:
# header = f.read(8)
# return header.startswith(b"XYZDATA")
return True
def extract(self, context: ExtractionContext) -> list[dict[str, Any]]:
"""
Extract metadata from an XYZ file.
Parameters
----------
context : ExtractionContext
Contains file_path and instrument information
Returns
-------
list[dict]
List of metadata dictionaries, one per signal.
Each dict has an 'nx_meta' key with NexusLIMS metadata.
"""
logger.debug("Extracting metadata from XYZ file: %s", context.file_path)
# Your extraction logic here
metadata = {"nx_meta": {}}
# Add required fields
metadata["nx_meta"]["DatasetType"] = "Image" # or "Spectrum", "SpectrumImage", etc.
metadata["nx_meta"]["Data Type"] = "SEM_Imaging"
metadata["nx_meta"]["Creation Time"] = self._get_creation_time(context.file_path)
# Add format-specific metadata
# ...
# Always return a list (even for single-signal files)
return [metadata]
def _get_creation_time(self, file_path: Path) -> str:
"""Helper to get ISO-formatted creation time."""
from datetime import datetime as dt
from nexusLIMS.instruments import get_instr_from_filepath
mtime = file_path.stat().st_mtime
instr = get_instr_from_filepath(file_path)
return dt.fromtimestamp(
mtime,
tz=instr.timezone if instr else None,
).isoformat()
Required Interface#
Every extractor plugin must define a class with these attributes and methods:
Class Attributes#
name: str#
Unique identifier for this extractor. Use lowercase with underscores (e.g., "dm3_extractor").
priority: int#
Priority for extractor selection (0-1000). Higher values are preferred. Guidelines:
100: Standard format-specific extractors50: Generic extractors with content sniffing0: Fallback extractors (like BasicFileInfoExtractor)
supported_extensions: set[str] | None#
File extensions this extractor supports (without leading dot). Required attribute.
Behavior:
If a
setof extensions (e.g.,{"dm3", "dm4"}): Extractor is registered only for those extensions. The registry will prioritize this extractor when files with these extensions are encountered.If
None: Extractor becomes a “wildcard” extractor tried only after all extension-specific extractors fail.
Note: The registry uses this attribute to determine registration, so supports() should match the declared extensions. Mismatches can lead to unexpected behavior.
Examples:
# Single extension
supported_extensions = {"xyz"}
# Multiple extensions
supported_extensions = {"dm3", "dm4"}
# Wildcard (fallback only)
supported_extensions = None
Methods#
supports(context: ExtractionContext) -> bool#
Determine if this extractor can handle a given file.
Parameters:
context: Containsfile_path(Path) andinstrument(Instrument or None)
Returns: True if this extractor supports the file
Note
Content Sniffing: While extension checks are fast, you can also inspect file contents to verify the format. This is useful when multiple formats share the same extension or when you need to distinguish file variants. See Content-Based Detection for advanced examples.
Example:
def supports(self, context: ExtractionContext) -> bool:
# Simple extension check
ext = context.file_path.suffix.lower().lstrip(".")
return ext in {"dm3", "dm4"}
extract(context: ExtractionContext) -> list[dict[str, Any]]#
Extract metadata from the file.
Parameters:
context: Containsfile_path(Path) andinstrument(Instrument or None)
Returns: A list of metadata dictionaries. Each dict must contain an "nx_meta" key with NexusLIMS metadata.
Format:
Single-signal files: Return
[{...}]- a list with one elementMulti-signal files: Return
[{...}, {...}]- a list with multiple elements
Required nx_meta fields (in each dict):
"DatasetType": One of “Image”, “Spectrum”, “SpectrumImage”, “Diffraction”, “Misc”"Data Type": Descriptive string (e.g., “STEM_Imaging”, “TEM_EDS”)"Creation Time": ISO-8601 formatted timestamp with timezone (e.g.,"2024-01-15T10:30:00-05:00"or"2024-01-15T10:30:00Z")
Example:
def extract(self, context: ExtractionContext) -> list[dict[str, Any]]:
metadata = {"nx_meta": {}}
metadata["nx_meta"]["DatasetType"] = "Image"
metadata["nx_meta"]["Data Type"] = "SEM_Imaging"
metadata["nx_meta"]["Creation Time"] = "2024-01-15T10:30:00-05:00"
# ... extraction logic
# Always return a list
return [metadata]
Metadata Validation#
The nx_meta section is automatically validated using type-specific Pydantic schemas based on the DatasetType field. NexusLIMS provides schema classes for each dataset type:
nexusLIMS.schemas.metadata.ImageMetadata- For Image datasetsValidates imaging-specific fields (acceleration_voltage, working_distance, beam_current, etc.)
Used when
DatasetType == "Image"
nexusLIMS.schemas.metadata.SpectrumMetadata- For Spectrum datasetsValidates spectrum-specific fields (acquisition_time, live_time, channel_size, etc.)
Used when
DatasetType == "Spectrum"
nexusLIMS.schemas.metadata.SpectrumImageMetadata- For SpectrumImage datasetsCombines both Image and Spectrum field validation
Used when
DatasetType == "SpectrumImage"
nexusLIMS.schemas.metadata.DiffractionMetadata- For Diffraction datasetsValidates diffraction-specific fields (camera_length, convergence_angle, etc.)
Used when
DatasetType == "Diffraction"
nexusLIMS.schemas.metadata.NexusMetadata- Base schemaUsed for
DatasetType == "Misc"or"Unknown"Validates only common required fields
Validation Process:
Extractor sets
DatasetTypefield innx_metavalidate_nx_meta()selects appropriate schema based onDatasetTypeSchema validates both required fields and type-specific optional fields
Validation errors include detailed field-level diagnostics
Strict validation - invalid metadata causes extraction to fail
All schemas support the extensions section for instrument-specific metadata that doesn’t fit the core schema (see “Core Fields vs. Extensions” below).
Schema Selection Logic
When writing an extractor, choose the appropriate DatasetType based on the data content:
DatasetType |
When to Use |
Example Data |
Schema Used |
|---|---|---|---|
|
2D imaging data (SEM, TEM, STEM images) |
Micrographs, detector images |
|
|
1D spectral data (EDS, EELS point spectra) |
Single-point spectrum |
|
|
2D image with spectrum at each pixel |
EDS maps, EELS spectrum images |
|
|
Diffraction patterns (SAED, CBED, 4D-STEM) |
Diffraction patterns, 4D datasets |
|
|
Other data types |
Proprietary formats, unusual data |
|
|
Unable to determine type |
Unreadable files, fallback |
|
Common extraction patterns:
# SEM image
nx_meta = {
"DatasetType": "Image", # → validates with ImageMetadata
"Data Type": "SEM_Imaging",
"acceleration_voltage": ureg.Quantity(15, "kilovolt"),
"working_distance": ureg.Quantity(5.2, "millimeter"),
# ... imaging-specific fields
}
# EDS spectrum
nx_meta = {
"DatasetType": "Spectrum", # → validates with SpectrumMetadata
"Data Type": "SEM_EDS",
"acquisition_time": ureg.Quantity(60, "second"),
"live_time": ureg.Quantity(58.5, "second"),
"channel_size": ureg.Quantity(10, "electronvolt"),
# ... spectrum-specific fields
}
# EDS map (spectrum at each pixel)
nx_meta = {
"DatasetType": "SpectrumImage", # → validates with SpectrumImageMetadata
"Data Type": "STEM_EDS",
# Validates BOTH imaging and spectrum fields
"acceleration_voltage": ureg.Quantity(200, "kilovolt"),
"acquisition_time": ureg.Quantity(300, "second"),
# ... combined fields
}
Tip: If unsure about DatasetType, examine the data shape:
Single 2D array → likely
"Image"Single 1D array with energy axis → likely
"Spectrum"3D array (x, y, spectrum) → likely
"SpectrumImage"2D array with reciprocal space metadata → likely
"Diffraction"
For more details on validation and schema structure, see NexusLIMS Internal Schema.
Using Pint Quantities for Physical Values#
Since v2.2.0, NexusLIMS uses Pint Quantities for all fields with physical units. This provides:
Type safety and automatic unit validation
Programmatic unit conversion
Machine-readable unit information
Standardized field names using EM Glossary terminology
Example with Pint Quantities:
# Import the NexusLIMS unit registry
from nexusLIMS.schemas.units import ureg
# Create Pint Quantity objects for fields with units
nx_meta = {
"DatasetType": "Image",
"Data Type": "SEM_Imaging",
"Creation Time": "2024-01-15T10:30:00-05:00",
# Physical quantities with units (using Pint)
"acceleration_voltage": ureg.Quantity(15, "kilovolt"), # or ureg("15 kV")
"working_distance": ureg.Quantity(5.2, "millimeter"), # or ureg("5.2 mm")
"beam_current": ureg.Quantity(100, "picoampere"), # or ureg("100 pA")
"dwell_time": ureg.Quantity(10, "microsecond"), # or ureg("10 us")
}
Tip: For a complete reference of standardized field names and their preferred units, see EM Glossary Quick Reference. When defining Pydantic schemas (not extractors), you can use emg_field() to automatically populate field metadata from EM Glossary.
Converting from raw values:
from nexusLIMS.schemas.units import ureg
# If you have voltage in volts, create Quantity directly
voltage_v = 15000 # Volts from file
nx_meta["acceleration_voltage"] = ureg.Quantity(voltage_v, "volt")
# Pint will automatically handle unit conversion when needed
# Alternative: parse from string
nx_meta["working_distance"] = ureg("5.2 mm")
Using FieldDefinition for bulk metadata extraction:
The FieldDefinition pattern provides a declarative way to extract many metadata fields with minimal code repetition. This works for any file format with key-value metadata (TIFF tags, HDF5 attributes, JSON/XML metadata, etc.):
from nexusLIMS.extractors.base import FieldDefinition
# Define fields with their target units
FIELD_DEFINITIONS = [
FieldDefinition(
source_key="HV", # Key in source metadata
target_key="acceleration_voltage", # EM Glossary field name
conversion_factor=1e-3, # Convert from V to kV
target_unit="kilovolt" # Target unit as Pint unit string
),
FieldDefinition(
source_key="WD",
target_key="working_distance",
target_unit="millimeter" # No conversion if already in target units
),
]
# In your extract() method, iterate over field definitions:
from decimal import Decimal
for field in FIELD_DEFINITIONS:
if field.source_key in source_metadata:
raw_value = source_metadata[field.source_key]
# Apply conversion factor if specified
if field.conversion_factor:
value = Decimal(str(raw_value)) * Decimal(str(field.conversion_factor))
else:
value = Decimal(str(raw_value))
# IMPORTANT: Quantities MUST use Decimal, not float
# This ensures precision for unit conversions
if field.target_unit:
nx_meta[field.target_key] = ureg.Quantity(value, field.target_unit)
else:
nx_meta[field.target_key] = value
Real-world examples:
Quanta TIF extractor - Uses
FieldDefinitionfor extracting TIFF metadata tagsTescan TIF extractor - Uses
FieldDefinitionfor SEM metadata extractionOrion TIF extractor - Uses
FieldDefinitionfor HIM metadata extraction
Benefits of Pint Quantities:
Type safety: Invalid units are caught immediately
Automatic conversion: Units are normalized during XML generation
Machine-readable: Units are separate from values in XML output
EM Glossary alignment: Field names match community standards
Important: If a field has units, always create a Pint Quantity. If a field is dimensionless (like magnification, brightness, or gain), use a plain numeric value (int/float).
XML Output:
<!-- Pint Quantities serialize to clean XML with unit attributes -->
<meta name="Voltage" unit="kV">15</meta>
<meta name="Working Distance" unit="mm">5.2</meta>
<meta name="Beam Current" unit="pA">100</meta>
EM Glossary Field Names#
NexusLIMS uses standardized field names from the Electron Microscopy Glossary (EM Glossary) for core metadata fields. This improves interoperability and aligns with community standards.
For the complete field mapping (50+ fields), see the EM Glossary Reference.
When to use EM Glossary names:
Use EM Glossary names for core metadata fields that have standardized meanings
For vendor-specific or instrument-specific fields, use EM Glossary names where possible, or for fields without EM Glossary equivalents, use descriptive names.
Core Fields vs. Extensions#
NexusLIMS metadata follows a two-tier structure:
Core fields - Standardized EM Glossary fields validated by Pydantic schemas
Extensions - Vendor/instrument-specific fields not in the core schema
Decision guide:
Use Core Field When… |
Use Extension When… |
|---|---|
Field has EM Glossary equivalent |
Field is vendor-specific (e.g., “fei_detector_mode”) |
Field is validated by schema |
Field doesn’t fit any schema |
Field is common across instruments |
Field is instrument-specific metadata |
Field should be searchable/standardized |
Field is informational only |
Example with core fields and extensions:
from nexusLIMS.schemas.utils import add_to_extensions
nx_meta = {
# Core fields (EM Glossary names) - validated by schema
"acceleration_voltage": ureg.Quantity(15, "kilovolt"),
"working_distance": ureg.Quantity(5.2, "millimeter"),
"beam_current": ureg.Quantity(100, "picoampere"),
}
# Add vendor-specific fields to extensions using helper
add_to_extensions(nx_meta, "detector_brightness", 50.0)
add_to_extensions(nx_meta, "facility", "Nexus Facility")
add_to_extensions(nx_meta, "quanta_spot_size", 3.5)
# Result:
# nx_meta["extensions"] = {
# "detector_brightness": 50.0,
# "facility": "Nexus Facility",
# "quanta_spot_size": 3.5,
# }
For detailed documentation, see Helper Functions.
Naming conventions for extensions:
Use
snake_casefor consistency with core fieldsAdd vendor prefix for vendor-specific fields (e.g.,
fei_,zeiss_,quanta_)Use descriptive names (avoid abbreviations)
Document units in comments if applicable
Advanced Patterns#
Content-Based Detection#
For formats where extension alone isn’t sufficient:
class MyFormatExtractor:
name = "my_format_extractor"
priority = 100
supported_extensions = {"dat"} # Register for .dat files
def supports(self, context: ExtractionContext) -> bool:
"""Check file extension and validate file signature."""
ext = context.file_path.suffix.lower().lstrip(".")
if ext != "dat":
return False
# Check file signature (magic bytes)
try:
with context.file_path.open("rb") as f:
header = f.read(4)
return header == b"MYFT" # Your format's signature
except Exception:
return False
Important: Keep supported_extensions synchronized with supports(). If your extractor is registered for .dat but supports() returns False for all .dat files, the registry will try other extractors.
Instrument-Specific Extractors#
Use the instrument information for instrument-specific handling:
class QuantaTifExtractor:
name = "quanta_tif_extractor"
priority = 150 # Higher priority for specific instrument
supported_extensions = {"tif"} # Only .tif files
def supports(self, context: ExtractionContext) -> bool:
"""Only support files from specific instruments."""
ext = context.file_path.suffix.lower().lstrip(".")
if ext != "tif":
return False
# Check instrument
if context.instrument is None:
return False
# Only handle files from Quanta SEMs
return "Quanta" in context.instrument.name
Priority Guidelines#
Set appropriate priorities for your extractor:
class SpecificFormatExtractor:
# High priority - handles specific format well
priority = 150
class GenericFormatExtractor:
# Medium priority - handles many formats adequately
priority = 75
class FallbackExtractor:
# Low/zero priority - only used when nothing else works
priority = 0
Testing Your Extractor#
Test Fixtures for Real File Formats#
For testing with actual file formats (especially binary formats like microscopy data), NexusLIMS uses compressed tar archives to store sanitized test files efficiently.
Setting up test fixtures:
Create sanitized test data - Start with a real microscopy file from your target format, then remove/zero sensitive image data while preserving metadata structure. See Creating Sanitized Test Files in the Developer Guide for detailed instructions.
Add to test archive registry in
tests/unit/utils.py:tars = { # ... existing entries ... "XYZ_FORMAT": "xyz_format_test_dataZeroed.tar.gz", }
Place archive in
tests/unit/files/
The fixture system extracts files from *.tar.gz archives to a temporary test directory, makes them available during test execution, and automatically cleans up after tests complete. This keeps the repository small (compressed archives) while enabling tests with realistic file structures and metadata.
Example Test File#
Create a test file in tests/unit/test_extractors/:
"""Tests for XYZ extractor."""
import pytest
from pathlib import Path
from nexusLIMS.extractors.plugins.xyz import XYZExtractor
from nexusLIMS.extractors.base import ExtractionContext
from tests.unit.utils import extract_files, delete_files
@pytest.fixture(scope="module")
def xyz_test_file():
"""Extract test XYZ file from archive."""
files = extract_files("XYZ_FORMAT")
yield files[0] # Yield the extracted test file
delete_files("XYZ_FORMAT") # Clean up after tests
class TestXYZExtractor:
"""Test cases for XYZ format extractor."""
def test_supports_xyz_files(self):
"""Test that extractor supports .xyz files."""
extractor = XYZExtractor()
context = ExtractionContext(Path("test.xyz"), instrument=None)
assert extractor.supports(context) is True
def test_rejects_other_files(self):
"""Test that extractor rejects non-.xyz files."""
extractor = XYZExtractor()
context = ExtractionContext(Path("test.dm3"), instrument=None)
assert extractor.supports(context) is False
def test_extraction_real_file(self, xyz_test_file):
"""
Test metadata extraction from real XYZ file.
Uses the "xyz_test_file" fixture to automatically extract,
read, and then delete the real test file from the compressed
archive.
"""
from nexusLIMS.extractors import get_full_meta
metadata = get_full_meta(xyz_test_file, instrument=None)
# Verify extracted metadata
assert metadata["nx_meta"]["DatasetType"] == "Image"
assert metadata["nx_meta"]["acceleration_voltage"].magnitude == 15.0
# ... additional assertions
Best Practices#
v2.2.0 Schema Patterns#
Use Pint Quantities consistently:
from decimal import Decimal
from nexusLIMS.schemas.units import ureg
# ✅ GOOD - Create Quantity for fields with units
nx_meta["acceleration_voltage"] = ureg.Quantity(15000, "volt") # Auto-converts to kV
# For calculations/conversions, use Decimal explicitly:
nx_meta["acceleration_voltage"] = ureg.Quantity(Decimal("15000"), "volt")
# ❌ BAD - Don't use raw numbers for fields with units
nx_meta["acceleration_voltage"] = 15 # Missing unit information
# ✅ GOOD - Dimensionless values as plain numbers
nx_meta["magnification"] = 50000
# ❌ BAD - Don't create Quantities for dimensionless values
nx_meta["magnification"] = ureg.Quantity(50000, "dimensionless")
Note on Decimal values: The unit registry auto-converts numeric values to Decimal to avoid floating-point precision issues. While ureg.Quantity(15, "kilovolt") works, using Decimal explicitly is recommended when performing calculations or conversions (e.g., Decimal(metadata["HV"]) * Decimal("1e-3")) for clarity and to ensure precision.
Use add_to_extensions() for vendor fields:
from nexusLIMS.schemas.utils import add_to_extensions
# ✅ GOOD - Use helper function
add_to_extensions(nx_meta, "fei_detector_mode", "CBS")
add_to_extensions(nx_meta, "facility", "Nexus Lab")
# ❌ BAD - Manual dict manipulation (doesn't keep implementation isolated)
if "extensions" not in nx_meta:
nx_meta["extensions"] = {}
nx_meta["extensions"]["fei_detector_mode"] = "CBS"
Set DatasetType early:
# ✅ GOOD - Set DatasetType first, then add type-specific fields
nx_meta = {
"DatasetType": "Image", # Schema selection happens based on this
"Data Type": "SEM_Imaging",
"Creation Time": creation_time_iso,
}
# Now add imaging-specific fields
nx_meta["acceleration_voltage"] = ureg.Quantity(15, "kilovolt")
# ❌ BAD - Adding fields before DatasetType
nx_meta = {}
nx_meta["acceleration_voltage"] = ureg.Quantity(15, "kilovolt") # What schema validates this?
nx_meta["DatasetType"] = "Image" # Too late
Handle timezone in Creation Time:
from datetime import datetime as dt
from nexusLIMS.instruments import get_instr_from_filepath
# ✅ GOOD - Include timezone
instr = get_instr_from_filepath(context.file_path)
creation_time = dt.fromtimestamp(mtime, tz=instr.timezone if instr else None)
nx_meta["Creation Time"] = creation_time.isoformat() # "2024-01-15T10:30:00-05:00"
# ❌ BAD - Naive datetime (no timezone)
creation_time = dt.fromtimestamp(mtime)
nx_meta["Creation Time"] = creation_time.isoformat() # "2024-01-15T10:30:00" - FAILS validation
Error Handling#
Always handle errors gracefully:
def extract(self, context: ExtractionContext) -> list[dict[str, Any]]:
"""Extract metadata with defensive error handling."""
try:
# Primary extraction logic (returns list)
return self._extract_full_metadata(context)
except Exception as e:
logger.warning(
"Error extracting full metadata from %s: %s",
context.file_path,
e,
exc_info=True
)
# Return basic metadata as fallback (also as list)
return self._extract_basic_metadata(context)
Logging#
Use appropriate log levels:
logger.debug("Extracting metadata from %s", context.file_path) # Routine operations
logger.info("Discovered unusual format variant in %s", context.file_path) # Notable events
logger.warning("Missing expected metadata field in %s", context.file_path) # Recoverable issues
logger.error("Failed to parse %s", context.file_path, exc_info=True) # Serious errors
Performance#
For expensive operations, consider lazy evaluation:
def extract(self, context: ExtractionContext) -> dict[str, Any]:
"""Extract metadata with lazy loading."""
# Only load what's needed
metadata = self._extract_header_metadata(context)
# If using hyperspy, load the signal lazily:
metadata = hs.load(context.filename, lazy=True).original_metadata
# Don't load full data unless necessary
if self._needs_full_data(metadata):
metadata.update(self._extract_full_data(context))
return metadata
Migration from Legacy Extractors#
If you have an existing extraction function (pre-v2.1.0), create a simple wrapper:
Before (legacy):
# In nexusLIMS/extractors/my_format.py
def get_my_format_metadata(filename: Path) -> dict:
# ... extraction logic
return metadata
After (plugin):
# In nexusLIMS/extractors/plugins/my_format.py
from nexusLIMS.extractors.base import ExtractionContext
from nexusLIMS.extractors.my_format import get_my_format_metadata
class MyFormatExtractor:
name = "my_format_extractor"
priority = 100
supported_extensions = {"myformat"} # Declare supported extensions
def supports(self, context: ExtractionContext) -> bool:
ext = context.file_path.suffix.lower().lstrip(".")
return ext == "myformat"
def extract(self, context: ExtractionContext) -> list[dict]:
# Legacy function returns dict, wrap in list
metadata = get_my_format_metadata(context.file_path)
return [metadata] # Always return a list
Registry Behavior#
The registry automatically:
Discovers plugins on first use by walking
nexusLIMS/extractors/plugins/Sorts by priority within each file extension
Calls
supports()on each extractor in priority orderReturns first match where
supports()returnsTrueFalls back to BasicFileInfoExtractor if nothing matches
You don’t need to manually register your plugin - just create the file and it will be discovered automatically.
Examples#
See the built-in extractors for real-world examples:
nexusLIMS.extractors.plugins.digital_micrograph- Simple extension-based matchingnexusLIMS.extractors.plugins.quanta_tif- TIFF format for specific instrumentsnexusLIMS.extractors.plugins.tescan_tif- TIFF format with metadata content sniffing and parsing sidecar header filenexusLIMS.extractors.plugins.basic_metadata- Fallback extractor with priority 0nexusLIMS.extractors.plugins.edax- Multiple extractors in one file
Troubleshooting#
My extractor isn’t being discovered#
Check that:
File is in
nexusLIMS/extractors/plugins/(or subdirectory)Class has all required attributes (
name,priority,supported_extensions) and methods (supports,extract)Class name doesn’t start with underscore
No import errors (check logs)
supported_extensionsis properly defined as asetorNone
My extractor isn’t being selected#
Check that:
supported_extensionsincludes the file’s extension (without dot)supports()returnsTruefor your test filePriority is high enough (higher priority extractors are tried first)
No higher-priority extractor is matching first
supported_extensionsandsupports()are synchronized (if registered for.xyz,supports()should returnTruefor.xyzfiles)
Enable debug logging to see selection process:
import logging
logging.getLogger("nexusLIMS.extractors.registry").setLevel(logging.DEBUG)
Tests are failing#
Ensure your extractor:
Returns a dictionary with
"nx_meta"keyIncludes required fields in
nx_metaHandles missing/corrupted files gracefully
Uses appropriate timezone for timestamps
Metadata Reference Files#
When developing extractor plugins, it’s helpful to see examples of real metadata structures from different file formats. NexusLIMS maintains a collection of reference metadata files extracted from test data.
Location#
Reference metadata is stored in tests/unit/files/metadata_references/ and includes:
127 JSON files containing
original_metadatafrom HyperSpy-readable test files2 XML files with raw TIFF metadata from Orion HIM instruments
README.md documenting the reference files
extract_original_metadata.py script to regenerate the references
Contents#
Each JSON file contains the complete original_metadata dictionary extracted from a test file using HyperSpy’s signal.original_metadata.as_dictionary(). The files are named after their source files:
Single-signal files:
{filename}_original_metadata.jsonMulti-signal files:
{filename}_signal_{index}_original_metadata.json
Format examples covered:
Digital Micrograph (.dm3, .dm4) - Various TEM/STEM imaging and spectroscopy modes
FEI TIA (.emi/.ser) - TEM/STEM images, spectra, and spectrum images
FEI/Thermo Quanta (.tif) - SEM imaging with embedded metadata
Zeiss Orion (.tif) - HIM imaging with Fibics or Zeiss metadata
Tescan (.tif) - SEM/FIB imaging
HyperSpy (.hspy) - 4D-STEM data
Usage for Extractor Development#
1. Understanding metadata structure
Before writing an extractor, examine the reference metadata to understand:
What fields are available in the source format
How the data is structured (nested dictionaries, lists, etc.)
Which fields map to NexusLIMS required fields
What vendor-specific fields exist
Example workflow:
# View metadata structure for a Quanta TIFF file
less tests/unit/files/metadata_references/quanta_image_original_metadata.json
# Search for specific fields
grep -r "acceleration" tests/unit/files/metadata_references/
2. Mapping source fields to nx_meta
Use the reference files to identify which source metadata keys should map to NexusLIMS fields:
# Example: Mapping Quanta TIFF metadata to nx_meta
# Based on quanta_image_original_metadata.json:
# {
# "User": {"HV": "15000.0", "WD": "5.2", ...},
# "EScan": {"DwellTime": "1e-05", ...}
# }
# The NexusLIMS ureg requires the use of Decimal to ensure the source
# precision is retained when converting units
from decimal import Decimal
nx_meta["acceleration_voltage"] = ureg.Quantity(
Decimal(metadata["User"]["HV"]) * Decimal("1e-3"), # Convert V to kV
"kilovolt"
)
nx_meta["working_distance"] = ureg.Quantity(
Decimal(metadata["User"]["WD"]),
"millimeter"
)
nx_meta["dwell_time"] = ureg.Quantity(
Decimal(metadata["EScan"]["DwellTime"]) * Decimal("1e6"), # Convert s to µs
"microsecond"
)
3. Testing against real metadata
When writing tests, compare your extractor’s output against the reference metadata to ensure you’re handling all available fields correctly.
Regenerating Reference Files#
If test files are added or modified, regenerate the reference metadata:
cd tests/unit/files/metadata_references
uv run python extract_original_metadata.py
The script:
Extracts all files from test archives (
tests/unit/files/*.tar.gz)Loads each file with HyperSpy
Exports
original_metadataas formatted JSONHandles both single and multi-signal files automatically
This is particularly useful after:
Adding new test files for a format
Updating data-zeroing procedures
Verifying metadata preservation after file modifications
Benefits#
Using metadata reference files helps you:
Write better extractors - See real-world metadata structure before coding
Improve field coverage - Identify fields you might have missed
Debug extraction issues - Compare expected vs. actual metadata
Document metadata evolution - Track changes in file format metadata over time
Onboard new developers - Provide concrete examples of metadata structure