nexusLIMS.extractors#

Extract metadata from various electron microscopy file types.

Extractors should return a list of dictionaries, where each dictionary contains the extracted metadata under the key nx_meta. The nx_meta structure is validated against the NexusMetadata Pydantic schema to ensure consistency across all extractors.

Required Fields#

All extractors must include these fields in nx_meta:

  • 'Creation Time' - ISO-8601 timestamp string with timezone (e.g., "2024-01-15T10:30:00-05:00" or "2024-01-15T15:30:00Z")

  • 'Data Type' - Human-readable description using underscores (e.g., "STEM_Imaging", "TEM_EDS", "SEM_Imaging")

  • 'DatasetType' - Schema-defined category, must be one of: "Image", "Spectrum", "SpectrumImage", "Diffraction", "Misc", or "Unknown"

Optional Fields#

Common optional fields include:

  • 'Data Dimensions' - Dataset shape as string (e.g., "(1024, 1024)")

  • 'Instrument ID' - Instrument PID from database (e.g., "FEI-Titan-TEM-635816")

  • 'warnings' - List of warning messages or [message, context] pairs

Additional instrument-specific fields are allowed beyond these standard fields.

Schema Validation#

The nx_meta structure is validated using Pydantic strict mode. Validation occurs after default values are set (e.g., missing DatasetType defaults to "Misc"). If validation fails, a pydantic.ValidationError is raised with detailed information about which fields are invalid.

For complete schema details, see NexusMetadata.

Subpackages#

Submodules#

Package Contents#

Functions#

get_schema_for_dataset_type

Select the appropriate schema class based on DatasetType.

validate_nx_meta

Validate the nx_meta structure against type-specific metadata schemas.

parse_metadata

Parse metadata from a file and optionaly generate a preview image.

create_preview

Generate a preview image for a given file using the plugin system.

flatten_dict

Flatten a nested dictionary into a single level.

Data#

PLACEHOLDER_PREVIEW

Path to placeholder preview image used when preview generation fails.

unextracted_preview_map

Filetypes that will only have basic metadata extracted but will nonetheless have a custom preview image generated

API#

nexusLIMS.extractors.PLACEHOLDER_PREVIEW#

Path to placeholder preview image used when preview generation fails.

nexusLIMS.extractors.unextracted_preview_map#

Filetypes that will only have basic metadata extracted but will nonetheless have a custom preview image generated

nexusLIMS.extractors.get_schema_for_dataset_type(dataset_type: str) type[NexusMetadata][source]#

Select the appropriate schema class based on DatasetType.

This function maps dataset types to their corresponding type-specific metadata schemas. Type-specific schemas (ImageMetadata, SpectrumMetadata, etc.) provide stricter validation of fields appropriate for each data type.

Parameters:

dataset_type (str) – The value of the ‘DatasetType’ field. Must be one of: ‘Image’, ‘Spectrum’, ‘SpectrumImage’, ‘Diffraction’, ‘Misc’, or ‘Unknown’.

Returns:

The schema class to use for validation. Returns a type-specific schema (ImageMetadata, SpectrumMetadata, etc.) for known dataset types, or the base NexusMetadata schema for ‘Misc’ and ‘Unknown’ types.

Return type:

type[NexusMetadata]

Notes:

Schema mapping:

  • ‘Image’ → ImageMetadata (SEM/TEM/STEM images)

  • ‘Spectrum’ → SpectrumMetadata (EDS/EELS spectra)

  • ‘SpectrumImage’ → SpectrumImageMetadata (hyperspectral data)

  • ‘Diffraction’ → DiffractionMetadata (diffraction patterns)

  • ‘Misc’ → NexusMetadata (base schema)

  • ‘Unknown’ → NexusMetadata (base schema)

  • Other values → NexusMetadata (fallback)

Examples:

>>> schema = get_schema_for_dataset_type("Image")
>>> schema.__name__
'ImageMetadata'
>>> schema = get_schema_for_dataset_type("Unknown")
>>> schema.__name__
'NexusMetadata'
nexusLIMS.extractors.validate_nx_meta(metadata_dict: dict[str, Any], *, filename: Path | None = None) dict[str, Any][source]#

Validate the nx_meta structure against type-specific metadata schemas.

This function ensures that metadata returned by extractor plugins conforms to the required structure defined in the type-specific metadata schemas (ImageMetadata, SpectrumMetadata, etc.). The appropriate schema is selected based on the ‘DatasetType’ field. Validation is performed strictly - any schema violations will raise a ValidationError with detailed information about the failure.

Parameters:
  • metadata_dict (dict[str, Any]) – Dictionary containing an ‘nx_meta’ key with the metadata to validate. This is the format returned by all extractor plugins.

  • filename (Path or None, optional) – The file path being processed. Used only for error message context. If None, error messages will not include file path information.

Returns:

The original metadata_dict, unchanged. Validation does not modify data, it only checks conformance to the schema.

Return type:

dict[str, Any]

Raises:

pydantic.ValidationError – If the nx_meta structure fails validation. The error message will include detailed information about which fields are invalid and why.

Notes:

This function validates:

  • Required fields: ‘Creation Time’, ‘Data Type’, ‘DatasetType’ must be present

  • ISO-8601 timestamps: ‘Creation Time’ must be valid ISO-8601 with timezone

  • Controlled vocabularies: ‘DatasetType’ must be one of the allowed values

  • Type-specific fields: Fields appropriate for the dataset type (e.g., ‘acceleration_voltage’ for Image, ‘acquisition_time’ for Spectrum)

  • Type constraints: All fields must match their expected types

  • Pint Quantities: Physical measurements must use Pint Quantity objects

The validation system uses type-specific schemas:

  • Image → ImageMetadata (SEM/TEM/STEM imaging)

  • Spectrum → SpectrumMetadata (EDS/EELS spectra)

  • SpectrumImage → SpectrumImageMetadata (hyperspectral)

  • Diffraction → DiffractionMetadata (TEM diffraction)

  • Misc/Unknown → NexusMetadata (base schema)

All schemas support the ‘extensions’ section for instrument-specific metadata that doesn’t fit the core schema.

Examples:

Valid metadata passes without modification:

>>> metadata = {
...     "nx_meta": {
...         "Creation Time": "2024-01-15T10:30:00-05:00",
...         "Data Type": "STEM_Imaging",
...         "DatasetType": "Image",
...     }
... }
>>> result = validate_nx_meta(metadata)
>>> result == metadata
True

Invalid metadata raises ValidationError:

>>> bad_metadata = {
...     "nx_meta": {
...         "Creation Time": "invalid-timestamp",
...         "Data Type": "STEM_Imaging",
...         "DatasetType": "Image",
...     }
... }
>>> validate_nx_meta(bad_metadata)
Traceback (most recent call last):
    ...
pydantic.ValidationError: ...

See Also: nexusLIMS.schemas.metadata.NexusMetadata The base Pydantic schema model for nx_meta validation

nexusLIMS.schemas.metadata.ImageMetadata Schema for Image dataset types

nexusLIMS.schemas.metadata.SpectrumMetadata Schema for Spectrum dataset types

get_schema_for_dataset_type Helper function that selects the appropriate schema

parse_metadata Main extraction function that uses this validator

nexusLIMS.extractors.parse_metadata(fname: Path, *, write_output: bool = True, generate_preview: bool = True, overwrite: bool = True) Tuple[Dict[str, Any] | None, Path | list[Path] | None][source]#

Parse metadata from a file and optionaly generate a preview image.

Given an input filename, read the file, determine what “type” of file (i.e. what instrument it came from) it is, filter the metadata (if necessary) to what we are interested in, and return it as a dictionary (writing to the NexusLIMS directory as JSON by default). Also calls the preview generation method, if desired.

For files containing multiple signals (e.g., multi-signal DM3/DM4 files), generates one preview per signal and returns a list of preview paths.

Parameters:
  • fname – The filename from which to read data

  • write_output – Whether to write the metadata dictionary as a json file in the NexusLIMS folder structure

  • generate_preview – Whether to generate the thumbnail preview of this dataset (that operation is not done in this method, it is just called from here so it can be done at the same time)

  • overwrite – Whether to overwrite the .json metadata file and thumbnail image if either exists

Returns:

  • nx_meta (list[dict] or None) – A list of metadata dicts, one per signal in the file. If None, the file could not be opened. Single-signal files return a list with one dict, multi-signal files return a list with multiple dicts.

  • preview_fname (list[Path] or None) – A list of file paths for the generated preview images, one per signal. For single-signal files, returns a list with one path. Returns None if preview generation was not requested.

nexusLIMS.extractors.create_preview(fname: Path, *, overwrite: bool, signal_index: int | None = None) Path | None[source]#

Generate a preview image for a given file using the plugin system.

This method uses the preview generator plugin system to create thumbnail previews. It first tries to find a suitable preview generator plugin, and falls back to legacy methods if no plugin is found.

Parameters:
  • fname – The filename from which to read data

  • overwrite – Whether to overwrite the .json metadata file and thumbnail image if either exists

  • signal_index – For files with multiple signals, the index of the signal to preview. If None, generates a single preview (legacy behavior). If an int, generates preview with _signalN suffix in filename.

Returns:

preview_fname – The filename of the generated preview image; if None, a preview could not be successfully generated.

Return type:

Optional[Path]

nexusLIMS.extractors.flatten_dict(_dict, parent_key='', separator=' ')[source]#

Flatten a nested dictionary into a single level.

Utility method to take a nested dictionary structure and flatten it into a single level, separating the levels by a string as specified by separator.

Uses python-benedict for robust nested dictionary operations.

Parameters:
  • _dict (dict) – The dictionary to flatten

  • parent_key (str) – The “root” key to add to the existing keys (unused in current implementation)

  • separator (str) – The string to use to separate values in the flattened keys (i.e. {‘a’: {‘b’: ‘c’}} would become {‘a’ + sep + ‘b’: ‘c’})

Returns:

flattened_dict – The dictionary with depth one, with nested dictionaries flattened into root-level keys

Return type:

str