NexusLIMS Taxonomy#

This page defines the key terms and concepts used throughout NexusLIMS. Understanding this taxonomy is essential for working with the codebase and understanding how microscopy data is organized, classified, and transformed into experimental records.

Data Organization Philosophy#

NexusLIMS organizes microscopy data hierarchically:

Sessions - User reservations defining when an instrument was used
Acquisition Activities - Groups of files created in temporal proximity during a session
Datasets - Individual signals (images, spectra, etc.) with associated metadata. Typically corresponds to a single file on disk, but some files contain multiple signals and will result in multiple datasets in a record.
Records - XML documents describing complete experimental sessions

This structure mirrors typical microscopy workflows: users reserve instrument time (session), collect related data in bursts (activities), and each file represents a discrete measurement (one or more datasets).

System Components#

The following terms describe major architectural components:

Harvester:
- The harvesters (implemented in the nexusLIMS.harvesters package) are the portions of the code that connect to external data sources, such as the NEMO laboratory management system. The only harvester currenlty implemented is NEMO (nemo). A SharePoint calendar harvester was used in previous versions of NexusLIMS, but was removed in version 2.0.0.
Extractor:
- The extractors (implemented in the nexusLIMS.extractors package) are the modules that inspect the data files collected during an Experiment and pull out the relevant metadata contained within for inclusion in the record. The preview image generation is also considered an extractor.
Record Builder:
- The record builder (implemented in the nexusLIMS.builder.record_builder module) is the heart of the NexusLIMS back-end. It orchestrates the complete record creation workflow:
  1. Trigger harvesters to look for new sessions and add them to the session database
  2. Query database for sessions ready to build
  3. Find all data files modified during the session time window
  4. Cluster files into Acquisition Activities based on temporal gaps
  5. Extract and validate metadata from each file
  6. Build XML record conforming to Nexus Experiment schema
  7. Export record to configured destinations via the export framework
  Further details are provided on the record building documentation page.
Exporters:
- The exporters (implemented in the nexusLIMS.exporters package) form a plugin-based framework for publishing experimental records to multiple repository systems. After building XML records, the system can export them to one or more configured destinations:
  - CDCS (cdcs) - Primary destination for web-based record viewing, search, and access control
  - eLabFTW (elabftw) - Electronic lab notebook integration with structured metadata and cross-linking
  - Custom destinations - Extensible plugin system for additional repositories
  The framework supports multiple destinations running in parallel, configurable failure handling strategies, inter-destination dependencies for cross-linking, and per-destination tracking in the database.
  
  Further details are provided on the exporters documentation page.

Dataset Types#

Dataset Type (or DatasetType) classifies the fundamental nature of the data in a file. This is a required field in all NexusLIMS metadata and determines which Pydantic schema is used for validation.

Available Dataset Types#

DatasetType	Description	Example Instruments	Dimensions
`"Image"`	2D imaging data	SEM, TEM, STEM, Optical	2D array (x, y)
`"Spectrum"`	1D spectral data	EDS, EELS, CL	1D array (energy/wavelength)
`"SpectrumImage"`	2D image with spectrum at each pixel	EDS maps, EELS-SI	3D array (x, y, spectrum)
`"Diffraction"`	Diffraction patterns	SAED, CBED, 4D-STEM	2D or 4D array
`"Misc"`	Other structured data	Custom formats, line scans	Variable
`"Unknown"`	Unable to determine	Corrupted/unreadable files	N/A

Note:

A “line scan” is considered to be a special case of "SpectrumImage" with a y dimension of 1

Schema Mapping#

Each DatasetType maps to a specific Pydantic schema for metadata validation:

DatasetType	Schema Class
`"Image"`	`ImageMetadata`
`"Spectrum"`	`SpectrumMetadata`
`"SpectrumImage"`	`SpectrumImageMetadata`
`"Diffraction"`	`DiffractionMetadata`
`"Misc"`	`NexusMetadata` (base schema)
`"Unknown"`	`NexusMetadata` (base schema)

The schema selection happens automatically in validate_nx_meta() based on the DatasetType field. Different schemas validate different sets of optional fields:

ImageMetadata - Validates imaging fields (acceleration_voltage, working_distance, magnification)
SpectrumMetadata - Validates spectroscopy fields (acquisition_time, live_time, channel_size)
SpectrumImageMetadata - Validates both imaging and spectroscopy fields
DiffractionMetadata - Validates diffraction fields (camera_length, convergence_angle)

Usage in Extractors#

Extractors must set DatasetType based on file content:

# Example: SEM image
nx_meta = {
    "DatasetType": "Image",
    "Data Type": "SEM_Imaging",
    # ... imaging-specific fields
}

# Example: EDS spectrum
nx_meta = {
    "DatasetType": "Spectrum",
    "Data Type": "SEM_EDS",
    # ... spectrum-specific fields
}

# Example: EDS map (spectrum at each pixel)
nx_meta = {
    "DatasetType": "SpectrumImage",
    "Data Type": "STEM_EDS",
    # ... both imaging AND spectrum fields
}

For more details on choosing the appropriate DatasetType, see Schema Selection Logic.

Data Type Classification#

Data Type is a descriptive string that provides more specific information about the acquisition technique and modality. While DatasetType is constrained to six values, Data Type can be any descriptive string.

Naming Convention#

NexusLIMS uses a structured format for Data Type:

Category_Modality_Technique

Examples:

SEM_Imaging - Scanning Electron Microscopy imaging
TEM_Imaging - Transmission Electron Microscopy imaging
STEM_Imaging - Scanning Transmission Electron Microscopy imaging
STEM_EDS - STEM with Energy Dispersive X-ray Spectroscopy
TEM_EELS - TEM with Electron Energy Loss Spectroscopy
TEM_Diffraction - TEM diffraction pattern
SEM_BSE - SEM with Backscattered Electron detector
STEM_HAADF - STEM with High-Angle Annular Dark Field detector

Components#

Category - Instrument family (SEM, TEM, STEM, FIB, Optical, etc.)
Modality - Detection/operation mode (Imaging, EDS, EELS, Diffraction, BSE, SE, etc.)
Technique - Specific variant (HAADF, BF, DF, SAED, CBED, etc.) - optional

How Extractors Determine Data Type#

Extractors typically determine Data Type by analyzing:

File format - Some formats are technique-specific (e.g., .spc files are always EDS spectra)
Metadata fields - Imaging mode, detector type, operation mode
Data dimensions - 2D = likely imaging, 1D = likely spectrum
Detector information - HAADF detector → STEM_HAADF
Filename patterns - Sometimes contains hints (Diff, SAED, etc.)

Relationship to DatasetType#

Data Type	Typical DatasetType
SEM_Imaging, TEM_Imaging, STEM_Imaging	`"Image"`
SEM_EDS, TEM_EELS (point spectra)	`"Spectrum"`
STEM_EDS, STEM_EELS (maps)	`"SpectrumImage"`
TEM_Diffraction, TEM_SAED, TEM_CBED	`"Diffraction"`

The combination of DatasetType (for schema validation) and Data Type (for human description) provides both machine-readable classification and human-readable context.

Metadata Schema Structure#

New in v2.2.0: NexusLIMS uses a three-tier architecture for organizing metadata, balancing standardization with flexibility.

Three-Tier Architecture#

        %%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#e3f2fd', 'primaryTextColor': '#000', 'primaryBorderColor': '#1976d2', 'lineColor': '#1976d2', 'secondaryColor': '#fff3e0', 'tertiaryColor': '#f3e5f5'}}}%%
flowchart TD
    Tier1["<b>Tier 1: Core Fields</b><br/>(Required + Common EM Glossary fields)<br/><br/>• DatasetType, Data Type, Creation Time<br/>• acceleration_voltage, beam_current<br/>• magnification, working_distance"]

    Tier2["<b>Tier 2: Type-Specific Fields</b><br/>(Additional fields per DatasetType)<br/><br/>• ImageMetadata: detector_type, etc.<br/>• SpectrumMetadata: acquisition_time<br/>• DiffractionMetadata: camera_length"]

    Tier3["<b>Tier 3: Extensions</b><br/>(Vendor/instrument-specific fields)<br/><br/>• facility, building, room<br/>• vendor_metadata<br/>• instrument_serial"]

    Tier1 --> Tier2
    Tier2 --> Tier3

    classDef tier1Style fill:#e3f2fd,stroke:#1976d2,stroke-width:3px,color:#000
    classDef tier2Style fill:#fff3e0,stroke:#f57c00,stroke-width:3px,color:#000
    classDef tier3Style fill:#f3e5f5,stroke:#7b1fa2,stroke-width:3px,color:#000

    class Tier1 tier1Style
    class Tier2 tier2Style
    class Tier3 tier3Style

Tier 1: Core Fields#

Always required regardless of DatasetType:

DatasetType - Classification for schema selection
Data Type - Human-readable technique description
Creation Time - ISO-8601 timestamp with timezone

Common EM Glossary fields (optional but standardized):

acceleration_voltage - Electron beam energy
beam_current - Probe current
working_distance - Pole piece to sample distance
magnification - Imaging magnification
… (50+ standardized fields available)

These fields use EM Glossary standardized names and are validated by all schema classes.

Tier 2: Type-Specific Fields#

Additional optional fields validated based on DatasetType:

ImageMetadata (for DatasetType == "Image"):

detector_type - Detector name (SE, BSE, HAADF, etc.)
field_of_view_width / field_of_view_height - Image dimensions in physical units
pixel_width / pixel_height - Pixel size in physical units
dwell_time - Pixel dwell time for scanning

SpectrumMetadata (for DatasetType == "Spectrum"):

acquisition_time - Total acquisition duration
live_time - Detector live time (EDS)
channel_size - Energy/wavelength per channel
starting_energy - Spectrum start energy
takeoff_angle - EDS detector geometry

SpectrumImageMetadata (for DatasetType == "SpectrumImage"):

Validates both ImageMetadata AND SpectrumMetadata fields
Allows both imaging parameters and spectroscopy parameters

DiffractionMetadata (for DatasetType == "Diffraction"):

camera_length - Diffraction camera length
convergence_angle - Probe convergence angle (CBED)

Tier 3: Extensions#

The extensions dictionary holds fields that don’t fit the core schema:

nx_meta = {
    # Tier 1 & 2: Core and type-specific fields
    "DatasetType": "Image",
    "acceleration_voltage": ureg.Quantity(15, "kilovolt"),

    # Tier 3: Extensions
    "extensions": {
        "facility": "NIST Center for Nanoscale Science",
        "building": "Bldg 217",
        "room": "A206",
        "quanta_spot_size": 3.5,  # Vendor-specific
        "fei_detector_mode": "CBS",
        "detector_serial": "12345-ABC",
    }
}

When to use extensions:

Vendor-specific fields without EM Glossary equivalents
Site/facility metadata (building, room, PI name)
Instrument calibration data
Custom workflow parameters
Anything that shouldn’t be validated by schema

Populating extensions:

In extractors: Use add_to_extensions() helper
In instrument profiles: Use extension_fields parameter
See Helper Functions for details

Benefits of Three-Tier Structure#

Standardization - Core fields ensure interoperability across instruments
Validation - Type-specific fields are validated by appropriate schemas
Flexibility - Extensions accommodate vendor-specific or site-specific metadata
Extensibility - New fields can be added without breaking existing code
Semantic clarity - Clear separation between standardized and custom metadata

Acquisition Activities#

An Acquisition Activity is a group of files created in temporal proximity during an experimental session. Activities represent discrete experimental tasks (e.g., “Survey of sample region 1”, “EDS line scan across interface”).

Motivation#

Microscopy experiments typically consist of multiple distinct activities:

Initial survey imaging at low magnification
Higher magnification imaging of regions of interest
Spectroscopy (EDS, EELS) at specific points
Diffraction pattern collection
Return to imaging at different locations

Each activity generates a burst of files. Temporal gaps between file creation indicate transitions between activities.

File Clustering Algorithm#

The record builder uses Kernel Density Estimation (KDE) to identify temporal gaps in file creation times:

from nexusLIMS.schemas.activity import cluster_filelist_mtimes

# Input: List of file paths with modification times
files = [Path("file1.dm3"), Path("file2.dm3"), ...]

# Cluster into activities based on temporal gaps
activities = cluster_filelist_mtimes(files)
# → [[file1, file2], [file3, file4, file5], [file6]]

Algorithm steps:

Extract modification times from all files
Convert to minutes since session start
Apply Gaussian KDE to find probability density of file creation
Identify local minima in density → temporal gaps
Split file list at gap boundaries
Return list of file groups (activities)

Example timeline:

        %%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#e3f2fd', 'primaryTextColor': '#000', 'primaryBorderColor': '#1976d2'}}}%%
gantt
    title File Clustering into Acquisition Activities
    dateFormat mm
    axisFormat %M min

    section Activity 1
    Survey images (7 files)           :active, a1, 00, 5m

    section Gap
    10 min gap                        :crit, gap1, 05, 10m

    section Activity 2
    High-mag imaging (4 files)        :active, a2, 15, 5m

    section Gap
    15 min gap                        :crit, gap2, 20, 15m

    section Activity 3
    EDS point spectra (6 files)       :active, a3, 35, 5m

    section Gap
    5 min gap                         :crit, gap3, 40, 5m

    section Activity 4
    EDS map (8 files)                 :active, a4, 45, 5m

Parameters:

Minimum gap duration to split activities (configurable)
Bandwidth for KDE smoothing (adapts to data density)

Activity Metadata#

Each activity in a record includes:

Start time (earliest file in group)
End time (latest file in group)
List of datasets (files) in the activity
Activity-level summary information

Relationship to Sessions#

        %%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#e3f2fd', 'primaryTextColor': '#000', 'primaryBorderColor': '#1976d2'}}}%%
graph TD
    Session["Session<br/>(Microscopy Usage Event)"]

    Session --> Activity1["Activity 1<br/>(Survey Images)"]
    Session --> Activity2["Activity 2<br/>(High-Mag Imaging)"]
    Session --> Activity3["Activity 3<br/>(EDS Analysis)"]

    Activity1 --> Dataset1["Dataset 1<br/>(image001.tif)"]
    Activity1 --> Dataset2["Dataset 2<br/>(image002.tif)"]
    Activity1 --> Dataset3["Dataset 3<br/>(image003.tif)"]

    Activity2 --> Dataset4["Dataset 4<br/>(image004.tif)"]
    Activity2 --> Dataset5["Dataset 5<br/>(image005.tif)"]

    Activity3 --> Dataset6["Dataset 6<br/>(spectrum001.spc)"]

    classDef sessionStyle fill:#e3f2fd,stroke:#1976d2,stroke-width:3px
    classDef activityStyle fill:#fff3e0,stroke:#f57c00,stroke-width:2px
    classDef datasetStyle fill:#f3e5f5,stroke:#7b1fa2,stroke-width:1px

    class Session sessionStyle
    class Activity1,Activity2,Activity3 activityStyle
    class Dataset1,Dataset2,Dataset3,Dataset4,Dataset5,Dataset6 datasetStyle

A session can have one or more activities. The clustering algorithm automatically determines activity boundaries based on temporal gaps in file creation times.

Implementation Details#

Module: nexusLIMS.schemas.activity
Class: AcquisitionActivity
Function: cluster_filelist_mtimes()
Algorithm: Scikit-learn KernelDensity with Gaussian kernel

For more details, see Separating Acquisition Activities.