--- tocdepth: 2 --- (record-building)= # Record building workflow NexusLIMS automatically builds XML records from electron microscopy data by combining session information from the database with metadata extracted from data files. This page explains the workflow from session detection to record upload. See the [activity diagram](activity-diagram) at the bottom for a visual overview. Understanding these details isn't required for basic operation, but this documentation may help you troubleshoot issues, customize behavior, and understand what the system does behind the scenes. Links to API reference documentation are provided throughout. (general-approach)= ## General Approach Since instruments cannot communicate directly with NexusLIMS, the system uses periodic polling (via `systemd` or `cron`) to detect sessions requiring record generation. **Execution:** Run `nexuslims build-records` to invoke {py:func}`~nexusLIMS.builder.record_builder.process_new_records`. See the {doc}`CLI reference ` for all command-line options including date filtering, dry-run mode, and verbosity controls. The command: 1. Queries NEMO APIs for recent usage events 2. Finds database sessions awaiting record generation 3. Searches the centralized file system for matching files 4. Extracts metadata and groups files into `Acquisition Activities` 5. Builds and validates XML records against the Nexus Microscopy Schema 6. Uploads valid records to the CDCS instance (and any other export destinations configured) **Error handling:** Failed builds are logged to the database with `ERROR` status, triggering email notifications. ## Finding New Sessions {py:func}`~nexusLIMS.builder.record_builder.process_new_records` discovers sessions requiring record generation through two steps: 1. **Harvest NEMO events:** {py:func}`~nexusLIMS.harvesters.nemo.utils.add_all_usage_events_to_db` queries configured NEMO APIs for recent usage events (default: 7 days lookback) and adds them to the session log database. 2. **Query pending sessions:** {py:func}`~nexusLIMS.db.session_handler.get_sessions_to_build` finds sessions with `TO_BE_BUILT` status: ```sql SELECT (session_identifier, instrument, timestamp, event_type, user) FROM session_log WHERE record_status == 'TO_BE_BUILT'; ``` Results are converted to {py:class}`~nexusLIMS.db.models.SessionLog` objects, then paired into {py:class}`~nexusLIMS.db.session_handler.Session` objects by matching `START` and `END` logs with the same `session_identifier`. ### Session Attributes Each {py:class}`~nexusLIMS.db.session_handler.Session` contains: (session-contents)= - **`session_identifier`** ({py:class}`str`): Session ID ( resolvable URL for NEMO events) - **`instrument`** ({py:class}`~nexusLIMS.db.models.Instrument`): Associated instrument object - **`dt_from`** ({py:class}`~datetime.datetime`): Session start time - **`dt_to`** ({py:class}`~datetime.datetime`): Session end time - **`user`** ({py:class}`str`): Username (read from NEMO usage event) Sessions are processed sequentially by the record builder. ## Building a Single Record For each {py:class}`~nexusLIMS.db.session_handler.Session`, the record builder executes these steps: (overview)= ### Process Overview 1. **[(link)](starting-record-builder) Initiate build**: Execute {py:func}`~nexusLIMS.builder.record_builder.build_record` 2. **[(link)](harvesting-calendar) Fetch usage event/reservation data**: Query harvesters (e.g., {py:mod}`~nexusLIMS.harvesters.nemo`) 3. **[(link)](identifying-files) Find files**: Search for parseable files within session timespan - If none found → mark as `NO_FILES_FOUND` and skip to next session 4. **[(link)](build-activities) Cluster activities**: Group files into {py:class}`~nexusLIMS.schemas.activity.AcquisitionActivity` objects using temporal analysis 5. **[(link)](parse-metadata) Extract metadata**: Parse each file's metadata and generate preview images 6. **[(link)](separate-setup-parameters) Organize metadata**: Separate common metadata (setup parameters) from file-specific values 7. **[(link)](validating-the-record) Validate**: Check XML against Nexus Microscopy Schema 8. **[(link)](upload-records) Upload**: Push valid records to CDCS/other export destinations and update database (starting-record-builder)= ### 1. Initiating the Build Before building each record: 1. {py:meth}`~nexusLIMS.db.session_handler.Session.insert_record_generation_event` logs the build attempt to the database for audit purposes 2. {py:func}`~nexusLIMS.builder.record_builder.build_record` is invoked with the session's instrument and timestamp information 3. Basic XML header is written before querying the reservation system [(go to top)](overview) (harvesting-calendar)= ### 2. Querying the Reservation/Tool Usage Calendars Instrument reservation and usage tracking systems (like NEMO) provide experiment metadata (user, sample, motivation, etc.) for the record's `` element. The record builder queries the harvester specified in the instrument's database entry (typically {py:mod}`~nexusLIMS.harvesters.nemo`). Each harvester implements `res_event_from_session()`, which returns a {py:class}`~nexusLIMS.harvesters.reservation_event.ReservationEvent` object representing the reservation with maximum time overlap (or a generic event if none found). ```{versionadded} 2.3.0 While the class is named `ReservationEvent` for historical reasons, it can contain metadata harvested from either *Reservations* or *Usage Events* (actual instrument usage sessions). This capability prioritizes Usage Event data since it reflects what was actually performed, falling back to Reservation data when usage information is unavailable. ``` This adapter layer provides uniform reservation data regardless of the source system. The event is then serialized to schema-compliant XML. #### Three-Tier Fallback Strategy for NEMO For NEMO-based systems, the harvester uses a **three-tier fallback strategy** to obtain experiment metadata, prioritizing the most recent and accurate information: **Priority 1: Post-Run Questions (`run_data`)** The highest priority source is the `run_data` field from NEMO usage events. This field is populated when users answer questions **at the end** of their instrument session (when they "enable" the tool to end their usage). Since these questions are answered after the experiment completes, they represent the most accurate description of what was actually performed. **Priority 2: Pre-Run Questions (`pre_run_data`)** If `run_data` is empty or invalid, the harvester falls back to the `pre_run_data` field. This field is populated when users answer questions **at the start** of their instrument session (when they initially "enable" the tool). While potentially less accurate than post-run data (since the experiment plan may change during execution), this provides valuable metadata when users don't complete post-run questions. **Priority 3: Reservation Matching** If neither usage event question data source is available, the harvester falls back to traditional reservation matching. The system finds the reservation with maximum time overlap with the usage event and extracts metadata from the reservation's question data. This is the least reliable method since reservations may be created well in advance and might not reflect the actual experiment performed. ```{admonition} Question Data Structure :class: tip All three sources (`run_data`, `pre_run_data`, and reservation `question_data`) use identical JSON structure, enabling consistent metadata extraction. Both usage event fields are JSON-encoded strings that are parsed and validated during harvesting. See {py:func}`~nexusLIMS.harvesters.nemo.res_event_from_session` documentation for the required question structure and examples. ``` #### Data Consent Validation All metadata sources require affirmative user consent via a `data_consent` question. The harvester validates consent for each source in priority order: - **Missing consent field**: Falls back to next priority level or raises {py:exc}`~nexusLIMS.harvesters.nemo.exceptions.NoDataConsentError` - **Declined consent** ("Disagree", "No", etc.): Falls back to next priority level or raises {py:exc}`~nexusLIMS.harvesters.nemo.exceptions.NoDataConsentError` - **Invalid JSON or missing question data**: Falls back to next priority level If all three tiers fail (no valid question data with consent), the session is marked `NO_CONSENT` or `NO_RESERVATION` in the database and no record is generated. #### Graceful Degradation The three-tier system provides graceful degradation: 1. Empty strings, malformed JSON, or missing fields in higher-priority sources automatically trigger fallback to the next tier 2. Each tier is validated independently for data completeness and consent 3. Logging tracks which data source was ultimately used for each session 4. The system prioritizes data quality over data availability This approach maximizes record generation success while preferring the most accurate metadata source available for each session. [(go to top)](overview) (identifying-files)= ### 3. Identifying Files to Include Record metadata is primarily extracted from session files. The {py:class}`~nexusLIMS.db.session_handler.Session` provides the instrument, timespan, and identifier. Instrument configuration (persistent ID, location, file storage path relative to {ref}`NX_INSTRUMENT_DATA_PATH `) is read from the database `instruments` table, centralizing configuration management. **File search strategy:** {py:meth}`~nexusLIMS.utils.files.gnu_find_files_by_mtime` uses GNU [`find`](https://www.gnu.org/software/findutils/) to find files modified within the session timespan. **No files found:** Sessions without matching files (accidental session start, no data generated) are marked `NO_FILES_FOUND` and the builder moves to the next session. [(go to top)](overview) (build-activities)= ### 4. Separating Acquisition Activities Files are grouped into logical {py:class}`~nexusLIMS.schemas.activity.AcquisitionActivity` objects to approximate conceptual experiment boundaries using statistical analysis of file creation times. ```{versionadded} 2.3.0 The {ref}`NX_CLUSTERING_SENSITIVITY ` configuration variable allows you to adjust how files are clustered into activities. Higher values (e.g., 2.0) create more activities by detecting smaller time gaps, while lower values (e.g., 0.5) create fewer activities by requiring larger gaps. Set to 0 to disable clustering entirely and group all files into a single activity. ``` The figure below illustrates the clustering process for an EELS spectrum image experiment. Panel (a) shows creation time differences revealing 13 distinct groups. Panel (b) histograms time gaps between consecutive files—small gaps indicate related files, large gaps mark activity boundaries. (cluster-fig)= ```{figure} ../_static/file_clustering.png :scale: 80 % :figwidth: 80% :alt: How groups of files are separated into Acquisition Activities An example of determining the {py:class}`~nexusLIMS.schemas.activity.AcquisitionActivity` time boundaries for a group of files collected during an experiment. See the surrounding text for a full explanation of these plots. ``` **Clustering algorithm:** {py:meth}`~nexusLIMS.schemas.activity.cluster_filelist_mtimes` uses [Kernel Density Estimation](https://scikit-learn.org/stable/modules/density.html#kernel-density) (KDE) to detect temporal gaps. The KDE peaks where files cluster in time and minimizes at gaps. Grid search cross-validation optimizes the KDE bandwidth for each session. Local minima become activity boundaries (panel c). Panel (d) overlays boundaries on the original time plot, successfully identifying all 13 groups as a human would. This approach generalizes well across file types and experiment patterns. [(go to top)](overview) (parse-metadata)= ### 5. Parsing Individual Files' Metadata {py:func}`~nexusLIMS.extractors.parse_metadata` dispatches to format-specific extractors in {py:mod}`nexusLIMS.extractors`. Each extractor returns a {py:class}`dict` with native metadata plus a top-level `'nx_meta'` key containing normalized metadata for the XML record. **Metadata Validation (v2.2.0+):** Pydantic schemas ({py:mod}`nexusLIMS.schemas.metadata`) validate metadata quality: 1. **Schema selection** by `DatasetType` ({py:class}`~nexusLIMS.schemas.metadata.ImageMetadata`, {py:class}`~nexusLIMS.schemas.metadata.SpectrumMetadata`, etc.) 2. **Required fields**: `creation_time` (with timezone), `data_type`, `dataset_type` 3. **Normalization**: EM Glossary terms (where available), unit standardization (voltages → kV, distances → nm/mm) 4. **Type safety**: Invalid types, missing timezones, incorrect enums trigger logged errors This evolving system provides foundational standardization. See {doc}`../dev_guide/nexuslims_internal_schema` for details. **XML serialization:** Nested `'nx_meta'` structures are flattened (space-separated keys). Most key-value pairs become `Setup Parameters` or `Dataset Metadata` in the CDCS front-end. ```{admonition} Special metadata keys :class: tip Reserved `'nx_meta'` keys with special handling: - **`DatasetType`**: Controlled vocabulary mapped to `` (Image, Spectrum, etc.) - **`Data Type`**: Human-readable descriptor (`TEM_Imaging`, `SEM_EDS`, `STEM_EELS`) for activity contents summary - **`Creation Time`**: [ISO-8601 timestamp](https://en.wikipedia.org/wiki/ISO_8601#Combined_date_and_time_representations) displayed in dataset table - **`Extraction Details`**: Metadata extraction provenance information - **`warnings`**: List of unreliable metadata keys (front-end displays warnings) See {doc}`../dev_guide/nexuslims_internal_schema` for complete metadata schema documentation. ``` **Extractor implementation:** Extractors leverage [HyperSpy](https://hyperspy.org/) for most formats. Unsupported formats require manual binary decoding and custom extractor/preview generator implementation. **Output artifacts:** {py:func}`~nexusLIMS.extractors.parse_metadata` generates (saved to {ref}`NX_DATA_PATH `, mirroring {ref}`NX_INSTRUMENT_DATA_PATH ` structure): - **JSON metadata file**: Full metadata in text format (linked from XML record) - **PNG preview image**: Generated via {py:mod}`nexusLIMS.extractors.plugins.preview_generators` (HyperSpy-based for complex formats, simple downsampling for regular image files) Metadata and preview paths are stored at the {py:class}`~nexusLIMS.schemas.activity.AcquisitionActivity` level. [(go to top)](overview) (separate-setup-parameters)= ### 6. Determining Setup Parameters After file processing, metadata is organized hierarchically: - **Common values** (identical across all activity files) → `` Setup Parameters - **Unique values** (varying per file) → `` metadata {py:meth}`~nexusLIMS.schemas.activity.AcquisitionActivity.store_setup_params` identifies common metadata by comparing values across files. {py:meth}`~nexusLIMS.schemas.activity.AcquisitionActivity.store_unique_metadata` then stores only file-specific values at the dataset level. This organization helps users distinguish constant experimental conditions from per-file variations. [(go to top)](overview) (validating-the-record)= ### 7. Validating the Built Records Each completed {py:class}`~nexusLIMS.schemas.activity.AcquisitionActivity` is converted to XML via {py:meth}`~nexusLIMS.schemas.activity.AcquisitionActivity.as_xml`. Once all activities are assembled, {py:func}`~nexusLIMS.builder.record_builder.validate_record` validates the complete record against the NexusLIMS schema. **Validation failure:** Session marked `ERROR` in database for investigation. **Validation success:** Record written to {ref}`NX_RECORDS_PATH ` (or {ref}`NX_DATA_PATH ` subdirectory if unspecified) before uploading to configured exporters. The builder then processes the next session, repeating until all are complete. [(go to top)](overview) (upload-records)= ### 8. Uploading Completed Records and Updating Database After all sessions are processed, {py:func}`~nexusLIMS.exporters.export_records` uploads generated XML records to configured export destinations. **Upload process:** 1. CDCS re-validates each record against the NexusLIMS schema 2. Records are assigned to the `Global Public Workspace` (viewable without login) 3. Database is updated with upload status ```{note} Future versions will implement single-sign-on, assigning records to their creating users instead of the public workspace. ``` The complete workflow repeats periodically (see [General Approach](general-approach)) to process new sessions as they occur. [(go to top)](overview) (activity-diagram)= ## Record Generation Diagram The following diagram illustrates the logic (described above) that is used to generate `Experiment` records and upload them to the NexusLIMS CDCS instance. The diagram shows the flow through different modules/sub-packages within the `nexusLIMS` package. Decision points determine the path through the workflow, and database updates occur at key stages to track session status. ```{tip} 💡 **Interactive Diagram**: You can zoom in/out (scroll or pinch) and pan (click and drag) to explore different parts of the workflow in detail. ``` ```{mermaid} %%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#eee', 'primaryTextColor': '#000', 'primaryBorderColor': '#333', 'lineColor': '#333', 'secondaryColor': '#eeeeff', 'tertiaryColor': '#eeffee', 'noteBkgColor': '#fff9e6', 'noteBorderColor': '#d4a017'}}}%% flowchart TD Start([Start]) --> ProcessNewSessions["record_builder.process_new_sessions()"] ProcessNewSessions --> AddUsageEvents["harvesters.nemo.add_all_usage_events_to_db()"] AddUsageEvents --> GetSessions["session_handler.get_sessions_to_build()"] GetSessions --> CheckSessions{New sessions
found?} CheckSessions -->|No| End1([End]) CheckSessions -->|Yes| ExtractSession["Extract single Session"] ExtractSession --> ReturnSession["Return session_handler.Session object"] ReturnSession --> InsertEvent["Insert RECORD_GENERATION event
in database"] InsertEvent --> BuildRecord["record_builder.build_record(Session)"] BuildRecord --> CreateXML["Create new .xml document
for this Experiment"] CreateXML --> FindReservation["harvesters.XXXX.res_event_from_session(s)"] FindReservation --> CheckReservation{ReservationEvent
found?} CheckReservation -->|Yes| AddDetailedInfo["Add detailed event info
to record summary XML"] CheckReservation -->|No| AddBasicInfo["Add basic information
(instrument and date)"] CheckReservation -->|No consent| UpdateNoConsent["Update Session status
as NO_CONSENT"] CheckReservation -->|No reservation| UpdateNoReservation["Update Session status
as NO_RESERVATION"] AddDetailedInfo --> IdentifyFiles AddBasicInfo --> IdentifyFiles["Identify files created on this
instrument within timespan"] IdentifyFiles --> CheckFiles{Files found?} CheckFiles -->|No| UpdateNoFiles["Update Session status
as NO_FILES_FOUND"] CheckFiles -->|Yes| ClusterFiles["Separate files into activities
by clustering file creation times"] ClusterFiles --> ProcessActivity["Build one AcquisitionActivity"] ProcessActivity --> ParseMetadata["Parse metadata for all files"] ParseMetadata --> ExtractLoop["extractors.parse_metadata(filename)"] ExtractLoop --> ExtractMetadata["Extract metadata from file"] ExtractMetadata --> DetermineType["Determine dataset type"] DetermineType --> ValidateMetadata["Validate metadata using
Pydantic schemas"] ValidateMetadata --> GeneratePreview["Generate .png preview image"] GeneratePreview --> SaveMetadata["Save metadata as .json"] SaveMetadata --> CheckMoreFiles{More files
to process?} CheckMoreFiles -->|Yes| ExtractLoop CheckMoreFiles -->|No| AssignMetadata["Assign common metadata as
Setup parameters and
unique metadata to datasets"] AssignMetadata --> AddActivity["Add AcquisitionActivity
to .xml record"] AddActivity --> CheckMoreActivities{More activities
to process?} CheckMoreActivities -->|Yes| ProcessActivity CheckMoreActivities -->|No| ValidateRecord{Validate
record?} ValidateRecord -->|Valid| WriteXML["Write XML record
to NexusLIMS folder"] ValidateRecord -->|Invalid| UpdateError["Update Session status
as ERROR"] WriteXML --> UpdateCompleted["Update Session status
as COMPLETED"] UpdateCompleted --> VerifyUpdate["Verify Session was updated
in database"] UpdateError --> VerifyUpdate UpdateNoFiles --> VerifyUpdate UpdateNoConsent --> VerifyUpdate UpdateNoReservation --> VerifyUpdate VerifyUpdate --> CheckMoreSessions{More sessions
to process?} CheckMoreSessions -->|Yes| ExtractSession CheckMoreSessions -->|No| CheckRecordsCreated{Any records
created?} CheckRecordsCreated -->|No| End2([End]) CheckRecordsCreated -->|Yes| UploadRecord["Upload single .xml record"] UploadRecord --> AssignWorkspace["Assign to public
CDCS workspace"] AssignWorkspace --> CheckMoreRecords{More records
to upload?} CheckMoreRecords -->|Yes| UploadRecord CheckMoreRecords -->|No| End3([End]) %% Styling classDef recordBuilder fill:#eee,stroke:#333,stroke-width:2px classDef sessionHandler fill:#eeeeff,stroke:#333,stroke-width:2px classDef extractors fill:#eeffee,stroke:#333,stroke-width:2px classDef cdcs fill:#ffefe1,stroke:#333,stroke-width:2px classDef dbUpdate fill:#ffcc99,stroke:#333,stroke-width:3px class ProcessNewSessions,BuildRecord,CreateXML,IdentifyFiles,ClusterFiles,ProcessActivity,AssignMetadata,AddActivity,ValidateRecord,WriteXML recordBuilder class AddUsageEvents,GetSessions,ReturnSession,InsertEvent,VerifyUpdate sessionHandler class ParseMetadata,ExtractLoop,ExtractMetadata,DetermineType,ValidateMetadata,GeneratePreview,SaveMetadata extractors class UploadRecord,AssignWorkspace cdcs class UpdateCompleted,UpdateError,UpdateNoFiles,UpdateNoConsent,UpdateNoReservation dbUpdate ``` **Diagram Key:** | Module | Light Mode | Dark Mode | |--------|------------|-----------| | Record Builder | Gray boxes | Dark gray boxes | | Session Handler | Light blue boxes | Dark yellow boxes | | Metadata Extractors | Light green boxes | Red boxes | | CDCS Connector | Peach boxes | Slate gray boxes | | Database Updates | Orange boxes | Light blue boxes |