{ "cells": [ { "cell_type": "markdown", "id": "ae6d6280", "metadata": {}, "source": "# Standalone Extractor Usage\n\nThe NexusLIMS extractor system can be used as a **standalone Python library**\nwithout a full NexusLIMS deployment. No `.env` file, database, NEMO instance,\nor CDCS server is required.\n\nThis notebook demonstrates:\n\n1. Downloading example microscopy files from the project repository\n2. Extracting metadata with the high-level `parse_metadata()` API\n3. Using the low-level registry API (`ExtractionContext` + `get_registry()`)\n4. What degrades gracefully when no NexusLIMS config is present\n\n:::{note}\n{download}`Download this notebook ` to run it locally.\n:::\n\n## Installation\n\n```bash\npip install nexusLIMS\n# or, if you use uv:\nuv add nexusLIMS\n```" }, { "cell_type": "markdown", "id": "0158ab63", "metadata": {}, "source": [ "## 1. Download example data files\n", "\n", "We pull a handful of microscopy files straight from the NexusLIMS test suite\n", "on GitHub so this notebook is self-contained. Some larger files are stored as\n", "`.tar.gz` archives in the repository; those are downloaded and extracted here." ] }, { "cell_type": "code", "execution_count": 1, "id": "4a9f07ca", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " already present : test_STEM_image.dm3\n", " already present : orion-zeiss_dataZeroed.tif\n", " already present : leo_edax_test.msa\n", " already present : leo_edax_test.spc\n", " already present : TEM_list_signal_dataZeroed.dm3\n", " already present : neoarm-gatan_SI_dataZeroed.dm4\n", "\n", "All files ready.\n" ] } ], "source": [ "import tarfile\n", "import urllib.request\n", "from pathlib import Path\n", "\n", "# Raw-content base URL for test files in the NexusLIMS repository\n", "_BASE = \"https://raw.githubusercontent.com/datasophos/NexusLIMS/main/tests/unit/files/\"\n", "\n", "data_dir = Path(\"example_data\")\n", "data_dir.mkdir(exist_ok=True)\n", "\n", "\n", "def fetch(remote_name: str, extract_to: Path | None = None) -> Path:\n", " \"\"\"Download a file from the NexusLIMS test suite.\n", "\n", " If *extract_to* is given the file is treated as a .tar.gz archive:\n", " it is downloaded to a temp path and extracted into *extract_to*.\n", " Returns the path of the (extracted) file.\n", " \"\"\"\n", " url = _BASE + remote_name\n", " dest = data_dir / remote_name\n", "\n", " if extract_to is None:\n", " # Plain file download\n", " if dest.exists():\n", " print(f\" already present : {remote_name}\")\n", " else:\n", " print(f\" downloading {remote_name} …\", end=\" \", flush=True)\n", " urllib.request.urlretrieve(url, dest)\n", " print(f\"done ({dest.stat().st_size:,} bytes)\")\n", " return dest\n", " else:\n", " # Archive download + extraction\n", " # The member file has the same stem as the archive (e.g., foo.dm3.tar.gz → foo.dm3)\n", " stem = remote_name.replace(\".tar.gz\", \"\")\n", " extracted = extract_to / stem\n", " if extracted.exists():\n", " print(f\" already present : {stem}\")\n", " else:\n", " archive_path = data_dir / remote_name\n", " if not archive_path.exists():\n", " print(f\" downloading {remote_name} …\", end=\" \", flush=True)\n", " urllib.request.urlretrieve(url, archive_path)\n", " print(f\"done ({archive_path.stat().st_size:,} bytes)\")\n", " print(f\" extracting {stem} …\", end=\" \", flush=True)\n", " with tarfile.open(archive_path) as tf:\n", " tf.extract(stem, path=extract_to, filter=\"data\")\n", " print(f\"done ({extracted.stat().st_size:,} bytes)\")\n", " return extracted\n", "\n", "\n", "# --- Files stored directly in the repo ---\n", "dm3_file = fetch(\"test_STEM_image.dm3\")\n", "orion_tif = fetch(\"orion-zeiss_dataZeroed.tif\")\n", "msa_file = fetch(\"leo_edax_test.msa\")\n", "spc_file = fetch(\"leo_edax_test.spc\")\n", "\n", "# --- Files stored as archives ---\n", "multi_dm3 = fetch(\"TEM_list_signal_dataZeroed.dm3.tar.gz\", extract_to=data_dir)\n", "neoarm_dm4 = fetch(\"neoarm-gatan_SI_dataZeroed.dm4.tar.gz\", extract_to=data_dir)\n", "\n", "print(\"\\nAll files ready.\")" ] }, { "cell_type": "markdown", "id": "0fa9d627", "metadata": {}, "source": [ "## 2. High-level API — `parse_metadata()`\n", "\n", "`parse_metadata()` is the main entry point. Pass\n", "`write_output=False, generate_preview=False` to skip the NexusLIMS\n", "deployment-specific steps (writing JSON sidecars and generating thumbnails).\n", "\n", "### 2a. Gatan DM3 — single STEM image" ] }, { "cell_type": "code", "execution_count": 2, "id": "fee97546", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Signals found : 1\n", "DatasetType : Image\n", "Data Type : STEM_Imaging\n", "Creation Time : 2026-02-18T14:54:03.330955-07:00\n" ] } ], "source": [ "from nexusLIMS.extractors import parse_metadata\n", "\n", "metadata_list, _previews = parse_metadata(\n", " dm3_file,\n", " write_output=False,\n", " generate_preview=False,\n", ")\n", "\n", "print(f\"Signals found : {len(metadata_list)}\")\n", "nx = metadata_list[0][\"nx_meta\"]\n", "print(f\"DatasetType : {nx['DatasetType']}\")\n", "print(f\"Data Type : {nx['Data Type']}\")\n", "print(f\"Creation Time : {nx['Creation Time']}\")" ] }, { "cell_type": "markdown", "id": "ee2b1391", "metadata": {}, "source": [ "The full `nx_meta` dictionary contains everything the extractor could pull from the file.\n", "A small helper makes nested dicts and Pint Quantity objects readable:" ] }, { "cell_type": "code", "execution_count": 3, "id": "d4707375", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "acceleration_voltage: 200000.0 volt\n", "acquisition_device: DigiScan\n", "Creation Time: 2026-02-18T14:54:03.330955-07:00\n", "Data Dimensions: (68, 68)\n", "Data Type: STEM_Imaging\n", "DatasetType: Image\n", "dwell_time: 3.5 microsecond\n", "extensions:\n", " Cs: 0.0 millimeter\n", " GMS Version: 2.31.734.0\n", " Illumination Mode: STEM NANOPROBE\n", " Imaging Mode: DIFFRACTION\n", " Microscope: FEI Titan\n", " Name: TEST Titan Remote\n", " Operation Mode: SCANNING\n", " STEM Camera Length: 135.0 millimeter\n", "horizontal_field_width: 0.5090058644612631 micrometer\n", "magnification: 225000.0\n", "stage_position:\n", " tilt_alpha: 24.950478513002935 degree\n", " x: -461.276 micrometer\n", " y: 52.0039 micrometer\n", " z: 35.033899999999996 millimeter\n", "warnings: []\n", "NexusLIMS Extraction:\n", " Date: 2026-02-18T15:13:06.884944-07:00\n", " Module: nexusLIMS.extractors.plugins.dm3_extractor\n", " Version: 2.4.2.dev0\n" ] } ], "source": [ "from decimal import Decimal\n", "\n", "import numpy as np\n", "\n", "\n", "def _to_display(obj):\n", " \"\"\"Convert quantities and numpy scalars to printable strings.\"\"\"\n", " if hasattr(obj, \"magnitude\"): # pint Quantity\n", " return f\"{obj.magnitude} {obj.units}\"\n", " if isinstance(obj, (np.integer, np.floating)):\n", " return obj.item()\n", " if isinstance(obj, Decimal):\n", " return float(obj)\n", " return obj\n", "\n", "\n", "def pretty(d, indent=0):\n", " for k, v in d.items():\n", " if isinstance(v, dict):\n", " print(\" \" * indent + f\"{k}:\")\n", " pretty(v, indent + 2)\n", " else:\n", " print(\" \" * indent + f\"{k}: {_to_display(v)}\")\n", "\n", "\n", "pretty(nx)" ] }, { "cell_type": "markdown", "id": "954b3588", "metadata": {}, "source": [ "### 2b. Multi-signal DM3 file\n", "\n", "Some DM3/DM4 files contain multiple signals (e.g., a survey image alongside\n", "an EELS spectrum). `parse_metadata()` returns one metadata dict per signal." ] }, { "cell_type": "code", "execution_count": 4, "id": "6c0662b8", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Signals in file: 2\n", " [0] Image STEM_Imaging dims=(512, 512)\n", " [1] Image TEM_Imaging dims=(512,)\n" ] } ], "source": [ "multi_list, _ = parse_metadata(\n", " multi_dm3,\n", " write_output=False,\n", " generate_preview=False,\n", ")\n", "\n", "print(f\"Signals in file: {len(multi_list)}\")\n", "for i, m in enumerate(multi_list):\n", " nx = m[\"nx_meta\"]\n", " dims = nx.get(\"Data Dimensions\", \"?\")\n", " print(f\" [{i}] {nx['DatasetType']:15s} {nx['Data Type']:25s} dims={dims}\")" ] }, { "cell_type": "markdown", "id": "cccbd9b9", "metadata": {}, "source": [ "### 2c. Gatan DM4 — multi-signal spectrum image (EELS/EDS SI)" ] }, { "cell_type": "code", "execution_count": 5, "id": "5ad2519d", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Signals in file: 4\n", " [0] Image STEM_Imaging dims=(512, 512)\n", " [1] Image STEM_Imaging dims=(52, 106)\n", " [2] Image STEM_Imaging dims=(52, 106)\n", " [3] SpectrumImage EDS_Spectrum_Imaging dims=(52, 106, 2048)\n" ] } ], "source": [ "dm4_list, _ = parse_metadata(\n", " neoarm_dm4,\n", " write_output=False,\n", " generate_preview=False,\n", ")\n", "\n", "print(f\"Signals in file: {len(dm4_list)}\")\n", "for i, m in enumerate(dm4_list):\n", " nx = m[\"nx_meta\"]\n", " dims = nx.get(\"Data Dimensions\", \"?\")\n", " print(f\" [{i}] {nx['DatasetType']:15s} {nx['Data Type']:25s} dims={dims}\")" ] }, { "cell_type": "markdown", "id": "302e8bb6", "metadata": {}, "source": [ "### 2d. Zeiss Orion HIM TIFF" ] }, { "cell_type": "code", "execution_count": 6, "id": "de0b3370", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "DatasetType : Image\n", "Data Type : HIM_Imaging\n", "Creation Time : 2026-02-18T21:54:03.444135+00:00\n" ] } ], "source": [ "orion_list, _ = parse_metadata(\n", " orion_tif,\n", " write_output=False,\n", " generate_preview=False,\n", ")\n", "\n", "nx = orion_list[0][\"nx_meta\"]\n", "print(f\"DatasetType : {nx['DatasetType']}\")\n", "print(f\"Data Type : {nx['Data Type']}\")\n", "print(f\"Creation Time : {nx['Creation Time']}\")" ] }, { "cell_type": "markdown", "id": "c851b19e", "metadata": {}, "source": [ "### 2e. EDAX spectrum files (`.spc` / `.msa`)" ] }, { "cell_type": "code", "execution_count": 7, "id": "cf375417", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "leo_edax_test.spc\n", " DatasetType : Spectrum\n", " Data Type : EDS_Spectrum\n", " Creation Time: 2026-02-18T21:54:03.966308+00:00\n", "\n", "\u001b[33;20mWARNING | Hyperspy | `signal_type='EDS'` not understood. See `hs.print_known_signal_types()` for a list of installed signal types or https://github.com/hyperspy/hyperspy-extensions-list for the list of all hyperspy extensions providing signals. (hyperspy.io:745)\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2026-02-18 15:13:07,245 hyperspy.io WARNING: `signal_type='EDS'` not understood. See `hs.print_known_signal_types()` for a list of installed signal types or https://github.com/hyperspy/hyperspy-extensions-list for the list of all hyperspy extensions providing signals.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "leo_edax_test.msa\n", " DatasetType : Spectrum\n", " Data Type : EDS_Spectrum\n", " Creation Time: 2026-02-18T21:54:03.851536+00:00\n", "\n" ] } ], "source": [ "for path in [spc_file, msa_file]:\n", " result, _ = parse_metadata(path, write_output=False, generate_preview=False)\n", " nx = result[0][\"nx_meta\"]\n", " print(f\"{path.name}\")\n", " print(f\" DatasetType : {nx['DatasetType']}\")\n", " print(f\" Data Type : {nx['Data Type']}\")\n", " print(f\" Creation Time: {nx['Creation Time']}\")\n", " print()" ] }, { "cell_type": "markdown", "id": "99a70a42", "metadata": {}, "source": [ "## 3. Low-level API — `ExtractionContext` + `get_registry()`\n", "\n", "For more control you can work directly with the extractor registry. This lets\n", "you inspect which extractor was selected, or call `extract()` yourself without\n", "going through `parse_metadata()`." ] }, { "cell_type": "code", "execution_count": 8, "id": "b9196560", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Selected extractor : dm3_extractor\n", "Supported extensions: {'dm4', 'dm3'}\n" ] } ], "source": [ "from nexusLIMS.extractors.base import ExtractionContext\n", "from nexusLIMS.extractors.registry import get_registry\n", "\n", "# instrument=None is fine when there is no database to look up\n", "context = ExtractionContext(file_path=dm3_file, instrument=None)\n", "\n", "registry = get_registry()\n", "extractor = registry.get_extractor(context)\n", "\n", "print(f\"Selected extractor : {extractor.name}\")\n", "print(f\"Supported extensions: {extractor.supported_extensions}\")" ] }, { "cell_type": "code", "execution_count": 9, "id": "607b0a0d", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Signals returned: 1\n", " DatasetType : Image\n", " Data Type : STEM_Imaging\n" ] } ], "source": [ "# Call extract() directly — returns list[dict], one dict per signal\n", "raw_list = extractor.extract(context)\n", "\n", "print(f\"Signals returned: {len(raw_list)}\")\n", "nx = raw_list[0][\"nx_meta\"]\n", "print(f\" DatasetType : {nx['DatasetType']}\")\n", "print(f\" Data Type : {nx['Data Type']}\")" ] }, { "cell_type": "markdown", "id": "b5bed6a7", "metadata": {}, "source": [ "### Listing all registered extractors" ] }, { "cell_type": "code", "execution_count": 10, "id": "8dd8c0ef", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Name Priority Extensions\n", "----------------------------------------------------------------------\n", "orion_HIM_tif_extractor 150 tif, tiff\n", "tescan_tif_extractor 150 tif, tiff\n", "dm3_extractor 100 dm3, dm4\n", "msa_extractor 100 msa\n", "spc_extractor 100 spc\n", "ser_emi_extractor 100 ser\n", "quanta_tif_extractor 100 tif, tiff\n", "basic_file_info_extractor 0 (any)\n" ] } ], "source": [ "print(f\"{'Name':<35} {'Priority':>8} Extensions\")\n", "print(\"-\" * 70)\n", "for ext in registry.all_extractors:\n", " exts = (\n", " \", \".join(sorted(ext.supported_extensions))\n", " if ext.supported_extensions\n", " else \"(any)\"\n", " )\n", " print(f\"{ext.name:<35} {ext.priority:>8} {exts}\")" ] }, { "cell_type": "markdown", "id": "3395e321", "metadata": {}, "source": [ "## 4. Graceful degradation without NexusLIMS configuration\n", "\n", "| Feature | Without config | With config |\n", "|---------|:--------------:|:-----------:|\n", "| Metadata extraction | ✅ Full | ✅ Full |\n", "| Schema validation | ✅ Full | ✅ Full |\n", "| Instrument ID from database | ⚠️ Returns `None` | ✅ Looks up DB |\n", "| JSON sidecar write (`write_output`) | ⚠️ Skipped + warning | ✅ Written |\n", "| Preview thumbnail (`generate_preview`) | ⚠️ Skipped + warning | ✅ Generated |\n", "| Local instrument profiles | ⚠️ Skipped (built-ins active) | ✅ Loaded |\n", "\n", "If you call `parse_metadata()` with the defaults (`write_output=True,\n", "generate_preview=True`) and no config is present, NexusLIMS logs a warning but\n", "still returns the metadata dict. The cell below demonstrates that:" ] }, { "cell_type": "code", "execution_count": 11, "id": "6255e6f9", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "2026-02-18 15:13:07,281 nexusLIMS.extractors WARNING: NexusLIMS config unavailable; skipping metadata file write (pass write_output=False to suppress this warning)\n", "2026-02-18 15:13:07,281 nexusLIMS.extractors WARNING: NexusLIMS config unavailable; skipping preview generation (pass generate_preview=False to suppress this warning)\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Metadata returned despite missing config: True\n", "Previews list (all None when config missing): [None]\n" ] } ], "source": [ "import logging\n", "\n", "import nexusLIMS.extractors as _ext\n", "\n", "# Show warnings inline\n", "logging.basicConfig(level=logging.WARNING)\n", "\n", "# Temporarily pretend config is unavailable\n", "_real = _ext._config_available\n", "_ext._config_available = lambda: False\n", "\n", "try:\n", " result, previews = parse_metadata(\n", " dm3_file,\n", " write_output=True, # normally writes JSON — skipped with a warning\n", " generate_preview=True, # normally makes thumbnail — skipped with a warning\n", " )\n", " print(f\"Metadata returned despite missing config: {result is not None}\")\n", " print(f\"Previews list (all None when config missing): {previews}\")\n", "finally:\n", " _ext._config_available = _real # restore" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.14" } }, "nbformat": 4, "nbformat_minor": 5 }