{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "ae6d6280",
   "metadata": {},
   "source": "# Standalone Extractor Usage\n\nThe NexusLIMS extractor system can be used as a **standalone Python library**\nwithout a full NexusLIMS deployment.  No `.env` file, database, NEMO instance,\nor CDCS server is required.\n\nThis notebook demonstrates:\n\n1. Downloading example microscopy files from the project repository\n2. Extracting metadata with the high-level `parse_metadata()` API\n3. Using the low-level registry API (`ExtractionContext` + `get_registry()`)\n4. What degrades gracefully when no NexusLIMS config is present\n\n:::{note}\n{download}`Download this notebook <standalone_extractor_usage.ipynb>` to run it locally.\n:::\n\n## Installation\n\n```bash\npip install nexusLIMS\n# or, if you use uv:\nuv add nexusLIMS\n```"
  },
  {
   "cell_type": "markdown",
   "id": "0158ab63",
   "metadata": {},
   "source": [
    "## 1. Download example data files\n",
    "\n",
    "We pull a handful of microscopy files straight from the NexusLIMS test suite\n",
    "on GitHub so this notebook is self-contained.  Some larger files are stored as\n",
    "`.tar.gz` archives in the repository; those are downloaded and extracted here."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "4a9f07ca",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  already present : test_STEM_image.dm3\n",
      "  already present : orion-zeiss_dataZeroed.tif\n",
      "  already present : leo_edax_test.msa\n",
      "  already present : leo_edax_test.spc\n",
      "  already present : TEM_list_signal_dataZeroed.dm3\n",
      "  already present : neoarm-gatan_SI_dataZeroed.dm4\n",
      "\n",
      "All files ready.\n"
     ]
    }
   ],
   "source": [
    "import tarfile\n",
    "import urllib.request\n",
    "from pathlib import Path\n",
    "\n",
    "# Raw-content base URL for test files in the NexusLIMS repository\n",
    "_BASE = \"https://raw.githubusercontent.com/datasophos/NexusLIMS/main/tests/unit/files/\"\n",
    "\n",
    "data_dir = Path(\"example_data\")\n",
    "data_dir.mkdir(exist_ok=True)\n",
    "\n",
    "\n",
    "def fetch(remote_name: str, extract_to: Path | None = None) -> Path:\n",
    "    \"\"\"Download a file from the NexusLIMS test suite.\n",
    "\n",
    "    If *extract_to* is given the file is treated as a .tar.gz archive:\n",
    "    it is downloaded to a temp path and extracted into *extract_to*.\n",
    "    Returns the path of the (extracted) file.\n",
    "    \"\"\"\n",
    "    url = _BASE + remote_name\n",
    "    dest = data_dir / remote_name\n",
    "\n",
    "    if extract_to is None:\n",
    "        # Plain file download\n",
    "        if dest.exists():\n",
    "            print(f\"  already present : {remote_name}\")\n",
    "        else:\n",
    "            print(f\"  downloading {remote_name} …\", end=\" \", flush=True)\n",
    "            urllib.request.urlretrieve(url, dest)\n",
    "            print(f\"done ({dest.stat().st_size:,} bytes)\")\n",
    "        return dest\n",
    "    else:\n",
    "        # Archive download + extraction\n",
    "        # The member file has the same stem as the archive (e.g., foo.dm3.tar.gz → foo.dm3)\n",
    "        stem = remote_name.replace(\".tar.gz\", \"\")\n",
    "        extracted = extract_to / stem\n",
    "        if extracted.exists():\n",
    "            print(f\"  already present : {stem}\")\n",
    "        else:\n",
    "            archive_path = data_dir / remote_name\n",
    "            if not archive_path.exists():\n",
    "                print(f\"  downloading {remote_name} …\", end=\" \", flush=True)\n",
    "                urllib.request.urlretrieve(url, archive_path)\n",
    "                print(f\"done ({archive_path.stat().st_size:,} bytes)\")\n",
    "            print(f\"  extracting  {stem} …\", end=\" \", flush=True)\n",
    "            with tarfile.open(archive_path) as tf:\n",
    "                tf.extract(stem, path=extract_to, filter=\"data\")\n",
    "            print(f\"done ({extracted.stat().st_size:,} bytes)\")\n",
    "        return extracted\n",
    "\n",
    "\n",
    "# --- Files stored directly in the repo ---\n",
    "dm3_file = fetch(\"test_STEM_image.dm3\")\n",
    "orion_tif = fetch(\"orion-zeiss_dataZeroed.tif\")\n",
    "msa_file = fetch(\"leo_edax_test.msa\")\n",
    "spc_file = fetch(\"leo_edax_test.spc\")\n",
    "\n",
    "# --- Files stored as archives ---\n",
    "multi_dm3 = fetch(\"TEM_list_signal_dataZeroed.dm3.tar.gz\", extract_to=data_dir)\n",
    "neoarm_dm4 = fetch(\"neoarm-gatan_SI_dataZeroed.dm4.tar.gz\", extract_to=data_dir)\n",
    "\n",
    "print(\"\\nAll files ready.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0fa9d627",
   "metadata": {},
   "source": [
    "## 2. High-level API — `parse_metadata()`\n",
    "\n",
    "`parse_metadata()` is the main entry point.  Pass\n",
    "`write_output=False, generate_preview=False` to skip the NexusLIMS\n",
    "deployment-specific steps (writing JSON sidecars and generating thumbnails).\n",
    "\n",
    "### 2a. Gatan DM3 — single STEM image"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "fee97546",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Signals found : 1\n",
      "DatasetType   : Image\n",
      "Data Type     : STEM_Imaging\n",
      "Creation Time : 2026-02-18T14:54:03.330955-07:00\n"
     ]
    }
   ],
   "source": [
    "from nexusLIMS.extractors import parse_metadata\n",
    "\n",
    "metadata_list, _previews = parse_metadata(\n",
    "    dm3_file,\n",
    "    write_output=False,\n",
    "    generate_preview=False,\n",
    ")\n",
    "\n",
    "print(f\"Signals found : {len(metadata_list)}\")\n",
    "nx = metadata_list[0][\"nx_meta\"]\n",
    "print(f\"DatasetType   : {nx['DatasetType']}\")\n",
    "print(f\"Data Type     : {nx['Data Type']}\")\n",
    "print(f\"Creation Time : {nx['Creation Time']}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ee2b1391",
   "metadata": {},
   "source": [
    "The full `nx_meta` dictionary contains everything the extractor could pull from the file.\n",
    "A small helper makes nested dicts and Pint Quantity objects readable:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "d4707375",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "acceleration_voltage: 200000.0 volt\n",
      "acquisition_device: DigiScan\n",
      "Creation Time: 2026-02-18T14:54:03.330955-07:00\n",
      "Data Dimensions: (68, 68)\n",
      "Data Type: STEM_Imaging\n",
      "DatasetType: Image\n",
      "dwell_time: 3.5 microsecond\n",
      "extensions:\n",
      "  Cs: 0.0 millimeter\n",
      "  GMS Version: 2.31.734.0\n",
      "  Illumination Mode: STEM NANOPROBE\n",
      "  Imaging Mode: DIFFRACTION\n",
      "  Microscope: FEI Titan\n",
      "  Name: TEST Titan Remote\n",
      "  Operation Mode: SCANNING\n",
      "  STEM Camera Length: 135.0 millimeter\n",
      "horizontal_field_width: 0.5090058644612631 micrometer\n",
      "magnification: 225000.0\n",
      "stage_position:\n",
      "  tilt_alpha: 24.950478513002935 degree\n",
      "  x: -461.276 micrometer\n",
      "  y: 52.0039 micrometer\n",
      "  z: 35.033899999999996 millimeter\n",
      "warnings: []\n",
      "NexusLIMS Extraction:\n",
      "  Date: 2026-02-18T15:13:06.884944-07:00\n",
      "  Module: nexusLIMS.extractors.plugins.dm3_extractor\n",
      "  Version: 2.4.2.dev0\n"
     ]
    }
   ],
   "source": [
    "from decimal import Decimal\n",
    "\n",
    "import numpy as np\n",
    "\n",
    "\n",
    "def _to_display(obj):\n",
    "    \"\"\"Convert quantities and numpy scalars to printable strings.\"\"\"\n",
    "    if hasattr(obj, \"magnitude\"):  # pint Quantity\n",
    "        return f\"{obj.magnitude} {obj.units}\"\n",
    "    if isinstance(obj, (np.integer, np.floating)):\n",
    "        return obj.item()\n",
    "    if isinstance(obj, Decimal):\n",
    "        return float(obj)\n",
    "    return obj\n",
    "\n",
    "\n",
    "def pretty(d, indent=0):\n",
    "    for k, v in d.items():\n",
    "        if isinstance(v, dict):\n",
    "            print(\" \" * indent + f\"{k}:\")\n",
    "            pretty(v, indent + 2)\n",
    "        else:\n",
    "            print(\" \" * indent + f\"{k}: {_to_display(v)}\")\n",
    "\n",
    "\n",
    "pretty(nx)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "954b3588",
   "metadata": {},
   "source": [
    "### 2b. Multi-signal DM3 file\n",
    "\n",
    "Some DM3/DM4 files contain multiple signals (e.g., a survey image alongside\n",
    "an EELS spectrum).  `parse_metadata()` returns one metadata dict per signal."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "6c0662b8",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Signals in file: 2\n",
      "  [0] Image            STEM_Imaging               dims=(512, 512)\n",
      "  [1] Image            TEM_Imaging                dims=(512,)\n"
     ]
    }
   ],
   "source": [
    "multi_list, _ = parse_metadata(\n",
    "    multi_dm3,\n",
    "    write_output=False,\n",
    "    generate_preview=False,\n",
    ")\n",
    "\n",
    "print(f\"Signals in file: {len(multi_list)}\")\n",
    "for i, m in enumerate(multi_list):\n",
    "    nx = m[\"nx_meta\"]\n",
    "    dims = nx.get(\"Data Dimensions\", \"?\")\n",
    "    print(f\"  [{i}] {nx['DatasetType']:15s}  {nx['Data Type']:25s}  dims={dims}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cccbd9b9",
   "metadata": {},
   "source": [
    "### 2c. Gatan DM4 — multi-signal spectrum image (EELS/EDS SI)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "5ad2519d",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Signals in file: 4\n",
      "  [0] Image            STEM_Imaging               dims=(512, 512)\n",
      "  [1] Image            STEM_Imaging               dims=(52, 106)\n",
      "  [2] Image            STEM_Imaging               dims=(52, 106)\n",
      "  [3] SpectrumImage    EDS_Spectrum_Imaging       dims=(52, 106, 2048)\n"
     ]
    }
   ],
   "source": [
    "dm4_list, _ = parse_metadata(\n",
    "    neoarm_dm4,\n",
    "    write_output=False,\n",
    "    generate_preview=False,\n",
    ")\n",
    "\n",
    "print(f\"Signals in file: {len(dm4_list)}\")\n",
    "for i, m in enumerate(dm4_list):\n",
    "    nx = m[\"nx_meta\"]\n",
    "    dims = nx.get(\"Data Dimensions\", \"?\")\n",
    "    print(f\"  [{i}] {nx['DatasetType']:15s}  {nx['Data Type']:25s}  dims={dims}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "302e8bb6",
   "metadata": {},
   "source": [
    "### 2d. Zeiss Orion HIM TIFF"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "de0b3370",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "DatasetType   : Image\n",
      "Data Type     : HIM_Imaging\n",
      "Creation Time : 2026-02-18T21:54:03.444135+00:00\n"
     ]
    }
   ],
   "source": [
    "orion_list, _ = parse_metadata(\n",
    "    orion_tif,\n",
    "    write_output=False,\n",
    "    generate_preview=False,\n",
    ")\n",
    "\n",
    "nx = orion_list[0][\"nx_meta\"]\n",
    "print(f\"DatasetType   : {nx['DatasetType']}\")\n",
    "print(f\"Data Type     : {nx['Data Type']}\")\n",
    "print(f\"Creation Time : {nx['Creation Time']}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c851b19e",
   "metadata": {},
   "source": [
    "### 2e. EDAX spectrum files (`.spc` / `.msa`)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "cf375417",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "leo_edax_test.spc\n",
      "  DatasetType  : Spectrum\n",
      "  Data Type    : EDS_Spectrum\n",
      "  Creation Time: 2026-02-18T21:54:03.966308+00:00\n",
      "\n",
      "\u001b[33;20mWARNING | Hyperspy | `signal_type='EDS'` not understood. See `hs.print_known_signal_types()` for a list of installed signal types or https://github.com/hyperspy/hyperspy-extensions-list for the list of all hyperspy extensions providing signals. (hyperspy.io:745)\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "2026-02-18 15:13:07,245 hyperspy.io WARNING: `signal_type='EDS'` not understood. See `hs.print_known_signal_types()` for a list of installed signal types or https://github.com/hyperspy/hyperspy-extensions-list for the list of all hyperspy extensions providing signals.\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "leo_edax_test.msa\n",
      "  DatasetType  : Spectrum\n",
      "  Data Type    : EDS_Spectrum\n",
      "  Creation Time: 2026-02-18T21:54:03.851536+00:00\n",
      "\n"
     ]
    }
   ],
   "source": [
    "for path in [spc_file, msa_file]:\n",
    "    result, _ = parse_metadata(path, write_output=False, generate_preview=False)\n",
    "    nx = result[0][\"nx_meta\"]\n",
    "    print(f\"{path.name}\")\n",
    "    print(f\"  DatasetType  : {nx['DatasetType']}\")\n",
    "    print(f\"  Data Type    : {nx['Data Type']}\")\n",
    "    print(f\"  Creation Time: {nx['Creation Time']}\")\n",
    "    print()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "99a70a42",
   "metadata": {},
   "source": [
    "## 3. Low-level API — `ExtractionContext` + `get_registry()`\n",
    "\n",
    "For more control you can work directly with the extractor registry.  This lets\n",
    "you inspect which extractor was selected, or call `extract()` yourself without\n",
    "going through `parse_metadata()`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "b9196560",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Selected extractor  : dm3_extractor\n",
      "Supported extensions: {'dm4', 'dm3'}\n"
     ]
    }
   ],
   "source": [
    "from nexusLIMS.extractors.base import ExtractionContext\n",
    "from nexusLIMS.extractors.registry import get_registry\n",
    "\n",
    "# instrument=None is fine when there is no database to look up\n",
    "context = ExtractionContext(file_path=dm3_file, instrument=None)\n",
    "\n",
    "registry = get_registry()\n",
    "extractor = registry.get_extractor(context)\n",
    "\n",
    "print(f\"Selected extractor  : {extractor.name}\")\n",
    "print(f\"Supported extensions: {extractor.supported_extensions}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "607b0a0d",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Signals returned: 1\n",
      "  DatasetType : Image\n",
      "  Data Type   : STEM_Imaging\n"
     ]
    }
   ],
   "source": [
    "# Call extract() directly — returns list[dict], one dict per signal\n",
    "raw_list = extractor.extract(context)\n",
    "\n",
    "print(f\"Signals returned: {len(raw_list)}\")\n",
    "nx = raw_list[0][\"nx_meta\"]\n",
    "print(f\"  DatasetType : {nx['DatasetType']}\")\n",
    "print(f\"  Data Type   : {nx['Data Type']}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b5bed6a7",
   "metadata": {},
   "source": [
    "### Listing all registered extractors"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "8dd8c0ef",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Name                                Priority  Extensions\n",
      "----------------------------------------------------------------------\n",
      "orion_HIM_tif_extractor                  150  tif, tiff\n",
      "tescan_tif_extractor                     150  tif, tiff\n",
      "dm3_extractor                            100  dm3, dm4\n",
      "msa_extractor                            100  msa\n",
      "spc_extractor                            100  spc\n",
      "ser_emi_extractor                        100  ser\n",
      "quanta_tif_extractor                     100  tif, tiff\n",
      "basic_file_info_extractor                  0  (any)\n"
     ]
    }
   ],
   "source": [
    "print(f\"{'Name':<35} {'Priority':>8}  Extensions\")\n",
    "print(\"-\" * 70)\n",
    "for ext in registry.all_extractors:\n",
    "    exts = (\n",
    "        \", \".join(sorted(ext.supported_extensions))\n",
    "        if ext.supported_extensions\n",
    "        else \"(any)\"\n",
    "    )\n",
    "    print(f\"{ext.name:<35} {ext.priority:>8}  {exts}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3395e321",
   "metadata": {},
   "source": [
    "## 4. Graceful degradation without NexusLIMS configuration\n",
    "\n",
    "| Feature | Without config | With config |\n",
    "|---------|:--------------:|:-----------:|\n",
    "| Metadata extraction | ✅ Full | ✅ Full |\n",
    "| Schema validation | ✅ Full | ✅ Full |\n",
    "| Instrument ID from database | ⚠️ Returns `None` | ✅ Looks up DB |\n",
    "| JSON sidecar write (`write_output`) | ⚠️ Skipped + warning | ✅ Written |\n",
    "| Preview thumbnail (`generate_preview`) | ⚠️ Skipped + warning | ✅ Generated |\n",
    "| Local instrument profiles | ⚠️ Skipped (built-ins active) | ✅ Loaded |\n",
    "\n",
    "If you call `parse_metadata()` with the defaults (`write_output=True,\n",
    "generate_preview=True`) and no config is present, NexusLIMS logs a warning but\n",
    "still returns the metadata dict.  The cell below demonstrates that:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "6255e6f9",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "2026-02-18 15:13:07,281 nexusLIMS.extractors WARNING: NexusLIMS config unavailable; skipping metadata file write (pass write_output=False to suppress this warning)\n",
      "2026-02-18 15:13:07,281 nexusLIMS.extractors WARNING: NexusLIMS config unavailable; skipping preview generation (pass generate_preview=False to suppress this warning)\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Metadata returned despite missing config: True\n",
      "Previews list (all None when config missing): [None]\n"
     ]
    }
   ],
   "source": [
    "import logging\n",
    "\n",
    "import nexusLIMS.extractors as _ext\n",
    "\n",
    "# Show warnings inline\n",
    "logging.basicConfig(level=logging.WARNING)\n",
    "\n",
    "# Temporarily pretend config is unavailable\n",
    "_real = _ext._config_available\n",
    "_ext._config_available = lambda: False\n",
    "\n",
    "try:\n",
    "    result, previews = parse_metadata(\n",
    "        dm3_file,\n",
    "        write_output=True,  # normally writes JSON — skipped with a warning\n",
    "        generate_preview=True,  # normally makes thumbnail — skipped with a warning\n",
    "    )\n",
    "    print(f\"Metadata returned despite missing config: {result is not None}\")\n",
    "    print(f\"Previews list (all None when config missing): {previews}\")\n",
    "finally:\n",
    "    _ext._config_available = _real  # restore"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.14"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}