Use Cases
Thirteen scenarios showing how ROAD discovers, classifies, and governs sensitive unstructured data across industries.
A health system has accumulated decades of scanned patient forms, referral letters, and discharge summaries across on-premise NAS drives. None have been classified. Most are image-only PDFs — invisible to keyword-based scanners. ROAD runs OCR-enabled scanning across the full estate, classifies each document by type (clinical note, intake form, billing record), and pinpoints the exact location of PHI within each file — identifying which page and paragraph contains a patient name, date of birth, or medical record number. The output is a verified inventory of sensitive content, classified and indexed, with no files moved and no source environment disrupted.
A bank has years of loan closing packages, audit exports, and customer correspondence sitting in unmanaged network drives across multiple regional offices. No unified inventory exists. ROAD indexes all five source environments — NAS, S3, Azure Blob, SFTP, and Google Storage — into a single searchable index. LLM-powered classification distinguishes Social Security numbers from employee IDs and vendor reference codes that match the same nine-digit pattern. Each file is classified by document type (loan application, closing statement, correspondence) and flagged with the exact fields containing sensitive data — giving the compliance team a defensible, auditable view of what is actually in the file estate.
An insurer has millions of claims documents spread across legacy file servers and an S3 archive. Files arrived through automated processes with no metadata beyond filename and creation date. ROAD applies document classification to distinguish adjuster notes, settlement agreements, medical records, and policy documents — without relying on filenames or folder structure. Custom extractors identify claimant identifiers, diagnosis codes, and financial settlement amounts in plain-language terms, with no ML training required. The output is a structured, classified index of the full claims file estate — searchable by document type, sensitive data category, and geographic territory.
A law firm is conducting a data risk audit across its document repositories. A significant portion of the estate consists of PNG and JPEG files — scanned court filings, signed agreements, and exhibits submitted as images. These files are invisible to the firm's existing scanner. ROAD applies OCR during indexing, extracts text from image content, and classifies each file by matter type. Entity extraction identifies person names, company names, and identification numbers embedded in image-based documents that have never been text-indexed. The audit surface now includes file types the previous tool never read.
A pharma company needs to verify that all GxP-relevant documents in its file estate are accounted for ahead of an FDA inspection. Study protocols, validation records, batch records, and deviation reports are distributed across multiple cloud storage environments with no central classification. ROAD runs across AWS S3 and Azure Blob simultaneously in File System Collection mode, classifying documents by regulatory type using LLM-based document classification. Natural language search — “find every document referencing a specific clinical trial number” — surfaces files that no keyword pattern would have reached. The result is a verified, searchable regulatory document inventory built without moving a single file.
A manufacturer needs to understand what proprietary and export-controlled content exists in its file estate before a cloud migration. Engineering drawings, CAD-adjacent image files, and technical specifications are stored in formats most scanners skip. ROAD indexes the full estate — including images and non-standard file types processed through Apache Tika — and applies custom extractors to identify part numbers, project codes, and specification references defined in plain language by the governance team. Document classification separates supplier contracts, internal engineering specs, and marketing materials so that each category can be evaluated against the appropriate control requirements before any files are moved.
A federal agency has accumulated files across on-premise NAS drives and SFTP servers used for inter-agency data exchange. No classification has ever been applied. ROAD runs File System Analysis — indexing, classifying, and scanning files without moving them — and applies PII detection with jurisdiction-level scoping for U.S. federal requirements. Scanned PDFs from paper-based workflows that were digitized but never text-indexed are processed via OCR on the first scan run. The output includes file path, document type, sensitive data category, and exact location within each document — giving the agency a complete, auditable record of what the file estate contains and where risk is concentrated.
A retailer's compliance team knows customer data lives in the CRM. What they don't know is how much of it has been exported into spreadsheets, CSV files, and PDFs sitting on shared drives and in cloud object storage — outside any governed system. ROAD scans across Google Storage and S3, classifies files by type, and applies entity extraction to surface files containing customer names, email addresses, purchase history, and loyalty identifiers. LLM-powered classification distinguishes actual customer records from marketing templates and product reference sheets that contain similar terms — reducing false positives that have made previous scan outputs unusable. The team gets a credible inventory of out-of-system customer data for the first time.
A utility company needs to produce a verified inventory of environmental permits, inspection reports, and regulatory correspondence ahead of a compliance review. These documents are scattered across NAS drives, Azure Blob, and SFTP servers used for regulatory submission. Many are scanned PDFs from field inspections — image-only documents never indexed by the company's existing tools. ROAD classifies each document by type (permit, inspection report, incident record, regulatory correspondence), extracts relevant identifiers (facility codes, permit numbers, inspection dates) using custom plain-language extractors, and produces a searchable index organized by classification and metadata. The compliance team can query the full estate in natural language rather than navigating folder structures.
A university has no unified view of what files exist across departmental file shares, Google Storage instances, and SFTP servers used for research data transfer. Student records, grant documentation, research datasets, and faculty correspondence are intermixed with no classification applied. ROAD runs File System Collection mode across all environments simultaneously, classifying documents by category — student record, research dataset, grant correspondence, administrative file — and applying PII detection scoped to FERPA-relevant data types. Natural language search lets the compliance office query the full estate without knowing in advance which directories or file types to look in. The output is the first accurate, classified picture of what the university's unstructured file estate actually contains.
A security operations team suspects that application logs ingested into Splunk contain sensitive data — customer identifiers, session tokens, and credentials logged by developers during debugging that were never removed from production pipelines. Traditional scanners cannot read log files at the line level. ROAD scans Splunk log exports, applies LLM-powered classification to distinguish actual sensitive data from matching patterns in system noise, and returns results that identify the exact line number where each sensitive item appears — with the item highlighted directly in the output. A log file containing two million lines does not require a human analyst to open and search it manually. The team gets a precise, reviewable inventory of what is in the logs, where it is, and what category of sensitive data it represents — actionable without further triage.
A security team needs to determine whether source code repositories contain hardcoded credentials, API keys, PII embedded in test fixtures, or sensitive data committed to version history. Keyword searches generate too much noise to be useful at scale. ROAD scans GitHub repositories, applies LLM-powered classification to understand context — distinguishing a real API key from a placeholder string in documentation — and returns results that identify the exact file, the exact line number, and the specific item flagged, highlighted in the output for immediate review. Custom extractors can be defined in plain language to target organization-specific token formats, internal identifier conventions, or proprietary reference codes that no pre-built pattern library would cover. The result is a defensible, reviewable inventory of sensitive content across the codebase — built without guesswork and without burying real findings under thousands of false positives.
An organization has decades of contracts stored in Content Manager. When each contract was ingested, key metadata fields — contract type, counterparty names, effective dates, expiration dates, and governing terms — were captured manually by staff. Over time, the process produced what manual processes always produce: incomplete records where fields were left blank, inconsistent entries where the same party was named differently across records, and outright errors where the wrong value was entered. The organization knows the metadata is unreliable but has no efficient way to verify or correct it at scale. ROAD rescans the full contract repository — reading PDFs, Word documents, and scanned image-based contracts via OCR — and applies LLM-powered entity extraction to pull contract type, party names, execution dates, expiration dates, and key commercial terms directly from the document content. Each extracted value is compared against the existing metadata record. Discrepancies are flagged: a contract classified as a vendor agreement that the document itself shows is a master services agreement; a counterparty name that does not match the signatory block; an expiration date that was never captured at all. The output is a structured correction report that gives the legal or operations team a precise, document-grounded basis for updating Content Manager records — without reopening a single file manually. The result is a contract repository where the metadata reflects what the documents actually say, not what someone typed years ago under time pressure.