Specialized File Format Scanning: DICOM, Tableau, Pickle, and the “We Don’t Scan That” Problem
Most security programs are pretty comfortable talking about PDFs, Office documents, and maybe CSVs. But when I ask, “What are you doing about DICOM, EDI, Tableau extracts, pickle files, OneNote notebooks, Draw.io diagrams, and Java KeyStores?” the room usually goes quiet.
The truth is that some of the highest‑risk data stores in your environment live in specialized file formats that traditional DLP and DSPM tools were never designed to understand. If your platform shrugs and treats them as opaque blobs, you’re ignoring exactly the data regulators and attackers care about most.
This blog post looks at why specialized file format scanning matters for DICOM, EDI, Tableau extracts, pickle/joblib, OneNote, Draw.io, Java KeyStores, and LST catalogs, and how making them first‑class citizens in your DSPM program closes a huge visibility gap.
DICOM PHI Scanning: Medical Images That Aren’t “Just Images”
Let’s start with healthcare. In modern environments, nearly every CT, MRI, and X‑ray is stored as DICOM.
To many teams, that’s “just imaging,” but DICOM is actually a rich container: it carries patient names, dates of birth, medical record numbers, referring physicians, institution IDs, sometimes even Social Security numbers and insurance details, all in structured metadata alongside the image.
When those files get exported from tightly controlled PACS systems to research shares, cloud buckets, or AI training pipelines, that PHI comes along for the ride, often without any visibility from security.
Sentra’s DICOM reader pulls those metadata fields into tabular form so we can classify PHI wherever it shows up, not just in EHR databases. Instead of “DICOM = image, ignore,” you get structured visibility into the actual identifiers inside each file.
EDI File Scanning: Healthcare Transactions You Can Finally See
The same story plays out in EDI healthcare transactions. EDI 837s, 835s, and related formats are packed with patient demographics, diagnosis and procedure codes, insurance identifiers, and payment details. These files routinely move between providers, payers, and vendors, land in staging buckets, get archived, and quietly drift out of scope. They’re not human‑readable, so they’re also not on most security teams’ radar.
We built an EDI parser specifically to turn those streams into structured data we can classify, so “EDI” stops being shorthand for “we hope that system is locked down.” With specialized EDI scanning in place, you can actually answer:
- Where do our 837/835 files live across cloud storage and file shares?
- Which of them contain regulated PHI and payment data?
- Who has access, and are they stored in the right geography?
Tableau Extract Scanning: Shadow Data in TDE and Hyper
In analytics, Tableau extracts (TDE/Hyper) are the poster child for shadow data. When an analyst pulls a subset of a production database into a local extract, they’ve just created a new, often uncontrolled copy of that data. Customer records, transaction histories, compensation data - whatever they could query is now sitting in a file that can be emailed, synced, uploaded, and forgotten.
Sentra’s Tableau readers crack open TDE and Hyper, extract the tables, and run the same classification we use on your core data stores. For SOX, financial data governance, and general cloud data security, that’s the only way to have an honest inventory of where your financial and customer data actually lives.
Instead of “Tableau extracts somewhere in that EC2 or S3 bucket,” you get:
- A clear map of which extracts exist
- Exactly which columns carry PII, PCI, or sensitive business data
- Visibility into who can access those shadow datasets
Pickle and Joblib Scanning: Seeing Inside ML and AI Artifacts
In modern ML and AI pipelines, formats like Python’s pickle and scikit‑learn’s joblib are everywhere.
They’re not just “model files”; they frequently contain:
- Serialized DataFrames
- Cached training samples
- Feature stores
All of which can embed PII, financial data, or PHI from the datasets you used to build your models.
As AI governance and model transparency requirements tighten, having zero visibility into what’s baked into those artifacts isn’t tenable. You need to be able to answer questions like:
- What real data did we use to train this model?
- Did any regulated data sneak into training samples or feature stores?
Sentra extracts both tabular and textual content from pickle and joblib so you can finally treat ML artifacts as governed data stores, not opaque byproducts. That’s the basis for answering, with evidence, what data you actually trained on.
OneNote, Draw.io, Java KeyStores, and LST: Everyday Tools, High Impact Risk
Even day‑to‑day productivity tools become risk multipliers when you can’t see inside them.
OneNote Notebook Scanning
OneNote notebooks are used for:
- Meeting notes
- Project docs
- Onboarding checklists
- Internal knowledge bases
Which means they tend to accumulate customer details, credentials, financial numbers, and strategy discussions in an unstructured, nested hierarchy. Without specialized OneNote scanning, those notebooks become an ungoverned archive of PII, secrets, and sensitive business context living in SharePoint, OneDrive, or exported file shares.
Draw.io Diagram Scanning
Draw.io diagrams are full of labels that reference:
- Server names and IP ranges
- Database identifiers
- Customer names and environments
Treating .drawio files as “just diagrams” misses the fact that they often encode both network topology and customer context in plain text. With a dedicated reader, those labels flow through the same classification as any other unstructured text.
Java KeyStore (JKS) Scanning
Java KeyStore (JKS) files hold keys and certificates - the crown jewels of many Java and Spring applications.
You might already inventory them for crypto hygiene, but they also matter for data security posture:
- Where are private keys stored?
- Are keystores sitting in publicly reachable locations or over‑permissive buckets?
- Which identities and apps are effectively protected by (or exposed through) those keystores?
Bringing JKS into your DSPM coverage means you can correlate where keys live with where your most sensitive data lives and moves.
LST Catalog Scanning
LST catalogs quietly index sensitive entities across systems in tabular form, essentially acting as cross‑system indexes of important IDs, records, or objects.
Scanning LST files as structured tables, rather than raw text, lets you:
- Identify when sensitive IDs or mappings are being replicated into uncontrolled locations
- Tie those catalog entries back to regulated source systems
Why Specialized File Format Scanning Is Not an Edge Case
None of these formats are edge cases. For healthcare, financial services, and AI‑heavy organizations, they sit squarely in the blast radius of your biggest risks:
- DICOM & EDI: PHI and claims data well inside HIPAA and regional healthcare regulations
- Tableau extracts: Financial, customer, and HR data copied into BI workflows—critical for SOX and privacy regimes
- Pickle/joblib: Training data and features embedded in ML artifacts—central to emerging AI regulations
- OneNote, Draw.io, JKS, LST: The connective tissue of how your infrastructure and customer data are actually used day‑to‑day
That’s why Sentra’s extraction engine supports 150+ file types and treats specialized formats as first‑class citizens in your DSPM program, not as “we’ll get to that later” backlog items.
From Opaque Blobs to Governed Data: How Sentra Helps
Sensitive data doesn’t respect format boundaries, and neither can your visibility. With Sentra’s specialized file format scanning, you can discover formats like DICOM, EDI, Tableau extracts, pickle/joblib, OneNote, Draw.io, JKS, LST, and more across S3, Azure Blob, GCS, file shares, and SaaS environments. Sentra goes beyond surface metadata by parsing and extracting the true structure and content - both tabular and unstructured - so you can accurately classify PHI, PCI, PII, secrets, and sensitive business data at the level where it actually lives, such as fields, columns, and labels.
All of this is integrated into the same DSPM policies you already apply to databases, data lakes, and email archives. If you want to understand how this specialized format coverage fits into Sentra’s broader AI-ready data security and governance approach, you can explore the data security platform overview at sentra.io or connect with us to discuss your specific stack and file formats. After all, the most dangerous data is often hiding in the files your tools still ignore.
<blogcta-big>







