Sentra Launches Breakthrough AI Classification Capabilities!
All Resources
In this article:
minus iconplus icon
Share the Blog

Safeguarding Data Integrity and Privacy in the Age of AI-Powered Large Language Models (LLMs)

December 6, 2023
4
Min Read
Data Security

In the burgeoning realm of artificial intelligence (AI), Large Language Models (LLMs) have emerged as transformative tools, enabling the development of applications that revolutionize customer experiences and streamline business operations. These sophisticated AI models, trained on massive amounts of text data, can generate human-quality text, translate languages, write different kinds of creative content, and answer questions in an informative way.

Unfortunately, the extensive data consumption and rapid adoption of LLMs has also brought to light critical challenges surrounding the protection of data integrity and privacy during the training process. As organizations strive to harness the power of LLMs responsibly, it is imperative to address these vulnerabilities and ensure that sensitive information remains secure.

Challenges: Navigating the Risks of LLM Training

The training of LLMs often involves the utilization of vast amounts of data, often containing sensitive information such as personally identifiable information (PII), intellectual property, and financial records. This wealth of data presents a tempting target for malicious actors seeking to exploit vulnerabilities and gain unauthorized access.

One of the primary challenges is preventing data leakage or public disclosure. LLMs can inadvertently disclose sensitive information if not properly configured or protected. This disclosure can occur through various means, such as unauthorized access to training data, vulnerabilities in the LLM itself, or improper handling of user inputs.

Another critical concern is avoiding overly permissive configurations. LLMs can be configured to allow users to provide inputs that may contain sensitive information. If these inputs are not adequately filtered or sanitized, they can be incorporated into the LLM's training data, potentially leading to the disclosure of sensitive information.

Finally, organizations must be mindful of the potential for bias or error in LLM training data. Biased or erroneous data can lead to biased or erroneous outputs from the LLM, which can have detrimental consequences for individuals and organizations.

OWASP Top 10 for LLM Applications

The OWASP Top 10 for LLM Applications identifies and prioritizes critical vulnerabilities that can arise in LLM applications. Among these, LLM03 Training Data Poisoning, LLM06 Sensitive Information Disclosure, LLM08 Excessive Agency, and LLM10 Model Theft pose significant risks that cybersecurity professionals must address. Let's dive into these:

OWASP Top 10 for LLM Applications

LLM03: Training Data Poisoning

LLM03 addresses the vulnerability of LLMs to training data poisoning, a malicious attack where carefully crafted data is injected into the training dataset to manipulate the model's behavior. This can lead to biased or erroneous outputs, undermining the model's reliability and trustworthiness.

The consequences of LLM03 can be severe. Poisoned models can generate biased or discriminatory content, perpetuating societal prejudices and causing harm to individuals or groups. Moreover, erroneous outputs can lead to flawed decision-making, resulting in financial losses, operational disruptions, or even safety hazards.


LLM06: Sensitive Information Disclosure

LLM06 highlights the vulnerability of LLMs to inadvertently disclosing sensitive information present in their training data. This can occur when the model is prompted to generate text or code that includes personally identifiable information (PII), trade secrets, or other confidential data.

The potential consequences of LLM06 are far-reaching. Data breaches can lead to financial losses, reputational damage, and regulatory penalties. Moreover, the disclosure of sensitive information can have severe implications for individuals, potentially compromising their privacy and security.

LLM08: Excessive Agency

LLM08 focuses on the risk of LLMs exhibiting excessive agency, meaning they may perform actions beyond their intended scope or generate outputs that cause harm or offense. This can manifest in various ways, such as the model generating discriminatory or biased content, engaging in unauthorized financial transactions, or even spreading misinformation.

Excessive agency poses a significant threat to organizations and society as a whole. Supply chain compromises and excessive permissions to AI-powered apps can erode trust, damage reputations, and even lead to legal or regulatory repercussions. Moreover, the spread of harmful or offensive content can have detrimental social impacts.

LLM10: Model Theft

LLM10 highlights the risk of model theft, where an adversary gains unauthorized access to a trained LLM or its underlying intellectual property. This can enable the adversary to replicate the model's capabilities for malicious purposes, such as generating misleading content, impersonating legitimate users, or conducting cyberattacks.

Model theft poses significant threats to organizations. The loss of intellectual property can lead to financial losses and competitive disadvantages. Moreover, stolen models can be used to spread misinformation, manipulate markets, or launch targeted attacks on individuals or organizations.

Recommendations: Adopting Responsible Data Protection Practices

To mitigate the risks associated with LLM training data, organizations must adopt a comprehensive approach to data protection. This approach should encompass data hygiene, policy enforcement, access controls, and continuous monitoring.

Data hygiene is essential for ensuring the integrity and privacy of LLM training data. Organizations should implement stringent data cleaning and sanitization procedures to remove sensitive information and identify potential biases or errors.

Policy enforcement is crucial for establishing clear guidelines for the handling of LLM training data. These policies should outline acceptable data sources, permissible data types, and restrictions on data access and usage.

Access controls should be implemented to restrict access to LLM training data to authorized personnel and identities only, including third party apps that may connect. This can be achieved through role-based access control (RBAC), zero-trust IAM, and multi-factor authentication (MFA) mechanisms.

Continuous monitoring is essential for detecting and responding to potential threats and vulnerabilities. Organizations should implement real-time monitoring tools to identify suspicious activity and take timely action to prevent data breaches.

Solutions: Leveraging Technology to Safeguard Data

In the rush to innovate, developers must remain keenly aware of the inherent risks involved with training LLMs if they wish to deliver responsible, effective AI that does not jeopardize their customer's data.  Specifically, it is a foremost duty to protect the integrity and privacy of LLM training data sets, which often contain sensitive information.

Preventing data leakage or public disclosure, avoiding overly permissive configurations, and negating bias or error that can contaminate such models should be top priorities.

Technological solutions play a pivotal role in safeguarding data integrity and privacy during LLM training. Data security posture management (DSPM) solutions can automate data security processes, enabling organizations to maintain a comprehensive data protection posture.

DSPM solutions provide a range of capabilities, including data discovery, data classification, data access governance (DAG), and data detection and response (DDR). These capabilities help organizations identify sensitive data, enforce access controls, detect data breaches, and respond to security incidents.

Cloud-native DSPM solutions offer enhanced agility and scalability, enabling organizations to adapt to evolving data security needs and protect data across diverse cloud environments.

Sentra: Automating LLM Data Security Processes

Having to worry about securing yet another threat vector should give overburdened security teams pause. But help is available.

Sentra has developed a data privacy and posture management solution that can automatically secure LLM training data in support of rapid AI application development.

The solution works in tandem with AWS SageMaker, GCP Vertex AI, or other AI IDEs to support secure data usage within ML training activities.  The solution combines key capabilities including DSPM, DAG, and DDR to deliver comprehensive data security and privacy.

Its cloud-native design discovers all of your data and ensures good data hygiene and security posture via policy enforcement, least privilege access to sensitive data, and monitoring and near real-time alerting to suspicious identity (user/app/machine) activity, such as data exfiltration, to thwart attacks or malicious behavior early. The solution frees developers to innovate quickly and for organizations to operate with agility to best meet requirements, with confidence that their customer data and proprietary information will remain protected.

LLMs are now also built into Sentra’s classification engine and data security platform to provide unprecedented classification accuracy for unstructured data. Learn more about Large Language Models (LLMs) here.

Conclusion: Securing the Future of AI with Data Privacy

AI holds immense potential to transform our world, but its development and deployment must be accompanied by a steadfast commitment to data integrity and privacy. Protecting the integrity and privacy of data in LLMs is essential for building responsible and ethical AI applications. By implementing data protection best practices, organizations can mitigate the risks associated with data leakage, unauthorized access, and bias. Sentra's DSPM solution provides a comprehensive approach to data security and privacy, enabling organizations to develop and deploy LLMs with speed and confidence.

If you want to learn more about Sentra's Data Security Platform and how LLMs are now integrated into our classification engine to deliver unmatched accuracy for unstructured data, request a demo today.

<blogcta-big>

David Stuart is Senior Director of Product Marketing for Sentra, a leading cloud-native data security platform provider, where he is responsible for product and launch planning, content creation, and analyst relations. Dave is a 20+ year security industry veteran having held product and marketing management positions at industry luminary companies such as Symantec, Sourcefire, Cisco, Tenable, and ZeroFox. Dave holds a BSEE/CS from University of Illinois, and an MBA from Northwestern Kellogg Graduate School of Management.

Subscribe

Latest Blog Posts

David Stuart
David Stuart
Nikki Ralston
Nikki Ralston
November 24, 2025
3
Min Read

Third-Party OAuth Apps Are the New Shadow Data Risk: Lessons from the Gainsight/Salesforce Incident

Third-Party OAuth Apps Are the New Shadow Data Risk: Lessons from the Gainsight/Salesforce Incident

The recent exposure of customer data through a compromised Gainsight integration within Salesforce environments is more than an isolated event - it’s a sign of a rapidly evolving class of SaaS supply-chain threats. Even trusted AppExchange partners can inadvertently create access pathways that attackers exploit, especially when OAuth tokens and machine-to-machine connections are involved. This post explores what happened, why today’s security tooling cannot fully address this scenario, and how data-centric visibility and identity governance can meaningfully reduce the blast radius of similar breaches.

A Recap of the Incident

In this case, attackers obtained sensitive credentials tied to a Gainsight integration used by multiple enterprises. Those credentials allowed adversaries to generate valid OAuth tokens and access customer Salesforce orgs, in some cases with extensive read capabilities. Neither Salesforce nor Gainsight intentionally misconfigured their systems. This was not a product flaw in either platform. Instead, the incident illustrates how deeply interconnected SaaS environments have become and how the security of one integration can impact many downstream customers.

Understanding the Kill Chain: From Stolen Secrets to Salesforce Lateral Movement

The attackers’ pathway followed a pattern increasingly common in SaaS-based attacks. It began with the theft of secrets; likely API keys, OAuth client secrets, or other credentials that often end up buried in repositories, CI/CD logs, or overlooked storage locations. Once in hand, these secrets enabled the attackers to generate long-lived OAuth tokens, which are designed for application-level access and operate outside MFA or user-based access controls.

What makes OAuth tokens particularly powerful is that they inherit whatever permissions the connected app holds. If an integration has broad read access, which many do for convenience or legacy reasons, an attacker who compromises its token suddenly gains the same level of visibility. Inside Salesforce, this enabled lateral movement across objects, records, and reporting surfaces far beyond the intended scope of the original integration. The entire kill chain was essentially a progression from a single weakly-protected secret to high-value data access across multiple Salesforce tenants.

Why Traditional SaaS Security Tools Missed This

Incident response teams quickly learned what many organizations are now realizing: traditional CASBs and CSPMs don’t provide the level of identity-to-data context necessary to detect or prevent OAuth-driven supply-chain attacks.

CASBs primarily analyze user behavior and endpoint connections, but OAuth apps are “non-human identities” - they don’t log in through browsers or trigger interactive events. CSPMs, in contrast, focus on cloud misconfigurations and posture, but they don’t understand the fine-grained data models of SaaS platforms like Salesforce. What was missing in this incident was visibility into how much sensitive data the Gainsight connector could access and whether the privileges it held were appropriate or excessive. Without that context, organizations had no meaningful way to spot the risk until the compromise became public.

Sentra Helps Prevent and Contain This Attack Pattern

Sentra’s approach is fundamentally different because it starts with data: what exists, where it resides, who or what can access it, and whether that access is appropriate. Rather than treating Salesforce or other SaaS platforms as black boxes, Sentra maps the data structures inside them, identifies sensitive records, and correlates that information with identity permissions including third-party apps, machine identities, and OAuth sessions.

One key pillar of Sentra’s value lies in its DSPM capabilities. The platform identifies sensitive data across all repositories, including cloud storage, SaaS environments, data warehouses, code repositories, collaboration platforms, and even on-prem file systems. Because Sentra also detects secrets such as API keys, OAuth credentials, private keys, and authentication tokens across these environments, it becomes possible to catch compromised or improperly stored secrets before an attacker ever uses them to access a SaaS platform.

OAuth 2.0 Access Token

Another area where this becomes critical is the detection of over-privileged connected apps. Sentra continuously evaluates the scopes and permissions granted to integrations like Gainsight, identifying when either an app or an identity holds more access than its business purpose requires. This type of analysis would have revealed that a compromised integrated app could see far more data than necessary, providing early signals of elevated risk long before an attacker exploited it.

Sentra further tracks the health and behavior of non-human identities. Service accounts and connectors often rely on long-lived credentials that are rarely rotated and may remain active long after the responsible team has changed. Sentra identifies these stale or overly permissive identities and highlights when their behavior deviates from historical norms. In the context of this incident type, that means detecting when a connector suddenly begins accessing objects it never touched before or when large volumes of data begin flowing to unexpected locations or IP ranges.

Finally, Sentra’s behavior analytics (part of DDR) help surface early signs of misuse. Even if an attacker obtains valid OAuth tokens, their data access patterns, query behavior, or geography often diverge from the legitimate integration. By correlating anomalous activity with the sensitivity of the data being accessed, Sentra can detect exfiltration patterns in real time—something traditional tools simply aren’t designed to do.

The 2026 Outlook: More Incidents Are Coming

The Gainsight/Salesforce incident is unlikely to be the last of its kind. The speed at which enterprises adopt SaaS integrations far exceeds the rate at which they assess the data exposure those integrations create. OAuth-based supply-chain attacks are growing quickly because they allow adversaries to compromise one provider and gain access to dozens or hundreds of downstream environments. Given the proliferation of partner ecosystems, machine identities, and unmonitored secrets, this attack vector will continue to scale.

Prediction:
Unless enterprises add data-centric SaaS visibility and identity-aware DSPM, we should expect three to five more incidents of similar magnitude before summer 2026.

Conclusion

The real lesson from the Gainsight/Salesforce breach is not to reduce reliance on third-party SaaS providers as modern business would grind to a halt without them. The lesson is that enterprises must know where their sensitive data lives, understand exactly which identities and integrations can access it, and ensure those privileges are continuously validated. Sentra provides that visibility and contextual intelligence, making it possible to identify the risks that made this breach possible and help to prevent the next one.

<blogcta-big>

Read More
David Stuart
David Stuart
November 24, 2025
3
Min Read

Securing Unstructured Data in Microsoft 365: The Case for Petabyte-Scale, AI-Driven Classification

Securing Unstructured Data in Microsoft 365: The Case for Petabyte-Scale, AI-Driven Classification

The modern enterprise runs on collaboration and nothing powers that more than Microsoft 365. From Exchange Online and OneDrive to SharePoint, Teams, and Copilot workflows, M365 hosts a massive and ever-growing volume of unstructured content: documents, presentations, spreadsheets, image files, chats, attachments, and more.

Yet unstructured = harder to govern. Unlike tidy database tables with defined schemas, unstructured repositories flood in with ambiguous content types, buried duplicates, or unused legacy files. It’s in these stacks that sensitive IP, model training data, or derivative work can quietly accumulate, and then leak.

Consider this: one recent study found that more than 81 % of IT professionals report data-loss events in M365 environments. And to make matters worse, according to the International Data Corporation (IDC), 60% of organizations do not have a strategy for protecting their critical business data that resides in Microsoft 365.

Why Traditional Tools Struggle

  • Built-in classification tools (e.g., M365’s native capabilities) often rely on pattern matching or simple keywords, and therefore struggle with accuracy, context, scale and derivative content.

  • Many solutions only surface that a file exists and carries a type label - but stop short of mapping who or what can access it, its purpose, and what its downstream exposure might be.

  • GenAI workflows now pump massive volumes of unstructured data into copilots, knowledge bases, training sets - creating new blast radii that legacy DLP or labeling tools weren’t designed to catch.

What a Modern Platform Must Deliver

  1. High-accuracy, petabyte-scale classification of unstructured data (so you know what you have, where it sits, and how sensitive it is). And it must keep pace with explosive data growth and do so cost efficiently.

  2. Unified Data Access Governance (DAG) - mapping identities (users, service principals, agents), permissions, implicit shares, federated/cloud-native paths across M365 and beyond.
  3. Data Detection & Response (DDR) - continuous monitoring of data movement, copies, derivative creation, AI agent interactions, and automated response/remediation.

How Sentra addresses this in M365

Assets contain plain text credit card numbers

At Sentra, we’ve built a cloud-native data-security platform specifically to address this triad of capabilities - and we extend that deeply into M365 (OneDrive, SharePoint, Teams, Exchange Online) and other SaaS platforms.

  • A newly announced AI Classifier for Unstructured Data accelerates and improves classification across M365’s unstructured repositories (see: Sentra launches breakthrough unstructured-data AI classification capabilities).

  • Petabyte-scale processing: our architecture supports classification and monitoring of massive file estates without astronomical cost or time-to-value.

  • Seamless support for M365 services: read/write access, ingestion, classification, access-graph correlation, detection of shadow/unmanaged copies across OneDrive and SharePoint—plus integration into our DAG and DDR layers (see our guide: How to Secure Regulated Data in Microsoft 365 + Copilot).

  • Cost-efficient deployment: designed for high scale without breaking the budget or massive manual effort.

The Bottom Line

In today’s cloud/AI era, saying “we discovered the PII in my M365 tenant” isn’t enough.

The real question is: Do I know who or what (user/agent/app) can access that content, what its business purpose is, and whether it’s already been copied or transformed into a risk vector?


If your solution can’t answer that, your unstructured data remains a silent, high-stakes liability and resolving concerns becomes a very costly, resource-draining burden. By embracing a platform that combines classification accuracy, petabyte-scale processing, unified DSPM + DAG + DDR, and deep M365 support, you move from “hope I’m secure” to “I know I’m secure.”

Want to see how it works in a real M365 setup? Check out our video or book a demo.

<blogcta-big>

Read More
Ofir Yehoshua
Ofir Yehoshua
November 17, 2025
4
Min Read

How to Gain Visibility and Control in Petabyte-Scale Data Scanning

How to Gain Visibility and Control in Petabyte-Scale Data Scanning

Every organization today is drowning in data - millions of assets spread across cloud platforms, on-premises systems, and an ever-expanding landscape of SaaS tools. Each asset carries value, but also risk. For security and compliance teams, the mandate is clear: sensitive data must be inventoried, managed and protected.

Scanning every asset for security and compliance is no longer optional, it’s the line between trust and exposure, between resilience and chaos.

Many data security tools promise to scan and classify sensitive information across environments. In practice, doing this effectively and at scale, demands more than raw ‘brute force’ scanning power. It requires robust visibility and management capabilities: a cockpit view that lets teams monitor coverage, prioritize intelligently, and strike the right balance between scan speed, cost, and accuracy.

Why Scan Tracking Is Crucial

Scanning is not instantaneous. Depending on the size and complexity of your environment, it can take days - sometimes even weeks to complete. Meanwhile, new data is constantly being created or modified, adding to the challenge.

Without clear visibility into the scanning process, organizations face several critical obstacles:

  • Unclear progress: It’s often difficult to know what has already been scanned, what is currently in progress, and what remains pending. This lack of clarity creates blind spots that undermine confidence in coverage.

  • Time estimation gaps: In large environments, it’s hard to know how long scans will take because so many factors come into play — the number of assets, their size, the type of data - structured, semi-structured, or unstructured, and how much scanner capacity is available. As a result, predicting when you’ll reach full coverage is tricky. This becomes especially stressful when scans need to be completed before a fixed deadline, like a compliance audit. 

    "With Sentra’s Scan Dashboard, we were able to quickly scale up our scanners to meet a tight audit deadline, finish on time, and then scale back down to save costs. The visibility and control it gave us made the whole process seamless”, said CISO of Large Retailer.
  • Poor prioritization: Not all environments or assets carry the same importance. Yet without visibility into scan status, teams struggle to balance historical scans of existing assets with the ongoing influx of newly created data, making it nearly impossible to prioritize effectively based on risk or business value.

Sentra’s End-to-End Scanning Workflow

Managing scans at petabyte scale is complex. Sentra streamlines the process with a workflow built for scale, clarity, and control that features:

1. Comprehensive Asset Discovery

Before scanning even begins, Sentra automatically discovers assets across cloud platforms, on-premises systems, and SaaS applications. This ensures teams have a complete, up-to-date inventory and visual map of their data landscape, so no environment or data store is overlooked.

Example: New S3 buckets, a freshly deployed BigQuery dataset, or a newly connected SharePoint site are automatically identified and added to the inventory.

Comprehensive Asset Discovery with Sentra

2. Configurable Scan Management

Administrators can fine-tune how scans are executed to meet their organization’s needs. With flexible configuration options, such as number of scanners, sampling rates, and prioritization rules - teams can strike the right balance between scan speed, coverage, and cost control.

For instance, compliance-critical assets can be scanned at full depth immediately, while less critical environments can run at reduced sampling to save on compute consumption and costs.

3. Real-Time Scan Dashboard

Sentra’s unified Scan Dashboard provides a cockpit view into scanning operations, so teams always know where they stand. Key features include:

  • Daily scan throughput correlated with the number of active scanners, helping teams understand efficiency and predict completion times.
  • Coverage tracking that visualizes overall progress and highlights which assets remain unscanned.
  • Decision-making tools that allow teams to dynamically adjust, whether by adding scanner capacity, changing sampling rates, or reordering priorities when new high-risk assets appear.
Real-Time Scan Dashboard with Sentra

Handling Data Changes

The challenge doesn’t end once the initial scans are complete. Data is dynamic, new files are added daily, existing records are updated, and sensitive information shifts locations. Sentra’s activity feeds give teams the visibility they need to understand how their data landscape is evolving and adapt their data security strategies in real time.


Conclusion

Tracking scan status at scale is complex but critical to any data security strategy. Sentra provides an end-to-end view and unmatched scan control, helping organizations move from uncertainty to confidence with clear prediction of scan timelines, faster troubleshooting, audit-ready compliance, and smarter, cost-efficient decisions for securing data.

<blogcta-big>

Read More
decorative ball
Expert Data Security Insights Straight to Your Inbox
What Should I Do Now:
1

Get the latest GigaOm DSPM Radar report - see why Sentra was named a Leader and Fast Mover in data security. Download now and stay ahead on securing sensitive data.

2

Sign up for a demo and learn how Sentra’s data security platform can uncover hidden risks, simplify compliance, and safeguard your sensitive data.

3

Follow us on LinkedIn, X (Twitter), and YouTube for actionable expert insights on how to strengthen your data security, build a successful DSPM program, and more!

Before you go...

Get the Gartner Customers' Choice for DSPM Report

Read why 98% of users recommend Sentra.

Gartner Certificate for Sentra