Sentra Launches Breakthrough AI Classification Capabilities!
All Resources
In this article:
minus iconplus icon
Share the Blog

Use Redshift Data Scrambling for Additional Data Protection

May 3, 2023
8
Min Read

According to IBM, a data breach in the United States cost companies an average of 9.44 million dollars in 2022. It is now more important than ever for organizations to place high importance on protecting confidential information. Data scrambling, which can add an extra layer of security to data, is one approach to accomplish this. 

In this post, we'll analyze the value of data protection, look at the potential financial consequences of data breaches, and talk about how Redshift Data Scrambling may help protect private information.

The Importance of Data Protection

Data protection is essential to safeguard sensitive data from unauthorized access. Identity theft, financial fraud,and other serious consequences are all possible as a result of a data breach. Data protection is also crucial for compliance reasons. Sensitive data must be protected by law in several sectors, including government, banking, and healthcare. Heavy fines, legal problems, and business loss may result from failure to abide by these regulations.

Hackers employ many techniques, including phishing, malware, insider threats, and hacking, to get access to confidential information. For example, a phishing assault may lead to the theft of login information, and malware may infect a system, opening the door for additional attacks and data theft. 

So how to protect yourself against these attacks and minimize your data attack surface?

What is Redshift Data Masking?

Redshift data masking is a technique used to protect sensitive data in Amazon Redshift; a cloud-based data warehousing and analytics service. Redshift data masking involves replacing sensitive data with fictitious, realistic values to protect it from unauthorized access or exposure. It is possible to enhance data security by utilizing Redshift data masking in conjunction with other security measures, such as access control and encryption, in order to create a comprehensive data protection plan.

What is Redshift Data Masking

What is Redshift Data Scrambling?

Redshift data scrambling protects confidential information in a Redshift database by altering original data values using algorithms or formulas, creating unrecognizable data sets. This method is beneficial when sharing sensitive data with third parties or using it for testing, development, or analysis, ensuring privacy and security while enhancing usability. 

The technique is highly customizable, allowing organizations to select the desired level of protection while maintaining data usability. Redshift data scrambling is cost-effective, requiring no additional hardware or software investments, providing an attractive, low-cost solution for organizations aiming to improve cloud data security.

Data Masking vs. Data Scrambling

Data masking involves replacing sensitive data with a fictitious but realistic value. However, data scrambling, on the other hand, involves changing the original data values using an algorithm or a formula to generate a new set of values.

In some cases, data scrambling can be used as part of data masking techniques. For instance, sensitive data such as credit card numbers can be scrambled before being masked to enhance data protection further.

Setting up Redshift Data Scrambling

Having gained an understanding of Redshift and data scrambling, we can now proceed to learn how to set it up for implementation. Enabling data scrambling in Redshift requires several steps.

To achieve data scrambling in Redshift, SQL queries are utilized to invoke built-in or user-defined functions. These functions utilize a blend of cryptographic techniques and randomization to scramble the data.

The following steps are explained using an example code just for a better understanding of how to set it up:

Step 1: Create a new Redshift cluster

Create a new Redshift cluster or use an existing cluster if available. 

Redshift create cluster

Step 2: Define a scrambling key

Define a scrambling key that will be used to scramble the sensitive data.

 
SET session my_scrambling_key = 'MyScramblingKey';

In this code snippet, we are defining a scrambling key by setting a session-level parameter named <inlineCode>my_scrambling_key<inlineCode> to the value <inlineCode>MyScramblingKey<inlineCode>. This key will be used by the user-defined function to scramble the sensitive data.

Step 3: Create a user-defined function (UDF)

Create a user-defined function in Redshift that will be used to scramble the sensitive data. 


CREATE FUNCTION scramble(input_string VARCHAR)
RETURNS VARCHAR
STABLE
AS $$
DECLARE
scramble_key VARCHAR := 'MyScramblingKey';
BEGIN
-- Scramble the input string using the key
-- and return the scrambled output
RETURN ;
END;
$$ LANGUAGE plpgsql;

Here, we are creating a UDF named <inlineCode>scramble<inlineCode> that takes a string input and returns the scrambled output. The function is defined as <inlineCode>STABLE<inlineCode>, which means that it will always return the same result for the same input, which is important for data scrambling. You will need to input your own scrambling logic.

Step 4: Apply the UDF to sensitive columns

Apply the UDF to the sensitive columns in the database that need to be scrambled.


UPDATE employee SET ssn = scramble(ssn);

For example, applying the <inlineCode>scramble<inlineCode> UDF to a column saying, <inlineCode>ssn<inlineCode> in a table named <inlineCode>employee<inlineCode>. The <inlineCode>UPDATE<inlineCode> statement calls the <inlineCode>scramble<inlineCode> UDF and updates the values in the <inlineCode>ssn<inlineCode> column with the scrambled values.

Step 5: Test and validate the scrambled data

Test and validate the scrambled data to ensure that it is unreadable and unusable by unauthorized parties.


SELECT ssn, scramble(ssn) AS scrambled_ssn
FROM employee;

In this snippet, we are running a <inlineCode>SELECT<inlineCode> statement to retrieve the <inlineCode>ssn<inlineCode> column and the corresponding scrambled value using the <inlineCode>scramble<inlineCode> UDF. We can compare the original and scrambled values to ensure that the scrambling is working as expected. 

Step 6: Monitor and maintain the scrambled data

To monitor and maintain the scrambled data, we can regularly check the sensitive columns to ensure that they are still rearranged and that there are no vulnerabilities or breaches. We should also maintain the scrambling key and UDF to ensure that they are up-to-date and effective.

Different Options for Scrambling Data in Redshift

Selecting a data scrambling technique involves balancing security levels, data sensitivity, and application requirements. Various general algorithms exist, each with unique pros and cons. To scramble data in Amazon Redshift, you can use the following Python code samples in conjunction with a library like psycopg2 to interact with your Redshift cluster. Before executing the code samples, you will need to install the psycopg2 library:


pip install psycopg2

Random

Utilizing a random number generator, the Random option quickly secures data, although its susceptibility to reverse engineering limits its robustness for long-term protection.


import random
import string
import psycopg2

def random_scramble(data):
    scrambled = ""
    for char in data:
        scrambled += random.choice(string.ascii_letters + string.digits)
    return scrambled

# Connect to your Redshift cluster
conn = psycopg2.connect(host='your_host', port='your_port', dbname='your_dbname', user='your_user', password='your_password')
cursor = conn.cursor()
# Fetch data from your table
cursor.execute("SELECT sensitive_column FROM your_table;")
rows = cursor.fetchall()

# Scramble the data
scrambled_rows = [(random_scramble(row[0]),) for row in rows]

# Update the data in the table
cursor.executemany("UPDATE your_table SET sensitive_column = %s WHERE sensitive_column = %s;", [(scrambled, original) for scrambled, original in zip(scrambled_rows, rows)])
conn.commit()

# Close the connection
cursor.close()
conn.close()

Shuffle

The Shuffle option enhances security by rearranging data characters. However, it remains prone to brute-force attacks, despite being harder to reverse-engineer.


import random
import psycopg2

def shuffle_scramble(data):
    data_list = list(data)
    random.shuffle(data_list)
    return ''.join(data_list)

conn = psycopg2.connect(host='your_host', port='your_port', dbname='your_dbname', user='your_user', password='your_password')
cursor = conn.cursor()

cursor.execute("SELECT sensitive_column FROM your_table;")
rows = cursor.fetchall()

scrambled_rows = [(shuffle_scramble(row[0]),) for row in rows]

cursor.executemany("UPDATE your_table SET sensitive_column = %s WHERE sensitive_column = %s;", [(scrambled, original) for scrambled, original in zip(scrambled_rows, rows)])
conn.commit()

cursor.close()
conn.close()

Reversible

By scrambling characters in a decryption key-reversible manner, the Reversible method poses a greater challenge to attackers but is still vulnerable to brute-force attacks. We’ll use the Caesar cipher as an example.


def caesar_cipher(data, key):
    encrypted = ""
    for char in data:
        if char.isalpha():
            shift = key % 26
            if char.islower():
                encrypted += chr((ord(char) - 97 + shift) % 26 + 97)
            else:
                encrypted += chr((ord(char) - 65 + shift) % 26 + 65)
        else:
            encrypted += char
    return encrypted

conn = psycopg2.connect(host='your_host', port='your_port', dbname='your_dbname', user='your_user', password='your_password')
cursor = conn.cursor()

cursor.execute("SELECT sensitive_column FROM your_table;")
rows = cursor.fetchall()

key = 5
encrypted_rows = [(caesar_cipher(row[0], key),) for row in rows]
cursor.executemany("UPDATE your_table SET sensitive_column = %s WHERE sensitive_column = %s;", [(encrypted, original) for encrypted, original in zip(encrypted_rows, rows)])
conn.commit()

cursor.close()
conn.close()

Custom

The Custom option enables users to create tailor-made algorithms to resist specific attack types, potentially offering superior security. However, the development and implementation of custom algorithms demand greater time and expertise.

Best Practices for Using Redshift Data Scrambling

There are several best practices that should be followed when using Redshift Data Scrambling to ensure maximum protection:

Use Unique Keys for Each Table

To ensure that the data is not compromised if one key is compromised, each table should have its own unique key pair. This can be achieved by creating a unique index on the table.


CREATE UNIQUE INDEX idx_unique_key ON table_name (column_name);

Encrypt Sensitive Data Fields 

Sensitive data fields such as credit card numbers and social security numbers should be encrypted to provide an additional layer of security. You can encrypt data fields in Redshift using the ENCRYPT function. Here's an example of how to encrypt a credit card number field:


SELECT ENCRYPT('1234-5678-9012-3456', 'your_encryption_key_here');

Use Strong Encryption Algorithms

Strong encryption algorithms such as AES-256 should be used to provide the strongest protection. Redshift supports AES-256 encryption for data at rest and in transit.


CREATE TABLE encrypted_table (  sensitive_data VARCHAR(255) ENCODE ZSTD ENCRYPT 'aes256' KEY 'my_key');

Control Access to Encryption Keys 

Access to encryption keys should be restricted to authorized personnel to prevent unauthorized access to sensitive data. You can achieve this by setting up an AWS KMS (Key Management Service) to manage your encryption keys. Here's an example of how to restrict access to an encryption key using KMS in Python:


import boto3

kms = boto3.client('kms')

key_id = 'your_key_id_here'
grantee_principal = 'arn:aws:iam::123456789012:user/jane'

response = kms.create_grant(
    KeyId=key_id,
    GranteePrincipal=grantee_principal,
    Operations=['Decrypt']
)

print(response)

Regularly Rotate Encryption Keys 

Regular rotation of encryption keys ensures that any compromised keys do not provide unauthorized access to sensitive data. You can schedule regular key rotation in AWS KMS by setting a key policy that specifies a rotation schedule. Here's an example of how to schedule annual key rotation in KMS using the AWS CLI:

 
aws kms put-key-policy \\
    --key-id your_key_id_here \\
    --policy-name default \\
    --policy
    "{\\"Version\\":\\"2012-10-17\\",\\"Statement\\":[{\\"Effect\\":\\"Allow\\"
    "{\\"Version\\":\\"2012-10-17\\",\\"Statement\\":[{\\"Effect\\":\\"Allow\\"
    \\":\\"kms:RotateKey\\",\\"Resource\\":\\"*\\"},{\\"Effect\\":\\"Allow\\",\
    \"Principal\\":{\\"AWS\\":\\"arn:aws:iam::123456789012:root\\"},\\"Action\\
    ":\\"kms:CreateGrant\\",\\"Resource\\":\\"*\\",\\"Condition\\":{\\"Bool\\":
    {\\"kms:GrantIsForAWSResource\\":\\"true\\"}}}]}"

Turn on logging 

To track user access to sensitive data and identify any unwanted access, logging must be enabled. All SQL commands that are executed on your cluster are logged when you activate query logging in Amazon Redshift. This applies to queries that access sensitive data as well as data-scrambling operations. Afterwards, you may examine these logs to look for any strange access patterns or suspect activities.

You may use the following SQL statement to make query logging available in Amazon Redshift:

ALTER DATABASE  SET enable_user_activity_logging=true;

The stl query system table may be used to retrieve the logs once query logging has been enabled. For instance, the SQL query shown below will display all queries that reached a certain table:

Monitor Performance 

Data scrambling is often a resource-intensive practice, so it’s good to monitor CPU usage, memory usage, and disk I/O to ensure your cluster isn’t being overloaded. In Redshift, you can use the <inlineCode>svl_query_summary<inlineCode> and <inlineCode>svl_query_report<inlineCode> system views to monitor query performance. You can also use Amazon CloudWatch to monitor metrics such as CPU usage and disk space.

Amazon CloudWatch

Establishing Backup and Disaster Recovery

In order to prevent data loss in the case of a disaster, backup and disaster recovery mechanisms should be put in place. Automated backups and manual snapshots are only two of the backup and recovery methods offered by Amazon Redshift. Automatic backups are taken once every eight hours by default. 

Moreover, you may always manually take a snapshot of your cluster. In the case of a breakdown or disaster, your cluster may be restored using these backups and snapshots. Use this SQL query to manually take a snapshot of your cluster in Amazon Redshift:

CREATE SNAPSHOT ; 

To restore a snapshot, you can use the <inlineCode>RESTORE<inlineCode> command. For example:


RESTORE 'snapshot_name' TO 'new_cluster_name';

Frequent Review and Updates

To ensure that data scrambling procedures remain effective and up-to-date with the latest security requirements, it is crucial to consistently review and update them. This process should include examining backup and recovery procedures, encryption techniques, and access controls.

In Amazon Redshift, you can assess access controls by inspecting all roles and their associated permissions in the <inlineCode>pg_roles<inlineCode> system catalog database. It is essential to confirm that only authorized individuals have access to sensitive information.

To analyze encryption techniques, use the <inlineCode>pg_catalog.pg_attribute<inlineCode> system catalog table, which allows you to inspect data types and encryption settings for each column in your tables. Ensure that sensitive data fields are protected with robust encryption methods, such as AES-256.

The AWS CLI commands <inlineCode>aws backup plan<inlineCode> and <inlineCode>aws backup vault<inlineCode> enable you to review your backup plans and vaults, as well as evaluate backup and recovery procedures. Make sure your backup and recovery procedures are properly configured and up-to-date.

Decrypting Data in Redshift

There are different options for decrypting data, depending on the encryption method used and the tools available; the decryption process is similar to of encryption, usually a custom UDF is used to decrypt the data, let’s look at one example of decrypting data scrambling with a substitution cipher.

Step 1: Create a UDF with decryption logic for substitution


CREATE FUNCTION decrypt_substitution(ciphertext varchar) RETURNS varchar
IMMUTABLE AS $$
    alphabet = 'abcdefghijklmnopqrstuvwxyz'
    substitution = 'ijklmnopqrstuvwxyzabcdefgh'
    reverse_substitution = ''.join(sorted(substitution, key=lambda c: substitution.index(c)))
    plaintext = ''
    for i in range(len(ciphertext)):
        index = substitution.find(ciphertext[i])
        if index == -1:
            plaintext += ciphertext[i]
        else:
            plaintext += reverse_substitution[index]
    return plaintext
$$ LANGUAGE plpythonu;

Step 2: Move the data back after truncating and applying the decryption function


TRUNCATE original_table;
INSERT INTO original_table (column1, decrypted_column2, column3)
SELECT column1, decrypt_substitution(encrypted_column2), column3
FROM temp_table;

In this example, encrypted_column2 is the encrypted version of column2 in the temp_table. The decrypt_substitution function is applied to encrypted_column2, and the result is inserted into the decrypted_column2 in the original_table. Make sure to replace column1, column2, and column3 with the appropriate column names, and adjust the INSERT INTO statement accordingly if you have more or fewer columns in your table.

Conclusion

Redshift data scrambling is an effective tool for additional data protection and should be considered as part of an organization's overall data security strategy. In this blog post, we looked into the importance of data protection and how this can be integrated effectively into the  data warehouse. Then, we covered the difference between data scrambling and data masking before diving into how one can set up Redshift data scrambling.

Once you begin to accustom to Redshift data scrambling, you can upgrade your security techniques with different techniques for scrambling data and best practices including encryption practices, logging, and performance monitoring. Organizations may improve their data security posture management (DSPM) and reduce the risk of possible breaches by adhering to these recommendations and using an efficient strategy.

<blogcta-big>

Veronica is the security researcher at Sentra. She brings a wealth of knowledge and experience as a cybersecurity researcher. Her main focuses are researching the main cloud provider services and AI infrastructures for Data related threats and techniques.

Subscribe

Latest Blog Posts

David Stuart
David Stuart
Nikki Ralston
Nikki Ralston
November 24, 2025
3
Min Read

Third-Party OAuth Apps Are the New Shadow Data Risk: Lessons from the Gainsight/Salesforce Incident

Third-Party OAuth Apps Are the New Shadow Data Risk: Lessons from the Gainsight/Salesforce Incident

The recent exposure of customer data through a compromised Gainsight integration within Salesforce environments is more than an isolated event - it’s a sign of a rapidly evolving class of SaaS supply-chain threats. Even trusted AppExchange partners can inadvertently create access pathways that attackers exploit, especially when OAuth tokens and machine-to-machine connections are involved. This post explores what happened, why today’s security tooling cannot fully address this scenario, and how data-centric visibility and identity governance can meaningfully reduce the blast radius of similar breaches.

A Recap of the Incident

In this case, attackers obtained sensitive credentials tied to a Gainsight integration used by multiple enterprises. Those credentials allowed adversaries to generate valid OAuth tokens and access customer Salesforce orgs, in some cases with extensive read capabilities. Neither Salesforce nor Gainsight intentionally misconfigured their systems. This was not a product flaw in either platform. Instead, the incident illustrates how deeply interconnected SaaS environments have become and how the security of one integration can impact many downstream customers.

Understanding the Kill Chain: From Stolen Secrets to Salesforce Lateral Movement

The attackers’ pathway followed a pattern increasingly common in SaaS-based attacks. It began with the theft of secrets; likely API keys, OAuth client secrets, or other credentials that often end up buried in repositories, CI/CD logs, or overlooked storage locations. Once in hand, these secrets enabled the attackers to generate long-lived OAuth tokens, which are designed for application-level access and operate outside MFA or user-based access controls.

What makes OAuth tokens particularly powerful is that they inherit whatever permissions the connected app holds. If an integration has broad read access, which many do for convenience or legacy reasons, an attacker who compromises its token suddenly gains the same level of visibility. Inside Salesforce, this enabled lateral movement across objects, records, and reporting surfaces far beyond the intended scope of the original integration. The entire kill chain was essentially a progression from a single weakly-protected secret to high-value data access across multiple Salesforce tenants.

Why Traditional SaaS Security Tools Missed This

Incident response teams quickly learned what many organizations are now realizing: traditional CASBs and CSPMs don’t provide the level of identity-to-data context necessary to detect or prevent OAuth-driven supply-chain attacks.

CASBs primarily analyze user behavior and endpoint connections, but OAuth apps are “non-human identities” - they don’t log in through browsers or trigger interactive events. CSPMs, in contrast, focus on cloud misconfigurations and posture, but they don’t understand the fine-grained data models of SaaS platforms like Salesforce. What was missing in this incident was visibility into how much sensitive data the Gainsight connector could access and whether the privileges it held were appropriate or excessive. Without that context, organizations had no meaningful way to spot the risk until the compromise became public.

Sentra Helps Prevent and Contain This Attack Pattern

Sentra’s approach is fundamentally different because it starts with data: what exists, where it resides, who or what can access it, and whether that access is appropriate. Rather than treating Salesforce or other SaaS platforms as black boxes, Sentra maps the data structures inside them, identifies sensitive records, and correlates that information with identity permissions including third-party apps, machine identities, and OAuth sessions.

One key pillar of Sentra’s value lies in its DSPM capabilities. The platform identifies sensitive data across all repositories, including cloud storage, SaaS environments, data warehouses, code repositories, collaboration platforms, and even on-prem file systems. Because Sentra also detects secrets such as API keys, OAuth credentials, private keys, and authentication tokens across these environments, it becomes possible to catch compromised or improperly stored secrets before an attacker ever uses them to access a SaaS platform.

OAuth 2.0 Access Token

Another area where this becomes critical is the detection of over-privileged connected apps. Sentra continuously evaluates the scopes and permissions granted to integrations like Gainsight, identifying when either an app or an identity holds more access than its business purpose requires. This type of analysis would have revealed that a compromised integrated app could see far more data than necessary, providing early signals of elevated risk long before an attacker exploited it.

Sentra further tracks the health and behavior of non-human identities. Service accounts and connectors often rely on long-lived credentials that are rarely rotated and may remain active long after the responsible team has changed. Sentra identifies these stale or overly permissive identities and highlights when their behavior deviates from historical norms. In the context of this incident type, that means detecting when a connector suddenly begins accessing objects it never touched before or when large volumes of data begin flowing to unexpected locations or IP ranges.

Finally, Sentra’s behavior analytics (part of DDR) help surface early signs of misuse. Even if an attacker obtains valid OAuth tokens, their data access patterns, query behavior, or geography often diverge from the legitimate integration. By correlating anomalous activity with the sensitivity of the data being accessed, Sentra can detect exfiltration patterns in real time—something traditional tools simply aren’t designed to do.

The 2026 Outlook: More Incidents Are Coming

The Gainsight/Salesforce incident is unlikely to be the last of its kind. The speed at which enterprises adopt SaaS integrations far exceeds the rate at which they assess the data exposure those integrations create. OAuth-based supply-chain attacks are growing quickly because they allow adversaries to compromise one provider and gain access to dozens or hundreds of downstream environments. Given the proliferation of partner ecosystems, machine identities, and unmonitored secrets, this attack vector will continue to scale.

Prediction:
Unless enterprises add data-centric SaaS visibility and identity-aware DSPM, we should expect three to five more incidents of similar magnitude before summer 2026.

Conclusion

The real lesson from the Gainsight/Salesforce breach is not to reduce reliance on third-party SaaS providers as modern business would grind to a halt without them. The lesson is that enterprises must know where their sensitive data lives, understand exactly which identities and integrations can access it, and ensure those privileges are continuously validated. Sentra provides that visibility and contextual intelligence, making it possible to identify the risks that made this breach possible and help to prevent the next one.

<blogcta-big>

Read More
David Stuart
David Stuart
November 24, 2025
3
Min Read

Securing Unstructured Data in Microsoft 365: The Case for Petabyte-Scale, AI-Driven Classification

Securing Unstructured Data in Microsoft 365: The Case for Petabyte-Scale, AI-Driven Classification

The modern enterprise runs on collaboration and nothing powers that more than Microsoft 365. From Exchange Online and OneDrive to SharePoint, Teams, and Copilot workflows, M365 hosts a massive and ever-growing volume of unstructured content: documents, presentations, spreadsheets, image files, chats, attachments, and more.

Yet unstructured = harder to govern. Unlike tidy database tables with defined schemas, unstructured repositories flood in with ambiguous content types, buried duplicates, or unused legacy files. It’s in these stacks that sensitive IP, model training data, or derivative work can quietly accumulate, and then leak.

Consider this: one recent study found that more than 81 % of IT professionals report data-loss events in M365 environments. And to make matters worse, according to the International Data Corporation (IDC), 60% of organizations do not have a strategy for protecting their critical business data that resides in Microsoft 365.

Why Traditional Tools Struggle

  • Built-in classification tools (e.g., M365’s native capabilities) often rely on pattern matching or simple keywords, and therefore struggle with accuracy, context, scale and derivative content.

  • Many solutions only surface that a file exists and carries a type label - but stop short of mapping who or what can access it, its purpose, and what its downstream exposure might be.

  • GenAI workflows now pump massive volumes of unstructured data into copilots, knowledge bases, training sets - creating new blast radii that legacy DLP or labeling tools weren’t designed to catch.

What a Modern Platform Must Deliver

  1. High-accuracy, petabyte-scale classification of unstructured data (so you know what you have, where it sits, and how sensitive it is). And it must keep pace with explosive data growth and do so cost efficiently.

  2. Unified Data Access Governance (DAG) - mapping identities (users, service principals, agents), permissions, implicit shares, federated/cloud-native paths across M365 and beyond.
  3. Data Detection & Response (DDR) - continuous monitoring of data movement, copies, derivative creation, AI agent interactions, and automated response/remediation.

How Sentra addresses this in M365

Assets contain plain text credit card numbers

At Sentra, we’ve built a cloud-native data-security platform specifically to address this triad of capabilities - and we extend that deeply into M365 (OneDrive, SharePoint, Teams, Exchange Online) and other SaaS platforms.

  • A newly announced AI Classifier for Unstructured Data accelerates and improves classification across M365’s unstructured repositories (see: Sentra launches breakthrough unstructured-data AI classification capabilities).

  • Petabyte-scale processing: our architecture supports classification and monitoring of massive file estates without astronomical cost or time-to-value.

  • Seamless support for M365 services: read/write access, ingestion, classification, access-graph correlation, detection of shadow/unmanaged copies across OneDrive and SharePoint—plus integration into our DAG and DDR layers (see our guide: How to Secure Regulated Data in Microsoft 365 + Copilot).

  • Cost-efficient deployment: designed for high scale without breaking the budget or massive manual effort.

The Bottom Line

In today’s cloud/AI era, saying “we discovered the PII in my M365 tenant” isn’t enough.

The real question is: Do I know who or what (user/agent/app) can access that content, what its business purpose is, and whether it’s already been copied or transformed into a risk vector?


If your solution can’t answer that, your unstructured data remains a silent, high-stakes liability and resolving concerns becomes a very costly, resource-draining burden. By embracing a platform that combines classification accuracy, petabyte-scale processing, unified DSPM + DAG + DDR, and deep M365 support, you move from “hope I’m secure” to “I know I’m secure.”

Want to see how it works in a real M365 setup? Check out our video or book a demo.

<blogcta-big>

Read More
Ofir Yehoshua
Ofir Yehoshua
November 17, 2025
4
Min Read

How to Gain Visibility and Control in Petabyte-Scale Data Scanning

How to Gain Visibility and Control in Petabyte-Scale Data Scanning

Every organization today is drowning in data - millions of assets spread across cloud platforms, on-premises systems, and an ever-expanding landscape of SaaS tools. Each asset carries value, but also risk. For security and compliance teams, the mandate is clear: sensitive data must be inventoried, managed and protected.

Scanning every asset for security and compliance is no longer optional, it’s the line between trust and exposure, between resilience and chaos.

Many data security tools promise to scan and classify sensitive information across environments. In practice, doing this effectively and at scale, demands more than raw ‘brute force’ scanning power. It requires robust visibility and management capabilities: a cockpit view that lets teams monitor coverage, prioritize intelligently, and strike the right balance between scan speed, cost, and accuracy.

Why Scan Tracking Is Crucial

Scanning is not instantaneous. Depending on the size and complexity of your environment, it can take days - sometimes even weeks to complete. Meanwhile, new data is constantly being created or modified, adding to the challenge.

Without clear visibility into the scanning process, organizations face several critical obstacles:

  • Unclear progress: It’s often difficult to know what has already been scanned, what is currently in progress, and what remains pending. This lack of clarity creates blind spots that undermine confidence in coverage.

  • Time estimation gaps: In large environments, it’s hard to know how long scans will take because so many factors come into play — the number of assets, their size, the type of data - structured, semi-structured, or unstructured, and how much scanner capacity is available. As a result, predicting when you’ll reach full coverage is tricky. This becomes especially stressful when scans need to be completed before a fixed deadline, like a compliance audit. 

    "With Sentra’s Scan Dashboard, we were able to quickly scale up our scanners to meet a tight audit deadline, finish on time, and then scale back down to save costs. The visibility and control it gave us made the whole process seamless”, said CISO of Large Retailer.
  • Poor prioritization: Not all environments or assets carry the same importance. Yet without visibility into scan status, teams struggle to balance historical scans of existing assets with the ongoing influx of newly created data, making it nearly impossible to prioritize effectively based on risk or business value.

Sentra’s End-to-End Scanning Workflow

Managing scans at petabyte scale is complex. Sentra streamlines the process with a workflow built for scale, clarity, and control that features:

1. Comprehensive Asset Discovery

Before scanning even begins, Sentra automatically discovers assets across cloud platforms, on-premises systems, and SaaS applications. This ensures teams have a complete, up-to-date inventory and visual map of their data landscape, so no environment or data store is overlooked.

Example: New S3 buckets, a freshly deployed BigQuery dataset, or a newly connected SharePoint site are automatically identified and added to the inventory.

Comprehensive Asset Discovery with Sentra

2. Configurable Scan Management

Administrators can fine-tune how scans are executed to meet their organization’s needs. With flexible configuration options, such as number of scanners, sampling rates, and prioritization rules - teams can strike the right balance between scan speed, coverage, and cost control.

For instance, compliance-critical assets can be scanned at full depth immediately, while less critical environments can run at reduced sampling to save on compute consumption and costs.

3. Real-Time Scan Dashboard

Sentra’s unified Scan Dashboard provides a cockpit view into scanning operations, so teams always know where they stand. Key features include:

  • Daily scan throughput correlated with the number of active scanners, helping teams understand efficiency and predict completion times.
  • Coverage tracking that visualizes overall progress and highlights which assets remain unscanned.
  • Decision-making tools that allow teams to dynamically adjust, whether by adding scanner capacity, changing sampling rates, or reordering priorities when new high-risk assets appear.
Real-Time Scan Dashboard with Sentra

Handling Data Changes

The challenge doesn’t end once the initial scans are complete. Data is dynamic, new files are added daily, existing records are updated, and sensitive information shifts locations. Sentra’s activity feeds give teams the visibility they need to understand how their data landscape is evolving and adapt their data security strategies in real time.


Conclusion

Tracking scan status at scale is complex but critical to any data security strategy. Sentra provides an end-to-end view and unmatched scan control, helping organizations move from uncertainty to confidence with clear prediction of scan timelines, faster troubleshooting, audit-ready compliance, and smarter, cost-efficient decisions for securing data.

<blogcta-big>

Read More
decorative ball
Expert Data Security Insights Straight to Your Inbox
What Should I Do Now:
1

Get the latest GigaOm DSPM Radar report - see why Sentra was named a Leader and Fast Mover in data security. Download now and stay ahead on securing sensitive data.

2

Sign up for a demo and learn how Sentra’s data security platform can uncover hidden risks, simplify compliance, and safeguard your sensitive data.

3

Follow us on LinkedIn, X (Twitter), and YouTube for actionable expert insights on how to strengthen your data security, build a successful DSPM program, and more!

Before you go...

Get the Gartner Customers' Choice for DSPM Report

Read why 98% of users recommend Sentra.

Gartner Certificate for Sentra