All Resources
In this article:
minus iconplus icon
Share the Blog

Use Redshift Data Scrambling for Additional Data Protection

May 3, 2023
8
Min Read

According to IBM, a data breach in the United States cost companies an average of 9.44 million dollars in 2022. It is now more important than ever for organizations to place high importance on protecting confidential information. Data scrambling, which can add an extra layer of security to data, is one approach to accomplish this. 

In this post, we'll analyze the value of data protection, look at the potential financial consequences of data breaches, and talk about how Redshift Data Scrambling may help protect private information.

The Importance of Data Protection

Data protection is essential to safeguard sensitive data from unauthorized access. Identity theft, financial fraud,and other serious consequences are all possible as a result of a data breach. Data protection is also crucial for compliance reasons. Sensitive data must be protected by law in several sectors, including government, banking, and healthcare. Heavy fines, legal problems, and business loss may result from failure to abide by these regulations.

Hackers employ many techniques, including phishing, malware, insider threats, and hacking, to get access to confidential information. For example, a phishing assault may lead to the theft of login information, and malware may infect a system, opening the door for additional attacks and data theft. 

So how to protect yourself against these attacks and minimize your data attack surface?

What is Redshift Data Masking?

Redshift data masking is a technique used to protect sensitive data in Amazon Redshift; a cloud-based data warehousing and analytics service. Redshift data masking involves replacing sensitive data with fictitious, realistic values to protect it from unauthorized access or exposure. It is possible to enhance data security by utilizing Redshift data masking in conjunction with other security measures, such as access control and encryption, in order to create a comprehensive data protection plan.

What is Redshift Data Masking

What is Redshift Data Scrambling?

Redshift data scrambling protects confidential information in a Redshift database by altering original data values using algorithms or formulas, creating unrecognizable data sets. This method is beneficial when sharing sensitive data with third parties or using it for testing, development, or analysis, ensuring privacy and security while enhancing usability. 

The technique is highly customizable, allowing organizations to select the desired level of protection while maintaining data usability. Redshift data scrambling is cost-effective, requiring no additional hardware or software investments, providing an attractive, low-cost solution for organizations aiming to improve cloud data security.

Data Masking vs. Data Scrambling

Data masking involves replacing sensitive data with a fictitious but realistic value. However, data scrambling, on the other hand, involves changing the original data values using an algorithm or a formula to generate a new set of values.

In some cases, data scrambling can be used as part of data masking techniques. For instance, sensitive data such as credit card numbers can be scrambled before being masked to enhance data protection further.

Setting up Redshift Data Scrambling

Having gained an understanding of Redshift and data scrambling, we can now proceed to learn how to set it up for implementation. Enabling data scrambling in Redshift requires several steps.

To achieve data scrambling in Redshift, SQL queries are utilized to invoke built-in or user-defined functions. These functions utilize a blend of cryptographic techniques and randomization to scramble the data.

The following steps are explained using an example code just for a better understanding of how to set it up:

Step 1: Create a new Redshift cluster

Create a new Redshift cluster or use an existing cluster if available. 

Redshift create cluster

Step 2: Define a scrambling key

Define a scrambling key that will be used to scramble the sensitive data.

 
SET session my_scrambling_key = 'MyScramblingKey';

In this code snippet, we are defining a scrambling key by setting a session-level parameter named <inlineCode>my_scrambling_key<inlineCode> to the value <inlineCode>MyScramblingKey<inlineCode>. This key will be used by the user-defined function to scramble the sensitive data.

Step 3: Create a user-defined function (UDF)

Create a user-defined function in Redshift that will be used to scramble the sensitive data. 


CREATE FUNCTION scramble(input_string VARCHAR)
RETURNS VARCHAR
STABLE
AS $$
DECLARE
scramble_key VARCHAR := 'MyScramblingKey';
BEGIN
-- Scramble the input string using the key
-- and return the scrambled output
RETURN ;
END;
$$ LANGUAGE plpgsql;

Here, we are creating a UDF named <inlineCode>scramble<inlineCode> that takes a string input and returns the scrambled output. The function is defined as <inlineCode>STABLE<inlineCode>, which means that it will always return the same result for the same input, which is important for data scrambling. You will need to input your own scrambling logic.

Step 4: Apply the UDF to sensitive columns

Apply the UDF to the sensitive columns in the database that need to be scrambled.


UPDATE employee SET ssn = scramble(ssn);

For example, applying the <inlineCode>scramble<inlineCode> UDF to a column saying, <inlineCode>ssn<inlineCode> in a table named <inlineCode>employee<inlineCode>. The <inlineCode>UPDATE<inlineCode> statement calls the <inlineCode>scramble<inlineCode> UDF and updates the values in the <inlineCode>ssn<inlineCode> column with the scrambled values.

Step 5: Test and validate the scrambled data

Test and validate the scrambled data to ensure that it is unreadable and unusable by unauthorized parties.


SELECT ssn, scramble(ssn) AS scrambled_ssn
FROM employee;

In this snippet, we are running a <inlineCode>SELECT<inlineCode> statement to retrieve the <inlineCode>ssn<inlineCode> column and the corresponding scrambled value using the <inlineCode>scramble<inlineCode> UDF. We can compare the original and scrambled values to ensure that the scrambling is working as expected. 

Step 6: Monitor and maintain the scrambled data

To monitor and maintain the scrambled data, we can regularly check the sensitive columns to ensure that they are still rearranged and that there are no vulnerabilities or breaches. We should also maintain the scrambling key and UDF to ensure that they are up-to-date and effective.

Different Options for Scrambling Data in Redshift

Selecting a data scrambling technique involves balancing security levels, data sensitivity, and application requirements. Various general algorithms exist, each with unique pros and cons. To scramble data in Amazon Redshift, you can use the following Python code samples in conjunction with a library like psycopg2 to interact with your Redshift cluster. Before executing the code samples, you will need to install the psycopg2 library:


pip install psycopg2

Random

Utilizing a random number generator, the Random option quickly secures data, although its susceptibility to reverse engineering limits its robustness for long-term protection.


import random
import string
import psycopg2

def random_scramble(data):
    scrambled = ""
    for char in data:
        scrambled += random.choice(string.ascii_letters + string.digits)
    return scrambled

# Connect to your Redshift cluster
conn = psycopg2.connect(host='your_host', port='your_port', dbname='your_dbname', user='your_user', password='your_password')
cursor = conn.cursor()
# Fetch data from your table
cursor.execute("SELECT sensitive_column FROM your_table;")
rows = cursor.fetchall()

# Scramble the data
scrambled_rows = [(random_scramble(row[0]),) for row in rows]

# Update the data in the table
cursor.executemany("UPDATE your_table SET sensitive_column = %s WHERE sensitive_column = %s;", [(scrambled, original) for scrambled, original in zip(scrambled_rows, rows)])
conn.commit()

# Close the connection
cursor.close()
conn.close()

Shuffle

The Shuffle option enhances security by rearranging data characters. However, it remains prone to brute-force attacks, despite being harder to reverse-engineer.


import random
import psycopg2

def shuffle_scramble(data):
    data_list = list(data)
    random.shuffle(data_list)
    return ''.join(data_list)

conn = psycopg2.connect(host='your_host', port='your_port', dbname='your_dbname', user='your_user', password='your_password')
cursor = conn.cursor()

cursor.execute("SELECT sensitive_column FROM your_table;")
rows = cursor.fetchall()

scrambled_rows = [(shuffle_scramble(row[0]),) for row in rows]

cursor.executemany("UPDATE your_table SET sensitive_column = %s WHERE sensitive_column = %s;", [(scrambled, original) for scrambled, original in zip(scrambled_rows, rows)])
conn.commit()

cursor.close()
conn.close()

Reversible

By scrambling characters in a decryption key-reversible manner, the Reversible method poses a greater challenge to attackers but is still vulnerable to brute-force attacks. We’ll use the Caesar cipher as an example.


def caesar_cipher(data, key):
    encrypted = ""
    for char in data:
        if char.isalpha():
            shift = key % 26
            if char.islower():
                encrypted += chr((ord(char) - 97 + shift) % 26 + 97)
            else:
                encrypted += chr((ord(char) - 65 + shift) % 26 + 65)
        else:
            encrypted += char
    return encrypted

conn = psycopg2.connect(host='your_host', port='your_port', dbname='your_dbname', user='your_user', password='your_password')
cursor = conn.cursor()

cursor.execute("SELECT sensitive_column FROM your_table;")
rows = cursor.fetchall()

key = 5
encrypted_rows = [(caesar_cipher(row[0], key),) for row in rows]
cursor.executemany("UPDATE your_table SET sensitive_column = %s WHERE sensitive_column = %s;", [(encrypted, original) for encrypted, original in zip(encrypted_rows, rows)])
conn.commit()

cursor.close()
conn.close()

Custom

The Custom option enables users to create tailor-made algorithms to resist specific attack types, potentially offering superior security. However, the development and implementation of custom algorithms demand greater time and expertise.

Best Practices for Using Redshift Data Scrambling

There are several best practices that should be followed when using Redshift Data Scrambling to ensure maximum protection:

Use Unique Keys for Each Table

To ensure that the data is not compromised if one key is compromised, each table should have its own unique key pair. This can be achieved by creating a unique index on the table.


CREATE UNIQUE INDEX idx_unique_key ON table_name (column_name);

Encrypt Sensitive Data Fields 

Sensitive data fields such as credit card numbers and social security numbers should be encrypted to provide an additional layer of security. You can encrypt data fields in Redshift using the ENCRYPT function. Here's an example of how to encrypt a credit card number field:


SELECT ENCRYPT('1234-5678-9012-3456', 'your_encryption_key_here');

Use Strong Encryption Algorithms

Strong encryption algorithms such as AES-256 should be used to provide the strongest protection. Redshift supports AES-256 encryption for data at rest and in transit.


CREATE TABLE encrypted_table (  sensitive_data VARCHAR(255) ENCODE ZSTD ENCRYPT 'aes256' KEY 'my_key');

Control Access to Encryption Keys 

Access to encryption keys should be restricted to authorized personnel to prevent unauthorized access to sensitive data. You can achieve this by setting up an AWS KMS (Key Management Service) to manage your encryption keys. Here's an example of how to restrict access to an encryption key using KMS in Python:


import boto3

kms = boto3.client('kms')

key_id = 'your_key_id_here'
grantee_principal = 'arn:aws:iam::123456789012:user/jane'

response = kms.create_grant(
    KeyId=key_id,
    GranteePrincipal=grantee_principal,
    Operations=['Decrypt']
)

print(response)

Regularly Rotate Encryption Keys 

Regular rotation of encryption keys ensures that any compromised keys do not provide unauthorized access to sensitive data. You can schedule regular key rotation in AWS KMS by setting a key policy that specifies a rotation schedule. Here's an example of how to schedule annual key rotation in KMS using the AWS CLI:

 
aws kms put-key-policy \\
    --key-id your_key_id_here \\
    --policy-name default \\
    --policy
    "{\\"Version\\":\\"2012-10-17\\",\\"Statement\\":[{\\"Effect\\":\\"Allow\\"
    "{\\"Version\\":\\"2012-10-17\\",\\"Statement\\":[{\\"Effect\\":\\"Allow\\"
    \\":\\"kms:RotateKey\\",\\"Resource\\":\\"*\\"},{\\"Effect\\":\\"Allow\\",\
    \"Principal\\":{\\"AWS\\":\\"arn:aws:iam::123456789012:root\\"},\\"Action\\
    ":\\"kms:CreateGrant\\",\\"Resource\\":\\"*\\",\\"Condition\\":{\\"Bool\\":
    {\\"kms:GrantIsForAWSResource\\":\\"true\\"}}}]}"

Turn on logging 

To track user access to sensitive data and identify any unwanted access, logging must be enabled. All SQL commands that are executed on your cluster are logged when you activate query logging in Amazon Redshift. This applies to queries that access sensitive data as well as data-scrambling operations. Afterwards, you may examine these logs to look for any strange access patterns or suspect activities.

You may use the following SQL statement to make query logging available in Amazon Redshift:

ALTER DATABASE  SET enable_user_activity_logging=true;

The stl query system table may be used to retrieve the logs once query logging has been enabled. For instance, the SQL query shown below will display all queries that reached a certain table:

Monitor Performance 

Data scrambling is often a resource-intensive practice, so it’s good to monitor CPU usage, memory usage, and disk I/O to ensure your cluster isn’t being overloaded. In Redshift, you can use the <inlineCode>svl_query_summary<inlineCode> and <inlineCode>svl_query_report<inlineCode> system views to monitor query performance. You can also use Amazon CloudWatch to monitor metrics such as CPU usage and disk space.

Amazon CloudWatch

Establishing Backup and Disaster Recovery

In order to prevent data loss in the case of a disaster, backup and disaster recovery mechanisms should be put in place. Automated backups and manual snapshots are only two of the backup and recovery methods offered by Amazon Redshift. Automatic backups are taken once every eight hours by default. 

Moreover, you may always manually take a snapshot of your cluster. In the case of a breakdown or disaster, your cluster may be restored using these backups and snapshots. Use this SQL query to manually take a snapshot of your cluster in Amazon Redshift:

CREATE SNAPSHOT ; 

To restore a snapshot, you can use the <inlineCode>RESTORE<inlineCode> command. For example:


RESTORE 'snapshot_name' TO 'new_cluster_name';

Frequent Review and Updates

To ensure that data scrambling procedures remain effective and up-to-date with the latest security requirements, it is crucial to consistently review and update them. This process should include examining backup and recovery procedures, encryption techniques, and access controls.

In Amazon Redshift, you can assess access controls by inspecting all roles and their associated permissions in the <inlineCode>pg_roles<inlineCode> system catalog database. It is essential to confirm that only authorized individuals have access to sensitive information.

To analyze encryption techniques, use the <inlineCode>pg_catalog.pg_attribute<inlineCode> system catalog table, which allows you to inspect data types and encryption settings for each column in your tables. Ensure that sensitive data fields are protected with robust encryption methods, such as AES-256.

The AWS CLI commands <inlineCode>aws backup plan<inlineCode> and <inlineCode>aws backup vault<inlineCode> enable you to review your backup plans and vaults, as well as evaluate backup and recovery procedures. Make sure your backup and recovery procedures are properly configured and up-to-date.

Decrypting Data in Redshift

There are different options for decrypting data, depending on the encryption method used and the tools available; the decryption process is similar to of encryption, usually a custom UDF is used to decrypt the data, let’s look at one example of decrypting data scrambling with a substitution cipher.

Step 1: Create a UDF with decryption logic for substitution


CREATE FUNCTION decrypt_substitution(ciphertext varchar) RETURNS varchar
IMMUTABLE AS $$
    alphabet = 'abcdefghijklmnopqrstuvwxyz'
    substitution = 'ijklmnopqrstuvwxyzabcdefgh'
    reverse_substitution = ''.join(sorted(substitution, key=lambda c: substitution.index(c)))
    plaintext = ''
    for i in range(len(ciphertext)):
        index = substitution.find(ciphertext[i])
        if index == -1:
            plaintext += ciphertext[i]
        else:
            plaintext += reverse_substitution[index]
    return plaintext
$$ LANGUAGE plpythonu;

Step 2: Move the data back after truncating and applying the decryption function


TRUNCATE original_table;
INSERT INTO original_table (column1, decrypted_column2, column3)
SELECT column1, decrypt_substitution(encrypted_column2), column3
FROM temp_table;

In this example, encrypted_column2 is the encrypted version of column2 in the temp_table. The decrypt_substitution function is applied to encrypted_column2, and the result is inserted into the decrypted_column2 in the original_table. Make sure to replace column1, column2, and column3 with the appropriate column names, and adjust the INSERT INTO statement accordingly if you have more or fewer columns in your table.

Conclusion

Redshift data scrambling is an effective tool for additional data protection and should be considered as part of an organization's overall data security strategy. In this blog post, we looked into the importance of data protection and how this can be integrated effectively into the  data warehouse. Then, we covered the difference between data scrambling and data masking before diving into how one can set up Redshift data scrambling.

Once you begin to accustom to Redshift data scrambling, you can upgrade your security techniques with different techniques for scrambling data and best practices including encryption practices, logging, and performance monitoring. Organizations may improve their data security posture management (DSPM) and reduce the risk of possible breaches by adhering to these recommendations and using an efficient strategy.

<blogcta-big>

Veronica is the security researcher at Sentra. She brings a wealth of knowledge and experience as a cybersecurity researcher. Her main focuses are researching the main cloud provider services and AI infrastructures for Data related threats and techniques.

Subscribe

Latest Blog Posts

David Stuart
David Stuart
April 22, 2026
4
Min Read
AI and ML

What Breaks in Production AI When It Doesn’t Have Data Security Context?

What Breaks in Production AI When It Doesn’t Have Data Security Context?

Everyone’s talking about the context layer for AI – the semantic glue between raw data and intelligent behavior. Atlan’s Activate is showing how the industry is moving to make that layer real: demonstrating the Enterprise Data Graph, Context Engineering Studio, and a shared fabric in real time. Capabilities like these let AI agents finally understand what data means in production, not just where it lives.

But there’s a blind spot that keeps showing up when we walk into real enterprises:

Your AI doesn’t just need business and analytical context. It needs data security context – or it will quietly break in production in ways that are hard, expensive, and sometimes impossible to fix after the fact.

In this post, I’ll focus on what goes wrong when AI runs without that data security context, why it’s harder to bolt on later than most teams assume, and how Sentra’s category – cloud-native DSPM with deep unstructured data coverage – is built to feed the “context layer” with the one dimension it can’t infer from SQL patterns alone: risk.

What Actually Breaks Without Data Security Context?

When we say “it breaks,” we don’t mean “the model returns a bad joke.” We mean systemic failures that show up only once you’re in production with real users, real data, and real regulators.

Here’s what we see over and over:

1. AI picks the right answer from the wrong data

Your context layer tells the agent which tables and documents look relevant. Great. But if it doesn’t know:

  • Which of those assets contain regulated data (PII, PHI, PCI, secrets)
  • Where outdated copies and derivatives live across OneDrive, SharePoint, Gmail, Google Drive, S3, etc.
  • Which identities, apps, and agents are allowed to touch them

…then the agent will happily answer the question from a dataset that never should have been exposed to that user or workflow in the first place.

Semantically correct. Security-wise catastrophic.

2. “Context aware” copilots still hallucinate permissions

We see this in Microsoft 365 Copilot and Google Workspace with Gemini:

  • Copilot can understand SharePoint sites and OneDrives, but not whether a document is overshared to “anyone with the link” or inherited via a stale group.
  • Gemini Chat can retrieve from Drive, but doesn’t know if that spreadsheet became sensitive when someone added a new column of health data last week.

Without a live data access graph – identities, apps, agents, and their effective permissions to sensitive content – your AI believes the IAM story, not the reality on the ground.

3. Governance teams lose the plot on blast radius

Security, risk, and compliance teams ask a simple question:

“If this AI workflow is compromised tomorrow, what sensitive data could realistically be exposed?”

If your context layer has no notion of:

  • Where regulated data sits across SaaS, cloud data warehouses, collaboration platforms, and object storage
  • How that data flows into retrieval indexes, vector stores, and training sets
  • Which non-human identities (connectors, OAuth apps, service principals, copilots) can query it

…then you can’t answer that blast-radius question in a credible way. You’re back to spreadsheets and manual inventories – which is exactly what the context layer was supposed to fix.

4. Incident response becomes guesswork

The first time a GenAI workflow mishandles data, everyone scrambles:

  • “Which prompts touched PCI data?”
  • “Did that model training run include EU citizen data that violates residency?”
  • “Which users received responses that included that contract template or source-code snippet?”

If your AI stack was never wired to data security posture – sensitivity, ownership, access, data movement, and misconfigurations – you can’t reconstruct what actually happened. You’re stuck with log-diving and hope.

Why This Is Much Harder to “Patch” Than It Sounds

On paper, the fix seems straightforward:

  • “We’ll just add some DLP policies.”
  • “We’ll tune the retrieval layer to avoid certain tables.”
  • “We’ll label the sensitive stuff and call it a day.”

In production, those tactics collapse for three reasons.

1. Labels are not context

Most organizations still rely on static labels – “Confidential,” “PII,” etc. These break at AI scale because:

  • They’re missing or wrong for huge swaths of unstructured data: docs, slides, PDFs, images, chat attachments, code, logs.
  • They don’t encode why the data is sensitive (contract vs. credentials vs. design IP vs. health record).
  • They say nothing about who can access it today or how that has drifted over time.

A context layer that only sees labels can’t distinguish “safe to use in this RAG workflow” from “lawsuit waiting to happen.”

2. Security context is cross-system and constantly changing

AI teams often underestimate the dynamics involved:

  • Data sets move between warehouses, object stores, SaaS apps, and M365/Workspace tenants weekly.
  • New data is created at petabyte scale – especially unstructured content in M365, Google Drive, Slack, etc.
  • Identities and apps are created, granted permissions, and forgotten (especially third‑party integrations and copilots).

Trying to “hard-code” allowed sources, or maintain a static allowlist of safe collections, is equivalent to freezing your organization on the day you launch your first AI pilot. It doesn’t survive the next quarter.

3. You can’t bolt on trust after you ship

The most painful pattern we see:

  1. Team launches a pilot RAG or copilot.
  2. It lands well, usage explodes.
  3. Only then does security get brought in to review.

At that point:

  • Indexes are already built on top of unknown data.
  • Training sets have been created from snapshots no one can fully reconstruct.
  • Business stakeholders are used to the AI “just working.”

Retrofitting data security context into that mess is like trying to retrofit access governance onto a SaaS estate ten years after everyone integrated everything with everything. It’s not an integration project; it’s a re‑architecture project.

Sentra’s Point of View: Data Security Context Is a First-Class Citizen of the Context Layer

Atlan is right: the context layer will be the most important enterprise asset of the AI era. But our conviction at Sentra is:

A context layer that doesn’t understand data security posture is fundamentally incomplete.

For AI to be both useful and safe, your context graph has to know, for every relevant asset:

  • What it is (content- and schema-aware classification both at the entity and file level)
  • How sensitive it is (regulatory, contractual, IP, secrets)
  • Who or what can access it (users, groups, apps, agents, OAuth connectors)
  • How it moves and mutates (copies, derivatives, AI workflows, exports)

That’s exactly the slice of context Sentra provides.

How Sentra enhances the context layer

From our deployments with enterprises running M365, Google Workspace, cloud data platforms, and SaaS, we’ve built Sentra around three pillars that plug directly into a modern context layer:

  1. AI-grade, petabyte-scale classification for unstructured data

  • We classify documents, emails, files, code, and other unstructured content across M365, Google Workspace, cloud object stores, and SaaS with high accuracy and at petabyte scale – not just database rows.
  • This includes contextual understanding (contracts vs. HR docs vs. financials vs. source code) so the context layer isn’t guessing from filenames.

  1. Data Access Governance (DAG) that understands humans and non-human identities

  • We map which users, groups, service principals, OAuth apps, and copilots can reach which sensitive assets, across clouds and SaaS.
  • That access graph becomes a critical input into any context layer deciding what is safe to retrieve or train on for a given agent.

  1. Data Detection & Response (DDR) that follows data into AI workflows

  • We track how sensitive data moves: copies, derivatives, exports, and AI interactions – not just who touched a file once.
  • That telemetry feeds back into risk scoring and guardrails, so AI workflows can be shut down or tuned when they start creating new exposure patterns.

Put differently: Atlan is building the infrastructure for context – Enterprise Data Graph, Context Engineering Studio, Context Lakehouse. Sentra brings the security brain that tells that infrastructure which data is safe to use, under what conditions, and for whom. The enriched security context that Sentra provides flows into Atlan’s Enterprise Context Layer so that AI systems act accurately, reliably, and safely.

Read More
Yair Cohen
Yair Cohen
David Stuart
David Stuart
April 15, 2026
3
Min Read
Data Sprawl

Fiverr Data Breach: Beyond Misconfigured Buckets and the Data Sprawl That Made It Inevitable

Fiverr Data Breach: Beyond Misconfigured Buckets and the Data Sprawl That Made It Inevitable

Fiverr’s recent data breach/data exposure left tax forms, IDs, contracts, and even credentials publicly accessible and indexed by Google via misconfigured Cloudinary URLs.

This post explains what happened, why data sprawl across third-party services made it inevitable, and how to prevent the next Fiverr-style leak.

The Fiverr data breach is a textbook case of sensitive data sprawl and misconfigured third‑party infrastructure: highly sensitive documents (including tax returns, IDs, health records, and even admin credentials) were stored on Cloudinary behind unauthenticated, non‑expiring URLs, then surfaced via public HTML so Google could index them—remaining accessible for weeks after initial disclosure and hours after public reporting. This isn’t a zero‑day exploit; it’s a failure to understand where regulated data lives, how it rapidly proliferates and is shared across services, and whether controls like signed URLs, authentication, and proper indexing rules are actually in place.

In practical terms, what happened in the Fiverr data breach?

– Sensitive documents (tax returns, IDs, contracts, even credentials) were stored on Cloudinary behind unauthenticated, non-expiring URLs.

– Some of those URLs were linked from public HTML, allowing Google and other search engines to index them.

– As a result, private Fiverr user data became publicly searchable, long before regulators or affected users were notified.

What the Fiverr Data Breach Reveals About Third-Party Data Sprawl

What makes this kind of data exposure - like the Fiverr data leak - so damaging is that it collapses the boundary between “internal work product” and “public web content.” The same files that power everyday workflows—tax filings, medical notes, penetration test reports, admin credentials—suddenly become discoverable to anyone with a search engine, long before regulators or affected users even know there’s a problem. As enterprises lean on third‑party processors, media platforms, and SaaS for collaboration, the real risk isn’t a single misconfigured bucket; it’s the absence of continuous visibility into where sensitive data actually resides and who—human or machine—can reach it.

Sentra is built to restore that visibility and hygiene baseline across the entire data estate, including cloud storage, SaaS platforms, AI data lakes, and media services like the one at the center of this incident. By running discovery and classification in‑environment—without copying customer data out—Sentra builds a live inventory of sensitive assets, from tax forms and IDs to health and financial records, even in unstructured PDFs and images brought into scope via OCR and transcription. On top of that, Sentra continuously identifies redundant, obsolete, and toxic (ROT) data, so organizations can eliminate unnecessary copies that amplify the blast radius when something does go wrong, and set enforceable policies like “no GLBA‑covered data on unauthenticated public endpoints” before the next Cloudinary‑style exposure ever materializes.

If you’re asking “How do we avoid a Fiverr-style data breach on our own SaaS and media stack?”, the starting point is continuous visibility into where sensitive data lives, how it moves into services like Cloudinary, and who or what (including AI agents) can access it.

How to Prevent a Fiverr-Style Data Leak Across SaaS, Storage, and Media Services

Where traditional controls stop at the perimeter, Sentra ties data to identities and access paths, including AI agents, copilots, and service principals. Lineage‑driven maps show how data moves—from a storage bucket into a search index, from a document library into a media processor—so entitlements can follow data automatically and public or over‑privileged links can be revoked in a targeted way, rather than taking an entire service offline. On that foundation, Sentra orchestrates automated actions and remediation: quarantining exposed files, tombstoning toxic copies, removing public links, and routing rich, contextual tickets to owners when human judgment is required—all through existing tools like DLP, IAM, ServiceNow, Jira, Slack, and SOAR instead of standing up a parallel enforcement stack.

Doing this at “Fiverr scale” requires more than point tools; it demands a platform that is accurate, scalable, and cost‑efficient enough to run continuously and scale across multi-hundred petabyte environments. Sentra’s in‑environment architecture and small‑model approach have already scanned 8–9 petabytes in under 4–5 days at 95–98% accuracy—an order‑of‑magnitude faster and cheaper than extraction‑based alternatives—while keeping customer data inside their own accounts. That efficiency means enterprises can maintain continuous scanning, labeling, and remediation across hundreds of petabytes and multiple clouds without turning governance into a budget‑breaking project, and can generate audit‑grade evidence that sensitive data was governed properly over time—not just at the last assessment.

Incidents like the Fiverr data breach are a warning shot for the AI era, where copilots, internal agents, and search experiences will happily surface whatever the underlying permissions and data quality allow. As AI adoption accelerates, the only sustainable defense is a baseline of automated, continuous data protection: accurate classification, durable hygiene, identity‑aware access, automated remediation, and economically viable, always‑on governance that keeps pace with rapidly expanding and evolving data estates. You can’t secure AI—or avoid the next “public and searchable” headline—without first understanding and continuously governing the data that AI and its surrounding services can see. As AI pushes boundaries (and challenges security teams!), there is no time like now to ensure data remains protected.


Fiverr data breach FAQ

  • Was my Fiverr data exposed in the breach?
    Fiverr and independent researchers have confirmed that some user documents—including tax forms, IDs, invoices, and credentials—were publicly accessible and indexed by Google via misconfigured Cloudinary URLs. Whether your specific files were exposed depends on what you shared and how Fiverr stored it, but the safest assumption is that any sensitive document shared on the platform may have been at risk.

  • What made the Fiverr data breach possible?
    The root cause wasn’t a zero-day exploit; it was data sprawl across third-party infrastructure plus weak controls: public, non-expiring Cloudinary URLs, public HTML linking to those URLs, and no continuous visibility into where regulated data lived or who could reach it.

  • How can enterprises prevent similar leaks?
    By continuously discovering and classifying sensitive data across cloud storage, SaaS, and media services; cleaning up ROT; enforcing policies like “no GLBA-covered data on unauthenticated public endpoints”; and tying access to identities so public links and over-privileged routes can be revoked automatically. 

Read more about the Fiverr Data Breach

Detailed news coverage of the Fiverr data breach and Cloudinary misconfiguration (Cybernews)

Independent analysis of the Fiverr data exposure via public Cloudinary URLs (CyberInsider)

Read More
Ariel Rimon
Ariel Rimon
March 30, 2026
3
Min Read

Web Archive Scanning: WARC, ARC, and the Forgotten PII in Your Compliance Crawls

Web Archive Scanning: WARC, ARC, and the Forgotten PII in Your Compliance Crawls

One of the most interesting blind spots I see in mature security programs isn’t a database or a SaaS app. It’s web archives.

If you’re in financial services, you may be required to archive every version of your public website for years. Legal teams preserve web content under hold. Marketing and product teams crawl competitors for competitive intel. Security teams capture phishing pages and breach sites for analysis. All of that activity produces WARC and ARC files - standard formats for storing captured web content.

Now ask yourself: what’s in those archives?

Where Web Archives Come From and Why They Get Ignored

In most enterprises, web archives are created in predictable ways, but rarely treated as data stores that need to be actively managed. Compliance teams crawl and preserve marketing pages, disclosures, and rate sheets to meet record-keeping requirements. Legal teams snapshot websites for e-discovery and retain those captures for years. Product and growth teams scrape competitor sites, pricing pages, and documentation, while security teams collect phishing kits, fake login pages, and breach sites for analysis.

All of this content ends up stored as WARC or ARC files in object storage or file shares. Once the initial crawl is complete and the compliance requirement is satisfied, these archives are typically dumped into an S3 bucket or on-prem share, referenced in a ticket or spreadsheet, and then quietly forgotten.

That’s where the risk begins. What started as a compliance or research activity turns into a growing, unmonitored data store - one that may contain sensitive and regulated information, but sits outside the scope of most security and privacy programs.

What’s Really Inside a WARC or ARC File?

A single WARC from a routine compliance crawl of your own site can contain thousands of pages. Many of those pages will have:

  • Customer names and emails
  • Account IDs and usernames
  • Phone numbers and mailing addresses
  • Perhaps even partial transaction details in page content, forms, or query strings

If you’re scraping external sites, those files can hold third‑party PII: profiles, contact details, and public record data. Threat intel archives may include:

  • Captured credentials from phishing kits
  • Breach data and exposed account information
  • Screenshots or HTML copies of login pages and portals

Meanwhile, the archives themselves grow quietly in S3 buckets and on‑prem file shares, rarely revisited and almost never scanned with the same rigor you apply to “primary” systems.

From a privacy perspective, this is a real problem. Under GDPR and similar laws, individuals have the right to request access to and deletion of their personal data. If that data lives inside a 3‑year‑old WARC file you can’t even parse, you have no practical way or scalable way to honor that request. Multiply that across years of compliance archiving, legal holds, scraping campaigns, and threat intel crawls, and you’re sitting on terabytes of unmanaged web content containing PII and regulated data.

Why Traditional DLP and Discovery Can’t Handle WARC and ARC

Most traditional DLP (Data Loss Prevention) and data discovery tools were designed for a simpler data landscape, focused on emails, attachments, PDFs, Office documents, and flat text logs or CSV files. When these tools encounter formats like WARC or ARC files, they typically treat them as opaque blobs of data, relying on basic text extraction and regex-based pattern matching to identify sensitive information.

This approach breaks down with web archives. WARC and ARC files are complex container formats that store full HTTP interactions, including requests, responses, headers, and payloads. A single web archive can contain thousands of captured pages and resources: HTML, JavaScript, CSS, JSON APIs, images, and PDFs, often compressed or encoded in ways that require reconstructing the original HTTP responses to interpret correctly.

As a result, legacy DLP tools cannot reliably parse or analyze WARC and ARC files. Instead, they surface only fragmented data such as headers, binary content, or partial HTML, without reconstructing the full user-visible context. This means they miss critical elements like complete web pages, DOM structures, form inputs, query strings, request bodies, and embedded assets where sensitive data such as PII, credentials, or financial information may exist.

The result is a significant compliance and security gap. Web archives stored in WARC and ARC formats often contain regulated data but remain unscanned and unmanaged, creating a persistent blind spot for traditional DLP and DSPM programs.

How Sentra Scans Web Archives at Scale

We built web archive scanning into Sentra to make this tractable.

Sentra’s WarcReader understands both WARC and ARC formats. It:

  • Processes captured HTTP responses, not just headers
  • Extracts the actual HTML page content and associated resources from each record
  • Normalizes those payloads so they can be scanned just like any other web‑delivered content

Once we’ve pulled out the page content and resources, we run them through the same classification engine we apply to your other data stores, looking for:

  • PII (names, emails, addresses, national IDs, phone numbers, etc.)
  • Financial data (account numbers, card numbers, bank details)
  • Healthcare information and PHI indicators
  • Credentials and other secrets
  • Business‑sensitive data (internal IDs, case numbers, etc.)

Because WARC files can be huge, we do all of this in memory, without unpacking archives to disk. That matters for two reasons:

  1. Performance and scale: We can stream through large archives without creating temporary, unmanaged copies.
  2. Security: We avoid writing decrypted or reconstructed content to local disks, which would create new artifacts you now have to protect.

We also handle embedded resources - images, documents, and other files captured as part of the original pages — so you’re not only seeing what was in the HTML but also what was linked or rendered alongside it. Sentra’s existing file parsers and OCR engine can inspect those nested assets for sensitive content just as they would in any other data store.

Bringing Web Archives into Your DSPM Program

Once you can actually see inside web archives, you can bring them into your data security program instead of pretending they’re “just logs.”

With Sentra, teams can:

  • Discover where web archives live across cloud and on‑prem (S3, Azure Blob, GCS, NFS/SMB shares, and more).
  • Classify the captured content for PII, PCI, PHI, credentials, and business‑sensitive information.
  • Assess regulatory exposure from long‑running archiving programs and legal holds that have accumulated unmanaged PII over time.
  • Support DSAR and deletion workflows that touch archived content, so you can respond to GDPR/CCPA requests with an honest inventory that includes historical web captures.
  • Evaluate scraping and threat‑intel collections to identify sensitive data they were never supposed to capture in the first place (for example, credentials, breach records, or third‑party PII).

In practice, this often leads to concrete actions like:

  • Tightening retention policies on specific archive sets
  • Segmenting or encrypting archives that contain regulated data
  • Updating crawler configurations to avoid collecting sensitive content going forward
  • Aligning privacy teams, legal, and security around a shared understanding of what’s actually in years’ worth of WARC/ARC content

Web Archives Are Data Stores - Treat Them That Way

Web archives aren’t just compliance artifacts, they’re data stores, often holding sensitive and regulated information. Yet in most organizations, WARC and ARC files sit outside the scope of DSPM and data discovery, creating a blind spot between what’s stored and what’s actually secured.

Sentra removes that tradeoff. You can keep the archives you’re required to maintain and gain full visibility into the data inside them. By bringing WARC and ARC files into your DSPM program, you extend coverage to web archives and other hard-to-reach data—without changing how you store or manage them.

Want to see what’s hiding in your web archives? Explore how Sentra scans WARC and ARC files and uncovers sensitive data at scale.

<blogcta-big>

Read More
Expert Data Security Insights Straight to Your Inbox
What Should I Do Now:
1

Get the latest GigaOm DSPM Radar report - see why Sentra was named a Leader and Fast Mover in data security. Download now and stay ahead on securing sensitive data.

2

Sign up for a demo and learn how Sentra’s data security platform can uncover hidden risks, simplify compliance, and safeguard your sensitive data.

3

Follow us on LinkedIn, X (Twitter), and YouTube for actionable expert insights on how to strengthen your data security, build a successful DSPM program, and more!

Before you go...

Get the Gartner Customers' Choice for DSPM Report

Read why 98% of users recommend Sentra.

White Gartner Peer Insights Customers' Choice 2025 badge with laurel leaves inside a speech bubble.