All Resources
In this article:
minus iconplus icon
Share the Blog

Use Redshift Data Scrambling for Additional Data Protection

May 3, 2023
8
Min Read

According to IBM, a data breach in the United States cost companies an average of 9.44 million dollars in 2022. It is now more important than ever for organizations to place high importance on protecting confidential information. Data scrambling, which can add an extra layer of security to data, is one approach to accomplish this. 

In this post, we'll analyze the value of data protection, look at the potential financial consequences of data breaches, and talk about how Redshift Data Scrambling may help protect private information.

The Importance of Data Protection

Data protection is essential to safeguard sensitive data from unauthorized access. Identity theft, financial fraud,and other serious consequences are all possible as a result of a data breach. Data protection is also crucial for compliance reasons. Sensitive data must be protected by law in several sectors, including government, banking, and healthcare. Heavy fines, legal problems, and business loss may result from failure to abide by these regulations.

Hackers employ many techniques, including phishing, malware, insider threats, and hacking, to get access to confidential information. For example, a phishing assault may lead to the theft of login information, and malware may infect a system, opening the door for additional attacks and data theft. 

So how to protect yourself against these attacks and minimize your data attack surface?

What is Redshift Data Masking?

Redshift data masking is a technique used to protect sensitive data in Amazon Redshift; a cloud-based data warehousing and analytics service. Redshift data masking involves replacing sensitive data with fictitious, realistic values to protect it from unauthorized access or exposure. It is possible to enhance data security by utilizing Redshift data masking in conjunction with other security measures, such as access control and encryption, in order to create a comprehensive data protection plan.

What is Redshift Data Masking

What is Redshift Data Scrambling?

Redshift data scrambling protects confidential information in a Redshift database by altering original data values using algorithms or formulas, creating unrecognizable data sets. This method is beneficial when sharing sensitive data with third parties or using it for testing, development, or analysis, ensuring privacy and security while enhancing usability. 

The technique is highly customizable, allowing organizations to select the desired level of protection while maintaining data usability. Redshift data scrambling is cost-effective, requiring no additional hardware or software investments, providing an attractive, low-cost solution for organizations aiming to improve cloud data security.

Data Masking vs. Data Scrambling

Data masking involves replacing sensitive data with a fictitious but realistic value. However, data scrambling, on the other hand, involves changing the original data values using an algorithm or a formula to generate a new set of values.

In some cases, data scrambling can be used as part of data masking techniques. For instance, sensitive data such as credit card numbers can be scrambled before being masked to enhance data protection further.

Setting up Redshift Data Scrambling

Having gained an understanding of Redshift and data scrambling, we can now proceed to learn how to set it up for implementation. Enabling data scrambling in Redshift requires several steps.

To achieve data scrambling in Redshift, SQL queries are utilized to invoke built-in or user-defined functions. These functions utilize a blend of cryptographic techniques and randomization to scramble the data.

The following steps are explained using an example code just for a better understanding of how to set it up:

Step 1: Create a new Redshift cluster

Create a new Redshift cluster or use an existing cluster if available. 

Redshift create cluster

Step 2: Define a scrambling key

Define a scrambling key that will be used to scramble the sensitive data.

 
SET session my_scrambling_key = 'MyScramblingKey';

In this code snippet, we are defining a scrambling key by setting a session-level parameter named <inlineCode>my_scrambling_key<inlineCode> to the value <inlineCode>MyScramblingKey<inlineCode>. This key will be used by the user-defined function to scramble the sensitive data.

Step 3: Create a user-defined function (UDF)

Create a user-defined function in Redshift that will be used to scramble the sensitive data. 


CREATE FUNCTION scramble(input_string VARCHAR)
RETURNS VARCHAR
STABLE
AS $$
DECLARE
scramble_key VARCHAR := 'MyScramblingKey';
BEGIN
-- Scramble the input string using the key
-- and return the scrambled output
RETURN ;
END;
$$ LANGUAGE plpgsql;

Here, we are creating a UDF named <inlineCode>scramble<inlineCode> that takes a string input and returns the scrambled output. The function is defined as <inlineCode>STABLE<inlineCode>, which means that it will always return the same result for the same input, which is important for data scrambling. You will need to input your own scrambling logic.

Step 4: Apply the UDF to sensitive columns

Apply the UDF to the sensitive columns in the database that need to be scrambled.


UPDATE employee SET ssn = scramble(ssn);

For example, applying the <inlineCode>scramble<inlineCode> UDF to a column saying, <inlineCode>ssn<inlineCode> in a table named <inlineCode>employee<inlineCode>. The <inlineCode>UPDATE<inlineCode> statement calls the <inlineCode>scramble<inlineCode> UDF and updates the values in the <inlineCode>ssn<inlineCode> column with the scrambled values.

Step 5: Test and validate the scrambled data

Test and validate the scrambled data to ensure that it is unreadable and unusable by unauthorized parties.


SELECT ssn, scramble(ssn) AS scrambled_ssn
FROM employee;

In this snippet, we are running a <inlineCode>SELECT<inlineCode> statement to retrieve the <inlineCode>ssn<inlineCode> column and the corresponding scrambled value using the <inlineCode>scramble<inlineCode> UDF. We can compare the original and scrambled values to ensure that the scrambling is working as expected. 

Step 6: Monitor and maintain the scrambled data

To monitor and maintain the scrambled data, we can regularly check the sensitive columns to ensure that they are still rearranged and that there are no vulnerabilities or breaches. We should also maintain the scrambling key and UDF to ensure that they are up-to-date and effective.

Different Options for Scrambling Data in Redshift

Selecting a data scrambling technique involves balancing security levels, data sensitivity, and application requirements. Various general algorithms exist, each with unique pros and cons. To scramble data in Amazon Redshift, you can use the following Python code samples in conjunction with a library like psycopg2 to interact with your Redshift cluster. Before executing the code samples, you will need to install the psycopg2 library:


pip install psycopg2

Random

Utilizing a random number generator, the Random option quickly secures data, although its susceptibility to reverse engineering limits its robustness for long-term protection.


import random
import string
import psycopg2

def random_scramble(data):
    scrambled = ""
    for char in data:
        scrambled += random.choice(string.ascii_letters + string.digits)
    return scrambled

# Connect to your Redshift cluster
conn = psycopg2.connect(host='your_host', port='your_port', dbname='your_dbname', user='your_user', password='your_password')
cursor = conn.cursor()
# Fetch data from your table
cursor.execute("SELECT sensitive_column FROM your_table;")
rows = cursor.fetchall()

# Scramble the data
scrambled_rows = [(random_scramble(row[0]),) for row in rows]

# Update the data in the table
cursor.executemany("UPDATE your_table SET sensitive_column = %s WHERE sensitive_column = %s;", [(scrambled, original) for scrambled, original in zip(scrambled_rows, rows)])
conn.commit()

# Close the connection
cursor.close()
conn.close()

Shuffle

The Shuffle option enhances security by rearranging data characters. However, it remains prone to brute-force attacks, despite being harder to reverse-engineer.


import random
import psycopg2

def shuffle_scramble(data):
    data_list = list(data)
    random.shuffle(data_list)
    return ''.join(data_list)

conn = psycopg2.connect(host='your_host', port='your_port', dbname='your_dbname', user='your_user', password='your_password')
cursor = conn.cursor()

cursor.execute("SELECT sensitive_column FROM your_table;")
rows = cursor.fetchall()

scrambled_rows = [(shuffle_scramble(row[0]),) for row in rows]

cursor.executemany("UPDATE your_table SET sensitive_column = %s WHERE sensitive_column = %s;", [(scrambled, original) for scrambled, original in zip(scrambled_rows, rows)])
conn.commit()

cursor.close()
conn.close()

Reversible

By scrambling characters in a decryption key-reversible manner, the Reversible method poses a greater challenge to attackers but is still vulnerable to brute-force attacks. We’ll use the Caesar cipher as an example.


def caesar_cipher(data, key):
    encrypted = ""
    for char in data:
        if char.isalpha():
            shift = key % 26
            if char.islower():
                encrypted += chr((ord(char) - 97 + shift) % 26 + 97)
            else:
                encrypted += chr((ord(char) - 65 + shift) % 26 + 65)
        else:
            encrypted += char
    return encrypted

conn = psycopg2.connect(host='your_host', port='your_port', dbname='your_dbname', user='your_user', password='your_password')
cursor = conn.cursor()

cursor.execute("SELECT sensitive_column FROM your_table;")
rows = cursor.fetchall()

key = 5
encrypted_rows = [(caesar_cipher(row[0], key),) for row in rows]
cursor.executemany("UPDATE your_table SET sensitive_column = %s WHERE sensitive_column = %s;", [(encrypted, original) for encrypted, original in zip(encrypted_rows, rows)])
conn.commit()

cursor.close()
conn.close()

Custom

The Custom option enables users to create tailor-made algorithms to resist specific attack types, potentially offering superior security. However, the development and implementation of custom algorithms demand greater time and expertise.

Best Practices for Using Redshift Data Scrambling

There are several best practices that should be followed when using Redshift Data Scrambling to ensure maximum protection:

Use Unique Keys for Each Table

To ensure that the data is not compromised if one key is compromised, each table should have its own unique key pair. This can be achieved by creating a unique index on the table.


CREATE UNIQUE INDEX idx_unique_key ON table_name (column_name);

Encrypt Sensitive Data Fields 

Sensitive data fields such as credit card numbers and social security numbers should be encrypted to provide an additional layer of security. You can encrypt data fields in Redshift using the ENCRYPT function. Here's an example of how to encrypt a credit card number field:


SELECT ENCRYPT('1234-5678-9012-3456', 'your_encryption_key_here');

Use Strong Encryption Algorithms

Strong encryption algorithms such as AES-256 should be used to provide the strongest protection. Redshift supports AES-256 encryption for data at rest and in transit.


CREATE TABLE encrypted_table (  sensitive_data VARCHAR(255) ENCODE ZSTD ENCRYPT 'aes256' KEY 'my_key');

Control Access to Encryption Keys 

Access to encryption keys should be restricted to authorized personnel to prevent unauthorized access to sensitive data. You can achieve this by setting up an AWS KMS (Key Management Service) to manage your encryption keys. Here's an example of how to restrict access to an encryption key using KMS in Python:


import boto3

kms = boto3.client('kms')

key_id = 'your_key_id_here'
grantee_principal = 'arn:aws:iam::123456789012:user/jane'

response = kms.create_grant(
    KeyId=key_id,
    GranteePrincipal=grantee_principal,
    Operations=['Decrypt']
)

print(response)

Regularly Rotate Encryption Keys 

Regular rotation of encryption keys ensures that any compromised keys do not provide unauthorized access to sensitive data. You can schedule regular key rotation in AWS KMS by setting a key policy that specifies a rotation schedule. Here's an example of how to schedule annual key rotation in KMS using the AWS CLI:

 
aws kms put-key-policy \\
    --key-id your_key_id_here \\
    --policy-name default \\
    --policy
    "{\\"Version\\":\\"2012-10-17\\",\\"Statement\\":[{\\"Effect\\":\\"Allow\\"
    "{\\"Version\\":\\"2012-10-17\\",\\"Statement\\":[{\\"Effect\\":\\"Allow\\"
    \\":\\"kms:RotateKey\\",\\"Resource\\":\\"*\\"},{\\"Effect\\":\\"Allow\\",\
    \"Principal\\":{\\"AWS\\":\\"arn:aws:iam::123456789012:root\\"},\\"Action\\
    ":\\"kms:CreateGrant\\",\\"Resource\\":\\"*\\",\\"Condition\\":{\\"Bool\\":
    {\\"kms:GrantIsForAWSResource\\":\\"true\\"}}}]}"

Turn on logging 

To track user access to sensitive data and identify any unwanted access, logging must be enabled. All SQL commands that are executed on your cluster are logged when you activate query logging in Amazon Redshift. This applies to queries that access sensitive data as well as data-scrambling operations. Afterwards, you may examine these logs to look for any strange access patterns or suspect activities.

You may use the following SQL statement to make query logging available in Amazon Redshift:

ALTER DATABASE  SET enable_user_activity_logging=true;

The stl query system table may be used to retrieve the logs once query logging has been enabled. For instance, the SQL query shown below will display all queries that reached a certain table:

Monitor Performance 

Data scrambling is often a resource-intensive practice, so it’s good to monitor CPU usage, memory usage, and disk I/O to ensure your cluster isn’t being overloaded. In Redshift, you can use the <inlineCode>svl_query_summary<inlineCode> and <inlineCode>svl_query_report<inlineCode> system views to monitor query performance. You can also use Amazon CloudWatch to monitor metrics such as CPU usage and disk space.

Amazon CloudWatch

Establishing Backup and Disaster Recovery

In order to prevent data loss in the case of a disaster, backup and disaster recovery mechanisms should be put in place. Automated backups and manual snapshots are only two of the backup and recovery methods offered by Amazon Redshift. Automatic backups are taken once every eight hours by default. 

Moreover, you may always manually take a snapshot of your cluster. In the case of a breakdown or disaster, your cluster may be restored using these backups and snapshots. Use this SQL query to manually take a snapshot of your cluster in Amazon Redshift:

CREATE SNAPSHOT ; 

To restore a snapshot, you can use the <inlineCode>RESTORE<inlineCode> command. For example:


RESTORE 'snapshot_name' TO 'new_cluster_name';

Frequent Review and Updates

To ensure that data scrambling procedures remain effective and up-to-date with the latest security requirements, it is crucial to consistently review and update them. This process should include examining backup and recovery procedures, encryption techniques, and access controls.

In Amazon Redshift, you can assess access controls by inspecting all roles and their associated permissions in the <inlineCode>pg_roles<inlineCode> system catalog database. It is essential to confirm that only authorized individuals have access to sensitive information.

To analyze encryption techniques, use the <inlineCode>pg_catalog.pg_attribute<inlineCode> system catalog table, which allows you to inspect data types and encryption settings for each column in your tables. Ensure that sensitive data fields are protected with robust encryption methods, such as AES-256.

The AWS CLI commands <inlineCode>aws backup plan<inlineCode> and <inlineCode>aws backup vault<inlineCode> enable you to review your backup plans and vaults, as well as evaluate backup and recovery procedures. Make sure your backup and recovery procedures are properly configured and up-to-date.

Decrypting Data in Redshift

There are different options for decrypting data, depending on the encryption method used and the tools available; the decryption process is similar to of encryption, usually a custom UDF is used to decrypt the data, let’s look at one example of decrypting data scrambling with a substitution cipher.

Step 1: Create a UDF with decryption logic for substitution


CREATE FUNCTION decrypt_substitution(ciphertext varchar) RETURNS varchar
IMMUTABLE AS $$
    alphabet = 'abcdefghijklmnopqrstuvwxyz'
    substitution = 'ijklmnopqrstuvwxyzabcdefgh'
    reverse_substitution = ''.join(sorted(substitution, key=lambda c: substitution.index(c)))
    plaintext = ''
    for i in range(len(ciphertext)):
        index = substitution.find(ciphertext[i])
        if index == -1:
            plaintext += ciphertext[i]
        else:
            plaintext += reverse_substitution[index]
    return plaintext
$$ LANGUAGE plpythonu;

Step 2: Move the data back after truncating and applying the decryption function


TRUNCATE original_table;
INSERT INTO original_table (column1, decrypted_column2, column3)
SELECT column1, decrypt_substitution(encrypted_column2), column3
FROM temp_table;

In this example, encrypted_column2 is the encrypted version of column2 in the temp_table. The decrypt_substitution function is applied to encrypted_column2, and the result is inserted into the decrypted_column2 in the original_table. Make sure to replace column1, column2, and column3 with the appropriate column names, and adjust the INSERT INTO statement accordingly if you have more or fewer columns in your table.

Conclusion

Redshift data scrambling is an effective tool for additional data protection and should be considered as part of an organization's overall data security strategy. In this blog post, we looked into the importance of data protection and how this can be integrated effectively into the  data warehouse. Then, we covered the difference between data scrambling and data masking before diving into how one can set up Redshift data scrambling.

Once you begin to accustom to Redshift data scrambling, you can upgrade your security techniques with different techniques for scrambling data and best practices including encryption practices, logging, and performance monitoring. Organizations may improve their data security posture management (DSPM) and reduce the risk of possible breaches by adhering to these recommendations and using an efficient strategy.

<blogcta-big>

Veronica is the security researcher at Sentra. She brings a wealth of knowledge and experience as a cybersecurity researcher. Her main focuses are researching the main cloud provider services and AI infrastructures for Data related threats and techniques.

Subscribe

Latest Blog Posts

Daniel Suissa
Daniel Suissa
March 15, 2026
4
Min Read

The Blind Spot in Your Data Lake: Why Big Data Format Scanning Is the Next Frontier of Data Security

The Blind Spot in Your Data Lake: Why Big Data Format Scanning Is the Next Frontier of Data Security

Data lakes were supposed to be the great democratizer of enterprise analytics. Centralized, scalable, and cost-effective, they promised to put data in the hands of every team that needed it. And they delivered -- perhaps too well. Today, petabytes of sensitive data sit in Apache Parquet files, Avro containers, and ORC stores across S3 buckets, Azure Data Lake Storage, and Google Cloud Storage, often with little to no visibility into what those files actually contain.

Traditional Data Loss Prevention (DLP) tools were built for a world of emails, PDFs, and spreadsheets. They have no understanding of columnar storage formats, embedded schemas, or the sheer scale of modern data lake architectures. That gap is where sensitive data hides in plain sight -- and where Sentra's data lake format scanning changes the equation entirely.

The Shadow Data Problem in Data Lakes

Every modern enterprise runs some version of the same playbook: production databases feed into ETL pipelines, which land data in object storage as Parquet, Avro, or ORC files. Data engineers, analysts, and machine learning teams then consume that data downstream.

The security problem is straightforward but pervasive. When data engineering teams copy production data into data lakes for analytics, the PII that was supposed to be masked or anonymized often arrives intact. A full copy of customer records -- Social Security numbers, credit card numbers, health information -- ends up in a Parquet file in a shared S3 bucket, accessible to anyone with the right IAM role.

This is not a hypothetical scenario. It is the default state of most enterprise data lakes. And with data democratization initiatives actively expanding access to these stores, the blast radius of unprotected data lake files grows with every new user who gets read permissions.

Why Traditional DLP Falls Short

Conventional DLP solutions treat files as opaque blobs of text. They can scan a CSV or a Word document, but hand them an Apache Parquet file and they see nothing. This is a fundamental architectural limitation, not a feature gap that can be patched.

Big data formats are structurally different from traditional file types. Parquet and ORC use columnar storage, meaning data is organized by column rather than by row. Avro embeds its schema directly in the file. Arrow IPC (Feather) uses an in-memory format optimized for zero-copy reads. Scanning these formats requires purpose-built readers that understand their internal structure -- readers that traditional DLP simply does not have.

The result is a compliance blind spot that grows larger every quarter as more data moves into lakehouse architectures powered by Databricks, Snowflake external tables, and similar platforms.

How Sentra Scans Big Data Formats

Sentra provides native, schema-aware scanning for the full spectrum of data lake file formats. This is not a bolt-on capability -- it is core to how our platform understands modern data infrastructure.

Apache Parquet

Parquet is the lingua franca of the modern data lake. Sentra's tabular reader processes Parquet files with full awareness of their columnar structure, performing intelligent column-level classification. Rather than brute-forcing through every byte, Sentra leverages the columnar layout to efficiently scan individual columns for sensitive data patterns. Batch processing support means even large Parquet datasets are handled without requiring the entire file to be loaded into memory at once. Sentra also recognizes Spark checkpoint files (the `c000` convention) and processes them via Parquet or JSON fallback, ensuring that intermediate pipeline outputs do not escape scrutiny. Sentra also goes beyond the parquet schema and detects nested schemas like a json column that hides behind a “string” data type, adding meaningful context to the classification engine.

Apache Avro

Avro files carry their schema with them, and Sentra takes full advantage of that. Our tabular reader parses the embedded schema to understand field names, types, and structure before scanning the data itself. This schema-aware approach enables more accurate classification -- a field named `ssn` containing nine-digit numbers is treated differently than a field named `zip_code` with the same pattern.

Apache ORC

The Optimized Row Columnar format is a staple of Hive-based data warehouses and remains widely used across Hadoop-era data infrastructure. Sentra's tabular reader handles ORC files natively, applying the same column-level classification intelligence used for Parquet and Avro.

Apache Feather / Arrow IPC

Arrow's IPC format (commonly known as Feather) is increasingly used for fast data interchange between Python, R, and other analytics tools. Sentra scans these files through its textual reader, ensuring that even ephemeral interchange formats do not become a vector for untracked sensitive data.

Column-Level Intelligence

Across all of these formats, Sentra performs column-level scanning and classification. This is critical at data lake scale. A single column in a petabyte Parquet dataset could contain millions of Social Security numbers, while every other column holds benign operational metrics. Column-level granularity means Sentra can pinpoint exactly where sensitive data lives, rather than simply flagging an entire file as "contains PII."

The Compliance Imperative

Regulatory frameworks do not carve out exceptions for big data formats. GDPR's right of access and right to erasure apply regardless of whether personal data is stored in a PostgreSQL table or a Parquet file in S3. CCPA's disclosure requirements extend to every copy of consumer data, including the one sitting in your analytics data lake.

Data Subject Access Requests (DSARs) are particularly challenging when sensitive data is spread across thousands of Parquet files in a data lake. Without automated scanning that understands these formats, responding to a DSAR becomes a manual archaeology project -- expensive, slow, and error-prone.

The AI governance dimension adds another layer of urgency. Machine learning training datasets are frequently stored in Parquet format. If those datasets contain PII that was used to train models, organizations face regulatory exposure under emerging AI governance frameworks. Knowing what personal data exists in your ML training pipelines is no longer optional -- it is a compliance requirement that is rapidly taking shape across jurisdictions.

From Blind Spot to Full Visibility

The shift to data lakehouse architectures is accelerating. Databricks, Snowflake, and the broader modern data stack have made it easier than ever to store and process massive volumes of data in open file formats. That is a net positive for analytics and engineering teams. But without security tooling that speaks the same language as the data infrastructure, sensitive data will continue to accumulate in places where no one is looking.

Sentra closes that gap. By providing native, schema-aware scanning for Parquet, Avro, ORC, Feather, and related formats -- combined with intelligent column-level classification and efficient batch processing -- Sentra gives security and compliance teams the visibility they need into the fastest-growing data stores in the enterprise.

Data lakes are not going away. The question is whether your security posture can keep up with the data engineering teams that feed them. With Sentra, the answer is yes.

*Sentra is a Data Security Posture Management (DSPM) platform that automatically discovers, classifies, and monitors sensitive data across your entire cloud environment. To learn more about how Sentra handles data lake scanning and 150+ other file formats, book a demo with our data security experts.

<blogcta-big>

Read More
Nikki Ralston
Nikki Ralston
David Stuart
David Stuart
March 12, 2026
4
Min Read

How to Protect Sensitive Data in AWS

How to Protect Sensitive Data in AWS

Storing and processing sensitive data in the cloud introduces real risks, misconfigured buckets, over-permissive IAM roles, unencrypted databases, and logs that inadvertently capture PII. As cloud environments grow more complex in 2026, knowing how to protect sensitive data in AWS is a foundational requirement for any organization operating at scale. This guide breaks down the key AWS services, encryption strategies, and operational controls you need to build a layered defense around your most critical data assets.

How to Protect Sensitive Data in AWS (With Practical Examples)

Effective protection requires a layered, lifecycle-aware strategy. Here are the core controls to implement:

Field-Level and End-to-End Encryption

Rather than encrypting all data uniformly, use field-level encryption to target only sensitive fields, Social Security numbers, credit card details, while leaving non-sensitive data in plaintext. A practical approach: deploy Amazon CloudFront with a Lambda@Edge function that intercepts origin requests and encrypts designated JSON fields using RSA. AWS KMS manages the underlying keys, ensuring private keys stay secure and decryption is restricted to authorized services.

Encryption at Rest and in Transit

Enable default encryption on all storage assets, S3 buckets, EBS volumes, RDS databases. Use customer-managed keys (CMKs) in AWS KMS for granular control over key rotation and access policies. Enforce TLS across all service endpoints. Place databases in private subnets and restrict access through security groups, network ACLs, and VPC endpoints.

Strict IAM and Access Controls

Apply least privilege across all IAM roles. Use AWS IAM Access Analyzer to audit permissions and identify overly broad access. Where appropriate, integrate the AWS Encryption SDK with KMS for client-side encryption before data reaches any storage service.

Automated Compliance Enforcement

Use CloudFormation or Systems Manager to enforce encryption and access policies consistently. Centralize logging through CloudTrail and route findings to AWS Security Hub. This reduces the risk of shadow data and configuration drift that often leads to exposure.

What Is AWS Macie and How Does It Help Protect Sensitive Data?

AWS Macie is a managed security service that uses machine learning and pattern matching to discover, classify, and monitor sensitive data in Amazon S3. It continuously evaluates objects across your S3 inventory, detecting PII, financial data, PHI, and other regulated content without manual configuration per bucket.

Key capabilities:

  • Generates findings with sensitivity scores and contextual labels for risk-based prioritization
  • Integrates with AWS Security Hub and Amazon EventBridge for automated response workflows
  • Can trigger Lambda functions to restrict public access the moment sensitive data is detected
  • Provides continuous, auditable evidence of data discovery for GDPR, HIPAA, and PCI-DSS compliance

Understanding what sensitive data exposure looks like is the first step toward preventing it. Classifying data by sensitivity level lets you apply proportionate controls and limit blast radius if a breach occurs.

AWS Macie Pricing Breakdown

Macie offers a 30-day free trial covering up to 150 GB of automated discovery and bucket inventory. After that:

Component Cost
S3 bucket monitoring $0.10 per bucket/month (prorated daily), up to 10,000 buckets
Automated discovery $0.01 per 100,000 S3 objects/month + $1 per GB inspected beyond the first 1 GB
Targeted discovery jobs $1 per GB inspected; standard S3 GET/LIST request costs apply separately

For large environments, scope automated discovery to your highest-risk buckets first and use targeted jobs for periodic deep scans of lower-priority storage. This balances coverage with cost efficiency.

What Is AWS GuardDuty and How Does It Enhance Data Protection?

AWS GuardDuty is a managed threat detection service that continuously monitors CloudTrail events, VPC flow logs, and DNS logs. It uses machine learning, anomaly detection, and integrated threat intelligence to surface indicators of compromise.

What GuardDuty detects:

  • Unusual API calls and atypical S3 access patterns
  • Abnormal data exfiltration attempts
  • Compromised credentials
  • Multi-stage attack sequences correlated from isolated events

Findings and underlying log data are encrypted at rest using KMS and in transit via HTTPS. GuardDuty findings route to Security Hub or EventBridge for automated remediation, making it a key component of real-time data protection.

Using CloudWatch Data Protection Policies to Safeguard Sensitive Information

Applications frequently log more than intended, request payloads, error messages, and debug output can all contain sensitive data. CloudWatch Logs data protection policies automatically detect and mask sensitive information as log events are ingested, before storage.

How to Configure a Policy

  • Create a JSON-formatted data protection policy for a specific log group or at the account level
  • Specify data types to protect using over 100 managed data identifiers (SSNs, credit cards, emails, PHI)
  • The policy applies pattern matching and ML in real time to audit or mask detected data

Important Operational Considerations

  • Only users with the logs:Unmask IAM permission can view unmasked data
  • Encrypt log groups containing sensitive data using AWS KMS for an additional layer
  • Masking only applies to data ingested after a policy is active, existing log data remains unmasked
  • Set up alarms on the LogEventsWithFindings metric and route findings to S3 or Kinesis Data Firehose for audit trails

Implement data protection policies at the point of log group creation rather than retroactively, this is the single most common mistake teams make with CloudWatch masking.

How Sentra Extends AWS Data Protection with Full Visibility

Native AWS tools like Macie, GuardDuty, and CloudWatch provide strong point-in-time controls, but they don't give you a unified view of how sensitive data moves across accounts, services, and regions. This is where minimizing your data attack surface requires a purpose-built platform.

What Sentra adds:

  • Discovers and governs sensitive data at petabyte scale inside your own environment, data never leaves your control
  • Maps how sensitive data moves across AWS services and identifies shadow and redundant/obsolete/trivial (ROT) data
  • Enforces data-driven guardrails to prevent unauthorized AI access
  • Typically reduces cloud storage costs by ~20% by eliminating data sprawl

Knowing how to protect sensitive data in AWS means combining the right services, KMS for key management, Macie for S3 discovery, GuardDuty for threat detection, CloudWatch policies for log masking, with consistent access controls, encryption at every layer, and continuous monitoring. No single tool is sufficient. The organizations that get this right treat data protection as an ongoing operational discipline: audit IAM policies regularly, enforce encryption by default, classify data before it proliferates, and ensure your logging pipeline never exposes what it was meant to record.

<blogcta-big>

Read More
Dean Taler
Dean Taler
March 11, 2026
3
Min Read

Archive Scanning for Cloud Data Security: Stop Ignoring Compressed Files

Archive Scanning for Cloud Data Security: Stop Ignoring Compressed Files

If you care about cloud data security, you cannot afford to treat compressed files as opaque blobs. Archive scanning for cloud data security is no longer a nice‑to‑have — it’s a prerequisite for any credible data security posture.

Every environment I’ve seen at scale looks the same: thousands of ZIP files in S3 buckets, TAR.GZ backups in Azure Blob, JARs and DEBs in artifact repositories, and old GZIP‑compressed database dumps nobody remembers creating. These archives are the digital equivalent of sealed boxes in a warehouse. Most tools walk right past them.

Attackers don’t.

Archives: Where Sensitive Data Goes to Disappear

Think about how your teams actually use compressed files:

  • An engineer zips up a project directory — complete with .env files and API keys — and uploads it to shared storage.
  • A DBA compresses a production database backup holding millions of customer records and drops it into an internal bucket.
  • A departing employee packs a folder of financial reports into a RAR file and moves it to a personal account.

None of this is hypothetical. It happens every day, and it creates a perfect hiding place for:

  • Bulk data exfiltration – a single ZIP can contain thousands of PII‑rich documents, financial reports, or IP.
  • Nested archives – ZIP‑inside‑ZIP‑inside‑TAR.GZ is normal in automated build and backup pipelines. One‑layer scanners never see what’s inside.
  • Password‑protected archives – if your tool silently skips encrypted ZIPs, you’re ignoring what could be the highest‑risk file in your environment.
  • Software artifacts with secrets – JARs and DEBs often carry config files with embedded credentials and tokens.
  • Old backups – that three‑year‑old compressed backup may contain an unmasked database nobody has reviewed since it was created.

If your data security platform cannot see inside compressed files, you don’t actually have end‑to‑end data visibility. Full stop.

Why Archive Scanning for Cloud Data Security Is Hard

The problem isn’t just volume — it’s structure and diversity.

Real cloud environments contain:

  • ZIP / JAR / CSZ
  • RAR (including multi‑part R00/R01 sets)
  • 7Z
  • TAR and TAR.GZ / TAR.BZ2 / TAR.XZ
  • Standalone compression formats like GZIP, BZ2, XZ/LZMA, LZ4, ZLIB
  • Package formats like DEB that are themselves layered archives

Most legacy tools treat all of this as “a file with an unknown blob of bytes.” At best, they record that the archive exists. They don’t recursively extract layers, don’t traverse internal structures, and don’t feed the inner files back into the same classification engine they use for documents or databases.

That gap becomes larger every quarter, as more data gets compressed to save money and speed up transfer.

How Sentra Does Archive Scanning All the Way Down

In Sentra, we treat archives and compressed files as first‑class citizens in the parsing and classification pipeline.

Full Archive and Compression Format Coverage

Our archive scanning engine supports the full range of formats we see in real‑world cloud workloads:

  • ZIP (including JAR and CSZ)
  • RAR (including multi‑part sets)
  • 7Z
  • TAR
  • GZ / GZIP
  • BZ2
  • XZ / LZMA
  • LZ4
  • ZLIB / ZZ
  • DEB and other layered package formats

Each reader is implemented as a composite reader. When Sentra encounters an archive, we don’t just log its presence. We:

  1. Open the archive.
  2. Iterate every entry.
  3. Hand each inner file back into the global parsing pipeline.
  4. If the inner file is itself an archive, we repeat the process until there are no more layers.

A TAR.GZ containing a ZIP containing a CSV with customer records is not an edge case. It’s Tuesday. Sentra will find the CSV and classify the records correctly.

Encryption Detection Without Decryption

Password‑protected archives are dangerous precisely because they’re opaque.

When Sentra hits an encrypted ZIP or RAR, we don’t shrug and move on. We detect encryption by inspecting archive metadata and entry‑level flags, then surface:

  • That the archive is encrypted
  • Where it lives
  • How large it is

We don’t attempt to brute‑force passwords or exfiltrate content. But we do make encrypted archives visible so they can be governed: flagged as high‑risk, pulled into investigations, or subject to separate key‑management policies.

Intelligent File Prioritization Inside Archives

Not every file inside an archive has the same risk profile. A tarball full of binaries and images is very different from one full of CSVs and PDFs.

Sentra implements file‑type–aware prioritization inside archives. We scan high‑value targets first — formats associated with PII, PCI, PHI, or sensitive business data — before we get to low‑risk assets.

This matters when you’re scanning multi‑gigabyte archives under time or budget constraints. You want the most important findings first, not after you’ve chewed through 40,000 icons and object files.

In‑Memory Processing for Security and Speed

All archive processing in Sentra happens in memory. We don’t unpack archives to temporary disk locations or leave extracted debris lying around in scratch directories.

That gives you two benefits:

  • Performance – we avoid disk I/O overhead when dealing with massive archives.
  • Security – we don’t create yet another copy of the sensitive data you’re trying to control.

For a data security platform, that design choice is non‑negotiable.

Compliance: Auditors Don’t Accept “We Skipped the Zips”

Regulations like GDPR, CCPA, HIPAA, and PCI DSS don’t carve out exceptions for compressed files. If personal health information is sitting in a GZIP’d database dump in S3, or cardholder data is archived in a ZIP on a shared drive, you are still accountable.

Auditors won’t accept “we scanned everything except the compressed files” as a defensible position.

Sentra’s archive scanning closes this gap. Across major cloud providers and archive formats, we give you end‑to‑end visibility into compressed and archived data — recursively, intelligently, and without blind spots.

Because the most dangerous data exposure in your cloud is often the one hiding a single ZIP file deep.

<blogcta-big>

Read More
Expert Data Security Insights Straight to Your Inbox
What Should I Do Now:
1

Get the latest GigaOm DSPM Radar report - see why Sentra was named a Leader and Fast Mover in data security. Download now and stay ahead on securing sensitive data.

2

Sign up for a demo and learn how Sentra’s data security platform can uncover hidden risks, simplify compliance, and safeguard your sensitive data.

3

Follow us on LinkedIn, X (Twitter), and YouTube for actionable expert insights on how to strengthen your data security, build a successful DSPM program, and more!

Before you go...

Get the Gartner Customers' Choice for DSPM Report

Read why 98% of users recommend Sentra.

White Gartner Peer Insights Customers' Choice 2025 badge with laurel leaves inside a speech bubble.