Securing Sensitive Data: A Deep Dive into PII Protection with OpenSearch

Data Modernization & Analytics

Learn how organizations can protect sensitive data using Amazon OpenSearch's security features like fine-grained access controls, encryption, authentication, and audit logging.

There are some sobering statistics related to stolen data from 2023: nearly 6.5 million records were stolen, a little over half of the breach incidents involved PII, and each incident averages a single company a cost of $4.45 million US dollars (Statista, 2024). These statistics reinforce the value that data has today. A customer once said to me, “it’s not a matter of if you’ll be hacked, it’s when you’ll be hacked … and now that we’ve been hacked, I fully believe this.” This doesn’t mean we should throw in the towel and shred all of our data, but rather we should follow some well-established guidelines and best practices so that in the event of a breach, your valuable data is unlikely to be exfiltrated.

If you are or aim to use Amazon’s OpenSearch service, how do you ensure your sensitive data is secure? Are there any compliance regulations, such as HIPAA and CCPA, that you must follow?

Amazon OpenSearch

OpenSearch is an open source search and analytics engine that is based on Elasticsearch and Kibana. It is used to power search experiences and analytics for various types of data. Like Elasticsearch, it centrally stores, searches, and analyzes large volumes of data quickly. It enables full-text search, real-time analytics, and visualized data analytics. Common use cases include storing and searching log data, metrics, traces, security events, business data, IoT telemetry, and vector embeddings.

Amazon’s OpenSearch Service is a managed service that reduces the administrative overhead of running OpenSearch clusters. Amazon’s OpenSearch Service an OpenSearch Server (the search and analytics engine and OpenSearch Dashboards (data visualization capabilities). It is API-compatible with Elasticsearch, allowing an easy migration from Elasticsearch for those who want an open source alternative.

Best practices in AWS

Before we look at the specifics of securing your Amazon OpenSearch Service implementation, let’s zoom out a bit and focus on overall best practices within the AWS ecosystem.

Shared-Responsibility Model

The AWS shared responsibility model details the important distinction between what AWS is responsible for and what you are responsible for. In short, AWS is responsible for the security of the cloud (i.e., protecting the infrastructure that runs all of the offered services) while customers are responsible for securing their own workloads in the cloud (i.e., applications, data, and config). It’s critical that customers be familiar with this model and understand their role in securing their own workloads.

Fine-grained access control

Fine-grained access controls allow businesses to control access to their data at the index level, document level, and field level. Fine-grained access controls enable greater efficiency and centralize management by limiting each user to only the permissions needed to perform specific tasks. This supports a Principle of Least Privilege (POLP) security model.

Additionally, it offers three forms of authentication and authorization: a built-in user database for configuring usernames and passwords within OpenSearch, AWS Identity and Access Management (IAM) integration to map IAM principals to data permissions, and single sign-on with native SAML integration.

This comprehensive approach ensures that sensitive data is protected behind multiple layers of defense and provides effective protection against unauthorized access.

Encryption and Masking

OpenSearch implements encryption by managing encryption in transit and at rest.

Encryption In Transit

Encryption in transit is responsible for encrypting data moving to, from, and within the cluster using the TLS protocol, covering both client-to-node encryption (the REST layer) and node-to-node encryption (the transport layer).

Using these security controls ensures that requests to OpenSearch and data movement are appropriately protected in the event of unauthorized transmission-level access.

Encryption at Rest

On the other hand, encryption at rest protects data stored in the cluster, including indexes, logs, swap files, automated snapshots, and all data in the application directory. This type of encryption is managed by the operating system on each OpenSearch node. To enable encryption at rest, businesses can utilize features like AWS Key Management Service (AWS KMS) to store and manage encryption keys for securing data at rest.

Encryption at rest helps prevent unauthorized access to sensitive data stored within OpenSearch clusters in the event of unauthorized disk-level access.

Data Masking

OpenSearch manages the obfuscation of data, including PII, by offering field masking capabilities. Field masking allows businesses to obscure the values of specific fields in documents by replacing them with cryptographic one-way hashes (i.e., hashes that cannot be converted back into the original value).

This feature is currently compatible with string-based fields and works alongside field-level security on a per-role, per-index basis. Businesses can configure field masking to allow certain roles to view sensitive fields in plain text while masking them for others. The process involves setting a salt (a random string used to hash data) in the opensearch.yml file. Field masking can be configured using OpenSearch Dashboards, roles.yml, or the REST API.

In OpenSearch Dashboards, businesses can choose a role, select an index permission, specify fields for anonymization, and set up field masking accordingly. Additionally, OpenSearch provides the flexibility to use alternative hash algorithms and pattern-based field masking using regular expressions and replacement strings. The masked fields are excluded from read history tracking to maintain data privacy and security

Authentication

OpenSearch manages authentication by validating user credentials against various backends, such as the internal user database, Lightweight Directory Access Protocol (LDAP), Active Directory, Kerberos, or JSON web tokens. The authentication process involves extracting user credentials based on the configured plugin, like basic authentication, JSON web tokens, or TLS certificates.

The plugin supports chaining multiple backends in a sequential manner until successful authentication occurs. Commonly, organizations combine the internal user database with LDAP/Active Directory for enhanced security.

After verifying user credentials, the plugin collects backend roles, which can be arbitrary strings or retrieved from LDAP/Active Directory.

Subsequently, the security plugin utilizes role mapping to assign security roles to users. If a user is authenticated but lacks permissions based on role mapping, they will not be able to perform actions within OpenSearch. This robust authentication mechanism ensures secure access control and authorization for users based on their roles and permissions

Audit logging

OpenSearch implements audit logging to track access to the OpenSearch cluster and is useful for compliance purposes as well as for researching security incidents during or after a suspected breach. Immutable audit logs can be configured to track access to the cluster, including authentication success and failures, requests to OpenSearch, index changes, and incoming search queries. The audit logs can be stored on the current cluster or on other storage options, such as Amazon S3 or Elasticsearch.

After enabling audit logging, businesses can manage audit log categories and other settings using OpenSearch Dashboards. Audit logs are highly customizable and can be tailored to meet specific needs. They can be configured to log events in two ways: HTTP requests (REST) and the transport layer.

By default, the security plugin logs events from all users, but excludes the internal OpenSearch Dashboards server user.

Businesses can also exclude specific users from being logged. The default configuration tracks a popular set of user actions, but it is recommended to tailor the settings to specific needs.

OpenSearch Best Practices

As you can see, OpenSearch provides a robust suite of options for security and protecting data. However, data is only as secure as you make it. It’s important to be familiar with these options, but more important to have a holistic data strategy for your organization. Rather than focusing on securing OpenSearch, for example, define corporate wide policies that govern how OpenSearch is to be secured. Most organizations require that data be encrypted at rest and in transit. That business requirement can then be directly translated into a technical requirement for OpenSearch. In addition to these OpenSearch security features, there are a few common best practices to consider when building your corporate data security posture:

  • Be extremely familiar with the AWS shared responsibility model.

  • Before loading sensitive data in any system, ask the question “do I really need this data here?” If you don’t then a great way to protect data is to limit the amount of systems it appears in. For example, you can keep sensitive customer data in a highly-secured system, and only let the Customer ID persist in other systems. That ID can then be used to link back to the controlled single-source in use cases that really need that sensitive data to be retrieved.

  • Consider ways of keeping PII out of non-production environments, such as sanitizing data, masking data, or never even loading it. It is common practice for businesses to fabricate specific sets of fictitious test data, and only seed non-prod environments with this seed data.

  • Encrypting data at rest goes beyond files in the file system and records in a database. Consider other areas where information may reside such as URLs (query string), API logs, etc.

At Caylent, we have extensive knowledge and expertise in deploying secure and performant data infrastructures on AWS. If you're looking to augment your data capabilities while ensuring enterprise grade security and compliance, get in touch with our team.

Data Modernization & Analytics
Jorge Goldman

Jorge Goldman

Jorge Goldman is a Sr. Big Data Architect with over 12 years of experience in diverse areas from SRE to Data Science. Jorge is passionate about Big Data problems in the real world. He graduated with a Bachelors degree in Software Engineering and a Masters degree in Petroleum Engineering and Data Science. He is always looking for opportunities to improve existing architectures with new technologies. His mission is to deliver sophisticated technical solutions without compromising quality nor security. He enjoys contributing to the community through open-source projects, articles, and lectures, and loves to guide Caylent's customers through challenging problems.

View Jorge's articles
Kenneth Henrichs

Kenneth Henrichs

Kenneth is fueled with a passion for transforming businesses through the power of technology. With over 20 years of industry expertise, he has helped startups thrive, empowered small businesses to scale, and collaborated with Fortune 500 clients to drive innovation. Throughout his career, Kenneth has done everything from bare-metal-to-browser and has gained an affinity for data. He has already helped several customers create value from generative AI and is energized by the wealth of possibilities that this technology ushers in.

View Kenneth's articles

Learn more about the services mentioned

Caylent Catalysts™

Data Modernization Strategy

From implementing data lakes & migrating off commercial databases to optimizing data flows between systems, turn your data into insights with AWS cloud native data services.

Caylent Services

Data Modernization & Analytics

From implementing data lakes and migrating off commercial databases to optimizing data flows between systems, turn your data into insights with AWS cloud native data services.

Accelerate your cloud native journey

Leveraging our deep experience and patterns

Get in touch

Related Blog Posts

Best Practices for Migrating to Aurora MySQL

Aurora MySQL is a high-performance, fully managed database with Amazon RDS benefits, simplifying infrastructure for business focus. Learn migration best practices and essential components for a successful journey toward Aurora MySQL that can lead to increased scalability, resiliency, and cost-effectiveness.

Data Modernization & Analytics
Migrations

re:Invent 2023 Data Session Summaries

Get up to speed on all the data focused 300 and 400 level sessions from re:Invent 2023!

Cloud Technology
Data Modernization & Analytics

Amazon Bedrock vs SageMaker JumpStart

Learn about the differences in how Amazon Bedrock and SageMaker JumpStart help you approach foundation model training for GenerativeAI Use cases on AWS.

Artificial Intelligence & MLOps
Data Modernization & Analytics
Video