How to secure non-production data: A Guide.

Secure Non Production Data

INTRODUCTION

Production data and non-production data are very important to an organization of any size. And sometimes real production data makes it to non-production data in databases. That is one of the reasons why securing the data is so crucial. In this post, we are going to explain what production data and non-production data are. Then we’ll show you how to secure non-production data.

PRODUCTION DATA VS. NON-PRODUCTION DATA

Before we talk about securing non-production data, let’s discuss what production data and non-production data are and how they differ. 

PRODUCTION DATA

Production data is the data that is the business. Every organization, whether a startup or a big multi-national company, has critical data. For a bank, the customer data and the transactional data are production data. And for an e-commerce giant, the production data is the product catalog, the user information, and the transaction. This kind of data is secured with the best systems available. But any data taken by a hacker can cause both reputational and financial losses. 

NON-PRODUCTION DATA

Non-production data is generally used for testing and development purposes. In an ideal scenario, it should be fake data, but it should emulate real data. Suppose your production database contains 10 million records. That means the test database should also contain 10 million records. One thing to note is that only the action load and performance testing can be done. But sometimes the developers and testers require real production data. And in such cases, they are given a subset of the data, which is generally replicated. This is the reason why securing non-production data is so important. Even if this subset of production data is stolen by a hacker, it can cause havoc in the organization. 

DIFFERENCES BETWEEN PRODUCTION AND NON-PRODUCTION DATA

Every organization uses databases. The data is the main part of the business, and in most cases, it’s the business itself. Now, whether this data is stored for internal purposes (like data on all the employees) or for external purposes (like the catalog of an e-commerce site), it is all considered production data. Every developer needs to work with databases to develop applications. They cannot work on production data, as they can corrupt it or, in worst-case scenarios, delete it. So, all developers work on non-production data from non-production databases. This data generally consists of fake records that replicate the original production database. But in some cases, it contains some real data, as the developers need to check the real structure of the records. The testers testing the database or application work on both production and non-production data because they need to test the application before it goes live in production to real-world users. They also test the production data the way an end user experiences it once the release goes through to production. 

SECURING NON-PRODUCTION DATA

As discussed earlier, the non-production data used by developers can also contain sensitive production data. These records can be sensitive records like credit card details, bank details, and even Social Security numbers. The exact data is not required by developers, but they at least need the structure of the database and the schema of the record. Now, before giving the data to developers or testers, it’s important to mask sensitive data through data masking

DATA MASKING

As the name suggests, we mask the original data before handing it to the developers or testers. In this process, the company first decides on the sensitive data that cannot go to the non-production database. The perfect masking needs to be done in a manner where the original data doesn’t go to the developer. But the data should have some meaning—a zip code should be a valid one. Some of the methods used for masking are shuffling and multiplier. In shuffling, the names are changed, so John becomes David and vice versa. And in multiplier, a random number is added to numeric data like dates. So, 12/31/2010 becomes 13/11/2019. Data masking is generally done with the help of tools, which we will look into in the next section. These tools mask data in two ways: static masking and dynamic masking. In static masking, the production database is used to create a static database, which contains masked data. This masked data is then used by developers and testers. In dynamic masking, whenever the developer or tester makes some query to the production database, a proxy service receives the request. It gets the real data from the production database but converts it to dummy data by masking it. And then it returns this masked data to the developer or tester. 

DATA MASKING TOOLS

Here are some of the top data masking tools available. 

ENOV8 TEST DATA MANAGER

The Enov8 Test Data Management platform speeds up your development & testing process by identifying where data security vulnerabilities reside inside your databases, rapidly remediating those risks, through masking, to avoid breaches and automatically validate PII compliance success. It also comes with IT delivery accelerators for example: Data provisioning (DataOps) automation, Data Mining & Test Data Booking features. Enov8, geared for the larger enterprise, is probably the most “holistic” or feature rich solution.

ORACLE DATA MASKING AND SUBSETTING

Oracle Data Masking and Subsetting is a solution from a top provider that also runs on non-Oracle databases. It completes the masking in very little time. Besides masking, it also helps remove duplicate data in testing and development databases. The only drawback is that since it comes from a top vendor, it’s costly. For pricing details, you need to contact Oracle directly. 

INFORMATICA PERSISTENT DATA MASKING

Informatica’s persistent data masking tool is again a solution from a top vendor. It is created with big enterprises in mind and helps set data masking from a single location. That means the administrator can set the masking from a single place. It also supports a huge volume of data to mask, which is not possible with small solutions. It is again costly because it is an enterprise product. But Informatica offers a 30-day trial period. 

K2VIEW DATA PRODUCT PLATFORM

K2View’s Data Product Platform is one of the top data masking products on the market, and it does both static and dynamic masking. K2View not only masks traditional data but also records PDFs and images. In fact, it even masks the original image by blurring it. Because of the cost, it is most suitable for large organizations. 

DATPROF

DATPROF’s data masking tool has a state-of-the-art algorithm, which not only masks the data but can also generate a lot of dummy data from it. Besides traditional data, it also supports XML and CSV files. It has an easy-to-use interface and can create templates, which can be used later. The drawback of these templates is that they can be created on a Windows machine only. It does support a large number of records. 

ACCUTIVE DATA DISCOVERY AND MASKING

Accutive Data Discovery and Masking is a top tool that also does data discovery of sensitive data. This is done automatically and can use preconfigured keywords. Or keywords like “credit card” or “Social Security numbers” can be added by the administrator. Besides this feature, the masked data is consistent across multiple destinations. Like if the masking of Rohit is done to John in the development database, then it is the same in the testing database. Also, data can be moved between multiple kinds of databases. It can be moved from an Oracle database to a MySQL database, or from a flat file to a MySQL database. The UI is very easy to use, and they have one of the most cost-effective products.

CONCLUSION

In this post, we first discussed production data and non-production data, as well as the differences between them. Then we reviewed how to secure non-production data through the process of data masking. This process masks sensitive data from the users of non-production data. We also looked into the top tools available for data masking. 

AUTHOR

This post was written by Nabendu Biswas. Nabendu has been working in the software industry for the past 15 years, starting as a C++ developer, then moving on to databases. For the past six years he’s been working as a web-developer working in the JavaScript ecosystem, and developing web-apps in ReactJS, NodeJS, GraphQL. He loves to blog about what he learns and what he’s up to.

A Coder Guide to Data Science?

Data Scientist DataOpsZone

Data Science is an interdisciplinary field that utilizes mathematics, statistics, and computer science to extract meaningful insights from large datasets. It can be used to uncover patterns and solve complex problems in a variety of industries such as healthcare, finance, marketing, and engineering.

 Choosing the right language for a data science project is essential, and there are a variety of languages to choose from. Python, R, SQL, MATLAB, and Scala are some of the best languages for data science, each offering unique features and capabilities that make them suitable for different tasks.

Lets talk about the top 5 languages n more detail.

The What & When of:

  • 1. Python
  • 2. R
  • 3. SQL
  • 4. MATLAB
  • 5. Scala

Python

What is Python?

Python is a high-level, general-purpose programming language that is popular among data scientists for its flexibility, wide range of libraries, and ease of use. Python is used for data analysis, machine learning, web development, and more. It is a great language for beginners as it has a simple syntax and provides a wide range of libraries and modules to help with data manipulation and analysis.

When to choose Python?

Python is a great choice for data science projects that require a lot of data manipulation and analysis. It is also a great choice for projects that have a large and diverse dataset, as its wide range of libraries and modules will make it easier to process and visualize the data. Python is also a great choice for beginners, as it is easy to learn and provides a wide range of resources to help with data analysis.

R

What is R?

R is a programming language and software environment for statistical computing and graphics. It is popular among data scientists for its powerful statistical analysis capabilities and its wide range of libraries for data manipulation and visualization. R is particularly popular among academics and researchers, who use it to analyze data and build predictive models.

When to use R?

R is a great choice for data science projects that require a lot of statistical analysis. It is also a great choice for projects that require powerful data manipulation and visualization capabilities. R is popular among academics and researchers, so it is a great choice for projects involving research or analysis.

SQL

What is SQL?

SQL (Structured Query Language) is a domain-specific language used to interact with databases. It is used to store, retrieve, manipulate, and analyze data stored in a relational database. SQL is popular among data scientists to access and analyze data stored in relational databases, as it is easy to learn and offers powerful features for data analysis.

When to use SQL?

SQL is a great choice for data science projects that involve accessing and analyzing data stored in a relational database. It is also a great choice for projects that require a lot of data manipulation, as SQL offers powerful features for data analysis. SQL is also easy to learn, making it a great choice for beginners.

MATLAB

What is MATLAB?

MATLAB (Matrix Laboratory) is a high-level programming language and environment used for technical computing and data analysis. It is popular among data scientists for its powerful numerical computing and visualization capabilities. MATLAB also has a wide range of libraries for data analysis and machine learning, making it a great choice for data scientists.

When to use MATLAB?

MATLAB is a great choice for data science projects that require a lot of technical computing and visualization. It is also a great choice for projects that require a lot of data manipulation and analysis, as it has a wide range of libraries for data analysis and machine learning. MATLAB is also a great choice for projects involving numerical computing, as it has powerful numerical computing capabilities.

Scala

What is Scala?

Scala is a general-purpose programming language that is often used for data science projects. It is a combination of object-oriented and functional programming, and is popular for its powerful features and scalability. Scala is a great choice for data science projects, as it is easy to learn and offers a wide range of libraries for data manipulation and analysis.

When to use Scala?

Scala is a great choice for data science projects that require a lot of data manipulation and analysis. It is also a great choice for projects that require scalability, as it offers powerful features for data manipulation and analysis. Scala is also a great choice for projects that require a lot of object-oriented programming, as it offers a combination of object-oriented and functional programming.

One Size Doesnt Fit All

In many cases, a hybrid approach is best for data science projects. This involves combining the best features of different languages and tools to create a powerful and flexible data science solution. For example, combining Python and R can provide the best of both worlds, with Python providing powerful data manipulation and visualization capabilities, and R providing powerful statistical analysis capabilities.

No matter what language or tools you use, the most important thing is to choose the right ones for your particular project. Finding the right combination of languages and tools to best suit your project can take some experimentation, but it is well worth the effort.

Author Jane Temov

Jane Temov is an IT Environments Evangelist at Enov8, specializing in IT and Test Environment Management, Release and Data Management product design & solutions.

What is Data Cloning? A Beginners Guide

What is Data Cloning

Data Cloning, sometimes called Database Virtualization, is a method of snapshotting real data and creating tiny “fully functional” copies for the purpose of rapid provisioning into your Development & Test Environments.

The Cloning Workflow

There are four primary Steps

  1. Load / Ingest the Source Data
  2. Snapshot the Data
  3. Clone / Replicate the Data
  4. Provision of the Data to DevTest Environments

Under the Hood

Cloning is typically achieved/built using ZFS or HyperV technologies and allows you to move away from the traditional backup & restore methods, which can take hours.

By using ZFS or HyperV you can provision databases x100 quicker and x10 smaller.

What is ZFS?

  • ZFS is a file system that provides for data integrity and Snapshotting. It is available for most if not all major OS platforms.

What is HyperV?

  • HyperV is a Microsoft virtualization platform that can be used to create and manage virtual machines. It supports Snapshotting as well.

Problem Statement

Backups are often taken manually and can take hours or days to complete. This means that the data isn’t available for use during this time period, which can be problematic if you need access to your data immediately.

There is also a secondary issue with storage. A backup & restore is, by its nature, a 100% copy of the original source. So if you started with a 5 TB database and wanted x3 restores then you are up for another 15 TB in disk space.

What are the Benefits of Data Cloning?

Data cloning is the process of creating a copy, or snapshot, of data for backup, analysis, or engineering purposes. This can be done in real-time or as part of a scheduled routine. Data clones can be used to provision new databases and test changes to production systems without affecting the live dataset.

Advantages

– Clones can be used for development and testing without affecting production data

– Clones use little storage, on average about 40 MB, even if the source was 1 TB

– The Snapshot & Cloning process takes seconds, not hours

– You can restore a Clone to any point in time by bookmarking

– Simplifies your End to End Data Management

Disadvantages

– The underlying technology to achieve cloning can be complex.

However, there are various cool tools on the market that remove this complexity.

What Tools are available to support Data Cloning?

In addition to building your own from scratch, commercial cloning solutions include:

Each is powerful and has its own set of features and benefits. The key is to understand your data environment and what you’re trying to achieve before making that final decision.

Common Use Cases for Data Cloning

  • DevOps: Data cloning is the process of creating an exact copy of a dataset. This can be useful for several reasons, such as creating backups or replicating test data, into Test Environments, for development and testing purposes.
  • Cloud Migration: Data cloning provides a secure and efficient way to move TB-size datasets from on-premises to the cloud. This technology can create space-efficient data environments needed for testing and cutover rehearsal.
  • Platform Upgrades: A large majority of projects end up going over the set schedule and budget. The primary reason for this is because setting up and refreshing project environments is slow and complicated. Database virtualization can cut down on complexity, lower the total cost of ownership, and accelerate projects by delivering virtual data copies to platform teams more efficiently than legacy processes allow.
  • Analytics: Data clones can provide a space for designing queries and reports, as well as on-demand access to data across sources for BI projects that require data integration. This makes it easier to work with large amounts of data without damaging the original dataset.
  • Production Support: Data cloning can help teams identify and resolve production issues by providing complete virtual data environments. This allows for root cause analysis and validation of changes to ensure that they do not cause further problems.

To Conclude

Data cloning is the process of creating an exact copy of a dataset (database). This can be useful for many reasons, such as creating backups or replicating data for development and testing purposes. Data clones can be used to quickly provision new databases and test changes to production systems without affecting the live dataset.

This article provides a brief overview of data cloning, including its advantages, disadvantages, common use cases, and available tools. It is intended as a starting point for those who are new to the topic. Further research is recommended to identify the best solution for your specific needs. Thanks for reading!