SQL Versus NoSQL: What Is the Difference?

SQL-versus-NOSQL

SQL vs. NoSQL? Which database architecture should you use for your next project? Which one is the “best”? Some argue that one is always better than the other. But they are very different technologies that solve different problems.

Let’s take a look at them and see where they differ and where they overlap.

SQL Databases

SQL databases support Structured Query Language (SQL), a language for working with data in relational databases. Broadly speaking, “SQL database” and “relational database” refer to the same technology.

A relational database stores data in tables. These tables have columns and rows. The columns define the attributes that each entry in a table can have. Each column has a name and a datatype. The rows are the records in the table.

For example, a table that holds customers might have columns that define the first name, last name, street address, city, state, postal code, and a unique identification code (ID). You could define the first six columns as strings. Or, the postal code could be an integer if all the clients are in the United States. The ID could be a string or an integer.

The relationships between the tables give SQL its power. Suppose you want to track your customer’s vehicles. Add a second table with vehicle ID, brand, model, and type. Then, create a third table that stores two columns: vehicle ID and customer ID. When you add a new vehicle, store its ID with the customer that owns it in this third table. Now, you can query the database for vehicles, for customers, for customers that own certain vehicles, and vehicles owned by customers. You can also easily have more than one vehicle per customer or more than one customer per vehicle.

Three common examples of SQL databases are SQLite, Oracle, and MySQL.

NoSQL databases

NoSQL database means many things. They’re databases that, well, don’t support SQL. Or they support a special dialect of SQL. Here’s a non-exhaustive list of the more popular NoSQL databases.

Key-Value Databases

Key-Value (KV) databases store data in dictionaries. They can store huge amounts of data for fast insertion and retrieval.

In a KV database, all keys are unique. While the keys are often defined as strings, the values can be any datatype. They can even be different types in the same database. Common examples of values are JSON strings and Binary Large Objects (BLOBs).

Two popular examples of KV databases are Redis and Memcached.

Document Stores

A document store operates like a KV database but contains extra capabilities for manipulating values as documents rather than opaque types.

The structures of the documents in a store are independent of each other. In other words, there is no schema. But, document stores support operations that allow you to query based on the contents.

MongoDB and Couchbase are common examples of document stores.

Column-Oriented Databases

Relational databases use rows to store their data in tables. What sets column-oriented databases apart from them is — as the name suggests — storing their information in columns. These databases support an SQL-like query language, but they store records and relations in columns of the same datatype. This makes for a scalable architecture. Column-oriented databases have very fast insertion and query times. They are suited for huge datasets.

Apache Cassandra and Hadoop HBase are column-oriented databases.

Graph Databases

Graph databases work on the relationships between values. The values are free form, like the values in a document database. But you can connect them with user-defined links. This creates a graph of nodes and sets of nodes.

Queries operate on the graph. You can query on the keys, the values, or the relationships between the nodes.

Neo4j and FlockDB are popular graph databases.

SQL vs. NoSQL Databases: Which One?

So, when you compare SQL and NoSQL databases, you’re comparing one database technology with several others. Deciding which one is better depends on your data and how you need to access it.

Your Data Should Guide Your Decision

Is there a perfect fit for every data set? Probably not. But if you look at your data and how you use it, the best database becomes apparent.

Relational Data Problems

Can you break your data down into entities with logical relationships? A relational database is what you need, especially when you need to perform operations with the relationships.

Relational databases are best when you need data integrity. Properly designed, the constraints that relational databases place on datatypes and relations help guarantee integrity. NoSQL databases tend to be designed without explicit support for constraints, placing the onus on you.

Caching Data Problems

Caching is storing data for repeated access. You usually identify cached data with a single key. NoSQL databases excel at solving caching problems, while relational databases tend to be overkill.

Key-Value stores are an obvious choice for caching problems. Many websites use Redis and Memcached for data and session information.

But a document store that saves documents for historical purposes or reuse is an example of a caching solution, too.

Graph Data Problems

If a graph database stores data with relationships between data, why isn’t it a relational database? It’s because in a graph database relationships are just as important as the data. The relations have fields, names, and directions. Graph queries may include relationships and their names, types, or fields. Relation queries also use wildcards, which account for indirect relationships.

Suppose a database represents rooms in several hosting facilities. It stores buildings, rooms, racks, computers, and networking equipment. This is a relational problem since you have entities with specific relationships.

There could be a table for each entity in a relational database and then join tables representing the relationships between them. But now imagine a query for all the networking equipment in a given building. It has to look in the buildings, find the rooms, look in the rooms for racks, and finally collect all the equipment.

In a graph database, you could create a relation called “contains.” It would be a one-way relation reflecting that one node contains another. Each item in each facility is a node contained by another, except for the buildings. When you query the database for networking gear, a wildcard could combine relationships between the buildings, room, and racks. This query models real life, since you say “Give me all of the gear in building X.”

Scalability: SQL vs. NoSQL

Which technology scales better? NoSQL may have a slight edge here.

Relational databases scale vertically. In other words, data can’t extend across different servers. So, for large datasets, you need a bigger server. As your data increases in size, you need more drive space and more memory. You can share the load across clusters, but not data.

Column-oriented databases were created to solve this problem. They provide horizontal scalability with a relational model.

Key-Value, document, and graph databases also scale horizontally since it’s easier to distribute their datasets across a cluster of servers.

SQL vs. NoSQL: Which One?

SQL and NoSQL are effective technologies. SQL has been around for decades and has proven its worth in countless applications. NoSQL is a set of technologies that solve a variety of different problems. Each of them has its own advantages and tradeoffs.

The question is, which one is best suited for your application? Take the first step by carefully modeling your data and defining use-cases to learn how you need to store and retrieve it. Then, pick the right technology for your application.

Author – Eric Goebelbecker

Eric has worked in the financial markets in New York City for 25 years, developing infrastructure for market data and financial information exchange (FIX) protocol networks. He loves to talk about what makes teams effective (or not so effective!).

What is Data Subsetting in TDM

Test Data Subsetting

The foundation of a comprehensive and well-implemented QA strategy is a sound testing approach. And a sound testing approach, in its turn, depends on having a proper test data management (TDM) process in place. TDM’s responsibilities include obtaining high-quality and high-volume test data in a timely manner. However, obtaining such data isn’t an easy process and might create prohibitive infrastructure costs and unforeseen challenges.

This post is all about the solution: data subsetting.

We begin by defining data subsetting in a more general sense and then explain why it’s so important in the context of TDM. We then talk about the main challenges involved in data subsetting and cover the main methods in which it can be performed.

Let’s get started.

What Is Data Subsetting?

Data subsetting isn’t a hard concept to grasp. To put it simply, it consists of getting a subset or a slice of a complete dataset and moving it somewhere else.

The next step is to understand how this concept works in the context of TDM.

Why Is Data Subsetting Needed in TDM? Knowing the Pain

Data subsetting is a medicine that’s supposed to alleviate a specific pain. So if you want to understand what subsetting is and why it’s needed in the context of TDM, you need first to understand what this pain we’re talking about is.

A couple of sections ago, we definedtest data management. You learned that this process is in charge of coming up with reliable data for the consumption of automated test cases. And even though there are alternatives, one of the most popular solutions for this problem is merely copying the data from the production servers, which is often called production cloning.

Copying from production is a handy way of obtaining realistic data sets from testing since nothing is more real than data. However, this approach presents some severe downsides. The security-related challenges can be solved with approaches like data masking. This post focuses on the challenges related to infrastructure.

The Pains of Production Cloning: The Infrastructure Edition

You could probably summarize the infrastructure-related challenges of production in two words: high costs. If you want to copy 100 percent of your production data into your test environments, you’ll incur incredibly high costs for storage and infrastructure.

That’s not to mention the fact that you could potentially have not only one test environment but several. So we’re talking about multiplying this astronomical cost three, four, or even five times.

Besides the direct financial hit, you’d also incur indirect costs in the form of slow test suites. If you have gigantic amounts of test data, then it’ll necessarily take a long time for you to load it when it’s time for test execution.

Data Subsetting to the Rescue

Applying data subsetting in the context of TDM can solve or alleviate the difficulties of copying data from production. When you create test data by copying not the whole production database but a relatively small portion of it, you don’t incur the exorbitant infrastructure costs that you would otherwise. In other words, it’s like a partial production cloning.

What are the main benefits of using data subsetting in TDM?

The first distinct advantage is the decrease in storage costs for the test data. In the same vein, you’ll also incur fewer costs in overall infrastructure. This cost savings quickly becomes outstanding if you factor in multiple QA or testing environments, which you most likely have.

But the benefits of data subsetting aren’t all about cost. Test time is also impacted positively. Since there’s less data to load and refresh, it takes less time to do it. That way, the adoption of data subsetting can also reduce the total execution time of your test suites.

The Challenges Involved in Data Subsetting

There isn’t such a thing as medicine without side effects. So data subsetting, despite being able to cure or alleviate some pains in the TDM process, also comes with some pains of its own. Let’s look at them.

The first roadblock is referential integrity. Let’s say you work for a social network site and you’re implementing data subsetting. The site has one million users, and you’ve got just a hundred thousand of them for your test database, slicing from the users table. When you’re getting data from the other tables in the database, you have to make sure to fetch the posts, friendships, and pictures from just those hundred thousand users to keep the existing foreign-key relationships intact.

This roadblock becomes even more relentless when you factor in the possibility of relationships spanning multiple databases. There’s nothing that prohibits this organization from storing user profiles in a PostgreSQL database and posts in an Oracle database.

These types of relationships can be even harder to track and protect when they span not only multiple databases but multiple data sources. You could use relational databases for some types of data while others reside in .csv files, and a third type might be stored in some document-based NoSQL database. Such a variety of possible data sources certainly poses a challenge for maintaining referential integrity when doing data subsetting.

Data Subsetting Methods

Let’s now cover the main methods in which you can implement data subsetting.

Using SQL Queries

We start with the most straightforward approach. In the case of data subsetting, this translates as using plain old SQL queries.

For instance, in the social network example we used before, let’s say you have a total of ten thousand users (it’s a small social network, mind you) and you want to get a hundred users. This is how you’d do it, for instance, in PostgreSQL:

SELECT * FROM users ORDER BY id LIMIT 100

Now let’s say you want to fetch the posts from those hundred users. How would you do it? Here’s a solution:

SELECT * FROM posts where user_id in (SELECT id FROM users ORDER BY id LIMIT 100)

These are just simple examples. For a more realistic approach, you would have more complex queries stored in script files. Ideally, you’d check those in version control to any track changes.

The advantage of this approach is that it’s easy to get started. It’s pretty much it. You most likely have people with SQL knowledge in your organization, so the learning curve here should be nonexistent.

On the other side, this approach is very limited. While it might work well in the beginning, as soon as your subsetting needs start to get serious, it falls apart. It’s going to become increasingly harder to understand, change, and maintain the scripts. You have to bear in mind that changes to the “root” query (in our example, the query that fetches the hundred users) cascades down to its children, which further complicates updating the scripts.

Also, knowledge of query performance optimization techniques might be necessary. Otherwise, you might get into situations where poorly written queries become unusable due to outrageously poor performance.

Developing a Custom Solution

The second approach on our list consists of developing a custom application to perform the subsetting. The programming stack you use for creating this application is of little consequence. So just adopt the programming languages and frameworks the software engineers in the organization are already comfortable using.

This approach can be effective for small and medium-size teams, especially when it comes to cost. But at the end of the day, it has more downsides than benefits.

First, building such a tool takes development time that could be used elsewhere, so you’re incurring an opportunity cost here. Database knowledge would also be necessary for building the tool, which would partially defeat the purpose of having this abstraction layer above the database in the first place.

Also, the opportunity cost of building the tool isn’t the only one you would incur. After the solution is ready, you’d have to maintain it for as long as it’s in use.

Adopting Commercial Tools

Why build when you can buy? That’s the reasoning between the second approach on our list: adopting an existing commercial tool.

There are plenty of benefits in adopting a commercial tool, among which the most important are probably the robustness and scalability of the tool and the high number of databases and data sources it supports.

The downside associated with buying isn’t that surprising, and it boils down to one thing: high costs. These costs aren’t just what you pay when buying or subscribing to the tool. You have to factor in the total cost of ownership, including the learning curve. And that might be steeper than you expect.

Using Open Source Tools

Why buy when you can get it for free, besides having access to the source code?

Adopting open source subsetting tools is like getting the best of both worlds: you don’t have to build a tool from scratch, while at the same time you can see the source code and change it if you ever need to expand the capabilities.

The downside is the total cost of ownership, which might still be high, depending on the learning curve the tool presents.

TDM: Subset Your Way to Success!

Copying data from production to use in tests is an old technique. It’s a handy way of feeding test cases with realistic data. However, it’s not without its problems. Besides security and privacy concerns, production cloning can also generate substantial infrastructure costs and high testing times.

In order to solve those problems, many organizations adopt data subsetting. Using subsetting and techniques that deal with security and privacy concerns (e.g., data masking) allows companies to leverage production cloning safely and affordably.

Thanks for reading, and until next time.

Author

This post was written by Carlos Schults. Carlos is a .NET software developer with experience in both desktop and web development, and he’s now trying his hand at mobile. He has a passion for writing clean and concise code, and he’s interested in practices that help you improve app health, such as code review, automated testing, and continuous build.

What is the Consumer Data Right (CDR)?

open-banking-CDR

A DataOps Article.

What is the Consumer Data Right (CDR)?

You may have heard it mentioned, particularly if you’re in “Open Banking”. But CDR is the future of how we access and ultimately share our data with “trusted” third parties.

It will be introduced into the Australian banking sector initially from the middle of 2020, with scope/functionality evolving in phases, and ultimately roll out across other sectors of the economy, including superannuation, energy and telecommunications.

Vendor Benefits

The Consumer Data Right is a competition and consumer reform first!

  • Reduced sector “monopolization” (increased competition).
  • CDR encourages innovation and competition between service providers.
  • Access to new digital products & channels.
  • New, to be innovated, customer experiences.

Consumer Benefits

  • Immediate access to your information for quicker decision making.
  • Better transparency of vendor(s) pricing and offers.
  • Increase in products to support your lifestyle.
  • Consumer power e.g. ease of switching when dissatisfied with providers.

Vendor Risks

  • CDR Compliance is mandatory for Data Holders
  • Implementing CDR (on top of legacy platforms) is non-trivial.
  • Non-compliance penalties may be severe (fines and trading restrictions)
  • CDR is rapidly evolving & continually changing. Continuous conformance validation & upkeep required.
  • Increased access to data, means increased “attack footprint”.

Be warned! Although the CDR is expected to create exciting new opportunities, there are also clearly defined conformance requirements. In a nutshell, breaches of the CDR Rules can attract severe penalties ranging from $10M to 10% of the organization’s annual revenue.

Who is responsible for CDR?

Ultimately CDR may evolve to a point where it is self-regulating. However, at present at least, the accreditation of who can be part of the ecosystem (i.e. Data Holders & Data Recipients) will be controlled by the relevant industry regulators*.

*In Australia the ACCC is responsible for implementing the CDR system. Only an organisation which has been accredited can provide services under in the CDR system. An accredited provider must comply with a set of privacy safeguards, rules and IT system requirements that ensure your privacy is protected and your data is transferred and managed securely. 

How do consumers keep their data safe?

The CDR system is designed to ensure your data is only made available, to the service providers, after you have given authentication and consent.

Note: The diagram below, based on Australian oAuth2/OIDC security CDR guidelines, shows the key interactions between the Consumer, The Data Recipient (e.g. a Retailer App on a Phone) and a Data Holder (a Bank).

Australian CDR uses oAuth2/OIDC Hybrid Flow

Consumers can control what data is shared,  what it can be used for and for how long. Consumers will also have the ability to revoke consent and have information delete at any time.

CDR is the beginning of an interesting new information era. Learn more about the Consumer Data Right and accreditation on the CDR website.