DynamoDB Data Transformation Safety: From Manual Toil to Automated and Open Source – InfoQ.com

QCon London (March 27-29, 2023): Adopt the right emerging trends to solve your complex engineering challenges. Register Now.
Facilitating the Spread of Knowledge and Innovation in Professional Software Development


In this article, we introduce the topic of code obfuscation, with emphasis on string obfuscation. Obfuscation is an important practice to protect source code by making it unintelligible. Obfuscation is often mistaken with encryption, but they are different concepts. In the article we will present a number of techniques and approaches used to obfuscate data in a program.
In this podcast, Jim Barton explains some of the fundamentals of modern service meshes, and provides an overview of Istio Ambient Mesh and the benefits it will provide in the future.
Data transformation remains a continuous challenge in engineering and built upon manual toil. The open source utility Dynamo Data Transform was built to simplify and build safety and guardrails into data transformation for DynamoDB based systems––built upon a robust manual framework that was then automated and open sourced. This article discusses the challenges with Data Transformation.
Learning software craftership made me reconsider how I wrote code. Being an experienced software team manager, I try to reevaluate my management practices in the same way: what could Test Driven Management or Pair-management be? Here I propose different insights on how software craftership tools and mindset are transposable to the management domain.
In a recent article, Paula Kennedy shared her thoughts on the ever-increasing cognitive load being saddled onto development teams. Although platform engineering is touted as a solution to this challenge, a poorly designed platform will increase the cognitive burden on developers utilizing it. We must also be careful that we are not just transferring that cognitive load onto the platform teams.
Make the right decisions by uncovering how senior software developers at early adopter companies are adopting emerging trends. Register Now.
Learn how to implement and manage your API projects with a security strategy and a development mindset. Register Now.
Adopt the right emerging trends to solve your complex engineering challenges. Register Now.
Your monthly guide to all the topics, technologies and techniques that every professional needs to know about. Subscribe for free.
InfoQ Homepage Articles DynamoDB Data Transformation Safety: From Manual Toil to Automated and Open Source
Nov 23, 2022 15 min read
by
Guy Braunstain
reviewed by
Srini Penchikala
 
When designing a product to be a self-serve developer tool, there are often constraints – but likely one of the most common ones is scale. Ensuring our product, Jit – a security-as-code SaaS platform, was built for scale was not something we could embed as an afterthought, it needed to be designed and handled from the very first line of code.
We wanted to focus on developing our application and its user experience, without having challenges with issues and scale be a constant struggle for our engineers. After researching the infrastructure that would enable this for our team – we decided to use AWS with a serverless-based architecture.  
AWS Lambda is becoming an ever-popular choice for fast-growing SaaS systems, as it provides a lot of benefits for scale and performance out of the box through its suite of tools, and namely the database that supports these systems, AWS’s DynamoDB.
ScyllaDB is the database for data-intensive apps requiring high performance + low latency. Achieve extreme scale with the lowest TCO. Learn More.
One of its key benefits is that it is already part of the AWS ecosystem, and therefore this abstracts many of the operational tasks of management and maintenance, such as maintaining connections with the database, and it requires minimal setup to get started in AWS environments.
As a fast-growing SaaS operation, we need to evolve quickly based on user and customer feedback and embed this within our product. Many of these changes in application design have a direct impact on data structures and schemas.
With rapid and oftentimes significant changes in the application design and architecture, we found ourselves needing to make data transformations in DynamoDB very often, and of course, with existing users, it was a priority that this be achieved with zero downtime. (In the context of this article Data Transformation will refer to modifying data from state A to state B).
In the spirit of Brendon Moreno from the UFC:
Maybe not today, maybe not tomorrow, and maybe not next month, but only one thing is true, you will need to make data transformations one day, I promise.
Yet, while data transformation is a known constant in engineering and data engineering, it remains a pain point and challenge to do seamlessly. Currently, in DynamoDB, there is no easy way to do it programmatically in a managed way, surprisingly enough.
While there are many forms of data transformation, from replacing an existing item’s primary key to adding/removing attributes, updating existing indexes – and the list goes on (these types are just a few examples), there remains no simple way to perform any of these in a managed and reproducible manner, without just using breakable or one-off scripting.
Below, we are going to dive into a real-world example of a data transformation process with production data.
Let’s take the example of splitting a “full name” field into its components “first name” and “last name”. As you can see in the example below, the data aggregation currently writes names in the table with a “full name” attribute. But let’s say we want to transform from a full name, and split this field into first and last name fields.
Before
Id
FullName
123
Guy Br
After
Id
FirstName
LastName
123
Guy
Br
Looks easy, right?  Not so, to achieve just this simple change these are the steps that will need to be performed on the business logic side, in order to successfully transform this data.
But let's discuss some of the issues you would need to take into account before you even get started, such as – how do you run and manage these transformations in different application environments? Particularly when it’s not really considered a security best practice to have access to each environment.  In addition, you need to think about service dependencies.  For example, what should you do when you have another service dependent on this specific data format? Your service needs to be backward compatible and still provide the same interface to external services relying on it.
When you have production clients, possibly one of the most critical questions you need to ask yourself before you modify one line of code is how do you ensure that zero downtime will be maintained?
Some of the things you’d need to plan for to avoid any downtime is around testing and verification. How do you even test your data transformation script? What are some good practices for running a reliable dry run of a data transformation on production data?
There are so many things to consider before transforming data.
Now think that this is usually, for the most part, done manually.  What an error-prone, tedious process! It looks like we need a fine-grained process that will prevent mistakes and help us to manage all of these steps.
To avoid this, we understood we’d need to define a process that would help us tackle the challenges above.

Figure 1: Rewrite Process Flow Chart
First, we started by adjusting the backend code to write the new data format to the database while still keeping the old format, by first writing the FullName, FirstName and LastName to provide us some reassurance of backward compatibility. This would enable us to have the ability to revert to the previous format if something goes terribly wrong.
Link to GitHub
Next, we wrote a data transformation script that scans the old records and appends the FirstName and LastName attributes to each of them, see the example below:
Link to GitHub
After writing the actual script (which is the easy part), we now needed to verify that it actually does what it’s supposed to.  To do so, the next step was to run this script on a test environment and make sure it works as expected. Only after the scripts usability is confirmed, it could be run on the application environments.
The last phase is the cleanup, this includes taking the plunge and ultimately deleting the FullName column entirely from our database attributes. This is done in order to purge the old data format which is not used anymore, and reduce clutter and any future misuse of the data format.
Link to GitHub
Lets quickly recap what we have done in the process:
This well-defined process helped us to build much-needed safety and guardrails into our data transformation process. As we mentioned before, with this process we were able to avoid downtime by keeping the old format of the records until we don’t need them anymore. This provided us with a good basis and framework for more complex data transformations.
Now that we have a process––let’s be honest, real-world data transformations are hardly so simple.  Let’s assume, a more likely scenario, that the data is actually ingested from an external resource, such as the GitHub API, and that our more advanced data transformation scenario actually requires us to ingest data from multiple sources.  
Let’s take a look at the example below for how this could work.
In the following table, the GSI partition key is by GithubUserId.
For the sake of this data transformation example, we want to add a “GithubUsername” column to our existing table.
Before
Id
GithubUserId
123
7654321
After
Id
GithubUserId
GithubUsername
123
7654321
Guy7B
This data transformation looks seemingly as straightforward as the example with the full name, but there is a little twist.
How can we get the Github username if we don’t have this information? We have to use an external resource, in this case, it's the Github API.
GitHub has a simple API for extracting this data (you can read the documentation here). We will pass the GithubUserId and get information about the user which contains the Username field that we want.
https://api.github.com/user/:id
The naive flow is similar to the full name example above:
However, in contrast to our previous flow, there is an issue with this naive flow. The flow above is not safe enough. What happens if you have issues while running the data transformation when calling the external resource? Perhaps the external resource will crash / be blocked by your IP or is simply unavailable for any other reason? In this case, you might end up with production errors or a partial transformation, or other issues with your production data.
What can we do on our end to make this process safer?
While you can always resume the script if an error occurs or try to handle errors in the script itself, however, it is important to have the ability to perform a dry run with the prepared data from the external resource before running the script on production. A good way to provide greater safety measures is by preparing the data in advance.
Below is the design of the safer flow:
Only after we do this, we scan the user records, get GithubUsername for each of them using Github API, append it to a JSON Object `{ [GithubUserId]: GithubUsername }` and then write that JSON to a file.
This is what such a flow would look like:
Link to GitHub
Next we scan the user records (get GithubUsername by GithubUserId for each record using Preparation Data), and move ahead to updating the record.
Link to GitHub
And finally, like the previous process, we wrap up by running the script on the testing environment, and then the application environments.
Once we built a robust process that we could trust for data transformation, we understood that to do away with human toil and ultimately error, the best bet would be to automate it.
We realized that even if this works for us today at our smaller scale, manual processes will not grow with us. This isn’t a practical long-term solution and would eventually break as our organization scales. That is why we decided to build a tool that would help us automate and simplify this process so that data transformation would no longer be a scary and painful process in the growth and evolution of our product. 
Every data transformation is just a piece of code that helps us to perform a specific change in our database, but these scripts, eventually, must be found in your codebase.
This enables us to do a few important operations:
By enabling automation for data transformation processes, you essentially make it possible for every developer to be a data transformer. While you likely should not give production access to every developer in your organization, applying changes is the last mile. When only a handful of people have access to production, this leaves them with validating the scripts and running them on production, and not having to do all of the heavy lifting of writing the scripts too. We understand it consumes more time than needed for those operations and it is not safe. 
When the scripts in your codebase and their execution are automated via CI/CD pipelines
other developers can review them, and basically, anyone can perform data transformations on all environments, alleviating bottlenecks.
Now that we understand the importance of having the scripts managed in our codebase, we want to create the best experience for every data-transforming developer.
Every developer prefers to focus on their business logic – with very few context disruptions and changes. This tool can assist in keeping them focused on their business logic, and not have to start from scratch every time they need to perform data transformations to support their current tasks.  
For example – dynamo-data-transform provides the benefits of: 
Quick Installation for serverless:
The package can be used as a standalone npm package see here.
To get started with DynamoDT, first run:
npm install dynamo-data-transform --save-dev
To install the package through NPM (you can also install it via…)
Next, add the tool to your serverless.yml Run:
npx sls plugin install -n dynamo-data-transform
You also have the option of adding it manually to your serverless.yml:
plugins:
  - dynamo-data-transform
You can also run the command:
sls dynamodt --help
To see all of the capabilities that DynamoDT supports.
Let’s get started with running an example with DynamoDT. We’ll start by selecting an example from the code samples in the repo, for the sake of this example, we’re going to use the example `v3_insert_users.js`, however, you are welcome to test it out using the examples you’ll find here.
We’ll initialize the data transformation folder with the relevant tables by running the command: 
npx sls dynamodt init --stage local
For serverless (it generates the folders using the resources section in the serverless.yml):
The section above should be in serverless.yml
The data-transformations folder generated with a template script that can be found here.
We will start by replacing the code in the template file v1_script-name.js with:
Link to GitHub
For most of the regular data transformations, you can use the util functions from the dynamo-data-transform package. This means you don’t need to manage the versions of the data transformation scripts, the package will do this work for you. Once you’ve customized the data you’ll want to transform, you can test the script using the dry run option by running:
npx sls dynamodt up --stage local --dry
The dry run option prints the records in your console so you can immediately see the results of the script, and ensure there is no data breakage or any other issues.

Once you’re happy with the test results, you can remove the –dry flag and run it again, this time it will run the script on your production data, so make sure to validate the results and outcome.
Once you have created your data transformation files, the next logical thing you’d likely want to do is add this to your CI/CD.  To do so add the command to your workflow/ci file for production environments.
The command will run immediately after the `sls deploy` command, which is useful for serverless applications.
Finally, all of this is saved, as noted above so if you want to see the history of the data transformations, you can run:
`npx sls dynamodt history –table UserExample –stage local`

The tool also provides an interactive CLI for those who prefer to do it this way.
And all of the commands above are supported via CLI as well.

With Dynamo Data Transform, you get the added benefits of being able to version and order your data transformation operations and manage them in a single place. You also have the history of your data transformation operations if you would like to roll back an operation. And last but not least, you can reuse and review your previous data transformations.
We have open-sourced the Dynamo Data Transform tool that we built for internal use to perform data transformations on DynamoDB and serverless-based environments and manage these formerly manual processes in a safe way.
The tool can be used as a Serverless Plugin and as a standalone NPM package.
Feel free to provide feedback and contribute to the project if you find it useful.

Figure 2: Data Transformation Flow Chart 

Becoming an editor for InfoQ was one of the best decisions of my career. It has challenged me and helped me grow in so many ways. We’d love to have more people join our team.

A round-up of last week’s content on InfoQ sent out every Tuesday. Join a community of over 250,000 senior developers. View an example

We protect your privacy.
You need to Register an InfoQ account or or login to post comments. But there’s so much more behind being registered.
Get the most out of the InfoQ experience.
Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

A round-up of last week’s content on InfoQ sent out every Tuesday. Join a community of over 250,000 senior developers. View an example

We protect your privacy.
Real-world technical talks. No product pitches.
Practical ideas to inspire you and your team.
QCon Plus – Nov 30 – Dec 8, Online.

QCon Plus brings together the world’s most innovative senior software engineers across multiple domains to share their real-world implementation of emerging trends and practices.
Uncover emerging software trends and practices to solve your complex engineering challenges, without the product pitches.Save your spot now
InfoQ.com and all content copyright © 2006-2022 C4Media Inc. InfoQ.com hosted at Contegix, the best ISP we’ve ever worked with.
Privacy Notice, Terms And Conditions, Cookie Policy

source

Note that any programming tips and code writing requires some knowledge of computer programming. Please, be careful if you do not know what you are doing…

Post expires at 8:57pm on Thursday February 23rd, 2023

Leave a Reply

Next Post

The genius of the Lionel Messi and Cristiano Ronaldo Louis Vuitton campaign - Evening Standard

Wed Nov 23 , 2022
Amy Francombe uncovers the layers of meaning in the fashion house’s viral ad A photograph of footballers Cristiano Ronaldo and Lionel Messi playing chess has been appointed the “picture of the century” by social media.In the image, shot by legendary American photographer Annie Leibovitz for Louis Vuitton, the two world […]
%d bloggers like this: