A Data Engineer's Guide to Data Anonymization Pipelines
An overview of how Data Engineer's can use Data Anonymization in their Pipelines
November 7th, 2024
We recently published support for using Neosync base Transformers inside of Custom Transformers and were faced with the task of writing a LOT of documentation. So, like the good engineers that we are, we started looking at ways that we could auto-generate our documentation.
After some research, we came across Go Templates and realized that we could use them to automate the entire document generation process.
In this blog, we'll dive into the process and how it all works.
Let's jump in.
Go templates are powerful tools for generating textual outputs that include Go data structures. They separate conditional logic and looping and can inherit sections from other templates.
Here's a quick example:
type Entry struct {
Name string
Email string
}
func main() {
data := []Entry{
{"Alice", "alice@example.com"},
{"Bob", "bob@example.com"},
}
t := template.Must(template.New("example").Parse(`
<html>
<head><title>Email List</title></head>
<body>
<h1>Email List</h1>
<ul>
{{range .}}
<li>{{.Name}} - {{.Email}}</li>
{{end}}
</ul>
</body>
</html>
`))
t.Execute(os.Stdout, data)
}
We define our template inside of the Parse()
function and then insert interpolations where we want to inject Go code or values from a Go data structure. Also, worth noting that the {{range .}}
function allows us to range over a slice like we are doing in the example.
Neosync has Transformers that customers use to anonymize and generate synthetic data. Customers can use any of the 40+ Transformers that Neosync comes with out of the box or they can create their own using a custom javascript implementation.
Under the covers, these base transformers are go functions that take different parameters based on the transformer. Each go function is wrapped as a Bloblag custom function so we can expose it through Bento, which is what we use as a streaming platform.
Here is an example:
// Bloblang spec that defines the function params
spec := bloblang.NewPluginSpec().
Description("Generates a new randomized email address.").
Param(bloblang.NewInt64Param("max_length").Default(100000).Description("Specifies the maximum length for the generated data. This field ensures that the output does not exceed a certain number of characters.")).
Param(bloblang.NewStringParam("email_type").Default(GenerateEmailType_UuidV4.String()).Description("Specifies the type of email type to generate, with options including `uuidv4`, `fullname`, or `any`.")).
Param(bloblang.NewInt64Param("seed").Optional().Description("An optional seed value used to generate deterministic outputs."))
// Bloblang function registration
err := bloblang.RegisterFunctionV2("generate_email", spec, func(args *bloblang.ParsedParams) (bloblang.Function, error) {
// ... other code here that implements the generate_email logic
return func() (any, error) {
output, err := generateRandomEmail(randomizer, maxLength, emailType, excludedDomains)
if err != nil {
return nil, fmt.Errorf("unable to run generate_email: %w", err)
}
return output, nil
}, nil
})
There are two section so the code above. The first section is where we generate the spec. The spec defines the parameters that the transformer, or function, takes in. You can see that we define 3 parameters (max_length, email_type, seed), and then set some defaults and a description.
The second second is where we register the function with Bloblang and then call the underlying generateRandomEmail
function.
So the goal was to auto-generate documentation that would define the function signature along with the parameters and their definitions, and then an example of how to use the transformer. Ultimately, we were able to generate something like this:
We first defined an entrypoint file that when we run will trigger all of the generation to occur. This allows us to trigger this file during a build and then update all of the auto-generated docs in case any transformers have been added or changed.
We created a generators.go
file to serve as our entrypoint. It's pretty straightforward and looks like this:
package transformers
//go:generate go run neosync_transformer_generator.go $GOPACKAGE
//go:generate go run neosync_transformer_list_generator.go $GOPACKAGE
//go:generate go run neosync_js_transformer_docs_generator.go ../../../../docs/docs/transformers/gen-javascript-transformer.md
This file runs three files. The first two files generate the transformer code that allows developers to use the base transformer. But those don't contribute to auto-generating the documentation, so we're going to ignore those for now.
The one that we care about is the last file - neosync_js_transformer_docs_generator.go
.
Let's go through it.
Let's look at the code for that file starting with the main function.
err := filepath.WalkDir(".", func(path string, d fs.DirEntry, err error) error {
if !d.IsDir() && filepath.Ext(path) == ".go" {
node, err := parser.ParseFile(fileSet, path, nil, parser.ParseComments)
if err != nil {
log.Printf("Failed to parse file %s: %v", path, err)
return nil
}
for _, cgroup := range node.Comments {
for _, comment := range cgroup.List {
if strings.HasPrefix(comment.Text, "// +neosyncTransformerBuilder:") {
parts := strings.Split(comment.Text, ":")
if len(parts) < 3 {
continue
}
transformerFuncs = append(transformerFuncs, &transformers.BenthosSpec{
SourceFile: path,
Name: parts[2],
Type: parts[1],
})
}
}
}
}
return nil
})
This code walks through the directory where all of the transformers live, parses each transformer file and then looks for the +neosyncTransformerBuilder:<transform/generate>:<transformer_name>
directive. This is set as a comment at the top of a transformer file to tell neosync_js_transformer_docs_generator.go
that this is a file that should be included in the documentation auto-generation.
A little further down in our main function, we call the generateCode()
function which is where we encounter our first go template. Let's take a look:
title: Javascript Transformer
slug: /transformers/javascript
hide_title: false
id: javascript
description: Learn about Neosync's javascript transformer
---
<!-- prettier-ignore-start -->
<!--
Code generated by Neosync neosync_js_transformer_docs_generator.go. DO NOT EDIT.
-->
# Neosync Javascript Transformer Functions
Learn about Neosync's Javascript transformer and generator functions, which provide a wide range of capabilities for data transformation and
generation within the Javascript Transformer and Generator. Explore detailed descriptions and examples to effectively utilize these functions in your jobs.
## Transformers
Neosync's transformer functions allow you to manipulate and transform data values with ease.
These functions are designed to provide powerful and flexible data transformation capabilities within your jobs.
Each transformer function accepts a value and a configuration object as arguments.
The source column value is accessible via the ` + "`value`" + ` keyword, while additional columns can be referenced using ` + "`input.{column_name}`" + `.
<br/>
{{range $i, $bs := .TransformerSpecs }}
<!--
source: {{$bs.SourceFile}}
-->
### {{$bs.Name}}
{{$bs.Description}}
**Parameters**
**Value**
Type: Any
Description: Value that will be transformed
**Config**
| Field | Type | Default | Required | Description |
| -------- | ---- | ------- | -------- | ----------- |
{{- range $i, $param := $bs.Params}}
{{- if eq $param.Name "value" }}{{ continue }}{{- end }}
| {{$param.Name}} | {{$param.TypeStr}} | {{$param.Default}} | {{ if $param.IsOptional -}} false {{- else -}} true {{- end }} | {{$param.Description}}
{{- end -}}
<br/>
**Example**
` + "```javascript" + `
{{$bs.Example}}
` + "```" + `
<br/>
{{end }}
## Generators
Neosync's generator functions enable the creation of various data values, facilitating the generation of realistic and diverse data for
testing and development purposes. These functions are designed to provide robust and versatile data generation capabilities within your jobs.
Each generator function accepts a configuration object as an argument.
<br/>
{{range $i, $bs := .GeneratorSpecs }}
<!--
source: {{$bs.SourceFile}}
-->
### {{$bs.Name}}
{{$bs.Description}}
**Parameters**
**Config**
| Field | Type | Default | Required | Description |
| -------- | ---- | ------- | -------- | ----------- |
{{range $i, $param := $bs.Params -}}
| {{$param.Name}} | {{$param.TypeStr}} | {{$param.Default}} | {{ if $param.IsOptional -}} false {{- else -}} true {{- end }} | {{ $param.Description }}
{{ end -}}
<br/>
**Example**
` + "```javascript" + `
{{$bs.Example}}
` + "```" + `
<br/>
{{end }}
<!-- prettier-ignore-end -->
This go template is used for the overall Transformer documentation page.
This is a pretty big template but let's break it down:
{{range $i, $bs := .TransformerSpecs }}
<!--
source: {{$bs.SourceFile}}
-->
### {{$bs.Name}}
{{$bs.Description}}
**Parameters**
**Value**
Type: Any
Description: Value that will be transformed
**Config**
| Field | Type | Default | Required | Description |
| -------- | ---- | ------- | -------- | ----------- |
{{- range $i, $param := $bs.Params}}
{{- if eq $param.Name "value" }}{{ continue }}{{- end }}
| {{$param.Name}} | {{$param.TypeStr}} | {{$param.Default}} | {{ if $param.IsOptional -}} false {{- else -}} true {{- end }} | {{$param.Description}}
{{- end -}}
<br/>
**Example**
` + "```javascript" + `
{{$bs.Example}}
` + "```" + `
<br/>
{{end }}
We start with a range function that ranges over the transformers in the .TransformerSpecs
and then layout the sections within that. Specifically, the Parameters
, Values
, Config
and examples sections
. As we range over the transformers in the spec, we fill in those sections.
You'll also notice that for the Example
section, we actually use another template that we inherit into this one.
{{- if eq .BenthosSpec.Type "transform" -}}
{{if eq (len .BenthosSpec.Params) 0}}
const newValue = neosync.{{.BenthosSpec.Name}}(value, {});
{{- else }}
const newValue = neosync.{{.BenthosSpec.Name}}(value, {
{{- range $i, $param := .BenthosSpec.Params -}}
{{- if eq $param.Name "value" -}}{{ continue }}{{- end -}}
{{ if $param.HasDefault }}
{{ if eq $param.Name "seed" -}}
{{$param.Name}}: 1,
{{- else -}}
{{$param.Name}}: {{$param.Default}},
{{- end }}
{{- else }}
{{ if eq $param.TypeStr "string"}}{{$param.Name}}: "", {{ end -}}
{{ if eq $param.TypeStr "int64"}}{{$param.Name}}: 1, {{ end -}}
{{ if eq $param.TypeStr "float64"}}{{$param.Name}}: 1.12, {{ end -}}
{{ if eq $param.TypeStr "bool"}}{{$param.Name}}: false, {{ end -}}
{{ if eq $param.TypeStr "any"}}{{$param.Name}}: "", {{ end -}}
{{ end }}
{{- end }}
});
{{- end }}
{{- else if eq .BenthosSpec.Type "generate" -}}
{{if eq (len .BenthosSpec.Params) 0}}
const newValue = neosync.{{.BenthosSpec.Name}}({});
{{- else }}
const newValue = neosync.{{.BenthosSpec.Name}}({
{{- range $i, $param := .BenthosSpec.Params -}}
{{ if $param.HasDefault }}
{{ if eq $param.Name "seed" -}}
{{$param.Name}}: 1,
{{- else -}}
{{$param.Name}}: {{$param.Default}},
{{- end }}
{{- else }}
{{ if eq $param.TypeStr "string"}}{{$param.Name}}: "", {{ end -}}
{{ if eq $param.TypeStr "int64"}}{{$param.Name}}: 1, {{ end -}}
{{ if eq $param.TypeStr "float64"}}{{$param.Name}}: 1.12, {{ end -}}
{{ if eq $param.TypeStr "bool"}}{{$param.Name}}: false, {{ end -}}
{{ if eq $param.TypeStr "any"}}{{$param.Name}}: "", {{ end -}}
{{ end }}
{{- end }}
});
{{- end -}}
{{ end }}
This allows us to break up the template into smaller chunks and easily define what we want our examples section to look like. Once generated, it produces something like this:
We then repeat this for the generators:
{{range $i, $bs := .GeneratorSpecs }}
<!--
source: {{$bs.SourceFile}}
-->
### {{$bs.Name}}
{{$bs.Description}}
**Parameters**
**Config**
| Field | Type | Default | Required | Description |
| -------- | ---- | ------- | -------- | ----------- |
{{range $i, $param := $bs.Params -}}
| {{$param.Name}} | {{$param.TypeStr}} | {{$param.Default}} | {{ if $param.IsOptional -}} false {{- else -}} true {{- end }} | {{ $param.Description }}
{{ end -}}
<br/>
**Example**
` + "```javascript" + `
{{$bs.Example}}
` + "```" + `
<br/>
{{end }}
<!-- prettier-ignore-end -->
Go template's interpolations, range and conditional functions are super useful in being able to automatically generate useful documentation that has many sections and sub sections.
Go templates are a powerful way to auto-generate documentation and we use them to auto-generate documentation for our Transformers. This saves us a ton of time with updates as well since we can just change the function parameters and then documentation automatically get's updated. If you have tried Go templates, check them!
An overview of how Data Engineer's can use Data Anonymization in their Pipelines
November 7th, 2024
Product highlights from October
November 5th, 2024
Nucleus Cloud Corp. 2024