Why we built CLI & Terraform Provider

David Mattia

September 20th, 2022 · 12 min read

 

From a leaked internal document at Facebook, we see the clear struggle with governing personal data:

We do not have an adequate level of control and explainability over how our systems use data, and thus we can’t confidently make controlled policy changes or external commitments such as “we will not use X data for Y purpose.” And yet, this is exactly what regulators expect us to do, increasing our risk of mistakes and misrepresentation.

As Facebook continues with years of legal proceedings stemming from their use of personal data, it becomes increasingly important for all companies to be able to accurately identify and govern all personal data in their systems.

At Transcend, we’ve recently open sourced two infrastructure as code tools that make both identification and governance of personal data in your engineering systems much easier.

About the new tools

The Transcend Terraform Provider

We recently announced the release of our official Terraform Provider. This provider lets you declaratively create and update: 

  • Data silos (integrations with third parties like Stripe/Datadog/Salesforce or internal databases), 

  • Datapoints (classifications of a set of personal data that exists under some data silo),

  • API keys (that can be scoped to individual data silos if needed), 

  • Enrichers (a concept for connecting user identifiers like phone numbers, user IDs, and email addresses)

  • And more!

One of the great things about the Terraform provider is that it allows you to integrate Transcend alongside any other tools that have Terraform providers.

Here’s an example of a snippet that creates an IAM Role in an AWS account—giving Transcend access to scan an account for what personal data it might contain and then using the Transcend provider to create and connect an AWS data silo.

1resource "transcend_data_silo" "aws" {
2  type        = "amazonWebServices"
3  description = "Amazon Web Services (AWS) provides information technology infrastructure services to businesses in the form of web services."
4
5  # Normally, Data Silos are connected in this resource. But for AWS, we want to delay connecting until after
6  # we create the IAM Role, which must use the `aws_external_id` output from this resource. So instead, we set
7  # `skip_connecting` to `true` here and use a `transcend_data_silo_connection` resource below
8  skip_connecting = true
9  lifecycle { ignore_changes = [plaintext_context] }
10}
11
12resource "aws_iam_role" "iam_role" {
13  name        = "TranscendAWSIntegrationRole2"
14  description = "Policy to allow Transcend access to this AWS Account"
15
16  assume_role_policy = jsonencode({
17    Version = "2012-10-17"
18    Statement = [
19      {
20        Action = "sts:AssumeRole"
21        Effect = "Allow"
22        // 829095311197 is the AWS Organization for Transcend that will try to assume role into your organization
23        Principal = { AWS = "arn:aws:iam::829095311197:root" }
24        Condition = { StringEquals = { "sts:ExternalId" : transcend_data_silo.aws.aws_external_id } }
25      },
26    ]
27  })
28
29  inline_policy {
30    name = "TranscendPermissions"
31    policy = jsonencode({
32      Version = "2012-10-17"
33      Statement = [
34        {
35          Action = [
36            "dynamodb:ListTables",
37            "dynamodb:DescribeTable",
38            "rds:DescribeDBInstances",
39            "s3:ListAllMyBuckets"
40          ]
41          Effect   = "Allow"
42          Resource = "*"
43        },
44      ]
45    })
46  }
47}
48
49# Give AWS Time to become consistent with the new IAM Role permissions
50resource "time_sleep" "pause" {
51  depends_on      = [aws_iam_role.iam_role]
52  create_duration = "10s"
53}
54
55data "aws_caller_identity" "current" {}
56resource "transcend_data_silo_connection" "connection" {
57  data_silo_id = transcend_data_silo.aws.id
58
59  plaintext_context {
60    name  = "role"
61    value = aws_iam_role.iam_role.name
62  }
63
64  plaintext_context {
65    name  = "accountId"
66    value = data.aws_caller_identity.current.account_id
67  }
68
69  depends_on = [time_sleep.pause]
70}

This can enable all sorts of cool integrations like connecting Transcend and Datadog to remove logs relating to a particular user, securely connecting with clouds like AWS, Google Cloud, or Azure, or creating a database and then immediately connecting it to Transcend.

The CLI tool

Our second infrastructure as code tool is the `@transcend-io/cli` NPM package that can be used as a standalone binary.

Similar to the Terraform provider, the Command Line Interface (CLI) is an infrastructure as code tool aimed at making discovering and governing data easier. The schema is very similar to the Terraform API, but has three major reasons why you might want to use it:

  1. If your organization does not have established practices around Terraform and how to deploy it on CI, the CLI provides a lower barrier of entry.

  2. If you are planning on auto-generating the config to upload to Transcend (which we’ll show an example of later), then it may be more natural to output YAML (which the CLI injests) than HCL from Terraform.

  3. The CLI comes with options for generating configuration from your existing Transcend account that are not present in the Terraform provider. The Terraform provider would require using `terraform import` to bring already existing infrastructure into your code.

You are also welcome to use both tools in conjunction with one another. We have seen success when using Terraform to manually configure systems and to securely specify API credentials while using the CLI to upload auto-generated datapoint schemas. But feel free to use whichever tool or combination fits your business needs best.

Data Mapping with Transcend

You can’t manage what you can’t see, so step one in setting up a privacy program is often determining where personal data lives inside your systems. If, like at most companies, the personal data you collect changes over time, data mapping is not an exercise you can complete once and then forget about.

Robust data mapping systems must enable you to retroactively find personal data in your legacy and existing systems, proactively manage new sources of personal data as you add them, and continuously scan for personal data to prevent missing personal data that may be added in the future.

Retroactively finding and classifying personal data

At Transcend, we understand that not every software project starts out designing for privacy. Companies that have been around for a while likely have systems containing personal data that predate the current laws and regulations around how personal data must be handled. And as new laws come into effect in the future, those systems may need updates again.

It’s often infeasible to ask a company to go back and hand label where all of their personal data lives.

  • Did the person who created a central system leave your company and nobody is quite sure how “that server over there” works or what data it contains?

  • Do you have thousands of databases with millions of tables and tens of engineers?

  • Is there a disconnect between the people you want to be responsible for labeling data (such as legal) and the people who know how to find that data (such as engineers)?

Enter our Data Silo Plugins. These come in a variety of forms to help you sort through your old systems.

Silo Discovery Plugins

We have Silo Discovery Plugins that can find and suggest data silos your org uses. Examples include scanning an AWS account for databases used, your SSO tool such as Okta for all applications your employees can access, or Salesforce for where you might keep personal data on prospects and leads.

Using our Terraform provider, adding a data silo plugin is as easy as defining when you want the scans to start and how often you want them to occur going forward.

1resource "transcend_data_silo" "aws" {
2  type        = "amazonWebServices"
3  description = "Amazon Web Services (AWS) provides information technology infrastructure services to businesses in the form of web services."
4
5  plugin_configuration {
6    enabled                    = true
7    type                       = "DATA_SILO_DISCOVERY"
8    schedule_frequency_minutes = 1440 # 1 day
9    schedule_start_at          = "2022-09-06T17:51:13.000Z"
10    schedule_now               = false
11  }
12
13  # ...other fields...
14}
15
16# ...other resources...

In this example, we set up an AWS data silo to scan for databases, S3 buckets, and other resources that often contain personal data.

Silo Discovery via dependency files

Another way to discover data silos to connect is by scanning your codebase for external SDKs. Transcend can then map those SDKs to data silos and suggest them to you to add. Currently we support scanning for new data silos in Javascript, Python, Gradle, and CocoaPods projects.

To get started, you'll need to add a data silo for the corresponding project type with the Silo Discovery Plugin enabled. For example, if you want to scan a JavaScript project, add a “JavaScript package.json” data silo. You can do this in the Transcend admin-dashboard (or via this CLI tooling or Terraform).

Then, you'll need to grab that dataSiloId and a Transcend API key and pass it to the CLI. Using JavaScript package.json as an example:

1# Scan a javascript project (package.json files) to look for new data silos
2
3yarn tr-discover-silos --scanPath=./myJavascriptProject --auth={{api_key}} --dataSiloId={{dataSiloId}}
4

This call will look for all the package.json files that are in the scan path ./myJavascriptProject, parse each of the dependencies into their individual package names, and send it to our Transcend backend for classification.

These classifications can then be viewed in the data silo recommendations triage tab, just like where you’d look with other Silo Discovery mechanisms. The process is the same for scanning requirements.txt, podfiles and build.gradle files.

Datapoint Discovery Plugins

We also have Datapoint Discovery Plugins that can go into your data stores and extract your schemas. This supports databases like BigQuery, MongoDB, DynamoDB, Snowflake, PostgreSQL, MySQL, Redshift, and many more, while also supporting data stores such as Google Forms, Amazon S3, and Salesforce.

Adding a Datapoint Discovery Plugin is very similar to adding a Silo Plugin, just using a type of `DATA_POINT_DISCOVERY` instead:

1resource "transcend_data_silo" "aws" {
2  type = "amazonS3"
3
4  plugin_configuration {
5    enabled                    = true
6    type                       = "DATA_POINT_DISCOVERY"
7    schedule_frequency_minutes = 1440 # 1 day
8    schedule_start_at          = "2022-09-06T17:51:13.000Z"
9    schedule_now               = false
10  }
11
12  # ...other fields...
13}
14
15# ...other resources...

In this example, we set up AWS to scan for personal data in Amazon S3.

Datapoint Classification Plugins

Lastly, we support Datapoint Classification Plugins that can sample the data in your datastores. This is especially powerful when combined with the Datapoint Discovery Plugins that find the schemas of your internal systems.

The results of the classification of datapoints in a Redshift database

In the above example, a Redshift database is being scanned. Each column under each table is sampled and our classifier attempts to classify the data category that each column belongs to. Each classification comes with a confidence rating to make triaging the findings easier.

Classification with complete security

One great part of this classification process is the security model. By using our end-to-end encryption gateway, Sombra, Transcend never needs to see the sample data in any of your systems. Likewise, Transcend never needs to have direct access to your databases nor any means of connecting to them even if we did have access.

All communication from Transcend to your database or other internal systems happens through Sombra, and all data that flows from Sombra back to Transcend will not contain any personal data that we would have access to. If personal data is ever returned, it is encrypted by the encryption gateway with keys from your Key Management System, which Transcend does not have access to.

Here’s a complete example, using Terraform, of setting up a PostgreSQL database using Amazon RDS in a private subnet of a VPC and connecting it to Transcend:

1locals {
2  subdomain = "https-test"
3  # You should pick a hosted zone that is in your AWS Account
4  parent_domain = "sombra.dev.trancsend.com"
5  # Org URI found on https://app.transcend.io/infrastructure/sombra
6  organization_uri = "wizard"
7}
8
9######################################################################################
10# Create a private network to put our database in with the sombra encryption gateway #
11######################################################################################
12
13module "vpc" {
14  source  = "terraform-aws-modules/vpc/aws"
15  version = "~> 2.18.0"
16
17  name = "sombra-example-https-test-vpc"
18  cidr = "10.0.0.0/16"
19  azs  = ["us-east-1a", "us-east-1b"]
20
21  private_subnets  = ["10.0.101.0/24", "10.0.102.0/24"]
22  public_subnets   = ["10.0.201.0/24", "10.0.202.0/24"]
23  database_subnets = ["10.0.103.0/24", "10.0.104.0/24"]
24
25  enable_nat_gateway                 = true
26  enable_dns_hostnames               = true
27  enable_dns_support                 = true
28  create_database_subnet_group       = true
29  create_database_subnet_route_table = true
30}
31
32#######################################################################
33# Deploy a Sombra encryption gateway and register it to a domain name #
34#######################################################################
35
36data "aws_route53_zone" "this" {
37  name = local.parent_domain
38}
39
40module "acm" {
41  source      = "terraform-aws-modules/acm/aws"
42  version     = "~> 2.0"
43  zone_id     = data.aws_route53_zone.this.id
44  domain_name = "${local.subdomain}.${local.parent_domain}"
45}
46
47variable "tls_cert" {}
48variable "tls_key" {}
49variable "jwt_ecdsa_key" {}
50variable "internal_key_hash" {}
51module "sombra" {
52  source  = "transcend-io/sombra/aws"
53  version = "1.4.1"
54
55  # General Settings
56  deploy_env       = "example"
57  project_id       = "example-https"
58  organization_uri = local.organization_uri
59
60  # This should not be done in production, but allows testing the external endpoints during development
61  transcend_backend_ips = ["0.0.0.0/0"]
62
63  # VPC settings
64  vpc_id                      = module.vpc.vpc_id
65  public_subnet_ids           = module.vpc.public_subnets
66  private_subnet_ids          = module.vpc.private_subnets
67  private_subnets_cidr_blocks = module.vpc.private_subnets_cidr_blocks
68  aws_region                  = "us-east-1"
69  use_private_load_balancer   = false
70
71  # DNS Settings
72  subdomain       = local.subdomain
73  root_domain     = local.parent_domain
74  zone_id         = data.aws_route53_zone.this.id
75  certificate_arn = module.acm.this_acm_certificate_arn
76
77  # App settings
78  data_subject_auth_methods = ["transcend", "session"]
79  employee_auth_methods     = ["transcend", "session"]
80
81  # HTTPS Configuration
82  desired_count = 1
83  tls_config = {
84    passphrase = "unsecurePasswordAsAnExample"
85    cert       = var.tls_cert
86    key        = var.tls_key
87  }
88  transcend_backend_url = "https://api.dev.trancsend.com:443"
89
90  # The root secrets that you should generate yourself and keep secret
91  # See https://docs.transcend.io/docs/security/end-to-end-encryption/deploying-sombra#6.-cycle-your-keys for information on how to generate these values
92  jwt_ecdsa_key     = var.jwt_ecdsa_key
93  internal_key_hash = var.internal_key_hash
94
95  tags = {}
96}
97
98######################################################################
99# Create a security group that allows Sombra to talk to the database #
100######################################################################
101
102module "security_group" {
103  source  = "terraform-aws-modules/security-group/aws"
104  version = "~> 4.0"
105
106  name   = "database-ingress"
107  vpc_id = module.vpc.vpc_id
108
109  # ingress
110  ingress_with_cidr_blocks = [
111    {
112      from_port   = 5432
113      to_port     = 5432
114      protocol    = "tcp"
115      description = "PostgreSQL access from private subnets within VPC (which includes sombra)"
116      cidr_blocks = join(",", module.vpc.private_subnets_cidr_blocks)
117    },
118  ]
119}
120
121###################################################
122# Create a sample postgres database using AWS RDS #
123###################################################
124
125module "postgresDb" {
126  source  = "terraform-aws-modules/rds/aws"
127  version = "~> 5.0"
128
129  allocated_storage    = 5
130  engine               = "postgres"
131  engine_version       = "11.14"
132  family               = "postgres11"
133  major_engine_version = "11"
134  instance_class       = "db.t3.micro"
135
136  multi_az               = true
137  db_subnet_group_name   = module.vpc.database_subnet_group
138  vpc_security_group_ids = [module.security_group.security_group_id]
139  skip_final_snapshot    = true
140  deletion_protection    = false
141  apply_immediately      = true
142
143  identifier = "some-postgres-db"
144  username   = "someUsername"
145  db_name    = "somePostgresDb"
146}
147
148#######################################################
149# As Sombra can talk to the database, we can create a #
150# data silo using the private connection information. #
151#######################################################
152
153resource "transcend_data_silo" "database" {
154  type = "database"
155
156  plugin_configuration {
157    enabled                    = true
158    type                       = "DATA_POINT_DISCOVERY"
159    schedule_frequency_minutes = 1440 # 1 day
160    schedule_start_at          = "2022-09-06T17:51:13.000Z"
161    schedule_now               = false
162  }
163
164  secret_context {
165    name  = "driver"
166    value = "PostgreSQL Unicode"
167  }
168  secret_context {
169    name = "connectionString"
170    value = join(";", [
171      "Server=${module.postgresDb.db_instance_address}",
172      "Database=${module.postgresDb.db_instance_name}",
173      "UID=${module.postgresDb.db_instance_username}",
174      "PWD=${module.postgresDb.db_instance_password}",
175      "Port=${module.postgresDb.db_instance_port}"
176    ])
177  }
178}

Notice that the database has a security group setup such that it can only be talked to from within the Virtual Private Cloud. Also, note that the Sombra encryption gateway is given permissions to talk to the database, but not any external Transcend system.

Proactively managing new sources of personal data

Finding user data in existing systems is cool. But do ya know what’s even cooler? Proactively labeling your data classifications and purposes as you add new features and syncing that data to Transcend. You can eliminate the need to triage our classifications by just telling us what the classifications are.

This can be done with both Terraform and the CLI, but this is where the CLI really shines. Our customers like Clubhouse have even created database client libraries where they can encode privacy information directly into their schema definitions. During their deploys, they extract this data into a YAML file that the CLI syncs to Transcend.

Here’s an example of a change in our codebase where we use an extension of Sequelize to define some fields on an email-related model. As the `from` and `to` fields of an email may contain personal email addresses, we labeled this data directly in our schema.

During a deploy, we extract the metadata about each database model and create a `transcend.yml` file containing a data point declared like:

1- title: Communication
2  key: dsr-models.communication
3  description: A communication message sent between the organization and data subject
4  fields:
5    ...other fields ...
6    - key: from
7      title: from
8      categories:
9        - name: CONTACT
10          category: CONTACT
11    - key: to
12      title: to
13      categories:
14        - name: CONTACT
15          category: CONTACT

The CI job then syncs this data to Transcend, where there will be a data silo for the database with datapoints for the `from` and `to` columns listed as contact information.

Continuously scanning for personal data sources

In the “Retroactively Finding and Classifying Personal Data” section above, we showed how Transcend makes it easy to find and classify personal data in your existing systems. But what about going into the future?

  • What about when your Go To Market team adds a new tracking tool “just to test it out” and forgets to let your security owners know?

  • Or what about if an engineering project slips through the cracks and doesn’t proactively note all personal data it stores?

  • Or what if your ideal process is that engineering builds the tooling and that legal uses the scanning/classification tooling to label the data before the project fully launches (as opposed to engineering labeling the data as described in the previous section)?

Because our scanners and classifiers can run on a schedule, it’s easy to stay continuously compliant. Let the Okta plugin discover that new tracking tool. Let the AWS plugin notify you when a new database is created. Let a database plugin scan the databases for any new personal data that might appear.

Labeling data shouldn’t be a once per year affair, and it definitely shouldn’t be a “I’ll do it when we’re getting audited for privacy violations” affair. With Transcend, you can rest easy knowing that your data map will always be up to date.


About Transcend

Transcend is the company that makes it easy to encode privacy across your entire tech stack. Our mission is to make it simple for companies to give users control of their data.

Automate data subject request workflows with Privacy Requests, ensure nothing is tracked without user consent using Transcend Consent, or discover data silos and auto-generate reports with Data Mapping.

ProductEngineering

Discover more articles

DSAR: What are data subject access requests?

DSAR: What are data subject access requests?

DSAR or data subject access requests are when someone asks a company for a record of the personal data that's been collected about them.

August 12th, 2022 9 min read

Announcing the Transcend Terraform Provider

Announcing the Transcend Terraform Provider

Learn how to use our new provider to manage your privacy stack.

August 11th, 2022 8 min read