Set Up Segment Data Lakes
Data Lakes is available for the listed account plans only.
See the available plans, or contact us.
Segment Data Lakes provide a way to collect large quantities of data in a format that’s optimized for targeted data science and data analytics workflows. You can read more information about Data Lakes and learn how they differ from Warehouses in our documentation.
Segment Data Lakes is available to Business tier customers only.
Pre-Requisites
Before you set up Segment Data Lakes, you need the following resources:
- An AWS account
- An Amazon S3 bucket to receive data and store logs
- A subnet within a VPC for the EMR cluster to run in
Step 1 - Set Up AWS Resources
You can use the open source Terraform module to automate much of the set up work to get Data Lakes up and running. If you’re familiar with Terraform, you can modify the module to meet your organization’s needs, however Segment guarantees support only for the template as provided. The Data Lakes set up uses Terraform v0.11+. To support more versions of Terraform, the aws provider must use v2, which is included in our example main.tf.
You can also use our manual set up instructions to configure these AWS resources if you prefer.
The Terraform module and manual set up instructions both provide a base level of permissions to Segment (for example, the correct IAM role to allow Segment to create Glue databases on your behalf). If you want stricter permissions, or other custom configurations, you can customize these manually.
Step 2 - Enable Data Lakes Destination
After you set up the necessary AWS resources, the next step is to set up the Data Lakes destination within Segment:
-
In the Segment App, click Add Destination, then search for and select Data Lakes.
-
Click Configure Data Lakes and select the source to connect to the Data Lakes destination. Warning:You must add the Workspace ID to the external ID list in the IAM policy, or else the source data cannot be synced to S3.
- In the Settings tab, enter and save the following connection settings:
- AWS Region: The AWS Region where your EMR cluster, S3 Bucket and Glue DB reside. Ex:
us-west-2
- EMR Cluster ID: The EMR Cluster ID where the Data Lakes jobs will be run.
- Glue Catalog ID: The Glue Catalog ID (this must be the same as your AWS account ID).
- IAM Role ARN: The ARN of the IAM role that Segment will use to connect to Data Lakes. Ex:
arn:aws:iam::000000000000:role/SegmentDataLakeRole
- S3 Bucket: Name of the S3 bucket used by Data Lakes. The EMR cluster will store logs in this bucket. Ex:
segment-data-lake
You must individually connect each source to the Data Lakes destination. However, you can copy the settings from another source by clicking … (“more”) (next to the button for “Set up Guide”).
- AWS Region: The AWS Region where your EMR cluster, S3 Bucket and Glue DB reside. Ex:
- (Optional) Date Partition: Optional advanced setting to change the date partition structure, with a default structure
day=<YYYY-MM-DD>/hr=<HH>
. To use the default, leave this setting unchanged. To partition the data by a different date structure, choose one of the following options:- Day/Hour [YYYY-MM-DD/HH] (Default)
- Year/Month/Day/Hour [YYYY/MM/DD/HH]
- Year/Month/Day [YYYY/MM/DD]
- Day [YYYY-MM-DD]
-
(Optional) Glue Database Name: Optional advanced setting to change the name of the Glue Database which is set to the source slug by default. Each source connected to Data Lakes must have a different Glue Database name otherwise data from different sources will collide in the same database.
- Enable the Data Lakes destination by clicking the toggle near the Set up Guide button.
Once the Data Lakes destination is enabled, the first sync will begin approximately 2 hours later.
Step 3 - Verify Data is Synced to S3 and Glue
You will see event data and sync reports populated in S3 and Glue after the first sync successfully completes. However if an insufficient permission or invalid setting is provided during set up, the first data lake sync will fail.
To be alerted of sync failures via email, subscribe to the Storage Destination Sync Failed
activity email notification within the App Settings > User Preferences > Notification Settings.
Sync Failed
emails are sent on the 1st, 5th and 20th sync failure. Learn more about the types of errors which can cause sync failures here.
(Optional) Step 4 - Replay Historical Data
If you want to add historical data to your data set using a replay of historical data into Data Lakes, contact the Segment Support team to request one.
The time needed to process a Replay can vary depending on the volume of data and number of events in each source. If you decide to run a Replay, we recommend that you start with data from the last six months to get started, and then replay additional data if you find you need more.
Segment creates a separate EMR cluster to run replays, then destroys it when the replay finished. This ensures that regular Data Lakes syncs are not interrupted, and helps the replay finish faster.
FAQ
Data Lakes Set Up
No, Data Lakes automatically creates one Glue database per source. This database uses the source slug as its name.
Four roles are created when you set up Data Lakes using Terraform. You add the arn:aws:iam::$ACCOUNT_ID:role/segment-data-lake-iam-role
role to the Data Lakes Settings page in the Segment web app.
The roles which Data Lakes assigns during set up are:
segment-datalake-iam-role
- This is the role that Segment assumes to access S3, Glue and the EMR cluster. It allows Segment access to:- Get, create, delete access to the Glue catalog. Note that this does not provide access to Glue ETL or Glue crawlers.
- Access only to the specific S3 bucket used for Data Lakes.
- EMR access only to the clusters having the
vendor=segment
tag
-
segment_emr_service_role
- Restricted role that can only be assumed by the EMR service. This is set up based on AWS best practices. segment_emr_instance_profile_role
- Role that is assumed by the applications running on the EMR cluster. Based on AWS best practices, it allows Segment access to:- Get, create, delete access to the Glue catalog. Note that this does not provide access to Glue ETL or Glue crawlers.
- Access only to the specific S3 bucket used for Data Lakes.
segment_emr_autoscaling_role
- Restricted role that can only be assumed by EMR and EC2. This is set up based on AWS best practices.
The module doesn’t create a new S3 bucket so you can re-use an existing bucket for your Data Lakes.
Yes, the S3 bucket and the EMR cluster must be in the same region.
To connect a new source to Data Lakes:
- Ensure that the
workspace_id
of the Segment workspace is in the list of external ids in the IAM policy. You can either update this from the AWS console, or re-run the Terraform job. - From your Segment workspace, connect the source to the Data Lakes destination.
Yes, you can configure multiple sources to use the same EMR cluster. We recommend that the EMR cluster only be used for Data Lakes to ensure there aren’t interruptions from non-Data Lakes job.
Post-Set Up
If you don’t see data after enabling a source, check the following:
- Does the IAM role have the Segment account ID and workspace ID as the external ID?
- Is the EMR cluster running?
- Is the correct IAM role and S3 bucket configured in the settings?
If all of these look correct and you’re still not seeing any data, please contact the Support team.
The output
tables are temporary tables Segment creates when loading data. They are deleted after each sync.
Yes, you can create new directories in S3 without interfering with Segment data. Do not modify, or create additional directories with the following names:
logs/
segment-stage/
segment-data/
segment-logs/
Partitioned
just means that the table has partition columns (day and hour). All tables are partitioned, so you should see this on all table names.
You can use the following command to create external tables in Spectrum to access tables in Glue and join the data with Redshift:
Run the CREATE EXTERNAL SCHEMA
command:
create external schema [spectrum_schema_name]
from data catalog
database [glue_db_name]
iam_role arn:aws:iam::[account_id]:role/MySpectrumRole
create external database if not exists;
Replace:
- [glue_db_name] = The Glue database created by Data Lakes which is named after the source slug
- [spectrum_schema_name] = The schema name in Redshift you want to map to
This page was last modified: 05 Mar 2021
Need support?
Questions? Problems? Need more info? Contact us, and we can help!