This preview of pull request 1639 is meant for internal use only.

How does Segment handle duplicate data?


Segment has a special de-duplication service that sits just behind the api.segment.com endpoint, and attempts to drop duplicate data. However, that de-duplication api has to hold the entire set of events in memory in order to know whether or not it has seen that event already. Segment stores 24 hours worth of event message_ids. This means Segment can de-duplicate any data that appears within a 24 hour rolling window.

An important point remember is that Segment de-duplicates on the event’s message_id, not on the contents of an event payload. So if you aren’t generating message_ids for each event, or are trying to deduplicate data over a longer period than 24 hours, Segment does not have a built-in way to de-duplicate data.

Since the api layer is de-duping during this window, duplicate events that are further than 24 hours apart from one another must be de-duped in the Warehouse. Segment also dedupes messages going into a Warehouse based on the message_id, which is the id column in a Segment Warehouse. Note that in these cases you will see duplications in end tools as there is no additional layer prior to sending the event to downstream tools.

Keep in mind that Segment’s libraries all generate message_ids for you for each event payload, with the exception of the Segment HTTP API, which assigns each event a unique message_id when the message is ingested. You can override these default generated IDs and manually assign a message_id if necessary.

This page was last modified: 14 May 2021



Get started with Segment

Segment is the easiest way to integrate your websites & mobile apps data to over 300 analytics and growth tools.
or
Create free account