Dedupe events
Configuration Options
Required Options
inputs(required)
A list of upstream source or transform
IDs. Wildcards (*
) are supported.
See configuration for more info.
Type | Syntax | Default | Example |
---|---|---|---|
array | literal | ["my-source-or-transform-id","prefix-*"] |
fields(required)
Options controlling what fields to match against.
Type | Syntax | Default | Example |
---|---|---|---|
hash | literal | [] |
type(required)
The component type. This is a required field for all components and tells Vector which component to use.
Type | Syntax | Default | Example |
---|---|---|---|
string | literal | ["dedupe"] |
Advanced Options
cache(optional)
Options controlling how we cache recent Events for future duplicate checking.
Type | Syntax | Default | Example |
---|---|---|---|
hash | [] |
How it Works
Cache Behavior
This transform is backed by an LRU cache of size cache.num_events
.
That means that this transform will cache information in memory for
the last cache.num_events
Events that it has processed. Entries
will be removed from the cache in the order they were inserted. If
an Event is received that is considered a duplicate of an Event
already in the cache that will put that event back to the head of
the cache and reset its place in line, making it once again last
entry in line to be evicted.
Memory Usage Details
Each entry in the cache corresponds to an incoming Event and
contains a copy of the 'value' data for all fields in the Event
being considered for matching. When using fields.match
this will
be the list of fields specified in that configuration option. When
using fields.ignore
that will include all fields present in the
incoming event except those specified in fields.ignore
. Each entry
also uses a single byte per field to store the type information of
that field. When using fields.ignore
each cache entry additionally
stores a copy of each field name being considered for matching. When
using fields.match
storing the field names is not necessary.
Memory Utilization Estimation
If you want to estimate the memory requirements of this transform for your dataset, you can do so with these formulas:
When using fields.match
:
Sum(the average size of the *data* (but not including the field name) for each field in `fields.match`) * `cache.num_events`
When using fields.ignore
:
(Sum(the average size of each incoming Event) - (the average size of the field name *and* value for each field in `fields.ignore`)) * `cache.num_events`
State
This component is stateful, meaning its behavior changes based on previous inputs (events). State is not preserved across restarts, therefore state-dependent behavior will reset between restarts and depend on the inputs (events) received since the most recent restart.
Missing Fields
Fields with explicit null values will always be considered different
than if that field was omitted entirely. For example, if you run
this transform with fields.match = ["a"]
, the event "{a: null,
b:5}" will be considered different to the event "{b:5}".