Tokenizer

CAUTION

This transform has been deprecated in favor of the remap transform, which enables you to use Vector Remap Language (VRL for short) to create transform logic of any degree of complexity. The examples below show how you can use VRL to replace this transform's functionality.

.message = parse_tokens(.message)

Example Configuration

Loosely Structured

Config
Input
Output
1[transforms.my_transform_id]
2type = "tokenizer"
3field_names = [
4 "remote_addr",
5 "ident",
6 "user_id",
7 "timestamp",
8 "message",
9 "status",
10 "bytes"
11]
12field = "message"
13
14 [transforms.my_transform_id.types]
15 timestamp = "timestamp"
16 status = "int"
17 bytes = "int"
1{
2 "log": {
3 "message": "5.86.210.12 - zieme4647 [19/06/2019:17:20:49 -0400] \"GET /embrace/supply-chains/dynamic/vertical\" 201 20574"
4 }
5}
1{
2 "log": {
3 "remote_addr": "5.86.210.12",
4 "user_id": "zieme4647",
5 "timestamp": "19/06/2019:17:20:49 -0400",
6 "message": "GET /embrace/supply-chains/dynamic/vertical",
7 "status": 201,
8 "bytes": 20574
9 }
10}

Configuration Options

Required Options

field_names(required)

The log field names assigned to the resulting tokens, in order.

TypeSyntaxDefaultExample
arrayliteral["timestamp","level","message","parent.child"]
inputs(required)

A list of upstream source or transform IDs. Wildcards (*) are supported.

See configuration for more info.

TypeSyntaxDefaultExample
arrayliteral["my-source-or-transform-id","prefix-*"]
type(required)

The component type. This is a required field for all components and tells Vector which component to use.

TypeSyntaxDefaultExample
stringliteral["tokenizer"]

Advanced Options

drop_field(optional)

If true the field will be dropped after parsing.

TypeSyntaxDefaultExample
bool
field(optional)

The log field to tokenize.

TypeSyntaxDefaultExample
stringliteralmessage["message","parent.child"]
timezone(optional)

The name of the time zone to apply to timestamp conversions that do not contain an explicit time zone. This overrides the global timezone option. The time zone name may be any name in the TZ database, or local to indicate system local time.

TypeSyntaxDefaultExample
stringliterallocal["local","America/NewYork","EST5EDT"]
types(optional)

Key/value pairs representing mapped log field names and types. This is used to coerce log fields from strings into their proper types. The available types are listed in the Types list below.

Timestamp coercions need to be prefaced with timestamp|, for example "timestamp|%F". Timestamp specifiers can use either of the following:

  1. One of the built-in-formats listed in the Timestamp Formats table below.
  2. The time format specifiers from Rust's chrono library.

Types

  • array
  • bool
  • bytes
  • float
  • int
  • map
  • null
  • timestamp (see the table below for formats)

Timestamp Formats

FormatDescriptionExample
%F %TYYYY-MM-DD HH:MM:SS2020-12-01 02:37:54
%v %TDD-Mmm-YYYY HH:MM:SS01-Dec-2020 02:37:54
%FT%TISO 8601[RFC 3339](https://tools.ietf.org/html/rfc3339) format without time zone2020-12-01T02:37:54
%a, %d %b %Y %TRFC 822/2822 without time zoneTue, 01 Dec 2020 02:37:54
%a %d %b %T %Ydate command output without time zoneTue 01 Dec 02:37:54 2020
%a %b %e %T %Yctime formatTue Dec 1 02:37:54 2020
%sUNIX timestamp1606790274
%FT%TZISO 8601/RFC 3339 UTC2020-12-01T09:37:54Z
%+ISO 8601/RFC 3339 UTC with time zone2020-12-01T02:37:54-07:00
%a %d %b %T %Z %Ydate command output with time zoneTue 01 Dec 02:37:54 PST 2020
%a %d %b %T %z %Ydate command output with numeric time zoneTue 01 Dec 02:37:54 -0700 2020
%a %d %b %T %#z %Ydate command output with numeric time zone (minutes can be missing or present)Tue 01 Dec 02:37:54 -07 2020

Note: the examples in this table are for 54 seconds after 2:37 am on December 1st, 2020 in Pacific Standard Time.

TypeSyntaxDefaultExample
hash[{"status":"int","duration":"float","success":"bool","timestamp_iso8601":"timestamp|%F","timestamp_custom":"timestamp|%a %b %e %T %Y","timestamp_unix":"timestamp|%F %T","parent":{"child":"int"}}]

How it Works

Blank Values

Both " " and "-" are considered blank values and their mapped fields will be set to null.

State

This component is stateless, meaning its behavior is consistent across each input.

Special Characters

In order to extract raw values and remove wrapping characters, we must treat certain characters as special. These characters will be discarded:

  • "..." - Quotes are used tp wrap phrases. Spaces are preserved, but the wrapping quotes will be discarded.
  • [...] - Brackets are used to wrap phrases. Spaces are preserved, but the wrapping brackets will be discarded.
  • \ - Can be used to escape the above characters, Vector will treat them as literal.