Elasticsearch Tokenizers – Structured Text Tokenizers

In this tutorial, we’re gonna look at Structured Text Tokenizers that are usually used with structured text like identifiers, email addresses, zip codes, and paths.

I. Keyword Tokenizer

keyword tokenizer is the simplest tokenizer that accepts whatever text it is given and outputs the exact same text as a single term.

For example:


POST _analyze
{
  "tokenizer": "keyword",
  "text": "Java Sample Approach"
}

Term:


[ Java Sample Approach ]

II. Pattern Tokenizer

pattern tokenizer uses a regular expression to either split text into terms whenever it matches a word separator, or to capture matching text as terms.

The default pattern is \W+, which splits text whenever it encounters non-word characters.

For example:


POST _analyze
{
  "tokenizer": "pattern",
  "text": "Java_Sample_Approach's tutorials are helpful."
}

Terms:


[ "Java_Sample_Approach", "s", "tutorials", "are", "helpful" ]

Configuration

pattern: Java regular expression, defaults to \W+.
flags: Java regular expression flags. (for example: “CASE_INSENSITIVE|COMMENTS”) More flags at: regex Pattern
group capture group to extract as tokens. Defaults to -1 (split).

For example, we want to break text into tokens when it encounters commas:


PUT jsa_idx_pattern
{
  "settings": {
    "analysis": {
      "analyzer": {
        "jsa_analyzer": {
          "tokenizer": "jsa_tokenizer"
        }
      },
      "tokenizer": {
        "jsa_tokenizer": {
          "type": "pattern",
          "pattern": ","
        }
      }
    }
  }
}

POST jsa_idx_pattern/_analyze
{
  "analyzer": "jsa_analyzer",
  "text": "Java Sample Approach,Java Technology,Spring Framework"
}

Terms:


[ "Java Sample Approach","Java Technology","Spring Framework" ]

III. Path Tokenizer

path_hierarchy tokenizer takes a hierarchical value like a filesystem path, splits on the path separator, and emits a term for each component in the tree.

For example:


POST _analyze
{
  "tokenizer": "path_hierarchy",
  "text": "/java-integration/elasticsearch/tokenizer"
}

Terms:


[ "java-integration", "java-integration/elasticsearch", "java-integration/elasticsearch/tokenizer" ]

Configuration

delimiter: character to use as the path separator. Defaults to /.
replacement: optional replacement character to use for the delimiter. Defaults to the delimiter.
buffer_size: number of characters read into the term buffer in a single pass. Defaults to 1024. The term buffer will grow by this size until all the text has been consumed. It is advisable not to change this setting.
reverse: If true, emits the tokens in reverse order. Defaults to false.
skip: number of initial tokens to skip. Defaults to 0.

For example, we configure tokenizer to split on - characters, and to replace them with /, skip 2 first tokens:


PUT jsa_idx_pathhierarchy
{
  "settings": {
    "analysis": {
      "analyzer": {
        "jsa_analyzer": {
          "tokenizer": "jsa_tokenizer"
        }
      },
      "tokenizer": {
        "jsa_tokenizer": {
          "type": "path_hierarchy",
          "delimiter": "-",
          "replacement": "/",
          "skip": 3
        }
      }
    }
  }
}

POST jsa_idx_pathhierarchy/_analyze
{
  "analyzer": "jsa_analyzer",
  "text": "one-two-three-four-five"
}

Terms:


[ "/four", "/four/five" ]

If reverse is true:
Terms:


[ "/four/five", "/four" ]

Leave a Reply

Your email address will not be published. Required fields are marked *