Elasticsearch Tokenizers – Partial Word Tokenizers

In this tutorial, we’re gonna look at 2 tokenizers that can break up text or words into small fragments, for partial word matching: N-Gram Tokenizer and Edge N-Gram Tokenizer.

I. N-Gram Tokenizer

ngram tokenizer does 2 things:
– break up text into words when it encounters specified characters (whitespace, punctuation…)
– emit N-grams of each word of the specified length (quick with length = 2 -> [qu, ui, ic, ck] )

=> N-grams are like a sliding window of continuous letters.

For example:


POST _analyze
{
  "tokenizer": "ngram",
  "text": "Spring 5"
}

It will generate terms with a sliding (1 char min-width, 2 chars max-width) window:


[ "S", "Sp", "p", "pr", "r", "ri", "i", "in", "n", "ng", "g", "g ", " ", " 5", "5" ]

Configuration

min_gram: minimum length of characters in a gram (min-width of the sliding window). Defaults to 1.
max_gram: maximum length of characters in a gram (max-width of the sliding window). Defaults to 2.
token_chars: character classes that will be included in a token. Elasticsearch will split on characters that don’t belong to:
+ letter (a, b, …)
+ digit (1, 2, …)
+ whitespace (” “, “\n”, …)
+ punctuation (!, “, …)
+ symbol ($, %, …)

Defaults to [] (keep all characters).

For example, we will create a tokenizer with sliding window (width = 3) and character classes: only letter & digit.


PUT jsa_index_n-gram
{
  "settings": {
    "analysis": {
      "analyzer": {
        "jsa_analyzer": {
          "tokenizer": "jsa_tokenizer"
        }
      },
      "tokenizer": {
        "jsa_tokenizer": {
          "type": "ngram",
          "min_gram": 3,
          "max_gram": 3,
          "token_chars": [
            "letter",
            "digit"
          ]
        }
      }
    }
  }
}

POST jsa_index_n-gram/_analyze
{
  "analyzer": "jsa_analyzer",
  "text": "Tut101: Spring 5"
}

Terms:


[ "Tut", "ut1", "t10", "101", "Spr", "pri", "rin", "ing" ]

We can see that ":" (punctuation) and "5" (digit but "g 5" contains whitespace) were not contained in the terms.

II. N-Gram Tokenizer

edge_ngram tokenizer does 2 things:
– break up text into words when it encounters specified characters (whitespace, punctuation…)
– emit N-grams of each word where the start of the N-gram is anchored to the beginning of the word (quick -> [q, qu, qui, quic, quick])

For example:


POST _analyze
{
  "tokenizer": "edge_ngram",
  "text": "Spring 5"
}

It will generate terms with maximum length = 2:


[ "S", "Sp" ]

We can see that default length is useless. We need to configure more.

Configuration

min_gram: minimum length of characters in a gram. Defaults to 1.
max_gram: maximum length of characters in a gram. Defaults to 2.
token_chars: character classes that will be included in a token. Elasticsearch will split on characters that don’t belong to:
+ letter (a, b, …)
+ digit (1, 2, …)
+ whitespace (” “, “\n”, …)
+ punctuation (!, “, …)
+ symbol ($, %, …)

Defaults to [] (keep all characters).

For example, we will create a tokenizer to treat letters and digits as tokens, and to produce grams with minimum length 2 and maximum length 8:


PUT jsa_index_edge_ngram
{
  "settings": {
    "analysis": {
      "analyzer": {
        "jsa_analyzer": {
          "tokenizer": "jsa_tokenizer"
        }
      },
      "tokenizer": {
        "jsa_tokenizer": {
          "type": "edge_ngram",
          "min_gram": 2,
          "max_gram": 8,
          "token_chars": [
            "letter",
            "digit"
          ]
        }
      }
    }
  }
}

POST jsa_index_edge_ngram/_analyze
{
  "analyzer": "jsa_analyzer",
  "text": "Tut101: Framework 5"
}

Terms:


[ "Tu, "Tut", "Tut1", "Tut10", "Tut101", "Fr", "Fra", "Fram", "Frame", "Framew", "Framewo", "Framewor" ]

We can see that ":" (punctuation), "5" (digit but "g 5" contains whitespace), and Framework (length > 8) were not contained in the terms.

Leave a Reply

Your email address will not be published. Required fields are marked *