Elasticsearch Analyzers – Basic Analyzers

In this tutorial, we’re gonna look at some basic analysers that Elasticsearch supports.

1. Keyword Analyzer

keyword analyzer returns the entire input string as a single token.


POST _analyze
{
  "analyzer": "keyword",
  "text": "Java Sample Approach"
}

Terms:


[ Java Sample Approach ]

2. Whitespace Analyzer

whitespace analyzer breaks text into terms whenever it encounters a whitespace character.


POST _analyze
{
  "analyzer": "whitespace",
  "text": "The Java Sample Approach's tutorials."
}

Terms:


[ The, Java, Sample, Approach's, tutorials. ]

3. Simple Analyzer

simple analyzer breaks text into lower cased terms whenever it encounters a character which is not a letter.


POST _analyze
{
  "analyzer": "simple",
  "text": "The Java Sample Approach's tutorials."
}

Terms:


[ the, java, sample, approach, s, tutorials ]

4. Stop Analyzer

stop analyzer is just like simple analyzer, but supports removing stop words (_english_ stop words by default).


POST _analyze
{
  "analyzer": "stop",
  "text": "The Java Sample Approach's tutorials over years."
}

Terms:


[ java, sample, approach, s, tutorials, over, years ]

Configuration

stopwords: pre-defined stop words list or an array containing a list of stop words ([“the”, “over”] for example). Defaults to _english_.
stopwords_path: path to a file containing stop words (relative to the Elasticsearch config directory).

For example, we configure analyzer with an array containing a list of stop words (“the”, “over”):


PUT jsa_index_analyzer_stop
{
  "settings": {
    "analysis": {
      "analyzer": {
        "jsa_stop_analyzer": {
          "type": "stop",
          "stopwords": [ "the", "over" ]
        }
      }
    }
  }
}

POST jsa_index_analyzer_stop/_analyze
{
  "analyzer": "jsa_stop_analyzer",
  "text": "The Java Sample Approach's tutorials over years."
}

Terms:


[ java, sample, approach, s, tutorials, years ]

Elasticsearch provides predefined list of languages:
_arabic_, _armenian_, _basque_, _brazilian_, _bulgarian_, _catalan_, _czech_, _danish_, _dutch_, _english_, _finnish_, _french_, _galician_, _german_, _greek_, _hindi_, _hungarian_, _indonesian_, _irish_, _italian_, _latvian_, _norwegian_, _persian_, _portuguese_, _romanian_, _russian_, _sorani_, _spanish_, _swedish_, _thai_, _turkish_.

To disable stopwords, use: \_none_.

5. Standard Analyzer

standard analyzer is the default analyzer. It provides grammar based tokenization (based on the Unicode Text Segmentation algorithm) and works well for most languages.


POST _analyze
{
  "analyzer": "standard",
  "text": "The Java-Sample-Approach's tutorials over years."
}

Terms:


[ the, java, sample, approach's, tutorials, over, years ]

Configuration

stopwords: pre-defined stop words list or an array containing a list of stop words ([“the”, “over”] for example). Defaults to \_none_.
stopwords_path: path to a file containing stop words (relative to the Elasticsearch config directory).
max_token_length: maximum token length. If a token exceeds this length, it is split at max_token_length intervals. Defaults to 255.

For example:


PUT jsa_index_analyzer_standard
{
  "settings": {
    "analysis": {
      "analyzer": {
        "jsa_analyzer": {
          "type": "standard",
          "max_token_length": 5,
          "stopwords": "_english_"
        }
      }
    }
  }
}

POST jsa_index_analyzer_standard/_analyze
{
  "analyzer": "jsa_analyzer",
  "text": "The Java-Sample-Approach's tutorials over years."
}

Terms:


[ java, sampl, e, appro, ach's, tutor, ials, over, years ]

Leave a Reply

Your email address will not be published. Required fields are marked *