Word Break Rewriter

What it does

The Word Break Rewriter deals with compound words in queries. It works in two directions: it will split compound words found in queries and it will create compound words from adjacent query tokens.

Let’s use wall mount vs. wallmount as an example. Some documents in our index will have the form wall mount, others will have the compound form wallmount. If a user submits a query wall mount, she would only find the documents for wall mount and users searching for wallmount would only see the wallmount documents - both missing out on recall because of the variation in word compounding.

The Word Break Rewriter will split the query wallmount and add wall mount as a synonym, and vice-versa, it will add the synonym wallmount to the query wall mount.

As this rewriter manipulates the query before it is applied to fields, the parts of the broken-up compound word can match in different fields. This makes the rewriter very powerful for languages with very productive word compounding. For example, splitting the German query ‘lederhut’ (leather hat) into ‘leder hut’ will allow to search for ‘leder’ in the field ‘material’ and ‘hut’ in the category and product type fields.

Setting up a Word Break Rewriter

PUT  /_querqy/rewriter/word_break

 1{
 2    "class": "querqy.elasticsearch.rewriter.WordBreakCompoundRewriterFactory",
 3    "config": {
 4          "dictionaryField" :  "dictionary",
 5          "lowerCaseInput": true,
 6          "decompound": {
 7              "maxExpansions": 5,
 8              "verifyCollation": true
 9          },
10          "reverseCompoundTriggerWords": ["for"],
11          "morphology": "GERMAN"
12    }
13}

Note

OpenSearch users: Simply replace package name elasticsearch with opensearch in rewriter configurations.

The Word Break Rewriter is backed by a dictionary of known words. The dictionary is just a field in the index - the dictionaryField (line #4). This field is normally a ‘copy field’ to which contents from high-quality content fields is copied - usually fields like ‘title’, ‘product type’, ‘category’, ‘brand’, ‘colour’ or other textual attributes. The Analyzer of this field should not apply any stemming. Using a Standard Tokenizer and a Lowercase Token Filter will be a good start.

Setting lowerCaseInput to true (#5) assures that the query token lookup in the dictionary will match the lowercased words in the dictionary field, making the word break handling case-insensitive.

When the rewriter splits a compound word, it could find more than one position in the word at which a split is possible. The decompound.maxExpansions (#7) setting specifies how many of these split variants should be added as synonyms to the query.

Setting decompound.verifyCollation to true (#8) assures that only those variants will be added that co-occur in the dictionaryField value of at least one document. This can prevent many unwanted word splits. For example, the word ‘action’ will not be split into ‘act + ion’ as long as the ‘act’ and ‘ion’ do not co-occur in the dictionaryField of a document.

Hint

When using Solr, words provided on the list of protectedWords will be exempt from decompounding.

By default, it is assumed that words that together form compound word were just joined together without changing their form. But in some languages words can take a specific form when they are used in compounds. For example, when combining ‘baumwolle’ (cotton) with ‘jacke’ (jacket) to form ‘baumwolljacke’ (cotton jacket) in German, the final -e is removed from the modifier word ‘baumwolle’. [S. Langer (1998)]. mentions 68 different forms that modifier words can take in German compounds. The morphology setting enables language-specific word forms in compounds. Currently, the only available morphology implementation is GERMAN, which applies the 20 most popular compound forms listed on page 6 in [S. Langer (1998)]. The rewriter currently applies the morphology only when splitting compounds but not when creating them.

reverseCompoundTriggerWords (#10) support an additional compound creation strategy that drops the trigger word and creates a compound from the left and right tokens in reverse order. In languages with very productive compounding, like Dutch and German, this strategy helps to map phrases with prepositions like ‘for’ and ‘of’ to compounds. For example, let’s assume we have configured ‘voor’ (Dutch ‘for’) in ‘reverseCompoundTriggerWords’. The query ‘voer voor honden’ (‘food for dogs’) will then trigger a synonym ‘hondenvoer’ (‘dog food’) to be added to the query.

The rewriter will create compounds of adjacent query tokens without any further configuration.


Hint

See Configuring and applying a rewriter for instructions how to add a rewriter to the rewrite chain.

Reference

Configuration

PUT  /_querqy/rewriter/word_break

 1{
 2    "class": "querqy.elasticsearch.rewriter.WordBreakCompoundRewriterFactory",
 3    "config": {
 4          "dictionaryField" :  "dictionary",
 5          "minSuggestionFreq": 3,
 6          "minBreakLength": 4,
 7          "maxCombineLength": 30,
 8          "lowerCaseInput": true,
 9          "decompound": {
10              "maxExpansions": 5,
11              "verifyCollation": true
12          },
13          "morphology": "GERMAN",
14          "reverseCompoundTriggerWords": ["for", "from", "of"],
15          "alwaysAddReverseCompounds": true
16
17    }
18}
dictionaryField

The field containing the words for splitting compounds. Should be lowercased. Punctuation should be removed. No stemming should be applied. Typically a ‘copy field’.

Required.

minSuggestionFreqminSuggestionFrequency

How many documents must contain a term in the dictionaryField before it is added as a synonym?

Default: 1

minBreakLength

The minimum number of characters a word part must have when splitting a compound.

Default: 3

maxCombineLength

The maximum number of characters a word can have when generating compounds.

Default: 30

lowerCaseInput

Lowercase the input query?

Default: false

compound.morphology (Querqy >= 5.3)

Apply language-specific morphology when creating compound words. Available morphologies: DEFAULT, GERMAN.

Default: DEFAULT

decompound.maxExpansions

If a compound can be split at more than one position, how many variants at maximum should be added as a synonym?

Default: 3

decompound.verifyCollation

Verify that parts of a split-up compound co-occur in the dictionaryField of a document?

Default: false

decompound.morphology (Querqy >= 5.3)

Apply language-specific morphology when splitting compound words. Available morphologies: DEFAULT, GERMAN.

Default: DEFAULT

morphology (Querqy < 5.3)

Apply language-specific morphology in compound words. Only used for compound splitting. Available morphologies: DEFAULT, GERMAN.

Default: DEFAULT

reverseCompoundTriggerWords

List of words for which the compound of the left and right neighbour tokens should be added as a synonym in reverse order (add synonym ‘YX’ to ‘X <triggerWord> Y’).

Default: (empty list)

protectedWords

List of words that should never be split or be the result of a combination. These can be words that are no compounds grammatically but produce splits because their parts are terms in your dictionary (e.g. German “Slipper” if your vocabulary contains the words “Slip” and “per”) or simply terms that you don’t want to split as they produce too many irrelevant results when appearing on their own (e.g. German “Wissenschaft” if the dictionary contains “Wissen” and “Schaft”, or “Löwenzahn” which you might not want to split into “Löwe” and “Zahn”).

Default: (empty list)

alwaysAddReverseCompounds

Create an additional compound from adjacent tokens after reversing their order? (also add synonym ‘YX’ to query ‘X Y’ in addition to synonym ‘XY’)

Default: false

Request

No request parameters for the Word Break Rewriter.