Replace Rewriter¶
What it does¶
The Replace Rewriter is considered to be a preprocessor for other rewriters. In contrast to the Common Rules Rewriter, its main scope is to handle different variants of terms rather than enhancing the query by business logic.
For instance, the term smartphone
might be needed to be defined as a synonym for the term
mobile
in a subsequent rewriter (Common Rules Rewriter in this case). Let’s assume
that both terms exist somehow in the index and have a slightly different meaning, so
it is required not only to apply a synonym for these terms, but also a down-boost rule,
which is configured in the Common Rules Rewriter as well.
Let’s now assume that it is expected that this rule is not only applied if the user input is
mobile
, but also if it is mobiles
or ombile
or mo bile
. One possibility is to
define the rule multiple times in the Common Rules Rewriter, but this might lead to a configuration
that is spilled by repetitive rules that are finally supposed to do exactly the same. Another
approach is to use the Replace Rewriter in order to standardize all mentioned variants to
mobile
. Furthermore, the rewriter supports the handling of term variations in a more generic
way using prefix and suffix wildcards.
Setup¶
As a first step, the Replace Rewriter is configured
PUT /_querqy/rewriter/replace
1{
2 "class": "querqy.elasticsearch.rewriter.ReplaceRewriterFactory",
3 "config": {
4 "rules": "mobiles => mobile",
5 "ignoreCase": true,
6 "inputDelimiter": ";",
7 "querqyParser": "querqy.rewrite.commonrules.WhiteSpaceQuerqyParserFactory"
8 }
9}
Note
OpenSearch users: Simply replace package name elasticsearch
with opensearch
in rewriter configurations.
Querqy 5
POST /solr/mycollection/querqy/rewriter/replace?action=save
Content-Type: application/json
1{
2 "class": "querqy.solr.rewriter.replace.ReplaceRewriterFactory",
3 "config": {
4 "rules": "mobiles => mobile",
5 "ignoreCase": true,
6 "inputDelimiter": ";",
7 "querqyParser": "querqy.rewrite.commonrules.WhiteSpaceQuerqyParserFactory"
8 }
9}
Querqy 4
1<lst name="rewriter">
2 <str name="class">querqy.solr.contrib.ReplaceRewriterFactory</str>
3 <str name="rules">replace-rules.txt</str>
4 <str name="ignoreCase">true</str>
5 <str name="inputDelimiter">;</str>
6 <str name="querqyParser">querqy.rewrite.commonrules.WhiteSpaceQuerqyParserFactory</str>
7</lst>
The replace rules must be specified in a property rules
(Elasticsearch,
Querqy 5 for Solr). Remember to JSON-escape the value of this property.
In Querqy 4 for Solr, rules
references a file in ZooKeeper that contains
the rules. The property`` ignoreCase`` defines whether the rewriter
differentiates between upper- and lowercase when matching query terms to rule
inputs (default is true
). The property inputDelimiter
enables to
configure different input definitions for the same output, separated by the
configured delimiter (default is tab).
Configuring simple replace rules¶
Each line contains one rule definition, except for empty lines or lines starting with #
.
The input and the output of a rule must be separated by =>
. For simple replace rules, which
map input terms directly to output terms, multiple inputs can be configured for the same output.
The inputs must be delimited using the configured delimiter. Both the input and the output can
comprise multiple terms.
1# comments
2mobiles; ombile; mo bile => mobile
3cheapest smartphones => cheap smartphone
Deleting terms¶
Terms can be deleted simply by not defining an output. This is e. g. helpful to handle terms in the query without a semantic meaning (outside of Lucene analyzers). In combination with replacements, deleting terms is additionally useful to handle standalone special characters on a granular level.
1# comments
2the =>
3/; , =>
4+ => plus
The above rules will remove the term the
out of queries. Furthermore, standalone /
or ,
characters in the query will be deleted, whereas a standalone +
character will be mapped to
plus
.
Configuring prefix replace rules¶
In several cases, it is helpful not to map terms to terms directly, but to use wildcards.
The above rule cheapest mobiles
could be required to work in a more generic manner.
This can be achieved, by using a prefix wildcard for the term cheapest
.
1cheap* => cheap
This rule will map the terms cheaper
, cheapest
, cheaply
and all other terms starting
with cheap
to the term cheap
. In contrast to the Common Rules Rewriter, input terms with a
wildcard even match to a term if the term matches exactly to the prefix. It has to be taken into
account, that the output of the rule is not needed to match to the prefix part of the input. Any
output could be defined here (e. g. inexpensive
).
Additionally, the rewriter supports handling the wildcard match. This is e. g. helpful for
handling typical spellings in a more generic manner or for splitting terms. The wildcard match
can be added to the output using $1
.
1samrt* => smart$1
2computer* => computer $1
The above rules will e. g. map samrtwatch
to smartwatch
or samrtphone
to
smartphone
. Furthermore, terms like computerdesk
will be mapped to computer desk
.
Configuring suffix replace rules¶
The rewriter furthermore supports using wildcards at the beginning of terms for suffix matches. This is helpful for handling typical variations of term endings (e. g. singular-plural). The suffix wildcard is used in the same way like the prefix wildcard.
1*phones => $1phone
2*hpone => $1phone
3*hpones => $1phone
The above rules will map iphones
to iphone
, smarthpones
to smartphone
or
smarthpone
to smartphone
.
The suffix wildcard is also helpful to handle special characters at the end of terms in a generic way.
1*+ => $1 plus
2*. => $1
3*) => $1
4(* => $1
The above rules will e. g. map terms like s8+
to s8 plus
or remove dots at the end of
terms. The combination of a prefix and a suffix rule for brackets will map terms like (2018)
to 2018
.
Order of rules¶
The three types of replace rules are applied in the following order:
Simple mappings
Suffix mappings
Prefix mappings
Applying the simple mappings before the wildcard mappings helps to apply edge case mappings before the more generic wildcard mappings are applied.
(Current) Limitations¶
Using multiple wildcards in the same input is not supported (e. g.
\*input\*
).The rewriter does not support defining multiple input terms for a wildcard rule (e. g.
term1 term2*
).Using delimiters to configure multiple inputs for the same output is only supported for simple replace rules not containing a wildcard.