Logging and debugging rewriters

It can become difficult to understand the output of query pre-processing. For example, if multiple rewriters interact as part of the rewrite chain or if a Common Rules Rewriter is configured with tens of thousands of rules it can become a challenge to figure out how a given query was produced. But even if you use a very simple setup with only a few you rules you will probably be interested in whether a given rewriter was applied at all, for example, to analyse in your production environment which queries and which fraction of the search traffic was manipulated by Querqy.

In Querqy you can use two main approaches to understand how Querqy rewrites a query. The first approach would just use the standard facilities that your search engine offers to you - debug=true in Solr and the /_validate/query endpoint in Elasticsearch/OpenSearch. We will discuss this further in the section Debugging below. The second approach provides information about query rewriting that is available in non-debug mode of your search engine and it can be used for analysis and tracking. We call this approach Info Logging below.

Elasticsearch/OpenSearch
Solr

Debugging

Standard tools

The Querqy plugin finally produces a Lucene query. This query can be inspected using the standard means of your search engine to inspect the Lucene queries that the search engine produces from its query DSL.

In Elasticsearch/OpenSearch this means that you can call https://<MYHOST>:<PORT>/<INDEX>/_validate/query?explain=true and submit the usual Querqy search query in the request body. For example:

GET https://<MYHOST>:<PORT>/<INDEX>/_validate/query?explain=true
Content-Type: application/json
{
  "query":{
    "querqy":{
      "matching_query":{
        "query":"laptop"
      },
      "query_fields":[
        "title^23",
        "name",
        "shortSummary"
      ],
      "rewriters":[
        "common_rules"
      ]
    }
  }
}

If the above case had a CommonRulesRewriter common_rules defined with rules

laptop =>
  SYNONYM: notebook
  UP(100): AMD
  DOWN(50): sleeve

the output of _validate/query?explain=true will look like this:

 1{
 2  "_shards":{
 3    "total":1,
 4    "successful":1,
 5    "failed":0
 6  },
 7  "valid":true,
 8  "explanations":[
 9    {
10      "index":"myindex",
11      "valid":true,
12      "explanation":"+(+(name:notebook | shortSummary:laptop | title:laptop^23.0 | title:notebook^23.0 | shortSummary:notebook | name:laptop) AdditiveBoostFunction(100.0,query(+(name:amd | shortSummary:amd | title:amd^23.0),def=0.0)) AdditiveBoostFunction(-50.0,query(+(shortSummary:sleeve name:sleeve title:sleeve^23.0),def=0.0)))"
13    }
14  ]
15}

Line 12 contains the string representation of the parsed Lucene query and you will probably recognise the notebook / laptop synonyms. It also shows AdditiveBoostFunction sub-queries. AdditiveBoostFunction is a custom Lucene query that is provided by Querqy to deal with UP/DOWN boosting. It especially avoids producing negative document scores, which are not allowed by Lucene, and it guarantees that documents that match for both UP(100) and DOWN(100) yield the same score like documents that match neither UP(100) nor DOWN(100).

Info Logging

Info logging is a powerful tool when you want to understand how a specific rewriter manipulated a query.

Concepts

Querqy rewriters can produce information whether they were triggered at all and about how they changed the query. Within the framework of Info Logging, we call this information the info logging output.

Depending on the required granularity of the logging output, we distinguish between two types: one that contains all details that a rewriter can possibly produce as logging output, and one that contains only the rewriter ID.

The info logging output needs to be sent somewhere where we can pick it up and analyse or collect it for later analysis. We call the place to which we send the output a sink. The same output could, at least in theory, be sent to more than one sink.

Setting up Info Logging

Setting up Info Logging requires two steps: Configuring sinks and enabling the logging per request.

Warning

With Solr, note that the configuration of info logging and the required parameters have changed in an incompatible way with the release of ‘Querqy for Solr’ version 5.5.lucene900.0. The documentation for info logging in older Querqy versions can be found here. When migrating to a newer Querqy for Solr version, you might also consider using the more detailed output that Querqy provides for rewriters when you call Solr with debugQuery=true instead of Info Logging, depending on your use case.

Sinks

To use Info Logging we need a mapping between each rewriter and the sink(s) to which this rewriter should send its log output. This mapping is configured within the rewriter definition:

PUT  /_querqy/rewriter/common_rules

1{
2    "class": "querqy.elasticsearch.rewriter.SimpleCommonRulesRewriterFactory",
3    "config": {
4        "rules" : "notebook =>\nSYNONYM: laptop"
5    },
6    "info_logging": {
7      "sinks": ["log4j"]
8    }
9}

Note

OpenSearch users: Simply replace package name elasticsearch with opensearch in rewriter configurations.

As you probably recognise at this stage, the example shows the configuration for a Common Rules Rewriter. Lines 6-8 are new. They contain the configuration for Info Logging. The sink property is a list of named sinks to which this rewriter should send its log messages.

In this case, the list contains only one element, log4j, which is a predefined sink that routes the output to the Log4j logging framework, which is used in Elasticsearch and Opensearch and which can be configured further. At the current stage, log4j is the only available sink for Info Logging under Elasticsearch/Opensearch and it is not possible (yet) to provide a custom sink implementation.

The output in Log4j will look like this (using a file appender):

1[2021-03-26T13:23:43,006][INFO ][q.e.i.Log4jSink ] [node_s_0]DETAIL[ QUERQY ] {"id":"id-1001","msg":{"common_rules1":[{"APPLIED_RULES":["msg1"]}],"common_rules2":[{"APPLIED_RULES":["msg2"]}]}}
2[2021-03-26T13:28:47,454][INFO ][q.e.i.Log4jSink ] [node_s_0]REWRITER_ID[ QUERQY ] {"id":"id-1002","msg":["common_rules"]}

Let’s decompose this. DETAIL[ QUERQY ] (line 1) and REWRITER_ID[ QUERQY ] (line 2) are Log4j markers that Querqy provides and that you can use to filter log messages. The DETAIL and REWRITER_ID markers correspond to the output types that you can set per request and that are described below. They are both children of a common parent marker QUERQY.

The log message itself is a small JSON document. The msg element contains the messages as they were produced by the rewriters with the rewriter IDs (such as common_rules1) as keys and further rewriter-specific information as values.

The id element is an ID that can be passed per request to help trace requests across nodes (see below).

Enabling info logging per request

Once you have mapped rewriters to sinks, you can start using Info Logging. To trigger the rewriters to send logging output to the sinks, you need to enable Info Logging in your search requests:

POST /myindex/_search

 1{
 2    "query": {
 3        "querqy": {
 4            "matching_query": {
 5                "query": "notebook"
 6            },
 7            "query_fields": [ "title^3.0", "brand^2.1", "shortSummary"],
 8            "rewriters": [
 9              "word_break",
10              "common_rules"
11            ],
12            "info_logging": {
13              "id":"REQ-ID-0043",
14              "type": "DETAIL"
15            }
16        }
17    }
18}

Info Logging is controlled by the properties specified under info_logging (lines 12-15). You can set the properties as follows:

type

Values: DETAIL REWRITER_ID NONE

Controls whether a logging output is generated at all together with the format of the output. It can take the values:

  • DETAIL - Logs all details that the rewriter produces as logging output.

  • REWRITER_ID - Only logs the IDs of the rewriters.

  • NONE - Logs nothing at all.

Default: NONE

id

An identifier. This can be used for identifying search requests. For example, when you use more than one shard, the same search request will be executed on more than one shard and create a log message on each shard. You can use this ID to trace and aggregate the messages across shards. It is up to the client that makes the search request to supply the ID.

Default: not set

For examples of the output format for types DETAIL and REWRITER_ID see the Log4j sink output above. It is up to the individual rewriter what log message the emit for type DETAIL.