Elasticsearch index field with wildcard and search for it - elasticsearch

I have a document with a field "serial number". That serial number is ABC.XXX.DEF where XXX indicates wildcards. XXX can be \d{3}[a-zA-Z0-9].
So users can search for:
ABC.123.DEF
ABC.234.DEF
ABC.XYZ.DEF
while the document only includes
ABC.XXX.DEF
When a user queries ABC.123.DEF i need a hit on that document containing ABC.XXX.DEF. As other documents might contain ABC.DEF.XXX and must not be hit I am running out of ideas with my basic elasticsearch knowledge.
Do I have to attack the problem from the query side or when analyzing/tokenizing the pattern?
Can anyone give me an example how to approach that problem?

As long as serial number is well defined the first solution that comes to my mind is to split serial number into three parts ("part1", "part2" and "part3", for example) and index them as three separate fields. Parts consisting of wildcards should have special value or may not be indexed at all. Then at query time I would split serial number provided by user in the same way. Assuming that parts consisting of wildcards are not indexed my query would look like this:
"query": {
"bool": {
"must":[
{
"bool": {
"should": [
{
"match": {
"part1": "ABC"
}
},
{
"bool": {
"must_not": {
"exists": {
"field": "part1"
}
}
}
}
]
}
},
... // Similar code for other parts
]
}
}

Related

Find same text within time range

I'm storing articles of blogs in ElasticSearch in this format:
{
blog_id: keyword,
blog_article_id: keyword,
timestamp: date,
article_text: text
}
Suppose I want to find all blogs with articles that mention X at least twice within the last 30 days. Is there a simple query to find all blog_ids that have articles with the same word at least n times within a date range?
Is this the right way to model the problem or should I use a nested objects for an easier query?
Can this be made into a report in Kibana?
The simplest query that comes to mind is
{
"_source": "blog_id",
"query": {
"bool": {
"must": [
{
"match": {
"article_text": "xyz"
}
},
{
"range": {
"timestamp": {
"gte": "now-30d"
}
}
}
]
}
}
}
nested objects are most probably not going to simplify anything -- on the contrary.
Can it be made into a Kibana report?
Sure. Just apply the filters either in KQL (Kib. query lang) or using the dropdowns & choose a metric that you want to track (total blog_id count, timeseries frequency etc.)
EDIT re # of occurrences:
I know of 2 ways:
there's the term_vector API which gives you the word frequency information but it's a standalone API and cannot be used at query time.
Then there's the scripted approach whereby you look at the whole article text, treat is as a case-sensitive keyword, and count the # of substrings, thereby eliminating the articles with non-sufficient word frequency. Note that you don't have to use function_score as I did -- a simple script query will do. it may take a non-trivial amount of time to resolve if you have non-trivial # of docs.
In your case it could look like this:
{
"query": {
"bool": {
"must": [
{
"script": {
"script": {
"source": """
def word = 'xyz';
def docval = doc['article_text.keyword'].value;
String temp = docval.replace(word, "");
def no_of_occurences = ((docval.length() - temp.length()) / word.length());
return no_of_occurences >= 2;
"""
}
}
}
]
}
}
}

Find one result based on a term query or a list of results based on a match query

I have an index of documents, each containing an id and name field. Each document name happens to be unique.
I want to perform a query on the name field that returns one exact result if possible, or falls back to return a list of similar results. For example, if the search term is Acme Incorporated and there is an exact result, return that only. Otherwise return similar matches; e.g: ACME Inc., acme, Ace etc.
I assumed that I need to somehow combine a keyword-based term query for an exact match, and a text-based match query for the similar matches. I am still getting to grips with compound queries so my first attempt was pretty naive:
{
"query": {
"bool": {
"should": [
{
"term": {
"name.exact": "Acme Incorporated"
}
},
{
"match": {
"name": "Acme Incorporated"
}
}
]
}
}
}
This returns a list of similar matches AND an exact match if present, because at least one query should succeed. This is obviously not correct.
In order to facilitate the keyword-based term query above, I added name.exact to my document mapping:
{
"mappings": {
"properties": {
"id": {
"type": "integer"
},
"name": {
"type": "text",
"fields": {
"exact": {
"type": "keyword"
}
}
}
}
}
}
I suppose another approach is use the Multi Search API to perform the above queries separately. This allows me to look at the responses, and decide to use the match query if the term query result set is empty. This will work for my use case but I suspect that this is not an optimal approach.
I assume this is a common use-case but I am not sure what the solution is.
Edit
My current thinking on this is that I go with a Multi Search query as described above, the first is the same keyword-based term query to attempt to find an exact result and the second is the following — a compound bool query that excludes an exact result.
{
"query": {
"bool": {
"must": {
"match": {
"name": "Acme Incorporated"
}
},
"must_not": {
"term": {
"name.keyword": "Acme Incorporated"
}
}
}
}
}
In the end, the MultiSearch API suited my use case:
The multi search API executes several searches from a single API request. The format of the request is similar to the bulk API format and makes use of the newline delimited JSON (NDJSON) format.
I used this to perform two queries in one request:
Find any exact results with a keyword-based term query on the document name field.
Find any similar results with a bool query, comprising a match query on the
document name field, and a must_not of the first query to
filter out any exact results.
A Multi Search body is constructed of one or more pairs of an (optionally) empty header and body (a single query) delimited by newlines; e.g:
GET /myindex/_msearch
{}
{"query": {"constant_score": {"filter": {"term": {"name.keyword": "Acme Incorporated"}}}}}
{}
{"query": {"bool": {"must": {"match": {"name": "Acme Incorporated"}}, "must_not": {"term": {"name.keyword": "Acme Incorporated"}}}}}
The query is in ndjson format, which states that "Each Line is a Valid JSON Value". This requires that each query be compressed to one line, which is not very readable but not an issue if you're using a library to construct queries.

Advanced kibana / elasticsearch devtools queries

I'm querying my index in the following way:
GET INDEX/_count?q=KEY:VALUE
I want to get data on a list of values, so I run it multiple times:
GET INDEX/_count?q=KEY:VALUE0
GET INDEX/_count?q=KEY:VALUE1
GET INDEX/_count?q=KEY:VALUE2
Additionally, I want to check if the key exists. These options are available in the Discover window, but here I don't know how to access them...
What I eventually want: Query a specific index [I] and count (and, if possible, get advanced stats such as size of the total docs searched) all docs with specific key [K] existing, or having a value out of list of values (and, if possible, do that with regex). Added to that, I want the search / count to be between specific dates. I know how to do so in the Discover window, but Discover have 2 problems:
Gives the actual doc (too heavy, I only want size and count)
Involves GUI
To summarize, I have a few difficulties:
How to add size to the DevToools' count
How to count / search up to one month past
How to find if a key exists (e.g. GET I/_count?K:exists ?)
How to find if value is one of list of allowed values (e.g. GET I/_count?K=x || K=y || K2=z
How to describe value in regex (e.g. GET I/_count?K=abc*)
After count / search is done, how to delete said docs (Just replace GET with DELETE?)
This should get you started:
GET INDEX/_search
{
"query": {
"bool": {
"must": [
{
"range": {
"dateField": {
"gte": "now-1M"
}
}
},
{
"bool": {
"filter": {
"exists": {
"field": "K"
}
}
}
},
{
"query_string": {
"query": "K:(x OR y) OR K2:z"
}
},
{
"regexp": {
"K": "abc*"
}
}
]
}
}
}
Alternatively, you can switch must to should, thereby matching either of those subqueries.
After this, replace _search with _delete_by_query and you're good to go.

Elastic search wildcard query crashes cluster

I run the query below on a large elastic search cluster. The cluster bcomes unresponsive
{
"size": 10000,
"query": {
"bool": {
"must": [
{
"regexp": {
"message": {
"value": ".*exception.*"
}
}
},
{
"bool": {
"should": [
{
"term": {
"beat.hostname": "ip-xxx-xx-xx-xx"
}
}
]
}
},
{
"range": {
"#timestamp": {
"lt": 1518459660000,
"format": "epoch_millis",
"gte": 1518459600000
}
}
}
]
}
}
}
When I remove the wildcarded .*exception.* and replace it with any non wildcarded string like xyz it returns fast. Though the query uses a wildcarded expression, it also looks for a small time range and a specific host. I would think this is a very simple query. Any reason why elasticsearch server can't handle this query? The cluster has 10 nodes and 20 TB of data.
See the documentation for Regexp Query. It clearly states the following:
Note: The performance of a regexp query heavily depends on the regular
expression chosen. Matching everything like .* is very slow
What would be ideal is to change the text analysis on the message field with a WordDelimiterTokenFilter and set split_on_case_change to true. Then something like NullPointerException will get indexed as three separate tokens [Null, Pointer, Exception]. This can help you search on exception without using a regex. Caveat is you need to reindex all your documents.
Another quick thing to try might be to keep your filter conditions on the hostname and timestamp in a filter context, which will prefilter documents before running your regexp query. This may be a short-term solution for you until you fix the text analysis.

How to use multifield search in elasticsearch combining should and must clause

This may be a repeted question but I'm not findin' a good solution.
I'm trying to search elasticsearch in order to get documents that contains:
- "event":"myevent1"
- "event":"myevent2"
- "event":"myevent3"
the documents must not contain all of them in the same document but the result should contain only documents that are only with those types of events.
And this is simple because elasticsearch helps me with the clause should
which returns exactly what i want.
But then, I want that all the documents must contain another condition that is I want the field result.example.example = 200 and this must be in every single document PLUS the document should be 1 of the previously described "event".
So, for example, a document has "event":"myevent1" and result.example.example = 200 another one has "event":"myevent2" and result.example.example = 200 etc etc.
I've tried this configuration:
{
"query": {
"bool": {
"must":{"match":{"operation.result.http_status":200}},
"should": [
{
"match": {
"event": "bank.account.patch"
}
},
{
"match": {
"event": "bank.account.add"
}
},
{
"match": {
"event": "bank.user.patch"
}
}
]
}
}
}
but is not working 'cause I also get documents that not contain 1 of the should field.
Hope I explained well,
Thanks in advance!
As is, your query tells ES to look for documents that must have "operation.result.http_status":200 and to boost those that have a matching event type.
You're looking to combine two must queries
one that matches one of your event types,
one for your other condition
The event clause accepts multiple values and those values are exact matches : you're looking for a terms query.
Try
{
"query": {
"bool": {
"must": [
{"match":{"operation.result.http_status":200}},
{
"terms" : {
"event" : [
"bank.account.patch",
"bank.account.add",
"bank.user.patch"
]
}
}
]
}
}
}

Resources