Find same text within time range - elasticsearch

I'm storing articles of blogs in ElasticSearch in this format:
{
blog_id: keyword,
blog_article_id: keyword,
timestamp: date,
article_text: text
}
Suppose I want to find all blogs with articles that mention X at least twice within the last 30 days. Is there a simple query to find all blog_ids that have articles with the same word at least n times within a date range?
Is this the right way to model the problem or should I use a nested objects for an easier query?
Can this be made into a report in Kibana?

The simplest query that comes to mind is
{
"_source": "blog_id",
"query": {
"bool": {
"must": [
{
"match": {
"article_text": "xyz"
}
},
{
"range": {
"timestamp": {
"gte": "now-30d"
}
}
}
]
}
}
}
nested objects are most probably not going to simplify anything -- on the contrary.
Can it be made into a Kibana report?
Sure. Just apply the filters either in KQL (Kib. query lang) or using the dropdowns & choose a metric that you want to track (total blog_id count, timeseries frequency etc.)
EDIT re # of occurrences:
I know of 2 ways:
there's the term_vector API which gives you the word frequency information but it's a standalone API and cannot be used at query time.
Then there's the scripted approach whereby you look at the whole article text, treat is as a case-sensitive keyword, and count the # of substrings, thereby eliminating the articles with non-sufficient word frequency. Note that you don't have to use function_score as I did -- a simple script query will do. it may take a non-trivial amount of time to resolve if you have non-trivial # of docs.
In your case it could look like this:
{
"query": {
"bool": {
"must": [
{
"script": {
"script": {
"source": """
def word = 'xyz';
def docval = doc['article_text.keyword'].value;
String temp = docval.replace(word, "");
def no_of_occurences = ((docval.length() - temp.length()) / word.length());
return no_of_occurences >= 2;
"""
}
}
}
]
}
}
}

Related

Advanced kibana / elasticsearch devtools queries

I'm querying my index in the following way:
GET INDEX/_count?q=KEY:VALUE
I want to get data on a list of values, so I run it multiple times:
GET INDEX/_count?q=KEY:VALUE0
GET INDEX/_count?q=KEY:VALUE1
GET INDEX/_count?q=KEY:VALUE2
Additionally, I want to check if the key exists. These options are available in the Discover window, but here I don't know how to access them...
What I eventually want: Query a specific index [I] and count (and, if possible, get advanced stats such as size of the total docs searched) all docs with specific key [K] existing, or having a value out of list of values (and, if possible, do that with regex). Added to that, I want the search / count to be between specific dates. I know how to do so in the Discover window, but Discover have 2 problems:
Gives the actual doc (too heavy, I only want size and count)
Involves GUI
To summarize, I have a few difficulties:
How to add size to the DevToools' count
How to count / search up to one month past
How to find if a key exists (e.g. GET I/_count?K:exists ?)
How to find if value is one of list of allowed values (e.g. GET I/_count?K=x || K=y || K2=z
How to describe value in regex (e.g. GET I/_count?K=abc*)
After count / search is done, how to delete said docs (Just replace GET with DELETE?)
This should get you started:
GET INDEX/_search
{
"query": {
"bool": {
"must": [
{
"range": {
"dateField": {
"gte": "now-1M"
}
}
},
{
"bool": {
"filter": {
"exists": {
"field": "K"
}
}
}
},
{
"query_string": {
"query": "K:(x OR y) OR K2:z"
}
},
{
"regexp": {
"K": "abc*"
}
}
]
}
}
}
Alternatively, you can switch must to should, thereby matching either of those subqueries.
After this, replace _search with _delete_by_query and you're good to go.

Elasticsearch index field with wildcard and search for it

I have a document with a field "serial number". That serial number is ABC.XXX.DEF where XXX indicates wildcards. XXX can be \d{3}[a-zA-Z0-9].
So users can search for:
ABC.123.DEF
ABC.234.DEF
ABC.XYZ.DEF
while the document only includes
ABC.XXX.DEF
When a user queries ABC.123.DEF i need a hit on that document containing ABC.XXX.DEF. As other documents might contain ABC.DEF.XXX and must not be hit I am running out of ideas with my basic elasticsearch knowledge.
Do I have to attack the problem from the query side or when analyzing/tokenizing the pattern?
Can anyone give me an example how to approach that problem?
As long as serial number is well defined the first solution that comes to my mind is to split serial number into three parts ("part1", "part2" and "part3", for example) and index them as three separate fields. Parts consisting of wildcards should have special value or may not be indexed at all. Then at query time I would split serial number provided by user in the same way. Assuming that parts consisting of wildcards are not indexed my query would look like this:
"query": {
"bool": {
"must":[
{
"bool": {
"should": [
{
"match": {
"part1": "ABC"
}
},
{
"bool": {
"must_not": {
"exists": {
"field": "part1"
}
}
}
}
]
}
},
... // Similar code for other parts
]
}
}

ElasticSearch how to get docs with 10 or more fields in them?

I want to get all docs that have 10 or more fields in them. I'm guessing something like this:
{
"query": {
"range": {
"fields": {
"gt": 1000
}
}
}
}
What you can do is to run a script query like this
{
"query": {
"script": {
"script": {
"source": "params._source.size() >= 10"
}
}
}
}
However, be advised that depending on the number of documents you have and the hardware that supports your cluster, this can negatively impact the performance of your cluster.
A better idea would be to add another integer field that contains the number of fields that the document contains, so you can simply run a range query on it, like in your question.
As Per Documentation of _source field, you can do this like that or can't get results based on fields count.
https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-source-field.html

How to use multifield search in elasticsearch combining should and must clause

This may be a repeted question but I'm not findin' a good solution.
I'm trying to search elasticsearch in order to get documents that contains:
- "event":"myevent1"
- "event":"myevent2"
- "event":"myevent3"
the documents must not contain all of them in the same document but the result should contain only documents that are only with those types of events.
And this is simple because elasticsearch helps me with the clause should
which returns exactly what i want.
But then, I want that all the documents must contain another condition that is I want the field result.example.example = 200 and this must be in every single document PLUS the document should be 1 of the previously described "event".
So, for example, a document has "event":"myevent1" and result.example.example = 200 another one has "event":"myevent2" and result.example.example = 200 etc etc.
I've tried this configuration:
{
"query": {
"bool": {
"must":{"match":{"operation.result.http_status":200}},
"should": [
{
"match": {
"event": "bank.account.patch"
}
},
{
"match": {
"event": "bank.account.add"
}
},
{
"match": {
"event": "bank.user.patch"
}
}
]
}
}
}
but is not working 'cause I also get documents that not contain 1 of the should field.
Hope I explained well,
Thanks in advance!
As is, your query tells ES to look for documents that must have "operation.result.http_status":200 and to boost those that have a matching event type.
You're looking to combine two must queries
one that matches one of your event types,
one for your other condition
The event clause accepts multiple values and those values are exact matches : you're looking for a terms query.
Try
{
"query": {
"bool": {
"must": [
{"match":{"operation.result.http_status":200}},
{
"terms" : {
"event" : [
"bank.account.patch",
"bank.account.add",
"bank.user.patch"
]
}
}
]
}
}
}

Elasticsearch date range intersection

I'm storing something like the following information in elastic search:
{ "timeslot_start_at" : "2013-02-01", "timeslot_end_at" : "2013-02-03" }
Given that I have another date range (given from user input for example) I am wanting to search for an intersecting time range. Similar to this: Determine Whether Two Date Ranges Overlap Which outlines that the following logic is what i'm after:
(StartDate1 <= EndDate2) and (StartDate2 <= EndDate1)
But I'm unsure of how to fit this into an elastic search query, would I use a range filter and only set the 'to' values, leaving the from blank? Or is there a more efficient way of doing this?
Update: It is now possible to use date_range data type that was added in elasticsearch v5.2. For an earlier version of elasticsearch the following solution still applies.
To test for intersection, you should combine two range queries into a single query using bool query:
{
"bool": {
"must": [
{
"range": {
"timeslot_start_at": {
"lte": "2013-02-28"
}
}
},
{
"range": {
"timeslot_end_at": {
"gte": "2013-02-03"
}
}
}
]
}
}

Resources