Buckets of documents grouped by term frequency - elasticsearch

I want to segment Elasticsearch results in buckets, such that similar documents (with most matching terms) are grouped together (on an analyzed field) in the results. I'm not sure how to go about having aggregated buckets of individual documents this way.
Here's the basic mapping:
PUT movies
{
"mappings": {
"movie": {
"properties": {
"id": { "type": "long" },
"title": { "type" : "text" }
}
}
}
}
Now, for example, if a query is done for hunger then the results should be grouped as buckets of matching documents with most number of similar terms:
{
"buckets": {
"1": [
{
"title": "The Hunger Games"
},
{
"title": "The Hunger Games: Mockingjay"
},
{
"title": "The Hunger Games: Catching Fire"
}
],
"2": [
{
"title": "Aqua Teen Hunger Force"
},
{
"title": "Force of Hunger"
}
],
"3": [
{
"title": "Hunger Pain"
}
],
:
:
:
}
}
In the above example, similar documents are grouped in separate buckets, based on at-least two matching terms. All matching titles without similar terms are still included in the results as separate buckets (e.g. bucket #3).
Any suggestions are appreciated.

Related

How to correctly query inside of terms aggregate values in elasticsearch, using include and regex?

How do you filter out/search in aggregate results efficiently?
Imagine you have 1 million documents in elastic search. In those documents, you have a multi_field (keyword, text) tags:
{
...
tags: ['Race', 'Racing', 'Mountain Bike', 'Horizontal'],
...
},
{
...
tags: ['Tracey Chapman', 'Silverfish', 'Blue'],
...
},
{
...
tags: ['Surfing', 'Race', 'Disgrace'],
...
},
You can use these values as filters, (facets), against a query to pull only the documents that contain this tag:
...
"filter": [
{
"terms": {
"tags": [
"Race"
]
}
},
...
]
But you want the user to be able to query for possible tag filters. So if the user types, race the return should show (from previous example), ['Race', 'Tracey Chapman', 'Disgrace']. That way, the user can query for a filter to use. In order to accomplish this, I had to use aggregates:
{
"aggs": {
"topics": {
"terms": {
"field": "tags",
"include": ".*[Rr][Aa][Cc][Ee].*", // I have to dynamically form this
"size": 6
}
}
},
"size": 0
}
This gives me exactly what I need! But it is slow, very slow. I've tried adding the execution_hint, it does not help me.
You may think, "Just use a query before the aggregate!" But the issue is that it'll pull all values for all documents in that query. Meaning, you can be displaying tags that are completely unrelated. If I queried for race before the aggregate, and did not use the include regex, I would end up with all those other values, like 'Horizontal', etc...
How can I rewrite this aggregation to work faster? Is there a better way to write this? Do I really have to make a separate index just for values? (sad face) Seems like this would be a common issue but have found no answers through documentation and googling.
You certainly don't need a separate index just for the values...
Here's my take on it:
What you're doing with the regex is essentially what should've been done by a tokenizer -- i.e. constructing substrings (or N-grams) such that they can be targeted later.
This means that the keyword Race will need to be tokenized into the n-grams ["rac", "race", "ace"]. (It doesn't really make sense to go any lower than 3 characters -- most autocomplete libraries choose to ignore fewer than 3 characters because the possible matches balloon too quickly.)
Elasticsearch offers the N-gram tokenizer but we'll need to increase the default index-level setting called max_ngram_diff from 1 to (arbitrarily) 10 because we want to catch as many ngrams as is reasonable:
PUT tagindex
{
"settings": {
"index": {
"max_ngram_diff": 10
},
"analysis": {
"analyzer": {
"my_ngrams_analyzer": {
"tokenizer": "my_ngrams",
"filter": [ "lowercase" ]
}
},
"tokenizer": {
"my_ngrams": {
"type": "ngram",
"min_gram": 3,
"max_gram": 10,
"token_chars": [ "letter", "digit" ]
}
}
}
},
{ "mappings": ... } --> see below
}
When your tags field is a list of keywords, it's simply not possible to aggregate on that field without resorting to the include option which can be either exact matches or a regex (which you're already using). Now, we cannot guarantee exact matches but we also don't want to regex! So that's why we need to use a nested list which'll treat each tag separately.
Now, nested lists are expected to contain objects so
{
"tags": ["Race", "Racing", "Mountain Bike", "Horizontal"]
}
will need to be converted to
{
"tags": [
{ "tag": "Race" },
{ "tag": "Racing" },
{ "tag": "Mountain Bike" },
{ "tag": "Horizontal" }
]
}
After that we'll proceed with the multi field mapping, keeping the original tags intact but also adding a .tokenized field to search on and a .keyword field to aggregate on:
"index": { ... },
"analysis": { ... },
"mappings": {
"properties": {
"tags": {
"type": "nested",
"properties": {
"tag": {
"type": "text",
"fields": {
"tokenized": {
"type": "text",
"analyzer": "my_ngrams_analyzer"
},
"keyword": {
"type": "keyword"
}
}
}
}
}
}
}
We'll then add our adjusted tags docs:
POST tagindex/_doc
{"tags":[{"tag":"Race"},{"tag":"Racing"},{"tag":"Mountain Bike"},{"tag":"Horizontal"}]}
POST tagindex/_doc
{"tags":[{"tag":"Tracey Chapman"},{"tag":"Silverfish"},{"tag":"Blue"}]}
POST tagindex/_doc
{"tags":[{"tag":"Surfing"},{"tag":"Race"},{"tag":"Disgrace"}]}
and apply a nested filter terms aggregation:
GET tagindex/_search
{
"aggs": {
"topics_parent": {
"nested": {
"path": "tags"
},
"aggs": {
"topics": {
"filter": {
"term": {
"tags.tag.tokenized": "race"
}
},
"aggs": {
"topics": {
"terms": {
"field": "tags.tag.keyword",
"size": 100
}
}
}
}
}
}
},
"size": 0
}
yielding
{
...
"topics_parent" : {
...
"topics" : {
...
"topics" : {
...
"buckets" : [
{
"key" : "Race",
"doc_count" : 2
},
{
"key" : "Disgrace",
"doc_count" : 1
},
{
"key" : "Tracey Chapman",
"doc_count" : 1
}
]
}
}
}
}
Caveats
in order for this to work, you'll have to reindex
ngrams will increase the storage footprint -- depending on how many tags-per-doc you have, it may become a concern
nested fields are internally treated as "separate documents" so this affects the disk space too
P.S.: This is an interesting use case. Let me know how the implementation went!

Elastic Search Query for Multi-valued Data

ES Data is indexed like this :
{
"addresses" : [
{
"id" : 69,
"location": "New Delhi"
},
{
"id" : 69,
"location": "Mumbai"
}
],
"goods" : [
{
"id" : 396,
"name" : "abc",
"price" : 12500
},
{
"id" : 167,
"name" : "XYz",
"price" : 12000
},
{
"id" : 168,
"name" : "XYz1",
"price" : 11000
},
{
"id" : 169,
"name" : "XYz2",
"price" : 13000
}
]
}
In my query I want to fetch records which should have at-least one of the address matched and goods price range between 11000 and 13000 and name xyz.
When your data contains arrays of complex objects like a list of addresses or a list of goods, you probably want to have a look at elasticsearch's nested objects to avoid running into problems when your queries result in more items than you would expect.
The issue here is the way how elasticsearch (and in effect lucene) stores the data. As there is no such concept of lists of nested objects directly, the data is flattened and the connection between e.g. XYz and 12000 is lost. So you would also get this document as result when you query for XYz and 12500 as the price of 12500 is also there in the list of values for goods.price. To avoid this, you can use the nested objects feature of elasticsearch which basically extracts all inner objects into a hidden index and allows querying for several fields that occur in one specific object instead of "in any of the objects". For more details, have a look at the docs on nested objects which also explains this pretty good.
In your case a mapping could look like the following. I assume, you only want to query for the addresses.location text without providing the id, so that this list can remain the simple object type instead of also being a nested type. Also, I assume you query for exact matches. If this is not the case, you need to switch from keyword to text and adapt the term query to be some match one...
PUT nesting-sample
{
"mappings": {
"item": {
"properties": {
"addresses": {
"properties": {
"id": {"type": "integer"},
"location": {"type": "keyword"}
}
},
"goods": {
"type": "nested",
"properties": {
"id": {"type": "integer"},
"name": {"type": "keyword"},
"price": {"type": "integer"}
}
}
}
}
}
}
You can then use a bool query on the location and a nested query to match the inner documents of your goods list.
GET nesting-sample/item/_search
{
"query": {
"bool": {
"must": [
{
"term": {
"addresses.location": "New Delhi"
}
},
{
"nested": {
"path": "goods",
"query": {
"bool": {
"must": [
{
"range": {
"goods.price": {
"gte": 12200,
"lt": 12999
}
}
},
{
"term": {
"goods.name": {
"value": "XYz"
}
}
}
]
}
}
}
}
]
}
}
}
This query will not match the document because the price range is not in the same nested object as the exact name of the good. If you change the lower bound to 12000 it will match.
Please check your use case and be aware of the warning on the bottom of the docs regarding the mapping explosion when using nested fields.

How can I find documents whose child documents contain fields totalling a value?

A book series has many books. A book has a number of pages.
{
"mappings": {
"series": {
"properties": {
"name": { "type": "string" },
"books": {
"type": "nested",
"properties": {
"title": { "type": "string" },
"page_count": { "type": "integer" }
}
}
}
}
}
}
I index some book series.
{
"name": "Harry Potter",
"books": [
{
"title": "Harry Potter and the Philosopher's Stone",
"page_count": 100
},
{
"title": "Harry Potter and the Chamber of Secrets",
"page_count": 100
}
]
}
{
"name": "The Long Earth",
"books": [
{
"title": "The Long Earth",
"page_count": 150
},
{
"title": "The Long Mars",
"page_count": 150
}
]
}
I want to find all book series where the total number of pages of the books in that series is greater than or equal to 250. This query should return the "The Long Earth" series, which has 300 total pages, but not the "Harry Potter" series, which only has 200 total pages.
How can I structure such a query?
As far as I can tell, nested queries only allow you to look into individual nested docs (e.g. find a series that has at least one book with more than 120 pages). I've never worked with scoring before (only used binary queries), and I'm starting to think maybe that holds the answer...
Take a look at nested aggregations. You'll want to apply nested sum aggregation and filter on that.

Elastic Search - Sort By Doc Type

I have an elastic search index with 2 different doc types: 'a' and 'b'. I would like to sort my results by type and give preference to type='b' (even if it has a low score). I had been consuming the results of the search below at the client end and sorting them but I've realized that this approach does not work well since I am only inspecting the first 10 results which often does not contain any b's. Increasing the return results is not ideal. I'd like to get the elastic search to do the work.
http://<server>:9200/my_index/_search?q=london
You would need to play with function_score and, depending on how you already score your documents, test some weight values, boost_modes and score_modes for each type. For example:
GET /some_index/a,b/_search
{
"query": {
"function_score": {
"query": {
# your query here
},
"functions": [
{
"filter": {
"type": {
"value": "b"
}
},
"weight": 3
},
{
"filter": {
"type": {
"value": "a"
}
},
"weight": 1
}
],
"score_mode": "first",
"boost_mode": "multiply"
}
}
}
Its working for me.you will execute below commands at command Prompt.
curl -XGET localhost:9200/index_v1,index_v2/_search?pretty -d #boost.json
boost.json
{
"indices_boost" : {
"index_v2" : 1.4,
"index_v1" : 1.3
}
}

Elasticsearch shuffle index sorting

Thanks in advance. I expose the situation first and in the end the solution.
I have a collection of 2M documents with the following mapping:
{
"image": {
"properties": {
"timestamp": {
"type": "date",
"format": "dateOptionalTime"
},
"title": {
"type": "string"
},
"url": {
"type": "string"
}
}
}
}
I have a webpage which paginates through all the documents with the following search:
{
"from":STARTING_POSITION_NUMBER,
"size":15,
"sort" : [
{ "_id" : {"order" : "desc"}}
],
"query" : {
"match_all": {}
}
}
And a hit looks like this(note that the _id value is a hash of the url to prevent duplicated documents):
{
"_index": "images",
"_type": "image",
"_id": "2a750a4817bd1600",
"_score": null,
"_source": {
"url": "http://test.test/test.jpg",
"timestamp": "2014-02-13T17:01:40.442307",
"title": "Test image!"
},
"sort": [
null
]
}
This works pretty well. The only problem I have is that the documents appear sorted chronologically (The oldest documents appear on the first page, and the ones indexed more recently on the last page), but I want them to appear on a random order. For example, page 10 should always show always the same N documents, but they don't have to appear sorted by the date.
I though of something like sorting all the documents by their hash, which is kind of random and deterministic. How could I do it?
I've searched on the docs and the sorting api just works for sorting the results, not the full index. If I don't find a solution I will pick documents randomly and index them on a separated collection.
Thank you.
I solved it using the following search:
{
"from":STARTING_POSITION_NUMBER,
"size":15,
"query" : {
"function_score": {
"random_score": {
"seed" : 1
}
}
}
}
Thanks to David from the Elasticsearch mailing list for pointing out the function score with random scoring.

Resources