sort result by term frequency count - sorting

If there are 2 documents which have word "world" in them 5 times & 2 times respectively.
So I want the document which has word "world" 5 times to be listed first followed by document which has word "world" 2 times.
How do i sort this?
Thanks.

I don't think there is any need to sort it. If you have documents as you mentioned, and you are searching a particular word which is appearing more then one, two or three in your case, elastic search will calculate its score automatically and would return the document by score sorting.
To try this ingest some documents:
curl -XPUT "http://localhost:9200/movies/movie/1" -d'
{
"title": "The Godfather",
"director": "Francis Ford Coppola",
"year": 1972,
"genres": [
"Crime",
"Drama"
]
}'
curl -XPUT "http://localhost:9200/movies/movie/2" -d'
{
"title": "The Godfather Godfather",
"director": "Francis Ford Coppola",
"year": 1972,
"genres": [
"Crime",
"Drama"
]
}'
curl -XPUT "http://localhost:9200/movies/movie/3" -d'
{
"title": "The Godfather Godfather Godfather",
"director": "Francis Ford Coppola",
"year": 1972,
"genres": [
"Crime",
"Drama"
]
}'
After ingestion run this query and see the result:
curl -XPOST "http://localhost:9200/movies/_search" -d'
{
"explain": true,
"query": {
"filtered": {
"query": {
"query_string": {
"query": "godfather"
}
}
}
}
}'
This will return the document three on top because it has "godfather" multiple time

Related

Search for documents by minimum value of field

I'm trying to filter products by their price, and I'm completely stumped as to how to proceed.
Hoping someone can shed some light on this, and maybe point me in the right direction.
Concept
Each product has multiple prices.
These prices are valid during a certain date-range.
The actual price of the product at a certain date is the lowest price that is valid on that date.
Goal
I want to be able to:
get the lowest and highest price for a certain date
filter the products by a max/min price on a certain date
caveat: I have simplified the restrictions for the prices for this example, but I'm not able to consolidate the dates so there's only 1 valid per date range.
Example
Mapping:
curl -XPUT 'http://localhost:9200/price-filter-test'
curl -XPUT 'http://localhost:9200/price-filter-test/_mapping/_doc' -H 'Content-Type: application/json' -d '{
"properties": {
"id": {"type": "integer"},
"name": {"type": "text"},
"prices": {
"type": "nested",
"properties": {
"price": {"type": "integer"},
"from": {"type": "date"},
"untill": {"type": "date"}
}
}
}
}'
Test entries:
curl -XPUT 'http://localhost:9200/price-filter-test/_doc/1' -H 'Content-Type: application/json' -d '{
"id": 1,
"name": "Product A",
"prices": [
{
"price": 10,
"from": "2020-02-01",
"untill": "2020-03-01"
},
{
"price": 8,
"from": "2020-02-20",
"untill": "2020-02-21"
},
{
"price": 12,
"from": "2020-02-22",
"untill": "2020-02-23"
}
]
}'
curl -XPUT 'http://localhost:9200/price-filter-test/_doc/2' -H 'Content-Type: application/json' -d '{
"id": 2,
"name": "Product B",
"prices": [
{
"price": 20,
"from": "2020-02-01",
"untill": "2020-03-01"
},
{
"price": 18,
"from": "2020-02-20",
"untill": "2020-02-21"
},
{
"price": 22,
"from": "2020-02-22",
"untill": "2020-02-23"
}
]
}'
At 2020-02-20 entries the following prices will valid, correct prices in bold:
Product A:
10
8
Product B:
20
18
Solution
Min/Max
I have figured out how to get the min and max values of the applicable prices.
This was pretty doable using aggregations:
curl -XGET 'http://localhost:9200/price-filter-test/_search?pretty=true' -H 'Content-Type: application/json' -d '{
"query": {"match_all": {}},
"size": 0,
"aggs": {
"product_ids": {
"terms": {"field": "id"},
"aggs": {
"nested_prices": {
"nested": {"path": "prices"},
"aggs": {
"applicable_prices": {
"filter": {
"bool": {
"must": [
{"range": {"prices.from": {"lte": "2020-02-20"}}},
{"range": {"prices.untill": {"gte": "2020-02-20"}}}
]
}
},
"aggs": {
"min_price": {
"min": {"field": "prices.price"}
}
}
}
}
}
}
},
"stats_min_prices": {
"stats_bucket": {
"buckets_path": "product_ids>nested_prices>applicable_prices>min_price"
}
}
}
}'
Here I first aggregate over the different ids, to ensure prices are checked per product, then I filter by applicable dates, and then get the min prices for each.
Using the stats_bucket aggregation, I'm then able to get the min and max values of these minimum prices.
{
// ...
"aggregations" : {
// ...
"stats_min_prices" : {
"count" : 2,
"min" : 8.0,
"max" : 18.0,
"avg" : 13.0,
"sum" : 26.0
}
}
}
Here we see the correct min (8 for Product A) and max (18 for Product B)
Filtering
For filtering, I need to be able to exclude products based on their lowest price.
e.g. If I search for products that cost at least 19, I shouldn't find any as Product B's lowest price is 18
curl -X GET "localhost:9200/price-filter-test/_search?pretty" -H 'Content-Type: application/json' -d '{
"query": {
"nested": {
"path": "prices",
"query": {
"bool": {
"must": [
{
"range" : {
"prices.price" : {"gte" : 19}
}
},
{"range": {"prices.from": {"lte": "2020-02-20"}}},
{"range": {"prices.untill": {"gte": "2020-02-20"}}}
]
}
}
}
}
}'
This attempt, however, still yields "Product B" as a match, as one of the prices in this date range is higher than 19. However, as it is not the lowest price in this date range, it is not the "correct" price.
I'm completely stumped as to how to do this.
I've thought about using scripted fields, but I think I'd need to combine 2 (1 for calculated applicable prices, 1 for getting the lowest), and this doesn't appear to be an option.
Hope you can point me in the right direction
Well if i right you are looking for inner_hits:
https://www.elastic.co/guide/en/elasticsearch/reference/6.8/search-request-inner-hits.html
I was not sure for the aggregation (you cant inject inner_hits in the aggregation) what s why i didnot post at start.
Hope it s what you need.
{
"query": {
"nested": {
"path": "prices",
"query": {
"range": {
"prices.price": {
"gte": 10,
"lte": 20
}
}
},
"inner_hits": {}
}
}
}
=> will keep only nested doc mathing with the range in the inner_hits part:
"inner_hits":{
"prices":{
"hits":{
"total":2,
"max_score":1,
"hits":[
{
"_nested":{
"field":"prices",
"offset":1
},
"_score":1,
"_source":{
"price":18,
"from":"2020-02-20",
"untill":"2020-02-21"
}
},
{
"_nested":{
"field":"prices",
"offset":0
},
"_score":1,
"_source":{
"price":20,
"from":"2020-02-01",
"untill":"2020-03-01"
}
}
]
}
}
}

elasticsearch search like "distint count" in sql?

I put the following data in elasticsearch.
POST movies/movie
{
"title": "Apocalypse Now",
"director": "Francis Ford Coppola",
"year": 1979,
"genres": ["Drama", "War", "Foo"]
}
POST movies/movie
{
"title": "Apocalypse Now",
"director": "Francis Ford Coppola",
"year": 1979,
"genres": ["Drama", "War", "Foo", "Bar"]
}
POST movies/movie
{
"title": "Apocalypse Now",
"director": "Francis Ford Coppola",
"year": 1979,
"genres": ["Drama", "Comic", "Bar"]
}
And I want to get the following results.
"Drama" : 3
"War" : 2
"Foo" : 2
"Bar" : 2
"Comic" : 1
How do I get these results?
Thank you for your help in solving this problem.
Thanks in advance.
You can use a terms aggregation, like this:
POST movies/_search
{
"size": 0,
"aggs": {
"counts": {
"terms": {
"field": "genres.keyword",
"size": 20
}
}
}
}

Elasticsearch partial multi_match minimum 2 letters? Want to change to 1 letter min

I'm using Elasticsearch to [partial] lookup in a number of words. I split the search query by space, and create a "multi_match" node per word.
This is a sample of the full list of words:
Hill road
High garden road
H & M oxford road
Hammersmith road
This is a sample generated search query, when I search for "hi road"
{
"query": {
"bool": {
"must": [
{
"multi_match": {
"query": "hi",
"fields": [
"full_text"
],
"type": "phrase_prefix"
}
},
{
"multi_match": {
"query": "road",
"fields": [
"full_text"
],
"type": "phrase_prefix"
}
}
]
}
},
"size": 200
}
I expect it to return "Hill road" and "High garden road", which it does.
Now if I search for "h road", it only returns "H & M oxford road" but I expect it to return all 4 items. Why is that? Is there a minimum of two letters per multi_match query? If so, how can I overcome it?
Thank you

Unexpected result of Elastic term query

I have Elastic 2.4 running on http://localhost:9200 only for test.
Setup
As fresh start, I created 1 and only 1 item in the index.
$ curl -s -XPUT "http://localhost:9200/movies/movie/1" -d'
{
"title": "The Godfather",
"director": "Francis Ford Coppola",
"year": 1972,
"genres": ["Crime", "Drama"]
}'
Returns
{"_index":"movies","_type":"movie","_id":"1","_version":3,"_shards":{"total":2,"successful":1,"failed":0},"created":false}
I then run this command to confirm the index works:
$ curl -s -XPOST "http://localhost:9200/movies/_search" -d'
{
"query": {
"query_string": {
"query": "Godfather"
}
}
}'
Returns
{"took":8,"timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},"hits":{"total":1,"max_score":0.095891505,"hits":[{"_index":"movies","_type":"movie","_id":"1","_score":0.095891505,"_source":
{
"title": "The Godfather",
"director": "Francis Ford Coppola",
"year": 1972,
"genres": ["Crime", "Drama"]
}}]}}
The Problem
I tried to run term query like this:
$ curl -s -XPOST "http://localhost:9200/movies/_search" -d'
{
"query": {
"term": {"title": "The Godfather"}
}
}'
I was expected to get 1 result, instead I got this:
{"took":1,"timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},"hits":{"total":0,"max_score":null,"hits":[]}}
What did I got wrong?
Either match_phrase like jay suggested or you need to create a not_analyzed sub-field (e.g. title.raw), like this:
$ curl -s -XPUT "http://localhost:9200/movies/_mapping/movie" -d'
{
"properties": {
"title": {
"type": "string",
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}'
Then you can reindex your document to populate the title.raw:
$ curl -s -XPUT "http://localhost:9200/movies/movie/1" -d'
{
"title": "The Godfather",
"director": "Francis Ford Coppola",
"year": 1972,
"genres": ["Crime", "Drama"]
}'
And finally, your term query will work on the title.raw sub-field:
$ curl -s -XPOST "http://localhost:9200/movies/_search" -d'
{
"query": {
"term": {"title.raw": "The Godfather"}
}
}'

How to filter/sort properly with ElasticSearch?

I've just created some very simple database (index) of "movies" using this tutorial : http://joelabrahamsson.com/elasticsearch-101/
Now, I try to copy/paste the instruction to create a multifield mapping for the "director" field :
curl -XPUT "http://localhost:9200/movies/movie/_mapping" -d'
{
"movie": {
"properties": {
"director": {
"type": "multi_field",
"fields": {
"director": {"type": "string"},
"original": {"type" : "string", "index" : "not_analyzed"}
}
}
}
}
}'
But after this, if I post this query, I get no result :
curl -XPOST "http://localhost:9200/_search" -d'
{
"query": {
"constant_score": {
"filter": {
"term": { "director.original": "Francis Ford Coppola" }
}
}
}
}'
result :
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 0,
"max_score": null,
"hits": []
}
}
And if I try to sort using this :
http://localhost:9200/movies/movie/_search?sort=title.original:asc
I get the whole table (type) in random order (same order as with no "sort" instruction) :
{"took":2,"timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},"hits":{"total":6,"max_score":null,"hits":[{"_index":"movies","_type":"movie","_id":"4","_score":null,"_source":
{
"title": "Apocalypse Now",
"director": "Francis Ford Coppola",
"year": 1979,
"genres": ["Drama", "War"]
},"sort":[null]},{"_index":"movies","_type":"movie","_id":"5","_score":null,"_source":
{
"title": "Kill Bill: Vol. 1",
"director": "Quentin Tarantino",
"year": 2003,
"genres": ["Action", "Crime", "Thriller"]
},"sort":[null]},{"_index":"movies","_type":"movie","_id":"1","_score":null,"_source":
{
"title": "The Godfather",
"director": "Francis Ford Coppola",
"year": 1972,
"genres": ["Crime", "Drama"]
},"sort":[null]},{"_index":"movies","_type":"movie","_id":"6","_score":null,"_source":
{
"title": "The Assassination of Jesse James by the Coward Robert Ford",
"director": "Andrew Dominik",
"year": 2007,
"genres": ["Biography", "Crime", "Drama"]
},"sort":[null]},{"_index":"movies","_type":"movie","_id":"2","_score":null,"_source":
{
"title": "Lawrence of Arabia",
"director": "David Lean",
"year": 1962,
"genres": ["Adventure", "Biography", "Drama"]
},"sort":[null]},{"_index":"movies","_type":"movie","_id":"3","_score":null,"_source":
{
"title": "To Kill a Mockingbird",
"director": "Robert Mulligan",
"year": 1962,
"genres": ["Crime", "Drama", "Mystery"]
},"sort":[null]}]}}
So would you tell me what am I missing in this basic use of ElasticSearch ? why no filtering or sorting on my custom "director" field ?
You're not creating the multi-field properly. You should do it like this:
curl -XPOST "http://localhost:9200/movies/movie/_mapping" -d '{
"movie": {
"properties": {
"director": {
"type": "string",
"fields": {
"original": {"type" : "string", "index" : "not_analyzed"}
}
}
}
}
}'
Also note that in that tutorial they are using a deprecated way of declaring multi-fields, i.e. with "type": "multi_field". Now we do it the way I've shown above.
EDIT form comment below : After changing the mapping to multi-field, you need to re-run the 6 indexing queries to re-index the six movies so the director.original field gets populated.

Resources