elasticsearch search like "distint count" in sql? - elasticsearch

I put the following data in elasticsearch.
POST movies/movie
{
"title": "Apocalypse Now",
"director": "Francis Ford Coppola",
"year": 1979,
"genres": ["Drama", "War", "Foo"]
}
POST movies/movie
{
"title": "Apocalypse Now",
"director": "Francis Ford Coppola",
"year": 1979,
"genres": ["Drama", "War", "Foo", "Bar"]
}
POST movies/movie
{
"title": "Apocalypse Now",
"director": "Francis Ford Coppola",
"year": 1979,
"genres": ["Drama", "Comic", "Bar"]
}
And I want to get the following results.
"Drama" : 3
"War" : 2
"Foo" : 2
"Bar" : 2
"Comic" : 1
How do I get these results?
Thank you for your help in solving this problem.
Thanks in advance.

You can use a terms aggregation, like this:
POST movies/_search
{
"size": 0,
"aggs": {
"counts": {
"terms": {
"field": "genres.keyword",
"size": 20
}
}
}
}

Related

Aggregator of type top_hits cannot accept sub-aggregations with Percentiles

I have the following documents:
{"id": 1, "type": "bags", "brand": "Louis Vuitton", "condition": "new", "price": 500}
{"id": 2, "type": "bags", "brand": "Louis Vuitton", "condition": "new", "price": 450}
{"id": 3, "type": "bags", "brand": "Louis Vuitton", "condition": "new", "price": 420}
{"id": 4, "type": "bags", "brand": "Louis Vuitton", "condition": "like new", "price": 150}
{"id": 5, "type": "bags", "brand": "Louis Vuitton", "condition": "like new", "price": 150}
{"id": 6, "type": "bags", "brand": "Louis Vuitton", "condition": "like new", "price": 100}
{"id": 7, "type": "bags", "brand": "Louis Vuitton" "condition": "used", "price": 400}
{"id": 8, "type": "bags", "brand": "Louis Vuitton", "condition": "used", "price": 350}
{"id": 9, "type": "bags", "brand": "Louis Vuitton", "condition": "used", "price": 300}
I am looking to write a query that will return to me the Percentiles of prices for the top 2 documents for each condition. In other words, I want to perform some calculation after getting the top 2 best scoring documents for each item condition (new, like new, used). I have tried this but I am getting the error the error Aggregator of type top_hits cannot accept sub-aggregations:
{
"query": {
"match": {
"brand": "Louis Vuitton"
}
},
"aggs": {
"item_conditions": {
"terms": {
"field": "condition"
},
"aggs": {
"top_two": {
"top_hits": {
"size": 2
},
"aggs": {
"top_two_percentiles": {
"percentiles": {
"field": "price"
}
}
}
}
}
}
}
}
Is there another way to achieve this, or do I have to do some post-processing myself after getting the results back from ES? The end result I want is to be able to supply this data to charts to make it look like this: https://ibb.co/y5FpV80
"... the percentiles of prices for the top two documents ..." is somewhat arbitrary. What's the metric that determines the score? A terms aggregation would score the buckets equally. The only differentiating factor would be the bucket count... What I'm saying is, you'll need to first determine what puts a given bucket in the top 2 and go from there.
In any event, you can:
Order any terms aggregation by the result of one of its numeric child aggregations.
After that, you can limit it to 2 buckets.
When that's done, you can use a percentiles bucket aggregation to calculate the percentiles of the two top prices.
In concrete terms:
POST your-index/_search?filter_path=aggregations.*.buckets.key,aggregations.*.buckets.doc_count,aggregations.*.buckets.percentiles_top_two_prices
{
"size": 0,
"query": {
"match": {
"brand": "Louis Vuitton"
}
},
"aggs": {
"item_conditions": {
"terms": {
"field": "condition"
},
"aggs": {
"top_two": {
"terms": {
"field": "price",
"size": 2,
"order": {
"max_score": "desc" <-- here's how you enforce the top 2 docs
}
},
"aggs": {
"max_score": {
"max": {
"script": "_score" <-- how you determine what happens here is up to you. _score will be equal across all buckets (I believe) so pick some other metric.
}
},
"just_the_price": {
"min": {
"field": "price" <-- there's no "identity" agg in ES so I'm using min. There will be only bucket because you're already under the parent which aggregates the price.
}
}
}
},
"percentiles_top_two_prices": {
"percentiles_bucket": {
"buckets_path": "top_two>just_the_price"
}
}
}
}
}
}
yielding something along the lines of:
{
"aggregations" : {
"item_conditions" : {
"buckets" : [
{
"key" : "like new",
"doc_count" : 3,
"percentiles_top_two_prices" : {
"values" : {
"1.0" : 100.0,
"5.0" : 100.0,
"25.0" : 100.0,
"50.0" : 150.0,
"75.0" : 150.0,
"95.0" : 150.0,
"99.0" : 150.0
}
}
},
{
"key" : "new",
"doc_count" : 3,
"percentiles_top_two_prices" : {
"values" : {
"1.0" : 420.0,
"5.0" : 420.0,
"25.0" : 420.0,
"50.0" : 450.0,
"75.0" : 450.0,
"95.0" : 450.0,
"99.0" : 450.0
}
}
},
{
"key" : "used",
"doc_count" : 3,
"percentiles_top_two_prices" : {
"values" : {
"1.0" : 300.0,
"5.0" : 300.0,
"25.0" : 300.0,
"50.0" : 350.0,
"75.0" : 350.0,
"95.0" : 350.0,
"99.0" : 350.0
}
}
}
]
}
}
}
I'm frankly not sure what these stats would bring you (when based on only two values) but this is how it could be done 😉

Query elasticsearch nested field by index(order of insert)

I have an elasticsearch document with some nested objects(mapped as nested field)
for example:
{
"FirstName": "Test",
"LastName": "Test",
"Cost": 322.54,
"Email": "test#test.com",
"Vehicles": [
{
"Year": 2000,
"Make": "Mazda",
"Model": "6"
},
{
"Year": 2012,
"Make": "Ford",
"Model": "F150"
}
]
}
i am trying to do aggregations on specific index of the array, for example i want to sum the cost of documents which has Ford make but only on the first vehicle.
is it even possible at all? there is almost no information on the internet about elasticsearch nested fields and nothing about their index/order
It is possible to achieve what you want, but you also need to add the index order as a field inside your nested documents:
{
"FirstName": "Test",
"LastName": "Test",
"Cost": 322.54,
"Email": "test#test.com",
"Vehicles": [
{
"Year": 2000,
"Make": "Mazda",
"Model": "6",
"Index": 0
},
{
"Year": 2012,
"Make": "Ford",
"Model": "F150",
"Index": 1
}
]
}
And then you can query your index using the two conditions on Index and the Make like this:
{
"query": {
"nested": {
"path": "Vehicles",
"query": {
"bool": {
"filter": [
{
"match": {
"Vehicles.Index": 0
}
},
{
"match": {
"Vehicles.Make": "Ford"
}
}
]
}
}
}
}
}
In this specific case, the query is not going to yield any results, as you expect.

sort result by term frequency count

If there are 2 documents which have word "world" in them 5 times & 2 times respectively.
So I want the document which has word "world" 5 times to be listed first followed by document which has word "world" 2 times.
How do i sort this?
Thanks.
I don't think there is any need to sort it. If you have documents as you mentioned, and you are searching a particular word which is appearing more then one, two or three in your case, elastic search will calculate its score automatically and would return the document by score sorting.
To try this ingest some documents:
curl -XPUT "http://localhost:9200/movies/movie/1" -d'
{
"title": "The Godfather",
"director": "Francis Ford Coppola",
"year": 1972,
"genres": [
"Crime",
"Drama"
]
}'
curl -XPUT "http://localhost:9200/movies/movie/2" -d'
{
"title": "The Godfather Godfather",
"director": "Francis Ford Coppola",
"year": 1972,
"genres": [
"Crime",
"Drama"
]
}'
curl -XPUT "http://localhost:9200/movies/movie/3" -d'
{
"title": "The Godfather Godfather Godfather",
"director": "Francis Ford Coppola",
"year": 1972,
"genres": [
"Crime",
"Drama"
]
}'
After ingestion run this query and see the result:
curl -XPOST "http://localhost:9200/movies/_search" -d'
{
"explain": true,
"query": {
"filtered": {
"query": {
"query_string": {
"query": "godfather"
}
}
}
}
}'
This will return the document three on top because it has "godfather" multiple time

How to filter/sort properly with ElasticSearch?

I've just created some very simple database (index) of "movies" using this tutorial : http://joelabrahamsson.com/elasticsearch-101/
Now, I try to copy/paste the instruction to create a multifield mapping for the "director" field :
curl -XPUT "http://localhost:9200/movies/movie/_mapping" -d'
{
"movie": {
"properties": {
"director": {
"type": "multi_field",
"fields": {
"director": {"type": "string"},
"original": {"type" : "string", "index" : "not_analyzed"}
}
}
}
}
}'
But after this, if I post this query, I get no result :
curl -XPOST "http://localhost:9200/_search" -d'
{
"query": {
"constant_score": {
"filter": {
"term": { "director.original": "Francis Ford Coppola" }
}
}
}
}'
result :
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 0,
"max_score": null,
"hits": []
}
}
And if I try to sort using this :
http://localhost:9200/movies/movie/_search?sort=title.original:asc
I get the whole table (type) in random order (same order as with no "sort" instruction) :
{"took":2,"timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},"hits":{"total":6,"max_score":null,"hits":[{"_index":"movies","_type":"movie","_id":"4","_score":null,"_source":
{
"title": "Apocalypse Now",
"director": "Francis Ford Coppola",
"year": 1979,
"genres": ["Drama", "War"]
},"sort":[null]},{"_index":"movies","_type":"movie","_id":"5","_score":null,"_source":
{
"title": "Kill Bill: Vol. 1",
"director": "Quentin Tarantino",
"year": 2003,
"genres": ["Action", "Crime", "Thriller"]
},"sort":[null]},{"_index":"movies","_type":"movie","_id":"1","_score":null,"_source":
{
"title": "The Godfather",
"director": "Francis Ford Coppola",
"year": 1972,
"genres": ["Crime", "Drama"]
},"sort":[null]},{"_index":"movies","_type":"movie","_id":"6","_score":null,"_source":
{
"title": "The Assassination of Jesse James by the Coward Robert Ford",
"director": "Andrew Dominik",
"year": 2007,
"genres": ["Biography", "Crime", "Drama"]
},"sort":[null]},{"_index":"movies","_type":"movie","_id":"2","_score":null,"_source":
{
"title": "Lawrence of Arabia",
"director": "David Lean",
"year": 1962,
"genres": ["Adventure", "Biography", "Drama"]
},"sort":[null]},{"_index":"movies","_type":"movie","_id":"3","_score":null,"_source":
{
"title": "To Kill a Mockingbird",
"director": "Robert Mulligan",
"year": 1962,
"genres": ["Crime", "Drama", "Mystery"]
},"sort":[null]}]}}
So would you tell me what am I missing in this basic use of ElasticSearch ? why no filtering or sorting on my custom "director" field ?
You're not creating the multi-field properly. You should do it like this:
curl -XPOST "http://localhost:9200/movies/movie/_mapping" -d '{
"movie": {
"properties": {
"director": {
"type": "string",
"fields": {
"original": {"type" : "string", "index" : "not_analyzed"}
}
}
}
}
}'
Also note that in that tutorial they are using a deprecated way of declaring multi-fields, i.e. with "type": "multi_field". Now we do it the way I've shown above.
EDIT form comment below : After changing the mapping to multi-field, you need to re-run the 6 indexing queries to re-index the six movies so the director.original field gets populated.

Elasticsearch query on inner list and get only matching objects from list instead of entire list in result document

In following elastic search documents need to find comments from specific name eg "Mary Brown". Basically query on inner list and get only matching objects from list instead of entire list in result document. Is it possible. I have defined nested as mapping for 'comments'
{
"title": "Investment secrets",
"body": "What they don't tell you ...",
"tags": [ "shares", "equities" ],
"comments": [
{
"name": "Mary Brown",
"comment": "Lies, lies, lies",
"age": 42,
"stars": 1,
"date": "2014-10-18"
},
{
"name": "John Smith",
"comment": "You're making it up!",
"age": 28,
"stars": 2,
"date": "2014-10-16"
},
{
"name": "Mary Brown",
"comment": "making it!!!",
"age": 42,
"stars": 3,
"date": "2014-10-20"
}
]
}
Since you have properly mapped your comments field as nested, then yes this is possible using inner_hits, like this:
{
"_source": false,
"query": {
"nested": {
"path": "comments",
"inner_hits": { <---- use inner_hits here
"_source": [
"comment", "date"
]
},
"query": {
"bool": {
"must": [
{
"term": {
"comments.name": "Mary Brown"
}
}
]
}
}
}
}
}

Resources