Total count of all the tokens - elasticsearch

Is it possible to get the ttf (total term frequency) for all the tokens from a field in all the shards for a given index?
e.g. I have:
PUT /index/type/1
{
"sentence": "delicious cake"
}
PUT /index/type/2
{
"sentence": "horrible cake"
}
I want to get:
cake 2
horrible 1
delicious 1
Also is it possible to do it for multiple fields (let's say I'd have sentence1 and sentence2 and I'd like to run such a count on the concatenation of them)?
I know termvectors give the ttf and that mtermvectors can do it for multiple documents but then I'd have to go through all the documents and handle the results myself somehow.
Actually only the top K terms would be sufficient for me if I can control K.

If your field 'sentence' is analyzed you can get TTF with Terms Facet:
POST /index/type/_search
{
"query": {
"match_all": {}
},
"facets" : {
"sentence" : {
"terms" : {
"field" : "sentence",
"size" : 10
}
}
}
}
TTF will be in facet part of response
Also you can pass array of fields ["sentence", "sentence2"] to count TTF across multiple fields
POST /index/type/_search
{
"query" : {
"match_all" : { }
},
"facets" : {
"multiple_sentence" : {
"terms" : {
"fields" : ["sentence", "sentence2"],
"size" : 10
}
}
}
}

Related

How to find similar documents in Elasticsearch

My documents are made up of using various fields. Now given an input document, I want to find the similar documents using the input document fields. How can I achieve it?
{
"query": {
"more_like_this" : {
"ids" : ["12345"],
"fields" : ["field_1", "field_2"],
"min_term_freq" : 1,
"max_query_terms" : 12
}
}
}
you will get similar documents to id 12345. Here you need to specify only ids and field like title, category, name, etc. not their values.
Here is another code to do without ids, but you need to specify fields with values. Example: Get similar documents which have similar title to:
elasticsearch is fast
{
"query": {
"more_like_this" : {
"fields" : ["title"],
"like" : "elasticsearch is fast",
"min_term_freq" : 1,
"max_query_terms" : 12
}
}
}
You can add more fields and their values
You haven't mentioned the types of your fields. A general approach is to use a catch all field (using copy_to) with the more like this query.
{
"query": {
"more_like_this" : {
"fields" : ["first name", "last name", "address", "etc"],
"like" : "your_query",
"min_term_freq" : 1,
"max_query_terms" : 12
}
}
}
Put everything in your_query . You can increase or decrease min_term_freq and max_query_terms

Elasticsearch: filter or get value out of aggregation bucket

In Elasticsearch I have an index containing documents with a timestamp and the number of observed requests to a webservice.
I would like to perform an aggregation to get, for each day, the hour where the maximum number of requests were observed (peak hour).
I succeed to get the result by performing the following request:
{
"aggregations" : {
"week_summary" : {
"filter" : {"range": {"#timestamp": {"gte": "2015-01-20||-7d","lte": "2015-01-20"}}},
"aggregations" : {
"oneday_interval" : {
"date_histogram" : {"field" : "#timestamp", "interval" : "1d","order" : { "_key" : "desc" }},
"aggregations" : {
"peak_hour_histogram" : {
"date_histogram" : {"field" : "#timestamp", "interval" : "1h","order" : { "peak_request_count.value" : "desc" }},
"aggregations" : {
"peak_request_count" : {
"sum" : { "field" : "request_count"}
}
}
}
}
}
}
}
},
size : 0
}
This is working ok in a sense: the first item in the peak_hour_histogram buckets array is indeed corresponding to the peak hour due to the ability to sort a date histogram on a sub-aggregation value.
Nevertheless, I don't need all the other buckets items (i.e. the other 23 hours of the day), and I'd like to receive only the first item. I tried to play with top_hits without any success.
Do you know a way to perform this filtering?
NB: In the real use case my aggregation is returning about 3MB of data. So filtering all those useless values becomes important.
Thanks for your answers.
I think this would be the feature that should answer your requirement: https://github.com/elasticsearch/elasticsearch/issues/6704. Started from this one: https://github.com/elasticsearch/elasticsearch/issues/7103

Elasticsearch: Aggregate results of query

I have an elasticsearch index containing products, which I can query for different search terms. Every product contains a field shop_id to reference the shop it belongs to. Now I try to display a list of all shops holding products for my query. (To filter by shops)
As far as I read on similar questions, I have to use an aggregation. Finally I built this query:
curl -XGET 'http://localhost:9200/searchindex/_search?search_type=count&pretty=true' -d '{
"query" : {
"match" : {
"_all" : "playstation"
}
},
"aggregations": {
"shops_count": {
"terms": {
"field": "shop_id"
}
}
}
}'
This should search for playstation and aggregate the results based on shop_id. Sadly it only returns
Data too large, data would be larger than limit of [8534150348]
bytes].
I also tried it with queries returning only 2 results.
The index contains more than 90,000,000 products.
I would suggest thats a job for a filter aggregation.
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-aggregations-bucket-filter-aggregation.html
Note: I don't know your product mapping in your index, so if that filter below doesn't work, try another filter from http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-filters.html
{
"aggs" : {
"in_stock_playstation" : {
"filter" : { "term" : { "change_me_to_field_for_product" : "playstation" } } },
"aggs" : {
"shop_count" : { "terms" : { "field" : "shop_id" } }
}
}
}
}

Elastic Search NEST - How to have multiple levels of filters in search

I would like to have multiple levels of filters to derive a result set using NEST API in Elastic Search. Is it possible to query the results of another filter...? If yes can I do that in multiple levels?
My requirement is like a User is allowed to select / unselect options of various fields.
Example: There are totally 1000 documents in my index 'people'. There may be 3 ListBoxs, 1) City 2) Favourite Food 3) Favourite Colour. If user selects a city it filters out 600 documents. Out of those 600 documents I would like to filter Favourite food, which may result with some 300 documents. Now further I would like to filter with resp. to favourite movie to retrieve 50 documents out of previously derived 300 documents.
You don't need to query within filters to achieve what you want. Just use filtered queries, http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-filtered-query.html, and provide several filters. In your instance I would assume you would do something like this for your first query:
{
"filtered" : {
"query" : {
"match_all" : { }
},
"filter" : {
"and" : [
{
"term" : {
"city" : "some city"
}
}
]
}
}
}
You would then return the results from that and display them. You'd then let them select the next filter and do the following:
{
"filtered" : {
"query" : {
"match_all" : { }
},
"filter" : {
"and" : [
{
"term" : {
"city" : "some city"
}
},
{
"term" : {
"food" : "some food"
}
}
]
}
}
}
You'd then rinse and repeat for the 3 filter param:
{
"filtered" : {
"query" : {
"match_all" : { }
},
"filter" : {
"and" : [
{
"term" : {
"city" : "some city"
}
},
{
"term" : {
"food" : "some food"
}
},
{
"term" : {
"colour" : "some colour"
}
}
]
}
}
}
I haven't tested this, but the principle is sound and will work.

elasticsearch offset and limit facets

I'm trying to make a search that both limits and "offsets" (the keyword from in elasticsearch) the facet result set, so something like:
'{
"query" : {
"nested" : {
"_scope" : "my_scope",
"path" : "related_award_vendors",
"score_mode" : "avg",
"query" : {
"bool" : {
"must" : {
"text" : {"related_award_vendors.title" : "inc"}
}
}
}
}
},
"facets" : {
"facet1" : {
"terms_stats" : {
"key_field" : "related_award_vendors.django_id",
"value_field" : "related_award_vendors.award_amount",
"order":"term",
"size": 5,
"from":2
},
"scope" : "my_scope" }
}
}'
In the above, it returns id's 1,2,3,4,5 and if I remove "from" it still returns 1,2,3,5 in the result set.
The "size" is working correctly. In this case, it's returning five items in the result set.
My understanding is that solr can do this. Can this be done in elasticsearch?
The terms stats facet doesn't support the from parameter. The only way to achieve what you want is to set size to size + offset and ignore first offset entries on the client side. In your example it would mean to request 7 entries and ignore first 2.

Resources