Is it possible to eliminate "empty" facets with Elastic Search? - elasticsearch

I've finally managed to get Elastic Search indexing to work the way I want it to work, indexing the raw values of certain fields using subfields and not_analyzed. The facets are what I expect, however, in some cases, due to the source data having null/empty values for those fields, I get results like this in the facets section:
"things": {
"_type": "terms",
"missing": 187,
"total": 12214,
"other": 10608,
"terms": [
{
"term": "foo",
"count": 912
},
{
"term": "",
"count": 532
},
{
"term": "bar",
"count": 37
}
}
}
Note the "" in the second item. I can see why ElasticSearch wouldn't automatically exclude this, as one might want to know how many documents don't have the field. But for my purposes I'd like to just not have this returned.
Is there some way that I can configure ElasticSearch to ignore these, either in the indexing or in the query?

Try putting
"exclude" : ""
in your aggregation terms

Related

Elasticsearch Rank based on rarity of a field value

I'd like to know how can I rank lower items, which have fields that are frequently appearing among the results.
Say, we have a similar result set:
"name": "Red T-Shirt"
"store": "Zara"
"name": "Yellow T-Shirt"
"store": "Zara"
"name": "Red T-Shirt"
"store": "Bershka"
"name": "Green T-Shirt"
"store": "Benetton"
I'd like to rank the documents in such a manner that the documents containing frequently found fields,
"store" in this case, are deboosted to appear lower in the results.
This is to achieve a bit of variety, so that the search doesn't yield top results from the same store.
In the example above, if I search for "T-Shirt", I want to see one Zara T-Shirt at the top and the rest
of Zara T-Shirts should be appearing lower, after all other unique stores.
So far I tried to research for using aggregation buckets for sorting or script sorting, but without success.
Is it possible to achieve this inside of the search engine?
Many thanks in advance!
This is possible with a combination of diversified sampler aggregation and top hits aggregation, as learned from the Elastic forum. I don't know what the performance implications are, if used on a high-load production system. Here is a code example, use at your own risk:
{
"query": {}, // whatever query
"size": 0, // since we don't use hits
"aggs": {
"my_unbiased_sample": {
"diversified_sampler": {
"shard_size": 100,
"field": "store"
},
"aggs": {
"keywords": {
"top_hits": {
"_source": {
"includes": [ "name", "store" ]
},
"size": 100
}
}
}
}
}
}

ElasticSearch. How can I get one document without counting all documents by filter?

I want to get any document by a filter if it exists and I do not want ElasticSearch to count how many documents fit this filter.
Example, there are docs:
{"name": "dima", "age": 15},
{"name": "amid", "age": 15}
I want one document (size=1) where age is 15, I don't want ElasticSearch to waste time counting all
I do not need this:
"hits": {
"total": {
"value": 2,
...
},
You can add a size field to tell Elastic how many docs to return, if you'd only like 1 then you can just use 1 in the size field. https://www.elastic.co/guide/en/elasticsearch/reference/current/search-search.html
For example:
GET my-index/_search
{
"size": 1,
"query": {
"bool": {
"filter": [
{
"term": {
"age": "15"
}
}
]
}
}
}
Alternatively you can just specify the size in the query string params
my-index/_search?size=1
I understand that this is not exactly what you want, because you'd rather Elastic just stop looking as soon as it finds the first one, but don't think that is a possibilty .. check out the conversation on Elastic here https://discuss.elastic.co/t/how-to-stop-searching-on-first-match/15507/3

Searching for a field in AWS ElasticSearch

After indexing ddb records into ElasticSearch, when doing a simple search /_search?q=test, I see the hits shown like this
"hits": [
{
// ignore other fields ...
"_id": "z0YdS3I",
"_source": {
"M": {
"name": {
"S": "test name"
},
"age": {
"N": "18"
},
// ignore other fields ...
}
}
},
....
]
However, when I search for a specific field, e.g. /_search?q=name:test, I get zero hits. This happens with every field.
"hits": {
"total": {
"value": 0,
"relation": "eq"
},
"max_score": null,
"hits": []
}
So instead I have to search like this _search?q=M.name.S=test, which is a bit cumbersome. Just wonder if there's a cleaner way to search for a field? Maybe I'm missing some configuration during indexing step?
You could try this :
First define mappings for your index as per your requirement . like -
"name":"text",
"age":"integer"
.
.
etc
Then check if that got applied properly using /_mapping API - once you see the datatypes are applied as you desire then start indexing data.
Details of mappings => https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping.html
I found out I could use DynamoDB Converter provided by AWS SDK to convert back and forth between Javascript object and its equivalent DDB AttributeValue type. That way I can index a document in the write mapping and access it with the normal fields.

Change the structure of ElasticSearch response json

In some cases, I don't need all of the fields in response json.
For example,
// request json
{
"_source": "false",
"aggs": { ... },
"query": { ... }
}
// response json
{
"took": 123,
"timed_out": false,
"_shards": { ... },
"hits": {
"total": 123,
"max_score": 123,
"hits": [
{
"_index": "foo",
"_type": "bar",
"_id": "123",
"_score": 123
}
],
...
},
"aggregations": {
"foo": {
"buckets": [
{
"key": 123,
"doc_count": 123
},
...
]
}
}
}
Actually I don't need the _index/_type every time. When I do aggregations, I don't need hits block.
"_source" : false or "_source": { "exclude": [ "foobar" ] } can help ignore/exclude the _source fields in hits block.
But can I change the structure of ES response json in a more common way? Thanks.
I recently needed to "slim down" the Elasticsearch response as it was well over 1MB in json and I started using the filter_path request variable.
This allows to include or exclude specific fields and can have different types of wildcards. Do read the docs in the link above as there is quite some info there.
eg.
_search?filter_path=aggregations.**.hits._source,aggregations.**.key,aggregations.**.doc_count
This reduced (in my case) the response size by half without significantly increasing the search duration, so well worth the effort..
In the hits section, you will always jave _index, _type and _id fields. If you want to retrieve only some specific fields in your search results, you can use fields parameter in the root object :
{
"query": { ... },
"aggs": { ... },
"fields":["fieldName1","fieldName2", etc...]
}
When doing aggregations, you can use the search_type (documentation) parameter with count value like this :
GET index/type/_search?search_type=count
It won't return any document but only the result count, and your aggregations will be computed in the exact same way.

Can I use ElasticSearch Facets as an equivalent to GROUP BY and how?

I'm wondering if I can use the ElasticSearch Facets features to replace to Group By feature used in rational databases or even in a Sphinx client?
If so, beside the official documentation, can someone point out a good tutorial to do so?
EDIT :
Let's consider an SQL table products in which I have the following fields :
id
title
description
price
etc.
I omitted the others fields in the tables because I don't want to put them into my ES index.
I've indexed my database with ElasticSearch.
A product is not unique in the index. We can have the same product with different price offers and I wish to group them by price range.
Facets gives you the number of the docs it a particular word is present for a particular field...
Now let's suppose you have an index named tweets, with type tweet and field "name"...
A facet query for the field "name" would be:
curl -XPOST "http://localhost:9200/tweets/tweet/_search?search_type=count" -d'
{
"facets": {
"name": {
"terms": {
"field": "name"
}
}
}
}'
Now the response you get is the as below
"hits": {
"total": 3475368,
"max_score": 0,
"hits": []
},
"facets": {
"name": {
"_type": "terms",
"total": 3539206,
"other": 3460406,
"terms": [
{
"term": "brickeyee",
"count": 9205
},
{
"term": "ken_adrian",
"count": 9160
},
{
"term": "rhizo_1",
"count": 9143
},
{
"term": "purpleinopp",
"count": 8747
}
....
....
This is called term facet as this is term based count...There are other facets also which can be seen here

Resources