Does an empty field in a document take up space in elasticsearch? - elasticsearch

Does an empty field in a document take up space in elasticsearch?
For example, in the case below, is the total amount of space used to store the document the same in Case A as in Case B (assuming the field "colors" is defined in the mapping).
Case A
{"features":
"price": 1,
"colors":[]
}
Case B
{"features":
"price": 1,
}

If you keep the default settings, the original document is stored in the _source field, there will be a difference as the original document of case A is bigger than case B.
Otherwise, there should be no difference : for case A, no term is added in the index for the colors field as it's empty.
You can use the _size field to see the size of the original document indexed, which is the size of the _source field :
POST stack
{
"mappings":{
"features":{
"_size": {"enabled":true, "store":true},
"properties":{
"price":{
"type":"byte"
},
"colors":{
"type":"string"
}
}
}
}
}
PUT stack/features/1
{
"price": 1
}
PUT stack/features/2
{
"price": 1,
"colors": []
}
POST stack/features/_search
{
"fields": [
"_size"
]
}
The last query will output this result, which shows than document 2 takes more space than 1:
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 1,
"hits": [
{
"_index": "stack",
"_type": "features",
"_id": "1",
"_score": 1,
"fields": {
"_size": 16
}
},
{
"_index": "stack",
"_type": "features",
"_id": "2",
"_score": 1,
"fields": {
"_size": 32
}
}
]
}
}

Related

Searching within a percentile in elastic [duplicate]

Say if I want to filter documents by some field within 10th to 20th percentile. I'm wondering if it's possible by some simple query, something like {"fieldName":{"percentile": [0.1, 0.2]}}.
Say I have these documents:
[{"a":1,"b":101},{"a":2,"b":102},{"a":3,"b":103}, ..., {"a":100,"b":200}]
I need to filter the top 10th of them by a (with ascending order), that would be a from 1 to 10. Then I need to sort those results by b with descending order, then take the paginated result (like page No.2, with 10 items every page).
One solution in mind would be:
get the total count of the documents.
sort the documents by a, take the corresponding _id with limit 0.1 * total_count
write the final query, something like id in (...) order by b
But the shortcomings are pretty obvious too:
seems not effecient if we're talking about subsecond latency
the second query might not work if we have too many _id returned in the first query (ES only allows 1000 by default. I can change the config of course, but there's always a limit).
I doubt that there is a way to do this in one query if the exact values of a are not known beforehand, although I think one pretty efficient approach is feasible.
I would suggest to do a percentiles aggregation as first query and range query as second.
In my sample index I have only 14 documents, so for explanatory reasons I will try to find those documents that are from 30% to 60% of field a and sort them by field b in inverse order (so to be sure that sort worked).
Here are the docs I inserted:
{"a":1,"b":101}
{"a":5,"b":105}
{"a":10,"b":110}
{"a":2,"b":102}
{"a":6,"b":106}
{"a":7,"b":107}
{"a":9,"b":109}
{"a":4,"b":104}
{"a":8,"b":108}
{"a":12,"b":256}
{"a":13,"b":230}
{"a":14,"b":215}
{"a":3,"b":103}
{"a":11,"b":205}
Let's find out which are the bounds for field a between 30% and 60% percentiles:
POST my_percent/doc/_search
{
"size": 0,
"aggs" : {
"percentiles" : {
"percentiles" : {
"field" : "a",
"percents": [ 30, 60, 90 ]
}
}
}
}
With my sample index it looks like this:
{
...
"hits": {
"total": 14,
"max_score": 0,
"hits": []
},
"aggregations": {
"percentiles": {
"values": {
"30.0": 4.9,
"60.0": 8.8,
"90.0": 12.700000000000001
}
}
}
}
Now we can use the boundaries to do the range query:
POST my_percent/doc/_search
{
"query": {
"range": {
"a" : {
"gte" : 4.9,
"lte" : 8.8
}
}
},
"sort": {
"b": "desc"
}
}
And the result is:
{
"took": 5,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 4,
"max_score": null,
"hits": [
{
"_index": "my_percent",
"_type": "doc",
"_id": "vkFvYGMB_zM1P5OLcYkS",
"_score": null,
"_source": {
"a": 8,
"b": 108
},
"sort": [
108
]
},
{
"_index": "my_percent",
"_type": "doc",
"_id": "vUFvYGMB_zM1P5OLWYkM",
"_score": null,
"_source": {
"a": 7,
"b": 107
},
"sort": [
107
]
},
{
"_index": "my_percent",
"_type": "doc",
"_id": "vEFvYGMB_zM1P5OLRok1",
"_score": null,
"_source": {
"a": 6,
"b": 106
},
"sort": [
106
]
},
{
"_index": "my_percent",
"_type": "doc",
"_id": "u0FvYGMB_zM1P5OLJImy",
"_score": null,
"_source": {
"a": 5,
"b": 105
},
"sort": [
105
]
}
]
}
}
Note that the results of percentiles aggregation are approximate.
In general, this looks like a task better solved by pandas or a Spark job.
Hope that helps!

Elasticsearch query with fuzziness AUTO not working as expected

From the Elasticsearch documentation regarding fuzziness:
AUTO
Generates an edit distance based on the length of the term. Low and high distance arguments may be optionally provided AUTO:[low],[high]. If not specified, the default values are 3 and 6, equivalent to AUTO:3,6 that make for lengths:
0..2
Must match exactly
3..5
One edit allowed
>5
Two edits allowed
However, when I am trying to specify low and high distance arguments in the search query the result is not what I am expecting.
I am using Elasticsearch 6.6.0 with the following index mapping:
{
"fuzzy_test": {
"mappings": {
"_doc": {
"properties": {
"description": {
"type": "text"
},
"id": {
"type": "keyword"
}
}
}
}
}
}
Inserting a simple document:
{
"id": "1",
"description": "hello world"
}
And the following search query:
{
"size": 10,
"timeout": "30s",
"query": {
"match": {
"description": {
"query": "helqo",
"fuzziness": "AUTO:7,10"
}
}
}
}
I assumed that fuzziness:AUTO:7,10 would mean that for the input term with length <= 6 only documents with the exact match will be returned. However, here is a result of my query:
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0.23014566,
"hits": [
{
"_index": "fuzzy_test",
"_type": "_doc",
"_id": "OQtUu2oBABnEwrgM3Ejr",
"_score": 0.23014566,
"_source": {
"id": "1",
"description": "hello world"
}
}
]
}
}
This is strange but seems like that bug exists only in version the Elasticsearch 6.6.0. I've tried 6.4.2 and 6.6.2 and both of them work just fine.

Custom scoring function in Elasticsearch does not return expected field value

I create a custom scoring function for my documents that just returns the value of the field a for each document. But for some reason, in the example below, the last digits of the _score in the results differ from the last digits of the value of a for each document. What is happening here?
PUT test/doc/1
{
"a": 851459198
}
PUT test/doc/2
{
"a": 984968088
}
GET test/_search
{
"query": {
"function_score": {
"script_score": {
"script": {
"inline": "doc[\"a\"].value"
}
}
}
}
}
That will return the following:
{
"took": 16,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 984968060,
"hits": [
{
"_index": "test",
"_type": "doc",
"_id": "2",
"_score": 984968060,
"_source": {
"a": 984968088
}
},
{
"_index": "test",
"_type": "doc",
"_id": "1",
"_score": 851459200,
"_source": {
"a": 851459198
}
}
]
}
}
Why is the _score different than the value of the field a?
I'm using Elasticsearch 2.1.1
The _score value is internally hard coded as a float which can only accurately represent integers up to the value 134217728. Therefore, if you want to make use, in the scoring function, of a field stored as a number larger than that, it will overflow the buffer and be truncated. See this github issue

How to filter out elements from an array that doesn’t match the query?

stackoverflow won't let me write that much example code so I put it on gist.
So I have this index
with this mapping
here is a sample document I insert into newly created mapping
this is my query
GET products/paramSuggestions/_search
{
"size": 10,
"query": {
"filtered": {
"query": {
"match": {
"paramName": {
"query": "col",
"operator": "and"
}
}
}
}
}
}
this is the unwanted result I get from previous query
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0.33217794,
"hits": [
{
"_index": "products",
"_type": "paramSuggestions",
"_id": "1",
"_score": 0.33217794,
"_source": {
"productName": "iphone 6",
"params": [
{
"paramName": "color",
"value": "white"
},
{
"paramName": "capacity",
"value": "32GB"
}
]
}
}
]
}
}
and finally the wanted result, how I want the query result to look like
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0.33217794,
"hits": [
{
"_index": "products",
"_type": "paramSuggestions",
"_id": "1",
"_score": 0.33217794,
"_source": {
"productName": "iphone 6",
"params": [
{
"paramName": "color",
"value": "white"
},
]
}
}
]
}
}
How should the query look like to achieve the wanted result with filtered array field which matches the query? In other words, all other non-matching array items should not appear in the final result.
The final result is the _source document that you indexed. There is no feature that lets you mask field elements of your document out of the Elasticsearch response.
That said, depending on your goal, you can look into how Highlighters and Suggesters identify result terms matching the query, or possibly, roll-your-own client-side masking using info returned from setting "explain": true in your query.

elasticsearch update a source field without re-indexing entire document

I have document {customerID: 111, name: bob, approved: yes}
The field "approved" is not indexed. I have a mapping set as "approved": { "type" : "string", "index" : "no" }
So only the fields "customerID" and "name" are indexed.
How can I update just the approved field in the _source without re-indexing the entire document? I can pass the partial document to update such as {approved: no}
Is this possible?
What you're looking for is partial update. The problem is this will actually perform delete+put+index implicitly, but you just leave this hustle for ES and will not lose time for network roundtrip. Probably ES will optimize such query (in case of unindexed fields, but AFAIK it doesn't do such for now)
POST so/t3/1
{
"name": "Bob",
"id": 1,
"approved": "no"
}
GET so/t3/_search
POST so/t3/1/_update
{
"doc": {
"approved": "yes"
}
}
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 1,
"hits": [
{
"_index": "so",
"_type": "t3",
"_id": "1",
"_score": 1,
"_source": {
"name": "Bob",
"id": 1,
"approved": "yes"
}
}
]
}
}

Resources