Reduce data returned by ElasticSearch - elasticsearch

I have the following query.
GET sales/_search
{
"query": {
"terms": {
"ean": ["8719092410766", "8719092444716"]
}
},
"_source": ["ean"],
"size": 10000
}
Which gives me the following result.
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 3,
"max_score": 1,
"hits": [
{
"_index": "sales",
"_type": "doc",
"_id": "CuDvcGIBmw7bqEEVBvZq",
"_score": 1,
"_source": {
"ean": "8719092444716"
}
},
{
"_index": "sales",
"_type": "doc",
"_id": "DeDvcGIBmw7bqEEVBvZq",
"_score": 1,
"_source": {
"ean": "8719092410766"
}
},
{
"_index": "sales",
"_type": "doc",
"_id": "9yHvcGIBbx4s3M8zD9_u",
"_score": 1,
"_source": {
"ean": "8719092410766"
}
}
]
}
}
This is a lot of data, and I am actually only interested in the sources. What I would like it to return is this:
["8719092444716", "8719092410766"]
Or as closely as possible to it. Is there any trick that I can use to reduce the amount of data fetched from the database? I read about filter_path, but ElasticSearch 6.0 doesn't seem to recognize this keyword.

As you mentioned, you could use filter_path (docs), which is a parameter you can add to your request's URL and specify (comma separated) the data components you want to include in the response. For example, if you are interested in only the hits and none of the ES metrics, you could do (curl example)
curl http://localhost:9200/index01/type01/_search?filter_path=hits.hits
, and get the following response
{
"hits" : {
"hits" : [
{
"_index" : "index01",
"_id" : "6PHE_WIBts_g9zk4nzM5",
"_type" : "type01",
"_source" : {
"title" : "Radioactive Honeycomb"
},
"_score" : 1
}
]
}
}
Hope that helps (I'm using ES 6.0 btw).

Related

Function score ignored

I have two nearly identical documents, one of which has the fields CONSTRUCTION: 1 and EDUCATION: 0.1, the other with CONSTRUCTION: 0.1 and EDUCATION: 1. I want to be able to sort results by the value of either the CONSTRUCTION or EDUCATION field
GET /objects/_search
{
"query": {
"function_score": {
"query": {
"match": {
"name": {
"query": "Monkeys"
}
}
},
"field_value_factor": {
"field" : "CONSTRUCTION",
"missing": 1
}
}
},
"_source": ["name", "CONSTRUCTION", "EDUCATION"]
}
Returns the incorrect results:
{
"took": 8,
"timed_out": false,
"_shards": {
"total": 3,
"successful": 3,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 1.7622693,
"hits": [
{
"_index": "objects__feed_id_key_pages__date_2019-12-10__timestamp_1575988952__batch_id_3gpnz7fc__",
"_type": "_doc",
"_id": "dit:greatDomesticUi:KeyPages:12",
"_score": 1.7622693,
"_source": {
"CONSTRUCTION": 0.1,
"name": "Space Monkeys - education",
"EDUCATION": 1
}
},
{
"_index": "objects__feed_id_key_pages__date_2019-12-10__timestamp_1575988952__batch_id_3gpnz7fc__",
"_type": "_doc",
"_id": "dit:greatDomesticUi:KeyPages:11",
"_score": 1.0226655,
"_source": {
"CONSTRUCTION": 1,
"name": "Space Monkeys - construction",
"EDUCATION": 0.1
}
}
]
}
}
This only always returns the same results. Indeed if you misspell the field_value_factor field, you get the same score "field_value_factor": { "field" : "WHATEVER",... }. This suggests the field simply isn't being read.
Dynamic mapping was turned off. The EDUCATION and CONSTRUCTION fields were not mapped. Mystery solved!

How do i get accurate sum in elasticsearch based on source hits?

How do i get an exact sum aggregation in elasticsearch? Fore reference i am currently using elasticsearch 5.6 and the my index mapping looks like this:
{
"my-index":{
"mappings":{
"my-type":{
"properties":{
"id":{
"type":"keyword"
},
"fieldA":{
"type":"double"
},
"fieldB":{
"type":"double"
},
"fieldC":{
"type":"double"
},
"version":{
"type":"long"
}
}
}
}
}
}
The search query generated (using java client) is:
{
/// ... some filters here
"aggregations" : {
"fieldA" : {
"sum" : {
"field" : "fieldA"
}
},
"fieldB" : {
"sum" : {
"field" : "fieldB"
}
},
"fieldC" : {
"sum" : {
"field" : "fieldC"
}
}
}
}
However my result hits generate the following:
{
"took": 10,
"timed_out": false,
"_shards": {
"total": 3,
"successful": 3,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 5,
"max_score": 3.8466966,
"hits": [
{
"_index": "my-index",
"_type": "my-type",
"_id": "25a203b63e264fd2be13db006684b06d",
"_score": 3.8466966,
"_source": {
"fieldC": 108,
"fieldA": 108,
"fieldB": 0
}
},
{
"_index": "my-index",
"_type": "my-type",
"_id": "25a203b63e264fd2be13db006684b06d",
"_score": 3.8466966,
"_source": {
"fieldC": -36,
"fieldA": 108,
"fieldB": 144
}
},
{
"_index": "my-index",
"_type": "my-type",
"_id": "25a203b63e264fd2be13db006684b06d",
"_score": 3.8466966,
"_source": {
"fieldC": -7.2,
"fieldA": 1.8,
"fieldB": 9
}
},
{
"_index": "my-index",
"_type": "my-type",
"_id": "25a203b63e264fd2be13db006684b06d",
"_score": 3.8466966,
"_source": {
"fieldC": 14.85,
"fieldA": 18.9,
"fieldB": 4.05
}
},
{
"_index": "my-index",
"_type": "my-type",
"_id": "25a203b63e264fd2be13db006684b06d",
"_score": 3.8466966,
"_source": {
"fieldC": 36,
"fieldA": 36,
"fieldB": 0
}
}
]
},
"aggregations": {
"fieldA": {
"value": 272.70000000000005
},
"fieldB": {
"value": 157.05
},
"fieldC": {
"value": 115.64999999999999
}
}
}
why do i get:
115.64999999999999 instead of 115.65 in fieldC
272.70000000000005 instead of 272.7 in fieldA
should i use float instead of double? or is there a way i can change the query without using painless script and using java's BigDecimal with specified precision and rounding mode?
It has to do with float number precision in JavaScript (similar to what can be seen here and explained here).
Here are two ways to check this:
A. If you node.js installed, just type node at the prompt and then enter the sum of all fieldA values:
$ node
108 - 36 - 7.2 + 14.85 + 36
115.64999999999999 <--- this is the answer
B. Open the Developer tools of your browser and pick the Console view. Then type the same sum as above:
> 108-36-7.2+14.85+36
< 115.64999999999999
As you can see, both results are consistent with what you're seeing in your ES response.
One way to circumvent this is to store your numbers either as normal integers (i.e. 1485 instead of 14.85, 3600 instead of 36, etc) or as scaled_float with a scaling factor of 100 (or bigger depending on the precision you need)

ElasticSearch query with conditions on multiple documents

I have data of this format in elasticsearch, each one is in seperate document:
{ 'pid': 1, 'nm' : 'tom'}, { 'pid': 1, 'nm' : 'dick''},{ 'pid': 1, 'nm' : 'harry'}, { 'pid': 2, 'nm' : 'tom'}, { 'pid': 2, 'nm' : 'harry'}, { 'pid': 3, 'nm' : 'dick'}, { 'pid': 3, 'nm' : 'harry'}, { 'pid': 4, 'nm' : 'harry'}
{
"took": 137,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 8,
"max_score": null,
"hits": [
{
"_index": "query_test",
"_type": "user",
"_id": "AVj9KS86AaDUbQTYUmwY",
"_score": null,
"_source": {
"pid": 1,
"nm": "Harry"
}
},
{
"_index": "query_test",
"_type": "user",
"_id": "AVj9KJ9BAaDUbQTYUmwW",
"_score": null,
"_source": {
"pid": 1,
"nm": "Tom"
}
},
{
"_index": "query_test",
"_type": "user",
"_id": "AVj9KRlbAaDUbQTYUmwX",
"_score": null,
"_source": {
"pid": 1,
"nm": "Dick"
}
},
{
"_index": "query_test",
"_type": "user",
"_id": "AVj9KYnKAaDUbQTYUmwa",
"_score": null,
"_source": {
"pid": 2,
"nm": "Harry"
}
},
{
"_index": "query_test",
"_type": "user",
"_id": "AVj9KXL5AaDUbQTYUmwZ",
"_score": null,
"_source": {
"pid": 2,
"nm": "Tom"
}
},
{
"_index": "query_test",
"_type": "user",
"_id": "AVj9KbcpAaDUbQTYUmwb",
"_score": null,
"_source": {
"pid": 3,
"nm": "Dick"
}
},
{
"_index": "query_test",
"_type": "user",
"_id": "AVj9Kdy5AaDUbQTYUmwc",
"_score": null,
"_source": {
"pid": 3,
"nm": "Harry"
}
},
{
"_index": "query_test",
"_type": "user",
"_id": "AVj9KetLAaDUbQTYUmwd",
"_score": null,
"_source": {
"pid": 4,
"nm": "Harry"
}
}
]
}
}
And I need to find the pid's which have 'harry' and do not have 'tom', which in the above example are 3 and 4. Which essentialy means look for the documents having same pids where none of them has nm with value 'tom' but at least one of them have nm with value 'harry'.
How do I query that?
EDIT: Using Elasticsearch version 5
What if you have a POST request body which could look something like below, where you might use bool :
POST _search
{
"query": {
"bool" : {
"must" : {
"term" : { "nm" : "harry" }
},
"must_not" : {
"term" : { "nm" : "tom" }
}
}
}
}
I am relatively very new in Elasticsearch, so I might be wrong. But I have never seen such query. Simple filters can not be used here as those are applied on a doc (and not aggregations) which you do not want. What I see is you want to do a "Group by" query with "Having" clause (in terms of SQL). But Group by queries involve some aggregation (like avg, max, min of any field) which is used in "Having" clause. Basically you use a reducer for Post processing of aggregation results. For queries like this Bucket Selector Aggregation can be used. Read this
But your case is different. You do not want to apply Having clause on any metric aggregation but you want to check if some value is present in field (or column) of your "group by" data. In terms of SQL, you want to do a "where" query in "group by". This is what I have never seen. You can also read this
However, at application level, you can easily do this by breaking your query. First find unique pid where nm= harry using term aggs. Then get docs for those pid with additional condition nm != tom.
P.S. I am very new to ES. And I will be very happy if any one contradicts me show ways to do this in one query. I will also learn that.

Delete Indexes by index name and type using elasticSearch 2.3.3 in java

I have a project in java where I index the data using elastic search 2.3.3. The indexes are of two types.
My index doc looks like:
{
"took": 10,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 3,
"max_score": 1,
"hits": [
{
"_index": "test_index",
"_type": "movies",
"_id": "uReb0g9KSLKS18sTATdr3A",
"_score": 1,
"_source": {
"genre": "Thriller"
}
},
{
"_index": "test_index",
"_type": "drama",
"_id": "cReb0g9KSKLS18sTATdr3B",
"_score": 1,
"_source": {
"genre": "SuperNatural"
}
},
{
"_index": "index1",
"_type": "drama",
"_id": "cReb0g9KSKLS18sT76ng3B",
"_score": 1,
"_source": {
"genre": "Romance"
}
}
]
}
}
I need to delete index of a particular name and type only.
For eg:- From the above doc, I want to delete indexes with Name "test_index" and type "drama".
So the result should look like:
{
"took": 10,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 1,
"hits": [
{
"_index": "test_index",
"_type": "movies",
"_id": "uReb0g9KSLKS18sTATdr3A",
"_score": 1,
"_source": {
"genre": "Thriller"
}
},
{
"_index": "index1",
"_type": "drama",
"_id": "cReb0g9KSKLS18sT76ng3B",
"_score": 1,
"_source": {
"genre": "Romance"
}
}
]
}
}
Solutions tried:
client.admin().indices().delete(new DeleteIndexRequest("test_index").actionGet();
But it delete both indexes with name "test_index"
I have also tried various queries in sense beta plugin like:
DELETE /test_index/drama
It gives the error: No handler found for uri [/test_index/drama] and method [DELETE]
DELETE /test_index/drama/_query?q=_id:*&analyze_wildcard=true
It also doesn't work.
When I fire delete index request at that time id of indexes are unknown to us and I have to delete the indexes by name and type only.
How can I delete the required indexes using java api?
This used to be possible till ES 2.0 using the delete mapping API, however since 2.0 Delete Mapping API does not exist any more.
To do this you will have to install the Delete by Query plugin. Then you can simply do a match all query on your index and type and then delete all of them.
The query will look something like this:
DELETE /test_index/drama/_query
{
"query": {
"query": {
"match_all": {}
}
}
}
Also keep in mind that this will delete the documents in the mapping and not the mapping itself. If you want to remove the mapping too you'll have to reindex without the mapping.
This might be able to help you with the java implementation

MLT (More Like This) elasticsearch query

I'm trying to use elasticsearch MLT (More Like This) query.
Only one doc in store:
{
"_index": "monitors",
"_type": "monitor",
"_id": "AVTnvJ8SancUpEdFLMiq",
"_score": 1,
"_source": {
"ProcessGroup": "test",
"ProcessName": "test",
"OpName": "test",
"Domain": "test",
"LogLevel": "Info",
"StartDateTime": "2016-05-04 04:46:47",
"EndDateTime": "2016-05-04 04:47:47",
"MessageDateTime": "2016-05-04 04:46:47",
"ApplicationCode": "test",
"Status": "10",
}
}
Query:
POST /_search
{
"query": {
"more_like_this" : {
"fields" : ["ProcessName"],
"like" : "test",
"min_term_freq" : 1,
"max_query_terms" : 12
}
}
}
ProcessName is a not analyzed field.
I was expected to get this document as a response, but instead i got nada:
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 2,
"successful": 2,
"failed": 0
},
"hits": {
"total": 0,
"max_score": null,
"hits": []
}
}
Why is that ?
Another question:
Suppose I have search engines docs, and I search for "stph". I expect to get "Stephan Curry" suggestion because it's commonly searched. Fuzzy search doesn't fit because distance is greater than 2, so does using MLT query is a good option for this scenario ?

Resources