ElasticSearch not sorting results - sorting

I'm trying to sort the results based on a numeric field,
Here is my mapping:
{
"elasticie": {
"mappings": {
"properties": {
"name": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"number": {
"type": "long"
}
}
}
}
}
I'm using Python, and this is my testing data:
data = [
{'name': 'sElwUYiLXGHaQCKbdxtnvVzqIehfFWkJcPSTurgNoRD', 'number': 8583},
{'name': 'XJEtNsIFfcwHTMhqAvRkiygjbUGzZQPdS', 'number': 8127},
{'name': 'ZIeAGosUKJbjOdylM', 'number': 5862},
{'name': 'HYvcafoXkC', 'number': 7458},
{'name': 'tATJCjNuizOlGckXBpyVqSQL', 'number': 530},
{'name': 'TFYixotjhXzNZPvHnkraRDpAMEImJfqdcVGLC', 'number': 7052},
{'name': 'JCEGfoKDHRrcIkPQSqiVgNshZOBaMdXjAlxwUzmeWLy', 'number': 6168},
{'name': 'IpCTwUAQynSizJtcsuDmbX', 'number': 6492},
{'name': 'fTrcoXSBJNFhAkzWpDMxsEiLmZRvgnC', 'number': 382},
{'name': 'ulVNmqKTpPXfEIdiykhDjMrUGOYazLBFvgnWwsRtJoQbxSe', 'number': 2061}
]
Using following code, I'm creating the index and inserting the data:
from elasticsearch import Elasticsearch
from data import data # the data I've shown above
INDEX = 'elasticie'
es = Elasticsearch('http://127.0.0.1:9200')
for _ in data:
es.index(index=INDEX, body=_)
I'm trying to sort data based on number, asc or desc
Here is what I tried so far:
es.search(index=INDEX, params={'sort': {'number': {'order': 'asc'}})
es.search(index=INDEX, params={'sort': {'number': 'asc'})
es.search(index=INDEX, params={'sort': [('number', 'asc')]})
es.search(index=INDEX, params={'sort': {'number': {'order': 'asc', 'ignore_unmapped': True}})
es.search(index=INDEX, params={'sort': {'number': {'order': 'asc', 'unmapped_type': 'integer'}})
es.search(index=INDEX, params={'sort': {'number': {'order': 'asc', 'unmapped_type': 'long'}})
es.search(index=INDEX, params={'sort': {'number.raw': 'asc'})
Not of the above methods worked for me, The result is the same as the inserted data,
If I assign the above lines to a variable named search_result and print the result using the following code:
for index, result in enumerate(search_result['hits']['hits']):
print(f'{index}. {result["_source"]["number"]}')
I'll get the following result:
0. 8583
1. 8127
2. 5862
3. 7458
4. 530
5. 7052
6. 6168
7. 6492
8. 382
9. 2061
Which is obviously not sorted using number field!!
I don't know what I'm doing wrong, I'm using ElasticSearch 7.6 and Python 3.8
How can I make the sorting results work?
Update
Based on debugging logs, Python is sending a GET request to the following URL using the first method:
http://127.0.0.1:9200/elasticie/_search?sort={%27number%27%3A+{%27order%27%3A+%27asc%27}}

I am not familiar with python, but here is the Elasticsearch JSON query which would sort your documents according to the numbers in desc order. I've tried with your data set and it gives proper results.
Sort Search query
{
"sort": [
{
"number": {
"order": "desc"
}
}
]
}
Results
"hits": [
{
"_index": "so-60598395-sort",
"_type": "_doc",
"_id": "1",
"_score": null,
"_source": {
"name": "sElwUYiLXGHaQCKbdxtnvVzqIehfFWkJcPSTurgNoRD",
"number": 8583
},
"sort": [
8583
]
},
{
"_index": "so-60598395-sort",
"_type": "_doc",
"_id": "2",
"_score": null,
"_source": {
"name": "XJEtNsIFfcwHTMhqAvRkiygjbUGzZQPdS",
"number": 8127
},
"sort": [
8127
]
},
{
"_index": "so-60598395-sort",
"_type": "_doc",
"_id": "4",
"_score": null,
"_source": {
"name": "HYvcafoXkC",
"number": 7862
},
"sort": [
7862
]
},
{
"_index": "so-60598395-sort",
"_type": "_doc",
"_id": "3",
"_score": null,
"_source": {
"name": "ZIeAGosUKJbjOdylM",
"number": 5862
},
"sort": [
5862
]
}
EDIT:- Based on the OP comments, python library which he is using supports the POST method of search endpoint, using which he solved the issue. Refer to the comments on the question for more details.

My mistake, I read the documentation and the code functionality using help and dir functions
There is no parameter named sort defined on the Elasticsearch.search method, That's why I thought I should use it as a key within the params dict that it takes,
Thanks to #OpsterElasticSearchNinja and his comment, I realized there is something wrong with either the library or how I'm using it
Sending POST request with sort key as post body worked well,
So I decided to read the whole source code and find out what's going wrong?
#query_params(
#...
"size",
"sort",
#...
)
def search(self, body=None, index=None, doc_type=None, params=None):
# ...
This is how the sort parameter is defined, using a decorator on runtime!!
That's when I tried this code, and somehow it worked!
es.search(index=INDEX, sort=['number:asc'])

Related

How to get inner hits field values in Nest or Elastic.Net library ? Alterantivly how to specify output type in Nest or Elastic.Net library?

I am new to elasticsearch and I am having troubles with the Nest/Elastic.Net library.
I would like to retrieve not the entire document but just part of it. I am able to do it in Postman but I cannot do it via Elastic.Net library or Nest library.
Document structure looks like following
{
“Doc_id”: “id_for_cross_refference_with_othersystem”
“Ocr”:[
{
“word”: “example_word1”,
“box”: [],
“cord”: “some_number”,
},
{
“word”: “example_word2”,
“box”: [],
“cord”: “some_number2”,
}
]
}
The document has a huge amount of properties but I am interested only in Doc_id , ocr.word, ocr.box and ocr.cord.
The following postman request fully satisfies my needs :
{
"query": {
"bool": {
"must": [
{
"match": {
"doc_id": "2a558865-7dc2-4e4d-ad02-3f683159984e"
}
},
{
"nested": {
"path": "ocr",
"query": {
"match": {
"ocr.word": "signing"
}
},
"inner_hits": {
"_source": {
"includes":[
"ocr.word",
"ocr.box",
"ocr.conf"
]
}
}
}
}
]
}
},
"_source":"false"
}
Result of that request is following :
{
"took": 9,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 3,
"relation": "eq"
},
"max_score": 18.99095,
"hits": [
{
"_index": "irrelevant",
"_type": "irrelevant",
"_id": "irrelevant",
"_score": 18.99095,
"_source": {},
"inner_hits": {
"ocr": {
"hits": {
"total": {
"value": 1,
"relation": "eq"
},
"max_score": 7.9260864,
"hits": [
{
"_index": "irrelevant",
"_type": "irrelevant",
"_id": "irrelevant",
"_nested": {
"field": "ocr",
"offset": 11
},
"_score": 7.9260864,
"_source": {
"box": [
],
"conf": "96.452858",
"word": "signing"
}
}
]
}
}
}
},
{
"_index": "there_rest _of_object_is_ommited",
},
{
"_index": "there_rest _of_object_is_ommited",
}
]
}
}
However when I try to convert that request to Nest Query DSL I am not able to achieve the same result.
When I try to use the NEST library I don’t see any way to provide output result model/type. It looks like the Type of Document should match the output type which is not my case.
Query that I am using :
var searchResponse = client2.Search<Model>(s => s
.Query(q1 => q1.Bool(b1 => b1.Must(s1 =>
s1.Match(m => m.Field(f => f.doc_id).Query("2a558865-7dc2-4e4d-ad02-3f683159984e")),
s2 => s2.Nested(n => n.Path("ocr").Query(q2 => q2.Bool(b => b.Must(m => m.Match(m => m.Field(f => f.ocr.First().word).Query("signing")))))
.InnerHits(ih => ih.Source(s => s.Includes(i => i.Field(f => f.ocr.First().word).Field(f => f.ocr.First().conf))))
)
)))
.Source(false)
);
Due to the fact that the Model type is created for a document and it doesn’t match the output type I am getting [null, null, null] as the output .
There is property such properties as Hits in ISearchResponse? But when I look into it I cannot see values of fields.
I tried using a low level client (Elastic.Net) and providing json request as a string. But It looks like there is not way of specifying the output type either. When I ran my code with the low level library it returns me 3 object of class Model with empty fields.
My questions are :
Is it possible to specify output type different from document type for Nest query DSL or Elatic.Net library ?
Is it possible to get values of the fields that I specified in request for inner hits with help of Nest or Elastic.Net libraries?
How would you solve such problem ? I mean we have huge documents and we don’t want to pass unnecessary information back and forth. The inner hits approach looks like a neat solution for us but it doesn’t look like it works with the recommended libraries Unless I am doing some silly mistake.
NOTE: I can achieve desired result with simple use of HTTPClient and manually doing what I need , but I hope to leverage library that is written for this purpose(Nest or Elastic.Net).

Elastic serach record upsert with a complex _id field

I have to upsert bulk records in elastic search index with _id being combination of more than one field from the message. Can I do so. if that can be done then please give me a sample json for the same.
Regards
A sample _id field I am looking for some thing like below
{
"_index": "kpi_aggr",
"_type": "KPIBackChannel",
"_id": "<<<combination of name , period_type>>>",
"_score": 1,
"_source": {
"name": "kpi-v1",
"period_type": "w",
"country": "AL",
"pg_name": "DENTAL CARE",
"panel_type": "retail",
"number_of_records_with_proposal": 10000,
"number_of_proposals": 80000,
"overall_number_of_records": 2000,
"#timestamp": 1442162810
}
}
Naturally, you can specify your own Elasticsearch document ids during a call to the Index API:
PUT kpi_aggr/KPIBackChannel/kpi-v1,w
{
"name": "kpi-v1",
"period_type": "w",
"country": "AL",
"pg_name": "DENTAL CARE",
"panel_type": "retail",
"number_of_records_with_proposal": 10000,
"number_of_proposals": 80000,
"overall_number_of_records": 2000,
"#timestamp": 1442162810
}
You can also do so during a _bulk API call:
POST _bulk
{ "index" : { "_index" : "kpi_aggr", "_type" : "KPIBackChannel", "_id" : "kpi-v1,w" } }
{"name":"kpi-v1","period_type":"w","country":"AL","pg_name":"DENTAL CARE","panel_type":"retail","number_of_records_with_proposal":10000,"number_of_proposals":80000,"overall_number_of_records":2000,"#timestamp":1442162810}
Notice that Elasticsearch will replace the document with the new version.
If you execute these two queries on an empty index, then querying by document id:
GET kpi_aggr/KPIBackChannel/kpi-v1,w
will give you the following:
{
"_index": "kpi_aggr",
"_type": "KPIBackChannel",
"_id": "kpi-v1,w",
"_version": 2,
"found": true,
"_source": {
"name": "kpi-v1",
"period_type": "w",
"country": "AL",
"pg_name": "DENTAL CARE",
"panel_type": "retail",
"number_of_records_with_proposal": 10000,
"number_of_proposals": 80000,
"overall_number_of_records": 2000,
"#timestamp": 1442162810
}
}
Notice "_version": 2, which in our case indicates that a document has been indexed twice, hence performed an "upsert" (but in general is meant to be used for Optimistic Concurrency Control).
Hope that helps!

ElasticSearch - Get key of searched value

I search for key word machine4 in my ES . My python client is simply:
result = es.search('machine4', index='machines')
Result look like this
[
{
"_score": 0.13424811,
"_type": "person",
"_id": "2",
"_source": {
"date": "**20180601**",
"deleted": [],
"changed": [
"machine1",
"machine2",
"machine3"
],
"**added**": [
"**machine4**",
"machine5"
]
},
"_index": "contacts"
},
{
"_score": 0.13424811,
"_type": "person",
"_id": "3",
"_source": {
"date": "**20180701**",
"deleted": [
"machine2"
],
"**changed**": [
"machine1",
"**machine4**",
"machine3"
],
"added": [
"machine7"
]
},
"_index": "contacts"
}
]
So we can easily see:
In date 20180601 , machine4 belonged to added.
In date 20180701 , machine4 belonged to changed.
I can write another function to analyze the result. Basically loop through every key,value of each items and check if the searched keyword belong, like this:
for result in search_results['hits']['hits']:
source_result = result['_source']
for key,value in source_result.items():
if 'machine4' in value:
print key
However, I wonder if ES having API to detect which key/mapping/field that the searched keywords belonged to ? In this case is added of the 1st result, and changed in 2nd result
Thank you so much
Alex
The simple answer seems to be that no, Elasticsearch doesn't have a way to do this out of the box, because Lucene doesn't have it, as per this thread
Elasticsearch has the concept of highlights, however. These could be useful, but they do require you to have some idea about which fields the match may be in.
The ES Python search documentation suggests there's no way to do that as a parameter to search, but you could create a custom query and pass it on as the q argument. It would look something like:
q = {"query" : {"match": { "content": "'machine4'" }}, "highlight" : {"fields" : {"added" : {}, "updated": {}}}}
result = es.search(index='machines', q=q)
Hope this is helpful!

How do query subdocument from mongoid using Ruby?

I have this document which I only want part of it. But I'm not sure how to do this in Mongoid query.
{
"_id": {
"$oid": "5297d6773865640002000000"
},
"saved_tweets": [
{
"_id": {
"$oid": "52b0856b6535380002000000"
},
"saved_id": "123456",
"tweet_ids": [
"1",
"2"
]
},
{
"_id": {
"$oid": "52b0856b6535380002000001"
},
"saved_id": "78901",
"tweet_ids": [
"3",
"4"
]
}
]}
What I want is all the tweet_ids according to the saved_id. This is what I'm doing right now which I think it's very ineffective.
existing_user = User.find_by(:social_id => social_id)
existing_user.saved_tweets.each do |saved_tweet|
if saved_id == saved_tweet.saved_id
#saved_tweet_ids = saved_tweet.tweet_ids
end
end
did you try something like that?
user.saved_tweets.where(saved_id: user.saved_id).map(&:tweet_ids)
?

Leave out default Logstash fields in ElasticSearch

After processing data with: input | filter | output > ElasticSearch the format it's get stored in is somewhat like:
"_index": "logstash-2012.07.02",
"_type": "stdin",
"_id": "JdRaI5R6RT2do_WhCYM-qg",
"_score": 0.30685282,
"_source": {
"#source": "stdin://dist/",
"#type": "stdin",
"#tags": [
"tag1",
"tag2"
],
"#fields": {},
"#timestamp": "2012-07-02T06:17:48.533000Z",
"#source_host": "dist",
"#source_path": "/",
"#message": "test"
}
I filter/store most of the important information in specific fields, is it possible to leave out the default fields like: #source_path and #source_host? In the near future it's going to store 8 billion logs/month and I would like to run some performance tests with this default fields excluded (I just don't use these fields).
This removes fields from output:
filter {
mutate {
# remove duplicate fields
# this leaves timestamp from message and source_path for source
remove => ["#timestamp", "#source"]
}
}
Some of that will depend on what web interface you are using to view your logs. I'm using Kibana, and a customer logger (c#) that indexes the following:
{
"_index": "logstash-2013.03.13",
"_type": "logs",
"_id": "n3GzIC68R1mcdj6Wte6jWw",
"_version": 1,
"_score": 1,
"_source":
{
"#source": "File",
"#message": "Shalom",
"#fields":
{
"tempor": "hit"
},
"#tags":
[
"tag1"
],
"level": "Info"
"#timestamp": "2013-03-13T21:47:51.9838974Z"
}
}
This shows up in Kibana, and the source fields are not there.
To exclude certain fields you can use prune filter plugin.
filter {
prune {
blacklist_names => [ "#timestamp", "#source" ]
}
}
Prune filter is not a logstash default plugin and must be installed first:
bin/logstash-plugin install logstash-filter-prune

Resources