Does Elasticsearch execute operations in a specific order?

Does Elasticsearch execute operations in a specific order? - elasticsearch

I read that ES is near real-time, and therefore all index/create/update/delete etc. operations are not executed immediately.
Let's say I index 3 documents with same id, in this order with 1 millisecond between each, and then force refresh:
{
"_id": "A",
"_source": { "text": "a" }
}
{
"_id": "A",
"_source": { "text": "b" }
}
{
"_id": "A",
"_source": { "text": "c" }
}
Then, if I search for a document with id "A", I will get 1 result, but which one?
When Elasticsearch performs a refresh, does it execute operations sequentially in the order in which they arrive?

in this instance it will come down to which indexing approach you take
a bulk request does not guarantee the order that you submitted it in is how it will be applied. it might be in the same order with (some of) your tests, but there's no guarantee that Elasticsearch provides there
you can manage this by specifying a version in your document, so a higher version of a document is always going to be what is indexed
indexing using 3 individual POSTs will be ordered, as you are making 3 separate and sequential requests one after the other. that's because each request has the same _id and will be directed to the same shard and actioned by the order they are received in

Related

Calculate field data size and store to other field at indexing time ElasticSearch 7.17

I am looking for a way to store the size of a field (bytes) in a new field of a document.
I.e. when a document is created with a field message that contains the value hello, I want another field message_size_bytes written that in this example has the value 5.
I am aware of the possibilities using _update_by_query and _search using scripting fields, but I have so much data that I do not want to calculate the sizes while querying but at index time.
Is there a possibility to do this using Elasticsearch 7.17 only? I do not have access to the data before it's passed to elasticsearch.

You can use Ingest Pipeline with Script processor.
You can create pipeline using below command:
PUT _ingest/pipeline/calculate_bytes
{
"processors": [
{
"script": {
"description": "Calculate bytes of message field",
"lang": "painless",
"source": """
ctx['message_size_bytes '] = ctx['message'].length();
"""
}
}
]
}
After creating pipeline, you cna use pipeline name while indexing data like below (same you can use in logstash, java or anyother client as well):
POST 74906877/_doc/1?pipeline=calculate_bytes
{
"message":"hello"
}
Result:
"hits": [
{
"_index": "74906877",
"_id": "1",
"_score": 1,
"_source": {
"message": "hello",
"message_size_bytes ": 5
}
}
]

Elasticsearch query to get results irrespective of spaces in search text

I am trying to fetch data from Elasticsearch matching from a field name. I have following two records
{
"_index": "sam_index",
"_type": "doc",
"_id": "key",
"_version": 1,
"_score": 2,
"_source": {
"name": "Sample Name"
}
}
and
{
"_index": "sam_index",
"_type": "doc",
"_id": "key1",
"_version": 1,
"_score": 2,
"_source": {
"name": "Sample Name"
}
}
When I try to search using texts like sam, sample, Sa, etc, I able fetch both records by using match_phrase_prefix query. The query I tried with match_phrase_prefix is
GET sam_index/doc/_search
{
"query": {
"match_phrase_prefix" : {
"name": "sample"
}
}
}
I am not able to fetch the records when I try to search with string samplen. I need search and get results irrespective of spaces between texts. How can I achieve this in Elasticsearch?

First, you need to understand how Elasticsearch works and why it gives the result and doesn't give the result.
ES works on the token match, Documents which you index in ES goes through the analysis process and creates and stores the tokens generated from this process to inverted index which is used for searching.
Now when you make a query then that query also generates the search tokens, these can be as it is in the search query in case of term query or tokens based on the analyzer defined on the search field in case of match query. Hence it's very important to understand the internals of your search query.
Also, it's very important to understand the mapping of your index, ES uses the standard analyzer by default on the text fields.
You can use the Explain API to understand the internals of the query like which search tokens are generated by your search query, how documents matched to it and on what basis score is calculated.
In your case, I created the name field as text, which uses the word joined analyzer explained in Ignore spaces in Elasticsearch and I was able to get the document which consists of sample name when searched for samplen.
Let us know if you also want to achieve the same and if it solves your issue.

Elasticsearch: When just filtering, why use the filtered query type

What's the difference between
{
"query": {
"filtered": {
"filter": { "term": { "folder": "inbox" } }
}
}
}
and
{
"query": {
"term": { "folder": "inbox" }
}
}
It seems they both filter the index on the folder field by the inbox value.

Query can have two type of context in elastic search. Query context and filter context. Query context tells how well a document matches the query i.e. it calculates score whereas filter context tells whether a document matches the query and no scoring is done.
A query in query context tell you which document better matches the query. Higher the score more relevant the document is.
A query in filter context behaves like a conditional operator i.e. true if document matches the query and false if it doesn't.
To answer your question, both the queries will match the same number of documents but first query will not calculate the score (it will be faster compared to the second one because score calculation is skipped), whereas the second one will calculate score and will be slower comparatively to the first one. So if you just want to filter it is better to tell elastic that score need not to be calculated by putting the query in filter context. This way you save the computational cost of calculating score. Calculating score will be an overhead if only filtering is required and hence there are two type of contexts.
Sample output for 1st query (filter context):
{
"_index": "test",
"_type": "_doc",
"_id": "3",
"_score": 0, <-------- no scoring done
}
Sample output for 2nd query (query context):
{
"_index": "test",
"_type": "_doc",
"_id": "2",
"_score": 0.9808292 <-------- score calculated
}
So use query context to get relevant matches and filter context to filter out documents. You can use the combination of both as well.
You can read more on query and filter context here.

I agree with what is said upstairs, but there is one thing to add.Query and Filter can be used together in a query to reduce time.

Truncate and Index String values in Elasticsearch 2.3.x

I am running ES 2.3.3. I want to index a non-analyzed String but truncate it to a certain number of characters. The ignore_above property, according to the documentation, will NOT index a field above the provided value. I don't want that. I want to take say a field that could potentially be 30K long and truncate it to 10K long, but still be able to filter and sort on the 10K that is retained.
Is this possible in ES 2.3.3 or do I need to do this using Java prior to indexing a document.

I want to index a non-analyzed String but truncate it to a certain number of characters.
Technically it's possible with Update API and Upsert option, but, depending on your exact needs, it may not be very handy.
Let's say you want to index this document:
{
"name": "foofoofoofoo",
"age": 29
}
but you need to truncate name field so that it has only 5 characters. Using Update API, you'd have to execute a script:
POST http://localhost:9200/insert/test/1/_update
{
"script" : "ctx._source.name = ctx._source.name.substring(0,5);",
"scripted_upsert": true,
"upsert" : {
"name": "foofoofoofoo",
"age": 29
}
}
It means that, if ES does not find the document with given id (here id=1), it should index the document that is inside upsert element, and perform given script. So as you can see, it's rather inconvenient if you want to have automatically generated ids, as you have to provide the id in URI.
Result:
GET http://localhost:9200/insert/test/1
{
"_index": "insert",
"_type": "test",
"_id": "1",
"_version": 1,
"found": true,
"_source": {
"name": "foofo",
"age": 29
}
}

How to define document ordering based on filter parameter

Hi Elasticsearch experts.
I have a problem which might be realted to the fact I am indexing DB relational data.
My scenario is the following:
I have two entities:
documents and meetings.
Documents and meetings are independent entities. Although it is possible to assign documents to meetings in a given order.
We are using a join table for this in the DB.
meetings(id,name,date)
document(id,title,author)
meeting_document(doc_id,meeting_id,order)
In elasticsearch I am indexing the documents_id as NESTED property of the meeting
meeting example:
{
id: 25
name:"test",
documents: [22,12,24,55]
}
I will fetch the meeting, after this I would like to send a request to the documents filtering on document.id and asking elasticsearch to return the list in the same order I passed in the list of ids to the filter.
What is the best way to implement this ?
Thanks

Nice Question,
I've spent some time figuring a solution for you and come up with a solution, It might be tricky one but works.
Lets have a look to my query,
I've used script score, for sorting by user defined list.
POST index/type/_search
{
"query": {
"function_score": {
"functions": [
{
"script_score": {
"script": "ar.size()-ar.indexOf(doc['docid'].value)",
"params": {
"ar": [
"1",
"2",
"4",
"3"
]
}
}
}
]
}
},
"filter": {
"terms": {
"docid": [
"1",
"2",
"4",
"3"
]
}
}
}
The thing you have to take care is,
send, same value for filter and in params. Like in the above query.
This returns me hits with doc ids, 1, 2, 4, 3 .
You have to change field name inside script and in filter, and you can use termQuery inside query object.
I've tested the code, Hope this helps!!
Thanks

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio