How to retrieve all document with version n in elastic search - elasticsearch

ElasticSearch comes with versioning https://www.elastic.co/blog/versioning
Maybe I misunderstood the meaning of the versioning here.
I want to find all the documents that are in version 1 so I can update them.
An obvious way is to go through all the document one by one and select those that are in version 1.
Question:
Is it possible to retrieve all the Documents that are in version 1 with ONE query?

Because of Elasticsearch distributed nature, it needs a way to ensure that changes are applied in the correct order. This is where _version comes into play. It's an internal way of making sure than an older version of a document never overwrites a newer version.
You can also use _version as a way to make sure that the document you want to delete / update hasn't been modified in the meantime - this is done by specifying the version number in the URL; for example PUT /website/blog/1?version=5 will succeed only if the current _version of the document stored in the index is 5.
You can read more about it here: Optimistic Concurrency Control
To answer your question,
Is it possible to retrieve all the Documents that are in version 1 with ONE query?
No.

You can use scripted _reindex into an empty temporary index. The target index will then contain just those documents that have _version=1.
You can add further query stanzas as well, to limit your raw input (using the reverse index, faster), as well as further painless conditions (per document, flexible), too.
# ES5, use "source" i/o "inline" for later ES versions.
http POST :9200/_reindex <<END
{
"conflicts": "proceed",
"source": {
"index": "your-source-index"
},
"dest": {
"index": "some-temp-index",
"version_type": "external"
},
"script": {
"lang": "painless",
"inline": "if(ctx._version!=1){ctx.op='delete'}"
}
}
END

Related

Possible to provide an entire document to Update By Query?

I would like to search for a document that is stored in ElasticSearch based on it's fields and overwrite that entire document with a new version. I am new to ES but from what I can tell I can tell I can only use Update if I am searching for a document by it's ES assigned _id, so I was hoping to use Update By Query to do this. Unfortunately, it appears that if I use Update By Query, then I need to provide a script to update the fields I care about. Something like below:
POST my-index-000001/_update_by_query
{
"script": {
"source": "ctx._source.count++",
"lang": "painless"
},
"query": {
"term": {
"user.id": "kimchy"
}
}
}
My problem is that my document has dozens of fields and I don't know which of them will have changed. I could loop through them and build the script, but I'm hoping there is a way to simply provide the document that you want and have anything that matches your query be overwritten by that document. Is this possible with Update By Query? Or is there another way to match on something other than _id and perform an update?
You question is not entirely clear, are you trying to update the whole document for a for a given id? If yes, you can simple overwrite the exiting document with the put call:
PUT index-name/_id
This will overwrite the existing index so make sure that you are sending the complete document in your PUT call and not just the field that have changed.

Elasticsearch - Test new analyzers against an existing data set

New to Elasticsearch.
I need to update an index to treat both plurals & singulars as matches. So green apple should match green apples and well (and vice versa).
Through my research, I understand I need to recreate the index with a stemmer filter.
So:
"analysis": {
"analyzer": {
"std_analyzer": {
"tokenizer": "whitespace",
"filter": [ "stemmer" ]
}
}
}
Can anyone confirm if the above is correct? If not, what will I need to use?
I also understand that I cannot modify the existing index, but rather I will need to create a new one with this analyzer, and then re-add all the documents to the new index. Is that correct? If so, is there a shortcut or easy way to tell it to "add all documents from index X to new index Y?"
Thank you for your help
Find inline answers
In most of the cases, it should work, and also its really difficult to cover all the future use-cases and in your case we don't even know your current use-cases, you can use Analyze API and test some of your use-case, before pushing these analyzer related changes to production.*
Adding/changing the Analyzer is a breaking change as it controls how the tokens are generated and indexed in the elasticsearch inverted index, hence you have to create reindex all the documents with updated Analyzer setting, you can use the reindex API with
alias to do it with zero down time.

Is their any way using which i can restrict the number of children doc in my parent-child mapping in elasticsearch?

I am using ES v7.3 and I am using parent-child mapping, I wanted to know is thier any way that i can restrict the number of child doc's for a parent document. suppose i have a parent 'p1' then i want that this parent must not have more than 100 child doc's associated with it and if some more doc's are indexed then the old child docs get deleted and new ones are indexed but the limit should be of 100 child docs.
PUT test/
{
"mappings": {
"properties": {
"data": {
"type": "join",
"relations": {
"parent": ["child1", "child2", "child3"]
}
}
}
}
}
I am not aware that you could set such specific information about the maximum size and automatic delete via the mappings.
What you could do however is to implement a logstash filter that checks the size of the current number of child-documents and execute some REST-calls to the cluster if the number is already equal to 100.
I've never faced such an use case but I want to give you some possibilities for that workaround:
1.) execute a parent_id-query via logstash's elasticsearch filter plugin
As stated in the parent_id documentation, this query "Returns child documents joined to a specific parent document".
So with the id of the parent document you should be able to get all child documents in your filter implementation. Refer to the elasticsearch filter plugin documentation on how to use it. With that, you can surely determine the number of child-documents via a ruby code plugin.
2.) check if the number of the current child documents is equal to 100
3.) if 2.) evaluates to true, call the delete_by_query REST API
To index new child-documents without stepping over that maximum threshold of 100 child documents you have to delete previous indexed child documents. You could therefore use logstash's http filter plugin to call the delete_by_query API with the exact query that will delete previous indexed documents.
4.) index the new document via the elasticsearch output plugin
Refer to the Elasticsearch output plugin on how to index events from logstash.
So as I stated at the beginning I am not fully aware whether this approach will lead to the desired result or not. But I would give it a try since the Logstash plugins I mentioned are able to do the particular steps in the workflow.

Elasticsearch: Set type of field while reindexing? (can it be done with _reindex alone)

Question: Can the Elasticsearch _reindex API be used to set/reset the "field datatypes" of fields that are copied through it?
This question comes from looking at Elastics docs for reindex: https://www.elastic.co/guide/en/elasticsearch/reference/6.2/docs-reindex.html
Those docs show the _reindex API can modify things while they are being copied. They give the example of changing a field name:
POST _reindex
{
"source": {
"index": "from-index"
},
"dest": {
"index":"new-index"
},
"script": {
"source": "ctx._source.New-field-name = ctx._source.remove(\"field-to-change-name-of\")"
}
}
The script clause will cause the "new-index" to have a field called New-field-name, instead of the field with the name field-to-change-name-of from the "from-index"
The documentation implies there is a great deal of flexibility available in the "script" functionality, but its not clear to me if that includes projecting datatypes (for instance quoting data to turn it into a strings/text/keywords, and/or treating things as literals to attempt to turn string data into non-strings (obviously fought with danger)
If setting the datatypes in a _reindex is possible, I'm not assuming it will be efficient and/or be without (perhaps harsh) limits - I just want to better understand the limit of the _reindex functionality (and figure out if I can force a datatype in just one interaction, instead of setting the mapping no the new index before I do the reindex command)
(P.S. I happen to be working on Elasticsearch 6.2, but I think my question holds for all versions that have had the _reindex api (sounds like everything 2.3.0 and greater))
Maybe you are confusing some terms. The part of the documentation you are pointing out refers to the metadata associated with a document, in this case the _type meta field just tells Elasticsearch that a particular document belongs to a specific type (e.g. user type), it is not related to the datatype of a field (e.g. integer or boolean).
If you want to set/reset the mapping of particular fields, you don't even need to use scripting depending on your case. You just have to create the destination index with the new mapping and execute the _reindex API.
But if you want to change the mapping between incompatible values (e.g. a non numerical string into a field with an "integer" datatype), you would need to do some transformation through scripting or through the ingest node.

elasticsearch query for newest index

Elasticsearch newbie.
I would like to query for the newest index.
Every day logstash creates new indices with a naming convention something like: our_sales_data-%{dd-mm-yyyy}% or something very close. Se I end up with lots of indices like:
our-sales_data-14-09-2015
our-sales-data-15-09-2015
our-sales-data-16-09-2015
and so on.
I need to be able to query for the newest index. Obviously I can query for and retrieve all the indices with 'our-sales-data*' in the name.. but I only want to return the very newest one and no other.
Possible?
Well the preferred method would be to compute the latest index name from client side by resolving the date in our_sales_data-%{dd-mm-yyyy}%.
Another solution would be to run a sort query and get one of the latest document. You can infer the index from the index name of the document.
{
"size": 1,
"sort": {
"#timestamp": {
"order": "desc"
}
}
}
We have a search alias and a write alias. The write alias is technically always the latest until we roll it over and add a new one into the this alias.
Our search alias contains all the previous indexes plus the latest index (also in write).
Could you do something like this and then just query the write alias?

Resources