How to sort data by _version in ElasticSearch - elasticsearch

As I am able to sort data using score like
{
"version":true,
"_source":false,
"sort": [
{
"_score": {
"order": "desc"
}
}
],
"query": {
"match_all": {}
}
}
Please let me know How can I do the same with _version. By default Fielddata is not supported on field _version. So may be I am missing some thing.
Is there any specific setting to query with version?
Please help!

You can't do this, and usually you don't have to.
See this thread:
https://discuss.elastic.co/t/filter-by--version-and-show--version-in-elasticsearch-query/22024/2
While using the _version might seem to work in certain cases, I would
recommend to never use it for anything else than optimistic locking of
updates. In particular, versions do not carry any meaning: they might look
like the number of times a document has been modified but it is not always
the case (for instance if you create a new document which has the same ID
as a document that you just deleted, the version number of the new document
will not be 1), and more importantly it is an implementation detail, this
behaviour might change in the future.
_version field is not indexed so you can't use it in queries.
You can create you custom version field and handle it manually.

Related

How can I let ES support mixed type of a field?

I am saving logs to Elasticsearch for analysis but I found there are mixed types of a particular field which causing error when indexing the document.
For example, I may save below log to the index where uuid is an object.
POST /index-000001/_doc
{
"uuid": {"S": "001"}
}
but from another event, the log would be:
POST /index-000001/_doc
{
"uuid": "001"
}
the second POST will fail because the type of uuid is not an object. so I get this error: object mapping for [uuid] tried to parse field [uuid] as object, but found a concrete value
I wonder what the best solution for that? I can't change the log because they are from different application. The first log is from the data of dynamodb while the second one is the data from application. How can I save both types of logs into ES?
If I disable dynamic mapping, I will have to specify all fields in the index mapping. For any new fields, I am not able to search them. so I do need dynamic mapping.
There will be many cases like that. so I am looking for a solution which can cover all conflict fields.
It's perfectly possible using ingest pipelines which are run before the indexing process.
The following would be a solution for your particular use case, albeit somewhat onerous:
create a pipeline
PUT _ingest/pipeline/uuid_normalize
{
"description" : "Makes sure uuid is a hash map",
"processors" : [
{
"script": {
"source": """
if (ctx.uuid != null && !(ctx.uuid instanceof java.util.HashMap)) {
ctx.uuid = ['S': ctx.uuid]; // hash map init
}
"""
}
}
]
}
run the pipeline when ingesting a new doc
POST /index-000001/_doc
{
"uuid": {"S": "001"}
}
POST /index-000001/_doc?pipeline=uuid_normalize <------
{
"uuid": "001"
}
You could now extend this to be as generic as you like but it is assumed that you know what you expect as input in each and every doc. In other words, unlike dynamic templates, you need to know what you want to safeguard against.
You can read more about painless script operators here.
You just cannot.
You should either normalize all your field in a way or another.
Or use 2 separate field.
I can suggest to use a field like this :
"uuid": {"key": "S", "value": "001"}
and skip the key when not necessary.
But you will have to preprocess your value before ingestion.

Using numerics as type in Elasticsearch

I am going to store transaction logs on elasticsearch. I am new to ELK stack and not sure about how I should implement this on ELK stack. My transaction is printing lines of log sequentially(upserts) and instead of logging these to a file I want to store these on ElastichSearch and later I will query the logs by the transactionId I have created.
Normally the URI for querying will be
/bookstore/books/_search
but in my case it must be like
/transactions/transactionId/_search
because I dont want to store lines as array attached to a single transaction record but I am not sure if this is a good practice to create a new type in the beginning of every transaction. I am not even sure if this is possible.
Can you give advices about storing these transaction data on elasticsearch?
if you want to query with a URI like /transactions/transactionId/_search, that means you are planning to create multiple types every time a new transactionid comes. Now , apart from this being a bad design, its not even possible to have more than one type in an index(post version 5.X I guess) and types have been completely removed since version 7.X .
One work-around is if you use the transactionId itself as the document ID while creation. Then you can get the log associated with one transactionId by querying GET transactions/transactionId (read about the length restrictions of the document id though) but this might cause another issue, that being , there can be multiple logs for the same transaction, so each log entry having the same id would simply overwrite the previous entry.
The best solution here will be to change how you query those records.
For this you can put transactionId as one of the fields in the json body, along with maybe a created time stamp at the time of insertion ( let ES create the documents with the auto generated id) and then query all logs associated with a transaction like :
POST transactions/_search
{
"sort": [
{
"createdDate": {
"order": "asc"
}
}
],
"query":{
"bool":{
"must":[
{
"term":{
"transactionId.keyword":"<transaction id>"
}
}
]
}
}
}
Hope, this helps

How to exclude large number of IDs from an Elastic Search query

I'm working on an app similar to Tinder. In ElasticSearch I have a collection of about half a million users and their locations). Whenever the user opens the app to search for nearby users I run an Elastic Search query over that collection. The query is fairly complex, it takes into consideration not only the location but also how active the user is or how many photos he has.
What I struggle with is how to exclude those users who the current user already swiped through from the query. A naive way to implement this would probably be to maintaint a nested array of user IDs as part of every user document in the index and exclude based on that. But as every user does dozens of thousands swipes that array could potentially grow super big, so it's not a scalable solution.
Is there a way to exclude large number of entities from an Elastic Search query based on their IDs which does not hurt performace?
Use the lookup feature of the Terms query: Terms lookup mechanism
When it’s needed to specify a terms filter with a lot of terms it can be beneficial to fetch those term values from a document in an index. A concrete example would be to filter tweets tweeted by your followers. Potentially the amount of user ids specified in the terms filter can be a lot. In this scenario it makes sense to use the terms filter’s terms lookup mechanism.
You can try adding the ids filter into a bool/must_not clause of your complex query and see how it behaves.
{
"query": {
"filtered": {
"filter": {
"bool": {
"must": [
... <--- your other "must" constraints
],
"must_not": [
{
"ids": {
"values": [ "id1", "id2", "id3" ] <--- your list of ids to exclude
}
}
]
}
}
}
}
}

Exclude setting on integer field in term query

My documents contain an integer array field, storing the id of tags describing them. Given a specific tag id, I want to extract a list of top tags that occur most frequently together with the provided one.
I can solve this problem associating a term aggregation over the tag id field to a term filter over the same field, but the list I get back obviously always starts with the album id I provide: all documents matching my filter have that tag, and it is thus the first in the list.
I though of using the exclude field to avoid creating the problematic bucket, but as I'm dealing with an integer field, that seems not to be possible: this query
{
"size": 0,
"query": {
"term": {
"tag_ids": "00001"
}
},
"aggs": {
"tags": {
"terms": {
"size": 3,
"field": "tag_ids",
"exclude": "00001"
}
}
}
}
returns an error saying that Aggregation [tags] cannot support the include/exclude settings as it can only be applied to string values.
Is it possible to avoid getting back this bucket?
This is, as of Elasticsearch 1.4, a shortcoming of ES itself.
After the community proposed this change, the functionality has been added and will be included in Elasticsearch 1.5.0.
It's supposed to be fixed since version 1.5.0.
Look at this: https://github.com/elasticsearch/elasticsearch/pull/7727
While it is enroute to being fixed: My workaround is to have the aggregation use a script instead of direct access to the field, and let that script use the value as string.
Works well and without measurable performance loss.

Can elasticsearch return multiple value fields in a single facet?

I am looking for a way to create a facet such that I can essentially return two values for one key.
For instance, I am attempting to retrieve both an amount and schedule properties of an object. I attempted to use a computed value script, but the calculations that have to be done using the two objects are date based, and require an external library to perform them.
Basically, something along the lines of:
"theFacet": {
"terms_stats": {
"key_field": "someKeyProbablyADate",
"value_field": "amount",
"value_field": "simpleSchedule"
}
}
Workarounds are also appreciated. Perhaps some way to return a new dynamic object with both fields?
Sounds like you want to pre-process your data before you index it into a single field, then facet on that.
Something among the line of a single string containing key#amount#schedule
Then when you get the faceting results back you can split it up again and run whatever logic you want.
Try combining different fields with a script element. For example:
"facets": {
"facet-name": {
"terms": {
"field": "some-field",
"script": "_source['another-field'] + '/' + term
}
}
}

Resources