Elasticsearch upsert based on query - elasticsearch

Two years ago someone asked how to do upserts when you don't know a document's id. The (unaccepted) answer referenced the feature request
that resulted in the _update_by_query API.
However _update_by_query does not allow insertion if no hits exist, so it is not really an upsert, but just another way to do update.
Is there a way to do an upsert without an _id yet? I know that my query will always return one or zero results. Or am I forced to do multiple requests (and maintain the uniqueness myself)?

This doesn't seem to be possible right now. _update provides an upsert attribute, but this doesn't work with _update_by_query unfortunately. The following just gives you an error around Unknown key for a START_OBJECT in [upsert].
POST website/doc/_update_by_query?conflicts=proceed
{
"query": {
"term": {
"url": "http://foo.com"
}
},
"script": {
"inline": "ctx._source.views+=1",
"lang": "painless"
},
"upsert": {
"views": 1,
"url": "http://foo.com"
}
}

Without knowing in_stock values in all the document now you can reduce its count by 1:
POST products/_update_by_query
{
"script": {
"source": "ctx._source.in_stock--"
},
"query": {
"match_all": {}
}
}

Related

elasticsearch - how to combine results from two indexes

I have CDR log entries in Elasticsearch as in the below format. While creating this document, I won't have info about delivery_status field.
{
msgId: "384573847",
msgText: "Message text to be delivered"
submit_status: true,
...
delivery_status: //comes later
}
Later when delivery status becomes available, I can update this record.
But I have seen that update queries bring down the rate of ingestion. With pure inserts using bulk operations, I can reach upto 3000 or more transactions /sec, but if I combine with updates, the ingestion rate becomes very slow and crawls at 100 or less txns/sec.
So, I am thinking that I could create another index like below, where I store the delivery status along with msgId:
{
msgId:384573847,
delivery_status: 0
}
With this approach, I end up with 2 indices (similar to master-detail tables in an RDBMS). Is there a way to query the record by joining these indices? I have heard of aliases, but could not fully understand its concept and whether it can be applied in my use case.
thanks to anyone helping me out with suggestions.
As you mentioned, you can index both the document in separate index and used collapse functionality of Elasticsearch and retrieve both the documents.
Let consider, you have index document in index2 and index3 and both have common msgId then you can use below query:
POST index2,index3/_search
{
"query": {
"match_all": {}
},
"collapse": {
"field": "msgId",
"inner_hits": {
"name": "most_recent",
"size": 5
}
}
}
But again, you need to consider querying performance with large data set. You can do some benchmarking Evalue query performance and decide index or query time will be better.
Regarding alias, currently in above query we are providing index2,index3 as index name. (Comma separated). But if you use aliases then You can use the single unified name for query to both the index.
You can add both the index to single alias using below command:
POST _aliases
{
"actions": [
{
"add": {
"index": "index3",
"alias": "order"
}
},
{
"add": {
"index": "index2",
"alias": "order"
}
}
]
}
Now you can use below query with alias name insted of index name:
POST order/_search
{
"query": {
"match_all": {}
},
"collapse": {
"field": "msgId",
"inner_hits": {
"name": "most_recent",
"size": 5
}
}
}

Elasticsearch inline string replace seems to do nothing

We have some legacy fields in Elastic search index, which cause us some troubles and we would like to perform a string replace over the whole index.
For instance some old timestamps are stored in format of 2000-01-01T00:00:00.000+0100 but should be stored as 2000-01-01T00:00:00.000+01:00.
I tried to run following query:
POST /my_index/_update_by_query
{
"script":
{
"lang": "painless",
"inline": "ctx._source.timestamp = ctx._source.timestamp.replace('+0100', '+01:00')"
}
}
I run the query within Kibana, but I always get a query timeout - I guess that is not necessarily bad considering the database is huge, however I never see the fields updated.
Is there a way to see the status of such query?
I also tried to create a search query for the update, but with no luck:
GET /my_index/_search
{
"query": {
"query_string": {
"query": "*0100",
"allow_leading_wildcard": true,
"analyze_wildcard": true,
"fields": ["timestamp"]
}
}
}
Which unfortunately always returns empty set - not sure what might be wrong.
What would be a correct way to achieve such update?
I would solve this using an ingest pipeline that you'll use to update your whole index.
First, create the ingest pipeline like below. What it does is detect documents which have a timestamp field ending with +0100 and then updates the timestamp to use the timezone with the correct format.
PUT _ingest/pipeline/fix-tz
{
"processors": [
{
"dissect": {
"if": "ctx.timestamp.endsWith('+0100')",
"field": "timestamp",
"pattern": "%{timestamp}+%{tz}"
}
},
{
"set": {
"if": "ctx.tz != null",
"field": "timestamp",
"value": "{{timestamp}}+01:00"
}
},
{
"remove": {
"if": "ctx.tz!= null",
"field": "tz"
}
}
]
}
Then, when the pipeline is created, you just have to update your index with it, like this:
POST my_index/_update_by_query?pipeline=fix-tz&wait_for_completion=false
Once this has run completely, your index should be properly updated.

How do I view synonyms indexed in a document?

I have added a synonyms token filter to my index and I think it is working as planned, but I want a way to confirm the exact values that are being stored for each document (some queries aren't using the synonym values as I expect, and I need to verify if the right values were stored at the time of indexing).
Is there a standard way to figure this out?
Example:
At some point I configured a synonym for NICE and PLEASANT.
At some point I indexed a document that has the word NICE in it.
Givens
_termvectors shows my document has the term NICE in it.
_analyze for my analyzer shows NICE and PLEASANT are synonyms.
Question:
How can I tell if the indexed document is using PLEASANT as a term/synonym?
Update
Adapting the answer from user3775217 (I had to update the syntax to work for ElasticSearch 5.2):
{
"query":{
"term": { "{someFieldToFilterOn}": "{SomeFieldValue}"}
},
"script_fields":{
"terms":{
"script":{
"lang":"groovy",
"inline":"doc[field].values",
"params":{
"field":"{TheFieldIwantIndexedTermsFrom}"
}
}
}
}
}
I have prepared this query couple of years back to find the indexed values for the document. You can use this query to learn about the values indexed in the field for each document.
You will need doc_id for each document and the document field you want to check on.
curl 'http://localhost:9200/test-idx/_search?pretty=true' -d '{
"query": {
"filtered": {
"filter": {
"bool": {
"must": [
{
"term": {
"_id": "1770"
}
}
]
}
}
}
},
"script_fields": {
"terms": {
"script": "doc[field].values",
"params": {
"field": "input"
}
}
}
}'
Hope this helps

Hide a single record in Elastic Search on a per user basis

As a logged in user, I want to be able to hide a single record that I never want to see again if I perform the same search. Is this possible with ElasticSearch?
I've read about multitenancy and filters but I'm not quite sure how a top level implementation might look like.
One of my ideas is that I store some reference to the unwanted record in an RDB and then add those references into a filter query but I'm not sure what reference to use since Elastic Search generates it's own ID's that may not stay the same when a re-index happens.
It depends. If you have not many users and not too big documents you can go with field on the document, Add field dismissedBy and when use dismiss write update to document
POST test/type1/1/_update
{
"script" : {
"inline": "ctx._source.dismissedBy.add(params.userId)",
"lang": "painless",
"params" : {
"userId" : "1"
}
}
}
And query:
POST /index/documents/_search
{
"query": {
"bool": {
"must_not": {
"term": {
"dismissedBy": 1
}
}
}
}
}
Problem with this approach is that if you re-index document all settings will be overwritten so you must keep copy in some other places too.
Other option if documents are large or you have lots of users then I would go with parent/child approach
If user hit dismiss then you should index it
PUT /indexname/dissmisses/1?parent=dismissforid
{
"userId": 1
}
Then when you search you do
POST /index/documents/_search
{
"query": {
"bool": {
"must_not": {
"has_child": {
"type": "dissmiss",
"query": {
"term": {
"userId": 1
}
}
}
}
}
}
}

Sort documents by size of a field

I have documents like below indexed,
1.
{
"name": "Gilly",
"hobbyName" : "coin collection",
"countries": ["US","France","Georgia"]
}
2.
{
"name": "Billy",
"hobbyName":"coin collection",
"countries":["UK","Ghana","China","France"]
}
Now I need to sort these documents based on the array length of the field "countries", such that the result after the sorting would be of the order document2,document1. How can I achieve this using elasticsearch?
You can use script based sorting to achieve this.
{
"query": {
"match_all": {}
},
"sort": {
"_script": {
"type": "number",
"script": "doc['countries'].values.size()",
"order": "desc"
}
}
}
I would suggest using token count type in Elasticsearch.
By using scripts , it can be done (can check here for how to do it using scripts). But then results wont be perfect.
Scripts mostly uses filed data cache and duplicate are removed in this.
You can read more on how to use token count type here.

Resources