Bulk insert documents but if it exists only update provided fields - elasticsearch

I have an index which contains data as follows:
{
"some_field": string, -- exists in my database
"some_other_field": string, -- exists in my database
"another_field": string -- does NOT exist in my database
}
I have a script which grabs data from a database and performs a bulk insert. However, only some of the fields above come from the database as shown above.
If a document already exists, I still want to update the fields that come from the database, but without overwriting/deleting the field that does not come from the database.
I am using the bulk API to do this, however, I lose all data relating to another_field when running the script. Looking at bulk docs, I can't find any options to simply update an existing doc.
I am unable to share the script, but hope this might be enough information to shine some light on possible solutions.

TLDR;
Yes it is use index, as the doc explain:
(Optional, string) Indexes the specified document. If the document exists, replaces the document and increments the version. The following line must contain the source data to be indexed.
But make sure to provide the _id of the document in case of an update.
To understand
I created a toy project to replay and understand:
# post a single document
POST /71177773/_doc
{
"some_field": "data",
"some_other_field": "data"
}
GET /71177773/_search
# try to "update" with out providing an id
POST /_bulk
{"index":{"_index":"71177773"}}
{"some_field":"data","some_other_field":"data","another_field":"data"}
# 2 Documents exist now
GET /71177773/_search
# Try the same command but provide using the Id on the first documents
POST /_bulk
{"index":{"_index":"71177773", "_id": "<Id of the document>"}}
{"some_field":"data","some_other_field":"data","another_field":"data"}
# It seems it worked
GET /71177773/_search
If your question was:
Is Elasticsearch smart enough to recognise I want to update an existing document without providing the Id ?
I am afraid it is not possible.

Related

Add _id to the source as a separate field to all exist docs in index

I'm new to Elastic Search. I need go through all the documents, take the _id and add it to the _source as a separate field by script. Is it possible? If yes, сan I have an example of something similar or a link to similar scripts? I haven't seen anything like that on the docks. Why i need it? - Because after that i will do SELECT with Opendistro and SQL. This frame cannot return me fields witch not in source. If anyone can suggest I would be very grateful.
There are two options:
First option: Add this new field in your existing index and populate it and build the new index again.
Second option: Simply define a new field in a new index mapping(keep rest all field same) and than use reindex API with below script.
"script": {
"source": "ctx._source.<your-field-name> = ctx._id"
}

How to project a new field in response in ElasticSearch?

I am using Elasticsearch 6.2.
I have an index products with index_type productA having data with following structure:
{
"id": 1,
"parts": ["part1", "part2",...]
.....
.....
}
Now during the query time, I want to add or project a field parts_count to the response which simply represents the number of parts i.e the length of parts array. Also, if possible, I would also like to sort the documents of productA based on the generated field parts_count.
I have gone through most of the docs but haven't found a way to achieve this.
Note:
I don't want to update the mapping and add dynamic fields. I am not sure if Elasticsearch allows it. I just wanted to mention it.
Did you read about Script Fields and on Script Based Sorting?
I think you should be able to achieve both things and this not require any mapping updates.

Elasticsearch Jest update a whole document

I have an elasticsearch server which i'm accessing via a java server using the Jest client and i was looking for the best way to update multiple fields of a document each time.
I have looked to the documentation so far, and i have found that there are two way for doing it :
Partial update via a script : i don't think it is suitable for multiple field update (because i don't know the modified fields).
Whole document update: via re-indexing the whole document.
My question is how could i update the whole document knowing that Jest provide only update via a script?
Is it the best way to delete a document and indexing the updated version?
Already answered this in the github issue you also opened but again:
You should use the second way you linked (Whole document update) and there is no special API for it, it's just a regular index request. So you can do it simply by sending your Index request against the id of the document you want to update.
For example assuming you have below document already indexed in Elasticsearch within index people, type food, id 9:
{"user": "kramer", "fav_food": "jello"}
Then you would do:
String source = "{\"user\": \"kramer\", \"fav_food\": \"pizza\"}";
JestResult result = client.execute(
new Index.Builder(source)
.index("people")
.type("food")
.id(9)
.build()
);

How to retrieve all document ids matching a search, in elastic search?

I'm working on a simple side project, and have a tech stack that involves both a SQL database and ElasticSearch. I only have ElasticSearch because I assumed that as my project grows, my full text searching would be most efficiently performed by ES. My ES schema is very simple - documents that I insert into ES have 2 fields, one being the id and the other being the field with the body of text to search. The id being inserted into ES corresponds to that document's primary key id from the SQL database.
insert record into SQL -> insert record into ES using PK from SQL
Searching would be the reverse of that. Query ES and grab all the matching ids, and then turn around and use those ids to get records from SQL.
search ES can get all PK ids -> use those ids to get documents from SQL
The problem that I am facing though, is that ES can only return documents in a paginated manner. This is a problem because I also have a WHERE clause on my SQL query, beyond just the ids. My SQL query might look like this ...
SELECT * FROM foo WHERE id IN (1,2,3,4,5) AND bar != 'baz'
Well, with ES paginating the results, my WHERE clause will always only be querying a subset of the full results from ES. Even if I utilize ES' skip and take, I'm still only querying SQL using a subset of document ids.
Is there a way to get Elastic Search to only return the entire list of matching document ids? I realize this is here to not allow me to shoot myself in the foot, because doing this across all shards and many many documents is not efficient. Is there no way, though?
After putting in some hours on this project, I've only now realized that I've poorly engineered this, unless I can get all of these ids from ES. Some alternative implementations that I've thought of would be to store the things that I'm filtering on, in SQL, in ES as well. A problem there is that I'd have to update the ES document every time I update the document in SQL. This would require a pretty big rewrite to some of my data access code. I could scrap ElasticSearch all together and just perform searching in Postgres, for now, until I can think of a better way to structure this.
The elasticsearch not support return each and every doc match to you queries. Because it Ll overload the system. Instead of this.. Use scroll concept in elasticsearch.. It's lik cursor concept in db's..
http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/scan-scroll.html
For more examples refer the Github repo. https://github.com/sidharthancr/elasticsearch-java-client
Hope it helps..
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-request-fields.html
please have a look into the elastic search document where you can specify only particular fields that return from the match documents
hope this resolves your problem
{
"fields" : ["user", "postDate"],
"query" : {
"term" : { "user" : "kimchy" }
}
}

How to create unique constraint in Elasticsearch database?

I am using elasticsearch as a document database and each record I create has a guid id that the system uses for the record id. Business people want to offer a feature to let the user have their own auto file name convention based on date and how many records were created so far this day/month.
What I need is to prevent duplicate user file names. Is there a way to setup an indexed field to be unique? Like a sql unique constraint?
You'd need to use the field that is supposed to be unique as id for your documents. By default a new document with existing id would override the existing document with same id, but you can switch to op_type=create in order to get back an error if a document with same id already exists.
There's no way to have the same behaviour with arbitrary fields though, only the _id field works that way. I would probably consider handling this logic in the application layer instead of within elasticsearch.
One solution will be to use uniqueId field value for specifying document ID and use op_type=create while storing the documents in ES. With this you can make sure your uniqueId field will have unique value and will not be overridden by another same valued document.
For this, the elasticsearch document says:
The index operation also accepts an op_type that can be used to force a create operation, allowing for "put-if-absent" behavior. When create is used, the index operation will fail if a document by that id already exists in the index.
Here is an example of using the op_type parameter:
$ curl -XPUT 'http://localhost:9200/es_index/es_type/unique_a?op_type=create' -d '{
"user" : "kimchy",
"uniqueId" : "unique_a"
}'
If you run the above request it is ok, but running it the next time will give you an error.
You can use the _id in the column you want to have unique contraint on.
Here is the sample river that uses postgresql. Yo can change the Database Driver/DB-URL according to your usage.
curl -XPUT localhost:9200/_river/simple_jdbc_river/_meta -d "{\"type\":\"jdbc\",\"jdbc\":{\"strategy\":\"simple\",\"poll\":\"1s\",\"driver\":\"org.postgresql.Driver\",\"url\":\"jdbc:postgresql://DB-URL/DB-INSTANCE\",\"user\":\"USERNAME\",\"password\":\"PASSWORD\",\"sql\":\"select t.id as _id,t.name from topic as t \",\"digesting\" : true},\"index\":{\"index\":\"jdbc\",\"type\":\"topic_jdbc_river1\"}}"
So far as to ES 7.5, there is no such extra "constraint" to ensure uniqueness using a custom field in the mapping.
But you still can walk around it via your own application UUID, which could be used directly explicitly as the _id (which is unique implictly) to achieve your goals.
PUT <your_index_name>/_doc/<your_app_uuid>
{
"a_field": "a_value"
}
Another approach might be to generate the string you store in a field that should be unique by integrating an auto-incrementing integer. This way you ensure from the start that your field values are unique.
You would put your file name together like this:
<current day/month>_<auto-incremented integer>
Auto-incrementing integers are not supported by Elasticsearch per se but you could mimic them using this approach. If you happen to use node.js you can use the es-sequence module.

Resources