elasticsearch copy field when indexing - elasticsearch

I would like to create a one to many relashanship for the purpose of aggregations.
The "join" will be according to a field called "common_id":
When I create the first document belonging to the same group I would like to use it's flakeId (it's _id) as the common_id.
When adding other document belonging to the same group I would like to explicitly set the common_id to have the same value as the first document I added. This can be done by my app since my application will know the common_id of the first element.
My problem is with the first document:
How can i tell elasticsearch to copy the _id into common_id in a single call to elastic (I know I can do it using update script, or using two calls one for index and one for update... but this requires two requests instead of one).
I would like a simple syntax for this.
thanks

Related

Updating a record by query in ElasticSearch using olivere/elastic in google go

I am using olivere/elastic library for elasticsearch in my go app . I have list of values for a particular field (say fieldA) of elasticsearch document. I want to update a particular field of all document by searching on field fieldA .
This : Updating a record in ElasticSearch using olivere/elastic in google go
explains the update part. But in my case in don't have Id of documents to be updated . So, either i can make search call to retrieve document ids and then update them , or is there another way am missing? Thanks in Advance.
If you need to update a list of documents, you can use the Update By Query API. The unit tests give you a hint about how the syntax looks like. However, if you have individual values for individual documents, I guess there's no other way than updating them one by one. The fastest way to achieve that is by using the Bulk API.

How to detect when a new unique term has been inserted into an index on a specific field in a specific index in Elasticsearch?

I currently have a cron job that is looking at a field called "ex.set" and performs these tasks:
For every index, run a terms aggregation on the field "ex.set"
For every index, get every existing alias
For every unique term appearing in an index in "ex.set", if it does not have an existing alias, create a filtered alias
The job runs every ten minutes but most of the time does not find anything. Is there a way or a plugin (compatible with 2.3.x), that will automatically detect when a new unique term has been inserted into an index on a specific field in a specific index? And then if there is a unique item trigger the creation of a filtered alias on that index? Thank you in advance for any ideas or solutions.
Yes, I believe you can use Watcher plugin to do this. It has a default license valid for 30 days, after which some features are disabled and afterwards you'd need a valid license to have it fully working again.
The basic idea is that your first two steps can be put in a chain input as search inputs which will collect the data.
Then, the additional step which compares the existent aliases with the terms from that aggregation can be considered as a script condition where you do your magic of comparing the two sets. If your condition establishes that a new alias needs to be created then, in the action part of the watch you can use a webhook action to call the create alias REST command on the index.

Avoid duplicate documents in Elasticsearch

I parse documents from a JSON, which will be added as children of a parent document. I just post the items to the index, without taking care about the id.
Sometimes there will be updates to the JSON and items will be added to it. So e.g. I parsed 2 documents from the JSON and after a week or two I parse the same JSON again. This time the JSON contains 3 documents.
I found answers like: 'remove all children and insert all items again.', but I doubt this is the solution I'm looking for.
I could compare each item to the children of my target-parent and add new documents, if there is no equal child.
I wondered if there is a way, to let elasticsearch handle duplicates.
Duplication needs to be handled in ID handling itself.
Choose a key that is unique for a document and make that as the _id. In the the key is too large or it is multiple keys , create a SHAH checksum out of it and make that as the _id.
If you already have dedupes in the database , you can use terms aggregation nested with top_hits aggregation to detect those.
You can read more about this approach here.
When adding a new document to elasticsearch, it first scans the existing documents to see if any of the IDs match. If there is already an existing document with that ID, the document will be updated instead of adding in a duplicate document (the version field will be updated at the same time to track the amount of updates that have occurred). You will therefore need to keep track of your document IDs somehow and maintain the same IDs throughout matching documents to eliminate the possibility of duplicates.

Bulk add new field to ALL documents in an elasticsearch index

I need to add a new field to ALL documents in an index without pulling down the document and pushing it back up (this will take about a day). Is it possible to use the _BULK api to achieve this?
I have also researched the update_by_query plugin, and it seems to would take just as long as pulling them down and pushing them back myself.
Yes, the bulk API supports updates which can add a new field using a partial document or script. To iterate through your document ids do a scan and scroll with the fields parameter set to an empty array.

Lucene filter with docIds

I'm trying to do the following: I want to create a set of candidates by querying each field separately and then adding the top k matches to this set. After I'm done with that, I need to run another query on this candidate set.
The way how I implemented it right now is using a QueryWrapperFilter with a BooleanQuery that matches the unique id field of each candidate document. However, this means I have to call IndexSearcher.doc().get("docId") for each candidate document before I can add it to my BooleanQuery, which is the major bottleneck. I'm only loading the docId field via MapFieldSelector("docId).
I wanted to create my own Filter class, but I can't use the internal Lucene doc ids directly, because they are specified per segment. Any thoughts on how to approach this?
Instead of reading the stored docId, index the field (it probably already is) and use the FieldCache to retrieve docIds much faster. Then instead of using the docIds in a BooleanQuery, try using a TermsFilter or FieldCacheTermsFilter. The latter documentation describes the performance trade-offs.

Resources