Prevent data duplication over multiple indices in Elasticsearch - elasticsearch

Data duplication prevention is handled at the index level with the field "_id".
However, to avoid having huge indices, I work with several small indices linked under an alias. Is there a mechanism in place to check existing _ids at the alias level (over multiple indices) when a document is inserted or should it be handled at the application level ?
indices architecture

not natively, no. you'd need to handle this in your own code

Before inserting your document, you need to first find out which real index contains your document via the alias using
GET alias/_search?q=_id:123456&filter_path=hits.hits._index
In the response you'll get the concrete index name that you can then use to index/update your new document version.

Related

I have 2 index as, which were 2 tables in sql as I can perform an inner joins query

how I have 2 index one is called assignment and the other user in sql had a data field fk but I do not know how to perform an inner join in elasticsearch someone can support me
So you have a couple options which might be useful, without knowing your specific use case I'm going to list a potentially useful links.
1)
parent child mapping, really useful when you want to return all documents associated with a specific document. To make the mapping process a bit easier I typically index the data the retrieve the mapping using the /_mapping endpoint, modify the mapping, delete the index, then reingest the data. Sometimes that isn't an option in the case of short lived data.
https://www.elastic.co/guide/en/elasticsearch/guide/current/parent-child-mapping.html
after updating the current mapping it's possible to use one of the joining queries.
https://www.elastic.co/guide/en/elasticsearch/reference/current/joining-queries.html
2)
When deleting the index and re ingesting the data isn't an option, create a new index, modify the data as described above, but instead of deleting the index use the re index API to get the information to the new index.
https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-reindex.html
3)
It might also be possible to use an ingest processor to join the tables
https://www.elastic.co/guide/en/elasticsearch/reference/master/ingest-processors.html
4)
possibly the quickest until you get your head wrapped around how elasticsearch works is to either join the information prior to ingesting or write a script joining the tables using one of the sdk's.
https://elasticsearch-py.readthedocs.io/en/master/
https://www.elastic.co/guide/en/elasticsearch/client/javascript-api/current/index.html
plus a lot more build by the community.

Elasticsearch index alias

I am trying to use elasticsearch to filter millions of data. All data are in one index and I want to access them in a 'direct' way.
What I mean with direct way?
Direct way means for example accessing the 700000th element of this index (not by id). Is this possible somehow?
What I tried already:
from + size works, but seems not to be fast if number of elements > 10000
Scrolling I didn't try, but it's seem somehow not the right thing for my use-case.
So any other ideas?
Scrolling will not work. That will fetch all the data.
I think elasticseach is not the correct use case for what you want to do.
It would be better to use a linked list of the ids, that will let you fetch the id by index and then you can query elasticsearch to get the data.
If you data is such that it does not get modified or deleted then you can add an extra field in the mapping that will act like an auto increment field in a database. You can fetch the data using that field.

Bulk read of all documents in an elasticsearch alias

I have the following elasticsearch setup:
4 to 6 small-ish indices (<5 million docs, <5Gb each)
they are unioned through an alias
they all contain the same doc type
they change very infrequently (i.e. >99% of the indexing happens when the index is created)
One of the use cases for my app requires to read all documents for the alias, ordered by a field, do some magic and serve the result.
I understand using deep pagination will most likely bring down my cluster, or at the very least have dismal performance so I'm wondering if the scroll API could be the solution. I know the documentation says it is not intended for use in real-time user queries, but what are the actual reasons for that?
Generally, how are people dealing with having to read through all the documents in an index? Should I look for another way to chunk the data?
When you use the scroll API, Elasticsearch creates a sort of a cursor for the current state of the index, so the reason for it not being recommended for real time search is because you will not see any new documents that were inserted after you created the scroll token.
Since your use case indicates that you rarely update or insert new documents into your indices, that may not be an issue for you.
When generating the scroll token you can specify a query with a sort, so if your documents have some sort of timestamp, you could create one scroll context for all documents with timestamp: { lte: "now" } and another scroll (or every a simple query) for the rest of the documents that were not included in the first search context by specifying a certain date range filter.

Set up Elasticsearch suggesters that can return suggestions from different data types

We're in the process of setting up Amazon Elasticsearch Service (running Elasticsearch version 2.3).
We have different types of data (that I'm currently thinking of as different document types within the same index).
We have a generic search in an app where we want an inline autocomplete function, that is, a completion suggester returning hits from all different data (document) types. How can that be set up?
When querying suggesters you have to specify an index, so that's why I wanted to keep all the data in the same index. According to the documentation, the completion suggester considers all documents in the index.
Setting up the completion suggester for the first document type was pretty straight forward and is working great. However, as far as I can see you to specify a suggest field when querying. That would be all good hadn't it been for the error message we get when setting up the mapping for the second document type:
Type: illegal_argument_exception Reason: "[suggest] is defined as an object in mapping [name_of_document_type] but this name is already used for a field in other types"
Writing this question I see that it's possible to specify more than one suggester in a single suggest query. Maybe that is what we have to solve it? (I.e. get X results from Y suggesters where we compare the scores to get the 1 suggestion we want to present to the user.)
One of the core principles of good data design for Elasticsearch (as with many data stores) is to optimise your data storage for ease of reading. Usually, this means embracing duplication.
With this in mind, I'd suggest having a separate autocomplete index with a mapping that's designed specifically for the suggester queries.
Whenever you insert or write one of your other documents, map it to your autocomplete type and add or update it in your autocomplete index at the same time (or, depending on how up-to-date it needs to be, create an offline process to update your autocomplete index e.g. every day).
Then, when you do your suggest query, you can just use your autocomplete index and not worry about dealing with different types of documents with different fields.

Is there a way to update a specific entry in an ElasticSearch index through Titan?

Hi I am using Elastic Search along with Titan..
Right now I have a a mixed index on a specific property for a specific type of vertex:
PropertyKey textProp = mgmt.getPropertyKey(EntityProps.text);
VertexLabel entityClass = mgmt.getVertexLabel(VertexLabels.Entity);
mgmt.buildIndex("EntityTextFull", Vertex.class)
.indexOnly(entityClass)
.addKey(textProp)
.buildMixedIndex("search");
The indexed key values are not unique.. I wonder if there is a way to update some properties including the indexed property for a specific vertex and then somehow reindex this specific vertex against this specific index..
Thanks,
Michail
You could potentially pick _update endpoint and do partial updates:
Externally, it appears as though we are partially updating a document
in place. Internally, however, the update API simply manages the same
retrieve-change-reindex process that we have already described. The
difference is that this process happens within a shard, thus avoiding
the network overhead of multiple requests. By reducing the time
between the retrieve and reindex steps, we also reduce the likelihood
of there being conflicting changes from other processes.
I've no experience with Titan, but I suppose you could translate raw queries into it.

Resources