Is it possible to know when some data is available for being searched in Elasticsearch? - elasticsearch

I'm implementing a software in which data is sent to some web server, stored in an Elasticsearch and then queried right away. I know that Elasticsearch is a NoSQL following BASE (Basically Available, soft State, eventual consistency) principles which means there's no guarantee when your data will be available for searching.
That's why when I query for the data just being added to Elasticsearch, I have to wait for some time before it is found. Right now all I can do is to implement a polling mechanism to detect when data is completely applied. It is worth mentioning that if I'm using _id to retrieve a document, it is found right away. But if I'm searching for it using some type of Elasticsearch query (like term or query_string), it will take a while before the document is found.
So my question is: Is there a cheaper way to detect when data is completely indexed in Elasticsearch?

This part is done by the Refresh API, this API does not provide a way to know when the indexed data is available. But the folks of elastic are working in a hack to let the request wait for a refresh.
I think should be better if you take a look here: https://www.elastic.co/blog/refreshing_news
This post have a good overview of the issues and the stuffs that they are working to improve.
Hope it help :D

Related

Query ElasticSearch after the index operation

I have the eservice A that executes some text processing. After it, service B has to execute some set of Elasticsearch queries on the document. The connectivity between the services provided by Kafka. The solution is tightly coupled to ES free text search capabilities, so I can't query in another way.
Possible solution:
To store the document in ES and query it. The problem is that ES is eventually consistent and I don't know if the document already indexed or not.
Is there some API to ensure that the document is already indexed?
Another option is to publish a message from service A with delay X+5 seconds, where X is the refresh interval of the index, where the document should be stored. Seems to me an unreliable solution. What do you think?
Another direction that I thought about, is some way to query the document with ES queries where the document is in memory. For example, if I will have some magic way to convert the ES query to Luciene DSL, so I don't need to deal with the eventual consistent behavior of Elasticsearch and I can query Lucine directly.
Maybe there are some other solutions?
take a look at the ?refresh flag so that an indexing request will only return once a refresh has happened. otherwise you can use the GET API to see if the document exists or not
however there is no magic options here, Elasticsearch is eventually consistent and you need to factor that in

Both ElasticSearch and Redis, overkill usecase?

I'm currently designing the architecture of my project or atleast try to figure it out what will be useful in my case.
** Simple use case
I will have several thousands of profiles in a backend and I to need implement a fast search engine. So elasticsearch look perfect in that case. Everytime a profile is updated, the index will be updated by an asynchronous task.
My question now is : If I want to implement a cache system for the detail of a profile. Should I stick with elasticsearch and put these data in my index ? Or use Redis and do something like profil_id => data ?
I think both sounds good the problem is whenever a profile is updated, I will have to flush it after the reindexing in elasticsearch. If I want to see the change in my backend.
So what can I do ? Thank you so much !
You should consider using RediSearch. Using RediSearch can provide you a solution for your needs, getting both Redis performance and a full-text support.
Elasticsearch and redis are basically meant to solve two different problems, As one does indexing while other does caching.
Redis is meant to return already requested data as fast as possible whereas as
Elasticsearch is a search and analytics engine, it would perfectly fit a use-case where you have to implement a fast search engine and it will be more performant than any in-memory data structure store or cache such as redis(Assuming your searches will be complex, will involve some aggregation/filters).
The problem comes profile updates Since your profile updates are not that frequent you could actually do partial updates to the ES index rather doing reindex.So whenever a person updates its profile get the changeling set(changed data) and do a partial update to the particular document in ES Index. You can see how its done here partial update.
This one particular stackoverflow answer will help you cache vs indexing

how to have elasticsearch search (API) deal with a time for which it would be consistent with

Elastic search API is eventually consistent, in favour of query response times.
Eventual consistency is not an issue as long as one can be sure of the date it is consistent with.
Can any elasticsearch search (API) result return the time for which it is consistant with ?
or
Is there any way to have elasticsearch search (API) results include the time for which it is consistent with ?
or
Would it be possible to provide a date at search query time against witch the API could either respond «unconsistent against provided date» or the result consistent with that date ?
The ultimate desired function is to be able to provide a (possibly functional) date for each bulk update/indexing step and have the search query deal with it. If not possible, the real technical update/indexing date could be enough.
Is it possible to know when a given update/synchronization among all nodes is over ?
This question went to me in the process of evaluating elasticsearch against an industrial project, after reading this
Oh, that's a lot of questions )
First of all, there's big chance that if you're facing consistency requirement/issues, you need to switch to consistent data storage. ES is great and all, but sometimes consistency is a must.
Talking about "time since last sync"/"time consistent with". From my knowledge so far, there's synced flush machinery, that gives you ability to check state of "inactive" indices via /_stats?level=shards. Not sure if its suitable for your usecase.
The thing you can do if you need consistent state is to index data with wait_for_active_shards=all, which kind of turns your index from AP to CP mode, or [occasionally] read data from master shard only with _search?preference=_primary, to make index kind of CA

bidirectional xdcr elasticsearch and couchbase

I am pretty new to elasticsearch and couchbase and I want to understand a few concepts in order to make decisions on how to use them in my applcation.
My overall question: is it possible to configure bidirectional xdcr between couchbase and elasticsearch using the transport-couchbase plugin?
Since it is possible to update documents that are in an elasticsearch index, I want to be able to propagate that update back to the couchbase server. Is there any way to do this? Ive searched online for anything that can do this, with no luck so far...
as far as I know it's currently not possible to sync changes from Elasticsearch to Couchbase.
From a semantic perspective, you probably don't want to change an index, and forward a change from the index to the DB backing that index.
I maintain the Couchbase transport plugin and as Laurent answered earlier, it doesn't allow replicating from Elastic back to Couchbase. While it's technically possible to implement this functionality, it wouldn't make sense to do so in practice. The whole point of replicating to Elastic is to use it as an index on top of your source of truth, which is Couchbase. That way you get the high performance reads and writes of Couchbase and the query and search functionality of Elastic. Replicating changes back to Couchbase would mean the data there is no longer authoritative, which in turn opens the door to all sorts of concurrency and data integrity problems.

Internal data storage mechanism of elasticsearch

I have been working with elasticsearch for the past 2 months. I have used both REST approach and API support in different languages to index, get and search data. I also read a lot about elasticsearch and found out it is not a good option to use it as a data store. Why is this? And I'm also curious about how elasticsearch internally stores the indexed data. Any good link or explanation??
Elastic Search is built on top of Apache Lucene - here's a reference doc on the Lucene index file structure:
http://lucene.apache.org/core/4_7_2/core/org/apache/lucene/codecs/lucene46/package-summary.html#package_description
Regarding whether or not it's a good option as a data store I think that's more individual opinion and specific use cases than a fact that can be proved. It does not have the transaction support that something like MySQL does if that's what you are looking for. In that case it's somewhat on a par with other NoSQL solutions. This is a pretty decent writeup on the trade-offs and issues: https://www.found.no/foundation/elasticsearch-as-nosql/
In the end it depends on what you are doing with your data and what level of robustness you require.

Resources