How to override elastic-search routing by a plugin? - hadoop

Since 2.x, elastic search disabled routing field in the documents.
Due to this, it has become really difficult to specify routing parameter via the HTTP call, especially when the code for doing so is embedded deep in a third party library such as the elasticsearch-hadoop plugin (See CommonsHttpTransport.execute() for example.)
I see a couple of old posts that talk about overriding the "routing" paramater by a plugin (See github/elasticsearch-direct-routing-plugin and hashing-algo-for-routing post).
But I am unable to search for the setting cluster.routing.operation.hash.type anywhere in the elastic-search code.
Does someone know if the above option is supported in the latest version of ES or if latest version of elastic-search supports any way to override routing except the URL-param?
I want to completely disable routing such that whatever node receives the batch of documents, should just assimilate it there only without distributing to any other primary shards (its own replicas allowed off course). This will greatly improve our ingestion from storm to elastic-search where the number of ES-bolts is much more than the number of ES-primary shards and those bolts distribute the load equally among all primaries.
We will never search by ID.

Related

Is using Elastic Search as authoritative datastore for applications advisable?

I'm new to using elastic search, and I'm trying to find a datastore for our application where we can also add a front end for analytics, in this case Kibana. I'm planning to use them as a datastore for dr/cr transactions on our billing system.
Most use case I read is towards data analytics and searching related. I don't see a use case wherein it is used as a regular datastore for an application. So I'm worried I might use it on a wrong use case.
I was hoping if anyone can add their insights on this. Like why or why not use Elastic Search as authoritative/primary datastore for applications.
You should read a official blog of elasticsearch, where they clearly mentioned that databases must be robust and should not stop working unless you tell to do it.
From the robustness section of same blog
A database should be robust, especially if it is your authoritative
system of record. Ideally, a costly query should be possible to
cancel, and you certainly don't want the database to stop working
unless you tell it to.
Unfortunately, Elasticsearch (and the components it's made of) does
not currently handle OutOfMemory-errors very well. We cover this in
more depth in Elasticsearch in Production, OutOfMemory-Caused Crashes.
It is very important to provide Elasticsearch with enough memory and
be careful before running searches with unknown memory requirements on
a production cluster.
In short, you shouldn't use Elasticsearch as a primary data-store where you can't afford to loose the data.

How to build parent child relationship search in Elastic-search using Liferay API's?

We have a custom entity in Liferay called 'Publication'. It is indexed in Elastic-search and contains a field named 'journalArticleId'.
Based on our search requirements if some user searches for any keyword in the journal article we have to return the publication document which contains the 'journalArticleId' of the respective journal.
I found the solution for implementing this using Java API but I'm looking for the Liferay API to solve this.
Elastic Search Parent-Child Data Search Java API
Thanks in advance for any response.
for the extension of existing indexers you should try to implement an indexer post processor hook ... instead of actually overriding them with an ext plugin
link for 6.2
https://dev.liferay.com/de/develop/tutorials/-/knowledge_base/6-2/extending-the-indexer-post-processor-using-a-hook
link for 7 aka dxp
https://dev.liferay.com/de/develop/reference/-/knowledge_base/7-0/indexer-post-processor
You should be able to find documentation for overriding an indexer. It sounds like you could just extend the existing Journal Indexer: Just add the additional Publication data to the full text index for the existing Journal article and it will be found automatically.
Edit (after your comment): Without looking it up, I assume that Liferay's encapsulation of the API does not really cater for parent-child relationships (but: I might be wrong, it might be in or easy). However, Liferay also allows you to exchange Elasticsearch with SOLR (and potentially others) so its API naturally doesn't use all of the features of the underlying search engines. However, you should always be able to make the extra calls yourself - probably not in the indexer but closer to the ES adapter.
The solution might be: Prepare the content in the Indexer and separate it into parent and child later, in the Elasticsearch adapter.
The elastic search provides features for parent-child mapping and the solution for the above situation can be implemented using Java API.
Elastic Search Parent-Child Data Search Java API
We have contacted the Liferay support team and they responded that the Liferay-elasticsearch adapter doesn't support this feature yet.
version : liferay-dxp-digital-enterprise-7.0-sp3

Elasticsearch: security concerns

We are using elasticsearch as back-end for our in-house logging and monitoring system. We have multiple sites pouring in data to one ES cluster but in different index. e.g. abc-us has data from US site, abc-india has it from India site.
Now concerns are we need some security checks before pushing in data to cluster.
data coming to index is coming from right IP address
incoming json request is of inserting new data and not delete/update
while reading we want certain IP should not be able to read data of other index.
Kindly let me know if its possible to achieve using elasticsearch.
The elasticsearch-jetty plugin brings full power of Jetty and adds several new features to elasticsearch. With this plugin elasticsearch can now handle SSL connections, support basic authentication, and log all or some incoming requests in plain text or json formats.
The idea is to add a Jetty wrapper to ElasticSearch, as a plugin.
What remains is only to restrict certain URL and some methods (eg DELETE) to some users.
You can find elasticsearch-jetty on github with detailed specification about it's usage, configuration and limitations of course.

Using elasticsearch as central data repository

We are currently using elasticsearch to index and perform searches on about 10M documents. It works fine and we are happy with its performance. My colleague who initiated the use of elasticsearch is convinced that it can be used as the central data repository and other data systems (e.g. SQL Server, Hadoop/Hive) can have data pushed to them. I didn't have any arguments against it because my knowledge of both is too limited. However, I am concerned.
I do know that data in elasticsearch is stored in a manner that is efficient for text searching. Hadoop stores data just as a file system would but in a manner that is efficient to scale/replicate blocks over over multiple data nodes. Therefore, in my mind it seems more beneficial to use Hadoop (as it is more agnostic w.r.t its view on data) as a central data repository. Then push data from Hadoop to SQL, elasticsearch, etc...
I've read a few articles on Hadoop and elasticsearch use cases and it seems conventional to use Hadoop as the central data repository. However, I can't find anything that would suggest that elasticsearch wouldn't be a decent alternative.
Please Help!
As is the case with all database deployments, it really depends on your specific application.
Elasticsearch is a great open source search engine built on top of Apache Lucene. Its features and upgrades allow it to basically function just like a schema-less JSON datastore that can be accessed using both search-specific methods and regular database CRUD-like commands.
Nevertheless all the advantages Elasticsearch that brings, there are still some main disadvantages:
Security - Elasticsearch does not provide any authentication or access control functionality. It's supported since they have introduced shield.
Transactions - There is no support for transactions or processing on data manipulation. Well now data manipulation is handled with logstash.
Durability - ES is distributed and fairly stable but backups and durability are not as high priority as in other data stores.
Maturity of tools - ES is still relatively new and has not had time to develop mature client libraries and 3rd party tools which can make development much harder. We can consider that it's quite mature now
with a variety of connectors and tools around it like kibana. But it's still not suited for large computations - Commands for searching data are not suited to "large" scans of data and advanced computation on the db side.
Data Availability - ES makes data available in "near real-time" which may require additional considerations in your application (ie: comments page where a user adds new comment, refreshing the page might not actually show the new post because the index is still updating).
If you can deal with these issues then there's certainly no reason why you can't use Elasticsearch as your primary data store. It can actually lower complexity and improve performance by not having to duplicate your data but again this depends on your specific use case.
As always, weigh the benefits, do some experimentation and see what works best for you.
DISCLAIMER: This answer was written a while ago for the Elasticsearch 1.x series. These critics still somehow stand with the 2.x series. But Elastic is working on them, as the 2.x series comes with more mature tools, APIs and plugins per example, security wise, like Shield or even transport clients like Logstash or Beats, etc.
I'd highly discourage most users from using elasticsearch as your primary datastore. It will work great until your cluster melts down due to a network partition. Even settings such as minimum_master_nodes that the ES pros always set won't save you. See this excellent analysis by Aphyr with his Call Me Maybe series:
http://aphyr.com/posts/317-call-me-maybe-elasticsearch
eliasah, is right, it depends on your use case, but if your data (and job) is important to you, stay away.
Keep your golden record of your data stored in something really focused on persisting and sync your data out to search from there. It adds extra complexity and resources, but will result in a better nights rest :)
There are plenty of ways to go about this and if elasticsearch does everything you need, you can look into Kafka for persisting all the events going into a cluster which would allow replaying if things go wrong. I like this approach as it provides an async ingestion pipeline into elasticsearch that also does the persistence.

Elasticsearch: Disable the Versioning of index document

I could see the ElasticSearch versioning the index documents for every update. I would like to know if there is a way to disable these versions? I understand its a good to have functionality, but its just for exploring the tool. Also is there is any performance concerns with large number of versions?
Thanks
Versioning doesn't mean that elasticsearch keeps every version of the document. It keeps only the latest one. Every document has by default an implicit _version field, that gets incremented whenever the document changes. That allows for optimistic locking if you need it. You can also have your own external versioning system if you want. Have a look at this article that is all about versioning with elasticsearch.
This doesn't hurt performance at all, I wouldn't look into disabling it.
You can not disable versioning.
But the ES server will use versioning only if you specify
?version=[version_number] or ?retry_on_conflict=[retry_times]
If you don't use one of these 2 params, ES server will not do version check and conflict check, this is another way of "disable versioning".

Resources