How to sync ArangoDB with elasticsearch? - elasticsearch

Is there any logstash plugin or other kind of mechanism to sync ArangoDB data with elasticsearch?
I'm creating a social network platform and want my users and posts data searchable with ES.
My initial thought was using MongoDB as data store because MongoDb offer ways to sync data to ES via logstash plugin.
However ArangoDB provides graph database capabilities which are quite helpful for storing and querying social relations which is best suited to my requirements. This makes me more inclined towards ArangoDB.
I didn't find any ES/logstash plugin for syncing ArangoDB data with my ES cluster. There are few mentions of a river plugin which exists no more. Two questions:
Is there an efficient async way to sync ArangoDB data with ES?
If not, should I drop ArangoDB and go with MongoDB which does not provide graph capabilities?

Of Course, There is a way to Sync ArangoDB with ES. I don't remember exactly how to do it, but take a look at this https://www.arangodb.com/2013/11/new-plugin-elasticsearch-arangodb/

Related

Should I use Java or Logstash to index db content in Elastic Search?

I'm building a simple web application that will list/search retail items for sale.
design is like this ...
MySQL database -> Elastic Search deployment -> Spring Boot REST service -> Web UI (JSP/Bootstrap or Angular)
I am planning to write Java client code to read the database and post records to Elastic Search for indexing.
Googling, it looks like Logstash is used for this sort of thing. I'm not familiar with Logstash, I am very familiar with Java.
QUESTION: Is Java client considered a "deprecated" or "legacy" way to submit data to Elastic Search for indexing?
I'm very familiar with Java, should I use Java or Logstash?
Adding to #chris answer, logstash will add complexity and another Infrastructure to maintain in your stack, and logstash is known for getting stuck and is not as resilient as Elasticsearch is.
You are already using Java for your application code and btw elasticsearch now officially has a java client known as java high-level rest client(JHLRC) , which is very popular and provides an exhaustive list of APIs for indexing/searching and building a modern search system.
IMHO you should use the JHLRC,
which will save you to the pain points of logstash
you don't have to learn another tool
simple infrastructure
simple deployment
last but not least simple and easy to maintain codebase.
Logstash is good tool to be used to migrate the data from many sources to elastic search. It's build in java language only.
You can use Logstash. It also has options to mutate the data or filter the data. Its a ready to use to tool which will save lot of your development time and efforts.
But if you have a requirement for lot of customisation and need lot of control over your data before pushing it to elastic search then you can build your own application for the same.
Coming back to your question..java is not deprecated for indexing data to elastic search. It is still a preferred option.

Is using Elastic Search as authoritative datastore for applications advisable?

I'm new to using elastic search, and I'm trying to find a datastore for our application where we can also add a front end for analytics, in this case Kibana. I'm planning to use them as a datastore for dr/cr transactions on our billing system.
Most use case I read is towards data analytics and searching related. I don't see a use case wherein it is used as a regular datastore for an application. So I'm worried I might use it on a wrong use case.
I was hoping if anyone can add their insights on this. Like why or why not use Elastic Search as authoritative/primary datastore for applications.
You should read a official blog of elasticsearch, where they clearly mentioned that databases must be robust and should not stop working unless you tell to do it.
From the robustness section of same blog
A database should be robust, especially if it is your authoritative
system of record. Ideally, a costly query should be possible to
cancel, and you certainly don't want the database to stop working
unless you tell it to.
Unfortunately, Elasticsearch (and the components it's made of) does
not currently handle OutOfMemory-errors very well. We cover this in
more depth in Elasticsearch in Production, OutOfMemory-Caused Crashes.
It is very important to provide Elasticsearch with enough memory and
be careful before running searches with unknown memory requirements on
a production cluster.
In short, you shouldn't use Elasticsearch as a primary data-store where you can't afford to loose the data.

Full text search in Neo4j vs Elasticsearch

Both Neo4j 4.0 and elasticsearch have full text seach and inverted index with apache lucene.
So how elastic search is better than neo4j full text search?
Consider that we are dealing with the knowledge graph as a data storage model developed in Neo4j.
Apart from that why should we use elasticsearch with Neo4j 4.0. what are things that elasticsearch offer but not neo4j 4.0
So how elastic search is better than neo4j full text search?
"Better" is largely dependent on your use case. But the tools (Neo4j and ElasticSearch) were built for drastically different purposes.
Neo4j is best when used as a graph-traversal engine, returning data from edge (relationship) based queries. It might have similar capabilities, but it just wasn't meant to be used as a search engine.
Want things like "fuzzy" matching and relevance ranking? Neo4j is not going to do any of that. Also, ElasticSearch is a true out-of-the-box distributed datastore. Neo4j can't distribute without an enterprise license.
Basically, it comes down to business requirements. If a datastore mainly needs to execute graph traversals, and serve some simple search-like requests, Neo4j might be enough on its own. Need a full-featured search engine to serve that same data? ElasticSearch is the better suited to handle that.

Using ElasticSearch as alternative data store with applications updating both the DB and ES(with the help of Kafka). Is this a good idea?

The architecture is like this, there are several applications which access some set of relational Databases. But some applications require large joins which increases the query time. To solve this problem we made a ElasticSearch copy of the relational DBs. But even real time indexing of data in ES from DB takes a lot of time.
Which is where Kafka comes, we introduce a Kafka pipeline connecting applications directly to ES. Logstash for ES is a consumer and applications are producers for the Kafka. Alongside the normal flow which updates DB is intact (So if ES index crashes or ES cluster loses data in any way we can update back from DB)
Is this kind of architecture a good idea?
That's a good idea, yes, for reasons that you mention yourself. In fact, I also have a setup where docs are fed into ES through Kafka and can't really imagine going back to the setup I had before introducing Kafka.
If you're going to need a finer grain control over Kafka consumption process, take a look here. That's a recent project that unfortunately became usable after I implemented my own low-level consumers :)

Using elasticsearch as central data repository

We are currently using elasticsearch to index and perform searches on about 10M documents. It works fine and we are happy with its performance. My colleague who initiated the use of elasticsearch is convinced that it can be used as the central data repository and other data systems (e.g. SQL Server, Hadoop/Hive) can have data pushed to them. I didn't have any arguments against it because my knowledge of both is too limited. However, I am concerned.
I do know that data in elasticsearch is stored in a manner that is efficient for text searching. Hadoop stores data just as a file system would but in a manner that is efficient to scale/replicate blocks over over multiple data nodes. Therefore, in my mind it seems more beneficial to use Hadoop (as it is more agnostic w.r.t its view on data) as a central data repository. Then push data from Hadoop to SQL, elasticsearch, etc...
I've read a few articles on Hadoop and elasticsearch use cases and it seems conventional to use Hadoop as the central data repository. However, I can't find anything that would suggest that elasticsearch wouldn't be a decent alternative.
Please Help!
As is the case with all database deployments, it really depends on your specific application.
Elasticsearch is a great open source search engine built on top of Apache Lucene. Its features and upgrades allow it to basically function just like a schema-less JSON datastore that can be accessed using both search-specific methods and regular database CRUD-like commands.
Nevertheless all the advantages Elasticsearch that brings, there are still some main disadvantages:
Security - Elasticsearch does not provide any authentication or access control functionality. It's supported since they have introduced shield.
Transactions - There is no support for transactions or processing on data manipulation. Well now data manipulation is handled with logstash.
Durability - ES is distributed and fairly stable but backups and durability are not as high priority as in other data stores.
Maturity of tools - ES is still relatively new and has not had time to develop mature client libraries and 3rd party tools which can make development much harder. We can consider that it's quite mature now
with a variety of connectors and tools around it like kibana. But it's still not suited for large computations - Commands for searching data are not suited to "large" scans of data and advanced computation on the db side.
Data Availability - ES makes data available in "near real-time" which may require additional considerations in your application (ie: comments page where a user adds new comment, refreshing the page might not actually show the new post because the index is still updating).
If you can deal with these issues then there's certainly no reason why you can't use Elasticsearch as your primary data store. It can actually lower complexity and improve performance by not having to duplicate your data but again this depends on your specific use case.
As always, weigh the benefits, do some experimentation and see what works best for you.
DISCLAIMER: This answer was written a while ago for the Elasticsearch 1.x series. These critics still somehow stand with the 2.x series. But Elastic is working on them, as the 2.x series comes with more mature tools, APIs and plugins per example, security wise, like Shield or even transport clients like Logstash or Beats, etc.
I'd highly discourage most users from using elasticsearch as your primary datastore. It will work great until your cluster melts down due to a network partition. Even settings such as minimum_master_nodes that the ES pros always set won't save you. See this excellent analysis by Aphyr with his Call Me Maybe series:
http://aphyr.com/posts/317-call-me-maybe-elasticsearch
eliasah, is right, it depends on your use case, but if your data (and job) is important to you, stay away.
Keep your golden record of your data stored in something really focused on persisting and sync your data out to search from there. It adds extra complexity and resources, but will result in a better nights rest :)
There are plenty of ways to go about this and if elasticsearch does everything you need, you can look into Kafka for persisting all the events going into a cluster which would allow replaying if things go wrong. I like this approach as it provides an async ingestion pipeline into elasticsearch that also does the persistence.

Resources