How to sync cassandra and elasticsearch - elasticsearch

Iam looking for a solution that How can i sync cassandra with elastic search i.e. whatever data is written to cassandra ,it should be written to elastic search also

One way would be to develop a capture process that runs on a schedule or in a loop and reads "new" data from Cassandra and sends it to Elasticsearch. The definition of "new" data depends on your Cassandra schema and your applications' insert/update/delete patterns.

This can be either done in real time or as the separate batch process .
As and when you get the data to the Cassandra based on the retrieval pattern /query pattern index the data to the Elastic search.
Have the real time processing layer(such as Strom)/Distributed Message broke like Kafka to sync up the data .
Or periodically query the data from Cassandra and ingest to the ES as the separate batch job.

Related

Elastic Search connector for SPARK

Use Case:
An application uses spark to process data for 5 minutes, the data to be processed could be of several hundred thousands of records in data storage.
The choice for data storage is Elastic Search.
Issue:
Do we have a connector for the spark in elasticsearch similar to the connector in MongoDB?
https://www.mongodb.com/products/spark-connector.
Investigation:
I spent a lot of time but the best I could find was a solution using search API with scroll(we can fetch the limited number of records for given number interval), but this does not fit my use-case.
Please note that my elastic search will have JSON data and we do not want to save RDD.
as mentioned in below
https://www.elastic.co/guide/en/elasticsearch/hadoop/master/spark.html
You can use spark connector for ES , and data is not saved in any binary form - but RDD/Dataframe is serialized as JSON and thats what goes into Elasticsearch.

How to synchronize HBase with data in elastic search? (transaction?) And it needs to be real-time

My company belongs to the Internet of Things industry. The structure I am responsible for (data flow) is EMQTT=> kafka=> hbase=> phoenix=> spring cloud rest=> HTML view. Now the problem is that other fields of non-rowkey field query HBase very slowly, so I want to implement HBase + elastic search to achieve multi-condition fast query, but the biggest obstacle is how the data in HBase and elastic search are the same. Step? (transaction?) And it needs to be real-time.
I'd use a Kafka logstash input connector to read the stream as soon as possible.
In my opinion it's better to be as close as possible to the source of the event.

Indexing logs with es-hadoop

I am new to elasticsearch and want to index my website logs which are stored on HDFS for fast querying.
I have a well structured pipeline which runs a script every 20 minutes to ingest the data into HDFS.
I want to integrate elasticsearch with it, so that it also indexes these logs based on particular field(s) and thereby giving faster query results using spark SQL.
So, my question is, can I index my data based on particular field(s) only?
Also, my logs are saved in avro file format. Does es provides a way to directly index avro serialized data or do I need to convert it into some other format?
Thank you in advance.
I would suggest you to look at Elasticsearch, Logstash and Kibana stack that should be good enough to full fill your requirement. Putting it on HDFS and then using ES would be additional overhead.
Instead, you can use Logstash to pump data into ES, index on whatever fields you wish to query and build easy dashboards in less than 10 minutes of exercise. Take a look at this tutorial for better step-by-step guide.
http://hadooptutorials.co.in/tutorials/elasticsearch/log-analytics-using-elasticsearch-logstash-kibana.html

ElasticSearch on Cassandra data vs moving Cassandra data to ElasticSearch for Indexing

I'm new to ElasticSearch and am trying to figure out what is the most optimal way to index 1 Terabyte of data in Cassandra.
Two options that I understand right now are:
Move data periodically to ElasticSearch using the Cassandra-River plugin and then run index on the data.
Advantage: Search queries create no impact on Cassandra load
Disadvantage: Have to sync the data periodically
Without moving the data run ElasticSearch on Cassandra to index the data (not sure how will this be done).
Advantage: Data always in sync
Disadvantage: Impacts Cassandra performance ?
Any thoughts would be appreciated.
Prehaps in the context of ElasticSearch 1.4 and above.. just using ElasticSearch as a datastore and search engine might be simpler and elegant option.
Add more nodes to scale.

Use Elasticsearch as backup store

My application receives and parse thousands of small JSON snippets each about ~1Kb every hour. I want to create a backup of all incoming JSON snippets.
Is it a good idea to use Elasticsearch to backup this snippets in an index with f.ex. "number_of_replicas:" 4? Never read that anyone has used Elasticsearch for this.
Is my data safe in Elasticsearch when I use a cluster of servers and replicas or should I better use another storage for this use case?
(Writing it to the local file system isn't safe, as our hard discs crashes often. First I have thought about using HDFS, but this isn't made for small files.)
First you need to find difference between replica and backups.
replica is more than one copy of data at run time.It increases high availability and failover support,it wont support accidental delete of data.
Backup is copy of whole data at backup time.it will be used to restore when system crashed.
Elastic search for back up.. its not good idea.. Elastic search is a search engine not DB.If you have not configured ES cluster carefully,then you will end up with loss of data.
So in my opinion ,
To store json object, we got lot of dbs.. For example mongodb is a nosql db.We can easily configure it with more replicas.It means high availability of data and failover support.As you asked its also opensource and more reliable.
for more info about mongodb refer https://www.mongodb.org/
Update:
In elasticsearch if you create index with more shards it'll be distributed among nodes.If a node fails then the data will be lost.But in mongoDB more node means ,each mongodb node contains its own copy of data.If a mongodb fails then we can retrieve out data from replica mongodbs. We need to be more conscious about replica setup and shard allocation in Elasticsearch. But in mongoDB it's easier and good architecture too.
Note: I didn't say storing data in elasticsearch is not safe.I mean, comparing to mongodb,it's difficult to configure replica and maintain in elasticsearch.
Hope it helps..!

Resources