NoSQL (Mongo, DynaoDB) with Elasticsearch vs single Elasticsearch - elasticsearch

recently I started to use DynamoDB to store events with structure like this:
{start_date: '2016-04-01 15:00:00', end_date: '2016-04-01 15:30:00', from_id: 320, to_id: 360, type: 'yourtype', duration: 1800}
But when I started to analyze it I faced with the fact that DynamoDB has no aggregations, has read/write limits, response size limits etc. Then I installed a plugin to index data to ES. As a result I see that I do not need to use DynamoDB anymore.
So my question is when do you definitely need to have NoSQL (in my case DynamoDB) instance along with Elasticsearch?
Will it down ES performance when you are storing there not only indexes, but full documents? (yes I know ES is just an index, but anyway, in some cases such approaches could me more cost effective than having MySQL cluster)

The reason you would write data to DynamoDB and then have it automatically indexed in Elasticsearch using DynamoDB Streams is because DynamoDB, or MySQL for that matter, is considered a reliable data store. Elasticsearch is an index and generally speaking isn't considered an appropriate place to store data that you really can't afford to lose.
DynamoDB by itself has issues with storing time series event data and aggregating is impossible as you have stated. However, you can use DynamoDB Streams in conjunction with AWS Lambda and a separate DynamoDB table to materialize views for aggregations depending on what you are trying to compute. Depending on your use case and required flexibility this may be something to consider.
Using Elasticsearch as the only destination for thing such as logs is generally considered acceptable if you are willing to accept the possibility of data loss. If the records you are wanting to store and analyze are really too valuable to lose you really should store them somewhere else and have Elasticsearch be the copy that you query. Elasticsearch allows for very flexible aggregations so it is an excellent tool for this type of use case.
As a total alternative you can use AWS Kinesis Firehose to ingest the events and persistently store them in S3. You can then use an S3 Event to trigger an AWS Lambda function to send the data to Elasticsearch where you can aggregate it. This is an affordable solution with the only major downside being the 60 second delay that Firehose imposes. With this approach if you lose data in your Elasticsearch cluster it is still possible to reload it from the files stored in S3.

Related

How to design a system for Search query and Csv/Pdf export for 500GB data/day?

Problem Stmt
1 device is sending 500GB text data (logs) per day to my central server.
I want to design a system using which user can:
Apply exact-match filters and go through data using pagination
Export PDF/CSV reports for same query as above
Data can be stored for max 6 months. Its an on-premise solution. Some delay on queries is affordable. If we can do data compressions it would be awesome. I have 512GB RAM, 80core system and TBs of storage(these are upgradable)
What I have tried/found out:
Tech stack iam planning to use: MEAN stack for application dev. For core data part iam planning to use ELK stack. Elasticsearch single index can have <40-50gb ideal size recommendation.
So, my plan is create 100 indexes per day each of 5GB for each device. During query I can sort these indices based on their name (eg. 12_dec_2012_part_1 ...) and search into each index linearly and keep on doing this till the range user has asked. (I think this will hold good for ad-hoc request by user, but for reports if I do this and write to a csv file by going sequentially one by one it will take long time.) For reports I think best thing i can do is create pdf/csv for each index(5gb size), reason because most file openers cannot open very large csv/pdf files.
Iam new to big data problems. Iam not sure what approach is right; ELK or Hadoop ecosystem for this. (I would like to go with ELK)
Can someone point me to right direction or how to proceed or if someone has dealt with this type of problem statement? Any out of the way solution for these problems are also welcome.
Thanks!
exact-match filters
You can use term query or match_phrase query
Returns documents that contain an exact term in a provided field.
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-term-query.html
pagination
You can use from and size parameter to pagination.
GET /_search
{
"from": 5,
"size": 20,
"query": {
"match": {
"user.id": "kimchy"
}
}
}
https://www.elastic.co/guide/en/elasticsearch/reference/current/paginate-search-results.html
Export PDF/CSV
You can use Kibana
Kibana provides you with several options to share Discover saved
searches, dashboards, Visualize Library visualizations, and Canvas
workpads.
https://www.elastic.co/guide/en/kibana/current/reporting-getting-started.html
Data can be stored for max 6 months
You can use ILM policy
You can configure index lifecycle management (ILM) policies to
automatically manage indices according to your performance,
resiliency, and retention requirements. For example, you could use ILM
to:
https://www.elastic.co/guide/en/elasticsearch/reference/current/index-lifecycle-management.html
optimal shard size
For log indices you can use datastream indices.
A data stream lets you store append-only time series data across
multiple indices while giving you a single named resource for
requests. Data streams are well-suited for logs, events, metrics, and
other continuously generated data.
https://www.elastic.co/guide/en/elasticsearch/reference/current/data-streams.html
When you use datastream indices you don't think about shard size it
will rollover automatically. :)
For the compression you should update index settings
index.codec: best_compression
https://www.elastic.co/guide/en/elasticsearch/reference/current/index-modules.html

OpenSearch Indexes and Data streams

I have recently started to use OpenSearch and having a few newbie questions.
What is the difference between Index, Index Pattern and Index template? (Some examples would be really helpful to visualize and differentiate these terminologies).
I have seen some indexes with data streams and some without data streams. What exactly are data streams and why some indexes have them and the others do not.
Tried reading a few docs, watching a few youTube videos. But it's getting a little confusing as I do not have much hands on experience with OpenSearch.
(1)
An index is a collection of JSON documents that you want to make searchable. To maximise your ability to search and analyse documents, you can define how documents and their fields are stored and indexed (i.e., mappings and settings).
An index template is a way to initialize with predefined mappings and settings new indices that match a given name pattern - e.g., any new index with a name starting with "java-" (docs).
An index pattern is a concept associated with Dashboards, the OpenSearch UI. It provides Dashboards with a way to identify which indices you want to analyse, based on their name (again, usually based on prefixes).
(2)
Data streams are managed indices highly optimised for time-series and append-only data, typically, observability data. Under the hood, they work like any other index, but OpenSearch simplifies some management operations (e.g., rollovers) and stores in a more efficient way the continuous stream of data that characterises this scenario.
In general, if you have a continuous stream of data that is append-only and has a timestamp attached (e.g., logs, metrics, traces, ...), then data streams are advertised as the most efficient way to model this data in OpenSearch.

Elasticsearch vs Cassandra vs Elasticsearch with Cassandra

I am learning NoSQL and looking at different options for one of my client's requirements. I have gone through various resources before putting up this question (a person with little knowledge in NoSQL)
I need to store data at faster rate and read data.
Fully fail-safe and easily scalable.
Able to search through data for Analytics.
I ended up with a short list of: Cassandra and Elasticsearch
What I do understand is Cassandra is a perfect NoSQL storage solution for me, as I can write data and read data using indexes. Where it fails or it could fail is on Analytics. In the future, if I want to get data from from_date to to_date, or more ways to get data for analytics, if I don't design the Data model properly or keeping long term sight, which might be quite hard in ever changing world.
While Elastic Search is best at indexing (backed by Lucene), and can search the data randomly by throwing some random text. But does it work the same for even if I want to retrieve data from_date to to_date (I expect it might be). But the real question is, is it a Search Engine, or perfect NoSQL data storage like Cassandra? If yes, why do we still need Cassandra?
If both of these are in different world, please explain that! How do we combine them to get a more effective solution?
One of our applications uses data that is stored into both Cassandra and ElasticSearch. We use Cassandra to access those records whenever we can, and have data duplicated into query tables designed to adhere to specific application-side requests. For a more liberal search than our query tables can allow, ElasticSearch performs that functionality nicely.
We have asked that same question (of ourselves)..."Why don't we just get everything from ElastsicSearch?"
The answer is that ElasticSearch was designed to be a search engine, and not a persistent data store. Sometimes ElasticSearch loses writes. Schema changes are difficult to do in ElasticSearch without blowing everything away and reloading. For that purpose, I have written jobs that are designed to keep ElasticSearch in-sync with our Cassandra cluster. There was also a fairly recent discussion on Quora about this topic, that yielded similar points.
That being said, ElasticSearch works great as a search engine. And Cassandra works great as a scalable, high-performance datastore. But querying data is different from searching for data. There are times that we need one or the other, and a combination of the two works well for our application. It may (or it may not) work well for yours.
As for analytics, I have had some success in using the Cassandra Spark connector, to serve more complex OLAP queries.
Edit 20200421
I've written a newer answer to a similar question:
ElasticSearch vs. ElasticSearch+Cassandra
Cassandra + Lucene is a great option. There are different initiatives for this issue, for example:
Stratio’s Cassandra Lucene Index - Derived from Stratio Cassandra, is a plugin for Apache Cassandra that extends its index functionality. (https://github.com/Stratio/cassandra-lucene-index)
Stratio Cassandra, it's a native integration with Apache Lucene, it is very interesting. (https://github.com/Stratio/stratio-cassandra) - THIS PROJECT HAS BEEN DISCONTINUED IN FAVOUR OF Stratio’s Cassandra Lucene Index
Tuplejump Calliope, it's like Stratio Cassandra, but it's less active. (https://github.com/tuplejump/stargate-core)
DSE Search by Datastax. It allows using Cassandra with Apache Solr, but it's a proprietary option.(http://www.datastax.com/what-we-offer/products-services/datastax-enterprise)
After working on this problem myself I have realized that NoSQL databases like casandra are good when you want to make sure you are preserving your data schema with reliable writing operation, and don't want to take advantage of indexing operations that elasticsearch offers. In case you want to preserve some indexes data then elasticsearch is good in case you are trusting your scheme and only going to do far more reads than writes.
My case was data analytics. So I preserved a lot of my Latices in elastic search since later I wanted to traverse through the data a lot to see what should be my next step. I would have used casandra if I wanted to have a lot of changes in the schema of the data in my analytic pilelines.
Also there are many nice representing tools like kibana that you can use to present your data with some good graphics. Maybe I am lazy but they are very good looking and they helped me.
Storing data in a combination of Cassandra and ElasticSearch gives you most functionality. It allows you to lookup key-value tables, and also allows you to search data in indexes.
The combination gives you a lot of flexibility, ideal for your application.
Elassandra is the combined solution of Cassandra + Elastic search , It uses Elastic search to index the data and Cassandra as the data store , i'm not sure about the performance but as per this article , its performance is good.
If your application needs search feature then , Elassandra is the best open source option. DSE search is available but its expensive.
We had developed an application where we used Elasticsearch and Cassandra.
Similar data was stored into Cassandra and indexed into Elasticsearch.
Our application's UI was having features like searches, aggregations, data export, etc.
The back-end microservices were continuously getting huge data (on Kafka topics) and storing it into Cassandra. Once the data is stored into Cassandra, the services would make sure the data is indexed into Elasticsearch.
Cassandra was acting as "Source of truth" for Elasticsearch. In the cases, where reindexing of the ES index was required, we queried Cassandra and reindexed the data into ES.
This solution helped us, as this was very easy to scale and the searches and aggregations were much faster.
Cassandra is great at retrieving data by ID. I don't know much about secondary index performance, but I doubt it's as fast as Elasticsearch. Certainly Elasticsearch wins when it comes to full text search functionality (text analysis, relevancy scoring, etc).
Cassandra wins on update performance, too. Elasticsearch supports updates, but an update is really a reindex + soft delete in an atomic operation.
Cassandra has a very nice replication model (if you need to be extra-fail-safe). Elasticsearch is OK, too, I'm not in the camp that says ES is particularly unreliable (it has issues sometimes, like all software).
Elasticsearch also has aggregations for real-time analytics. And because searches are so fast, analytics on a subset of data will be fast, too.
If your requirements are satisfied well enough by one of them (like here it seems like ES would work well), I would just use one. If you have requirements from both worlds, then you can either:
use one of them and work around the downsides. For example, you may be able to handle many updates with Elasticsearch, but with more shards and more hardware
use both and make sure they're in sync
As elasticsearch is built on Lucene index and if you want to store indexing in elasticsearch it performs best comparing to indexing in Cassandra itself for retrieving the data.
If your requirements are not related to real-time retrieval then you can use elasticsearch as NoSQL database also, there are thoughts that ElasticSearch loses writes & Schema changes are difficult, but if your volume of data is not too big. You can easily achive elasticsearch as a search engine with best indexing along with elasticsearch as aNoSQL database. There are several way that you can prevent it. I have worked on the schema changes in elasticsearch, if your data structure is consistent then it will create any issues.
Being a supporter of ElasticSearch or SOlr. I have worked on both the search engines and i experienced that both the search engines can be used fluently if you configure them correctly.
Only cons that i can think of it, if you are targetting real time result and can't comprosie milliseconds delay in your response. Then its better to take help of other NoSQL databases like cassandra or couchbase.
Cassandra with solr, work better than Cassandra with elasticSearch.

Use Elasticsearch as backup store

My application receives and parse thousands of small JSON snippets each about ~1Kb every hour. I want to create a backup of all incoming JSON snippets.
Is it a good idea to use Elasticsearch to backup this snippets in an index with f.ex. "number_of_replicas:" 4? Never read that anyone has used Elasticsearch for this.
Is my data safe in Elasticsearch when I use a cluster of servers and replicas or should I better use another storage for this use case?
(Writing it to the local file system isn't safe, as our hard discs crashes often. First I have thought about using HDFS, but this isn't made for small files.)
First you need to find difference between replica and backups.
replica is more than one copy of data at run time.It increases high availability and failover support,it wont support accidental delete of data.
Backup is copy of whole data at backup time.it will be used to restore when system crashed.
Elastic search for back up.. its not good idea.. Elastic search is a search engine not DB.If you have not configured ES cluster carefully,then you will end up with loss of data.
So in my opinion ,
To store json object, we got lot of dbs.. For example mongodb is a nosql db.We can easily configure it with more replicas.It means high availability of data and failover support.As you asked its also opensource and more reliable.
for more info about mongodb refer https://www.mongodb.org/
Update:
In elasticsearch if you create index with more shards it'll be distributed among nodes.If a node fails then the data will be lost.But in mongoDB more node means ,each mongodb node contains its own copy of data.If a mongodb fails then we can retrieve out data from replica mongodbs. We need to be more conscious about replica setup and shard allocation in Elasticsearch. But in mongoDB it's easier and good architecture too.
Note: I didn't say storing data in elasticsearch is not safe.I mean, comparing to mongodb,it's difficult to configure replica and maintain in elasticsearch.
Hope it helps..!

elastic search index strategies under high traffic

We use ElasticSearch for our tool's real time metrics and analytics part. ElasticSearch is very cool and fast when we are query our data. (statiticial facets and terms facet)
But we have problem when we try to index our hourly data. We collect every our metric data from other services. First we collect data from other services and save them RabbitMQ process. But when queue worker runs our all hourly data not index to ES. Usually %40 of data index in ES and other them lost.
So what is your idea about when index ES under high traffic ?
I've posted answers to other similar questions:
Ways to improve first time indexing in ElasticSearch
Performance issues using Elasticsearch as a time window storage (latter part of my answer applies)
Additionally, instead of a custom 'queue worker' have you considered using a 'river'? For more information see:
http://www.elasticsearch.org/blog/the-river/
http://www.elasticsearch.org/guide/reference/river/

Resources