Can a time series database do everything that a streaming analytics system (like spark streaming / flink / kinesis analytics) can?
Does one subsume the other? I am not looking for which one is better. Just understanding what different use cases that they support.
Time series databases are focused on storage and retrieval of time-based entries in more performant ways than our common relational databases. Recently they have become again a hot topic, given the industry interest on high performance event processing. Nowadays, most of them rely on specific indexing techniques over NoSQL databases, e.g. OpenTSDB (HBase), InfluxDB (BoltDB) and so on.
On the other hand, Distributed Stream Processing frameworks like Spark Streaming are based on the research on Data Stream Management Systems and are provide more flexible ways of analysing events. They are usually applied to do other types of data analysis such as machine learning over streams, sketches, windowing and to apply multiple other techniques that are not the focus of time series databases.
Both of them are originated from the research from the 2000s on Time Series Databases and Data Stream Management Systems, so many of the features and architectural ideas from one are applied on the other and vice-versa. An example of that is that the seminal Stream Processing paper "Continuous Queries over Data Streams" (S. Babu, 2001) cites time series databases as an example of related work.
Related
I'm developing a parcels tracking system and thinking of how to improve it's performance.
Right now we have one table in postgres named parcels containing things like id, last known position, etc.
Everyday about 300.000 new parcels are added to this table. The parcels data is took from external API. We need to track all parcels positions as accurate as possible and reduce time between API calls about specific parcel.
Given such requirements what could you suggest about project architecture?
Right now the only solution I can think of is producer-consumer pattern. Like having one process selecting all records from parcel table in the infinite loop and then distribute fetching data task with something like Celery.
Majors downsides of this solution are:
possible deadlocks, as fetching data about the same task can be executed at the same time on different machines.
need in control of queue size
This is a very broad topic, but I can give you a few pointers. Once you reach the limits of vertical scaling (scaling based on picking more powerful machines) you have to scale horizontally (scaling based on adding more machines to the same task). So for being able to design scalable architectures you have to learn about distributed systems. Here some topics to look into:
Infrastructure & processes for hosting distributed systems, such as Kubernetes, Containers, CI/CD.
Scalable forms of persistence. For example different forms of distributed NoSQL like key-value stores, wide-column stores, in-memory databases and novel scalable SQL stores.
Scalable forms of data flows and processing. For example event driven architectures using distributed message- / event-queues.
For your specific problem with packages I would recommend to consider a key-value store for your position data. Those could scale to billions of insertions and retrievals per day (when querying by key).
It also sounds like your data is somewhat temporary and could be kept in an in-memory hot-storage while the package is not delivered yet (and archived afterwards). A distributed in-memory DB could scale even further in terms insertion and queries.
Also you probably want to decouple data extraction (through your api) from processing and persistence. For that you could consider introducing stream processing systems.
I'm gathering event logs every time a property of some device is changed. For this purpose I decided to use:
Logstash - where my agent IoT application sends logs to in JSON format,
Elasticsearch - for storing data (logs),
Kibana - for data visualisation.
The JSON with logs is being send in regular intervals and its form is as follows:
{"deviceEventLogs":[{"date":"16:16:39 31-08-2016","locationName":"default","property":"on","device":"Lamp 1","value":"
false","roomName":"LivingRoom"}, ... ,]}
Example of single event entry in Elasticsearch looks as follows:
{
"_index": "logstash-2016.08.25",
"_type": "on",
"_id": "AVbDYQPq54WlAl_UD_yg",
"_score": 1,
"_source": {
"#version": "1",
"#timestamp": "2016-08-25T20:25:28.750Z",
"host": "127.0.0.1",
"headers": {
"request_method": "PUT",
"request_path": "/deviceEventLogs",
"request_uri": "/deviceEventLogs",
"http_version": "HTTP/1.1",
"content_type": "application/json",
"http_user_agent": "Java/1.8.0_91",
"http_host": "127.0.0.1:31311",
"http_accept": "text/html, image/gif, image/jpeg, *; q=.2, */*; q=.2",
"http_connection": "keep-alive",
"content_length": "34861"
},
"date": "2016-08-08T14:48:11.000Z",
"device": "Lamp 1",
"property": "on",
"locationName": "default",
"roomName": "LivingRoom",
"value_boolean": true
}
}
My goal is to create a website with some kind of dashboard showing analyzed data in resonable time (several minutes should be acceptable) i.e.:
showing history of energy consumption and predicting the consumption in the feature
detecting anomalies in energy consumption or other factors like lights or heating usage
showing recomendations based on some kind of not sofisticated statistics i.e. "you can move a given device from location1 to location2 because it's more needed there (more intensively used than in other place)", etc.
While the last point is quite trivial - I can use simple query or aggregation in Elasticsearch and then compare it to some treshold value, the first two points require in-depth analysis like machine learning or data mining.
For now the system is eqquiped with around 50 devices updating their status every 10 sec in average. In the future the number of devices can increase up to 50 000. Assumig 100 bytes for one event log it can lead in approximation of around 15 Terabytes of data in Elasticsearch per year.
The general question is - what can be a resonable solutions / technology / architecture of such system?
Is it a resonable start to store all my logs in Elasticsearch?
I consider es-hadoop library to use Elasticsearch along with Apache Spark to have an ability to process my data using Mlib in Spark - is it a resonable direction to go?
Can I use only Elasticsearch to store all my data in it and just use Spark and Mlib to provide in-depth analysis or should I consider implementing so called "Lambda Architecture" treating Elasticsearch as a Speed Layer? I've red a bit about various configurations where Kafka, Apache Storm was used but I'm not really sure I need it. Since the project should be done within a one month and I'm a beginner, I'm worried about complexity and hence time needed for such implementation.
What if data load would be 10x smaller (around 1,5 Terabytes per year) - will your answer be the same?
This is a very elaborate question, let me try to break it down:
Questions that you should think about
What is the end-to-end latency for your data to be available for queries? Do you need it real-time or you are okay with delays?
What is the data-loss that you are willing to tolerate?
What is the accuracy of the analytics/ML algorithms that you are looking at? Do you need highly accurate results or you are okay with some inaccuracies?
Do you need results only when they are complete or do you need some kind of speculative results?
These questions along with the regulars like space constraints and latency when data load increases, etc. should help you determine the right solution.
Generally, these problems can be viewed as Ingestion -> Processing -> Presentation.
Ingestion - Need for a Message Bus
Generally, people opt for a message bus like Kafka to handle back-pressure from slow downstream consumers and also to provide reliability (by persisting to disk) to prevent data loss. Kafka also has a good community support in terms of integrations like Spark streaming, Druid firehose support, ES plugins, etc.
Processing - Need for a scalable compute layer
This is where you need to decide on things like real-time vs. batch processing, applicable data-loss, accurate vs speculative results, etc. Read Tyler Akidau's article on streaming at https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101 for a detailed explanation.
People choose Spark streaming for real-time use-cases and a simple M/R job should do the trick for batch jobs. If you are planning on streaming jobs, then windowing and sessions of events can complicate things further.
Presentation - Need for interactive queries and fast responses
This is where the front-facing app is going to integrate and it makes sense to pick a tool that is ideally suited for the kind of queries expected and the accuracy of responses needed.
Tools like ES perform extremely well for searching, filtering and faceting, but fail when there is a need for complex mathematical aggregations. AFAIK ES doesn't support probabilistic structures like HyperLogLog like Druid does.
Retrofit
Now you have to map the requirements you have with each of the layers above.
showing history of energy consumption and predicting the consumption in the feature
detecting anomalies in energy consumption or other factors like lights or heating usage
You clearly need Machine Learning libraries as you have mentioned. Spark with its MLib support is super-awesome.
showing recomendations based on some kind of not sofisticated statistics i.e. "you can move a given device from location1 to location2 because it's more needed there (more intensively used than in other place)", etc.
You could even do this using MLib on Spark and have the recommendations pumped to a separate index in ES or even a Kafka topic, which you can further get it down to HDFS or ES. You should be careful with garbage collection here, as this can lead to data explosion and you need to be aggressive about retention here. Also computing recommendations before-hand helps you do reactive stuff like alerting, push notifications and even a query from a UI will be faster.
Assumig 100 bytes for one event log it can lead in approximation of around 15 Terabytes of data in Elasticsearch per year.
These are normal problems of provisioning with any storage systems. You can optimise here by calculating materialised views for historical data, but you can leave that decision a bit later, as this can lead to premature optimisation. You would be better to measure the storage and latency of queries to begin with and then do a retroactive analysis of capacity.
Is it a resonable start to store all my logs in Elasticsearch?
Very much so, considering your use-case. But if use Spark streaming/MLib or a batch MR job, then you could even use dumb data-stores, as most of the computations happen before-hand.
I consider es-hadoop library to use Elasticsearch along with Apache Spark to have an ability to process my data using Mlib in Spark - is it a resonable direction to go?
Looks like you have decided on batch processing, in which case you can use standard MR or spark batch along with MLib. If you need real-time, you need something like Kafka and use spark streaming. If you are okay with data-loss, you could be aggressive about retention and even in Spark, when you decide on windowing/sliding intervals, etc. If you are okay with results being inaccurate, you can use probabilistic data-structures (like bloom filter, hyperloglog - druid supports this) to represent the results.
Can I use only Elasticsearch to store all my data in it and just use Spark and Mlib to provide in-depth analysis or should I consider implementing so called "Lambda Architecture" treating Elasticsearch as a Speed Layer?
I am not sure if you could stream data from ES to Spark jobs. And lambda architecture is over-hyped and only helps if you know for sure that your real-time layer is inaccurate and you cannot handle data-loss/inaccuracies. Otherwise a simple spark streaming job reading data from Kafka and pumping to ES should be more than enough. Please consider measuring data loss before you decide on elaborate architectures like Lambda, since the operational costs (like duplicate code, more infrastructure to maintain, etc.) are likely high.
What if data load would be 10x smaller (around 1,5 Terabytes per year) - will your answer be the same?
I would still prefer the same architecture - Kafka+Spark streaming(MLib)+ES/Druid - this is easier to implement and easier to maintain.
Per the title - I have seen that many companies - especially in ad tech - use a data warehouse solution like Redshift, where they store all the transactional data to do aggregations and analytics, and also pump their data in elastic search for possibly the same reason (not for search anyways).
Apologies if this questions looks daft but wanted to understand the reasons behind this.
Is it to get real-time queries out of one and do historical data analysis on the other?
Thanks
Indeed, I've worked with a few companies (as a consultant) who were considering a combination of these 2 exactly for the similar reasons to what you described:
Redshift: for historical analysis, large complex queries, joins, trends, pre-aggregations
ElasticSearch (usually with Kibana): for near real-time operational monitoring and analytics, leveraging its indexing capabilities and free-form searches, dashboards, JSON indexing, real-time metric queries
Redshift is great for handling massive amounts of time-series data (billions of rows in seconds). But it's not ideal for frequent queries on real-time streamed data, and that's where ElasticSearch comes in.
We are currently interested in evaluating datameer and have a few questions. Are there any datameer users that can answer these questions:
Since datameer works off HDFS, are the querying speeds similar to that of Hive? How does the querying speed compare with columnar databases?
Since Hadoop is known for high latency, is it advisable to use datameer for real time quering?
Thank you.
Ravi
Regarding 1:
Query speeds are comparable to Hive.
But Datameer is a lot faster in the design phase of your "query". Datameer provides a real time preview how the results of your "query" would look like, which is happening in memory and not on the cluster. The preview is based on a representative sample of your data. It's only a preview not the final results, but it gives you constant feedback if your analytics make sense while designing.
To test a Hive query you would have to execute it, which makes the design process very slow.
Datameer's big advantage over Hive is:
Loading data into Hadoop is much easier. No static schema creation, no ETL, etc. Just use a wizard to download data from your database, log files, social media, etc.
Designing analytics or making changes is a lot faster and can even be done by non technical users.
No need to install anything else since Datameer includes all you need for importing, analytics, scheduling, security, visualization etc. in one product
If you have real time requirements you should not pull data directly out of Datameer, Hive, Impala, etc.. Columnar storages make some processing faster but will still not be low latency. But you can use those tools together with a low latency database. Use Datameer/Hive/Impala for the heavy lifting to filter and pre aggregate big data into smaller data and then export that out into a database. In Datameer you could set this up very easily using one of Datameer's wizards.
Hope this helps,
Peter Voß (Datameer)
I have a large amount of data I need to store, and be able to generate reports on - each one representing an event on a website (we're talking over 50 per second, so clearly older data will need to be aggregated).
I'm evaluating approaches to implementing this, obviously it needs to be reliable, and should be as easy to scale as possible. It should also be possible to generate reports from the data in a flexible and efficient way.
I'm hoping that some SOers have experience of such software and can make a recommendation, and/or point out the pitfalls.
Ideally I'd like to deploy this on EC2.
Wow. You are opening up a huge topic.
A few things right off the top of my head...
think carefully about your schema for inserts in the transactional part and reads in the reporting part, you may be best off keeping them separate if you have really large data volumes
look carefully at the latency that you can tolerate between real-time reporting on your transactions and aggregated reporting on your historical data. Maybe you should have a process which runs periodically and aggregates your transactions.
look carefully at any requirement which sees you reporting across your transactional and aggregated data, either in the same report or as a drill-down from one to the other
prototype with some meaningful queries and some realistic data volumes
get yourself a real production quality, enterprise ready database, i.e. Oracle / MSSQL
think about using someone else's code/product for the reporting e.g. Crystal/BO / Cognos
as I say, huge topic. As I think of more I'll continue adding to my list.
HTH and good luck
#Simon made a lot of excellent points, I'll just add a few and re-iterate/emphasize some others:
Use the right datatype for the Timestamps - make sure the DBMS has the appropriate precision.
Consider queueing for the capture of events, allowing for multiple threads/processes to handle the actual storage of the events.
Separate the schemas for your transactional and data warehouse
Seriously consider a periodic ETL from transactional db to the data warehouse.
Remember that you probably won't have 50 transactions/second 24x7x365 - peak transactions vs. average transactions
Investigate partitioning tables in the DBMS. Oracle and MSSQL will both partition on a value (like date/time).
Have an archiving/data retention policy from the outset. Too many projects just start recording data with no plans in place to remove/archive it.
Im suprised none of the answers here cover Hadoop and HDFS - I would suggest that is because SO is a programmers qa and your question is in fact a data science question.
If youre dealing with a large number of queries and large processing time, you would use HDFS (a distributed storage format on EC) to store your data and run batch queries (I.e. analytics) on commodity hardware.
You would then provision as many EC2 instances as needed (hundreds or thousands depending on how big your data crunching requirements are) and run map reduce queires against.your data to produce reports.
Wow.. This is a huge topic.
Let me begin with databases. First get something good if you are going to have crazy amounts to data. I like Oracle and Teradata.
Second, there is a definitive difference between recording transactional data and reporting/analytics. Put your transactional data in one area and then roll it up on a regular schedule into a reporting area (schema).
I believe you can approach this two ways
Throw money at the problem: Buy best in class software (databases, reporting software) and hire a few slick tech people to help
Take the homegrown approach: Build only what you need right now and grow the whole thing organically. Start with a simple database and build a web reporting framework. There are a lot of descent open-source tools and inexpensive agencies that do this work.
As far as the EC2 approach.. I'm not sure how this would fit into a data storage strategy. The processing is limited which is where EC2 is strong. Your primary goal is effecient storage and retreival.