I'm gathering event logs every time a property of some device is changed. For this purpose I decided to use:
Logstash - where my agent IoT application sends logs to in JSON format,
Elasticsearch - for storing data (logs),
Kibana - for data visualisation.
The JSON with logs is being send in regular intervals and its form is as follows:
{"deviceEventLogs":[{"date":"16:16:39 31-08-2016","locationName":"default","property":"on","device":"Lamp 1","value":"
false","roomName":"LivingRoom"}, ... ,]}
Example of single event entry in Elasticsearch looks as follows:
{
"_index": "logstash-2016.08.25",
"_type": "on",
"_id": "AVbDYQPq54WlAl_UD_yg",
"_score": 1,
"_source": {
"#version": "1",
"#timestamp": "2016-08-25T20:25:28.750Z",
"host": "127.0.0.1",
"headers": {
"request_method": "PUT",
"request_path": "/deviceEventLogs",
"request_uri": "/deviceEventLogs",
"http_version": "HTTP/1.1",
"content_type": "application/json",
"http_user_agent": "Java/1.8.0_91",
"http_host": "127.0.0.1:31311",
"http_accept": "text/html, image/gif, image/jpeg, *; q=.2, */*; q=.2",
"http_connection": "keep-alive",
"content_length": "34861"
},
"date": "2016-08-08T14:48:11.000Z",
"device": "Lamp 1",
"property": "on",
"locationName": "default",
"roomName": "LivingRoom",
"value_boolean": true
}
}
My goal is to create a website with some kind of dashboard showing analyzed data in resonable time (several minutes should be acceptable) i.e.:
showing history of energy consumption and predicting the consumption in the feature
detecting anomalies in energy consumption or other factors like lights or heating usage
showing recomendations based on some kind of not sofisticated statistics i.e. "you can move a given device from location1 to location2 because it's more needed there (more intensively used than in other place)", etc.
While the last point is quite trivial - I can use simple query or aggregation in Elasticsearch and then compare it to some treshold value, the first two points require in-depth analysis like machine learning or data mining.
For now the system is eqquiped with around 50 devices updating their status every 10 sec in average. In the future the number of devices can increase up to 50 000. Assumig 100 bytes for one event log it can lead in approximation of around 15 Terabytes of data in Elasticsearch per year.
The general question is - what can be a resonable solutions / technology / architecture of such system?
Is it a resonable start to store all my logs in Elasticsearch?
I consider es-hadoop library to use Elasticsearch along with Apache Spark to have an ability to process my data using Mlib in Spark - is it a resonable direction to go?
Can I use only Elasticsearch to store all my data in it and just use Spark and Mlib to provide in-depth analysis or should I consider implementing so called "Lambda Architecture" treating Elasticsearch as a Speed Layer? I've red a bit about various configurations where Kafka, Apache Storm was used but I'm not really sure I need it. Since the project should be done within a one month and I'm a beginner, I'm worried about complexity and hence time needed for such implementation.
What if data load would be 10x smaller (around 1,5 Terabytes per year) - will your answer be the same?
This is a very elaborate question, let me try to break it down:
Questions that you should think about
What is the end-to-end latency for your data to be available for queries? Do you need it real-time or you are okay with delays?
What is the data-loss that you are willing to tolerate?
What is the accuracy of the analytics/ML algorithms that you are looking at? Do you need highly accurate results or you are okay with some inaccuracies?
Do you need results only when they are complete or do you need some kind of speculative results?
These questions along with the regulars like space constraints and latency when data load increases, etc. should help you determine the right solution.
Generally, these problems can be viewed as Ingestion -> Processing -> Presentation.
Ingestion - Need for a Message Bus
Generally, people opt for a message bus like Kafka to handle back-pressure from slow downstream consumers and also to provide reliability (by persisting to disk) to prevent data loss. Kafka also has a good community support in terms of integrations like Spark streaming, Druid firehose support, ES plugins, etc.
Processing - Need for a scalable compute layer
This is where you need to decide on things like real-time vs. batch processing, applicable data-loss, accurate vs speculative results, etc. Read Tyler Akidau's article on streaming at https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101 for a detailed explanation.
People choose Spark streaming for real-time use-cases and a simple M/R job should do the trick for batch jobs. If you are planning on streaming jobs, then windowing and sessions of events can complicate things further.
Presentation - Need for interactive queries and fast responses
This is where the front-facing app is going to integrate and it makes sense to pick a tool that is ideally suited for the kind of queries expected and the accuracy of responses needed.
Tools like ES perform extremely well for searching, filtering and faceting, but fail when there is a need for complex mathematical aggregations. AFAIK ES doesn't support probabilistic structures like HyperLogLog like Druid does.
Retrofit
Now you have to map the requirements you have with each of the layers above.
showing history of energy consumption and predicting the consumption in the feature
detecting anomalies in energy consumption or other factors like lights or heating usage
You clearly need Machine Learning libraries as you have mentioned. Spark with its MLib support is super-awesome.
showing recomendations based on some kind of not sofisticated statistics i.e. "you can move a given device from location1 to location2 because it's more needed there (more intensively used than in other place)", etc.
You could even do this using MLib on Spark and have the recommendations pumped to a separate index in ES or even a Kafka topic, which you can further get it down to HDFS or ES. You should be careful with garbage collection here, as this can lead to data explosion and you need to be aggressive about retention here. Also computing recommendations before-hand helps you do reactive stuff like alerting, push notifications and even a query from a UI will be faster.
Assumig 100 bytes for one event log it can lead in approximation of around 15 Terabytes of data in Elasticsearch per year.
These are normal problems of provisioning with any storage systems. You can optimise here by calculating materialised views for historical data, but you can leave that decision a bit later, as this can lead to premature optimisation. You would be better to measure the storage and latency of queries to begin with and then do a retroactive analysis of capacity.
Is it a resonable start to store all my logs in Elasticsearch?
Very much so, considering your use-case. But if use Spark streaming/MLib or a batch MR job, then you could even use dumb data-stores, as most of the computations happen before-hand.
I consider es-hadoop library to use Elasticsearch along with Apache Spark to have an ability to process my data using Mlib in Spark - is it a resonable direction to go?
Looks like you have decided on batch processing, in which case you can use standard MR or spark batch along with MLib. If you need real-time, you need something like Kafka and use spark streaming. If you are okay with data-loss, you could be aggressive about retention and even in Spark, when you decide on windowing/sliding intervals, etc. If you are okay with results being inaccurate, you can use probabilistic data-structures (like bloom filter, hyperloglog - druid supports this) to represent the results.
Can I use only Elasticsearch to store all my data in it and just use Spark and Mlib to provide in-depth analysis or should I consider implementing so called "Lambda Architecture" treating Elasticsearch as a Speed Layer?
I am not sure if you could stream data from ES to Spark jobs. And lambda architecture is over-hyped and only helps if you know for sure that your real-time layer is inaccurate and you cannot handle data-loss/inaccuracies. Otherwise a simple spark streaming job reading data from Kafka and pumping to ES should be more than enough. Please consider measuring data loss before you decide on elaborate architectures like Lambda, since the operational costs (like duplicate code, more infrastructure to maintain, etc.) are likely high.
What if data load would be 10x smaller (around 1,5 Terabytes per year) - will your answer be the same?
I would still prefer the same architecture - Kafka+Spark streaming(MLib)+ES/Druid - this is easier to implement and easier to maintain.
Related
Can a time series database do everything that a streaming analytics system (like spark streaming / flink / kinesis analytics) can?
Does one subsume the other? I am not looking for which one is better. Just understanding what different use cases that they support.
Time series databases are focused on storage and retrieval of time-based entries in more performant ways than our common relational databases. Recently they have become again a hot topic, given the industry interest on high performance event processing. Nowadays, most of them rely on specific indexing techniques over NoSQL databases, e.g. OpenTSDB (HBase), InfluxDB (BoltDB) and so on.
On the other hand, Distributed Stream Processing frameworks like Spark Streaming are based on the research on Data Stream Management Systems and are provide more flexible ways of analysing events. They are usually applied to do other types of data analysis such as machine learning over streams, sketches, windowing and to apply multiple other techniques that are not the focus of time series databases.
Both of them are originated from the research from the 2000s on Time Series Databases and Data Stream Management Systems, so many of the features and architectural ideas from one are applied on the other and vice-versa. An example of that is that the seminal Stream Processing paper "Continuous Queries over Data Streams" (S. Babu, 2001) cites time series databases as an example of related work.
I am evaluating a few different options for powering an analytics application using an open-source technology. One of the options is using ElasticSearch, though I haven't been able to find any examples of companies using it for large-scale implementations of analytics, thus my question here.
For datasets of 1B-10B points, what limitations (if any, or would it be possible?) would ElasticSearch have? For example, in having a feature-set like Google Analytics, with it.
Here's one user who seems to do analytics on largeish amounts of data - https://digitalgov.gov/2015/01/07/elk - plus description of what they do including downsides.
With Elasticsearch there is no black-white answer to a question as open-ended as yours. The amount of records is not everything: how much disk space are we talking about, how many nodes, how many indices, the number of shards for each, what kind of analytics you need, hardware specs etc etc. Two things are certain from the data you mentioned: you need dedicated master nodes and more importantly good client nodes and depending on queries and the concurrent searches count you will need more or less of them.
In Elasticsearch 5 the client node is called coordinating node but it has the same role. One limitation I can think of is the heap/RAM memory of such coordinating node. The heap of an Elasticsearch node shouldn't be set to values larger than ~30GB due to the longer garbage collection cycles of the JVM (larger memory to clean, more time it takes, more unusable the node is). During GC nothing else runs on that JVM. So you could be limited by the size of the memory.
I said that you most likely will need coordinating nodes because heavy aggregations (what will probably be the most used feature in an analytics platform) will use cpu and memory in the final phase of a query where it gathers the results from all shards involved and performs a final sorting and aggregation. Thus it will need more memory than a normal data node would only for aggregations.
I doubt though that a single aggregation will use so many GBs of memory but it could theoretically use it if the query/aggregation being used is built in a reckless way. Depending on how many concurrent searches there are and how much memory they use you might need more or less coordinating nodes so that the GC cycles are not very frequent.
Bottom line: I think this is possible but some common sense is needed (see my comment about reckless aggregations) and some as close to reality as possible estimations regarding the load.
Google Analytics Pros:
Easy to Install
Can be used in multiple environments (e.g. web, mobile, other)
Customized data collection
Google Analytics Cons:
Custom reporting is limited
Upgrading to Premium is expensive
Requires continual traning
Slices data into smaller samples to deal with large sampling issues
ElasticSearch Pros:
Distributed by design
Easier to scale horizontally
Good at full text search
Fast indexing & querying
ElasticSearch Cons:
Not a relational database therefore does not benefit from things like foreign-key constaints
Data consistency can be affected
No built-in authentication or authorization system
I'm currently investigating how to store and analyze enriched time based data with up to 1000 columns per line. At the moment Cassandra together with either Solr, Hadoop or Spark offered by Datastax Enterprise seem to fulfill my requirements on the rough. But the devil is in the detail.
Out of the 1000 columns about 60 are used for real-time-like queries (web-frontend, user sends form and expect quick response). These queries are more or less GROUPBY statements where the number or occurrences are counted.
As Cassandra itself does not provide the required analytical capabilities (no GROUPBY), I'm left these alternatives:
Roughly query via Cassandra and filter the resultset within self-written code
Index the data with Solr and run facet.pivot queries
Use either Hadoop or Spark and run the queries
The first approach seems cumbersome and prone to errors… Solr does have some anayltic features but without multifield grouping I'm stuck with pivots. I don't know whether this is a good or performant approach though… Last but not least there are Hadoop and Spark, the prior known not to be the best for real-time queries, the later pretty new and maybe not production ready.
So which way to go? There is no one-fits-all here, but before I go one way through I'd like to get some feedback. Maybe I'm thinking to complex or my expectations are too high :S
Thanks in advance,
Arman
In a place I work now we have a similar set of tech requirements and a solution is Cassandra-Solr-Spark, exactly in that order.
So if a query can be "covered" by Cassandra indices - good, if not - it's covered by Solr. For testing & less often queries - Spark (Scala, no SparkSQL due to old version of it -- it's a bank, everything should be tested and matured, from cognac to software, argh).
Generally I agree with the solution, though sometimes I have a feeling that some client's requests should NOT be taken seriously at all, saving us from loads of weird queries :)
I would recommend Spark, if you take a loot at the list of companies using it you'll such names as Amazon, eBay and Yahoo!. Also, as you noted in the comment, it's becoming a mature tool.
You've given arguments against Cassandra and Solr already, so I'll focus on explaining why Hadoop MapReduce wouldn't do as well as Spark for real-time queries.
Hadoop and MapReduce were designed to leverage hard disk under the assumption that for big data IO is negligible. As a result data are read and wrote at least twice - in map stage and in reduce stage. This allows you to recover from failures as partial result are secured but it that's not want you want when aiming for real-time queries.
Spark not only aims to fix MapReduce shortcomings, it also focuses on interactive data analysis, which is exactly what you want. This goal is achieved mainly by utilizing RAM and the results are astonishing. Spark jobs will often be 10-100 times faster than MapReduce equivalents.
The only caveat is the amount of memory you have. Most probably your data is probably going to feat in the RAM you can provide or you can rely on sampling. Usually when interactively working with data there is no real need to use MapReduce and it seems to be so in your case.
For a business use case where we have to deal with minimum "2-3 terabyte" of data per day, I was doing analysis on "Hadoop & Storm".
Needless to say that “Storm” looks impressive because of its efficiency in processing incoming big data but I am not sure whether “Storm” will be capable enough to process “Terabyte” of data and at the same time providing me real-time results or not ?
Can anyone explain please?
Thanks,
Gajendra
Storm was developed by twitter. they process more than 8 TB per day with it. Sounds like this should be enough for your case. Afaik storm is the best streaming/realtime system for distributed computing. hadoop is not suitable for it due to job start up times and not native handling of streaming data.
a fact is, both can handle the data per day you wish when you have enough server power and storage etc.
I've been considering following options.
senseidb [http://www.senseidb.com] This needs a fixed schema also data gateways. So there is no simple way to push data but provide data streams. My data is unstuctured and there are very few common attributes across all kinds of logs
riak[http://wiki.basho.com/Riak-Search.html]
vertica - cost factor?
Hbase(+Hadoop ecosystem +lucene) - main cons here are on single machine this wont make much sense and am not sure about free text search capability to be built around this
Main requirements are
1. it has to sustain thousands of incoming request for archival and at the same time build real-time index which will allow end user to do free-text search
storage (log archives + index ) has to be optimal
There are number of specialized log storage and indexing, I don't know that I'd cram logs into a normal data store necessarily.
If you have lots of money, it's tough to beat Splunk.
If you'd prefer an open source option, check out the ServerFault discussion. logstash + ElasticSearch seems to be a really strong choice, and should grow pretty well as your logs do.
Have you given a thought on the line of these implementation. It might be helpful to integrate Lucene and Hadoop for you problem.
http://www.cloudera.com/blog/2011/09/hadoop-for-archiving-email/
http://www.cloudera.com/blog/2012/01/hadoop-for-archiving-email-part-2/
So instead of email, your use case could use the log files and the parameters to index.
For the 2-3 TB of data sounds like a "in the middle" case. If it is all the data I would not suggest going into BigData / NoSQL venture.
I think RDBMS with full text search capability should do on good hardware. I would suggest to do some aggressive partitioning by time to be able to work with 2-3 TB data. Without partitioning it would be too mach. In the same time - if your data will be partitioned by days i think data size will be fine for MySQL.
Taking to the account the comment below that data size is about 10-15TB, and taking into account that need for some replication will multiply this number x2-x3. We also should consider size of indexes which I would estimate as dozens percents from the data size. Probably efficient single node solution might be more expensive then clustering mostly because of licensing costs.
In best of my understanding existing Hadoop/NoSQL solutions can not answer your requirements out of the box, mostly because of number of documents to be indexed. In out case - each log is a document. (http://blog.mgm-tp.com/2010/06/hadoop-log-management-part3/)
So I think solution will be in aggregating logs for some period of time together, and threating it as one document.
For the storage of these logs packages HDFS or Swift could be a good solutions.