What would be the advantages of using ELK for log management over a simple python logging + existing database log table combo? - elasticsearch

Assuming I have many Python processes running on an automation server such as Jenkins, let's say I want to use Python's native logging module and, other than writing to the Jenkins console or to a log file, I want to store & centralize the logs somewhere.
I thought of using ELK for that, but then I realized that I can just as well create a dedicated log table in an existing database (I'm using Redshift), use something like Grafana for log dashboards/visualization and save myself the trouble of deploying a new system (most of the people in my team are familiar with Redshift but not with ElasticSearch).
Although it sounds straightforward, I feel like I'm not looking at the big picture and that I would be missing some powerful capabilities that components like Logstash were written for the in the first place. What would these capabilities be and how would it be advantageous to use ELK instead of my solution?
Thank you!

I have implemented a full ELK stack in my company in the past year.
The project was huge and took a lot of time to properly implement. The advantages of using ELK and not implementing our own centralized logging solution would be:
Not needing to re-invent the wheel- There is already a product that is doing just that. (and the installation part is extremely easy)
It is battle tested and can stand huge amount of logs in a short time.
As your business and product grows and shift you will need to parse more logs with different structure which will mean DB changes on self built system. logstash will give you endless possibilities of filtering and parsing those new formatted logs.
It has Cluster and HA capabilities, and you can scale your logging system vertically and horizontally.
Very easy to maintain and change over time.
It can send the needed output to a variety of products including Zabbix, Grafana, elasticsearch and many more.
Kibana will give you ability to view the logs, build graphs and dashboards, alerts and more...
The options with ELK are really endless and the more I work with it, the more I find new ways it can help me. not just from viewing logs on distributed remote server systems, but also security alerts and SLA graphs and many other insights.

Related

Graphite vs Elastic Metrics Beat for Windows Performance Counters

I work with a web api, that makes heavy use of Windows Performance Counters. Until now this has not been collected in a good tool.
I would like to start making this data available in a place where we can create dashboards etc.
We already have an Elastic Search Cluster. I am only an enduser when it comes to Elastic. I do not have administrator knowledge. But I have heard about Metric Beats that as far as I can understand is intended for exactly Windows Performance Counters.
But I have also worked with Graphite and Grafana for these types of data in the past.
I have also heard that you can use Grafana as a dashboard tool on top of Metric Beats, is that correct?
I don't know what the best choice is, and I haven't been able to find comparisons on this on the web. So I am hoping someone here can enlighten me.
I also have a sneaky suspicion that I might have misunderstood something since I cannot find comparisons out there.
Thanks
This is a quite subjective question you'll get different answers depending on who you ask.
Anyway, there are three parts:
1) collections of metrics
2) storage of metrics
3) display of metrics
Metrics beats is a collector. I do not know which collectors are suitable for Windows, popular collectors are is Collectd, Telegraf, Beats, Diamond. You basically need to find one which collects the data you are interested in. If you are interested in application metrics you can also plug in a library in your application. A popular choice for Java is Dropwizzard.
Then you'll need some database to store those metrics in. For data storage you can use Graphite, InfluxDB, Elastic, etc, whatever suits your requirements.
And then for displaying the metrics you can basically choose between Grafana, Kibana, and think influxdata has something as well.
If you don't have any specific requirements most of the mentioned tools will do fine.

How do I get from "Big Data" to a webpage?

I've spent a lot of time reading and watching videos of people talking about how they use tools designed for handling huge datasets and real-time processing in their architectures. And while I understand what it is that tools like Hadoop/Cassandra/Kafka etc do, no one seems to explain how the data gets from these large processing tools to rendering something on a client/webpage.
From what I understand of big data tools, is that you can't build your application the same way you would a standard web-app querying MySQL, which I can understand given the size of the data that flows through these tools, however, for all this talk of "realtime data analytics" I cannot find any explanation of how the actual analytics gets put in front of someone in terms of some chart/table/etc?
explain how the data gets from these large processing tools to rendering something on a client/webpage.
With respect to this, one way would be to process the big data using Spark or Hadoop and store the results onto a RDBMS. Then have your webapp pull data from RDBMS to render charts, table etc. I can provide you the examples that I have done myself if you need more information.
Impala supports ODBC/JDBC interfaces. So, you actually could hook up a web app to it the same way you do with MySQL.
Other stuff you might want to check out is HBase, Kudu or Solr. In some realtime architectures data ends up in one of those. And all of them have some sort of an API that you can use in your web app to access their data.
If you want a simple solution for realtime data processing and analytics, check out the new Stride API, which enables developers to collect, process, and analyze streaming data and then either visualize summary data in Stride or push processed data out to applications in realtime. This is a very easy way to build the kind of realtime reporting dashboards and monitoring / alerting systems you described above.
Take a look at the Stride API technical docs for examples and more info on how to implement this.

NoSQL for multi-site archival logging with full-text search

I'm looking at building a somewhat complex log handling system to replace an old ad-hoc setup and could use a bit of advice. I'm pretty familiar with SQL databases and networking, but am very new to NoSQL stores, which seem to be the key to solving this mess. Note that we have a very good team, but a limited licensing budget, so free/open-source options are vastly preferred. (That said, availability of support if something goes pear-shaped would be nice.)
Requirements:
Archive (test) logs generated in the several GB/day range at multiple sites around the world.
Provide full text search of those logs at each site fairly instantaneous for debugging purposes.
Push that archived data back to a central location (though a replica at each site would be absolutely okay).
Provide for analytics of that data back at the central location.
Constraints:
The sites have fairly crap Internet connections for the moment (high latency and fairly low bandwidth). Much of the data is generated during the day and a good portion of the sync would have to lag behind and finish overnight each day.
Sites MUST be able to function if the WAN goes completely off-line.
Extras
The log data is (as usual) highly compressible. Any solution that compresses data transacting from node to node across the WAN is preferred.
Many log files are related to each other in multi-level hierarchies, and that relationship is very important and must be maintained!
Sites will generally not modify the same data or modify it again once stored. This is all archival for the most part.
We can either stream as the logs are generated or push blocks of logs. Streaming is preferred, as it would simplify things considerably.
Options I'm aware of:
Local MySQL and folder structure for logging and local configuration management.
This is what we have now and it's running, but not a long-term solution by any means.
Elasticsearch
I've read that ElasticSearch would probably be really good for this, though from what I understand that doesn't support multi-site.
Cassandra
This seems to have built-in multi-site support, but I'm not exactly familiar with the data-model. Is this a good choice for something like this, or will I hate myself if I give it a try?
CouchDB
This is a document store that seems(?) like a good match for log data, but again doesn't appear to have multi-site support.
Apache Kafka
I read up on this, but I haven't quite wrapped my head around it yet...
Questions:
Do any of these actually let you stream-append logs or are they best suited to dumping completed files in?
Is there a solution I'm missing that might be better?
Any recommendations on multi-site with some of the options that don't support multi-site by themselves?
Interesting links:
https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying
http://blog.cloudera.com/blog/2015/07/deploying-apache-kafka-a-practical-faq/
https://www.elastic.co/blog/scaling_elasticsearch_across_data_centers_with_kafka
https://kafka.apache.org/08/ops.html
https://github.com/Stratio/cassandra-lucene-index
I may be a bit biased, since Couchbase is my employer, but this sounds like the kind of problem that XDCR (Cross Datacenter Replication) was made to solve.
You could stand up a cluster on multiple geographical sites (Couchbase calls these "datacenters") and then XDCR would automatically replicate (bidirectionally) the data between sites. If I understand your requirements correctly, this sounds like just what you need.

Using elasticsearch as central data repository

We are currently using elasticsearch to index and perform searches on about 10M documents. It works fine and we are happy with its performance. My colleague who initiated the use of elasticsearch is convinced that it can be used as the central data repository and other data systems (e.g. SQL Server, Hadoop/Hive) can have data pushed to them. I didn't have any arguments against it because my knowledge of both is too limited. However, I am concerned.
I do know that data in elasticsearch is stored in a manner that is efficient for text searching. Hadoop stores data just as a file system would but in a manner that is efficient to scale/replicate blocks over over multiple data nodes. Therefore, in my mind it seems more beneficial to use Hadoop (as it is more agnostic w.r.t its view on data) as a central data repository. Then push data from Hadoop to SQL, elasticsearch, etc...
I've read a few articles on Hadoop and elasticsearch use cases and it seems conventional to use Hadoop as the central data repository. However, I can't find anything that would suggest that elasticsearch wouldn't be a decent alternative.
Please Help!
As is the case with all database deployments, it really depends on your specific application.
Elasticsearch is a great open source search engine built on top of Apache Lucene. Its features and upgrades allow it to basically function just like a schema-less JSON datastore that can be accessed using both search-specific methods and regular database CRUD-like commands.
Nevertheless all the advantages Elasticsearch that brings, there are still some main disadvantages:
Security - Elasticsearch does not provide any authentication or access control functionality. It's supported since they have introduced shield.
Transactions - There is no support for transactions or processing on data manipulation. Well now data manipulation is handled with logstash.
Durability - ES is distributed and fairly stable but backups and durability are not as high priority as in other data stores.
Maturity of tools - ES is still relatively new and has not had time to develop mature client libraries and 3rd party tools which can make development much harder. We can consider that it's quite mature now
with a variety of connectors and tools around it like kibana. But it's still not suited for large computations - Commands for searching data are not suited to "large" scans of data and advanced computation on the db side.
Data Availability - ES makes data available in "near real-time" which may require additional considerations in your application (ie: comments page where a user adds new comment, refreshing the page might not actually show the new post because the index is still updating).
If you can deal with these issues then there's certainly no reason why you can't use Elasticsearch as your primary data store. It can actually lower complexity and improve performance by not having to duplicate your data but again this depends on your specific use case.
As always, weigh the benefits, do some experimentation and see what works best for you.
DISCLAIMER: This answer was written a while ago for the Elasticsearch 1.x series. These critics still somehow stand with the 2.x series. But Elastic is working on them, as the 2.x series comes with more mature tools, APIs and plugins per example, security wise, like Shield or even transport clients like Logstash or Beats, etc.
I'd highly discourage most users from using elasticsearch as your primary datastore. It will work great until your cluster melts down due to a network partition. Even settings such as minimum_master_nodes that the ES pros always set won't save you. See this excellent analysis by Aphyr with his Call Me Maybe series:
http://aphyr.com/posts/317-call-me-maybe-elasticsearch
eliasah, is right, it depends on your use case, but if your data (and job) is important to you, stay away.
Keep your golden record of your data stored in something really focused on persisting and sync your data out to search from there. It adds extra complexity and resources, but will result in a better nights rest :)
There are plenty of ways to go about this and if elasticsearch does everything you need, you can look into Kafka for persisting all the events going into a cluster which would allow replaying if things go wrong. I like this approach as it provides an async ingestion pipeline into elasticsearch that also does the persistence.

Is there any visualization tool to see the processing chain (bolt and spouts) in a storm cluster?

Context:
Seeing a null pointer in one of the integration test, which runs in a locally spawned stom cluster. Increased the log level and could not figure it out what is really happening. Any help would be appreciated.
Your question doesn't quite match your title. If you're looking for better access to logs for scalable apps (whether on Hadoop or Storm) then check out tools that collect and aggregate logs from multiple nodes and systems. I'm familiar with PaperTrail and GreyLog, but I'm sure there are others. These tools, in conjunction with judicious use of log levels, can help you quickly find errors in your scalable apps.
If you're looking to get a better idea of how your system is performing (this is what I think of when I hear "visualization") then check out distributed monitoring tools. We've had very good success with the both the visualization of Storm bolt/spout performance and alert processing with CopperEgg, for example.

Resources