What is the best components stack for building distributed log aggregator (like Splunk)? - ruby

I'm trying to find the best components I could use to build something similar to Splunk in order to aggregate logs from a big number of servers in computing grid. Also it should be distributed because I have gigs of logs everyday and no single machine will be able to store logs.
I'm particularly interested in something that will work with Ruby and will work on Windows and latest Solaris (yeah, I got a zoo).
I see architecture as:
Log crawler (Ruby script).
Distributed log storage.
Distributed search engine.
Lightweight front end.
Log crawler and distributed search engine are out of questions - logs will be parsed by Ruby script and ElasticSearch will be used to index log messages. Front end is also very easy to choose - Sinatra.
My main problem is distributed log storage. I looked at MongoDB, CouchDB, HDFS, Cassandra and HBase.
MongoDB was rejected because it doesn't work on Solaris.
CouchDB doesn't support sharding (smartproxy is required to make it work but this is something I don't want to even try).
Cassandra works great but it's just a disk space hog and it requires running autobalance everyday to spread the load between Cassandra nodes.
HDFS looked promising but FileSystem API is Java only and JRuby was a pain.
HBase looked like a best solution around but deploying it and monitoring is just a disaster - in order to start HBase I need to start HDFS first, check that it started without problems, then start HBase and check it also, and then start REST service and also check it.
So I'm stuck. Something tells me HDFS or HBase are the best thing to use as a log storage, but HDFS only works smoothly with Java and HBase is just a deploying/monitoring nightmare.
Can anyone share its thoughts or experience building similar systems using components I described above or with something completely different?

I'd recommend using Flume to aggregate your data into HBase. You could also use the Elastic Search Sink for Flume to keep a search index up to date in real time.
For more, see my answer to a similar question on Quora.

With regards to Java and HDFS - using a tool like BeanShell, you can interact with the HDFS store via Javascript.


Ingest log files from edge nodes to Hadoop

I am looking for a way to stream entire log files from edge nodes to Hadoop. To sum up the use case:
We have applications that produce log files ranging from a few MB to hundreds of MB per file.
We do not want to stream all the log events as they occur.
Pushing the log files in their entirety after they have written completely is what we are looking for (written completely = got moved into another folder for example... this is not a problem for us).
This should be handled by some kind of lightweight agents on the edge nodes to the HDFS directly or - if necessary - an intermediate "sink" that will push the data to HDFS afterwards.
Centralized Pipeline Management (= configuring all edge nodes in a centralized manner) would be great
I came up with the following evaluation:
Elastic's Logstash and FileBeats
Centralized pipeline management for edge nodes is available, e.g. one centralized configuration for all edge nodes (requires a license)
Configuration is easy, WebHDFS output sink exists for Logstash (using FileBeats would require an intermediate solution with FileBeats + Logstash that outputs to WebHDFS)
Both tools are proven to be stable in production-level environments
Both tools are made for tailing logs and streaming these single events as they occur rather than ingesting a complete file
Apache NiFi w/ MiNiFi
The use case of collecting logs and sending the entire file to another location with a broad number of edge nodes that all run the same "jobs" looks predestined for NiFi and MiNiFi
MiNiFi running on the edge node is lightweight (Logstash on the other hand is not so lightweight)
Logs can be streamed from MiNiFi agents to a NiFi cluster and then ingested into HDFS
Centralized pipeline management within the NiFi UI
writing to a HDFS sink is available out-of-the-box
Community looks active, development is lead by Hortonworks (?)
We have made good experiences with NiFi in the past
Apache Flume
writing to a HDFS sink is available out-of-the-box
Looks like Flume is more of a event-based solution rather than a solution for streaming entire log files
No centralized pipeline management?
Apache Gobblin
writing to a HDFS sink is available out-of-the-box
No centralized pipeline management?
No lightweight edge node "agents"?
Maybe another tool to look at? Looking for your comments on this one...
I'd love to get some comments about which of the options to choose. The NiFi/MiNiFi option looks the most promising to me - and is free to use as well.
Have I forgotten any broadly used tool that is able to solve this use case?
I experience similar pain when choosing open source big data solutions, simply that there are so many paths to Rome. Though "asking for technology recommendations is off topic for Stackoverflow", I still want to share my opinions.
I assume you already have a hadoop cluster to land the log files. If you are using an enterprise ready distribution e.g. HDP distribution, stay with their selection of data ingestion solution. This approach always save you lots of efforts in installation, setup centrol managment and monitoring, implement security and system integration when there is a new release.
You didn't mention how you would like to use the log files once they lands in HDFS. I assume you just want to make an exact copy, i.e. data cleansing or data trasformation to a normalized format is NOT required in data ingestion. Now I wonder why you didn't mention the simplest approach, use a scheduled hdfs commands to put log files into hdfs from edge node?
Now I can share one production setup I was involved. In this production setup, log files are pushed to or pulled by a commercial mediation system that makes data cleansing, normalization, enrich etc. Data volume is above 100 billion log records every day. There is an 6 edge nodes setup behind a load balancer. Logs are firstly land on one of the edge nodes, then hdfs command put to HDFS. Flume was used initially but replaced by this approach due to performance issue.(it can very likely be that engineer was lack of experience in optimizing Flume). Worth to mention though, the mediation system has a managment UI for scheduling ingestion script. In your case, I would start with cron job for PoC then use e.g. Airflow.
Hope it helps! And would be glad to know your final choice and your implementation.

Newbie: Hadoop IIS Logs - Reasonable approach?

I am a totaly beginner at the topic hadoop - so sorry if this is a stupid question.
My fictional scenario is, that I have several webserver (IIS) with several log locations. I want to centralize this log files and based on the data I want to analyze the health of the applications and the webservers.
Since the eco system of hadoop overs a variety of tools I am not sure if my solution is a valid one.
So I thought that I move the log files to hdfs, create an external table on the directory and an internal table and copy the data via hive (insert into ...select from) from the external table to internal table (with some filtering because of the comment lines beginning with #)
When the data is stored within the internal table I delete the previous moved files from hdfs.
Technical it works, I tried it already - but is this is reasonable aproach?
And if yes - how would I automatize this steps since now I did all the stuff manually via Ambari.
THanks for your input
Yes, this is perfectly fine approach.
Outside of setting up the Hive table ahead of time, what's the left to automate?
You want to run things on a schedule? Use Oozie, Luigi, Airflow, or Azkaban.
Ingesting logs from other Windows servers because you have a highly available web service? Use Puppet, for example, to configure your log collections agents (not Hadoop related)
Note, if it's only log file collection that you care about, I would probably have used Elasticsearch instead of Hadoop to store data, Filebeat to continuously watch log files, Logstash to apply per-message level filtering, and Kibana to do visualizations. If combining Elasticsearch for fast indexing/searching and Hadoop for archival, you can insert Kafka between the log message ingestion and message writers/consumers

GlusterFS as the backend for Hadoop

I've seen redhat has come up one possible solution with GlusterFS working as the backend for hadoop. In this case, you can get ride of the namenode/datanode architecture and replace it with glusterfs, meanwhile you still have Hadoop Mapreduce api-compatibility.
Just wondering how does the performance compare against native-HDFS? Is it really production ready? Does it support all the hadoop ecosystem as well? e.g. Solr Cloud, Spark, Impala etc etc.
disclaimer: I work for Storage vendor.
Well. I don't know much about GlusterFS in particular but i can speak about Lustre as it's POSIX at the end of the day. It's parallel filesystem, but the benchmarks i looked into recently showed it does outperform HDFS. but it's definitely a production ready alternative that offers a single name space for your data (no more HDFS ingestion)
What does work from Hadoop ecosystem today?
what I've seen in the production today is Spark,Hive,Hbase. Imapala looks to me it require certain parts of HDFS, this is why it doesn't work with POSIX FS and it's not HCFS. I did a quick test and i was able to create the database and everything, but i wasn't able to fetch any rows.
Let me if you need further help.

hadoop logging facility?

If I am to use zookeeper as a work queue and connect to it individual consumers/workers. What would you recommend as a good distributed setup for logging these workers' activities?
Assume the following:
1) At anytime we could be down to 1 single computer housing the hadoop cluster. The system will autoscale up and down as needed but has alot of down time where only 1 single computer is needed.
2) I just need the ability to access all of the workers logs without accessing the individual machine that worker is located at. Bare in mind, that by the time I get to read one of these logs that machine might very well be terminated and long gone.
3) We'll need easy access to the logs i.e being able to cat/grep and tail or alternatively in a more SQLish manner - we'll need real time ability to both query as well as monitor output for short periods of time in real time. (i.e tail -f /var/log/mylog.1)
I appreciate your expert ideas here!
Have you looked at using Flume, chukwa or scribe - ensure that your flume etc process has access to the log files that you are trying to aggregate onto a centralized server.
flume reference:
hope it helps.
Fluentd log collector just released its WebHDFS plugin, which allows the users to instantly stream data into HDFS. It's really easy to install with ease of management.
Fluentd + Hadoop: Instant Big Data Collection
Of course you can import data directly from your applications. Here's a Java example to post logs against Fluentd. Fluentd's Java library is clever enough to buffer locally when Fluentd daemon is down. This lessens the possibility of the data loss.
Fluentd: Data Import from Java Applications
High availability configuration is also available, which basically enables you to have centralized log aggregation system.
Fluentd: High Availability Configuration

Do you know batch log processing tools for hadoop (zohmg alternatives)?

Since the zohmg project seems to be dead (no new commits since nov 2009), I would like to know if any of you used/uses it (with successful results). Or if you know anything about future of this project.
And if not, is there any alternative for this project. I'm looking for tool that will help to extract data from (apache) logs (using Hadoop as a batch processing system), store it into HBase, help with querying this data.
Cascading is very often used for this. It also provides adapters for HBase.
Examples can be found here
HBase integration
