Elasticsearch and Splunk connection via Hadoop - hadoop

I am currently experimenting with a company that has their log data in elasticsearch. (They currently use the entire ELK stack).
Splunk has a plugin called Hunk that lets you query HDFS / Hadoop data from Splunk's interface. I have been able to get this working.
My question is, is there a way using es-hadoop to somehow 'bridge' the two together? That when Hunk queries my hdfs, it also will end up pulling in the Elasticsearch data?
(They company wants to see if its feasible to use Splunk without having to duplicate the data)
Thank you.

Splunk deprecated HUNK in 2016, and released Hadoop Connect. Now you can read, write, explore, and export from HDFS. There is no Splunk license fee associated with reading, writing, or exploring HDFS clusters. So you can work with data from HDFS to create statistical solutions without incurring a fee. Splunk's betting that you'll want to put summarized data as well as refined data sets into Splunk to make them more available to enterprise audiences. https://www.splunk.com/blog/2012/12/20/connecting-splunk-and-hadoop/

Related

Hadoop Integration with Document Capture Software

We have requirement to send documents to Hadoop (Hortonworks) from our Image Capture Software: Image Capture Software release PDF document with metadata.
I don't have much idea about HDP. Is there any REST service or any tool that can able to add documents to Hadoop by providing Documents with metadata.
Please help
Hadoop HDFS has both WebHDFS and NFSGateway
However, it's generally recommended not to just store raw data immediately onto HDFS if you have better control over how the data gets there. That way, you have better control over auditing where and how data gets written.
For example, you could use Apache Nifi processors to start a ListenHTTP processor, read the document data, parse it, filter and enrich, then you can optionally write to HDFS or many other destinations.

Ingest log files from edge nodes to Hadoop

I am looking for a way to stream entire log files from edge nodes to Hadoop. To sum up the use case:
We have applications that produce log files ranging from a few MB to hundreds of MB per file.
We do not want to stream all the log events as they occur.
Pushing the log files in their entirety after they have written completely is what we are looking for (written completely = got moved into another folder for example... this is not a problem for us).
This should be handled by some kind of lightweight agents on the edge nodes to the HDFS directly or - if necessary - an intermediate "sink" that will push the data to HDFS afterwards.
Centralized Pipeline Management (= configuring all edge nodes in a centralized manner) would be great
I came up with the following evaluation:
Elastic's Logstash and FileBeats
Centralized pipeline management for edge nodes is available, e.g. one centralized configuration for all edge nodes (requires a license)
Configuration is easy, WebHDFS output sink exists for Logstash (using FileBeats would require an intermediate solution with FileBeats + Logstash that outputs to WebHDFS)
Both tools are proven to be stable in production-level environments
Both tools are made for tailing logs and streaming these single events as they occur rather than ingesting a complete file
Apache NiFi w/ MiNiFi
The use case of collecting logs and sending the entire file to another location with a broad number of edge nodes that all run the same "jobs" looks predestined for NiFi and MiNiFi
MiNiFi running on the edge node is lightweight (Logstash on the other hand is not so lightweight)
Logs can be streamed from MiNiFi agents to a NiFi cluster and then ingested into HDFS
Centralized pipeline management within the NiFi UI
writing to a HDFS sink is available out-of-the-box
Community looks active, development is lead by Hortonworks (?)
We have made good experiences with NiFi in the past
Apache Flume
writing to a HDFS sink is available out-of-the-box
Looks like Flume is more of a event-based solution rather than a solution for streaming entire log files
No centralized pipeline management?
Apache Gobblin
writing to a HDFS sink is available out-of-the-box
No centralized pipeline management?
No lightweight edge node "agents"?
Fluentd
Maybe another tool to look at? Looking for your comments on this one...
I'd love to get some comments about which of the options to choose. The NiFi/MiNiFi option looks the most promising to me - and is free to use as well.
Have I forgotten any broadly used tool that is able to solve this use case?
I experience similar pain when choosing open source big data solutions, simply that there are so many paths to Rome. Though "asking for technology recommendations is off topic for Stackoverflow", I still want to share my opinions.
I assume you already have a hadoop cluster to land the log files. If you are using an enterprise ready distribution e.g. HDP distribution, stay with their selection of data ingestion solution. This approach always save you lots of efforts in installation, setup centrol managment and monitoring, implement security and system integration when there is a new release.
You didn't mention how you would like to use the log files once they lands in HDFS. I assume you just want to make an exact copy, i.e. data cleansing or data trasformation to a normalized format is NOT required in data ingestion. Now I wonder why you didn't mention the simplest approach, use a scheduled hdfs commands to put log files into hdfs from edge node?
Now I can share one production setup I was involved. In this production setup, log files are pushed to or pulled by a commercial mediation system that makes data cleansing, normalization, enrich etc. Data volume is above 100 billion log records every day. There is an 6 edge nodes setup behind a load balancer. Logs are firstly land on one of the edge nodes, then hdfs command put to HDFS. Flume was used initially but replaced by this approach due to performance issue.(it can very likely be that engineer was lack of experience in optimizing Flume). Worth to mention though, the mediation system has a managment UI for scheduling ingestion script. In your case, I would start with cron job for PoC then use e.g. Airflow.
Hope it helps! And would be glad to know your final choice and your implementation.

Newbie: Hadoop IIS Logs - Reasonable approach?

I am a totaly beginner at the topic hadoop - so sorry if this is a stupid question.
My fictional scenario is, that I have several webserver (IIS) with several log locations. I want to centralize this log files and based on the data I want to analyze the health of the applications and the webservers.
Since the eco system of hadoop overs a variety of tools I am not sure if my solution is a valid one.
So I thought that I move the log files to hdfs, create an external table on the directory and an internal table and copy the data via hive (insert into ...select from) from the external table to internal table (with some filtering because of the comment lines beginning with #)
When the data is stored within the internal table I delete the previous moved files from hdfs.
Technical it works, I tried it already - but is this is reasonable aproach?
And if yes - how would I automatize this steps since now I did all the stuff manually via Ambari.
THanks for your input
BW
Yes, this is perfectly fine approach.
Outside of setting up the Hive table ahead of time, what's the left to automate?
You want to run things on a schedule? Use Oozie, Luigi, Airflow, or Azkaban.
Ingesting logs from other Windows servers because you have a highly available web service? Use Puppet, for example, to configure your log collections agents (not Hadoop related)
Note, if it's only log file collection that you care about, I would probably have used Elasticsearch instead of Hadoop to store data, Filebeat to continuously watch log files, Logstash to apply per-message level filtering, and Kibana to do visualizations. If combining Elasticsearch for fast indexing/searching and Hadoop for archival, you can insert Kafka between the log message ingestion and message writers/consumers

How to implement search functionalities?

Currently we are making a memory cache mechanism implementation in search functionalities.
Now the data is getting very large and we are not able to handle it in memory. Also we are getting more input source from different systems (oracle, flat file, and git).
Can you please share me how can we achieve this process?
We thought ES will help on this. But how can we provide input if any changes happen in end source? (batch processing will NOT help)
Hadoop - Not that level of data we are NOT handling
Also share your thoughts.
we are getting more input source from different systems (oracle, flat file, and git)
I assume that's why you tagged Kafka? It'll work, but you bring up a valid point
But how can we provide input if any changes happen...?
For plain text, or Git events, you'll obviously need to alter some parser engine and restart the job to get extra data in the message schema.
For Oracle, the GoldenGate product will publish table column changes, and Kafka Connect can recognize those events and update the payload accordingly.
If all you care about is searching things, plenty of tools exist, but you mention Elasticsearch, so using Filebeat works for plaintext, and Logstash can work with various other types of input sources. If you have Kafka, then feed events to Kafka, let Logstash or Kafka Connect update ES

What is the best components stack for building distributed log aggregator (like Splunk)?

I'm trying to find the best components I could use to build something similar to Splunk in order to aggregate logs from a big number of servers in computing grid. Also it should be distributed because I have gigs of logs everyday and no single machine will be able to store logs.
I'm particularly interested in something that will work with Ruby and will work on Windows and latest Solaris (yeah, I got a zoo).
I see architecture as:
Log crawler (Ruby script).
Distributed log storage.
Distributed search engine.
Lightweight front end.
Log crawler and distributed search engine are out of questions - logs will be parsed by Ruby script and ElasticSearch will be used to index log messages. Front end is also very easy to choose - Sinatra.
My main problem is distributed log storage. I looked at MongoDB, CouchDB, HDFS, Cassandra and HBase.
MongoDB was rejected because it doesn't work on Solaris.
CouchDB doesn't support sharding (smartproxy is required to make it work but this is something I don't want to even try).
Cassandra works great but it's just a disk space hog and it requires running autobalance everyday to spread the load between Cassandra nodes.
HDFS looked promising but FileSystem API is Java only and JRuby was a pain.
HBase looked like a best solution around but deploying it and monitoring is just a disaster - in order to start HBase I need to start HDFS first, check that it started without problems, then start HBase and check it also, and then start REST service and also check it.
So I'm stuck. Something tells me HDFS or HBase are the best thing to use as a log storage, but HDFS only works smoothly with Java and HBase is just a deploying/monitoring nightmare.
Can anyone share its thoughts or experience building similar systems using components I described above or with something completely different?
I'd recommend using Flume to aggregate your data into HBase. You could also use the Elastic Search Sink for Flume to keep a search index up to date in real time.
For more, see my answer to a similar question on Quora.
With regards to Java and HDFS - using a tool like BeanShell, you can interact with the HDFS store via Javascript.

Resources