Integration of hadoop (specifically HDFS files) with ELK stack - elasticsearch

I am trying to integrate hadoop with ELK stack.
My use case is " i have to get a data from a file present in HDFS path and show the contents on kibana dashboard"
Hive is not working there so I can't use hive.
Are there any other ways to do that?
Anybody is having any article with step by step process?
I have tried to get logs from a linux location on a hadoop server through logstash and filebeat but that is also not working.

I'm doing this for some OSINT work it is quite easy to do once one can get the content out of hdfs into a local filesystem. That's done by setting up a HdfsNfsGateway. Once that's done use filebeat and logstash to import your content into elasticsearch. After that just configure your kibana dashboard for the index your using.

Related

How to import data from HDFS (Hadoop) into ElasticSearch?

We have a big Hadoop cluster and recently installed Elastic Search for evaluation.
Now we want to bring data from HDFS to ElasticSearch.
ElasticSearch is installed in a different cluster and so far - we could run a Beeling or HDFS script to extract data from Hadoop into some file and then from a local file bulk load it to ElasticSearch.
Wondering if there is a direct connection from HDFS to ElasticSearch.
I start reading about it here:
https://www.elastic.co/guide/en/elasticsearch/hadoop/current/install.html
But since our team is not DevOps (does not configure nor manage Hadoop cluster) and can only access Hadoop via Kerberos/user/pass - wondering if this is possible to configure (and how) without involving whole DevOps team that manages Hadoop cluster to install/setup all these libraries before direct connect?
How to do it from a Client side?
Thanks.

nifi putHDFS writes to local filesystem

Challenge
I currently have two hortonworks clusters, a NIFI cluster and a HDFS cluster, and want to write to HDFS using NIFI.
On the NIFI cluster I use a simple GetFile connected to a PutHDFS.
When pushing a file through this, the PutHDFS terminates in success. However, rather than seeing a file dropped on my HFDS (on the HDFS cluster) I just see a file being dropped onto the local filesystem where I run NIFI.
This confuses me, hence my question:
How to make sure PutHDFS writes to HDFS, rather than to the local filesystem?
Possibly relevant context:
In the PutHDFS I have linked to the hive-site and core-site of the HDFS cluster (I tried updating all server references to the HDFS namenode, but with no effect)
I don't use Kerberos on the HDFS cluster (I do use it on the NIFI cluster)
I did not see anything looking like an error in the NIFI app log (which makes sense as it succesfully writes, just in the wrong place)
Both clusters are newly generated on Amazon AWS with CloudBreak, and opening all nodes to all traffic did not help
Can you make sure that you are able move file from NiFi node to Hadoop using below command:-
hadoop fs -put
If you are able move your file using above command then you must check your Hadoop config file which you are passing in your PutHDFS processor.
Also, check that you don't have anyother flow running to make sure that no other flow is processing that file.

HDP 2.4, How to collect hadoop mapreduce log using flume in one file and what is the best practice

We are using HDP 2.4 and have many map reduce jobs written in various ways ( java MR / Hive / etc. ) . The logs are collect in hadoop file system under the application ID. I want to collect all the logs of application and append in single file (hdfs or OS files of one machine) so that I can analyze my application log in a single location with out hassle . Also advise me the best way to achieve in HDP 2.4 ( Stack version info => HDFS 2.7.1.2.4 / YARN 2.7.1.2.4 / MapReduce2 2.7.1.2.4 / Log Search 0.5.0 / Flume 1.5.2.2.4 ) .
Flume cannot collect the logs after they are already on HDFS.
In order to do this, you need a Flume agent running on all NodeManagers pointed at the configured yarn.log.dir, and somehow parse out the application/container/attempt/file information from the local OS file path.
I'm not sure how well collecting into a "single file" would work, as each container generates at least 5 files of different information, but YARN log aggregation already does this. It's just not in a readable file format in HDFS unless you are using Splunk/Hunk, as far as I know
Alternative solutions include indexing these files into actual search services like Solr or Elasticsearch, which I would recommend for storing and searching logs over HDFS

Can you send AWS RDS Postgres logs to a AWS Hadoop cluster easily?

In particular, I'd like to push all of the INSERT, UPDATE, and DELETE statements from my Postgres logs to a AWS Hadoop cluster and have a nice way to search them to see the history of a row or rows.
I'm not a Hadoop expert in any way, so let me know if this is a red herring.
Thanks!
Use flume to send logs from your RDS instance to Hadoop cluster. Using flume you could use regex interceptor to filter events and send just INSERT, UPDATE and DELETE statements. Hadoop does not make your data searchable so you have to use something like Solr.
You could either get the data to Hadoop first and then run bunch of MapReduce jobs to insert data into Solr. Or you could directly configure flume to write data to Solr, see link below.
Links:
Using flume solr sink
Flume Regex Filtering Interceptor
EDIT:
It seems like RDS instances don't have SSH access, which means that you cannot natively run flume on the RDS instance itself but you have to periodically get the logs of the RDS instance manually to a machine (this could be a EC2 instance) which has flume configured.

Elasticsearch Hadoop

I have set up a Hadoop cluster with 3 DataNodes and 1 NameNode. I have also installed elasticsearch on one of the DataNodes. But I'm not able to access the HDFS using elasticsearch.(Hadoop cluster and Elasticsearch are working fine independently) Now, I want to integrate my Hadoop cluster with elasticsearch. I found there is a seperate plugin for that. But I'm not able to download it.(bin/plugin -i elasticsearch/elasticsearch-repository-hdfs/1.3.0.M3 command is not working. It is failing everytime I executed it). Can anyone suggest me which plugin I should download. Also the path to place that plugin and how to aceess it using the url.
Thanks in advance
I suggest you try to use this repo.
It's an Elasticsearch real-time search and analytics natively integrated with Hadoop and you can follow the documentation provided here to use it.
The repo is provided by Elasticsearch.
Try this
1) Download jars from this link
2) Unzip it and place the jars in plugin folder of Elasticsearch
3) restart the server and start using it..!
The elasticsearch hadoop library is not a plugin. You need to download or build it and put it into the classpath of the hadoop/spark application you will use.

Resources