NIFI PutHDFS fails to write the contents - apache-nifi

I'm newbie of Nifi
I'm trying to get data from database and put it to hadoop.
It seems i succeed to connect hadoop from nifi using PutHDFS processor.
After running PutHDFS, a file is created successfully.
But the problem is.. the file is empty. no contents.
I tried to get a file from local nifi server using GetFile, but the result is same. so there is no problem from source.
I have no idea why NIFI fails to write the contents into hadoop file. There is even no error occured.
please help me.

Related

Storing small files in hdfs and archiving them in Nifi Flow

I have an issue with small files and HDFS.
Scenario: I am using NiFi to read messages from the Kafka topic, these are all really small.
Requirement: to store these raw messages of data in HDFS(for replay capability)...before doing further processing on them.
I was thinking using Hadoop Archive (HAR) on them periodically. Is that something i can do through NiFi? the har command seems like a command line thing rather than something that i could execute through Nifi? Would love to know a solution that can achieve my requirement, without bringing down HDFS due to the small files.
Ginil
You can execute command line inside Nifi with ExecuteProcess processor :
http://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.6.0/org.apache.nifi.processors.standard.ExecuteProcess/
You can also take a look at Kafka-connect HDFS for putting kafka records into HDFS.

Integration of hadoop (specifically HDFS files) with ELK stack

I am trying to integrate hadoop with ELK stack.
My use case is " i have to get a data from a file present in HDFS path and show the contents on kibana dashboard"
Hive is not working there so I can't use hive.
Are there any other ways to do that?
Anybody is having any article with step by step process?
I have tried to get logs from a linux location on a hadoop server through logstash and filebeat but that is also not working.
I'm doing this for some OSINT work it is quite easy to do once one can get the content out of hdfs into a local filesystem. That's done by setting up a HdfsNfsGateway. Once that's done use filebeat and logstash to import your content into elasticsearch. After that just configure your kibana dashboard for the index your using.

nifi putHDFS writes to local filesystem

Challenge
I currently have two hortonworks clusters, a NIFI cluster and a HDFS cluster, and want to write to HDFS using NIFI.
On the NIFI cluster I use a simple GetFile connected to a PutHDFS.
When pushing a file through this, the PutHDFS terminates in success. However, rather than seeing a file dropped on my HFDS (on the HDFS cluster) I just see a file being dropped onto the local filesystem where I run NIFI.
This confuses me, hence my question:
How to make sure PutHDFS writes to HDFS, rather than to the local filesystem?
Possibly relevant context:
In the PutHDFS I have linked to the hive-site and core-site of the HDFS cluster (I tried updating all server references to the HDFS namenode, but with no effect)
I don't use Kerberos on the HDFS cluster (I do use it on the NIFI cluster)
I did not see anything looking like an error in the NIFI app log (which makes sense as it succesfully writes, just in the wrong place)
Both clusters are newly generated on Amazon AWS with CloudBreak, and opening all nodes to all traffic did not help
Can you make sure that you are able move file from NiFi node to Hadoop using below command:-
hadoop fs -put
If you are able move your file using above command then you must check your Hadoop config file which you are passing in your PutHDFS processor.
Also, check that you don't have anyother flow running to make sure that no other flow is processing that file.

Spring-xd Stream is writing empty files to my HDFS

So I am following this book, machine learning hands on for developers written by Jason Bell. I got very far in this book until I had to connect my spring-xd streams to hadoop. I am running spring-xd 1.2.1, and I am running hadoop (1.2.1, and 2.6.0, I have tried both) which is on port 9000. In this tutorial we are supposed to take a twitter stream and pipe it to a file in hadoop, but when I created and deployed this stream the file it created was not getting populated with tweets. So now, to make things simplier I am now just trying to get the stream connected to hdfs by creating this stream,
stream create --name ticktock --definition "time | hdfs" --deploy
which should be piping the date to a file in /xd/ticktock/ticktock-0.txt.tmp, however, when I try to use the command
hadoop fs cat /xd/ticktock/ticktock-0.txt.tmp
it produces nothing leaving me to assume that there is no data reaching it. I did place a tap on this stream, and ran it to a local file. In that file it was recording the times correctly, so I know that my stream is doing the correct function and producing an output, it's just not reaching hadoop for some reason.
It will create the file in hadoop, so it's not like hadoop is completely ignoring the stream, theres just nothing inside the file that it creates for it.
I did find someone who was having the same problem as me and they their vm networking to NET or something, but I am not using a vmbox.
I have tried chmoding the folder xd to 777,
I have made sure that I can ssh to my local machine without a password,
I have made sure that there is a data node running in my hadoop cluster,
and I have made sure that the function cat works by placing a file that I created into my hdfs then running the cat command on it from both within spring-xd shell and from a regular terminal.
I unfortunately am at a loss, could someone help me out in this scenario?
If you need any information about my hadoop cluster or spring-xd setup let me know what, I am still a newby with these technologies.
You can see the files in hdfs sink once you destroy the stream.
2.Also,
Rollover: Even when the stream is alive, once the stored data size exceeds
1G(default value), Spring XD will rollover the 1G content to an HDFS file and create a new tmp file and store the current timetock values in it.
Thanks
S.Satish
Okay I fixed it, for some reason I re-read that error message and saw that there was no datanodes running again. I restarted haoop but this time in 2.6.0 then ran that test stream for a couple of seconds and then destroyed it. Sure enough that did the tick. Thanks Satish Srinivasan, I had no idea the stream had to be deleted before read.

PIG cannot understand hbase table data

I'm running hbase(0.94.13) on a single node for my academic project. After loading data into hbase tables, I'm trying to run pig(0.11.1) scripts on the data using HBaseStorage. However this throws an error saying
IllegalArgumentException: Not a host:port pair: �\00\00\00
here is the load command I'm using in Pig
books = LOAD 'hbase://booksdb' USING
org.apache.pig.backend.hadoop.hbase.HBaseStorage('details:title','-loadKey
true') AS (ID:chararray,title:chararray);
I thought this might be a problem of hbase being a different version in pig than what my machine has. But can't seem to make it work without downgrading my hbase. Any help?
It seems you are trying to submit a pig job remotely
if so you'd need to add a few settings in the pig.properties file (or set setting_name='values' in your script)
hbase.zookeeper.quorum=<node>
hadoop.job.ugi=username,groupname
fs.default.name=hdfs://<node>:port
mapred.job.tracker=hdfs://<node>:port

Resources