Can any nifi processor catch hdfs directory changes? - hadoop

is there any way i can manage adding deleting and updating flowfiles in my hdfs directory after i delete or update them in my second hdfs directory, i mean i want the same flowfile in directory 1 to change or be deleted aproprietly when the flowfile with the same name is changed in second directory?
i tried to use listHdfs,fetchhdfs and puthdfs for file adding flowfile to second directory but i can't manage update and delete operations,
what can i do ?
Should i use hadoop tools for it or it is possible to make with nifi?

Related

hdfs or hadoop command to sync the files or folder between local to hdfs

I have a local files which gets added daily so I want to sync these newly added files to hdfs.
I tried below command but all are complete copy, I want some command which copies only newly added files
$ hdfs dfs -cp /home/user/files/* /data/files/*
You can use hsync.
https://github.com/alexholmes/hsync
Its Alex's custom package and perhaps useful on a dev box but could be hard to deploy on production environment. I am looking for a similar solution but for now this seems to be closest. Other option is to write your own shell script to compare source/target file times and then overwrite newer files only.

how to save data in HDFS with spark?

I want to using Spark Streaming to retrieve data from Kafka. Now, I want to save my data in a remote HDFS. I know that I have to use the function saveAsText. However, I don't know precisely how to specify the path.
Is that correct if I write this:
myDStream.foreachRDD(frm->{
frm.saveAsTextFile("hdfs://ip_addr:9000//home/hadoop/datanode/myNewFolder");
});
where ip_addr is the ip address of my hdfs remote server.
/home/hadoop/datanode/ is the DataNode HDFS directory created when I installed hadoop (I don't know if I have to specify this directory). And,
myNewFolder is the folder where I want to save my data.
Thanks in advance.
Yassir
The path has to be a directory in HDFS.
For example, if you want to save the files inside a folder named myNewFolder under the root / path in HDFS.
The path to use would be hdfs://namenode_ip:port/myNewFolder/
On execution of the spark job this directory myNewFolder will be created.
The datanode data directory which is given for the dfs.datanode.data.dir in hdfs-site.xml is used to store the blocks of the files you store in HDFS, should not be referenced as HDFS directory path.

How to copy HDFS files from one cluster to another cluster by preserving the modification time

I have to move some HDFS files from my production cluster to dev cluster. I have to test some operations on HDFS files after moving to dev cluster based on the file modification time. Need files with different dates to test it in dev.
I tried doing with DISTCP, Modification time is updating with the current time in that. i checked the Distcp by using many parameters that I found here distcp version2 guide
Is there any other way to get the files without changing modification time? or can i change the modification time manually after getting the files into hdfs ?
thanks in advance
Use -pt flag with the hadoop distcp command. This will preserve timestamp (modification time) of the file that is distcp'd.
hadoop distcp -pt hdfs://src_cluster/file hdfs://dest_cluster/file
Tested with Hadoop-2.7.3
Refer latest Distcp Guide

Is pig.temp.dir property mandatory?

Pig Execution Mode = Local
In that case do we need to set pig.temp.dir=/temp property and this /temp folder needs to be present inside HDFS.
Note:
Storing Intermediate Results
Pig stores the intermediate data generated between MapReduce jobs in a temporary location on HDFS. This location must already exist on HDFS prior to use. This location can be configured using the pig.temp.dir property. The property's default value is "/tmp" which is the same as the hardcoded location in Pig 0.7.0 and earlier versions.
As per: http://pig.apache.org/docs/r0.14.0/start.html#req Storing Intermediate Results heading
You'll still need to have some temp directory, but it needs to be present in your local file system. In local mode Pig (and MapReduce) does all operations on local filesystem by default.

Restarting datanodes after reformating namenode in a hadoop cluster

Using the basic configuration provided in the hadoop setup official documentation, I can run a hadoop cluster and submit mapreduce jobs.
The problem is whenever I stop all the daemons and reformat the namenode, when I subsequently start all the daemons, the datanode does not start.
I've been looking around for a solution and it appears that it is because the formatting only formats the namenode and the disk space for the datanode needs to be erased.
How can I do this? What changes do I need to make to my config files? After those changes are made, how do I delete the correct files when formatting the namenode again?
Specifically if you have provided configuration of below 2 parameters which can be defined in hdfs-site.xml
dfs.name.dir: Determines where on the local filesystem the DFS name node should store the name table(fsimage). If this is a comma-delimited list of directories then the name table is replicated in all of the directories, for redundancy.
dfs.data.dir: Determines where on the local filesystem an DFS data node should store its blocks. If this is a comma-delimited list of directories, then data will be stored in all named directories, typically on different devices. Directories that do not exist are ignored
if you have provided the specific directory location for above 2 parameters then you need to delete those directories as well before formating namenode .
if you have not provided the above 2 parameter so by default it gets created under below parameter :
hadoop.tmp.dir which can be configured in core-site.xml
Again if you have specified this parameter then you need to remove that directory before formating namenode .
if you have not defined so by default it gets created in /tmp/hadoop-$username(hadoop) user so you need to remove this directory .
Summary: you have to delete the name node and data node directory before formating the system. By default it gets created at /tmp/ location .

Resources