nifi - fetchsftp - subfolders - ftp

I am using nifi to transfer files between ftp locations.
I have to transfer files from a sftp location to a ftp directory.
I have the below folder structure in the remote sftp location.
/rootfolder/
/subfolder1
/subfolder2
/subfolder3
I need to download respective files from each subfolder to a local directory which has the similar structure.
My workflow includes
ListSFTP -> FetchSFTP (3) -> PutFTP
In ListSFTP
Remote path: /rootfolder
In FetchSFTP1
Remote path: /rootfolder/subfolder1
In FetchSFTP2
Remote path: /rootfolder/subfolder2
In FetchSFTP3
Remote path: /rootfolder/subfolder3
But, this does not seem to work. can someone help me how i can transfer files from a remote sftp sub-folder(s).
Thanks,
Aadil

You should be able to set ListSFTP to recursive search and then coming out of ListSFTP each flow file will have attributes for "path" ad "filename".
Lets say you had one file under each directory in your example, you should get three flow files like the following:
ff 1
path = /rootfolder/subfolder1
filename = file1
ff 2
path = /rootfolder/subfolder2
filename = file2
ff 3
path = /rootfolder/subfolder3
filename = file3
You should only need one FetchSFTP processor with Remote Filename set to ${path}/${filename}.
If you have the same structure on your destination system, just set PutFTP's Remote Path to ${path}.
If you have a slightly different structure, use UpdateAttribute to modify "path" right before PutFTP.

Related

Flume-ng: source path and type for copying log file from local to HDFS

I am trying to copy some log files from local to HDFS using flume-ng. The source is /home/cloudera/flume/weblogs/ and the sink is hdfs://localhost:8020/flume/dump/. A cron job will copy the logs from tomcat server to /home/cloudera/flume/weblogs/ and I want to log files to be copied to HDFS as the files are available in /home/cloudera/flume/weblogs/ using flume-ng. Below is the conf file I created:
agent1.sources= local
agent1.channels= MemChannel
agent1.sinks=HDFS
agent1.sources.local.type = ???
agent1.sources.local.channels=MemChannel
agent1.sinks.HDFS.channel=MemChannel
agent1.sinks.HDFS.type=hdfs
agent1.sinks.HDFS.hdfs.path=hdfs://localhost:8020/flume/dump/
agent1.sinks.HDFS.hdfs.fileType=DataStream
agent1.sinks.HDFS.hdfs.writeformat=Text
agent1.sinks.HDFS.hdfs.batchSize=1000
agent1.sinks.HDFS.hdfs.rollSize=0
agent1.sinks.HDFS.hdfs.rollCount=10000
agent1.sinks.HDFS.hdfs.rollInterval=600
agent1.channels.MemChannel.type=memory
agent1.channels.MemChannel.capacity=10000
agent1.channels.MemChannel.transactionCapacity=100
I am not able to understand:
1) what will be the value of agent1.sources.local.type = ???
2) where to mention the source path /home/cloudera/flume/weblogs/ in the above conf file ?
3) Is there anything I am missing in the above conf file?
Please let me know on these.
You can use either :
An Exec Source and use a command (i.e. cat or tail on gnu/linux on you files)
Or a Spooling Directory Source for read all files in a directory

Move files to HDFS using Spring XD

How to move the files from local disk to HDFS using Spring XD.
I do not want contents , but to move whole file for archival which saves the file with original name and content.
Here is what i have tried
stream create --name fileapple --definition "file --mode=ref --dir=/Users/dev/code/open/learnspringxd/input --pattern=apple*.txt | WHATTODOHERE"
I can see now with reference the file names with full path are made available , how to move that to HDFS.
You might want to check this which imports data from files to HDFS as a batch job and check if that fits your requirement. You can also check file | hdfs as a stream if that works for you.
example like below will load the file from data folder to HDFS and save the file by date folders(if there are multi records with different date) which by the record column named LastModified, the data file is a json file separate by lines.
file --mode=ref --dir=/Users/dev/code/open/learnspringxd/input --pattern=apple*.txt | hdfs --directory=/user/file_folder --partitionPath=path(dateFormat('yyyy-MM-dd',#jsonPath(payload,'$.LastModified'),'yyyy-MM-dd')) --fileName=output_file_name_prefix --fsUri=hdfs://HDFShostname.company.com:8020 --idleTimeout=30000

Talend - Get all files (in several directories) from FTP

There are many FTP components to extract files. What should I use if I have a root Directory, with some sub-directories and several files in all of them, and I want to extract all files?
For example:
rootDirectory
- file1.txt
- file2.txt
- file3.txt
- subDirectory1
- file4.txt
- file5.txt
- subDirectory2
- file6.txt
- subDirectory2
- file7.txt
- file8.txt
How can I get files 1 to 8, just by giving the component the path to the rootDirectory?
I've not used the FTP components yet but typically you'd use a tFileList connected to a tFileCopy to move files around. So in your case I would expect you should use a tFTPFileList connected to your FTP server with a filemask of "*.txt" and then connect that to a tFTPGET. Set this component to the local directory of your choice, a remote directory of "/" and then use ((String)globalMap.get("tFTPFileList_1_CURRENT_FILEPATH")) in your Filemask.
This approach seems to be the one I've just found now in the Talend documentation although it might require logging in (free account sign up and probably worth doing if you're using Talend much at all).
It's probably equally fair to say that unless you're planning on doing something complicated with the data rather than just grabbing it most FTP tools should comfortably be able to GET everything from an FTP server and Talend might not be the best approach here.

how to prevent hadoop corrupted .gz file

I'm using following simple code to upload files to hdfs.
FileSystem hdfs = FileSystem.get(config);
hdfs.copyFromLocalFile(src, dst);
The files are generated by webserver java component and rotated and closed by logback in .gz format. I've noticed that sometimes the .gz file is corrupted.
> gunzip logfile.log_2013_02_20_07.close.gz
gzip: logfile.log_2013_02_20_07.close.gz: unexpected end of file
But the following command does show me the content of the file
> hadoop fs -text /input/2013/02/20/logfile.log_2013_02_20_07.close.gz
The impact of having such files is quite disaster - since the aggregation for the whole day fails, and also several slave nodes is marked as blacklisted in such case.
What can I do in such case?
Can hadoop copyFromLocalFile() utility corrupt the file?
Does anyone met similar problem ?
It shouldn't do - this error is normally associated with GZip files which haven't been closed out when originally written to local disk, or are being copied to HDFS before they have finished being written to.
You should be able to check by running an md5sum on the original file and that in HDFS - if they match then the original file is corrupt:
hadoop fs -cat /input/2013/02/20/logfile.log_2013_02_20_07.close.gz | md5sum
md5sum /path/to/local/logfile.log_2013_02_20_07.close.gz
If they don't match they check the timestamps on the two files - the one in HDFS should be modified after the local file system one.

Copying file security permissions

I'm copying a file from folder A to folder B and then trying to copy the file permissions. Here are the basic steps I'm using:
CopyFile(source, target)
GetNamedSecurityInfo(source, GROUP_SECURITY_INFORMATION | DACL_SECURITY_INFORMATION)
Print source SD using ConvertSecurityDescriptorToStringSecurityDescriptor
SetNamedSecurityInfo(target, GROUP_SECURITY_INFORMATION | DACL_SECURITY_INFORMATION)
GetNamedSecurityInfo(target, GROUP_SECURITY_INFORMATION | DACL_SECURITY_INFORMATION)
Print target SD using ConvertSecurityDescriptorToStringSecurityDescriptor
At #3 I get this SD:
G:S-1-5-21-1454471165-1482476501-839522115-513D:AI(A;ID;0x1200a9;;;BU)(A;ID;0x1301bf;;;PU)(A;ID;FA;;;BA)(A;ID;FA;;;SY)(A;ID;FA;;;S-1-5-21-1454471165-1482476501-839522115-1004)
At #6 I get
G:S-1-5-21-1454471165-1482476501-839522115-513D:AI(A;ID;0x1301bf;;;PU)(A;ID;FA;;;BA)(A;ID;FA;;;SY)
The call to SetNamedSecurityInfo returns ERROR_SUCCESS, yet the results are the source and target file do not have the same SDs. Why is that? What am I doing wrong here?
SHFileOperation can copy files together with their security attributes, but from your other question I see you're concerned that this won't work within a service. Maybe the following newsgroup discussions will provide some useful information for you:
Copy NTFS files with security
How to copy a disk file or directory with ALL attributes?
Copying files with security attributes
Robocopy from the server tools kit http://www.microsoft.com/downloads/details.aspx?familyid=9d467a69-57ff-4ae7-96ee-b18c4790cffd&displaylang=en
Will copy all NTFS settigs and ACLs, it's also more robust and reliable than copy/xcopy

Resources