Talend - Get all files (in several directories) from FTP - ftp

There are many FTP components to extract files. What should I use if I have a root Directory, with some sub-directories and several files in all of them, and I want to extract all files?
For example:
rootDirectory
- file1.txt
- file2.txt
- file3.txt
- subDirectory1
- file4.txt
- file5.txt
- subDirectory2
- file6.txt
- subDirectory2
- file7.txt
- file8.txt
How can I get files 1 to 8, just by giving the component the path to the rootDirectory?

I've not used the FTP components yet but typically you'd use a tFileList connected to a tFileCopy to move files around. So in your case I would expect you should use a tFTPFileList connected to your FTP server with a filemask of "*.txt" and then connect that to a tFTPGET. Set this component to the local directory of your choice, a remote directory of "/" and then use ((String)globalMap.get("tFTPFileList_1_CURRENT_FILEPATH")) in your Filemask.
This approach seems to be the one I've just found now in the Talend documentation although it might require logging in (free account sign up and probably worth doing if you're using Talend much at all).
It's probably equally fair to say that unless you're planning on doing something complicated with the data rather than just grabbing it most FTP tools should comfortably be able to GET everything from an FTP server and Talend might not be the best approach here.

Related

Merge fastq.gz files with same name in different localizations in Google-Cloud

I would like to merge several fastq.gz files with the same name in different folders in the Google-Cloud. I have a total of 15 patients. Each patient has paired-end data "R1" and "R2". Each R1 and R2 are divided into 4 files. The size of each file is approximately 28 GB.
My goal is to merge the 4 files to obtain the complete fastq.gz R1 and R2 files for each patient.
I have never worked with Google-Cloud before.
Here is how the folders and the files are in the bucket (example with 2 patients):
gs://bucketID
/folder1
/folder001
Patient1_R1.fastq.gz
Patient1_R2.fastq.gz
/folder002
Patient2_R1.fastq.gz
Patient2_R2.fastq.gz
etc.
/folder2
/folder003
Patient1_R1.fastq.gz
Patient1_R2.fastq.gz
/folder004
Patient2_R1.fastq.gz
Patient2_R2.fastq.gz
etc.
/folder3
/folder005
Patient1_R1.fastq.gz
Patient1_R2.fastq.gz
/folder006
Patient2_R1.fastq.gz
Patient2_R2.fastq.gz
etc.
/folder4
/folder007
Patient1_R1.fastq.gz
Patient1_R2.fastq.gz
/folder008
Patient2_R1.fastq.gz
Patient2_R2.fastq.gz
etc.
I want to make a script that targets fastq.gz files with the same name in different folders, then merge them. However, I have no idea how to do this on Google-Cloud.
Here is the same example with colors (I want to concatenate files with the same color):
Example with colors
Here's how I see the bash script:
bucket="bucketID"
dir1=$bucket/"folder1"
dir2=$bucket/"folder2"
dir3=$bucket/"folder3"
dir4=$bucket/"folder4"
destdir=$bucket/"destdir"
participants = (Patient1
Patient2
)
for i in ${participants[*]};
do
zcat dir1/.../$i/_R1.fastq.gz dir2/.../$i/_R1.fastq.gz dir3/.../$i/_R1.fastq.gz dir4/.../$i/_R1.fastq.gz | gzip >$destdir/"merged_"$i/_R1.fastq.gz
zcat dir1/.../$i/_R2.fastq.gz dir2/.../$i/_R2.fastq.gz dir3/.../$i/_R2.fastq.gz dir4/.../$i/_R2.fastq.gz | gzip >$destdir/"merged_"$i/_R2.fastq.gz
done
Should I use "gsutil compose" instead to merge?
At the end, I would like to have only two files R1 and R2 for each patient: merged_patient#_R1.fastq.gz and merged_patient#_R2.fastq.gz.
In the example I gave above, it would give 4 files:
merged_Patient1_R1.fastq.gz
merged_Patient1_R2.fastq.gz
merged_Patient2_R1.fastq.gz
merged_Patient2_R2.fastq.gz
Thank you!
I would recommend you to use the following command in order to concatenate your files:
gsutil compose gs://bucket/obj1 [gs://bucket/obj2 ...] gs://bucket/composite
You can check the documentation in this link.
I've tried to do a simple bash script by using the "gsutil compose" command with fastq.gz files, and it was working fine for me.
The compose command creates a new object whose content is the concatenation of a given sequence of source objects under the same bucket.
Hope this helps!
Ok I found the solution with gsutil compose :
declare -a participantsArray=("Patient1"
"Patient2"
)
bucket="bucketID"
dir1=$bucket/"folder1"
dir2=$bucket/"folder2"
dir3=$bucket/"folder3"
dir4=$bucket/"folder4"
destdir=$bucket/"destdir"
for i in ${participantsArray[#]};
do
fileR1="${i}_R1.fastq.gz"
fileR2="${i}_R2.fastq.gz"
gsutil compose "${dir1}/*/${fileR1}" "${dir2}/*/${fileR1}" "${dir3}/*/${fileR1}" "${dir4}/*/${fileR1}" "${destdir}/merged_${fileR1}"
gsutil compose "${dir1}/*/${fileR2}" "${dir2}/*/${fileR2}" "${dir3}/*/${fileR2}" "${dir4}/*/${fileR2}" "${destdir}/merged_${fileR2}"
done
As you said the solution was not difficult to find.
Thank you again!

s3 awk bash pipeline

Following this question Splitting out a large file.
I would like to pipe calls from an Amazon s3:// bucket containing large gzipped files, process them with an awk command.
Sample file to process
...
{"captureTime": "1534303617.738","ua": "..."}
...
Script to optimize
aws s3 cp s3://path/to/file.gz - \
| gzip -d \
| awk -F'"' '{date=strftime("%Y%m%d%H",$4); print > "splitted."date }'
gzip splitted.*
# make some visual checks here before copying to S3
aws s3 cp splitted.*.gz s3://path/to/splitted/
Do you think I can wrap everything in the same pipeline to avoid writing files locally?
I can use Using gzip to compress files to transfer with aws command to be able to gzip and copy on the fly, but gzipping inside awk would be great.
Thank you.
Took me a bit to understand that your pipeline creates one "splitted.date file for each line in the source file. Since shell pipelines operate on byte streams and not files, while S3 operates on files (objects), you must turn your byte stream into a set of files on local storage before sending them back to S3. So, a pipeline by itself won't suffice.
But I'll ask: what's the larger purpose you trying to accomplish?
You're on the path to generating lots of S3 objects, one for each line of your "large gzipped files". Is this using S3 as a key value store? I'll ask if this is the best design for the goal of your effort? In other words, is S3 the best repository for this information or is here some other store (DynamoDB, or another NoSQL) that would be a better solution?
All the best
Two possible optimizations :
On large and multiple files it will help to use all the cores to gzip the files, use xargs, pigz or gnu parallel
Gzip with all cores
parallelize S3 upload :
https://github.com/aws-samples/aws-training-demo/tree/master/course/architecting/s3_parallel_upload

How can I compare the file sizes match between duplicate directories?

I need to compare two directories to validate a backup.
Say my directory looks like the following:
Filename Filesize Filename Filesize
user#main_server:~/mydir/ user#backup_server:~/mydir/
file1000.txt 4182410737 file1000.txt 4182410737
file1001.txt 8241410737 - <-- missing on backup_server!
... ...
file9999.txt 2410418737 file9999.txt 1111111111 <-- size != main_server
Is there a quick one liner that would get me close to output like:
Invalid Backup Files:
file1001.txt
file9999.txt
(with the goal to instruct the backup script to refetch these files)
I've tried to get variations of the following to no avail.
[main_server] $ rsync -n ~/mydir/ user#backup_server:~/mydir
I cannot do rsync to backup the directories itself because it takes way too long (8-24hrs). Instead I run multiple threads of scp to fetch files in batches. This completes regularly <1hr. However, occasionally I find a few files that were somehow missed (perhaps dropped connection).
Speed is a priority, so file sizes should be sufficient. But I'm open to including a checksum, provided it doesn't slow the process down like I find with rsync.
Here's my test process:
# Generate Large Files (1GB)
for i in {1..100}; do head -c 1073741824 </dev/urandom >foo-$i ; done
# SCP them from src to dest
for i in {1..100}; do ( scp ~/mydir/foo-$i user#backup_server:~/mydir/ & ) ; sleep 0.1 ; done
# Confirm destination has everything from source
# This is the point of the question. I've tried:
rsync -Sa ~/mydir/ user#backup_server:~/mydir
# Way too slow
What do you recommend?
By default, rsync uses the quick check method which only transfers files that differ in size or last-modified time. As you report that the sizes are unchanged, that would seem to indicate that the timestamps differ. Two options to handlel this are:
Use -p to preserve timestamps when transferring files.
Use --size-only to ignore timestamps and transfer only files that differ in size.

file's date changes after zip in and out again, according to XCOPY

So, here's the problem: I have files which are regular files, and they are put into a ZIP file (see below for details on ZIP). Then I unzip them (see below for details on the tool used), and the files are restored. The date of the file is restored, as in standard in the ZIP/UNZIP tools used. When querying using DIR, or in Windows Explorer, the files involved have the same date as they had, before being handled by the ZIP/UNZIP process.
So, all OK.
But then, I'm using the XCOPY /D command, to further manipulate different copies of those files on the disk ... and, XCOPY says : one file is newer than the other one. Given the fact the date, hour, up until minutes is the same .. the difference would be regarding a smaller entity, like seconds ?
All involved disks have NTFS file system.
Example:
C:\my>dir C:\windows\Background_mycomputer.cmd C:\my\directory\Background_mycomputer.cmd
Volume in drive C is mycomputerC
Volume Serial Number is 1234-5678
Directory of C:\windows
31/12/2014 19:50 51 Background_mycomputer.cmd
1 File(s) 51 bytes
Directory of C:\my\directory
31/12/2014 19:50 51 Background_mycomputer.cmd
1 File(s) 51 bytes
0 Dir(s) 33.655.316.480 bytes free
C:\my>xcopy C:\windows\Background_mycomputer.cmd C:\my\directory\Background_mycomputer.cmd /D
Overwrite C:\my\directory\Background_mycomputer.cmd (Yes/No/All)? y
C:\windows\Background_mycomputer.cmd
1 File(s) copied
C:\my>xcopy C:\my\directory\Background_mycomputer.cmd C:\windows\Background_mycomputer.cmd /D
0 File(s) copied
C:\my>xcopy C:\windows\Background_mycomputer.cmd C:\my\directory\Background_mycomputer.cmd /D
0 File(s) copied
C:\my>unzip -v
UnZip 6.00 of 20 April 2009, by Info-ZIP. Maintained by C. Spieler. Send
bug reports using http://www.info-zip.org/zip-bug.html; see README for details.
Latest sources and executables are at ftp://ftp.info-zip.org/pub/infozip/ ;
see ftp://ftp.info-zip.org/pub/infozip/UnZip.html for other sites.
Compiled with Microsoft C 13.10 (Visual C++ 7.1) for
Windows 9x / Windows NT/2K/XP/2K3 (32-bit) on Apr 20 2009.
UnZip special compilation options:
ASM_CRC
COPYRIGHT_CLEAN (PKZIP 0.9x unreducing method not supported)
NTSD_EAS
SET_DIR_ATTRIB
TIMESTAMP
UNIXBACKUP
USE_EF_UT_TIME
USE_UNSHRINK (PKZIP/Zip 1.x unshrinking method supported)
USE_DEFLATE64 (PKZIP 4.x Deflate64(tm) supported)
UNICODE_SUPPORT [wide-chars] (handle UTF-8 paths)
MBCS-support (multibyte character support, MB_CUR_MAX = 1)
LARGE_FILE_SUPPORT (large files over 2 GiB supported)
ZIP64_SUPPORT (archives using Zip64 for large files supported)
USE_BZIP2 (PKZIP 4.6+, using bzip2 lib version 1.0.5, 10-Dec-2007)
VMS_TEXT_CONV
[decryption, version 2.11 of 05 Jan 2007]
UnZip and ZipInfo environment options:
UNZIP: [none]
UNZIPOPT: [none]
ZIPINFO: [none]
ZIPINFOOPT: [none]
C:\my>ver
Microsoft Windows [Version 6.1.7601]
C:\my>zip -?
Copyright (c) 1990-2006 Info-ZIP - Type 'zip "-L"' for software license.
Zip 2.32 (June 19th 2006). Usage:
zip [-options] [-b path] [-t mmddyyyy] [-n suffixes] [zipfile list] [-xi list]
The default action is to add or replace zipfile entries from list, which
can include the special name - to compress standard input.
If zipfile and list are omitted, zip compresses stdin to stdout.
-f freshen: only changed files -u update: only changed or new files
-d delete entries in zipfile -m move into zipfile (delete files)
-r recurse into directories -j junk (don't record) directory names
-0 store only -l convert LF to CR LF (-ll CR LF to LF)
-1 compress faster -9 compress better
-q quiet operation -v verbose operation/print version info
-c add one-line comments -z add zipfile comment
-# read names from stdin -o make zipfile as old as latest entry
-x exclude the following names -i include only the following names
-F fix zipfile (-FF try harder) -D do not add directory entries
-A adjust self-extracting exe -J junk zipfile prefix (unzipsfx)
-T test zipfile integrity -X eXclude eXtra file attributes
-! use privileges (if granted) to obtain all aspects of WinNT security
-R PKZIP recursion (see manual)
-$ include volume label -S include system and hidden files
-e encrypt -n don't compress these suffixes
C:\my>
Question: I do not want XCOPY to make updates where I know they are invalid cause the time format is doing something wrong. How do I prevent that ?
From how I see, there's different things involved, being XCOPY, very specific ZIP and UNZIP, and NTFS file system. Which one is doing something wrong ?
I must stress that apart from ZIP and UNZIP, there are no other changes done to the file, like changing 1 file, then making a change to another one, in less than 60 seconds time.
At moment of test, the time shown was NOT the current time, and not close to it either. No file is adjusting to the current time, the times refer to last changes of the file in question, which may be any time in the past. In this case, it's one day later, but it can be anything.
I noticed the peculiar behavior Raymond Chen describes when writing a Powershell script (GitHub link) to freshen a zip archive using the System.IO.Compression and System.IO.Compression.FileSystem libraries.
Interestingly, Zip archives can store multiple copies of the same file with identical metadata (name, relative path, modification dates). Extracting the second copy of the file will fail in Windows Explorer because the file already exists.
When trying to prevent re-zipping a file was already archived, I checked the relative path and date, and noticed that there was a discrepancy of up to two seconds in the LastWriteTime. This workaround compensates for the loss of precision:
$AlreadyArchivedFile = ($WriteArchive.Entries | Where-Object {#zip will store multiple copies of the exact same file - prevent this by checking if already archived.
(($_.FullName -eq $RelativePath) -and ($_.Length -eq $File.Length) ) -and
([math]::Abs(($_.LastWriteTime.UtcDateTime - $File.LastWriteTimeUtc).Seconds) -le 2) #ZipFileExtensions timestamps are only precise within 2 seconds.
})
Also, the IsDaylightSavingTime flag is not stored in the Zip archive. As a result I was surprised when extracted files became an hour newer than the original archived file. I tried this several times and saw the extracted file's timestamp incremented by an hour every time it was compressed and extracted.
Here's a very ugly workaround that decreases the archived file time by one hour to make the original source file and extracted file timestamps consistent:
If($File.LastWriteTime.IsDaylightSavingTime() -and $ArchivedFile){#HACK: fix for buggy date - adds an hour inside archive when the zipped file was created during PDT (files created during PST are not affected).
$entry = $WriteArchive.GetEntry($RelativePath)
$entry.LastWriteTime = ($File.LastWriteTime.ToLocalTime() - (New-TimeSpan -Hours 1))
}
There's probably a better way to handle this. Unfortunately I'm not aware of any way to store a Daylight Savings indicator for a file in a .Zip archive, and that information is lost.

Copying file security permissions

I'm copying a file from folder A to folder B and then trying to copy the file permissions. Here are the basic steps I'm using:
CopyFile(source, target)
GetNamedSecurityInfo(source, GROUP_SECURITY_INFORMATION | DACL_SECURITY_INFORMATION)
Print source SD using ConvertSecurityDescriptorToStringSecurityDescriptor
SetNamedSecurityInfo(target, GROUP_SECURITY_INFORMATION | DACL_SECURITY_INFORMATION)
GetNamedSecurityInfo(target, GROUP_SECURITY_INFORMATION | DACL_SECURITY_INFORMATION)
Print target SD using ConvertSecurityDescriptorToStringSecurityDescriptor
At #3 I get this SD:
G:S-1-5-21-1454471165-1482476501-839522115-513D:AI(A;ID;0x1200a9;;;BU)(A;ID;0x1301bf;;;PU)(A;ID;FA;;;BA)(A;ID;FA;;;SY)(A;ID;FA;;;S-1-5-21-1454471165-1482476501-839522115-1004)
At #6 I get
G:S-1-5-21-1454471165-1482476501-839522115-513D:AI(A;ID;0x1301bf;;;PU)(A;ID;FA;;;BA)(A;ID;FA;;;SY)
The call to SetNamedSecurityInfo returns ERROR_SUCCESS, yet the results are the source and target file do not have the same SDs. Why is that? What am I doing wrong here?
SHFileOperation can copy files together with their security attributes, but from your other question I see you're concerned that this won't work within a service. Maybe the following newsgroup discussions will provide some useful information for you:
Copy NTFS files with security
How to copy a disk file or directory with ALL attributes?
Copying files with security attributes
Robocopy from the server tools kit http://www.microsoft.com/downloads/details.aspx?familyid=9d467a69-57ff-4ae7-96ee-b18c4790cffd&displaylang=en
Will copy all NTFS settigs and ACLs, it's also more robust and reliable than copy/xcopy

Resources