Merge fastq.gz files with same name in different localizations in Google-Cloud - bash

I would like to merge several fastq.gz files with the same name in different folders in the Google-Cloud. I have a total of 15 patients. Each patient has paired-end data "R1" and "R2". Each R1 and R2 are divided into 4 files. The size of each file is approximately 28 GB.
My goal is to merge the 4 files to obtain the complete fastq.gz R1 and R2 files for each patient.
I have never worked with Google-Cloud before.
Here is how the folders and the files are in the bucket (example with 2 patients):
gs://bucketID
/folder1
/folder001
Patient1_R1.fastq.gz
Patient1_R2.fastq.gz
/folder002
Patient2_R1.fastq.gz
Patient2_R2.fastq.gz
etc.
/folder2
/folder003
Patient1_R1.fastq.gz
Patient1_R2.fastq.gz
/folder004
Patient2_R1.fastq.gz
Patient2_R2.fastq.gz
etc.
/folder3
/folder005
Patient1_R1.fastq.gz
Patient1_R2.fastq.gz
/folder006
Patient2_R1.fastq.gz
Patient2_R2.fastq.gz
etc.
/folder4
/folder007
Patient1_R1.fastq.gz
Patient1_R2.fastq.gz
/folder008
Patient2_R1.fastq.gz
Patient2_R2.fastq.gz
etc.
I want to make a script that targets fastq.gz files with the same name in different folders, then merge them. However, I have no idea how to do this on Google-Cloud.
Here is the same example with colors (I want to concatenate files with the same color):
Example with colors
Here's how I see the bash script:
bucket="bucketID"
dir1=$bucket/"folder1"
dir2=$bucket/"folder2"
dir3=$bucket/"folder3"
dir4=$bucket/"folder4"
destdir=$bucket/"destdir"
participants = (Patient1
Patient2
)
for i in ${participants[*]};
do
zcat dir1/.../$i/_R1.fastq.gz dir2/.../$i/_R1.fastq.gz dir3/.../$i/_R1.fastq.gz dir4/.../$i/_R1.fastq.gz | gzip >$destdir/"merged_"$i/_R1.fastq.gz
zcat dir1/.../$i/_R2.fastq.gz dir2/.../$i/_R2.fastq.gz dir3/.../$i/_R2.fastq.gz dir4/.../$i/_R2.fastq.gz | gzip >$destdir/"merged_"$i/_R2.fastq.gz
done
Should I use "gsutil compose" instead to merge?
At the end, I would like to have only two files R1 and R2 for each patient: merged_patient#_R1.fastq.gz and merged_patient#_R2.fastq.gz.
In the example I gave above, it would give 4 files:
merged_Patient1_R1.fastq.gz
merged_Patient1_R2.fastq.gz
merged_Patient2_R1.fastq.gz
merged_Patient2_R2.fastq.gz
Thank you!

I would recommend you to use the following command in order to concatenate your files:
gsutil compose gs://bucket/obj1 [gs://bucket/obj2 ...] gs://bucket/composite
You can check the documentation in this link.
I've tried to do a simple bash script by using the "gsutil compose" command with fastq.gz files, and it was working fine for me.
The compose command creates a new object whose content is the concatenation of a given sequence of source objects under the same bucket.
Hope this helps!

Ok I found the solution with gsutil compose :
declare -a participantsArray=("Patient1"
"Patient2"
)
bucket="bucketID"
dir1=$bucket/"folder1"
dir2=$bucket/"folder2"
dir3=$bucket/"folder3"
dir4=$bucket/"folder4"
destdir=$bucket/"destdir"
for i in ${participantsArray[#]};
do
fileR1="${i}_R1.fastq.gz"
fileR2="${i}_R2.fastq.gz"
gsutil compose "${dir1}/*/${fileR1}" "${dir2}/*/${fileR1}" "${dir3}/*/${fileR1}" "${dir4}/*/${fileR1}" "${destdir}/merged_${fileR1}"
gsutil compose "${dir1}/*/${fileR2}" "${dir2}/*/${fileR2}" "${dir3}/*/${fileR2}" "${dir4}/*/${fileR2}" "${destdir}/merged_${fileR2}"
done
As you said the solution was not difficult to find.
Thank you again!

Related

How to export the output of multiple gcloud queries into adjacent sheets within one CSV file, using Bash?

I have the following 3 gcloud queries:
Query 1 - To enumerate users of a project:
gcloud projects get-iam-policy MyProject --format="csv(bindings.members)" >> output1.csv
Query 2 - To enumerate users of a folder:
gcloud resource-manager folders get-iam-policy MyFolder --format="csv(bindings.members)" >> output2.csv
Query 3 - To enumerate users of the organization:
gcloud organizations get-iam-policy MyOrg --format="csv(bindings.members)" >> output3.csv
My goal is to run all 3 queries together and export the output in multiple adjacent sheets within one CSV file, instead of 3 separate CSV files. Is that possible?
Please advise. Thanks.
It is not possible.
comma-delimited files (CSVs) do not support multiple 'tables' within a single file.
You must create a file per table.

Faster way of Appending/combining thousands (42000) of netCDF files in NCO

I seem to be having trouble properly combining thousands of netCDF files (42000+) (3gb in size, for this particular folder/variable). The main variable that i want to combine has a structure of (6, 127, 118) i.e (time,lat,lon)
Im appending each file 1 by 1 since the number of files is too long.
I have tried:
for i in input_source/**/**/*.nc; do ncrcat -A -h append_output.nc $i append_output.nc ; done
but this method seems to be really slow (order of kb/s and seems to be getting slower as more files are appended) and is also giving a warning:
ncrcat: WARNING Intra-file non-monotonicity. Record coordinate "forecast_period" does not monotonically increase between (input file file1.nc record indices: 17, 18) (output file file1.nc record indices 17, 18) record coordinate values 6.000000, 1.000000
that basically just increases the variable "forecast_period" 1-6 n-times. n = 42000files. i.e. [1,2,3,4,5,6,1,2,3,4,5,6......n]
And despite this warning i can still open the file and ncrcat does what its supposed to, it is just slow, at-least for this particular method
I have also tried adding in the option:
--no_tmp_fl
but this gives an eror:
ERROR: nco__open() unable to open file "append_output.nc"
full error attached below
If it helps, im using wsl and ubuntu in windows 10.
Im new to bash and any comments would be much appreciated.
Either of these commands should work:
ncrcat --no_tmp_fl -h *.nc
or
ls input_source/**/**/*.nc | ncrcat --no_tmp_fl -h append_output.nc
Your original command is slow because you open and close the output files N times. These commands open it once, fill-it up, then close it.
I would use CDO for this task. Given the huge number of files it is recommended to first sort them on time (assuming you want to merge them along the time axis). After that, you can use
cdo cat *.nc outfile

s3 awk bash pipeline

Following this question Splitting out a large file.
I would like to pipe calls from an Amazon s3:// bucket containing large gzipped files, process them with an awk command.
Sample file to process
...
{"captureTime": "1534303617.738","ua": "..."}
...
Script to optimize
aws s3 cp s3://path/to/file.gz - \
| gzip -d \
| awk -F'"' '{date=strftime("%Y%m%d%H",$4); print > "splitted."date }'
gzip splitted.*
# make some visual checks here before copying to S3
aws s3 cp splitted.*.gz s3://path/to/splitted/
Do you think I can wrap everything in the same pipeline to avoid writing files locally?
I can use Using gzip to compress files to transfer with aws command to be able to gzip and copy on the fly, but gzipping inside awk would be great.
Thank you.
Took me a bit to understand that your pipeline creates one "splitted.date file for each line in the source file. Since shell pipelines operate on byte streams and not files, while S3 operates on files (objects), you must turn your byte stream into a set of files on local storage before sending them back to S3. So, a pipeline by itself won't suffice.
But I'll ask: what's the larger purpose you trying to accomplish?
You're on the path to generating lots of S3 objects, one for each line of your "large gzipped files". Is this using S3 as a key value store? I'll ask if this is the best design for the goal of your effort? In other words, is S3 the best repository for this information or is here some other store (DynamoDB, or another NoSQL) that would be a better solution?
All the best
Two possible optimizations :
On large and multiple files it will help to use all the cores to gzip the files, use xargs, pigz or gnu parallel
Gzip with all cores
parallelize S3 upload :
https://github.com/aws-samples/aws-training-demo/tree/master/course/architecting/s3_parallel_upload

How can I compare the file sizes match between duplicate directories?

I need to compare two directories to validate a backup.
Say my directory looks like the following:
Filename Filesize Filename Filesize
user#main_server:~/mydir/ user#backup_server:~/mydir/
file1000.txt 4182410737 file1000.txt 4182410737
file1001.txt 8241410737 - <-- missing on backup_server!
... ...
file9999.txt 2410418737 file9999.txt 1111111111 <-- size != main_server
Is there a quick one liner that would get me close to output like:
Invalid Backup Files:
file1001.txt
file9999.txt
(with the goal to instruct the backup script to refetch these files)
I've tried to get variations of the following to no avail.
[main_server] $ rsync -n ~/mydir/ user#backup_server:~/mydir
I cannot do rsync to backup the directories itself because it takes way too long (8-24hrs). Instead I run multiple threads of scp to fetch files in batches. This completes regularly <1hr. However, occasionally I find a few files that were somehow missed (perhaps dropped connection).
Speed is a priority, so file sizes should be sufficient. But I'm open to including a checksum, provided it doesn't slow the process down like I find with rsync.
Here's my test process:
# Generate Large Files (1GB)
for i in {1..100}; do head -c 1073741824 </dev/urandom >foo-$i ; done
# SCP them from src to dest
for i in {1..100}; do ( scp ~/mydir/foo-$i user#backup_server:~/mydir/ & ) ; sleep 0.1 ; done
# Confirm destination has everything from source
# This is the point of the question. I've tried:
rsync -Sa ~/mydir/ user#backup_server:~/mydir
# Way too slow
What do you recommend?
By default, rsync uses the quick check method which only transfers files that differ in size or last-modified time. As you report that the sizes are unchanged, that would seem to indicate that the timestamps differ. Two options to handlel this are:
Use -p to preserve timestamps when transferring files.
Use --size-only to ignore timestamps and transfer only files that differ in size.

Concatenating multiple text files into one very large file in HDFS

I have the multiple text files.
The total size of them exceeds the largest disk size available to me (~1.5TB)
A spark program reads a single input text file from HDFS. So I need to combine those files into one. (I cannot re-write the program code. I am given only the *.jar file for execution)
Does HDFS have such a capability? How can I achieve this?
What I understood from your question is you want to Concatenate multiple files into one. Here is a solution which might not be the most efficient way of doing it but it works. suppose you have two files: file1 and file2 and you want to get a combined file as ConcatenatedFile
.Here is the script for that.
hadoop fs -cat /hadoop/path/to/file/file1.txt /hadoop/path/to/file/file2.txt | hadoop fs -put - /hadoop/path/to/file/Concatenate_file_Folder/ConcatenateFile.txt
Hope this helps.
HDFS by itself does not provide such capabilities. All out-of-the-box features (like hdfs dfs -text * with pipes or FileUtil's copy methods) use your client server to transfer all data.
In my experience we always used our own written MapReduce jobs to merge many small files in HDFS in distributed way.
So you have two solutions:
Write your own simple MapReduce/Spark job to combine text files with
your format.
Find already implemented solution for such kind of
purposes.
About solution #2: there is the simple project FileCrush for combining text or sequence files in HDFS. It might be suitable for you, check it.
Example of usage:
hadoop jar filecrush-2.0-SNAPSHOT.jar crush.Crush -Ddfs.block.size=134217728 \
--input-format=text \
--output-format=text \
--compress=none \
/input/dir /output/dir 20161228161647
I had a problem to run it without these options (especially -Ddfs.block.size and output file date prefix 20161228161647) so make sure you run it properly.
You can do a pig job:
A = LOAD '/path/to/inputFiles' as (SCHEMA);
STORE A into '/path/to/outputFile';
Doing a hdfs cat and then putting it back to hdfs means, all this data is processed in the client node and will degradate your network

Resources