How to download some specific files with some keywords from different directories using wget? - download

I am trying to download data from TRMM satellite data archive using the following command
wget -r --no-parent ftp://arthurhou.pps.eosdis.nasa.gov/pub/trmmdata/ByDate/V07/2008/01/01 --user=--user= --password="
2008 is the year, 01 is for January and 01 is for 01 is for the date. Within this date folder, there are plenty of data files
(e.g 1A01.20080101.57701.7.gz, 2A21.20080101.57711.7.HDF.gz, 2A23.20080101.57702.7.HDF.gz).
I want to download only the files under "2A23" category from every folder (e.g year, month and date), but with my wget command all the files are getting downloaded. Is there a way to specify some key to download just those files?
Thank you in advance for your help.

The solution is here, if someone is stuck at the same question later.
wget -r --no-parent -A 'pattern' 'URL' --user=--user= --password=
In my case the pattern was 2a23*.gz.

Related

how to download file VIA wget while target path include Wildcards

here is elegant example how to download file and copy it to
/etc/yum.repo.d folder
example
REPOSITORY_SERVER=master_machine01
wget -nd -r -P /etc/yum.repos.d/ -A ".repo" "http://$REPOSITORY_SERVER/ambari/centos7/2.6.2.2-1/ambari.repo"
after above command ambari.repo file will copied to /etc/yum.repos.d/
note: the file amabri.rep path is
ls -ltr /var/www/html/ambari/centos7/2.6.2.2-1/ambari.repo
-rw-r--r-- 1 root users 304 Jun 11 2018 /var/www/html/ambari/centos7/2.6.2.2-1/ambari.repo
so this is the simple case
now what about path could be as ( with diff path's )
$REPOSITORY_SERVER/ambari/centos7/2.6.2.3-1/ambari.repo
or
$REPOSITORY_SERVER/ambari/centos7/2.6.2.2-4/ambari.repo
then how to use the cli with Wildcards
we try the following
wget -nd -r -P /etc/yum.repos.d/ -A ".repo" "http://$REPOSITORY_SERVER/ambari/centos7/*/ambari.repo"
but we get
HTTP request sent, awaiting response... 404 Not Found
2021-11-28 18:40:07 ERROR 404: Not Found.
or even with backslash
wget -nd -r -P /etc/yum.repos.d/ -A ".repo" "http://$REPOSITORY_SERVER/ambari/centos7/\*/ambari.repo"
BUT WITH THE SAME ERROR
any idea how to resolve this issue?
how to use the cli with Wildcards
It is not possible to perform a glob expansion with HTTP protocol. These are very unrelated technologies.
how to resolve this issue?
Devise and implement a method of getting the available files under certain path from an HTTP server. For example, contact the server administrator and ask him about it. Potentially, if the HTTP server supports serving a directory listing, recursively filter the listing to find all matching paths. Or find and query some other site that contains all the links and filter the obtained answer to extract all links, for example. Etc.

Merge fastq.gz files with same name in different localizations in Google-Cloud

I would like to merge several fastq.gz files with the same name in different folders in the Google-Cloud. I have a total of 15 patients. Each patient has paired-end data "R1" and "R2". Each R1 and R2 are divided into 4 files. The size of each file is approximately 28 GB.
My goal is to merge the 4 files to obtain the complete fastq.gz R1 and R2 files for each patient.
I have never worked with Google-Cloud before.
Here is how the folders and the files are in the bucket (example with 2 patients):
gs://bucketID
/folder1
/folder001
Patient1_R1.fastq.gz
Patient1_R2.fastq.gz
/folder002
Patient2_R1.fastq.gz
Patient2_R2.fastq.gz
etc.
/folder2
/folder003
Patient1_R1.fastq.gz
Patient1_R2.fastq.gz
/folder004
Patient2_R1.fastq.gz
Patient2_R2.fastq.gz
etc.
/folder3
/folder005
Patient1_R1.fastq.gz
Patient1_R2.fastq.gz
/folder006
Patient2_R1.fastq.gz
Patient2_R2.fastq.gz
etc.
/folder4
/folder007
Patient1_R1.fastq.gz
Patient1_R2.fastq.gz
/folder008
Patient2_R1.fastq.gz
Patient2_R2.fastq.gz
etc.
I want to make a script that targets fastq.gz files with the same name in different folders, then merge them. However, I have no idea how to do this on Google-Cloud.
Here is the same example with colors (I want to concatenate files with the same color):
Example with colors
Here's how I see the bash script:
bucket="bucketID"
dir1=$bucket/"folder1"
dir2=$bucket/"folder2"
dir3=$bucket/"folder3"
dir4=$bucket/"folder4"
destdir=$bucket/"destdir"
participants = (Patient1
Patient2
)
for i in ${participants[*]};
do
zcat dir1/.../$i/_R1.fastq.gz dir2/.../$i/_R1.fastq.gz dir3/.../$i/_R1.fastq.gz dir4/.../$i/_R1.fastq.gz | gzip >$destdir/"merged_"$i/_R1.fastq.gz
zcat dir1/.../$i/_R2.fastq.gz dir2/.../$i/_R2.fastq.gz dir3/.../$i/_R2.fastq.gz dir4/.../$i/_R2.fastq.gz | gzip >$destdir/"merged_"$i/_R2.fastq.gz
done
Should I use "gsutil compose" instead to merge?
At the end, I would like to have only two files R1 and R2 for each patient: merged_patient#_R1.fastq.gz and merged_patient#_R2.fastq.gz.
In the example I gave above, it would give 4 files:
merged_Patient1_R1.fastq.gz
merged_Patient1_R2.fastq.gz
merged_Patient2_R1.fastq.gz
merged_Patient2_R2.fastq.gz
Thank you!
I would recommend you to use the following command in order to concatenate your files:
gsutil compose gs://bucket/obj1 [gs://bucket/obj2 ...] gs://bucket/composite
You can check the documentation in this link.
I've tried to do a simple bash script by using the "gsutil compose" command with fastq.gz files, and it was working fine for me.
The compose command creates a new object whose content is the concatenation of a given sequence of source objects under the same bucket.
Hope this helps!
Ok I found the solution with gsutil compose :
declare -a participantsArray=("Patient1"
"Patient2"
)
bucket="bucketID"
dir1=$bucket/"folder1"
dir2=$bucket/"folder2"
dir3=$bucket/"folder3"
dir4=$bucket/"folder4"
destdir=$bucket/"destdir"
for i in ${participantsArray[#]};
do
fileR1="${i}_R1.fastq.gz"
fileR2="${i}_R2.fastq.gz"
gsutil compose "${dir1}/*/${fileR1}" "${dir2}/*/${fileR1}" "${dir3}/*/${fileR1}" "${dir4}/*/${fileR1}" "${destdir}/merged_${fileR1}"
gsutil compose "${dir1}/*/${fileR2}" "${dir2}/*/${fileR2}" "${dir3}/*/${fileR2}" "${dir4}/*/${fileR2}" "${destdir}/merged_${fileR2}"
done
As you said the solution was not difficult to find.
Thank you again!

Grep file after loaded after a particular minute

Can anyone please help me in getting the command to search files loaded after a particular minute.
Example:-
I want the files loaded after the time 2017/02/15.11.
That means all the files loaded after 11
First, you need to convert you custom date into a recognizable date format:
DATE="2017/02/15.11"
TIME="${DATE##*.}:00"
DATE="${DATE%.*}"
echo "DATE: ${DATE} TIME: ${TIME}"
Then just pass it to find:
find /dir -newermt "${DATE} ${TIME}"

file's date changes after zip in and out again, according to XCOPY

So, here's the problem: I have files which are regular files, and they are put into a ZIP file (see below for details on ZIP). Then I unzip them (see below for details on the tool used), and the files are restored. The date of the file is restored, as in standard in the ZIP/UNZIP tools used. When querying using DIR, or in Windows Explorer, the files involved have the same date as they had, before being handled by the ZIP/UNZIP process.
So, all OK.
But then, I'm using the XCOPY /D command, to further manipulate different copies of those files on the disk ... and, XCOPY says : one file is newer than the other one. Given the fact the date, hour, up until minutes is the same .. the difference would be regarding a smaller entity, like seconds ?
All involved disks have NTFS file system.
Example:
C:\my>dir C:\windows\Background_mycomputer.cmd C:\my\directory\Background_mycomputer.cmd
Volume in drive C is mycomputerC
Volume Serial Number is 1234-5678
Directory of C:\windows
31/12/2014 19:50 51 Background_mycomputer.cmd
1 File(s) 51 bytes
Directory of C:\my\directory
31/12/2014 19:50 51 Background_mycomputer.cmd
1 File(s) 51 bytes
0 Dir(s) 33.655.316.480 bytes free
C:\my>xcopy C:\windows\Background_mycomputer.cmd C:\my\directory\Background_mycomputer.cmd /D
Overwrite C:\my\directory\Background_mycomputer.cmd (Yes/No/All)? y
C:\windows\Background_mycomputer.cmd
1 File(s) copied
C:\my>xcopy C:\my\directory\Background_mycomputer.cmd C:\windows\Background_mycomputer.cmd /D
0 File(s) copied
C:\my>xcopy C:\windows\Background_mycomputer.cmd C:\my\directory\Background_mycomputer.cmd /D
0 File(s) copied
C:\my>unzip -v
UnZip 6.00 of 20 April 2009, by Info-ZIP. Maintained by C. Spieler. Send
bug reports using http://www.info-zip.org/zip-bug.html; see README for details.
Latest sources and executables are at ftp://ftp.info-zip.org/pub/infozip/ ;
see ftp://ftp.info-zip.org/pub/infozip/UnZip.html for other sites.
Compiled with Microsoft C 13.10 (Visual C++ 7.1) for
Windows 9x / Windows NT/2K/XP/2K3 (32-bit) on Apr 20 2009.
UnZip special compilation options:
ASM_CRC
COPYRIGHT_CLEAN (PKZIP 0.9x unreducing method not supported)
NTSD_EAS
SET_DIR_ATTRIB
TIMESTAMP
UNIXBACKUP
USE_EF_UT_TIME
USE_UNSHRINK (PKZIP/Zip 1.x unshrinking method supported)
USE_DEFLATE64 (PKZIP 4.x Deflate64(tm) supported)
UNICODE_SUPPORT [wide-chars] (handle UTF-8 paths)
MBCS-support (multibyte character support, MB_CUR_MAX = 1)
LARGE_FILE_SUPPORT (large files over 2 GiB supported)
ZIP64_SUPPORT (archives using Zip64 for large files supported)
USE_BZIP2 (PKZIP 4.6+, using bzip2 lib version 1.0.5, 10-Dec-2007)
VMS_TEXT_CONV
[decryption, version 2.11 of 05 Jan 2007]
UnZip and ZipInfo environment options:
UNZIP: [none]
UNZIPOPT: [none]
ZIPINFO: [none]
ZIPINFOOPT: [none]
C:\my>ver
Microsoft Windows [Version 6.1.7601]
C:\my>zip -?
Copyright (c) 1990-2006 Info-ZIP - Type 'zip "-L"' for software license.
Zip 2.32 (June 19th 2006). Usage:
zip [-options] [-b path] [-t mmddyyyy] [-n suffixes] [zipfile list] [-xi list]
The default action is to add or replace zipfile entries from list, which
can include the special name - to compress standard input.
If zipfile and list are omitted, zip compresses stdin to stdout.
-f freshen: only changed files -u update: only changed or new files
-d delete entries in zipfile -m move into zipfile (delete files)
-r recurse into directories -j junk (don't record) directory names
-0 store only -l convert LF to CR LF (-ll CR LF to LF)
-1 compress faster -9 compress better
-q quiet operation -v verbose operation/print version info
-c add one-line comments -z add zipfile comment
-# read names from stdin -o make zipfile as old as latest entry
-x exclude the following names -i include only the following names
-F fix zipfile (-FF try harder) -D do not add directory entries
-A adjust self-extracting exe -J junk zipfile prefix (unzipsfx)
-T test zipfile integrity -X eXclude eXtra file attributes
-! use privileges (if granted) to obtain all aspects of WinNT security
-R PKZIP recursion (see manual)
-$ include volume label -S include system and hidden files
-e encrypt -n don't compress these suffixes
C:\my>
Question: I do not want XCOPY to make updates where I know they are invalid cause the time format is doing something wrong. How do I prevent that ?
From how I see, there's different things involved, being XCOPY, very specific ZIP and UNZIP, and NTFS file system. Which one is doing something wrong ?
I must stress that apart from ZIP and UNZIP, there are no other changes done to the file, like changing 1 file, then making a change to another one, in less than 60 seconds time.
At moment of test, the time shown was NOT the current time, and not close to it either. No file is adjusting to the current time, the times refer to last changes of the file in question, which may be any time in the past. In this case, it's one day later, but it can be anything.
I noticed the peculiar behavior Raymond Chen describes when writing a Powershell script (GitHub link) to freshen a zip archive using the System.IO.Compression and System.IO.Compression.FileSystem libraries.
Interestingly, Zip archives can store multiple copies of the same file with identical metadata (name, relative path, modification dates). Extracting the second copy of the file will fail in Windows Explorer because the file already exists.
When trying to prevent re-zipping a file was already archived, I checked the relative path and date, and noticed that there was a discrepancy of up to two seconds in the LastWriteTime. This workaround compensates for the loss of precision:
$AlreadyArchivedFile = ($WriteArchive.Entries | Where-Object {#zip will store multiple copies of the exact same file - prevent this by checking if already archived.
(($_.FullName -eq $RelativePath) -and ($_.Length -eq $File.Length) ) -and
([math]::Abs(($_.LastWriteTime.UtcDateTime - $File.LastWriteTimeUtc).Seconds) -le 2) #ZipFileExtensions timestamps are only precise within 2 seconds.
})
Also, the IsDaylightSavingTime flag is not stored in the Zip archive. As a result I was surprised when extracted files became an hour newer than the original archived file. I tried this several times and saw the extracted file's timestamp incremented by an hour every time it was compressed and extracted.
Here's a very ugly workaround that decreases the archived file time by one hour to make the original source file and extracted file timestamps consistent:
If($File.LastWriteTime.IsDaylightSavingTime() -and $ArchivedFile){#HACK: fix for buggy date - adds an hour inside archive when the zipped file was created during PDT (files created during PST are not affected).
$entry = $WriteArchive.GetEntry($RelativePath)
$entry.LastWriteTime = ($File.LastWriteTime.ToLocalTime() - (New-TimeSpan -Hours 1))
}
There's probably a better way to handle this. Unfortunately I'm not aware of any way to store a Daylight Savings indicator for a file in a .Zip archive, and that information is lost.

How to fix the error in the bash shell script?

I am trying a code in shell script. while I am trying to convert the code from batch script to shell script I am getting an error.
BATCH FILE CODE
:: Create a file with all latest snapshots
FOR /F "tokens=5" %%a in (' ec2-describe-snapshots ^|find "SNAPSHOT" ^|sort /+64') do set "var=%%a"
set "latestdate=%var:~0,10%"
call ec2-describe-snapshots |find "SNAPSHOT"|sort /+64 |find "%latestdate%">"%EC2_HOME%\Working\SnapshotsLatest_%date-today%.txt"
CODE IN SHELL SCRIPT
#Create a file with all latest snapshots
FOR snapshot_date in $(' ec2-describe-snapshots | grep -i "SNAPSHOT" |sort /+64') do set "var=$snapshot_date"
set "latestdate=$var:~0,10"
ec2-describe-snapshots |grep -i "SNAPSHOT" |sort /+64 | grep "$latestdate">"$EC2_HOME%/SnapshotsLatest_$today_date"
I want to sort the snapshots according to dates and to save the snapshots that are created in latest date in a file.
SAMPLE OUTPUT OF ece-describe-snapshots:
SNAPSHOT snap-5e20 vol-f660 completed 2013-12-10T08:00:30+0000 100% 109030037527 10 2013-12-10: Daily Backup for i-2111 (VolID:vol-f9a0 InstID:i-2601)
It will contain records like this
I got this code :
latestdate=$(ec2-describe-snapshots | grep ^SNAPSHOT | sort -k 5 | awk '{print $5}')
ec2-describe-snapshots | grep SNAPSHOT.*$latestdate | > "$EC2_HOME/SnapshotsLatest_$today_date"
but getting this error :
grep: 2013-12-10T09:55:34+0000: No such file or directory
grep: 2013-12-11T04:16:49+0000: No such file or directory
grep: 2013-12-11T04:17:57+0000: No such file or directory
i have some snapshots made on amazon, i want to find the latest snapshots made on a date and then want to store them in a file. like date 2013-12-10 snapshots made on this date should be stored in file. Contents of snapshotslatest file should be
SNAPSHOT snap-c17f3 vol-f69a0 completed 2013-12-04T09:24:50+0000 100% 109030037‌​527 10 2013-12-04: Daily Backup for Sanjay_Test_Machine (VolID:vol-f66409a0 InstID:i-26048111)
SNAPSHOT snap-c7d617f9 vol-3d335f6b completed 2013-12-04T09:24:54+0000 100% 1090‌​30037527 10 2013-12-04: Daily Backup for sacht_VPC (VolID:vol-3db InstID:i-ed6)
please not that if there are snapshots created on 2013-12-10, 2013-12-11, 2013-12-12. It means that the latest_date should be 2013-12-12 and all the snaphshot created on 2013-12-12 should be saved in file.
Any suggestion or lead is appreciated.
Neither the batch script nor the shell script you posted are a good starting point so let's start from scratch. Sorry, this is too big for a comment.
You want to find the latest snapshots made on a date and then want to store them in a file.
What does that mean?
Do the snapshot files have a timestamp in their name or in their content?
If not - UNIX does not store file creation timestamps so is a last-modified timestamp adequate?
Do you literally want to concatenate all of your snapshot files into one singe file or do you want to create a file that has a list of the snapshot file names?
Post some sample input (e.g. some snapshot file names and contents if that's where the timestamp is stored) and the expected output given that input.
Update your question to address all of the above, do not try to reply in a comment.
Minor issue, you don't need a pipe when re-directing output, so your line to save should be
ec2-describe-snapshots | grep SNAPSHOT.*$latestdate > "$EC2_HOME/SnapshotsLatest_$today_date"
Now the main issue here, is that the grep is messed up. I haven't worked with amazon snapshots, but judging by your example descriptions, you should be doing something like
latestdate=$(ec2-describe-snapshots | grep -oP "\d+-\d+-\d+" | sort -r | head -1)
This will get all the dates containing the form dddd-dd-dd from the file (I'm assuming the two dates in each snapshot line always match up), sort them in reverse order (latest first) and take the head which is the latest date, storing it in $latestdate.
Then to store all snapshots with the given date do something like
ec2-describe-snapshots | grep -oP "SNAPSHOT(.*?)$lastdateT(.*?)\)" > "$EC2_HOME/SnapshotsLatest_$today_date"
This will get all text starting with SNAPSHOT, containing the given date, and ending in a closing ")" and save it. Note, you may have to mess around with it a bit, if ")" can be present elsewhere.

Resources