A question like this was asked here, but never answered.
I need to pass fastq files as options to a tool that does not accept gzipped inputs. Is there really no option other than to unzip every one of them?
It fails when I pass the gzipped version:
mytool -s sample.fastq.gz -f fastq -o output
I've tried this, out of desperation (obviously doesn't work):
mytool -s <(zcat | sample.fastq.gz) -f fastq -o output
I can't pipe it directly, but is there an easier/more straightforward way than what I settled on below?
zcat sample.fastq.gz > sample.fastq
mytool -s sample.fastq -f fastq -o output
rm sample.fastq
It's slow and seems like a lot of unnecessary hassle for tons of files. I'm not the best at coding and always up for learning new tricks.
Thanks!
Related
I have a script that was kicking off ~200 jobs for each sub-analysis. I realized that a job array would probably be much better for this for several reasons. It seems simple enough but is not quite working for me. My input files are not numbered so I've following examples I've seen I do this first:
INFILE=`sed -n ${SGE_TASK_ID}p <pathto/listOfFiles.txt`
My qsub command takes in quite a few variables as it is both pulling and outputting to different directories. $res does not change, however $INFILE is what I am looping through.
qsub -q test.q -t 1-200 -V -sync y -wd ${res} -b y perl -I /master/lib/ myanalysis.pl -c ${res}/${INFILE}/configFile-${INFILE}.txt -o ${res}/${INFILE}/
Since this was not working, I was curious as to what exactly was being passed. So I did an echo on this and saw that it only seems to expand up to the first time $INFILE is used. So I get:
perl -I /master/lib/ myanalysis.pl -c mydirectory/fileABC/
instead of:
perl -I /master/lib/ myanalysis.pl -c mydirectory/fileABC/configFile-fileABC.txt -o mydirectory/fileABC/
Hoping for some clarity on this and welcome all suggestions. Thanks in advance!
UPDATE: It doesn't look like $SGE_TASK_ID is set on the cluster. I looked for any variable that could be used for an array ID and couldn't find anything. If I see anything else I will update again.
Assuming you are using a grid engine variant then SGE_TASK_ID should be set within the job. It looks like you are expecting it to be set to some useful variable before you use qsub. Submitting a script like this would do roughly what you appear to be trying to do:
#!/bin/bash
INFILE=$(sed -n ${SGE_TASK_ID}p <pathto/listOfFiles.txt)
exec perl -I /master/lib/ myanalysis.pl -c ${res}/${INFILE}/configFile-${INFILE}.txt -o ${res}/${INFILE}/
Then submit this script with
res=${res} qsub -q test.q -t 1-200 -V -sync y -wd ${res} myscript.sh
`
I am using a software in Unix on a computing cluster, for which I need hundreds of input files for a combinative analysis, which wants each file specified with an -f flag, it is like:
software -f file1.ext -f file2.ext -f file3.ext
Simple bash loop doesn't work to let it comprehend each file, such as:
for i in *ext; do software -f ${i}; done
Even simpler way doesn't work, either:
software -f *ext
Specifying the folder where the files are doesn't work, either:
software -f .
Even a cooler bash script doesn't work (actually not really different from the simple loop):
#/bin/bash
for i in $(ls *.ext | rev | cut -c 5- | rev | uniq)
do
software -f ${i}.ext
done
So what I need is a way to make the software recognize all my files in the same input flag by iterating the -f as well I believe. Something like:
for i in *ext; for each -f; do software -f ${i}; done
Any help is much appreciated!
You can iteratively build an array with the appropriate options.
for f in *ext; do
opts+=(-f "$f")
done
software "${opts[#]}"
I have a large number of photos on my machine where I'd like to parse the standard naming convention I have created for each file, and then pipe it to the touch command.
For example, I have these files:
2016-08-06-00h28m34.jpg
2016-08-06-00h28m35.jpg
2016-08-06-00h28m36.jpg
I would like to generate (and then run) the following commands:
touch -t 201608060028.34 2016-08-06-00h28m34.jpg
touch -t 201608060028.35 2016-08-06-00h28m35.jpg
touch -t 201608060028.36 2016-08-06-00h28m36.jpg
I can do this manually in a text editor, but it's extremely time-consuming, due to the number of files in each directory. I could also do it in C# and run it over my LAN, but that seems like overkill. Heck, I can even do this in SQL Server, but ... it's OS X and I'm sure there's a simple command-line thing I'm missing.
I've looked at Windows version of Unix touch command to modify a file's modification date from filename part, and Split Filename Up to Define Variables, but I can't seem to figure out how to add in the period for the seconds portion of the script, plus I don't want to add the batch script to each of the hundreds of folders I have.
Any assistance will be greatly appreciated.
Simple Option
I presume you are trying to set the filesystem time to match the EXIF capture time of thousands of photos. There is a tool for that, which runs on OSX, Linux (and Windows if you must). It is called jhead and I installed it on OSX using homebrew with:
brew install jhead
There may be other ways to install it - jhead website.
Please make a back up before trying this, or try it out on a small subset of your files, as I may have misunderstood your needs!
Basically the command to set the filesystem timestamp to match the EXIF timestamp on a single file is:
jhead -ft SomeFile.jpg
So, if you wanted to set the timestamps for all files in $HOME/photos/tmp and all subdirectories, you would do:
find $HOME/photos/tmp -iname \*.jpg -exec jhead -ft {} \;
Option not requiring any extra software
Failing that, you could do it with Perl which is installed on OSX by default anyway:
find . -name \*.jpg | perl -lne 'my $a=$_; s/.*(\d{4})-(\d+)-(\d+)-(\d+)h(\d+)m(\d+).*/$1$2$3$4$5.$6/ && print "touch -t $_\ \"$a\"" '
which gives this sort of output on my machine:
touch -t 201608060028.34 "./2016-08-06-00h28m34.jpg"
touch -t 201608060028.35 "./2016-08-06-00h28m35.jpg"
touch -t 201501060028.35 "./tmp/2015-01-06-00h28m35.jpg"
and if that looks good on your machine, you could send those commands into bash to be executed like this:
find . -name \*.jpg | perl -lne 'my $a=$_;s/.*(\d{4})-(\d+)-(\d+)-(\d+)h(\d+)m(\d+).*/$1$2$3$4$5.$6/ && print "touch -t $_\ \"$a\"" ' | bash -x
And, for the Perl purists out there, yes, I know Perl could do the touch itself and save invoking a whole touch process per file, but that would require modules and explanation and a heap of other extraneous stuff that is not really necessary for a one-off, or occasional operation.
In a bash script, I want to redirect a file called test to gzip -f, then redirect STDOUT to tar -xfzv, and possibly STDERR to echo This is what I tried:
$ gzip -f <> ./test | tar -xfzv
Here, tar just complains that no file was given. How would I do what I'm attempting?
EDIT: the test file IS NOT a .tar.gz file, sorry about that
EDIT: I should be unzipping, then zipping, not like I had it written here
tar's -f switch tells it that it will be given a filename to read from. Use - for a filename to make tar read from stdout, or omit -f switch. Please read man tar for further information.
I'm not really sure about what you're trying to achieve in general, to be honest. The purpose of read-write redirection and -f gzip switch here is unclear. If the task is to unpack a .tar.gz, better use tar xvzf ./test.tar.gz.
As a side note, you cannot 'redirect stderr to echo', echo is just a built-in, and if we're talking about interactive terminal session, stderr will end up visible on your terminal anyway. You can redirect it to file with 2>$filename construct.
EDIT: So for the clarified version of the question, if you want to decompress a gzipped file, run it through bcrypt, then compress it back, you may use something like
gzip -dc $orig_file.tar.gz | bcrypt [your-switches-here] | gzip -c > $modified_file.tar.gz
where gzip's -d stands for decompression, and -c stands for 'output to stdout'.
If you want to encrypt each file individually instead of encrypting the whole tar archive, things get funnier because tar won't read input from stdin. So you'll need to extract your files somewhere, encrypt them and then tgz them back. This is not the shortest way to do that, but in general it works like this:
mkdir tmp ; cd tmp
tar xzf ../$orig_file.tar.gz
bcrypt [your-switches-here] *
tar czf ../$modified_file.tar.gz *
Please note that I'm not familiar with bcrypt switches and workflow at all.
Current Process:
I have a tar.gz file. (Actually, I have about 2000 of them, but that's another story).
I make a temporary directory, extract the tar.gz file, revealing 100,000 tiny files (around 600 bytes each).
For each file, I cat it into a processing program, pipe that loop into another analysis program, and save the result.
The temporary space on the machines I'm using can barely handle one of these processes at once, never mind the 16 (hyperthreaded dual quad core) that they get sent by default.
I'm looking for a way to do this process without saving to disk. I believe the performance penalty for individually pulling files using tar -xf $file -O <targetname> would be prohibitive, but it might be what I'm stuck with.
Is there any way of doing this?
EDIT: Since two people have already made this mistake, I'm going to clarify:
Each file represents one point in time.
Each file is processed separately.
Once processed (in this case a variant on Fourier analysis), each gives one line of output.
This output can be combined to do things like autocorrelation across time.
EDIT2: Actual code:
for f in posns/*; do
~/data_analysis/intermediate_scattering_function < "$f"
done | ~/data_analysis/complex_autocorrelation.awk limit=1000 > inter_autocorr.txt
If you do not care about the boundaries between files, then tar --to-stdout -xf $file will do what you want; it will send the contents of each file in the archive to stdout one after the other.
This assumes you are using GNU tar, which is reasonably likely if you are using bash.
[Update]
Given the constraint that you do want to process each file separately, I agree with Charles Duffy that a shell script is the wrong tool.
You could try his Python suggestion, or you could try the Archive::Tar Perl module. Either of these would allow you to iterate through the tar file's contents in memory.
This sounds like a case where the right tool for the job is probably not a shell script. Python has a tarfile module which can operate in streaming mode, letting you make only a single pass through the large archive and process its files, while still being able to distinguish the individual files (which the tar --to-stdout approach will not).
You can use the tar option --to-command=cmd to execute the command for each file. Tar redirects the file content to the standard input of the command, and sets some environment variables with details about the file, such as TAR_FILENAME. More details in Tar Documentation.
e.g.
tar zxf file.tar.gz --to-command='./process.sh'
Note that OSX uses bsdtar by default, which does not have this option. You can explicitly call gnutar instead.
You could use a ramdisk ( http://www.vanemery.com/Linux/Ramdisk/ramdisk.html ) to process and load it from. (me boldly assuming you use Linux but other UNIX systems should have the same type of provisions)
tar zxvf <file.tar.gz> <path_to_extract> --to-command=cat
The above command will show the content of extracted file on shell only. There will be no changes to disk. tar command should be GNU tar.
Sample logs:
$ cat file_a
aaaa
$ cat file_b
bbbb
$ cat file_c
cccc
$ tar zcvf file.tar.gz file_a file_b file_c
file_a
file_b
file_c
$ cd temp
$ ls <== no files in directory
$ tar zxvf ../file.tar.gz file_b --to-command=cat
file_b
bbbb
$ tar zxvf ../file.tar.gz file_a --to-command=cat
file_a
aaaa
$ ls <== Even after tar extract - no files in directory. So, no changes to disk
$ tar --version
tar (GNU tar) 1.25
...
$