Output filename from input in bash - bash

I have this script:
#!/bin/bash
FASTQFILES=~/Programs/ncbi-blast-2.2.29+/DB_files/*.fastq
FASTAFILES=~/Programs/ncbi-blast-2.2.29+/DB_files/*.fasta
clear
for file in $FASTQFILES
do cat $FASTQFILES | perl -e '$i=0;while(<>){if(/^\#/&&$i==0){s/^\#/\>/;print;}elsif($i==1){print;$i=-3}$i++;}' > ~/Programs/ncbi-blast-2.2.29+/DB_files/"${FASTQFILES%.*}.fasta"
mv $FASTAFILES ~/Programs/ncbi-blast-2.2.29+/db/
done
I'm trying it to grab the files defined in $FASTQFILES, do the .fastq to .fasta conversion, name the output with the same filename of the input, and move it to a new folder. E.g., ~/./DB_files/HELLO.fastq should give a converted ~/./db/HELLO.fasta
The problem is that the output of the conversion is a properly formatted hidden file called .fasta in the first folder instead of the expected one named HELLO.fasta. So there is nothing to mv. I think I'm messing up in the ${FASTQFILES%.*}.fasta argument but I can't seem to fix it.

I see three problems:
One part of your trouble is that you use cat $FASTQFILES instead of cat $file.
You also need to fix the I/O redirection at the end of that line to > ~/Programs/ncbi-blast-2.2.29+/DB_files/"${file%.fastq}.fasta".
The mv command needs to be executed outside the loop.
In fact, when processing a single file at a time, you don't need to use cat at all (UUOC — Useless Use Of Cat). Simply provide "$file" as an argument to the Perl script.

Related

How to delay `redirection operator` of BASH `>`

First I create 3 files:
$ touch alpha bravo carlos
Then I want to save the list to a file:
$ ls > info.txt
However, I always got my info.txt inside:
$ cat info.txt
alpha
bravo
carlos
info.txt
It looks like the redirection operator creates my info.txt first.
In this case, my question is. How can I save my list of files before creating the info.txt first?
The main question is about the redirection operator. Why does it act first, and how to delay it so I complete my task first? Using the example above to answer it.
When you redirect a command's output to a file, the shell opens a file handle to the destination file, then runs the command in a child process whose standard output is connected to this file handle. There is no way to change this order, but you can redirect to a file in a different directory if you don't want the ls output to include the new file.
ls >/tmp/info.txt
mv /tmp/info.txt ./
In a production script, you should make sure that the file name is unique and unpredictable.
t=$(mktemp -t lstemp.XXXXXXXXXX) || exit
trap 'rm -f "$t"' INT HUP
ls >"$t"
mv "$t" ./info.txt
Alternatively, capture the output into a variable, and then write that variable to a file.
files=$(ls)
echo "$files" >info.txt
As an aside, probably don't use ls in scripts. If you want a list of files in the current directory
printf '%s\n' *
does that.
One simple approach is to save your command output to a variable, like this:
ls_output="$(ls)"
and then write the value of that variable to the file, using any of these commands:
printf '%s\n' "$ls_output" > info.txt
cat <<< "$ls_output" > info.txt
echo "$ls_output" > info.txt
Some caveats with this approach:
Bash variables can't contain null bytes. If the output of the command includes a null byte, that byte and everything after it will be discarded.
In the specific case of ls, though, this shouldn't be an issue, because the output of ls should never contain a null byte.
$(...) removes trailing newlines. The above compensates for this by adding a newline while creating info.txt, but if the the command output ends with multiple newlines, then the above will effectively collapse them into a single newline.
In the specific case of ls, this could happen if a filename ends with a newline — very unusual, and unlikely to be intentional, but nonetheless possible.
Since the above adds a newline while creating info.txt, it will put a newline there even if the command output doesn't end with a newline.
In the specific case of ls, this shouldn't be an issue, because the output of ls should always end with a newline.
If you want to avoid the above issues, another approach is to save your command output to a temporary file in a different directory, and then move it to the right place; for example:
tmpfile="$(mktemp)"
ls > "$tmpfile"
mv -- "$tmpfile" info.txt
. . . which obviously has different caveats (e.g., it requires access to write to a different directory), but should work on most systems.
One way to do what you want is to exclude the info.txt file from the ls output.
If you can rename the list file to .info.txt then it's as simple as:
ls >.info.txt
ls doesn't list files whose names start with . by default.
If you can't rename the list file but you've got GNU ls then you can use:
ls --ignore=info.txt >info.txt
Failing that, you can use:
ls | grep -v '^info\.txt$' >info.txt
All of the above options have the advantage that you can safely run them after the list file has been created.
Another general approach is to capture the output of ls with one command and save it to the list file with a second command. As others have pointed out, temporary files and shell variables are two specific ways to capture the output. Another way, if you've got the moreutils package installed, is to use the sponge utility:
ls | sponge info.txt
Finally, note that you may not be able to reliably extract the list of files from info.txt if it contains plain ls output. See ParsingLs - Greg's Wiki for more information.

Bash - extremely simple script redirecting output to file

Disclaimer: I'm very new to bash and for some reason I'm having a very hard time learning this one. The syntax seems very different depending on the website I visit.
I have a simple wrapper script that I want to test if a file is gzipped or not, and if so, to zcat the file to a new temporary file and open it in an editor. Here's part of the script:
if file $FILE | grep -q gzip
then
timestamp=$(date +"%D_%T")
$( zcat $FILE > tmp-$timestamp )
fi
I'm getting an error: "tmp-10/19/15_15:16:41: No such file or directory"
I tried removing the command substitution syntax or putting tmp-$timestamp in double quotes and I get the same error. If I remove the -$timestamp part, then it seems to work fine. Can someone tell me what's going on here? I'm clearing missing something very simple.
tmp-10/19/15_15:16:41 refers to a file named 15_15:16:41 in directory 19 which is a subdirectory of tmp-10. If those directories and subdirectories do not exist, you cannot write to them.
Replace:
timestamp=$(date +"%D_%T")
With:
timestamp=$(date +"%F_%T")
This gives the date without the /.
As an example of this format:
$ date +"%F_%T"
2015-10-19_12:37:05
With %F, the year comes before the month which comes before the day. This means that your files will sort properly. For most people, that is an important advantage over %D.
Revised script
Your script can be simplified to:
if file "$file" | grep -q gzip
then
zcat "$file" > "tmp-$(date +"%F_%T")"
fi
Notes:
It is best practices not to use all caps for your shell variable. The system uses all caps for its variables and you don't want to accidentally overwrite one. Use lower case or mixed case and you'll be safe.
File names, such as $file, should always be in double-quotes. Some day, someone will give you a file name with a space in it and you don't want that to cause your script to fail.
The command substitution $(...) does not belong here. It has been removed.

How to redirect and replace the input file with the output (don't erase myfile when doing "cat myfile > myfile")

What is the right way to redirect to a file used as input and replace the file instead of erasing the file?
I want to do something like:
cat myfile > myfile
where the program (i.e., cat) reads the file and then redirects to the file, so in the example above myfile would be unchanged.
I know that the example above will erase the file, as explained here.
What is the proper syntax so the file will be replaced instead of erased? Is stdout and redirection possible, or should the program handle opening and writing the new data to the file? I would like to let my program send output to stdout and then the user can redirect or pipe or whatever.
BTW, I don't want to >> (concatenate). I want the function on the left to write "new" data to the file -- to replace the file contents.
Perhaps a better way to state it is that I want the left side of the redirection to fully occur before the streaming occurs -- is this possible? Do I have a fundamental misunderstanding of bash?
You have to either read all the data to memory first, or write to a temporary file and swap it out.
To read it into memory (like vim, ed and other editors do):
contents=$(<myfile)
cat <<< "$contents" > myfile
To create a temporary file (like sed -i, rsync and other updating tools do):
tmpfile=$(mktemp fooXXXXXX)
cat myfile > "$tmpfile" && mv "$tmpfile" myfile
Presumably you're interested in doing something more complex than cat myfile.
Some commands have command-line arguments that let you do this (typically -i) -- but those options work by writing to a temporary file and then renaming the temporary.
You can do the same thing yourself:
cat myfile > myfile.$$ && mv myfile.$$ myfile
I often use $$, which expands to my shell's process ID, as a suffix for a temporary file; it makes it fairly unlikely that the name will collide with some other file.
Using && means that it won't clobber your input file unless the cat command succeeds.
One possible problem with this approach is that your new file myfile.$$ is a newly created file, so it won't keep the permissions of the original input file. If you have tne GNU Coreutils version of the chmod command, you can avoid that problem:
cat myfile > myfile.$$ && \
chmod --reference=myfile myfile.$$ && \
mv myfile.$$ myfile
I'm taking your question literally here. By introducing a pipe to read the output of cat, the second pipe process reads the output from stdin and redirects it to the original file name, resulting in an unchanged file:
cat myfile | cat - > myfile
DANGER: This is a race condition. There is no guarantee that the first process can read the entire comments of myfile before the second process truncates it with the output redirection.

How to evaluate a stream line by line

I am trying to avoid creating any new files to store output in order to minimize the risk of overwriting something in a directory with the same name. I am trying to just evaluate each line in a stream with a pipe instead of outputting to a file and then using a while read line do done < file loop. Something like:
echo -e "1\n2\n3\n4\n5" | #evaluate current line separately#
Could I somehow read each line into an array and then evaluate the elements in the array? or is there a better way to avoid accidentally overwriting files?
In bash, the common way is to use the Process Substitution:
while read line ; do
...
done < <( commands producing the input)
You were halfway there...
echo -e "1\n2\n3\n4\n5" | while read line; do
...
done
Note that bash runs each part of the pipeline in a separate process, and any variables defined there will not persist after that block. (ksh93 will preserve them, as the loop will run in the current shell process.)
You can avoid overwriting files by using mktemp or tempfile to create temporary files with unique names. However, I would use process substitution as in choroba's answer.

bash script get files in subfolders which contains a special line

I need your help with a short bash script. I have a folder, which contains about 150,000(!) xml-files. I need a script which extracts all those files, which contain a specified line. The script should be work as fast as possible, because the script have to be used very often.
My first approach was the following, using grep:
for f in temp/*
do
if grep "^.*the line which should be equal.*$" "$f"
then
echo "use this file"
else
echo "this file does not contain the line"
fi
done
This approach works, but it takes too much time. Does somebody know an faster approach? If another scripting language is a better choice, it is also ok.
Best regards,
Michael
You can use grep without any bash handlers.
-l, --files-with-matches
Suppress normal output; instead print the name of each input file from which output would normally have been printed. The scanning will stop on the first match. (-l is
specified by POSIX.)
So, try this:
grep "the line which should be equal" --files-with-matches temp/*

Resources