Bash (split) file name comparison fails - bash

In my directory I have files (*fastq.gz.fasta) and directories, whose names contain the filenames (*fastq.gz.fasta-blastdb):
IVC6_Meino.clust.gz.fasta-blastdb
IVC5_Mehiv.clust.gz.fasta-blastdb
....
IVC6_Meino.clust.gz.fasta
IVC5_Mehiv.clust.gz.fasta
....
In a bash script I want to compare the filenames with the direcories using the cut option on the latter to extract only the filename part. If those two names match I want to do further stuff (for now echo match or no match respectively).
I have written the following piece of code:
#!/bin/bash
for file in *.fasta
do
for db in *-blastdb
do
echo $file, $db | cut -d '-' -f 1
if [[ $file = "$db | cut -d '-' -f 1" ]]; then
echo "match"
else
echo "no match"
fi
done
done
But it does not detect matches. The output looks like this:
...
IVC6_Meino.clust.gz.fasta, IIIA11_Meova.clust.gz.fasta
no match
IVC6_Meino.clust.gz.fasta, IVC5_Mehiv.clust.gz.fasta
no match
IVC6_Meino.clust.gz.fasta, IVC6_Meino.clust.gz.fasta
no match
The last line should read match as you can see, the strings look the same.
What am i missing?

You can use parameter expansion to do this more easily:
for file in *.fasta
do
for db in *-blastdb
do
echo "$file", "$db"
if [[ "${file%%.fasta}" = "${db%%.fasta-blastdb}" ]]; then
echo "match"
else
echo "no match"
fi
done
done
If you want to fix yours, the problem is the use of $db | cut -d '-' -f 1 With echo it appears that echo is printing the pipe. It isn't. cut is printing. When you do [[ $file = "$db | cut -d '-' -f 1" ]] it is equivalent to [[ $file = [return code from last pipe component] ]]
You need to use the $(..) shell construct to capture the output of the pipe and you need to echo to get the contents of $db to start the pipe. You should quote "$db" so you do not have word splitting or globbing from the contents of the variable.
Like so:
for file in *.fasta
do
for db in *-blastdb
do
ts=$(echo "$db" | cut -d '-' -f 1)
echo "$file", "$ts"
if [[ "$file" = "$ts" ]]; then
echo "match"
else
echo "no match"
fi
done
done # this works I think -- not tested...
Please be careful with your quoting with Bash and liberally use ShellCheck.
The structure you have is also not the most efficient. You will loop over the *-blastdb glob once for every file in *-blastdb. If you have a lot of files, that could get really slow.
To solve that, you could rewrite this loop with Bash arrays (best if you have Bash 4+) or use awk:
ext1=.fasta
ext2=.fasta-blastdb
awk 'FNR==NR{
s=$0
sub("\\"ext1"$","",s)
seen[s]=$0
next}
{
s=$0
sub("\\"ext2"$","",s)
if (s in seen)
print seen[s], $0
}
' ext1="$ext1" ext2="$ext2" <(for fn in *$ext1; do echo "$fn"; done) <(for fn in *$ext2; do echo "$fn"; done)
Each glob is only executing once and awk is using an array to test if the basenames are the same.
Best

Related

Counting number of lines in file and saving it in a bash file

I am trying to loop through all the files in a folder and add the file name of those files with 10 lines to a txt file but I don't know how to write the if statement.
As of right now, what I have is:
for FILE in *.txt do if wc $FILE == 10; then "$FILE" >> saved_names.txt fi done
I am getting stuck in how to format the statement that will evaluate to a boolean for the if statement.
I have already tried the if statement as:
if [ wc $FILE != 10 ]
if "wc $FILE" != 10
if "wc $FILE != 10"
as well as other ways but I don't seem to get it right. I know I am new to Bash but I can't seem to find a solution to this question.
There are a few problems in your code.
To count the number of lines in the file you should run "wc -l" command. However, that command will result in the number of lines and the name of the file (so for example - 10 a.txt - you can test it by running the command on a file in your terminal). To receive only the number of lines you need to pass the file's name to the standard input of that command
"==" is used in bash to compare strings. To compare integers as in that case, you should use "-eq" (take a look here https://tldp.org/LDP/abs/html/comparison-ops.html)
In terms of brackets: To get the wc command result you need to run it in a terminal and switch the command in the code to the result. To do that, you need correct brackets - $(wc -l). To receive a result of the comparison as a bool, you need to use square brackets with spaces [ 1 -eq 1 ].
To save the name of the file in another file using >> you need to first put the name to the standard output (as >> redirect the standard output to the chosen place). To do that you can just use the echo command.
The code should look like this:
#!/bin/bash
for FILE in *.txt
do
if [ "$(wc -l < "$FILE")" -eq 10 ]
then
echo "$FILE" >> saved_names.txt
fi
done
Try:
for file in *.txt; do
if [[ $(wc -l < "$file") -eq 10 ]]; then
printf '%s\n' "$file"
fi
done > saved_names.txt
Change > to >> if you want to append the filenames.
Related docs:
Command Substitution
Conditional Constructs
Extract the actual number of lines from a file with wc -l $FILE | cut -f1 -d' ' and use -eq operator:
for FILE in *.txt; do if [ "$(wc -l $FILE | cut -f1 -d' ')" -eq 10 ]; then "$FILE" >> saved_names.txt; fi; done

How to cut variables which are beteween quotes from a string

I had problem with cut variables from string in " quotes. I have some scripts to write for my sys classes, I had a problem with a script in which I had to read input from the user in the form of (a="var1", b="var2")
I tried the code below
#!/bin/bash
read input
a=$($input | cut -d '"' -f3)
echo $a
it returns me a error "not found a command" on line 3 I tried to double brackets like
a=$(($input | cut -d '"' -f3)
but it's still wrong.
In a comment the OP gave a working answer (should post it as an answer):
#!/bin/bash
read input
a=$(echo $input | cut -d '"' -f2)
b=$(echo $input | cut -d '"' -f4)
echo sum: $(( a + b))
echo difference: $(( a - b))
This will work for user input that is exactly like a="8", b="5".
Never trust input.
You might want to add the check
if [[ ${input} =~ ^[a-z]+=\"[0-9]+\",\ [a-z]+=\"[0-9]+\"$ ]]; then
echo "Use your code"
else
echo "Incorrect input"
fi
And when you add a check, you might want to execute the input (after replacing the comma with a semicolon).
input='testa="8", testb="5"'
if [[ ${input} =~ ^[a-z]+=\"[0-9]+\",\ [a-z]+=\"[0-9]+\"$ ]];
then
eval $(tr "," ";" <<< ${input})
set | grep -E "^test[ab]="
else
echo no
fi
EDIT:
#PesaThe commented correctly about BASH_REMATCH:
When you use bash and a test on the input you can use
if [[ ${input} =~ ^[a-z]+=\"([0-9]+)\",\ [a-z]+=\"([0-9])+\"$ ]];
then
a="${BASH_REMATCH[1]}"
b="${BASH_REMATCH[2]}"
fi
To extract the digit 1 from a string "var1" you would use a Bash substring replacement most likely:
$ s="var1"
$ echo "${s//[^0-9]/}"
1
Or,
$ a="${s//[^0-9]/}"
$ echo "$a"
1
This works by replacing any non digits in a string with nothing. Which works in your example with a single number field in the string but may not be what you need if you have multiple number fields:
$ s2="1 and a 2 and 3"
$ echo "${s2//[^0-9]/}"
123
In this case, you would use sed or grep awk or a Bash regex to capture the individual number fields and keep them distinct:
$ echo "$s2" | grep -o -E '[[:digit:]]+'
1
2
3

Expand shell glob in variable into array

In a bash script I have a variable containing a shell glob expression that I want to expand into an array of matching file names (nullglob turned on), like in
pat='dir/*.config'
files=($pat)
This works nicely, even for multiple patterns in $pat (e.g., pat="dir/*.config dir/*.conf), however, I cannot use escape characters in the pattern. Ideally, I would like to able to do
pat='"dir/*" dir/*.config "dir/file with spaces"'
to include the file *, all files ending in .config and file with spaces.
Is there an easy way to do this? (Without eval if possible.)
As the pattern is read from a file, I cannot place it in the array expression directly, as proposed in this answer (and various other places).
Edit:
To put things into context: What I am trying to do is to read a template file line-wise and process all lines like #include pattern. The includes are then resolved using the shell glob. As this tool is meant to be universal, I want to be able to include files with spaces and weird characters (like *).
The "main" loop reads like this:
template_include_pat='^#include (.*)$'
while IFS='' read -r line || [[ -n "$line" ]]; do
if printf '%s' "$line" | grep -qE "$template_include_pat"; then
glob=$(printf '%s' "$line" | sed -nrE "s/$template_include_pat/\\1/p")
cwd=$(pwd -P)
cd "$targetdir"
files=($glob)
for f in "${files[#]}"; do
printf "\n\n%s\n" "# FILE $f" >> "$tempfile"
cat "$f" >> "$tempfile" ||
die "Cannot read '$f'."
done
cd "$cwd"
else
echo "$line" >> "$tempfile"
fi
done < "$template"
Using the Python glob module:
#!/usr/bin/env bash
# Takes literal glob expressions on as argv; emits NUL-delimited match list on output
expand_globs() {
python -c '
import sys, glob
for arg in sys.argv[1:]:
for result in glob.iglob(arg):
sys.stdout.write("%s\0" % (result,))
' _ "$#"
}
template_include_pat='^#include (.*)$'
template=${1:-/dev/stdin}
# record the patterns we were looking for
patterns=( )
while read -r line; do
if [[ $line =~ $template_include_pat ]]; then
patterns+=( "${BASH_REMATCH[1]}" )
fi
done <"$template"
results=( )
while IFS= read -r -d '' name; do
results+=( "$name" )
done < <(expand_globs "${patterns[#]}")
# Let's display our results:
{
printf 'Searched for the following patterns, from template %q:\n' "$template"
(( ${#patterns[#]} )) && printf ' - %q\n' "${patterns[#]}"
echo
echo "Found the following files:"
(( ${#results[#]} )) && printf ' - %q\n' "${results[#]}"
} >&2

Need help for string manipulation in a bash script

I'm not use to the syntax of bash script. I'm trying to read a file. For each line I want to keep only the part of the string before the delimiter '/' and put it back into a new file if the word respect a perticular length. I've download a dictionary, but the format does not meet my expectation. Since there is 84000 words, I don't really want to manualy remove what after the '/' for each word. I though it would be an easy thing and I follow couple of idea in other similar question on this site, but it seem that I'm missing something somewhere because it still doesn't work. I can't get the length right. The file Test_Input contains one word per line. Here's the code:
#!/usr/bin/bash
filename="Test_Input.txt"
while read -r line
do
sub= echo $line | cut -d '/' -f1
length= echo ${#sub}
if $length >= 4 && $length <= 10;
then echo $sub >> Test_Output.txt
fi
done < "$filename"
Several items:
I assume that you have been using single back-quotes in the assignments, and not literally sub= echo $line | cut -d '/' -f1, as this would have certainly failed. Alternatively, you can also use sub=$(), as in $(echo $line | cut -d '/' -f1)
The conditions in an if clause need to be encompassed by single or double [], like this: if [[ $length -ge 4 ]] && [[ $length -le 10 ]];
Which brings me to the next point: <= doesn't reliably work in bash. Just use -ge for "greater or equal" and -le for "less or equal".
If your line does not contain any / characters, in your version sub will contain the whole line. This might not be what you want, so I'd advise to also add the -s flag to cut.
You don't need somevar=$(echo $someothervar). Just use somevar=$someothervar
Here's a version that works:
#!/usr/bin/env bash
filename="Test_Input.txt"
while read -r line
do
sub=$(echo $line | cut -s -d '/' -f 1)
length=${#sub}
if [[ $length -ge 4 ]] && [[ $length -le 10 ]];
then echo $sub >> Test_Output.txt
fi
done < "$filename"
Of course, you could also just use sed:
sed -n -r '/^[^/]{4,10}\// s;/.*$;;p' Test_Input.txt > Test_Output.txt
Explanation:
-n Don't print anything unless explicitly marked for printing.
-r Use the extended regex
/<searchterm>/ <operation> Search for lines that match a certain criteria, and perform this operation:
Searchterm is: ^[^/]{4,10}\/ From the beginning of the line, there should be between 4 and 10 non-slash characters, followed by the slash
Operation is: s;/.*$;;p replace everything between the first slash and the end of the line with nothing, then print.
awk is the best tool for this
awk -F/ 'length($1) >= 4 && length($1) <= 10 {print $1} > newfile

Extract a certain part of a string in bash with different patterns

I have this file:
CLUSTERS=SP1,SP2,SP3
FNAME_SP1="REWARDS_BTS_SP1_<GTS>.dat"
FNAME_SP2="DUMP_LOG_SP2_<GTS>.dat"
FNAME_SP3="TEST_CASE_TABLE_SP3_<GTS>.dat"
What I want to get from these are:
REWARDS_BTS_SP1_
DUMP_LOG_SP2_
TEST_CASE_TABLE_SP3_
I loop through the CLUSTERS field, get the values, and use it to find the appropriate FNAME_<CLUSTERNAME> value. Basically, the CLUSTERS value are ALWAYS before the _<GTS> part of the string. Any string pattern will do, provided that the CLUSTERS value come before the _<GTS> at the end of the string.
Any suggestions? Here's a part of the script.
function loadClusters() {
for i in `echo ${!CLUSTER*}`
do
CLUSTER=`echo ${i} | grep $1`
if [[ -n ${CLUSTER} ]]; then
CLUSTER=${!i}
break;
fi
done
echo -e ${CLUSTER}
}
function loadClustersCampaign() {
for i in `echo ${!BPOINTS*}`
do
BPOINTS=`echo ${i} | grep $1`
if [[ -n ${BPOINTS} ]]; then
BPOINTS=${!i}
break;
fi
done
for i in `echo ${!FNAME*}`
do
FNAME=`echo ${i} | grep $1`
if [[ -n ${FNAME} ]]; then
FNAME=${!i}
break;
fi
done
echo -e ${BPOINTS}"|"${FNAME}
}
#get clusters
clusters=$(loadClusters $1)
for i in `echo $clusters | sed 's/,/ /g'`
do
file=$(loadClustersCampaign ${i/-/_} | awk -F"|" '{print $2}') ;
echo $file;
#then get the part of the $file variable
done
Fun with Shell Parameter Expansions
You can use matching-prefix notation and indirect expansion to get at the variables you want, and use the "remove suffix" expansion on each result to collect just the portions of the filename that you want. For example:
FNAME_SP1='REWARDS_BTS_SP1_<GTS>.dat'
FNAME_SP2='DUMP_LOG_SP2_<GTS>.dat'
FNAME_SP3='TEST_CASE_TABLE_SP3_<GTS>.dat'
for cluster in "${!FNAME_SP#}"; do
echo ${!cluster%%<GTS>*}
done
This will print out the following:
REWARDS_BTS_SP1_
DUMP_LOG_SP2_
TEST_CASE_TABLE_SP3_
but you could issue any valid shell command inside the loop instead of using echo.
See Also
http://www.gnu.org/software/bash/manual/html_node/Shell-Parameter-Expansion.html
If you like an awk solution for this ,may be below will be useful.
> echo 'FNAME_SP1="REWARDS_BTS_SP1_<GTS>.dat"' | awk -F"<GTS>" '{split($1,a,"=\"");print substr(a[2],2)}'
REWARDS_BTS_SP1_
Furthur more detail below:
> cat temp
LUSTERS=SP1,SP2,SP3
FNAME_SP1="REWARDS_BTS_SP1_<GTS>.dat"
FNAME_SP2="DUMP_LOG_SP2_<GTS>.dat"
FNAME_SP3="TEST_CASE_TABLE_SP3_<GTS>.dat"
> awk -F"<GTS>" '/FNAME_SP/{split($1,a,"=");print substr(a[2],2)}' temp
REWARDS_BTS_SP1_
DUMP_LOG_SP2_
TEST_CASE_TABLE_SP3_
>

Resources