Using sed to extract the middle of a line/filename - bash

I have multiple files named:
Genus_species_strain.fasta
I want to use sed to print out:
Genus
species
strain
I want to use the "printed" words in a command like this (prokka is a tool for genome annotation):
prokka $file --outdir `echo $file | sed s/\.fasta//` --genus `echo $file | sed s/_.*\.fasta//` --species `echo $file | sed <something here>` --strain `echo $file | sed <something here>`
I would appreciate the help. I am very new to all of this, and as you see above, I only know how to print out Genus.
Below I have some additional questions (no need to answer these if it only complicates things further). This is one of my attempts to print species, and the questions are the following:
sed s/.*_//1 | sed s/_.*\.fasta//
I know the second command isn't correct. I assume it needs to start from the second _, but I don't know how to do that, since the continuation (that is .fasta) is unique.
When used alone, sed s/.*_//1 returns strain.fasta. How to make it not skip the first _?
Combining commands (either as you see above, or with ;) doesn't seem to work for me.

You can use string splitting with string manipulation:
file='Genus_species_strain.fasta'
IFS='[_.]' read -r genus species strain _ <<< "$file"
outdir="${file%.*}"
Then you can use the variables in the command:
prokka "$file" --outdir "$outdir" --genus "$genus" --species "$species" --strain "$strain"
See this online demo:
#!/bin/bash
file='Genus_species_strain.fasta'
IFS='[_.]' read -r genus species strain _ <<< "$file"
echo "${file%.*}" # outdir
echo "$genus"
echo "$species"
echo "$strain"
Output:
Genus_species_strain
Genus
species
strain

One liners without setting multiple varibles
Using sed capture groups:
One liner
file='Genus_species_strain.fasta'
$(echo "$file" | sed "s/\(^[^_]*\)_\([^_]*\)_\([^_]*\)\.\(.*\)/prokka "$(echo "$file")" --outdir \4 --genus \1 --species \2 --strain \3/")
Using Bash string manipulation:
One liner
file='Genus_species_strain.fasta'
$(echo prokka "$file" --outdir `echo "${file#*.}"` --genus `echo "${file%%_*}"` --species "$(echo `file=${file#*_} && echo "${file%%_*}"`)" --strain "$(echo `file=${file#*_} && file=${file#*_} && echo "${file%%.*}"`)")
Awk one liner
file='Genus_species_strain.fasta'
$(echo "$file" | awk -F [_\.] -v var="$file" '{print "prokka " $var " --outdir " $4 " --genus " $1 " --species " $2 " --strain " $4}')
Now you can use above commands within loop or with xargs with file variable pointing to filenames.
It will create a prokka command and directly evaluates/executes it.
Hoping it works for you. Accept answer if it is more efficient

Using sed
$ file=path_to_file
$ sed "s/\(\([^_]*\)_\([^_]*\)_\([^.]*\)\).*/prokka $file --outdir \1 --genus \2 --species \3 --strain \4/e" <(echo *.fasta)
Output of command executed
prokka path_to_file --outdir Genus_species_strain --genus Genus --species species --strain strain

Related

find and replace multiple lines of string using sed

I have an input file containing the following numbers
-45.0005
-43.0022
-41.002
.
.
.
I have a target txt file
line:12 Angle=30
line:42 Angle=60
line:72 Angle=90
.
.
.
Using sed I want to replace the first instance of Angle entry in the target file with the first entry from the input file, the second entry of Angle with the second entry of the input file so and so forth...
Expected output:
line:12 Angle=-45.005
line:42 Angle=-43.002
line:72 Angle=-41.002
.
.
.
This is what I have managed to write but I am not getting the expected output
a=`head -1 temp.txt`
#echo $a
sed -i "12s/Angle = .*/Angle = $a/g" $procfile
for i in {2..41..1}; do
for j in {42..1212..30}; do
c=$(( $i - 1 ))
#echo "this is the value of c: $c"
b=`head -$i temp.txt | tail -$c`
#echo "This is the value of b: $b"
sed -i "$js/Angle = .*/Angle = $b/g" $procfile 2> /dev/null
done
done
Could you help me improve the script?
Thanks!
You may create an iterator i and then use it in sed to perform substitution in each line.
i=0;
while read -r line; do
i=$((i+1));
sed -i "${i}s/Angle=.*/Angle=${line}/g" $procfile;
done < temp.txt
So I guess you want to paste files - marge files line by line. Then replace the field with a regex for example.
paste target_file input_file | sed 's/\(Angle=\)[^\t]*\t/\1/'
This might work for you (GNU sed):
sed '/Angle=/R inputFile' targetFile | sed '/Angle=/{N;s/=.*\n/=/}'
In the first sed invocation append the input line.
In the second sed invocation remove the original angle and the newline delimiter.
pr might help here, please try this:
pr -m -t target input | sed -r 's/(Angle=)[^\s]+\s+/\1/'
Please note - this works for your first two showed files, your code assumes some different input - e.g. spaces around "=".
So I was able to come up with this solution
#!/bin/bash
infile=$1
cp $infile ORIG_${infile}
grep "Angle = " $infile | sed 's/Angle = //g' | sort -n > temp.txt
iMax=`cat temp.txt | wc -l`
jMax=`grep -n "Angle = " $infile | tail -1 | sed 's/:.*//g'`
for ((i=1,j=12; i<=${iMax} && j<=${jMax};i+=1,j+=30));do
a=`head -$i temp.txt | tail -1`
sed -i "${j}s/Angle = .*/Angle = $a/g" $infile
done
rm temp.txt
Many thanks to william pursell for clarifying the syntax for incrementing var counts in bash.

Sed replace substring only if expression exist

In a bash script, I am trying to remove the directory name in filenames :
documents/file.txt
direc/file5.txt
file2.txt
file3.txt
So I try to first see if there is a "/" and if yes delete everything before :
for i in **/*.scss *.scss; do
echo "$i" | sed -n '^/.*\// s/^.*\///p'
done
But it doesn't work for files in the current directory, it gives me a blank string.
I get :
file.txt
file5.txt
When you only want the filename, use basename instead of sed.
# basename /path/to/file
returns file
here is the man page
Your sed attempt is basically fine, but you should print regardless of whether you performed a substitution; take out the -n and the p at the end. (Also there was an unrelated syntax error.)
Also, don't needlessly loop over all files.
printf '%s\n' **/*.scss *.scss |
sed -n 's%^.*/%%p'
This also can be done with awk bash util.
Example:
echo "1/2/i.py" | awk 'BEGIN {FS="/"} {print $NF}'
output: i.py
Eventually, I did :
for i in **/*.scss *.scss; do
# for i in *.scss; do
# for i in _hm-globals.scss; do
name=${i##*/} # remove dir name
name=${name%.scss} # remove extension
name=`echo "$name" | sed -n "s/^_hm-//p"` # remove _hm-
if [[ $name = *"."* ]]; then
name=`echo "$name" | sed -n 's/\./-/p'` #replace . to --
fi
echo "$name" >&2
done

Unix file pattern issue: append changing value of variable pattern to copies of matching line

I have a file with contents:
abc|r=1,f=2,c=2
abc|r=1,f=2,c=2;r=3,f=4,c=8
I want a result like below:
abc|r=1,f=2,c=2|1
abc|r=1,f=2,c=2;r=3,f=4,c=8|1
abc|r=1,f=2,c=2;r=3,f=4,c=8|3
The third column value is r value. A new line would be inserted for each occurrence.
I have tried with:
for i in `cat $xxxx.txt`
do
#echo $i
live=$(echo $i | awk -F " " '{print $1}')
home=$(echo $i | awk -F " " '{print $2}')
echo $live
done
but is not working properly. I am a beginner to sed/awk and not sure how can I use them. Can someone please help on this?
awk to the rescue!
$ awk -F'[,;|]' '{c=0;
for(i=2;i<=NF;i++)
if(match($i,/^r=/)) a[c++]=substr($i,RSTART+2);
delim=substr($0,length($0))=="|"?"":"|";
for(i=0;i<c;i++) print $0 delim a[i]}' file
abc|r=1,f=2,c=2|1
abc|r=1,f=2,c=2;r=3,f=4,c=8|1
abc|r=1,f=2,c=2;r=3,f=4,c=8|3
Use an inner routine (made up of GNU grep, sed, and tr) to compile a second more elaborate sed command, the output of which needs further cleanup with more sed. Call the input file "foo".
sed -n $(grep -no 'r=[0-9]*' foo | \
sed 's/^[0-9]*/&s#.*#\&/;s/:r=/|/;s/.*/&#p;/' | \
tr -d '\n') foo | \
sed 's/|[0-9|]*|/|/'
Output:
abc|r=1,f=2,c=2|1
abc|r=1,f=2,c=2;r=3,f=4,c=8|1
abc|r=1,f=2,c=2;r=3,f=4,c=8|3
Looking at the inner sed code:
grep -no 'r=[0-9]*' foo | \
sed 's/^[0-9]*/&s#.*#\&/;s/:r=/|/;s/.*/&#p;/' | \
tr -d '\n'
It's purpose is to parse foo on-the-fly (when foo changes, so will the output), and in this instance come up with:
1s#.*#&|1#p;2s#.*#&|1#p;2s#.*#&|3#p;
Which is almost perfect, but it leaves in old data on the last line:
sed -n '1s#.*#&|1#p;2s#.*#&|1#p;2s#.*#&|3#p;' foo
abc|r=1,f=2,c=2|1
abc|r=1,f=2,c=2;r=3,f=4,c=8|1
abc|r=1,f=2,c=2;r=3,f=4,c=8|1|3
...which old data |1 is what the final sed 's/|[0-9|]*|/|/' removes.
Here is a pure bash solution. I wouldn't recommend actually using this, but it might help you understand better how to work with files in bash.
# Iterate over each line, splitting into three fields
# using | as the delimiter. (f3 is only there to make
# sure a trailing | is not included in the value of f2)
while IFS="|" read -r f1 f2 f3; do
# Create an array of variable groups from $f2, using ;
# as the delimiter
IFS=";" read -a groups <<< "$f2"
for group in "${groups[#]}"; do
# Get each variable from the group separately
# by splitting on ,
IFS=, read -a vars <<< "$group"
for var in "${vars[#]}"; do
# Split each assignment on =, create
# the variable for real, and quit once we
# have found r
IFS== read name value <<< "$var"
declare "$name=$value"
[[ $name == r ]] && break
done
# Output the desired line for the current value of r
printf '%s|%s|%s\n' "$f1" "$f2" "$r"
done
done < $xxxx.txt
Changes for ksh:
read -A instead of read -a.
typeset instead of declare.
If <<< is a problem, you can use a here document instead. For example:
IFS=";" read -A groups <<EOF
$f2
EOF

Weird bash results using cut

I am trying to run this command:
./smstocurl SLASH2.911325850268888.911325850268896
smstocurl script:
#SLASH2.911325850268888.911325850268896
model=$(echo \&model=$1 | cut -d'.' -f 1)
echo $model
imea1=$(echo \&simImea1=$1 | cut -d'.' -f 2)
echo $imea1
imea2=$(echo \&simImea2=$1 | cut -d'.' -f 3)
echo $imea2
echo $model$imea1$imea2
Result Received
&model=SLASH2911325850268888911325850268896
Result Expected
&model=SLASH2&simImea1=911325850268888&simImea2=911325850268896
What am I missing here ?
You are cutting based on the dot .. In the first case your desired string contains the first string, the one containing &model, so then it is printed.
However, in the other cases you get the 2nd and 3rd blocks (-f2, -f3), so that the imea text gets cutted off.
Instead, I would use something like this:
while IFS="." read -r model imea1 imea2
do
printf "&model=%s&simImea1=%s&simImea2=%s\n" $model $imea1 $imea2
done <<< "$1"
Note the usage of printf and variables to have more control about what we are writing. Using a lot of escapes like in your echos can be risky.
Test
while IFS="." read -r model imea1 imea2; do printf "&model=%s&simImea1=%s&simImea2=%s\n" $model $imea1 $imea2
done <<< "SLASH2.911325850268888.911325850268896"
Returns:
&model=SLASH2&simImea1=911325850268888&simImea2=911325850268896
Alternatively, this sed makes it:
sed -r 's/^([^.]*)\.([^.]*)\.([^.]*)$/\&model=\1\&simImea1=\2\&simImea2=\3/' <<< "$1"
by catching each block of words separated by dots and printing back.
You can also use this way
Run:
./program SLASH2.911325850268888.911325850268896
Script:
#!/bin/bash
String=`echo $1 | sed "s/\./\&simImea1=/"`
String=`echo $String | sed "s/\./\&simImea2=/"`
echo "&model=$String
Output:
&model=SLASH2&simImea1=911325850268888&simImea2=911325850268896
awk way
awk -F. '{print "&model="$1"&simImea1="$2"&simImea2="$3}' <<< "SLASH2.911325850268888.911325850268896"
or
awk -F. '$0="&model="$1"&simImea1="$2"&simImea2="$3' <<< "SLASH2.911325850268888.911325850268896"
output
&model=SLASH2&simImea1=911325850268888&simImea2=911325850268896

bash: grep only lines with certain criteria

I am trying to grep out the lines in a file where the third field matches certain criteria.
I tried using grep but had no luck in filtering out by a field in the file.
I have a file full of records like this:
12794357382;0;219;215
12795287063;0;220;215
12795432063;0;215;220
I need to grep only the lines where the third field is equal to 215 (in this case, only the third line)
Thanks a lot in advance for your help!
Put down the hammer.
$ awk -F ";" '$3 == 215 { print $0 }' <<< $'12794357382;0;219;215\n12795287063;0;220;215\n12795432063;0;215;220'
12795432063;0;215;220
grep:
grep -E "[^;]*;[^;]*;215;.*" yourFile
in this case, awk would be easier:
awk -F';' '$3==215' yourFile
A solution in pure bash for the pre-processing, still needing a grep:
while read line; do
OLF_IFS=$IFS; IFS=";"
line_array=( $line )
IFS=$OLD_IFS
test "${line_array[2]}" = 215 && echo "$line"
done < file | grep _your_pattern_
Simple egrep (=grep -E)
egrep ';215;[0-d][0-d][0-d]$' /path/to/file
or
egrep ';215;[[:digit:]]{3}$' /path/to/file
How about something like this:
cat your_file | while read line; do
if [ `echo "$line" | cut -d ";" -f 3` == "215" ]; then
# This is the line you want
fi
done
Here is the sed version to grep for lines where 3rd field is 215:
sed -n '/^[^;]*;[^;]*;215;/p' file.txt
Simplify your problem by putting the 3rd field at the beginning of the line:
cut -d ";" -f 3 file | paste -d ";" - file
then grep for the lines matching the 3rd field and remove the 3rd field at the beginning:
grep "^215;" | cut -d ";" -f 2-
and then you can grep for whatever you want. So the complete solution is:
cut -d ";" -f 3 file | paste -d ";" - file | grep "^215;" | cut -d ";" -f 2- | grep _your_pattern_
Advantage: Easy to understand; drawback: many processes.

Resources