Weird bash results using cut - bash

I am trying to run this command:
./smstocurl SLASH2.911325850268888.911325850268896
smstocurl script:
#SLASH2.911325850268888.911325850268896
model=$(echo \&model=$1 | cut -d'.' -f 1)
echo $model
imea1=$(echo \&simImea1=$1 | cut -d'.' -f 2)
echo $imea1
imea2=$(echo \&simImea2=$1 | cut -d'.' -f 3)
echo $imea2
echo $model$imea1$imea2
Result Received
&model=SLASH2911325850268888911325850268896
Result Expected
&model=SLASH2&simImea1=911325850268888&simImea2=911325850268896
What am I missing here ?

You are cutting based on the dot .. In the first case your desired string contains the first string, the one containing &model, so then it is printed.
However, in the other cases you get the 2nd and 3rd blocks (-f2, -f3), so that the imea text gets cutted off.
Instead, I would use something like this:
while IFS="." read -r model imea1 imea2
do
printf "&model=%s&simImea1=%s&simImea2=%s\n" $model $imea1 $imea2
done <<< "$1"
Note the usage of printf and variables to have more control about what we are writing. Using a lot of escapes like in your echos can be risky.
Test
while IFS="." read -r model imea1 imea2; do printf "&model=%s&simImea1=%s&simImea2=%s\n" $model $imea1 $imea2
done <<< "SLASH2.911325850268888.911325850268896"
Returns:
&model=SLASH2&simImea1=911325850268888&simImea2=911325850268896
Alternatively, this sed makes it:
sed -r 's/^([^.]*)\.([^.]*)\.([^.]*)$/\&model=\1\&simImea1=\2\&simImea2=\3/' <<< "$1"
by catching each block of words separated by dots and printing back.

You can also use this way
Run:
./program SLASH2.911325850268888.911325850268896
Script:
#!/bin/bash
String=`echo $1 | sed "s/\./\&simImea1=/"`
String=`echo $String | sed "s/\./\&simImea2=/"`
echo "&model=$String
Output:
&model=SLASH2&simImea1=911325850268888&simImea2=911325850268896

awk way
awk -F. '{print "&model="$1"&simImea1="$2"&simImea2="$3}' <<< "SLASH2.911325850268888.911325850268896"
or
awk -F. '$0="&model="$1"&simImea1="$2"&simImea2="$3' <<< "SLASH2.911325850268888.911325850268896"
output
&model=SLASH2&simImea1=911325850268888&simImea2=911325850268896

Related

How can I parallelize my loop ? (fasta file)

I wrote a script to change specific lines in one text files (fasta format) and I want to parallelize because there is a lot of lines (~800k).
>CTC14_37541|M00842:336:000000000-C7WWK:1:2101:20913:9309:0:66|o:97|mo:0.000000|MR:n=0;r1=0;r2=0|Q30:p=75;p=71|CO
And I want to transform it to:
>Sample-CTC14_Read37541
I have two problems.
I tried to run my script with and without function:
Without function, it works: all the lines I want to change are modified.
When I use a function, only one line is modified. Something is wrong in my function header()?
Second problem is the parallelization. I tried something with "&" but I'm not sure that is the best solution. Any idea?
My code without function and parallel:
#!/bin/bash
TMP_PATH="/path/where/is/my/fasta"
cd $TMP_PATH
for fasta in *.fasta
do
echo $fasta
lines=$(grep ">" $fasta)
for line in $lines
do
if [[ $line = *">"* ]]; then
read_nb="_Read"$(echo $line | cut -d'|' -f1 | cut -d'_' -f2)
sample=$(echo $line | cut -d'_' -f1 | cut -d'>' -f2)
newheader=$(echo ">Sample-$sample$read_nb")
sed -i -e "s/$line/$newheader/g" $fasta
sed -i -e "s/ /\n/g" $fasta
fi
done
done
echo "END"
My code with function and parallel:
#!/bin/bash
TMP_PATH="/path/where/is/my/fasta"
cd $TMP_PATH
n=0
maxjobs=500
header(){
if [[ $line = *">"* ]]; then
read_nb="_Read"$(echo $line | cut -d'|' -f1 | cut -d'_' -f2)
sample=$(echo $line | cut -d'_' -f1 | cut -d'>' -f2)
newheader=$(echo ">Sample-$sample$read_nb")
sed -i -e "s/$line/$newheader/g" $fasta
sed -i -e "s/ /\n/g" $fasta
fi
}
for fasta in *.fasta
do
lines=$(grep ">" $fasta)
for line in $lines
do
header $line &
#limit jobs
if (( $(($((++n)) % $maxjobs)) == 0 )) ; then
wait
echo $n wait
fi
done
done
I have a fasta file as input that contains several headers and sequences. And I want to transform headers in order to use my fasta file in a specific workflow. I need to go from that :
>CTC14_18758|M00842:336:000000000-C7WWK:1:1108:17474:5670:0:66|o:98|mo:0.000000|MR:n=0;r1=0;r2=0|Q30:p=66;p=62|CO:0|
TGGGGAATATTGGACAATGGGCGAAAGCCTGATCCAGCCATGCCGCATGAGTGAAGAAGGCCTTTGGGTTGTAAAGCTCTTTTAGTGAGGAAGATAATGACGGTACTCACAGAAGAAGTCCTGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGAGGGCTAGCGTTATTCGGAATTATTGGGCGTAAAGGGCGCGTAGGCTGGTTAATAAGTTAAAAGTGAAATCCCGAGGCTTAACCTTGGAATTGCTTTTAAAACTATTAATCTAGAGATTGAAAGAGGATAGAGGAATTCCTGATGTAGAGGTAAAATTCGTAAATATTAGGAGGAACACCAGCGGCGAAGGCGTCTATCTGGTTCAAATCTGACGCTGAAGCGCGAAGGCTTGGGGAGCAAACAGG
>CTC14_20535|M00842:336:000000000-C7WWK:1:1108:28568:20175:0:66|o:97|mo:0.000000|MR:n=0;r1=0;r2=0|Q30:p=77;p=64|CO:0|
TGGGGAATATTGGACAATGGGCGAAAGCCTGATCCAGCCATGCCGCATGAGTGAAGAAGGCCTTTGGGTTGTAAAGCTCTTTTAGTGAGGAAGATAATGACGGTACCCACAGAAGAAGTCCTGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGAGGGCTAGCGTTATTCGGAATTATTGGGCGTAAAGGGCGCGTAGGCTGGTTAATAAGTTAAAAGTGAAATCCCGAGGCTTAACCTTGGAATTGCTTTTAAAACTATTAATCTAGAGATTGAAAGAGGATAGAGGAATTCCTGATGTAGAGGTAAAATTCGTAAATATTAGGAGGAACACCAGTGGCGAAGGCGTCTATCTGGTTCAAATCTGACGCTGAAGCGCGAAGGCGTGGGGAGCAAACAGG
>CTC14_24700|M00842:336:000000000-C7WWK:1:1110:7911:9824:0:66|o:97|mo:0.000000|MR:n=0;r1=0;r2=0|Q30:p=77;p=71|CO:0|
TGGGGAATATTGGACAATGGGCGAAAGCCTGATCCAGCCATGCCGCATGAGTGAAGAAGGCCTTTGGGTTGTAAAGCTCTTTTAGTGAGGAAGATAATGACGGTACTCACAGAAGAAGTCCTGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGAGGGCTAGCGTTATTCGGAATTATTGGGCGTAAAGGGCGCGTAGGCTGGTTAATAAGTTAAAAGTGAAATCCCGAGGCTTAACCTTGGAATTGCTTTTAAAACTATTAATCTAGAGATTGAAAGAGGATAGAGGAATTCCTGATGTAGAGGTAAAATTCGTAAATATTAGGAGGAACACCAGTGGCGAAGGCGTCTATCTGGTTCAAATCTGACGCTGAAGCGCGAAGGCGTGGGGAGCAAACAGG
To this:
>Sample-CTC14_Read18758
TGGGGAATATTGGACAATGGGCGAAAGCCTGATCCAGCCATGCCGCATGAGTGAAGAAGGCCTTTGGGTTGTAAAGCTCTTTTAGTGAGGAAGATAATGACGGTACTCACAGAAGAAGTCCTGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGAGGGCTAGCGTTATTCGGAATTATTGGGCGTAAAGGGCGCGTAGGCTGGTTAATAAGTTAAAAGTGAAATCCCGAGGCTTAACCTTGGAATTGCTTTTAAAACTATTAATCTAGAGATTGAAAGAGGATAGAGGAATTCCTGATGTAGAGGTAAAATTCGTAAATATTAGGAGGAACACCAGCGGCGAAGGCGTCTATCTGGTTCAAATCTGACGCTGAAGCGCGAAGGCTTGGGGAGCAAACAGG
>Sample-CTC14_Read20535
TGGGGAATATTGGACAATGGGCGAAAGCCTGATCCAGCCATGCCGCATGAGTGAAGAAGGCCTTTGGGTTGTAAAGCTCTTTTAGTGAGGAAGATAATGACGGTACCCACAGAAGAAGTCCTGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGAGGGCTAGCGTTATTCGGAATTATTGGGCGTAAAGGGCGCGTAGGCTGGTTAATAAGTTAAAAGTGAAATCCCGAGGCTTAACCTTGGAATTGCTTTTAAAACTATTAATCTAGAGATTGAAAGAGGATAGAGGAATTCCTGATGTAGAGGTAAAATTCGTAAATATTAGGAGGAACACCAGTGGCGAAGGCGTCTATCTGGTTCAAATCTGACGCTGAAGCGCGAAGGCGTGGGGAGCAAACAGG
>Sample-CTC14_Read24700
TGGGGAATATTGGACAATGGGCGAAAGCCTGATCCAGCCATGCCGCATGAGTGAAGAAGGCCTTTGGGTTGTAAAGCTCTTTTAGTGAGGAAGATAATGACGGTACTCACAGAAGAAGTCCTGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGAGGGCTAGCGTTATTCGGAATTATTGGGCGTAAAGGGCGCGTAGGCTGGTTAATAAGTTAAAAGTGAAATCCCGAGGCTTAACCTTGGAATTGCTTTTAAAACTATTAATCTAGAGATTGAAAGAGGATAGAGGAATTCCTGATGTAGAGGTAAAATTCGTAAATATTAGGAGGAACACCAGTGGCGAAGGCGTCTATCTGGTTCAAATCTGACGCTGAAGCGCGAAGGCGTGGGGAGCAAACAGG
And I want to make this parallel because I have a lot of lines to change (~700-800k) and it takes very long time if I run the script line by line.
With my script without function, job is works but it's too long.
With my script with function and parallel, job doesn't work fine because only one header is changed in my fasta instead of all headers and I don't understand why. I tried different ways to write and call my function but the result is always the same.
Moreover, I tried with the gnu-parallel but it's the same way. I think my function or my call have a problem but I don't understand where.
I think use awk as you suggested is a good idea but I'm not comfortable with it. Can you help me please?
Proper format of my fasta file is:
>CTC14_1600|M00842:336:000000000-C7WWK:1:1101:26089:18004:0:66|o:97|mo:0.000000|MR:n=0;r1=0;r2=0|Q30:p=77;p=71|CO:0| TGGGGAATATTGGACAATGGGCGAAAGCCTGATCCAGCCATGCCGCATGAGTGAAGAAGGCCTTTGGGTTGTAAAGCTCTTTTAGTGAGGAAGATAATGACGGTACTCACAGAAGAAGTCCTGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGAGGGCTAGCGTTATTCGGAATTATTGGGCGTAAAGGGCGCGTAGGCTGGTTAATAAGTTAAAAGTGAAATCCCGAGGCTTAACCTTGGAATTGCTTTTAAAACTATTAATCTAGAGATTGAAAGAGGATAGAGGAATTCCTGACGTAGAGGTAAAATTCGTAAATATTAGGAGGAACACCAGTGGCGAAGGCGTCTATCTGGTTCAAATCTGACGCTGAAGCGCGAAGGCGTGGGGAGCAAACAGG$
>CTC14_11169|M00842:336:000000000-C7WWK:1:1105:11636:11876:0:66|o:97|mo:0.000000|MR:n=0;r1=0;r2=0|Q30:p=76;p=65|CO:0| TGGGGAATATTGGACAATGGGCGAAAGCCTGATCCAGCCATGCCGCATGAGTGAAGAAGGCCTTTGGGTTGTAAAGCTCTTTTAGTGAGGAAGATAATGACGGTACTCACAGAAGAAGTCCTGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGAGGGCTAGCGTTATTCGGAATTATTGGGCGTAAAGGGCGCGTAGGCTGGTTAATAAGTTAAAAGTGAAATCCCGAGGCTTAACCTTGGAATTGCTTTTAAAACTATTAATCTAGAGATTGAAAGAGGATAGAGGAATTCCTGATGTAGAGGTAAAATTCGTAAATATTAGGAGGAACACCAGTGGCGAAGGCGTCTATCTGGTTCAAATCTGACGCTGAAGCGCGAAGGCGTGGGGAGCAAACAGG$
>CTC14_16471|M00842:336:000000000-C7WWK:1:1107:6941:10486:0:66|o:97|mo:0.000000|MR:n=0;r1=0;r2=0|Q30:p=77;p=70|CO:0| TGGGGAATATTGGACAATGGGCGAAAGCCTGATCCAGCCATGCCGCATGAGTGAAGAAGGCCTTTGGGTTGTAAAGCTCTTTTAGTGAGGAAGATAATGGCGGTACTCACAGAAGAAGTCCTGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGAGGGCTAGCGTTATTCGGAATTATTGGGCGTAAAGGGCGCGTAGGCTGGTTAATAAGTTAAAAGTGAAATCCCGAGGCTTAACCTTGGAATTGCTTTTAAAGCTATTAATCTAGAGATTGAAAGAGGATAGAGGAATTCCTGATGTAGAGGTAAAATTCGTGAATATTAGGAGGAACACCAGTGGCGAAGGCGTCTATCTGGTTCAAATCTGACGCTGAAGCGCGAAGGCGTGGGGAGCAAACAGG$
Assuming that >CTC14_18758|M00842:336:000000000- is on a separate line, this code will convert the input to the output.
#!/bin/sed -f
#skip blank lines
/^[[:space:]]*$/n
#change >CTC14_18758|M00842:336:000000000-
# to >Sample-CTC14_Read18758
s/^>/>Sample-/
s/_/_Read/
/^>/s/|.*$//
# remove 2ndary header
# C7WWK:1:1108:17474:5670:0:66|o:98|mo:0.000000|MR:n=0;r1=0;r2=0|Q30:p=66;p=62|CO:0| TGGGGAATATTGGAC...
# to
# TGGGGAATATTGGAC...
s/^[^>].*| //
Save that as a file/script.
Then mark it as executable with
chmod +x mySed
and run it like
./mySed -i fileIn
Or if you get an warning/error message about -i, then run
./mySed fileIn > fileOut && mv fileOut fileIn
Now you can eliminate your function header(), and the 2ndary loop in your code.
Just
for file in *.fasta ; do
echo "processing file=$file"
/path/to/mySed -i "$file"
# run other processing if needed
# don't think you need wait any more
#uncomment? wait
done
-------------- version 2 sed ---------------
#!/bin/sed -f
#skip blank lines
/^[[:space:]]*$/n
#>CTC14_18758|M00842:336:000000000-C7WWK:1:1108:17474:5670:0:66|o:98|mo:0.000000|MR:n=0;r1=0;r2=0|Q30:p=66;p=62|CO:0| TGGGGA...
#change >CTC14_18758|M00842:336:000000000-
# to >Sample-CTC14_Read18758
s/^>/>Sample-/
s/_/_Read/
s/|.*| / /
# /^>/s/-.*| / /
# s/-.*| / /
works with data like
>CTC14_16471|M00842:336:000000000-C7WWK:1:1107:6941:10486:0:66|o:97|mo:0.000000|MR:n=0;r1=0;r2=0|Q30:p=77;p=70|CO:0| TGGGGAATATTGGACAATGGGCGAAAGCCTGATCCAGCCATGCCGCATGAGTGAAGAAGGCCTTTGGGTTGTAAAGCTCTTTTAGTGAGGAAGATAATGGCGGTACTCACAGAAGAAGTCCTGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGAGGGCTAGCGTTATTCGGAATTATTGGGCGTAAAGGGCGCGTAGGCTGGTTAATAAGTTAAAAGTGAAATCCCGAGGCTTAACCTTGGAATTGCTTTTAAAGCTATTAATCTAGAGATTGAAAGAGGATAGAGGAATTCCTGATGTAGAGGTAAAATTCGTGAATATTAGGAGGAACACCAGTGGCGAAGGCGTCTATCTGGTTCAAATCTGACGCTGAAGCGCGAAGGCGTGGGGAGCAAACAGG
IHTH

Splitting a text in Unix

I am writing a simple script that splits a variable that holds some text by using below code:
#!/bin/sh
SAMPLE_TEXT=hello.world.testing
echo $SAMPLE_TEXT
OUT_VALUE=$SAMPLE_TEXT | cut -d'.' -f1
echo output is $OUT_VALUE
I am expecting output as output is hello but when I run this program then I am getting output as output is. Please let me know where I am doing mistake?
To evaluate a command and store it into a variable, use var=$(command).
All together, your code works like this:
SAMPLE_TEXT="hello.world.testing"
echo "$SAMPLE_TEXT"
OUT_VALUE=$(echo "$SAMPLE_TEXT" | cut -d'.' -f1)
# OUT_VALUE=$(cut -d'.' -f1 <<< "$SAMPLE_TEXT") <--- alternatively
echo "output is $OUT_VALUE"
Also, note I am adding quotes all around. It is a good practice that will help you in general.
Other approaches:
$ sed -r 's/([^\.]*).*/\1/g' <<< "$SAMPLE_TEXT"
hello
$ awk -F. '{print $1}' <<< "$SAMPLE_TEXT"
hello
$ echo "${SAMPLE_TEXT%%.*}"
hello
The answer by fedorqui is the correct answer. Just adding another approach...
$ SAMPLE_TEXT=hello.world.testing
$ IFS=. read OUT_VALUE _ <<< "$SAMPLE_TEXT"
$ echo output is $OUT_VALUE
output is hello
Just to expand on #anishane's comment to his own answer:
$ SAMPLE_TEXT="hello world.this is.a test string"
$ IFS=. read -ra words <<< "$SAMPLE_TEXT"
$ printf "%s\n" "${words[#]}"
hello world
this is
a test string
$ for idx in "${!words[#]}"; do printf "%d\t%s\n" $idx "${words[idx]}"; done
0 hello world
1 this is
2 a test string

Split String in Unix Shell Script

I have a String like this
//ABC/REC/TLC/SC-prod/1f9/20/00000000957481f9-08d035805a5c94bf
and want to get last part of
00000000957481f9-08d035805a5c94bf
Let's say you have
text="//ABC/REC/TLC/SC-prod/1f9/20/00000000957481f9-08d035805a5c94bf"
If you know the position, i.e. in this case the 9th, you can go with
echo "$text" | cut -d'/' -f9
However, if this is dynamic and your want to split at "/", it's safer to go with:
echo "${text##*/}"
This removes everything from the beginning to the last occurrence of "/" and should be the shortest form to do it.
For more information on this see: Bash Reference manual
For more information on cut see: cut man page
The tool basename does exactly that:
$ basename //ABC/REC/TLC/SC-prod/1f9/20/00000000957481f9-08d035805a5c94bf
00000000957481f9-08d035805a5c94bf
I would use bash string function:
$ string="//ABC/REC/TLC/SC-prod/1f9/20/00000000957481f9-08d035805a5c94bf"
$ echo "${string##*/}"
00000000957481f9-08d035805a5c94bf
But following are some other options:
$ awk -F'/' '$0=$NF' <<< "$string"
00000000957481f9-08d035805a5c94bf
$ sed 's#.*/##g' <<< "$string"
00000000957481f9-08d035805a5c94bf
Note: <<< is herestring notation. They do not create a subshell, however, they are NOT portable to POSIX sh (as implemented by shells such as ash or dash).
In case you want more than just the last part of the path,
you could do something like this:
echo $PWD | rev | cut -d'/' -f1-2 | rev
You can use this BASH regex:
s='//ABC/REC/TLC/SC-prod/1f9/20/00000000957481f9-08d035805a5c94bf'
[[ "$s" =~ [^/]+$ ]] && echo "${BASH_REMATCH[0]}"
00000000957481f9-08d035805a5c94bf
This can be done easily in awk:
string="//ABC/REC/TLC/SC-prod/1f9/20/00000000957481f9-08d035805a5c94bf"
echo "${string}" | awk -v FS="/" '{ print $NF }'
Use "/" as field separator and print the last field.
You can try this...
echo //ABC/REC/TLC/SC-prod/1f9/20/00000000957481f9-08d035805a5c94bf |awk -F "/" '{print $NF}'

bash read loop only reading first line of input variable

I have a read loop that is reading a variable but not behaving the way I expect. I want to read every line of my variable and process each one. Here is my loop:
while read -r line
do
echo $line | sed 's/<\/td>/<\/td>$/g' | cut -d'$' -f2,3,4 >> file.txt
done <<< "$TABLE"
I expect it to process every line of the file but instead it just does the first one. If my the middle is simply echo $line >> file.txt it works as expected. What's going on here? How do I get the behavior I want?
It seems your lines are delimited by \r instead of \n.
Use this while loop to iterate the input with use of read -d $'\r':
while read -rd $'\r' line; do
echo "$line" | sed 's~</td>~</td>$~g' | cut -d'$' -f2,3,4 >> file.txt
done <<< "$TABLE"
If $TABLE contains a multi-line string, I recommend
printf '%s\n' "$TABLE" |
while read -r line; do
echo $line | sed 's/<\/td>/<\/td>$/g' | cut -d'$' -f2,3,4 >> file.txt
done
This is also more portable since the '<<<' operator for here-strings is not POSIX.

Using values of variables in on-the-spot shell commands (using ``)

I have a shell script that for-loops over input to get a number and string. If I want to test the number in the loop, can I cut the looped-over variable to get the number? For example, something like:
for line in input
do
num=`cut -f1 $line`
...
done
If not, how else can I accomplish this?
Instead of:
num=`cut -f1 $line`
You can do:
num=$(echo "$line" | cut -f1)
OR else using awk:
num=$(awk '{print $1}' <<< $line)
OR using pure BASH:
num=${line%% *}
Your command cut -f1 $line will try to cut first column from a file named as $line.
Is this what you want instead ?
while read -r number str
do
echo $number;
echo $str;
done < input

Resources