bash calculations with numbers from files - bash

I am trying to do a simple thing:
To get the second number in the the line with the second occurence of the word TER and lower it by one and further process it. The tr -s ' ' is there because the file is not delimited by tabs, but by different amounts of whitespaces.
My script:
first_res_atombumb= grep 'TER' tata_sbox_cuda.pdb | head -n 2 | tail -1 |tr -s ' '| cut -f 2 -d ' '
echo $((first_res_atombumb-1))
but this only returnes:
255
-1
Of course I want to have 254.
adding | tr -d '\n' does not help either, what on earth is going on? I have already asked several people at work noone seems to know.
the lines in question look linke this
TER 128 DA3 4
TER 255 DA3 8
and if I apply grep 'TER' tata_sbox_cuda.pdb | head -n 2 | tail -1 | tr -s ' '| cut -f 2 -d ' ' in the command line i get what i expect, just 255

With bash, I'd write
n_ter=0
while read -a words; do
if [[ ${words[0]} == TER ]] && (( ++n_ter == 2 )); then
echo $(( ${words[1]} - 1 ))
fi
done < file
but I'd use awk
awk '$1 == "TER" && ++n == 2 {print $2 - 1}' file
The problem with your code: you forgot to use the $() command substitution syntax
first_res_atombumb= grep 'TER' tata_sbox_cuda.pdb | head -n 2 | tail -1 |tr -s ' '| cut -f 2 -d ' '
# .................^...............................................................................^
echo $((first_res_atombumb-1))
You're setting the variable to an empty string in the environment of the grep command. Then, since you're not capturing the output of that pipeline, "255" is printed to the terminal. Because the variable is unset in your current shell, you get echo $((-1))
All you need is:
first_res_atombumb=$(grep 'TER' tata_sbox_cuda.pdb | head -n 2 | tail -1 |tr -s ' '| cut -f 2 -d ' ')
# .................^^...............................................................................^
But I'd still use awk.

If I understand your problem correctly you can solve it using AWK:
awk 'BEGIN{v=0} $1 == "TER" {v++;if (v==2) {print $2-1 ;exit}}' tata_sbox_cuda.pdb
Explanation:
BEGIN{v=0} declaring and nulling the variable.
$1 == "TER" execute the command in {} only if it's the second occurence of TER.
{v++;if (v==2) {print $2-1 ;exit}}' increase the value of v and check if it's 2, in this case subtract 1 from the second field and display, exit afterwards (will make the processing faster and will skip unnecessary lines).

Related

Extract number in every line of TSV file

I have a file with tab-separated-values and also with blank spaces like this:
! (desambiguación) http://es.dbpedia.org/resource/!_(desambiguación) 5
! (álbum) http://es.dbpedia.org/resource/!_(álbum_de_Trippie_Redd) 2
!! http://es.dbpedia.org/resource/!! 4
$9.99 http://es.dbpedia.org/resource/$9.99 6
Tomlinson http://es.dbpedia.org/resource/(10108)_Tomlinson 20
102 Miriam http://es.dbpedia.org/resource/(102)_Miriam 2
2003 QQ47 http://es.dbpedia.org/resource/(143649)_2003_QQ47 2
I want to extract the last number of every line:
5
2
4
6
20
2
2
For that, I have done this:
while read line;
do
NUMBER=$(echo $line | cut -f 3 -d ' ')
echo $NUMBER
done < $PAIRCOUNTS_FILE
The main problem is that some lines have more spaces than others and cut doesn't work for me with default delimiter (tab). I dont' know why, maybe because I am using WSL.
I have tried cut with several options but it doesn't work in anyway:
NUMBER=$(echo $line | cut -f 3 -d ' ')
NUMBER=$(echo $line | cut -f 4 -d ' ')
NUMBER=$(echo $line | cut -f 2)
NUMBER=$(echo $line | cut -f 3)
Hope you can help me with this. Thanks in advance.
I want to extract the last number of every line:
You could use grep
grep -Eo '[[:digit:]]+$' file
Or mapfile aka readarray which is a bash4+ feature.
mapfile -t array < file
printf '%s\n' "${array[#]##* }"
You can use awk:
awk '{print $NF}' file
With cut (if it is truly TAB separated and 3 fields per line):
cat file | cut -f3
If you have some variable number of fields per line, use rev|cut|rev to get the last field:
cat file | rev | cut -f1 | rev
Or with pure Bash and parameter expansion:
while IFS= read -r line; do
last=${line##* } # that is a literal TAB in the parameter expansion
printf "%s\n" "$last";
done <file
Or, read into a bash array and echo the last field:
while IFS=$'\t' read -r -a arr; do
echo "${arr[${#arr[#]}-1]}"
done <file
If you have a mixture of tabs and spaces you can do what usually is a mistake and break a Bash variable on white spaces in general (tabs and spaces) into an array:
while IFS= read -r line; do
arr=($line) # break on either tab or space without quotes
echo "${arr[${#arr[#]}-1]}"
done <file

how to awk pattern as variable and loop the result?

I assign a keyword as variable, and need to awk from a file using this variable and loop. The file has millions of lines.
i have tried the code below.
DEVICE="DEV2"
while read -r line
do
echo $line
X_keyword=`echo $line | cut -d ',' -f 2 | grep -w "X" | cut -d '=' -f2`
echo $X_keyword
done <<< "$(grep -w $DEVICE $config)"
log="Dev2_PRT.log"
while read -r file
do
VALUE=`echo $file | cut -d '|' -f 1`
HEADER=`echo $VALUE | cut -c 1-4`
echo $file
if [[ $HEADER = 'PTR:' ]]; then
VALUE=`echo $file | cut -d '|' -f 4`
echo $VALUE
XCOORD+=($VALUE)
((X++))
fi
done <<< "awk /$X_keyword/ $log"
expected result:
the log files content lots of below:
PTR:1|2|3|4|X_keyword
PTR:1|2|3|4|Y_rest .....
Filter the X_keyword and get the field no 4.
Unfortunately your shell script is simply the wrong approach to this problem (see https://unix.stackexchange.com/q/169716/133219 for some of the reasons why) so you should set it aside and start over.
To demonstrate the solution, lets create a sample input file:
$ seq 10 | tee file
1
2
3
4
5
6
7
8
9
10
and a shell variable to hold a regexp that's a character list of the chars 5, 6, or 7:
$ var='[567]'
Now, given the above input, here is the solution for how to g/re/p pattern as variable and count how many results:
$ awk -v re="$var" '$0~re{print; c++} END{print "---" ORS c+0}' file
5
6
7
---
3
If that's not all you need then please edit your question to clarify your requirements and provide concise, testable sample input and expected output.

Echo the command result in a file.txt

I have a script such as :
cat list_id.txt | while read line; do for ACC in $line;
do
echo -n "$ACC\t"
curl -s "link=fasta&retmode=xml" |\
grep TSeq_taxid |\
cut -d '>' -f 2 |\
cut -d '<' -f 1 |\
tr -d "\n"
echo
sleep 0.25
done
done
This script allows me from a list of ID in list_id.txt to get the corresponding names in a database in https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=${ACC}&rettype=fasta&retmode=xml
So from this script I get something like
CAA42669\t9913
V00181\t7154
AH002406\t538120
And what I would like is directly to print or echo this result in fiel call new_ids.txt, I tried echo >> new_ids.txt but the file is empty.
Thanks for your help.
A minimal refactoring of your script might look like
# Avoid useless use of cat
# Use read -r
# Don't use upper case for private variables
while read -r line; do
for acc in $line; do
echo -n "$acc\t"
# No backslash necessary after | character
curl -s "link=fasta&retmode=xml" |
# Probably use a proper XML parser for this
grep TSeq_taxid |
cut -d '>' -f 2 |
cut -d '<' -f 1 |
tr -d "\n"
echo
sleep 0.25
done
done <list_id.txt >new_ids.txt
This could probably still be simplified significantly, but without knowledge of what your input file looks like exactly, or what curl returns, this is somewhat speculative.
tr -s ' \t\n' '\n' <list_id.txt |
while read -r acc; do
curl -s "link=fasta&retmode=xml" |
awk -v acc="$acc" '/TSeq_taxid/ {
split($0, a, /[<>]/); print acc "\t" a[3] }'
sleep 0.25
done <list_id.txt >new_ids.txt

While Read Line - Limit Number of Lines

I am trying to limit the number of lines found during a while read line loop. For example:
File: order.csv
123456,ORDER1,NEW
123456,ORDER-2,NEW
123456,ORDER-3,SHIPPED
I am doing the following.
cat order.csv | while read line;
do
order=$(echo $line | cut -d "," -f 1)
status=$(echo $line | cut -d "," -f 3)
echo "$order:$status"
done
Which outputs:
123456:NEW
123456:NEW
123456:SHIPPED
How can I limit the number of lines. In this case there are three. How can I limit them to only 2 so that only the first two are displayed?
Desired output:
123456:NEW
123456:NEW
There are some ways to meet your requirements:
Method 1
Use head to display first few lines of a file.
head -n 2 order.csv | while read line;
do
order=$(echo $line | cut -d "," -f 1)
status=$(echo $line | cut -d "," -f 3)
echo "$order:$status"
done
Method 2
Use a for loop.
for i in {1..2}
do
read line
order=$(echo $line | cut -d "," -f 1)
status=$(echo $line | cut -d "," -f 3)
echo "$order:$status"
done < order.csv
Method 3
Use awk.
awk -F, 'NR <= 2 { print $1":"$3 }' order.csv

shell scripting do loop

Sorry Im new to unix, but just wondering is there anyway I can make the following code into a loop. For example the file name would change every time from 1 to 50
My script is
cut -d ' ' -f5- cd1_abcd_w.txt > cd1_rightformat.txt ;
sed 's! \([^ ]\+\)\( \|$\)!\1 !g' cd1_rightformat.txt ;
sed -i 's/ //g' cd1_rightformat.txt;
cut -d ' ' -f1-4 cd1_abcd_w.txt > cd1_extrainfo.txt ;
I would like to make this into a loop where cd1_abcd_w.txt would then become cd2_abcd_w.txt and output would be cd2_rightformat.txt etc...all the way to 50.
So essentially cd$i.
Many thanks
In bash, you can use brace expansion:
for num in {1..10}; do
echo ${num}
done
Similar to a BASIC for i = 1 to 10 loop, it's inclusive at both ends, that loop will output the numbers 1 through 10.
You then just replace the echo command with whatever you need to do, such as:
cut -d ' ' -f5- cd${num}_abcd_w.txt >cd${num}_rightformat.txt
# and so on
If you need the numbers less than ten to have a leading zero, change the expression in the for loop to be {01..50} instead. That doesn't appear to be the case here but it's very handy to know.
Also in the not-needed-but-handy-to-know category, you can also specify an increment if you don't want to use the default of one:
pax> for num in {1..50..9}; do echo ${num}; done
1
10
19
28
37
46
(equivalent to the BASIC for i = 1 to 50 step 9).
This should work:
for((i=1;i<=50;i++));do
cut -d ' ' -f5- cd${i}_abcd_w.txt > cd${i}_rightformat.txt ;
sed 's! \([^ ]\+\)\( \|$\)!\1 !g' cd${i}_rightformat.txt ;
sed -i 's/ //g' cd${i}_rightformat.txt;
cut -d ' ' -f1-4 cd${i}_abcd_w.txt > cd${i}_extrainfo.txt ;
done
This would work in bash:
for in in $(seq 50)
do
cut -d ' ' -f5- cd$i_abcd_w.txt > cd$1_rightformat.txt;
sed 's! \([^ ]\+\)\( \|$\)!\1 !g' cd$i_rightformat.txt;
sed -i 's/ //g' cd$i_rightformat.txt;
cut -d ' ' -f1-4 cd$i_abcd_w.txt > cd$i_extrainfo.txt;
done

Resources