Arithmetic operations using numbers from grep - bash

I have FILE from which I can extract two numbers using grep. The numbers appear in the last column.
$ grep number FILE
number1: 123
number2: 456
I would like to assign the numbers to variables, e.g. $num1 and $num2, and do some arithmetic operations using the variables.
How can I do this using bash commands?

Assumptions:
we want to match on lines that start with the string number
we will always find 2 matches for ^number from the input file
not interested in storing values in an array
Sample data:
$ cat file.dat
number1: 123
not a number: abc
number: 456
We'll use awk to find the desired values and print all to a single line of output:
$ awk '/^number/ { printf "%s ",$2 }' file.dat
123 456
From here we can use read to load the variables:
$ read -r num1 num2 < <(awk '/^number/ { printf "%s ",$2 }' file.dat)
$ typeset -p num1 num2
declare -- num1="123"
declare -- num2="456"
$ echo ".${num1}.${num2}."
.123.456.
NOTE: periods added as visual delimiters

Firstly, you need to extract the numbers from the file. Assuming that the file is always in the format stated, then you can use a while loop, combined with the the read command to read the numbers into a named variable, one row at a time.
You can then use the $(( )) operator to perform integer arithmetic to keep a running total of the incoming numbers.
For example:
#!/bin/bash
declare -i total=0 # -i declares an integer.
while read discard number; do # read returns false at EOF. discard is ignored.
total=$((total+number)) # Variables don't need '$' prefix in this case.
done < FILE # while loop passes STDIN to the 'read' command.
echo "Total is: ${total}"

Related

how to assign each of multiple lines in a file as different variable?

this is probably a very simple question. I looked at other answers but couldn't come up with a solution. I have a 365 line date file. file as below,
01-01-2000
02-01-2000
I need to read this file line by line and assign each day to a separate variable. like this,
d001=01-01-2000
d002=02-01-2000
I tried while read commands but couldn't get them to work.It takes a lot of time to shoot one by one. How can I do it quickly?
Trying to create named variable out of an associative array, is time waste and not supported de-facto. Better use this, using an associative array:
#!/bin/bash
declare -A array
while read -r line; do
printf -v key 'd%03d' $((++c))
array[$key]=$line
done < file
Output
for i in "${!array[#]}"; do echo "key=$i value=${array[$i]}"; done
key=d001 value=01-01-2000
key=d002 value=02-01-2000
Assumptions:
an array is acceptable
array index should start with 1
Sample input:
$ cat sample.dat
01-01-2000
02-01-2000
03-01-2000
04-01-2000
05-01-2000
One bash/mapfile option:
unset d # make sure variable is not currently in use
mapfile -t -O1 d < sample.dat # load each line from file into separate array location
This generates:
$ typeset -p d
declare -a d=([1]="01-01-2000" [2]="02-01-2000" [3]="03-01-2000" [4]="04-01-2000" [5]="05-01-2000")
$ for i in "${!d[#]}"; do echo "d[$i] = ${d[i]}"; done
d[1] = 01-01-2000
d[2] = 02-01-2000
d[3] = 03-01-2000
d[4] = 04-01-2000
d[5] = 05-01-2000
In OP's code, references to $d001 now become ${d[1]}.
A quick one-liner would be:
eval $(awk 'BEGIN{cnt=0}{printf "d%3.3d=\"%s\"\n",cnt,$0; cnt++}' your_file)
eval makes the shell variables known inside your script or shell. Use echo $d000 to show the first one of the newly defined variables. There should be no shell special characters (like * and $) inside your_file. Remove eval $() to see the result of the awk command. The \" quoted %s is to allow spaces in the variable values. If you don't have any spaces in your_file you can remove the \" before and after %s.

How can I assign each column value to Its name?

I have a MetaData.csv file that contains many values to perform an analysis. All I want are:
1- Reading column names and making variables similar to column names.
2- Put values in each column into variables as an integer that can be read by other commands. column_name=Its_value
MetaData.csv:
MAF,HWE,Geno_Missing,Inds_Missing
0.05,1E-06,0.01,0.01
I wrote the following codes but it doesn't work well:
#!/bin/bash
Col_Names=$(head -n 1 MetaData.csv) # Cut header (camma sep)
Col_Names=$(echo ${Col_Names//,/ }) # Convert header to space sep
Col_Names=($Col_Names) # Convert header to an array
for i in $(seq 1 ${#Col_Names[#]}); do
N="$(head -1 MetaData.csv | tr ',' '\n' | nl |grep -w
"${Col_Names[$i]}" | tr -d " " | awk -F " " '{print $1}')";
${Col_Names[$i]}="$(cat MetaData.csv | cut -d"," -f$N | sed '1d')";
done
Output:
HWE=1E-06: command not found
Geno_Missing=0.01: command not found
Inds_Missing=0.01: command not found
cut: 2: No such file or directory
cut: 3: No such file or directory
cut: 4: No such file or directory
=: command not found
Expected output:
MAF=0.05
HWE=1E-06
Geno_Missing=0.01
Inds_Missing=0.01
Problems:
1- I want to use array length (${#Col_Names[#]}) as the final iteration which is 5, but the array index start from 0 (0-4). So MAF column was not captured by the loop. Loop also iterate twice (once 0-4 and again 2-4!).
2- When I tried to call values in variables (echo $MAF), they were empty!
Any solution is really appreciated.
This produces the expected output you posted from the sample input you posted:
$ awk -F, -v OFS='=' 'NR==1{split($0,hdr); next} {for (i=1;i<=NF;i++) print hdr[i], $i}' MetaData.csv
MAF=0.05
HWE=1E-06
Geno_Missing=0.01
Inds_Missing=0.01
If that's not all you need then edit your question to clarify your requirements.
If I'm understanding your requirements correctly, would you please try something like:
#!/bin/bash
nr=1 # initialize input line number to 1
while IFS=, read -r -a ary; do # split the line on "," then assign "ary" to the fields
if (( nr == 1 )); then # handle the header line
col_names=("${ary[#]}") # assign column names
else # handle the body lines
for (( i = 0; i < ${#ary[#]}; i++ )); do
printf -v "${col_names[i]}" "${ary[i]}"
# assign the variable "${col_names[i]}" to the input field
done
# now you can access the values via its column name
echo "Fnames=$Fnames"
echo "MAF=$MAF"
fname_list+=("$Fnames") # create a list of Fnames
fi
(( nr++ )) # increment the input line number
done < MetaData.csv
echo "${fname_list[#]}" # print the list of Fnames
Output:
Fnames=19.vcf.gz
MAF=0.05
Fnames=20.vcf.gz
MAF=
Fnames=21.vcf.gz
MAF=
Fnames=22.vcf.gz
MAF=
19.vcf.gz 20.vcf.gz 21.vcf.gz 22.vcf.gz
The statetemt IFS=, read -a ary is mostly equivalent to your
first three lines; it splits the input on ",", and assigns the
array variable ary to the field values.
There are several ways to use a variable's value as a variable name
(Indirect Variable References). printf -v VarName Value is one of them.
[EDIT]
Based on the OP's updated input file, here is an another version:
#!/bin/bash
nr=1 # initialize input line number to 1
while IFS=, read -r -a ary; do # split the line on "," then assign "ary" to the fields
if (( nr == 1 )); then # handle the header line
col_names=("${ary[#]}") # assign column names
else # handle the body lines
for (( i = 0; i < ${#ary[#]}; i++ )); do
printf -v "${col_names[i]}" "${ary[i]}"
# assign the variable "${col_names[i]}" to the input field
done
fi
(( nr++ )) # increment the input line number
done < MetaData.csv
for n in "${col_names[#]}"; do # iterate over the variable names
echo "$n=${!n}" # print variable name and its value
done
# you can also specify the variable names literally as follows:
echo "MAF=$MAF HWE=$HWE Geno_Missing=$Geno_Missing Inds_Missing=$Inds_Missing"
Output:
MAF=0.05
HWE=1E-06
Geno_Missing=0.01
Inds_Missing=0.01
MAF=0.05 HWE=1E-06 Geno_Missing=0.01 Inds_Missing=0.01
As for the output, the first four lines are printed by echo "$n=${!n}" and the last line is printed by echo "MAF=$MAF ....
You can choose either statement depending on your usage of the variables in the following code.
I don't really think you can implement a robust CSV reader/parser in Bash, but you can implement it to work to some extent with simple CSV files. For example, a very simply bash-implemented CSV might look like this:
#!/bin/bash
set -e
ROW_NUMBER='0'
HEADERS=()
while IFS=',' read -ra ROW; do
if test "$ROW_NUMBER" == '0'; then
for (( I = 0; I < ${#ROW[#]}; I++ )); do
HEADERS["$I"]="${ROW[I]}"
done
else
declare -A DATA_ROW_MAP
for (( I = 0; I < ${#ROW[#]}; I++ )); do
DATA_ROW_MAP[${HEADERS["$I"]}]="${ROW[I]}"
done
# DEMO {
echo -e "${DATA_ROW_MAP['Fnames']}\t${DATA_ROW_MAP['Inds_Missing']}"
# } DEMO
unset DATA_ROW_MAP
fi
ROW_NUMBER=$((ROW_NUMBER + 1))
done
Note that is has multiple disadvantages:
it only works with ,-separated fields (truly "C"SV);
it cannot handle multiline records;
it cannot handle field escapes;
it considers the first row always represents a header row.
This is why many commands may produce and consume \0-delimited data just because this control character may be easier to use. Now what I'm not sure about is whether test is the only external command executed by bash (I believe it is, but it can be probably re-implemented using case so that no external test is executed?).
Example of use (with the demo output):
./read-csv.sh < MetaData.csv
19.vcf.gz 0.01
20.vcf.gz
21.vcf.gz
22.vcf.gz
I wouldn't recommend using this parser at all, but would recommend using a more CSV-oriented tool (Python would probably be the easiest choice to use; + or if your favorite language, as you mentioned, is R, then probably this is another option for you: Run R script from command line ).

Bash command to read a line based on the parameters I pass - perform column-based lookups

I have a file links.txt:
1 a.sh
3 b.sh
6 c.sh
4 d.sh
So, if i pass 1,4 as parameters to another file(master.sh), a.sh and d.sh should be stored in a variable.
sed '3!d' would print the 3rd line, but not the line that starts with 3. For that, you need sed '/^3 /!d'. The problem is you can't combine them for more lines, as this means "Delete everything that doesn't start with a 3", which means all other lines will be missed. So, use sed -n '/^3 /p' instead, i.e. don't print by default and tell sed what lines to print, not what lines to delete.
You can loop over the argument and create a sed script from them that prints the lines, then run sed using this output:
#!/bin/bash
file=$1
shift
for id in "$#" ; do
echo "/^$id /p"
done | sed -nf- "$file"
Run as script.sh filename 3 4.
If you want to remove the id from the output, you can either use
cut -f2 -d' '
or you can modify the generated sed script to do the work
echo "/^$id /s/.* //p"
i.e. only print if the substitution was successful.
This loops through each argument and greps for it in the links file. The result is piped into cut where we specify the delimiter as a space with -d flag and the field number as 2 with -f flag. Finally this is appended to the array called files.
links="links.txt"
files=()
for arg in $#; do
files=("${files[#]}" `grep "^$arg" "$links" | cut -d" " -f2`)
done;
echo ${files[#]}
Usage:
$ ./master.sh 1 4
a.sh d.sh
Edit:
As pointed out by mklement0, the solution above reads the file once per arg. The following first builds the pattern then reads the file just once.
links="links.txt"
pattern="^$1\s"
for arg in ${#:2}; do
pattern+="|^$arg\s"
done
files=$(grep -E "$pattern" "$links" | cut -d" " -f2)
echo ${files[#]}
Usage:
$ ./master.sh 1 4
a.sh d.sh
Here is another example with grep and cut:
#!/bin/bash
for line in $(grep "$1\|$2" links.txt|cut -d' ' -f2)
do
echo $line
done
Example of usage:
./master.sh 1 4
a.sh
d.sh
Why not just stores the values and call them at will:
items=()
while read -r num file
do
items[num]="$file"
done<links.txt
for arg
do
echo "${items[arg]}"
done
Now you can use the items array any time you like :)
The following awk solution:
preserves the argument order; that is, the results reflect the order in which the lookup values were specified (as opposed to the order in which the lookup values happen to occur in the file).
If that is not important (i.e., if outputting the results in file order is acceptable), the readarray technique below can be combined with this one-liner, which is a generalized variant of Panta's answer:
grep -f <(printf "^%s\n" "$#") links.txt | cut -d' ' -f2-
performs well, because the input file is only read once; the only requirement is that all key-value pairs fit into memory as a whole (as a single associative Awk array (dictionary)).
works with any lookup values that don't have embedded whitespace.
Similarly, the assumption is that the output column values (containing values such as a.sh in the sample input) have no embedded whitespace. awk doesn't handle quoted fields well, so more work would be needed.
#!/bin/bash
readarray -t files < <(
awk -v idList="$*" '
BEGIN { count=split(idList, idArr); for (i in idArr) idDict[idArr[i]]++ }
$1 in idDict { idDict[$1] = $2 }
END { for (i=1; i<=count; ++i) print idDict[idArr[i]] }
' links.txt
)
# Print results.
printf '%s\n' "${files[#]}"
readarray -t files reads stdin input (<) line by line into array variable files.
Note: readarray requires Bash v4+; on Bash 3.x, such as on macOS, replace this part with
IFS=$'\n' read -d '' -ra files
<(...) is a Bash process substitution that, loosely speaking, presents the output from the enclosed command as if it were (self-deleting) temporary file.
This technique allows readarray to run in the current shell (as opposed to a subshell if a pipeline had been used), which is necessary for the files variable to remain defined in the remainder of the script.
The awk command breaks down as follows:
-v idList="$*" passes the space-separated list of all command-line arguments as a single string to Awk variable idList.
Note that this assumes that the arguments have no embedded spaces, which is indeed the case here and also generally the case with identifiers.
BEGIN { ... } is only executed once, before the individual lines are processed:
split(idList, idArr) splits the input ID list into an array by whitespace and stores the result in idArr.
for (i in idArr) idDict[idArr[i]]++ } then converts the (conceptually regular) array into associative array idDict (dictionary), whose keys are the input IDs - this enables efficient lookup by ID later, and also allows storing the lookup result for each ID.
$1 in idDict { idDict[$1] = $2 } is processed for every input line:
Pattern $1 in idDict returns true if the line's first whitespace-separated field ($1) - e.g., 6 - is among the keys (in) of associative array idDict, and, if so, executes the associated action ({...}).
Action { idDict[$1] = $2 } then assigns the second field ($2) - e.g., c.sh - to the iDict entry for key $1.
END { ... } is executed once, after all input lines have been processed:
for (i=1; i<=count; ++i) print idDict[idArr[i]] loops over all input IDs in order and prints each ID's lookup result, which is the value of the dictionary entry with that ID.

How to iterate over text file having multiple-words-per-line using shell script?

I know how to iterate over lines of text when the text file has contents as below:
abc
pqr
xyz
However, what if the contents of my text file are as below,
abc xyz
cdf pqr
lmn rst
and I need to get values "abc" stored to one variable and"xyz" stored to another variable. How would I do that?
read splits the line by $IFS as many times as you pass variables to it:
while read var1 var2 ; do
echo "var1: ${var1} var2: ${var2}"
done
You see, if you pass var1 and var2 both columns go to separate variables. But note that if the line would contain more columns var2 would contain the whole remaining line, not just column2.
Type help read for more info.
If the delimiter is a space then you can do:
#!/bin/bash
ALLVALUES=()
while read line
do
ALLVALUES+=( $line )
done < "/path/to/your/file"
So after, you can just reference an element by ${ALLVALUES[0]} or ${ALLVALUES[1]} etc
If you want to read every word in a file into a single array you can do it like this:
arr=()
while read -r -a _a; do
arr+=("${a[#]}")
done < infile
Which uses -r to avoid read from interpreting backslashes in the input and -a to have it split the words (splitting on $IFS) into an array. It then appends all the elements of that array to the accumulating array while being safe for globbing and other metacharacters.
This awk command reads the input word by word:
awk -v RS='[[:space:]]+' '1' file
abc
xyz
cdf
pqr
lmn
rst
To populate a shell array use awk command in process substitution:
arr=()
while read -r w; do
arr+=("$w")
done < <(awk -v RS='[[:space:]]+' '1' file)
And print the array content:
declare -p arr
declare -a arr='([0]="abc" [1]="xyz" [2]="cdf" [3]="pqr" [4]="lmn" [5]="rst")'

Setting a BASH environment variable directly in AWK (in an AWK one-liner)

I have a file that has two columns of floating point values. I also have a C program that takes a floating point value as input and returns another floating point value as output.
What I'd like to do is the following: for each row in the original, execute the C program with the value in the first column as input, and then print out the first column (unchanged) followed by the second column minus the result of the C program.
As an example, suppose c_program returns the square of the input and behaves like this:
$ c_program 4
16
$
and suppose data_file looks like this:
1 10
2 11
3 12
4 13
What I'd like to return as output, in this case, is
1 9
2 7
3 3
4 -3
To write this in really sketchy pseudocode, I want to do something like this:
awk '{print $1, $2 - `c_program $1`}' data_file
But of course, I can't just pass $1, the awk variable, into a call to c_program. What's the right way to do this, and preferably, how could I do it while still maintaining the "awk one-liner"? (I don't want to pull out a sledgehammer and write a full-fledged C program to do this.)
you just do everything in awk
awk '{cmd="c_program "$1; cmd|getline l;print $1,$2-l}' file
This shows how to execute a command in awk:
ls | awk '/^a/ {system("ls -ld " $1)}'
You could use a bash script instead:
while read line
do
FIRST=`echo $line | cut -d' ' -f1`
SECOND=`echo $line | cut -d' ' -f2`
OUT=`expr $SECOND \* 4`
echo $FIRST $OUT `expr $OUT - $SECOND`
done
The shell is a better tool for this using a little used feature. There is a shell variable IFS which is the Input Field Separator that sh uses to split command lines when parsing; it defaults to <Space><Tab><Newline> which is why ls foo is interpreted as two words.
When set is given arguments not beginning with - it sets the positional parameters of the shell to the contents of the arguments as split via IFS, thus:
#!/bin/sh
while read line ; do
set $line
subtrahend=`c_program $1`
echo $1 `expr $2 - $subtrahend`
done < data_file
Pure Bash, without using any external executables other than your program:
#!/bin/bash
while read num1 num2
do
(( result = $(c_program num2) - num1 ))
echo "$num1 $result"
done
As others have pointed out: awk is not not well equipped for this job. Here is a suggestion in bash:
#!/bin/sh
data_file=$1
while read column_1 column_2 the_rest
do
((result=$(c_program $column_1)-$column_2))
echo $column_1 $result "$the_rest"
done < $data_file
Save this to a file, say myscript.sh, then invoke it as:
sh myscript.sh data_file
The read command reads each line from the data file (which was redirected to the standard input) and assign the first 2 columns to $column_1 and $column_2 variables. The rest of the line, if there is any, is stored in $the_rest.
Next, I calculate the result based on your requirements and prints out the line based on your requirements. Note that I surround $the_rest with quotes to reserve spacing. Failure to do so will result in multiple spaces in the input file to be squeezed into one.

Resources