How to print and store specific named columns from csv file with new row numbers - bash

start by saying, I'm very new to using bash and any sort of script writing in general.
I have a csv file that has basic column headers and values underneath which looks something like this as an example:
a b c d
3 3 34 4
2 5 4 94
4 5 8 3
9 8 5 7
Is there a way to extract only the numerical values from a specific column and add a number for each row. For example first numbered row of the first column (starting from 1 after the column header) is 1, then 2, then 3, etc, for example for column b the output would be:
1 3
2 5
3 5
4 8
I would like to be able to do this for various different named column headers.
Any help would be appreciated,
Chris

Like this? Using awk:
$ awk 'NR>1{print NR-1, $2}' file
1 3
2 5
3 5
4 8
Explained:
$ awk ' # using awk for the job
NR>1 { # for the records or rows after the first
print NR-1, $2 # output record number minus one and the second field or column
}' file # state the file
I would like to be able to do this for various different named column headers. With awk you don't specify the column header name but the column number, like you don't state b but $2.

awk 'NR>1 {print i=1+i, $2}' file
NR>1 skips the first line, in your case the header.
print print following
i=1+i prints i, i is first 0 and add 1, so i is 1, next time 2 and so on.
$2 prints the second column.
file is the path to your file.

If you have a simple multi-space delimited file (as in your example) awk is the best tool for the job. To select the column by name in awk you can do something like:
$ awk -v col="b" 'FNR==1 { for (i=1;i<=NF;i++) if ($i==col) x=i; next }
{print FNR-1 OFS $x}' file
1 3
2 5
3 5
4 8

Related

Modify values of one column based on values of another column on a line-by-line basis

I'm looking to use bash/awk/sed in order to modify a document.
The document contains multiple columns. Column 5 currently has the value "A" at every row. Column six is composed of increasing numbers. I'm attempting a script that goes through the document line by line, checks the value of Column 6, if the value is greater than a certain integer (specifically 275) the value of Column 5 in that same line is changed to "B".
while IFS="" read -r line ; do
awk 'BEGIN {FS = " "}'
Num=$(awk '{print $6}' original.txt)
if [ $Num > 275 ] ; then
awk '{ gsub("A","B",$5) }'
fi
done < original.txt >> edited.txt
For the above, I've tried setting the residueNum variable both inside and outside of the while loop.
I've also tried using a for loop and cat:
awk 'BEGIN {FS = " "}' original.txt
Num=$(awk '{print $6}' heterodimer_P49913/unrelaxed_model_1.pdb)
integer=275
for data in $Num ; do
if [ $data > $integer ] ; then
##Change value in other column to "B" for all lines containing column 6 values greater than "integer"
fi
done
Thanks in advance.
GNU AWK does not need external while loop (there is implicit loop), if you need further explanation read awk info page. Let file.txt content be
1 2 3 4 A 100
1 2 3 4 A 275
1 2 3 4 A 300
and task to be
checks the value of Column 6, if the value is greater than a certain
integer (specifically 275) the value of Column 5 in that same line is
changed to "B".
then it might be done using GNU AWK following way
awk '$6>275{$5="B"}{print}' file.txt
which gives output
1 2 3 4 A 100
1 2 3 4 A 275
1 2 3 4 B 300
Explanation: action set value of 5th field ($5) to B is applied conditionally to rows where value of 6th field is greater than 275. Action to print is applied unconditionally to all lines. Observe that change if applied is done before printing.
(tested in GNU Awk 5.0.1)

Optimizing grep -f piping commands [duplicate]

This question already has answers here:
Inner join on two text files
(5 answers)
Closed 4 years ago.
I have two files.
file1 has some keys that start have abc in the second column
et1 abc
et2 abc
et55 abc
file2 has the column 1 values and some other numbers I need to add up:
1 2 3 4 5 et1
5 5 5 5 5 et100
3 3 3 3 3 et55
5 5 5 5 4 et1
6 6 6 6 3 et1
For the keys extracted in file1, I need to add up the corresponding column 5 if it matches. File2 itself is very large
This command seems to be working but it is very slow:
egrep -isr "abc" file1.tcl | awk '{print $1}' | grep -vwf /dev/stdin file2.tcl | awk '{tl+=$5} END {print tl}'
How would I go about optimizing the pipe. Also what am I doing wrong with grep -f. Is it generally not recommended to do something like this.
Edit: Expected output is the sum of all column5 in file2 when the column6 key is present in file1
Edit2:Expected output: Since file 1 has keys "et1, et2 and et55", in file2 adding up the column 5 with matching keys in rows 1,3,4 and 5, the expected output is [5+3+4+3=15]
Use a single awk to read file1 into the keys of an array. Then when reading file2, add $5 to a total variable when $6 is in the array.
awk 'NR==FNR {if ($2 == "abc") a[$1] = 0;
next}
$6 in a {total += $5}
END { print total }
' file1.tcl file2.tcl
Could you please try following, with reading first Input_file2.tcl and with less loops. Since your expected output is not clear so haven't completely tested it.
awk 'FNR==NR{a[$NF]+=$(NF-1);next} $2=="abc"{print $1,a[$1]+0}' file2.tcl file1.tcl

compare single column of two files

I have two files, each with two columns separated by a space.
I'd like to find the lines in which column 2 is not the same in both files and output them to a third file.
file A:
1 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
2 BBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
3 CCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
4 DDDDDDDDDDDDDDDDDDDDDDDDDDDDDD
5 EEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
6 FFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
7 GGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
8 HHHHHHHHHHHHHHHHHHHHHHHHHHHHHH
file B:
1 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
2 BBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
3 CCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
4 DDDDDDDDDDDDDDDDDDDDDDDDDDDDDD
5 WWWWWWWWWWWWWWWWWWWWWWWWWWWWWW
6 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
7 YYYYYYYYYYYYYYYYYYYYYYYYYYYYYY
8 ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ
9 EEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
10 FFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
11 GGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
12 HHHHHHHHHHHHHHHHHHHHHHHHHHHHHH
desired output:
5 WWWWWWWWWWWWWWWWWWWWWWWWWWWWWW
6 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
7 YYYYYYYYYYYYYYYYYYYYYYYYYYYYYY
8 ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ
I assumed the easiest way to do this was grep each line from file A in file B, but I'm new to bash and can't figure out the next step. Any help is greatly appreciated!
You can use awk for this:
$ awk 'FNR==NR {a[$1]=$2; next} $1 in a && a[$1] != $2' fileA fileB
5 WWWWWWWWWWWWWWWWWWWWWWWWWWWWWW
6 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
7 YYYYYYYYYYYYYYYYYYYYYYYYYYYYYY
8 ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ
It loops through the first file storing the values in an array a[1st col] = 2nd col. Then, it loops through the second file and prints those lines matching these conditions:
The first column is present in the first file.
The second column value is different from the one in the first file.
To store it into a new file, just redirect the command to a file:
awk 'FNR==NR {a[$1]=$2; next} $1 in a && a[$1] != $2' fileA fileB > fileC
^^^^^^^

Exclude a define pattern using awk

I have a file with two columns and want to print the first column only if a determined pattern is not found in the second column, the file can be for example:
3 0.
5 0.
4 1.
3 1.
10 0.
and I want to print the values in the first column only if there isn't the number 1. in the second file, i.e.
3
5
10
I know that to print the first column I can use
awk '{print $1}' fileInput >> fileOutput
Is it possible to have an if block somewhere?
In general, you just need to indicate what pattern you don't want to match:
awk '! /pattern/' file
In this specific case, where you want to print the 1st column of lines where 2st column is not "1.", you can say:
$ awk '$2 != "1." {print $1}' file
3
5
10
When the condition is accomplished, {print $1} will be performed, so that you will have the first column of the file.
In this special case, because the 1 evaluates to true and the 0 to false, you can do:
awk '!$2 { print $1 }' file
3
5
10
The part before the { } is the condition under which the commands are executed. In this case, !$2 means that not column 2 is true (i.e. column 2 is false).
edit: this remains to be the case, even with the trailing dot. In fact, all three of these solutions work:
bash-4.2$ cat file
3 0.
5 0.
4 1.
3 1.
10 0.
bash-4.2$ awk '!$2 { print $1 }' file # treat column 2 as a boolean
3
5
10
bash-4.2$ awk '$2 != "1." {print $1}' file # treat column 2 as a string
3
5
10
bash-4.2$ awk '$2 != 1 {print $1}' file # treat column 2 as a number
3
5
10

Divide the first entry and last entry of each row in a file using awk

I have a file with varying row lengths:
120 2 3 4 5 9 0.003
220 2 3 4 0.004
320 2 3 5 6 7 8 8 0.009
I want the output to consist of a single column with entries like:
120/0.003
220/0.004
320/0.009
That is i want to divide the first column and last column of each row.
How can I achieve this using awk?
To show the output of the division operation:
$ awk '{ printf $1 "/" $NF "=" ; print ($1/$NF)}' infile
120/0.003=40000
220/0.004=55000
320/0.009=35555.6
awk will split its input based on the value of FS, which by default is any sequence of whitespace. That means you can get at the first and last column by referring to $1 and $NF. NF is the number of fields in the current line or record.
So to tell awk to print the first and last column do something like this:
awk '{ print $1 "/" $NF }' infile
Output:
120/0.003
220/0.004
320/0.009

Resources