Assign the output of awk to a specific column in a file - bash

I have the following example output from a log file where im trying to get the reverse pointer records for the IP Addresses in column 7 below
2017-01-09 11:25:22.421 0.306 TCP -> 500 20000 1
2017-01-09 11:30:11.210 0.000 TCP -> 100 4000 1
2017-01-09 09:01:22.546 0.000 TCP -> 100 4000 1
If I run this awk command I can extract the reverse records for column 7:
cat test.txt | awk '{print $7}'| grep -oE '[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+'| xargs -I % bash -c 'echo "$(dig -x % +short)"'
How do I get the output from the above command to replace whats in column 7 to update it so it will read for example:
2017-01-09 11:25:22.421 0.306 TCP -> 500 20000 1
2017-01-09 11:30:11.210 0.000 TCP -> 100 4000 1
2017-01-09 09:01:22.546 0.000 TCP -> 100 4000 1

Using awk only:
$ awk '{split($7,a,":"); r=""; c="dig -x " a[1] " +short"; c|getline r; $7=r} 1' file
split by : to get the ip from $7 to a[1]
construct the dig command for shell to c var
execute it and store result to r
replace $7 with r and print with 1
Not showing any example output as the test file didn't have ips that would return any reverse.


Nethogs print Total Traffic of Process to File

I've been trying for some time now and unfortunately I can't get any further, so I'm hoping you can help me.
I would need to determine the total UP/DOWN traffic since start of the PC for a specific process.
I have found nethogs which gives me the correct values (in the terminal) with the following command.
./nethogs -t -v 2 eth0 2>&1 | awk '/AB/{print $3,"/",$2}'
211 / 561
211 / 561
271 / 620
271 / 620
Now I would need the last (and therefore most recent) value to be saved in the first line in a text file so that I can process it further.
To save all values i have added >|/dev/shm/traffic.log at the end. But the file is not updated but a new line is added every x seconds.
Unfortunately, I am failing and have not yet found a solution.
I would like to ask you to help me here.
   I created sample ./nethogs file to simulate your output at my local host.
    $ ./nethogs -t -v 2 eth0
    AB 211 561
    AB 211 561
    AB 271 620
    AB 271 620
    $ ./nethogs -t -v 2 eth0 2>1 | awk '/AB/{print $3,"/",$2}'
    561 / 211
    561 / 211
    620 / 271
    620 / 271
   Hence I tried using valid redirection, without using OR operator.
    $ ./nethogs -t -v 2 eth0 2>1 | awk '/AB/{ print $3,"/",$2}' >/dev/shm/traffic.log
    $ cat /dev/shm/traffic.log
    561 / 211
    561 / 211
    620 / 271
    620 / 271
    Hence replace:
    If you are interested to get last output alone, you can use:
    $ ./nethogs -t -v 2 eth0 2>&1 | tail -2 | awk '/AB/{print $3,"/",$2}' >/dev/shm/traffic.log
    cat /dev/shm/traffic.log
    620 / 271
    620 / 271
Thank you for the feedback,
Unfortunately nothing is saved in the log when the "tail -2" command is before the "awk".
$ ./nethogs -t -v 2 eth0 2>&1 | tail -2 | awk '/AB/{print $3,"/",$2}' >/dev/shm/traffic.log
nethogs outputs permanent lines, with awk I filter the needed ones. The last output is actual sum of the TX and RX bytes. Only this I would like to have in the first line in the logfile.
so the first line in the logfile should always correspond to the last output of nethogs with awk filter. the old data should always be overwritten in the File.

How to merge two tab-separated files and predefine formatting of missing values?

I am trying to merge two unsorted tab separated files by a column of partially overlapping identifiers (gene#) with the option of predefining missing values and keeping the order of the first table.
When using paste on my two example tables missing values end up as empty space.
cat file1
c3 100 300 gene4
c1 300 400 gene1
c13 600 700 gene2
cat file2
gene1 4.2 0.001
gene4 1.05 0.5
paste file1 file2
c3 100 300 gene1 gene1 4.2 0.001
c1 300 400 gene4 gene4 1.05 0.5
c13 600 700 gene2
As you see the result not surprisingly shows empty spaces in non matched lines. Is there a way to keep the order of file1 and fill lines like the third as follows:
c3 100 300 gene4 gene4 1.05 0.5
c1 300 400 gene1 gene1 4.2 0.001
c13 600 700 gene2 NA 1 1
I assume one way could be to build an awk conditional construct. It would be great if you could point me in the right direction.
With awk please try the following:
awk 'FNR==NR {a[$1]=$1; b[$1]=$2; c[$1]=$3; next}
{if (!a[$4]) {a[$4]="N/A"; b[$4]=1; c[$4]=1}
printf "%s %s %s %s\n", $0, a[$4], b[$4], c[$4]}
' file2 file1
which yields:
c3 100 300 gene1 gene1 4.2 0.001
c1 300 400 gene4 gene4 1.05 0.5
c13 600 700 gene2 N/A 1 1
awk 'FNR==NR {a[$1]=$1; b[$1]=$2; c[$1]=$3; next}
{if (!a[$4]) {a[$4]="N/A"; b[$4]=1; c[$4]=1}
printf "%s %s %s %s\n", $0, a[$4], b[$4], c[$4]}
' file2 file1
In the 1st line, FNR==NR { command; next} is an idiom to execute the command only when reading the 1st file in the argument list ("file2" in this case). Then it creates maps (aka associative arrays) to associate values in "file2" to genes
gene1 => gene1 (with array a)
gene1 => 4.2 (with array b)
gene1 => 0.001 (with array c)
gene4 => gene4 (with array a)
gene4 => 1.05 (with array b)
gene4 => 0.5 (with array c)
It is not necessary that "file2" is sorted.
The following lines are executed only when reading the 2nd file ("file1") because these lines are skipped when reading the 1st file due to the next statement.
The line {if (!a[$4]) .. is a fallback to assign variables to default values when the associative array a[gene] is undefined (meaning the gene is not found in "file2").
The final line prints the contents of "file1" followed by the associated values via the gene.
You can use join:
join -e NA -o '1.1 1.2 1.3 1.4 1.5 2.1 2.2 2.3' -a 1 -1 5 -2 1 <(nl -w1 -s ' ' file1 | sort -k 5) <(sort -k 1 file2) | sed 's/NA\sNA$/1 1/' | sort -n | cut -d ' ' -f 2-
-e NA — replace all missing values with NA
-o ... — output format (field is specified using <file>.<field>)
-a 1 — Keep every line from the left file
-1 5, -2 1 — Fields used to join the files
file1, file2 — The files
nl -w1 -s ' ' file1 — file1 with numbered lines
<(sort -k X fileN) — File N ready to be joined on column X
s/NA\sNA$/1 1/ — Replace every NA NA on end of line with 1 1
| sort -n | cut -d ' ' -f 2- — sort numerically and remove the first column
The example above uses spaces on output. To use tabs, append | tr ' ' '\t':
join -e NA -o '1.1 1.2 1.3 1.4 2.1 2.2 2.3' -a 1 -1 4 -2 1 file1 file2 | sed 's/NA\sNA$/1 1/' | tr ' ' '\t'
The broken lines have a TAB as the last character. Fix this with
paste file1 file2 | sed 's/\t$/\tNA\t1\t1/g'

Sort command strange behaviour

Input file: salary.txt
1 rob hr 10000
2 charls it 20000
4 kk Fin 30000
5 km it 30000
6 kl it 30000
7 mark hr 10000
8 kc it 30000
9 dc fin 40000
10 mn hr 40000
3 abi it 20000
objective: find all record with second highest salary where 4rthcolumn is salary (space separated record)
I ran two similar commands but the output is entirely different, What is that I am missing?
Command1 :
sort -nr -k4,4 salary.txt | awk '!a[$4]{a[$4]=$4;t++}t==2'
8 kc it 30000
6 kl it 30000
5 km it 30000
4 kk Fin 30000
cat salary.txt | sort -nr -k4,4 | awk '!a[$4]{a[$4]=$4;t++}t==2' salary.txt
2 charls it 20000
the difference in the two commands is only the way salary.txt is read but why the output is entirely different
Because in the second form awk will read directly from salary.txt - which you are passing as the name of the input file - ignoring the output from sort that you are passing to stdin. Leave out the final salary.txt in command2 and you'll see that the output matches that of command1. In fact, sort behaves the same way and the forms:
cat salary.txt | sort
echo "string that will be ignored" | sort salary.txt
will both yield the exact same output.
In your second command does not, awk does not read from stdin. If you change it to
cat salary.txt | sort -nr -k4,4 | awk '!a[$4]{a[$4]=$4;t++}t==2'
you get the same result

Subtract corresponding lines

I have two files, file1.csv
3 1009
7 1012
2 1013
8 1014
and file2.csv
5 1009
3 1010
1 1013
In the shell, I want to subtract the count in the first column in the second file from that in the first file, based on the identifier in the second column. If an identifier is missing in the second column, the count is assumed to be 0.
The result would be
-2 1009
-3 1010
7 1012
1 1013
8 1014
The files are huge (several GB). The second columns are sorted.
How would I do this efficiently in the shell?
Assuming that both files are sorted on second column:
$ join -j2 -a1 -a2 -oauto -e0 file1 file2 | awk '{print $2 - $3, $1}'
-2 1009
-3 1010
7 1012
1 1013
8 1014
join will join sorted files.
-j2 will join one second column.
-a1 will print records from file1 even it there is no corresponding row in file2.
-a2 Same as -a1 but applied for file2.
-oauto is in this case the same as -o1.2,1.1,2.1 which will print the joined column, and then the remaining columns from file1 and file2.
-e0 will insert 0 instead of an empty column. This works with -a1 and -a2.
The output from join is three columns like:
1009 3 5
1010 0 3
1012 7 0
1013 2 1
1014 8 0
Which is piped to awk, to subtract column three from column 2, and then reformatting.
$ awk 'NR==FNR { a[$2]=$1; next }
{ a[$2]-=$1 }
END { for(i in a) print a[i],i }' file1 file2
7 1012
1 1013
8 1014
-2 1009
-3 1010
It reads the first file in memory so you should have enough memory available. If you don't have the memory, I would maybe sort -k2 the files first, then sort -m (merge) them and continue with that output:
$ sort -m -k2 -k3 <(sed 's/$/ 1/' file1|sort -k2) <(sed 's/$/ 2/' file2|sort -k2) # | awk ...
3 1009 1
5 1009 2 # previous $2 = current $2 -> subtract
3 1010 2 # previous $2 =/= current and current $3=2 print -$3
7 1012 1
2 1013 1 # previous $2 =/= current and current $3=1 print prev $2
1 1013 2
8 1014 1
(I'm out of time for now, maybe I'll finish it later)
EDIT by Ed Morton
Hope you don't mind me adding what I was working on rather than posting my own extremely similar answer, feel free to modify or delete it:
$ cat tst.awk
{ split(prev,p) }
$2 == p[2] {
print p[1] - $1, p[2]
prev = ""
p[2] != "" {
print (p[3] == 1 ? p[1] : 0-p[1]), p[2]
{ prev = $0 }
print (p[3] == 1 ? p[1] : 0-p[1]), p[2]
$ sort -m -k2 <(sed 's/$/ 1/' file1) <(sed 's/$/ 2/' file2) | awk -f tst.awk
-2 1009
-3 1010
7 1012
1 1013
8 1014
Since the files are sorted¹, you can merge them line-by-line with the join utility in coreutils:
$ join -j2 -o auto -e 0 -a 1 -a 2 41144043-a 41144043-b
1009 3 5
1010 0 3
1012 7 0
1013 2 1
1014 8 0
All those options are required:
-j2 says to join based on the second column of each file
-o auto says to make every row have the same format, beginning with the join key
-e 0 says that missing values should be substituted with zero
-a 1 and -a 2 include rows that are absent from one file or another
the filenames (I've used names based on the question number here)
Now we have a stream of output in that format, we can do the subtraction on each line. I used this GNU sed command to transform the above output into a dc program:
sed -re 's/.*/c&-n[ ]np/e'
This takes the three values on each line and rearranges them into a dc command for the subtraction, then executes it. For example, the first line becomes (with spaces added for clarity)
c 1009 3 5 -n [ ]n p
which subtracts 5 from 3, prints it, then prints a space, then prints 1009 and a newline, giving
-2 1009
as required.
We can then pipe all these lines into dc, giving us the output file that we want:
$ join -o auto -j2 -e 0 -a 1 -a 2 41144043-a 41144043-b \
> | sed -e 's/.*/c& -n[ ]np/' \
> | dc
-2 1009
-3 1010
7 1012
1 1013
8 1014
¹ The sorting needs to be consistent with LC_COLLATE locale setting. That's unlikely to be an issue if the fields are always numeric.
The full command is:
join -o auto -j2 -e 0 -a 1 -a 2 "$file1" "$file2" | sed -e 's/.*/c& -n[ ]np/' | dc
It works a line at a time, and starts only the three processes you see, so should be reasonably efficient in both memory and CPU.
Assuming this is a csv with blank separation, if this is a "," use argument -F ','
awk 'FNR==NR {Inits[$2]=$1; ids[$2]++; next}
{Discounts[$2]=$1; ids[$2]++}
END { for (id in ids) print Inits[ id] - Discounts[ id] " " id}
' file1.csv file2.csv
for memory issue (could be in 1 serie of pipe but prefer to use a temporary file)
awk 'FNR==NR{print;next}{print -1 * $1 " " $2}' file1 file2 \
| sort -k2 \
> file.tmp
awk 'Last != $2 {
if (NR != 1) print Result " "Last
Last = $2; Result = $1
Last == $2 { Result+= $1; next}
END { print Result " " $2}
' file.tmp
rm file.tmp

Search for a value in a file and remove subsequent lines

I'm developing a shell script but I am stuck with the below part.
I have the file sample.txt:
S.No Sub1 Sub2
1 100 200
2 100 200
3 100 200
4 100 200
5 100 200
6 100 200
7 100 200
I want to search the S.No column in sample.txt. For example if I'm searching the value 5 I need the rows up to 5 only I don't want the rows after the value of in S.NO is larger than 5.
the output must look like, output.txt
S.No Sub1 Sub2
1 100 200
2 100 200
3 100 200
4 100 200
5 100 200
Print the first line and any other line where the first field is less than or equal to 5:
$ awk 'NR==1||$1<=5' file
S.No Sub1 Sub2
1 100 200
2 100 200
3 100 200
4 100 200
5 100 200
Using perl:
perl -ane 'print if $F[$1]<=5' file
And the sed solution
sed "/^$n[[:space:]]/q" filename
The sed q command exits after printing the current line
The suggested awk relies on that column 1 is numeric sorted. A generic awk that fulfills the question title would be:
gawk -v p=5 '$1==p {print; exit} {print}'
However, in this situation, sed is better IMO. Use -i to modify the input file.
sed '6q' sample.txt > output.txt
