Finding difference of values between corresponding fields in two CSV files

Finding difference of values between corresponding fields in two CSV files - shell

I have been trying to find difference of values between corresponding fields in two CSV files
$ cat f1.csv
A,B,25,35,50
C,D,30,40,36
$
$ cat f2.csv
E,F,20,40,50
G,H,22,40,40
$
Desired output:
5 -5 0
8 0 -4
I could able to achieve it like this:
$ paste -d "," f1.csv f2.csv
A,B,25,35,50,E,F,20,40,50
C,D,30,40,36,G,H,22,40,40
$
$ paste -d "," f1.csv f2.csv | awk -F, '{print $3-$8 " " $4-$9 " " $5-$10 }'
5 -5 0
8 0 -4
$
Is there any better way to achieve it with awk alone without paste command?

As first step replace only paste with awk:
awk -F ',' 'NR==FNR {file1[FNR]=$0; next} {print file1[FNR] FS $0}' f1.csv f2.csv
Output:
A,B,25,35,50,E,F,20,40,50
C,D,30,40,36,G,H,22,40,40
Then split file1[FNR] FS $0 to an array with , as field separator:
awk -F ',' 'NR==FNR {file1[FNR]=$0; next} {split(file1[FNR] FS $0, arr, FS); print arr[3]-arr[8], arr[4]-arr[9], arr[5]-arr[10]}' f1.csv f2.csv
Output:
5 -5 0
8 0 -4
From man awk:
FNR: The input record number in the current input file.
NR: The total number of input records seen so far.

Another way using nl and awk
$ (nl f1.csv;nl f2.csv) | sort | awk -F, ' {a1=$3;a2=$4;a3=$5; getline; print a1-$3,a2-$4,a3-$5 } '
5 -5 0
8 0 -4
$

Related

Linux - loop through each element on each line

I have a text file with the following information:
cat test.txt
a,e,c,d,e,f,g,h
d,A,e,f,g,h
I wish to iterate through each line and then for each line print the index of all the characters different from e. So the ideal output would be either with a tab seperator or comma seperator
1 3 4 6 7 8
1 2 4 5 6
or
1,3,4,6,7,8
1,2,4,5,6
I have managed to iterate through each line and print the index, but the results are printed to the same line and not seperated.
while read line;do echo "$line" | awk -F, -v ORS=' ' '{for(i=1;i<=NF;i++) if($i!="e") {print i}}' ;done<test.txt
With the result being
1 3 4 6 7 8 1 2 4 5 6
If I do it only using awk
awk -F, -v ORS=' ' '{for(i=1;i<=NF;i++) if($i!="e") {print i}}'
I get the same output.
Could anyone help me with this specific issue with seperating the lines?

If you don't mind some trailing whitespace, you can just do:
while read line;do echo "$line" | awk -F, '{for(i=1;i<=NF;i++) if($i!="e") {printf i " "}; print ""}' ;done<test.txt
but it would be more typical to omit the while loop and do:
awk -F, '{for(i=1;i<=NF;i++) if($i!="e") {printf i " "}; print ""}' <test.txt
You can avoid the trailing whitespace with the slightly cryptic:
awk -F, '{m=0; for(i=1;i<=NF;i++) if($i!="e") {printf "%c%d", m++ ? " " : "", i }; print ""}' <test.txt

Subtract length element two columns

I've a file from which I get two columns: cut -d $'\t' -f 4,5 file.txt
Now I would like to get the difference in length of each element between column 1 and 2.
Input from cut command
A T
AA T
AC TC
A CT
What I would expect
0
1
0
-1

Using awk.
awk ' {print length($1) - length($2)} ' cutoutput.txt
Or awk on the original file you can simply do:
awk ' {print length($4) - length($5)} ' file.txt

You probably can do this only with awk without using cut. Since you don't have the original input file, I would use the following with a | to your cut command:
cut -d $'\t' -f 4,5 file.txt | \
awk '{for (i=1;i<NF;i++) s=length($i)-length($NF); printf s"\n"}'

How to add number to beginning of each line?

This is what I normally use to add numbers to the beginning of each line:
awk '{ print FNR " " $0 }' file
However, what I need to do is start the number at 1000001. Is there a way to start with a specific number like this instead of having to use line numbers?

there is a special command for this nl
nl -v1000001 file

You can just add 1000001 to FNR (or NR):
awk '{ print (1000001 + FNR), $0 }' file

$ seq 5 | awk -v n=1000000 '{print ++n, $0}'
1000001 1
1000002 2
1000003 3
1000004 4
1000005 5
$ seq 5 | awk -v n=30 '{print ++n, $0}'
31 1
32 2
33 3
34 4
35 5

UNIX: Getting count occurance of numbers from a CSV file

I have a CSV file with first column & second column as ID,domain.
#Input.txt
1,google.com
1,cnn.com
1,dropbox.com
2,bbc.com
3,twitter.com
3,hello.com
3,example.com
4,twitter.com
.............
Now, I would like to get the count of IDs. Yes,this can be done in Excel/sheets but the file contains of about 1.5Million lines.
Expected Output:
1,3
2,1
3,3
4,1
I tried using cat Input.txt | grep -c 1 and that which gives me count of '1' as 3 but I would like to do it for individual ID count all at once. Can any one help me on how to achieve this ?

awk -F "," '{ ids[$1]++} END { for(id in ids) { print id, ids[id] } }' input
And input is the file with the data.
output:
1 3
2 1
3 3
4 1
Edit://
If you want a comma seperated output you need to set the output seperator like this:
awk -F "," 'BEGIN { OFS=","} { ids[$1]++} END { for(id in ids) { print id, ids[id] } }' input
output:
1,3
2,1
3,3
4,1

Here's one way, though the count is present in the 1. column:
$ zcat Input.txt.gz | cut -d , -f 1 | sort | uniq -c
3 1
1 2
3 3
1 4
Here's another way using awk:
$ awk -F , '{counter[$1]++};
END {for (id in counter) printf "%s,%d\n",id,counter[id];}' Input.txt |
sort
1,3
2,1
3,3
4,1

This will do the job in bash:
$ for i in {1..4}; do echo -n $i, >> OUTPUT && grep -c $i Input.txt >> OUTPUT; done
$ less OUTPUT
1,3
2,1
3,3
4,1

$ awk -F, '{ print $1 }' input.txt | uniq -c | awk '{ print $2 "," $1 }'
1,3
2,1
3,3
4,1

Here is a pure awk solution. It doesn't map the entire file in memory, so it will probably use less memory that #Joda's answer, but it assumes that the file is sorted:
awk -F, -v OFS=, '$1==prev{c++;next}{print prev,c; c=1}{prev=$1}END{print prev,c}' file

Problems in mapping indices using awk

Hi all I have this data files
File1
1 The hero
2 Chainsaw and the gang
3 .........
4 .........
where the first field is the id and the second field is the product name
File 2
The hero 12
The hero 2
Chainsaw and the gang 2
.......................
From these two files I want to have a third file
File 3
The hero 12 1
The hero 2 1
Chainsaw and the gang 2 2
.......................
As you can see I am just adding the indices reading from file 1
I used this method
awk -F '\t' 'NR == FNR{a[$2]=$1; next}; {print $0, a[$1]}' File1 File2 > File 3
where I am creating this associated array using File 1 and doing just lookup using product names from file 2
However my files are huge, I have like 20 million product names and this process is taking a lot of time. Any suggestions, how I can speed it up?

You can use this awk:
awk 'FNR==NR{p=$1; $1=""; sub(/^ +/, ""); a[$0]=p;next} {q=$NF; $NF=""; sub(/ +$/, "")}
($0 in a) {print $0, q, a[$0]}' f1 f2
The hero 12 1
The hero 2 1
Chainsaw and the gang 2 2

The script you posted won't produce the output you want from the input files you posted so let's fix that first:
$ cat file1
1 The hero
2 Chainsaw and the gang
$ cat file2
The hero 12
The hero 2
Chainsaw and the gang 2
$ awk -F'\t' 'NR==FNR{map[$2]=$1;next} {key=$0; sub(/[[:space:]]+[^[:space:]]+$/,"",key); print $0, map[key]}' file1 file2
The hero 12 1
The hero 2 1
Chainsaw and the gang 2 2
Now, is that really too slow or were you doing some pre or post-processing and that was the real speed issue?
The obvious speed up is if your "file2" is sorted then you can delete the corresponding map[] value whenever the key changes so your map[] gets smaller every time you use it. e.g. something like this (untested):
$ awk -F'\t' '
NR==FNR {map[$2]=$1; next}
{ key=$0; sub(/[[:space:]]+[^[:space:]]+$/,"",key); print $0, map[key] }
key != prev { delete map[prev] }
{ prev = key }
' file1 file2
Alternative approach when populating map[] uses too much time/memory and file2 is sorted:
$ awk '
{ key=$0
sub(/[[:space:]]+[^[:space:]]+$/,"",key)
if (key != prev) {
cmd = "awk -F\"\t\" -v key=\"" key "\" \047$2 == key{print $1;exit}\047 file1"
cmd | getline val
close(cmd)
}
print $0, val
prev = key
}' file2

From comments you're having scaling problems with your lookups. The general fix for that is to merge sorted sequences:
join -t $'\t' -1 2 -2 1 -o 1.2,2.2,1.1 \
<( sort -t $'\t' -k2 file1) \
<( sort -t $'\t' -sk1,1 file2)
I gather Windows can't do process substitution, so you have to use temporary files:
sort -t $'\t' -k2 file1 >idlookup.bykey
sort -t $'\t' -sk1,1 file2 >values.bykey
join -t $'\t' -1 2 -2 1 -o 1.2,2.2,1.1 idlookup.bykey values.bykey
If you need to preserve the value lookup sequence use nl to put line numbers on the front and sort on those at the end.

If your issue is performance then try this perl script:
#!/usr/bin/perl -l
use strict;
use warnings;
my %h;
open my $fh1 , "<", "file1.txt";
open my $fh2 , "<", "file2.txt";
open my $fh3 , ">", "file3.txt";
while (<$fh1>) {
my ($v, $k) = /(\d+)\s+(.*)/;
$h{$k} = $v;
}
while (<$fh2>) {
my ($k, $v) = /(.*)\s+(\d+)$/;
print $fh3 "$k $v $h{$k}" if exists $h{$k};
}
Save the above script in say script.pl and run it as perl script.pl. Make sure the file1.txt and file2.txt are in the same directory as the script.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Finding difference of values between corresponding fields in two CSV files - shell

Another way using nl and awk $ (nl f1.csv;nl f2.csv) | sort | awk -F, ' {a1=$3;a2=$4;a3=$5; getline; print a1-$3,a2-$4,a3-$5 } ' 5 -5 0 8 0 -4 $

Related

Linux - loop through each element on each line

Subtract length element two columns

How to add number to beginning of each line?

UNIX: Getting count occurance of numbers from a CSV file

Problems in mapping indices using awk

Categories

Resources