How to Convert two columns of CSV files to consecutive integers? - shell

Hello say I have this file file1.csv and it has 2 columns a and b which are both 22 char strings. It looks like something like this:
hWcYwgRKOD77hfm1oKE0IA,5HleiJXMsFkGEsr8Jqr3Ug
hWcYwgRKOD77hfm1oKE0IA,rCDlYd2WHJuiT05sYGxaVA
65q0c2Iw03B8eSuHHTETHw,G40NUD0/op+13yjzBw+hrw
65q0c2Iw03B8eSuHHTETHw,1u8UW/cQ4i1vbSF9wvzu3w
...
And I would like to convert the a, b columns into consecutive integers like:
1,1
1,2
2,3
2,4
Does anyone know how can I do it? I am using Ubuntu 12.04 by the way
And what if I have another file file2.csv with column a' and b'. And is there any way to do the same thing to file2 and if "hWcYwgRKOD77hfm1oKE0IA" is 1 in file1 then "hWcYwgRKOD77hfm1oKE0IA" is 1 in file2 if it appears. Same for column b and b'. And I would like to have single output from those two files: result1.csv and result2.csv

awk -F, -v OFS=, '{ if ($1 in a) { $1 = a[$1] } else { $1 = a[$1] = ++x }
if ($2 in b) { $2 = b[$2] } else { $2 = b[$2] = ++y } } 1' file
Or perhaps simpler but may be less efficient:
awk -F, -v OFS=, '!($1 in a) { a[$1] = ++x } { $1 = a[$1] }
!($2 in b) { b[$2] = ++y } { $2 = b[$2] } 1' file
Or dynamic to any number of columns:
awk -F, -v OFS=, '{ for (i = 1; i <= NF; ++i)
if ((i, $i) in a) { $i = a[i, $i] }
else { $i = a[i, $i] = ++x[i] } } 1' file
Which is also similar to
awk -F, -v OFS=, '{ for (i = 1; i <= NF; ++i) {
if (!((i, $i) in a)) a[i, $i] = ++x[i]
$i = a[i, $i] } } 1' file
Output:
1,1
1,2
2,3
2,4
UPDATE
To apply on two files, try:
awk -F, -v OFS=, '{ if ($1 in a) { $1 = a[$1] } else { $1 = a[$1] = ++x }
if ($2 in b) { $2 = b[$2] } else { $2 = b[$2] = ++y }
print > "result_" FILENAME }' file1 file2
UPDATE 02
awk -F, -v OFS=, '!($1 in a) { a[$1] = ++x } !($2 in b) { b[$2] = ++y }
{ print $1, $2, a[$1], b[$2] }' file
Output:
hWcYwgRKOD77hfm1oKE0IA,5HleiJXMsFkGEsr8Jqr3Ug,1,1
hWcYwgRKOD77hfm1oKE0IA,rCDlYd2WHJuiT05sYGxaVA,1,2
65q0c2Iw03B8eSuHHTETHw,G40NUD0/op+13yjzBw+hrw,2,3
65q0c2Iw03B8eSuHHTETHw,1u8UW/cQ4i1vbSF9wvzu3w,2,4
File by file version:
awk -F, -v OFS=, '!($1 in a) { a[$1] = ++x } !($2 in b) { b[$2] = ++y }
{ print $1, $2, a[$1], b[$2] > "result_" FILENAME }' file1 file2

Related

how to find out common columns and its records from two files using awk

I have two files:
File 1:
id|name|address|country
1|abc|efg|xyz
2|asd|dfg|uio
File 2(only headers):
id|name|country
Now, I want an output like:
OUTPUT:
id|name|country
1|abc|xyz
2|asd|uio
Basically, I have a user record file(file1) and a header file(file2).Now, I want to extract only those records from (file1) whose columns match with that in the header file.
I want to do this using awk or bash.
I tried using:
awk 'BEGIN { OFS="..."} FNR==NR { a[(FNR"")] = $0; next } { print a[(FNR"")], $0 > "test.txt"}' header.txt file.txt
and have no idea what to do next.
Thank You
Following awk may help you on same.
awk -F"|" 'FNR==NR{for(i=1;i<=NF;i++){a[$i]};next} FNR==1 && FNR!=NR{for(j=1;j<=NF;j++){if($j in a){b[++p]=j}}} {for(o=1;o<=p;o++){printf("%s%s",$b[o],o==p?ORS:OFS)}}' OFS="|" File2 File1
Adding a non-one liner form of solution too now.
awk -F"|" '
FNR==NR{
for(i=1;i<=NF;i++){
a[$i]};
next}
FNR==1 && FNR!=NR{
for(j=1;j<=NF;j++){
if($j in a){ b[++p]=j }}
}
{
for(o=1;o<=p;o++){
printf("%s%s",$b[o],o==p?ORS:OFS)}
}
' OFS="|" File2 File1
Edit by Ed Morton: FWIW here's the same script written with normal indenting/spacing and a couple of more meaningful variable names:
BEGIN { FS=OFS="|" }
NR==FNR {
for (i=1; i<=NF; i++) {
names[$i]
}
next
}
FNR==1 {
for (i=1; i<=NF; i++) {
if ($i in names) {
f[++numFlds] = i
}
}
}
{
for (i=1; i<=numFlds; i++) {
printf "%s%s", $(f[i]), (i<numFlds ? OFS : ORS)
}
}
with (lot's of) unix pipes as Doug McIlroy intended...
$ function p() { sed 1q "$1" | tr '|' '\n' | cat -n | sort -k2; }
$ cut -d'|' -f"$(join -j2 <(p header) <(p file) | sort -k2n | cut -d' ' -f3 | paste -sd,)" file
id|name|country
1|abc|xyz
2|asd|uio
Solution using bash>4:
IFS='|' headers1=($(head -n1 $file1))
IFS='|' headers2=($(head -n1 $file2))
IFS=$'\n'
# find idxes we want to output, ie. mapping of headers1 to headers2
idx=()
for i in $(seq 0 $((${#headers2[#]}-1))); do
for j in $(seq 0 $((${#headers1[#]}-1))); do
if [ "${headers2[$i]}" == "${headers1[$j]}" ]; then
idx+=($j)
break
fi
done
done
# idx=(0 1 3) for example
# simple join output function from https://stackoverflow.com/questions/1527049/join-elements-of-an-array
join_by() { local IFS="$1"; shift; echo "$*"; }
# first line - output headers
join_by '|' "${headers2[#]}"
isfirst=true
while IFS='|' read -a vals; do
# ignore first (header line)
if $isfirst; then
isfirst=false
continue;
fi;
# filter from line only columns with idx indices
tmp=()
for i in ${idx[#]}; do
tmp+=("${vals[$i]}")
done
# join ouptut with '|'
join_by '|' "${tmp[#]}"
done < $file1
This one respects the order of columns in file1, changed the order:
$ cat file1
id|country|name
The awk:
$ awk '
BEGIN { FS=OFS="|" }
NR==1 { # file1
n=split($0,a)
next
}
NR==2 { # file2 header
for(i=1;i<=NF;i++)
b[$i]=i
}
{ # output part
for(i=1;i<=n;i++)
printf "%s%s", $b[a[i]], (i==n?ORS:OFS)
}' file1 file2
id|country|name
1|xyz|abc
2|uio|asd
(Another version using cut for outputing in revisions)
This is similar to RavinderSingh13's solution, in that it first reads the headers from the shorter file, and then decides which columns to keep from the longer file based on the headers on the first line of it.
It however does the output differently. Instead of constructing a string, it shifts the columns to the left if it does not want to include a particular field.
BEGIN { FS = OFS = "|" }
# read headers from first file
NR == FNR { for (i = 1; i <= NF; ++i) header[$i]; next }
# mark fields in second file as "selected" if the header corresponds
# to a header in the first file
FNR == 1 {
for (i = 1; i <= NF; ++i)
select[i] = ($i in header)
}
{
skip = 0
pos = 1
for (i = 1; i <= NF; ++i)
if (!select[i]) { # we don't want this field
++skip
$pos = $(pos + skip) # shift fields left
} else
++pos
NF -= skip # adjust number of fields
print
}
Running this:
$ mawk -f script.awk file2 file1
id|name|country
1|abc|xyz
2|asd|uio

Compare two files using awk having many columns and get the column in which data is different

file 1:
field1|field2|field3|
abc|123|234
def|345|456
hij|567|678
file2:
field1|field2|field3|
abc|890|234
hij|567|658
desired output:
field1|field2|field3|
abc|N|Y
def|345|456
hij|Y|N
I need to compare.if the fields match , then it shld put Y , else N in the output file.
Using awk, you may try this:
awk -F '|' 'FNR == NR {
p = $1
sub(p, "")
a[p] = $0
next
}
{
if (FNR > 1 && $1 in a) {
split(a[$1], b, /\|/)
printf "%s", $1 FS
for (i=2; i<=NF; i++)
printf "%s%s", ($i == b[i] ? "Y" : "N"), (i == NF ? ORS : FS)
}
else
print
}' file2 file1
field1|field2|field3|
abc|N|Y
def|345|456
hij|Y|N
Code Demo

AWK: increment a field based on values from previous line

Given the following input for AWK:
10;20;20
8;41;41
15;52;52
How could I increase/decrease the values so that:
$1 = remains unchanged
$2 = $2 of previous line + $1 of previous line + 1
$3 = $3 of previous line + $1 of previous line + 1
So the desired output would be:
10;20;20
8;31;31
15;40;40
I need to auto-increment and loop over the lines,
using associative arrays, but it's confusing for me.
Surely, this doesn't work as desired:
#!/bin/awk -f
BEGIN { FS = ";" }
{
print ln, st, of
ln=$1
st=$2 + ln + 1
of=$3 + ln + 1
}
with awk
awk -F";" -v OFS=";"
'NR!=1{ $2=a[2]+a[1]+1; $3=a[3]+a[1]+1 } { split($0,a,FS) } 1' file
split the line to an array and when processing the next line we can use the values stored.
test
10;20;20
8;31;31
15;40;40
Following awk may help you in same.
awk -F";" '
FNR==1{
val=$1;
val1=$2;
val2=$3;
print;
next
}
{
$2=val+val1+1;
$3=val+val2+1;
print;
val=$1;
val1=$2;
val2=$3;
}' OFS=";" Input_file
For your given Input_file, output will be as follows.
10;20;20
8;31;31
15;40;40
awk 'BEGIN{
FS = OFS = ";"
}
FNR>1{
$2 = p2 + p1 + 1
$3 = p3 + p1 + 1
}
{
p1=$1; p2=$2; p3=$3
}1
' infile
Input:
$ cat infile
10;20;20
8;41;41
15;52;52
Output:
awk 'BEGIN{FS=OFS=";"}FNR>1{$2=p2+p1+1; $3=p3+p1+1 }{p1=$1; p2=$2; p3=$3}1' infile
10;20;20
8;31;31
15;40;40
Or store only fields of your interest
awk -v myfields="2,3" '
BEGIN{
FS=OFS=";";
split(myfields,t,/,/)
}
{
for(i in t)
{
if(FNR>1)
{
$(t[i]) = a[t[i]] + a[1] + 1
}
a[t[i]] = $(t[i])
}
a[1] = $1
}1' infile

Find string in a file to another file

I want to find keywords in FILE2 to each column of FILE1 and print <BLANK> if keywords in FILE2 is not in FILE1 regardless of the separators.
FILE1
XYZ=TRS-000|XYZ=TWR-000|UJU=909|GFT=879|JKP=908
XYZ=TRS-000|XYZ=TWR-000|GFT=879|JKP=908
FILE2
TRS-0
TWR
UJU
GFT-8
OUTPUT
XYZ=TRS-000|XYZ=TWR-000|UJU=909|GFT=879
XYZ=TRS-000|XYZ=TWR-000||GFT=879
SCRIPT so far
(This script finds exact match in FILE2 to FILE1 columns (with = as separator). I can't figure out how to do: if a string from FILE2 contained in a column in FILE1. )
BEGIN{FS=OFS="|"}
NR==FNR{a[++i]=$1;next}
{
d=""
delete b
for(j=1;j<=NF;j++){
split($j,c,"-")
b[c[1]]=$j;
}
for(j=1;j<=i;j++){
d=d (d==""?"":OFS) (a[j] in b?b[a[j]]:"")
}
print d
}
`
$ cat tst.awk
BEGIN { FS=OFS="|" }
NR==FNR { sub(/[^[:alpha:]].*/,""); keys[++numKeys]=$0; next }
{
delete key2val
for (i=1;i<=NF;i++) {
for (keyNr=1; keyNr<=numKeys; keyNr++) {
key = keys[keyNr]
if ( $i ~ ("="key"-|^"key"=") ) {
key2val[key] = $i
}
}
}
for (keyNr=1; keyNr<=numKeys; keyNr++) {
key = keys[keyNr]
printf "%s%s", key2val[key], (keyNr<numKeys?OFS:ORS)
}
}
$ awk -f tst.awk file2 file1
XYZ=TRS-000|XYZ=TWR-000|UJU=909|GFT=879
XYZ=TRS-000|XYZ=TWR-000||GFT=879
Based on your sample input, there are two patterns for determining the keyword after the split. This solution sets index i accordingly.
BEGIN {FS=OFS="|"}
NR==FNR {a[$1]=NR; cols=NR; next}
{
delete out
for (f=1;f<=NF;++f) {
i = ($f ~ /^.*=.*-.*$/) ? 2 : 1
split($f, b, /[=-]/)
if (b[i] in a) {
out[a[b[i]]] = $f
}
}
printf "%s", out[1]
for (j=2; j<=cols; ++j) {
printf "%s%s", OFS, out[j]
}
print ""
}
$ cat file1
XYZ=TRS-000|XYZ=TWR-000|UJU=909|GFT=879|JKP=908
XYZ=TRS-000|XYZ=TWR-000|GFT=879|JKP=908
$ cat file2
TRS
TWR
UJU
GFT
$ awk -f utl.awk file2 file1
XYZ=TRS-000|XYZ=TWR-000|UJU=909|GFT=879
XYZ=TRS-000|XYZ=TWR-000||GFT=879

Rearranging a csv file

I have a file with contents similar to the below
Boy,Football
Boy,Football
Boy,Football
Boy,Squash
Boy,Tennis
Boy,Football
Girl,Tennis
Girl,Squash
Girl,Tennis
Girl,Tennis
Boy,Football
How can I use 'awk' or similar to rearrange this to the below:
Football Tennis Squash
Boy 5 1 1
Girl 0 3 1
I'm not even sure if this is possible, but any help would be great.
$ cat tst.awk
BEGIN{ FS=","; OFS="\t" }
{
genders[$1]
sports[$2]
count[$1,$2]++
}
END {
printf ""
for (sport in sports) {
printf "%s%s", OFS, sport
}
print ""
for (gender in genders) {
printf "%s", gender
for (sport in sports) {
printf "%s%s", OFS, count[gender,sport]+0
}
print ""
}
}
$ awk -f tst.awk file
Squash Tennis Football
Boy 1 1 5
Girl 1 3 0
In general when you know the end point of the loop you put the OFS or ORS after each field:
for (i=1; i<=n; i++) {
printf "%s%s", $i, (i<n?OFS:ORS)
}
but if you don't then you put the OFS before the second and subsequent fields and print the ORS after the loop:
for (x in array) {
printf "%s%s", (++i>1?OFS:""), array[x]
}
print ""
I do like the:
n = length(array)
for (x in array) {
printf "%s%s", array[x], (++i<n?OFS:ORS)
}
idea to get the end of the loop too, but length(array) is gawk-specific.
Another approach to consider:
$ cat tst.awk
BEGIN{ FS=","; OFS="\t" }
{
for (i=1; i<=NF; i++) {
if (!seen[i,$i]++) {
map[i,++num[i]] = $i
}
}
count[$1,$2]++
}
END {
for (i=0; i<=num[2]; i++) {
printf "%s%s", map[2,i], (i<num[2]?OFS:ORS)
}
for (i=1; i<=num[1]; i++) {
printf "%s%s", map[1,i], OFS
for (j=1; j<=num[2]; j++) {
printf "%s%s", count[map[1,i],map[2,j]]+0, (j<num[2]?OFS:ORS)
}
}
}
$ awk -f tst.awk file
Football Squash Tennis
Boy 5 1 1
Girl 0 1 3
That last will print the rows and columns in the order they were read. Not quite as obvious how it works though :-).
I would just loop normally:
awk -F, -v OFS="\t" '
{names[$1]; sport[$2]; count[$1,$2]++}
END{printf "%s", OFS;
for (i in sport)
printf "%s%s", i, OFS;
print "";
for (n in names) {
printf "%s%s", n, OFS
for (s in sport)
printf "%s%s", count[n,s]?count[n,s]:0, OFS; print ""
}
}' file
This keeps track of three arrays: names[] for the first column, sport[] for the second column and count[name,sport] to count the occurrences of every combination.
Then, it is a matter of looping through the results and printing them in a fancy way and making sure 0 is printed if the count[a,b] does not exist.
Test
$ awk -F, -v OFS="\t" '{names[$1]; sport[$2]; count[$1,$2]++} END{printf "%s", OFS; for (i in sport) printf "%s%s", i, OFS; print ""; for (n in names) {printf "%s%s", n, OFS; for (s in sport) printf "%s%s", count[n,s]?count[n,s]:0, OFS; print ""}}' a
Squash Tennis Football
Boy 1 1 5
Girl 1 3 0
Format is a bit ugly, there are some trailing OFS.
To get rid of trailing OFS:
awk -F, -v OFS="\t" '{names[$1]; sport[$2]; count[$1,$2]++} END{printf "%s", OFS; for (i in sport) {cn++; printf "%s%s", i, (cn<length(sport)?OFS:ORS)} for (n in names) {cs=0; printf "%s%s", n, OFS; for (s in sport) {cs++; printf "%s%s", count[n,s]?count[n,s]:0, (cs<length(sport)?OFS:ORS)}}}' a
You can always pipe to column -t for a nice output.

Resources