match a row value to a column value, and rename the row - sorting

I have a file with the following header:
File 1:
location, nameA, nameB, nameC
and a second file with the format:
File2:
ID_number, names
101, nameA
102, nameB
103, nameC
I would like to match the row names from File1 to those in column 2 of File2, and if they match replace the names in the header with the ID number. So that in the end, the resulting file would like:
File 1:
location, 101, 102, 103
I've mostly being trying with awk to do this but I can't get it to produce anything and I'm not sure how to ask it to do the last part of what I want.
awk -F "," '{print $2}' file2.csv | while read i; do awk 'NR=1;{for (j=0;j<=NF;j++) {if ($j == $i) printF $j; }}' file1.csv;done > test.csv
It's a really large file with thousands of columns and rows, so I just put up a simplified version of the files in my question here.
Thanks!

This should work if your csv fields have no embedded commas. It also assumes that both files have a header line.
awk '
BEGIN { FS=","; OFS=", " }
FNR == 1 { # if it is the header line
if (NR != 1) # if it is the second file
print # print it
next # go to next line of file
}
{ gsub(/ +/, "") } # compress spaces
NR == FNR { # if it is the first file
a[$2] = $1 # save the info
next # go to next line of file
}
{
$2=a[$2]; $3=a[$3]; $4=a[$4] # swap names
print # print line
}
' file2.csv file1.csv
Test files:
file1.csv
location, nameA, nameB, nameC
Earth, Chuck, Edward, Bob
The Moon, Bob, Doris, Al
file2.csv
ID_number, names
101, Al
102, Bob
103, Chuck
104, Doris
105, Edward
Output:
location, nameA, nameB, nameC
Earth, 103, 105, 102
TheMoon, 102, 104, 101

Related

awk: math operations of multi-column data in multiple CSV files

I am working on bash script that loops multi-column data filles and executes integrated AWK code to operate with the multi-column data.
#!/bin/bash
home="$PWD"
# folder with the outputs
rescore="${home}"/rescore
# folder with the folders to analyse
storage="${home}"/results
while read -r d; do
awk -F ", *" ' # set field separator to comma, followed by 0 or more whitespaces
FNR==1 {
if (n) { # calculate the results of previous file
f= # apply this equation to rescore data using values of $3 and $2
f[suffix] = f # store the results in the array
n=$1 # take ID of the column
}
prefix=suffix=FILENAME
sub(/_.*/, "", prefix)
sub(/\/[^\/]+$/, "", suffix)
sub(/^.*_/, "", suffix)
n = 1 # count of samples
min = 0 # lowest value of $3 (assuming all $3 < 0)
}
FNR > 1 {
s += $3
s2 += $3 * $3
++n
if ($3 < min) min = $3 # update the lowest value
}
print "ID" prefix, rescoring
for (i in n)
printf "%s %.2f\n", i, f[i]
}' "${d}_"*/input.csv > "${rescore}/"${d%%_*}".csv"
done < <(find . -maxdepth 1 -type d -name '*_*_*' | awk -F '[_/]' '!seen[$2]++ {print $2}')
Briefly, the workflow should process each line of the input.csv located inside ${d} folder that correctly has been identified by my bash script:
# input.csv located in the folder 10V1_cne_lig12
ID, POP, dG
1, 142, -5.6500 # this is dG(min)
2, 10, -5.5000
3, 2, -4.9500
4, 150, -4.1200
My AWK script is expected to process each line of each CSV file in order to reduce them to the two columns, keeping in the output: i) the number from the first column of the input.csv (contained ID of the processed line) + the name of the folder ($d) contained the CSV file as well as ii) the result of the math operation (f) applied on the numbers in POP and dG columns of the input.csv:
f(ID)= sqrt(((dG(ID)+10)/10)^2+((POP(ID)-240)/240))^2)
where dG(ID) is the value of dG ($3) of the "rescored" line of input.csv, and POP(ID) is its POP value ($2).Eventually output.csv contained information regarding 1 input.csv should be in the following format:
# output.csv
ID, rescore value
1 10V1_cne_lig12, f(ID1)
2 10V1_cne_lig12, f(ID2)
3 10V1_cne_lig12, f(ID3)
4 10V1_cne_lig12, f(ID4)
While bash part of my code (dealing with the looping of CSVs in the distinct directories) works correctly I am stuck with the AWK code, which does not assign correctly ID of the lines in order that I could apply demonstrated math operations using $2 and $3 columns of the line with precised ID.
given the input file: folder/file
ID, POP, dG
1, 142, -5.6500
2, 10, -5.5000
3, 2, -4.9500
4, 150, -4.1200
this script
$ awk -F', *' -v OFS=', ' '
FNR==1 {path=FILENAME; sub(/\/[^/]+$/,"",path); print $1,"rescore value"; next}
{print $1" "path, sqrt((($3+10)/10)^2+(($2-240)/240)^2)}' folder/file
will produce
ID, rescore value
1 folder, 0.596625
2 folder, 1.05873
3 folder, 1.11285
4 folder, 0.697402
Not sure what the rest of your code does, but I guess you can integrate it in.

Extract common lines from multiple text files and display original line numbers

What I want?
Extract the common lines from n large files.
Append the original line numbers of each files.
Example:
File1.txt has the following content
apple
banana
cat
File2.txt has the following content
boy
girl
banana
apple
File3.txt has the following content
foo
apple
bar
The output should be a different file
1 3 2 apple
1, 3 and 2 in the output are the original line numbers of File1.txt, File2.txt and File3.txt where the common line apple exists
I have tried using grep -nf File1.txt File2.txt File3.txt, but it returns
File2.txt:3:apple
File3.txt:2:apple
Associate each unique line with a space separated list of line numbers indicating where it is seen in each file in an array, and print these next to each other at the end if the line is found in all three files.
awk '{
n[$0] = n[$0] FNR OFS
c[$0]++
}
END {
for (r in c)
if (c[r] == 3)
print n[r] r
}' file1 file2 file3
If the number of files is unknown, refer to Ravinder's answer, or just change the hardcoded 3 in the END block with ARGC-1 as shown there.
GNU awk specific approach that works with any number of files:
#!/usr/bin/gawk -f
BEGINFILE {
nfiles++
}
{
lines[$0][nfiles] = FNR
}
END {
PROCINFO["sorted_in"] = "#ind_str_asc"
for (line in lines) {
if (length(lines[line]) == nfiles) {
for (file = 1; file <= nfiles; file++)
printf "%d\t", lines[line][file]
print line
}
}
}
Example:
$ ./showlines file[123].txt
1 3 2 apple
Could you please try following, written and tested with GNU awk, one could make use of ARGC value which gives us total number of element passed to awk program.
awk '
{
a[$0]=(a[$0]?a[$0] OFS:"")FNR
count[$0]++
}
END{
for(i in count){
if(count[i]==(ARGC-1)){
print i,a[i]
}
}
}
' file1.txt file2.txt file3.txt
A perl solution
perl -ne '
$h{$_} .= "$.\t"; # append current line number and tab character to value in a hash with key current line
$. = 0 if eof; # reset line number when end of file is reached
END{
while ( ($k,$v) = each %h ) { # loop over has entries
if ( $v =~ y/\t// == 3 ) { # if value contains 3 tabs
print $v.$k # print value concatenated with key
}
}
}' file1.txt file2.txt file3.txt

Adding columns to a csv table with AWK from multiple files

I'm looking to build a csv table by getting values from several files with AWK. I have it working with two files, but I can't scale it beyond that. I'm currently taking the output of the second file, and appending the third, and so on.
Here are example files:
#file1 #file2 #file3 #file4
100 45 1 5
200 23 1 2
300 29 2 1
400 0 1 2
500 74 4 5
This is the goal:
#data.csv
1,100,45,1,5
2,200,23,1,2
3,300,29,2,1
4,400,0,1,2
5,500,74,4,5
This is what I have working:
awk 'FNR==NR { a[FNR""] = NR", " $0","; next } { print a[FNR""], $0}' $file1 $file2
With the result:
1, 100, 45
2, 200, 23
3, 300, 29
4, 400, 0
5, 500, 74
But when I try and get it to work on 3 or more files, like so:
awk 'FNR==NR { a[FNR""] = NR", " $0","; next } { print a[FNR""], $0; next } { print a[FNR""], $0}' $file1 $file2 $file3
I get this output:
1, 100, 45
2, 200, 23
3, 300, 29
4, 400, 0
5, 500, 74
1, 100, 1
2, 200, 1
3, 300, 2
4, 400, 1
5, 500, 4
In the first column the line count restarts, and the second column it also repeats the first file. In the third column is where it adds the third and subsequent files as new rows, where I would expect these should be added as columns. No new rows required.
Any help would be greatly appreciated. I have learned most of my AWK from Stack Exchange, and I know I'm missing something fundamental here. Thanks,
as already answered you can use paste. To get the exact output with comma delimited line numbering, you can do this
paste -d, file{1..4} | nl -s, -w1
-s, sets number separator as comma (default is tab).
-w1 sets number width, so there are no initial spaces (because default is bigger)
another solution with awk
awk '{a[FNR]=a[FNR] "," $0}
END {for (i=1;i<=length(a);i++) print i a[i]}' file{1..4}
Why don't you use paste and then simply number each row:-
paste -d"," file1 file2 file3 file4
100,45,1,5
200,23,1,2
300,29,2,1
400,0 ,1,2
500,74,4,5
An awk solution for a variable number of files:
awk '{ !line[FNR] && line[FNR]=FNR; line[FNR]=line[FNR]","$0 }
END { for (i=1; i<=length(line); i++) print line[i] }' file1 file2 ... fileN
For example:
$ awk '{ !line[FNR] && line[FNR]=FNR; line[FNR]=line[FNR]","$0 }
END { for (i=1; i<=length(line); i++) print line[i] }' \
<(seq 1 5) <(seq 11 15) <(seq 21 25) <(seq 31 35)
1,1,11,21,31
2,2,12,22,32
3,3,13,23,33
4,4,14,24,34
5,5,15,25,35
Here is a beginner friendly solution. If you need to manipulate the data on the way in you can clearly see which file is being read.
ARGIND is gawk specific. It tells us which file we are processing. We fill two arrays a and b from file1 and file2 and then print your desired output while processing file3.
awk '
ARGIND == 1 { a[FNR] = $0 ; next }
ARGIND == 2 { b[FNR] = $0 ; next }
ARGIND == 3 { print FNR "," a[FNR] "," b[FNR] "," $0 }
' file1 file2 file3
Output:
1,100,45,1
2,200,23,1
3,300,29,2
4,400,0,1
5,500,74,4

Parsing key value in an csv file using shell script

Given csv input file
Id Name Address Phone
---------------------
100 Abc NewYork 1234567890
101 Def San Antonio 9876543210
102 ghi Chicago 7412589630
103 GHJ Los Angeles 7896541259
How do we grep/command for the value using the key?
if Key 100, expected output is NewYork
You can try this:
grep 100 filename.csv | cut -d, -f3
Output:
New York
This will search the whole file for the value 100, and return all the values in the 3rd column of the matching rows.
With GNU grep:
grep -Po '^100.....\K...........' file
or shorter:
grep -Po '^100.{5}\K.{11}' file
Output:
NewYork
Awk splits lines by whitespace sequences (by default).
You could use that to write a condition on the first column.
In your example input, it looks like not CSV but columns with fixed width (except the header). If that's the case, then you can extract the name of the city as a substring:
awk '$1 == 100 { print substr($0, 9, 11); }' input.csv
Here 9 is the starting position of the city column, and 11 is its length.
If on the other hand your input file is not what you pasted, but really CSV (comma separated values), and if there are no other embedded commas or newline characters in the input, then you can write like this:
awk -F, '$1 == 100 { print $3 }' input.csv

Use awk to fix CSV file with commas in unenclosed fields

I have a CSV file which looks like:
height, comment, name
152, he was late, for example, on Tuesday, Fred
162, , Sam
I cannot parse this file because it includes a variable number of unenclosed commas in the comment field (but no other fields). I would like to fix the file using awk (which is very new to me) so that the commas in the second field become semi-colons:
height, comment, name
152, he was late; for example; on Tuesday, Fred
162, , Sam
(Enclosing the entire field in quotes will not solve my problem because my CSV parser does not understand quotes.)
So far I am looking at using NF to work out the number of unenclosed commas and then replacing them using gsub with an unpleasant regex, but I feel I should be able to leverage awk to write a more readable program and I am not sure NF behaves this way.
Essentially just a brute-force solution, but fairly easy to understand. Invoke with
$ awk -F "," -f test.awk test.dat
The awk file.
$ cat test.awk
{
printf "%s, ", $1
if (NF > 3) {
for (i = 2; i < NF; i++) {
printf "%s;", $i
}
printf ", "
}
else {
printf "%s, ", $2
}
printf "%s\n", $NF
}
$ cat file
height, comment, name
152, he was late, for example, on Tuesday, Fred
162, , Sam
$ awk -v OFS=, '{
height = comment = name = $0
sub(/,.*$/,"",height)
sub(/^.*,/,"",name)
gsub(/^[^,]+,|,[^,]+$/,"",comment)
gsub(/,/,";",comment)
print height, comment, name
}' file
height, comment, name
152, he was late; for example; on Tuesday, Fred
162, , Sam

Resources