awk: select first column and value in column after matching word - bash

I have a .csv where each row corresponds to a person (first column) and attributes with values that are available for that person. I want to extract the names and values a particular attribute for persons where the attribute is available. The doc is structured as follows:
name,attribute1,value1,attribute2,value2,attribute3,value3
joe,height,5.2,weight,178,hair,
james,,,,,,
jesse,weight,165,height,5.3,hair,brown
jerome,hair,black,breakfast,donuts,height,6.8
I want a file that looks like this:
name,attribute,value
joe,height,5.2
jesse,height,5.3
jerome,height,6.8
Using this earlier post, I've tried a few different awk methods but am still having trouble getting both the first column and then whatever column has the desired value for the attribute (say height). For example the following returns everything.
awk -F "height," '{print $1 "," FS$2}' file.csv
I could grep only the rows with height in them, but I'd prefer to do everything in a single line if I can.

You may use this awk:
cat attrib.awk
BEGIN {
FS=OFS=","
print "name,attribute,value"
}
NR > 1 && match($0, k "[^,]+") {
print $1, substr($0, RSTART+1, RLENGTH-1)
}
# then run it as
awk -v k=',height,' -f attrib.awk file
name,attribute,value
joe,height,5.2
jesse,height,5.3
jerome,height,6.8
# or this one
awk -v k=',weight,' -f attrib.awk file
name,attribute,value
joe,weight,178
jesse,weight,165

With your shown samples please try following awk code. Written and tested in GNU awk. Simple explanation would be, using GNU awk and setting RS(record separator) to ^[^,]*,height,[^,]* and then printing RT as per requirement to get expected output.
awk -v RS='^[^,]*,height,[^,]*' 'RT{print RT}' Input_file

I'd suggest a sed one-liner:
sed -n 's/^\([^,]*\).*\(,height,[^,]*\).*/\1\2/p' file.csv

One awk idea:
awk -v attr="height" '
BEGIN { FS=OFS="," }
FNR==1 { print "name", "attribute", "value"; next }
{ for (i=2;i<=NF;i+=2) # loop through even-numbered fields
if ($i == attr) { # if field value is an exact match to the "attr" variable then ...
print $1,$i,$(i+1) # print current name, current field and next field to stdout
next # no need to check rest of current line; skip to next input line
}
}
' file.csv
NOTE: this assumes the input value (height in this example) will match exactly (including same capitalization) with a field in the file
This generates:
name,attribute,value
joe,height,5.2
jesse,height,5.3
jerome,height,6.8

With a perl one-liner:
$ perl -lne '
print "name,attribute,value" if $.==1;
print "$1,$2" if /^(\w+).*(height,\d+\.\d+)/
' file
output
name,attribute,value
joe,height,5.2
jesse,height,5.3
jerome,height,6.8

awk accepts variable-value arguments following a -v flag before the script. Thus, the name of the required attribute can be passed into an awk script using the general pattern:
awk -v attr=attribute1 ' {} ' file.csv
Inside the script, the value of the passed variable is reference by the variable name, in this case attr.
Your criteria are to print column 1, the first column containing the name, the column corresponding to the required header value, and the column immediately after that column (holding the matched values).
Thus, the following script allows you to fish out the column headed "attribute1" and it's next neighbour:
awk -v attr=attribute1 ' BEGIN {FS=","} /attr/{for (i=1;i<=NF;i++) if($i == attr) col=i;} {print $1","$col","$(col+1)} ' data.txt
result:
name,attribute1,value1
joe,height,5.2
james,,
jesse,weight,165
jerome,hair,black
another column (attribute 3):
awk -v attr=attribute3 ' BEGIN {FS=","} /attr/{for (i=1;i<=NF;i++) if($i == attr) col=i;} {print $1","$col","$(col+1)} ' awkNames.txt
result:
name,attribute3,value3
joe,hair,
james,,
jesse,hair,brown
jerome,height,6.8
Just change the value of the -v attr= argument for the required column.

Related

change numerical value in file to characters via awk

I'm looking to replace the numerical values in a file with a new value provided by me. Can be present in any part of the text, in some cases, it comes across as the third position but is not always necessarily the case. Also to try and save a new version of the file.
original format
A:fdg:user#server:r
A:g:1234:xtcy
A:d:1111:xtcy
modified format
A:fdg:user#server:rxtTncC
A:g:replaced_value:xtcy
A:d:replaced_value:xtcy
bash line command with awk:
awk -v newValue="newVALUE" 'BEGIN{FS=OFS=":"} /:.:.*:/ && ~/^[0-9]+$/{~=newValue} 1' original_file.txt > replaced_file.txt
You can simply use sed instead of awk:
sed -E 's/\b[0-9]+\b/replaced_value/g' /path/to/infile > /path/to/outfile
Here is an awk that asks you for replacement values for each numerical value it meets:
$ awk '
BEGIN {
FS=OFS=":" # delimiters
}
{
for(i=1;i<=NF;i++) # loop all fields
if($i~/^[0-9]+$/) { # if numerical value found
printf "Provide replacement value for %d: ",$i > "/dev/stderr"
getline $i < "/dev/stdin" # ask for a replacement
}
}1' file_in > file_out # write output to a new file
I would use GNU AWK for this task following way, let file.txt content be
A:fdg:user#server:rxtTncC
A:g:1234:xtcy
A:d:1111:xtcy
then
awk 'BEGIN{newvalue="replacement"}{gsub(/[[:digit:]]+/,newvalue);print}' file.txt
output
A:fdg:user#server:rxtTncC
A:g:replacement:xtcy
A:d:replacement:xtcy
Explanation: replace one or more digits using newvalue. Disclaimer: I assumed numeric is something consisting solely from digits.
(tested in gawk 4.2.1)
How about
awk -F : '$3 ~ /^[0-9]+$/ { $3 = "new value"} {print}' original_file >replaced_file
?

grep few columns from a file to another file in shell

The following file is present in file1.txt:
mudId|~|mudType|~|mudNAme|~|mudDate|~|mudEndDate
100|~|Balance|~|Abc|~|21-09-2020|~|22-09-2020
101|~|Clone|~|Bcd|~|11-07-2020|~|12-07-2020
102|~|Ledger|~|Def|~|12-06-2019|~|13-06-2019
How to grep only the columns mudId, mudType and mudDate with all the rows into another file?
The columns are separated by |~|
To meet your criteria of specifying the field names from the heading row, you can use awk utilizing a Regular Expression as the Field-Separator variable (e.g. "[|][~][|]"). For the first record (line), read the field names as array indexes and set the value to the current field index. For your second rule, simply output the field value captured in your array that corresponds to the strings "mudId", "mudType" and "mudDate".
For example you can do:
awk '
BEGIN { FS="[|][~][|]"; OFS="|~|" }
FNR==1 { for(i=1;i<=NF;i++) arr[$i]=i; next }
{ print $arr["mudId"], $arr["mudType"], $arr["mudDate"] }
' file
(note: the above intentionally generalizes to meet your criteria where you want to specify the string names of the fields to output)
If you simply want to write fields 1, 2, & 4 to a new file, you would do:
awk -v FS="[|][~][|]" -v OFS="|~|" 'FNR>1 {print $1,$2,$4}' file
Example Use/Output
Simply copy/middle-mouse paste the above into an xterm where file is in the current directory, e.g.
$ awk '
> BEGIN { FS="[|][~][|]"; OFS="|~|" }
> FNR==1 { for(i=1;i<=NF;i++) arr[$i]=i; next }
> { print $arr["mudId"], $arr["mudType"], $arr["mudDate"] }
> ' file
100|~|Balance|~|21-09-2020
101|~|Clone|~|11-07-2020
102|~|Ledger|~|12-06-2019
(note: if you want the new file space-delimited, just remove OFS="|~|")
or
$ awk -v FS="[|][~][|]" -v OFS="|~|" 'FNR>1 {print $1,$2,$4}' file
100|~|Balance|~|21-09-2020
101|~|Clone|~|11-07-2020
102|~|Ledger|~|12-06-2019
To write the contents to a new filename, just redirect the output to a new filename (e.g. for the last line above, add ' file > newfile)
Look things over and let me know if you have further questions.
If the column is fixed by mudId|~|mudType|~|mudNAme|~|mudDate|~|mudEndDate, try this:
sed 's/|~|/\t/g' file1.txt | awk '{print $1"|~|"$2"|~|"$4}'
you should change \t to other character which will not occur in your file1.txt if the \t would exist in file1.txt, and then add -F'\t' after awk.

Ignore delimiters in quotes and excluding columns dynamically in csv file

I have awk command to read the csv file with | sperator. I am using this command as part of my shell script where the columns to exclude will be removed from the output. The list of columns are input as 1 2 3
Command Reference: http://wiki.bash-hackers.org/snipplets/awkcsv
awk -v FS='"| "|^"|"$' '{for i in $test; do $(echo $i=""); done print }' test.csv
$test is 1 2 3
I want to print $1="" $2="" $3="" in front of print all columns. I am getting this error
awk: {for i in $test; do $(echo $i=""); done {print }
awk: ^ syntax error
This command is working properly which prints all the columns
awk -v FS='"| "|^"|"$' '{print }' test.csv
File 1
"first"| "second"| "last"
"fir|st"| "second"| "last"
"firtst one"| "sec|ond field"| "final|ly"
Expected output if I want to exclude the column 2 and 3 dynamically
first
fir|st
firtst one
I need help to keep the for loop properly.
With GNU awk for FPAT:
$ awk -v FPAT='"[^"]+"' '{print $1}' file
"first"
"fir|st"
"firtst one"
$ awk -v flds='1' -v FPAT='"[^"]+"' 'BEGIN{n=split(flds,f,/ /)} {for (i=1;i<=n;i++) printf "%s%s", $(f[i]), (i<n?OFS:ORS)}' file
"first"
"fir|st"
"firtst one"
$ awk -v flds='2 3' -v FPAT='"[^"]+"' 'BEGIN{n=split(flds,f,/ /)} {for (i=1;i<=n;i++) printf "%s%s", $(f[i]), (i<n?OFS:ORS)}' file
"second" "last"
"second" "last"
"sec|ond field" "final|ly"
$ awk -v flds='3 1' -v FPAT='"[^"]+"' 'BEGIN{n=split(flds,f,/ /)} {for (i=1;i<=n;i++) printf "%s%s", $(f[i]), (i<n?OFS:ORS)}' file
"last" "first"
"last" "fir|st"
"final|ly" "firtst one"
If you don't want your output fields separated by a blank char then set OFS to whatever you do want with -v OFS='whatever'. If you want to get rid of the surrounding quotes you can use gensub() (since we're using gawk anyway) or substr() on every field, e.g.:
$ awk -v OFS=';' -v flds='1 3' -v FPAT='"[^"]+"' 'BEGIN{n=split(flds,f,/ /)} {for (i=1;i<=n;i++) printf "%s%s", substr($(f[i]),2,length($(f[i]))-2), (i<n?OFS:ORS)}' file
first;last
fir|st;last
firtst one;final|ly
$ awk -v OFS=';' -v flds='1 3' -v FPAT='"[^"]+"' 'BEGIN{n=split(flds,f,/ /)} {for (i=1;i<=n;i++) printf "%s%s", gensub(/"/,"","g",$(f[i])), (i<n?OFS:ORS)}' file
first;last
fir|st;last
firtst one;final|ly
In GNU awk (for FPAT):
$ test="2 3" # fields to exclude in bash var $test
$ awk -v t="$test" ' # taken to awk var t
BEGIN { # first
FPAT="([^|]+)|( *\"[^\"]+\")" # instead of FS, use FPAT
split(t,a," ") # process t to e:
for(i in a) # a[1]=2 -> e[2], etc.
e[a[i]]
}
{
for(i=1;i<=NF;i++) # for each field
if((i in e)==0) { # if field # not in e
gsub(/^\"|\"$/,"",$i) # remove leading and trailing "
b=b (b==""?"":OFS) $i # put to buffer b
}
print b; b="" # putput and reset buffer
}' file
first
fir|st
firtst one
FPAT is used as FS can't handle separator in quotes.
Vikram, if your actual Input_file is DITTO same as shown sample Input_file then following may help you in same. I will add explanation shortly too here(tested this with GNU awk 3.1.7 little old version of awk).
awk -v num="2,3" 'BEGIN{
len=split(num, val,",")
}
{while($0){
match($0,/.[^"]*/);
if(substr($0,RSTART,RLENGTH+1) && substr($0,RSTART,RLENGTH+1) !~ /\"\| \"/ && substr($0,RSTART,RLENGTH+1) !~ /^\"$/ && substr($0,RSTART,RLENGTH+1) !~ /^\" \"$/){
array[++i]=substr($0,RSTART,RLENGTH+1)
};
$0=substr($0,RLENGTH+1);
};
for(l=1;l<=len;l++){
delete array[val[l]]
};
for(j=1;j<=length(array);j++){
if(array[j]){
gsub(/^\"|\"$/,"",array[j]);
printf("%s%s",array[j],j==length(array)?"":" ")
}
};
print "";
i="";
delete array
}' Input_file
EDIT1: Adding a code with explanation too here.
awk -v num="2,3" 'BEGIN{ ##creating a variable named num whose value is comma seprated values of fields which you want to nullify, starting BEGIN section here.
len=split(num, val,",") ##creating an array named val here whose delimiter is comma and creating len variable whose value is length of array val here.
}
{while($0){ ##Starting a while loop here which will run for a single line till that line is NOT getting null.
match($0,/.[^"]*/);##using match functionality which will look for matches from starting to till a " comes into match.
if(substr($0,RSTART,RLENGTH+1) && substr($0,RSTART,RLENGTH+1) !~ /\"\| \"/ && substr($0,RSTART,RLENGTH+1) !~ /^\"$/ && substr($0,RSTART,RLENGTH+1) !~ /^\" \"$/){##So RSTATR and RLENGTH are the variables which will be set when a regex is having a match in line/variable passed into match function. In this if condition I am checking 1st: value of substring of RSTART,RLENGTH+1 should not be NULL. 2nd: Then checking this substring should not be having " pipe space ". 3rd condition: Checking if substring is NOT equal to a string which starts from " and ending with it. 4th condition: Checking here if substring is NOT equal to ^" space "$, if all conditions are TRUE then do following actions.
array[++i]=substr($0,RSTART,RLENGTH+1) ##creating an array named array whose index is variable i with increasing value of i and its value is substring of RSTART to till RLENGTH+1.
};
$0=substr($0,RLENGTH+1);##Now removing the matched part from current line which will decrease the length of line and avoid the while loop to become as infinite.
};
for(l=1;l<=len;l++){##Starting a loop here once while above loop is done which runs from starting of variable l=1 to value of len.
delete array[val[l]] ##Deleting here those values which we want to REMOVE from OPs request, so removing here.
};
for(j=1;j<=length(array);j++){##Start a for loop from the value of j=1 till the value of lengthh of array.
if(array[j]){ ##Now making sure array value whose index is j is NOT NULL, if yes then perform following statements.
gsub(/^\"|\"$/,"",array[j]); ##Globally substituting starting " and ending " with NULL in value of array value.
printf("%s%s",array[j],j==length(array)?"":" ") ##Now printing the value of array and secondly printing space or null depending upon if j value is equal to array length then print NULL else print space. It is because we don not want space at the last of the line.
}
};
print ""; ##Because above printf will NOT print a new line, so printing a new line.
i=""; ##Nullifying variable i here.
delete array ##Deleting array here.
}' Input_file ##Mentioning Input_file here.

Shell script to add values to a specific column

I have semicolon-separated columns, and I would like to add some characters to a specific column.
aaa;111;bbb
ccc;222;ddd
eee;333;fff
to the second column I want to add '#', so the output should be;
aaa;#111;bbb
ccc;#222;ddd
eee;#333;fff
I tried
awk -F';' -OFS=';' '{ $2 = "#" $2}1' file
It adds the character but removes all semicolons with space.
You could use sed to do your job:
# replaces just the first occurrence of ';', note the absence of `g` that
# would have made it a global replacement
sed 's/;/;#/' file > file.out
or, to do it in place:
sed -i 's/;/;#/' file
Or, use awk:
awk -F';' '{$2 = "#"$2}1' OFS=';' file
All the above commands result in the same output for your example file:
aaa;#111;bbb
ccc;#222;ddd
eee;#333;fff
#atb: Try:
1st:
awk -F";" '{print $1 FS "#" $2 FS $3}' Input_file
Above will work only when your Input_file has 3 fields only.
2nd:
awk -F";" -vfield=2 '{$field="#"$field} 1' OFS=";" Input_file
Above code you could put any field number and could make it as per your request.
Here I am making field separator as ";" and then taking a variable named field which will have the field number in it and then that concatenating "#" in it's value and 1 is for making condition TRUE and not making and action so by default print action will happen of current line.
You just misunderstood how to set variables. Change -OFS to -v OFS:
awk -F';' -v OFS=';' '{ $2 = "#" $2 }1' file
but in reality you should set them both to the same value at one time:
awk 'BEGIN{FS=OFS=";"} { $2 = "#" $2 }1' file

Compare two columns of different files and add new column if it matches

I would like to compare the first two columns of two files, if matched need to print yes else no.
input.txt
123,apple,type1
123,apple,type2
456,orange,type1
6567,kiwi,type2
333,banana,type1
123,apple,type2
qualified.txt
123,apple,type4
6567,kiwi,type2
output.txt
123,apple,type1,yes
123,apple,type2,yes
456,orange,type1,no
6567,kiwi,type2,yes
333,banana,type1,no
123,apple,type2,yes
I was using the below command for split the data, and then i will add one more column based on the result.
Now the the input.txt has duplicate(1st column) so the below method is not working, also the file size was huge.
Can we get the output.txt in awk one liner?
comm -2 -3 input.txt qualified.txt
$ awk -F, 'NR==FNR {a[$1 FS $2];next} {print $0 FS (($1 FS $2) in a?"yes":"no")}' qual input
123,apple,type1,yes
123,apple,type2,yes
456,orange,type1,no
6567,kiwi,type2,yes
333,banana,type1,no
123,apple,type2,yes
Explained:
NR==FNR { # for the first file
a[$1 FS $2];next # aknowledge the existance of qualified 1st and 2nd field pairs
}
{
print $0 FS ($1 FS $2 in a?"yes":"no") # output input row and "yes" or "no"
} # depending on whether key found in array a
No need to redefine the OFS as $0 isn't modified and doesn't get rebuilt.
You can use awk logic for this as below. Not sure why do you mention one-liner awk command though.
awk -v FS="," -v OFS="," 'FNR==NR{map[$1]=$2;next} {if($1 in map == 0) {$0=$0FS"no"} else {$0=$0FS"yes"}}1' qualified.txt input.txt
123,apple,type1,yes
123,apple,type2,yes
456,orange,type1,no
6567,kiwi,type2,yes
333,banana,type1,no
123,apple,type2,yes
The logic is
The command FNR==NR parses the first file qualified.txt and stores the entries in column 1 and 2 in first file with first column being the index.
Then for each of the line in 2nd file {if($1 in map == 0) {$0=$0FS"no"} else {$0=$0FS"yes"}}1 the entry in column 1 does not match the array, append the no string and yes otherwise.
-v FS="," -v OFS="," are for setting input and output field separators
It looks like all you need is:
awk 'BEGIN{FS=OFS=","} NR==FNR{a[$1];next} {print $0, ($1 in a ? "yes" : "no")}' qualified.txt output.txt

Resources