Spreading cell values into columns using UNIX - bash

Suppose we have this file:
head file
id,name,value
1,Je,1
2,Je,1
3,Ko,1
4,Ne,1
5,Ne,1
6,Je,1
7,Ko,1
8,Ne,1
9,Ne,1
And I'd like to get this out:
id,Je,Ko,Ne
1,1,0,0
2,1,0,0
3,0,1,0
4,0,0,1
5,0,0,1
6,1,0,0
7,0,1,0
8,0,0,1
9,0,0,1
Does someone know how to get this output, using awk or sed?

Assuming that the possible values of name are only Je or Ko or Ne, you can do:
awk -F, 'BEGIN{print "id,Je,Ko,Ne"}
NR==1{ next }
{je=$2=="Je"?"1":"0";
ko=$2=="Ko"?"1":"0";
ne=$2=="Ne"?"1":"0";
print $1","je","ko","ne}' file
If you want something that will print the values in the same order they are read and not limited to your example fields, you could do:
awk -F, 'BEGIN{OFS=FS; x=1;y=1}
NR==1 { next }
!($2 in oa){ oa[$2]=1; ar[x++]=$2}
{lines[y++]=$0;}
END{
s="";
for (i=1; i<x; i++)
s=s==""?ar[i]:s OFS ar[i];
print "id" OFS s;
for (j=1; j<y; j++){
split(lines[j], a)
s=""
for (i=1; i<x; i++) {
tt=ar[i]==a[2]?"1":"0"
s=s==""?tt:s OFS tt;
}
print a[1] OFS s;
}
}
' file

Here's a "two-pass solution" (along the lines suggested by #Drakosha) implemented using a single invocation of awk. The implementation would be a little simpler if there was no requirement regarding the ordering of names.
awk -F, '
# global: n, array a
function println(ix,name,value, i,line) {
line=ix;
for (i=0;i<n;i++) {
if (a[i]==name) {line=line OFS value} else {line=line OFS 0}
}
print line;
}
BEGIN {OFS=FS; n=0}
FNR==1 {next} # skip the header each time
NR==FNR {if (!mem[$2]) {mem[$2] = a[n++] = $2}; next}
!s { s="id"; for (i=0;i<n;i++) {s=s OFS a[i]}; print s}
{println($1, $2, $3)}
' file file

I suggest 2 passes.
1st will generate all the possible values of column 2 (Je, Ko, Ne,
...).
2nd will be able to trivially generate the output you are looking for.

awk -F, 'BEGIN{s="Je,Ko,Ne";print "id,"s}
NR>1 {m=s; sub($2,1,m); gsub("[^0-9,]+","0",m); print $1","m}' file

Related

awk to get value for a column of next line and add it to the current line in shellscript

I have a csv file lets say lines
cat lines
1:abc
6:def
17:ghi
21:tyu
I wanted to achieve something like this
1:6:abc
6:17:def
17:21:ghi
21::tyu
Tried the below code by didn't work
awk 'BEGIN{FS=OFS=":"}NR>1{nln=$1;cl=$2}NR>0{print $1,nln,$2}' lines
1::abc
6:6:def
17:17:ghi
21:21:tyu
Can you please help ?
Here is a potential AWK solution:
cat lines
1:abc
6:def
17:ghi
21:tyu
awk -F":" '{num[NR]=$1; letters[NR]=$2}; END{for(i=1;i<=NR;i++) print num[i] ":" num[i + 1] ":" letters[i]}' lines
1:6:abc
6:17:def
17:21:ghi
21::tyu
Formatted:
awk '
BEGIN {FS=OFS=":"}
{
num[NR] = $1;
letters[NR] = $2
}
END {for (i = 1; i <= NR; i++)
print num[i], num[i + 1], letters[i]
}
' lines
1:6:abc
6:17:def
17:21:ghi
21::tyu
Basically this is your solution but I switched the order of the code blocks and added the END block to output the last record, you were close:
awk 'BEGIN{FS=OFS=":"}FNR>1{print p,$1,q}{p=$1;q=$2}END{print p,"",q}' file
Explained:
$ awk 'BEGIN {
FS=OFS=":" # delims
}
FNR>1 { # all but the first record
print p,$1,q # output $1 and $2 from the previous round
}
{
p=$1 # store for the next round
q=$2
}
END { # gotta output the last record in the END
print p,"",q # "" feels like cheating
}' file
Output:
1:6:abc
6:17:def
17:21:ghi
21::tyu
1st solution: Here is a tac + awk + tac solution. Written and tested with shown samples only.
tac Input_file |
awk '
BEGIN{
FS=OFS=":"
}
{
prev=(prev?$2=prev OFS $2:$2=OFS $2)
}
{
prev=$1
}
1
' | tac
Explanation: Adding detailed explanation for above code.
tac Input_file | ##Printing lines from bottom to top of Input_file.
awk ' ##Getting input from previous command as input to awk.
BEGIN{ ##Starting BEGIN section from here.
FS=OFS=":" ##Setting FS and OFS as colon here.
}
{
prev=(prev?$2=prev OFS $2:$2=OFS $2) ##Creating prev if previous NOT NULL then add its value prior to $2 with prev OFS else add OFS $2 in it.
}
{
prev=$1 ##Setting prev to $1 value here.
}
1 ##printing current line here.
' | tac ##Sending awk output to tac to make it in actual sequence.
2nd solution: Adding Only awk solution with 2 times passing Input_file to it.
awk '
BEGIN{
FS=OFS=":"
}
FNR==NR{
if(FNR>1){
arr[FNR-1]=$1
}
next
}
{
$2=(FNR in arr)?(arr[FNR] OFS $2):OFS $2
}
1
' Input_file Input_file

print columns based on column name

Let's say I have a file test.txt that contains
a,b,c,d,e
1,2,3,4,5
6,7,8,9,10
I want to print out columns based on matching column names, either from another text file or from an array. So for example if I was given
arr=(a b c)
I want my output to then be
a,b,c
1,2,3
6,7,8
How can I do this with bash utilities/awk/sed? My actual text file is 3GB (and the line I want to match column values on is actually line 3), so efficient solutions are appreciated. This is what I have so far:
for j in "${arr[#]}"; do awk -F ',' -v a=$j '{ for(i=1;i<=NF;i++) {if($i==a) {print $i}}}' test.txt; done
but the output I get is
a
b
c
which not only is missing the other rows, but each column name is printed on one line each.
With your shown samples, please try following. Code is reading 2 files another_file.txt(which has a b c as per samples) and actual Input_file named test.txt(which has all the values in it).
awk '
FNR==NR{
for(i=1;i<=NF;i++){
arr[$i]
}
next
}
FNR==1{
for(i=1;i<=NF;i++){
if($i in arr){
valArr[i]
header=(header?header OFS:"")$i
}
}
print header
next
}
{
val=""
for(i=1;i<=NF;i++){
if(i in valArr){
val=(val?val OFS:"")$i
}
}
print val
}
' another_file.txt FS="," OFS="," test.txt
Output will be as follows:
a,b,c
1,2,3
6,7,8
Explanation: Adding detailed explanation for above solution.
awk ' ##Starting awk program from here.
FNR==NR{ ##Checking condition which will be TRUE while reading another_text file here.
for(i=1;i<=NF;i++){ ##Traversing through all fields of current line.
arr[$i] ##Creating arr with index of current field value.
}
next ##next will skip all statements from here.
}
FNR==1{ ##Checking if this is 1st line for test.txt file.
for(i=1;i<=NF;i++){ ##Traversing through all fields of current line.
if($i in arr){ ##If current field values comes in arr then do following.
valArr[i] ##Creating valArr which has index of current field number.
header=(header?header OFS:"")$i ##Creating header which has each field value in it.
}
}
print header ##Printing header here.
next ##next will skip all statements from here.
}
{
val="" ##Nullifying val here.
for(i=1;i<=NF;i++){ ##Traversing through all fields of current line.
if(i in valArr){ ##Checking if i is present in valArr then do following.
val=(val?val OFS:"")$i ##Creating val which has current field value.
}
}
print val ##printing val here.
}
' another_file.txt FS="," OFS="," test.txt ##Mentioning Input_file names here.
Here is how you can do this in a single pass awk command:
arr=(a c e)
awk -v cols="${arr[*]}" 'BEGIN {FS=OFS=","; n=split(cols, tmp, / /); for (i=1; i<=n; ++i) hdr[tmp[i]]} NR==1 {for (i=1; i<=NF; ++i) if ($i in hdr) hnum[i]} {for (i=1; i<=NF; ++i) if (i in hnum) {printf "%s%s", (f ? OFS : ""), $i; f=1} f=0; print ""}' file
a,c,e
1,3,5
6,8,10
A more readable form:
awk -v cols="${arr[*]}" '
BEGIN {
FS = OFS = ","
n = split(cols, tmp, / /)
for (i=1; i<=n; ++i)
hdr[tmp[i]]
}
NR == 1 {
for (i=1; i<=NF; ++i)
if ($i in hdr)
hnum[i]
}
{
for (i=1; i<=NF; ++i)
if (i in hnum) {
printf "%s%s", (f ? OFS : ""), $i
f = 1
}
f = 0
print ""
}' file

Define field with FPAT

I am trying to split data into field in awk, but I cant come up with the right regex using FPAT.
I have tried:
echo 'C002 2019-06-28;16:03;approved;content=L1-34,EE;not taken;;1024 ' | awk 'BEGIN {FPAT = "([^ ]+) +[^ ]+|;"} {print "f1:"$1;print "f2:"$2;print "f3:"$3;print "f6:"$6;print "f7:"$7}'
Expected result:
f1:C002
f2:2019-06-28
f3:16:03
f6:not taken
f7:
There are no simple way to separate random space from random space.
You need to do as David writes, separate using ; and then split first field by space.
awk -F";" '{split($1,a,"[ \t]+");print "a[1]---"a[1]"\na[2]---"a[2];for (i=1;i<=NF;i++) print i"---"$i}'
a[1]---C002
a[2]---2019-06-28
1---C002 2019-06-28
2---16:03
3---approved
4---content=L1-34,EE
5---not taken
6---
7---1024
A bit similar to the answer of Jotne, but you could write a function to split the record according to your wishes:
awk 'function split_record(string,f, t,n,m) {
n=split(string,t,";"); m=split(t[1],f,"[ \t]+")
for(i=2;i<=n;++i) f[m+i-1]=t[i]
return m+n-1
}
{ split_record($0,f) }
{print "f1:"f[1];print "f2:"f[2];print "f3:"f[3];print "f6:"f[6];print "f7:"f[7]}'
This returns:
f1:C002
f2:2019-06-28
f3:16:03
f6:not taken
f7:
You can update the split record in any way you like.
awk '
BEGIN { FS=OFS=";" }
{
split($1,a,/[[:space:]]+/)
$1 = ""
$0 = a[1] FS a[2] $0
for (i=1; i<=NF; i++) {
print "f" i ":" $i
}
}
' file
f1:C002
f2:2019-06-28
f3:16:03
f4:approved
f5:content=L1-34,EE
f6:not taken
f7:
f8:1024

Match two files and print the matched strings based on the second file using awk

I have two files below named InputFile and Ref
InputFile
1234~code1=yyy:code2=fff:code3=vvv
1256~code2=ttt:code1=yyy:code4=zzz
4567~code4=uuu
8907~code8=ooo:code7=rrr
Ref
code2
code3
code8
code7
I have to match all the records in Ref to InputFile's second column (~ delimited and will be split by colon(:)). If a record in Ref is found in InputFile, it should print the preceding value after the = sign otherwise print none.
Desired output
1234~fff~vvv~~
1256~ttt~~~
4567~~~~
8907~~~ooo~rrr
I'm about to load it to a table having the Ref records as the columns.
Here's my script as of:
awk '
BEGIN{
FS=OFS="~"
}
FNR==NR{
a[$0]
next
}
FNR==1 && FNR!=NR{
print
next
}
{
num=split($2,array,"[=:]")
for(i=1;i<=num;i+=2){
if(array[i] in a){
val=val?val OFS array[i+1]:array[i+1]
}
else{
val=val?val OFS "~":"~"
}
}
print $1,val
val=""
}
' Ref InputFile
It prints the array (code1,code2,etc) in InputFile that is present in Ref but it doesn't print in Ref's order.
Script's output
1234~~fff~vvv
1256~ttt
4567~
8907~ooo~rrr
something similar to yours
$ awk -F~ 'NR==FNR {c[NR]=$1; cs=NR; next}
{n=split($2,f,"[=:]");
delete k;
for(i=1;i<n;i+=2) k[f[i]]=f[i+1];
printf "%s", $1;
for(i=1;i<=cs;i++) printf "%s", FS k[c[i]];
print ""}' ref input
1234~fff~vvv~~
1256~ttt~~~
4567~~~~
8907~~~ooo~rrr
since you want to keep the order in the ref file, don't insert them as keys to the array, instead add them as values indexed with the order number (here the line number). Otherwise you're going to lose order, which I think it the (only?) issue with your script.
$ cat tst.awk
BEGIN {
FS = "[~:=]"
OFS = "~"
}
NR == FNR {
refs[++numRefs] = $0
next
}
{
delete ref2val
for (fldNr=2; fldNr<NF; fldNr+=2) {
ref2val[$fldNr] = $(fldNr+1)
}
printf "%s%s", $1, OFS
for (refNr=1; refNr<=numRefs; refNr++) {
ref = refs[refNr]
printf "%s%s", ref2val[ref], (refNr<numRefs ? OFS : ORS)
}
}
$ awk -f tst.awk refs file
1234~fff~vvv~~
1256~ttt~~~
4567~~~~
8907~~~ooo~rrr

bash: all combinations of lines

I have the following file (this is semicolon delimited; the real file is tab-delimited)
abc;173959;172730
def;4186657;4187943
ghi;4703911;4702577
jkl;2243551;2242259
and I want to combine each line with each, so that my output would be:
abc;173959;172730;def;4186657;4187943
abc;173959;172730;ghi;4703911;4702577
abc;173959;172730;jkl;2243551;2242259
def;4186657;4187943;ghi;4703911;4702577
def;4186657;4187943;jkl;2243551;2242259
ghi;4703911;4702577;jkl;2243551;2242259
The order is not important.
I came up with the following awk-solution:
awk '{ a[$0] } END { for (i in a){ for (j in a){if (i != j) print (i "\t" j) } } }' file
But this prints me the combinations in both directions, so for example
abc;173959;172730;def;4186657;4187943
def;4186657;4187943;abc;173959;172730
Because I am pretty unfamiliar with python or perl, I kindly ask for a solution using awk/bash etc.
In awk:
$ awk '{ a[$0] }
END {
for(i in a) {
delete a[i] # new place for delete
for(j in a)
if(i!=j)
print i ";" j
# delete a[i] # previous and maybe wrong place
}
}' file
def;4186657;4187943;ghi;4703911;4702577
def;4186657;4187943;abc;173959;172730
def;4186657;4187943;jkl;2243551;2242259
ghi;4703911;4702577;abc;173959;172730
ghi;4703911;4702577;jkl;2243551;2242259
abc;173959;172730;jkl;2243551;2242259
Unfortunately the order is random.
Another way that restores the order and doesn't modify the a while processing (see comments) is:
$ awk '{ a[NR]=$0 } # index on NR
END {
for(i=1;i<=NR;i++)
for(j=i+1;j<=NR;j++) # j=i+1 is the magic
print a[i] ";" a[j]
}' file
abc;173959;172730;def;4186657;4187943
abc;173959;172730;ghi;4703911;4702577
abc;173959;172730;jkl;2243551;2242259
def;4186657;4187943;ghi;4703911;4702577
def;4186657;4187943;jkl;2243551;2242259
ghi;4703911;4702577;jkl;2243551;2242259
This awk should work as well:
awk -F ';' 'NR==FNR{a[++k]=$0; next} {for (i=FNR+1; i<=k; i++) print $0 FS a[i]}' file{,}
abc;173959;172730;def;4186657;4187943
abc;173959;172730;ghi;4703911;4702577
abc;173959;172730;jkl;2243551;2242259
def;4186657;4187943;ghi;4703911;4702577
def;4186657;4187943;jkl;2243551;2242259
ghi;4703911;4702577;jkl;2243551;2242259
Could you please try following one, it will give you same order as Input_file's field values only, by reading the Input_file once only.
awk '{a[FNR]=$0} END{j=1;while(length(a)>=++k){for(q=j+1;q<=FNR;q++){print a[j]";"a[q]}j++};}' Input_file
OR
awk '
{
a[FNR]=$0
}
END{
j=1;
while(length(a)>=++k){
for(q=j+1;q<=FNR;q++){
print a[j]";"a[q]
}
j++
}
}
' Input_file
Output will be as follows.
abc;173959;172730;def;4186657;4187943
abc;173959;172730;ghi;4703911;4702577
abc;173959;172730;jkl;2243551;2242259
def;4186657;4187943;ghi;4703911;4702577
def;4186657;4187943;jkl;2243551;2242259
ghi;4703911;4702577;jkl;2243551;2242259

Resources