print columns based on column name - bash

Let's say I have a file test.txt that contains
a,b,c,d,e
1,2,3,4,5
6,7,8,9,10
I want to print out columns based on matching column names, either from another text file or from an array. So for example if I was given
arr=(a b c)
I want my output to then be
a,b,c
1,2,3
6,7,8
How can I do this with bash utilities/awk/sed? My actual text file is 3GB (and the line I want to match column values on is actually line 3), so efficient solutions are appreciated. This is what I have so far:
for j in "${arr[#]}"; do awk -F ',' -v a=$j '{ for(i=1;i<=NF;i++) {if($i==a) {print $i}}}' test.txt; done
but the output I get is
a
b
c
which not only is missing the other rows, but each column name is printed on one line each.

With your shown samples, please try following. Code is reading 2 files another_file.txt(which has a b c as per samples) and actual Input_file named test.txt(which has all the values in it).
awk '
FNR==NR{
for(i=1;i<=NF;i++){
arr[$i]
}
next
}
FNR==1{
for(i=1;i<=NF;i++){
if($i in arr){
valArr[i]
header=(header?header OFS:"")$i
}
}
print header
next
}
{
val=""
for(i=1;i<=NF;i++){
if(i in valArr){
val=(val?val OFS:"")$i
}
}
print val
}
' another_file.txt FS="," OFS="," test.txt
Output will be as follows:
a,b,c
1,2,3
6,7,8
Explanation: Adding detailed explanation for above solution.
awk ' ##Starting awk program from here.
FNR==NR{ ##Checking condition which will be TRUE while reading another_text file here.
for(i=1;i<=NF;i++){ ##Traversing through all fields of current line.
arr[$i] ##Creating arr with index of current field value.
}
next ##next will skip all statements from here.
}
FNR==1{ ##Checking if this is 1st line for test.txt file.
for(i=1;i<=NF;i++){ ##Traversing through all fields of current line.
if($i in arr){ ##If current field values comes in arr then do following.
valArr[i] ##Creating valArr which has index of current field number.
header=(header?header OFS:"")$i ##Creating header which has each field value in it.
}
}
print header ##Printing header here.
next ##next will skip all statements from here.
}
{
val="" ##Nullifying val here.
for(i=1;i<=NF;i++){ ##Traversing through all fields of current line.
if(i in valArr){ ##Checking if i is present in valArr then do following.
val=(val?val OFS:"")$i ##Creating val which has current field value.
}
}
print val ##printing val here.
}
' another_file.txt FS="," OFS="," test.txt ##Mentioning Input_file names here.

Here is how you can do this in a single pass awk command:
arr=(a c e)
awk -v cols="${arr[*]}" 'BEGIN {FS=OFS=","; n=split(cols, tmp, / /); for (i=1; i<=n; ++i) hdr[tmp[i]]} NR==1 {for (i=1; i<=NF; ++i) if ($i in hdr) hnum[i]} {for (i=1; i<=NF; ++i) if (i in hnum) {printf "%s%s", (f ? OFS : ""), $i; f=1} f=0; print ""}' file
a,c,e
1,3,5
6,8,10
A more readable form:
awk -v cols="${arr[*]}" '
BEGIN {
FS = OFS = ","
n = split(cols, tmp, / /)
for (i=1; i<=n; ++i)
hdr[tmp[i]]
}
NR == 1 {
for (i=1; i<=NF; ++i)
if ($i in hdr)
hnum[i]
}
{
for (i=1; i<=NF; ++i)
if (i in hnum) {
printf "%s%s", (f ? OFS : ""), $i
f = 1
}
f = 0
print ""
}' file

Related

awk to get value for a column of next line and add it to the current line in shellscript

I have a csv file lets say lines
cat lines
1:abc
6:def
17:ghi
21:tyu
I wanted to achieve something like this
1:6:abc
6:17:def
17:21:ghi
21::tyu
Tried the below code by didn't work
awk 'BEGIN{FS=OFS=":"}NR>1{nln=$1;cl=$2}NR>0{print $1,nln,$2}' lines
1::abc
6:6:def
17:17:ghi
21:21:tyu
Can you please help ?
Here is a potential AWK solution:
cat lines
1:abc
6:def
17:ghi
21:tyu
awk -F":" '{num[NR]=$1; letters[NR]=$2}; END{for(i=1;i<=NR;i++) print num[i] ":" num[i + 1] ":" letters[i]}' lines
1:6:abc
6:17:def
17:21:ghi
21::tyu
Formatted:
awk '
BEGIN {FS=OFS=":"}
{
num[NR] = $1;
letters[NR] = $2
}
END {for (i = 1; i <= NR; i++)
print num[i], num[i + 1], letters[i]
}
' lines
1:6:abc
6:17:def
17:21:ghi
21::tyu
Basically this is your solution but I switched the order of the code blocks and added the END block to output the last record, you were close:
awk 'BEGIN{FS=OFS=":"}FNR>1{print p,$1,q}{p=$1;q=$2}END{print p,"",q}' file
Explained:
$ awk 'BEGIN {
FS=OFS=":" # delims
}
FNR>1 { # all but the first record
print p,$1,q # output $1 and $2 from the previous round
}
{
p=$1 # store for the next round
q=$2
}
END { # gotta output the last record in the END
print p,"",q # "" feels like cheating
}' file
Output:
1:6:abc
6:17:def
17:21:ghi
21::tyu
1st solution: Here is a tac + awk + tac solution. Written and tested with shown samples only.
tac Input_file |
awk '
BEGIN{
FS=OFS=":"
}
{
prev=(prev?$2=prev OFS $2:$2=OFS $2)
}
{
prev=$1
}
1
' | tac
Explanation: Adding detailed explanation for above code.
tac Input_file | ##Printing lines from bottom to top of Input_file.
awk ' ##Getting input from previous command as input to awk.
BEGIN{ ##Starting BEGIN section from here.
FS=OFS=":" ##Setting FS and OFS as colon here.
}
{
prev=(prev?$2=prev OFS $2:$2=OFS $2) ##Creating prev if previous NOT NULL then add its value prior to $2 with prev OFS else add OFS $2 in it.
}
{
prev=$1 ##Setting prev to $1 value here.
}
1 ##printing current line here.
' | tac ##Sending awk output to tac to make it in actual sequence.
2nd solution: Adding Only awk solution with 2 times passing Input_file to it.
awk '
BEGIN{
FS=OFS=":"
}
FNR==NR{
if(FNR>1){
arr[FNR-1]=$1
}
next
}
{
$2=(FNR in arr)?(arr[FNR] OFS $2):OFS $2
}
1
' Input_file Input_file

Merge two rows of a file

I have a input file which has large data in below pattern. some part of data is shown below:
Data1
C
In;
CP
In;
D
In;
Q
Out;
Data2
CP
In;
D
In;
Q
Out;
Data3
CP
In;
CPN
In;
D
In;
QN
Out;
I want my output as
Data1(C,CP,D,Q)
In C;
In CP;
In D;
Out Q;
Data2 (CP,D,Q)
In CP;
In D;
Out Q;
Data3 (CP,CPN,D,QN)
In CP;
In CPN
In D
Out QN;
I tried code given in comment section below, But getting error. Corrections are welcome.
variation on #EdMorton suggestion - fixing the desired order of fields:
$ awk 'FNR==1{print;next}!(NR%2){a=$0; next} {printf "%s %s%s%s", $1,a,FS,ORS}' FS=';' file
Data
In A1;
In A2;
Out Z;
$ awk 'NR%2{print sep $0; sep=OFS; next} {printf "%s", $0}' file
Data
A1 In;
A2 In;
Z Out;
Could you please try following, written and tested with shown samples.
awk '
FNR==1{
print
next
}
val{
sub(/;$/,OFS val"&")
print
val=""
next
}
{
val=$0
}
END{
if(val!=""){
print val
}
}' Input_file
Issues in OP's attempt:
1st: awk code should be covered in ' so "$1=="A1" should be changed to '$1=="A1".
2nd: But condition for logic looking wrong to me because if we only looking specifically for A1 and In then other lines like Z and out will miss out, hence came up with above approach.
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
FNR==1{ ##Checking condition if this is first line then do following.
print ##Printing current line here.
next ##next will skip all further statements from here.
}
val{ ##Checking condition if val is NOT NULL then do following.
sub(/;$/,OFS val"&") ##Substituting last semi colon with OFS val and semi colon here.
print ##printing current line here.
val="" ##Nullify val here.
next ##next will skip all further statements from here.
}
{
val=$0 ##Setting current line value to val here.
}
END{ ##Starting END block of this program from here.
if(val!=""){ ##Checking condition if val is NOT NULL then do following.
print val ##Printing val here.
}
}' Input_file ##Mentioning Input_file name here.
and variation on #vgersh99 but setting FS twice: at the beginning and at the end.
awk -v FS='\n' -v RS= '
{
gsub(";", " ");
r2= $3 $2;
r3= $5 $4;
r4= $7 $6;
}
{print $1}
{print r2 FS}
{print r3 FS}
{print r4 FS}' FS=';' file
Data
In A1;
In A2;
Out Z;
'''
Does it give you error?

How to run a bash script in a loop

i wrote a bash script in order to pull substrings and save it to an output file from two input files that looks like this:
input file 1
>genotype1
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
input file 2
gene1 10 20
gene2 40 50
genen x y
my script
>output_file
cat input_file2 | while read row; do
echo $row > temp
geneName=`awk '{print $1}' temp`
startPos=`awk '{print $2}' temp`
endPos=`awk '{print $3}' temp`
length=$(expr $endPos - $startPos)
for i in temp; do
echo ">${geneName}" >> genes_fasta
awk -v S=$startPos -v L=$length '{print substr($0,S,L)}' input_file1 >> output file
done
done
how can i make it work in a loop for more than one string in the input file 1?
new input file looks like this:
>genotype1
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
>genotype2
bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
>genotypen...
nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn...
I would like to have a different out file for every genotype and that the file name would be the genotype name.
thank you!
If I'm understanding correctly, would you try the following:
awk '
FNR==NR {
name[NR] = $1
start[NR] = $2
len[NR] = $3 - $2
count = NR
next
}
/^>/ {
sub(/^>/,"")
genotype=$0
next
}
{
for (i = 1; i <= count; i++) {
print ">" name[i] > genotype
print substr($0, start[i], len[i]) >> genotype
}
close(genotype)
}' input_file2 input_file1
input_file1:
>genotype1
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
>genotype2
bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
>genotype3
nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
Input_file2:
gene1 10 20
gene2 40 50
gene3 20 25
[Results]
genotype1:
>gene1
aaaaaaaaaa
>gene2
aaaaaaaaaa
>gene3
aaaaa
genotype2:
>gene1
bbbbbbbbbb
>gene2
bbbbbbbbbb
>gene3
bbbbb
genotype3:
>gene1
nnnnnnnnnn
>gene2
nnnnnnnnnn
>gene3
nnnnn
[EDIT]
If you want to store the output files to a different directory,
please try the following instead:
dir="./outdir" # directory name to store the output files
# you can modify the name as you want
mkdir -p "$dir"
awk -v dir="$dir" '
FNR==NR {
name[NR] = $1
start[NR] = $2
len[NR] = $3 - $2
count = NR
next
}
/^>/ {
sub(/^>/,"")
genotype=$0
next
}
{
for (i = 1; i <= count; i++) {
print ">" name[i] > dir"/"genotype
print substr($0, start[i], len[i]) >> dir"/"genotype
}
close(dir"/"genotype)
}' input_file2 input_file1
The 1st two lines are executed in bash to define and mkdir the destination directory.
Then the directory name is passed to awk via -v option
Hope this helps.
Could you please try following, where I am assuming that your Input_file1's column which starts with > should be compared with 1st column of Input_file2's first column (since samples are confusing so based on OP's attempt this has been written).
awk '
FNR==NR{
start_point[$1]=$2
end_point[$1]=$3
next
}
/^>/{
sub(/^>/,"")
val=$0
next
}
{
print val ORS substr($0,start_point[val],end_point[val])
val=""
}
' Input_file2 Input_file1
Explanation: Adding explanation for above code.
awk ' ##Starting awk program from here.
FNR==NR{ ##Checking condition FNR==NR which will be TRUE when first Input_file named Input_file2 is being read.
start_point[$1]=$2 ##Creating an array named start_point with index $1 of current line and its value is $2.
end_point[$1]=$3 ##Creating an array named end_point with index $1 of current line and its value is $3.
next ##next will skip all further statements from here.
}
/^>/{ ##Checking condition if a line starts from > then do following.
sub(/^>/,"") ##Substituting starting > with NULL.
val=$0 ##Creating a variable val whose value is $0.
next ##next will skip all further statements from here.
}
{
print val ORS substr($0,start_point[val],end_point[val]) ##Printing val newline(ORS) and sub-string of current line whose start value is value of start_point[val] and end point is value of end_point[val].
val="" ##Nullifying variable val here.
}
' Input_file2 Input_file1 ##Mentioning Input_file names here.

Match two files and print the matched strings based on the second file using awk

I have two files below named InputFile and Ref
InputFile
1234~code1=yyy:code2=fff:code3=vvv
1256~code2=ttt:code1=yyy:code4=zzz
4567~code4=uuu
8907~code8=ooo:code7=rrr
Ref
code2
code3
code8
code7
I have to match all the records in Ref to InputFile's second column (~ delimited and will be split by colon(:)). If a record in Ref is found in InputFile, it should print the preceding value after the = sign otherwise print none.
Desired output
1234~fff~vvv~~
1256~ttt~~~
4567~~~~
8907~~~ooo~rrr
I'm about to load it to a table having the Ref records as the columns.
Here's my script as of:
awk '
BEGIN{
FS=OFS="~"
}
FNR==NR{
a[$0]
next
}
FNR==1 && FNR!=NR{
print
next
}
{
num=split($2,array,"[=:]")
for(i=1;i<=num;i+=2){
if(array[i] in a){
val=val?val OFS array[i+1]:array[i+1]
}
else{
val=val?val OFS "~":"~"
}
}
print $1,val
val=""
}
' Ref InputFile
It prints the array (code1,code2,etc) in InputFile that is present in Ref but it doesn't print in Ref's order.
Script's output
1234~~fff~vvv
1256~ttt
4567~
8907~ooo~rrr
something similar to yours
$ awk -F~ 'NR==FNR {c[NR]=$1; cs=NR; next}
{n=split($2,f,"[=:]");
delete k;
for(i=1;i<n;i+=2) k[f[i]]=f[i+1];
printf "%s", $1;
for(i=1;i<=cs;i++) printf "%s", FS k[c[i]];
print ""}' ref input
1234~fff~vvv~~
1256~ttt~~~
4567~~~~
8907~~~ooo~rrr
since you want to keep the order in the ref file, don't insert them as keys to the array, instead add them as values indexed with the order number (here the line number). Otherwise you're going to lose order, which I think it the (only?) issue with your script.
$ cat tst.awk
BEGIN {
FS = "[~:=]"
OFS = "~"
}
NR == FNR {
refs[++numRefs] = $0
next
}
{
delete ref2val
for (fldNr=2; fldNr<NF; fldNr+=2) {
ref2val[$fldNr] = $(fldNr+1)
}
printf "%s%s", $1, OFS
for (refNr=1; refNr<=numRefs; refNr++) {
ref = refs[refNr]
printf "%s%s", ref2val[ref], (refNr<numRefs ? OFS : ORS)
}
}
$ awk -f tst.awk refs file
1234~fff~vvv~~
1256~ttt~~~
4567~~~~
8907~~~ooo~rrr

Spreading cell values into columns using UNIX

Suppose we have this file:
head file
id,name,value
1,Je,1
2,Je,1
3,Ko,1
4,Ne,1
5,Ne,1
6,Je,1
7,Ko,1
8,Ne,1
9,Ne,1
And I'd like to get this out:
id,Je,Ko,Ne
1,1,0,0
2,1,0,0
3,0,1,0
4,0,0,1
5,0,0,1
6,1,0,0
7,0,1,0
8,0,0,1
9,0,0,1
Does someone know how to get this output, using awk or sed?
Assuming that the possible values of name are only Je or Ko or Ne, you can do:
awk -F, 'BEGIN{print "id,Je,Ko,Ne"}
NR==1{ next }
{je=$2=="Je"?"1":"0";
ko=$2=="Ko"?"1":"0";
ne=$2=="Ne"?"1":"0";
print $1","je","ko","ne}' file
If you want something that will print the values in the same order they are read and not limited to your example fields, you could do:
awk -F, 'BEGIN{OFS=FS; x=1;y=1}
NR==1 { next }
!($2 in oa){ oa[$2]=1; ar[x++]=$2}
{lines[y++]=$0;}
END{
s="";
for (i=1; i<x; i++)
s=s==""?ar[i]:s OFS ar[i];
print "id" OFS s;
for (j=1; j<y; j++){
split(lines[j], a)
s=""
for (i=1; i<x; i++) {
tt=ar[i]==a[2]?"1":"0"
s=s==""?tt:s OFS tt;
}
print a[1] OFS s;
}
}
' file
Here's a "two-pass solution" (along the lines suggested by #Drakosha) implemented using a single invocation of awk. The implementation would be a little simpler if there was no requirement regarding the ordering of names.
awk -F, '
# global: n, array a
function println(ix,name,value, i,line) {
line=ix;
for (i=0;i<n;i++) {
if (a[i]==name) {line=line OFS value} else {line=line OFS 0}
}
print line;
}
BEGIN {OFS=FS; n=0}
FNR==1 {next} # skip the header each time
NR==FNR {if (!mem[$2]) {mem[$2] = a[n++] = $2}; next}
!s { s="id"; for (i=0;i<n;i++) {s=s OFS a[i]}; print s}
{println($1, $2, $3)}
' file file
I suggest 2 passes.
1st will generate all the possible values of column 2 (Je, Ko, Ne,
...).
2nd will be able to trivially generate the output you are looking for.
awk -F, 'BEGIN{s="Je,Ko,Ne";print "id,"s}
NR>1 {m=s; sub($2,1,m); gsub("[^0-9,]+","0",m); print $1","m}' file

Resources