Awk splitting a line by spaces where there are spaces in each field - bash

I've got an R summary table like so:
employee salary startdate
John Doe :1 Min. :21000 Min. :2007-03-14
Jolie Hope:1 1st Qu.:22200 1st Qu.:2007-09-18
Peter Gynn:1 Median :23400 Median :2008-03-25
Mean :23733 Mean :2008-10-02
3rd Qu.:25100 3rd Qu.:2009-07-13
Max. :26800 Max. :2010-11-01
and I need to produce an output csv file like so:
employee,,salary,,startdate,,
John Doe,1,Min.,21000,Min.,2007-03-14
Jolie Hope,1,1st Qu.,22200,1st Qu.,2007-09-18
Peter Gynn,1,Median,23400,Median,2008-03-25
,,Mean,23733,Mean,2008-10-02
,,3rd Qu.,25100,3rd Qu.,2009-07-13
,,Max.,26800,Max.,2010-11-01
so that in excel it looks something like this:
However it doesn't suffice to split the fields by one or more white spaces,
awk -F "[ ]+" '{ print $3 }'
It works for the header, but not for the remaining lines:
salary
Doe
Hope:1
Gynn:1
:23733
Qu.:25100
:26800
Is this problem solvable using awk (and maybe sed)?

sed '1 {
s/^[[:space:]]*\([^[:space:]]\{1,\}\)[[:space:]]\{1,\}\([^[:space:]]\{1,\}\)[[:space:]]\{1,\}[[:space:]]\{1,\}\([^[:space:]]\{1,\}\)/\1,,\2,,\3,/
b
}
s/[[:space:]]\{1,\}:/:/g
/^[[:space:]]*\([^:]\{1,\}\):\([^[:space:]]*\)[[:space:]]*\([^:]\{1,\}\):\([^[:space:]]*\)[[:space:]]*\([^:]\{1,\}\):\(.[^[:space:]]*\)/ {
s//\1,\2,\3,\4,\5,\6/
b
}
/^[[:space:]]*\([^:]\{1,\}\):\([^[:space:]]*\)[[:space:]]*\([^:]\{1,\}\):\([^[:space:]]*\)/ {
s//,,\1,\2,\3,\4/
b
}
' YourFile
sed one, just for the fun if you need to adapt a bit in this ArachnoRegEx
awk is lot more interesting in this case mainly for any adaptation to add later but if you only have access to sed ...

This uses GNU awk for FIELDWIDTHS, etc. and relies on the first line of input after the header always having all fields populated. It includes the positions that are just :s as output fields, I expect you can figure out how to skip those if you do want to use this solution:
$ cat tst.awk
BEGIN { OFS="," }
NR==1 {
for (i=1;i<=NF;i++) {
printf "%s%s", $i, (i<NF?OFS OFS OFS:ORS)
}
next
}
NR==2 {
tail = $0
while ( match(tail,/([^:]+):(\S+(\s+|$))/,a) ) {
FIELDWIDTHS = FIELDWIDTHS length(a[1]) " 1 " length(a[2]) " "
tail = substr(tail,RSTART+RLENGTH)
}
$0 = $0
}
{
for (i=1;i<=NF;i++) {
gsub(/^\s+|\s+$/,"",$i)
}
print
}
$ awk -f tst.awk file
employee,,,salary,,,startdate
John Doe,:,1,Min.,:,21000,Min.,:,2007-03-14
Jolie Hope,:,1,1st Qu.,:,22200,1st Qu.,:,2007-09-18
Peter Gynn,:,1,Median,:,23400,Median,:,2008-03-25
,,,Mean,:,23733,Mean,:,2008-10-02
,,,3rd Qu.,:,25100,3rd Qu.,:,2009-07-13
,,,Max.,:,26800,Max.,:,2010-11-01

Related

cut a field from its position & place it in different position

I have 2 files - file1 & file2 with contents as shown.
cat file1.txt
1,2,3
cat file2.txt
a,b,c
& the desired output is as below,
a,1,b,2,c,3
Can anyone please help to achieve this?
Till now i have tried this,
paste -d "," file1.txt file2.txt|cut -d , -f4,1,5,2,6,3
& the output came as 1,2,3,a,b,c
But using 'cut' is not the good approach i think.
Becuase here i know there are 3 values in both files, but if the values are more, above command will not be helpful.
try:
awk -F, 'FNR==NR{for(i=1;i<=NF;i++){a[FNR,i]=$i};next} {printf("%s,%s",a[FNR,1],$1);for(i=2;i<=NF;i++){printf(",%s,%s",a[FNR,i],$i)};print ""}' file2.txt file1.txt
OR(a NON-one liner form of solution too as follows)
awk -F, 'FNR==NR{ ####making field separator as , then putting FNR==NR condition will be TRUE when first file named file1.txt will be read by awk.
for(i=1;i<=NF;i++){ ####Starting a for loop here which will run till the total number of fields value from i=1.
a[FNR,i]=$i ####creating an array with name a whose index is FNR,i and whose value is $i(fields value).
};
next ####next will skip all further statements, so that second file named file2.txt will NOT read until file1.txt is completed.
}
{
printf("%s,%s",a[FNR,1],$1); ####printing the value of a very first element of each lines first field here with current files first field.
for(i=2;i<=NF;i++){ ####starting a for loop here till the value of NF(number of fields).
printf(",%s,%s",a[FNR,i],$i) ####printing the values of array a value whose index is FNR and variable i and printing the $i value too here.
};
print "" ####printing a new line here.
}
' file2.txt file1.txt ####Mentioning the Input_files here.
paste -d "," file*|awk -F, '{print $4","$1","$5","$2","$6","$3}'
a,1,b,2,c,3
This is simple printing operation. Other answers are most welcome.
But if the file contains 1000's of values, then this printing approach will not help.
$ awk '
BEGIN { FS=OFS="," }
NR==FNR { split($0,a); next }
{
for (i=1;i<=NF;i++) {
printf "%s%s%s%s", $i, OFS, a[i], (i<NF?OFS:ORS)
}
}
' file1 file2
a,1,b,2,c,3
or if you prefer:
$ paste -d, file2 file1 |
awk '
BEGIN { FS=OFS="," }
{
n=NF/2
for (i=1;i<=n;i++) {
printf "%s%s%s%s", $i, OFS, $(i+n), (i<n?OFS:ORS)
}
}
'
a,1,b,2,c,3

Restructure line fields in a file

Am a newbie to coding but would like to use either awk, sed or bash to solve this problem.
I have a file "input.txt" that looks like this:
Otu13 k__Bacteria;p__Firmicutes;c__Bacilli;o__Lactobacillales;f__Streptococcaceae;g__Streptococcus 0.998
Otu24 k__Bacteria;p__Candidatus_Saccharibacteria;g__Saccharibacteria_genera_incertae_sedis; 1.000;;
Otu59 k__Bacteria;p__Bacteroidetes;c__Bacteroidia;o__Bacteroidales;f__Prevotellaceae;g__Alloprevotella 0.991
Otu41 k__Bacteria;p__Bacteroidetes;g__Alloprevotella 0.998
Firstly, I would like to drop the last column with numbers, then for the rest of the fields in each line, write them out depending on their prefix (k__, p__, o__, f__, g__).
The values after the prefixes should be printed out in a similar order as in parenthesis such that if one of the prefix in the sequences order is missing e.g. line 2 and 4, then they are replaced with blank. In the end I should have 7 fields.
My desired output is something like this:
Otu13; Bacteria; Firmicutes; Bacilli; Lactobacillales; Streptococcaceae; Streptococcus
Otu24; Bacteria; Candidatus_Saccharibacteria; ; ; ;Saccharibacteria_genera_incertae_sedis
Otu59; Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Prevotellaceae;Alloprevotella
Otu41; Bacteria;Bacteroidetes; ; ; ; Alloprevotella
Will greatly appreciate your assistance.
It's not clear how/why you'd get the output you show from the input you posted and the description of your requirements but I think this is what you really want:
$ cat tst.awk
BEGIN { n=split("k p c o f g",order); FS="[ ;]+|__"; OFS=";" }
{
sub(/[0-9.;[:space:]]+$/,"")
delete f
for (i=2;i<=NF;i+=2) {
f[$i] = $(i+1)
}
printf "%s%s", $1, OFS
for (i=1; i<=n; i++) {
printf "%s%s", f[order[i]], (i<n ? OFS : ORS)
}
}
$ awk -f tst.awk file
Otu13;Bacteria;Firmicutes;Bacilli;Lactobacillales;Streptococcaceae;Streptococcus
Otu24;Bacteria;Candidatus_Saccharibacteria;;;;Saccharibacteria_genera_incertae_sedis
Otu59;Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Prevotellaceae;Alloprevotella
Otu41;Bacteria;Bacteroidetes;;;;Alloprevotella

using awk to lookup and insert data

In my continuing crusade not to use MS Excel, I'd like to process some data, send it to a file, and then insert some records from a separate file into a third file using field $1 as the index. Is this possible?
I have data like this:
2600,foo,stack,1,04/02/2015,ACH Payment,ACH Settled,1500
2600,foo,stack,2,04/06/2015,Credit Card Sale,Settled,100
2600,foo,stack,3,04/07/2015,Credit Card Sale,Settled,157.13
2600,foo,stack,4,04/07/2015,ACH Credit,ACH Settled,.03
I have this to group it:
cat group.awk
#!/usr/bin/awk -f
BEGIN {
OFS = FS = ","
}
NR > 1 {
arr[$1 OFS $2 OFS $3]++
}
END {
for (key in arr)
print key, arr[key]
}
The group makes it like this:
2600,foo,stack,4
Simple multiplication is applied to fields 5, 6 and 7 where applicable--depends on fields 3.
In this example we can say the finished record looks like this:
2600,foo,stack,4,.2,19.8
Now in a separate file, I have this data:
2600,registered user,5hPASLJlHlgJR4AQc9sZQ==
basic flow is:
awk -f group.awk data.csv | awk -f math.awk > finished.csv
Then use awk (if it can do this) to look up field $1 in finished.csv and find corresponding record above in the separate file(bill.csv) and print to a third file or insert into bill.csv.
Expected output in third file(bill.csv):
x,y,,1111111,2600,,,,,,,19.8,,,registered user,,,,,,,,,,RS,,,N5hPASLJlHlgJR4AQc9sZQ==,z,a
x,y,,1111111,RS,z,a will be pre-populated to I only need to insert three new records.
Is this something awk can accomplish?
Edit
Field $3 is the accountID that sets the multiplication on 5, 6 and 7.
Here's the idea:
bill.awk:
NR>1{if($3=="stack" && $4>199) $5=$4*0.03;
if($3=="stack" && $4<200) $5=$4*0.05
if($3=="user") $5=$4*.01
}1
total.awk:
awk -F, -v OFS="," 'NR>1{if($3=="stack" && $5<20) $6=20-$5;
if($3=='stack && $5>20) $6=0;}1'
This part is working and final output is like above:
2600,foo,stack,4,.2,19.8
4*.05 = .2 & 20 - .2 = 19.8
But the minimium charge is $20
So we'll correct it:
4*.05 = .2 & 20 - .2 = 20
Extra populated fields came from a separate file (bill.csv) and I need to fill in 20 to the correct record on bill.csv
bill.csv contains everything needed except the 20
before:
x,y,,1111111,2600,,,,,,,,,,,registered user,,,,,,,,,,RS,,,N5hPASLJlHlgJR4AQc9sZQ==,z,a
after:
x,y,,1111111,2600,,,,,,,20,,,registered user,,,,,,,,,,RS,,,N5hPASLJlHlgJR4AQc9sZQ==,z,a
Is this a better explanaiton? Go on the assumption that group.awk, bill.awk and total.awk are working correctly. I just need to extract the correct total for field $1 and put it in bill.csv in the correct spot.
Is maybe this last awk what you need. I´ve tried to understand what you want and I think is just that merging awk way:
For explaininng: We first save the fileA in an array with the first key as the index. Then we search for each line o file B if the field1 is between the indexes of our array, and if it´s, we print all data from two files together
awk -F"," 'BEGIN {while (getline < "record.dat"){ a[$1]=$0; }} {if($1 in a){ print a[$1]","$0}}' file.dat
2600,foo,stack,4,10,10.4,2600,registered user,5hPASLJlHlgJR4AQc9sZQ==
This is the kind of solution you need:
$ cat fileA
2600,foo,stack,1,04/02/2015,ACH Payment,ACH Settled,1500
2600,foo,stack,2,04/06/2015,Credit Card Sale,Settled,100
2600,foo,stack,3,04/07/2015,Credit Card Sale,Settled,157.13
2600,foo,stack,4,04/07/2015,ACH Credit,ACH Settled,.03
2600,foo,stack,5,04/09/2015,ACH Payment,ACH Settled,147.10
$ cat fileB
2600,registered user,5hPASLJlHlgJR4AQc9sZQ==
$ cat tst.awk
BEGIN { FS=OFS="," }
NR==FNR{
cnts[$1][$2FS$3]++
next
}
{
for (val in cnts[$1]) {
cnt = cnts[$1][val]
print $1, val, cnt, cnt*2.5, $2, $3
}
}
$ awk -f tst.awk fileA fileB
2600,foo,stack,5,12.5,registered user,5hPASLJlHlgJR4AQc9sZQ==
but until you update your question we can't provide any more concrete help than that.
The above uses GNU awk 4.* for true 2D arrays.

AWK split file by separator and count

I have a large 220mb file. The file is grouped by a horizontal row "---". This is what I have so far:
cat test.list | awk -v ORS="" -v RS="-------------------------------------------------------------------------------" '{print $0;}'
How do I take this and print to a new file every 1000 matches?
Is there another way to do this? I looked at split, and csplit but the "----" rows to not occur predictably so I have to match them, and then split on a count of the matches.
I would like the output files to groups of 1000 matches per file.
To output the first 1000 records to outputfile0, the next to outputfile1, etc., just do:
awk 'NR%1000 == 1{ file = "outputfile" i++ } { print > file }' ORS= RS=------ test.list
(Note that I truncated the dashes in RS for simplicity.)'
Unfortunately, using a value of RS that is more than a single character produces unspecified results, so the above cannot be the solution. Perhaps something like twalberg's solution is required:
awk '/^----$/ { if(!(c%1000)) count+=1; c+=1; next }
{print > ("outputfile"count)}' c=1 count=1
Not tested, but something along these lines might work:
awk 'BEGIN {fileno=1,matchcount=0}
/^-------/ { if (++matchcount == 1000) { ++fileno; matchcount=0; } }
{ print $0 > "output_file_" fileno }' < test.list
It might be cleaner to put all that in, say split.awk and use awk -f split.awk test.list instead...

How to remove several columns and the field separators at once in AWK?

I have a big file with several thousands of columns. I want to delete some specific columns and the field separators at once with AWK in Bash.
I can delete one column at a time with this oneliner (column 3 will be deleted and its corresponding field separator):
awk -vkf=3 -vFS="\t" -vOFS="\t" '{for(i=kf; i<NF;i++){ $i=$(i+1);}; NF--; print}' < Big_File
However, I want to delete several columns at once... Can someone help me figure this out?
You can pass list of columns to be deleted from shell to awk like this:
awk -vkf="3,5,11" ...
then in the awk programm parse it into array:
split(kf,kf_array,",")
and then go thru all the colums and test if each particular column is in the kf_array and possibly skip it
Other possibility is to call your oneliner several times :-)
Here is an implementation of Kamil's idea:
awk -v remove="3,8,5" '
BEGIN {
OFS=FS="\t"
split(remove,a,",")
for (i in a) b[a[i]]=1
}
{
j=1
for (i=1;i<=NF;++i) {
if (!(i in b)) {
$j=$i
++j
}
}
NF=j-1
print
}
'
If you can use cut instead of awk, this one is easier with cut:
e.g. this obtains columns 1,3, and from 50 on from file:
cut -f1,3,50- file
Something like this should work:
awk -F'\t' -v remove='3|8|5' '
{
rec=ofs=""
for (i=1;i<=NF;i++) {
if (i !~ "^(" remove ")$" ) {
rec = rec ofs $i
ofs = FS
}
}
print rec
}
' file

Resources