Restructure line fields in a file - bash

Am a newbie to coding but would like to use either awk, sed or bash to solve this problem.
I have a file "input.txt" that looks like this:
Otu13 k__Bacteria;p__Firmicutes;c__Bacilli;o__Lactobacillales;f__Streptococcaceae;g__Streptococcus 0.998
Otu24 k__Bacteria;p__Candidatus_Saccharibacteria;g__Saccharibacteria_genera_incertae_sedis; 1.000;;
Otu59 k__Bacteria;p__Bacteroidetes;c__Bacteroidia;o__Bacteroidales;f__Prevotellaceae;g__Alloprevotella 0.991
Otu41 k__Bacteria;p__Bacteroidetes;g__Alloprevotella 0.998
Firstly, I would like to drop the last column with numbers, then for the rest of the fields in each line, write them out depending on their prefix (k__, p__, o__, f__, g__).
The values after the prefixes should be printed out in a similar order as in parenthesis such that if one of the prefix in the sequences order is missing e.g. line 2 and 4, then they are replaced with blank. In the end I should have 7 fields.
My desired output is something like this:
Otu13; Bacteria; Firmicutes; Bacilli; Lactobacillales; Streptococcaceae; Streptococcus
Otu24; Bacteria; Candidatus_Saccharibacteria; ; ; ;Saccharibacteria_genera_incertae_sedis
Otu59; Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Prevotellaceae;Alloprevotella
Otu41; Bacteria;Bacteroidetes; ; ; ; Alloprevotella
Will greatly appreciate your assistance.

It's not clear how/why you'd get the output you show from the input you posted and the description of your requirements but I think this is what you really want:
$ cat tst.awk
BEGIN { n=split("k p c o f g",order); FS="[ ;]+|__"; OFS=";" }
{
sub(/[0-9.;[:space:]]+$/,"")
delete f
for (i=2;i<=NF;i+=2) {
f[$i] = $(i+1)
}
printf "%s%s", $1, OFS
for (i=1; i<=n; i++) {
printf "%s%s", f[order[i]], (i<n ? OFS : ORS)
}
}
$ awk -f tst.awk file
Otu13;Bacteria;Firmicutes;Bacilli;Lactobacillales;Streptococcaceae;Streptococcus
Otu24;Bacteria;Candidatus_Saccharibacteria;;;;Saccharibacteria_genera_incertae_sedis
Otu59;Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Prevotellaceae;Alloprevotella
Otu41;Bacteria;Bacteroidetes;;;;Alloprevotella

Related

Use an array created using awk as a variable in another awk script

I am trying to use awk to extract data using a conditional statement containing an array created using another awk script.
The awk script I use for creating the array is as follows:
array=($(awk 'NR>1 { print $1 }' < file.tsv))
Then, to use this array in the other awk script
awk var="${array[#]}" 'FNR==1{ for(i=1;i<=NF;i++){ heading[i]=$i } next } { for(i=2;i<=NF;i++){ if($i=="1" && heading[i] in var){ close(outFile); outFile=heading[i]".txt"; print ">kmer"NR-1"\n"$1 >> (outFile) }}}' < input.txt
However, when I run this, the following error occurs.
awk: fatal: cannot open file 'foo' for reading (No such file or directory)
I've already looked at multiple posts on why this error occurs and on how to correctly implement a shell variable in awk, but none of these have worked so far. However, when removing the shell variable and running the script it does work.
awk 'FNR==1{ for(i=1;i<=NF;i++){ heading[i]=$i } next } { for(i=2;i<=NF;i++){ if($i=="1"){ close(outFile); outFile=heading[i]".txt"; print ">kmer"NR-1"\n"$1 >> (outFile) }}}' < input.txt
I really need that conditional statement but don't know what I am doing wrong with implementing the bash variable in awk and would appreciate some help.
Thx in advance.
That specific error messages is because you forgot -v in front of var= (it should be awk -v var=, not just awk var=) but as others have pointed out, you can't set an array variable on the awk command line. Also note that array in your code is a shell array, not an awk array, and shell and awk are 2 completely different tools each with their own syntax, semantics, scopes, etc.
Here's how to really do what you're trying to do:
array=( "$(awk 'BEGIN{FS=OFS="\t"} NR>1 { print $1 }' < file.tsv)" )
awk -v xyz="${array[*]}" '
BEGIN{ split(xyz,tmp,RS); for (i in tmp) var[tmp[i]] }
... now use `var` as you were trying to ...
'
For example:
$ cat file.tsv
col1 col2
a b c d e
f g h i j
$ cat -T file.tsv
col1^Icol2
a b^Ic d e
f g h^Ii j
$ awk 'BEGIN{FS=OFS="\t"} NR>1 { print $1 }' < file.tsv
a b
f g h
$ array=( "$(awk 'BEGIN{FS=OFS="\t"} NR>1 { print $1 }' < file.tsv)" )
$ awk -v xyz="${array[*]}" '
BEGIN {
split(xyz,tmp,RS)
for (i in tmp) {
var[tmp[i]]
}
for (idx in var) {
print "<" idx ">"
}
}
'
<f g h>
<a b>
It's easier and more efficient to process both files in a single awk:
edit: fixed issues in comment, thanks #EdMorton
awk '
FNR == NR {
if ( FNR > 1 )
var[$1]
next
}
FNR == 1 {
for (i = 1; i <= NF; i++)
heading[i] = $i
next
}
{
for (i = 2; i <= NF; i++)
if ( $i == "1" && heading[i] in var) {
outFile = heading[i] ".txt"
print ">kmer" (NR-1) "\n" $1 >> (outFile)
close(outFile)
}
}
' file.tsv input.txt
You might store string in variable, then use split function to turn that into array, consider following simple example, let file1.txt content be
A B C
D E F
G H I
and file2.txt content be
1
3
2
then
var1=$(awk '{print $1}' file1.txt)
awk -v var1="$var1" 'BEGIN{split(var1,arr)}{print "First column value in line number",$1,"is",arr[$1]}' file2.txt
gives output
First column value in line number 1 is A
First column value in line number 3 is G
First column value in line number 2 is D
Explanation: I store output of 1st awk command, which is then used as 1st argument to split function in 2nd awk command. Disclaimer: this solutions assumes all files involved have delimiter compliant with default GNU AWK behavior, i.e. one-or-more whitespaces is always delimiter.
(tested in gawk 4.2.1)

Awk splitting a line by spaces where there are spaces in each field

I've got an R summary table like so:
employee salary startdate
John Doe :1 Min. :21000 Min. :2007-03-14
Jolie Hope:1 1st Qu.:22200 1st Qu.:2007-09-18
Peter Gynn:1 Median :23400 Median :2008-03-25
Mean :23733 Mean :2008-10-02
3rd Qu.:25100 3rd Qu.:2009-07-13
Max. :26800 Max. :2010-11-01
and I need to produce an output csv file like so:
employee,,salary,,startdate,,
John Doe,1,Min.,21000,Min.,2007-03-14
Jolie Hope,1,1st Qu.,22200,1st Qu.,2007-09-18
Peter Gynn,1,Median,23400,Median,2008-03-25
,,Mean,23733,Mean,2008-10-02
,,3rd Qu.,25100,3rd Qu.,2009-07-13
,,Max.,26800,Max.,2010-11-01
so that in excel it looks something like this:
However it doesn't suffice to split the fields by one or more white spaces,
awk -F "[ ]+" '{ print $3 }'
It works for the header, but not for the remaining lines:
salary
Doe
Hope:1
Gynn:1
:23733
Qu.:25100
:26800
Is this problem solvable using awk (and maybe sed)?
sed '1 {
s/^[[:space:]]*\([^[:space:]]\{1,\}\)[[:space:]]\{1,\}\([^[:space:]]\{1,\}\)[[:space:]]\{1,\}[[:space:]]\{1,\}\([^[:space:]]\{1,\}\)/\1,,\2,,\3,/
b
}
s/[[:space:]]\{1,\}:/:/g
/^[[:space:]]*\([^:]\{1,\}\):\([^[:space:]]*\)[[:space:]]*\([^:]\{1,\}\):\([^[:space:]]*\)[[:space:]]*\([^:]\{1,\}\):\(.[^[:space:]]*\)/ {
s//\1,\2,\3,\4,\5,\6/
b
}
/^[[:space:]]*\([^:]\{1,\}\):\([^[:space:]]*\)[[:space:]]*\([^:]\{1,\}\):\([^[:space:]]*\)/ {
s//,,\1,\2,\3,\4/
b
}
' YourFile
sed one, just for the fun if you need to adapt a bit in this ArachnoRegEx
awk is lot more interesting in this case mainly for any adaptation to add later but if you only have access to sed ...
This uses GNU awk for FIELDWIDTHS, etc. and relies on the first line of input after the header always having all fields populated. It includes the positions that are just :s as output fields, I expect you can figure out how to skip those if you do want to use this solution:
$ cat tst.awk
BEGIN { OFS="," }
NR==1 {
for (i=1;i<=NF;i++) {
printf "%s%s", $i, (i<NF?OFS OFS OFS:ORS)
}
next
}
NR==2 {
tail = $0
while ( match(tail,/([^:]+):(\S+(\s+|$))/,a) ) {
FIELDWIDTHS = FIELDWIDTHS length(a[1]) " 1 " length(a[2]) " "
tail = substr(tail,RSTART+RLENGTH)
}
$0 = $0
}
{
for (i=1;i<=NF;i++) {
gsub(/^\s+|\s+$/,"",$i)
}
print
}
$ awk -f tst.awk file
employee,,,salary,,,startdate
John Doe,:,1,Min.,:,21000,Min.,:,2007-03-14
Jolie Hope,:,1,1st Qu.,:,22200,1st Qu.,:,2007-09-18
Peter Gynn,:,1,Median,:,23400,Median,:,2008-03-25
,,,Mean,:,23733,Mean,:,2008-10-02
,,,3rd Qu.,:,25100,3rd Qu.,:,2009-07-13
,,,Max.,:,26800,Max.,:,2010-11-01

AWK split file by separator and count

I have a large 220mb file. The file is grouped by a horizontal row "---". This is what I have so far:
cat test.list | awk -v ORS="" -v RS="-------------------------------------------------------------------------------" '{print $0;}'
How do I take this and print to a new file every 1000 matches?
Is there another way to do this? I looked at split, and csplit but the "----" rows to not occur predictably so I have to match them, and then split on a count of the matches.
I would like the output files to groups of 1000 matches per file.
To output the first 1000 records to outputfile0, the next to outputfile1, etc., just do:
awk 'NR%1000 == 1{ file = "outputfile" i++ } { print > file }' ORS= RS=------ test.list
(Note that I truncated the dashes in RS for simplicity.)'
Unfortunately, using a value of RS that is more than a single character produces unspecified results, so the above cannot be the solution. Perhaps something like twalberg's solution is required:
awk '/^----$/ { if(!(c%1000)) count+=1; c+=1; next }
{print > ("outputfile"count)}' c=1 count=1
Not tested, but something along these lines might work:
awk 'BEGIN {fileno=1,matchcount=0}
/^-------/ { if (++matchcount == 1000) { ++fileno; matchcount=0; } }
{ print $0 > "output_file_" fileno }' < test.list
It might be cleaner to put all that in, say split.awk and use awk -f split.awk test.list instead...

How to remove several columns and the field separators at once in AWK?

I have a big file with several thousands of columns. I want to delete some specific columns and the field separators at once with AWK in Bash.
I can delete one column at a time with this oneliner (column 3 will be deleted and its corresponding field separator):
awk -vkf=3 -vFS="\t" -vOFS="\t" '{for(i=kf; i<NF;i++){ $i=$(i+1);}; NF--; print}' < Big_File
However, I want to delete several columns at once... Can someone help me figure this out?
You can pass list of columns to be deleted from shell to awk like this:
awk -vkf="3,5,11" ...
then in the awk programm parse it into array:
split(kf,kf_array,",")
and then go thru all the colums and test if each particular column is in the kf_array and possibly skip it
Other possibility is to call your oneliner several times :-)
Here is an implementation of Kamil's idea:
awk -v remove="3,8,5" '
BEGIN {
OFS=FS="\t"
split(remove,a,",")
for (i in a) b[a[i]]=1
}
{
j=1
for (i=1;i<=NF;++i) {
if (!(i in b)) {
$j=$i
++j
}
}
NF=j-1
print
}
'
If you can use cut instead of awk, this one is easier with cut:
e.g. this obtains columns 1,3, and from 50 on from file:
cut -f1,3,50- file
Something like this should work:
awk -F'\t' -v remove='3|8|5' '
{
rec=ofs=""
for (i=1;i<=NF;i++) {
if (i !~ "^(" remove ")$" ) {
rec = rec ofs $i
ofs = FS
}
}
print rec
}
' file

Get next field/column width awk

I have a dataset of the following structure:
1234 4334 8677 3753 3453 4554
4564 4834 3244 3656 2644 0474
...
I would like to:
1) search for a specific value, eg 4834
2) return the following field (3244)
I'm quite new to awk, but realize it is a simple operation. I have created a bash-script that asks the user for the input, and attempts to return the following field.
But I can't seem to get around scoping in AWK. How do I parse the input value to awk?
#!/bin/bash
read input
cat data.txt | awk '
for (i=1;i<=NF;i++) {
if ($i==input) {
print $(i+1)
}
}
}'
Cheers and thanks in advance!
UPDATE Sept. 8th 2011
Thanks for all the replies.
1) It will never happen that the last number of a row is picked - still I appreciate you pointing this out.
2) I have a more general problem with awk. Often I want to "do something" with the result found. In this case I would like to output it to xclip - an application which read from standard input and copies it to the clipboard. Eg:
$ echo Hi | xclip
Unfortunately, echo doesn't exist for awk, so I need to return the value and echo it. How would you go about this?
#!/bin/bash
read input
cat data.txt | awk '{
for (i=1;i<=NF;i++) {
if ($i=='$input') {
print $(i+1)
}
}
}'
Don't over think it!
You can create an array in awk with the split command:
split($0, ary)
This will split the line $0 into an array called ary. Now, you can use array syntax to find the particular fields:
awk '{
size = split($0, ary)
for (i=1; i < size ;i++) {
print ary[i]
}
print "---"
}' data.txt
Now, when you find ary[x] as the field, you can print out ary[x+1].
In your example:
awk -v input=$input '{
size = split($0, ary)
for (i=1; i<= size ;i++) {
if ($i == ary[i]) {
print ary[i+1]
}
}
}' data.txt
There is a way of doing this without creating an array, but it's simply much easier to work with arrays in situations like this.
By the way, you can eliminate the cat command by putting the file name after the awk statement and save creating an extraneous process. Everyone knows creating an extraneous process kills a kitten. Please don't kill a kitten.
You pass shell variable to awk using -v option. Its cleaner/nicer than having to put quotes.
awk -v input="$input" '
for(i=1;i<=NF;i++){
if ($i == input ){
print "Next value: " $(i+1)
}
}
' data.txt
And lose the useless cat.
Here is my solution: delete everything up to (and including) the search field, then the field you want to print out is field #1 ($1):
awk '/4834/ {sub(/^.* * 4834 /, ""); print $1}' data.txt

Resources