awk or bash for split lines - bash

I would like to split a csv file which looks like this:
a|b|1,2,3
c|d|4,5
e|f|6,7,8
the goal is this format:
a|b|1
a|b|2
a|b|3
c|d|4
c|d|5
e|f|6
e|f|7
e|f|8
How can I do this in bash or awk?

With bash:
while IFS="|" read -r a b c; do for n in ${c//,/ }; do echo "$a|$b|$n"; done; done <file
Output:
a|b|1
a|b|2
a|b|3
c|d|4
c|d|5
e|f|6
e|f|7
e|f|8

$ cat hm.awk
{
s = $0; p = ""
while (i = index(s, "|")) { # `p': up to the last '|'
# `s': the rest
p = p substr(s, 1 , i)
s = substr(s, i + 1)
}
n = split(s, a, ",")
for (i = 1; i <= n; i++)
print p a[i]
}
Usage:
awk -f hm.awk file.csv

In Gnu awk (split):
$ awk '{n=split($0,a,"[|,]");for(i=3;i<=n;i++) print a[1] "|" a[2] "|" a[i]}' file

with perl
$ cat ip.csv
a|b|1,2,3
c|d|4,5
e|f|6,7,8
$ perl -F'\|' -lane 'print join "|", #F[0..1],$_ foreach split /,/,$F[2]' ip.csv
a|b|1
a|b|2
a|b|3
c|d|4
c|d|5
e|f|6
e|f|7
e|f|8
splits input line on | into #F array
then for every comma separated value in 3rd column, print in desired format
For a generic last column,
perl -F'\|' -lane 'print join "|", #F[0..$#F-1],$_ foreach split /,/,$F[-1]' ip.csv

Related

Use an array created using awk as a variable in another awk script

I am trying to use awk to extract data using a conditional statement containing an array created using another awk script.
The awk script I use for creating the array is as follows:
array=($(awk 'NR>1 { print $1 }' < file.tsv))
Then, to use this array in the other awk script
awk var="${array[#]}" 'FNR==1{ for(i=1;i<=NF;i++){ heading[i]=$i } next } { for(i=2;i<=NF;i++){ if($i=="1" && heading[i] in var){ close(outFile); outFile=heading[i]".txt"; print ">kmer"NR-1"\n"$1 >> (outFile) }}}' < input.txt
However, when I run this, the following error occurs.
awk: fatal: cannot open file 'foo' for reading (No such file or directory)
I've already looked at multiple posts on why this error occurs and on how to correctly implement a shell variable in awk, but none of these have worked so far. However, when removing the shell variable and running the script it does work.
awk 'FNR==1{ for(i=1;i<=NF;i++){ heading[i]=$i } next } { for(i=2;i<=NF;i++){ if($i=="1"){ close(outFile); outFile=heading[i]".txt"; print ">kmer"NR-1"\n"$1 >> (outFile) }}}' < input.txt
I really need that conditional statement but don't know what I am doing wrong with implementing the bash variable in awk and would appreciate some help.
Thx in advance.
That specific error messages is because you forgot -v in front of var= (it should be awk -v var=, not just awk var=) but as others have pointed out, you can't set an array variable on the awk command line. Also note that array in your code is a shell array, not an awk array, and shell and awk are 2 completely different tools each with their own syntax, semantics, scopes, etc.
Here's how to really do what you're trying to do:
array=( "$(awk 'BEGIN{FS=OFS="\t"} NR>1 { print $1 }' < file.tsv)" )
awk -v xyz="${array[*]}" '
BEGIN{ split(xyz,tmp,RS); for (i in tmp) var[tmp[i]] }
... now use `var` as you were trying to ...
'
For example:
$ cat file.tsv
col1 col2
a b c d e
f g h i j
$ cat -T file.tsv
col1^Icol2
a b^Ic d e
f g h^Ii j
$ awk 'BEGIN{FS=OFS="\t"} NR>1 { print $1 }' < file.tsv
a b
f g h
$ array=( "$(awk 'BEGIN{FS=OFS="\t"} NR>1 { print $1 }' < file.tsv)" )
$ awk -v xyz="${array[*]}" '
BEGIN {
split(xyz,tmp,RS)
for (i in tmp) {
var[tmp[i]]
}
for (idx in var) {
print "<" idx ">"
}
}
'
<f g h>
<a b>
It's easier and more efficient to process both files in a single awk:
edit: fixed issues in comment, thanks #EdMorton
awk '
FNR == NR {
if ( FNR > 1 )
var[$1]
next
}
FNR == 1 {
for (i = 1; i <= NF; i++)
heading[i] = $i
next
}
{
for (i = 2; i <= NF; i++)
if ( $i == "1" && heading[i] in var) {
outFile = heading[i] ".txt"
print ">kmer" (NR-1) "\n" $1 >> (outFile)
close(outFile)
}
}
' file.tsv input.txt
You might store string in variable, then use split function to turn that into array, consider following simple example, let file1.txt content be
A B C
D E F
G H I
and file2.txt content be
1
3
2
then
var1=$(awk '{print $1}' file1.txt)
awk -v var1="$var1" 'BEGIN{split(var1,arr)}{print "First column value in line number",$1,"is",arr[$1]}' file2.txt
gives output
First column value in line number 1 is A
First column value in line number 3 is G
First column value in line number 2 is D
Explanation: I store output of 1st awk command, which is then used as 1st argument to split function in 2nd awk command. Disclaimer: this solutions assumes all files involved have delimiter compliant with default GNU AWK behavior, i.e. one-or-more whitespaces is always delimiter.
(tested in gawk 4.2.1)

change range of letters in a specific column

I want to change in column 11 these characters
!"#$%&'()*+,-.\/0123456789:;<=>?#ABCDEFGHIJ
for these characetrs:
#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghi
so, if I have in column 11 000#!, it should be PPP_#. I tried awk:
awk '{a = gensub(/[#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghi]/, /[!\"\#$%&'\''()*+,-.\/0123456789:;<=>?#ABCDEFGHIJ]/, "g", $11); print a }' file.txt
but it does not work...
Try Perl.
perl -lane '$F[10] =~ y/!"#$%&'"'"'()*+,-.\/0-9:;<=>?#A-J/#A-Z[\\]^_`a-i/;
print join(" ", #F)'
I am assuming by "column 11" you mean a string of several characters after the tenth run of successive whitespace, which is what the -a option splits on by default (basically to simulate Awk). Unfortunately, changes to the array #F do not show up in the output directly, so you have to reconstruct the output line from (the modified) #F, which will normalize the field delimiter to just a single space.
Just change f = 2 to f = 11:
$ cat tst.awk
BEGIN {
f = 2
old = "!\"#$%&'()*+,-.\\/0123456789:;<=>?#ABCDEFGHIJ"
new = "#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\\\]^_`abcdefghi"
n = length(old)
for (i=1; i<=n; i++) {
map[substr(old,i,1)] = substr(new,i,1)
}
}
{
n = length($f)
newStr = ""
for (i=1; i<=n; i++) {
oldChar = substr($f,i,1)
newStr = newStr (oldChar in map ? map[oldChar] : oldChar)
}
$f = newStr
print
}
$ cat file
a 000#! b
$ awk -f tst.awk file
a PPP_# b
Note that you have to escape "s and \s in strings.

Use AWK to print FILENAME to CSV

I have a little script to compare some columns inside a bunch of CSV files.
It's working fine, but there are some things that are bugging me.
Here is the code:
FILES=./*
for f in $FILES
do
cat -v $f | sed "s/\^A/,/g" > op_tmp.csv
awk -F, -vOFS=, 'NR == 1{next} $9=="T"{t[$8]+=$7;n[$8]} $9=="A"{a[$8]+=$7;n[$8]} $9=="C"{c[$8]+=$7;n[$8]} $9=="R"{r[$8]+=$7;n[$8]} $9=="P"{p[$8]+=$7;n[$8]} END{ for (i in n){print i "|" "A" "|" a[i]; print i "|" "C" "|" c[i]; print i "|" "R" "|" r[i]; print i "|" "P" "|" p[i]; print i "|" "T" "|" t[i] "|" (t[i]==a[i]+c[i]+r[i]+p[i] ? "ERROR" : "MATCHED")} }' op_tmp.csv >> output.csv
rm op_tmp.csv
done
Just to explain:
I get all files on the directory, then i use CAT to replace the divisor ^A for a Pipe |.
Then i use the awk onliner to compare the columns i need and print the result to a output.csv.
But now i want to print the filename before every loop.
I tried using the cat sed and awk in the same line and printing the $FILENAME, but it doesn't work:
cat -v $f | sed "s/\^A/,/g" | awk -F, -vOFS=, 'NR == 1{next} $9=="T"{t[$8]+=$7;n[$8]} $9=="A"{a[$8]+=$7;n[$8]} $9=="C"{c[$8]+=$7;n[$8]} $9=="R"{r[$8]+=$7;n[$8]} $9=="P"{p[$8]+=$7;n[$8]} END{ for (i in n){print i "|" "A" "|" a[i]; print i "|" "C" "|" c[i]; print i "|" "R" "|" r[i]; print i "|" "P" "|" p[i]; print i "|" "T" "|" t[i] "|" (t[i]==a[i]+c[i]+r[i]+p[i] ? "ERROR" : "MATCHED")} }' > output.csv
Can anyone help?
You can rewrite the whole script better, but assuming it does what you want for now just add
echo $f >> output.csv
before awk call.
If you want to add filename in every awk output line, you have to pass it as an argument, i.e.
awk ... -v fname="$f" '{...; print fname... etc
A rewrite:
for f in ./*; do
awk -F '\x01' -v OFS="|" '
BEGIN {
letter[1]="A"; letter[2]="C"; letter[3]="R"; letter[4]="P"; letter[5]="T"
letters["A"] = letters["C"] = letters["R"] = letters["P"] = letters["T"] = 1
}
NR == 1 {next}
$9 in letters {
count[$9,$8] += $7
seen[$8]
}
END {
print FILENAME
for (i in seen) {
sum = 0
for (j=1; j<=4; j++) {
print i, letter[j], count[letter[j],i]
sum += count[letter[j],i]
}
print i, "T", count["T",i], (count["T",i] == sum ? "ERROR" : "MATCHED")
}
}
' "$f"
done > output.csv
Notes:
your method of iterating over files will break as soon as you have a filename with a space in it
try to reduce duplication as much as possible.
newlines are free, use them to improve readability
improve your variable names i, n, etc -- here "letter" and "letters" could use improvement to hold some meaning about those symbols.
awk has a FILENAME variable (here's the actual answer to your question)
awk understands \x01 to be a Ctrl-A -- I assume that's the field separator in your input files
define an Output Field Separator that you'll actually use
If you have GNU awk (version ???) you can use the ENDFILE block and do away with the shell for loop altogether:
gawk -F '\x01' -v OFS="|" '
BEGIN {...}
FNR == 1 {next}
$9 in letters {...}
ENDFILE {
print FILENAME
for ...
# clean up the counters for the next file
delete count
delete seen
}
' ./* > output.csv

UPDATED: Bash + Awk : Print first X(dynamic) columns and always last column

#file test.txt
a b c 5
d e f g h 7
gg jj 2
Say X = 3 I need the output like this:
#file out.txt
a b c 5
d e f 7
gg jj 2
NOT this:
a b c 5
d e f 7
gg jj 2 2 <--- WRONG
I've gotten to this stage:
cat test.txt | awk ' { print $1" "$2" "$3" "NF } '
If you're unsure of the total number of fields, then one option would be to use a loop:
awk '{ for (i = 1; i <= 3 && i < NF; ++i) printf "%s ", $i; print $NF }' file
The loop can be avoided by using a ternary:
awk '{ print $1, $2, (NF > 3 ? $3 OFS $NF : $3) }' file
This is slightly more verbose than the approach suggested by 123 but means that you aren't left with trailing white space on the lines with three fields. OFS is the Output Field Separator, a space by default, which is what print inserts between fields when you use a ,.
Use a $ combined with NF :
cat test.txt | awk ' { print $1" "$2" "$3" "$NF } '

How to parse data vertically in shell script?

I have data printed out in the console like this:
A B C D E
1 2 3 4 5
I want to manipulate it so A:1 B:2 C:3 D:4 E:5 is printed.
What is the best way to go about it? Should I tokenize the two lines and then print it out using arrays?
How do I go about it in bash?
Awk is good for this.
awk 'NR==1{for(i=0;i<NF;i++){row[i]=$i}} NR==2{for(i=0;i<NF;i++){printf "%s:%s",row[i],$i}}' oldfile > newfile
A slightly more readable version for scripts:
#!/usr/bin/awk -f
NR == 1 {
for(i = 0; i < NF; i++) {
first_row[i] = $i
}
}
NR == 2 {
for(i = 0; i < NF; i++) {
printf "%s:%s", first_row[i], $i
}
print ""
}
If you want it to scale vertically, you'll have to say how.
For two lines with any number of elements:
(read LINE;
LINE_ONE=($LINE);
read LINE;
LINE_TWO=($LINE);
for i in `seq 0 $((${#LINE_ONE[#]} - 1))`;
do
echo ${LINE_ONE[$i]}:${LINE_TWO[$i]};
done)
To do pairs of lines just wrap it in a loop.
This might work for you:
echo -e "A B C D E FFF GGG\n1 2 3 4 5 666 7" |
sed 's/ \|$/:/g;N;:a;s/\([^:]*:\)\([^\n]*\)\n\([^: ]\+ *\)/\2\1\3\n/;ta;s/\n//'
A:1 B:2 C:3 D:4 E:5 FFF:666 GGG:7
Perl one-liner:
perl -lane 'if($.%2){#k=#F}else{print join" ",map{"$k[$_]:$F[$_]"}0..$#F}'
Somewhat more legible version:
#!/usr/bin/perl
my #keys;
while (<>) {
chomp;
if ($. % 2) { # odd lines are keys
#keys = split ' ', $_;
} else { # even lines are values
my #values = split ' ', $_;
print join ' ', map { "$keys[$_]:$values[$_]" } 0..$#values;
}
}

Resources