Storing Mulitdimensional array using awk - bash

My input file is
a|b|c|d
w|r|g|h
i want to store the value in array like
a[1,1] = a
a[1,2] = b
a[2,1] = w
Kindly suggest in any way to achieve this in awk bash.
I have two i/p files and need to do field level validation.

Like this
awk -F'|' '{for(i=1;i<=NF;i++)a[NR,i]=$i}
END {print a[1,1],a[2,2]}' file
Output
a r

This parses the file into an awk array:
awk -F \| '{ for(i = 1; i <= NF; ++i) a[NR,i] = $i }' filename
You'll have to add code that uses the array for this to be of any use, of course. Since you didn't say what you wanted to do with the array once it is complete (after the pass over the file), this is all the answer i can give you.

You're REALLY going to want to get/use gawk 4.* if you're using multi-dimensional arrays as that's the only awk that supports them. When you write:
a[1,2]
in any awk you are actually creating a psedudo-multi-dimensional array which is a 1-dimensional array indexed by the string formed by the concatenation of
1 SUBSEP 2
where SUBSEP is a control char that's unlikely to appear in your input.
In GNU awk 4.* you can do:
a[1][2]
(note the different syntax) and that populates an actual multi-dimentional array.
Try this to see the difference:
$ cat tst.awk
BEGIN {
SUBSEP=":" # just to make it visible when printing
oneD[1,2] = "a"
oneD[1,3] = "b"
twoD[1][2] = "c"
twoD[1][3] = "d"
for (idx in oneD) {
print "oneD", idx, oneD[idx]
}
print ""
for (idx1 in twoD) {
print "twoD", idx1
for (idx2 in twoD[idx1]) { # you CANNOT do this with oneD
print "twoD", idx1, idx2, twoD[idx1][idx2]
}
}
}
$ awk -f tst.awk
oneD 1:2 a
oneD 1:3 b
twoD 1
twoD 1 2 c
twoD 1 3 d

Related

Use an array created using awk as a variable in another awk script

I am trying to use awk to extract data using a conditional statement containing an array created using another awk script.
The awk script I use for creating the array is as follows:
array=($(awk 'NR>1 { print $1 }' < file.tsv))
Then, to use this array in the other awk script
awk var="${array[#]}" 'FNR==1{ for(i=1;i<=NF;i++){ heading[i]=$i } next } { for(i=2;i<=NF;i++){ if($i=="1" && heading[i] in var){ close(outFile); outFile=heading[i]".txt"; print ">kmer"NR-1"\n"$1 >> (outFile) }}}' < input.txt
However, when I run this, the following error occurs.
awk: fatal: cannot open file 'foo' for reading (No such file or directory)
I've already looked at multiple posts on why this error occurs and on how to correctly implement a shell variable in awk, but none of these have worked so far. However, when removing the shell variable and running the script it does work.
awk 'FNR==1{ for(i=1;i<=NF;i++){ heading[i]=$i } next } { for(i=2;i<=NF;i++){ if($i=="1"){ close(outFile); outFile=heading[i]".txt"; print ">kmer"NR-1"\n"$1 >> (outFile) }}}' < input.txt
I really need that conditional statement but don't know what I am doing wrong with implementing the bash variable in awk and would appreciate some help.
Thx in advance.
That specific error messages is because you forgot -v in front of var= (it should be awk -v var=, not just awk var=) but as others have pointed out, you can't set an array variable on the awk command line. Also note that array in your code is a shell array, not an awk array, and shell and awk are 2 completely different tools each with their own syntax, semantics, scopes, etc.
Here's how to really do what you're trying to do:
array=( "$(awk 'BEGIN{FS=OFS="\t"} NR>1 { print $1 }' < file.tsv)" )
awk -v xyz="${array[*]}" '
BEGIN{ split(xyz,tmp,RS); for (i in tmp) var[tmp[i]] }
... now use `var` as you were trying to ...
'
For example:
$ cat file.tsv
col1 col2
a b c d e
f g h i j
$ cat -T file.tsv
col1^Icol2
a b^Ic d e
f g h^Ii j
$ awk 'BEGIN{FS=OFS="\t"} NR>1 { print $1 }' < file.tsv
a b
f g h
$ array=( "$(awk 'BEGIN{FS=OFS="\t"} NR>1 { print $1 }' < file.tsv)" )
$ awk -v xyz="${array[*]}" '
BEGIN {
split(xyz,tmp,RS)
for (i in tmp) {
var[tmp[i]]
}
for (idx in var) {
print "<" idx ">"
}
}
'
<f g h>
<a b>
It's easier and more efficient to process both files in a single awk:
edit: fixed issues in comment, thanks #EdMorton
awk '
FNR == NR {
if ( FNR > 1 )
var[$1]
next
}
FNR == 1 {
for (i = 1; i <= NF; i++)
heading[i] = $i
next
}
{
for (i = 2; i <= NF; i++)
if ( $i == "1" && heading[i] in var) {
outFile = heading[i] ".txt"
print ">kmer" (NR-1) "\n" $1 >> (outFile)
close(outFile)
}
}
' file.tsv input.txt
You might store string in variable, then use split function to turn that into array, consider following simple example, let file1.txt content be
A B C
D E F
G H I
and file2.txt content be
1
3
2
then
var1=$(awk '{print $1}' file1.txt)
awk -v var1="$var1" 'BEGIN{split(var1,arr)}{print "First column value in line number",$1,"is",arr[$1]}' file2.txt
gives output
First column value in line number 1 is A
First column value in line number 3 is G
First column value in line number 2 is D
Explanation: I store output of 1st awk command, which is then used as 1st argument to split function in 2nd awk command. Disclaimer: this solutions assumes all files involved have delimiter compliant with default GNU AWK behavior, i.e. one-or-more whitespaces is always delimiter.
(tested in gawk 4.2.1)

How to remove columns from a file by given the columns in anther file in Linux?

Suppose I have a file A contains the column numbers need to be removed (I really have over 500 columns in my input file fileB),
fileA:
2
5
And I want to remove those columns(2 and 5) from fileB:
a b c d e f
g h i j k l
in Linux to get:
a c d f
g i j l
what should I do? I found out that I could eliminate printing those columns with the code:
awk '{$2=$5="";print $0}' fileB
however, there are two problems in this way, first it does not really remove those columns, it just using empty string to replace them; second, instead of manually typing in those column numbers, how can I get these column numbers by reading from another file.
Original Question:
Suppose I have a file A contains the column numbers need to be removed,
file A:
223
345
346
567
And I want to remove those columns(223, 345,567) from file B in Linux, what should I do?
If your cut have the --complement option then you can do:
cut --complement -d ' ' -f "$(echo $(<FileA))" fileB
$ cat tst.awk
NR==FNR {
badFldNrs[$1]
next
}
FNR == 1 {
for (inFldNr=1; inFldNr<=NF; inFldNr++) {
if ( !(inFldNr in badFldNrs) ) {
out2in[++numOutFlds] = inFldNr
}
}
}
{
for (outFldNr=1; outFldNr<=numOutFlds; outFldNr++) {
inFldNr = out2in[outFldNr]
printf "%s%s", $inFldNr, (outFldNr<numOutFlds ? OFS : ORS)
}
}
$ awk -f tst.awk fileA fileB
a c d f
g i j l
One awk idea:
awk '
FNR==NR { skip[$1] ; next } # store field #s to be skipped
{ line="" # initialize output variable
pfx="" # first prefix will be ""
for (i=1;i<=NF;i++) # loop through the fields in this input line ...
if ( !(i in skip) ) { # if field # not mentioned in the skip[] array then ...
line=line pfx $i # add to our output variable
pfx=OFS # prefix = OFS for 2nd-nth fields to be added to output variable
}
if ( pfx == OFS ) # if we have something to print ...
print line # print output variable to stdout
}
' fileA fileB
NOTE: OP hasn't provided the input/output field delimiters; OP can add the appropriate FS/OFS assignments as needed
This generates:
a c d f
g i j l
Using awk
$ awk 'NR==FNR {col[$1]=$1;next} {for(i=1;i<=NF;++i) if (i != col[i]) printf("%s ", $i); printf("\n")}' fileA fileB
a c d f
g i j l

How can I name a variable dynamically in awk? [duplicate]

This question already has answers here:
Does awk support dynamic user-defined variables?
(4 answers)
Closed 5 years ago.
I have requirement where first "n" rows ("n" being variable passed at run time) should be stored in "n" corresponding arrays, array name being named as arrayRow"NR"; here, NR is the built-in awk variable.
awk -F"," -v n=$Header_Count 'NR<=n{
split($0,**("arrayRow" NR)**) }' SampleText.txt
So the first row should be stored in array arrayRow1; the second row should be stored in arrayRow2, ... etc. Any ideas on how to create dynamic array name in awk?
You can't do it without a hacky shell loop and bad coding practices for letting shell variables expand to become part of an awk command, or a function that populates hard-coded, predefined arrays given a numeric arg or similar. Don't even try it.
If you have or can get GNU awk you should use true multi-dimensional arrays instead:
$ cat file
a b c
d e
f g h
i j k
l m
$ awk -v n=3 '
NR<=n {arrayRow[NR][1]; split($0,arrayRow[NR]) }
END {
for (i in arrayRow) {
for (j in arrayRow[i]) {
print i, j, arrayRow[i][j]
}
}
}
' file
1 1 a
1 2 b
1 3 c
2 1 d
2 2 e
3 1 f
3 2 g
3 3 h
Otherwise you can use pseudo-multi-dimensional arrays in any awk:
$ awk -v n=3 '
NR<=n { for (i=1;i<=NF;i++) arrayRow[NR,i] = $i; nf[NR]=NF }
END {
for (i in nf) {
for (j=1; j<=nf[i]; j++) {
print i, j, arrayRow[i,j]
}
}
}
' file
but you don't actually need multi-dimensional arrays for this at all, just use 1-dimensional and split the string into a new array when you need it:
$ awk -v n=3 '
NR<=n {arrayRow[NR] = $0}
END {
for (i in arrayRow) {
split(arrayRow[i],arrayRowI)
for (j in arrayRowI) {
print i, j, arrayRowI[j]
}
}
}
' file

Remove the last-occured lines of patterns

I want to exclude/delete the last line of pattern {n}{n}{n}.log for each possible 3-digit numbers. Each lines end with a sample pattern "123.log".
Sample input file:
aaaa116.log
a112.log
aaa112.log
a113.log
aaaaa112.log
aaa113.log
aa112.log
aaa116.log
a113.log
aaaaa116.log
aaa113.log
aa114.log
Output file:
aaaa116.log
a112.log
aaa112.log
a113.log
aaaaa112.log
aaa113.log
aaa116.log
a113.log
How could this be performed by bash scripting?
It is fairly simple to remove the last matching line in awk without retaining order.
awk -F'[^0-9]+' '/[0-9]+\.log$/ {
t = $(NF - 1);
if (t in a)
print a[t];
a[t] = $0;
}'
To keep the output ordered is more complicated, and requires more memory.
awk -F'[^0-9]+' '/[0-9]+\.log$/ {
t = $(NF - 1);
a[++i] = $0;
b[$0] = t;
c[t] = i;
}
END {
for (n = 1; n <= i; n++)
if (n != c[b[a[n]]])
print a[n];
}'
To pass through non-matching lines in the first example a next statement can be added to the action, and a pattern of 1 can be appended. For the second example assignment into array a can be moved to its own action.
Probably awk would be the easiest tool for this. For example, this one-liner
tac file | awk 'match($0, /[0-9]{3}.log/,a) && a[0] in b; {b[a[0]]}' | tac
produces the requested output for the sample input. This does not require the entire file to be stored in memory.
Change the regular expression to suit your specific needs.
$ awk '{k=substr($0,length()-7)} NR==FNR{n[k]=NR;next} FNR!=n[k]' file file
aaaa116.log
a112.log
aaa112.log
a113.log
aaaaa112.log
aaa113.log
aaa116.log
a113.log

How to grep number of unique occurrences

I understand that grep -c string can be used to count the occurrences of a given string. What I would like to do is count the number of unique occurrences when only part of the string is known or remains constant.
For Example, if I had a file (in this case a log) with several lines containing a constant string and a repeating variable like so:
string=value1
string=value1
string=value1
string=value2
string=value3
string=value2
Than I would like to be able to identify the number of each unique set with an output similar to the following: (ideally with a single grep/awk string)
value1 = 3 occurrences
value2 = 2 occurrences
value3 = 1 occurrences
Does anyone have a solution using grep or awk that might work? Thanks in advance!
This worked perfectly... Thanks to everyone for your comments!
grep -oP "wwn=[^,]*" path/to/file | sort | uniq -c
In general, if you want to grep and also keep track of results, it is best to use awk since it performs such things in a clear manner with a very simple syntax.
So for your given file I would use:
$ awk -F= '/string=/ {count[$2]++} END {for (i in count) print i, count[i]}' file
value1 3
value2 2
value3 1
What is this doing?
-F=
set the field separator to =, so that we can compute the right and left part of it.
/string=/ {count[$2]++}
when the pattern "string=" is found, check it! This uses an array count[] to keep track on the times the second field has appeared so far.
END {for (i in count) print i, count[i]}
at the end, loop through the results and print them.
Here's an awk script:
#!/usr/bin/awk -f
BEGIN {
file = ARGV[1]
while ((getline line < file) > 0) {
for (i = 2; i < ARGC; ++i) {
p = ARGV[i]
if (line ~ p) {
a[p] += !a[p, line]++
}
}
}
for (i = 2; i < ARGC; ++i) {
p = ARGV[i]
printf("%s = %d occurrences\n", p, a[p])
}
exit
}
Example:
awk -f script.awk somefile ab sh
Output:
ab = 7 occurrences
sh = 2 occurrences

Resources