Passing parameters from shell to awk as an array - shell

I am using a shell script and within that I am using an awk script. I am passing parameters to awk from shell script by using -v option. At some point of time, when the argument size exceeds a certain limit, I was getting the 'Argument list too long error'. This was my previous question but I have found out the root cause for the same. Now my question is:
Variable to be passed from shell to awk using -v option = too large ⟶ Hence getting argument list too long error
My idea is to break the large variable into small chunks and store it in an array and then pass the array into awk instead of passing the single variable into awk.
My question is:
Is it possible to break the large variable into small array and then pass it back to awk. I know how to modify a variable of shell inside an awk script. But how can I modify the array of shell inside an awk script?
I read that -v option is not advisable and they suggested to pipe the variable values. So if that the case
echo variable | awk '{}'
So variables would be piped. But I have to pipe an array along with some other variables. Could you please help me?
CODE DESCRIPTION
addvariable=""
export variable
loop begins
eval $(awk -v tempvariable="$addvariable" '{tempvariable=tempvariable+"long string" variable=tempvariable(Here is where the shell variable(variable) is being modified )}')
In shell
addvariable=$variable (Taking the new value of shell variable and feeding back to awk in the next iteration)
loop ends
So the problem is now as the addvariable and variable keeps on increasing, I am get argument too long error .. So what I have to do is to split the tempvariable into small chunks and then store it in variable[1] variable[2] etc and then assign that to addvariable[1], addvariable[2] and the feed addvariable[1],[2] instead of feeding the entire addvariable as a whole.So my question is how to feed that as an array. and how to store the big data inside the awk into variable[1] variable[2]
CODE
addshellvariable=""
for i in {0..10}
{
zcat normalfile{i} > FILE A
zcat hugefile{i} > FILE
export shellvariable=""
getdate=grep "XXX" FILE B|sort|Uniq (getdate contains a list of id's)
eval $(awk -v getdata="$getdata" -v addshellvariable="$addshellvariable" BEGIN {tempvariable="";split(addshellvariable,tempshellvariableArray,"*");while(t <= length(tempshellvariable)) {awkarray[tempshellvariableArray[t]];} {for(id in ids) {awkarray[id];} END {for(id in awkarray) {tempvariable=tempvariable"*"id"*"awkarray[id]} **print "shellvariable"=tempvariable;**}} FILE A)
addshellvariable=$shellvariable;
}
So as you can see awk is being embedded inside the shell . everytime I need the awkarray content to be feedback into the awk again .. So that I will be able to get the updated ones and that is the reason I am getting the awk array content in the shell variable by printing that, again the shell variable is stored in an another shell variable "addshellvariable" and that is being given to the awk in the next iteration. But the problem is when the shellvariable size increases a certain point then I am getting an Argument too long error . Thus I wanted a solution in such a way that , instead of doing
print "shellvariable"=tempvariable; I can make it as print "shellvariable[1]"=A part of tempvariable; and so on ...

Your shell appears to have limited you. I suspect that your guess is correct, and this isn't an awk problem, it's the scripting language from which you're calling awk.
You can pre-load awk with variables loaded from a file. Check this out:
$ printf 'foo=2\nbar=3\nbaz=4\n' > vars
$ printf 'snarf\nblarg\nbaz\nsnurry\n' > text
$ awk 'NR==FNR{split($0,a,"=");vars[a[1]]=a[2];next} $1 in vars {print vars[$1]}' vars text
4
$
How does this work?
The first two printf lines give us our raw data. Run them without the redirect (or cat the resultant files) if they're not completely clear to you.
The awk script has two main sections. Awk scripts consist of repetitions of condition { commands }. In this case, we've got two of these sets.
The first set has a condition of NR==FNR. This evaluates as "true" if the current record number that awk is processing (NR) is the same as the current record number in the current file. Obviously, this only works for the FIRST file, because as of the first line in the second file, NR is 1 plus the line count of the first file.
Within this section, we split() the line according to its equals sign, and put the data into an array called vars.
The second set has a condition of $1 in vars, which evaluates to true if the first word of the current line exists as a subscript of the vars array. I include this only as an example of what you can do with vars, since I don't know what you're trying to achieve with these variables.
Does this address your problem? If not, we'll need to see some of your code to get an idea of how to fix it.
UPDATE per suggestion in comments, here's proof that it works for large variables:
First, we prepare our input data:
$ dd if=/dev/random of=out.rand count=128k bs=1k
131072+0 records in
131072+0 records out
134217728 bytes transferred in 3.265765 secs (41098404 bytes/sec)
$ b64encode -o out.b64 out.rand out.rand
$ ls -lh out.b64
-rw-r--r-- 1 ghoti wheel 172M Jul 17 01:08 out.b64
$ awk 'BEGIN{printf("foo=")} NR>1{printf("%s",$0)} END{print ""}' out.b64 > vars
$ ls -lh vars
-rw-r--r-- 1 ghoti wheel 170M Jul 17 01:10 vars
$ wc -l vars
1 vars
$ cut -c1-30 vars
foo=orq0UgQJyUAcwJV0SenJrSHu3j
Okay, we've got a ~170MB variable on a single line. Let's suck it into awk.
$ awk 'NR==FNR{split($0,a,"=");vars[a[1]]=a[2];next} END{print length(vars["foo"]);print "foo=" substr(vars["foo"],0,26);}' out.var bar
178956971
foo=orq0UgQJyUAcwJV0SenJrSHu3j
We can see the size of the variable, and the first 26 characters match what we saw from shell. Yup, it works.

Related

How to use awk to split a file and store each filename in a Bash array

Input
A file called input_file.csv, which has 7 columns, and n rows.
Example header and row:
Date Location Team1 Team2 Time Prize_$ Sport
2016 NY Raptors Gators 12pm $500 Soccer
Output
n files, where the rows in each new file are grouped based on their values in column 7 of the original file. Each file is named after that shared value from column 7. Note: each file will have the same header. (The script currently does this.)
Example: if 2 rows in the original file had golf as their value for column 7, they would be grouped together in a file called golf.csv. If 3 other rows shared soccer as their value for column 7, they would be found in soccer.csv.
An array that has the name of each generated file in it. This array lives outside of the scope of awk. (This is what I need help with.)
Example: Array = [golf.csv, soccer.csv]
Situation
The following script produces the desired output. However, I want to run another script on each of the newly generated files and I don't know how.
Question:
My idea is to store the names of each new file in an array. That way, I can loop through the array and do what I want to each file. The code below passes a variable called array into awk, but I don't know how to add the name of each file to the array.
#!/bin/bash
ARRAY=()
awk -v myarray="$ARRAY" -F"\",\"" 'NR==1 {header=$0}; NF>1 && NR>1 {if(! files[$7]) {print header >> ("" $7 ".csv"); files[$7]=1}; print $0 >> ("" $7 ".csv"); close("" $7 ".csv");}' input_file.csv
for i in "${ARRAY[#]}"
do
:
echo $i
done
Rather than struggling to get awk to fill your shell array variable, why not:
make sure that the *.csv files are created in a clean directory
use globbing to loop over all *.csv files in that directory?
awk -F'","' ... # your original Awk command
for i in *.csv # use globbing to loop over resulting *.csv files
do
:
echo $i
done
Just off the top of my head, untested because you haven't supplied very much sample data, what about this?
#!/usr/bin/awk -f
FNR==1 {
header=$0
next
}
! $7 in files {
files[$7]=sprintf("sport-%s.csv", $7)
print header > file
}
{
files[$7]=sprintf("sport-%s.csv", $7)
}
{
print > files[$7]
}
END {
printf("declare -a sportlist=( ")
for (sport in files) {
printf("\"%s\"", sport)
}
printf(" )\n");
}
The idea here is that we store sport names in the array files[], and build filenames out of that array. (You can format the filename inside sprintf() as you see fit.) We step through the file, adding a header line whenever we get a new sport with no recorded filename. Then for non-headers, print to the file based on the sport name.
For your second issue, exporting the array back to something outside of awk, the END block here will output a declare line which can be interpreted by bash. IF you feel lucky, you can eval this awk script inside command expansion, and the declare command will effectively be interpreted by your shell:
eval $(/path/to/awkscript inputfile.csv)
Or, if you subscribe to the school of thought that consiers eval to be evil, you can redirect the awk script's standard output to a temporary file which you source:
/path/to/awkscript inputfile.csv > /tmp/yadda.$$
. /tmp/yadda.$$
(Don't use this temp file, make a real one with mktemp or the like.)
There's no way for any program to modify the environment of the parent shell. Just have the awk script output the names of the files as standard output, and use command substitution to put them in an array.
filesArray=($(awk ... ))
If the files might have spaces in them, you need a different solution; assuming you're on bash 4, you can just be sure to print each file on a separate line and use readarray:
readarray filesArray < <( awk ... )
if the files might have newlines in them, too, then things get tricky...
if your file is not large, you can run another script to get the unique $7 elements, for example
$ awk 'NR>1&&!a[$7]++{print $7}' sports
will print the values, you can change it to your file name format as well, such as
$ awk 'NR>1&&!a[$7]++{print tolower($7)".csv"}' sports
this then can be piped to your other process, here for example to wc
$ awk ... sports | xargs wc
This will do what I THINK you want:
oIFS="$IFS"; IFS=$'\n'
array=( $(awk '{out=$7".csv"; print > out} !seen[out]++{print out}' input_file.csv) )
IFS="$oIFS"
If your input file really is comma-separated instead of space-separated as you show in the sample input in your question then adjust the awk script to suit (You might want to look at GNU awk and FPAT).
If you don't have GNU awk then you'll need to add a bit more code to close the open output files as you go.
The above will fail if you have file names that contain newlines but will be fine for blank chars or other white space.

awk output is acting weird

cat TEXT | awk -v var=$i -v varB=$j '$1~var , $1~varB {print $1}' > PROBLEM HERE
I am passing two variables from an array to parse a very large text file by range. And it works, kind of.
if I use ">" the output to the file will ONLY be the last three lines as verified by cat and a text editor.
if I use ">>" the output to the file will include one complete read of TEXT and then it will divide the second read into the ranges I want.
if I let the output go through to the shell I get the same problem as above.
Question:
It appears awk is reading every line and printing it. Then it goes back and selects the ranges from the TEXT file. It does not do this if I use constants in the range pattern search.
I undestand awk must read all lines to find the ranges I request.
why is it printing the entire document?
How can I get it to ONLY print the ranges selected?
This is the last hurdle in a big project and I am beating my head against the table.
Thanks!
give this a try, you didn't assign varB in right way:
yours: awk -v var="$i" -varB="$j" ...
mine : awk -v var="$i" -v varB="$j" ...
^^
Aside from the typo, you can't use variables in //, instead you have to specify with regular ~ match. Also quote your shell variables (here is not needed obviously, but to set an example). For example
seq 1 10 | awk -v b="3" -v e="5" '$0 ~ b, $0 ~ e'
should print 3..5 as expected
It sounds like this is what you want:
awk -v var="foo" -v varB="bar" '$1~var{f=1} f{print $1} $1~varB{f=0}' file
e.g.
$ cat file
1
2
foo
3
4
bar
5
foo
6
bar
7
$ awk -v var="foo" -v varB="bar" '$1~var{f=1} f{print $1} $1~varB{f=0}' file
foo
3
4
bar
foo
6
bar
but without sample input and expected output it's just a guess and this would not address the SHELL behavior you are seeing wrt use of > vs >>.
Here's what happened. I used an array to input into my variables. I set the counter for what I thought was the total length of the array. When the final iteration of the array was reached, there was a null value returned to awk for the variable. This caused it to print EVERYTHING. Once I correctly had a counter with the correct number of array elements the printing oddity ended.
As far as the > vs >> goes, I don't know. It did stop, but I wasn't as careful in documenting it. I think what happened is that I used $1 in the print command to save time, and with each line it printed at the end it erased the whole file and left the last three identical matches. Something to ponder. Thanks Ed for the honest work. And no thank you to Robo responses.

How to use the awk command in a loop to produce several thinned data files

I have several large data files with 8 columns and 120,000 rows. Now I want to keep 1 line every 200 lines starting from the 100th line. I have the script file thin.sh as:
awk '(NR%200==100)' original_file > thinned_file
However, now I have 30 original files, which means I have to revise the command little by little for 30 times and the original files share similar names as:
data.0000.dat, data.0001.dat data.0002.dat, ..., data.0029.dat
I suppose there must be some way to embed the awk command into a loop to accomplish my goal, maybe something like:
for(i=0;i<30;i++);
do
awk '(NR%200==100)' data.$i.dat > data.$i_thinned.dat
done
But I realize there're two digits of 00 in front of $i in the file names. Can I use sprintf("%s") or something? If so, how to arrange the order of awk and sprinf?
I use ubuntu and bash.
With seq:
for i in $(seq -f %04g 1 29); do
awk 'NR % 200 == 100' "data.${i}.dat" > "data.${i}_thinned.dat"
done
Alternatively with bash:
for i in {0001..0029}; do
The quotes are not strictly necessary in the first snippet because we know $i does not contain anything nefarious, but it's better to be paranoid about expansion in shell scripts. The braces in "data.${i}_thinned.dat" are necessary so the shell doesn't look for a variable $i_thinned to use. They are not strictly necessary in "data.${i}.dat" because shell variable names cannot have . in them, but consistency is nice.
All you need is:
awk 'FNR==1{close(out); out=FILENAME; sub(/\.dat/,"_thinned&",out)} (FNR%200==100){print > out}' data.[0-9][0-9][0-9][0-9].dat
I used data.[0-9][0-9][0-9][0-9].dat as the file name globbing pattern instead of data.*.dat in case you rerun the script in the same dir where you previously generated all of the "_thinned" files.
Ingedients(GAWK)
1 FNR - The record number in the current file
1 match - Matches a regex string and can capture groups into an array.
1 print - Prints following data(if none is provided then defaults to current record)
1 *.dat - All files ending with .dat in the current director.
Instructions
In the condition block check that the current record number in the current file when divided by 200 leaves a remainder of 100.
If it does then run the next block {..}
Take the current file name and match up to the last dot, capture everything before this with (.*) into array a.
Print into a file using the captured date a[1] with the extension _thinned.dat
Finally add *.dat to the end to read all .dat files in the current directory
Resulting code
gawk '(FNR%200==100){match(FILENAME,/(.*)\./,a);print >(a[1]"_thinned.dat")}' *.dat

Printing lines which have a field number greater than, in AWK

I am writing a script in bash which takes a parameter and storing it;
threshold = $1
I then have sample data that looks something like:
5 blargh
6 tree
2 dog
1 fox
9 fridge
I wish to print only the lines which have their number greater than the number which is entered as the parameter (threshold).
I am currently using:
awk '{print $1 > $threshold}' ./file
But nothing prints out, help would be appreciated.
You're close, but it needs to be more like this:
$ threshold=3
$ awk -v threshold="$threshold" '$1 > threshold' file
Creating a variable with -v avoids the ugliness of trying to expand shell variables within an awk script.
EDIT:
There are a few problems with the current code you've shown. The first is that your awk script is single quoted (good), which stops $threshold from expanding, and so the value is never inserted in your script. Second, your condition belongs outside the curly braces, which would make it:
$1 > threshold { print }
This works, but the `print is not necessary (it's the default action), which is why I shortened it to
$1 > threshold

Is it possible to put 2 command in one string in awk?

I have created simple script:
#!/bin/sh
column=${1:-1}
awk '{colawk='$column'+1; print colawk}'
But when I run:
ls -la | ./Column.sh 4
I receive output:
5
5
5
5
But I have expected receive 5th column. Why this error?
I believe this will do what you've attempted in your example:
#!/bin/sh
let "column=${1:-1} + 1"
awk "{print \$$column}"
However, I don't see why you're adding one to the column index? You'll then not be able to intuitively access the first column.
I'd to it this way instead:
#!/bin/sh
let "column=${1:-1}"
awk "{print \$$column}"
The argument to ./Column.sh will be the column number you want, 0 will give you all columns, while a call without arguments will default the column index to 1.
I know bash. I would like make arithmetic with AWK
In that case, how about:
#!/bin/sh
column=${1:-1}
awk 'BEGIN{colawk='$column'+1} {print $colawk}'
Or, simply:
#!/bin/sh
awk 'BEGIN{colawk='${1:-1}'+1} {print $colawk}'
Two things I changed in your script:
put the arithmetic in a BEGIN{} block since it only needs to be done once and not repeated for every input line.
"print $colawk" instead of "print colawk" so we're printing the column indexed by colawk instead of its value.

Resources