AWK -- Select Random Record From A List - bash

I am trying to return a random word from /usr/share/dict/words on my *NIX machine. I have written the following script using BASH, AWK, and SED together to do it, but I feel like it should be writable using AWK alone by using the RN and NF fields somehow.
#!/bin/bash
get_secret_word () {
awk '/^[A-Za-z]+$/ {if (length($1) > 3 && length($1) < 9)
print $1}' /usr/share/dict/words > /tmp/word_list
word_list_length=$(wc -l /tmp/word_list | awk '{print $1}')
random_number=$(( $RANDOM%$word_list_length ))
secret_word=$(sed "${random_number}!d" /tmp/word_list)
return $secret_word
}
get_secret_word
echo $secret_word
Any suggestions? I love AWK, and I'm trying to understand it better.

Try something like:
awk '
BEGIN {
srand('"$RANDOM"')
}
{
if (/^[A-Za-z]+$/ && length() > 3 && length() < 9)
words[i++] = $1
}
END {
print words[int(rand() * i)]
}' /usr/share/dict/words
Whether you store the words in memory or in a file will depend on your use case.BR.

Related

Use an array created using awk as a variable in another awk script

I am trying to use awk to extract data using a conditional statement containing an array created using another awk script.
The awk script I use for creating the array is as follows:
array=($(awk 'NR>1 { print $1 }' < file.tsv))
Then, to use this array in the other awk script
awk var="${array[#]}" 'FNR==1{ for(i=1;i<=NF;i++){ heading[i]=$i } next } { for(i=2;i<=NF;i++){ if($i=="1" && heading[i] in var){ close(outFile); outFile=heading[i]".txt"; print ">kmer"NR-1"\n"$1 >> (outFile) }}}' < input.txt
However, when I run this, the following error occurs.
awk: fatal: cannot open file 'foo' for reading (No such file or directory)
I've already looked at multiple posts on why this error occurs and on how to correctly implement a shell variable in awk, but none of these have worked so far. However, when removing the shell variable and running the script it does work.
awk 'FNR==1{ for(i=1;i<=NF;i++){ heading[i]=$i } next } { for(i=2;i<=NF;i++){ if($i=="1"){ close(outFile); outFile=heading[i]".txt"; print ">kmer"NR-1"\n"$1 >> (outFile) }}}' < input.txt
I really need that conditional statement but don't know what I am doing wrong with implementing the bash variable in awk and would appreciate some help.
Thx in advance.
That specific error messages is because you forgot -v in front of var= (it should be awk -v var=, not just awk var=) but as others have pointed out, you can't set an array variable on the awk command line. Also note that array in your code is a shell array, not an awk array, and shell and awk are 2 completely different tools each with their own syntax, semantics, scopes, etc.
Here's how to really do what you're trying to do:
array=( "$(awk 'BEGIN{FS=OFS="\t"} NR>1 { print $1 }' < file.tsv)" )
awk -v xyz="${array[*]}" '
BEGIN{ split(xyz,tmp,RS); for (i in tmp) var[tmp[i]] }
... now use `var` as you were trying to ...
'
For example:
$ cat file.tsv
col1 col2
a b c d e
f g h i j
$ cat -T file.tsv
col1^Icol2
a b^Ic d e
f g h^Ii j
$ awk 'BEGIN{FS=OFS="\t"} NR>1 { print $1 }' < file.tsv
a b
f g h
$ array=( "$(awk 'BEGIN{FS=OFS="\t"} NR>1 { print $1 }' < file.tsv)" )
$ awk -v xyz="${array[*]}" '
BEGIN {
split(xyz,tmp,RS)
for (i in tmp) {
var[tmp[i]]
}
for (idx in var) {
print "<" idx ">"
}
}
'
<f g h>
<a b>
It's easier and more efficient to process both files in a single awk:
edit: fixed issues in comment, thanks #EdMorton
awk '
FNR == NR {
if ( FNR > 1 )
var[$1]
next
}
FNR == 1 {
for (i = 1; i <= NF; i++)
heading[i] = $i
next
}
{
for (i = 2; i <= NF; i++)
if ( $i == "1" && heading[i] in var) {
outFile = heading[i] ".txt"
print ">kmer" (NR-1) "\n" $1 >> (outFile)
close(outFile)
}
}
' file.tsv input.txt
You might store string in variable, then use split function to turn that into array, consider following simple example, let file1.txt content be
A B C
D E F
G H I
and file2.txt content be
1
3
2
then
var1=$(awk '{print $1}' file1.txt)
awk -v var1="$var1" 'BEGIN{split(var1,arr)}{print "First column value in line number",$1,"is",arr[$1]}' file2.txt
gives output
First column value in line number 1 is A
First column value in line number 3 is G
First column value in line number 2 is D
Explanation: I store output of 1st awk command, which is then used as 1st argument to split function in 2nd awk command. Disclaimer: this solutions assumes all files involved have delimiter compliant with default GNU AWK behavior, i.e. one-or-more whitespaces is always delimiter.
(tested in gawk 4.2.1)

awk FS vs FPAT puzzle and counting words but not blank fields

Suppose I have the file:
$ cat file
This, that;
this-that or this.
(Punctuation at the line end is not always there...)
Now I want to count words (with words being defined as one or more ascii case-insensitive letters.) In typical POSIX *nix you could do:
sed -nE 's/[^[:alpha:]]+/ /g; s/ $//p' file | tr ' ' "\n" | tr '[:upper:]' '[:lower:]' | sort | uniq -c
1 or
2 that
3 this
With grep you can shorten that a bit to only match what you define as a word:
grep -oE '[[:alpha:]]+' file | tr '[:upper:]' '[:lower:]' | sort | uniq -c
# same output
With GNU awk, you can use FPAT to replicate matching only what you want (ignore sorting...):
gawk -v FPAT="[[:alpha:]]+" '
{for (i=1;i<=NF;i++) {seen[tolower($i)]++}}
END {for (e in seen) printf "%4s %s\n", seen[e], e}' file
3 this
1 or
2 that
Now trying to replicate in POSIX awk I tried:
awk 'BEGIN{FS="[^[:alpha:]]+"}
{ for (i=1;i<=NF;i++) seen[tolower($i)]++ }
END {for (e in seen) printf "%4s %s\n", seen[e], e}' file
2
3 this
1 or
2 that
Note the 2 with blank at top. This is from having blank fields from ; at the end of line 1 and . at the end of line 2. If you delete the punctuation at line's end, this issue goes away.
You can partially fix it (for all but the last line) by setting RS="" in the awk, but still get a blank field with the last (only) line.
I can also fix it this way:
awk 'BEGIN{FS="[^[:alpha:]]+"}
{ for (i=1;i<=NF;i++) if ($i) seen[tolower($i)]++ }
END {for (e in seen) printf "%4s %s\n", seen[e], e}' file
Which seems a little less than straight forward.
Is there an idiomatic fix I am missing to make POSIX awk act similarly to GNU awk's FPAT solution here?
This should work in POSIX/BSD or any version of awk:
awk -F '[^[:alpha:]]+' '
{for (i=1; i<=NF; ++i) ($i != "") && ++count[tolower($i)]}
END {for (e in count) printf "%4s %s\n", count[e], e}' file
1 or
3 this
2 that
By using -F '[^[:alpha:]]+' we are splitting fields on any non-alpha character.
($i != "") condition will make sure to count only non-empty fields in seen.
With POSIX awk, I'd use match and the builtin RSTART and RLENGTH variables:
#!awk
{
s = $0
while (match(s, /[[:alpha:]]+/)) {
word = substr(s, RSTART, RLENGTH)
count[tolower(word)]++
s = substr(s, RSTART+RLENGTH)
}
}
END {
for (word in count) print count[word], word
}
$ awk -f countwords.awk file
1 or
3 this
2 that
Works with the default BSD awk on my Mac.
With your shown samples, please try following awk code. Written and tested in GNU awk in case you are ok to do this with RS approach.
awk -v RS='[[:alpha:]]+' '
RT{
val[tolower(RT)]++
}
END{
for(word in val){
print val[word], word
}
}
' Input_file
Explanation: Simple explanation would be, using RS variable of awk to make record separator as [[:alpha:]] then in main program creating array val whose index is RT variable and keep counting its occurrences with respect to same index in array val. In END block of this program traversing through array and printing indexes with its respective values.
Using RS instead:
$ gawk -v RS="[^[:alpha:]]+" ' # [^a-zA-Z] or something for some awks
$0 { # remove possible leading null string
a[tolower($0)]++
}
END {
for(i in a)
print i,a[i]
}' file
Output:
this 3
or 1
that 2
Tested successfully on gawk and Mac awk (version 20200816) and on mawk and busybox awk using [^a-zA-Z]
With GNU awk using patsplit() and a second array for counting, you can try this:
awk 'patsplit($0, a, /[[:alpha:]]+/) {for (i in a) b[ tolower(a[i]) ]++} END {for (j in b) print b[j], j}' file
3 this
1 or
2 that

put awk command into a separate file

I did make a code and now would like to make a separate file because the code is a bit long to type but I'm having troubles.
This is my code:
awk 'NF && $1!~/^#/ && $1!~/^#/' rmsd.xvg | awk '{for(i=1;i<=NF;i++) {sum[i] += $i; sumsq[i] += ($i)^2}}
END {for (i=2;i<=NF;i++) {
print "\n", sum[i]/NR, sqrt((sumsq[i]-sum[i]^2/NR)/NR)}}' | sort -u
How can this be done?
Create a file named script.awk, and put:
{ for(i=1;i<=NF;i++) {
sum[i] += $i; sumsq[i] += ($i)^2}
}
END {for (i=2;i<=NF;i++) {
print "\n", sum[i]/NR, sqrt((sumsq[i]-sum[i]^2/NR)/NR)
}
}
into it. Then use:
awk 'NF && $1!~/^#/ && $1!~/^#/' rmsd.xvg | awk -f script.awk | sort -u
But there's no need for two separate awk commands. Change the script to:
/^[##]/ { for(i=1;i<=NF;i++) {
sum[i] += $i; sumsq[i] += ($i)^2}
}
END {for (i=2;i<=NF;i++) {
print "\n", sum[i]/NR, sqrt((sumsq[i]-sum[i]^2/NR)/NR)
}
}
Then:
awk -f script.awk rmsd.xvg | sort -u
You can create a shell script as
#!/bin/bash
awk 'NF && $1!~/^#/ && $1!~/^#/' rmsd.xvg | awk '{for(i=1;i<=NF;i++) {sum[i] += $i; sumsq[i] += ($i)^2}}
END {for (i=2;i<=NF;i++) {
print "\n", sum[i]/NR, sqrt((sumsq[i]-sum[i]^2/NR)/NR)}}' | sort -u
Excecute the script as
$ bash fileName
Note that you have two awk commands. However, the first is just a filter and can trivially be combined with the second. The only issue is that instead of using NR in the END action, you'll need to keep count of how many records were acted upon by the first action. The two scripts combined, along with the adjustment for NR, would look like
NF && $1 !~ /^#/ && $1 !~ /^#/ {
for(i=1;i<=NF;i++) {
sum[i] += $i
sumsq[i] += ($i)^2
}
record_count++
}
END {
for (i=2;i<=NF;i++) {
print "\n", sum[i]/record_count, sqrt((sumsq[i]-sum[i]^2/record_count)/record_count)
}
}
I'm assuming that every line has the same number of fields; otherwise, the value of NF in the END action is just the value of NF on the last line, which may or may not have any meaning.
Once the above is saved in something like script.awk, run it with
awk -f script.awk rmsd.xvg | sort -u

Shell script: copying columns by header in a csv file to another csv file

I have a csv file which I'll be using as input with a format looking like this:
xValue,value1-avg,value1-median,value2-avg,value3-avg,value3-median
1,3,4,20,14,20
The key attributes of the input file are that each "value" will have a variable number of statistics, but the statistic type and "value" will always be separated by a "-". I then want to output the statistics of all the "values" to separate csv files.
The output would then look something like this:
value1.csv
xvalue,value1-avg,value1-median
1,3,4
value2.csv
xvalue,value2-avg
1,20
I've tried finding solutions to this, but all I can find are ways to copy by the column number, not the header name. I need to be able to use the header names to append the associated statistics to each of the output csv files.
Any help is greatly appreciated!
P.S. the output file may have already been written to during previous runs of this script, meaning the code should append to the output file
Untested but should be close:
awk -F, '
NR==1 {
for (i=2;i<=NF;i++) {
outfile = $i
sub(/-.*/,".csv",outfile)
outfiles[i] = outfile
}
}
{
delete(outstr)
for (i=2;i<=NF;i++) {
outfile = outfiles[i]
outstr[outfile] = outstr[outfile] FS $i
}
for (outfile in outstr)
print $1 outstr[outfile] >> outfile
}
' inFile.csv
Note that deleting a whole array with delete(outstr) is gawk-specific. With other awks you can use split("",outstr) to get the same effect.
Note that this appends the output you wanted to existing files BUT that means you'll get the header line repeated on every execution. If that's an issue, tell us how to know when to generate the header line or not but the solution I THINK you'll want would look something like this:
awk -F, '
NR==1 {
for (i=2;i<=NF;i++) {
outfile = $i
sub(/-.*/,".csv",outfile)
outfiles[i] = outfile
}
for (outfile in outfiles) {
exists[outfile] = ( ((getline tmp < outfile) > 0) && (tmp != "") )
close(outfile)
}
}
{
delete(outstr)
for (i=2;i<=NF;i++) {
outfile = outfiles[i]
outstr[outfile] = outstr[outfile] FS $i
}
for (outfile in outstr)
if ( (NR > 1) || !exists[outfile] )
print $1 outstr[outfile] >> outfile
}
' inFile.csv
Just figure out the name associated with each column and use that mapping to manipulate the columns. If you're trying to do this in awk, you can use associative arrays to store the column names and the rows those correspond to. If you're using ksh93 or bash, you can use associative arrays to store the column names and the rows those correspond to. If you're using perl or python or ruby or ... you can...
Or push the columns into an array to map the numbers to column numbers.
Either way, then you have a list of column headers, which can further be manipulated however you need to.
The solution I have found most useful to this kind of problem is to first retrieve the column number using an AWK script (encapsulated in a shell function) and then follow with a cut statement. This technique/strategy turns into a very concise, general and fast solution that can take advantage of co-processing. The non-append case is cleaner, but here is an example that handles the complication of the append you mentioned:
#! /bin/sh
fields() {
LC_ALL=C awk -F, -v pattern="$1" '{
j=0; split("", f)
for (i=1; i<=NF; i++) if ($(i) ~ pattern) f[j++] = i
if (j) {
printf("%s", f[0])
for (i=1; i<j; i++) printf(",%s", f[i])
}
exit 0
}' "$2"
}
cut_fields_with_append() {
if [ -s "$3" ]
then
cut -d, -f `fields "$1" "$2"` "$2" | sed '1 d' >> "$3"
else
cut -d, -f `fields "$1" "$2"` "$2" > "$3"
fi
}
cut_fields_with_append '^[^-]+$|1-' values.csv value1.csv &
cut_fields_with_append '^[^-]+$|2-' values.csv value2.csv &
cut_fields_with_append '^[^-]+$|3-' values.csv value3.csv &
wait
The result is as you would expect:
$ ls
values values.csv
$ cat values.csv
xValue,value1-avg,value1-median,value2-avg,value3-avg,value3-median
1,3,4,20,14,20
$ ./values
$ ls
value1.csv value2.csv value3.csv values values.csv
$ cat value1.csv
xValue,value1-avg,value1-median
1,3,4
$ cat value2.csv
xValue,value2-avg
1,20
$ cat value3.csv
xValue,value3-avg,value3-median
1,14,20
$ ./values
$ cat value1.csv
xValue,value1-avg,value1-median
1,3,4
1,3,4
$ cat value2.csv
xValue,value2-avg
1,20
1,20
$ cat value3.csv
xValue,value3-avg,value3-median
1,14,20
1,14,20
$

Filter a file using shell script tools

I have a file which contents are
E006:Jane:HR:9800:Asst
E005:Bob:HR:5600:Exe
E002:Barney:Purc:2300:PSE
E009:Miffy:Purc:3600:Mngr
E001:Franny:Accts:7670:Mngr
E003:Ostwald:Mrktg:4800:Trainee
E004:Pearl:Accts:1800:SSE
E009:Lala:Mrktg:6566:SE
E018:Popoye:Sales:6400:QAE
E007:Olan:Sales:5800:Asst
I want to fetch List all employees whose emp codes are between E001 and E018 using command including pipes is it possible to get ?
Use sed:
sed -n -e '/^E001:/,/^E018:/p' data.txt
That is, print the lines that are literally between those lines that start with E001 and E018.
If you want to get the employees that are numerically between those, one way to do that would be to do comparisons inline using something like awk (as suggested by hochl). Or, you could take this approach preceded by a sort (if the lines are not already sorted).
sort data.txt | sed -n -e '/^E001:/,/^E018:/p'
You can use awk for such cases:
$ gawk 'BEGIN { FS=":" } /^E([0-9]+)/ { n=substr($1, 2)+0; if (n >= 6 && n <= 18) { print } }' < data.txt
E006:Jane:HR:9800:Asst
E009:Miffy:Purc:3600:Mngr
E009:Lala:Mrktg:6566:SE
E018:Popoye:Sales:6400:QAE
E007:Olan:Sales:5800:Asst
Is that the result you want? This example intentionally only prints employees between 6 and 18 to show that it filters out records. You may print some fields only using $1 or $2 as in print $1 " " $2.
You can try something like this: cut -b2- | awk '{ if ($1 < 18) print "E" $0 }'
Just do string comparison: Since all your sample data matches, I changed the boundaries for illustration
awk -F: '"E004" <= $1 && $1 <= "E009" {print}'
output
E006:Jane:HR:9800:Asst
E005:Bob:HR:5600:Exe
E009:Miffy:Purc:3600:Mngr
E004:Pearl:Accts:1800:SSE
E009:Lala:Mrktg:6566:SE
E007:Olan:Sales:5800:Asst
You can pass the strings as variables if you don't want to hard-code them in the awk script
awk -F: -v start=E004 -v stop=E009 'start <= $1 && $1 <= stop {print}'

Resources