Code to choose 5.000 numbers in 10.000 randonly - bash

I would like some help to make a code in awk that within 10,000 records would randomly choose 5,000.

Sort has a randomizer.
Assuming an input filename of 10k,
sort -R 10k | head -5000 > 5k # write selections to "5k"

The following method works for single as well as multi-line records or records with specific record-separators.
Define a script random_subset.awk
# Uniform(m) :: returns a random integer such that
# 1 <= Uniform(m) <= m
function Uniform(m) { return 1+int(m * rand()) }
# KnuthShuffle(m) :: creates a random permutation of the range [1,m]
function KnuthShuffle(m, i,j,k) {
for (i = 1; i <= m ; i++) { permutation[i] = i }
for (i = 1; i <= m-1; i++) {
j = Uniform(i-1)
k = permutation[i]
permutation[i] = permutation[j]
permutation[j] = k
}
}
BEGIN{ srand() }
{a[NR]=$0}
END{ KnuthShuffle(NR); for(r = 1; r <= count; r++) print a[permutation[r]] }
Then you can run it as:
$ awk -v count=5000 -f subset.awk inputfile > outputfile
Or if you have a file where the record separator is given by a character like #, then you can do:
$ awk -v count=5000 -v RS='#' -v ORS='#' -f subset.awk inputfile > outputfile
If you want to select random paragraphs, you can do:
$ awk -v count=5000 -v RS='' -v ORS='\n\n' -f subset.awk inputfile > outputfile

Related

Check number against all ranges in shell script

I have to check specific numbers from file1 against all ranges from file2.
cat file1
200000000000032805177900000000000000000000922820669 xx
200000000000022805177700000000000000000000922820669 xx
200000000000022805181300000000000000000000922820669 xx
cat file2
228051777, 228051779
228051811, 228051814
228051817, 228051817
output
200000000000022805177700000000000000000000922820669 xx
200000000000022805181300000000000000000000922820669 xx
This is my code so far. Got the right output but it is too slow from reading thousands of records.
#segregate records
awk -F', +' '
# 1st pass (fileB): read the lower and upper range bounds
FNR==NR { lbs[++count] = $1+0; ubs[count] = $2+0; next }
# 2nd pass (FILES): check each line against all ranges.
{
for(i=1;i<=count;++i) {
anum=substr($1,3,22); sub(/^0+/, "", anum)
if (anum+0 >= lbs[i] && anum+0 <= ubs[i]) { print; next }
}
}
' file2 file1
You can use this awk:
awk -F', *' 'NR == FNR {
lb[++i] = $1;
ub[i] = $2;
next
} {
anum = substr($1,3,20)+0;
for(k=1; k<=i; k++)
if (anum >= lb[k] && anum <= ub[k])
print
}' file2 file1
200000000000022805177700000000000000000000922820669 xx
200000000000022805181300000000000000000000922820669 xx

Unix/Bash: Uniq on a cell

I have a tab-separated fileA where the 12th column (starting from 1) contain several comma separated identifiers. Some of them in the same row, however, can occur more than once:
GO:0042302, GO:0042302, GO:0042302
GO:0004386,GO:0005524,GO:0006281, GO:0004386,GO:0005524,GO:0006281
....
....
(some with a white-space after the comma, some where it is not).
I would like to only get the unique identifiers and remove the multiples for each row in the 12th column:
GO:0042302
GO:0004386,GO:0005524,GO:0006281
....
....
Here is what I have so far:
for row in `fileA`
do
cut -f12 $row | sed "s/,/\n/" | sort | uniq | paste fileA - | \
awk 'BEGIN {OFS=FS="\t"}{print $1, $2, $3, $4, $5, $6, $7, $8, $9, $10, $11, $13}'
done > out
The idea was to go over each row at a time, cut out the 12th column, replace all commas with newlines and then sort and take uniq to get rid of duplicates, paste it back and print the columns in the right order, skipping the original identifier column.
However, this does not seem to work. Any ideas?
Just for completeness, and because I personally prefer Perl over Awk for this sort of thing, here's a Perl one-liner solution:
perl -F'\t' -le '%u=();#k=split/,/,$F[11];#u{#k}=#k;$F[11]=join",",sort
keys%u;print join"\t",#F'
Explanation:
-F'\t' Loop over input lines, splitting each one into fields at tabs
-l automatically remove newlines from input and append on output
-e get code to execute from the next argument instead of standard input
%u = (); # clear out the hash variable %u
#k = split /,/, $F[11]; # Split 12th field (1st is 0) on comma into array #k
#u{#k} = #k; # Copy the contents of #k into #u as key/value pairs
Because hash keys are unique, that last step means that the keys of %u are now a deduplicated copy of #k.
$F[11] = join ",", sort keys %u; # replace the 12th field with the sorted unique list
print join "\t", #F; # and print out the modified line
If I understand you correctly, then with awk:
awk -F '\t' 'BEGIN { OFS = FS } { delete b; n = split($12, a, /, */); $12 = ""; for(i = 1; i <= n; ++i) { if(!(a[i] in b)) { b[a[i]]; $12 = $12 a[i] "," } } sub(/,$/, "", $12); print }' filename
This works as follows:
BEGIN { OFS = FS } # output FS same as input FS
{
delete b # clear dirty table from last pass
n = split($12, a, /, */) # split 12th field into tokens,
$12 = "" # then clear it out for reassembly
for(i = 1; i <= n; ++i) { # wade through those tokens
if(!(a[i] in b)) { # those that haven't been seen yet:
b[a[i]] # remember that they were seen
$12 = $12 a[i] "," # append to result
}
}
sub(/,$/, "", $12) # remove trailing comma from resulting field
print # print the transformed line
}
The delete b; has been POSIX-conforming for only a short while, so if you're working with an old, old awk and it fails for you, see #MarkReed's comment for another way that ancient awks should accept.
Using field 2 instead of field 12:
$ cat tst.awk
BEGIN{ FS=OFS="\t" }
{
split($2,f,/ *, */)
$2 = ""
delete seen
for (i=1;i in f;i++) {
if ( !seen[f[i]]++ ) {
$2 = $2 (i>1?",":"") f[i]
}
}
print
}
.
$ cat file
a,a,a GO:0042302, GO:0042302, GO:0042302 b,b,b
c,c,c GO:0004386,GO:0005524,GO:0006281, GO:0004386,GO:0005524,GO:0006281 d,d,d
$ awk -f tst.awk file
a,a,a GO:0042302 b,b,b
c,c,c GO:0004386,GO:0005524,GO:0006281 d,d,d
If your awk doesn't support delete seen you can use split("",seen).
Using this awk:
awk -F '\t' -v OFS='\t' '{
delete seen;
split($12, a, /[,; ]+/);
for (i=1; i<=length(a); i++) {
if (!(a[i] in seen)) {
seen[a[i]];
s=sprintf("%s%s,", s, a[i])
}
}
$12=s} 1' file
GO:0042302,
GO:0042302,GO:0004386,GO:0005524,GO:0006281,
In your example data, the comma followed by a space is the delimiter of the 12th field. Every subfield after that is merely a repeat of the first field. The subfields appear to already be in sorted order.
GO:0042302, GO:0042302, GO:0042302
^^^dup1^^^ ^^^dup2^^^
GO:0004386,GO:0005524,GO:0006281, GO:0004386,GO:0005524,GO:0006281
^^^^^^^^^^^^^^^dup1^^^^^^^^^^^^^
Based on that, you could simply keep the first of the subfields and toss the rest:
awk -F"\t" '{sub(/, .*/, "", $12)} 1' fileA
If instead, you can have different sets of repeated subfields, where keys are not sorted like this:
GO:0042302, GO:0042302, GO:0042302, GO:0062122,GO:0055000, GO:0055001, GO:0062122,GO:0055000
GO:0004386,GO:0005524,GO:0006281, GO:0005525, GO:0004386,GO:0005524,GO:0006281
If you were stuck with a default MacOS awk you could introduce a sort/uniq functions in an awk executable script:
#!/usr/bin/awk -f
BEGIN {FS="\t"}
{
c = uniq(a, split($12, a, /, |,/))
sort(a, c)
s = a[1]
for(i=2; i<=c; i++) { s = s "," a[i] }
$2 = s
}
47 # print out the modified line
# take an indexed arr as from split and de-dup it
function uniq(arr, len, i, uarr) {
for(i=len; i>=1; i--) { uarr[arr[i]] }
delete arr
for(k in uarr) { arr[++i] = k }
return( i )
}
# slightly modified from
# http://rosettacode.org/wiki/Sorting_algorithms/Bubble_sort#AWK
function sort(arr, len, haschanged, tmp, i)
{
haschanged = 1
while( haschanged==1 ) {
haschanged = 0
for(i=1; i<=(len-1); i++) {
if( arr[i] > arr[i+1] ) {
tmp = arr[i]
arr[i] = arr[i + 1]
arr[i + 1] = tmp
haschanged = 1
}
}
}
}
If you had GNU-awk, I think you could swap out the sort(a, c) call with asort(a), and drop the bubble-sort local function completely.
I get the following for the 12th field:
GO:0042302,GO:0055000,GO:0055001,GO:0062122
GO:0004386,GO:0005524,GO:0005525,GO:0006281

Storing Mulitdimensional array using awk

My input file is
a|b|c|d
w|r|g|h
i want to store the value in array like
a[1,1] = a
a[1,2] = b
a[2,1] = w
Kindly suggest in any way to achieve this in awk bash.
I have two i/p files and need to do field level validation.
Like this
awk -F'|' '{for(i=1;i<=NF;i++)a[NR,i]=$i}
END {print a[1,1],a[2,2]}' file
Output
a r
This parses the file into an awk array:
awk -F \| '{ for(i = 1; i <= NF; ++i) a[NR,i] = $i }' filename
You'll have to add code that uses the array for this to be of any use, of course. Since you didn't say what you wanted to do with the array once it is complete (after the pass over the file), this is all the answer i can give you.
You're REALLY going to want to get/use gawk 4.* if you're using multi-dimensional arrays as that's the only awk that supports them. When you write:
a[1,2]
in any awk you are actually creating a psedudo-multi-dimensional array which is a 1-dimensional array indexed by the string formed by the concatenation of
1 SUBSEP 2
where SUBSEP is a control char that's unlikely to appear in your input.
In GNU awk 4.* you can do:
a[1][2]
(note the different syntax) and that populates an actual multi-dimentional array.
Try this to see the difference:
$ cat tst.awk
BEGIN {
SUBSEP=":" # just to make it visible when printing
oneD[1,2] = "a"
oneD[1,3] = "b"
twoD[1][2] = "c"
twoD[1][3] = "d"
for (idx in oneD) {
print "oneD", idx, oneD[idx]
}
print ""
for (idx1 in twoD) {
print "twoD", idx1
for (idx2 in twoD[idx1]) { # you CANNOT do this with oneD
print "twoD", idx1, idx2, twoD[idx1][idx2]
}
}
}
$ awk -f tst.awk
oneD 1:2 a
oneD 1:3 b
twoD 1
twoD 1 2 c
twoD 1 3 d

How to grep number of unique occurrences

I understand that grep -c string can be used to count the occurrences of a given string. What I would like to do is count the number of unique occurrences when only part of the string is known or remains constant.
For Example, if I had a file (in this case a log) with several lines containing a constant string and a repeating variable like so:
string=value1
string=value1
string=value1
string=value2
string=value3
string=value2
Than I would like to be able to identify the number of each unique set with an output similar to the following: (ideally with a single grep/awk string)
value1 = 3 occurrences
value2 = 2 occurrences
value3 = 1 occurrences
Does anyone have a solution using grep or awk that might work? Thanks in advance!
This worked perfectly... Thanks to everyone for your comments!
grep -oP "wwn=[^,]*" path/to/file | sort | uniq -c
In general, if you want to grep and also keep track of results, it is best to use awk since it performs such things in a clear manner with a very simple syntax.
So for your given file I would use:
$ awk -F= '/string=/ {count[$2]++} END {for (i in count) print i, count[i]}' file
value1 3
value2 2
value3 1
What is this doing?
-F=
set the field separator to =, so that we can compute the right and left part of it.
/string=/ {count[$2]++}
when the pattern "string=" is found, check it! This uses an array count[] to keep track on the times the second field has appeared so far.
END {for (i in count) print i, count[i]}
at the end, loop through the results and print them.
Here's an awk script:
#!/usr/bin/awk -f
BEGIN {
file = ARGV[1]
while ((getline line < file) > 0) {
for (i = 2; i < ARGC; ++i) {
p = ARGV[i]
if (line ~ p) {
a[p] += !a[p, line]++
}
}
}
for (i = 2; i < ARGC; ++i) {
p = ARGV[i]
printf("%s = %d occurrences\n", p, a[p])
}
exit
}
Example:
awk -f script.awk somefile ab sh
Output:
ab = 7 occurrences
sh = 2 occurrences

change random line with shellscript

how can i easily (quick and dirty) change, say 10, random lines of a file with a simple shellscript?
i though about abusing ed and generating random commands and line ranges, but i'd like to know if there was a better way
awk 'BEGIN{srand()}
{ lines[++c]=$0 }
END{
while(d<10){
RANDOM = int(1 + rand() * c)
if( !( RANDOM in r) ) {
r[RANDOM]
print "do something with " lines[RANDOM]
++d
}
}
}' file
or if you have the shuf command
shuf -n 10 $file | while read -r line
do
sed -i "s/$line/replacement/" $file
done
Playing off #Dennis' version, this will always output 10.
Doing random numbers in a separate array could create
duplicates and, consequently, fewer than 10 modifications.
file=~/testfile
c=$(wc -l < "$file")
awk -v c=$c '
BEGIN {
srand();
count = 10;
}
{
if (c*rand() < count) {
--count;
print "do something with " $0;
} else
print;
--c;
}
' "$file"
This seems to be quite a bit faster:
file=/your/input/file
c=$(wc -l < "$file")
awk -v c=$c 'BEGIN {
srand();
for (i=0;i<10;i++) lines[i] = int(1 + rand() * c);
asort(lines);
p = 1
}
{
if (NR == lines[p]) {
++p
print "do something with " $0
}
else print
}' "$file"
I

Resources