Sampling without replacement using awk - bash

I have a lot of text files that look like this:
>ALGKAHOLAGGATACCATAGATGGCACGCCCT
>BLGKAHOLAGGATACCATAGATGGCACGCCCT
>HLGKAHOLAGGATACCATAGATGGCACGCCCT
>DLGKAHOLAGGATACCATAGATGGCACGCCCT
>ELGKAHOLAGGATACCATAGATGGCACGCCCT
>FLGKAHOLAGGATACCATAGATGGCACGCCCT
>JGGKAHOLAGGATACCATAGATGGCACGCCCT
>POGKAHOLAGGATACCATAGATGGCACGCCCT
Is there a way to do a sampling without replacement using awk?
For example, I have this 8 lines, and I only want to sample 4 of these randomly in a new file, without replacement.
The output should look something like this:
>FLGKAHOLAGGATACCATAGATGGCACGCCCT
>POGKAHOLAGGATACCATAGATGGCACGCCCT
>ALGKAHOLAGGATACCATAGATGGCACGCCCT
>BLGKAHOLAGGATACCATAGATGGCACGCCCT
Thanks in advance

How about this for a random sampling of 10% of your lines?
awk 'rand()>0.9' yourfile1 yourfile2 anotherfile
I am not sure what you mean by "replacement"... there is no replacement occurring here, just random selection.
Basically, it looks at each line of each file precisely once and generates a random number on the interval 0 to 1. If the random number is greater than 0.9, the line is output. So basically it is rolling a 10 sided dice for each line and only printing it if the dice comes up as 10. No chance of a line being printed twice - unless it occurs twice in your files, of course.
For added randomness (!) you can add an srand() at the start as suggested by #klashxx
awk 'BEGIN{srand()} rand()>0.9' yourfile(s)

Yes, but I wouldn't. I would use shuf or sort -R (neither POSIX) to randomize the file and then select the first n lines using head.
If you really want to use awk for this, you would need to use the rand function, as Mark Setchell points out.

To obtain random samples from a text file, without replacement, means that once a line has been randomly selected (sampled) it cannot be selected again. Thus, if 10 lines of 100 are to be selected, the ten random line numbers need to be unique.
Here is a script to produce NUM random (without replacement) samples from a text FILE:
#!/usr/bin/env bash
# random-samples.sh NUM FILE
# extract NUM random (without replacement) lines from FILE
num=$(( 10#${1:?'Missing sample size'} ))
file="${2:?'Missing file to sample'}"
lines=`wc -l <$file` # max num of lines in the file
# get_sample MAX
#
# get a random number between 1 .. max
# (see the bash man page on RANDOM
get_sample() {
local max="$1"
local rand=$(( ((max * RANDOM) / 32767) + 1 ))
echo "$rand"
}
# select_line LINE FILE
#
# select line LINE from FILE
select_line() {
head -n $1 $2 | tail -1
}
declare -A samples # keep track of samples
for ((i=1; i<=num; i++)) ; do
sample=
while [[ -z "$sample" ]]; do
sample=`get_sample $lines` # get a new sample
if [[ -n "${samples[$sample]}" ]]; then # already used?
sample= # yes, go again
else
(( samples[$sample]=1 )) # new sample, track it
fi
done
line=`select_line $sample $file` # fetch the sampled line
printf "%2d: %s\n" $i "$line"
done
exit
Here is the output of a few invocations:
./random-samples.sh 10 poetry-samples.txt
1: 11. Because I could not stop for death/He kindly stopped for me 2,360,000 Emily Dickinson
2: 25. Hope springs eternal in the human breast 1,080,000 Alexander Pope
3: 43. The moving finger writes; and, having writ,/Moves on571,000 Edward Fitzgerald
4: 5. And miles to go before I sleep 5,350,000 Robert Frost
5: 6. Not with a bang but a whimper 5,280,000 T.S. Eliot
6: 40. In Xanadu did Kubla Khan 594,000 Coleridge
7: 41. The quality of mercy is not strained 589,000 Shakespeare
8: 7. Tread softly because you tread on my dreams 4,860,000 W.B. Yeats
9: 42. They also serve who only stand and wait 584,000 Milton
10: 48. If you can keep your head when all about you 447,000Kipling
./random-samples.sh 10 poetry-samples.txt
1: 38. Shall I compare thee to a summers day 638,000 Shakespeare
2: 34. Busy old fool, unruly sun 675,000 John Donne
3: 14. Candy/Is dandy/But liquor/Is quicker 2,150,000 Ogden Nash
4: 45. We few, we happy few, we band of brothers 521,000Shakespeare
5: 9. Look on my works, ye mighty, and despair 3,080,000 Shelley
6: 11. Because I could not stop for death/He kindly stopped for me 2,360,000 Emily Dickinson
7: 46. If music be the food of love, play on 507,000 Shakespeare
8: 44. What is this life if, full of care,/We have no time to stand and stare 528,000 W.H. Davies
9: 35. Do not go gentle into that good night 665,000 Dylan Thomas
10: 15. But at my back I always hear 2,010,000 Marvell
./random-samples.sh 10 poetry-samples.txt
1: 26. I think that I shall never see/A poem lovely as a tree. 1,080,000 Joyce Kilmer
2: 32. Human kind/Cannot bear very much reality 891,000 T.S. Eliot
3: 14. Candy/Is dandy/But liquor/Is quicker 2,150,000 Ogden Nash
4: 13. My mistress’ eyes are nothing like the sun 2,230,000Shakespeare
5: 42. They also serve who only stand and wait 584,000 Milton
6: 24. When in disgrace with fortune and men's eyes 1,100,000Shakespeare
7: 21. A narrow fellow in the grass 1,310,000 Emily Dickinson
8: 9. Look on my works, ye mighty, and despair 3,080,000 Shelley
9: 10. Tis better to have loved and lost/Than never to have loved at all 2,400,000 Tennyson
10: 31. O Romeo, Romeo; wherefore art thou Romeo 912,000Shakespeare

Maybe it's better to sample the file using a fixed schema, like sampling one record each 10 lines. You can do that using this awk one-liner:
awk '0==NR%10' filename
If you want to sample a percentage of the total, then you can program a way to calculate the number of rows the awk one-liner should use so the number of records printed matches that quantity/percentage.
I hope this helps!

Related

Comparing unknown number of variable in bash

I have 1 to 4 linux server names in a configuration file, I will have to take those names out of the configuration file and assign them values (Floating point Value derived from linux commands). Now the number of server taken out of the configuration file may vary depending on the server availability(For example if a server is down for some reason we will remove the server from configuration file or comment it out) so Fixed number of server is 4 but it may reduce based on its availability, how do I compare the values derived and find out the least/minimum out of this ? It will be great if someone could provide suggestions on this.
To compare two floating point numbers you can use bc. It will print (not return) 0 for false and 1 for true statements:
$ bc <<< '2.01 > 2.1'
0
$ bc <<< '2.1 > 2.01'
1
$ bc <<< '2.01 >= 2.1'
0
$ bc <<< '2.01 >= 2.01'
1

How to count the number of reviews in my files in a folder and then sort by highest to lowest?

thanks anyone who has helped me so far, here is my problem: I have a folder which contains 825 files. Within these files are reviews of a hotel. An example name of one of these files is hotel_72572.dat and this file basically contains the following:
<Overall Rating>4
<Avg. Price>$173
<URL>http://www.tripadvisor.com/ShowUserReviews-g60878-d72572-r23327047-Best_Western_Pioneer_Square_Hotel-Seattle_Washington.html
<Author>everywhereman2
<Content>Old seattle...
<Date>Jan 6, 2009
<img src="http://cdn.tripadvisor.com/img2/new.gif" alt="New"/>
<No. Reader>-1
<No. Helpful>-1
<Overall>5
<Value>5
<Rooms>5
<Location>5
<Cleanliness>5
<Check in / front desk>5
<Service>5
<Business service>5
<Author> //repeats the fields again, each cluster of fields is a review
The fields (line 6 - <Business service>) are then repeated by n times where n is the number of reviews in the file. I thought that by counting the number of times "Author" appears per file would achieve this but perhaps there is a better solution?
I am trying to write a script that will called countreviews.sh that will count the number of reviews per file in my folder (the folder name is reviews_folder) and then sort the number per file from highest to lowest. An example output will be:
hotel_72572 45
hotel_72579 33
hotel_73727 17
where the prefix is the name of the file and the number is the number of reviews per file. My script must take the folder name as an argument. For example I would type ./countreviews.sh reviews_folder and would get my output.
I have received lots of help over the past few days with many different suggestions but none of them have achieved what I am trying to do (my fault due to poor explanations), I hope this finally explains it clearly enough. Thanks again anyone who has helped me over the past few days and for any help I get for this question.
grep -c Author hotel_*.dat | sort -t : -k2nr | sed 's/\.dat:/ /'
Output (e.g.):
hotel_72572 45
hotel_72579 33
hotel_73727 17
Update
#!/bin/bash
cd "$1" || exit 1
grep -c Author hotel_*.dat | sort -t : -k2nr | sed 's/\.dat:/ /'

How to resume reading a file?

I'm trying to find the best and most efficient way to resume reading a file from a given point.
The given file is being written frequently (this is a log file).
This file is rotated on a daily basis.
In the log file I'm looking for a pattern 'slow transaction'. End of such lines have a number into parentheses. I want to have the sum of the numbers.
Example of log line:
Jun 24 2015 10:00:00 slow transaction (5)
Jun 24 2015 10:00:06 slow transaction (1)
This is easy part that I could do with awk command to get total of 6 with above example.
Now my challenge is that I want to get the values from this file on a regular basis. I've an external system that polls a custom OID using SNMP. When hitting this OID the Linux host runs a couple of basic commands.
I want this SNMP polling event to get the number of events since the last polling only. I don't want to have the total every time, just the total of the newly added lines.
Just to mention that only bash can be used, or basic commands such as awk sed tail etc. No perl or advanced programming language.
I hope my description will be clear enough. Apologizes if this is duplicate. I did some researches before posting but did not find something that precisely correspond to my need.
Thank you for any assistance
In addition to the methods in the comment link, you can also simply use dd and stat to read the logfile size, save it and sleep 300 then check the logfile size again. If the filesize has changed, then skip over the old information with dd and read the new information only.
Note: you can add a test to handle the case where the logfile is deleted and then restarted with 0 size (e.g. if $((newsize < size)) then read all.
Here is a short example with 5 minute intervals:
#!/bin/bash
lfn=${1:-/path/to/logfile}
size=$(stat -c "%s" "$lfn") ## save original log size
while :; do
newsize=$(stat -c "%s" "$lfn") ## get new log size
if ((size != newsize)); then ## if change, use new info
## use dd to skip over existing text to new text
newtext=$(dd if="$lfn" bs="$size" skip=1 2>/dev/null)
## process newtext however you need
printf "\nnewtext:\n\n%s\n" "$newtext"
size=$((newsize)); ## update size to newsize
fi
sleep 300
done

Text-Message Gateways & Incrementing Bash Variable Daily

I have a bash script that is sending me a text daily, for 100 days.
#! /bin/bash
EMAIL="my-phone-gateway#address.net"
MESSAGE="message_content.txt"
mail $EMAIL < $MESSAGE
Using crontab, I can have the static $MESSAGE sent to me every day.
Other than hard-coding 100 days of texts ;)
How could I implement a variable counter such that I can have my texts say:
"Today is Day #1" on the first day, "Today is Day #2" on the second day, etc. ?
Note: The location of the requested text within the $MESSAGE file doesn't matter. Last line, first line, middle, etc.
The only requirement for an answer here is that I know what day it is relative to the first, where the first day is the day the script was started.
Of course, bonus awesome points for the cleanest, simplest, shortest solution :)
For our nightly build systems, I wrote a C program that does the calculation (using local proprietary libraries that store dates as a number of days since a reference date). Basically, given a (non-changing) reference date, it reports the number of days since the reference date. So, the cron script would have a hard-wired first day in it, and the program would report the number of days since then.
The big advantage of this system is that the reference date doesn't change (very often), so the script doesn't change (very often), and there are no external files to store information in.
There probably are ways to achieve the same effect with standard Unix tools, but I've not sat down and worked out the portable solution. I'd probably think it terms of using Perl. (The C program only works up to 2999 CE; I left a note in the code for people to contact me about 50 years before it becomes a problem for the Y3K fix. It is probably trivial.)
You could perhaps work in terms of Unix timestamps...
Create a script 'days_since 1234567890' which treats the number as the reference date, gets the current time stamp (from date with appropriate format specification; on Linux, date '+%s' would do that job, and it works on Mac OS X too), takes the difference and divides by 86,400 (the number of seconds in a day).
refdate=1234567890
bc <<EOF
scale=0
($(date '+%s') - $refdate) / 86400
EOF
An example:
$ timestamp 1234567890
1234567890 = Fri Feb 13 15:31:30 2009
$ timestamp
1330027280 = Thu Feb 23 12:01:20 2012
$ refdate=1234567890
$ bc <<EOF
> scale=0
> ($(date '+%s') - $refdate) / 86400
> EOF
1104
$
So, if the reference date was 13th Feb 2009, today is day 1104. (The program bc is the calculator; its name has nothing to do with Anno Domini or Before Christ. The program timestamp is another homebrew of mine that prints timestamps according to a format that can be specified; it is a specialized variant of date originally written in the days before date had the functionality, by which I mean in the early 1980s.)
In a Perl one-liner (assuming you specify the reference date in your script):
perl -e 'printf "%d\n", int((time - 1234567890)/ 86400)'
or:
days=$(perl -e 'printf "%d\n", int((time - 1234567890)/ 86400)')
The only way to accomplish this would be to store the date in a file, and read from that file each day. I would suggest storing the epoch time.
today=$(date +%s)
time_file="~/.first_time"
if [[ -f $time_file ]]; then
f_time=$(< "$time_file")
else
f_time=$today
echo "$f_time" > "$time_file"
fi
printf 'This is day: %s\n' "$((($today - $f_time) / 60 / 60 / 24))"
Considering that your script is running only once a day, something like this should work:
#!/bin/bash
EMAIL="my-phone-gateway#address.net"
MESSAGE="message_content.txt"
STFILE=/tmp/start.txt
start=0
[ -f $STFILE ] && start=$(<$STFILE)
start=$((start+1))
MESSAGE=${MESSAGE}$'\n'"Today is Day #${start}"
echo "$start" > $STFILE
mail $EMAIL < $MESSAGE
A simple answer would be to export the current value to an external file, and read that back in again later.
So, for example, make a file called "CurrentDay.dat" that has the number 1 in it.
Then, in your bash script, read in the number and increment it.
e.g. your bash script could be:
#!/bin/bash
#Your stuff here.
DayCounter=$(<CurrentDay.dat)
#Use the value of DayCounter (i.e. $DayCounter) in your message.
DayCounter=$((DayCounter + 1))
echo $DayCounter > CurrentDay.dat
Of course, you may need to implement some additional checks to avoid something going wrong, but that should work as is.

fastest hashing in a unix environment?

I need to examine the output of a certain script 1000s of times on a unix platform and check if any of it has changed from before.
I've been doing this:
(script_stuff) | md5sum
and storing this value. I actually don't really need "md5", JUST a simple hash function which I can compare against a stored value to see if its changed. Its okay if there are an occassional false positive.
Is there anything better than md5sum that works faster and generates a fairly usable hash value? The script itself generates a few lines of text - maybe 10-20 on average to max 100 or so.
I had a look at fast md5sum on millions of strings in bash/ubuntu - that's wonderful, but I can't compile a new program. Need a system utility... :(
Additional "background" details:
I've been asked to monitor the DNS record of a set of 1000 or so domains and immediately call certain other scripts if there has been any change. I intend to do a dig xyz +short statement and hash its output and store that, and then check it against a previously stored value. Any change will trigger the other script, otherwise it just goes on. Right now, we're planning on using cron for a set of these 1000, but can think completely diffeerently for "seriously heavy" usage - ~20,000 or so.
I have no idea what the use of such a system would be, I'm just doing this as a job for someone else...
The cksum utility calculates a non-cryptographic CRC checksum.
How big is the output you're checking? A hundred lines max. I'd just save the entire original file then use cmp to see if it's changed. Given that a hash calculation will have to read every byte anyway, the only way you'll get an advantage from a checksum type calculation is if the cost of doing it is less than reading two files of that size.
And cmp won't give you any false positives or negatives :-)
pax> echo hello >qq1.txt
pax> echo goodbye >qq2.txt
pax> cp qq1.txt qq3.txt
pax> cmp qq1.txt qq2.txt >/dev/null
pax> echo $?
1
pax> cmp qq1.txt qq3.txt >/dev/null
pax> echo $?
0
Based on your question update:
I've been asked to monitor the DNS record of a set of 1000 or so domains and immediately call certain other scripts if there has been any change. I intend to do a dig xyz +short statement and hash its output and store that, and then check it against a previously stored value. Any change will trigger the other script, otherwise it just goes on. Right now, we're planning on using cron for a set of these 1000, but can think completely diffeerently for "seriously heavy" usage - ~20,000 or so.
I'm not sure you need to worry too much about the file I/O. The following script executed dig microsoft.com +short 5000 times first with file I/O then with output to /dev/null (by changing the comments).
#!/bin/bash
rm -rf qqtemp
mkdir qqtemp
((i = 0))
while [[ $i -ne 5000 ]] ; do
#dig microsoft.com +short >qqtemp/microsoft.com.$i
dig microsoft.com +short >/dev/null
((i = i + 1))
done
The elapsed times at 5 runs each are:
File I/O | /dev/null
----------+-----------
3:09 | 1:52
2:54 | 2:33
2:43 | 3:04
2:49 | 2:38
2:33 | 3:08
After removing the outliers and averaging, the results are 2:49 for the file I/O and 2:45 for the /dev/null. The time difference is four seconds for 5000 iterations, only 1/1250th of a second per item.
However, since an iteration over the 5000 takes up to three minutes, that's how long it will take maximum to detect a problem (a minute and a half on average). If that's not acceptable, you need to move away from bash to another tool.
Given that a single dig only takes about 0.012 seconds, you should theoretically do 5000 in sixty seconds assuming your checking tool takes no time at all. You may be better off doing something like this in Perl and using an associative array to store the output from dig.
Perl's semi-compiled nature means that it will probably run substantially faster than a bash script and Perl's fancy stuff will make the job a lot easier. However, you're unlikely to get that 60-second time much lower just because that's how long it takes to run the dig commands.

Resources