random split files with specific proportion - random

I want to randomly 80/20 split a file using awk.
I have read and tried the option found HERE in which something like the following proposed:
$ awk -v N=`cat FILE | wc -l` 'rand()<3000/N' FILE
works great if you want a random selection.
However, is it possible to alter this awk in order to split the one file into two files of 80/20 (or any other) proportion?

With gawk, you'd write
gawk '
BEGIN {srand()}
{f = FILENAME (rand() <= 0.8 ? ".80" : ".20"); print > f}
' file
Example:
seq 100 > 100.txt
gawk 'BEGIN {srand()} {f = FILENAME (rand() <= 0.8 ? ".80" : ".20"); print > f}' 100.txt
wc -l 100.txt*
100 100.txt
23 100.txt.20
77 100.txt.80
200 total
To ensure 20 lines in the "20" file:
$ paste -d $'\034' <(seq $(wc -l < "$file") | sort -R) "$file" \
| awk -F $'\034' -v file="$file" '{
f = file ($1 <= 20 ? ".20" : ".80")
print $2 > f
}'
$ wc -l "$file"*
100 testfile
20 testfile.20
80 testfile.80
200 total
\034 is the ASCII FS character, unlikely to appear in a text file.
sort -R to shuffle the input may not be portable. It's in GNU and BSD sort though.

Related

Random line using sed

I want to select a random line with sed. I know shuf -n and sort -R | head -n does the job, but for shuf you have to install coreutils, and for the sort solution, it isn't optimal on large data :
Here is what I tested :
echo "$var" | shuf -n1
Which gives the optimal solution but I'm afraid for portability
that's why I want to try it with sed.
`var="Hi
i am a student
learning scripts"`
output:
i am a student
output:
hi
It must be Random.
It depends greatly on what you want your pseudo-random probability distribution to look like. (Don't try for random, be content with pseudo-random. If you do manage to generate a truly random value, go collect your nobel prize.) If you just want a uniform distribution (eg, each line has equal probability of being selected), then you'll need to know a priori how many lines of are in the file. Getting that distribution is not quite so easy as allowing the earlier lines in the file to be slightly more likely to be selected, and since that's easy, we'll do that. Assuming that the number of lines is less than 32769, you can simply do:
N=$(wc -l < input-file)
sed -n -e $((RANDOM % N + 1))p input-file
-- edit --
After thinking about it for a bit, I realize you don't need to know the number of lines, so you don't need to read the data twice. I haven't done a rigorous analysis, but I believe that the following gives a uniform distribution:
awk 'BEGIN{srand()} rand() < 1/NR { out=$0 } END { print out }' input-file
-- edit --
Ed Morton suggests in the comments that we should be able to invoke rand() only once. That seems like it ought to work, but doesn't seem to. Curious:
$ time for i in $(seq 400); do awk -v seed=$(( $(date +%s) + i)) 'BEGIN{srand(seed); r=rand()} r < 1/NR { out=$0 } END { print out}' input; done | awk '{a[$0]++} END { for (i in a) print i, a[i]}' | sort
1 205
2 64
3 37
4 21
5 9
6 9
7 9
8 46
real 0m1.862s
user 0m0.689s
sys 0m0.907s
$ time for i in $(seq 400); do awk -v seed=$(( $(date +%s) + i)) 'BEGIN{srand(seed)} rand() < 1/NR { out=$0 } END { print out}' input; done | awk '{a[$0]++} END { for (i in a) print i, a[i]}' | sort
1 55
2 60
3 37
4 50
5 57
6 45
7 50
8 46
real 0m1.924s
user 0m0.710s
sys 0m0.932s
var="Hi
i am a student
learning scripts"
mapfile -t array <<< "$var" # create array from $var
echo "${array[$RANDOM % (${#array}+1)]}"
echo "${array[$RANDOM % (${#array}+1)]}"
Output (e.g.):
learning scripts
i am a student
See: help mapfile
This seems to be the best solution for large input files:
awk -v seed="$RANDOM" -v max="$(wc -l < file)" 'BEGIN{srand(seed); n=int(rand()*max)+1} NR==n{print; exit}' file
as it uses standard UNIX tools, it's not restricted to files that are 32,769 lines long or less, it doesn't have any bias towards either end of the input, it'll produce different output even if called twice in 1 second, and it exits immediately after the target line is printed rather than continuing to the end of the input.
Update:
Having said the above, I have no explanation for why a script that calls rand() once per line and reads every line of input is about twice as fast as a script that calls rand() once and exits at the first matching line:
$ seq 100000 > file
$ time for i in $(seq 500); do
awk -v seed="$RANDOM" -v max="$(wc -l < file)" 'BEGIN{srand(seed); n=int(rand()*max)+1} NR==n{print; exit}' file;
done > o3
real 1m0.712s
user 0m8.062s
sys 0m9.340s
$ time for i in $(seq 500); do
awk -v seed="$RANDOM" 'BEGIN{srand(seed)} rand() < 1/NR{ out=$0 } END { print out}' file;
done > o4
real 0m29.950s
user 0m9.918s
sys 0m2.501s
They both produced very similar types of output:
$ awk '{a[$0]++} END { for (i in a) print i, a[i]}' o3 | awk '{sum+=$2; max=(NR>1&&max>$2?max:$2); min=(NR>1&&min<$2?min:$2)} END{print NR, sum, min, max}'
498 500 1 2
$ awk '{a[$0]++} END { for (i in a) print i, a[i]}' o4 | awk '{sum+=$2; max=(NR>1&&max>$2?max:$2); min=(NR>1&&min<$2?min:$2)} END{print NR, sum, min, max}'
490 500 1 3
Final Update:
Turns out it was calling wc that (unexpectedly to me at least!) was taking most of the time. Here's the improvement when we take it out of the loop:
$ time { max=$(wc -l < file); for i in $(seq 500); do awk -v seed="$RANDOM" -v max="$max" 'BEGIN{srand(seed); n=int(rand()*max)+1} NR==n{print; exit}' file; done } > o3
real 0m24.556s
user 0m5.044s
sys 0m1.565s
so the solution where we call wc up front and rand() once is faster than calling rand() for every line as expected.
on bash shell, first initialize seed to # line cube or your choice
$ i=;while read a; do let i++;done<<<$var; let RANDOM=i*i*i
$ let l=$RANDOM%$i+1 ;echo -e $var |sed -En "$l p"
if move your data to varfile
$ echo -e $var >varfile
$ i=;while read a; do let i++;done<varfile; let RANDOM=i*i*i
$ let l=$RANDOM%$i+1 ;sed -En "$l p" varfile
put the last inside loop e.g. for((c=0;c<9;c++)) { ;}
Using GNU sed and bash; no wc or awk:
f=input-file
sed -n $((RANDOM%($(sed = $f | sed '2~2d' | sed -n '$p')) + 1))p $f
Note: The three seds in $(...) are an inefficient way to fake wc -l < $f. Maybe there's a better way -- using only sed of course.
Using shuf:
$ echo "$var" | shuf -n 1
Output:
Hi

How to work around open files limit when demuxing files?

I frequently have large text files (10-100GB decompressed) to demultiplex based on barcodes in each line, where in practice the number of resulting individual files (unique barcodes) is between 1K and 20K. I've been using awk for this and it accomplishes the task. However, I've noticed that the rate of demuxing larger files (which correlates with more unique barcodes used) is significantly slower (10-20X). Checking ulimit -n shows 4096 as the limit on open files per process, so I suspect that the slowdown is due to the overhead of awk being forced to constantly close and reopen files whenever the total number of demuxed files exceeds 4096.
Lacking root access (i.e., the limit is fixed), what kinds of workarounds could be used to circumvent this bottleneck?
I do have a list of all barcodes present in each file, so I've considered forking multiple awk processes where each is assigned a mutually exclusive subset (< 4096) of barcodes to search for. However, I'm concerned the overhead of having to check each line's barcode for set membership might defeat the gains of not closing files.
Is there a better strategy?
I'm not married to awk, so approaches in other scripting or compiled languages are welcome.
Specific Example
Data Generation (FASTQ with barcodes)
The following generates data similar to what I'm specifically working with. Each entry consists of 4 lines, where the barcode is an 18 character word using the non-ambiguous DNA alphabet.
1024 unique barcodes | 1 million reads
cat /dev/urandom | tr -dc "ACGT" | fold -w 5 | \
awk '{ print "#batch."NR"_"$0"AAAAAAAAAAAAA_ACGTAC length=1\nA\n+\nI" }' | \
head -n 4000000 > cells.1K.fastq
16384 unique barcodes | 1 million reads
cat /dev/urandom | tr -dc "ACGT" | fold -w 7 | \
awk '{ print "#batch."NR"_"$0"AAAAAAAAAAA_ACGTAC length=1\nA\n+\nI" }' | \
head -n 4000000 > cells.16K.fastq
awk script for demultiplexing
Note that in this case I'm writing to 2 files for each unique barcode.
demux.awk
#!/usr/bin/awk -f
BEGIN {
if (length(outdir) == 0 || length(prefix) == 0) {
print "Variables 'outdir' and 'prefix' must be defined!" > "/dev/stderr";
exit 1;
}
print "[INFO] Initiating demuxing..." > "/dev/stderr";
}
{
if (NR%4 == 1) {
match($1, /.*_([ACGT]{18})_([ACGTN]{6}).*/, bx);
print bx[2] >> outdir"/"prefix"."bx[1]".umi";
}
print >> outdir"/"prefix"."bx[1]".fastq";
if (NR%40000 == 0) {
printf("[INFO] %d reads processed\n", NR/4) > "/dev/stderr";
}
}
END {
printf("[INFO] %d total reads processed\n", NR/4) > "/dev/stderr";
}
Usage
awk -v outdir="/tmp/demux1K" -v prefix="batch" -f demux.awk cells.1K.fastq
or similarly for the cells.16K.fastq.
Assuming you're the only one running awk, you can verify the approximate number of open files using
lsof | grep "awk" | wc -l
Observed Behavior
Despite the files being the same size, the one with 16K unique barcodes runs 10X-20X slower than the one with only 1K unique barcodes.
Without seeing any sample input/output or the script you're currently executing it's very much guesswork but if you currently have the barcode in field 1 and are doing (assuming GNU awk so you don't have your own code managing the open files):
awk '{print > $1}' file
then if managing open files really is your problem you'll get a significant improvement if you change it to:
sort file | '$1!=f{close(f};f=$1} {print > f}'
The above is, of course, making assumptions about what these barcoode values are, which field holds them, what separates fields, whether or not the output order has to match the original, what else your code might be doing that gets slower as the input grows, etc., etc. since you haven't shown us any of that yet.
If that's not all you need then edit your question to include the missing MCVE.
Given your updated question with your script and the info that the input is 4-line blocks, I'd approach the problem by adding the key "bx" values at the front of each record and using NUL to separate the 4-line blocks then using NUL as the record separator for sort and the subsequent awk:
$ cat tst.sh
infile="$1"
outdir="${infile}_out"
prefix="foo"
mkdir -p "$outdir" || exit 1
awk -F'[_[:space:]]' -v OFS='\t' -v ORS= '
NR%4 == 1 { print $2 OFS $3 OFS }
{ print $0 (NR%4 ? RS : "\0") }
' "$infile" |
sort -z |
awk -v RS='\0' -F'\t' -v outdir="$outdir" -v prefix="$prefix" '
BEGIN {
if ( (outdir == "") || (prefix == "") ) {
print "Variables \047outdir\047 and \047prefix\047 must be defined!" | "cat>&2"
exit 1
}
print "[INFO] Initiating demuxing..." | "cat>&2"
outBase = outdir "/" prefix "."
}
{
bx1 = $1
bx2 = $2
fastq = $3
if ( bx1 != prevBx1 ) {
close(umiOut)
close(fastqOut)
umiOut = outBase bx1 ".umi"
fastqOut = outBase bx1 ".fastq"
prevBx1 = bx1
}
print bx2 > umiOut
print fastq > fastqOut
if (NR%10000 == 0) {
printf "[INFO] %d reads processed\n", NR | "cat>&2"
}
}
END {
printf "[INFO] %d total reads processed\n", NR | "cat>&2"
}
'
When run against input files generated as you describe in your question:
$ wc -l cells.*.fastq
4000000 cells.16K.fastq
4000000 cells.1K.fastq
the results are:
$ time ./tst.sh cells.1K.fastq 2>/dev/null
real 0m55.333s
user 0m56.750s
sys 0m1.277s
$ ls cells.1K.fastq_out | wc -l
2048
$ wc -l cells.1K.fastq_out/*.umi | tail -1
1000000 total
$ wc -l cells.1K.fastq_out/*.fastq | tail -1
4000000 total
$ time ./tst.sh cells.16K.fastq 2>/dev/null
real 1m6.815s
user 0m59.058s
sys 0m5.833s
$ ls cells.16K.fastq_out | wc -l
32768
$ wc -l cells.16K.fastq_out/*.umi | tail -1
1000000 total
$ wc -l cells.16K.fastq_out/*.fastq | tail -1
4000000 total

How to grep and execute command for every multiline match

Is there the possibility to process a multiline grep-output by one command each?
I've got something like
<fulldata>
<value>1</value>
<value>2</value>
</fulldata>
<fulldata>
<value>2</value>
<value>3</value>
</fulldata>
and want to get means, standard deviation and do some other things with data-element on its own.
In this case, I want to execute
function printStatistics {
mean1=$(awk -F ';' '{print $1}' $1 | awk '{sum += $1; square += $1^2} END {print sum / NR}')
deviation1=$(awk -F ';' '{print $1}' $1 | awk '{sum += $1; square += $1^2} END {print sqrt(square / NR - (sum/NR)^2)}')
size=$(cat $1 | wc -l)
echo $mean1 $deviation1 $size
}
with the expected result (for the sample data), idealy separated by newline:
1,5 0,7 2
2,5 0,7 2
Running
cat add.xml | grep "<fulldata" -A 2001 | while read line ; do echo "Line: $line" ; done
like suggested in How to grep and execute a command (for every match) does result in one entry for each line; but I want one entry for each entry (in order to execute awk stuff on it later).
Is this possible with grep, or is this a use case where another language would be more appropriate?
It is bad practice to parse html/xml with grep, because its not reliable. If you are using Mac OS X, you can use a preinstalled cli tool called xmllint to select specific elements. On linux, you can use the standard package manager to get it.
There is also xgrep, and probably others that I dont know about.
awk to the rescue!
$ awk -v RS='\n?</?fulldata>\n' -F'\n' '
!(NR%2){gsub("</?value>","");
s=ss=0;
for(i=1;i<=NF;i++) {s+=$i; ss+=$i^2}
printf "%.1f %.1f %d\n", s/NF, sqrt((ss-s^2/NF)/(NF-1)), NF} ' file
1.5 0.7 2
2.5 0.7 2
for the sample standard deviation as computed you need to guard for single observation (NF==1) case.
Complex xmlstarlet + awk solution:
xmlstarlet ed -u "//fulldata/value" -x "concat(.,',')" add.xml \
| xmlstarlet sel -B -t -v "//fulldata" -n \
| awk -F, '{ n=NF-1; sum=sq=0; for(i=1;i<=n;i++) { sum+=$i; sq+=$i^2 }
printf "%.1f\n%.1f\n%d\n", sum/n, sqrt((sq-sum^2/n)/(n-1)), n }'
The output:
1.5
0.7
2
2.5
0.7
2

BASH: Padding a series of HEX values based on the longest string

I have this odd condition where I've been given a series of HEX values that represent binary data. The interesting thing is that they are occasionally different lengths, such as:
40000001AA
0000000100
A0000001
000001
20000001B0
40040001B0
I would like to append 0's on the end to make them all the same length based on the longest entry. So, in the example above I have four entires that are 10 characters long, terminated by '\n', and a few short ones (in the actual data, I 200k of entries with about 1k short ones). What I would like to do figure out the longest string in the file, and then go through and pad the short ones; however, I haven't been able to figure it out. Any suggestions would be appreciated.
Using standard two-pass awk:
awk 'NR==FNR{if (len < length()) len=length(); next}
{s = sprintf("%-*s", len, $0); gsub(/ /, "0", s); print s}' file file
40000001AA
0000000100
A000000100
0000010000
20000001B0
40040001B0
Or using gnu wc with awk:
awk -v len="$(wc -L < file)" '
{s = sprintf("%-*s", len, $0); gsub(/ /, "0", s); print s}' file
40000001AA
0000000100
A000000100
0000010000
20000001B0
40040001B0
As you use Bash there is a big chance that you also use other GNU
tools. In such case wc can easily tell you the the length of the
greatest line in the file using -L option. Example:
$ wc -L /tmp/HEX
10 /tmp/HEX
Padding can be done like this:
$ while read i; do echo $(echo "$i"0000000000 | head -c 10); done < /tmp/HEX
40000001AA
0000000100
A000000100
0000010000
20000001B0
40040001B0
A one-liner:
while read i; do eval printf "$i%.s0" {1..$(wc -L /tmp/HEX | cut -d ' ' -f1)} | head -c $(wc -L /tmp/HEX | cut -d ' ' -f1); echo; done < /tmp/HEX
In general to zero-pad a string from either or both sides is (using 5 as the desired field width for example):
$ echo '17' | awk '{printf "%0*s\n", 5, $0}'
00017
$ echo '17' | awk '{printf "%s%0*s\n", $0, 5-length(), ""}'
17000
$ echo '17' | awk '{w=int((5+length())/2); printf "%0*s%0*s\n", w, $0, 5-w, ""}'
01700
$ echo '17' | awk '{w=int((5+length()+1)/2); printf "%0*s%0*s\n", w, $0, 5-w, ""}'
00170
so for your example:
$ awk '{cur=length()} NR==FNR{max=(cur>max?cur:max);next} {printf "%s%0*s\n", $0, max-cur, ""}' file file
40000001AA
0000000100
A000000100
0000010000
20000001B0
40040001B0
Let's suppose you have this values in file:
file=/tmp/hex.txt
Find out length of longest number:
longest=$(wc -L < $file)
Now for each number in file justify it with zeroes
while read number; do
printf "%-${longest}s\n" $number | sed 's/ /0/g'
done < $file
This what will print script to stdout:
40000001AA
0000000100
A000000100
0000010000
20000001B0
40040001B0

count the number of words between two lines in a text file

As the title says I'm wondering if there is an easier way of getting the number of words between two lines in a text file, using text processing tools available on *nix.
For example given a text file is as follows,
a bc ae
a b
ae we wke wew
countwords between, 1-2 -> 5, 2-3 -> 6.
You can use sed and wc like this:
sed -n '1,2p' file | wc -w
5
and
sed -n '2,3p' file | wc -w
6
You can do this with a simple awk command:-
awk -v start='1' -v end='2' 'NR>=start && NR <=end{sum+=NF}END{print sum}' file
For the sample file you have provided:-
$ cat file
a bc ae
a b
ae we wke wew
$ awk -v start='1' -v end='2' 'NR>=start && NR <=end{sum+=NF}END{print sum}' file
5
$ awk -v start='2' -v end='3' 'NR>=start && NR <=end{sum+=NF}END{print sum}' file
6
$ awk -v start='1' -v end='3' 'NR>=start && NR <=end{sum+=NF}END{print sum}' file
9
The logic is simple:-
Use the start, end variables for specifying the ranges in the file, they are awk variables
NR>=start && NR <=end provides the condition to loop from the lines you need
sum+=NF does the word count arithmetic. NF is a special awk variable which counts the number of words de-limited by IFS, which in this case is white-space.
END{print sum} prints the final count.
Worked fine on GNU Awk 3.1.7

Resources