Why is generating a higher amount of random data much slower? - bash

I want to generate a high amount of random numbers. I wrote the following bash command (note that I am using cat here for demonstrational purposes; in my real use case, I am piping the numbers into a process):
for i in {1..99999999}; do echo -e "$(cat /dev/urandom | tr -dc '0-9' | fold -w 5 | head -n 1)"; done | cat
The numbers are printed at a very low rate. However, if I generate a smaller amount, it is much faster:
for i in {1..9999}; do echo -e "$(cat /dev/urandom | tr -dc '0-9' | fold -w 5 | head -n 1)"; done | cat
Note that the only difference is 9999 instead of 99999999.
Why is this? Is the data buffered somewhere? Is there a way to optimize this, so that the random numbers are piped/streamed into cat immediately?

Why is this?
Generating {1..99999999} 100000000 arguments and then parsing them requires a lot of memory allocation from bash. This significantly stalls the whole system.
Additionally, large chunks of data are read from /dev/urandom, and about 96% of that data are filtered out by tr -dc '0-9'. This significantly depletes the entropy pool and additionally stalls the whole system.
Is the data buffered somewhere?
Each process has its own buffer, so:
cat /dev/urandom is buffering
tr -dc '0-9' is buffering
fold -w 5 is buffering
head -n 1 is buffering
the left side of pipeline - the shell, has its own buffer
and the right side - | cat has its own buffer
That's 6 buffering places. Even ignoring input buffering from head -n1 and from the right side of the pipeline | cat, that's 4 output buffers.
Also, save animals and stop cat abuse. Use tr </dev/urandom, instead of cat /dev/urandom | tr. Fun fact - tr can't take filename as a argument.
Is there a way to optimize this, so that the random numbers are piped/streamed into cat immediately?
Remove the whole code.
Take only as little bytes from the random source as you need. To generate a 32-bit number you only need 32 bits - no more. To generate a 5-digit number, you only need 17 bits - rounding to 8-bit bytes, that's only 3 bytes. The tr -dc '0-9' is a cool trick, but it definitely shouldn't be used in any real code.
Strangely recently I answered I guess a similar question, copying the code from there, you could:
for ((i=0;i<100000000;++i)); do echo "$((0x$(dd if=/dev/urandom of=/dev/stdout bs=4 count=1 status=none | xxd -p)))"; done | cut -c-5
# cut to take first 5 digits
But that still would be unacceptably slow, as it runs 2 processes for each random number (and I think just taking the first 5 digits will have a bad distribution).
I suggest to use $RANDOM, available in bash. If not, use $SRANDOM if you really want /dev/urandom (and really know why you want it). If not, I suggest to write the random number generation from /dev/urandom in a real programming language, like C, C++, python, perl, ruby. I believe one could write it in awk.
The following looks nice, but still converting binary data to hex, just to convert them to decimal later is a workaround for that shell just can't work with binary data:
count=10;
# take count*4 bytes from input
dd if=/dev/urandom of=/dev/stdout bs=4 count=$count status=none |
# Convert bytes to hex 4 bytes at a time
xxd -p -c 4 |
# Convert hex to decimal using GNU awk
awk --non-decimal-data '{printf "%d\n", "0x"$0}'

Why are you running this in a loop? You can just run a single set of these commands to generate everything, e.g.:
cat /dev/urandom | tr -dc '0-9' | fold -w 5 | head -n 100000000
I.e. just generate a single stream of numbers, rather than generate them individually.
I'd second the suggestion of using another language for this, it should be much more efficient. For example, in Python it would just be:
from random import randrange
for _ in range(100000000):
print(randrange(100000))

#SamMason gave the best answer so far, as he completely did away with the loop:
cat /dev/urandom | tr -dc '0-9' | fold -w 5 | head -n 100000000
That still leaves a lot of room for improvement though. First, tr -dc '0-9' only uses about 4% of the stuff that's coming out of /dev/urandom :-) Second, depending on how those random numbers will be consumed in the end, some additional overhead may be incurred for getting rid of leading zeros -- so that some numbers will not be interpreted as octal. Let me suggest a better alternative, using the od command:
outputFile=/dev/null # For test. Replace with the real file.
count=100000000
od -An -t u2 -w2 /dev/urandom | head -n $count >$outputFile
A qick test with the time command showed this to be roughly four times faster than the tr version. And there is really no need for using "another language", as both od and head are highly optimized, and this whole thing runs at native speed.
NOTE: The above command will be generating 16-bit integers, ranging from 0 to 65535 inclusive. If you need a larger range, then you could go for 32 bit numbers, and that will give you a range from 0 to 4294967295:
od -An -t u4 -w4 /dev/urandom | head -n $count >$outputFile
If needed, the end user can scale those down to the desired size with a modulo division.

Related

Convert array of bytes to base64 string using bash [duplicate]

cat /dev/urandom is always a fun way to create scrolling characters on your display, but produces too many non-printable characters.
Is there an easy way to encode it on the command-line in such a way that all of its output are readable characters, base64 or uuencode for example.
Note that I prefer solutions that require no additional files to be created.
What about something like
cat /dev/urandom | base64
Which gives (lots of) stuff like
hX6VYoTG6n+suzKhPl35rI+Bsef8FwVKDYlzEJ2i5HLKa38SLLrE9bW9jViSR1PJGsDmNOEgWu+6
HdYm9SsRDcvDlZAdMXAiHBmq6BZXnj0w87YbdMnB0e2fyUY6ZkiHw+A0oNWCnJLME9/6vJUGsnPL
TEw4YI0fX5ZUvItt0skSSmI5EhaZn09gWEBKRjXVoGCOWVlXbOURkOcbemhsF1pGsRE2WKiOSvsr
Xj/5swkAA5csea1TW5mQ1qe7GBls6QBYapkxEMmJxXvatxFWjHVT3lKV0YVR3SI2CxOBePUgWxiL
ZkQccl+PGBWmkD7vW62bu1Lkp8edf7R/E653pi+e4WjLkN2wKl1uBbRroFsT71NzNBalvR/ZkFaa
2I04koI49ijYuqNojN5PoutNAVijyJDA9xMn1Z5UTdUB7LNerWiU64fUl+cgCC1g+nU2IOH7MEbv
gT0Mr5V+XAeLJUJSkFmxqg75U+mnUkpFF2dJiWivjvnuFO+khdjbVYNMD11n4fCQvN9AywzH23uo
03iOY1uv27ENeBfieFxiRwFfEkPDgTyIL3W6zgL0MEvxetk5kc0EJTlhvin7PwD/BtosN2dlfPvw
cjTKbdf43fru+WnFknH4cQq1LzN/foZqp+4FmoLjCvda21+Ckediz5mOhl0Gzuof8AuDFvReF5OU
Or, without the (useless) cat+pipe :
base64 /dev/urandom
(Same kind of output ^^ )
EDIT : you can also user the --wrap option of base64, to avoid having "short lines" :
base64 --wrap=0 /dev/urandom
This will remove wrapping, and you'll get "full-screen" display ^^
A number of folks have suggested catting and piping through base64 or uuencode. One issue with this is that you can't control how much data to read (it will continue forever, or until you hit ctrl+c). Another possibility is to use the dd command, which will let you specify how much data to read before exiting. For example, to read 1kb:
dd if=/dev/urandom bs=1k count=1 2>/dev/null | base64
Another option is to pipe to the strings command which may give more variety in its output (non-printable characters are discarded, any runs of least 4 printable characters [by default] are displayed). The problem with strings is that it displays each "run" on its own line.
dd if=/dev/urandom bs=1k count=1 2>/dev/null | strings
(of course you can replace the entire command with
strings /dev/urandom
if you don't want it to ever stop).
If you want something really funky, try one of:
cat -v /dev/urandom
dd if=/dev/urandom bs=1k count=1 2>/dev/null | cat -v
So, what is wrong with
cat /dev/urandom | uuencode -
?
Fixed after the first attempt didn't actually work... ::sigh::
BTW-- Many unix utilities use '-' in place of a filename to mean "use the standard input".
There are already several good answers on how to base64 encode random data (i.e. cat /dev/urandom | base64). However in the body of your question you elaborate:
... encode [urandom] on the command-line in such a way that all of it's output are readable characters, base64 or uuencode for example.
Given that you don't actually require parseable base64 and just want it to be readable, I'd suggest
cat /dev/urandom | tr -dC '[:graph:]'
base64 only outputs alphanumeric characters and two symbols (+ and / by default). [:graph:] will match any printable non-whitespace ascii, including many symbols/punctuation-marks that base64 lacks. Therefore using tr -dC '[:graph:]' will result in a more random-looking output, and have better input/output efficiency.
I often use < /dev/random stdbuf -o0 tr -Cd '[:graph:]' | stdbuf -o0 head --bytes 32 for generating strong passwords.
You can do more interesting stuff with BASH's FIFO pipes:
uuencode <(head -c 200 /dev/urandom | base64 | gzip)
cat /dev/urandom | tr -dc 'a-zA-Z0-9'
Try
xxd -ps /dev/urandom
xxd(1)

GNU Parallel: split file into children

Goal
Use GNU Parallel to split a large .gz file into children. Since the server has 16 CPUs, create 16 children. Each child should contain, at most, N lines. Here, N = 104,214,420 lines. Children should be in .gz format.
Input File
name: file1.fastq.gz
size: 39 GB
line count: 1,667,430,708 (uncompressed)
Hardware
36 GB Memory
16 CPUs
HPCC environment (I'm not admin)
Code
Version 1
zcat "${input_file}" | parallel --pipe -N 104214420 --joblog split_log.txt --resume-failed "gzip > ${input_file}_child_{#}.gz"
Three days later, the job was not finished. split_log.txt was empty. No children were visible in the output directory. Log files indicated that Parallel had increased the --block-size from 1 MB (the default) to over 2 GB. This inspired me to change my code to Version 2.
Version 2
# --block-size 3000000000 means a single record could be 3 GB long. Parallel will increase this value if needed.
zcat "${input_file}" | "${parallel}" --pipe -N 104214420 --block-size 3000000000 --joblog split_log.txt --resume-failed "gzip > ${input_file}_child_{#}.gz"
The job has been running for ~2 hours. split_log.txt is empty. No children are visible in the output directory yet. So far, log files show the following warning:
parallel: Warning: --blocksize >= 2G causes problems. Using 2G-1.
Questions
How can my code be improved ?
Is there a faster way to accomplish this goal ?
Let us assume that the file is a fastq file, and that the record size therefore is 4 lines.
You tell that to GNU Parallel with -L 4.
In a fastq file the order does not matter, so you want to pass blocks of n*4 lines to the children.
To do that efficiently you use --pipe-part, except --pipe-part does not work with compressed files and does not work with -L, so you have to settle for --pipe.
zcat file1.fastq.gz |
parallel -j16 --pipe -L 4 --joblog split_log.txt --resume-failed "gzip > ${input_file}_child_{#}.gz"
This will pass a block to 16 children, and a block defaults to 1 MB, which is chopped at a record boundary (i.e. 4 lines). It will run a job for each block. But what you really want is to have the input passed to only 16 jobs in total, and you can do that round robin. Unfortunately there is an element of randomness in --round-robin, so --resume-failed will not work:
zcat file1.fastq.gz |
parallel -j16 --pipe -L 4 --joblog split_log.txt --round-robin "gzip > ${input_file}_child_{#}.gz"
parallel will be struggling to keep up with the 16 gzips, but you should be able to compress 100-200 MB/s.
Now if you had the fastq-file uncompressed we can do it even faster, but we will have to cheat a little: Often in fastq files you will have a seqname that starts the same string:
#EAS54_6_R1_2_1_413_324
CCCTTCTTGTCTTCAGCGTTTCTCC
+
;;3;;;;;;;;;;;;7;;;;;;;88
#EAS54_6_R1_2_1_540_792
TTGGCAGGCCAAGGCCGATGGATCA
+
;;;;;;;;;;;7;;;;;-;;;3;83
#EAS54_6_R1_2_1_443_348
GTTGCTTCTGGCGTGGGTGGGGGGG
+EAS54_6_R1_2_1_443_348
;;;;;;;;;;;9;7;;.7;393333
Here it is #EAS54_6_R. Unfortunately this is also a valid string in the quality line (which is a really dumb design), but in practice we would be extremely surprised to see a quality line starting with #EAS54_6_R. It just does not happen.
We can use that to our advantage, because now you can use \n followed by #EAS54_6_R as a record separator, and then we can use --pipe-part. The added benefit is that the order will remain the same. Here you would have to give the block size to 1/16 of the size of file1-fastq:
parallel -a file1.fastq --block <<1/16th of the size of file1.fastq>> -j16 --pipe-part --recend '\n' --recstart '#EAS54_6_R' --joblog split_log.txt "gzip > ${input_file}_child_{#}.gz"
If you use GNU Parallel 20161222 then GNU Parallel can do that computation for you. --block -1 means: Choose a block-size so that you can give one block to each of the 16 jobslots.
parallel -a file1.fastq --block -1 -j16 --pipe-part --recend '\n' --recstart '#EAS54_6_R' --joblog split_log.txt "gzip > ${input_file}_child_{#}.gz"
Here GNU Parallel will not be the limiting factor: It can easily transfer 20 GB/s.
It is annoying having to open the file to see what the recstart value should be, so this will work in most cases:
parallel -a file1.fastq --pipe-part --block -1 -j16
--regexp --recend '\n' --recstart '#.*\n[A-Za-z\n\.~]'
my_command
Here we assume that the lines will start like this:
#<anything>
[A-Za-z\n\.~]<anything>
<anything>
<anything>
Even if you have a few quality lines starting with '#', then they will never be followed by a line starting with [A-Za-z\n.~], because a quality line is always followed by the seqname line, which starts with #.
You could also have a block size so big that it corresponded to 1/16 of the uncompressed file, but that would be a bad idea:
You would have to be able to keep the full uncompressed file in RAM.
The last gzip will only be started after the last byte had been read (and the first gzip will probably be done by then).
By setting the number of records to 104214420 (using -N) this is basically what you are doing, and your server is probably struggling with keeping the 150 GB of uncompressed data in its 36 GB of RAM.
Paired end poses a restriction: The order does not matter, but the order must be predictable for different files. E.g. record n in file1.r1.fastq.gz must match record n in file1.r2.fastq.gz.
split -n r/16 is very efficient for doing simple round-robin. It does, however, not support multiline records. So we insert \0 as a record separator after every 4th line, which we remove after the splitting. --filter runs a command on the input, so we do not need to save the uncompressed data:
doit() { perl -pe 's/\0//' | gzip > $FILE.gz; }
export -f doit
zcat big.gz | perl -pe '($.-1)%4 or print "\0"' | split -t '\0' -n r/16 --filter doit - big.
Filenames will be named big.aa.gz .. big.ap.gz.

Function in bash that generates random chars from /dev/random

In my .bashrc file I appended the line psswd() { LC_ALL=C tr -dc 'a-zA-Z0-9-!"#/#$%^&*()_+~' < /dev/urandom | head -c "$1";echo ;} so that when I type psswd n in the terminal, it returns a string of n random characters. I would like to achieve the same but using /dev/random instead of /dev/urandom. However when I replace /urandom by /random, calling psswd does nothing (cannot even output a single random character after 1 hour), it's as if it's frozen. I don't know why it is so, and I know it's not a problem of not having enough entropy. The reason is that the command od -An -N1 -i /dev/random returns a random number.
Note that this last command returns a random number almost instantly if I type it say after a fresh reboot. But if I have invoked a call to psswd n with /random, then the command returns a random number after about 15 seconds. So the call to /random seems to have some effect on /dev/random even though it produces no output when I call the function psswd.
Overall I'd like to know how I could create a function that uses /dev/random to generate a random string of n characters.
This happens because libc will buffer tr's output when it's not a terminal. On GNU/Linux, it's 4096 bytes. This means that tr has to produce 4096 bytes of output before head will see the first few bytes, even though it just asks for e.g. 8.
Since you only keep 78 out of 256 values, /dev/random has to produce on average 4096*256/78 = 13443 bytes of random output before you get your password.
/dev/random on my system, starting from an empty pool, took 26 seconds to generate 20 bytes. That means those bytes would take 13443*26/20 = 17475 seconds, or almost 5 hours, to generate a password.
At this point it would print the password, but it would require another bufferful for tr to realize head doesn't want anymore, so it would take another 5 hours before the command would exit.
If you disabled buffering, you would only need to generate (8+1)*256/78 = 29 bytes, which would take a mere ~38 seconds. On GNU/Linux, you can do this with stdbuf -o0:
$ time { LC_ALL=C stdbuf -o0 tr -dc 'a-zA-Z0-9-!"#/#$%^&*()_+~' < /dev/random | head -c 8; echo; }
9D^MKbT)
real 0m36.172s
user 0m0.000s
sys 0m0.010s
…when I replace /urandom by /random, calling psswd does nothing…
That happens. As Wikipedia explains:
In Unix-like operating systems, /dev/random is a special file that serves as a blocking pseudorandom number generator....
A counterpart to /dev/random is /dev/urandom ("unlimited"/non-blocking random source [Emphasis added]
"Blocking" means that when it thinks it has run out of entropy, it stops producing numbers. For example, on my system, the first command in your pipeline produces the following:
$ LC_ALL=C tr -dc 'a-zA-Z0-9-!"#/#$%^&*()_+~' < /dev/random
~Sk(+!h
And, then, it hangs, presumably while waiting for more entropy.
This issue is discussed further here where it is argued that /dev/urandom is good enough.
Speeding it up
tr appears to buffer its output which delays the appearance of characters. The work-around is to use stdbuf and, for me, this results in a substantial speed-up:
LC_ALL=C stdbuf -o0 tr -dc 'a-zA-Z0-9-!"#/#$%^&*()_+~' < /dev/random

Grepping a huge file (80GB) any way to speed it up?

grep -i -A 5 -B 5 'db_pd.Clients' eightygigsfile.sql
This has been running for an hour on a fairly powerful linux server which is otherwise not overloaded.
Any alternative to grep? Anything about my syntax that can be improved, (egrep,fgrep better?)
The file is actually in a directory which is shared with a mount to another server but the actual diskspace is local so that shouldn't make any difference?
the grep is grabbing up to 93% CPU
Here are a few options:
1) Prefix your grep command with LC_ALL=C to use the C locale instead of UTF-8.
2) Use fgrep because you're searching for a fixed string, not a regular expression.
3) Remove the -i option, if you don't need it.
So your command becomes:
LC_ALL=C fgrep -A 5 -B 5 'db_pd.Clients' eightygigsfile.sql
It will also be faster if you copy your file to RAM disk.
If you have a multicore CPU, I would really recommend GNU parallel. To grep a big file in parallel use:
< eightygigsfile.sql parallel --pipe grep -i -C 5 'db_pd.Clients'
Depending on your disks and CPUs it may be faster to read larger blocks:
< eightygigsfile.sql parallel --pipe --block 10M grep -i -C 5 'db_pd.Clients'
It's not entirely clear from you question, but other options for grep include:
Dropping the -i flag.
Using the -F flag for a fixed string
Disabling NLS with LANG=C
Setting a max number of matches with the -m flag.
Some trivial improvement:
Remove the -i option, if you can, case insensitive is quite slow.
Replace the . by \.
A single point is the regex symbol to match any character, which is also slow
Two lines of attack:
are you sure, you need the -i, or do you habe a possibility to get rid of it?
Do you have more cores to play with? grep is single-threaded, so you might want to start more of them at different offsets.
< eightygigsfile.sql parallel -k -j120% -n10 -m grep -F -i -C 5 'db_pd.Clients'
If you need to search for multiple strings, grep -f strings.txt saves a ton of time. The above is a translation of something that I am currently testing. the -j and -n option value seemed to work best for my use case. The -F grep also made a big difference.
Try ripgrep
It provides much better results compared to grep.
All the above answers were great. What really did help me on my 111GB file was using the LC_ALL=C fgrep -m < maxnum > fixed_string filename.
However, sometimes there may be 0 or more repeating patterns, in which case calculating the maxnum isn't possible. The workaround is to use the start and end patterns for the event(s) you are trying to process, and then work on the line numbers between them. Like so -
startline=$(grep -n -m 1 "$start_pattern" file|awk -F":" {'print $1'})
endline=$(grep -n -m 1 "$end_pattern" file |awk -F":" {'print $1'})
logs=$(tail -n +$startline file |head -n $(($endline - $startline + 1)))
Then work on this subset of logs!
hmm…… what speeds do you need ? i created a synthetic 77.6 GB file with nearly 525 mn rows with plenty of unicode :
rows = 524759550. | UTF8 chars = 54008311367. | bytes = 83332269969.
and randomly selected rows at an avg. rate of 1 every 3^5, using rand() not just NR % 243, to place the string db_pd.Clients at a random position in the middle of the existing text, totaling 2.16 mn rows where the regex pattern hits
rows = 2160088. | UTF8 chars = 42286394. | bytes = 42286394.
% dtp; pvE0 < testfile_gigantic_001.txt|
mawk2 '
_^(_<_)<NF { print (__=NR-(_+=(_^=_<_)+(++_)))<!_\
?_~_:__,++__+_+_ }' FS='db_pd[.]Clients' OFS=','
in0: 77.6GiB 0:00:59 [1.31GiB/s] [1.31GiB/s] [===>] 100%
out9: 40.3MiB 0:00:59 [ 699KiB/s] [ 699KiB/s] [ <=> ]
524755459,524755470
524756132,524756143
524756326,524756337
524756548,524756559
524756782,524756793
524756998,524757009
524757361,524757372
And mawk2 took just 59 seconds to extract out a list of row ranges it needs. From there it should be relatively trivial. Some overlapping may exist.
At throughput rates of 1.3GiB/s, as seen above calculated by pv, it might even be detrimental to use utils like parallel to split the tasks.

How do I cause my computer to use lot of RAM?

I am looking for a command(s) that could cause my Linux computer to use lot of RAM ?
Any pointers?
I want to run that in the back ground and do some other task that needs already high RAM usage.
Thanks!
That one will cause growing memory usage by bash:
while read f; do arr[$((i++))]=$f; done < /dev/urandom
Another hard way might be (be careful):
while read f; do arr="$arr:$arr:$arr:$arr:$f"; done < /dev/urandom
Softer version:
while read f; do arr="$arr:$f"; done < /dev/urandom
to reclaim parts of the memory call:
unset arr
You could also mix it with a fork bomb... but I would avoid it
The current answer will try to use more RAM until it's simply completely out of RAM. For a more controlled approach, if you have basic GNU tools (head and tail) or BusyBox on Linux, you can do this to fill a certain amount of free memory:
</dev/zero head -c BYTES | tail
</dev/zero head -c 5000m | tail #~5GB, portable
</dev/zero head -c 5G | tail #5GiB on GNU (not busybox)
cat /dev/zero | head -c 5G | tail #Easier notation; does the same thing
This works because tail needs to keep the current line in memory, in case it turns out to be the last line. The line, read from /dev/zero which outputs only null bytes and no newlines, will be infinitely long, but is limited by head to BYTES bytes, thus tail will use only that much memory. For a more precise amount, you will need to check how much RAM head and tail itself use on your system and subtract that.
To just quickly run out of RAM completely, you can remove the limiting head part:
tail /dev/zero
If you want to also add a duration, this can be done quite easily in bash (will not work in sh):
cat <( </dev/zero head -c 500m) <(sleep SECONDS) | tail
The <(command) thing seems to be little known but is often extremely useful, more info on it here: http://tldp.org/LDP/abs/html/process-sub.html
The cat command will wait for inputs to complete until exiting, and by keeping one of the pipes open, it will keep tail alive.
If you have pv and want to slowly increase RAM use:
</dev/zero head -c TOTAL | pv -L BYTES_PER_SEC | tail
</dev/zero head -c 1000m | pv -L 10m | tail
The latter will use up to one gigabyte at a rate of ten megabytes per second. As an added bonus, pv will show the current rate of use and the total use so far. Of course this can also be done with previous variants:
</dev/zero head -c 500m | pv | tail
Just inserting the | pv | part will show you the current status (throughput and total by default).
If you do not have a /dev/zero device, the standard yes and tr tools might substitute: yes | tr \\n x | head -c BYTES | tail (yes outputs an infinite amount of "yes"es, tr substitutes the newline such that everything becomes one huge line and tail needs to keep all that in memory).
Another, simpler alternative is using dd: dd if=/dev/zero bs=1G of=/dev/null uses 1GB of memory on GNU and BusyBox, but also 100% CPU on one core.
Finally, if your head does not accept a suffix, you can calculate an amount of bytes inline, for example 50 megabytes: head -c $((1024*1024*50))
Cross-posted from my answer on the Unix StackExchange
ulimit -m can be used to impose an artificial memory limit on a process.
Use this at your own risk :
:(){ :|:& };:
=> Explosion in your ram

Resources