Counting integer frequency through pipe - bash

Description
I have a for loop in bash with 10^4 iterations in total. Each iteration a list of roughly 10^7 numbers is generated from a pipe, each number an integer between 1 and 10^8. I want to keep track of how many times each integer appeared. The ideal output would be a .txt file with 10^8 lines, each line containing a counter for the integer corresponding to the row number.
As a significant proportion of integers do not appear while others appear nearly every iteration, I imagined using a hashmap, so as to limit analysis to numbers that have appeared. However, I do not know how to fill it with numbers appearing sequentially from a pipe. Any help would be greatly appreciated!
Reproducible example:
sample.R
args = commandArgs(trailingOnly=TRUE)
n_samples = as.numeric(args[1])
n_max = as.numeric(args[2])
v = as.character(sample(1:n_max, n_samples))
writeLines(v)
for loop:
for i in {1..n_loops}
do
Rscript sample.R n_samples n_max | "COLLECT AND INCREMENT HERE"
done
, where in my case n_loops=10^4, n_samples=10^7, n_max = 10^8.

Simple Approach
Before doing premature optimization, try the usual approach with sort | uniq -c first -- if that is fast enough, you have less work and a shorter script. To speed things up without too much hassle, set the memory using -S and use the simplest text encoding LC_ALL=C.
for i in {1..10000}; do
Rscript sample.R n_samples n_max
done | LC_ALL=C sort -nS40% | LC_ALL=C uniq -c
The output will have lines of the form number_of_matches integer_from_the_output. Only integers which appeared at least once will be listed.
To convert this format (inefficiently) into your preferred format with 108 lines, each containing the count for the integer corresponding to the line number, replace the ... | sort | uniq -c part with the following command:
... | cat - <(seq 100''000''000) | LC_ALL=C sort -nS40% | LC_ALL=C uniq -c | awk '{$1--;$2=""}1'
This assumes that all the generated integers are between 1 and 108 inclusive. The result gets mangled if any other values appear more than once.
Hash Map
If you want to go with the hash map, the simplest implementation would probably be an awk script:
for i in {1..10000}; do
Rscript sample.R n_samples n_max
done | awk '{a[$0]++} END {for (ln=1; ln<=100000000; ln++) print int(a[ln])}'
However, I'm unsure whether this is such a good idea. The hash map could allocate much more memory than the actual data requires and is probably slow for that many entries.
Also, your awk implementation has to support large numbers. 32-bit integers are not sufficient. If the entire output is just the same integer repeated over and over again you can get a up to ...
104 iterations * 107 occurrences / iteration = 104+7 occurrences = 1011 occurrences
... of that integer. To store the maximal count of 1011 you need at least 37 bits > log2(1011) bits.
GNU awk 5 on a 64-bit system seems to handle numbers of that size.
Faster Approach
Counting occurrences in a data structure is a good idea. However, a hash map is overkill as you have "only" 108 possible values as output. Therefore, you can use an array with 108 entries of 64-bit counters. The array would use ...
64 bit * 108 = 8 Byte * 102+6 = 800 MByte
... of memory. I think 800 MByte should be free even on old PCs and Laptops from 10 years ago.
To implement this approach, use a "normal" programming language of your choice. Bash is not the right tool for this job. You can use bash to pipe the output of the loop into your program. Alternatively, you can execute the for loop directly in your program.

Related

bash script problem with understanding how shuf works

I have the following problem understanding this line of code
for NUMBER in $(shuf -i1-$MAX_NUMBER)
Do I understand correctly that I take subsequent numbers up to "$MAX_NUMBER" or the function "shuf -i1-" make any changes?
shuf -i1-$MAX_NUMBER prints a random permutation of the numbers in the range of 1 to $MAX_NUMBER (i.e, not subsequent).
This means that in each iteration of the loop, the value of $NUMBER will be a random value between 1 and $MAX_NUMBER, until all numbers have been used.

Solved: Grep and Dynamically Truncate at Same Time

Given the following:
for(condition which changes $z)
aptitude show $z | grep -E 'Uncompressed Size: |x' | sed s/Uncompressed Size: //";
done
That means 3 items are outputting to screen ($Z, Uncompressed Size, x).
I want all of that to fit on one line, and a line I deem is = 100 characters.
So, ($Z, Uncompressed Size, x) must fit on one line. But X is very long and will have to be truncated. So there is a requirement to add "used" characters by $z and Uncompressed Size, so that x can be truncated dynamically. I love scripting and being able to do this I deem an absolute must. Needless to say all 3 items being output to screen change hence the characters of the first two outputs must be calculated to subtract from the allowed characters for x, and sum of all characters between all 3 items cannot exceed 100 characters.
sed 's/.//5g'
Lmao, sometimes I wish I thought in simpler terms; complicated description + simple solution = simple problem over complicated by interpreter.
Thank you, Barmar
That only leaves sed (100 - amount of characters used by $z which is this function: ${#z}

Fastest way to delete duplicates in large wordlist? [duplicate]

This question already has answers here:
`uniq` without sorting an immense text file?
(6 answers)
Closed 7 years ago.
A similar question was made here but they didn't address why there is a speed difference between sort and awk.
I made this question first on Unix Stackexchange but since they told me this would be a good question for Stackoverflow I'll post it here.
I need to deduplicate a large wordlist. I tried several commands and did some research here and here where they explained that the fastest way to deduplicate a wordlist seems to be using awk because awk doesn't sort the list. It uses hash lookups to keep track of the items and delete duplicates. Since AWK uses hash lookup they argued that that big O is like this
awk --> O(n) ?
sort --> O(n log n) ?
However I found that this isn't true. Here are my testing results. I generated two random wordlists using this python script.
List1 = 7 Mb
List2 = 690 Mb
Test commands
sort -u input.txt -o output.txt
awk '!x[$0]++' input.txt > output.txt
Results AWK:
List1
real 0m1.643s
user 0m1.565s
sys 0m0.062s
List2
real 2m6.918s
user 2m4.499s
sys 0m1.345s
Results SORT:
List1
real 0m0.724s
user 0m0.666s
sys 0m0.048s
List2
real 1m27.254s
user 1m25.013s
sys 0m1.251s
I made these tests over and over again and found consistent results. Namely, that SORT is a lot faster. Could someone explain why and if there is an even faster way to do it?
************ Update ***********
Things that could have flawed my outcomes are
caching: I've excluded this possibility by changing the order of
execution on the tests
constant factors of the big O notation. I think they should've become irrelevant at this point due to the size of the wordlists. (600Mb)
Bad implementation of algorithms: This remains a possibility I haven't checked out the source code of awk and sort
Your sample input has a lot of duplicate values; you only have 1,000,000 distinct values in a sample size of 100,000,000, so you would expect only 1% of the values to be unique. I don't know exactly how sort -u works, but imagine it is a merge sort which filters unique values during each merge. The effective input size would then be much smaller than 100,000,000. Rerunning your commands with only 1,000,000 values, but chosen from 500,000 distinct values (so that 50%, not 1%, are expected to be unique) produces the following results:
% time awk '!x[$0]++' randomwordlist.txt > /dev/null
awk ... 1.32s user 0.02s system 99% cpu 1.338 total
% time sort -u randomwordlist.txt -o /dev/null
sort ... 14.25s user 0.04s system 99% cpu 14.304 total
The big-O notation only tells you that there is some N for which O(N) will be faster than O(N*log N). The actual number of operations include constant factors and added terms so that in reality the numbers areO(N) ~ k1 * N + c1 andO(N * log N) ~ k2 * N * log(N) + c2 Which one is faster for a chosen N depends on the values of the k and c.
Some input/algorithm combinations lead to very small k and c.
Either program may not use the optimum algorithm.
Caching efffects? If you always run test 1 before test 2, the second test may use already cached data, while the first always has to load from scratch. Proper elimination/determination of cache effects is an art.
Something else I haven't thought of and others will be quick to point out :-)

Is there a way to split a large file into chunks with random sizes?

I know you can split a file with split, but for test purposes I would like to split a large file into chunks whose sizes differ. Is this possible?
Alternatively, if the above-mentioned file is a zip, is there a way to split it into volumes of unequal sizes?
Any suggestions welcome! Thanks!
So the general question that you're asking is: how can I compute N random integers that sum to S? Specifically, S is the size of your file and N is how many smaller files that you want to break it into.
For example, assume that you want to split your file into 4 parts. If a, b, c, and d are four random numbers, then:
a + b + c + d = X
a/X + b/X + c/X + d/X = 1
S*a/X + S*b/X + S*c/X + S*d/X = S
Giving us four random numbers that sum to S, the size of your file.
Which means you'd want to write a script that:
Computes N random numbers (any random numbers).
Computes X as the sum of those random numbers.
Multiplies each of those random numbers by S/X (and makes sure you're left with integers greater than 0 that sum to S)
Splits the original file into pieces using the generated random numbers as sizes, using whatever tool you want.
This is a little much for a shell script, but would be pretty straight forward in something like Perl.
since you tagged the question only with shell. so I supposed you want to handle it only with shell script and those common linux command/tools.
As far as I know there is no existing tool/cmd can split file randomly. To split file, we can consider to use split, dd
Both tools support options like, how big (size) split-ed file should be or how many files do you want to split. let's say, we use dd/split first split your file into 500 parts, each file has same size. so we have:
foo.zip.001
foo.zip.002
foo.zip.003
...
foo.zip.500
then we take this file list as input, to do merge (cat). This step could be done by awk or shell script.
for example we can build a set of cat statements like:
cat foo.zip.001, foo.zip.002 > part1
cat foo.zip.003, foo.zip.004, foo.zip.005 > part2
cat foo.zip.006, foo.zip.007, foo.zip.008, foo.zip.009 > part3
....
run the generated cat statements, you got final part1-n, each part has different size.
for example like:
kent$ seq -f'foo.zip.%g' 20|awk 'BEGIN{i=k=2}NR<i{s=s sprintf ("%s,",$0);next}{k++;i=(NR+k);print "cat "s$0" >part"k-2;s="" }'
cat foo.zip.1,foo.zip.2 >part1
cat foo.zip.3,foo.zip.4,foo.zip.5 >part2
cat foo.zip.6,foo.zip.7,foo.zip.8,foo.zip.9 >part3
cat foo.zip.10,foo.zip.11,foo.zip.12,foo.zip.13,foo.zip.14 >part4
cat foo.zip.15,foo.zip.16,foo.zip.17,foo.zip.18,foo.zip.19,foo.zip.20 >part5
but how is the performance you have to test on your own...at least this should work for your requirement.

Fastest way to sort files

I have a huge text file with lines like:
-568.563626 159 33 -1109.660591 -1231.295129 4.381508
-541.181308 159 28 -1019.279615 -1059.115975 4.632301
-535.370812 155 29 -1033.071786 -1152.907805 4.420473
-533.547101 157 28 -1046.218277 -1063.389677 4.423696
What I want is to sort the file, depending on the 5th column, so I would get
-568.563626 159 33 -1109.660591 -1231.295129 4.381508
-535.370812 155 29 -1033.071786 -1152.907805 4.420473
-533.547101 157 28 -1046.218277 -1063.389677 4.423696
-541.181308 159 28 -1019.279615 -1059.115975 4.632301
For this I use:
for i in file.txt ; do sort -k5n $i ; done
I wonder if this is the fastest or more efficient way
Thanks
Why use for? Why not just:
sort -k5n file.txt
And what sort is more efficient depends on a number of issues. You could no doubt make a faster sort for specific data sets (size and other properties)- bubble sort can actually outperform other sorts (with particular inputs).
However, have you tested the standard sort and established that it's too slow? That's the first thing you should do. My machine (which is by no means the gruntiest on the planet) can do 4 million of those lines in under ten seconds:
real 0m9.023s
user 0m8.689s
sys 0m0.332s
Having said that, there is at least one trick which may speed it up. Transform the file into fixed-length records with fixed length fields before applying a sort to it. Sorting on a specific set of characters and fixed length records can often be much faster than the more flexible sorting allowed by variable field and record sizes allowed by sort.
That way, you add an O(n) operation (the transformation) to speed up what is probably at best an O(n log n) operation (the sort).
But, as with all optimisations, measure, don't guess!
if you have many different files to sort, you may use a loop, however, since you have only 1 file, just pass the filename to sort
$ sort -k5n file

Resources