Fastest way to delete duplicates in large wordlist? [duplicate] - bash

This question already has answers here:
`uniq` without sorting an immense text file?
(6 answers)
Closed 7 years ago.
A similar question was made here but they didn't address why there is a speed difference between sort and awk.
I made this question first on Unix Stackexchange but since they told me this would be a good question for Stackoverflow I'll post it here.
I need to deduplicate a large wordlist. I tried several commands and did some research here and here where they explained that the fastest way to deduplicate a wordlist seems to be using awk because awk doesn't sort the list. It uses hash lookups to keep track of the items and delete duplicates. Since AWK uses hash lookup they argued that that big O is like this
awk --> O(n) ?
sort --> O(n log n) ?
However I found that this isn't true. Here are my testing results. I generated two random wordlists using this python script.
List1 = 7 Mb
List2 = 690 Mb
Test commands
sort -u input.txt -o output.txt
awk '!x[$0]++' input.txt > output.txt
Results AWK:
List1
real 0m1.643s
user 0m1.565s
sys 0m0.062s
List2
real 2m6.918s
user 2m4.499s
sys 0m1.345s
Results SORT:
List1
real 0m0.724s
user 0m0.666s
sys 0m0.048s
List2
real 1m27.254s
user 1m25.013s
sys 0m1.251s
I made these tests over and over again and found consistent results. Namely, that SORT is a lot faster. Could someone explain why and if there is an even faster way to do it?
************ Update ***********
Things that could have flawed my outcomes are
caching: I've excluded this possibility by changing the order of
execution on the tests
constant factors of the big O notation. I think they should've become irrelevant at this point due to the size of the wordlists. (600Mb)
Bad implementation of algorithms: This remains a possibility I haven't checked out the source code of awk and sort

Your sample input has a lot of duplicate values; you only have 1,000,000 distinct values in a sample size of 100,000,000, so you would expect only 1% of the values to be unique. I don't know exactly how sort -u works, but imagine it is a merge sort which filters unique values during each merge. The effective input size would then be much smaller than 100,000,000. Rerunning your commands with only 1,000,000 values, but chosen from 500,000 distinct values (so that 50%, not 1%, are expected to be unique) produces the following results:
% time awk '!x[$0]++' randomwordlist.txt > /dev/null
awk ... 1.32s user 0.02s system 99% cpu 1.338 total
% time sort -u randomwordlist.txt -o /dev/null
sort ... 14.25s user 0.04s system 99% cpu 14.304 total

The big-O notation only tells you that there is some N for which O(N) will be faster than O(N*log N). The actual number of operations include constant factors and added terms so that in reality the numbers areO(N) ~ k1 * N + c1 andO(N * log N) ~ k2 * N * log(N) + c2 Which one is faster for a chosen N depends on the values of the k and c.
Some input/algorithm combinations lead to very small k and c.
Either program may not use the optimum algorithm.
Caching efffects? If you always run test 1 before test 2, the second test may use already cached data, while the first always has to load from scratch. Proper elimination/determination of cache effects is an art.
Something else I haven't thought of and others will be quick to point out :-)

Related

Counting integer frequency through pipe

Description
I have a for loop in bash with 10^4 iterations in total. Each iteration a list of roughly 10^7 numbers is generated from a pipe, each number an integer between 1 and 10^8. I want to keep track of how many times each integer appeared. The ideal output would be a .txt file with 10^8 lines, each line containing a counter for the integer corresponding to the row number.
As a significant proportion of integers do not appear while others appear nearly every iteration, I imagined using a hashmap, so as to limit analysis to numbers that have appeared. However, I do not know how to fill it with numbers appearing sequentially from a pipe. Any help would be greatly appreciated!
Reproducible example:
sample.R
args = commandArgs(trailingOnly=TRUE)
n_samples = as.numeric(args[1])
n_max = as.numeric(args[2])
v = as.character(sample(1:n_max, n_samples))
writeLines(v)
for loop:
for i in {1..n_loops}
do
Rscript sample.R n_samples n_max | "COLLECT AND INCREMENT HERE"
done
, where in my case n_loops=10^4, n_samples=10^7, n_max = 10^8.
Simple Approach
Before doing premature optimization, try the usual approach with sort | uniq -c first -- if that is fast enough, you have less work and a shorter script. To speed things up without too much hassle, set the memory using -S and use the simplest text encoding LC_ALL=C.
for i in {1..10000}; do
Rscript sample.R n_samples n_max
done | LC_ALL=C sort -nS40% | LC_ALL=C uniq -c
The output will have lines of the form number_of_matches integer_from_the_output. Only integers which appeared at least once will be listed.
To convert this format (inefficiently) into your preferred format with 108 lines, each containing the count for the integer corresponding to the line number, replace the ... | sort | uniq -c part with the following command:
... | cat - <(seq 100''000''000) | LC_ALL=C sort -nS40% | LC_ALL=C uniq -c | awk '{$1--;$2=""}1'
This assumes that all the generated integers are between 1 and 108 inclusive. The result gets mangled if any other values appear more than once.
Hash Map
If you want to go with the hash map, the simplest implementation would probably be an awk script:
for i in {1..10000}; do
Rscript sample.R n_samples n_max
done | awk '{a[$0]++} END {for (ln=1; ln<=100000000; ln++) print int(a[ln])}'
However, I'm unsure whether this is such a good idea. The hash map could allocate much more memory than the actual data requires and is probably slow for that many entries.
Also, your awk implementation has to support large numbers. 32-bit integers are not sufficient. If the entire output is just the same integer repeated over and over again you can get a up to ...
104 iterations * 107 occurrences / iteration = 104+7 occurrences = 1011 occurrences
... of that integer. To store the maximal count of 1011 you need at least 37 bits > log2(1011) bits.
GNU awk 5 on a 64-bit system seems to handle numbers of that size.
Faster Approach
Counting occurrences in a data structure is a good idea. However, a hash map is overkill as you have "only" 108 possible values as output. Therefore, you can use an array with 108 entries of 64-bit counters. The array would use ...
64 bit * 108 = 8 Byte * 102+6 = 800 MByte
... of memory. I think 800 MByte should be free even on old PCs and Laptops from 10 years ago.
To implement this approach, use a "normal" programming language of your choice. Bash is not the right tool for this job. You can use bash to pipe the output of the loop into your program. Alternatively, you can execute the for loop directly in your program.

How to compare all the lines in a sorted file (file size > 1GB) in a very efficient manner

Lets say the input file is:
Hi my name NONE
Hi my name is ABC
Hi my name is ABC
Hi my name is DEF
Hi my name is DEF
Hi my name is XYZ
I have to create the following output:
Hi my name NONE 1
Hi my name is ABC 2
Hi my name is DEF 2
Hi my name is XYZ 1
The number of words in a single line can vary from 2 to 10. File size will be more than 1GB.
How can I get the required output in the minimum possible time. My current implementation uses a C++ program to read a line from the file and then compare it with next line. The running time of this implementation will always be O(n) where n is the number of characters in the file.
To improve the running time, the next option is to use the mmap. But before implementing it, I just wanted to confirm is there a faster way to do it? Using any other language/scripting?
uniq -c filename | perl -lane 'print "#F[1..$#F] $F[0]"'
The perl step is only to take the output of uniq (which looks like "2 Hi my name is ABC") and re-order it into "Hi my name is ABC 2". You can use a different language for it, or else leave it off entirely.
As for your question about runtime, big-O seems misplaced here; surely there isn't any chance of scanning the whole file in less than O(n). mmap and strchr seem like possibilities for constant-factor speedups, but a stdio-based approach is probably good enough unless your stdio sucks.
The code for BSD uniq could be illustrative here. It does a very simple job with fgets, strcmp, and a very few variables.
In most cases this operation will be completely I/O bound. (Especially using well-designed C++)
Given that, its likely the only bottleneck you need to care about is the disk.
I think you will find this to be relevant:
mmap() vs. reading blocks
Ben Collins has a very good answer comparing mmap to standard read/write.
Well there is two time scales you are comparing which aren't related to each other really. The first is algorithmic complexity which you are expressing in O notation. This has, however, nothing to do with the complexity of reading from a file.
Say in the ideal case you have all your data in memory and you have to find the duplicates with an algorithm - depending on how your data is organized (e.g. a simple list, a hash map etc) you can find duplicates you could go with O(n^2), O(n) or even O(1) if you have a perfect hash (just for detecting the item).
Reading from a file or mapping to memory has no relation to the "big-Oh" notation at all so you don't consider that for complexity calculations at all. You will just pick the one that takes less measured time - nothing more.

Print 1 followed by googolplex number of zeros

Assuming we are not concerned about running time of the program (which is practically infinite for human mortals) and using limited amount of memory (2^64 bytes), we want to print out in base 10, the exact value of 10^(googolplex), one digit at a time on screen (mostly zeros).
Describe an algorithm (which can be coded on current day computers), or write a program to do this.
Since we cannot practically check the output, so we will rely on collective opinion on the correctness of the program.
NOTE : I do not know the solution, or whether a solution exists or not. The problem is my own invention. To those readers who are quick to mark this offtopic... kindly reconsider. This is difficult and bit theoretical but definitely CS.
This is impossible. There are more states (10^(10^100)) in the program than there are electrons in the universe (~10^80). Therefore, in our universe, there can be no such realization of a machine capable of executing the task.
First of all, we note that 10^(10^100) is equivalent to ((((10^10)^10)^...)^10), 100 times.
Or 10↑↑↑↑↑↑↑↑↑↑10.
This gives rise to the following solution:
print 1
for i in A(10, 100)
print 0
in bash:
printf 1
while true; do
printf 0
done
... close enough.
Here's an algorithm that solves this:
print 1
for 1 to 10^(10^100)
print 0
One can trivially prove correctness using Hoare logic:
There are no pre-conditions
The post condition is that a one followed by 10^(10^100) zeros are printed
The cycle's invariant is that the number of zeros printed so far is equal to i
EDIT: A machine to solve the problem needs the ability to distinguish between one googolplex of distinct states: each state is the result of printing one more zero than the previous. The amount of memory needed to do this is the same needed to store the number one googolplex. If there isn't that much memory available, this problem cannot be solved.
This does not mean it isn't a computable problem: it can be solved by a Turing machine because a Turing machine has a limitless amount of memory.
There definitely is a solution to this problem in theory, assuming of course you have a machine that is capable of producing that sort of output. I'm pretty sure that a googolplex is larger than the number of atoms in the universe, at least according to what the physicists tell us, so I don't think that any physically realizable model of computation could print it out. However, mathematically speaking, you could define a Turing machine capable of printing out the value by just giving it a googolplex-ish number of states and having each write a zero and then move to the next lower state.
Consider the following:
The console window to which you are printing the output will have a maximum buffer size.
When this buffer size is exceeded, anything printed earlier is discarded, and the user will not be able to scroll back to see it.
The maximum buffer size will be minuscule compared to a googolplex.
Therefore, if you want to mimic the user experience of your program running to completion, find the maximum buffer size of the console you will print to and print that many zeroes.
Hurray laziness!

Fastest way to sort files

I have a huge text file with lines like:
-568.563626 159 33 -1109.660591 -1231.295129 4.381508
-541.181308 159 28 -1019.279615 -1059.115975 4.632301
-535.370812 155 29 -1033.071786 -1152.907805 4.420473
-533.547101 157 28 -1046.218277 -1063.389677 4.423696
What I want is to sort the file, depending on the 5th column, so I would get
-568.563626 159 33 -1109.660591 -1231.295129 4.381508
-535.370812 155 29 -1033.071786 -1152.907805 4.420473
-533.547101 157 28 -1046.218277 -1063.389677 4.423696
-541.181308 159 28 -1019.279615 -1059.115975 4.632301
For this I use:
for i in file.txt ; do sort -k5n $i ; done
I wonder if this is the fastest or more efficient way
Thanks
Why use for? Why not just:
sort -k5n file.txt
And what sort is more efficient depends on a number of issues. You could no doubt make a faster sort for specific data sets (size and other properties)- bubble sort can actually outperform other sorts (with particular inputs).
However, have you tested the standard sort and established that it's too slow? That's the first thing you should do. My machine (which is by no means the gruntiest on the planet) can do 4 million of those lines in under ten seconds:
real 0m9.023s
user 0m8.689s
sys 0m0.332s
Having said that, there is at least one trick which may speed it up. Transform the file into fixed-length records with fixed length fields before applying a sort to it. Sorting on a specific set of characters and fixed length records can often be much faster than the more flexible sorting allowed by variable field and record sizes allowed by sort.
That way, you add an O(n) operation (the transformation) to speed up what is probably at best an O(n log n) operation (the sort).
But, as with all optimisations, measure, don't guess!
if you have many different files to sort, you may use a loop, however, since you have only 1 file, just pass the filename to sort
$ sort -k5n file

How would you sort 1 million 32-bit integers in 2MB of RAM?

Please, provide code examples in a language of your choice.
Update:
No constraints set on external storage.
Example: Integers are received/sent via network. There is a sufficient space on local disk for intermediate results.
Split the problem into pieces small enough to fit into available memory, then use merge sort to combine them.
Sorting a million 32-bit integers in 2MB of RAM using Python by Guido van Rossum
1 million 32-bit integers = 4 MB of memory.
You should sort them using some algorithm that uses external storage. Mergesort, for example.
You need to provide more information. What extra storage is available? Where are you supposed to store the result?
Otherwise, the most general answer:
1. load the fist half of data into memory (2MB), sort it by any method, output it to file.
2. load the second half of data into memory (2MB), sort it by any method, keep it in memory.
3. use merge algorithm to merge the two sorted halves and output the complete sorted data set to a file.
This wikipedia article on External Sorting have some useful information.
Dual tournament sort with polyphased merge
#!/usr/bin/env python
import random
from sort import Pickle, Polyphase
nrecords = 1000000
available_memory = 2000000 # number of bytes
#NOTE: it doesn't count memory required by Python interpreter
record_size = 24 # (20 + 4) number of bytes per element in a Python list
heap_size = available_memory / record_size
p = Polyphase(compare=lambda x,y: cmp(y, x), # descending order
file_maker=Pickle,
verbose=True,
heap_size=heap_size,
max_files=4 * (nrecords / heap_size + 1))
# put records
maxel = 1000000000
for _ in xrange(nrecords):
p.put(random.randrange(maxel))
# get sorted records
last = maxel
for n, el in enumerate(p.get_all()):
if el > last: # elements must be in descending order
print "not sorted %d: %d %d" % (n, el ,last)
break
last = el
assert nrecords == (n + 1) # check all records read
Um, store them all in a file.
Memory map the file (you said there was only 2M of RAM; let's assume the address space is large enough to memory map a file).
Sort them using the file backing store as if it were real memory now!
Here's a valid and fun solution.
Load half the numbers into memory. Heap sort them in place and write the output to a file. Repeat for the other half. Use external sort (basically a merge sort that takes file i/o into account) to merge the two files.
Aside:
Make heap sort faster in the face of slow external storage:
Start constructing the heap before all the integers are in memory.
Start putting the integers back into the output file while heap sort is still extracting elements
As people above mention type int of 32bit 4 MB.
To fit as much "Number" as possible into as little of space as possible using the types int, short and char in C++. You could be slick(but have odd dirty code) by doing several types of casting to stuff things everywhere.
Here it is off the edge of my seat.
anything that is less than 2^8(0 - 255) gets stored as a char (1 byte data type)
anything that is less than 2^16(256 - 65535) and > 2^8 gets stored as a short ( 2 byte data type)
The rest of the values would be put into int. ( 4 byte data type)
You would want to specify where the char section starts and ends, where the short section starts and ends, and where the int section starts and ends.
No example, but Bucket Sort has relatively low complexity and is easy enough to implement

Resources