find most frequent elements in set - algorithm

Lets ay you are building a analytics program for your site, in which you log the address of the page every time you visit it, so your log.txt could be
x.com/a
x.com/b
x.com/a
x.com/c
x.com/a
There is no counter, its just a log file and no sql is used, given that this had thousands of elements your a thousand or so unique domains addresses (x.com/a x.com/b), what's the most efficient way of going though this list and spitting out the top 10 urls.
My best solution would go through the log file, then if that domain does not exist in hashtable add it in as key, and increment it's value; then search on hash for largest 10 values.
I'm not convinced this is the best solution, not only because of the space complexity (what happens if the unique domain went from a few thousands to few million), and because I would need to conduct another search on hashtable to find the largest values.

Even for few thousands or few millions entries - your approach is just fine - it has a linear on average (O(n)) run time - so it is not that bad.
However, you can use a map-reduce approach if you want more scalability.
map(file):
for each entry in file:
emitIntermediate(getDomainName(entry),"1"))
reduce(domain,list):
emit(domain,size(list))
The above will efficiently give you the list of (domain,count) tupples, and all you have to do is select the top 10.
Selecting the top 10 can be done using a map-reduce (distributed) sort for scalability - or using a min heap (iterate while maintaining the top 10 elements encountered in the heap). The second is explained with more details in this thread
About space complexity: If you are using a 64 bit system, you can use it as RAM, and let the OS do what it cans (by swapping elements to disk when needed), it is very unlikely that you will need more then the amount of Virtual Memory you have on a 64 bits machine. An alternative is to use a hash table (or a B+ tree) optimized for file systems and do the same algorithm on it.
However, if this is indeed the case - and the data does not fit RAM, and you cannot use map-reduce, I suspect sorting and iterating - though will be O(nlogn) will be more efficient (using external sort) - because the number of DISK ACCESSES will be minimized, and disk access is much slower then RAM access.

I'd suggest that inspecting the file each time is the wrong way to go. A better solution might be to parse the file and push the data into a database (ravendb, or another no-sql should be the easiest). Once there, querying the data becomes trivial, even with very large amounts of data.

Don't reinvent the wheel. Coreutils' sort and uniq can process your log file
sort log.txt | uniq -c | sort -n -r
Coreutils are available on *nix systems and have been ported to Windows.
If you do have a need to roll up this processing in your own code, consult your language's available libraries for its version of a multiset. Python's, for example, is the Counter class, which will happily tell you the most_common([n]).

Related

Using ChronicleMap as a key-value database

I would like to use a ChronicleMap as a memory-mapped key-value database (String to byte[]). It should be able to hold up to the order of 100 million entries. Reads/gets will happen much more frequently than writes/puts, with an expected write rate of less than 10 entries/sec. While the keys would be similar in length, the length of the value could vary strongly: it could be anything from a few bytes up to tens of Mbs. Yet, the majority of values will have a length between 500 to 1000 bytes.
Having read a bit about ChronicleMap, I am amazed about its features and am wondering why I can't find articles describing it being used as a general key-value database. To me there seem to be a lot of advantages of using ChronicleMap for such a purpose. What am I missing here?
What are the drawbacks of using ChronicleMap for the given boundary conditions?
I voted for closing this question because any "drawbacks" would be relative.
As a data structure, Chronicle Map is not sorted, so it doesn't fit when you need to iterate the key-value pairs in the sorted order by key.
Limitation of the current implementation is that you need to specify the number of elements that are going to be stored in the map in advance, and if the actual number isn't close to the specified number, you are going to overuse memory and disk (not very severely though, on Linux systems), but if the actual number of entries exceeds the specified number by approximately 20% or more, operation performance starts to degrade, and the performance hit grows linearly with the number of entries growing further. See https://github.com/OpenHFT/Chronicle-Map/issues/105

Sort the contents of a large file with low RAM

You have a large file around 50GB containing numbers. You need to read the file, sort the contents and copy the sorted contents to another new file.
Condition - you have only 1GB RAM on the computer. However disk space is not an issue.
When we sort items where all the items fit in the memory, we call it internal sorting. When we sort items where items is too big to store in the memory, we call it external sorting.
Art of Computer Programming Vol 3: Sorting and Searching on Page 248 discuss detail algorithm for external sorting (one is merge sort).
You also mention that file contains 50GB of numbers. Maybe there is a lot of duplicated number. You might as well using counting sort if there are a lotof duplicated.

External Sorting with a heap?

I have a file with a large amount of data, and I want to sort it holding only a fraction of the data in memory at any given time.
I've noticed that merge sort is popular for external sorting, but I'm wondering if it can be done with a heap (min or max). Basically my goal is to get the top (using arbitrary numbers) 10 items in a 100 item list while never holding more than 10 items in memory.
I mostly understand heaps, and understand that heapifying the data would put it in the appropriate order, from which I could just take the last fraction of it as my solution, but I can't figure out how to do with without an I/O for every freakin' item.
Ideas?
Thanks! :D
Using a heapsort requires lots of seek operations in the file for creating the heap initially and also when removing the top element. For that reason, it's not a good idea.
However, you can use a variation of mergesort where every heap element is a sorted list. The size of the lists is determined by how much you want to keep in memory. You create these lists from the input file using by loading chunks of data, sorting them and then writing them to a temporary file. Then, you treat every file as one list, read the first element and create a heap from it. When removing the top element, you remove it from the list and restore the heap conditions if necessary.
There is one aspect though that makes these facts about sorting irrelevant: You say you want to determine the top 10 elements. For that, you could indeed use an in-memory heap. Just take an element from the file, push it onto the heap and if the size of the heap exceeds 10, remove the lowest element. To make it more efficient, only push it onto the heap if the size is below 10 or it is above the lowest element, which you then replace and re-heapify. Keeping the top ten in a heap allows you to only scan through the file once, everything else will be done in-memory. Using a binary tree instead of a heap would also work and probably be similarly fast, for a small number like 10, you could even use an array and bubblesort the elements in place.
Note: I'm assuming that 10 and 100 were just examples. If your numbers are really that low, any discussion about efficiency is probably moot, unless you're doing this operation several times per second.
Yes, you can use a heap to find the top-k items in a large file, holding only the heap + an I/O buffer in memory.
The following will obtain the min-k items by making use of a max-heap of length k. You could read the file sequentially, doing an I/O for every item, but it will generally be much faster to load the data in blocks into an auxillary buffer of length b. The method runs in O(n*log(k)) operations using O(k + b) space.
while (file not empty)
read block from file
for (i = all items in block)
if (heap.count() < k)
heap.push(item[i])
else
if (item[i] < heap.root())
heap.pop_root()
heap.push(item[i])
endif
endfor
endwhile
Heaps require lots of nonsequential access. Mergesort is great for external sorting because it does a whole lot of sequential access.
Sequential access is a hell of a lot faster on the kinds of disks that spin because the head doesn't need to move. Sequential access will probably also be a hell of a lot faster on solid-state disks than heapsort's access because they do accesses in blocks that are probably considerably larger than a single thing in your file.
By using Merge sort and passing the two values by reference you only have to hold the two comparison values in a buffer, and move throughout the array until it is sorted in place.

Why does LevelDB needs more than two levels?

I think only two levels(level-0 and level-1) is ok, why does LevelDB need level-2, level-3, and more?
I'll point you in the direction of some articles on LevelDB and it's underlying storage structure.
So in the documentation for LevelDB
it discusses merges among levels.
These merges have the effect of gradually migrating new updates from the young level to the largest level using only bulk reads and writes (i.e., minimizing expensive seeks).
LevelDB is similar in structure to Log Structured Merge Trees. The paper discusses the different levels if you're interested in the analysis of it. If you can get through the mathematics it seems to be your best bet to understanding the data structure.
A much easier to read analysis of levelDB talks about the datastore's relation to LSM Trees but in terms of your questions about the levels all it says is:
Finally, having hundreds of on-disk SSTables is also not a great idea, hence periodically we will run a process to merge the on-disk SSTables.
Probably the LevelDB documentation provides the best answer: (maximizing the size of the writes and reads, since LevelDB is on-disk(slow seek) data storage).
Good Luck!
I think it is mostly to do with easy and quick merging of levels.
In Leveldb, level-(i+1) has approx. 10 times the data compared to level-i. This is more analogous to a multi-level cache structure where in if the database has 1000 records between keys x1 to x2, then 10 of the most frequently accessed ones in that range would be in level-1 and 100 in the same range would be in level-2 and rest in level-3 (this is not exact but just to give an intuitive idea of levels). In this set up, to merge a file in level-i we need to look at at most 10 files in level-(i+1) and it can all be brought into memory, a quick merge done and written back. These results in reading relatively small chunks of data for each compaction/merging operation.
On the other hand if you had just 2 levels, the key range in one level-0 file could potentially match 1000's of files in level-1 and all of them need to be opened up for merging which is going to be pretty slow. Note that an important assumption here is we have fixed size files (say 2MB). With variable length files in level-1, your idea could still work and I think a variant of that is used in systems like HBase and Cassandra.
Now if you are concern is about look up delay with many levels, again this is like a multi-level cache structure, most recently written data would be in higher levels to help with typical locality of reference.
Level 0 is data in memory other levels are disk data. The important part is that data in levels is sorted. If level1 consists of 3 2Mb files then in file1 it's the keys 0..50 (sorted) in file2 150..200 and in file3 300..400 (as an example). So when memory level is full we need to insert it's data to disk in the most efficient manner, which is sequential writing (using as few disk seeks as possible). Imagine in memory we have keys 60-120, cool, we just write them sequentially as file which becomes file2 in level1. Very efficient!
But now imagine that level1 is much larger then level0 (which is reasonable as level0 is memory). In this case there are many files in level1. And now our keys in memory (60-120) belong to many files as the key range in level1 is very fine grained. Now to merge level0 with level1 we need to read many files and make a lot of random seeks, make new files in memory and write them. So this is where many levels idea kicks in, we'll have many layers, each somewhat larger than the previous (x10), but not much larger so when we have to migrate data from i-1 to i-th layer we have a good chance of having to read least amount of files.
Now, since data might change there may be no need to propagate it to higher more expensive layers (it might be changed or deleted) and so we avoid expensive merges altogether. The data that does end up in the last level is statistically least likely to change so is the best fit for most-expensive-to-merge-with last layer.

General approach to count word occurrence in large number of files

This is sort of an algorithm question. To make it clear, I'm not interested in working code but in how to approach the task generally.
We have a server with 4 CPU's, and no databases. There are 100,000 HTML documents, stored on disk. Each document is 2MB in size. We need an efficient way to determine the count of the word "CAMERA" (case insensitive) appearing in that collection.
My approach would be to
parse the HTML document to extract only words
and then sort the words,
then use binary search on that collection.
In other words, I would create threads to let them use all 4 CPU's to parse the HTML documents into a single, large word collection text file, then sort it, and then using binary search.
What do you think of this?
Have you tried grep? That's what I would do.
It will probably take some experimentation to figure out the right way to pass it so much data and make sure ahead of time that the results come out right, because it's going to take a little while.
I would not recommend sorting that much data.
Well, it is not a complete pseudo code answer, but I don't think there is one. To get optimal performance you need to know a LOT on your HW architecture. Here are the notes:
There is no need to sort the data at all, nor use binary search. Just read the files (read each file sequentially from disk) and while doing so search if the word camera appears in it.
The bottle neck in the program will most likely be IO (disk reads), since disk access is MUCH slower then CPU calculations. So, to optimize the program - one should focus on optimizing the disk reads.
To optimize the disk reads, one should know the architecture of it. For example, if you have only one disk (and no RAID), there is really no point in multi-threading, assuming the disk can process a single request at the same time. If it is the case - use a single thread.
However, if you have multiple disks - it does not matter how many cores you have, you should spawn #disks threads (assuming the files are evenly seperated among the disks). Since it is the bottle-neck, by having multiple threads that concurrently requesting the data from the disks, you make all of them work, and effectively reduce the time consumption significantly.
Something like?
htmlDocuments = getPathsOfHtmlDocuments()
threadsafe counter = new Counter(0)
scheduler = scheduler with max 4 threads
for(htmlDocument: htmlDocuments){
scheduler.schedule(new SearchForCameraJob("Camera",htmlDocument,counter))
}
wait while scheduler.hasUnfinishedJobs
print Found camera +counter+ times
class SearchForCameraJob(searchString, pathToFile, counter){
document = readFile(pathToFile);
while(document.findNext(searchString)){
counter.increment();
}
}
If your documents are located on single local hard drive, you will be constrained by I/O, not CPU.
I would use very simple approach of simply serially loading every file into memory and scanning memory searching for target word and increasing counter.
If you try to use 4 threads in attempt to speed it up (like 25000 files to every thread), it will likely make it slower, because I/O does not like overlapping access patterns from competing processes/threads.
If, however, files are spread accross multiple hard drives, you should start as many threads as you have drives, and each thread should read data from that drive only.
You can use Boyer-Moore algorithm. Is difficult to say what programming language is proper for make such of application, but you can make it in C++ so as to directly optimize your native code. Obviously you need to use multithreading.
Of the HTML document parsing libraries you can choose Xerces-C++.

Resources