I have 4,000,000,000 (four billion) edges for an undirected graph. They are represented in a large text file as pairs of node ids. I would like to compute the connected components of this graph. Unfortunately, once you load in the node ids with the edges into memory this takes more than the 128GB of RAM I have available.
Is there an out of core algorithm for finding connected components that is relatively simple to implement? Or even better, can be it cobbled together with Unix command tools and existing (python) libraries?
Based on the description of the problem you've provided and the answers you provided in the comments, I think the easiest way to do this might be to use an approach like the one #dreamzor described. Here's a more fleshed-out version of that answer.
The basic idea is to convert the data to a more compressed format that fits into memory, to run a regular connected components algorithm on that data, then to decompress it. Notice that if you assign each node a 32-bit numeric ID, then the total space required to store all the nodes is at most the space for four billion nodes and eight billion edges (assuming you store two copies of each edge), which is space for twelve billion 32-bit integers, only around 48GB of space, below your memory threshold.
To start off, write a script that reads in the edges file, assigns a numeric ID to each node (perhaps sequentially in the order in which they appear). Have this script write this mapping to a file and, as it goes, write a new edges file that uses the numeric IDs of the nodes rather than the string names. When you're done, you'll have a names file mapping IDs to names and an edges file that takes up much less space than before. You mentioned in the comments that you can fit all the node names into memory, so this step should be very reasonable. Note that you don't need to store all the edges in memory - you can stream them through the program - so that shouldn't be a bottleneck.
Next, write a program that reads the edges file - but not the names file - into memory and finds connected components using any reasonable algorithm (BFS or DFS would be great here). If you're careful with your memory (using something like C or C++ here would be a good call), this should fit comfortably into main memory. When you're done, write out all the clusters to an external file by numeric ID. You now have a list of all the CCs by ID.
Finally, write a program that reads in the ID to node mapping from the names file, then streams in the cluster IDs and writes out the names of all the nodes in each cluster to a final file.
This approach should be relatively straightforward to implement because the key idea is to keep the existing algorithms you're used to but just change the representation of the graph to be more memory efficient. I've used approaches like this before in the past when dealing with huge graphs (Wikipedia) and it's worked beautifully even on systems with less memory than yours.
You can hold only an array of vertices as their "color" (an int value), then run through the file without loading the entire set of links, marking vertices with a color, a new one if neither vertice is colored, the same color if one is colored and the other isn't, and lowest of two colors, together with repainting all the other vertices in the array that are painted with the highest color if both are colored. A pseudocode example:
int nextColor=1;
int merges=0;
int[] vertices;
while (!file.eof()) {
link=file.readLink();
c1=vertices[link.a];
c2=vertices[link.b];
if ((c1==0)&&(c2==0)) {
vertices[link.a]=nextColor;
vertices[link.b]=nextColor;
nextColor++;
} else if ((c1!=0)&&(c2!=0)) {
// both colored, merge
for (i=vertices.length-1;i>=0;i--) if (vertices[i]==c2) vertices[i]=c1;
merges++;
} else if (c1==0) vertices[link.a]=c2; // only c1 is 0
else vertices[link.b]=c1; // only c2 is 0
}
In case you choose the smaller than 32-bit type for storing color of a vertex, you might need to first check if nextColor is maxed, have an array of colors unused (released in merge), and skip coloring a new set of two vertices if no color can be used, then re-run the file reading process if both the colors are all used and any mergings occur.
UPDATE: Since the vertices aren't really ints but strings instead, you should also have a map of string to int while parsing that file. If your strings are limited by length, you can probably fit them all into memory as a hash table, but I'd pre-process the file by creating another file that would have all strings "s1" replaced with "1", "s2" with "2", etc, where "s1", "s2" are whatever names appear as vertices in the file, so that the data will be compacted to a list of pairs of ints. In case you'll be processing similar data later (that is, your graph isn't changing much, and contains largely the same names of vertices, store the "metadata" file with links from names to ints to ease further pre-processings.
Related
The Problem
On a server, I host ids in a json file. From clients, I need to mandate the server to intersect and sometimes negate these ids (the ids never travel to the client even though the client instructs the server its operations to perform).
I typically have 1000's of ids, often have 100,000's of ids, and have a maximum of 56,000,000 of them, where each value is unique and between -100,000,000 and +100,000,000.
These ids files are stable and do not change (so it is possible to generate a different representation for it that is better adapted for the calculations if needed).
Sample ids
Largest file sizes
I need an algorithm that will intersect ids in the sub-second range for most cases. What would you suggest? I code in java, but do not limit myself to java for the resolution of this problem (I could use JNI to bridge to native language).
Potential solutions to consider
Although you could not limit yourselves to the following list of broad considerations for solutions, here is a list of what I internally debated to resolve the situation.
Neural-Network pre-qualifier: Train a neural-network for each ids list that accepts another list of ids to score its intersection potential (0 means definitely no intersection, 1 means definitely there is an intersection). Since neural networks are good and efficient at pattern recognition, I am thinking of pre-qualifying a more time-consuming algorithm behind it.
Assembly-language: On a Linux server, code an assembly module that does such algorithm. I know that assembly is a mess to maintain and code, but sometimes one need the speed of an highly optimized algorithm without the overhead of a higher-level compiler. Maybe this use-case is simple enough to benefit from an assembly language routine to be executed directly on the Linux server (and then I'd always pay attention to stick with the same processor to avoid having to re-write this too often)? Or, alternately, maybe C would be close enough to assembly to produce clean and optimized assembly code without the overhead to maintain assembly code.
Images and GPU: GPU and image processing could be used and instead of comparing ids, I could BITAND images. That is, I create a B&W image of each ids list. Since each id have unique values between -100,000,000 and +100,000,000 (where a maximum of 56,000,000 of them are used), the image would be mostly black, but the pixel would become white if the corresponding id is set. Then, instead of keeping the list of ids, I'd keep the images, and do a BITAND operation on both images to intersect them. This may be fast indeed, but then to translate the resulting image back to ids may be the bottleneck. Also, each image could be significantly large (maybe too large for this to be a viable solution). An estimate of a 200,000,000 bits sequence is 23MB each, just loading this in memory is quite demanding.
String-matching algorithms: String comparisons have many adapted algorithms that are typically extremely efficient at their task. Create a binary file for each ids set. Each id would be 4 bytes long. The corresponding binary file would have each and every id sequenced as their 4 bytes equivalent into it. The algorithm could then be to process the smallest file to match each 4 bytes sequence as a string into the other file.
Am I missing anything? Any other potential solution? Could any of these approaches be worth diving into them?
I did not yet try anything as I want to secure a strategy before I invest what I believe will be a significant amount of time into this.
EDIT #1:
Could the solution be a map of hashes for each sector in the list? If the information is structured in such a way that each id resides within its corresponding hash key, then, the smaller of the ids set could be sequentially ran and matching the id into the larger ids set first would require hashing the value to match, and then sequentially matching of the corresponding ids into that key match?
This should make the algorithm an O(n) time based one, and since I'd pick the smallest ids set to be the sequentially ran one, n is small. Does that make sense? Is that the solution?
Something like this (where the H entry is the hash):
{
"H780" : [ 45902780, 46062780, -42912780, -19812780, 25323780, 40572780, -30131780, 60266780, -26203780, 46152780, 67216780, 71666780, -67146780, 46162780, 67226780, 67781780, -47021780, 46122780, 19973780, 22113780, 67876780, 42692780, -18473780, 30993780, 67711780, 67791780, -44036780, -45904780, -42142780, 18703780, 60276780, 46182780, 63600780, 63680780, -70486780, -68290780, -18493780, -68210780, 67731780, 46092780, 63450780, 30074780, 24772780, -26483780, 68371780, -18483780, 18723780, -29834780, 46202780, 67821780, 29594780, 46082780, 44632780, -68406780, -68310780, -44056780, 67751780, 45912780, 40842780, 44642780, 18743780, -68220780, -44066780, 46142780, -26193780, 67681780, 46222780, 67761780 ],
"H782" : [ 27343782, 67456782, 18693782, 43322782, -37832782, 46152782, 19113782, -68411782, 18763782, 67466782, -68400782, -68320782, 34031782, 45056782, -26713782, -61776782, 67791782, 44176782, -44096782, 34041782, -39324782, -21873782, 67961782, 18703782, 44186782, -31143782, 67721782, -68340782, 36103782, 19143782, 19223782, 31711782, 66350782, 43362782, 18733782, -29233782, 67811782, -44076782, -19623782, -68290782, 31721782, 19233782, 65726782, 27313782, 43352782, -68280782, 67346782, -44086782, 67741782, -19203782, -19363782, 29583782, 67911782, 67751782, 26663782, -67910782, 19213782, 45992782, -17201782, 43372782, -19992782, -44066782, 46142782, 29993782 ],
"H540" : [...
You can convert each file (list of ids) into a bit-array of length 200_000_001, where bit at index j is set if the list contains value j-100_000_000. It is possible, because the range of id values is fixed and small.
Then you can simply use bitwise and and not operations to intersect and negate lists of ids. Depending on the language and libraries used, it would require operating element-wise: iterating over arrays and applying corresponding operations to each index.
Finally, you should measure your performance and decide whether you need to do some optimizations, such as parallelizing operations (you can work on different parts of arrays on different processors), preloading some of arrays (or all of them) into memory, using GPU, etc.
First, the bitmap approach will produce the required performance, at a huge overhead in memory. You'll need to benchmark it, but I'd expect times of maybe 0.2 seconds, with that almost entirely dominated by the cost of loading data from disk, and then reading the result.
However there is another approach that is worth considering. It will use less memory most of the time. For most of the files that you state, it will perform well.
First let's use Cap'n Proto for a file format. The type can be something like this:
struct Ids {
is_negated #0 :Bool;
ids #1 :List(Int32);
}
The key is that ids are always kept sorted. So list operations are a question of running through them in parallel. And now:
Applying not is just flipping is_negated.
If neither is negated, it is a question of finding IDs in both lists.
If the first is not negated and the second is, you just want to find IDs in the first that are not in the second.
If the first is negated and the second is not, you just want to find IDs in the second that are not in the first.
If both are negated, you just want to find all ids in either list.
If your list has 100k entries, then the file will be about 400k. A not requires copying 400k of data (very fast). And intersecting with another list of the same size involves 200k comparisons. Integer comparisons complete in a clock cycle, and branch mispredictions take something like 10-20 clock cycles. So you should be able to do this operation in the 0-2 millisecond range.
Your worst case 56,000,000 file will take over 200 MB and intersecting 2 of them can take around 200 million operations. This is in the 0-2 second range.
For the 56 million file and a 10k file, your time is almost all spent on numbers in the 56 million file and not in the 10k one. You can speed that up by adding a "galloping" mode where you do a binary search forward in the larger file looking for the next matching number and picking most of them. Do be warned that this code tends to be tricky and involves lots of mispredictions. You'll have to benchmark it to find out how big a size difference is needed.
In general this approach will lose for your very biggest files. But it will be a huge win for most of the sizes of file that you've talked about.
In the specific problem I'm dealing with, the processes arranged in a 3D topology have to exchange portions of a 3D array A(:,:,:) with each other. In particular, each one has to send a given number of slices of A to the processes in the six oriented directions (e.g. A(nx-1:nx,:,:) to the process in the positive 1st dimension, A(1:3,:,:) in the negative one, A(:,ny-3:ny,:) in the positive y-dimension, and so on).
In order to do so I'm going to define a set of subarray types (by means of MPI_TYPE_CREATE_SUBARRAY) to be used in communications (maybe MPI_NEIGHBOR_ALLTOALL, or its V or W extension). The question is about what the better choice, in terms of performance, between:
define 3 subarrays (one for each dimension), each one being actually a 2D array, and then make the communications send along each dimension a different number of these types in the two directions, or
define 6 subarray (one for each oriented direction), each one still being a 3D array, and then make the communications send along each dimension one element of the two types in the two directions?
Finally, to be more general, as in the title, is it better to define more "basic" MPI derived data types and use counts greater than 1 in the communications, or to define "bigger" types and and use counts = 1 in the communications?
MPI derived datatypes are defined to provide the library a means of packing and unpacking the data you send.
For basic types (MPI_INT, MPI_DOUBLE, etc.) there's no problem since the data in memory is already contiguous: there are no holes in memory.
More complex types such as multidimensional arrays or structures, sending the data as is may be inefficient due to the fact that you are probably sending useless data. For this reason, data is packed into a contiguous array of bytes, sent to the destination and then unpacked again to restore its original shape.
That being said, you need to create a derived datatype for each different shape in memory. For example, A(1:3,:,:) and A(nx-2:nx,:,:) represent the same datatype. But A(nx-2:nx,:,:) and A(:,nx-2:nx,:) don't. If you specify correctly the stride access (the gap between consecutive datatypes), you can even specify a 2D derived datatype and then vary the count argument to get better flexibility of your program.
Finally, to answer your last question, this probably worths benchmarking, although I think the difference will not be very noticeable, since it results in a single MPI message in both cases.
Let's say I have a document & the document is spread across 4 different machines, I would like to get a character which has the highest repeated count (all 4 machines combined).
One approach I have is to use a hashmap in each machine and calculate the frequency on each machine individually and then pass that hashmap to the main server where hashmaps from all the 4 machines will be merged.
Thus we'll get the character with the highest frequency.
But the cache here is that I want to minimize the data transferred from each machine.
What improvements can be made ?
[EDIT]
Each machine holds a part of the document
If you don't mind it taking longer...
Each computer passes the most frequent character(s). Hopefully, the number of characters with the highest frequency is low. Ideally, it would be almost always only one.
Main server combines them into a set. If the set has a single character done. Otherwise this set is passed along to the computers, likely as an array or list. Assuming only one character from each computer, this list would have only 2-4 characters.
Each computer returns the frequencies of each character in the set.
Main server sums the frequencies, obtaining the most frequent.
I assert that without prior knowledge of the distribution of characters in the document then any approach you take will have to reduce the data from all 4 computers onto one of them. To minimise the data transferred it is necessary to minimise the size of the data structure which holds the character counts on each computer.
Supposing that you are working with an alphabet with N characters your problem is now the design of a data structure which can hold N integers (in some range [0..m], m being the number of characters in the alphabet) and there is any number of such data structures to be found.
Of course, if you have prior knowledge of the distribution of characters, for example if you know that it is pure text written in English, you have a range of possible approaches to data compression.
Given the relatively small values for N and m likely to be found in practice I agree with the general thrust of the commentary, that it is probably not worth devising a complicated structure to minimise the amount of data transferred, sending an array of N integers would be adequate in most conceivable circumstances.
Given two files containing list of words(around million), We need to find out the words that are in common.
Use Some efficient algorithm, also not enough memory availble(1 million, certainly not).. Some basic C Programming code, if possible, would help.
The files are not sorted.. We can use some sort of algorithm... Please support it with basic code...
Sorting the external file...... with minimum memory available,, how can it be implement with C programming.
Anybody game for external sorting of a file... Please share some code for this.
Yet another approach.
General. first, notice that doing this sequentially takes O(N^2). With N=1,000,000, this is a LOT. Sorting each list would take O(N*log(N)); then you can find the intersection in one pass by merging the files (see below). So the total is O(2N*log(N) + 2N) = O(N*log(N)).
Sorting a file. Now let's address the fact that working with files is much slower than with memory, especially when sorting where you need to move things around. One way to solve this is - decide the size of the chunk that can be loaded into memory. Load the file one chunk at a time, sort it efficiently and save into a separate temporary file. The sorted chunks can be merged (again, see below) into one sorted file in one pass.
Merging. When you have 2 sorted lists (files or not), you can merge them into one sorted list easily in one pass: have 2 "pointers", initially pointing to the first entry in each list. In each step, compare the values the pointers point to. Move the smaller value to the merged list (the one you are constructing) and advance its pointer.
You can modify the merge algorithm easily to make it find the intersection - if pointed values are equal move it to the results (consider how do you want to deal with duplicates).
For merging more than 2 lists (as in sorting the file above) you can generalize the algorithm for using k pointers.
If you had enough memory to read the first file completely into RAM, I would suggest reading it into a dictionary (word -> index of that word ), loop over the words of the second file and test if the word is contained in that dictionary. Memory for a million words is not much today.
If you have not enough memory, split the first file into chunks that fit into memory and do as I said above for each of that chunk. For example, fill the dictionary with the first 100.000 words, find every common word for that, then read the file a second time extracting word 100.001 up to 200.000, find the common words for that part, and so on.
And now the hard part: you need a dictionary structure, and you said "basic C". When you are willing to use "basic C++", there is the hash_map data structure provided as an extension to the standard library by common compiler vendors. In basic C, you should also try to use a ready-made library for that, read this SO post to find a link to a free library which seems to support that.
Your problem is: Given two sets of items, find the intersaction (items common to both), while staying within the constraints of inadequate RAM (less than the size of any set).
Since finding an intersaction requires comparing/searching each item in another set, you must have enough RAM to store at least one of the sets (the smaller one) to have an efficient algorithm.
Assume that you know for a fact that the intersaction is much smaller than both sets and fits completely inside available memory -- otherwise you'll have to do further work in flushing the results to disk.
If you are working under memory constraints, partition the larger set into parts that fit inside 1/3 of the available memory. Then partition the smaller set into parts the fit the second 1/3. The remaining 1/3 memory is used to store the results.
Optimize by finding the max and min of the partition for the larger set. This is the set that you are comparing from. Then when loading the corresponding partition of the smaller set, skip all items outside the min-max range.
First find the intersaction of both partitions through a double-loop, storing common items to the results set and removing them from the original sets to save on comparisons further down the loop.
Then replace the partition in the smaller set with the second partition (skipping items outside the min-max). Repeat. Notice that the partition in the larger set is reduced -- with common items already removed.
After running through the entire smaller set, repeat with the next partition of the larger set.
Now, if you do not need to preserve the two original sets (e.g. you can overwrite both files), then you can further optimize by removing common items from disk as well. This way, those items no longer need to be compared in further partitions. You then partition the sets by skipping over removed ones.
I would give prefix trees (aka tries) a shot.
My initial approach would be to determine a maximum depth for the trie that would fit nicely within my RAM limits. Pick an arbitrary depth (say 3, you can tweak it later) and construct a trie up to that depth, for the smaller file. Each leaf would be a list of "file pointers" to words that start with the prefix encoded by the path you followed to reach the leaf. These "file pointers" would keep an offset into the file and the word length.
Then process the second file by reading each word from it and trying to find it in the first file using the trie you constructed. It would allow you to fail faster on words that don't match. The deeper your trie, the faster you can fail, but the more memory you would consume.
Of course, like Stephen Chung said, you still need RAM to store enough information to describe at least one of the files, if you really need an efficient algorithm. If you don't have enough memory -- and you probably don't, because I estimate my approach would require approximately the same amount of memory you would need to load a file whose words were 14-22 characters long -- then you have to process even the first file by parts. In that case, I would actually recommend using the trie for the larger file, not the smaller. Just partition it in parts that are no bigger than the smaller file (or no bigger than your RAM constraints allow, really) and do the whole process I described for each part.
Despite the length, this is sort of off the top of my head. I might be horribly wrong in some details, but this is how I would initially approach the problem and then see where it would take me.
If you're looking for memory efficiency with this sort of thing you'll be hard pushed to get time efficiency. My example will be written in python, but should be relatively easy to implement in any language.
with open(file1) as file_1:
current_word_1 = read_to_delim(file_1, delim)
while current_word_1:
with open(file2) as file_2:
current_word_2 = read_to_delim(file_2, delim)
while current_word_2:
if current_word_2 == current_word_1:
print current_word_2
current_word_2 = read_to_delim(file_2, delim)
current_word_1 = read_to_delim(file_1, delim)
I leave read_to_delim to you, but this is the extreme case that is memory-optimal but time-least-optimal.
depending on your application of course you could load the two files in a database, perform a left outer join, and discard the rows for which one of the two columns is null
I have a huge list of multi-byte sequences (lets call them words) that I need to store in a file and that I need to be able to lookup quickly. Huge means: About 2 million of those, each 10-20 bytes in length.
Furthermore, each word shall have a tag value associated with it, so that I can use that to reference more (external) data for each item (hence, a spellchecker's dictionary is not working here as that only provides a hit-test).
If this were just in memory, and if memory was plenty, I could simply store all words in a hashed map (aka dictionary, aka key-value pairs), or in a sorted list for a binary search.
However, I'd like to compress the data highly, and would also prefer not to have to read the data into memory but rather search inside the file.
As the words are mostly based on the english language, there's a certain likelyness that certain "sillables" in the words occur more often than others - which is probably helpful for an efficient algorithm.
Can someone point me to an efficient technique or algorithm for this?
Or even code examples?
Update
I figure that DAWG or anything similar routes the path into common suffixes this way won't work for me, because then I won't be able to tag each complete word path with an individual value. If I were to detect common suffixes, I'd have to put them into their own dictionary (lookup table) so that a trie node could reference them, yet the node would keep its own ending node for storing that path's tag value.
In fact, that's probably the way to go:
Instead of building the tree nodes for single chars only, I could try to find often-used character sequences, and make a node for those as well. That way, single nodes can cover multiple chars, maybe leading to better compression.
Now, if that's viable, how would I actually find often-used sub-sequences in all my phrases?
With about 2 million phrases consisting of usually 1-3 words, it'll be tough to run all permutations of all possible substrings...
There exists a data structure called a trie. I believe that this data structure is perfectly suited for your requirements. Basically a trie is a tree where each node is a letter and each node has child nodes. In an letter based trie, there would be 26 children per node.
Depending on what language you are using this may be easier or better to store as a variable length list while creation.
This structure gives:
a) Fast searching. Following a word of length n, you can find the string in n links in the tree.
b) Compression. Common prefixes are stored.
Example: The word BANANA and BANAL both will have B,A,N,A nodes equal and then the last (A) node will have 2 children, L and N. Your Nodes can also stored other information about the word.
(http://en.wikipedia.org/wiki/Trie)
Andrew JS
I would recommend using a Trie or a DAWG (directed acyclic word graph). There is a great lecture from Stanford on doing exactly what you want here: http://academicearth.org/lectures/lexicon-case-study
Have a look at the paper "How to sqeeze a lexicon". It explains how to build a minimized finite state automaton (which is just another name for a DAWG) with a one-to-one mapping of words to numbers and vice versa. Exactly what you need.
You should get familiar with Indexed file.
Have you tried just using a hash map? Thing is, on a modern OS architecture, the OS will use virtual memory to swap out unused memory segments to disk anyway. So it may turn out that just loading it all into a hash map is actually efficient.
And as jkff points out, your list would only be about 40 MB, which is not all that much.