Joining very large lists [closed] - algorithm

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
Lets put some numbers first:
The largest of the list is about 100M records. (but is expected to grow upto 500). The other lists (5-6 of them) are in millions but would be less than 100M for the foreseeable future.
These are always joined based on a single id. and never with any other parameters.
Whats the best algorithm to join such lists?
I was thinking in lines of distributed computing. Have a good hash (the circular hash kinds, where you can add a node and there's not a lot of data movement) function and have these lists split into several smaller files. And since, they are always joined on the common id (which i will be hashing) it would boil down to joining to small files. And maybe use the nix join commands for that.
A DB (at least MySQL) would join using merge join (since it would be on primary key). Is that going to be more efficient that my approach?
I know its best to test and see. But given the magnitute of these files, its pretty time consuming. And I would like to do some theoretical calculation and then see how it fairs in practice.
Any insights on these or other ideas would be helpful. I dont mind if it takes slightly longer, but would prefer the best utilization of the resources I have. Don't have a huge budget :)

Use a Database. They are designed for performing joins (with the right indexes of course!)

Related

What's a good selective pressure to use in tournament selection in a genetic algorithm? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 7 years ago.
Improve this question
What is the optimal and usual value of selective pressure in tournament selection? What percent of the best members of the current generation should propagate to the next generation?
Unfortunately, there isn't a great answer to this question. The optimal parameters will vary from problem to problem, and people use a wide range of them. Selecting the right tournament selection parameters is currently more of an art than a science. Stronger selective pressure (a larger tournament) will generally result in the population converging on a solution faster, at the cost of that solution potentially not being as good. This is called the exploration vs. exploitation tradeoff, and it underlies most algorithms for searching a large space of possible solutions - you're not going to get away from it.
I know that's not very helpful, though - you want a starting place, and that's completely reasonable. So here's the best one I know of (and I know a number of others who use it as a go-to default tournament configuration as well): a tournament size of two. Basically, this means you just keep picking random pairs of solutions, choosing the best one, and sending it to the next generation (with mutation and crossover as desired), until the next generation is the desired size. This has the nice property that any member of the population besides the absolute worst has a chance of getting to the next generation, but better ones have a better chance.

How to implement a general-purpose list type? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 7 years ago.
Improve this question
Suppose you are developing a language that is intended to be used for scripting, prototyping, as a macro language for application automation, or as an interactive calculator. It's dynamically typed, memory management is implicit, based on a garbage collection. In most cases, you do not expect users to use highly optimized algorithms or hand-picked and fine-tuned data structures. You want to provide a general-purpose list type that would have a decent performance on average. It must support all kinds of operations: iteration, random access by index, prepending and appending elements, insertion, deletion, mapping, filtering, membership testing, concatenation, splitting, reversing, sorting, cloning, extracting segments. It could be used both with small and large number of elements (but you can assume that it fits into physical memory). It's intended only for single-thread access and you need not care about thread safety. You expect users to use this general-purpose list type, no matter what is their scenario or usage pattern. Some users might want to use it as a sparse array, where most elements have some default value (e.g. 0), and only few elements have non-default values.
What implementation would you choose?
We assume that you can afford to invest significant development effort, so the solution need not be necessarily simple. For example, you could implement different ways of internal organization of data, and switch between them depending on number of elements or usage patterns. High performance is a more important goal than reducing memory consumption, so you can afford some memory overhead if it wins you performance.

Is using unique data better than random data for a hash? [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 9 years ago.
Improve this question
I need to generate global unique ids by hashing some data.
On the one hand, I could use a combination of timestamp and network address, which is unique since every computer can only create one id at the same time. But since this data is to long I'd need to hash it and thus collisions could occur. (As a side note, we could also throw in a random number if the timestamp is not exact enough.)
On the other hand, I could just use a random number and hash that. Shouldn't that bring exactly the same hash collision probability as the first approach? It is interesting because this approach would be faster and is much easier to implement.
Is there a difference in terms of hash collisions when using unique data rather than random data? (By the way, I will not use real GUIDs as described by the standard but mine will only be 64 bits long. But that shouldn't affect the question.)
Why bother to hash a random number? Hashing is designed to map inputs uniformly to a keyspace, but PRNGs are already giving you a uniform mapping of outcomes. All you're doing is creating more work.

Naive approaches to detecting plagiarism? [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
Let's say you wanted to compare essays by students and see if one of those essays was plagiarized. How would you go about this in a naive manner (i.e. not too complicated approach)? Of course, there are simple ways like comparing the words used in the essays, and complicated ways like using compressing functions, but what are some other ways to check plagiarism without too much complexity/theory?
There are several papers giving several approaches, I recommend reading this
The paper shows an algorithm based on an index structure
built over the entire file collection.
So they say their algorithm can be used to find similar code fragments in a large software system. Before the index is built, all the files in the
collection are tokenized. This is a simple parsing problem, and can be solved in
linear time. For each of the N files in the collection, The output of the tokenizer
for a file F_i is a string of n_i tokens.
here is other paper you could read
Other good algorithm is a scam based algorithm that consists on detecting plagiarism by making comparison on a set of words that are common between test document
and registered document. Our plagiarism detection system, like many Information Retrieval systems, is evaluated with metrics of precision and recall.
You could take a look at Dick Grune's similarity comparator, which claims to work on natural language texts as well (I've only tried it on software). The algorithms are described as well. (By the way, his book on parsing is really good, in my opinion.)

Ruby Object manipulation [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 8 years ago.
Improve this question
We have an algorithm that compare ruby objects coming from MongoDB. The majority of the time spent, it taking the results (~1000), assigning a weight to them, and comparing them to a base object. This process takes ~2 sec for 1000 objects. Afterwards, we order the objects by the weight, and take the top 10.
Given that the number of initial matches will continue to grow, I'm looking for more efficient ways to compare and sort matches in Ruby.
I know this is kind of vague, but let's assume they are User objects that have arrays of data about the person and we're comparing them to a single user to find the best match for that user.
Have you considered storing/caching the weight? This works well if the weight depends only on the attributes of each user and not on values external to that user.
Also, how complex is the calculation involving the weight associated with a user and the "base" user? If it's complex you may want to consider using a graph database, which can store data that is specific to the relation between 2 nodes/objects.

Resources