Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 8 years ago.
Improve this question
We have an algorithm that compare ruby objects coming from MongoDB. The majority of the time spent, it taking the results (~1000), assigning a weight to them, and comparing them to a base object. This process takes ~2 sec for 1000 objects. Afterwards, we order the objects by the weight, and take the top 10.
Given that the number of initial matches will continue to grow, I'm looking for more efficient ways to compare and sort matches in Ruby.
I know this is kind of vague, but let's assume they are User objects that have arrays of data about the person and we're comparing them to a single user to find the best match for that user.
Have you considered storing/caching the weight? This works well if the weight depends only on the attributes of each user and not on values external to that user.
Also, how complex is the calculation involving the weight associated with a user and the "base" user? If it's complex you may want to consider using a graph database, which can store data that is specific to the relation between 2 nodes/objects.
Related
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I have a question about machine learning regarding predictions.
So typically I would have a dataset with x's and y's that i would train my algo on. But what if I just have a dataset with input variables only (x's) and no actual predictions (y's)?
For example, im looking for fradulent transactions.
In dataset A i have a bunch of input variables like amounts, zipcodes, merchant, etc. and i have a fraud status variable that says 1 for possible fraud, 0 for safe transaction. Here i have known frauds/known non frauds that i can train my model on.
However, what if i have a dataset where there is no fraud varaible. All i have is my input variables and no variable that indicates whether it is fraud or not. How could an ML algo then predict the probability of it being a fraudulent transaction for this specific dataset?
I think what you are looking for is anomaly detection. In anomaly detection, you will try to find the datapoints, which are different from the rest of the data points, in your case it is fraudulent transaction.
There are quite a few algorithms available in sklearn, look here. I would recommend start with IsolationForest model for your problem.
From Documentation.
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 6 years ago.
Improve this question
I'm reading about AI and in the notes it is mentioned
A lookup table in chess would have roughly 35^100 entries.
But what does this mean? Is there any way we could find out how long it would take the computer to search through and find it's entry? Would we assume thereis some order or that there is no order?
The number of atoms in the known universe is estimated to be around 10^80 which is much less than 35^100. With current technology, at least a few thousand atoms are required to store a single bit. I assume that each entry of your table would have multiple bits. You would need some really advanced technology to implement the memory of your computer.
So the answer is: With current technology it is not a matter of time, it is simply impossible.
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 8 years ago.
Improve this question
In order to support users learning English, I want to make a multiple-choice quiz using the vocabulary that the user is studying.
For example, if the user is learning "angel" then I need an algorithm to produce some similar words such as "angle" and "angled"
Another example, if the user is learning "accountant" then I need an algorithm to produce some similar words such as "accounttant" and "acountant", "acounttant"
You could compute the Levenshtein Distance from the starting word to each word in your vocabulary and pick the 2 or 3 shortest ones.
Depending on how many words are in your dictionary this might take a long time though, so I would recommend bailing out after a certain (small) number of steps - i.e. if you have made 3 mutations and still haven't arrived at your target word then stop and move on to the next one.
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 8 years ago.
Improve this question
Why does SecureRandom.uuid create a unique string?
SecureRandom.uuid
# => "35cb4e30-54e1-49f9-b5ce-4134799eb2c0"
The string that method SecureRandom.uuid creates is never repeated?
The string is not in fact guaranteed unique. There is a very small but finite chance of a collision.
However in practice you will never see two ids generated using this mechanism that are the same, because the probability is so low.
You may safely treat a call to SecureRandom.uuid as generating a globally unique string in code that needs to manage many billions of database entities.
Here is a small table of collision probabilities.
Opinion: If I were to pick an arbitrary limit where you might start to see one or two collisions across the entire dataset with a realistic probability I would go for around 10**16 - assuming you create a million ids per second in your system, then it would take 30 years to reach that size. Even then, the probability of seeing any collisions over the whole 30 years of the project, would be roughly 1 in 100000.
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 9 years ago.
Improve this question
I need to generate global unique ids by hashing some data.
On the one hand, I could use a combination of timestamp and network address, which is unique since every computer can only create one id at the same time. But since this data is to long I'd need to hash it and thus collisions could occur. (As a side note, we could also throw in a random number if the timestamp is not exact enough.)
On the other hand, I could just use a random number and hash that. Shouldn't that bring exactly the same hash collision probability as the first approach? It is interesting because this approach would be faster and is much easier to implement.
Is there a difference in terms of hash collisions when using unique data rather than random data? (By the way, I will not use real GUIDs as described by the standard but mine will only be 64 bits long. But that shouldn't affect the question.)
Why bother to hash a random number? Hashing is designed to map inputs uniformly to a keyspace, but PRNGs are already giving you a uniform mapping of outcomes. All you're doing is creating more work.