Increase Speed of Wordle Bot (General For Any Programming Language) - performance

I am working on a Wordle bot to calculate the best first (and subsequent) guess(es). I am assuming all 13,000 possible guesses are equally likely (for now), and I am running into a huge speed issue.
I can not think of a valid way to avoid a triple for loop, each with 13,000 elements. Obviously, this is ridiculously slow and would take about 20 hours on my laptop to compute the best first guess (I assume it would be faster for subsequent guesses due to fewer valid words).
The way I see it, I definitely need the first loop to test each word as a guess.
I need the second loop to find the colors/results given each possible answer with that guess.
Then, I need the third loop to see which guesses are valid given the guess and the colors.
Then, I find the average words remaining for each guess, and choose the one with the lowest average using a priority queue.
I think the first two loops are unavoidable, but maybe there's a way to get rid of the third?
Does anyone have any thoughts or suggestions?

I did a similar thing, for a list with 11,000 or so entries. It took 27 minutes.
I did it using a pregenerated (in one pass) list of the letters in a word,
as a bitfield (i.e. one 32 bit integer) and then did crude testing using
the AND instruction. if a word failed that, it exited the rest of the loop.

Related

Using syntax to add a count of the number of cases which match that case's value

I don't think this matches any existing question but it seems kind of fundamental.
I have a variable full of ranks. Think of it like people who ran a marathon and their places. But there are lots of draws so there might be 5 firsts and 4 seconds and 9 thirds and so on.
Each case has a variable with their place except the people who finished third are not actually third. They are joint 10th from the above figures. The people who finished second are joint 6th.
How do I create a new variable with the marathon runners actual places in the race?
If I understand right, you want the Nth place to reflect the number of actual people above in the list?
Here is a way to do that:
sort cases by OrigPlace.
compute MyPlace=OrigPlace.
if $casenum>1 and OrigPlace<>lag(OrigPlace) MyPlace=$casenum.
if $casenum>1 and OrigPlace=lag(OrigPlace) MyPlace=lag(MyPlace).
exe.

Why is a finite sum calculated so long?

I'm trying to compute the next sum:
It is calculated instantly. So I raise the number of points to 24^3 and it still works fast:
But when the number of points is 25^3 it's almost impossible to await the result! Moreover, there is a warning:
Why is it so time-consuming to calculate a finite sum? How can I get a precise answer?
Try
max=24;
Timing[N[
Sum[1/(E^((i^2+j^2+k^2-3)/500)-1),{i,2,max},{j,1,max},{k,1,max}]+
Sum[1/(E^((i^2+j^2+k^2-3)/500)-1),{i,1,1},{j,2,max},{k,1,max}]+
Sum[1/(E^((i^2+j^2+k^2-3)/500)-1),{i,1,1},{j,1,1},{k,2,max}]]]
which quickly returns
{0.143978,14330.9}
and
max=25;
Timing[N[
Sum[1/(E^((i^2+j^2+k^2-3)/500)-1),{i,2,max},{j,1,max},{k,1,max}]+
Sum[1/(E^((i^2+j^2+k^2-3)/500)-1),{i,1,1},{j,2,max},{k,1,max}]+
Sum[1/(E^((i^2+j^2+k^2-3)/500)-1),{i,1,1},{j,1,1},{k,2,max}]]]
which quickly returns
{0.156976,14636.6}
and even
max=50;
Timing[N[
Sum[1/(E^((i^2+j^2+k^2-3)/500)-1),{i,2,max},{j,1,max},{k,1,max}]+
Sum[1/(E^((i^2+j^2+k^2-3)/500)-1),{i,1,1},{j,2,max},{k,1,max}]+
Sum[1/(E^((i^2+j^2+k^2-3)/500)-1),{i,1,1},{j,1,1},{k,2,max}]]]
which quickly returns
{1.36679,16932.5}
Changing your code in this way avoids doing hundreds or thousands of If tests that will almost always result in True. And it potentially uses symbolic algorithms to find those results instead of needing to add up each one of the individual values.
Compare those results and times if you replace Sum with NSum and if you replace /500 with *.002
To try to guess why the times you see suddenly change as you increment the bound, other people have noticed in the past that it appears there are some hard coded bounds inside some of the numerical algorithms and when a range is small enough Mathematica will use one algorithm, but when the range is just large enough to exceed that bound then it will switch to another and potentially slower algorithm. It is difficult or impossible to know exactly why you see this change without being able to inspect the decisions being made inside the algorithms and nobody outside Wolfram gets to see that information.
To get a more precise numerical value you can change N[...] to N[...,64] or N[...,256] or eliminate the N entirely and get a large complicated exact numeric result.
Be cautious with this, check the results carefully to make certain that I have not made any mistakes. And some of this is just guesswork on my part.

Is it possible to come up with a distributed / multi core implementation of a prime sieve.

I have been working on prime sieve algorithm, and the basic implementation is working fine for me. What I am currently struggling with is a way to divide and distribute the calculation on to multiple processors.
I know it would require storage of the actual sieve in a shared memory area or a text file, but how would one go about dividing the calculation related steps.
Any lead would help. Thanks!
Split the numbers into sections of equal size, each processor will be responsible for one of these sections.
Another processor (or one of the processors) will generate the numbers of which multiple needs to be crossed-off. And pass this number to each other processors.
Each of the processors will then use the remainder of the section size divided by the given number and its own section index to determine the offset into its own section, and then loop through and cross off the applicable numbers.
Alternatively, one could get a much simpler approach by just using shared memory.
Let the first processor start crossing off multiple of 2, the second multiples of 3, the third multiples of 5, etc.
Essentially just let each processor grab the next number from the array and run with it.
If you don't do this well, you may end up with the third crossing off multiples of 4, since the first didn't get to 4 yet when the third started, so it's not crossed off, but it shouldn't result in too much more work - it will take increasingly longer for a multiple of some prime to be grabbed by a processor, while it will always be the first value crossed off by a processor handling that prime, so the likelihood of this redundancy happening decreases very quickly.
Using shared memory like this tends to be risky - if you plan on using one bit per index, most languages don't allow you to work on that level, and you'll end up needing to do some bitwise operations (probably bitwise-AND) on a few bytes to make your desired changes (although this complexity might be hidden in some API), and many languages will also not have this operation be a so-called atomic operation, meaning one thread can get a value, AND it, and write it back, and another can come in and get the value before the first thread wrote it, AND it, and write it back after the first thread's write, essentially causing the first thread's changes to be lost. There's no simple, efficient fix for this - what exactly you need to do will depend on the language.

brute force search optimisation

I have an function that is engineered as follows:
int brutesearch(startNumber,endNumber);
this function returns the correct number if one matches my criteria by performing a linear search, or null if it's not found in the searched numbers.
Say that:
I want to search all 6 digits numbers to find one that does something I want
I can run the brutesearch() function multithreaded
I have a laptop with 4 cores
My question is the following:
What is my best bet for optimising this search? Dividing the number space in 4 segments and running 4 instances of the function one on each core? Or dividing for example in 10 segments and running them all together, or dividing in 12 segments and running them in batches of 4 using a queue?
Any ideas?
Knowing nothing about your search criteria (there may be other considerations created by the memory subsystem), the tradeoff here is between the cost of having some processors do more work than others (e.g., because the search predicate is faster on some values than others, or because other threads were scheduled) and the cost of coordinating the work. A strategy that's worked well for me is to have a work queue from which threads grab a constant/#threads fraction of the remaining tasks each time, but with only four processors, it's pretty hard to go wrong, though the really big running-time wins are in algorithms.
There is no general answer. You need to give more information.
If your each comparison is completely independent of the others, and there are no opportunities for saving computation in a global resource, there is say no global hash tables involved, and your operations are all done in a single stage,
then your best bet is to just divide your problem space into the number of cores you have available, in this case 4 and send 1/4 of the data to each core.
For example if you had 10 million unique numbers that you wanted to test for primality. Or if you had 10 million passwords your were trying to hash to find a match, then just divide by 4.
If you have a real world problem, then you need to know a lot more about the underlying operations to get a good solution. For example if a global resource is involved, then you won't get any improvement from parallelism unless you isolate the operations on the global resource somehow.

Log combing algorithm

We get these ~50GB data files consisting of 16 byte codes, and I want to find any code that occurs 1/2% of the time or more. Is there any way I can do that in a single pass over the data?
Edit: There are tons of codes - it's possible that every code is different.
EPILOGUE: I've selected Darius Bacon as best answer, because I think the best algorithm is a modification of the majority element he linked to. The majority algorithm should be modifiable to only use a tiny amount of memory - like 201 codes to get 1/2% I think. Basically you just walk the stream counting up to 201 distinct codes. As soon as you find 201 distinct codes, you drop one of each code (deduct 1 from the counters, forgetting anything that becomes 0). At the end, you have dropped at most N/201 times, so any code occurring more times than that must still be around.
But it's a two pass algorithm, not one. You need a second pass to tally the counts of the candidates. It's actually easy to see that any solution to this problem must use at least 2 passes (the first batch of elements you load could all be different and one of those codes could end up being exactly 1/2%)
Thanks for the help!
Metwally et al., Efficient Computation of Frequent and Top-k Elements in Data Streams (2005). There were some other relevant papers I read for my work at Yahoo that I can't find now; but this looks like a good start.
Edit: Ah, see this Brian Hayes article. It sketches an exact algorithm due to Demaine et al., with references. It does it in one pass with very little memory, yielding a set of items including the frequent ones you're looking for, if they exist. Getting the exact counts takes a (now-tractable) second pass.
this will depend on the distribution of the codes. if there are a small enough number of distinct codes you can build a http://en.wikipedia.org/wiki/Frequency_distribution in core with a map. otherwise you probably will have to build a http://en.wikipedia.org/wiki/Histogram and then make multiple passes over the data examining frequencies of codes in each bucket.
Sort chunks of the file in memory, as if you were performing and external sort. Rather than writing out all of the sorted codes in each chunk, however, you can just write each distinct code and the number of occurrences in that chunk. Finally, merge these summary records to find the number of occurrences of each code.
This process scales to any size data, and it only makes one pass over the input data. Multiple merge passes may be required, depending on how many summary files you want to open at once.
Sorting the file allows you to count the number of occurrences of each code using a fixed amount of memory, regardless of the input size.
You also know the total number of codes (either by dividing the input size by a fixed code size, or by counting the number of variable length codes during the sorting pass in a more general problem).
So, you know the proportion of the input associated with each code.
This is basically the pipeline sort * | uniq -c
If every code appears just once, that's no problem; you just need to be able to count them.
That depends on how many different codes exist, and how much memory you have available.
My first idea would be to build a hash table of counters, with the codes as keys. Loop through the entire file, increasing the counter of the respective code, and counting the overall number. Finally, filter all keys with counters that exceed (* overall-counter 1/200).
If the files consist solely of 16-byte codes, and you know how large each file is, you can calculate the number of codes in each file. Then you can find the 0.5% threshold and follow any of the other suggestions to count the occurrences of each code, recording each one whose frequency crosses the threshold.
Do the contents of each file represent a single data set, or is there an arbitrary cutoff between files? In the latter case, and assuming a fairly constant distribution of codes over time, you can make your life simpler by splitting each file into smaller, more manageable chunks. As a bonus, you'll get preliminary results faster and can pipeline then into the next process earlier.

Resources