Using syntax to add a count of the number of cases which match that case's value - syntax

I don't think this matches any existing question but it seems kind of fundamental.
I have a variable full of ranks. Think of it like people who ran a marathon and their places. But there are lots of draws so there might be 5 firsts and 4 seconds and 9 thirds and so on.
Each case has a variable with their place except the people who finished third are not actually third. They are joint 10th from the above figures. The people who finished second are joint 6th.
How do I create a new variable with the marathon runners actual places in the race?

If I understand right, you want the Nth place to reflect the number of actual people above in the list?
Here is a way to do that:
sort cases by OrigPlace.
compute MyPlace=OrigPlace.
if $casenum>1 and OrigPlace<>lag(OrigPlace) MyPlace=$casenum.
if $casenum>1 and OrigPlace=lag(OrigPlace) MyPlace=lag(MyPlace).
exe.

Related

Increase Speed of Wordle Bot (General For Any Programming Language)

I am working on a Wordle bot to calculate the best first (and subsequent) guess(es). I am assuming all 13,000 possible guesses are equally likely (for now), and I am running into a huge speed issue.
I can not think of a valid way to avoid a triple for loop, each with 13,000 elements. Obviously, this is ridiculously slow and would take about 20 hours on my laptop to compute the best first guess (I assume it would be faster for subsequent guesses due to fewer valid words).
The way I see it, I definitely need the first loop to test each word as a guess.
I need the second loop to find the colors/results given each possible answer with that guess.
Then, I need the third loop to see which guesses are valid given the guess and the colors.
Then, I find the average words remaining for each guess, and choose the one with the lowest average using a priority queue.
I think the first two loops are unavoidable, but maybe there's a way to get rid of the third?
Does anyone have any thoughts or suggestions?
I did a similar thing, for a list with 11,000 or so entries. It took 27 minutes.
I did it using a pregenerated (in one pass) list of the letters in a word,
as a bitfield (i.e. one 32 bit integer) and then did crude testing using
the AND instruction. if a word failed that, it exited the rest of the loop.

Fill all connected grid squares of the same type

Foreword: I am aware there is another question like this, however mine has very specific restrictions. I have done my best to make this question applicable to many, as it is a generic grid issue, but if it still does not belong here, then I am sorry, and please be nice about it. I have found in the past stackoverflow to be a very picky and hostile environment to question askers, but I'm hoping that was just a bad couple people.
Goal(abstract): Check all connected grid squares in a 3D grid that are of the same type and touching on one face.
Goal(specific/implementation): Create a "fill bucket" tool in Minecraft with command blocks.
Knowledge of Minecraft not really necessary to answer, this is more of an algorithm question, and I will be staying away from Minecraft specifics.
Restrictions: I can do this in code with recursive functions, but in Minecraft there are some limitations I am wondering if are possible to get around. 1: no arrays(data structure) permitted. In Minecraft I can store an integer variable and do basic calculations with it (+,-,*,/,%(mod),=,==), but that's it. I cannot dynamically create variables or have the program create anything with a name that I did not set out ahead of time. I can do "IF" and "OR" statements, and everything that derives from them. I CANNOT have multiple program pointers - that is, I can't have things like recursive functions, which require a program to stop executing, execute itself from beginning to end, and then resume executing where it was - I have minimal control over the program flow. I can use loops and conditional exits (so FOR loops). I can have a marker on the grid in 3D space that can move regardless of the presence of blocks (I'm using an armour stand, for those who know), and I can test grid squares relative to that marker.
So say my grid is full of empty spaces only. There are separate clusters of filled squares in opposite corners, not touching each other. If I "use" my fillbucket tool on one block / filled grid square, I want it to use a single marker to check and identify all the connected grid squares - basically, I need to be sure that it traverses the entire shape, all the nooks and crannies, but not the squares that are not connected to that shape. So in the end, one of the two clusters, from me only selecting a single square of it, will be erased/replaced by another kind of block, without affecting the other blocks around it.
Again, apologies if this doesn't belong here. And only answer this if you WANT to tackle the challenge - it's not important or anything, I just want to do this. You don't have to answer it if you don't want to. Or if you can solve this problem for a 2D grid, that would be helpful as well, as I could possibly extend that to work for 3D.
Thank you, and if I get nobody degrading me for how I wrote this post or the fact that I did, then I will consider this a success :)
With help from this and other sources, I figured it out! It turns out that, since all recursive functions (or at least most of them) can be written as FOR loops, that I can make a recursive function in Minecraft. So I did, and the general idea of it is as follows:
For explaining the program, you may assuming the situation is a largely empty grid with a grouping of filled squares in one part of it, and the goal is to replace the kind of block that that grouping is made of with a different block. We'll say the grouping currently consists of red blocks, and we want to change them to blue blocks.
Initialization:
IDs - A objective (data structure) for holding each marker's ID (score)
numIDs - An integer variable for holding number of IDs/markers active
Create one marker at selected grid position with ID [1] (aka give it a score of 1 in the "IDs" objective). This grid position will be a filled square from which to start replacing blocks.
Increment numIDs
Main program:
FOR loop that goes from 1 to numIDs
{
at marker with ID [1], fill grid square with blue block
step 1. test block one to the +x for a red block
step 2. if found, create marker there with ID [numIDs]
step 3. increment numIDs
[//repeat steps 1 2 and 3 for the other five adjacent grid squares: +z, -x, -z, +y, and -y]
delete stand[1]
numIDs -= 1
subtract 1 from every marker's ID's, so that the next marker to evaluate, which was [2], now has ID [1].
} (end loop)
So that's what I came up with, and it works like a charm. Sorry if my explanation is hard to understand, I'm trying to explain in a way that might make sense to both coders and Minecraft players, and maybe achieving neither :P

Need help psuedocoding a dice game and dont know where to even start [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I am a beginner IT student and doing a project for my programming logic and design class. I need to create a psuedocode for a dice game that allows you 2 rolls with 5 dice. On the first roll you get to pick 1 die to keep. The computer then rolls the other 4 dice and calculates you're score based on what you rolled. There are 3 rolls per game and the total score is displayed. Rolling nothing takes points away. The scoring is: 2 of a kind=50 points, 3 of a kind=75 points, 4 of a kind=100 points and nothing subtracts 50 points.
The whole problem I have is I dont even know where to start. I think I need this to repeat 3 times, but what variables do set? Please someone help me, I cant really ask my instructor because he is outside smoking the whole class and everything I have learned about this class mostly came from the internet and reading the book. I dont want to fail this class...someone please help me through this???
First of all don't panic. What you are about to do is break the task down into small steps.
Pseudo-code is not really code - you can't use it directly as a language, but instead it is just plain english to describe what it is you are doing and the flow of events.
So what are the initial steps to get you started?
Ask yourself what are the facts, what do you know exist in advance. These are the "declarations" that you make.
You have five dice. Each is a seperate object so each gets it's own variable declaration
dice_1
dice_2
dice_3
dice_4
dice_5
Next decide if each die has an initial value
dice_1 initial value = 0
etc...
Next you know that you have to throw the dice a number of times. Throwing is a variable with an initial value
turns initial value = 2
turns_counter initial value = 2
You should be getting the idea now. Are there other things you should declare in advance? I think so!
Next you have to decide what it is you are doing step by step. Is it just a sequence of events or is it repeating? If it's repeating how do you get it to stop?
While turns_counter is less than 2
Repeat the following:
turns_counter = turns_counter + 1
if turns_counter = 2
Throw. Collect_result. Sum_result.
else
Throw. Collect_result. Sum_result. Remove_a_dice.
endif.
perhaps you have to tell the reusable code which objects they are going to be working with? These are parameters that you pass to the reusable code Throw(dice_1) perhaps also you need to update some variables that you created? do it in the reusable code with or without passing them as parameters.
This is by no means complete or perfect, but you should get the idea about what's going on and how to break it down. It could take quite a while to do.
Most languages provide a pseudo-random number generator function that returns a random number within a certain range. I would start by figuring out which language you'll use and which function it provides.
Once you have that, you will need to call it for each roll of each dice. If you are rolling 5 dice, you would call it 5 times. And you would call it 5 more times for a second roll.
That's a start anyway.
You have already almost answered the question by simply writing it down here. There is no strict definition of what pseudocode is. Why don't you start by re-writing what you've described here as a sequence of steps. Then, for each step simply refine that step further until you think you've made it as fine-grain as you like.
You could start with something like this:
Roll 5 dice.
Pick 1 die to keep.
Rolls the other 4 dice
Calculate the score.
// etc...
Quite weird to think that it's easier to ask SO than your instructor! :)
The easiest way to get started on this is to not rigorously bind yourself to the constraint of a specific language, or even to pseudocode. Simply, in natural English, write out how you would do this. Imagine that YOU are the computer, and somebody wants to play the game with you. Just imagine, in very specific detail, what you would do at each potential step, i.e.
Give the user 5 dice
Ask the user to roll them
From that roll, allow the user to pick one die to keep
...etc. Once you have done this, and you are sure it is correct, start transforming it into pseudo code by thinking about what a computer would need to do to solve this problem. For instance, you'll need a variable keeping track of how many points the user as, as well as how many total rolls have occurred. If you were very specific in your English description of the problem, this should mean you basically only need to plug pseudo code into a few sentences you already have - in other words, you're just substituting one type of pseudo code for another.
I'd like to help, but straight-up providing the pseudo code wouldn't be very helpful to you. One of the hardest steps in beginning programming is learning to break a problem down into its constituent elements. That type of granular thinking is unintuitive at first, but gets easier the more time you spend on it.
Well, pseudo-code, in my experience, is best drawn up when you pretend you're writing up the work for someone else to do:
THINGS WE NEED
Dice
Players
Score
THINGS WE TRACK
Dice rolls
Player score
THINGS WE KNOW
(These are also called constants)
Nothing (-50)
2 of a kind (+50)
3 of a kind (+75)
4 of a kind (+100)
All of these are vital tools to getting started. And...well, asking questions on stackoverflow.
Next, define your "actions" (things we do), which utilizes the above known things that we will need.
I would start the same place I always do: creating our things.
def player():
"""Create a new player"""
def dice():
"""Creates 4 new, 6 sided dice"""
def welcome():
"""Welcome player by name, give option to quit"""
def game():
"""Initialize number of turns (start at 0)"""
def humanturn():
"""Roll dice, display, ask which one they'll keep"""
def compturn():
"""Roll four dice"""
def check():
"""Check for any matches in the dice"""
def score():
"""Tally up the score for any matches"""
def endturn():
"""Update turn(s), update total score"""
def gameover():
"""Display name, total score, ask for retry"""
def quit():
"""Quit the game"""
Those are your components, all fleshed out in a very procedural manner. There are many other ways to do this that are much better, but for now you're just writing the skeleton of an idea. You may be tempted to combine many of these methods together when you're ready to start coding, but it's a good idea to separate everything until you're confident you won't get lost chasing down a bug.
Good luck!

Seeking algo for text diff that detects and can group similar lines

I am in the process of writing a diff text tool to compare two similar source code files.
There are many such "diff" tools around, but mine shall be a little improved:
If it finds a set of lines are mismatched on both sides (ie. in both files), it shall not only highlight those lines but also highlight the individual changes in these lines (I call this inter-line comparison here).
An example of my somewhat working solution:
alt text http://files.tempel.org/tmp/diff_example.png
What it currently does is to take a set of mismatched lines and running their single chars thru the diff algo once more, producing the pink highlighting.
However, the second set of mismatches, containing "original 2", requires more work: Here, the first two right lines ("added line a/b") were added, while the third line is an altered version of the left side. I wish my software to detect this difference between a likely alteration and a probable new line.
When looking at this simple example, I can rather easily detect this case:
With an algo such as Levenshtein, I could find that of all right lines in the set of 3 to 5, line 5 matches left line 3 best, thus I could deduct that lines 3 and 4 on the right were added, and perform the inter-line comparison on left line 3 and right line 5.
So far, so good. But I am still stuck with how to turn this into a more general algorithm for this purpose.
In a more complex situation, a set of different lines could have added lines on both sides, with a few closely matching lines in between. This gets quite complicated:
I'd have to match not only the first line on the left to the best on the right, but vice versa as well, and so on with all other lines. Basically, I have to match every line on the left against every one on the right. At worst, this might create even crossings, so that it's not easily clear any more which lines were newly inserted and which were just altered (Note: I do not want to deal with possible moved lines in such a block, unless that would actually simplify the algorithm).
Sure, this is never going to be perfect, but I'm trying to get it better than it's now. Any suggestions that aren't too theoerical but rather practical (I'm not good understanding abstract algos) are appreciated.
Update
I must admit that I do not even understand how the LCS algo works. I simply feed it two arrays of strings and out comes a list of which sequences do not match. I am basically using the code from here: http://www.incava.org/projects/java/java-diff
Looking at the code I find one function equal() that is responsible for telling the algorithm whether two lines match or not. Based on what Pavel suggested, I wonder if that's the place where I'd make the changes. But how? This function only returns a boolean - not a relative value that could identify the quality of the match. And I can not simply used a fixed Levenshtein ration that would decide whether a similar line is still considered equal or not - I'll need something that's self-adopting to the entire set of lines in question.
So, what I'm basically saying is that I still do not understand where I'd apply the fuzzy value that relates to the relative similarity of lines that do not (exactly) match.
Levenshtein distance is based on the notion of an "edit script" that transforms one string into another. It's very closely related to the Needleman-Wunsch algorithm used for aligning DNA sequences by inserting gap characters, in which we search for the alignment that maximises a score in O(nm) time using dynamic programming. Exact matches between characters increase the score, while mismatches or inserted gap characters reduce the score. An example alignment of AACTTGCCA and AATGCGAT:
AACTTGCCA-
AA-T-GCGAT
(6 matches, 1 mismatch, 3 gap characters, 3 gap regions)
We can think of the top string being the "starting" sequence that we are transforming into the "final" sequence on the bottom. Each column containing a - gap character on the bottom is a deletion, each column with a - on the top is an insertion, and each column with different (non-gap) characters is a substitution. There are 2 deletions, 1 insertion and 1 substitution in the above alignment, so the Levenshtein distance is 4.
Here is another alignment of the same strings, with the same Levenshtein distance:
AACTTGCCA-
AA--TGCGAT
(6 matches, 1 mismatch, 3 gap characters, 2 gap regions)
But notice that although there are the same number of gaps, there is one less gap region. Because biological processes are more likely to create wide gaps than multiple separate gaps, biologists prefer this alignment -- and so will the users of your program. This is accomplished by also penalising the number of gap regions in the scores that we compute. An O(nm) algorithm to accomplish this for strings of lengths n and m was given by Gotoh in 1982 in a paper called "An improved algorithm for matching biological sequences". Unfortunately, I can't find any links to free full text of the paper -- but there are many useful tutorials that you can find by googling "sequence alignment" and "affine gap penalty".
In general, different choices of match, mismatch, gap and gap region weights will give different alignments, but any negative score for gap regions will prefer the bottom alignment above to the top one.
What does all this have to do with your problem? If you use Gotoh's algorithm on individual characters with a suitable gap penalty (arrived at with a few empirical tests), you should find a significant decrease in the the number of terrible-looking alignments like the example you gave.
Efficiency Considerations
Ideally, you could just do this on characters and ignore lines altogether, since the affine penalty will work to cluster changes into blocks spanning many lines wherever it can. But because of the higher running time, it may be more realistic to do a first pass on lines and then rerun the algorithm on characters, using as input all lines that are not identical. Under this scheme, any shared block of identical lines can be handled by compressing it into a single "character" with inflated matching weight, which helps to ensure no "crossings" appear.
With an algo such as Levenshtein, I could find that of all right lines in the set of 3 to 5, line 5 matches left line 3 best, thus I could deduct that lines 3 and 4 on the right were added, and perform the inter-line comparison on left line 3 and right line 5.
After you have determined it, use the same algorithm to determine what lines in these two chinks match each other. But you need to make slight modificaiton. When you used the algorithm to match equal lines, the lines could either match or not match, so that added either 0 or 1 to the cell of the table you used.
When comparing strings in one chunk some of them are "more equal" than others (ack. to Orwell). So they can add a real number from 0 to 1 to the cell when considering what sequence matches best so far.
To compute this metrics (from 0 to 1), you can apply to each pair of strings you encounter... right, the same algorithm again (actually, you already did this when you were doing the first pass of Levenstein algorithm). This will compute the length of LCS, whose ratio to the average length of two strings would be the the metric value.
Or, you can borrow the algorithm from one of diff tools. For instance, vimdiff can highlight the matches you require.
Here's one possible solution someone else just made me realize:
My original approach was like this:
Split the text up into separate lines and use LCS algo to determine where there are blocks of nonmatching lines.
Use some smart algo (which this question is about) to figure out which of these lines closely match, i.e. to tell that these lines were modified between revisions.
Compare those closely matching lines line-by-line using LCS again, while marking the non-matching lines as entirely new.
While this would allow for a better visual display of changes when comparing source code revisions, I now found that a much simpler approach is usually sufficient. It works like this:
Same as above.
Take the right and left block of nonmatching lines, concatenate those lines, and tokenize them (either into language-specific tokens/words, or just into single characters)
Apply the LCS algo on the two arrays of tokens.
Maybe those who replied to my original question assumed that I knew to do this all the time, but I had my focus so strongly on a per-line comparison that it did not occur to me to apply LCS on the set of lines by concatenating them, instead of processing them line-by-line.
So, while this approach will not provide as detailed change information as my original intent was, it still does improve the results over what I started yesterday with when I wrote this question.
I'll leave this question open for a while longer - maybe someone else, reading all this, can still provide a complete answer (Pavel and random_hacker offered some suggestions, but it's not a complete solution yet - anyway, thank you for the helpful comments).

Algorithm for most recently/often contacts for auto-complete?

We have an auto-complete list that's populated when an you send an email to someone, which is all well and good until the list gets really big you need to type more and more of an address to get to the one you want, which goes against the purpose of auto-complete
I was thinking that some logic should be added so that the auto-complete results should be sorted by some function of most recently contacted or most often contacted rather than just alphabetical order.
What I want to know is if there's any known good algorithms for this kind of search, or if anyone has any suggestions.
I was thinking just a point system thing, with something like same day is 5 points, last three days is 4 points, last week is 3 points, last month is 2 points and last 6 months is 1 point. Then for most often, 25+ is 5 points, 15+ is 4, 10+ is 3, 5+ is 2, 2+ is 1. No real logic other than those numbers "feel" about right.
Other than just arbitrarily picked numbers does anyone have any input? Other numbers also welcome if you can give a reason why you think they're better than mine
Edit: This would be primarily in a business environment where recentness (yay for making up words) is often just as important as frequency. Also, past a certain point there really isn't much difference between say someone you talked to 80 times vs say 30 times.
Take a look at Self organizing lists.
A quick and dirty look:
Move to Front Heuristic:
A linked list, Such that whenever a node is selected, it is moved to the front of the list.
Frequency Heuristic:
A linked list, such that whenever a node is selected, its frequency count is incremented, and then the node is bubbled towards the front of the list, so that the most frequently accessed is at the head of the list.
It looks like the move to front implementation would best suit your needs.
EDIT: When an address is selected, add one to its frequency, and move to the front of the group of nodes with the same weight (or (weight div x) for courser groupings). I see aging as a real problem with your proposed implementation, in that it requires calculating a weight on each and every item. A self organizing list is a good way to go, but the algorithm needs a bit of tweaking to do what you want.
Further Edit:
Aging refers to the fact that weights decrease over time, which means you need to know each and every time an address was used. Which means, that you have to have the entire email history available to you when you construct your list.
The issue is that we want to perform calculations (other than search) on a node only when it is actually accessed -- This gives us our statistical good performance.
This kind of thing seems similar to what is done by firefox when hinting what is the site you are typing for.
Unfortunately I don't know exactly how firefox does it, point system seems good as well, maybe you'll need to balance your points :)
I'd go for something similar to:
NoM = Number of Mail
(NoM sent to X today) + 1/2 * (NoM sent to X during the last week)/7 + 1/3 * (NoM sent to X during the last month)/30
Contacts you did not write during the last month (it could be changed) will have 0 points. You could start sorting them for NoM sent in total (since it is on the contact list :). These will be showed after contacts with points > 0
It's just an idea, anyway it is to give different importance to the most and just mailed contacts.
If you want to get crazy, mark the most 'active' emails in one of several ways:
Last access
Frequency of use
Contacts with pending sales
Direct bosses
Etc
Then, present the active emails at the top of the list. Pay attention to which "group" your user uses most. Switch to that sorting strategy exclusively after enough data is collected.
It's a lot of work but kind of fun...
Maybe count the number of emails sent to each address. Then:
ORDER BY EmailCount DESC, LastName, FirstName
That way, your most-often-used addresses come first, even if they haven't been used in a few days.
I like the idea of a point-based system, with points for recent use, frequency of use, and potentially other factors (prefer contacts in the local domain?).
I've worked on a few systems like this, and neither "most recently used" nor "most commonly used" work very well. The "most recent" can be a real pain if you accidentally mis-type something once. Alternatively, "most used" doesn't evolve much over time, if you had a lot of contact with somebody last year, but now your job has changed, for example.
Once you have the set of measurements you want to use, you could create an interactive apoplication to test out different weights, and see which ones give you the best results for some sample data.
This paper describes a single-parameter family of cache eviction policies that includes least recently used and least frequently used policies as special cases.
The parameter, lambda, ranges from 0 to 1. When lambda is 0 it performs exactly like an LFU cache, when lambda is 1 it performs exactly like an LRU cache. In between 0 and 1 it combines both recency and frequency information in a natural way.
In spite of an answer having been chosen, I want to submit my approach for consideration, and feedback.
I would account for frequency by incrementing a counter each use, but by some larger-than-one value, like 10 (To add precision to the second point).
I would account for recency by multiplying all counters at regular intervals (say, 24 hours) by some diminisher (say, 0.9).
Each use:
UPDATE `addresslist` SET `favor` = `favor` + 10 WHERE `address` = 'foo#bar.com'
Each interval:
UPDATE `addresslist` SET `favor` = FLOOR(`favor` * 0.9)
In this way I collapse both frequency and recency to one field, avoid the need for keeping a detailed history to derive {last day, last week, last month} and keep the math (mostly) integer.
The increment and diminisher would have to be adjusted to preference, of course.

Resources