Algorithm for this format - algorithm

I am trying to develop a program to compute the tax for a given base salary, I believe given the format of the income tax table that I have, there should be a formula or algorithm to calculate the tax for a given base salary. Below is the sample format of income tax table. I can do lookups but I think an algorithm or formula might be the best approach.
Annual Salary Monthly From Ranges To Tax Due
18,000.00 0.00 1,500.00 0.00
18,060.00 1,500.01 1,505.00 0.25
18,120.00 1,505.01 1,510.00 0.50
18,180.00 1,510.01 1,515.00 0.75
18,240.00 1,515.01 1,520.00 1.00
.....
27,960.00 2,325.01 2,330.00 41.50 --last line of first format
28,020.00 2,330.01 2,335.00 41.83 -- start of second format
28,080.00 2,335.01 2,340.00 42.33
28,140.00 2,340.01 2,345.00 42.83
I was able to find a formula for the first part and its working find in my program but I can't make the second part work.
The formula for the first part is below :
The tax is 0.25 on every 5.00 or portion above 1500. So I am using
Ceil((monthly−1500)/5) * 0.25
I have tried similar methods for the second pattern but its not working so far

It seems like you just want to offset it by some number, specifically 41.33.
41.33 + Ceil((monthly − 2330) / 5) * 0.5
Although this seems like a better solution: (considering that it might be the least effort to deal with new versions of it)
Reading the data in in some automated way
Doing a binary search for the range the salary falls under
Look up the tax due

A fully automated algorithm:
linear fit of the salary/tax table;
interpolation search for the exact value.

Related

rule based algorithm for text classification

I'm trying to build a rule based text classification system to speed up the manual categorization of documents.
The long term aim is to use these manually classified documents as training data for an AI.
The classification system has over 100 categories.
The idea is to manually build a list of 'words' associated for each category in the classification system.
The list of words will be built by manually classifying a small no. of documents and manually identify the common words I find in each document.
The job of the rule engine is to attempt to identify other documents that belong in the same category based on the assigned words.
I'm looking to put a weight on each word associated with a category.
The intent is to initially weight the words based on their occurrence in the manually labelled documents.
So if the word 'select' appeared in an a 1000 word document 50 times then it would get a weighting of 5% (50/1000).
The rule engines job is then to score other documents based on the occurrences of words and their relative weighting.
What I'm not certain is how this scoring process should work or how to normalize the data given the variance in document size. (From 100 words to 10,000 words would be typical).
The intent is to have an iterative process (manually validate classification/add-remove words/adjust weights/classify documents via rule-engine).
With each iteration the rule-engine will hopefully get better at correctly classifying the documents reducing the label process to a Good/Bad classification action. Providing a significant percentage (even 50% should probably be fine) are correctly label the process should proceed rapidly.
I've heard that concepts such as linear regression might apply to this type of problem but don't know enough to google effectively.
Edit:
I've had some thoughts on how to go about the normalisation process.
normalise all documents to an 'average' size of 1000 words.
count the words in a document - total word count e.g. 250 words
count each word of interest - e.g. CheckBox occurred 25 times
calculate the occurrence of each word as a percentage of the actual
documents word count e.g. checkbox = 10%
If we have three words of interest: checkbox, select, multi
We end up with a set of ratios:
checkbox: select : multi
0.05 : 0.01 : 0.02
When scoring we are now looking for documents that have the closest matching ratio.
If a document presents with the following ratio:
0.04 : 0.02 : 0.01
Then we can define the distance between the two documents as:
0.05 - 0.04 + 0.01 - 0.02 + 0.02 - 0.01 = 0.01
The problem with this approach is that we care about the over all distance so the second word is problematic as it reduces the distance as the ratio is in the opposite direction as the other words.
To counter this we need to flip the calculation on the second word so that it moves the distance in the same direction
0.05 - 0.04 + 0.02 - 0.01 + 0.02 - 0.01 = 0.03
The second equations would appear to more accurately reflect the distance between the two documents.
Given we are talking about distance rather than a vector we would always take the absolute value of the result.
A distance of zero is considered an exact match.
I'm not entirely happy with this approach as some words are 'good' words and any number of them should be considered a positive.
e.g. if the classification is checkbox then the word checkbox should always be seen to reduce the distance.
May be we deal with this by nominating one or more words as 'key words'.
When a keyword appears, if its word ratio is > than the expected ratio, then the distance for that word is considered 0.
If you have some data that is labelled, it sounds like you might want a naive Bayes classifier with a bag-of-words feature vector.
If you don't have data with labels, it sounds like you might want topic modelling, such as LDA.

Rounding Calculated fields in Dynamics 365

I'm hoping that this is a fairly simple solution, but I'm fairly new to Dynamics 365 development and the documentation and prior questions I've seen so far haven't been too helpful.
I need to create a specifically rounded calculated field within an entity. Currently the field has been set up as a decimal type and I have the precision value set to 0 to produce a whole number.
The calculation I am currently using in the field calculation editor is simply x + y / 100. However whatever result comes from that needs to be always rounded up to the next whole number rather than the nearest one. Is there a way to force the field logic to always round upwards?
If a direct answer isn't available, any resources would be appreciated.
I would say it is working as expected since less than 0.5 will round off to lower nearest number (floor) and greater than 0.5 will round off to higher nearest number (ceiling).
To handle your scenario, to always round off to higher nearest number (ceiling) - I may recommend this: Add 49 like (x + y + 49)/100, in other words 0.49 to formula. This quirk came from top of my head but this may be the best option in calculated field perspective.

500px.com Ranking Algorithm

I was recently wondering how http://500px.com calculates their "Pulse" rating.
The "Pulse" is a score from 1..100 based on the popularity of the photo.
I think it might use some of the following criteria:
Number of likes
Number of "favorites"
Number of comments
Total views
maybe the time since the photo has been uploaded
maybe some other non-obvious criteria like the users follower count, user rank, camera model or similar
How would I achieve some sort of algorithm like this?
Any advice on how to implement an algorithm with this criteria (and maybe some code) would be appreciated too.
I don't know too much about the site but systems like this generally work the same way. Normalize a set of weighted values to produce a single comparable value.
Define your list of rules, weight them based on importance, then run them all together to get your final value.
In this case it would be something like.
Total number of visits = 10%
Total number of Likes = 10%
Number of vists / number of likes = 40% (popularity = percentage of visitors that liked it)
number of Likes in last 30 days = 20% (current popularity)
author rating = 20%
Now we need to normalize the values for those rules. Depending on what your data is, scale of numbers etc this will be different for each rule so we need a workable value, say between 1 and 100.
Example normalizations for the above:
= percentage of vistors out of 50,000 vists (good number of vists)
(vists / 50000 ) * 100
= percentage of likes out of 10,000 likes (good number of likes)
(likes / 10000) * 100
= percentage of vistors that liked it
(likes / vists) * 100
= percentage of likes in last 30 days out of 1,000 likes (good number of likes for a 30 day period)
(likesIn30Days / 1000) * 100
= arbitrary rating of the author
Make sure all of these have a maximum value of 100 (if it's over bring it back down). Then we need to combine all these up depending on their weighting:
Popularity = (1 * 0.1) + (2 * 0.1) + (3 * 0.4) + (4 * 0.2) + (5 * 0.2)
This is all off the top of my head and rough. There are obviously also much more efficient ways to this as you don't need to normalize to a percentage at every stage but I hope it helps you get the gist.
Update
I've not really got any references or extra reading. I've never really worked with it as a larger concept only in small implementations.
I think most of what you read though is going to be methodological ranking systems in general and theories. Because depending on your rules and data format, your implementation will be very different. It seems such a huge concept when actually it will probably come down to arround 10 lines of code, not counting aggregating your data.
You may want to also refer to the following
How Reddit ranking algorithms work
How Hacker News ranking algorithm works
How to Build a Popularity Algorithm You can be Proud of
500px explains their (in the meantime outdated) Pulse ranking algorithm to some extend in their blog:
https://500px.com/blog/52/how-rating-works-and-why-there-s-a-lot-more-to-a-rating-than-just-a-number
Pretty interesting and different than what the solutions here on SO suggest so far.

Formula for popularity? (based on "like it", "comments", "views")

I have some pages on a website and I have to create an ordering based on "popularity"/"activity"
The parameters that I have to use are:
views to the page
comments made on the page (there is a form at the bottom where uses can make comments)
clicks made to the "like it" icon
Are there any standards for what a formula for popularity would be? (if not opinions are good too)
(initially I thought of views + 10*comments + 10*likeit)
Actually there is an accepted best way to calculate this:
http://www.evanmiller.org/how-not-to-sort-by-average-rating.html
You may need to combine 'likes' and 'comments' into a single score, assigning your own weighting factor to each, before plugging it into the formula as the 'positive vote' value.
from the link above:
Score = Lower bound of Wilson score confidence interval for a
Bernoulli parameter
We need to balance the proportion of positive ratings with
the uncertainty of a small number of observations. Fortunately, the
math for this was worked out in 1927 by Edwin B. Wilson. What we want
to ask is: Given the ratings I have, there is a 95% chance that the
"real" fraction of positive ratings is at least what? Wilson gives the
answer. Considering only positive and negative ratings (i.e. not a
5-star scale), the lower bound on the proportion of positive ratings
is given by:
(Use minus where it says plus/minus to calculate the lower bound.)
Here p̂ is the observed fraction of positive ratings, zα/2 is the
(1-α/2) quantile of the standard normal distribution, and n is the
total number of ratings. The same formula implemented in Ruby:
require 'statistics2'
def ci_lower_bound(pos, n, confidence)
if n == 0
return 0
end
z = Statistics2.pnormaldist(1-(1-confidence)/2)
phat = 1.0*pos/n
(phat + z*z/(2*n) - z * Math.sqrt((phat*(1-phat)+z*z/(4*n))/n))/(1+z*z/n)
end
pos is the number of positive ratings, n is the total number of
ratings, and confidence refers to the statistical confidence level:
pick 0.95 to have a 95% chance that your lower bound is correct, 0.975
to have a 97.5% chance, etc. The z-score in this function never
changes, so if you don't have a statistics package handy or if
performance is an issue you can always hard-code a value here for z.
(Use 1.96 for a confidence level of 0.95.)
The same formula as an SQL query:
SELECT widget_id, ((positive + 1.9208) / (positive + negative) -
1.96 * SQRT((positive * negative) / (positive + negative) + 0.9604) /
(positive + negative)) / (1 + 3.8416 / (positive + negative))
AS ci_lower_bound FROM widgets WHERE positive + negative > 0
ORDER BY ci_lower_bound DESC;
There is no standard formula for this (how could there be?)
What you have looks like a fairly normal solution, and would probably work well. Of course, you should play around with the 10's to find values that suit your needs.
Depending on your requirements, you might also want to add in a time factor (i.e. -X points per week) so that old pages become less popular. Alternatively, you could change your "page views" to "page views in the last month". Again, this depends on your needs, it may not be relevant.
You could do something like what YouTube does - just have it sorted by largest count per category. For example - most viewed, most commented, most liked. In each category a different page could come first, though the rankings might likely be correlated. If you only need a single ranking, then you would have to come up with a formula of some sort, preferably derived empirically by analyzing a bunch of data you already have and deciding what should be calculated as good/bad, and working backwards to fit an equation that fits your decision.
You could even attempt a machine learning approach to "learn" what a good weighting is for combining each of these numbers as in your example formula. Doing it manually might also not be too hard.
I use,
(C*comments + L*likeit)*100/views
where you must use C and L depending on how much you value each attribute.
I use C=1 and L=1.
This gives you the percentage of views that generated a positive action, making the items with
higher percentage the most "popular".
I like this because it makes it possible for newer items to be very popular at first, showing up first and getting more views and thus becoming less popular (or more) until stabilizing.
Anyway,
i hope it helps.
PS: Of it would work just the same without the "*100" but i like percentages.
I would value comments more than 'like it's if the content invites a discussion. If it's just stating facts, an equal ration for comments and the like count seems ok (though 10 is a bit too much, I think...)
Does visit take into account the time the user spent somehow? You might use that, as well, as a 2 second view means less than a 3 minute one.
Java code for Anentropic's answer:
public static double getRank(double thumbsUp, double thumbsDown) {
double totalVotes = thumbsUp + thumbsDown;
if (totalVotes > 0) {
return ((thumbsUp + 1.9208) / totalVotes -
1.96 * Math.sqrt((thumbsUp * thumbsDown) / totalVotes + 0.9604) /
totalVotes) / (1 + (3.8416 / totalVotes));
} else {
return 0;
}
}

Smooth average of sales data

How can I calculate the average of a set of data while smoothing over any points that are outside the "norm". It's been a while since I had to do any real math, but I'm sure I learned this somewhere...
Lets say I have 12 days of sales data on one item: 2,2,2,50,10,15,9,6,2,0,2,1
I would like to calculate the average sales per day without allowing the 4th day (50) to screw up the average too much. Log, Percentile, something like that I think...
It sounds to me that you're looking for a moving average.
You can also filter by thresholding at some multiple of the standard deviation. This would filter out results that were much farther than expected from the mean (average).
Standard deviation is simply sqrt(sum(your_values - average_value) / number_of_values).
edit: You can also look at weighting the value by it's deviation from the mean. So values that are very large can be weighted as 1 / exp(deviation) and therefore contribute much less the farther from the mean they are.
You'll want to use something like IQR (interquartile range). Basically you break the data into quartiles and then calculate the median from the first and third quartiles. Then you can get your central tendency of the data.

Resources