i'll explain the matter.
introduction
I have a database containing some trasactions between people.
These trasanction contains lots of data but for the sake of this algorithm it's only needed 3 things : value of transaction, user giving money and user receiving money (I'm a student trying to build on my own a real world project for a group of friends).
Steps
This is not a real time processing but needs to be done only one or two time over long period of time.
The input to this algorithm will be the simplified transaction list and it'll get a map of what user gave and what user received (eg from this [Mark, Sophie, 20], [John, Mark, 30] to this [Mark : -10 , Sophie : -20, John : +30])
After that i'd like to get the most efficient way reset all the members of the dataset to 0. (in that case it'd be mark give 10 to john and sophie give 20 to john, but for a larger quantity of transaction there will be optimal ways and sub-optimal ways).
What i thought
At first thought i think a greedy way is to get the max and the min and "equalize them" as result one (or at best both) should be 0 and going on like that untill there are none value left (the 0 value will be removed from the dataset not to loop or there will be a check if a value is not 0, probably the former).
I hope i explained it well, if not feel free to ask more. Is there a better way to minimize the numbers of transaction to equalize the dataset?
Thanks all for your attention.
Related
I'm trying to develop a rating system for an application I'm working on. Basically app allows you to rate an object from 1 to 5(represented by stars). But I of course know that keeping a rating count and adding the rating the number itself is not feasible.
So the first thing that came up in my mind was dividing the received rating by the total ratings given. Like if the object has received the rating 2 from a user and if the number of times that object has been rated is 100 maybe adding the 2/100. However I believe this method is not good enough since 1)A naive approach 2) In order for me to get the number of times that object has been rated I have to do a look up on db which might end up having time complexity O(n)
So I was wondering what alternative and possibly better ways to approach this problem?
You can keep in DB 2 additional values - number of times it was rated and total sum of all ratings. This way to update object's rating you need only to:
Add new rating to total sum.
Divide total sum by total times it was rated.
There are many approaches to this but before that check
If all feedback givers treated at equal or some have more weight than others (like panel review, etc)
If the objective is to provide only an average or any score band or such. Consider scenario like this website - showing total reputation score
And yes - if average is to be omputed, you need to have total and count of feedback and then have to compute it - that's plain maths. But if you need any other method, be prepared for more compute cycles. balance between database hits and compute cycle but that's next stage of design. First get your requirement and approach to solution in place.
I think you should keep separate counters for 1 stars, 2 stars, ... to calcuate the rating, you'd have to compute rating = (1*numOneStars+2*numTwoStars+3*numThreeStars+4*numFourStars+5*numFiveStars)/numOneStars+numTwoStars+numThreeStars+numFourStars+numFiveStars)
This way you can, like amazon also show how many ppl voted 1 stars and how many voted 5 stars...
Have you considered a vote up/down mechanism over numbers of stars? It doesn't directly solve your problem but it's worth noting that other sites such as YouTube, Facebook, StackOverflow etc all use +/- voting as it is often much more effective than star based ratings.
Here is my scenario,
I run a Massage Place which offers various type of massages. Say 30 min Massage, 45 min massage, 1 hour massage, etc. I have 50 rooms, 100 employees and 30 pieces of equipment.When a customer books a massage appointment, the appointment requires 1 room, 1 employee and 1 piece of equipment to be available.
What is a good algorithm to find available resources for 10 guests for a given day
Resources:
Room – 50
Staff – 100
Equipment – 30
Business Hours : 9AM - 6PM
Staff Hours: 9AM- 6PM
No of guests: 10
Services
5 Guests- (1 hour massages)
3 Guests - (45mins massages)
2 Guests - (1 hour massage).
They are coming around the same time. Assume there are no other appointment on that day
What is the best way to get ::
Top 10 result - Fastest search which meets all conditions gets the top 10 result set. Top ten is defined by earliest available time. 9 – 11AM is best result set. 9 – 5pm is not that good.
Exhaustive search (Find all combinations) - all sets – Every possible combination
First available met (Only return the first match) – stop after one of the conditions have been met
I would appreciate your help.
Thanks
Nick
First, it seems the number of employees, rooms, and equipment are irrelevant. It seems like you only care about which of those is the lowest number. That is your inventory. So in your case, inventory = 30.
Next, it sounds like you can service all 10 people at the same time within the first hour of business. In fact, you can service 30 people at the same time.
So, no algorithm is necessary to figure that out, it's a static solution. If you take #Mario The Spoon's advice and weight the different duration massages with their corresponding profits, then you can start optimizing when you have more than 30 customers at a time.
Looks like you are trying to solve a problem for which there are quite specialized software applications. If your problem is small enough, you could try to do a brute force approach using some looping and backtracking, but as soon as the problem becomes too big, it will take too much time to iterate through all possibilities.
If the problem starts to get big, look for more specialized software. Things to look for are "constraint based optimization" and "constraint programming".
E.g. the ECLIPSe tool is an open-source constraint programming environment. You can find some examples on http://eclipseclp.org/examples/index.html. One nice example you can find there is the SEND+MORE=MONEY problem. In this problem you have the following equation:
S E N D
+ M O R E
-----------
= M O N E Y
Replace every letter by a digit so that the sum is correct.
This also illustrates that although you can solve this brute-force, there are more intelligent ways to solve this (see http://eclipseclp.org/examples/sendmore.pl.txt).
Just an idea to find a solution:
You might want to try to solve it with a constraint satisfaction problem (CSP) algorithm. That's what some people do if they have to solve timetable problems in general (e.g. room reservation at the University).
There are several tricks to improve CSP performance like forward checking, building a DAG and then do a topological sort and so on...
Just let me know, if you need more information about CSP :)
EDIT: Just to make sure someone is not breaking their head on the problem... I am not looking for the best optimal algorithm. Some heuristic that makes sense is fine.
I made a previous attempt at formulating this and realized I did not do a great job at it so I removed that question. I have taken another shot at formulating my problem. Please feel free to provide any constructive criticism that can help me improve this.
Input:
N people
k announcements that I can make
Distance that my voice can be heard (say 5 meters) i.e. I may decide to announce or not depending on the number of people within these 5 meters
Goal:
Maximize the total number of people who have heard my k announcements and (optionally) minimize the time in which I can finish announcing all k announcements
Constraints:
Once a person hears my announcement, he is be removed from the total i.e. if he had heard my first announcement, I do not count him even if he hears my second announcement
I can see the same person as well as the same set of people within my proximity
Example:
Let us consider 10 people numbered from 1 to 10 and the following pattern of arrival:
Time slot 1: 1 (payoff = 1)
Time slot 2: 2 3 4 5 (payoff = 4)
Time slot 3: 5 6 7 8 (payoff = 4 if no announcement was made previously in time slot 2, 3 if an announcement was made in time slot 2)
Time slot 4: 9 10 (payoff = 2)
and I am given 2 announcements to make. Now if I were an oracle, I would choose time slots 2 and time slots 3 because then 7 people would have heard (because 5 already heard my announcement in Time slot 2, I do not consider him anymore). I am looking for an online algorithm that will help me make these decisions on whether or not to make an announcement and if so based on what factors. Does anyone have any ideas on what algorithms can be used to solve this or a simpler version of this problem?
There should be an approach relying upon a max-flow algorithm. In essence, you're trying to push the maximum amount of messages from start->end. Though it would be multidimensional, you could have a super-sink, which connects to each value of t, then have each value of t connect to the people you can reach at this time and then have a super-sink. This way, you simply have to compute a max-flow (with the added constraint of no more than k shouts, which should be solvable with a bit of dynamic programming). It's a terrifically dirty way to solve it, but it should get the job done deterministically and without the use of heuristics.
I don't know that there is really a way to solve this or an algorithm to do it the way you have formulated it.
It seems like basically you are trying to reach the maximum number of people with exactly 2 announcements. But without knowing any information about the groups of people in advance, you can't really make any kind of intelligent decision about whether or not to use your first announcement. Your second one at least has the benefit of knowing when not to be used (i.e. if the group has no new members then you can know its not worth wasting the announcement). But it still has basically the same problem.
The only real way to solve this is to use knowledge about the type of data or the desired outcome to make guesses. If you know that groups average 100 people with a standard deviation of 10, then you could just refuse to announce if less than 90 people are present. Or, if you know you need to reach at least 100 people with two announcements, you could choose never to announce to less than 50 at once. Obviously those approaches risk never announcing at all if the actual data does not meet what you would expect. But that's always going to be a risk, since you could get 1 person in the first group and then 0 in all of the rest, no matter what you do.
Or, you could try more clearly defining the problem, I have a hard time figuring out how to relate this to computers.
Lets start my trying to solve the simplest possible variant of the problem: Lets assume N people and K timeslots, but only one possible announcement. Lets also assume that each person will only ever stay for one timeslot and that each person who hasn't yet shown up has an equally probable chance of showing up at any future timeslot.
Given these simplifications, at each timeslot you look at the payoff of announcing at the current timeslot and compare to the chance of a future timeslot having a higher payoff, eg, lets assume 4 people 3 timeslots:
Timeslot 1: Person 1 shows up, so you know you could get a payoff of 1 by announcing, but then you have 3 people to show up in 2 remaining timeslots, so at least one of those timeslots is guaranteed to have 2 people, so don't announce..
So at each timeslot, you can calculate the chance that a later timeslot will have a higher payoff than the current by treating the remaining (N) people and (K) timeslots as being N independent random numbers each from 1..k, and calculate the chance of at least one value k being hit more than or equal to the current-payoff times. (Similar to the Birthday problem, but for more than 1 collision) and then you need to decide hwo much to discount based on expected variances. (bird in the hand, etc)
Generalization of this solution to the original problem is left as an exercise for the reader.
Suppose you were able keep track of the news mentions of different entities, like say "Steve Jobs" and "Steve Ballmer".
What are ways that could you tell whether the amount of mentions per entity per a given time period was unusual relative to their normal degree of frequency of appearance?
I imagine that for a more popular person like Steve Jobs an increase of like 50% might be unusual (an increase of 1000 to 1500), while for a relatively unknown CEO an increase of 1000% for a given day could be possible (an increase of 2 to 200). If you didn't have a way of scaling that your unusualness index could be dominated by unheard-ofs getting their 15 minutes of fame.
update: To make it clearer, it's assumed that you are already able to get a continuous news stream and identify entities in each news item and store all of this in a relational data store.
You could use a rolling average. This is how a lot of stock trackers work. By tracking the last n data points, you could see if this change was a substantial change outside of their usual variance.
You could also try some normalization -- one very simple one would be that each category has a total number of mentions (m), a percent change from the last time period (δ), and then some normalized value (z) where z = m * δ. Lets look at the table below (m0 is the previous value of m) :
Name m m0 δ z
Steve Jobs 4950 4500 .10 495
Steve Ballmer 400 300 .33 132
Larry Ellison 50 10 4.0 400
Andy Nobody 50 40 .20 10
Here, a 400% change for unknown Larry Ellison results in a z value of 400, a 10% change for the much better known Steve Jobs is 495, and my spike of 20% is still a low 10. You could tweak this algorithm depending on what you feel are good weights, or use standard deviation or the rolling average to find if this is far away from their "expected" results.
Create a database and keep a history of stories with a time stamp. You then have a history of stories over time of each category of news item you're monitoring.
Periodically calculate the number of stories per unit of time (you choose the unit).
Test if the current value is more than X standard deviations away from the historical data.
Some data will be more volatile than others so you may need to adjust X appropriately. X=1 is a reasonable starting point
Way over simplified-
store people's names and the amount of articles created in the past 24 hours with their name involved. Compare to historical data.
Real life-
If you're trying to dynamically pick out people's names, how would you go about doing that? Searching through articles how do you grab names? Once you grab a new name, do you search for all articles for him? How do you separate out Steve Jobs from Apple from Steve Jobs the new star running back that is generating a lot of articles?
If you're looking for simplicity, create a table with 50 people's names that you actually insert. Every day at midnight, have your program run a quick google query for past 24 hours and store the number of results. There are a lot of variables in this though that we're not accounting for.
The method you use is going to depend on the distribution of the counts for each person. My hunch is that they are not going to be normally distributed, which means that some of the standard approaches to longitudinal data might not be appropriate - especially for the small-fry, unknown CEOs you mention, who will have data that are very much non-continuous.
I'm really not well-versed enough in longitudinal methods to give you a solid answer here, but here's what I'd probably do if you locked me in a room to implement this right now:
Dig up a bunch of past data. Hard to say how much you'd need, but I would basically go until it gets computationally insane or the timeline gets unrealistic (not expecting Steve Jobs references from the 1930s).
In preparation for creating a simulated "probability distribution" of sorts (I'm using terms loosely here), more recent data needs to be weighted more than past data - e.g., a thousand years from now, hearing one mention of (this) Steve Jobs might be considered a noteworthy event, so you wouldn't want to be using expected counts from today (Andy's rolling mean is using this same principle). For each count (day) in your database, create a sampling probability that decays over time. Yesterday is the most relevant datum and should be sampled frequently; 30 years ago should not.
Sample out of that dataset using the weights and with replacement (i.e., same datum can be sampled more than once). How many draws you make depends on the data, how many people you're tracking, how good your hardware is, etc. More is better.
Compare your actual count of stories for the day in question to that distribution. What percent of the simulated counts lie above your real count? That's roughly (god don't let any economists look at this) the probability of your real count or a larger one happening on that day. Now you decide what's relevant - 5% is the norm, but it's an arbitrary, stupid norm. Just browse your results for awhile and see what seems relevant to you. The end.
Here's what sucks about this method: there's no trend in it. If Steve Jobs had 15,000 a week ago, 2000 three days ago, and 300 yesterday, there's a clear downward trend. But the method outlined above can only account for that by reducing the weights for the older data; it has no way to project that trend forward. It assumes that the process is basically stationary - that there's no real change going on over time, just more and less probable events from the same random process.
Anyway, if you have the patience and willpower, check into some real statistics. You could look into multilevel models (each day is a repeated measure nested within an individual), for example. Just beware of your parametric assumptions... mention counts, especially on the small end, are not going to be normal. If they fit a parametric distribution at all, it would be in the Poisson family: the Poisson itself (good luck), the overdispersed Poisson (aka negative binomial), or the zero-inflated Poisson (quite likely for your small-fry, no chance for Steve).
Awesome question, at any rate. Lend your support to the statistics StackExchange site, and once it's up you'll be able to get a much better answer than this.
We have an auto-complete list that's populated when an you send an email to someone, which is all well and good until the list gets really big you need to type more and more of an address to get to the one you want, which goes against the purpose of auto-complete
I was thinking that some logic should be added so that the auto-complete results should be sorted by some function of most recently contacted or most often contacted rather than just alphabetical order.
What I want to know is if there's any known good algorithms for this kind of search, or if anyone has any suggestions.
I was thinking just a point system thing, with something like same day is 5 points, last three days is 4 points, last week is 3 points, last month is 2 points and last 6 months is 1 point. Then for most often, 25+ is 5 points, 15+ is 4, 10+ is 3, 5+ is 2, 2+ is 1. No real logic other than those numbers "feel" about right.
Other than just arbitrarily picked numbers does anyone have any input? Other numbers also welcome if you can give a reason why you think they're better than mine
Edit: This would be primarily in a business environment where recentness (yay for making up words) is often just as important as frequency. Also, past a certain point there really isn't much difference between say someone you talked to 80 times vs say 30 times.
Take a look at Self organizing lists.
A quick and dirty look:
Move to Front Heuristic:
A linked list, Such that whenever a node is selected, it is moved to the front of the list.
Frequency Heuristic:
A linked list, such that whenever a node is selected, its frequency count is incremented, and then the node is bubbled towards the front of the list, so that the most frequently accessed is at the head of the list.
It looks like the move to front implementation would best suit your needs.
EDIT: When an address is selected, add one to its frequency, and move to the front of the group of nodes with the same weight (or (weight div x) for courser groupings). I see aging as a real problem with your proposed implementation, in that it requires calculating a weight on each and every item. A self organizing list is a good way to go, but the algorithm needs a bit of tweaking to do what you want.
Further Edit:
Aging refers to the fact that weights decrease over time, which means you need to know each and every time an address was used. Which means, that you have to have the entire email history available to you when you construct your list.
The issue is that we want to perform calculations (other than search) on a node only when it is actually accessed -- This gives us our statistical good performance.
This kind of thing seems similar to what is done by firefox when hinting what is the site you are typing for.
Unfortunately I don't know exactly how firefox does it, point system seems good as well, maybe you'll need to balance your points :)
I'd go for something similar to:
NoM = Number of Mail
(NoM sent to X today) + 1/2 * (NoM sent to X during the last week)/7 + 1/3 * (NoM sent to X during the last month)/30
Contacts you did not write during the last month (it could be changed) will have 0 points. You could start sorting them for NoM sent in total (since it is on the contact list :). These will be showed after contacts with points > 0
It's just an idea, anyway it is to give different importance to the most and just mailed contacts.
If you want to get crazy, mark the most 'active' emails in one of several ways:
Last access
Frequency of use
Contacts with pending sales
Direct bosses
Etc
Then, present the active emails at the top of the list. Pay attention to which "group" your user uses most. Switch to that sorting strategy exclusively after enough data is collected.
It's a lot of work but kind of fun...
Maybe count the number of emails sent to each address. Then:
ORDER BY EmailCount DESC, LastName, FirstName
That way, your most-often-used addresses come first, even if they haven't been used in a few days.
I like the idea of a point-based system, with points for recent use, frequency of use, and potentially other factors (prefer contacts in the local domain?).
I've worked on a few systems like this, and neither "most recently used" nor "most commonly used" work very well. The "most recent" can be a real pain if you accidentally mis-type something once. Alternatively, "most used" doesn't evolve much over time, if you had a lot of contact with somebody last year, but now your job has changed, for example.
Once you have the set of measurements you want to use, you could create an interactive apoplication to test out different weights, and see which ones give you the best results for some sample data.
This paper describes a single-parameter family of cache eviction policies that includes least recently used and least frequently used policies as special cases.
The parameter, lambda, ranges from 0 to 1. When lambda is 0 it performs exactly like an LFU cache, when lambda is 1 it performs exactly like an LRU cache. In between 0 and 1 it combines both recency and frequency information in a natural way.
In spite of an answer having been chosen, I want to submit my approach for consideration, and feedback.
I would account for frequency by incrementing a counter each use, but by some larger-than-one value, like 10 (To add precision to the second point).
I would account for recency by multiplying all counters at regular intervals (say, 24 hours) by some diminisher (say, 0.9).
Each use:
UPDATE `addresslist` SET `favor` = `favor` + 10 WHERE `address` = 'foo#bar.com'
Each interval:
UPDATE `addresslist` SET `favor` = FLOOR(`favor` * 0.9)
In this way I collapse both frequency and recency to one field, avoid the need for keeping a detailed history to derive {last day, last week, last month} and keep the math (mostly) integer.
The increment and diminisher would have to be adjusted to preference, of course.