I have LAMP-based business application. SugarCRM to be more precise. There are 120+ active users at the moment. Every day each user generates some records that are used in complex calculation to get so called “individual rating”.
It takes for about 6 seconds to calculate one “individual rating” value. And there was not a big problem before: each user hits the link provided to start “individual rating” calculations, waits for 6-7 seconds, and get the value displayed.
But now I need to implement “overall rating” calculation. That means that additionally to “individual rating” I have to calculate and display to the user:
minimum individual rating among ALL the users of the application
maximum individual rating among ALL the users of the application
current user position in the range of all individual ratings.
Say, current user has individual rating equal to 220 points, minimum value of rating is 80, maximum is 235 and he is on 23rd position among all the users.
What are (imho) the main problems to be solved?
If one calculation lasts for 6 seconds, that overall calculations will take more than 10 minutes. I think it’s no good to make the application almost unaccessible for this period. And what if the quantity of users will rise in the nearest future 2-3 times?
Those calculations could be done as nightly job but all the users are in different timezones. In Russia difference between extreme timezones is 9 hours. So people in west part of Russia are still working in “today”. While people in eastern part is waking up to work in “tomorrow”. So what is the best time for nightly job in this case?
Are there any best practices|approaches|algorithms to build such rating system?
Given only the information provided, the only options I see:
The obvious one - reduce the time taken for a rating calculation (6 seconds to calculate 1 user's rating seems like a lot)
If possible, have intermediate values which you only recalculate some of, as required (for example, have 10 values that make up the rating, all based on different data, when some of the data changes, flag the appropriate values for recalcuation). Either do this recalculation:
During your daily recalculation or
When the update happens
Partial batch calculation - only recalculate x of the users' ratings at chosen intervals (where x is some chosen value) - has the disadvantage that, at all times, some of the ratings can be out of date
Calculate if not busy - either continuously recalculate ratings or only do so at a chosen interval, but instead of locking the system, have it run as a background process, only doing work if the system is idle
(Sorry, didn't manage with "long" comment posting; so decided to post as answer)
#Dukeling
SQL query that takes almost all the time for calculation mentioned above is just a replication of business logic that should be executed in PHP code. The logic was moved into SQL with the hope to reduce calculation time. OK, I’ll try both to optimize SQL query and play with executing logic in PHP code.
Suppose after that optimized application calculates individual rating for just 1 second. Great! But even in this case the first user logged into system should awaits for 120 seconds (120+ users * 1 sec = 120 sec) to calculate overall rating and gets its position in it.
I’m thinking of implementing the following approach:
Let’s have 2 “overall ratings” – “today” and “yesterday”.
For displaying purposes we’ll use “yesterday” overall rating represented as huge already sorted PHP array.
When user hits calculation link he started “today” calculation but application displays him “yesterday” value. Thus we have quickly accessible “yesterday” rating and each user randomly launches rating calculation that will be displayed for them tomorrow.
User list are partitioned by timezones. Each hour a cron job started to check if there’re any users in selected timezone that don’t have “today” individual rating calculated (e.g. user didn’t log into application). If so, application starts calculation of individual rating and puts its value in “today” (still invisible) ovarall rating array. Thus we have a cron job that runs nightly for each timezone-specific user group and fills the probable gaps in case users didn’t log into system.
After all users in all timezones had been worked out, application
sorts “today” array,
drops “yesterday” one,
rename “today” in “yesterday” and
initialize new “today”.
What do you think of it? Is it reasonable enough or not?
Related
Scenario:
I need to give users opportunity to book different times for the service.
Caveat is that i dont have bookings in advance but i need to fill them as they come in.
Bookings can be represented as keyvalue pairs:
[startTime, duration]
So, for example, [9,3] would mean event starts at 9 o’clock and has duration of 3 hours.
Rules:
users come in one by one, there is never a batch of users requests
no bookings can overlap
service is available 24/7 so no need to worry about “working time”
users choose duration on their own
obviously, once user chooses&confirms his booking we cannot shuffle it anymore
we dont want gaps to be lesser than some amount of time. this one is based on probability that future users will fill in the gap. for example, if distribution of durations over users bookings is such that probability for future users filling the gap shorter than x hours is less than p then we want a rule that gap cannot be shorter than x. (for purpose of this question, we can assume x being hardcoded, here i just explain reasons)
the goal is to have service-busy-duration maximized
My thinking so far...
I keep the list of bookings made so far
I also keep track of gaps (as they are potential slots for new users booking)
When new user comes with his booking [startTime, duration] i first check for ideal case where gapLength = duration. if there is no such gaps, i find all slots (gaps) that satisfy condition gapLength - duration > minimumGapDuration and order them in descending order by that gapLength - duration value
I assign user to the first gap with maximum value of gapLength - duration since that gives me highest probability that gap remaining after this booking will also get filled in future
Questions:
Are there some problems with my approach that i am missing?
Are there some algorithms solving this particular problem?
Is there some usual approach (good starting point) which i could start with and optimize later? (i am actually trying to get enough infos to start but not making some critical mistake; optimizations can/should come later)
PS.
From research so far it sounds this might be the case for constraint programming. I would like to avoid it if possible as i have no clue about it (maybe its simple, i just dont know) but if it makes a real difference, i will go for its benefits and implement it.
I went through stackoverflow for similar problems but didnt find one with unknown future events. If there is such and this is direct duplicate, please refer to it.
There is a carwash that can only service 1 customer at a time. The goal of the car wash is to have as many happy customers as possible by making them wait the least amount of time in the queue. If the customers can be serviced under 15 minutes they are joyful, under an hour they are happy, between 1 hour to 3 hours neutral and 3 hours to 8 hours angry. (The goal is to minimize angry people and maximize happy people). The only caveat to this problem is that each car takes a different amount of time to wash and service so we cannot always serve on first come first serve basis given the goal we have to maximize customer utility. So it may look like this:
Customer Line :
Customer1) Task:6 Minutes (1st arrival)
Customer2) Task:3 Minutes (2nd arrival)
Customer3) Task:9 Minutes (3rd)
Customer4) Task:4 Minutes (4th)
Service Line:
Serve Customer 2, Serve Customer 1, Serve Customer 4, Serve Customer 3.
In this way, no one waited in line for more than 15 minutes before being served. I know I should use some form of priority queue to implement this but I honestly know how should I give priority to different customers. I cannot give priority to customers with the least amount of work needed since they may be the last customer to have arrived for example (others would be very angry) and I cannot just compare based on time since the first guy might have a task that takes the whole day.So how would I compare these customers to each other with the goal of maximizing happiness?
Cheers
This is similar to ordering calls to a call center. It becomes more interesting when you have gold and silver customers. Anyway:
Every car has readyTime (the earliest time it can be washed, so it's arrival time - which might be different for each car) and a dueTime (the moment the customer becomes angry, so 3 hours after readyTime).
Then use a constraint solver (like OptaPlanner) to determine the order of the cars (*). The order of the cars (which is a genuine planning variable) implies the startWashingTime of each car (which is a shadow variable), because in your example, if customer 1 is ordered after customer 2 and if we start at 08:00, we can deduce that customer 1's startWashingTime is 08:03.
Then the endWashingTime is startWashingTime + washingDuration.
Then it's just a matter of adding 2 constraints and let the solver solve() it:
endWashingTime must be lower than dueTime, this is a hard constraint. This is to have no angry customers.
endWashingTime must be lower than startTime plus 15 minutes, this is a soft constraint. This is to maximize happy customers.
(*) This problem is NP-complete or NP-hard because you can relax it to a knapsack problem. In practice this means: you can't write an algorithm for it that scales out and finds the optimal solution in reasonable time. But a constraint solver can get you close.
Actually my question is the same as stated here https://parse.com/questions/issue-with-using-countobjects-on-class-with-large-number-of-objects-millions which is a 9 months old thread that unfortunately never got a real answer.
I have a parse backend for a mobile game, and I want to query the global rank of the user based on his score.
How can I query this, assuming I have more than 1000 users, and a lot more score entries?
Also, I want to be able to query the all-time rank as well as the last-24-hours rank.
Is this even possible with parse?
If you need to do large queries like that, you should redesign your solution. Keep a leaderboard class that gets updated with cloud code; either in an afterSave hook, or with a scheduled job that runs regularly. Your primary focus must be on lightning fast lookups. As soon as your lookups start getting large or calculation heavy, you should redesign your backend for scalability.
Solution for all time and time based leader boards is
You should keep records all users score in separate entity suppose LeaderBoard and maintain top users of all time leader board or time base leader board users list in separate entity suppose TopScoreUsers. In this way you do not need to run query every time on actual leader board entity(i.e LeaderBoard) to get your top users.
Suppose your app/game has all time leader board that should show 100 top users only. When any user submit score then you check first, is this user fall in top 100 users list (from TopScoreUser), if yes then add it in separate entity named TopScoreUser (if this list already has 100 users then find place of current user in this list and remove last record) and then update user's own score in LeaderBoard entity otherwise only update user score in LeaderBoard entity.
Same case is for Time based leader board.
For user rank, recommended way is to use some calculation or formula to determine approximately rank otherwise it is tough to find/manage user actual rank if users are in millions.
Hopefully this will help you.
Let's say you have a large number of transactions over time flowing into/out of an account. Every time a new transaction comes in you need to recalculate the balance to show to the user. It's inefficient to recalculate the total balance (summing up the rows from all transactions) every time a new transaction comes in (imagine there are too many transactions).
Obviously you could increment or decrement the balance. But lets say actions aren't guaranteed to happen atomically, for example - in between saving a transaction and the increment, your database could restart, missing the increment.
I'm wondering what algorithms/approaches exist to solve this. It seems like you could:
do the increment/decrement method, and periodically correct it (it would still show incorrect balance though periodically, not great)
do a map/reduce, although this is still overkill if new transactions come in frequently (it is going over 100% of old data, even though only tiny portion changed)
save some sort of markers over time, so once you sum a period of time once, you can reuse it later, maybe in some sort of hierarchy so you can sum the day's total, week's total, month, year etc so it is always fast to get the current balance (todays transactions plus prior days, weeks, months, years etc)
It seems like this has been invented before. What should I be searching for? Thanks!
Suppose you were able keep track of the news mentions of different entities, like say "Steve Jobs" and "Steve Ballmer".
What are ways that could you tell whether the amount of mentions per entity per a given time period was unusual relative to their normal degree of frequency of appearance?
I imagine that for a more popular person like Steve Jobs an increase of like 50% might be unusual (an increase of 1000 to 1500), while for a relatively unknown CEO an increase of 1000% for a given day could be possible (an increase of 2 to 200). If you didn't have a way of scaling that your unusualness index could be dominated by unheard-ofs getting their 15 minutes of fame.
update: To make it clearer, it's assumed that you are already able to get a continuous news stream and identify entities in each news item and store all of this in a relational data store.
You could use a rolling average. This is how a lot of stock trackers work. By tracking the last n data points, you could see if this change was a substantial change outside of their usual variance.
You could also try some normalization -- one very simple one would be that each category has a total number of mentions (m), a percent change from the last time period (δ), and then some normalized value (z) where z = m * δ. Lets look at the table below (m0 is the previous value of m) :
Name m m0 δ z
Steve Jobs 4950 4500 .10 495
Steve Ballmer 400 300 .33 132
Larry Ellison 50 10 4.0 400
Andy Nobody 50 40 .20 10
Here, a 400% change for unknown Larry Ellison results in a z value of 400, a 10% change for the much better known Steve Jobs is 495, and my spike of 20% is still a low 10. You could tweak this algorithm depending on what you feel are good weights, or use standard deviation or the rolling average to find if this is far away from their "expected" results.
Create a database and keep a history of stories with a time stamp. You then have a history of stories over time of each category of news item you're monitoring.
Periodically calculate the number of stories per unit of time (you choose the unit).
Test if the current value is more than X standard deviations away from the historical data.
Some data will be more volatile than others so you may need to adjust X appropriately. X=1 is a reasonable starting point
Way over simplified-
store people's names and the amount of articles created in the past 24 hours with their name involved. Compare to historical data.
Real life-
If you're trying to dynamically pick out people's names, how would you go about doing that? Searching through articles how do you grab names? Once you grab a new name, do you search for all articles for him? How do you separate out Steve Jobs from Apple from Steve Jobs the new star running back that is generating a lot of articles?
If you're looking for simplicity, create a table with 50 people's names that you actually insert. Every day at midnight, have your program run a quick google query for past 24 hours and store the number of results. There are a lot of variables in this though that we're not accounting for.
The method you use is going to depend on the distribution of the counts for each person. My hunch is that they are not going to be normally distributed, which means that some of the standard approaches to longitudinal data might not be appropriate - especially for the small-fry, unknown CEOs you mention, who will have data that are very much non-continuous.
I'm really not well-versed enough in longitudinal methods to give you a solid answer here, but here's what I'd probably do if you locked me in a room to implement this right now:
Dig up a bunch of past data. Hard to say how much you'd need, but I would basically go until it gets computationally insane or the timeline gets unrealistic (not expecting Steve Jobs references from the 1930s).
In preparation for creating a simulated "probability distribution" of sorts (I'm using terms loosely here), more recent data needs to be weighted more than past data - e.g., a thousand years from now, hearing one mention of (this) Steve Jobs might be considered a noteworthy event, so you wouldn't want to be using expected counts from today (Andy's rolling mean is using this same principle). For each count (day) in your database, create a sampling probability that decays over time. Yesterday is the most relevant datum and should be sampled frequently; 30 years ago should not.
Sample out of that dataset using the weights and with replacement (i.e., same datum can be sampled more than once). How many draws you make depends on the data, how many people you're tracking, how good your hardware is, etc. More is better.
Compare your actual count of stories for the day in question to that distribution. What percent of the simulated counts lie above your real count? That's roughly (god don't let any economists look at this) the probability of your real count or a larger one happening on that day. Now you decide what's relevant - 5% is the norm, but it's an arbitrary, stupid norm. Just browse your results for awhile and see what seems relevant to you. The end.
Here's what sucks about this method: there's no trend in it. If Steve Jobs had 15,000 a week ago, 2000 three days ago, and 300 yesterday, there's a clear downward trend. But the method outlined above can only account for that by reducing the weights for the older data; it has no way to project that trend forward. It assumes that the process is basically stationary - that there's no real change going on over time, just more and less probable events from the same random process.
Anyway, if you have the patience and willpower, check into some real statistics. You could look into multilevel models (each day is a repeated measure nested within an individual), for example. Just beware of your parametric assumptions... mention counts, especially on the small end, are not going to be normal. If they fit a parametric distribution at all, it would be in the Poisson family: the Poisson itself (good luck), the overdispersed Poisson (aka negative binomial), or the zero-inflated Poisson (quite likely for your small-fry, no chance for Steve).
Awesome question, at any rate. Lend your support to the statistics StackExchange site, and once it's up you'll be able to get a much better answer than this.