Actually my question is the same as stated here https://parse.com/questions/issue-with-using-countobjects-on-class-with-large-number-of-objects-millions which is a 9 months old thread that unfortunately never got a real answer.
I have a parse backend for a mobile game, and I want to query the global rank of the user based on his score.
How can I query this, assuming I have more than 1000 users, and a lot more score entries?
Also, I want to be able to query the all-time rank as well as the last-24-hours rank.
Is this even possible with parse?
If you need to do large queries like that, you should redesign your solution. Keep a leaderboard class that gets updated with cloud code; either in an afterSave hook, or with a scheduled job that runs regularly. Your primary focus must be on lightning fast lookups. As soon as your lookups start getting large or calculation heavy, you should redesign your backend for scalability.
Solution for all time and time based leader boards is
You should keep records all users score in separate entity suppose LeaderBoard and maintain top users of all time leader board or time base leader board users list in separate entity suppose TopScoreUsers. In this way you do not need to run query every time on actual leader board entity(i.e LeaderBoard) to get your top users.
Suppose your app/game has all time leader board that should show 100 top users only. When any user submit score then you check first, is this user fall in top 100 users list (from TopScoreUser), if yes then add it in separate entity named TopScoreUser (if this list already has 100 users then find place of current user in this list and remove last record) and then update user's own score in LeaderBoard entity otherwise only update user score in LeaderBoard entity.
Same case is for Time based leader board.
For user rank, recommended way is to use some calculation or formula to determine approximately rank otherwise it is tough to find/manage user actual rank if users are in millions.
Hopefully this will help you.
Related
There are two entities around this problem, mainly
Leaderboard - Which holds the info about type(lowest first/highest first), description, name etc.
Score - The score value submitted by the player which holds player details along with score value
Usecases:
We need to fetch scored who are top 10
For a monthly leaderboard, we need to find top 3
Domain Rules:
A player can submit any number of scores
The leaderboard ranking needs to be based on the type defined in leaderboard (lowest/highest)
For such a system where
Leaderboard and scores has 1 to many relationship
Score needs to have info about the player information(which is a separate aggregate root and in different Bounding Context)
How to design it in DDD?
Scenario 1:
Does Leaderboard will be aggregate root and Scores will be added through Leaderboard aggregate root (for every score)?
Queries:
Here, scores doesn't have a meaning without Leaderboard, and also no domain rules insist to add a score via Leaderboard aggregate root. This is in-fact a dilemma and how to handle this?
How to get the Player details to feed in score? Do I need to fetch the player details in a domain service and feed the Leaderboard Aggregate root while adding the score?
Scenario 2:
Leaderboard and LeaderboardScore are two different Aggregate roots.
Queries:
While calculating ranks, we need to fetch scores from score aggregate root and type info from leaderboard and fulfil the use-case?
Here most of the use-case serving code needs to be in Domain Service or Application Service?
I would approach it with score and leaderboard being their own aggregates. Score changes publish domain events which get fed (asynchronously, since eventual consistency is probably OK) to update the leaderboard.
Scenario:
I need to give users opportunity to book different times for the service.
Caveat is that i dont have bookings in advance but i need to fill them as they come in.
Bookings can be represented as keyvalue pairs:
[startTime, duration]
So, for example, [9,3] would mean event starts at 9 o’clock and has duration of 3 hours.
Rules:
users come in one by one, there is never a batch of users requests
no bookings can overlap
service is available 24/7 so no need to worry about “working time”
users choose duration on their own
obviously, once user chooses&confirms his booking we cannot shuffle it anymore
we dont want gaps to be lesser than some amount of time. this one is based on probability that future users will fill in the gap. for example, if distribution of durations over users bookings is such that probability for future users filling the gap shorter than x hours is less than p then we want a rule that gap cannot be shorter than x. (for purpose of this question, we can assume x being hardcoded, here i just explain reasons)
the goal is to have service-busy-duration maximized
My thinking so far...
I keep the list of bookings made so far
I also keep track of gaps (as they are potential slots for new users booking)
When new user comes with his booking [startTime, duration] i first check for ideal case where gapLength = duration. if there is no such gaps, i find all slots (gaps) that satisfy condition gapLength - duration > minimumGapDuration and order them in descending order by that gapLength - duration value
I assign user to the first gap with maximum value of gapLength - duration since that gives me highest probability that gap remaining after this booking will also get filled in future
Questions:
Are there some problems with my approach that i am missing?
Are there some algorithms solving this particular problem?
Is there some usual approach (good starting point) which i could start with and optimize later? (i am actually trying to get enough infos to start but not making some critical mistake; optimizations can/should come later)
PS.
From research so far it sounds this might be the case for constraint programming. I would like to avoid it if possible as i have no clue about it (maybe its simple, i just dont know) but if it makes a real difference, i will go for its benefits and implement it.
I went through stackoverflow for similar problems but didnt find one with unknown future events. If there is such and this is direct duplicate, please refer to it.
(This is a theoretical question for a system design I am working on - advising changes is great)
I have a large table of GPS data which contains the following rows:
locID - PK
userID - User ID of the user of the app
lat
long
timestamp - point in UNIX time when this data was recorded
I am trying to design a way which will allow a server to go through this dataset and check if any "users" were in a specific "place" together (eg. 50m apart) at a specific time range (2min) - eg. did user1 visit the same vicinity of user2 within that 2min time gap.
The only way I can currently think of is check each row one by one with all the rows in the same timeframe using a co-ordinate distance check algorithm. - But this comes up with the issue if the users are all around the world and have thousands maybe millions of rows in that 5min timeframe this would not work efficiently.
Also what if I want to know how long they were in each others vicinity?
Any ideas/thoughts would be helpful. Including the database to use? I am thinking either PostgreSQL or maybe Cassandra. And the table layout. All help appreciated.
Divide the globe into patches, where each patch is small enough to contain only a few thousand people, say 200m by 200m, and add the patchID as an attribute to each entry in the database. Note that two users cannot be in close proximity if they aren't in the same patch or in adjacent patches. Therefore, when checking for two users in the same place at a given time, query the database for a given patchID and the eight surrounding patchIDs, to get a subset of the database that may contain possible collisions.
I have LAMP-based business application. SugarCRM to be more precise. There are 120+ active users at the moment. Every day each user generates some records that are used in complex calculation to get so called “individual rating”.
It takes for about 6 seconds to calculate one “individual rating” value. And there was not a big problem before: each user hits the link provided to start “individual rating” calculations, waits for 6-7 seconds, and get the value displayed.
But now I need to implement “overall rating” calculation. That means that additionally to “individual rating” I have to calculate and display to the user:
minimum individual rating among ALL the users of the application
maximum individual rating among ALL the users of the application
current user position in the range of all individual ratings.
Say, current user has individual rating equal to 220 points, minimum value of rating is 80, maximum is 235 and he is on 23rd position among all the users.
What are (imho) the main problems to be solved?
If one calculation lasts for 6 seconds, that overall calculations will take more than 10 minutes. I think it’s no good to make the application almost unaccessible for this period. And what if the quantity of users will rise in the nearest future 2-3 times?
Those calculations could be done as nightly job but all the users are in different timezones. In Russia difference between extreme timezones is 9 hours. So people in west part of Russia are still working in “today”. While people in eastern part is waking up to work in “tomorrow”. So what is the best time for nightly job in this case?
Are there any best practices|approaches|algorithms to build such rating system?
Given only the information provided, the only options I see:
The obvious one - reduce the time taken for a rating calculation (6 seconds to calculate 1 user's rating seems like a lot)
If possible, have intermediate values which you only recalculate some of, as required (for example, have 10 values that make up the rating, all based on different data, when some of the data changes, flag the appropriate values for recalcuation). Either do this recalculation:
During your daily recalculation or
When the update happens
Partial batch calculation - only recalculate x of the users' ratings at chosen intervals (where x is some chosen value) - has the disadvantage that, at all times, some of the ratings can be out of date
Calculate if not busy - either continuously recalculate ratings or only do so at a chosen interval, but instead of locking the system, have it run as a background process, only doing work if the system is idle
(Sorry, didn't manage with "long" comment posting; so decided to post as answer)
#Dukeling
SQL query that takes almost all the time for calculation mentioned above is just a replication of business logic that should be executed in PHP code. The logic was moved into SQL with the hope to reduce calculation time. OK, I’ll try both to optimize SQL query and play with executing logic in PHP code.
Suppose after that optimized application calculates individual rating for just 1 second. Great! But even in this case the first user logged into system should awaits for 120 seconds (120+ users * 1 sec = 120 sec) to calculate overall rating and gets its position in it.
I’m thinking of implementing the following approach:
Let’s have 2 “overall ratings” – “today” and “yesterday”.
For displaying purposes we’ll use “yesterday” overall rating represented as huge already sorted PHP array.
When user hits calculation link he started “today” calculation but application displays him “yesterday” value. Thus we have quickly accessible “yesterday” rating and each user randomly launches rating calculation that will be displayed for them tomorrow.
User list are partitioned by timezones. Each hour a cron job started to check if there’re any users in selected timezone that don’t have “today” individual rating calculated (e.g. user didn’t log into application). If so, application starts calculation of individual rating and puts its value in “today” (still invisible) ovarall rating array. Thus we have a cron job that runs nightly for each timezone-specific user group and fills the probable gaps in case users didn’t log into system.
After all users in all timezones had been worked out, application
sorts “today” array,
drops “yesterday” one,
rename “today” in “yesterday” and
initialize new “today”.
What do you think of it? Is it reasonable enough or not?
Data for various stocks is coming from various stock exchange continuously. Which data structure is suitable to store these data?
things to consider are :
a) effective retrieval and update of data is required as stock data changes per second or microsecond during trading time.
I thought of using Heap as the number of stocks would be more or less constant and the most frequent used operations are retrieval and update so heap should perform well for this scenario.
b) need to show stocks which are currently trending (as in volume of shares being sold most active and least active, high profit and loss on a particular day)
I am nt sure about how to got about this.
c) as storing to database using any programming language has some latency considering the amount of stocks that will be traded during a particular time, how can u store all the transactional data persistently??
Ps: This is a interview question from Morgan Stanley.
A heap doesn't support efficient random access (i.e. look-up by index) nor getting the top k elements without removing elements (which is not desired).
My answer would be something like:
A database would be the preferred choice for this, as, with a proper table structure and indexing, all of the required operations can be done efficiently.
So I suppose this is more a theoretical question about understanding of data structures (related to in-memory storage, rather than persistent).
It seems multiple data structures is the way to go:
a) Effective retrieval and update of data is required as stock data changes per second or microsecond during trading time.
A map would make sense for this one. Hash-map or tree-map allows for fast look-up.
b) How to show stocks which are currently trending (as in volume of shares being sold most active and least active, high profit and loss on a particular day)?
Just about any sorted data structure seems to make sense here (with the above map having pointers to the correct node, or pointing to the same node). One for activity and one for profit.
I'd probably go with a sorted (double) linked-list. It takes minimal time to get the first or last n items. Since you have a pointer to the element through the map, updating takes as long as the map lookup plus the number of moves of that item required to get it sorted again (if any). If an item often moves many indices at once, a linked-list would not be a good option (in which case I'd probably go for a Binary Search Tree).
c) How can you store all the transactional data persistently?
I understand this question as - if the connection to the database is lost or the database goes down at any point, how do you ensure there is no data corruption? If this is not it, I would've asked for a rephrase.
Just about any database course should cover this.
As far as I remember - it has to do with creating another record, updating this record, and only setting the real pointer to this record once it has been fully updated. Before this you might also have to set a pointer to the old record so you can check if it's been deleted if something happens after setting the pointer away, but before deletion.
Another option is having a active transaction table which you add to when starting a transaction and remove from when a transaction completes (which also stores all required details to roll back or resume the transaction). Thus, whenever everything is okay again, you check this table and roll back or resume any transactions that have not yet completed.
If I have to choose , I would go for Hash Table:
Reason : It is synchronized and thread safe , BigO(1) as average case complexity.
Provided :
1.Good hash function to avoid the collision.
2. High performance cache.
While this is a language agnostic question, a few of the requirements jumped out at me. For example:
effective retrieval and update of data is required as stock data changes per second or microsecond during trading time.
The java class HashMap uses the hash code of a key value to rapidly access values in its collection. It actually has an O(1) runtime complexity, which is ideal.
need to show stocks which are currently trending (as in volume of shares being sold most active and least active, high profit and loss on a particular day)
This is an implementation based issue. Your best bet is to implement a fast sorting algorithm, like QuickSort or Mergesort.
as storing to database using any programming language has some latency considering the amount of stocks that will be traded during a particular time, how can u store all the transactional data persistently??
A database would have been my first choice, but it depends on your resources.