I came into this question during an interview.
Let say we have log information for user visiting a website, the information including website, user, time. We need to design a data structure that we can get the information of
Top five visiting user of a specific website
Top five website visit by a specific user
Websites that are only be visited for 100 times in a day
All in real time
The first thought that I came in mind is that we can just use a database to store the log and every time we just need to do counting and sorting for each user or each website. But it's not real time as we need to do a lot of computation to get the information.
Then I think we can use HashMap for each problem. For example, for each website we use HashMap<Website, <TreeMap<User, count>>, so that we can get the top five visitor for a specific website. But the interviewer said we can only use one data structure for all three problem as the second problem would use HashMap<User, <TreeMap<Website, count>>, which has different key and value type.
Can anyone think of a good solution for this problem?
A map of maps, with generic types, as a basic approach.
The first map represents the global data structure, which will contain the maps for the three problems.
The first inside map, you'll have the website as the key and a list of the top 5 users.
The second inside map, you'll have the user as a key and a list of the top 5 visited website by him.
For the last problem, you can have the websites as key and the number of visitors as value on the third inside map.
If what they meant was to have the same data structure for three different problems, thant forget about the global map.
If you want to go a litle deeper, you can consider using a adjacency matrix implementation where your users and websites identifications are your columns/rows identifiers and the values are the number of visitors.
Related
I want to implement a search feature on my app that re-filters upon each new character entered into the search bar so users can search for other users. This is a fairly common feature on apps, but as a beginner it would seem like a very computationally complex process. It would seem that one of two things happen:
For each new character typed, the frontend queries the backend, which applies filter and returns.
The frontend loads all (or many) possible results beforehand and updates filter on the stored info as new characters are entered.
It would seem that 1) would have time complexity issues, as it makes O(n) queries (where n is number of characters) per search. This is especially problematic because it's expected that the filtered search results update near instantaneously. Additionally, my average query time is probably slower than most, as I'm using a three tier architecture (frontend<->server<->graph database)
I don't like 2)--at least in its straightforward form--as the number of possible results can get very large. We can reduce the space complexity of this by querying only for a limited set of user attributes (perhaps only uid and name, and fetching details on the fly if needed), but the point remains.
Things get more interesting if we modify 2) to load only a sample of users (and here we can use data like Location as well as ML/AI to select). The problem with this is that the searching user could always be looking for someone we didn't select. It would be a horrible (even if rare) experience for a user to know their friend was on the app but was unable to find them because our algorithm was only accurate for 99% of searches.
I am sure this is possible--other apps seem to pull it off--so what am I missing?
First, you should avoid to query the server for each character typed. Most of the times the user types a bounce of chars very fast without looking at suggested results, especially because with few chars the results wouldn't be specific enough. All the autocompletion systems adopt both of the following:
query only if the string is at least 2-3 chars long;
query only if the user is not typing more, i.e. after 300ms from the last type.
To get all the pertinent results without huge data transfer you could implement a progressive data load. Just load enough results to fill the page height, then as the user scrolls down load more results. However if you reach a high number of results you should stop retrieving them and ask the user to type a more specific search.
If you want to make your users happy, try to sort the result by relevance. For example if you know where the users are located you may sort the results by distance, because if I live in Italy and I search for "Ste" it is more likely my friend is Stefano who lives in Rome, than Steve who lives in NY.
Imagine: someone has a huge website selling, let's say, T-shirts.
we want to show paginated sorted listings of offers, also with options to filter by parameters, let's say - T-shirt colour.
offers should be sortable by any of 5 properties (creating date,
price, etc...)
Important requirement 1: we have to give a user an ability to browse all the 15 million offers, and not just the "top-N".
Important requirement 2: they must be able to jump to a random page at any time, not just flick through them sequentially
we use some sort of a traditional data storage (MongoDB, to be precise).
The problem is that MongoDB (as well as other traditional databases) performs poorly when it comes to big offsets. Imagine if a user wants to fetch a page of results somewhere in the middle of this huge list sorted by creation date with some additional filters (for instance - by colour)
There is an article describing this kind of problem:
http://openmymind.net/Paging-And-Ranking-With-Large-Offsets-MongoDB-vs-Redis-vs-Postgresql/
Okay now, so we are told that redis is a solution for similar kind of problem. You "just" need to prepare certain data structures and search them instead of your primary storage.
the question is:
What kind of structures and approaches whould you suggest to use in order to solve this with Redis?
Sorted Sets, paging through with ZRANGE.
let's say a service operation like this
api/places/?category=entertainment&geo=123,456&area=30&orderBy=distance
so the user is searching for places of entertainment near the geo location (123,456), no further than 30 kms boundary, and want the results sorted by distance
suppose the search should be paged, say it will have 500 items satisfy the query, but the page size is 50, so it will have 10 pages.
each item in database stores only geo location of the place, and then I will have to fetch all 500 items from db first, calculate the distance of each item, cut the array to the page number and then return.
so every time the user is requesting next page, I will have to query all 500 and then do the same thing again.
is this the right way to implement a service like that, or is there a better strategy?
it seems to be worse when my database don't have the geo location, because I am using a different API provider to give me the geo of a place. It means I will have to query everything and hit another service to get the geo, calculate, and finally able to sort... :(
thank you very much!
If you were developing a single-page app for the search results, you wouldn't need to send another request to the server every time the user presses Next. You could use a pagination library that takes the full set of results and sorts them into pages accordingly.
In this situation, there's a trade between the size of the data you want to store, and the speed and efficiency of your web application. In this sort of situation, you should really be dealing with large data sets. You should ideally have additional databases for each general geographic region (such as Northeast, Southeast) that store the distance between each store and each location the user can enter. You should use a separate server for this, and aggregate the data at intervals (say, every six hours) using an automated database operation, such as running MongoDB scripts.
You should also consider using weighted graphs to store the distances between locations. You could then more easily traverse them with a graphing algorithm.
I’m writing a simple website. I want to be able to group users by routes they make on my sites. For example I have this site tree, but the final product will be more complicated. Lets say I have three users.
User one route is A->B->C->D
User two route is A->B->C->E
User three route is J->K
We can say that users one and two belongs to same group and user three belongs to some other group.
My question is: what algorithm or maybe more than one I should use to accomplish that> Also what data do I need to collect?
I have some ideas, however, I want to confront it with someone who might have more experience than me.
I’m looking for suggestions rather than an exact solution for my problem. Also, if there are any ready-made solution which I can read, I will appreciate it also.
You could also consider common subpaths. A pointer in this direction and general web usage analysis is: http://pdf.aminer.org/000/473/298/improving_the_effectiveness_of_a_web_site_with_web_usage.pdf
As a first cut, it seems reasonable to divide the problem into (1) defining a similarity score for two traces and (2) using the similarity scores to cluster. One possibility for (1) is Levenshtein distance. Two possibilities for (2) are farthest-point clustering and k-modes.
While I was going through Stackoverflow, i observed that what ever questions i have visited earlier those are marked by different color. Then i started thinking how stack overflow detects this.
Can somebody tell me what algorithm do they use, not only used by stackoverflow may be by different sites?
May be they are storing the question numbers in my cookie and after parsing the cookie data they are able to say the question i have visited. But if I have visited many questions is this approach possible?
Update
As everybody has mention this is a browser property, so question is how do they remember so many links, what algorithm or data structure do they use to store.
Actually, it's your user agent (e.g. browser), that is remembering visited links. Then a site can use CSS to style them to their liking.
User agents commonly display unvisited links differently from previously visited ones. CSS provides the pseudo-classes ':link' and ':visited' to distinguish them.
As for your updated question. A glance at the Chrome source code brings up some kind of hash table as data structure.
Also, if your user agent is just interested, whether a link was visited or not, you'll only need to compute a fingerprint (e.g. city hash) of the URL and compare the cached fingerprints with the fingerprints of the links found on a page.
Even if you would visit one new URL every 10 seconds for a whole month and assuming the fingerprint would use up 40 bytes, you would consume only about 10 Megabytes of memory.