While I was going through Stackoverflow, i observed that what ever questions i have visited earlier those are marked by different color. Then i started thinking how stack overflow detects this.
Can somebody tell me what algorithm do they use, not only used by stackoverflow may be by different sites?
May be they are storing the question numbers in my cookie and after parsing the cookie data they are able to say the question i have visited. But if I have visited many questions is this approach possible?
Update
As everybody has mention this is a browser property, so question is how do they remember so many links, what algorithm or data structure do they use to store.
Actually, it's your user agent (e.g. browser), that is remembering visited links. Then a site can use CSS to style them to their liking.
User agents commonly display unvisited links differently from previously visited ones. CSS provides the pseudo-classes ':link' and ':visited' to distinguish them.
As for your updated question. A glance at the Chrome source code brings up some kind of hash table as data structure.
Also, if your user agent is just interested, whether a link was visited or not, you'll only need to compute a fingerprint (e.g. city hash) of the URL and compare the cached fingerprints with the fingerprints of the links found on a page.
Even if you would visit one new URL every 10 seconds for a whole month and assuming the fingerprint would use up 40 bytes, you would consume only about 10 Megabytes of memory.
Related
I came into this question during an interview.
Let say we have log information for user visiting a website, the information including website, user, time. We need to design a data structure that we can get the information of
Top five visiting user of a specific website
Top five website visit by a specific user
Websites that are only be visited for 100 times in a day
All in real time
The first thought that I came in mind is that we can just use a database to store the log and every time we just need to do counting and sorting for each user or each website. But it's not real time as we need to do a lot of computation to get the information.
Then I think we can use HashMap for each problem. For example, for each website we use HashMap<Website, <TreeMap<User, count>>, so that we can get the top five visitor for a specific website. But the interviewer said we can only use one data structure for all three problem as the second problem would use HashMap<User, <TreeMap<Website, count>>, which has different key and value type.
Can anyone think of a good solution for this problem?
A map of maps, with generic types, as a basic approach.
The first map represents the global data structure, which will contain the maps for the three problems.
The first inside map, you'll have the website as the key and a list of the top 5 users.
The second inside map, you'll have the user as a key and a list of the top 5 visited website by him.
For the last problem, you can have the websites as key and the number of visitors as value on the third inside map.
If what they meant was to have the same data structure for three different problems, thant forget about the global map.
If you want to go a litle deeper, you can consider using a adjacency matrix implementation where your users and websites identifications are your columns/rows identifiers and the values are the number of visitors.
I was recently asked this in an interview (Software Engineer) and didn't really know how to go about answering the question.
The question was focused on both the algorithm of the spreadsheet and how it would interact with the browser. I was a bit confused on what data structure would be optimal to handle the cells and their values. I guess any form of hash table would work with cells being the unique key and the value being the object in the cell? And then when something gets updated, you'd just update that entry in your table. The interviewer hinted at a graph but I was unsure of how a graph would be useful for a spreadsheet.
Other things I considered were:
Spreadsheet in a browser = auto-save. At any update, send all the data back to the server
Cells that are related to each other, i.e. C1 = C2+C3, C5 = C1-C4. If the value of C2 changes, both C1 and C5 change.
Usage of design patterns? Does one stand out over another for this particular situation?
Any tips on how to tackle this problem? Aside from the algorithm of the spreadsheet itself, what else could the interviewer have wanted? Does the fact that its in a browser as compared to a separate application add any difficulties?
Thanks!
For an interview this is a good question. If this was asked as an actual task in your job, then there would be a simple answer of use a third party component, there are a few good commercial ones.
While we can't say for sure what your interviewer wanted, for me this is a good question precisely because it is so open ended and has so many correct possible answers.
You can talk about the UI and how to implement the kind of dynamic grid you need for a spreadsheet and all the functionality of the cells and rows and columns and selection of cells and ranges and editing of values and formulas. You probably could talk for a while on the UI implications alone.
Alternatively you can go the data route, talk about data structures to hold a spreadsheet, talk exactly about links between cells for formulas, talk about how to detect and deal with circular references, talk about how in a browser you have less control over memory and for very large spreadsheets you could run into problems earlier. You can talk about what is available in JavaScript vs a native language and how this impacts the data structures and calculations. Also along with data, a big important issue with spreadsheets is numerical accuracy and floating point number calculations. Floating point numbers are made to be fast but are not necessarily accurate in extreme levels of precision and this leads to a lot of confusing questions. I believe very recently Excel switched to their own representation of a fixed decimal number as it's now viable to due spreadsheet level calculations without using the built-in floating point calculations. You can also talk about data structures and calculation and how they affect performance. In a browser you don't have threads (yet) so you can't run all the calculations in the background. If you have 100,000 rows with complex calculations and change one value that cascades across everything, you can get a warning about a slow script. You need to break up the calculation.
Finally you can run form the user experience angle. How is the experience in a browser different from a native application? What are the advantages and what cool things can you do in a browser that may be difficult in a desktop application? What things are far more complicated or even totally impossible (example, associate your spreadsheet app with a file type so a user can double-click a file and open it in your online spreadsheet app, although I may be wrong about that still being unsupported).
Good question, lots of right answers, very open ended.
On the other hand, you could also have had a bad interviewer that is specifically looking for the answer they want and in that case you're pretty much out of luck unless you're telepathic.
You can say hopelessly too much about this. I'd probably start with:
If most of the cells are filled, use a simply 2D array to store it.
Otherwise use a hash table of location to cell
Or perhaps something like a kd-tree, which should allow for more efficient "get everything in the displayed area" queries.
By graph, your interviewer probably meant have each cell be a vertex and each reference to another cell be a directed edge. This would allow you to do checks for circular references fairly easily, and allow for efficiently updating of all cells that need to change.
"In a browser" (presumably meaning "over a network" - actually "in a browser" doesn't mean all that much by itself - one can write a program that runs in a browser but only runs locally) is significant - you probably need to consider:
What are you storing locally (everything or just the subset of cells that are current visible)
How are you sending updates to the server (are you sending every change or keeping a collection of changed cells and only sending updates on save, or are you not storing changes separately and just sending the whole grid across during save)
Auto-save should probably be considered as well
Will you have an "undo", will this only be local, if not, how will you handle this on the server and how will you send through the updates
Is only this one user allowed to work with it at a time (or do you have to cater for multi-user, which brings dealing with conflicts, among other things, to the table)
Looking at the CSS cursor property just begs for one to create
a spreadsheet web application.
HTML table or CSS grid? HTML tables are purpose built for tabular
data.
Resizing cell height and width is achievable with offsetX and
offsetY.
Storing the data is trivial. It can be Mongo, mySQL, Firebase,
...whatever. On blur, send update.
Javascrip/ECMA is more than capable of delivering all the Excel built-in
functions. Did I mention web workers?
Need to increment letters as in column ID's? I got you covered.
Most importantly, don't do it. Why? Because it's already been done.
Find a need and work that project.
I have a website that gives users different outcomes depending on a virtual dice roll. I want them to trust that my random numbers are honest, so instead of me determining it in my own code (which to my skeptical users is a black box I can manipulate), I want to come up with some other mechanism.
One idea is to point to some credible website (e.g. governmental) that has a publicly observable random number that changes over time. Then I could say, "We will base your outcome on a number between 0 and 9, which will be the number at [url] in 10 seconds."
Any suggestions?
I'd go with this site myself. It has a public anonymous URL for several kinds of numbers, and realtime pages to observe them:
Hex numbers
Number: https://qrng.anu.edu.au/ran_hex.php
Stream: https://qrng.anu.edu.au/RainHex.php
Binary numbers
Number: https://qrng.anu.edu.au/ran_bin.php
Stream: https://qrng.anu.edu.au/RainBin.php
It also includes references to the scientific explanation of the source of randomness, and practical demonstrations of it, even one specifically for dice.
From your code you can just retrieve the number URL mentioned above.
Alternative if verifiability is important
A completely alternative approach: when the deadline falls, retrieve the homepage of an outside controlled, high-traffic interactive site, such as the questions page of Stack Overflow. Store the page, take its MD5 or SHA1 hash, and derive your roll from that.
You can then:
Show the page as it was at the snapshot time to verify it's working HTML
Its HTML source full of timestamps to verify authenticity and time of retrieval to nearly the second
Let people verify the hash for themselves based on that
Guarantee randomness of the value because it is mathematically impossible to predict what you need to change on a site like SO to trigger a given new hash value
Any attempt to tamper with this system, such as Jeff reiterating an old page on purpose because he knows the MD5 hash it produces, is easily debunked by the real time nature of the site - it would be visible for everyone to see the questions aren't recent to the time of snapshot.
I'm looking for algorithms/techniques that are able to present the importance of a a single webpage. Leaving PageRank aside, are there any other methods of doing such a rating based on content, structure and hyperlinks with each other?
I'm not only talking about the connection from www.foo.com to www.bar.com as PageRank does but also from www.foo.com/bar to www.foo.com/baz and so on (beside the fact of adapting PageRank for these needs)
How do I "define" importance: I think of importance in this context as "how relevant is this side to the user, as well as how important it is to the rest of the site".
E.g. A christmas raffle is announced on the startpage with only a single link leading to this site is more important to the user as well as to the site. An imprint, which has a link from every site (since it's mostly somewhere in the footer) is not important although it has many links to it. Imprint is also not important to the site as a "unit" since it doesn't give any real value for the page's puprpose (= giving information, selling products, a general service, etc)
There is also SALSA which is more stable then HITS [so it suffers less from spam].
Since you are also interested in context of pages, you might want to have a look on Haveliwala's work on topic sensitive page rank
Another famous algorithm is the Hubs and Authorities (HITS). Basically you classify your page as either a Hub (a page having a lot of outbound links) and Authorities (a page having a lot of inbound links).
But you should really define what you mean by importance. What does really important mean ? PageRank defines it with respect to the inbound links. That is PageRank definitions.
If you define important as having a photo, because you like photography. Then you could come up with an important metric like number of photos in the page. Another metric could be the number of inbound links from a photography site (like flickr.com, 500px, ...)
Using your definition of important, you could use `1-(the number of inbound links divided by the number of pages on the site). This gives you a number between 0 and 1. 0 means not important and 1 means important.
Using this metric your imprint, which appears on all the pages of the site, has importance of 0. Your Christmas sale page, which has only one link to it, has importance almost 1
I devised an idea a long time ago and never got around to implementing it, and I would like to know whether it is practical in that it would work to significantly decrease loading times for modern browsers. It relies on the fact that related tasks are often done more quickly when they are done together in bulk, and that the browser could be downloading content on different pages using a statistical model instead of being idle while the user is browsing. I've pasted below an excerpt from what I originally wrote, which describes the idea.
Description.
When people visit websites, I
conjecture that that a probability
density function P(q, t), where q is a
real-valued integer representing the
ID of a website and t is another
real-valued, non-negative integer
representing the time of the day, can
predict the sequence of webpages
visited by the typical human
accurately enough to warrant
requesting and loading the HTML
documents the user is going to read in
advance. For a given website, have the
document which appears to be the "main
page" of the website through which
users access the other sections be
represented by the root of a tree
structure. The probability that the
user will visit the root node of the
tree can be represented in two ways.
If the user wishes to allow a process
to automatically execute upon the
initialization of the operating system
to pre-fetch webpages from websites
(using a process elaborated later)
which the user frequently accesses
upon opening the web browser, the
probability function which determines
whether a given website will have its
webpages pre-fetched can be determined
using a self-adapting heuristic model
based on the user's history (or by
manual input). Otherwise, if no such
process is desired by the user, the
value of P for the root node is
irrelevant, since the pre-fetching
process is only used after the user
visits the main page of the website.
Children in the tree described earlier
are each associated with an individual
probability function P(q, t) (this
function can be a lookup table which
stores time-webpage pairs). Thus, the
sequences of webpages the user visits
over time are logged using this tree
structure. For instance, at 7:00 AM,
there may be a 71/80 chance that I
visit the "WTF" section on Reddit
after loading the main page of that
site. Based on the values of the
p> robability function P for each node
in the tree, chains of webpages
extending a certain depth from the
root node where the net probability
that each sequence is followed, P_c,
is past a certain threshold, P_min,
are requested upon the user visiting
the main page of the site. If the
downloading of one webpage finishes
before before another is processed, a
thread pool is used so that another
core is assigned the task of parsing
the next webpage in the queue of
webpages to be parsed. Hopefully, in
this manner, a large portion of those
webpages the user clicks may be
displayed much more quickly than they
would be otherwise.
I left out many details and optimizations since I just wanted this to be a brief description of what I was thinking about. Thank you very much for taking the time to read this post; feel free to ask any further questions if you have them.
Interesting idea -- and there have been some implementations for pre-fetching in browsers though without the brains you propose -- which could help alot. I think there are some flaws with this plan:
a) web page browsing, in most cases, is fast enough for most purposes.
b) bandwidth is becoming metered -- if I'm just glancing at the home page, do I as a user want to pay to serve the other pages. Moreover, in the cases where this sort of thing could be useful (eg--slow 3g connection), bandwidth tends to be more tightly metered. And perhaps not so good at concurrency (eg -- CDMA 3g connections).
c) from a server operator's point of view, I'd rather just serve requested pages in most cases. Rendering pages that don't ever get seen costs me cycles and bandwidth. If you are like alot of folks and on some cloud computing platform, you are paying by the cycle and the byte.
d) would require re-building lots of analytics systems, many of which still operate on the theory of request == impression
Or, the short summary is that there really isn't a need to pre-sage what people would view in order to speed serving and rendering pages. Now, places where something like this could be really useful would be in the "hey, if you liked X you probably liked Y" and then popping links and such to said content (or products) to folks.
Windows does the same thing with disk access - it "knows" that you are likely to start let's say Firefox at a certain time and preloads it.
SuperFetch also keeps track of what times of day those applications are used, which allows it to intelligently pre-load information that is expected to be used in the near future.
http://en.wikipedia.org/wiki/Windows_Vista_I/O_technologies#SuperFetch
Pointing out existing tech that does similar thing:
RSS readers load feeds in background, with assumption that user will want to read them sooner or later. There's no probability function that selects feeds for download though, user explicitly selects them
Browser start page and pinned tabs: these load as you start your browser, again user gets to select which websites are worth having around all the time
Your proposal boils down to predicting where user is most likely to click next given current website and current time of day. I can think of few other factors that play role here:
what other websites are open in tabs ("user opened song in youtube, preload lyrics and guitar chords!")
what other applications are running ("user is looking at invoice in e-mail reader, preload internet bank")
which person is using the computer--use webcam to recognize faces, know which sites each one frequents