How many bits are sufficient to hash a webpage in english? [closed] - algorithm

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
Recently I came by a question which asked , how many bits are sufficient to hash a webpage with these assumptions:
There are 1 billion web pages
The average length of web pages is 300 words
We have 250,000 words in English
The pages are in ASCII
Apparently there is no one right answer to this problem , but the aim of the question is to see how the general method works.

You haven't defined what it means to “hash a webpage”; that phrase appears in this question and in a couple of other pages on Internet. In those other pages it is used to mean computing a checksum (for example with sha1sum) to verify that content is intact. If that's what you mean, then you need all the bits of any page that's to be “hashed”; on average, that is 300 * 8 * average English word length. The question doesn't specify the average English word length, but if it is five letters plus a space, the average number of bits per page is 6*300*8 or 14400.
If you instead mean putting all the words of all the webpages into an index structure to allow a search to find all the webpages that contain any given set of words, one answer is about 10^13 bits: There are 300 billion word references in a billion pages; each reference uses log_2(1G) bits, or about 30 bits, if references are stored naively; hence 9 trillion bits, or about 10^13. You can also work out that naive storage for a billion URLs is at least an order of magnitude smaller than that, ie 10^12 bits at most. Special methods might be used to reduce reference storage a couple orders of magnitude, but because URLs are easier to compress or save compactly (via, eg, a trie), reference storage is likely to still be far more than what is needed for storing URLs.

Related

Purpose of floating points in Ruby [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 9 years ago.
What is the purpose of a floating number in Ruby? I found some information about using less bytes or increasing accuracy, but I do not understand why you would not always use floats. Wouldn't it give you a more accurate result?
In the past, integer ops were much faster and sometimes the FPU was not present or was optional in the architecture.
However, today, FP is almost universal, it's quite fast, and in fact it is possible to use FP for everything.
Most or all Javascript implementations work like that.
In general, though, the integer ops are still faster and the catalog of available operations matches more closely to what programmers will expect. 64-bit integers map better to bytes and the storage system than the 52-bit integers provided by the floating point system.
A full-featured language like Ruby will almost always implement both integer and FP ops. It gives the user more of a choice for attribute domains, while languages that are more streamlined like Javascript may pick one or the other. Ruby is much more likely to need something like ORM than Javascript is.
Note, however, that the reason is not "more accuracy". FP and integer operations return the exact same results for integer operands. FP has 52 bits, and although that's greater than the standard 32-bit int it's less than the also-common 64-bit long or long long, so no one really wins or loses on precision. Both are accurate.
And yes, as Jörg hints, the integer ops are more easily extended to greater precision.
Integers are typically faster for some operations, and sometimes you want the chopping that results from integer division

Repeated Sampling Without Replacement [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 9 years ago.
I want to generate 10 random numbers from a population 1:1000, and the code that generate this number is repeated 10 times. I want the sampling to be without replacement such that the intersection between the 10 sets of 10 random numbers is null.
First, if I used sample function in r and set replace to false it doesn't help much and
when I searched online I found a function for doing so called urn, but I can't download package in r. so in short I want to do exactly like the following code:
http://rss.acs.unt.edu/Rdoc/library/urn/html/urn.html
but manually instead of using the urn package
I tried the following code but the samples generated aren't unique where I select rows from "data" randomly
for(j in 1:10) {
x=unique(data[,2])
tr=sample(length(x),0.9*length(x),replace=FALSE)
}
Taking into account #ElKamina's comment you could generate 100 numbers using sample and allocate them into a 10 x 10 matrix:
matrix(sample(1:1000, 100, FALSE), ncol=10)
I like the sample 100 values and put them in a 10 by 10 matrix the best, but another option would be to sample the 1st 10 from the full list, then use setdiff to compute the set without the 10 already chosen, chose another 10 from that group, use setdiff again, etc.
This way may work better if you don't know ahead of time how many samples or how many in each sample, though in those cases you could use sample to randomly permute the whole list of 1000, then just pick off groups from the permuted list.

What technique they use to compress the URL's? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
What algorithm/technique does most of the sites use to compress the URL?
Adfly shortens an URL e.g. to "5Y8F2" which is superb. It produces the most compressed URL I have ever seen.
You can find piece of information in Wiki: URL shortening.
Quoting this article:
There are several techniques to implement a URL shortening. Keys can be generated in base 36, assuming 26 letters and 10 numbers. In this case, each character in the sequence will be 0, 1, 2, ..., 9, a, b, c, ..., y, z. Alternatively, if uppercase and lowercase letters are differentiated, then each character can represent a single digit within a number of base 62 (26 + 26 + 10). In order to form the key, a hash function can be made, or a random number generated so that key sequence is not predictable. Or users may propose their own keys. For example, http://en.wikipedia.org/w/index.php?title=TinyURL&diff=283621022&oldid=283308287 can be shortened to http://bit.ly/tinyurlwiki.
I think they do not compress it they just generate a URL and map it to the real URL you compressed. So if they decide to make it N letters long they will be able to support (All Possible URL Characters)^N

Image Hash for very similar images [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I am taking screenshots of an application, and trying to detect if the exact image has been seen before. I am looking to detect trivial changes as different - e.g. if there is text in the image, and the spelling changes, that counts as a mismatch.
I've been successfully using an MD5 hash of the contents of an screen-shot image to lookup in a database of known images, and detect if it has been seen before.
Now, I have ported it to another machine, and despite my attempts to exactly match configurations, I am getting ever-so-slightly different images to the older machine. When I say different, the changes are minute - if I blow up the old and new images and flick between then, I can't see a single difference! Nonetheless, ImageMagick's compare command can see a smattering of pixels that are different.
So my MD5 hashes are no longer matching. Rather than a simple MD5 hash, I need an image hash.
Doing my research, I find that most of the image hashes try to be fairly generous - they accept resized, transformed and watermarked images, with a corresponding false positive matches. I want an image hash that is far more strict - the only changes permitted are minute changes in colour.
Can anyone recommend an image hash library or algorithm? (Not an application, like dupdetector).
Remember: My requirements are different from the many similar questions in that I don't want a liberal algorithm like shrinking or pHash, and I don't want a comparison tool like structural similarity or ImageMagick's compare.
I want a hash that makes very similar images give the same hash value. Is that even possible?
You can have a look at the following paper called "Spectral hashing". It is an algorithm that is designed to produce hash codes from images in order to group together similar images (see the retrieval examples at the end of the paper). It is a good starting point.
The link: http://www.cs.huji.ac.il/~yweiss/SpectralHashing/

What algorithm to choose [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 9 years ago.
asked in a recent interview:
What data structure would you use to implement spell correction in a document. The goal is to find if a given word typed by the user is in the dictionary or not (no need to correct it).
What is the complexity?
I would use a "Radix," or "Patricia," tree to index the dictionary. See here, including an example of its use to index dictionary words: https://secure.wikimedia.org/wikipedia/en/wiki/Radix_tree. There is a useful discussion at that link of its complexity.
if I'm understanding the question correctly, you are given a dictionary (or a list of "correct" words), and are asked to specify whether an input word is in the dictionary. So you're looking for data structures with very fast lookup times. I would go with a hash table
I would use a DAWG (Directed Acyclic Word Graph) which is basically a compressed Trie.
These are commonly used in algorithms for Scrabble and other words games, like Boggle.
I've done this before. The TWL06 Scrabble dictionary with 170,000 words fits in a 700 KB structure both on disk and in RAM.
The Levenshtein distance tells you how many letters you need to change to get from one string to another ... by finding the one with less substitutions you are able to provide correct words (also see Damerau Levenshtein distance)
The increase performance you should not calculate the distance against your whole dictionary and constrain it with some heuristic, for instance words that start with same first letter.
Bloom Filter. False positives are possible, but false negatives are not. As you know the dictionary in advance you can eliminate the false negatives by using a perfect hash for your input.(dictionary). Or you can use this as an auxiliary data structure behind your actual dictionary data structure.
edit: Of course complexity is O(1) for bloom filter.

Resources