As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
What algorithm/technique does most of the sites use to compress the URL?
Adfly shortens an URL e.g. to "5Y8F2" which is superb. It produces the most compressed URL I have ever seen.
You can find piece of information in Wiki: URL shortening.
Quoting this article:
There are several techniques to implement a URL shortening. Keys can be generated in base 36, assuming 26 letters and 10 numbers. In this case, each character in the sequence will be 0, 1, 2, ..., 9, a, b, c, ..., y, z. Alternatively, if uppercase and lowercase letters are differentiated, then each character can represent a single digit within a number of base 62 (26 + 26 + 10). In order to form the key, a hash function can be made, or a random number generated so that key sequence is not predictable. Or users may propose their own keys. For example, http://en.wikipedia.org/w/index.php?title=TinyURL&diff=283621022&oldid=283308287 can be shortened to http://bit.ly/tinyurlwiki.
I think they do not compress it they just generate a URL and map it to the real URL you compressed. So if they decide to make it N letters long they will be able to support (All Possible URL Characters)^N
Related
As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 9 years ago.
I want to generate 10 random numbers from a population 1:1000, and the code that generate this number is repeated 10 times. I want the sampling to be without replacement such that the intersection between the 10 sets of 10 random numbers is null.
First, if I used sample function in r and set replace to false it doesn't help much and
when I searched online I found a function for doing so called urn, but I can't download package in r. so in short I want to do exactly like the following code:
http://rss.acs.unt.edu/Rdoc/library/urn/html/urn.html
but manually instead of using the urn package
I tried the following code but the samples generated aren't unique where I select rows from "data" randomly
for(j in 1:10) {
x=unique(data[,2])
tr=sample(length(x),0.9*length(x),replace=FALSE)
}
Taking into account #ElKamina's comment you could generate 100 numbers using sample and allocate them into a 10 x 10 matrix:
matrix(sample(1:1000, 100, FALSE), ncol=10)
I like the sample 100 values and put them in a 10 by 10 matrix the best, but another option would be to sample the 1st 10 from the full list, then use setdiff to compute the set without the 10 already chosen, chose another 10 from that group, use setdiff again, etc.
This way may work better if you don't know ahead of time how many samples or how many in each sample, though in those cases you could use sample to randomly permute the whole list of 1000, then just pick off groups from the permuted list.
As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
Recently I came by a question which asked , how many bits are sufficient to hash a webpage with these assumptions:
There are 1 billion web pages
The average length of web pages is 300 words
We have 250,000 words in English
The pages are in ASCII
Apparently there is no one right answer to this problem , but the aim of the question is to see how the general method works.
You haven't defined what it means to “hash a webpage”; that phrase appears in this question and in a couple of other pages on Internet. In those other pages it is used to mean computing a checksum (for example with sha1sum) to verify that content is intact. If that's what you mean, then you need all the bits of any page that's to be “hashed”; on average, that is 300 * 8 * average English word length. The question doesn't specify the average English word length, but if it is five letters plus a space, the average number of bits per page is 6*300*8 or 14400.
If you instead mean putting all the words of all the webpages into an index structure to allow a search to find all the webpages that contain any given set of words, one answer is about 10^13 bits: There are 300 billion word references in a billion pages; each reference uses log_2(1G) bits, or about 30 bits, if references are stored naively; hence 9 trillion bits, or about 10^13. You can also work out that naive storage for a billion URLs is at least an order of magnitude smaller than that, ie 10^12 bits at most. Special methods might be used to reduce reference storage a couple orders of magnitude, but because URLs are easier to compress or save compactly (via, eg, a trie), reference storage is likely to still be far more than what is needed for storing URLs.
As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
Given a known set $A$ of distinct numbers $0 ~ 2^(n+1)-1$. In binary mode, it is a n-dimensional vector with 0/1 elements. Now for an arbitrary subset $S$ containing $m$ distinct numbers of $A$, is it possible to find a function $f$, such that $f(S)$ becomes $0,1,...,m-1$, while $f(A\S)$ should not fall in $0,1,...,m-1$. The function $f$ should be as simple as possible, a linear one is preferred. Thanks.
The keyword you're looking for is a minimal perfect hash function, and yes, it's always possible to construct a minimal perfect hash function for a given S.
As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I have a Hash, and I need to access the following:
parsed["HotelInformationResponse"]["PropertyAmenities"]["PropertyAmenity"]
that needs to go on a line with a variable assignment. This makes it longer than 80 characters, which is where I wrap my lines. What's the most elegant way to wrap that to make it fit?
Text editors are only a tool. Just because you wrap your lines at 80 characters doesn't mean that 100% of lines should be under 80 characters. There are expressions which cannot (or should not) be broken down and happen to be long. As a language aiming for syntax which reads like natural language, sometimes verbose method or variable names (such as "HotelInformationResponse") require more space.
To directly answer, you can assign different parts to separate variables:
response = parsed["HotelInfomationResponse"]
amenities = response["PropertyAmenities"]
amenity = amenities["PropertyAmenity"]
This would be preferable if you are reusing parts of the hash, so you aren't calling parsed["HotelInformationResponse"]["PropertyAmenities"] repeatedly.
As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 11 years ago.
What is the best Fuzzy Matching Algorithm (Fuzzy Logic, N-Gram, Levenstein, Soundex ....,) to process more than 100000 records in less time?
I suggest you read the articles by Navarro mentioned in the Refences section of the Wikipedia article titled
Approximate string matching.
Making your decision based on actual research is always better than on suggestions by random
strangers.. Especially if performance on a known set of records is important to you.
It massively depends on your data. Certain records can be matched better than others. For example postcode is a defined format so can be compared in a different way to normal strings. People can be matched on initials and DOB, or other combinations etc.