Repeated Sampling Without Replacement [closed] - algorithm

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 9 years ago.
I want to generate 10 random numbers from a population 1:1000, and the code that generate this number is repeated 10 times. I want the sampling to be without replacement such that the intersection between the 10 sets of 10 random numbers is null.
First, if I used sample function in r and set replace to false it doesn't help much and
when I searched online I found a function for doing so called urn, but I can't download package in r. so in short I want to do exactly like the following code:
http://rss.acs.unt.edu/Rdoc/library/urn/html/urn.html
but manually instead of using the urn package
I tried the following code but the samples generated aren't unique where I select rows from "data" randomly
for(j in 1:10) {
x=unique(data[,2])
tr=sample(length(x),0.9*length(x),replace=FALSE)
}

Taking into account #ElKamina's comment you could generate 100 numbers using sample and allocate them into a 10 x 10 matrix:
matrix(sample(1:1000, 100, FALSE), ncol=10)

I like the sample 100 values and put them in a 10 by 10 matrix the best, but another option would be to sample the 1st 10 from the full list, then use setdiff to compute the set without the 10 already chosen, chose another 10 from that group, use setdiff again, etc.
This way may work better if you don't know ahead of time how many samples or how many in each sample, though in those cases you could use sample to randomly permute the whole list of 1000, then just pick off groups from the permuted list.

Related

Algorithm choice for schedule of campaign stops given route information and list of stops? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 9 years ago.
SHOW_SCHEDULE(START_CITY, START_STATE , HOURS)
This function looks at the current set of campaign stops that is stored in the system to create a schedule for the candidate. The schedule includes a subset of the current set of stored campaign stops and the route information between these campaign stops. The schedule must include the maximum number of campaign stops that can be accommodated within a given number of hours. START_CITY, START_STATE together denote the first city in the schedule. HOURS denote
the number of hours for which the schedule is being made.
What will be the best algorithm for this function??
You could look at this answer which talks about Djikstra's Algorithm for routing (you would probably need to define your graph such).
Basically, have your stops as vertices and the route could be treading these vertices.
Now, since you bring in the time dimension, his makes the route somewhat non-static static. Have a look at Distance Vector routing, again as suggested in the above mentioned answer.
The below links should give some more insights and comparision on routing algorithms
Wikipedia Journey Planner
This paper compares other algorithms that are faster than Djikstra's algorithm

How many bits are sufficient to hash a webpage in english? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
Recently I came by a question which asked , how many bits are sufficient to hash a webpage with these assumptions:
There are 1 billion web pages
The average length of web pages is 300 words
We have 250,000 words in English
The pages are in ASCII
Apparently there is no one right answer to this problem , but the aim of the question is to see how the general method works.
You haven't defined what it means to “hash a webpage”; that phrase appears in this question and in a couple of other pages on Internet. In those other pages it is used to mean computing a checksum (for example with sha1sum) to verify that content is intact. If that's what you mean, then you need all the bits of any page that's to be “hashed”; on average, that is 300 * 8 * average English word length. The question doesn't specify the average English word length, but if it is five letters plus a space, the average number of bits per page is 6*300*8 or 14400.
If you instead mean putting all the words of all the webpages into an index structure to allow a search to find all the webpages that contain any given set of words, one answer is about 10^13 bits: There are 300 billion word references in a billion pages; each reference uses log_2(1G) bits, or about 30 bits, if references are stored naively; hence 9 trillion bits, or about 10^13. You can also work out that naive storage for a billion URLs is at least an order of magnitude smaller than that, ie 10^12 bits at most. Special methods might be used to reduce reference storage a couple orders of magnitude, but because URLs are easier to compress or save compactly (via, eg, a trie), reference storage is likely to still be far more than what is needed for storing URLs.

What technique they use to compress the URL's? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
What algorithm/technique does most of the sites use to compress the URL?
Adfly shortens an URL e.g. to "5Y8F2" which is superb. It produces the most compressed URL I have ever seen.
You can find piece of information in Wiki: URL shortening.
Quoting this article:
There are several techniques to implement a URL shortening. Keys can be generated in base 36, assuming 26 letters and 10 numbers. In this case, each character in the sequence will be 0, 1, 2, ..., 9, a, b, c, ..., y, z. Alternatively, if uppercase and lowercase letters are differentiated, then each character can represent a single digit within a number of base 62 (26 + 26 + 10). In order to form the key, a hash function can be made, or a random number generated so that key sequence is not predictable. Or users may propose their own keys. For example, http://en.wikipedia.org/w/index.php?title=TinyURL&diff=283621022&oldid=283308287 can be shortened to http://bit.ly/tinyurlwiki.
I think they do not compress it they just generate a URL and map it to the real URL you compressed. So if they decide to make it N letters long they will be able to support (All Possible URL Characters)^N

Kahan summation and relative errors; or real life war stories of "getting the wrong result instead of the correct one" [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I am interested in "war stories" like this one:
I wrote a program involving the sum of floating point numbers but I did not use the Kahan summation.
The sum was bad_sum and the program gave me a wrong result.
A colleague of mine, more versed than me in numerical analysis, had a look at the code and suggested me to use the Kahan summation, the sum is now good_sum and the program gives me the correct result.
I am interested in real-life production-code, not in code samples "artificially" created in order to explain the Kahan summation algorithm.
In particular what was the relative error (bad_sum-good_sum)/good_sum for your application?
Up to now I have no similar story to tell. Maybe I will make some tests (running my program on an input data set, logging the program results and the sums with and without Kahan, estimate the relative error).

Why are we allowed to harass an iterating variable in a for loop [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 11 years ago.
Sorry about the question being so generic, but I've always wondered about this.
In a for loop, why doesn't a compiler determine the number of times it has to run based on the initializer,condition and increment and then run the loop for the predetermined number of times?
Even the advanced for in Java and the for in python which allow us to iterate over collections would act funny if we modified the collection.
If we do want to change the iterating variable or the object we are iterating upon, we might as well use a while loop instead of a for. Are there any advantages to using a for loop in such a case?
Since no language does a for loop the way I have described, there must be a lot of things I haven't thought about. Please point out the same.
That's what an optimizing compiler can do if it decides that's the right optimization. It's called loop unrolling and you can encourage it in a c compiler usually with the flag -funroll-loops. The main issue is that you don't always know at compile time how many iterations you are going to need, so compilers have to be able to handle the general case correctly. If a compiler can determine that the number of loop iterations is invariant and the number of iterations is reasonably small, it will likely output machine code with the loops unrolled.
The other major issue is file size. If you know you'll have to iterate 1,000,000,000 times, unrolling that loop will make your executable binary huge.

Resources