Related
I am trying to write a decoder for GPU. My encoding scheme has data dependencies between lines. So when decoding columns of data each column depends on the previous. I want to parallellize the internal computation of each column, but execute each column one-by-one and sequentially, but I am having trouble getting this correctly.
Below I have modeled a toy example to show the problem:
Func f;
Var x,y;
RDom r(1,3,1,3); // goes from (1,1) to (4,4)
f(x,y) = 0;
f(0,y) = y;
Expr p_1 = f(r.x-1,r.y);
Expr p_2 = f(r.x-1,r.y-1);
f(r.x,r.y) = p_1 + p_2;
Buffer<int32_t> output_2D = f.realize({4,4});
A visualization of this program can be seen here: Serial Computation Visualisation
This reduction should give the following array():
int expected_output[4][4] = {{0,0,0,0},
{1,1,1,1},
{2,3,4,5},
{3,5,8,12}};
And checking using Catch2 I can see that it actually calculates it correctly
for(int j = 0; j < output_2D.height(); j++){
for(int i = 0; i < output_2D.width(); i++){
CAPTURE(i,j);
REQUIRE(expected_output[j][i]==output_2D(i,j));
}
}
My task is to speed this computation up. Since column one depends on column zero I have to calculate each column in series. I can however, calculate all the values in the column in parallel. Please see Computation Steps Parallel and Desired Pipeline to see how I want Halide to compute the pipeline.
I tried doing this in halide using the f.update(1).allow_race_conditions().parallel(r.y); and this does almost what I want.
f(r.x,r.y) = p_1 + p_2;
f.update(1).allow_race_conditions().parallel(r.y);
f.trace_stores();
Buffer<int32_t> output_2D = f.realize({4,4});
For some reason however, it seems that parallel(y) executes the columns in seemingly random order.
It yields the following store_trace:
Init Image:
Store f29.0(0, 0) = 0
Store f29.0(1, 0) = 0
....
Store f29.0(3, 3) = 0
Init first row:
Store f29.0(0, 0) = 0
Store f29.0(1, 0) = 1
Store f29.0(2, 0) = 2
Store f29.0(3, 0) = 3
Start Parallel Computation:
Store f29.0(1, 1) = 1 // First parallel column
Store f29.0(2, 1) = 1
Store f29.0(3, 1) = 1
Store f29.0(1, 3) = 5 // Second parallel column: THIS IS MY PROBLEM
Store f29.0(2, 3) = 5 // This should be column 2 not column 3.
Store f29.0(3, 3) = 5
Store f29.0(1, 2) = 3
Store f29.0(2, 2) = 4
Store f29.0(3, 2) = 5
A visualization of this pattern can be seen here in this figure: Current Pipeline.
I know that I explicitly enabling the race_conditions so I must be doing something wrong, but I dont know what is the right way to do this and this is the closest I got. I could vectorize() with respect to y and that gives the correct evaluation, but I want to use the parallel() block to gain greater speedup for larger matrixes/images. RFactor might be a solution as my problem should be associative in the y direction, but it might not work as it is non-associative in the x-direction(each column depends on the previous) Does anyone know how to be serial in x and parallel in y when using RDoms?
Got asked this question in an interview and couldn't find a solution.
Given an array of characters delete all the characters that got repeated k or more times consecutively and add '#' in the end of the array for every deleted character.
Example:
"xavvvarrrt"->"xaat######"
O(1) memory and O(n) time without writing to the same cell twice.
The tricky part for me was that I am not allowed to overwrite a cell more than once, which means I need to know exactly where each character will move after deleting the duplicates.
The best I could come up with is iterating once on the array and saving in a map the occurrences of each character, and when iterating again and checking if the current character is not deleted then move it to the new position according to the offset, if it is deleted then update an offset variable.
The problem with this approach is that it won't work in this scenario:
"aabbaa" because 'a' appears at two different places.
So when I thought about saving an array of occurrences in the map but now it won't use O(1) memory.
Thanks
This seems to work with your examples, although it seems a little complicated to me :) I wonder if we could simplify it. The basic idea is to traverse from left to right, keeping a record of how many places in the current block of duplicates are still available to replace, while the right pointer looks for more blocks to shift over.
JavaScript code:
function f(str){
str = str.split('')
let r = 1
let l = 0
let to_fill = 0
let count = 1
let fill = function(){
while (count > 0 && (to_fill > 0 || l < r)){
str[l] = str[r - count]
l++
count--
to_fill--
}
}
for (; r<str.length; r++){
if (str[r] == str[r-1]){
count++
} else if (count < 3){
if (to_fill)
fill()
count = 1
if (!to_fill)
l = r
} else if (!to_fill){
to_fill = count
count = 1
} else {
count = 1
}
}
if (count < 3)
fill()
while (l < str.length)
str[l++] = '#'
return str.join('')
}
var str = "aayyyycbbbee"
console.log(str)
console.log(f(str)) // "aacee#######"
str = "xavvvarrrt"
console.log(str)
console.log(f(str)) // "xaat######"
str = "xxaavvvaarrrbbsssgggtt"
console.log(str)
console.log(f(str))
Here is a version similar to the other JS answer, but a bit simpler:
function repl(str) {
str = str.split("");
var count = 1, write = 0;
for (var read = 0; read < str.length; read++) {
if (str[read] == str[read+1])
count++;
else {
if (count < 3) {
for (var i = 0; i < count; i++)
str[write++] = str[read];
}
count = 1;
}
}
while (write < str.length)
str[write++] = '#';
return str.join("");
}
function demo(str) {
console.log(str + " ==> " + repl(str));
}
demo("a");
demo("aa");
demo("aaa");
demo("aaaaaaa");
demo("aayyyycbbbee");
demo("xavvvarrrt");
demo("xxxaaaaxxxaaa");
demo("xxaavvvaarrrbbsssgggtt");
/*
Output:
a ==> a
aa ==> aa
aaa ==> ###
aaaaaaa ==> #######
aayyyycbbbee ==> aacee#######
xavvvarrrt ==> xaat######
xxxaaaaxxxaaa ==> #############
xxaavvvaarrrbbsssgggtt ==> xxaaaabbtt############
*/
The idea is to keep the current index for reading the next character and one for writing, as well as the number of consecutive repeated characters. If the following character is equal to the current, we just increase the counter. Otherwise we copy all characters below a count of 3, increasing the write index appropriately.
At the end of reading, anything from the current write index up to the end of the array is the number of repeated characters we have skipped. We just fill that with hashes now.
As we only store 3 values, memory consumption is O(1); we read each array cell twice, so O(n) time (the extra reads on writing could be eliminated by another variable); and each write index is accessed exactly once.
I have a very large set (billions or more, it's expected to grow exponentially to some level), and I want to generate seemingly random elements from it without repeating. I know I can pick a random number and repeat and record the elements I have generated, but that takes more and more memory as numbers are generated, and wouldn't be practical after couple millions elements out.
I mean, I could say 1, 2, 3 up to billions and each would be constant time without remembering all the previous, or I can say 1,3,5,7,9 and on then 2,4,6,8,10, but is there a more sophisticated way to do that and eventually get a seemingly random permutation of that set?
Update
1, The set does not change size in the generation process. I meant when the user's input increases linearly, the size of the set increases exponentially.
2, In short, the set is like the set of every integer from 1 to 10 billions or more.
3, In long, it goes up to 10 billion because each element carries the information of many independent choices, for example. Imagine an RPG character that have 10 attributes, each can go from 1 to 100 (for my problem different choices can have different ranges), thus there's 10^20 possible characters, number "10873456879326587345" would correspond to a character that have "11, 88, 35...", and I would like an algorithm to generate them one by one without repeating, but makes it looks random.
Thanks for the interesting question. You can create a "pseudorandom"* (cyclic) permutation with a few bytes using modular exponentiation. Say we have n elements. Search for a prime p that's bigger than n+1. Then find a primitive root g modulo p. Basically by definition of primitive root, the action x --> (g * x) % p is a cyclic permutation of {1, ..., p-1}. And so x --> ((g * (x+1))%p) - 1 is a cyclic permutation of {0, ..., p-2}. We can get a cyclic permutation of {0, ..., n-1} by repeating the previous permutation if it gives a value bigger (or equal) n.
I implemented this idea as a Go package. https://github.com/bwesterb/powercycle
package main
import (
"fmt"
"github.com/bwesterb/powercycle"
)
func main() {
var x uint64
cycle := powercycle.New(10)
for i := 0; i < 10; i++ {
fmt.Println(x)
x = cycle.Apply(x)
}
}
This outputs something like
0
6
4
1
2
9
3
5
8
7
but that might vary off course depending on the generator chosen.
It's fast, but not super-fast: on my five year old i7 it takes less than 210ns to compute one application of a cycle on 1000000000000000 elements. More details:
BenchmarkNew10-8 1000000 1328 ns/op
BenchmarkNew1000-8 500000 2566 ns/op
BenchmarkNew1000000-8 50000 25893 ns/op
BenchmarkNew1000000000-8 200000 7589 ns/op
BenchmarkNew1000000000000-8 2000 648785 ns/op
BenchmarkApply10-8 10000000 170 ns/op
BenchmarkApply1000-8 10000000 173 ns/op
BenchmarkApply1000000-8 10000000 172 ns/op
BenchmarkApply1000000000-8 10000000 169 ns/op
BenchmarkApply1000000000000-8 10000000 201 ns/op
BenchmarkApply1000000000000000-8 10000000 204 ns/op
Why did I say "pseudorandom"? Well, we are always creating a very specific kind of cycle: namely one that uses modular exponentiation. It looks pretty pseudorandom though.
I would use a random number and swap it with an element at the beginning of the set.
Here's some pseudo code
set = [1, 2, 3, 4, 5, 6]
picked = 0
Function PickNext(set, picked)
If picked > Len(set) - 1 Then
Return Nothing
End If
// random number between picked (inclusive) and length (exclusive)
r = RandomInt(picked, Len(set))
// swap the picked element to the beginning of the set
result = set[r]
set[r] = set[picked]
set[picked] = result
// update picked
picked++
// return your next random element
Return temp
End Function
Every time you pick an element there is one swap and the only extra memory being used is the picked variable. The swap can happen if the elements are in a database or in memory.
EDIT Here's a jsfiddle of a working implementation http://jsfiddle.net/sun8rw4d/
JavaScript
var set = [];
set.picked = 0;
function pickNext(set) {
if(set.picked > set.length - 1) { return null; }
var r = set.picked + Math.floor(Math.random() * (set.length - set.picked));
var result = set[r];
set[r] = set[set.picked];
set[set.picked] = result;
set.picked++;
return result;
}
// testing
for(var i=0; i<100; i++) {
set.push(i);
}
while(pickNext(set) !== null) { }
document.body.innerHTML += set.toString();
EDIT 2 Finally, a random binary walk of the set. This can be accomplished with O(Log2(N)) stack space (memory) which for 10billion is only 33. There's no shuffling or swapping involved. Using trinary instead of binary might yield even better pseudo random results.
// on the fly set generator
var count = 0;
var maxValue = 64;
function nextElement() {
// restart the generation
if(count == maxValue) {
count = 0;
}
return count++;
}
// code to pseudo randomly select elements
var current = 0;
var stack = [0, maxValue - 1];
function randomBinaryWalk() {
if(stack.length == 0) { return null; }
var high = stack.pop();
var low = stack.pop();
var mid = ((high + low) / 2) | 0;
// pseudo randomly choose the next path
if(Math.random() > 0.5) {
if(low <= mid - 1) {
stack.push(low);
stack.push(mid - 1);
}
if(mid + 1 <= high) {
stack.push(mid + 1);
stack.push(high);
}
} else {
if(mid + 1 <= high) {
stack.push(mid + 1);
stack.push(high);
}
if(low <= mid - 1) {
stack.push(low);
stack.push(mid - 1);
}
}
// how many elements to skip
var toMid = (current < mid ? mid - current : (maxValue - current) + mid);
// skip elements
for(var i = 0; i < toMid - 1; i++) {
nextElement();
}
current = mid;
// get result
return nextElement();
}
// test
var result;
var list = [];
do {
result = randomBinaryWalk();
list.push(result);
} while(result !== null);
document.body.innerHTML += '<br/>' + list.toString();
Here's the results from a couple of runs with a small set of 64 elements. JSFiddle http://jsfiddle.net/yooLjtgu/
30,46,38,34,36,35,37,32,33,31,42,40,41,39,44,45,43,54,50,52,53,51,48,47,49,58,60,59,61,62,56,57,55,14,22,18,20,19,21,16,15,17,26,28,29,27,24,25,23,6,2,4,5,3,0,1,63,10,8,7,9,12,11,13
30,14,22,18,16,15,17,20,19,21,26,28,29,27,24,23,25,6,10,8,7,9,12,13,11,2,0,63,1,4,5,3,46,38,42,44,45,43,40,41,39,34,36,35,37,32,31,33,54,58,56,55,57,60,59,61,62,50,48,49,47,52,51,53
As I mentioned in my comment, unless you have an efficient way to skip to a specific point in your "on the fly" generation of the set this will not be very efficient.
if it is enumerable then use a pseudo-random integer generator adjusted to the period 0 .. 2^n - 1 where the upper bound is just greater than the size of your set and generate pseudo-random integers discarding those more than the size of your set. Use those integers to index items from your set.
Pre- compute yourself a series of indices (e.g. in a file), which has the properties you need and then randomly choose a start index for your enumeration and use the series in a round-robin manner.
The length of your pre-computed series should be > the maximum size of the set.
If you combine this (depending on your programming language etc.) with file mappings, your final nextIndex(INOUT state) function is (nearly) as simple as return mappedIndices[state++ % PERIOD];, if you have a fixed size of each entry (e.g. 8 bytes -> uint64_t).
Of course, the returned value could be > your current set size. Simply draw indices until you get one which is <= your sets current size.
Update (In response to question-update):
There is another option to achieve your goal if it is about creating 10Billion unique characters in your RPG: Generate a GUID and write yourself a function which computes your number from the GUID. man uuid if you are are on a unix system. Else google it. Some parts of the uuid are not random but contain meta-info, some parts are either systematic (such as your network cards MAC address) or random, depending on generator algorithm. But they are very very most likely unique. So, whenever you need a new unique number, generate a uuid and transform it to your number by means of some algorithm which basically maps the uuid bytes to your number in a non-trivial way (e.g. use hash functions).
In this problem we consider only strings of lower-case English letters (a-z).
A string is a palindrome if it has exactly the same sequence of characters when traversed left-to-right as right-to-left. For example, the following strings are palindromes:
"kayak"
"codilitytilidoc"
"neveroddoreven"
A string A is an anagram of a string B if it consists of exactly the same characters, but possibly in another order. For example, the following strings are each other's anagrams:
A="mary" B="army" A="rocketboys" B="octobersky" A="codility" B="codility"
Write a function
int isAnagramOfPalindrome(String S);
which returns 1 if the string s is a anagram of some palindrome, or returns 0 otherwise.
For example your function should return 1 for the argument "dooernedeevrvn", because it is an anagram of a palindrome "neveroddoreven". For argument "aabcba", your function should return 0.
'Algorithm' would be too big word for it.
You can construct a palindrome from the given character set if each character occurs in that set even number of times (with possible exception of one character).
For any other set, you can easily show that no palindrome exists.
Proof is simple in both cases, but let me know if that wasn't clear.
In a palindrome, every character must have a copy of itself, a "twin", on the other side of the string, except in the case of the middle letter, which can act as its own twin.
The algorithm you seek would create a length-26 array, one for each lowercase letter, and start counting the characters in the string, placing the quantity of character n at index n of the array. Then, it would pass through the array and count the number of characters with an odd quantity (because one letter there does not have a twin). If this number is 0 or 1, place that single odd letter in the center, and a palindrome is easily generated. Else, it's impossible to generate one, because two or more letters with no twins exist, and they can't both be in the center.
I came up with this solution for Javascript.
This solution is based on the premise that a string is an anagram of a palindrome if and only if at most one character appears an odd number of times in it.
function solution(S) {
var retval = 0;
var sorted = S.split('').sort(); // sort the input characters and store in
// a char array
var array = new Array();
for (var i = 0; i < sorted.length; i++) {
// check if the 2 chars are the same, if so copy the 2 chars to the new
// array
// and additionally increment the counter to account for the second char
// position in the loop.
if ((sorted[i] === sorted[i + 1]) && (sorted[i + 1] != undefined)) {
array.push.apply(array, sorted.slice(i, i + 2));
i = i + 1;
}
}
// if the original string array's length is 1 or more than the length of the
// new array's length
if (sorted.length <= array.length + 1) {
retval = 1;
}
//console.log("new array-> " + array);
//console.log("sorted array-> " + sorted);
return retval;
}
i wrote this code in java. i don't think if its gonna be a good one ^^,
public static int isAnagramOfPalindrome(String str){
ArrayList<Character> a = new ArrayList<Character>();
for(int i = 0; i < str.length(); i++){
if(a.contains(str.charAt(i))){
a.remove((Object)str.charAt(i));
}
else{
a.add(str.charAt(i));
}
}
if(a.size() > 1)
return 0;
return 1;
}
Algorithm:
Count the number of occurrence of each character.
Only one character with odd occurrence is allowed since in a palindrome the maximum number of character with odd occurrence can be '1'.
All other characters should occur in an even number of times.
If (2) and (3) fail, then the given string is not a palindrome.
This adds to the other answers given. We want to keep track of the count of each letter seen. If we have more than one odd count for a letter then we will not be able to form a palindrome. The odd count would go in the middle, but only one odd count can do so.
We can use a hashmap to keep track of the counts. The lookup for a hashmap is O(1) so it is fast. We are able to run the whole algorithm in O(n). Here's it is in code:
if __name__ == '__main__':
line = input()
dic = {}
for i in range(len(line)):
ch = line[i]
if ch in dic:
dic[ch] += 1
else:
dic[ch] = 1
chars_whose_count_is_odd = 0
for key, value in dic.items():
if value % 2 == 1:
chars_whose_count_is_odd += 1
if chars_whose_count_is_odd > 1:
print ("NO")
else:
print ("YES")
I have a neat solution in PHP posted in this question about complexities.
class Solution {
// Function to determine if the input string can make a palindrome by rearranging it
static public function isAnagramOfPalindrome($S) {
// here I am counting how many characters have odd number of occurrences
$odds = count(array_filter(count_chars($S, 1), function($var) {
return($var & 1);
}));
// If the string length is odd, then a palindrome would have 1 character with odd number occurrences
// If the string length is even, all characters should have even number of occurrences
return (int)($odds == (strlen($S) & 1));
}
}
echo Solution :: isAnagramOfPalindrome($_POST['input']);
It uses built-in PHP functions (why not), but you can make it yourself, as those functions are quite simple. First, the count_chars function generates a named array (dictionary in python) with all characters that appear in the string, and their number of occurrences. It can be substituted with a custom function like this:
$count_chars = array();
foreach($S as $char) {
if array_key_exists($char, $count_chars) {
$count_chars[$char]++;
else {
$count_chars[$char] = 1;
}
}
Then, an array_filter with a count function is applied to count how many chars have odd number of occurrences:
$odds = 0;
foreach($count_chars as $char) {
$odds += $char % 2;
}
And then you just apply the comparison in return (explained in the comments of the original function).
return ($odds == strlen($char) % 2)
This runs in O(n). For all chars but one, must be even. the optional odd character can be any odd number.
e.g.
abababa
def anagram_of_pali(str):
char_list = list(str)
map = {}
nb_of_odds = 0
for char in char_list:
if char in map:
map[char] += 1
else:
map[char] = 1
for char in map:
if map[char] % 2 != 0:
nb_of_odds += 1
return True if nb_of_odds <= 1 else False
You just have to count all the letters and check if there are letters with odd counts. If there are more than one letter with odd counts the string does not satisfy the above palindrome condition.
Furthermore, since a string with an even number letters must not have a letter with an odd count it is not necessary to check whether string length is even or not. It will take O(n) time complexity:
Here's the implementation in javascript:
function canRearrangeToPalindrome(str)
{
var letterCounts = {};
var letter;
var palindromeSum = 0;
for (var i = 0; i < str.length; i++) {
letter = str[i];
letterCounts[letter] = letterCounts[letter] || 0;
letterCounts[letter]++;
}
for (var letterCount in letterCounts) {
palindromeSum += letterCounts[letterCount] % 2;
}
return palindromeSum < 2;
}
All right - it's been a while, but as I was asked such a question in a job interview I needed to give it a try in a few lines of Python. The basic idea is that if there is an anagram that is a palindrome for even number of letters each character occurs twice (or something like 2n times, i.e. count%2==0). In addition, for an odd number of characters one character (the one in the middle) may occur only once (or an uneven number - count%2==1).
I used a set in python to get the unique characters and then simply count and break the loop once the condition cannot be fulfilled. Example code (Python3):
def is_palindrome(s):
letters = set(s)
oddc=0
fail=False
for c in letters:
if s.count(c)%2==1:
oddc = oddc+1
if oddc>0 and len(s)%2==0:
fail=True
break
elif oddc>1:
fail=True
break
return(not fail)
def is_anagram_of_palindrome(S):
L = [ 0 for _ in range(26) ]
a = ord('a')
length = 0
for s in S:
length += 1
i = ord(s) - a
L[i] = abs(L[i] - 1)
return length > 0 and sum(L) < 2 and 1 or 0
While you can detect that the given string "S" is a candidate palindrome using the given techniques, it is still not very useful. According to the implementations given,
isAnagramOfPalindrome("rrss") would return true but there is no actual palindrome because:
A palindrome is a word, phrase, number, or other sequence of symbols or elements, whose meaning may be interpreted the same way in either forward or reverse direction. (Wikipedia)
And Rssr or Srrs is not an actual word or phrase that is interpretable. Same with it's anagram. Aarrdd is not an anagram of radar because it is not interpretable.
So, the solutions given must be augmented with a heuristic check against the input to see if it's even a word, and then a verification (via the implementations given), that it is palindrome-able at all. Then there is a heuristic search through the collected buckets with n/2! permutations to search if those are ACTUALLY palindromes and not garbage. The search is only n/2! and not n! because you calculate all permutations of each repeated letter, and then you mirror those over (in addition to possibly adding the singular pivot letter) to create all possible palindromes.
I disagree that algorithm is too big of a word, because this search can be done pure recursively, or using dynamic programming (in the case of words with letters with occurrences greater than 2) and is non trivial.
Here's some code: This is same as the top answer that describes algorithm.
1 #include<iostream>
2 #include<string>
3 #include<vector>
4 #include<stack>
5
6 using namespace std;
7
8 bool fun(string in)
9 {
10 int len=in.size();
11 int myints[len ];
12
13 for(int i=0; i<len; i++)
14 {
15 myints[i]= in.at(i);
16 }
17 vector<char> input(myints, myints+len);
18 sort(input.begin(), input.end());
19
20 stack<int> ret;
21
22 for(int i=0; i<len; i++)
23 {
24 if(!ret.empty() && ret.top()==input.at(i))
25 {
26 ret.pop();
27 }
28 else{
29 ret.push(input.at(i));
30 }
31 }
32
33 return ret.size()<=1;
34
35 }
36
37 int main()
38 {
39 string input;
40 cout<<"Enter word/number"<<endl;
41 cin>>input;
42 cout<<fun(input)<<endl;
43
44 return 0;
45 }
I'm familiar with the algorithm for reading a single random line from a file without reading the whole file into memory. I wonder if this technique can be extended to N random lines?
The use case is for a password generator which concatenates N random words pulled out of a dictionary file, one word per line (like /usr/share/dict/words). You might come up with angela.ham.lewis.pathos. Right now it reads the whole dictionary file into an array and picks N random elements from that array. I would like to eliminate the array, or any other in-memory storage of the file, and only read the file once.
(No, this isn't a practical optimization exercise. I'm interested in the algorithm.)
Update:
Thank you all for your answers.
Answers fell into three categories: modifications of the full read algorithm, random seek, or index the lines and seek to them randomly.
The random seek is much faster, and constant with respect to file size, but distributes based on file size not on number of words. It also allows duplicates (that can be avoided but it makes the algorithm O(inf)). Here's my reimplementation of my password generator using that algorithm. I realize that by reading forward from the seek point, rather than backwards, it has an off-by-one error should the seek fall in the last line. Correcting is left as an exercise for the editor.
#!/usr/bin/perl -lw
my $Words = "/usr/share/dict/words";
my $Max_Length = 8;
my $Num_Words = 4;
my $size = -s $Words;
my #words;
open my $fh, "<", $Words or die $!;
for(1..$Num_Words) {
seek $fh, int rand $size, 0 or die $!;
<$fh>;
my $word = <$fh>;
chomp $word;
redo if length $word > $Max_Length;
push #words, $word;
}
print join ".", #words;
And then there's Guffa's answer, which was what I was looking for; an extension of the original algorithm. Slower, it has to read the whole file, but distributes by word, allows filtering without changing the efficiency of the algorithm and (I think) has no duplicates.
#!/usr/bin/perl -lw
my $Words = "/usr/share/dict/words";
my $Max_Length = 8;
my $Num_Words = 4;
my #words;
open my $fh, "<", $Words or die $!;
my $count = 0;
while(my $line = <$fh>) {
chomp $line;
$count++;
if( $count <= $Num_Words ) {
$words[$count-1] = $line;
}
elsif( rand($count) <= $Num_Words ) {
$words[rand($Num_Words)] = $line;
}
}
print join ".", #words;
Finally, the index and seek algorithm has the advantage of distributing by word rather than file size. The disadvantage is it reads the whole file and memory usage scales linearly with the number of words in the file. Might as well use Guffa's algorithm.
The algorithm is not implemented in a very good and clear way in that example... Some pseudo code that better explains it would be:
cnt = 0
while not end of file {
read line
cnt = cnt + 1
if random(1 to cnt) = 1 {
result = line
}
}
As you see, the idea is that you read each line in the file and calculate the probability that the line should be the one chosen. After reading the first line the probability is 100%, after reading the second line the probability is 50%, and so on.
This can be expanded to picking N items by keeping an array with the size N instead of a single variable, and calculate the probability for a line to replace one of the current ones in the array:
var result[1..N]
cnt = 0
while not end of file {
read line
cnt = cnt + 1
if cnt <= N {
result[cnt] = line
} else if random(1 to cnt) <= N {
result[random(1 to N)] = line
}
}
Edit:
Here's the code implemented in C#:
public static List<string> GetRandomLines(string path, int count) {
List<string> result = new List<string>();
Random rnd = new Random();
int cnt = 0;
string line;
using (StreamReader reader = new StreamReader(path)) {
while ((line = reader.ReadLine()) != null) {
cnt++;
int pos = rnd.Next(cnt);
if (cnt <= count) {
result.Insert(pos, line);
} else {
if (pos < count) {
result[pos] = line;
}
}
}
}
return result;
}
I made a test by running the method 100000 times, picking 5 lines out of 20, and counted the occurances of the lines. This is the result:
25105
24966
24808
24966
25279
24824
25068
24901
25145
24895
25087
25272
24971
24775
25024
25180
25027
25000
24900
24807
As you see, the distribution is as good as you could ever want. :)
(I moved the creation of the Random object out of the method when running the test, to avoid seeding problems as the seed is taken from the system clock.)
Note:
You might want to scramble the order in the resulting array if you want them to be randomly ordered. As the first N lines are placed in order in the array, they are not randomly placed if they remain at the end. For exmaple if N is three or larger and the third line is picked, it will always be at the third position in the array.
Edit 2:
I changed the code to use a List<string> instead of a string[]. That makes it easy to insert the first N items in a random order. I updated the test data from a new test run, so that you can see that the distribution is still good.
Now my Perl is not what used to be, but trusting the implicit claim on your reference (that the distribution of line numbers thus selected is uniform), it seems this should work:
srand;
(rand($.) < 1 && ($line1 = $_)) || (rand($.) <1 && ($line2 = $_)) while <>;
Just like the original algorithm, this is one-pass and constant memory.
Edit
I just realized you need N, and not 2. You can repeat the OR-ed expression N times if you know N in advance.
Quite the first time I see some Perl code ... it is incredible unreadable ... ;) But that should not matter. Why don't you just repeat the cryptic line N times?
If I would have to write this, I would just seek a random position in the file, read to the end of the line (the next newline), and then read one line up to the next newline. Add some error handling if you just seeked into the last line, repeat all this N times and you are done. I guess
srand;
rand($.) < 1 && ($line = $_) while <>;
is the Perl way to do such a single step. You could also read backwards from the initial position up to the priviouse newline or the begining of the file and then read a line forward again. But this doesn't really matter.
UPDATE
I have to admit that seeking somewhere into the file will not generate a perfect uniform distribution because of the different line lengths. If this fluctuation matters depends on the usage scenario, of course.
If you need a perfect uniform distribution, you need to read the whole file at least once to get the number of lines. In this case the algorithm given by Guffa is probably the cleverest solution because it requires reading the file exactly once.
If you don't need to do it within the scope of Perl, shuf is a really nice command-line utility for this. To do what you're looking to do:
$ shuf -n N file > newfile
Quick and dirty bash
function randomLine {
numlines=`wc -l $1| awk {'print $1'}`
t=`date +%s`
t=`expr $t + $RANDOM`
a=`expr $t % $numlines + 1`
RETURN=`head -n $a $1|tail -n 1`
return 0
}
randomLine test.sh
echo $RETURN
Pick a random point in the file, look backwards for previous EOL, search forward for current EOL, and return the line.
FILE * file = fopen("words.txt");
int fs = filesize("words.txt");
int ptr = rand(fs); // 0 to fs-1
int start = min(ptr - MAX_LINE_LENGTH, 0);
int end = min(ptr + MAX_LINE_LENGTH, fs - 1);
int bufsize = end - start;
fseek(file, start);
char *buf = malloc(bufsize);
read(file, buf, bufsize);
char *startp = buf + ptr - start;
char *finp = buf + ptr - start + 1;
while (startp > buf && *startp != '\n') {
startp--;
}
while (finp < buf + bufsize && *finp != '\n') {
finp++;
}
*finp = '\0';
startp++;
return startp;
Lots of one off errors and crap in there, bad memory management, and other horrors. If this actually compiles, you get a nickel. (Please send self addressed stamped envelope and $5 handling to receive free nickle.)
But you should get the idea.
Longer lines statistically have a higher chance of being selected than shorter lines. But the run time of this is effectively constant regardless of file size. If you have a lot of words of mostly similar length, the statisticians won't be happy (they never are anyway), but in practice it will be close enough.
I'd say:
Read the file and search for the amount of \n. That's the number of lines - let's call that L
Store their positions in a small array in memory
Get two random lines lower than L, fetch their offsets and you're done.
You'd use just a small array and read the whole file once + 2 lines afterwards.
You could do a 2 pass algorithm. First get the positions of each newline, pushing those positions into a vector. Then pick random items in that vector, call this i.
Read from the file at position v[i] to v[i+1] to get your line.
During the first pass you read the file with a small buffer as to not read it all into RAM at once.