This code successfully would make array with all possible letter combinations with 5 characters.
a = ('aaaaa'..'zzzzz').to_a
however, when I try to go for 6+ characters, it takes like 10 mins, and then kills the task. Is there any way for it to actually load without killing the task? Is it limited by hardware?
You are indeed limited by hardware. In oversimplified terms, there are two limitation that you are facing here - processing power and memory capacity.
The "k-permutations of n" formula will tell us that you are trying to generate and process 26**6 = 308_915_776 elements.
(x..y) creates a Range, which knows how to generate all of its elements, but doesn't eagerly do so. When you call Range#to_a however, your processor tries to generate all those elements. After some time, the process runs out of memory and dies.
To avoid the memory restriction, you could instead take advantage of the fact that Range is also Enumerable. For example:
('aaaaaaa'..'zzzzzzz').each { |seven_letter_word| puts seven_letter_word }
will instantly start printing strings. Eventually (after a lot of waiting) it will loop through all of them.
However, note that this will let you bypass the memory restriction, but not the processing one. For that there are no shortcuts other than understand the specifics of the problem at hand.
Related
I am trying to develop the following lock free code (c++11):
int val_max;
std::array<std::atomic<int>, 255> vector;
if (vector[i] > val_max) {
val_max =vector[i];
}
The problem is that when there are many threads (128 threads), the result is not correct, because if for example val_max=1, and three threads ( with vector[i] = 5, 15 and 20) will execute the code in the next line, and it will be a data race..
So, I don't know the best way to solve the problem with the available functions in c++11 (I cannot use mutex, locks), protecting the whole code.
Any suggestions? Thanks in advance!
You need to describe the bigger problem you need to solve, and why you want multiple threads for this.
If you have a lot of data and want to find the maximum, and you want to split that problem, you're doing it wrong. If all threads try to access a shared maximum, not only is it hard to get right, but by the time you have it right, you have fully serialized accesses, thus making the entire thing an exercise in adding complexity and thread overhead to a serial program.
The correct way to make it parallel is to give every thread a chunk of the array to work on (and the array members are not atomics), for which the thread calculates a local maximum, then when all threads are done have one thread find the maximum of the individual results.
Do an atomic fetch of val_max.
If the value fetched is greater than or equal to vector[i], stop, you are done.
Do an atomic compare exchange -- compare val_max to the value you read in step 1 and exchange it for the value of vector[i] if it compares.
If the compare succeeded, stop, you are done.
Go to step 1, you raced with another thread that made forward progress.
Let's say that I have a variable x.
x = 0
I then spawn some number of threads, and each of them may or may not run the following expression WITHOUT the use of atomics.
x |= 1
After all threads have joined with my main thread, the main thread branches on the value.
if(x) { ... } else { ... }
Is it possible for there to be a race condition in this situation? My thoughts say no, because it doesn't seem to matter whether or not a thread is interrupted by another thread between reading and writing 'x' (in both cases, either 'x == 1', or 'x == 1'). That said, I want to make sure I'm not missing something stupid obvious or ridiculously subtle.
Also, if you happen to provide an answer to the contrary, please provide an instruction-by-instruction example!
Context:
I'm trying to, in OpenCL, have my threads indicate the presence or absence of a feature among any of their work-items. If any of the threads indicate the presence of the feature, my host ought to be able to branch on the result. I'm thinking of using the above method. If you guys have a better suggestion, that works too!
Detail:
I'm trying to add early-exit to my OpenCL radix-sort implementation, to skip radix passes if the data is banded (i.e. 'x' above would be x[RADIX] and I'd have all work groups, right after partial reduction of the data, indicate presence or absence of elements in the RADIX bins via 'x').
It may work within a work-group. You will need to insert a barrier before testing x. I'm not sure it will be faster than using atomic increments.
It will not work across several work-groups. Imagine you have 1000 work-groups to run on 20 cores. Typically, only a small number of work-groups can be resident on a single core, for example 4, meaning only 80 work-groups can be in flight inside the GPU at a given time. Once a work-group is done executing, it is retired, and another one is started. Halting a kernel in the middle of execution to wait for all 1000 work-groups to reach the same point is impossible.
I have a function that I use to look up a value based on an index. The value takes some time to calculate, so I want to do it with ParallelMap, and references another similar such function that returns a list of expressions, also based on an index.
However, when I set it all up in a seemingly reasonable fashion, I see some very bizarre behaviour. First, I see that the function appears to work, albeit very slowly. For large indexes, however, the processor activity in Taskmangler stays entirely at zero for an extended period of time (i.e. 2-4 minutes) where all instances of Mathematica are seemingly inert. Then, without the slightest blip of CPU use, a result appears. Is this another case of Mathematica spukhafte Fernwirkung?
That is, I want to create a variable/function that stores an expression, here a list of integers (ListOfInts), and then on the parallel workers I want to perform some function on that expression (here I apply a set of replacement rules and take the Min). I want the result of that function to also be indexed by the same index under another variable/function (IndexedFunk), whose result is then available back on the main instance of Mathematica:
(*some arbitrary rules that will convert some of the integers to negative values:*)
rulez=Dispatch[Thread[Rule[Range[222],-Range[222]]]];
maxIndex = 333;
Clear[ListOfInts]
Scan[(ListOfInts[#]=RandomInteger[{1,999},55])&,Range[maxIndex ]]
(*just for safety's sake:*)
DistributeDefinitions[rulez, ListOfInts]
Clear[IndexedFunk]
(*I believe I have to have at least one value of IndexedFunk defined before I Share the definition to the workers:*)
IndexedFunk[1]=Min[ListOfInts[1]]/.rulez
(*... and this should let me retrieve the values back on the primary instance of MMA:*)
SetSharedFunction[IndexedFunk]
(*Now, here is the mysterious part: this just sits there on my multiprocessor machine for many minutes until suddenly a result appears. If I up maxIndex to say 99999 (and of course re-execute the above code again) then the effect can more clearly be seen.*)
AbsoluteTiming[Short[ParallelMap[(IndexedFunk[#]=Min[ListOfInts[#]/.rulez])&, Range[maxIndex]]]]
I believe this is some bug, but then I am still trying to figure out Mathematica Parallel, so I can't be too confident in this conclusion. Despite its being depressingly slow, it is nonetheless impressive in its ability to perform calculations without actually requiring a CPU to do so.
I thought perhaps it was due to whatever communications protocol is being used between the master and slave processes, perhaps it is so slow that it just appears that the processors are doing nothing when if fact they are just waiting to send the next bit of some definition or other. In which case I thought ParallelMap[..., Method->"CoarsestGrained"] would be of some use. But no, that doesn't work neither.
A question: "Am I doing something obviously wrong, or is this a bug?"
I am afraid you are. The problem is with the shared definition of a variable. Mathematica maintains a single coherent value in all copies of the variable across kernels, and therefore that variable becomes a single point of huge contention. CPU is idle because kernels line up to the queue waiting for the variable IndexedFunk, and most time is spent in interprocess or inter-machine communication. Go figure.
By the way, there is no function SetSharedDefinition in any Mathematica version I know of. You probably intended to write SetSharedVariable. But remove that evil call anyway! To avoid contention, return results from the parallelized computation as a list of pairs, and then assemble them into downvalues of your variable at the main kernel:
Clear[IndexedFunk]
Scan[(IndexedFunk[#[[1]]] = #[[2]]) &,
ParallelMap[{#, Min[ListOfInts[#] /. rulez]} &, Range[maxIndex]]
]
ParallelMap takes care of distributing definition automagically, so the call to DistributeDefinitions is superfluous. (As a minor note, it is not correct as written, omitting the maxIndex variable, but the omission is automatically taken care of by ParallelMap in this particular case.)
EDIT, NB!: The automatic distribution applies only to the version 8 of Mathematica. Thanks #MikeHoneychurch for the correction.
Consider something like...
for (int i = 0; i < test.size(); ++i) {
test[i].foo();
test[i].bar();
}
Now consider..
for (int i = 0; i < test.size(); ++i) {
test[i].foo();
}
for (int i = 0; i < test.size(); ++i) {
test[i].bar();
}
Is there a large difference in time spent between these two? I.e. what is the cost of the actual iteration? It seems like the only real operations you are repeating are an increment and a comparison (though I suppose this would become significant for a very large n). Am I missing something?
First, as noted above, if your compiler can't optimize the size() method out so it's just called once, or is nothing more than a single read (no function call overhead), then it will hurt.
There is a second effect you may want to be concerned with, though. If your container size is large enough, then the first case will perform faster. This is because, when it gets to test[i].bar(), test[i] will be cached. The second case, with split loops, will thrash the cache, since test[i] will always need to be reloaded from main memory for each function.
Worse, if your container (std::vector, I'm guessing) has so many items that it won't all fit in memory, and some of it has to live in swap on your disk, then the difference will be huge as you have to load things in from disk twice.
However, there is one final thing that you have to consider: all this only makes a difference if there is no order dependency between the function calls (really, between different objects in the container). Because, if you work it out, the first case does:
test[0].foo();
test[0].bar();
test[1].foo();
test[1].bar();
test[2].foo();
test[2].bar();
// ...
test[test.size()-1].foo();
test[test.size()-1].bar();
while the second does:
test[0].foo();
test[1].foo();
test[2].foo();
// ...
test[test.size()-1].foo();
test[0].bar();
test[1].bar();
test[2].bar();
// ...
test[test.size()-1].bar();
So if your bar() assumes that all foo()'s have run, you will break it if you change the second case to the first. Likewise, if bar() assumes that foo() has not been run on later objects, then moving from the second case to the first will break your code.
So be careful and document what you do.
There are many aspects in such comparison.
First, complexity for both options is O(n), so difference isn't very big anyway. I mean, you must not care about it if you write quite big and complex program with a large n and "heavy" operations .foo() and bar(). So, you must care about it only in case of very small simple programs (this is kind of programs for embedded devices, for example).
Second, it will depend on programming language and compiler. I'm assured that, for instance, most of C++ compilers will optimize your second option to produce same code as for the first one.
Third, if compiler haven't optimized your code, performance difference will heavily depend on the target processor. Consider loop in a term of assembly commands - it will look something like this (pseudo assembly language):
LABEL L1:
do this ;; some commands
call that
IF condition
goto L1
;; some more instructions, ELSE part
I.e. every loop passage is just IF statement. But modern processors don't like IF. This is because processors may rearrange instructions to execute them beforehand or just to avoid idles. With the IF (in fact, conditional goto or jump) instructions, processors do not know if they may rearrange operation or not.
There's also a mechanism called branch predictor. From material of Wikipedia:
branch predictor is a digital circuit that tries to guess which way a branch (e.g. an if-then-else structure) will go before this is known for sure.
This "soften" effect of IF's, through if the predictor's guess is wrong, no optimization will be performed.
So, you can see that there's a big amount of conditions for both your options: target language and compiler, target machine, it's processor and branch predictor. This all makes very complex system, and you cannot foresee what exact result you will get. I believe, that if you don't deal with embedded systems or something like that, the best solution is just to use the form which your are more comfortable with.
For your examples you have the additional concern of how expensive .size() is, since it's compared for each time i increments in most languages.
How expensive is it? Well that depends, it's certainly all relative. If .foo() and .bar() are expensive, the cost of the actual iteration is probably minuscule in comparison. If they're pretty lightweight, then it'll be a larger percentage of your execution time. If you want to know about a particular case test it, this is the only way to be sure about your specific scenario.
Personally, I'd go with the single iteration to be on the cheap side for sure (unless you need the .foo() calls to happen before the .bar() calls).
I assume .size() will be constant. Otherwise, the first code example might not give the same as the second one.
Most compilers would probably store .size() in a variable before the loop starts, so the .size() time will be cut down.
Therefore the time of the stuff inside the two for loops will be the same, but the other part will be twice as much.
Performance tag, right.
As long as you are concentrating on the "cost" of this or that minor code segment, you are oblivious to the bigger picture (isolation); and your intention is to justify something that, at a higher level (outside your isolated context), is simply bad practice, and breaks guidelines. The question is too low level and therefore too isolated. A system or program which is set of integrated components will perform much better that a collection of isolated components.
The fact that this or that isolated component (work inside the loop) is fast or faster is irrelevant when the loop itself is repeated unnecessarily, and which would therefore take twice the time.
Given that you have one family car (CPU), why on Earth would you:
sit at home and send your wife out to do her shopping
wait until she returns
take the car, go out and do your shopping
leaving her to wait until you return
If it needs to be stated, you would spend (a) almost half of your hard-earned resources executing one trip and shopping at the same time and (b) have those resources available to have fun together when you get home.
It has nothing to do with the price of petrol at 9:00 on a Saturday, or the time it takes to grind coffee at the café, or cost of each iteration.
Yes, there is a large diff in the time and the resources used. But the cost is not merely in the overhead per iteration; it is in the overall cost of the one organised trip vs the two serial trips.
Performance is about architecture; never doing anything twice (that you can do once), which are the higher levels of organisation; integrated of the parts that make up the whole. It is not about counting pennies at the bowser or cycles per iteration; those are lower orders of organisation; which ajust a collection of fragmented parts (not a systemic whole).
Masseratis cannot get through traffic jams any faster than station wagons.
I have an array of integers
a = [1,2,3,4]
When I do
a.join
Ruby internally calls the to_s method 4 times, which is too slow for my needs.
What is the fastest method to output an big array of integers to console?
I mean:
a = [1,2,3,4........,1,2,3,9], should be:
1234........1239
If you want to print an integer to stdout, you need to convert it to a string first, since that's all stdout understands. If you want to print two integers to stdout, you need to convert both of them to a string first. If you want to print three integers to stdout, you need to convert all three of them to a string first. If you want to print one billion integers to stdout, you need to convert all one billion of them to a string first.
There's nothing you, we, or Ruby, or really any programming language can do about that.
You could try interleaving the conversion with the I/O by doing a lazy stream implementation. You could try to do the conversion and the I/O in parallel, by doing a lazy stream implementation and separating the conversion and the I/O into two separate threads. (Be sure to use a Ruby implementation which can actually execute parallel threads, not all of them can: MRI, YARV and Rubinius can't, for example.)
You can parallelize the conversion, by converting separate chunks in the array in separate threads in parallel. You can even buy a billion core machine and convert all billion integers at the same time in parallel.
But even then, the fact of the matter remains: every single integer needs to be converted. Whether you do that one after the other first, and then print them or do it one after the other interleaved with the I/O or do it one after the other in parallel with the I/O or even convert all of them at the same time on a billion core CPU: the number of needed conversions does not magically decrease. A large number of integers means a large number of conversions. Even if you do all billion conversions in a billion core CPU in parallel, it's still a billion conversions, i.e. a billion calls to to_s.
As stated in the comments above if Fixnum.to_s is not performing quickly enough for you then you really need to consider whether Ruby is the correct tool for this particular task.
However, there are a couple of things you could do that may or may not be applicable for your situation.
If the building of the array happens outside the time critical area then build the array, or a copy of the array with strings instead of integers. With my small test of 10000 integers this is approximately 5 times faster.
If you control both the reading and the writing process then use Array.pack to write the output and String.unpack to read the result. This may not be quicker as pack seems to call Fixnum.to_int even when the elements are already Integers.
I expect these figures would be different with each version of Ruby so it is worth checking for your particular target version.
The slowness in you program does not come from to_s being called 4 times, but from printing to the console. Console output is slow, and you can't really do anything about it.
For single digits you can do this
[1,2,3,4,5].map{|x|(x+48).chr}.join
If you need to speed up larger numbers you could try memoizing the result of to_s
Unless you really need to see the numbers on the console (and it sound like you do not) then write them to a file in binary - should be much faster.
And you can pipe binary files into other programs if that is what you need to do, not just text.