How to print array of integers into console quickly? - ruby

I have an array of integers
a = [1,2,3,4]
When I do
a.join
Ruby internally calls the to_s method 4 times, which is too slow for my needs.
What is the fastest method to output an big array of integers to console?
I mean:
a = [1,2,3,4........,1,2,3,9], should be:
1234........1239

If you want to print an integer to stdout, you need to convert it to a string first, since that's all stdout understands. If you want to print two integers to stdout, you need to convert both of them to a string first. If you want to print three integers to stdout, you need to convert all three of them to a string first. If you want to print one billion integers to stdout, you need to convert all one billion of them to a string first.
There's nothing you, we, or Ruby, or really any programming language can do about that.
You could try interleaving the conversion with the I/O by doing a lazy stream implementation. You could try to do the conversion and the I/O in parallel, by doing a lazy stream implementation and separating the conversion and the I/O into two separate threads. (Be sure to use a Ruby implementation which can actually execute parallel threads, not all of them can: MRI, YARV and Rubinius can't, for example.)
You can parallelize the conversion, by converting separate chunks in the array in separate threads in parallel. You can even buy a billion core machine and convert all billion integers at the same time in parallel.
But even then, the fact of the matter remains: every single integer needs to be converted. Whether you do that one after the other first, and then print them or do it one after the other interleaved with the I/O or do it one after the other in parallel with the I/O or even convert all of them at the same time on a billion core CPU: the number of needed conversions does not magically decrease. A large number of integers means a large number of conversions. Even if you do all billion conversions in a billion core CPU in parallel, it's still a billion conversions, i.e. a billion calls to to_s.

As stated in the comments above if Fixnum.to_s is not performing quickly enough for you then you really need to consider whether Ruby is the correct tool for this particular task.
However, there are a couple of things you could do that may or may not be applicable for your situation.
If the building of the array happens outside the time critical area then build the array, or a copy of the array with strings instead of integers. With my small test of 10000 integers this is approximately 5 times faster.
If you control both the reading and the writing process then use Array.pack to write the output and String.unpack to read the result. This may not be quicker as pack seems to call Fixnum.to_int even when the elements are already Integers.
I expect these figures would be different with each version of Ruby so it is worth checking for your particular target version.

The slowness in you program does not come from to_s being called 4 times, but from printing to the console. Console output is slow, and you can't really do anything about it.

For single digits you can do this
[1,2,3,4,5].map{|x|(x+48).chr}.join
If you need to speed up larger numbers you could try memoizing the result of to_s

Unless you really need to see the numbers on the console (and it sound like you do not) then write them to a file in binary - should be much faster.
And you can pipe binary files into other programs if that is what you need to do, not just text.

Related

Get all possible letter combinations with more than 5 letters

This code successfully would make array with all possible letter combinations with 5 characters.
a = ('aaaaa'..'zzzzz').to_a
however, when I try to go for 6+ characters, it takes like 10 mins, and then kills the task. Is there any way for it to actually load without killing the task? Is it limited by hardware?
You are indeed limited by hardware. In oversimplified terms, there are two limitation that you are facing here - processing power and memory capacity.
The "k-permutations of n" formula will tell us that you are trying to generate and process 26**6 = 308_915_776 elements.
(x..y) creates a Range, which knows how to generate all of its elements, but doesn't eagerly do so. When you call Range#to_a however, your processor tries to generate all those elements. After some time, the process runs out of memory and dies.
To avoid the memory restriction, you could instead take advantage of the fact that Range is also Enumerable. For example:
('aaaaaaa'..'zzzzzzz').each { |seven_letter_word| puts seven_letter_word }
will instantly start printing strings. Eventually (after a lot of waiting) it will loop through all of them.
However, note that this will let you bypass the memory restriction, but not the processing one. For that there are no shortcuts other than understand the specifics of the problem at hand.

Fortran implied do write speedup

tl;dr: I found that an "implied do" write was slower than an explicit one under certain circumstances, and want to understand why/if I can improve this.
Details:
I've got a code that does something to the effect of:
DO i=1,n
calculations...
!m, x, and y all change each pass through the loop
IF(m.GT.1)THEN
DO j=1,m
WRITE(10,*)x(j),y(j) !where 10 is an output file
ENDDO
ENDIF
ENDDO
The output file ends up being fairly large, and so it seems like the writing is a big performance factor, so I wanted to optimize it. Before anyone asks, no, moving away from ASCII isn't an option due to various downstream requirements. Accordingly, I rewrote the IF statement (and contents) as:
IF(m.GT.1)THEN
!build format statement for write
WRITE(mm1,*)m-1
mm1=ADJUSTL(mm1)
!implied do write statement
WRITE(10,'('//TRIM(mm1)//'(i9,1x,f7.5/),i9,1x,f7.5)')(x(j),y(j),j=1,m)
ELSEIF(m.EQ.1)THEN
WRITE(10,'(i9,1x,f7.5)')x(1),y(1)
ENDIF
This builds the format statement according to the # of values to be written out, then does a single write statement to output things. I've found that the code actually runs slower with this formulation. For reference, I've seen significant speedup on the same system (hardware and software) when going to an implied do write statement when the amount of data to be written was fixed. Under the assumption that the WRITE statement, itself, is faster, then that would mean the overhead from the couple of lines building that statement are what take the added time, but that seems hard to believe. For reference, m can vary a fair amount, but probably averages at least 1000. Is the concatenation of strings // a very slow operator, or is there something else I'm missing? Thanks in advance.
I haven't specific timing information to add, but your data transfer with an implied do loop is needlessly complicated.
In the first fragment, with the explicit looping, you are writing each pair of numbers to distinct records and you wish to repeat this output with the implied do loop. To do this, you use the slash edit descriptor to terminate each record once a pair has been written.
The needless complexity comes from two areas:
you have distinct cases for one/more than one pair;
for the more-than-one case you construct a format including a "dynamic" repeat count.
As Vladimir F comments you could just use a very large repeat count: it isn't erroneous for an edit descriptor to be processed when there are no more items to be written. The output terminates (successfully) when reaching such a non-matching descriptor. You could, then, just write
WRITE(10,'(*(i9,1x,f7.5/))') (x(j),y(j),j=1,m) ! * replacing a large count
rather than the if construct and the format creation.
Now, this doesn't quite match your first output. As I mentioned above, output termination comes about when a data edit descriptor is reached when there is no corresponding item to output. This means that / will be processed before that happens: you have a final empty record.
The colon edit descriptor is useful here:
WRITE(10,'(*(i9,1x,f7.5,:,/))') (x(j),y(j),j=1,m)
On reaching a : processing stops immediately if there is no remaining output item to process.
But my preferred approach is the far simpler
WRITE(10,'(i9,1x,f7.5)') (x(j),y(j),j=1,m) ! No repeat count
You had the more detailed format to include record termination. However, we have what is known as format reversion: if a format end is reached and more remains to be output then the record is terminated and processing goes back to the start of the format.
Whether these things make your output faster remains to be seen, but they certainly make the code itself much cleaner and clearer.
As a final note, it used to be trendy to avoid additional X editing. If your numbers fit inside the field of width 7 then 1x,f7.5 could be replaced by f8.5 and have the same look: the representation is right-justified in the field. It was claimed that this reduction had performance benefits with fewer switching of descriptors.

Fortran unformatted I/O optimization

I'm working on a set of Fortran programs that are heavily I/O bound, and so am trying to optimize this. I've read at multiple places that writing entire arrays is faster than individual elements, i.e. WRITE(10)arr is faster than DO i=1,n; WRITE(10) arr(i); ENDDO. But, I'm unclear where my case would fall in this regard. Conceptually, my code is something like:
OPEN(10,FILE='testfile',FORM='UNFORMATTED')
DO i=1,n
[calculations to determine m values stored in array arr]
WRITE(10) m
DO j=1,m
WRITE(10) arr(j)
ENDDO
ENDDO
But m may change each time through the DO i=1,n loop such that writing the whole array arr isn't an option. So, collapsing the DO loop for writing would end up with WRITE(10) arr(1:m), which isn't the same as writing the whole array. Would this still provide a speed-up to writing, what about reading? I could allocate an array of size m after the calculations, assign the values to that array, write it, then deallocate it, but that seems too involved.
I've also seen differing information on implied DO loop writes, i.e. WRITE(10) (arr(j),j=1,m), as to whether they help/hurt on I/O overhead.
I'm running a couple of tests now, and intend to update with my observations. Other suggestions on applicable
Additional details:
The first program creates a large file, the second reads it. And, no, merging the two programs and keeping everything in memory isn't a valid option.
I'm using unformatted I/O and have access to the Portland Group and gfortran compilers. It's my understanding the PG's is generally faster, so that's what I'm using.
The output file is currently ~600 GB, the codes take several hours to run.
The second program (reading in the file) seems especially costly. I've monitored the system and seen that it's mostly CPU-bound, even when I reduce the code to little more than reading the file, indicating that there is very significant CPU overhead on all the I/O calls when each value is read in one-at-a-time.
Compiler flags: -O3 (high optimization) -fastsse (various performance enhancements, optimized for SSE hardware) -Mipa=fast,inline (enables aggressive inter-procedural analysis/optimization on compiler)
UPDATE
I ran the codes with WRITE(10) arr(1:m) and READ(10) arr(1:m). My tests with these agreed, and showed a reduction in runtime of about 30% for the WRITE code, the output file is also slightly less than half the original's size. For the second code, reading in the file, I made the code do basically nothing but read the file to compare pure read time. This reduced the run time by a factor of 30.
If you use normal unformatted (record-oriented) I/O, you also write a record marker before and after the data itself. So you add eight bytes (usually) of overhead to each data item, which can easily (almost) double the data written to disc if your number is a double precision. The runtime overhead mentioned in the other answers is also significant.
The argument above does not apply if you use unformatted stream.
So, use
WRITE (10) m
WRITE (10) arr(1:m)
For gfortran, this is faster than an implied DO loop (i.e. the solution WRITE (10) (arr(i),i=1,m)).
In the suggested solution, an array descriptor is built and passed to the library with a single call. I/O can then be done much more efficiently, in your case taking advantage of the fact that the data is contiguous.
For the implied DO loop, gfortran issues multiple library calls, with much more overhead. This could be optimized, and is subject of a long-standing bug report, PR 35339, but some complicated corner cases and the presence of a viable alternative have kept this from being optimized.
I would also suggest doing I/O in stream access, not because of the rather insignificant saving in space (see above) but because keeping up the leading record marker up to date on writing needs a seek, which is additional effort.
If your data size is very large, above ~ 2^31 bytes, you might run into different behavior with record markers. gfortran uses subrecords in this case (compatible to Intel), but it should just work. I don't know what Portland does in this case.
For reading, of course, you can read m, then allocate an allocatable array, then read the whole array in one READ statement.
The point of avoiding outputting an array by looping over multiple WRITE() operations is to avoid the multiple WRITE() operations. It's not particularly important that the data being output are all the members of the array.
Writing either an array section or a whole array via a single WRITE() operation is a good bet. An implied DO loop cannot be worse than an explicit outer loop, but whether it's any better is a question of compiler implementation. (Though I'd expect the implied-DO to be better than an outer loop.)

How to first check files on equality before doing a byte by byte comparison?

I am writing a program that compare a lot of files.
I first group files by filesize. Then I check them byte by byte between grouped files. What params or propeties can I check before byte by byte comparsion to minimize using it?
Upd:
To get check sum i need to read entire file. I seek some property that can filter unequal files. I forgot to say that i need 100% equal of files. Hash functions have collision.
If the files are recorded as being the same size by the operating system then there is no way to know if they are different other than checking bytes.
For a group of files, once two files are known to be the same, then the comparison only needs to be done for one of the two. It would be wise to sort the files in a group by date for this reason, on the theory that files with similar dates are more likely to be identical. Thus, you should maintain lists of identical files. When a new comparison is done it need only be compared to the head of the list.
You should allocate as much memory as possible up front and keep the list heads in memory.
When the comparison is being done you should not actually compare bytes, but words. For example, on a 32-bit machine you would read data in 512-byte blocks from the hard drive and then each block would be compared 4-bytes at a time. Newer x86 processors have vectorized op instructions called MMX. You want to be sure you are using those.
If you are writing in C for an Intel box, use Intel's compiler, not Microsoft's. Double check the assembly to make sure the compiler is not doing something stupid.
You can also increase the speed of the work by parallelizing it. This is done by creating threads. For example, if the code is running on a quad core machine you create 4 threads and divide the work among the 4 threads.
Check file's checksum. It was mend for this task
For Python you can use hashlib. For C you can use, for example, md5 from openssl. There are similar functions for php, MySQL, and probably for every other programming language
Eventually you can use linux built-in md5sum

Does Kernel::srand have a maximum input value?

I'm trying to seed a random number generator with the output of a hash. Currently I'm computing a SHA-1 hash, converting it to a giant integer, and feeding it to srand to initialize the RNG. This is so that I can get a predictable set of random numbers for an set of infinite cartesian coordinates (I'm hashing the coordinates).
I'm wondering whether Kernel::srand actually has a maximum value that it'll take, after which it truncates it in some way. The docs don't really make this obvious - they just say "a number".
I'll try to figure it out myself, but I'm assuming somebody out there has run into this already.
Knowing what programmers are like, it probably just calls libc's srand(). Either way, it's probably limited to 2^32-1, 2^31-1, 2^16-1, or 2^15-1.
There's also a danger that the value is clipped when cast from a biginteger to a C int/long, instead of only taking the low-order bits.
An easy test is to seed with 1 and take the first output. Then, seed with 2i+1 for i in [1..64] or so, take the first output of each, and compare. If you get a match for some i=n and all greater is, then it's probably doing arithmetic modulo 2n.
Note that the random number generator is almost certainly limited to 32 or 48 bits of entropy anyway, so there's little point seeding it with a huge value, and an attacker can reasonably easily predict future outputs given past outputs (and an "attacker" could simply be a player on a public nethack server).
EDIT: So I was wrong.
According to the docs for Kernel::rand(),
Ruby currently uses a modified Mersenne Twister with a period of 2**19937-1.
This means it's not just a call to libc's rand(). The Mersenne Twister is statistically superior (but not cryptographically secure). But anyway.
Testing using Kernel::srand(0); Kernel::sprintf("%x",Kernel::rand(2**32)) for various output sizes (2*16, 2*32, 2*36, 2*60, 2*64, 2*32+1, 2*35, 2*34+1), a few things are evident:
It figures out how many bits it needs (number of bits in max-1).
It generates output in groups of 32 bits, most-significant-bits-first, and drops the top bits (i.e. 0x[r0][r1][r2][r3][r4] with the top bits masked off).
If it's not less than max, it does some sort of retry. It's not obvious what this is from the output.
If it is less than max, it outputs the result.
I'm not sure why 2*32+1 and 2*64+1 are special (they produce the same output from Kernel::rand(2**1024) so probably have the exact same state) — I haven't found another collision.
The good news is that it doesn't simply clip to some arbitrary maximum (i.e. passing in huge numbers isn't equivalent to passing in 2**31-1), which is the most obvious thing that can go wrong. Kernel::srand() also returns the previous seed, which appears to be 128-bit, so it seems likely to be safe to pass in something large.
EDIT 2: Of course, there's no guarantee that the output will be reproducible between different Ruby versions (the docs merely say what it "currently uses"; apparently this was initially committed in 2002). Java has several portable deterministic PRNGs (SecureRandom.getInstance("SHA1PRNG","SUN"), albeit slow); I'm not aware of something similar for Ruby.

Resources