Increasing performance in file manipulation - performance

I have a file which contains a matrix of numbers as following:
0 10 24 10 13 4 101 ...
6 0 52 10 4 5 0 4 ...
3 4 0 86 29 20 77 294 ...
4 1 1 0 78 100 83 199 ...
5 4 9 10 0 58 8 19 ...
6 58 60 13 68 0 148 41 ...
. .
. .
. .
What I am trying to do is sum each row and output the sum of each row to a new file (with the sum of each row on a new line).
I have tried doing it in Haskell using ByteStrings, but the performance is 3 times a slow as the python implementation. Here is the Haskell implementation:
import qualified Data.ByteString.Char8 as B
-- This function is for summing a row
sumrows r = foldr (\x y -> (maybe 0 (*1) $ fst <$> (B.readInt x)) + y) 0 (B.split ' ' r)
-- This function is for mapping the sumrows function to each line
sumfile f = map (\x -> (show x) ++ "\n") (map sumrows (B.split '\n' f))
main = do
contents <- B.readFile "telematrix"
-- I get the sum of each line, and then pack up all the results so that it can be written
B.writeFile "teleDensity" $ (B.pack . unwords) (sumfile contents)
print "complete"
This takes about 14 seconds for a 25 MB file.
Here is the python implemenation
fd = open("telematrix", "r")
nfd = open("teleDensity", "w")
for line in fd:
nfd.write(str(sum(map(int, line.split(" ")))) + "\n")
fd.close()
nfd.close()
This takes about 5 seconds for the same 25 MB file.
Any suggestions on how to increase the Haskell implementation?

It seems that he problem was that I was compiling and running the program with runhaskell as opposed to using ghc and then running the program. By compiling first and then running, I increased performance to 1 second in Haskell

At a glance, I would bet your first bottleneck is in the ++ on strings in sumfile, which is destructuring the left operand each time and rebuilding it. Instead of appending "\n" to the end, you could replace the unwords function call with unlines, which does exactly what you want it to here. That should get you a nice little speed boost.
A more minor nitpick is that the (*1) in the maybe function is unneeded. Using id there would be more efficient, since (*1) wastes a multiplication operation, but that's no more than a few processor cycles.
Then finally, I have to ask why you're using ByteString's here. ByteString's store string data efficiently as an array, like traditional strings in a more imperative language. However, what you're doing here involves splitting the string and iterating over the elements, which are operations that linked lists would be suited for. I would honestly recommend using the traditional [Char] type in this case. That B.split call may be what's ruining you, since it has to take the entire line and copy it into separate arrays of the split form, whereas the words function for linked lists of characters simply splits the linked structure off at a few points.

The main reason for the poor performance was because I was using runhaskell instead of first compiling and then running the program. So I switched from:
runhaskell program.hs
to
ghc program.hs
./program

Related

hpack encoding integer significance

After reading this, https://httpwg.org/specs/rfc7541.html#integer.representation
I am confused about quite a few things, although I seem to have the overall gist of the idea.
For one, What are the 'prefixes' exactly/what is their purpose?
For two:
C.1.1. Example 1: Encoding 10 Using a 5-Bit Prefix
The value 10 is to be encoded with a 5-bit prefix.
10 is less than 31 (2^5 - 1) and is represented using the 5-bit prefix.
0 1 2 3 4 5 6 7
+---+---+---+---+---+---+---+---+
| X | X | X | 0 | 1 | 0 | 1 | 0 | 10 stored on 5 bits
+---+---+---+---+---+---+---+---+
What are the leading Xs? What is the starting 0 for?
>>> bin(10)
'0b1010'
>>>
Typing this in the python IDE, you see almost the same output... Why does it differ?
This is when the number fits within the number of prefix bits though, making it seemingly simple.
C.1.2. Example 2: Encoding 1337 Using a 5-Bit Prefix
The value I=1337 is to be encoded with a 5-bit prefix.
1337 is greater than 31 (25 - 1).
The 5-bit prefix is filled with its max value (31).
I = 1337 - (25 - 1) = 1306.
I (1306) is greater than or equal to 128, so the while loop body executes:
I % 128 == 26
26 + 128 == 154
154 is encoded in 8 bits as: 10011010
I is set to 10 (1306 / 128 == 10)
I is no longer greater than or equal to 128, so the while loop terminates.
I, now 10, is encoded in 8 bits as: 00001010.
The process ends.
0 1 2 3 4 5 6 7
+---+---+---+---+---+---+---+---+
| X | X | X | 1 | 1 | 1 | 1 | 1 | Prefix = 31, I = 1306
| 1 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 1306>=128, encode(154), I=1306/128
| 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 10<128, encode(10), done
+---+---+---+---+---+---+---+---+
The octet-like diagram shows three different numbers being produced... Since the numbers are produced throughout the loop, how do you replicate this octet-like diagram within an integer? What is the actual final result? The diagram or "I" being 10, or 00001010.
def f(a, b):
if a < 2**b - 1:
print(a)
else:
c = 2**b - 1
remain = a - c
print(c)
if remain >= 128:
while 1:
e = remain % 128
g = e + 128
remain = remain / 128
if remain >= 128:
continue
else:
print(remain)
c+=int(remain)
print(c)
break
As im trying to figure this out, I wrote a quick python implementation of it, It seems that i am left with a few useless variables, one being g which in the documentation is the 26 + 128 == 154.
Lastly, where does 128 come from? I can't find any relation between the numbers besides the fact 2 raised to the 7th power is 128, but why is that significant? Is this because the first bit is reserved as a continuation flag? and an octet contains 8 bits so 8 - 1 = 7?
For one, What are the 'prefixes' exactly/what is their purpose?
Integers are used in a few places in HPACK messages and often they have leading bits that cannot be used to for the actual integer. Therefore, there will often be a few leading digits that will be unavailable to use for the integer itself. They are represented by the X. For the purposes of this calculation it doesn't make what those Xs are: could be 000, or 111, or 010 or...etc. Also, there will not always be 3 Xs - that is just an example. There could only be one leading X, or two, or four...etc.
For example, to look up a previous HPACK decoded header, we use 6.1. Indexed Header Field Representation which starts with a leading 1, followed by the table index value. Therefore that 1 is the X in the previous example. We have 7-bits (instead of only 5-bits in the original example in your question). If the table index value is 127 or less we can represent it using those 7-bits. If it's >= 127 then we need to do some extra work (we'll come back to this).
If it's a new value we want to add to the table (to reuse in future requests), but we already have that header name in the table (so it's just a new value for that name we want as a new entry) then we use 6.2.1. Literal Header Field with Incremental Indexing. This has 2 bits at the beginning (01 - which are the Xs), and we only have 6-bits this time to represent the index of the name we want to reuse. So in this case there are two Xs.
So don't worry about there being 3 Xs - that's just an example. In the above examples there was one X (as first bit had to be 1), and two Xs (as first two bits had to be 01) respectively. The Integer Representation section is telling you how to handle any prefixed integer, whether prefixed by 1, 2, 3... etc unusable "X" bits.
What are the leading Xs? What is the starting 0 for?
The leading Xs are discussed above. The starting 0 is just because, in this example we have 5-bits to represent the integers and only need 4-bits. So we pad it with 0. If the value to encode was 20 it would be 10100. If the value was 40, we couldn't fit it in 5-bits so need to do something else.
Typing this in the python IDE, you see almost the same output... Why does it differ?
Python uses 0b to show it's a binary number. It doesn't bother showing any leading zeros. So 0b1010 is the same as 0b01010 and also the same as 0b00001010.
This is when the number fits within the number of prefix bits though, making it seemingly simple.
Exactly. If you need more than the number of bits you have, you don't have space for it. You can't just use more bits as HPACK will not know whether you are intending to use more bits (so should look at next byte) or if it's just a straight number (so only look at this one byte). It needs a signal to know that. That signal is using all 1s.
So to encode 40 in 5 bits, we need to use 11111 to say "it's not big enough", overflow to next byte. 11111 in binary is 31, so we know it's bigger than that, so we'll not waste that, and instead use it, and subtract it from the 40 to give 9 left to encode in the next byte. A new additional byte gives us 8 new bits to play with (well actually only 7 as we'll soon discover, as the first bit is used to signal a further overflow). This is enough so we can use 00001001 to encode our 9. So our complex number is represented in two bytes: XXX11111 and 00001001.
If we want to encode a value bigger than can fix in the first prefixed bit, AND the left over is bigger than 127 that would fit into the available 7 bits of the second byte, then we can't use this overflow mechanism using two bytes. Instead we use another "overflow, overflow" mechanism using three bytes:
For this "overflow, overflow" mechanism, we set the first byte bits to 1s as usual for an overflow (XXX11111) and then set the first bit of the second byte to 1. This leaves 7 bits available to encode the value, plus the next 8 bits in the third byte we're going to have to use (actually only 7 bits of the third byte, because again it uses the first bit to indicate another overflow).
There's various ways they could go have gone about this using the second and third bytes. What they decided to do was encode this as two numbers: the 128 mod, and the 128 multiplier.
1337 = 31 + (128 * 10) + 26
So that means the frist byte is set to 31 as per pervious example, the second byte is set to 26 (which is 11010) plus the leading 1 to show we're using the overflow overflow method (so 100011010), and the third byte is set to 10 (or 00001010).
So 1337 is encoded in three bytes: XXX11111 100011010 00001010 (including setting X to whatever those values were).
Using 128 mod and multiplier is quite efficient and means this large number (and in fact any number up to 16,383) can be represented in three bytes which is, not uncoincidentally, also the max integer that can be represented in 7 + 7 = 14 bits). But it does take a bit of getting your head around!
If it's bigger than 16,383 then we need to do another round of overflow in a similar manner.
All this seems horrendously complex but is actually relatively simply, and efficiently, coded up. Computers can do this pretty easily and quickly.
It seems that i am left with a few useless variables, one being g
You are not print this value in the if statement. Only the left over value in the else. You need to print both.
which in the documentation is the 26 + 128 == 154.
Lastly, where does 128 come from? I can't find any relation between the numbers besides the fact 2 raised to the 7th power is 128, but why is that significant? Is this because the first bit is reserved as a continuation flag? and an octet contains 8 bits so 8 - 1 = 7?
Exactly, it's because the first bit (value 128) needs to be set as per explanation above, to show we are continuing/overflowing into needing a third byte.

Why does Frame.ofRecords garbles its results when fed a sequence generated by a parallel calculation?

I am running some code that calculates a sequence of records and calls Frame.ofRecords with that sequence as its argument. The records are calculated using PSeq.map from the library FSharp.Collections.ParallelSeq.
If I convert the sequence into a list then the output is OK. Here is the code and the output:
let summaryReport path (writeOpenPolicy: WriteOpenPolicy) (outputs: Output seq) =
let foo (output: Output) =
let temp =
{ Name = output.Name
Strategy = string output.Strategy
SharpeRatio = (fst output.PandLStats).SharpeRatio
CalmarRatio = (fst output.PandLStats).CalmarRatio }
printfn "************************************* %A" temp
temp
outputs
|> Seq.map foo
|> List.ofSeq // this is the line that makes a difference
|> Frame.ofRecords
|> frameToCsv path writeOpenPolicy ["Name"] "Summary_Statistics"
Name Name Strategy SharpeRatio CalmarRatio
0 Singleton_AAPL MyStrategy 0.317372564 0.103940018
1 Singleton_MSFT MyStrategy 0.372516931 0.130150478
2 Singleton_IBM MyStrategy Infinity
The printfn command let me verify by inspection that in each case the variable temp was calculated correctly.
The last code line is just a wrapper around FrameExtensions.SaveCsv.
If I remove the |> List.ofSeq line then what comes out is garbled:
Name Name Strategy SharpeRatio CalmarRatio
0 Singleton_IBM MyStrategy 0.317372564 0.130150478
1 Singleton_MSFT MyStrategy 0.103940018
2 Singleton_AAPL MyStrategy 0.372516931 Infinity
Notice that the empty (corresponding to NaN) and Infinity items are now in different lines and other things are also mixed up.
Why is this happening?
The Frame.ofRecords function iterates over the sequence multiple times, so if your sequence returns different data when called repeatedly, you will get inconsistent data into the frame.
Here is a minimal example:
let mutable n = 0.
let nums = seq { for i in 0 .. 10 do n <- n + 1.; yield n, n }
Frame.ofRecords nums
This returns:
Item1 Item2
0 -> 1 12
1 -> 2 13
2 -> 3 14
3 -> 4 15
4 -> 5 16
5 -> 6 17
6 -> 7 18
7 -> 8 19
8 -> 9 20
9 -> 10 21
10 -> 11 22
As you can see, the first item is obtained during the first iteration of the sequence, while the second items is obtained during the second iteration.
This should probably be better documented, but it makes the performance better in typical scenarios - if you can send a PR to the docs, that would be useful.
Parallel Sequences are run in arbitrary order, because they get split across many processors therefore the result-set will be in random order. You can always sort them afterwards, or not run your data in parallel.

Fortran passing parameters with brackets prevents changes

In this question I asked about a method to explicitly prevent passed arguments to change. An obvious solutions is defining copies of the arguments and operate the algorithm on those copies. However in the comment I was pointed to the fact, that I could call the function and wrap the argument I didn't want to change in brackets. This would have the same effect as creating a copy of that passed variables so that it would not change. But I don't understand how it works and what the brackets are actually doing. So could someone explain it to me?
Here is a simple example where the behaviour occurs as I described.
1 program argTest
2 implicit none
3 real :: a, b, c
4
5 interface !optional interface
6 subroutine change(a,b,c)
7 real :: a, b, c
8 end subroutine change
9 end interface
10
11 write(*,*) 'Input a,b,c: '
12 read(*,*) a, b, c
13
14 write(*,*) 'Values at start:'
15 write(*,*)'a:', a
16 write(*,*)'b:', b
17 write(*,*)'c:', c
18
19
20 call change((a),b,c)
21 write(*,*)'Values after calling change with brackets around a:'
22 write(*,*)'a:', a
23 write(*,*)'b:', b
24 write(*,*)'c:', c
25
26
27 call change(a,b,c)
28 write(*,*)'Values after calling change without brackets:'
29 write(*,*)'a:', a
30 write(*,*)'b:', b
31 write(*,*)'c:', c
32
33 end program argTest
34
35
36 subroutine change(a,b,c)
37 real :: a, b, c
38
39 a = a*2
40 b = b*3
41 c = c*4
42
43 end subroutine change
44
45
46
The syntax (a), in the context of the code in the question, is an expression. In the absence of pointer results, an expression is evaluated to yield a value. In this case the value of the expression is the same as the value of the variable a.
While the result of evaluating the expression (a), and the variable a, have the same value, they are not the same thing - the value of a variable is not the same concept as the variable itself. This is used in some situations where the same variable needs to be supplied as both an input argument and as a separate output argument, that would otherwise run afoul of Fortran's restrictions on aliasing of arguments.
HOWEVER - as stated above - in the absence of a pointer result, the result of evaluating an expression is a value, not a variable. You are not permitted to redefine a value. Conceptually, it makes it no sense to say "I am going to change the meaning of the value 2", or "I am going to change the meaning of the result of evaluating 1 + 1".
When you use such an expression as an actual argument, it must not be associated with a dummy argument that is redefined inside the procedure.
Inside the subroutine change, the dummy argument that is associated with the value of the expression (a) is redefined. This is non-conforming.
Whether a copy is made or not is an implementation detail that you cannot (and must not) count on - the comment in the linked question is inaccurate. For example, a compiler that is aware of this restriction discussed above knows the subroutine change cannot actually change the first argument in a conforming way, may know that a is not otherwise visible to change, and therefore decide that it doesn't need to make a temporary copy of a for the expression result.
If you need to make a temporary copy of something, then write the statements that make a copy.
real :: tmp_a
...
tmp_a = a
call change(tmp_a, b, c)
I think the explanation is this, though I can't point to a part of the standard that makes it explicit, ...
(a) is an expression whose result is the same as a. What gets passed to the subroutine is the result of evaluating that expression. Fortran is disallowing an assignment to that result, just as it would if you passed cos(a) to the subroutine. I guess that the result of (a) is almost exactly the same as a copy of a, which might explain the behaviour that is puzzling OP.
I don't have Fortran on this computer, but if I did I'd try a few more cases where the difference between a and (a) might be important, such as
(a) = some_value
to see what the compiler makes of them.
#IanH's comment, below, points out the relevant part of the language standard.
It may be interesting to actually print the address of the actual and dummy arguments using (non-standard) loc() function and compare them, for example:
program main
implicit none
integer :: a
a = 5
print *, "address(a) = ", loc( a )
call sub( 100 * a )
call sub( 1 * a )
call sub( 1 * (a) )
call sub( (a) )
call sub( a )
contains
subroutine sub( n )
integer :: n
n = n + 1
print "(2(a,i4,3x),a,i18)", "a=", a, " n=", n, "address(n) =", loc( n )
end subroutine
end program
The output become like this, which shows that a temporary variable containing the result of an expression is actually passed to sub() (except for the last case).
# gfortran-6
address(a) = 140734780422480
a= 5 n= 501 address(n) = 140734780422468
a= 5 n= 6 address(n) = 140734780422464
a= 5 n= 6 address(n) = 140734780422460
a= 5 n= 6 address(n) = 140734780422456
a= 6 n= 6 address(n) = 140734780422480
# ifort-16
address(a) = 140734590990224
a= 5 n= 501 address(n) = 140734590990208
a= 5 n= 6 address(n) = 140734590990212
a= 5 n= 6 address(n) = 140734590990216
a= 5 n= 6 address(n) = 140734590990220
a= 6 n= 6 address(n) = 140734590990224
# Oracle fortran 12.5
address(a) = 6296328
a= 5 n= 501 address(n) = 140737477281416
a= 5 n= 6 address(n) = 140737477281420
a= 5 n= 6 address(n) = 140737477281424
a= 5 n= 6 address(n) = 140737477281428
a= 6 n= 6 address(n) = 6296328
(It is interesting that Oracle uses a very small address for a for some reason... though other compilers use very similar addresses.)
[ Edit ] Acoording to the above answer by Ian, it is illegal to assign a value to the memory resulting from an expression (which is a value = constant, not a variable). So please take the above code just as an attempt to confirm that what is passed with (...) is different from the original a.

Consolidate 10 bit Value into a Unique Byte

As part of an algorithm I'm writing, I need to find a way to convert a 10-bit word into a unique 8-bit word. The 10-bit word is made up of 5 pairs, where each pair can only ever equal 0, 1 or 2 (never 3). For example:
|00|10|00|01|10|
This value needs to somehow be consolidated into a single, unique byte.
As each pair can never equal 3, there are a wide range of values that this 10-bit word will never represent, which makes me think that it is possible to create an algorithm to perform this conversion. The simplest way to do this would be to use a lookup table, but it seems like a waste of resources to store ~680 values which will only be used once in my program. I've already tried to incorporate one of the pairs into the others somehow, but every attempt I've made has resulted in a non-unique value, and I'm now very quickly running out of ideas!
Any help?
The number you have is essentially base 3. You just need to convert this to base 2.
There are 5 pairs, so 3^5 = 243 numbers. And 8 bits is 2^8 = 256 numbers, so it's possible.
The simplest way to convert between bases is to go to base 10 first.
So, for your example:
00|10|00|01|10
Base 3: 02012
Base 10: 2*3^3 + 1*3^1 + 2*3^0
= 54 + 3 + 2
= 59
Base 2:
59 % 2 = 1
/2 29 % 2 = 1
/2 14 % 2 = 0
/2 7 % 2 = 1
/2 3 % 2 = 1
/2 1 % 2 = 1
So 111011 is your number in binary
This explains the above process in a bit more detail.
Note that once you have 59 above stored in a 1-byte integer, you'll probably already have what you want, thus explicitly converting to base 2 might not be necessary.
What you basically have is a base 3 number and you want to convert this to a single number 0 - 255, luckily 5 digits in ternary (base 3) gives 243 combinations.
What you'll need to do is:
Digit Action
( 1st x 3^4)
+ (2nd x 3^3)
+ (3rd x 3^2)
+ (4th x 3)
+ (5th)
This will give you a number 0 to 242.
You are considering to store some information in a byte. A byte can contain at most 2 ^ 8 = 256 status.
Your status is totally 3 ^ 5 = 243 < 256. That make the transfer possible.
Consider your pairs are ABCDE (each character can be 0, 1 or 2)
You can just calculate A*3^4 + B*3^3 + C*3^2 + D*3 + E as your result. I guarantee the result will be in range 0 -- 255.

Speeding up reshaping person to period-format dataframe in R

I have a dataset with longitudinal data in a person-oriented format, as such:
pid varA_1 varB_1 varA_2 varB_2 varA_3 varB_3 ...
1 1 1 0 3 2 1
2 0 1 0 2 2 1
...
50k 1 0 1 3 1 0
This results in a large dataframe, with minimum 50k observations and 90 variables measured for up to 29 periods.
I would like to get a more period-oriented format, as such:
pid index start stop varA varB varC ...
1 1 ...
1 2
...
1 29
2 1
I have tried different approaches for reshaping the dataframe (*apply, plyr, reshape2, loops, appending vs. prefilling all numeric matrices, etc.,), but do not seem to get a decent processing time (+40min for subsets). I have picked up various hints along the way on what to avoid, but I'm still not sure if I miss some bottleneck or possible speedup.
Is there an optimal way to approach this kind of data-processing, so that I can evaluate the best-case processing time I can achieve in pure R-code? There have been similar questions on Stackoverflow, but they did not result in convincing answers...
First, let's build the data example (I am using 5e3 instead of 50e3 to avoid memory problems with my configuration):
nObs <- 5e3
nVar <- 90
nPeriods <- 29
dat <- matrix(rnorm(nObs*nVar*nPeriods), nrow=nObs, ncol=nVar*nPeriods)
df <- data.frame(id=seq_len(nObs), dat)
nmsV <- paste('Var', seq_len(nVar), sep='')
nmsPeriods <- paste('T', seq_len(nPeriods), sep='')
nms <- c(outer(nmsV, nmsPeriods, paste, sep='_'))
names(df)[-1] <- nms
And now with stats::reshape you change the format:
df2 <- reshape(df, dir = "long", varying = 2:((nVar*nPeriods)+1), sep = "_")
I am not sure if this is the fast solution you are looking for.
The well-aged stack() function can be very fast, if things fit into memory.
For large set, using (transparent) sqlite database as an intermediate is best. Try Gabor's package sqldf, there are many examples on googlecode.
http://code.google.com/p/sqldf/

Resources