In matlab, we can use a vector B to get the sublist of vector A by coding like A(B). But in Eigen, I am wondering if there is anyway to do that?
With the current development branch (since mid-2018) and starting with the 3.4 releases (which will likely be released start of 2019) you can simply write A(B), if B is either an Eigen object or a std::vector with integer coefficients.
See new features of the 3.4 release page for details.
Related
There is a variable first_variable which is always a mod of some number, mod_value.
In every step first_variable is multiplied with some number second_variable.
And the range of all three variables is from 1 to 10^18.
For that I build a formula,
first_variable = ((first_variable%mod_value)*(second_variable%mod_value))%mod_value
But this gives a wrong answer,
For example, If first_variable and second_variable is (10^18)-1 and mod_value = 10^18
Please suggest me method, so that first_variable will always give right answer.
Seems you are using a runtime where arithmetic is implemented using 64-bit integers. You can check this using multipliers like 2^32: if their product is 0, my guess is true. In that case, you should switch to an arbitrary long arithmetic implementation, or at least one that is much longer than the current one. E.g. Python supports integers up to 2^1016 (256^127), same for Erlang.
I've seen in comments you use C++. If so, look for GMP library and analogs. Or, if 128 bits is enough, modern GCC support it through own library.
This is basically overflows, so you should either use different value for mod_value (up to 10^9) or limit the range for first value and second value.
Your number is O(10^36) which is O(2^108) which cannot fit in any primitive data type in languages like java or C++. Use BigInt in C++ or Java or use numpy in python to get over it.
I am trying to write a backtracking algorithm that keeps state using mutable BitSets, it works fine but I want it to go faster!
The crux is given two mutable.BitSet alpha and beta I need to calculate if any of the bits of alpha are set in beta, i.e. bitwise AND. I do not need the resulting set just need to know if the intersection isNonEmpty
(alpha intersect beta).nonEmpty
or
(alpha & beta).nonEmpty
but both of these construct a set which is then tested for size... I really just need a boolean and would like to avoid the cost of constructing the intermediate set.
Is there a better way?
TIA
Nivag
Referring to the API docs, you may use find and contains method.
alpha find (beta.contains) isDefined
OR
Even better, use exists method.
alpha exists (beta.contains)
OR
Even shorter and better, use apply method of BitSet which is equivalent to its contains method.
alpha exists beta
My question is similar to this one Multiply a 3D matrix with a 2D matrix. However, I'm coding in Fortran.
Say, if I have a RxSxT matrix A and an SxU matrix B, where R,S,T,U are integers, and I want to multiply A(:,:,0) with B. How can I do this with matmul? When I do something like
C(:,:,0) = matmul(A(:,:,0),B)
The compiler (gfortran) gives:
Warning:Array reference at (1) is out of bounds (0 < 1) in dimension 3
f951: internal compiler error: Segmentation fault
Is there a way around this?
Thanks.
EDIT: I should add that I'm actually transposing the second matrix. Say, A a RxSxT matrix and B a UxS matrix. Then
C(:,:,0) = matmul(B,transpose(A(:,:,0))
That transpose might be part of the problem. Does it convert A(i,j,k) to A(k,i,j)?
Remember that in Fortran your array indices start from 1 by default. So unless you have specified your array A to have a non-default lower bound on the 3rd dimension, gfortran is entirely correct in pointing out your error.
Of course, an internal compiler error is always a compiler bug; unless you have some ancient version of gfortran please file a bug at http://gcc.gnu.org/bugzilla
transpose (A(:,:,0)) should interchange the indices A(i,j,0) to A(j,i,0). A(:,:,0) is a rank two matrix.
The compiler should never crash, whether or not the input source code is correct. Are you using the latest version of gfortran? You could report this "internal compiler error: Segmentation fault" to the gfortran development team: http://gcc.gnu.org/wiki/GFortran#bugs
I'm working on an implementation of one of the SHA3 candidates, JH. I'm at the point where the algorithm pass all KATs (Known Answer Tests) provided by NIST, and have also made it an instance of the Crypto-API. Thus I have began looking into its performance. But I'm quite new to Haskell and don't really know what to look for when profiling.
At the moment my code is consistently slower then the reference implementation written in C, by a factor of 10 for all input lengths (C code found here: http://www3.ntu.edu.sg/home/wuhj/research/jh/jh_bitslice_ref64.h).
My Haskell code is found here: https://github.com/hakoja/SHA3/blob/master/Data/Digest/JHInternal.hs.
Now I don't expect you to wade through all my code, rather I would just want some tips on a couple of functions. I have run some performance tests and this is (part of) the performance file generated by GHC:
Tue Oct 25 19:01 2011 Time and Allocation Profiling Report (Final)
main +RTS -sstderr -p -hc -RTS jh e False
total time = 6.56 secs (328 ticks # 20 ms)
total alloc = 4,086,951,472 bytes (excludes profiling overheads)
COST CENTRE MODULE %time %alloc
roundFunction Data.Digest.JHInternal 28.4 37.4
word128Shift Data.BigWord.Word128 14.9 19.7
blockMap Data.Digest.JHInternal 11.9 12.9
getBytes Data.Serialize.Get 6.7 2.4
unGet Data.Serialize.Get 5.5 1.3
sbox Data.Digest.JHInternal 4.0 7.4
getWord64be Data.Serialize.Get 3.7 1.6
e8 Data.Digest.JHInternal 3.7 0.0
swap4 Data.Digest.JHInternal 3.0 0.7
swap16 Data.Digest.JHInternal 3.0 0.7
swap8 Data.Digest.JHInternal 1.8 0.7
swap32 Data.Digest.JHInternal 1.8 0.7
parseBlock Data.Digest.JHInternal 1.8 1.2
swap2 Data.Digest.JHInternal 1.5 0.7
swap1 Data.Digest.JHInternal 1.5 0.7
linearTransform Data.Digest.JHInternal 1.5 8.6
shiftl_w64 Data.Serialize.Get 1.2 1.1
Detailed breakdown omitted ...
Now quickly about the JH algorithm:
It's a hash algorithm which consists of a compression function F8, which is repeated as long as there exists input blocks (of length 512 bits). This is just how the SHA-functions operate. The F8 function consists of the E8 function which applies a round function 42 times. The round function itself consists of three parts:
a sbox, a linear transformation and a permutation (called swap in my code).
Thus it's reasonable that most of the time is spent in the round function. Still I would like to know how those parts could be improved. For instance: the blockMap function is just a utility function, mapping a function over the elements in a 4-tuple. So why is it performing so badly? Any suggestions would be welcome, and not just on single functions, i.e. are there structural changes you would have done in order to improve the performance?
I have tried looking at the Core output, but unfortunately that's way over my head.
I attach some of the heap profiles at the end as well in case that could be of interest.
EDIT :
I forgot to mention my setup and build. I run it on a x86_64 Arch Linux machine, GHC 7.0.3-2 (I think), with compile options:
ghc --make -O2 -funbox-strict-fields
Unfortunately there seems to be a bug on the Linux plattform when compiling via C or LLVM, giving me the error:
Error: .size expression for XXXX does not evaluate to a constant
so I have not been able to see the effect of that.
Switch to unboxed Vectors (from Array, used for constants)
Use unsafeIndex instead of incurring the bounds check and data dependency from safe indexing (i.e. !)
Unpack Block1024 as you did with Block512 (or at least use UnboxedTuples)
Use unsafeShift{R,L} so you don't incur the check on the shift value (coming in GHC 7.4)
Unfold the roundFunction so you have one rather ugly and verbose e8 function. This was significat in pureMD5 (the rolled version was prettier but massively slower than the unrolled version). You might be able to use TH to do this and keep the code smallish. If you do this then you'll have no need for constants as these values will be explicit in the code and result in a more cache friendly binary.
Unpack your Word128 values.
Define your own addition for Word128, don't lift Integer. See LargeWord for an example of how this can be done.
rem not mod
Compile with optimization (-O2) and try llvm (-fllvm)
EDIT: And cabalize your git repo along with a benchmark so we can help you easier ;-). Good work on including a crypto-api instance.
The lower graph shows that a lot of memory is occupied by lists. Unless there are more lurking in other modules, they can only come from e8. Maybe you'll have to bite the bullet and make that a loop instead of a fold, but for starters, since Block1024 is a pair, the foldl' doesn't do much evaluation on the fly (unless the strictness analyser has become significantly better). Try making that stricter, data Block1024 = B1024 !Block512 !Block512, perhaps it also needs {-# UNPACK #-} pragmas. In roundFunction, use rem instead of mod (this will only have minor impact, but it's a bit faster) and make the let bindings strict. In the swapN functions, you might get better performance giving the constants in the form W x y rather than as 128-bit hex numbers.
I can't guarantee those changes will help, but that's what looks most promising after a short glance.
Ok, so I thought I would chime in with an update of what I have done and the results obtained thus far. Changes made:
Switched from Array to UnboxedArray (made Word128 an instance type)
Used UnboxedArray + fold in e8 instead of lists and (prelude) fold
Used unsafeIndex instead of !
Changed type of Block1024 to a real datatype (similiar to Block512), and unpacked its arguments
Updated GHC to version 7.2.1 on Arch Linux, thus fixing the problem with compiling via C or LLVM
Switched mod to rem in some places, but NOT in roundFunction. When I do it there, the compile time suddenly takes an awful lot of time, and the run time becomes 10 times slower! Does anyone know why that may be? It is only happening with GHC-7.2.1, not GHC-7.0.3
I compile with the following options:
ghc-7.2.1 --make -O2 -funbox-strict-fields main.hs ./Tests/testframe.hs -fvia-C -optc-O2
And the results? Roughly 50 % reduction in time. On an input of ~107 MB, the code now use 3 minutes as compared to the previous 6-7 minutes. The C version uses 42 seconds.
Things I tried, but which didn't result in better performance:
Unrolled the e8 function like this:
e8 !h = go h 0
where go !x !n
| n == 42 = x
| otherwise = go h' (n + 1)
where !h' = roundFunction x n
Tried breaking up the swapN functions to use the underlying Word64' directly:
swap1 (W xh hl) =
shiftL (W (xh .&. 0x5555555555555555) (xl .&. 0x5555555555555555)) 1
.|.
shiftR (W (xh .&. 0xaaaaaaaaaaaaaaaa) (xl .&. 0xaaaaaaaaaaaaaaaa)) 1
Tried using the LLVM backend
All of these attempts gave worse performance than what I have currently. I don't know if thats because I'm doing it wrong (especially the unrolling of e8), or because they just are worse options.
Still I have some new questions with these new tweaks.
Suddenly I have gotten this peculiar bump in memory usage. Take a look at following heap profiles:
Why has this happened? Is it because of the UnboxedArray? And what does SYSTEM mean?
When I compile via C I get the following warning:
Warning: The -fvia-C flag does nothing; it will be removed in a future GHC release
Is this true? Why then, do I see better performance using it, rather than not?
It looks like you did a fair amount of tweaking already; I'm curious what the performance is like without explicit strictness annotations (BangPatterns) and the various compiler pragmas (UNPACK, INLINE)... Also, a dumb question: what optimization flags are you using?
Anyway, two suggestions which may be completely awful:
Use unboxed primitive types where you can (e.g. replace Data.Word.Word64 with GHC.Word.Word64#, make sure word128Shift is using Int#, etc.) to avoid heap allocation. This is, of course, non-portable.
Try Data.Sequence instead of []
At any rate, rather than looking at the Core output, try looking at the intermediate C files (*.hc) instead. It can be hard to wade through, but sometimes makes it obvious where the compiler wasn't quite as sharp as you'd hoped.
I don't understand why version numbers of the Boost Library are incremented only 1/100 (e.g. 1.33, 1.34, so on) even though major inclusions are made like huge libraries. Is there any strong motivation behind this?
It says in the Boost Faq:
What do the Boost version numbers mean? The scheme is x.y.z, where x is incremented only for massive changes, such as a reorganization of many libraries, y is incremented whenever a new library is added, and z is incremented for maintenance releases. y and z are reset to 0 if the value to the left changes.