Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 1 year ago.
Improve this question
To all he-experts out there:
I want to implement a matrix-vector multiplication with very large matrices (600000 x 55). Currently I am able to perform he operations like Addition, Multiplication, InnerProduct etc. with small inputs. When I try to apply these operations on larger inputs I get errors like Invalid next size (normal) or I ran out of main memory until the os kills the process (exit code 9).
Do you have any recommendations/examples how to archive an efficient way of implementing a matrix-vector multiplication or something similar? (Using BFV and CKKS).
PS: I am using the PALISADE library but if you have better suggestions like SEAL or Helib I would happily use them as well.
CKKS, which is also available in PALISADE, would be a much better option for your scenario as it supports approximate (floating-point-like) arithmetic and does not require high precision (large plaintext modulus). BFV performs all operations exactly (mod plaintext modulus). You would have to use a really large plaintext modulus to make sure your result does not wrap around the plaintext modulus. This gets much worse as you increase the depth, e.g., two chained multiplications.
For matrix-vector multiplication, you could use the techniques described in https://eprint.iacr.org/2019/223, https://eprint.iacr.org/2018/254, and the supplemental information of https://eprint.iacr.org/2020/563. The main idea is to choose the right encoding and take advantage of SIMD packing. You would work with a power-of-two vector size and could pack the matrix either as 64xY (multiple rows) per ciphertext or a part of each row per ciphertext, depending on which one is more efficient.
Related
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
Lets say I'm creating a hash between 7 and 8 million elements using linear probing to handle collisions. How do I figure out how many buckets are required?
There is no perfect answer... the number of buckets affects both memory usage and performance, and the more collision prone the specific elements are (in combination with your hash function and table size - e.g. a prime number of buckets tends to be more tolerant than a power-of-2) the more buckets you may want.
So, the best way if you need accurate tuning is to get realistic data and try a range of load factors (i.e. # elements to # buckets), seeing where the memory/performance tradeoff suits you best.
If you just want a generally useful load factor as a point of departure, perhaps try .7 to .8 if you've a half-way decent hash function. In other words, an oft-sane ballpark figure for number of buckets would be 8 million / .7 or / .8 which is ~10 to 11.4 million.
If you're serious about tuning this aggressively, and don't have other good reasons for sticking with it (e.g. to support element deletions using immediate compaction rather than "tombstone"s marking once-used buckets over which element lookups/deletions must skip and continue probing), you should move off linear probing as it'll give you a lot more collisions than most-any alternatives.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
Let's say some researchers have figured out a way to analyze data and they have developed an algorithm for that. At the time, the algorithm is described in a book, using lots of mathematical formulas.
Now the algorithm needs to be implemented in software. The developer can read the formulas and starts translating e.g. Sum(f(x)) [1..n] (seems TeX is not allowed here) to a for loop.
Depending on how the developer converts the formula into code, there might be overflows or truncation in floating point operations. Not knowing much about real-world input values, unit tests might not detect those issues. However, in some cases, this can be avoided just by re-ordering the items or simplifying terms.
I wonder who is responsible for the precision of the output. Is it the mathematician or is it the developer? The mathematician might not know enough about computer number formats while the developer might not know enough about mathematics to restructure the formula.
A simple example:
Given the Binomial coefficient n over k which translates to n! / (k! (n-k)!).
A simple implementation would probably use the factorial function and then input the numbers directly (pseudo code):
result = fac(n) / (fac(k) * fac(n-k))
This can lead to overflows for larger n. Knowing that, one could divide n! by k! first and do (pseudo code):
result = 1
for (i = k+1 to n) result *= i
result = result / fac(n-k)
which is a) faster because it needs less calculations and b) does not suffer from overflows.
This science is called numerical analysis
http://en.wikipedia.org/wiki/Numerical_analysis
In my understanding the analysis is on the mathematician side, but it is the responsibility of the programmer to know the problem exists and to look for the correct well known solutions (like not using a simple Euler integrator but Runge-Kutta).
Short answer: developer.
Algorithm (of just a formula) manipulates arbitrary precision real numbers as pure math objects.
Code (based on the formula) works with real hardware and must overcome limitations (which depends on your hardware) by using more complex code.
Example: Formula f(x,y) = x * y may lead to very complex source code (if x,y are 64-bit floating point real numbers and your hardware is 8-bit microcontroller without FPU support and without integer MUL instruction).
Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 11 years ago.
Improve this question
I am thinking of something like a 3x3 matrix for each of the x,y,z coordinates. But that would be a waste of memory since a lot of block spaces are empty. Another solution would be to have a hashmap ((x,y,z) -> BlockObject), but that doesn't seem too efficient either.
When I say efficient, I do not mean optimal. It simply means that it would be enough to run smoothly on your modern day computer. Keep in mind, that the worlds generated by minecraft are quite huge, efficiency is important regardless. There's is also tons of meta-data that needs to be stored.
As noted in my comment, I have no idea how MineCraft does this, but a common efficient way of representing this sort of data is an Octree; http://en.wikipedia.org/wiki/Octree. The general idea is that it's like a binary tree but in three-space. You recursively divide each block of space in each dimension to get eight smaller blocks, and each block contains the pointers to the smaller blocks and a pointer to its parent block.
This allows you to be efficient about storing large blocks of the same material (e.g., "empty space"), because you can terminate the recursion whenever you get to a block that is made up of all the same thing, even if you haven't recursed down to the level of individual "cube" units.
Also, this means that you can efficiently find all the cubes in a given region by taking your current block and going up the tree just far enough to get to a block that contains all you can see -- and that way, you can very easily ignore all the cubes that are somewhere else.
If you're interested in exploring alternative means to represent Minecraft world (chunk)data, you can also look into the idea of bitstrings. Each 'chunk' is comprised of a volume 16*16*128, whereas 16*16 can adequately be represented by a single byte character and can be consolidated into a binary string.
As this approach is highly specific to a certain goal of trading client-computation vs highly optimized storage and transfer time, it seems imprudent to attempt to explain all the details, but I have created a specification for just this purpose, if you're interested.
Using this method, calculating storage cost is drastically different than the current 1byte-per-block, but instead is 'variable-bit-rate': ((1bit-per-block, rounded up to a multiple of 8) * (number of unique layers a blocktype appears in a chunk) + 2bytes)
This is then summed for the (unique number of blocktypes in that chunk).
Pretty much only in deliberate edgecases can this be more expensive than a normally structured chunk, in excess of 99% of Minecraft chunks are naturally generated and would benefit from this variable-bit-representation by a ratio of 8:1 or more in many of my tests.
Your best bet is to decompile Minecraft and look at the source. Modifying Minecraft: The Source Code is a nice walkthrough on how to do that.
Minecraft is very far from efficent. It just stores "chunks" of data.
Check out the "Map formats" in the Development Resources at Minecraft Wiki. AFAIK, the internal representation is exactly the same.
Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 11 years ago.
Improve this question
I understand that this makes the algorithms faster and use less storage space, and that these would have been critical features for software to run on the hardware of previous decades, but is this still an important feature? If the calculations were done with exact rational arithmetic then there would be no rounding errors at all, which would simplify many algorithms as you would no longer have to worry about catastrophic cancellation or anything like that.
Floating point is much faster than arbitrary-precision and symbolic packages, and 12-16 significant figures is usually plenty for demanding science/engineering applications where non-integral computations are relevant.
The programming language ABC used rational numbers (x / y where x and y were integers) wherever possible.
Sometimes calculations would become very slow because the numerator and denominator had become very big.
So it turns out that it's a bad idea if you don't put some kind of limit on the numerator and denominator.
In the vast majority of computations, the size of numbers required to to compute answers exactly would quickly grow beyond the point where computation would be worth the effort, and in many calculations it would grow beyond the point where exact calculation would even be possible. Consider that even running something like like a simple third-order IIR filter for a dozen iterations would require a fraction with thousands of bits in the denominator; running the algorithm for a few thousand iterations (hardly an unusual operation) could require more bits in the denominator than there exist atoms in the universe.
Many numerical algorithms still require fixed-precision numbers in order to perform well enough. Such calculations can be implemented in hardware because the numbers fit entirely in registers, whereas arbitrary precision calculations must be implemented in software, and there is a massive performance difference between the two. Ask anybody who crunches numbers for a living whether they'd be ok with things running X amount slower, and they probably will say "no that's completely unworkable."
Also, I think you'll find that having arbitrary precision is impractical and even impossible. For example, the number of decimal places can grow fast enough that you'll want to drop some. And then you're back to square one: rounded number problems!
Finally, sometimes the numbers beyond a certain precision do not matter anyway. For example, generally the nnumber of significant digits should reflect the level of experimental uncertainty.
So, which algorithms do you have in mind?
Traditionally integer arithmetic is easier and cheaper to implement in hardware (uses less space on the die so you can fit more units on there). Especially when you go into the DSP segment this can make a lot of difference.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 4 years ago.
Improve this question
Who knows the most robust algorithm for a chromatic instrument tuner?
I am trying to write an instrument tuner. I have tried the following two algorithms:
FFT to create a welch periodogram and then detect the peak frequency
A simple autocorrelation (http://en.wikipedia.org/wiki/Autocorrelation)
I encountered the following basic problems:
Accuracy 1: in FFT the relation between samplerate, recording length and bin size is fixed. This means that I need to record a 1-2 seconds of data to get an accuracy of a few cents. This is not exactly what i would call realtime.
Accuracy 2: autocorrelation works a bit better. To get the needed accuracy of a few cents I had to introduced linear interpolation of samples.
Robustness: In case of a guitar I see a lot of overtones. Some overtones are actually stronger than the main tone produced by the string. I could not find a robust way to select the right string played.
Still, any cheap electronic tuner works more robust than my implementation.
How are those tuners implemented?
You can interpolate FFTs also, and you can often use the higher harmonics for increased precision. You need to know a little bit about the harmonics of the instrument that was produced, and it's easier if you can assume you're less than half an octave off target, but even in the absence of that, the fundamental frequency is usually much stronger than the first subharmonic, and is not that far below the primary harmonic. A simple heuristic should let you pick the fundamental frequency.
I doubt that the autocorrelation method will work all that robustly across instruments, but you should get a series of self-similarity scores that is highest when you're offset by one fundamental frequency. If you go two, you should get the same score again (to within noise and differential damping of the different harmonics).
There's a pretty cool algorithm called Bitstream Autocorrelation. It doesn't take too many CPU cycles, and it's very accurate. You basically find all the zero cross points, and then save it as a binary string. Then you use Auto-correlation on the string. It's fast because you can use XOR instead of floating point multiplication.