Matrix multiplier with Chisel - matrix

I want to describe a Matrix multiplier with Chisel, but there are some things that I do not understand.
First, I found that response giving the code of a 3X5 matrix multiplier. I would like to generalize it for any square matrix up to 128X128. I know that in Chisel, I can parameterize a module by giving a size parameter to the module (so that I'll use n.W instead of a defined size).
But at the end of the day, a Verilog file will be generated right ? So the parameters have to be fixed ? I probably confuse some things. My prupose is to adapt the code to be able to perform any matrix multiplication up to 128x128, and I do not know if it is technically possible.

The advantage of chisel is that everything can be parameterized. That being said at the end of the day when you are making your physical hardware obviously the parameter should be fixed. The advantage of making it parameterized is that if you don't know your exact requirements (like area of the die available etc) you can have a parameterized version ready and when the time comes you plug in the values you need and generate the verilog file for that parameter. And to answer your question, yes it is possible to perform any matrix multiplication up to 128x128 (or beyond if your Laptop RAM is sufficient). You get the verilog only when you create a Hardware driver this tells you how to create verilog from chisel so go ahead and create your parameterized hardware.

Related

How to store an equation in EEPROM?

I'm working with embedded systems. For the sake of explanation, I'm working with a dsPIC33EP and a simple serial EEPROM.
Suppose I'm building a controller that uses a linear control scheme (y=mx+b). If the controller needs may different setting It's easy, store the m and the b in EEPROM and retrieve it for the different settings.
Now suppose I want to have different equations for different settings. I would have to pre program all the equations and then have a method for selecting that equation and pulling the settings from the EEPROM. It's harder because you need to know the equations ahead of time but still doable.
Now suppose that you don't know the equations ahead of time. Maybe you have to do a piece wise approximation for example. How could you store something like that in memory? That all a controller has to do is feed it a sensor reading and it would give back a control variable. Kind of like passing a variable to a function and getting the answer passed back.
How could you store a function like that in memory if only the current state is important?
How could you store a function like that if past states are important (if the control equation is second, third or fourth order for example)?
The dsPICs have limited RAM, but quite a bit of FLASH, enough for a small, but effective text parser. Have you thought of using some form of text based script? These can be translated to a more efficient data format at run-time.

Determinism in tensorflow gradient updates?

So I have a very simple NN script written in Tensorflow, and I am having a hard time trying to trace down where some "randomness" is coming in from.
I have recorded the
Weights,
Gradients,
Logits
of my network as I train, and for the first iteration, it is clear that everything starts off the same. I have a SEED value both for how data is read in, and a SEED value for initializing the weights of the net. Those I never change.
My problem is that on say the second iteration of every re-run I do, I start to see the gradients diverge, (by a small amount, like say, 1e-6 or so). However over time, this of course leads to non-repeatable behaviour.
What might the cause of this be? I dont know where any possible source of randomness might be coming from...
Thanks
There's a good chance you could get deterministic results if you run your network on CPU (export CUDA_VISIBLE_DEVICES=), with single-thread in Eigen thread pool (tf.Session(config=tf.ConfigProto(intra_op_parallelism_threads=1)), one Python thread (no multi-threaded queue-runners that you get from ops like tf.batch), and a single well-defined operation order. Also using inter_op_parallelism_threads=1 may help in some scenarios.
One issue is that floating point addition/multiplication is non-associative, so one fool-proof way to get deterministic results is to use integer arithmetic or quantized values.
Barring that, you could isolate which operation is non-deterministic, and try to avoid using that op. For instance, there's tf.add_n op, which doesn't say anything about the order in which it sums the values, but different orders produce different results.
Getting deterministic results is a bit of an uphill battle because determinism is in conflict with performance, and performance is usually the goal that gets more attention. An alternative to trying to have exact same numbers on reruns is to focus on numerical stability -- if your algorithm is stable, then you will get reproducible results (ie, same number of misclassifications) even though exact parameter values may be slightly different
The tensorflow reduce_sum op is specifically known to be non-deterministic. Furthermore, reduce_sum is used for calculating bias gradients.
This post discusses a workaround to avoid using reduce_sum (ie taking the dot product of any vector w/ a vector of all 1's is the same as reduce_sum)
I have faced the same problem..
The working solution for me was to:
1- use tf.set_random_seed(1) in order to make all tf functions have the same seed every new run
2- Training the model using CPU not the GPU to avoid GPU non-deterministic operations due to precision.

How to Scale SPICE Matrix so LU-decomposition doesn't Fail

I am implementing a SPICE solver. I have the following problem: say I put two diodes and a current source in serial (standard diodes). I use MNA and Boost LU-decomposition. The problem is that the nodal matrix becomes very quickly near-singular. I think I have to scale the values but I don't know how and I couldn't find anything on the Internet. Any ideas how to do this scaling?
In the perspective of numerical, there is a scale technique for this kind of near-singular matrices. Basically, this technique is to divide each row of A by the sum (or maximum) of the absolute values in that row. You can find KLU which is a linear solver for circuit simulations for more details.
In perspective of SPICE simulation, it uses so-call Gmin stepping technique to iteratively compute and approach a real answer. You can find this in the documents of a SPICE project QUCS (Quite Universal Circuit Simulator).
Scaling does not help when the matrix has both very large and very small entries.
It is necessary to use some or all of the many tricks that were developed for circuit solver applications. A good start is clipping the range of the exponential and log function arguments to reasonable values -- in most circuits a diode forward voltage is never more than 1V and the diode reverse current not less than 1pA.
Actually, look at all library functions and wrap them in code that makes their arguments and results suitable for circuit-solving purposes. Simple clipping is sometimes good enough, but it is way better to make sure the functions stay (twice) differentiable and continuous.

Range reduction for trigonometric functions

I'm trying to implement range reduction for trigonometric functions.
I found this paper http://www.computer.org/csdl/proceedings/pcspa/2010/4180/00/4180b048-abs.html which talks about using 64-bit integer arithmetic.
The idea presented should work but there seems to be some problem with equations in the paper.
Is this efficient than the one implemented in fdlibm ?
Should you want to to perform a complete floating point range reduction, consult K.C. Ng's "ARGUMENT REDUCTION FOR HUGE ARGUMENTS: Good to the Last Bit" readily findable on the web.
The salient issue is that to do range reduction on standard trig functions such as sine(x), where x is in radians, one must do a precise mod operation involving Pi. The mod needs to extend 4/Pi out to enough factional bit places to have a meaningful result. This paper details that process and how far one needs to go. Turns out, it is potential 100s of bits, but not millions of bits. Possible you are aware of this issue, but if not, it is what you need to know to make a good reduction using 64-bit routines or whatever.

Can raymarching be accelerated under an SIMD architecture?

The answer would seem to be no, because raymarching is highly conditional i.e. each ray follows a unique execution path, since on each step we check for opacity, termination etc. that will vary based on the direction of the individual ray.
So it would seem that SIMD would largely not be able to accelerate this; rather, MIMD would be required for acceleration.
Does this make sense? Or am I missing something(s)?
As stated already, you could probably get a speedup from implementing your
vector math using SSE instructions (be aware of the effects discussed
here - also for the other approach). This approach would allow the code
stay concise and maintainable.
I assume, however, your question is about "packet traversal" (or something
like it), in other words to process multiple scalar values each of a
different ray:
In principle it should be possible deferring the shading to another pass.
The SIMD packet could be repopulated with a new ray once the bare marching
pass terminates and the temporary result is stored as input for the shading
pass. This will allow to parallelize a certain, case-dependent percentage
of your code exploting all four SIMD lanes.
Tiling the image and indexing the rays within it in Morton-order might be
a good idea too in order to avoid cache pressure (unless your geometry is
strictly procedural).
You won't know whether it pays off unless you try. My guess is, that if it
does, the amount of speedup might not be worth the complication of the code
for just four lanes.
Have you considered using an SIMT architecture such as a programmable GPU?
A somewhat up-to-date programmable graphics board allows you to perform
raymarching at interactive rates (see it happen in your browser here).
The last days I built a software-based raymarcher for a menger sponge. At the moment without using SIMD and I also used no special algorithm. I just trace from -1 to 1 in X and Y, which are U and V for the destination texture. Then I got a camera position and a destination which I use to calculate the increment vector for the raymarch.
After that I use a constant value of iterations to perform, in which only one branch decides if there's an intersection with the fractal volume. So if my camera eye is E and my direction vector is D I have to find the smallest t. If I found that or reached a maximal distance I break the loop. At the end I have t - from that I calculate the fragment color.
In my opinion it should be possible to parallelize these operations by SSE1/2, because one can solve the branch by null'ing the field in the vector (__m64 / __m128), so further SIMD operations won't apply here. It really depends on what you raymarch/-cast but if you just calculate a fragment color from a function (like my fractal curve here is) and don't access memory non-linearly there are some tricks to make it possible.
Sure, this answer contains speculation, but I will keep you informed when I've parallelized this routine.
Only insofar as SSE, for instance, lets you do operations on vectors in parallel.

Resources