I am working on designing a mandelbrot viewer and I am designing hardware for squaring values. My squarer is recursively built where a 4bit squarer relies on 2, 2bit squarers. so for my 16 bit squarer, that has 2 8bits squarers, and each one of those has 2 4bit squarer's.
As you can see the recursivity begins to make the design blow up in complexity. To help speed up my design i would like to use a 4input ROM that emulates a 4bit squarer. So when you enter 3 in the rom, it outputs 9, when you enter 15, it outputs 225.
I know that a normal LUT implemented in a logic cell ay have 3 or 4 input variables and only 1 output, but i need an 8 bit output so I need more of a ROM then a LUT.
Any and all help is appreciated, Im curious how the FPGA will store those ROMs and if storing it in ROM would be faster than computing the 4input Square.
-
Jarvi
To square a 4-bit number explicitly using LUTs, you would need to use 8 4-input LUTs. Each LUT's output would give you one bit of the 8-bit product.
The overall size and fmax performance of your design may be achieved with this approach, using larger block RAM primitives (as ROM), dedicated MAC (multiply-accumulate) units, or by using the normal mulitiplication operator * and relying on your synthesis tool's optimization.
You may also want to review some research papers related to this topic, for example here.
Related
I'm working on a guitar effects "pedal" using the NEXSYS A7 Board.
For this purpose, I've purchased the I2S2 PMOD and successfully got it up and running using the example code provided by Digilent.
Currently, the design is a "pass-through", meaning that audio comes into the FPGA and immediately out.
I'm wondering what would be the correct way to store the data, make some DSP on this data to create the effects, and then transmit the modified data back to the I2S2 PMOD.
Maybe it's unnecessary to store the data?
maybe I can pass it through an RTL block that's responsible for applying the effect and then simply transmit the modified data out?
Collated from comments and extended.
For a live performance pedal you don't want to store much data; usually 10s of ms or less. Start with something simple : store 50 or 100ms of data in a ring (read old data, store new data, inc address modulo memory size). Output = Newdata = ( incoming sample * 0.n + olddata * (1 - 0.n)) for variable n. Very crude reverb or echo.
Yes, ring = ring buffer FIFO. And you'll see my description is a very crude implementation of a ring buffer FIFO.
Now extend it to separate read and write pointers. Now read and write at different, harmonically related rates ... you have a pitch changer. With glitches when the pointers cross.
Think of ways to hide the glitches, and soon you'll be able to make the crappy noises Autotune adds to most all modern music from that bloody Cher song onwards. (This takes serious DSP : something called interpolating filters is probably the simplest way. Live with the glitches for now)
btw if I'm interested in a distortion effect, can it be accomplished by simply multiplying the incoming data by a constant?
Multiplying by a constant is ... gain.
Multiplying a signal by itself is squaring it ... aka second harmonic distortion or 2HD (which produces components on the octave of each tone in the input).
Multiplying a signal by the 2HD is cubing it ... aka 3HD, producing components a perfect fifth above the octave.
Multiplying the 2HD by the 2HD is the fourth power ... aka 4HD, producing components 2 octaves higher, or a perfect fourth above that fifth.
Multiply the 4HD by the signal to produce 5HD ... and so on to probably the 7th. Also note that these components will decrease dramatically in level; you probably want to add gain beyond 2HD, multiply by 4 (= shift left 2 bits) as a starting point, and increase or decrease as desired.
Now multiply each of these by a variable gain and mix them (mixing is simple addition) to add as many distortion components you want as loud as you want ... don't forget to add in the original signal!
There are other approaches to adding distortion. Try simply saturating all signals above 0.25 to 0.25, and all signals below -0.25 to -0.25, aka clipping. Sounds nasty but mix a bit of this into the above, for a buzz.
Learn how to make white noise (pseudo-random number, usually from a LFSR).
Multiply this by the input signal, and mix or match with the above, for some fuzz.
Learn digital filtering (low pass, high pass, band pass for EQ), and how to control filters with noise or the input signal, the world of sound is open to you.
One of the AVX-512 instruction set extensions is AVX-512 + GFNI, " Galois Field New Instructions".
Galois theory is about field extensions. What does that have to do with processing vectorized integer or floating-point values? The instructions supposedly perform "Galois field affine transformation", the inverse of that, and "Galois field multiply bytes".
What fields are those? What do these instructions actually do and what is it good for?
These instructions are closely related to the AES (Rijndael) block cipher. GF2P8AFFINEINVQB performs a Rijndael S-Box substitution with a user-defined affine transformation.
GF2P8AFFINEQB is essentially a (carry-less) multiplication of an 8x8 bit matrix with an 8-bit vector in GF(2), so it should be useful in other bit-oriented algorithms. It can also be used to convert between isomorphic representations of GF(28).
GF2P8MULB multiplies two (vectors of) elements of GF(28), actually 8-bit numbers in polynomial representation with the Rijndael reduction polynomial. This operation is used in Rijndael's MixColumns step.
Note that multiplication in finite fields is only loosely related to integer multiplication.
One of the major use-cases is I think SW RAID6 parity, including generating new parity on every write. (Not just during recovery / rebuild). RAID5 can use simple XOR parity for its one and only parity member of each stripe, but RAID6 needs two different parities that can recover N blocks of data from any N of the N+2 blocks of data+parity. This is forward error correction, a similar kind of problem that ECC solves.
Galois Fields are useful for this; they're the basis of widely-used Reed-Solomon codes, for example. e.g. Par2 uses 16-bit Galois Fields to allow very large block counts to generate relatively fine-grained error-recovery data for a large file or set of files. (Up to 64k blocks).
Unfortunately GFNI is not great for PAR2 because GFNI only supports GF2P8 GF(28), not the GF(216) that par2 uses. http://lab.jerasure.org/jerasure/gf-complete/issues/14 says it's possible to use GF2P8AFFINEQB to implement wider word sizes so it might be possible to speed up PAR2 with it.
But it should be useful for RAID6, including generating new parity on writes which is pretty CPU intensive. The Linux kernel's md driver already includes inline asm to use SSE2 or AVX2, one of the few uses of kernel_fpu_begin() and kernel_fpu_end(). (A 2013 paper looks at optimizing GF coding using Intel SIMD, mentioning Linux's md RAID and GF-Complete, the project linked earlier. The current state of the art is something like two pshufb byte shuffles to implement a 4-bit table lookup; GFNI could bring that down to 1 instruction especially if the hard-coded GF polynomial baked into gf2p8mulb is used.)
(RAID6 uses parity in a different way than par2, generating separate parity for each stripe "vertically" across disks, instead of "horizontally" for one big array of data. The underlying math is similar.)
Intel pretty probably plans to support GFNI on some future Silvermont-family Atom because there are legacy-SSE encodings of the instructions, without 3-operand VEX or EVEX. Many other new instructions are introduced with only VEX encodings, including some of the BMI1/BMI2 scalar integer instructions.
Silvermont-family (Airmont, Goldmont, Tremont, ...) gets some use in NAS appliances where most of the CPU demand could come from RAID6. A future version of it with GFNI could save power, or avoid bottlenecks without raising clock speed.
AVX + GFNI implies support for a YMM version (even without AVX2), and AVX512F + GFNI implies a ZMM version. (The HTML extract at felixcloutier.com strangely only mentions the non-VEX 128-bit encoding while also listing a _mm_maskz_gf2p8affine_epi64_epi8 intrinsic (masking requires EVEX). HJLebbink's HTML extract does include the VEX and EVEX forms. Maybe they only appear in Intel's "future extensions" manual which HJ scrapes but Felix doesn't.)
Using 512-bit vectors does limit turbo clock speeds for a short time after (on Skylake-Xeon), so it might not be desirable for the kernel to do that. But it could give a significant reduction in CPU overhead for some cases, if you're not memory-bound.
A "field" is a mathematical concept:
(wikipedia) In mathematics, a field is a set on which addition, subtraction, multiplication, and division are defined and behave as the corresponding operations on rational and real numbers do.
...
including the existence of an additive inverse −a for all elements a, and of a multiplicative inverse b−1 for every nonzero element b
Galois Fields are a kind of Finite Field which have this property: the bits in a GF8 number represent 0 or 1 coefficients of a polynomial of degree 8. (It's quite possible I totally butchered that, but it's something like that rather than place-value.) That's why carryless addition (aka XOR) and carryless multiplication (using shift/XOR instead of shift/add) is useful over Galois fields)
gf2p8mulb's baked-in polynomial of x^8 + x^4 + x^3 + x + 1 matches the one used in AES (Rijndael); this lends more weight to #nwellnhof's hypothesis that Intel just included it because the HW was there.
If it's also used in any other common application, it might give us a clue of the "intended" use case for these instructions.
There is a VAES extension that provides versions of AESENC and related instructions for YMM and ZMM vectors, up from just 128-bit vectors with AES-NI + AVX2. So Intel apparently is extending AES HW to 512-bit SIMD vectors. IDK if this motivates wide GFNI or vice versa, or some of both. (Wide GFNI makes a huge amount of sense; if it was limited to 128-bit, an optimized AVX512 implementation using vpshufb for lookup tables would beat it.)
To answer the purpose part, my guess is that these were added primarily for accelerating SM4 encryption, which shares similarity with AES in design.
This guess comes from the fact that ARM also added SM4 acceleration in ARMv8.4 at around the same time, suggesting that chipmakers want to accelerate this algorithm, probably because it'll gain significant traction in the Chinese market. Also, the fact that it's the only AVX512 extension added in Icelake which also has an SSE encoding, so that Tremont could support it, suggests that they intended it for networking/storage purposes.
GFNI is also quite useful in Reed Solomon coding for error correction (as mentioned by Peter above). It's directly applicable to any GF(28) implementation (such as this) and the affine instruction can be used for other field sizes and polynomials - in fact, it's the fastest technique I know of to do so on an Intel processor.
The affine instruction also has a bunch of out-of-band use cases, including 8-bit shifts and bit-permutes. It's equivalent to RISC-V's bmatxor instruction, where some use cases are listed here.
Some links describing use cases for this instruction.
I have a problem which is easier solved with a HLS tool than with writing down the raw VHDL / verilog. Currently I'm using a Xilinx Virtex-7 as I think this has been solved already by some other vendors.
I can use VHDL 2008.
So imagine in VHDL you have many calculations such as:
p1 <= a x b - c;
p2 <= p1 x d - e;
p3 <= p2 x f - g;
p4 <= p2 x p1 - p3;
Currently if I were to write this with IP Cores, it would be four DSP IP cores, and because of the different port widths, I'd have to generate this IP core 4 times. Anytime I make a change to some of these external signals, all the widths would change again. Keeping track of all this resizing is a pain, especially when resizing signed vectors down.
I have a lot of maths and thus a lot of DSP logic. It would be easier to write this block with a HLS tool. Ideally I would like it to handle the widths and bitshift the data accordingly.
Does such a tool exist? Which one would you recommend?
Bonus points:
Do any of these tools handle floating point maths and let you control precision?
There are lots of ways to accomplish your goal. But first to address your points.
Currently if I were to write this with IP Cores, it would be three DSP IP cores, and because of the different port widths, I'd have to generate this IP core 3 times.
Not necessarily. If your inputs a through g are all fixed point, you can use ieee.numeric_std or in VHDL-2008 you can use ieee.fixed_pkg. These will infer DSP cores (such as the DSP48 on Xilinx). For example:
-- Assume a, b, and c are all signed integers (or implicit fixed point)
signal a : signed(7 downto 0);
signal b : signed(7 downto 0);
signal c : signed(7 downto 0);
signal p1 : signed(a'length+b'length downto 0); -- a times b produces a'length + b'length +1 (which also corresponds to (a times b) - c adding one bit).
...
p1 <= a*b - resize(c, p1'length);
This will imply multipliers and adders.
And this can be similarly done with UFIXED or SFIXED. But you do need to track the bit widths.
Also, there is a floating point package (ieee.float_pkg), but I would NOT recommend that for hardware. You are better off timing and resource-wise to implement it in fixed point.
Anytime I make a change to some of these external signals, all the widths would change again. Keeping track of all this resizing is a pain.
You can do this automatically. Look at my example above. You can easily determine widths based on the operations. Multiplications sum the number of bits. Additions add a single bit. So, if I have:
y <= a * b;
Then I can derive the length of y as simply a'length + b'length. It can be done. The issue, however, is bit growth. The chain of operations you describe will grow significantly if you keep full precision. At certain points you will need to truncate or round to reduce the number of bits. This is the hard part, it how much error you can tolerate is dependent upon the algorithm and expected data input.
I have a lot of maths and thus a lot of DSP logic. It would be easier to write this block with a HLS tool. Ideally I would like it to handle the widths and bitshift the data accordingly.
Automatic handling is the hard part. In VHDL this will not happen (nor Verilog for that matter). But you can track it fairly well and have bit widths update as necessary. But it will not automatically handle things like rounding, truncation, and managing error bounds. A DSP engineer should be handing those issues and directing the RTL developer on the appropriate widths and when to round or truncate.
Does such a tool exist? Which one would you recommend?
There are a variety of options to do this at a higher level. None of these are particularly frugal with respect to resources. Matlab has a code generation tool that will convert Matlab models (suitably constructed) into RTL. It will even analyze issues such as rounding, truncation, and determine appropriate bit widths. You can control the precision, but it is fixed point. We've played with it, and found it very far from producing efficient, high-speed code.
Alternatively, Xilinx does have an HLS suite (see Vivado). I'm not all that well versed in the methodology, but as I understand it, it allows writing C code to implement algorithms. The C doe is then "synthesized" to something that executes in some sort of execution engine. You still have to interface that C code to RTL infrastructure, and that's a challenge in its own right. The reason we have so far not pursued it heavily (even though we do DSP heavy designs) is that it is a big challenge to simulate both the HLS and RTL together as a system.
In the past I found flopoco to generate arbitrary math functions in hardware. If I recall correctly, it supports many types of functions. For instance it could generate a arithmetic core to compute something like a=3*sin²(x+pi/3). For these calculations allows you to specify the overall precision of the inputs/outputs (for floating point/fixed point) or the width of the inputs ( integer ). Execution frequency and whether or not to pipeline the function can also be specified.
Here is an old tutorial I found on how to use it: tutorial
Dear I am using xilinx FFT IP cores for FFT transformation but the problem is that FFT IP core takes fixed transformations length of 64,128,256,512,...
is it possible to use transform length of 50 , 100 , 126 etc. ie other than the available transform length of the iP core
is there any other solution to implement FFT for variety of transform lengths
Haider
Generally FFTs are implemented in hardware using a particular architecture called a butterfly. This architecture only works for power of 2 block sizes. It is possible to do arbitrary length FFTs, but the implementation is more complicated. Generally the solution when you need a non-power-of-2 size FFT is to zero pad to the closest power of 2 size. So if you need 50 points, pad it to 64 points with zeros.
I need to use keyboard as input for musical notes, and digilent speaker as output.
I plan to use only one octave.
My most intriguing questions are:
How do I represent the musical notes in VHDL code.
How do I (or do I need to) implement a DAC module that uses Spartan 3E Starter's built-in DAC? I have read on other forums that it can't be implemented. I need to use it in order to transmit the note to the speaker. The teacher who supervises my and my colleagues' projects suggested me to look into PWM for that(but all I've found is explained in electronic manner, no accompanying code, or explanation on implementation).
Besides keyboard controller, a processing module(for returning the note corresponding to the pressed key from the notes vector) and DAC, that I have figured out so far that I need, what else do I need.
There is a DAC (see comments)
There is no DAC on the Spartan-3E Starter Kit. Using a low-pass PWM signal is a common way to generate analog signal level from digital output.
You need to define a precision for your PWM, let's say 8 bits or 256 levels. For each audio sample you want to output, you need to count from 0 to 255. When the counter is less than the desired sample level, output 1, otherwise output 0. When the counter reach 255, reset it and go to the next sample.
Thus, if you want 8 bits precision (256 levels) and 8KHz signal, the counter will have to run at 256*8000 = 2.048MHz.
For your other questions, there is no easy answer. It's your job, as designer, to figure that out.