Dear I am using xilinx FFT IP cores for FFT transformation but the problem is that FFT IP core takes fixed transformations length of 64,128,256,512,...
is it possible to use transform length of 50 , 100 , 126 etc. ie other than the available transform length of the iP core
is there any other solution to implement FFT for variety of transform lengths
Haider
Generally FFTs are implemented in hardware using a particular architecture called a butterfly. This architecture only works for power of 2 block sizes. It is possible to do arbitrary length FFTs, but the implementation is more complicated. Generally the solution when you need a non-power-of-2 size FFT is to zero pad to the closest power of 2 size. So if you need 50 points, pad it to 64 points with zeros.
Related
I am working on designing a mandelbrot viewer and I am designing hardware for squaring values. My squarer is recursively built where a 4bit squarer relies on 2, 2bit squarers. so for my 16 bit squarer, that has 2 8bits squarers, and each one of those has 2 4bit squarer's.
As you can see the recursivity begins to make the design blow up in complexity. To help speed up my design i would like to use a 4input ROM that emulates a 4bit squarer. So when you enter 3 in the rom, it outputs 9, when you enter 15, it outputs 225.
I know that a normal LUT implemented in a logic cell ay have 3 or 4 input variables and only 1 output, but i need an 8 bit output so I need more of a ROM then a LUT.
Any and all help is appreciated, Im curious how the FPGA will store those ROMs and if storing it in ROM would be faster than computing the 4input Square.
-
Jarvi
To square a 4-bit number explicitly using LUTs, you would need to use 8 4-input LUTs. Each LUT's output would give you one bit of the 8-bit product.
The overall size and fmax performance of your design may be achieved with this approach, using larger block RAM primitives (as ROM), dedicated MAC (multiply-accumulate) units, or by using the normal mulitiplication operator * and relying on your synthesis tool's optimization.
You may also want to review some research papers related to this topic, for example here.
I have to multiply two very large (~ 2000 X 2000) dense matrices whose entries are floats with arbitrary precision (I am using GMP and the precision is currently set to 600). I was wondering if there is any CUDA library that supports arbitrary precision arithmetics? The only library that I have found is called CAMPARY however it seems to be missing some references to some of the used functions.
The other solution that I was thinking about was implementing a version of the Karatsuba algorithm for multiplying matrices with arbitrary precision entries. The end step of the algorithm would just be multiplying matrices of doubles, which could be done very efficiently using cuBLAS. Is there any similar implementation already out there?
Since nobody has suggested such a library so far, let's assume that one doesn't exist.
You could always implement the naive implementation:
One grid thread for each pair of coordinates in the output matrix.
Each thread performs an inner product of a row and a column in the input matrices.
Individual element operations will use the code taken from the GMP (hopefully not much more than copy-and-paste).
But you can also do better than this - just like you can do better for regular-float matrix multiplication. Here's my idea (likely not the best of course):
Consider the worked example of matrix multiplication using shared memory in the CUDA C Programming Guide. It suggests putting small submatrices in shared memory. You can still do this - but you need to be careful with shared memory sizes (they're small...):
A typical GPU today has 64 KB shared memory usable per grid block (or more)
They take 16 x 16 submatrix.
Times 2 (for the two multiplicands)
Times ceil(801/8) (assuming the GMP representation uses 600 bits from the mantissa, one bit for the sign and 200 bits from the exponent)
So 512 * 101 < 64 KB !
That means you can probably just use the code in their worked example as-is, again replacing the float multiplication and addition with code from GMP.
You may then want to consider something like parallelizing the GMP code itself, i.e. using multiple threads to work together on single pairs of 600-bit-precision numbers. That would likely help your shared memory reading pattern. Alternatively, you could interleave the placement of 4-byte sequences from the representation of your elements, in shared memory, for the same effect.
I realize this is a bit hand-wavy, but I'm pretty certain I've waved my hands correctly and it would be a "simple matter of coding".
I'm a J newbie, and am trying to import one of my large datasets for further experimentation. It is a 2D matrix of doubles, approximately 80000x50000. So far, I have found two different methods to load data into J.
The first is to convert the data into J format (replacing negatives with underscores, putting exponential notation numbers into J format, etc) and then load with (adapted from J: Handy method to enter a matrix?):
(".;._2) fread 'path/to/file'
The second method is to use tables/dsv.
I am experiencing the same problem with both methods: namely, that these methods work with small matrices, but fail at approximately 10M values. It seems the input just gets truncated to some arbitrary limit. How can I load matrices of arbitrary size? If I have to convert to some binary format, that's OK, as long as there is a description of the format somewhere.
I should add that this is a 64-bit system and build of J, and I can successfully create a matrix of random numbers of the appropriate size, so it doesn't seem to be a limitation on matrix size per se but only during I/O.
Thanks!
EDIT: I did not find what exactly was causing this, but thanks to Dane, I did find a workaround by using JMF ( 'data/jmf' package). It turns out that JMF is just straight binary data with no header and native (?) or little-endian data can be mapped directly with JFL map_jmf_ 'x';'whatever.bin'
You're running out of memory. A quick test to see how much space integers take up yields the following:
7!:2 'i. 80000 5000'
8589936256
That is, an 80,000 by 5,000 matrix of integers requires 8 GB of memory. Your 80,000 by 50,000 matrix, if it were of integers, would require approximately 80 GB of memory.
Your next question should be about performing array or matrix operations on a matrix too big to load into memory.
I am trying to implement the idea found in this paper:
http://crypto.stanford.edu/craig/easy-fhe.pdf.
However, I do not know how the compute the circuit representation of an algorithm.
Suppose I have a function that takes a list of exactly 32, 32 bit signed integers, and returns a 64 bit signed integer representing the sum of the integers. How can I convert this function to a Boolean function? That is, I need to design a circuit where each output wire is a boolean function of the ands/ ors/ and nots of the 1024 input wires.
Notice that the function will take a fixed width input and produce a fixed width output.
Are there any techniques from electrical engineering or math that I can use?
Consider the logic in an FPGA. I think this will help you get an idea of the kind of circuit needed to sum exactly 32 32-bit inputs into a 64-bit output.
How can I convert baseband sampled signal from real-valued samples to complex-valued samples (real,imaginary) and vice-versa.
My samples are integers, and I'm looking for a fast (but accurate) conversion algorithms.
A C++ sample code (real, not complex ;-) would be more than welcome.
Edit: IPP code will be much welcome.
Edit: I'm looking for a method that will convert n real-samples to n/2 complex-samples and vice-versa, without affecting the bandwidth.
Adding zeros as the imaginary is conceptually the first step in what you want to do. Initially you have a real only signal that looks like this in the frequency domain:
[r0, r1, r2, r3, ...]
/-~--------\
DC +Fs/2
If you stuff it with zeros for the imaginary value, you'll see that you really have both positive and negative frequencies as mirror images:
[r0 + 0i, r1 + 0i, r2 + 0i, r3 + 0i, ...]
/--------~-\ /-~--------\
-Fs/2 DC +Fs/2
Next, you multiply that signal in the time domain by a complex tone at -Fs/4 (tuning the signal). Your signal will look like
----~-\ /-~--------\ /------
DC
So now, you filter out the center half and you get:
________/-~--------\________
DC
Then you decimate by two and you end up with:
/-~--------\
Which is what you want.
All of these steps can be performed efficiently in the time domain. If you pay attention to all of the intermediate steps, you'll notice that there are many places where you're multiplying by 0, +1, -1, +i, or -i. Furthermore, the half band low pass filter will have a lot of zeros and some symmetry to exploit. Since you know you're going to decimate by 2, you only have to calculate the samples you intend to keep. If you work through the algebra, you'll find a lot of places to simplify it for a clean and fast implementation.
Ultimately, this is all equivalent to a Hilbert transform, but I think it's much easier to understand when you decompose it into pieces like this.
Converting back to real from complex is similar. You'll stuff it with zeroes for every other sample to undo the decimation. You'll filter the complex signal to remove an alias you just introduced. You'll tune it up by Fs/4, and then throw away the imaginary component. (Sorry, I'm all ascii-arted out... :-)
Note that this conversion is lossy near the boundaries. You'd have to use an infinite length filter to do it perfectly.
If you want to create a complex vector with a strictly real spectrum, just add an imaginary component of 0.0 to every sample. Depending on your data format, this may be as easy as creating a double length memory array, zeroing it, and copying from every element of the source into every other element of the destination.
If you want to convert a complex vector containing complex data (non-zero imaginary components above your required minimum noise floor) into a real vector, you will need to double your bandwidth in order to not lose information, which may or may not make sense, unless you are modulating, demodulating or filtering the signal.
If you want to produce a one-sided signal with a complex spectrum from a real vector, you can use a Hilbert transform (or filter) to create an imaginary component with the same spectrum but conjugate phase (except for DC). This would probably not be both fast and accurate.
I'm not sure if that is what you're looking for but you might want to check the Hilbert Transform, which can be used to find the analytic representation of real-valued signals, i.e., a signal with the same amount of information but with no negative frequency components.
Such representation is mostly useful in Digital Signal Processing techniques employing Spectral Shifting such as Single Sideband Modulation, an efficient form of Amplitude Modulation (AM) that uses half the bandwidth used by the raw AM.
I don't have enough points to vote zml up yet, but his is clearly the right answer. The Hilbert transform essentially converts your real-valued signal into its more natural domain, where the components of sound are complex "phasors" rather than sine waves. It does this by essentially chopping of half of the Fourier spectrum, which involves a single choice of "helicity" (i.e. cw vs ccw) but allows you to do things like perfectly pitch shift by multiplying by a single phasor. The possibilities are endless, and I hope this complex representation of audio catches on!
Intel Performance Primitives (IPP) has a function which does exactly this.
From their documentation:
The ippsHilbert function computes a complex analytic signal,
which contains the original real signal as its real part and
computed Hilbert transform as its imaginary part.
https://software.intel.com/content/www/us/en/develop/documentation/ipp-dev-reference/top/volume-1-signal-and-data-processing/transform-functions/hilbert-transform-functions/hilbert.html#hilbert