coreml inference results are different with cpu and gpu - xcode

os system: macos Catalina 10.15.2
xcode: 11.3
coreml3.0
I give same model input to the same mlmodel. But the inference results are different using cpu device and gpu device.
The results are as follows, the left file is inference result (second column) using cpu and the right file is inference result (second column) using CpuAndGpu. I use beyond compare to compare the two files, and the data marked with red color are the difference.
Does anyone know about the problem and how to fix it?
enter image description here

This is not a problem per se. On the GPU, 16-bit floats are used while on the CPU 32-bit floats are used. 16-bit floats have less precision, which explains the different results you're getting.
Some numbers will be slightly larger, some will be slightly smaller, but generally these effects cancel out and you won't notice the difference.
(However, if your model generates images, you may notice pixel artifacts from the lower precision provided by the 16-bit floats.)

Related

FPGA LUT with multiple outputs

I am working on designing a mandelbrot viewer and I am designing hardware for squaring values. My squarer is recursively built where a 4bit squarer relies on 2, 2bit squarers. so for my 16 bit squarer, that has 2 8bits squarers, and each one of those has 2 4bit squarer's.
As you can see the recursivity begins to make the design blow up in complexity. To help speed up my design i would like to use a 4input ROM that emulates a 4bit squarer. So when you enter 3 in the rom, it outputs 9, when you enter 15, it outputs 225.
I know that a normal LUT implemented in a logic cell ay have 3 or 4 input variables and only 1 output, but i need an 8 bit output so I need more of a ROM then a LUT.
Any and all help is appreciated, Im curious how the FPGA will store those ROMs and if storing it in ROM would be faster than computing the 4input Square.
-
Jarvi
To square a 4-bit number explicitly using LUTs, you would need to use 8 4-input LUTs. Each LUT's output would give you one bit of the 8-bit product.
The overall size and fmax performance of your design may be achieved with this approach, using larger block RAM primitives (as ROM), dedicated MAC (multiply-accumulate) units, or by using the normal mulitiplication operator * and relying on your synthesis tool's optimization.
You may also want to review some research papers related to this topic, for example here.

how data is interpreted in computers

Something that troubles me when I think of is how incoming data is interpreted in computers. I searched a lot but could not find an answer so as a last resort I am asking in here. What I am saying is that you plug in a USB to your computer and data stream starts. Your computer receives ones and zeros from the USB and interpret them correctly like for example inside of the USB there are pictures with different names and different formats and resolutions. What I do not understand is how computer correctly puts them together and the big picture emerges. This could be seen as a stupid question but had me thinking for a while. How does this system work?
I am not a computer scientist but I am studying Electrical and electronics engineering and know somethings.
It is all just streams of ones and zeros, which get counted up into bytes. As you probably know one can multiplex them, but with modern hardware that isn't very necessary (the 's' in USB standing for 'serial)
A pure black and white image of an "A" would be a 2d array:
111
101
111
101
101
3x5 font
I would guess that "A" is stored in a font file as 111101111101101, with a known length of 3*5=15 bits.
When displayed in a window, that A would be broken down into lines, and inserted on the respective line of the window, becoming a stream which contains 320x256 pixels perhaps.
When the length of data is not constant, it can:
If there is a max size, could be the size of the max size (integers and other primitive data types do this, a 0 takes 32/64 bits, as does 400123)
A length is included somewhere, often a sort of "header"
It gets chunked up into either constant or variable sized chunks, and has a continue bit (UTF-8 is a good simple example of constant chunks, some networking protocols (maybe TCP/IP) are a good example of variable chunks)
Both sides need to know how to decode the data, in your example of a USB stick with an image on it. The operating system has a driver which understands the UUID is a storage device, and attempts to read special sectors from it. If it detects a partition type it recognizes (for windows that would be NTFS or FAT32), it will then load the file tables, using drivers that understand how to decode those. It finds a filename allows access via the filename. Then an image reading program is able to load the bytestream of that file and decode it using its headers and installed codecs into a raster image array. If any of those pieces are not available in your system, you cannot view the image, and it will be just any random binary to you (if you format the usb stick with Linux, or use a uncommon/old image format)
So its all various level of explicit or implicit handshakes to agree on what the data is when you get to the higher levels (higher level being at least once you agree on endianness and baudrate of data transmission)

Octave - out of memory or dimension too large for Octave's index type

I'm aware of the fact that there are 3 questions with a similar exception message. Unfortunately none of the questions is answered and the comments could not solve my problem.
I use Octave 4.2.1 in the 64 bit version on a Windows 10 System with 16 GB RAM in total and ~11 GB free during runtime.
When i try to multiply a 60000 x 10 matrix with a 10 x 60000 matrix Octave comes up with the following exception:
error: out of memory or dimension too large for Octave's index type
This multiplication would result in a 60000 x 60000 matrix and therefore should not be a problem for a 64 bit index.
I can't even do zeros(60000,60000);
I don't get what i am doing wrong. Could someone point me into the right direction?
As is often the case, this error is often misinterpreted (maybe we should address this as a bug on the octave tracker already ;) )
>> 60000*60000
ans = 3.6000e+09
>> intmax
ans = 2147483647
>> 60000*60000 > intmax
ans = 1
I.e. the number of elements of the resulting 60000x60000 matrix is larger than the maximum integer representation supported by the system, therefore there is no way to linearly index such a matrix using an integer index.
Also, in order to use actual 64-bit indexing, you need to compile octave in that manner, as this tends not to be the default, but unfortunately that's not as straightforward as you might wish, as you'll have to use the respective 64-bit supporting libraries as well. More on that here.
Having said that, it may well be possible to make use of sparse matrices instead if your matrices are indeed sparse in nature. If not, you're essentially using 'big data', and you need to find workarounds, such as blockprocessing / mapping large arrays to files etc. It's worth reading up on common "big data" techniques. Unfortunately octave does not seem to support matlab's memmapfile command just yet, but you can simulate this using fwrite / fread / fseek appropriately to read appropriate ranges from a file.

CUDA implementation for arbitrary precision arithmetics

I have to multiply two very large (~ 2000 X 2000) dense matrices whose entries are floats with arbitrary precision (I am using GMP and the precision is currently set to 600). I was wondering if there is any CUDA library that supports arbitrary precision arithmetics? The only library that I have found is called CAMPARY however it seems to be missing some references to some of the used functions.
The other solution that I was thinking about was implementing a version of the Karatsuba algorithm for multiplying matrices with arbitrary precision entries. The end step of the algorithm would just be multiplying matrices of doubles, which could be done very efficiently using cuBLAS. Is there any similar implementation already out there?
Since nobody has suggested such a library so far, let's assume that one doesn't exist.
You could always implement the naive implementation:
One grid thread for each pair of coordinates in the output matrix.
Each thread performs an inner product of a row and a column in the input matrices.
Individual element operations will use the code taken from the GMP (hopefully not much more than copy-and-paste).
But you can also do better than this - just like you can do better for regular-float matrix multiplication. Here's my idea (likely not the best of course):
Consider the worked example of matrix multiplication using shared memory in the CUDA C Programming Guide. It suggests putting small submatrices in shared memory. You can still do this - but you need to be careful with shared memory sizes (they're small...):
A typical GPU today has 64 KB shared memory usable per grid block (or more)
They take 16 x 16 submatrix.
Times 2 (for the two multiplicands)
Times ceil(801/8) (assuming the GMP representation uses 600 bits from the mantissa, one bit for the sign and 200 bits from the exponent)
So 512 * 101 < 64 KB !
That means you can probably just use the code in their worked example as-is, again replacing the float multiplication and addition with code from GMP.
You may then want to consider something like parallelizing the GMP code itself, i.e. using multiple threads to work together on single pairs of 600-bit-precision numbers. That would likely help your shared memory reading pattern. Alternatively, you could interleave the placement of 4-byte sequences from the representation of your elements, in shared memory, for the same effect.
I realize this is a bit hand-wavy, but I'm pretty certain I've waved my hands correctly and it would be a "simple matter of coding".

Does Global Work Size Need to be Multiple of Work Group Size in OpenCL?

Hello: Does Global Work Size (Dimensions) Need to be Multiple of Work Group Size (Dimensions) in OpenCL?
If so, is there a standard way of handling matrices not a multiple of the work group dimensions? I can think of two possibilities:
Dynamically set the size of the work group dimensions to a factor of the global work dimensions. (this would incur the overhead of finding a factor and possibly set the work group to a non-optimal size.)
Increase the dimensions of the global work to be the nearest multiple of the work group dimensions, keeping all input and output buffers the same but checking bounds in the kernel to avoid segfaulting, i.e. do nothing on the work items out of bound of the desired output. (This seems like the better way.)
Would the second way work? Is there a better way? (Or is it not necessary because work group dimensions need not divide global work dimensions?)
Thanks!
Thx for the link Chad. But actually, if you read on:
If local_work_size is specified, the
values specified in global_work_size[0], … global_work_size[work_dim - 1] must be evenly
divisible by the corresponding values specified in local_work_size[0], …
local_work_size[work_dim – 1].
So YES, the local work size must be a multiple of the global work size.
I also think the assigning the global work size to the nearest multiple and being careful about bounds should work, I'll post a comment when I get around to trying it.
This seems to be an old post, but let me update this post with some new information. Hopefully, it could help someone else.
Does Global Work Size (Dimensions) Need to be Multiple of Work Group
Size (Dimensions) in OpenCL?
Answer: True till OpenCL 2.0. Before CL2.0, your global work size must be a multiple of local work size, otherwise you will get an error message when you execute clEnqueueNDRangeKernel.
But from CL2.0, this is not required anymore. You can use whatever global work size which fits your application dimensions. However, please remember that the hardware implementation might still use the "old" way, which means padding the global work group size. Therefore, it makes the performance highly dependent on the hardware architecture. You may see quite different performance on different hardware/platforms. Plus, you want to make your application back compatible to support older platform which only supports CL up to version 1.2. So, I think this new feature added in CL2.0 is just for easy programming, to get better controllable performance and backward compatibility, I suggest you still use the following method mentioned by you:
Increase the dimensions of the global work to be the nearest multiple
of the work group dimensions, keeping all input and output buffers the
same but checking bounds in the kernel to avoid segfaulting, i.e. do
nothing on the work items out of bound of the desired output. (This
seems like the better way.)
Answer: you are absolutely right. This is the right way to handle such case. Carefully design the local work group size (considering factors such as register usage, cache hit/miss, memory access pattern and so on). And then pad your global work size to a multiple of local work size. Then, you are good to go.
Another thing to consider is that you can utilize the image object to store the data instead of buffer, if there are quite a lot of boundary checking work in your kernel. For image, the boundary check is automatically done by hardware, almost no overhead in most of the implementations. Therefore, padding your global work size, store your data in image object, then, you just need to write your code normally without worrying about the boundary checking.
According to the standard it doesn't have to be from what I saw. I think I would handle it with a branch, but I don't know exactly what kind of matrix operation you are doing.
http://www.khronos.org/registry/cl/specs/opencl-1.1.pdf#page=131
global_work_size points to an array
of work_dim unsigned values that
describe the number of global
work-items in work_dim dimensions that
will execute the kernel function. The
total number of global work-items is
computed as global_work_size[0] *
... * global_work_size[work_dim –
1].
The values specified in
global_work_size + corresponding
values specified in global_work_offset
cannot exceed the range given by the
sizeof(size_t) for the device on
which the kernel execution will be
enqueued. The sizeof(size_t) for a
device can be determined using
CL_DEVICE_ADDRESS_BITS in table 4.3.
If, for example,
CL_DEVICE_ADDRESS_BITS = 32, i.e.
the device uses a 32-bit address
space, size_t is a 32-bit unsigned
integer and global_work_size values
must be in the range 1 .. 2^32 - 1.
Values outside this range return a
CL_OUT_OF_RESOURCES error.

Resources