Halide with OpenCL using const global <type>* restrict arguments? - halide

In OpenCL we get an efficient hardware path for input arguments when we specify them as const global * restrict as in (for a piece of handwritten OpenCL code):
__kernel void oclConvolveGlobalMem(const global float* restrict input,
constant float* restrict filterWeights,
global float* restrict output)
However, as seen with HL_DEBUG_CODEGEN=1 Halide generates:
// Address spaces for kernel_conv_70_s0_y___block_id_y
#define __address_space__conv__70 __global
#define __address_space__input __global
#define __address_space__kernel __global
__kernel void kernel_conv_70_s0_y___block_id_y(
const int _conv__70_extent_0,
const int _conv__70_extent_1,
const int _conv__70_min_0,
const int _conv__70_min_1,
const int _conv__70_stride_1,
const int _input_min_0,
const int _input_min_1,
const int _input_stride_1,
const int _kernel_min_0,
const int _kernel_min_1,
const int _kernel_stride_1,
__address_space__conv__70 float *_conv__70,
__address_space__input const float *_input,
__address_space__kernel const float *_kernel,
__address_space___shared int16* __shared)
where the input argument is not declared restrict. I expect this to sincerely limit performance. I do I get Halide to add the notion that the pointers are restricted (the buffer they use are not aliasing.)

When did you last update Halide? Halide recently (sort of, October 2016) added restrict to buffer arguments: https://github.com/halide/Halide/pull/1550. The latest binary release does have this change, barely.

Related

ARM GCC compiler "buggy" conversion

Problem
I am working with flash memory optimization of STM32F051. It's revealed, that conversion between floatand int types consumes a lot of flash.
Digging into this, it turned out that the conversion to int takes around 200 bytes of flash memory; while the conversion to unsigned int takes around 1500 bytes!
It’s known, that both int and unsigned int differ only by the interpretation of the ‘sign’ bit, so such behavior – is a great mystery for me.
Note: Performing the 2-stage conversion float -> int -> unsigned int also consumes only around 200 bytes.
Questions
Analyzing that, I have such questions:
1) What is a mechanism of the conversion of float to unsigned int. Why it takes so many memory space, when in the same time conversion float->int->unsigned int takes so little memory? Maybe it’s connected with IEEE 754 standard?
2) Are there any problems expected when the conversion float->int->unsigned int is used instead of a direct float ->int?
3) Are there any methods to wrap float -> unsigned int conversion keeping the low memory footprint?
Note: The familiar question has been already asked here (Trying to understand how the casting/conversion is done by compiler,e.g., when cast from float to int), but still there is no clear answer and my question is about the memory usage.
Technical data
Compiler: ARM-NONE-EABI-GCC (gcc version 4.9.3 20141119 (release))
MCU: STM32F051
MCU's core: 32 bit ARM CORTEX-M0
Code example
float -> int (~200 bytes of flash)
int main() {
volatile float f;
volatile int i;
i = f;
return 0;
}
float -> unsigned int (~1500 bytes! of flash)
int main() {
volatile float f;
volatile unsigned int ui;
ui = f;
return 0;
}
float ->int-> unsigned int (~200 bytes of flash)
int main() {
volatile float f;
volatile int i;
volatile unsigned int ui;
i = f;
ui = i;
return 0;
}
There is no fundamental reason for the conversion from float to unsigned int should be larger than the conversion from float to signed int, in practice the float to unsigned int conversion can be made smaller than the float to signed int conversion.
I did some investigations using the GNU Arm Embedded Toolchain (Version 7-2018-q2) and
as far as I can see the size problem is due to a flaw in the gcc runtime library. For some reason this library does not provide an specialized version of the __aeabi_f2uiz function for Arm V6m, instead it falls back on a much larger general version.

cmake finds cuda but fails to find cuda libraries on Windows

I have a small cmake project that works perfectly well on Linux but fails on Windows 10 (I tried with two different computers) with the latest versions of cmake and CUDA 8. It finds CUDA just fine, but fails to find the libraries. My cmake file:
cmake_minimum_required(VERSION 3.0)
project(myproject)
find_package(CUDA REQUIRED)
cuda_add_library(myproject STATIC matrix_mm.cu)
target_link_libraries(myproject ${CUDA_CUBLAS_LIBRARIES})
message(STATUS "")
message(STATUS "FoundCUDA : ${CUDA_FOUND}")
message(STATUS "Cuda cublas libraries : ${CUDA_CUBLAS_LIBRARIES}")
In the same folder, I have the header matrix_mm.cuh:
#include <cstdlib>
namespace myproject {
float* cuda_mm(const float *a, const float *b, const size_t m, const size_t k, const size_t n);
} /* end namespace myproject */
and matrix_mm.cu:
#include <cublas_v2.h>
#include "matrix_mm.cuh"
namespace myproject {
// Adapted from https://solarianprogrammer.com/2012/05/31/matrix-multiplication-cuda-cublas-curand-thrust/
void gpu_blas_mmul(const float *a, const float *b, float *c, const size_t m, const size_t k, const size_t n) {
int lda = m, ldb = k, ldc = m;
const float alf = 1;
const float bet = 0;
const float *alpha = &alf;
const float *beta = &bet;
// Create a handle for CUBLAS
cublasHandle_t handle;
cublasCreate(&handle);
// Do the actual multiplication
cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, m, n, k, alpha, a, lda, b, ldb, beta, c, ldc);
// Destroy the handle
cublasDestroy(handle);
}
float* cuda_mm(const float *a, const float *b, const size_t m, const size_t k, const size_t n) {
size_t const a_bytes = m * k * sizeof(float);
size_t const b_bytes = k * n * sizeof(float);
size_t const c_bytes = m * n * sizeof(float);
float* c = (float*)std::malloc(c_bytes);
float *d_A, *d_B, *d_C;
cudaMalloc(&d_A, a_bytes);
cudaMalloc(&d_B, b_bytes);
cudaMalloc(&d_C, c_bytes);
cudaMemcpy(d_A, a, a_bytes, cudaMemcpyHostToDevice);
cudaMemcpy(d_B, b, b_bytes, cudaMemcpyHostToDevice);
gpu_blas_mmul(d_A, d_B, d_C, m, k, n);
cudaMemcpy(c, d_C, c_bytes, cudaMemcpyDeviceToHost);
cudaFree(d_A);
cudaFree(d_B);
cudaFree(d_C);
return c;
}
} /* end namespace myproject */
On Linux I get:
-- FoundCUDA : TRUE
-- Toolkit root : /usr
-- Cuda cublas libraries : /usr/lib/x86_64-linux-gnu/libcublas.so
While on both Windows 10 machines I get
-- FoundCUDA : TRUE
-- Toolkit root : C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v8.0
-- Cuda cublas libraries : CUDA_cublas_LIBRARY-NOTFOUND;CUDA_cublas_device_LIBRARY-NOTFOUND
...and of course it fails to compile because the linker can't find cublas.
I tried quite a few things: making the lib SHARED instead of STATIC, I made sure Cuda was in Windows' environmental variables, etc, but nothing works.
This is the CMake snippet I use to find CUDA 8 on Windows 10 with CMake 3.7.1:
cmake_minimum_required(VERSION 3.7)
project(myproject)
# Check for CUDA ENV vars
IF(NOT DEFINED ENV{CUDA_PATH})
MESSAGE(FATAL_ERROR "CUDA_PATH Environment variable is not set.")
ENDIF(NOT DEFINED ENV{CUDA_PATH})
# Set the toolkit path
FILE(TO_CMAKE_PATH "$ENV{CUDA_PATH}" CUDA_TOOLKIT_ROOT_DIR)
SET(CUDA_TOOLKIT_ROOT_DIR ${CUDA_TOOLKIT_ROOT_DIR} CACHE STRING "Root directory of the Cuda Library" FORCE)
# Find the package
find_package(CUDA REQUIRED)
# Create and interface library as a link target (Requires CMake 3.7.0+)
add_library(cuda INTERFACE)
set_target_properties(cuda PROPERTIES
INTERFACE_INCLUDE_DIRECTORIES ${CUDA_INCLUDE_DIRS}
INTERFACE_LINK_LIBRARIES "${CUDA_LIBRARIES};${CUDA_CUFFT_LIBRARIES};${CUDA_CUBLAS_LIBRARIES}"
)
SET(CUDA_HOST_COMPILATION_CPP ON)
cuda_add_library(myproject STATIC matrix_mm.cu)
target_link_libraries(myproject cuda)
I think the quotes around the libraries are important since the Windows path will contain spaces.
I would also make sure that you delete your cache and regenerate the project. It's often the cause of errors when variable values appear correct on the surface (or when you make changes to a non-FORCE cache variable).

C++ Tuples and Readability

I think this is more of a philosophical question about readability and tupled types in C++11.
I am writing some code to produce Gaussian Mixture Models (the details are kind of irrelevant but it serves and a nice example.) My code is below:
GMM.hpp
#pragma once
#include <opencv2/opencv.hpp>
#include <vector>
#include <tuple>
#include "../Util/Types.hpp"
namespace LocalDescriptorAndBagOfFeature
{
// Weighted gaussian is defined as a (weight, mean vector, covariance matrix)
typedef std::tuple<double, cv::Mat, cv::Mat> WeightedGaussian;
class GMM
{
public:
GMM(int numGaussians);
void Train(const FeatureSet &featureSet);
std::vector<double> Supervector(const BagOfFeatures &bof);
int NumGaussians(void) const;
double operator ()(const cv::Mat &x) const;
private:
static double ComputeWeightedGaussian(const cv::Mat &x, WeightedGaussian wg);
std::vector<WeightedGaussian> _Gaussians;
int _NumGaussians;
};
}
GMM.cpp
using namespace LocalDescriptorAndBagOfFeature;
double GMM::ComputeWeightedGaussian(const cv::Mat &x, WeightedGaussian wg)
{
double weight;
cv::Mat mean, covariance;
std::tie(weight, mean, covariance) = wg;
cv::Mat precision;
cv::invert(covariance, precision);
double detp = cv::determinant(precision);
double outter = std::sqrt(detp / 2.0 * M_PI);
cv::Mat meanDist = x - mean;
cv::Mat meanDistTrans;
cv::transpose(meanDist, meanDistTrans);
cv::Mat symmetricProduct = meanDistTrans * precision * meanDist; // This is a "1x1" matrix e.g. a scalar value
double inner = symmetricProduct.at<double>(0,0) / -2.0;
return weight * outter * std::exp(inner);
}
double GMM::operator ()(const cv::Mat &x) const
{
return std::accumulate(_Gaussians.begin(), _Gaussians.end(), 0, [&x](double val, WeightedGaussian wg) { return val + ComputeWeightedGaussian(x, wg); });
}
In this case, am I gaining anything (clarity, readability, speed, ...) by using a tuple representation for the weighted Gaussian distribution over using a struct, or even a class with its own operator()?
You're reducing the size of your source code a little bit, but I'd argue that you're reducing its overall readability and type safety. Specifically, if you defined:
struct WeightedGaussian {
double weight;
cv::Mat mean, covariance;
};
then you wouldn't have a chance of writing the incorrect
std::tie(weight, covariance, mean) = wg;
and you'd guarantee that your users would use wg.mean instead of std::get<0>(wg). The biggest downside is that std::tuple comes with definitions of operator< and operator==, while you have to implement them yourself for a custom struct:
operator<(const WeightedGaussian& lhs, const WeightedGaussian& rhs) {
return std::tie(lhs.weight, lhs.mean, lhs.covariance) <
std::tie(rhs.weight, rhs.mean, rhs.covariance);
}

Atomic max for floats in OpenCL

I need an atomic max function for floats in OpenCL. This is my current naive code using atomic_xchg
float value = data[index];
if ( value > *max_value )
{
atomic_xchg(max_value, value);
}
This code gives the correct result when using an Intel CPU, but not for a Nvidia GPU. Is this code correct, or can anyone help me?
You can do it like this:
//Function to perform the atomic max
inline void AtomicMax(volatile __global float *source, const float operand) {
union {
unsigned int intVal;
float floatVal;
} newVal;
union {
unsigned int intVal;
float floatVal;
} prevVal;
do {
prevVal.floatVal = *source;
newVal.floatVal = max(prevVal.floatVal,operand);
} while (atomic_cmpxchg((volatile __global unsigned int *)source, prevVal.intVal, newVal.intVal) != prevVal.intVal);
}
__kernel mykern(__global float *data, __global float *max_value){
unsigned int index = get_global_id(0);
float value = data[index];
AtomicMax(max_value, value);
}
As stated in LINK.
What it does is create a union of float and int. Perform the math on the float, but compare integers when doing the atomic xchg. As long as the integers match, the operation is completed.
However, the speed decrease due to the use of these methods is very high. Use them carefully.

Initialize device array in CUDA

How do I initialize device array which is allocated using cudaMalloc()?
I tried cudaMemset, but it fails to initialize all values except 0.code, for cudaMemset looks like below, where value is initialized to 5.
cudaMemset(devPtr,value,number_bytes)
As you are discovering, cudaMemset works like the C standard library memset. Quoting from the documentation:
cudaError_t cudaMemset ( void * devPtr,
int value,
size_t count
)
Fills the first count bytes of the memory area pointed to by devPtr
with the constant byte value value.
So value is a byte value. If you do something like:
int *devPtr;
cudaMalloc((void **)&devPtr,number_bytes);
const int value = 5;
cudaMemset(devPtr,value,number_bytes);
what you are asking to happen is that each byte of devPtr will be set to 5. If devPtr was a an array of integers, the result would be each integer word would have the value 84215045. This is probably not what you had in mind.
Using the runtime API, what you could do is write your own generic kernel to do this. It could be as simple as
template<typename T>
__global__ void initKernel(T * devPtr, const T val, const size_t nwords)
{
int tidx = threadIdx.x + blockDim.x * blockIdx.x;
int stride = blockDim.x * gridDim.x;
for(; tidx < nwords; tidx += stride)
devPtr[tidx] = val;
}
(standard disclaimer: written in browser, never compiled, never tested, use at own risk).
Just instantiate the template for the types you need and call it with a suitable grid and block size, paying attention to the last argument now being a word count, not a byte count as in cudaMemset. This isn't really any different to what cudaMemset does anyway, using that API call results in a kernel launch which is do too different to what I posted above.
Alternatively, if you can use the driver API, there is cuMemsetD16 and cuMemsetD32, which do the same thing, but for half and full 32 bit word types. If you need to do set 64 bit or larger types (so doubles or vector types), your best option is to use your own kernel.
I also needed a solution to this question and I didn't really understand the other proposed solution. Particularly I didn't understand why it iterates over the grid blocks for(; tidx < nwords; tidx += stride) and for that matter, the kernel invocation and why using the counter-intuitive word sizes.
Therefore I created a much simpler monolithic generic kernel and customized it with strides i.e. you may use it to initialize a matrix in multiple ways e.g. set rows or columns to any value:
template <typename T>
__global__ void kernelInitializeArray(T* __restrict__ a, const T value,
const size_t n, const size_t incx) {
int tid = threadIdx.x + blockDim.x * blockIdx.x;
if (tid*incx < n) {
a[tid*incx] = value;
}
}
Then you may invoke the kernel like this:
template <typename T>
void deviceInitializeArray(T* a, const T value, const size_t n, const size_t incx) {
int number_of_blocks = ((n / incx) + BLOCK_SIZE - 1) / BLOCK_SIZE;
dim3 gridDim(number_of_blocks, 1);
dim3 blockDim(BLOCK_SIZE, 1);
kernelInitializeArray<T> <<<gridDim, blockDim>>>(a, value, n, incx);
}

Resources