stoi out ouf range exception when reusing unique ptr - socket.io

I'm trying to transmitt a draco encoded point cloud via socket.io.
In the first iteration everything is fine but in every following iteration abort() is being called, because I get an out of range exception in _NODISCARD inline int stoi.
for (int i = 0; i < count; i++)
{
//get pointcloud from kinect source
cwipc* capturedPC = ts->get();
cwipcH.CopyPointsToStruct(capturedPC);
//create draco point cloud from captured pointcloud (this throws the out of range exception on the second iteration)
std::unique_ptr<draco::PointCloud> pc = dracoH.BuildPointCloud(capturedPC, cwipcH);
//Write the encoder buffer with position quantization, compression speed and decompression speed
eBuff = dracoH.GetEncoderBuffer(move(pc), 13, 10, 10);
std::cout << "Buffer size " << eBuff.size() << "\n";
//copy encoder buffer to new char*
size_t size = eBuff.size();
const char* eBuffData = eBuff.data();
char* eBuffCpy = new char[size];
strncpy_s(eBuffCpy, sizeof(eBuffData), eBuffData, size);
//create shared pointer for pointcloud payload
std::shared_ptr<std::string> payload = std::make_shared<std::string>(eBuffData,size);
//transmitt payload
client.socket()->emit("Encoder Buffer", payload);
//free memory
capturedPC->free();
eBuff.Clear();
payload.reset();
pc.reset();
}
I've tried copying the encoder Buffer to a new char*, but the error still persits.

Related

Faster way to read/write a std::unordered_map from/to a file

I am working with some very large std::unordered_maps (hundreds of millions of entries) and need to save and load them to and from a file. The way I am currently doing this is by iterating through the map and reading/writing each key and value pair one at a time:
std::unordered_map<unsigned long long int, char> map;
void save(){
std::unordered_map<unsigned long long int, char>::iterator iter;
FILE *f = fopen("map", "wb");
for(iter=map.begin(); iter!=map.end(); iter++){
fwrite(&(iter->first), 8, 1, f);
fwrite(&(iter->second), 1, 1, f);
}
fclose(f);
}
void load(){
FILE *f = fopen("map", "rb");
unsigned long long int key;
char val;
while(fread(&key, 8, 1, f)){
fread(&val, 1, 1, f);
map[key] = val;
}
fclose(f);
}
But with around 624 million entries, reading the map from a file took 9 minutes. Writing to a file was faster but still took several minutes. Is there a faster way to do this?
C++ unordered_map implementations must all use chaining. There are a variety of really good reasons why you might want to do this for a general purpose hash table, which are discussed here.
This has enormous implications for performance. Most importantly, it means that the entries of the hash table are likely to be scattered throughout memory in a way which makes accessing each one an order of magnitude (or so) less efficient that would be the case if they could somehow be accessed serially.
Fortunately, you can build hash tables that, when nearly full, give near-sequential access to adjacent elements. This is done using open addressing.
Since your hash table is not general purpose, you could try this.
Below, I've built a simple hash table container with open addressing and linear probing. It assumes a few things:
Your keys are already somehow randomly distributed. This obviates the need for a hash function (though decent hash functions are fairly simple to build, even if great hash functions are difficult).
You only ever add elements to the hash table, you do not delete them. If this were not the case you'd need to change the used vector into something that could hold three states: USED, UNUSED, and TOMBSTONE where TOMBSTONE is the stated of a deleted element and used to continue linear search probe or halt a linear insert probe.
That you know the size of your hash table ahead of time, so you don't need to resize/rehash it.
That you don't need to traverse your elements in any particular order.
Of course, there are probably all kinds of excellent implementations of open addressing hash tables online which solve many of the above issues. However, the simplicity of my table allows me to convey the important point.
The important point is this: my design allows all the hash table's information to be stored in three vectors. That is: the memory is contiguous.
Contiguous memory is fast to allocate, fast to read from, and fast to write to. The effect of this is profound.
Using the same test setup as my previous answer, I get the following times:
Save. Save time = 82.9345 ms
Load. Load time = 115.111 ms
This is a 95% decrease in save time (22x faster) and a 98% decrease in load time (62x faster).
Code:
#include <cassert>
#include <chrono>
#include <cstdint>
#include <cstdio>
#include <functional>
#include <iostream>
#include <random>
#include <vector>
const int TEST_TABLE_SIZE = 10000000;
template<class K, class V>
class SimpleHash {
public:
int usedslots = 0;
std::vector<K> keys;
std::vector<V> vals;
std::vector<uint8_t> used;
//size0 should be a prime and about 30% larger than the maximum number needed
SimpleHash(int size0){
vals.resize(size0);
keys.resize(size0);
used.resize(size0/8+1,0);
}
//If the key values are already uniformly distributed, using a hash gains us
//nothing
uint64_t hash(const K key){
return key;
}
bool isUsed(const uint64_t loc){
const auto used_loc = loc/8;
const auto used_bit = 1<<(loc%8);
return used[used_loc]&used_bit;
}
void setUsed(const uint64_t loc){
const auto used_loc = loc/8;
const auto used_bit = 1<<(loc%8);
used[used_loc] |= used_bit;
}
void insert(const K key, const V val){
uint64_t loc = hash(key)%keys.size();
//Use linear probing. Can create infinite loops if table too full.
while(isUsed(loc)){ loc = (loc+1)%keys.size(); }
setUsed(loc);
keys[loc] = key;
vals[loc] = val;
}
V& get(const K key) {
uint64_t loc = hash(key)%keys.size();
while(true){
if(!isUsed(loc))
throw std::runtime_error("Item not present!");
if(keys[loc]==key)
return vals[loc];
loc = (loc+1)%keys.size();
}
}
uint64_t usedSize() const {
return usedslots;
}
uint64_t size() const {
return keys.size();
}
};
typedef SimpleHash<uint64_t, char> table_t;
void SaveSimpleHash(const table_t &map){
std::cout<<"Save. ";
const auto start = std::chrono::steady_clock::now();
FILE *f = fopen("/z/map", "wb");
uint64_t size = map.size();
fwrite(&size, 8, 1, f);
fwrite(map.keys.data(), 8, size, f);
fwrite(map.vals.data(), 1, size, f);
fwrite(map.used.data(), 1, size/8+1, f);
fclose(f);
const auto end = std::chrono::steady_clock::now();
std::cout<<"Save time = "<< std::chrono::duration<double, std::milli> (end-start).count() << " ms" << std::endl;
}
table_t LoadSimpleHash(){
std::cout<<"Load. ";
const auto start = std::chrono::steady_clock::now();
FILE *f = fopen("/z/map", "rb");
uint64_t size;
fread(&size, 8, 1, f);
table_t map(size);
fread(map.keys.data(), 8, size, f);
fread(map.vals.data(), 1, size, f);
fread(map.used.data(), 1, size/8+1, f);
fclose(f);
const auto end = std::chrono::steady_clock::now();
std::cout<<"Load time = "<< std::chrono::duration<double, std::milli> (end-start).count() << " ms" << std::endl;
return map;
}
int main(){
//Perfectly horrendous way of seeding a PRNG, but we'll do it here for brevity
auto generator = std::mt19937(12345); //Combination of my luggage
//Generate values within the specified closed intervals
auto key_rand = std::bind(std::uniform_int_distribution<uint64_t>(0,std::numeric_limits<uint64_t>::max()), generator);
auto val_rand = std::bind(std::uniform_int_distribution<int>(std::numeric_limits<char>::lowest(),std::numeric_limits<char>::max()), generator);
table_t map(1.3*TEST_TABLE_SIZE);
std::cout<<"Created table of size "<<map.size()<<std::endl;
std::cout<<"Generating test data..."<<std::endl;
for(int i=0;i<TEST_TABLE_SIZE;i++)
map.insert(key_rand(),(char)val_rand()); //Low chance of collisions, so we get quite close to the desired size
map.insert(23,42);
assert(map.get(23)==42);
SaveSimpleHash(map);
auto newmap = LoadSimpleHash();
//Ensure that the load worked
for(int i=0;i<map.keys.size();i++)
assert(map.keys.at(i)==newmap.keys.at(i));
for(int i=0;i<map.vals.size();i++)
assert(map.vals.at(i)==newmap.vals.at(i));
for(int i=0;i<map.used.size();i++)
assert(map.used.at(i)==newmap.used.at(i));
}
(Edit: I've added a new answer to this question which achieves a 95% decrease in wall-times.)
I made a Minimum Working Example that illustrates the problem you are trying to solve. This is something you should always do in your questions.
I then eliminated the unsigned long long int stuff and replaced it with uint64_t from the cstdint library. This ensures that we are operating on the same data size, since unsigned long long int can mean almost anything depending on what computer/compiler you use.
The resulting MWE looks like:
#include <chrono>
#include <cstdint>
#include <cstdio>
#include <deque>
#include <functional>
#include <iostream>
#include <random>
#include <unordered_map>
#include <vector>
typedef std::unordered_map<uint64_t, char> table_t;
const int TEST_TABLE_SIZE = 10000000;
void Save(const table_t &map){
std::cout<<"Save. ";
const auto start = std::chrono::steady_clock::now();
FILE *f = fopen("/z/map", "wb");
for(auto iter=map.begin(); iter!=map.end(); iter++){
fwrite(&(iter->first), 8, 1, f);
fwrite(&(iter->second), 1, 1, f);
}
fclose(f);
const auto end = std::chrono::steady_clock::now();
std::cout<<"Save time = "<< std::chrono::duration<double, std::milli> (end-start).count() << " ms" << std::endl;
}
//Take advantage of the limited range of values to save time
void SaveLookup(const table_t &map){
std::cout<<"SaveLookup. ";
const auto start = std::chrono::steady_clock::now();
//Create a lookup table
std::vector< std::deque<uint64_t> > lookup(256);
for(auto &kv: map)
lookup.at(kv.second+128).emplace_back(kv.first);
//Save lookup table header
FILE *f = fopen("/z/map", "wb");
for(const auto &row: lookup){
const uint32_t rowsize = row.size();
fwrite(&rowsize, 4, 1, f);
}
//Save values
for(const auto &row: lookup)
for(const auto &val: row)
fwrite(&val, 8, 1, f);
fclose(f);
const auto end = std::chrono::steady_clock::now();
std::cout<<"Save time = "<< std::chrono::duration<double, std::milli> (end-start).count() << " ms" << std::endl;
}
//Take advantage of the limited range of values and contiguous memory to
//save time
void SaveLookupVector(const table_t &map){
std::cout<<"SaveLookupVector. ";
const auto start = std::chrono::steady_clock::now();
//Create a lookup table
std::vector< std::vector<uint64_t> > lookup(256);
for(auto &kv: map)
lookup.at(kv.second+128).emplace_back(kv.first);
//Save lookup table header
FILE *f = fopen("/z/map", "wb");
for(const auto &row: lookup){
const uint32_t rowsize = row.size();
fwrite(&rowsize, 4, 1, f);
}
//Save values
for(const auto &row: lookup)
fwrite(row.data(), 8, row.size(), f);
fclose(f);
const auto end = std::chrono::steady_clock::now();
std::cout<<"Save time = "<< std::chrono::duration<double, std::milli> (end-start).count() << " ms" << std::endl;
}
void Load(table_t &map){
std::cout<<"Load. ";
const auto start = std::chrono::steady_clock::now();
FILE *f = fopen("/z/map", "rb");
uint64_t key;
char val;
while(fread(&key, 8, 1, f)){
fread(&val, 1, 1, f);
map[key] = val;
}
fclose(f);
const auto end = std::chrono::steady_clock::now();
std::cout<<"Load time = "<< std::chrono::duration<double, std::milli> (end-start).count() << " ms" << std::endl;
}
void Load2(table_t &map){
std::cout<<"Load with Reserve. ";
map.reserve(TEST_TABLE_SIZE+TEST_TABLE_SIZE/8);
const auto start = std::chrono::steady_clock::now();
FILE *f = fopen("/z/map", "rb");
uint64_t key;
char val;
while(fread(&key, 8, 1, f)){
fread(&val, 1, 1, f);
map[key] = val;
}
fclose(f);
const auto end = std::chrono::steady_clock::now();
std::cout<<"Load time = "<< std::chrono::duration<double, std::milli> (end-start).count() << " ms" << std::endl;
}
//Take advantage of the limited range of values to save time
void LoadLookup(table_t &map){
std::cout<<"LoadLookup. ";
map.reserve(TEST_TABLE_SIZE+TEST_TABLE_SIZE/8);
const auto start = std::chrono::steady_clock::now();
FILE *f = fopen("/z/map", "rb");
//Read the header
std::vector<uint32_t> inpsizes(256);
for(int i=0;i<256;i++)
fread(&inpsizes[i], 4, 1, f);
uint64_t key;
for(int i=0;i<256;i++){
const char val = i-128;
for(int v=0;v<inpsizes.at(i);v++){
fread(&key, 8, 1, f);
map[key] = val;
}
}
fclose(f);
const auto end = std::chrono::steady_clock::now();
std::cout<<"Load time = "<< std::chrono::duration<double, std::milli> (end-start).count() << " ms" << std::endl;
}
//Take advantage of the limited range of values and contiguous memory to save time
void LoadLookupVector(table_t &map){
std::cout<<"LoadLookupVector. ";
map.reserve(TEST_TABLE_SIZE+TEST_TABLE_SIZE/8);
const auto start = std::chrono::steady_clock::now();
FILE *f = fopen("/z/map", "rb");
//Read the header
std::vector<uint32_t> inpsizes(256);
for(int i=0;i<256;i++)
fread(&inpsizes[i], 4, 1, f);
for(int i=0;i<256;i++){
const char val = i-128;
std::vector<uint64_t> keys(inpsizes[i]);
fread(keys.data(), 8, inpsizes[i], f);
for(const auto &key: keys)
map[key] = val;
}
fclose(f);
const auto end = std::chrono::steady_clock::now();
std::cout<<"Load time = "<< std::chrono::duration<double, std::milli> (end-start).count() << " ms" << std::endl;
}
int main(){
//Perfectly horrendous way of seeding a PRNG, but we'll do it here for brevity
auto generator = std::mt19937(12345); //Combination of my luggage
//Generate values within the specified closed intervals
auto key_rand = std::bind(std::uniform_int_distribution<uint64_t>(0,std::numeric_limits<uint64_t>::max()), generator);
auto val_rand = std::bind(std::uniform_int_distribution<int>(std::numeric_limits<char>::lowest(),std::numeric_limits<char>::max()), generator);
std::cout<<"Generating test data..."<<std::endl;
//Generate a test table
table_t map;
for(int i=0;i<TEST_TABLE_SIZE;i++)
map[key_rand()] = (char)val_rand(); //Low chance of collisions, so we get quite close to the desired size
Save(map);
{ table_t map2; Load (map2); }
{ table_t map2; Load2(map2); }
SaveLookup(map);
SaveLookupVector(map);
{ table_t map2; LoadLookup (map2); }
{ table_t map2; LoadLookupVector(map2); }
}
On the test data set I use, this gives me a write time of 1982ms and a read time (using your original code) of 7467ms. It seemed as though the read time is the biggest bottleneck, so I created a new function Load2 which reserves sufficient space for the unordered_map prior to reading. This dropped the read time to 4700ms (a 37% savings).
Edit 1
Now, I note that the values of your unordered_map can only take 255 distinct values. Thus, I can easily convert the unordered_map into a kind of lookup table in RAM. That is, rather than having:
123123 1
234234 0
345345 1
237872 1
I can rearrange the data to look like:
0 234234
1 123123 345345 237872
What's the advantage of this? It means that I no longer have to write the value to disk. That saves 1 byte per table entry. Since each table entry consists of 8 bytes for the key and 1 byte for the value, this should give me an 11% savings in both read and write time minus the cost of rearranging the memory (which I expect to be low, because RAM).
Finally, once I've done the above rearrangement, if I have a lot of spare RAM on the machine, I can pack everything into a vector and read/write the contiguous data to disk.
Doing all this gives the following times:
Save. Save time = 1836.52 ms
Load. Load time = 7114.93 ms
Load with Reserve. Load time = 4277.58 ms
SaveLookup. Save time = 1688.73 ms
SaveLookupVector. Save time = 1394.95 ms
LoadLookup. Load time = 3927.3 ms
LoadLookupVector. Load time = 3739.37 ms
Note that the transition from Save to SaveLookup gives an 8% speed-up and the transition from Load with Reserve to LoadLookup gives an 8% speed-up as well. This is right in line our theory!
Using contiguous memory as well gives a total of a 24% speed-up over your original save time and a total of a 47% speed-up over your original load time.
Since your data seems to be static and given the amount of items, I would certainly consider using an own structure in a binary file and then use memory mapping on that file.
Opening would be instant (just mmap the file).
If you write the values in sorted order, you can use binary search on the mapped data.
If that is not good enough, you could split your data in buckets and store a list with offsets at the beginning of the file - or maybe even use some hash key.
If your keys are all unique and somewhat contiguous, you could even get a smaller file by only storing the char values in file position [key] (and use a special value for null values). Of course that wouldn't work for the full uint64 range, but depending on the data they could be grouped together in buckets containing an offset.
Using mmap this way would also use a lot less memory.
For faster access you could create your own hash map on disk (still with 'instant load').
For example, say you have 1 million hashes (in your case there would be lot more), you could write 1 million uint64 filepos values at the beginning of the file (the hash value would be the position of the uint64 containing the filepos). Each location would point to a block with one ore more key/value pairs, and each of those blocks would start with a count.
If the blocks are aligned on 2 or 4 bytes, a uint32 filepos could be used instead (multiply pos with 2 or 4).
Since the data is static you don't have to worry about possible insertions or deletions, which makes it rather easy to implement.
This has the advantage that you still can mmap the whole file and all the key/value pairs with the same hash are close together which brings them in the L1 cache (as compared to say linked lists)
I would assume you need the map to write the values ordered in the file. It would be better to load only once the values in a container, possibly a std::deque would be better since the amount is large, and use std::sort once, and then iterate through std::deque to write values. You would gain cache performance and also the run time complexity for std::sort is N*Log(N), which would be better than balancing your map ~624 million times or paying cache misses in an unordered map.
Perhaps a prefix-ordered traversal during save would help to reduce the amount of internal reordering during load?
Of course, you don't have visibility of the internal structure of the STL map containers, so the best you could do would be to simulate that by binary-chopping the iterator as if it was linear. Given that you know the total N nodes, save the node N/2, then N/4, N*3/4, and so-on.
This can be done algorithmically by visiting every odd N/(2^p) node in each pass p: N/2, N*1/4, N*3/4, N*1/8, N*3/8, N*5/8, N*7/8, etc, though you need to ensure that the series maintains step sizes such that N*4/8 = N/2, but without resorting to step sizes of 2^(P-p), and that in the last pass you visit every remaining node. You may find it advantageous to pre-calculate the highest pass number (~log2(N)), and the float value of S=N/(2^P) such that 0.5 < S <= 1.0, and then scale that back up for each p.
But as others have said, you need to profile it first to see if this is your issue, and profile again to see if this approach helps.

Crackling audio due to wrong audio data

I'm using CoreAudio low level API for audio capturing. The app target is MAC OSX, not iOS.
During testing it, from time to time we got very annoying noise modulate with real audio. the phenomena develops with time, started from barely noticeable and become more and more dominant.
Analyze the captured audio under Audacity indicate that the end part of the audio packet is wrong.
Here are sample picture:
the intrusion repeat every 40 ms which is the configured packetization time (in terms of buffer samples)
Update:
Over time the gap became larger, here is another snapshot from the same captured file 10 minutes later. the gap now contains 1460 samples which is 33ms from the total 40ms of the packet!!
CODE SNIPPESTS:
capture callback
OSStatus MacOS_AudioDevice::captureCallback(void *inRefCon,
AudioUnitRenderActionFlags *ioActionFlags,
const AudioTimeStamp *inTimeStamp,
UInt32 inBusNumber,
UInt32 inNumberFrames,
AudioBufferList *ioData)
{
MacOS_AudioDevice* _this = static_cast<MacOS_AudioDevice*>(inRefCon);
// Get the new audio data
OSStatus err = AudioUnitRender(_this->m_AUHAL, ioActionFlags, inTimeStamp, inBusNumber, inNumberFrames, _this->m_InputBuffer);
if (err != noErr)
{
...
return err;
}
// ignore callback on unexpected buffer size
if (_this->m_params.bufferSizeSamples != inNumberFrames)
{
...
return noErr;
}
// Deliver audio data
DeviceIOMessage message;
message.bufferSizeBytes = _this->m_deviceBufferSizeBytes;
message.buffer = _this->m_InputBuffer->mBuffers[0].mData;
if (_this->m_callbackFunc)
{
_this->m_callbackFunc(_this, message);
}
}
Open and start capture device:
void MacOS_AudioDevice::openAUHALCapture()
{
UInt32 enableIO;
AudioStreamBasicDescription streamFormat;
UInt32 size;
SInt32 *channelArr;
std::stringstream ss;
AudioObjectPropertyAddress deviceBufSizeProperty =
{
kAudioDevicePropertyBufferFrameSize,
kAudioDevicePropertyScopeInput,
kAudioObjectPropertyElementMaster
};
// AUHAL
AudioComponentDescription cd = {kAudioUnitType_Output, kAudioUnitSubType_HALOutput, kAudioUnitManufacturer_Apple, 0, 0};
AudioComponent HALOutput = AudioComponentFindNext(NULL, &cd);
verify_macosapi(AudioComponentInstanceNew(HALOutput, &m_AUHAL));
verify_macosapi(AudioUnitInitialize(m_AUHAL));
// enable input IO
enableIO = 1;
verify_macosapi(AudioUnitSetProperty(m_AUHAL, kAudioOutputUnitProperty_EnableIO, kAudioUnitScope_Input, 1, &enableIO, sizeof(enableIO)));
// disable output IO
enableIO = 0;
verify_macosapi(AudioUnitSetProperty(m_AUHAL, kAudioOutputUnitProperty_EnableIO, kAudioUnitScope_Output, 0, &enableIO, sizeof(enableIO)));
// Setup current device
size = sizeof(AudioDeviceID);
verify_macosapi(AudioUnitSetProperty(m_AUHAL, kAudioOutputUnitProperty_CurrentDevice, kAudioUnitScope_Global, 0, &m_MacDeviceID, sizeof(AudioDeviceID)));
// Set device native buffer length before setting AUHAL stream
size = sizeof(m_originalDeviceBufferTimeFrames);
verify_macosapi(AudioObjectSetPropertyData(m_MacDeviceID, &deviceBufSizeProperty, 0, NULL, size, &m_originalDeviceBufferTimeFrames));
// Get device format
size = sizeof(AudioStreamBasicDescription);
verify_macosapi(AudioUnitGetProperty(m_AUHAL, kAudioUnitProperty_StreamFormat, kAudioUnitScope_Input, 1, &streamFormat, &size));
// Setup channel map
assert(m_params.numOfChannels <= streamFormat.mChannelsPerFrame);
channelArr = new SInt32[streamFormat.mChannelsPerFrame];
for (int i = 0; i < streamFormat.mChannelsPerFrame; i++)
channelArr[i] = -1;
for (int i = 0; i < m_params.numOfChannels; i++)
channelArr[i] = i;
verify_macosapi(AudioUnitSetProperty(m_AUHAL, kAudioOutputUnitProperty_ChannelMap, kAudioUnitScope_Input, 1, channelArr, sizeof(SInt32) * streamFormat.mChannelsPerFrame));
delete [] channelArr;
// Setup stream converters
streamFormat.mFormatID = kAudioFormatLinearPCM;
streamFormat.mFormatFlags = kAudioFormatFlagIsSignedInteger;
streamFormat.mFramesPerPacket = m_SamplesPerPacket;
streamFormat.mBitsPerChannel = m_params.sampleDepthBits;
streamFormat.mSampleRate = m_deviceSampleRate;
streamFormat.mChannelsPerFrame = 1;
streamFormat.mBytesPerFrame = 2;
streamFormat.mBytesPerPacket = streamFormat.mFramesPerPacket * streamFormat.mBytesPerFrame;
verify_macosapi(AudioUnitSetProperty(m_AUHAL, kAudioUnitProperty_StreamFormat, kAudioUnitScope_Output, 1, &streamFormat, size));
// Setup callbacks
AURenderCallbackStruct input;
input.inputProc = captureCallback;
input.inputProcRefCon = this;
verify_macosapi(AudioUnitSetProperty(m_AUHAL, kAudioOutputUnitProperty_SetInputCallback, kAudioUnitScope_Global, 0, &input, sizeof(input)));
// Calculate the size of the IO buffer (in samples)
if (m_params.bufferSizeMS != -1)
{
unsigned int desiredSignalsInBuffer = (m_params.bufferSizeMS / (double)1000) * m_deviceSampleRate;
// making sure the value stay in the device's supported range
desiredSignalsInBuffer = std::min<unsigned int>(desiredSignalsInBuffer, m_deviceBufferFramesRange.mMaximum);
desiredSignalsInBuffer = std::max<unsigned int>(m_deviceBufferFramesRange.mMinimum, desiredSignalsInBuffer);
m_deviceBufferFrames = desiredSignalsInBuffer;
}
// Set device buffer length
size = sizeof(m_deviceBufferFrames);
verify_macosapi(AudioObjectSetPropertyData(m_MacDeviceID, &deviceBufSizeProperty, 0, NULL, size, &m_deviceBufferFrames));
m_deviceBufferSizeBytes = m_deviceBufferFrames * streamFormat.mBytesPerFrame;
m_deviceBufferTimeMS = 1000 * m_deviceBufferFrames/m_deviceSampleRate;
// Calculate number of buffers from channels
size = offsetof(AudioBufferList, mBuffers[0]) + (sizeof(AudioBuffer) * m_params.numOfChannels);
// Allocate input buffer
m_InputBuffer = (AudioBufferList *)malloc(size);
m_InputBuffer->mNumberBuffers = m_params.numOfChannels;
// Pre-malloc buffers for AudioBufferLists
for(UInt32 i = 0; i< m_InputBuffer->mNumberBuffers ; i++)
{
m_InputBuffer->mBuffers[i].mNumberChannels = 1;
m_InputBuffer->mBuffers[i].mDataByteSize = m_deviceBufferSizeBytes;
m_InputBuffer->mBuffers[i].mData = malloc(m_deviceBufferSizeBytes);
}
// Update class properties
m_params.sampleRateHz = streamFormat.mSampleRate;
m_params.bufferSizeSamples = m_deviceBufferFrames;
m_params.bufferSizeBytes = m_params.bufferSizeSamples * streamFormat.mBytesPerFrame;
}
eADMReturnCode MacOS_AudioDevice::start()
{
eADMReturnCode ret = OK;
LOGAPI(ret);
if (!m_isStarted && m_isOpen)
{
OSStatus err = AudioOutputUnitStart(m_AUHAL);
if (err == noErr)
m_isStarted = true;
else
ret = ERROR;
}
return ret;
}
Any idea what cause it and how to solve?
Thanks in advance!
Periodic glitches or dropouts can be caused by not paying attention to or by not fully processing the number of frames sent to each audio callback. Valid buffers don't always contain the expected or same number of samples (inNumberFrames might not equal bufferSizeSamples or the previous inNumberFrames in a perfectly valid audio buffer).
It is possible that these types of glitches might be caused by attempting to record at 44.1k on some models of iOS devices that only support 48k audio in hardware.
Some types of glitch might also be caused by any non-hard-real-time code within your m_callbackFunc function (such as any synchronous file reads/writes, OS calls, Objective C message dispatch, GC, or memory allocation/deallocation).

Readfile() occasionally returns 998 / ERROR_NOACCESS

I am trying to implement fast IO under Windows, and working my way up to Overlapped IO. In my research, Unbuffered IO requires page aligned buffers. Ive attempted to implement this in my code below. However, I occasionally have Readfiles last error report no access (error 998, ERROR_NOACCESS) - prior to completing the read, and after a few reads of a page aligned buffer. Sometimes 16. Sometimes 4, etc.
I cant for the life of me figure out why i am occasionally throwing an error. Any insight would be helpful.
ci::BufferRef CinderSequenceRendererApp::CreateFileLoadWinNoBufferSequential(fs::path path) {
HANDLE file = CreateFile(path.c_str(), GENERIC_READ, 0, NULL, OPEN_EXISTING, FILE_FLAG_NO_BUFFERING | FILE_FLAG_SEQUENTIAL_SCAN, 0);
if (file == INVALID_HANDLE_VALUE)
{
console() << "Could not open file for reading" << std::endl;
}
ci::BufferRef latestAvailableBufferRef = nullptr;
LARGE_INTEGER nLargeInteger = { 0 };
GetFileSizeEx(file, &nLargeInteger);
// how many reads do we need to fill our buffer with a buffer size of x and a read size of y
// Our buffer needs to hold 'n' sector sizes that wil fit the size of the file
SYSTEM_INFO si;
GetSystemInfo(&si);
long readAmount = si.dwPageSize;
int numReads = 0;
ULONG bufferSize = 0;
// calculate sector aligned buffer size that holds our file size
while (bufferSize < nLargeInteger.QuadPart)
{
numReads++;
bufferSize = (numReads) * readAmount;
}
// need one page extra for null if we need it
latestAvailableBufferRef = ci::Buffer::create(bufferSize + readAmount);
if (latestAvailableBufferRef != nullptr)
{
DWORD outputBytes = 1;
// output bytes = 0 when OEF
void* address = latestAvailableBufferRef->getData();
DWORD bytesRead = 0;
while (outputBytes != 0)
{
bool result = ReadFile(file, address, readAmount, &outputBytes, 0);
if (!result )//&& (outputBytes == 0))
{
getLastReadError();
}
address = (void*)((long)address + readAmount);
bytesRead += outputBytes;
}
}
CloseHandle(file);
// resize our buffer to expected file size?
latestAvailableBufferRef->resize(nLargeInteger.QuadPart);
return latestAvailableBufferRef;
}
Cast to long long - I was truncating my pointer address. Duh. Thanks to #jonathan-potter

Pass host pointer array to device global memory pointer array?

Suppose we have;
struct collapsed {
char **seq;
int num;
};
...
__device__ *collapsed xdev;
...
collapsed *x_dev
cudaGetSymbolAddress((void **)&x_dev, xdev);
cudaMemcpyToSymbol(x_dev, x, sizeof(collapsed)*size); //x already defined collapsed * , this line gives ERROR
Whay do you think I am getting error at the last line : invalid device symbol ??
The first problem here is that x_dev isn't a device symbol. It might contain an address in a device memory, but that address cannot be passed to cudaMemcpyToSymbol. The call should just be:
cudaMemcpyToSymbol(xdev, ......);
Which brings up the second problem. Doing this:
cudaMemcpyToSymbol(xdev, x, sizeof(collapsed)*size);
would be illegal. xdev is a pointer, so the only valid value you can copy to xdev is a device address. If x is the address of a struct collapsed in device memory, then the only valid version of this memory transfer operation is
cudaMemcpyToSymbol(xdev, &x, sizeof(collapsed *));
ie. x must have previously have been set to the address of memory allocated in the device, something like
collapsed *x;
cudaMalloc((void **)&x, sizeof(collapsed)*size);
cudaMemcpy(x, host_src, sizeof(collapsed)*size, cudaMemcpyHostToDevice);
As promised, here is a complete working example. First the code:
#include <cstdlib>
#include <iostream>
#include <cuda_runtime.h>
struct collapsed {
char **seq;
int num;
};
__device__ collapsed xdev;
__global__
void kernel(const size_t item_sz)
{
if (threadIdx.x < xdev.num) {
char *p = xdev.seq[threadIdx.x];
char val = 0x30 + threadIdx.x;
for(size_t i=0; i<item_sz; i++) {
p[i] = val;
}
}
}
#define gpuQ(ans) { gpu_assert((ans), __FILE__, __LINE__); }
void gpu_assert(cudaError_t code, const char *file, const int line)
{
if (code != cudaSuccess)
{
std::cerr << "gpu_assert: " << cudaGetErrorString(code) << " "
<< file << " " << line << std::endl;
exit(code);
}
}
int main(void)
{
const int nitems = 32;
const size_t item_sz = 16;
const size_t buf_sz = size_t(nitems) * item_sz;
// Gpu memory for sequences
char *_buf;
gpuQ( cudaMalloc((void **)&_buf, buf_sz) );
gpuQ( cudaMemset(_buf, 0x7a, buf_sz) );
// Host array for holding sequence device pointers
char **seq = new char*[nitems];
size_t offset = 0;
for(int i=0; i<nitems; i++, offset += item_sz) {
seq[i] = _buf + offset;
}
// Device array holding sequence pointers
char **_seq;
size_t seq_sz = sizeof(char*) * size_t(nitems);
gpuQ( cudaMalloc((void **)&_seq, seq_sz) );
gpuQ( cudaMemcpy(_seq, seq, seq_sz, cudaMemcpyHostToDevice) );
// Host copy of the xdev structure to copy to the device
collapsed xdev_host;
xdev_host.num = nitems;
xdev_host.seq = _seq;
// Copy to device symbol
gpuQ( cudaMemcpyToSymbol(xdev, &xdev_host, sizeof(collapsed)) );
// Run Kernel
kernel<<<1,nitems>>>(item_sz);
// Copy back buffer
char *buf = new char[buf_sz];
gpuQ( cudaMemcpy(buf, _buf, buf_sz, cudaMemcpyDeviceToHost) );
// Print out seq values
// Each string should be ASCII starting from ´0´ (0x30)
char *seq_vals = buf;
for(int i=0; i<nitems; i++, seq_vals += item_sz) {
std::string s;
s.append(seq_vals, item_sz);
std::cout << s << std::endl;
}
return 0;
}
and here it is compiled and run:
$ /usr/local/cuda/bin/nvcc -arch=sm_12 -Xptxas=-v -g -G -o erogol erogol.cu
./erogol.cu(19): Warning: Cannot tell what pointer points to, assuming global memory space
ptxas info : 8 bytes gmem, 4 bytes cmem[14]
ptxas info : Compiling entry function '_Z6kernelm' for 'sm_12'
ptxas info : Used 5 registers, 20 bytes smem, 4 bytes cmem[1]
$ /usr/local/cuda/bin/cuda-memcheck ./erogol
========= CUDA-MEMCHECK
0000000000000000
1111111111111111
2222222222222222
3333333333333333
4444444444444444
5555555555555555
6666666666666666
7777777777777777
8888888888888888
9999999999999999
::::::::::::::::
;;;;;;;;;;;;;;;;
<<<<<<<<<<<<<<<<
================
>>>>>>>>>>>>>>>>
????????????????
################
AAAAAAAAAAAAAAAA
BBBBBBBBBBBBBBBB
CCCCCCCCCCCCCCCC
DDDDDDDDDDDDDDDD
EEEEEEEEEEEEEEEE
FFFFFFFFFFFFFFFF
GGGGGGGGGGGGGGGG
HHHHHHHHHHHHHHHH
IIIIIIIIIIIIIIII
JJJJJJJJJJJJJJJJ
KKKKKKKKKKKKKKKK
LLLLLLLLLLLLLLLL
MMMMMMMMMMMMMMMM
NNNNNNNNNNNNNNNN
OOOOOOOOOOOOOOOO
========= ERROR SUMMARY: 0 errors
Some notes:
To simplify things a bit, I have only used a single memory allocation _buf to hold all of the string data. Each value of seq is set to a different address within _buf. This is functionally equivalent to running a separate cudaMalloc call for each pointer, but much faster.
The key concept is to assemble a copy of the structure you wish to access on the device in host memory, then copy that to the device. All of the pointers in my xdev_host are device pointers. The CUDA API doesn't have any sort of deep copy or automatic pointer translation facility, so it is the programmer's responsibility to make sure this is correct.
Each thread in the kernel just fills its sequence with a difference ASCII character. Note that I have declared my xdev as a structure, rather than pointer to structure and copy values rather than a reference to the __device__ symbol (again to simplify things slightly). But otherwise the sequence of operations is what you would need to make your design pattern work.
Because I only have access to a compute 1.x device, the compiler issues a warning. One compute 2.x and 3.x this won't happen because of the improved memory model in those devices. The warning is normal and can be safely ignored.
Because each sequence is just written into a different part of _buf, I can transfer all the sequences back to the host with a single cudaMemcpy call.

Conversion of PNG image to base64 in Windows phone7.1

I want to convert a PNG image found in a path to base64 for a html page in Windows phone7.1.How can it be done?
Stream imgStream;
imgStream = Assembly.GetExecutingAssembly().GetManifestResourceStream("NewUIChanges.Htmlfile.round1.png");
byte[] data = new byte[(int)imgStream.Length];
int offset = 0;
while (offset < data.Length)
{
int bytesRead = imgStream.Read(data, offset, data.Length - offset);
if (bytesRead <= 0)
{
throw new EndOfStreamException("Stream wasn't as long as it claimed");
}
offset += bytesRead;
}
The fact that it's a PNG image is actually irrelevant - all you need to know is that you've got some bytes that you need to convert into base64.
Read the data from a stream into a byte array, and then use Convert.ToBase64String. Reading a byte array from a stream can be slightly fiddly, depending on whether the stream advertises its length or not. If it does, you can use:
byte[] data = new byte[(int) stream.Length];
int offset = 0;
while (offset < data.Length)
{
int bytesRead = stream.Read(data, offset, data.Length - offset);
if (bytesRead <= 0)
{
throw new EndOfStreamException("Stream wasn't as long as it claimed");
}
offset += bytesRead;
}
If it doesn't, the simplest approach is probably to copy it to a MemoryStream:
using (MemoryStream ms = new MemoryStream())
{
byte[] buffer = new byte[8 * 1024];
int bytesRead;
while ((bytesRead = stream.Read(buffer, 0, buffer.Length)) > 0)
{
ms.Write(buffer, 0, bytesRead);
}
return ms.ToByteArray();
}
So once you've used either of those bits of code (or anything else suitable) to get a byte array, just use Convert.ToBase64String and you're away.
There are probably streaming solutions which will avoid ever having the whole byte array in memory - e.g. building up a StringBuilder of base64 data as it goes - but they would be more complicated. Unless you're going to deal with very large files, I'd stick with the above.

Resources