how to handle the entropy of two images from .mp4 - image

i am working on entropy , i am getting consecutive frames from .mp4 file , i want to count the entropy of current frame with previous frame , if the entropy between them is not zero than it should check the frame , otherwise it should ignore the frame , it should save the previous frame and take the current frame after 2 sec, if entropy is zero it should ignore it and than again wait for 2 sec Here is my code :
capture.open("recog.mp4");
if (!capture.isOpened()) {
cerr << "can not open camera or video file" << endl;
}
while(1)
{
capture >> current_frame;
if (current_frame.empty())
break;
if (! previous_frame.empty()) {
subtract(current_frame, previous_frame, pre_img);
Mat hist;
int channels[] = {0};
int histSize[] = {32};
float range[] = { 0, 256 };
const float* ranges[] = { range };
calcHist( &pre_img, 1, channels, Mat(), // do not use mask
hist, 1, histSize, ranges,
true, // the histogram is uniform
false );
Mat histNorm = hist / (pre_img.rows * pre_img.cols);
double entropy = 0.0;
for (int i=0; i<histNorm.rows; i++)
{
float binEntry = histNorm.at<float>(i,0);
if (binEntry != 0.0)
{
entropy -= binEntry * log(binEntry);
}
else
{
//ignore the frame andgo for next , but how to code it ? is any function with ignore ?
}
waitKey(10);
current_frame.copyTo(previous_frame);
}
This is counting the entropy of only one image that is current image and it become previous image when the next image come into process , as far my page work told me. It give me error in log2 when i use it like this entropy -= binEntry * log2(binEntry); and can you please help me in telling that how to ignore the frame when the entropy is zero , so that .mp4 continue running and should i need to use cvwaitkey(2) to check .mp4 after 2 sec , mean .mp4is running but i am ignoring the frames
ignore mean when it subtract the current frame from the previous and entropy is 0, than previous frame remain previous , current not become previous , and previous wait 2sec for the next current image , and than perform the task on it

To ignore a certain amount of frames simply read them from the stream.
for(int i=0; i<60; i++)
capture >> current_frame;
If your video has 30fps this would skip 2 seconds of video.
To act in case your entropy is greater than a certain threshold you need to add something like this:
if ( entropy > 1.0 )
{
// do something
}
I used a threshold, because due to noise the entropy probably will never be zero between different frames.
If your compiler does not offer you the log2 function you can simply emulate it as described here.

Related

FFMPEG - AVFrame to per channel array conversion

I am looking to copy an AVFrame into an array where pixels are stored one channel at a time in a row-major order.
Details:
I am using FFMPEG's api to read frames from a video. I have used avcodec_decode_video2 to fetch each frame as an AVFrame as follows:
AVFormatContext* fmt_ctx = NULL;
avformat_open_input(&fmt_ctx, filepath, NULL, NULL);
...
int video_stream_idx; // stores the stream index for the video
...
AVFrame* vid_frame = NULL;
vid_frame = av_frame_alloc();
AVPacket vid_pckt;
int frame_finish;
...
while (av_read_frame(fmt_ctx, &vid_pckt) >= 0) {
if (b_vid_pckt.stream_index == video_stream_idx) {
avcodec_decode_video2(cdc_ctx, vid_frame, &frame_finish, &vid_pckt);
if (frame_finish) {
/* perform conversion */
}
}
}
The destination array looks like this:
unsigned char* frame_arr = new unsigned char [cdc_ctx->width * cdc_ctx->height * 3];
I need to copy all of vid_frame into frame_arr, where the range of pixel values should be [0, 255]. The problem is that the array needs to store the frame in row major order, one channel at a time, i.e. R11, R12, ... R21, R22, ... G11, G12, ... G21, G22, ... B11, B12, ... B21, B22, ... (I have used the notation [color channel][row index][column index], i.e. G21 is the green channel value of pixel at row 2, column 1). I have had a look at sws_scale, but I don't understand it enough to figure out whether that function is capable of doing such a conversion. Can somebody help!! :)
The format you called "one channel at a time" has a term named planar. (btw, the opposite format is named packed) And almost every pixel format is of row order.
The problem here is the input format may vary and all of them should be converted to one format. That's what sws_scale() does.
However, there is no such planar RGB format in ffmpeg libs yet. You have to write your own pixel format description into ffmpeg source code libavutil/pixdesc.c and re-build the libs.
Or you can just convert the frame into AV_PIX_FMT_GBRP format, which is the most similar one to what you want. AV_PIX_FMT_GBRP is a planar format, while the green channel is at first and red at last (blue middle). And rearrange these channels then.
// Create a SwsContext first:
SwsContext* sws_ctx = sws_getContext(cdc_ctx->width, cdc_ctx->height, cdc_ctx->pix_fmt, cdc_ctx->width, cdc_ctx->height, AV_PIX_FMT_GBRP, 0, 0, 0, 0);
// alloc some new space for storing converted frame
AVFrame* gbr_frame = av_frame_alloc();
picture->format = AV_PIX_FMT_GBRP;
picture->width = cdc_ctx->width;
picture->height = cdc_ctx->height;
av_frame_get_buffer(picture, 32);
....
while (av_read_frame(fmt_ctx, &vid_pckt) >=0) {
ret = avcodec_send_packet(cdc_ctx, &vid_pckt);
// In particular, we don't expect AVERROR(EAGAIN), because we read all
// decoded frames with avcodec_receive_frame() until done.
if (ret < 0)
break;
ret = avcodec_receive_frame(cdc_ctx, vid_frame);
if (ret < 0 && ret != AVERROR(EAGAIN) && ret != AVERROR_EOF)
break;
if (ret >= 0) {
// convert image from native format to planar GBR
sws_scale(sws_ctx, vid_frame->data,
vid_frame->linesize, 0, vid_frame->height,
gbr_frame->data, gbr_frame->linesize);
// rearrange gbr channels in gbr_frame as you like
// g channel is gbr_frame->data[0]
// b channel is gbr_frame->data[1]
// r channel is gbr_frame->data[2]
// ......
}
}
av_frame_free(gbr_frame);
av_frame_free(vid_frame);
sws_freeContext(sws_ctx);
avformat_free_context(fmt_ctx)

How can I use mach_absolute_time without overflowing?

On Darwin, the POSIX standard clock_gettime(CLOCK_MONOTONIC) timer is not available. Instead, the highest resolution monotonic timer is obtained through the mach_absolute_time function from mach/mach_time.h.
The result returned may be an unadjusted tick count from the processor, in which case the time units could be a strange multiple. For example, on a CPU with a 33MHz tick count, Darwin returns 1000000000/33333335 as the exact units of the returned result (ie, multiply the mach_absolute_time by that fraction to obtain a nanosecond value).
We usually wish to convert from exact ticks to "standard" (decimal) units, but unfortunately, naively multiplying the absolute time by the fraction will overflow even in 64-bit arithmetic. This is an error that Apple's sole piece of documentation on mach_absolute_time falls into (Technical Q&A QA1398).1
How should I write a function that correctly uses mach_absolute_time?
Note that this is not a theoretical problem: the sample code in QA1398 completely fails to work on PowerPC-based Macs. On Intel Macs, mach_timebase_info always returns 1/1 as the scaling factor because the CPU's raw tick count is unreliable (dynamic speed-stepping), so the API does the scaling for you. On PowerPC Macs, mach_timebase_info returns either 1000000000/33333335 or 1000000000/25000000, so Apple's provided code definitely overflows every few minutes. Oops.
Most-precise (best) answer
Perform the arithmetic at 128-bit precision to avoid the overflow!
// Returns monotonic time in nanos, measured from the first time the function
// is called in the process.
uint64_t monotonicTimeNanos() {
uint64_t now = mach_absolute_time();
static struct Data {
Data(uint64_t bias_) : bias(bias_) {
kern_return_t mtiStatus = mach_timebase_info(&tb);
assert(mtiStatus == KERN_SUCCESS);
}
uint64_t scale(uint64_t i) {
return scaleHighPrecision(i - bias, tb.numer, tb.denom);
}
static uint64_t scaleHighPrecision(uint64_t i, uint32_t numer,
uint32_t denom) {
U64 high = (i >> 32) * numer;
U64 low = (i & 0xffffffffull) * numer / denom;
U64 highRem = ((high % denom) << 32) / denom;
high /= denom;
return (high << 32) + highRem + low;
}
mach_timebase_info_data_t tb;
uint64_t bias;
} data(now);
return data.scale(now);
}
A simple low-resolution answer
// Returns monotonic time in nanos, measured from the first time the function
// is called in the process. The clock may run up to 0.1% faster or slower
// than the "exact" tick count.
uint64_t monotonicTimeNanos() {
uint64_t now = mach_absolute_time();
static struct Data {
Data(uint64_t bias_) : bias(bias_) {
kern_return_t mtiStatus = mach_timebase_info(&tb);
assert(mtiStatus == KERN_SUCCESS);
if (tb.denom > 1024) {
double frac = (double)tb.numer/tb.denom;
tb.denom = 1024;
tb.numer = tb.denom * frac + 0.5;
assert(tb.numer > 0);
}
}
mach_timebase_info_data_t tb;
uint64_t bias;
} data(now);
return (now - data.bias) * data.tb.numer / data.tb.denom;
}
A fiddly solution using low-precision arithmetic but using continued fractions to avoid loss of accuracy
// This function returns the rational number inside the given interval with
// the smallest denominator (and smallest numerator breaks ties; correctness
// proof neglects floating-point errors).
static mach_timebase_info_data_t bestFrac(double a, double b) {
if (floor(a) < floor(b))
{ mach_timebase_info_data_t rv = {(int)ceil(a), 1}; return rv; }
double m = floor(a);
mach_timebase_info_data_t next = bestFrac(1/(b-m), 1/(a-m));
mach_timebase_info_data_t rv = {(int)m*next.numer + next.denum, next.numer};
return rv;
}
// Returns monotonic time in nanos, measured from the first time the function
// is called in the process. The clock may run up to 0.1% faster or slower
// than the "exact" tick count. However, although the bound on the error is
// the same as for the pragmatic answer, the error is actually minimized over
// the given accuracy bound.
uint64_t monotonicTimeNanos() {
uint64_t now = mach_absolute_time();
static struct Data {
Data(uint64_t bias_) : bias(bias_) {
kern_return_t mtiStatus = mach_timebase_info(&tb);
assert(mtiStatus == KERN_SUCCESS);
double frac = (double)tb.numer/tb.denom;
uint64_t spanTarget = 315360000000000000llu; // 10 years
if (getExpressibleSpan(tb.numer, tb.denom) >= spanTarget)
return;
for (double errorTarget = 1/1024.0; errorTarget > 0.000001;) {
mach_timebase_info_data_t newFrac =
bestFrac((1-errorTarget)*frac, (1+errorTarget)*frac);
if (getExpressibleSpan(newFrac.numer, newFrac.denom) < spanTarget)
break;
tb = newFrac;
errorTarget = fabs((double)tb.numer/tb.denom - frac) / frac / 8;
}
assert(getExpressibleSpan(tb.numer, tb.denom) >= spanTarget);
}
mach_timebase_info_data_t tb;
uint64_t bias;
} data(now);
return (now - data.bias) * data.tb.numer / data.tb.denom;
}
The derivation
We aim to reduce the fraction returned by mach_timebase_info to one that is essentially the same, but with a small denominator. The size of the timespan that we can handle is limited only by the size of the denominator, not the numerator of the fraction we shall multiply by:
uint64_t getExpressibleSpan(uint32_t numer, uint32_t denom) {
// This is just less than the smallest thing we can multiply numer by without
// overflowing. ceilLog2(numer) = 64 - number of leading zeros of numer
uint64_t maxDiffWithoutOverflow = ((uint64_t)1 << (64 - ceilLog2(numer))) - 1;
return maxDiffWithoutOverflow * numer / denom;
}
If denom=33333335 as returned by mach_timebase_info, we can handle differences of up to 18 seconds only before the multiplication by numer overflows. As getExpressibleSpan shows, by calculating a rough lower bound for this, the size of numer doesn't matter: halving numer doubles maxDiffWithoutOverflow. The only goal therefore is to produce a fraction close to numer/denom that has a smaller denominator. The simplest method to do this is using continued fractions.
The continued fractions method is rather handy. bestFrac clearly works correctly if the provided interval contains an integer: it returns the least integer in the interval over 1. Otherwise, it calls itself recursively with a strictly larger interval and returns m+1/next. The final result is a continued fraction that can be shown by induction to have the correct property: it's optimal, the fraction inside the given interval with the least denominator.
Finally, we reduce the fraction Darwin passes us to a smaller one to use when rescaling the mach_absolute_time to nanoseconds. We may introduce an error here because we can't reduce the fraction in general without losing accuracy. We set ourselves the target of 0.1% error, and check that we've reduced the fraction enough for common timespans (up to ten years) to be handled correctly.
Arguably the method is over-complicated for what it does, but it handles correctly anything the API can throw at it, and the resulting code is still short and extremely fast (bestFrac typically recurses only three or four iterations deep before returning a denominator less than 1000 for random intervals [a,a*1.002]).
You're worrying about overflow when multiplying/dividing with values from the mach_timebase_info struct, which is used for conversion to nanoseconds. So, while it may not fit your exact needs, there are easier ways to get a count in nanoseconds or seconds.
All solutions below are using mach_absolute_time internally (and NOT the wall clock).
Use double instead of uint64_t
(supported in Objective-C and Swift)
double tbInSeconds = 0;
mach_timebase_info_data_t tb;
kern_return_t kError = mach_timebase_info(&tb);
if (kError == 0) {
tbInSeconds = 1e-9 * (double)tb.numer / (double)tb.denom;
}
(remove the 1e-9 if you want nanoseconds)
Usage:
uint64_t start = mach_absolute_time();
// do something
uint64_t stop = mach_absolute_time();
double durationInSeconds = tbInSeconds * (stop - start);
Use ProcessInfo.processInfo.systemUptime
(supported in Objective-C and Swift)
It does the job in double seconds directly:
CFTimeInterval start = NSProcessInfo.processInfo.systemUptime;
// do something
CFTimeInterval stop = NSProcessInfo.processInfo.systemUptime;
NSTimeInterval durationInSeconds = stop - start;
For reference, source code of systemUptime
just does something similar as previous solution:
struct mach_timebase_info info;
mach_timebase_info(&info);
__CFTSRRate = (1.0E9 / (double)info.numer) * (double)info.denom;
__CF1_TSRRate = 1.0 / __CFTSRRate;
uint64_t tsr = mach_absolute_time();
return (CFTimeInterval)((double)tsr * __CF1_TSRRate);
Use QuartzCore.CACurrentMediaTime()
(supported in Objective-C and Swift)
Same as systemUptime, but without being open source.
Use Dispatch.DispatchTime.now()
(supported in Swift only)
Another wrapper around mach_absolute_time(). Base precision is nanoseconds, backed with UInt64.
DispatchTime start = DispatchTime.now()
// do something
DispatchTime stop = DispatchTime.now()
TimeInterval durationInSeconds = Double(end.uptimeNanoseconds - start.uptimeNanoseconds) / 1_000_000_000
For reference, source code of DispatchTime.now() says it basically simply returns a struct DispatchTime(rawValue: mach_absolute_time()). And the calculation for uptimeNanoseconds is:
(result, overflow) = result.multipliedReportingOverflow(by: UInt64(DispatchTime.timebaseInfo.numer))
result = overflow ? UInt64.max : result / UInt64(DispatchTime.timebaseInfo.denom)
So it just discards results if the multiplication can't be stored in an UInt64.
If mach_absolute_time() sets the uint64 back to 0 then reset the time calculations if less than the last check.
That's the problem, they don't document what happens when the uint64 reaches all ones (binary).
read it. https://developer.apple.com/documentation/kernel/1462446-mach_absolute_time

OpenCL slow -- not sure why

I'm teaching myself OpenCL by trying to optimize the mpeg4dst reference audio encoder. I achieved a 3x speedup by using vector instructions on CPU but I figured the GPU could probably do better.
I'm focusing on computing auto-correlation vectors in OpenCL as my first area of improvement. The CPU code is:
for (int i = 0; i < NrOfChannels; i++) {
for (int shift = 0; shift <= PredOrder[ChannelFilter[i]]; shift++)
vDSP_dotpr(Signal[i] + shift, 1, Signal[i], 1, &out, NrOfChannelBits - shift);
}
NrOfChannels = 6
PredOrder = 129
NrOfChannelBits = 150528.
On my test file, this function take approximately 188ms to complete.
Here's my OpenCL method:
kernel void calculateAutocorrelation(size_t offset,
global const float *input,
global float *output,
size_t size) {
size_t index = get_global_id(0);
size_t end = size - index;
float sum = 0.0;
for (size_t i = 0; i < end; i++)
sum += input[i + offset] * input[i + offset + index];
output[index] = sum;
}
This is how it is called:
gcl_memcpy(gpu_signal_in, Signal, sizeof(float) * NrOfChannels * MAXCHBITS);
for (int i = 0; i < NrOfChannels; i++) {
size_t sz = PredOrder[ChannelFilter[i]] + 1;
cl_ndrange range = { 1, { 0, 0, 0 }, { sz, 0, 0}, { 0, 0, 0 } };
calculateAutocorrelation_kernel(&range, i * MAXCHBITS, (cl_float *)gpu_signal_in, (cl_float *)gpu_out, NrOfChannelBits);
gcl_memcpy(out, gpu_out, sizeof(float) * sz);
}
According to Instruments, my OpenCL implementation seems to take about 13ms, with about 54ms of memory copy overhead (gcl_memcpy).
When I use a much larger test file, 1 minute of 2-channel music vs, 1 second of 6-channel, while the measured performance of the OpenCL code seems to be the same, the CPU usage falls to about 50% and the whole program takes about 2x longer to run.
I can't find a cause for this in Instruments and I haven't read anything yet that suggests that I should expect very heavy overhead switching in and out of OpenCL.
If I'm reading your kernel code correctly, each work item is iterating over all of the data from it's location to the end. This isn't going to be efficient. For one (and the primary performance concern), the memory accesses won't be coalesced and so won't be at full memory bandwidth. Secondly, because each work item has a different amount of work, there will be branch divergence within a work group, which will leave some threads idle waiting for others.
This seems like it has a lot in common with a reduction problem and I'd suggest reading up on "parallel reduction" to get some hints about doing an operation like this in parallel.
To see how memory is being read, work out how 16 work items (say, global_id 0 to 15) will be reading data for each step.
Note that if every work item in a work group access the same memory, there is a "broadcast" optimization the hardware can make. So just reversing the order of your loop could improve things.

Strange fseek()/fwrite() performance on MacOS

I have problems with write performance of fseek()/fwrite() on my Mac. I'm operating on large files up to 4 GB of size, tests below were made with a rather small one with only 120 MB. My strategy is as follows:
fopen() a new file on disk
fill the file with zeroes (takes ~3 seconds)
write small blocks of data to random positions (30.000 blocks, 4k each)
The whole procedure takes around 120 seconds.
The write strategy is bound to an image rotation algorithm (see my question here) and unless someone comes up with a faster solution for the rotation problem, I'm not able to change the strategy of using fseek() and then writing 4k or less to the file.
What I am observing is this: The first few thousand fseek()/fwrite() perform quite well, but the performance drops very fast, faster than you would expect from any system cache being filled up. The chart below shows fwrite()s per second vs time in seconds. As you see, after 7 seconds the fseek()/fwrite() rate reaches approx. 200 per second, still going down until it reaches 100 per second at the very end of the process.
In the middle of the process (2 or 3 times), the OS decides to flush file contents to disk which I can see from my console output hanging a few seconds, during that time I have approx. 5 MB/s write on my disk (which isn't that much). After fclose() the system seems to write the whole file, I see 20 MB/s disk activity for a longer period of time.
If I use fflush() every 5.000 fwrite()s, the behaviour doesn't change at all. Putting in fclose()/fopen() to force flushing somehow speeds up the whole thing by approx. 10%.
I did profile the process (screenshot below) and you see, that virtually all time is spent inside fwrite() and fseek() which can be drilled down to __write_nocancel() for both of them.
Completely absurd summary
Imagine the case where my input data fits into my buffers completely and thus I'm able to write my rotated output data linearly without the need to split the write process into fragments. I still use fseek() to position the file pointer, just because the logic of the writing function behaves that way, but the file pointer in this case is set to the same position where it already was. One would expect no performance impact. Wrong.
What is absurd is, if I remove the calls to fseek() for that special case, my function finishes within 2.7 seconds instead of 120 seconds.
Now, after a long foreword, the question is: Why does fseek() have such an impact on performance, even if I seek to the same position? How could I speed it up (by another strategy or other function calls, disabling caching if possible, memory mapped access, ...)?
For reference, here's my code (not tidied up, not optimized, containing lots of debug output):
-(bool)writeRotatedRaw:(TIFF*)tiff toFile:(NSString*)strFile
{
if(!tiff) return NO;
if(!strFile) return NO;
NSLog(#"Starting to rotate '%#'...", strFile);
FILE *f = fopen([strFile UTF8String], "w");
if(!f)
{
NSString *msg = [NSString stringWithFormat:#"Could not open '%#' for writing.", strFile];
NSRunAlertPanel(#"Error", msg, #"OK", nil, nil);
return NO;
}
#define LINE_CACHE_SIZE (1024*1024*256)
int h = [tiff iImageHeight];
int w = [tiff iImageWidth];
int iWordSize = [tiff iBitsPerSample]/8;
int iBitsPerPixel = [tiff iBitsPerSample];
int iLineSize = w*iWordSize;
int iLinesInCache = LINE_CACHE_SIZE / iLineSize;
int iLinesToGo = h, iLinesToRead;
NSLog(#"Creating temporary file");
double time = CACurrentMediaTime();
double lastTime = time;
unsigned char *dummy = calloc(iLineSize, 1);
for(int i=0; i<h; i++) fwrite(dummy, 1, iLineSize, f);
free(dummy);
fclose(f);
f = fopen([strFile UTF8String], "w");
NSLog(#"Created temporary file (%.1f MB) in %.1f seconds", (float)iLineSize*(float)h/1024.0f/1024.0f, CACurrentMediaTime()-time);
fseek(f, 0, SEEK_SET);
lastTime = CACurrentMediaTime();
time = CACurrentMediaTime();
int y=0;
unsigned char *ucRotatedPixels = malloc(iLinesInCache*iWordSize);
unsigned short int *uRotatedPixels = (unsigned short int*)ucRotatedPixels;
unsigned char *ucLineCache = malloc(w*iWordSize*iLinesInCache);
unsigned short int *uLineCache = (unsigned short int*)ucLineCache;
unsigned char *uc;
unsigned int uSizeCounter=0, uMaxSize = iLineSize*h, numfwrites=0, lastwrites=0;
while(iLinesToGo>0)
{
iLinesToRead = iLinesToGo;
if(iLinesToRead>iLinesInCache) iLinesToRead = iLinesInCache;
for(int i=0; i<iLinesToRead; i++)
{
// read as much lines as fit into buffer
uc = [tiff getRawLine:y+i withBitsPerPixel:iBitsPerPixel];
memcpy(ucLineCache+i*iLineSize, uc, iLineSize);
}
for(int x=0; x<w; x++)
{
if(iBitsPerPixel==8)
{
for(int i=0; i<iLinesToRead; i++)
{
ucRotatedPixels[iLinesToRead-i-1] = ucLineCache[i*w+x];
}
fseek(f, w*x+(h-y-1), SEEK_SET);
fwrite(ucRotatedPixels, 1, iLinesToRead, f);
numfwrites++;
uSizeCounter += iLinesToRead;
if(CACurrentMediaTime()-lastTime>1.0)
{
lastTime = CACurrentMediaTime();
NSLog(#"Progress: %.1f %%, x=%d, y=%d, iLinesToRead=%d\t%d", (float)uSizeCounter * 100.0f / (float)uMaxSize, x, y, iLinesToRead, numfwrites);
}
}
else
{
for(int i=0; i<iLinesToRead; i++)
{
uRotatedPixels[iLinesToRead-i-1] = uLineCache[i*w+x];
}
fseek(f, (w*x+(h-y-1))*2, SEEK_SET);
fwrite(uRotatedPixels, 2, iLinesToRead, f);
uSizeCounter += iLinesToRead*2;
if(CACurrentMediaTime()-lastTime>1.0)
{
lastTime = CACurrentMediaTime();
NSLog(#"Progress: %.1f %%, x=%d, y=%d, iLinesToRead=%d\t%d", (float)uSizeCounter * 100.0f / (float)uMaxSize, x, y, iLinesToRead, numfwrites);
}
}
}
y += iLinesInCache;
iLinesToGo -= iLinesToRead;
}
free(ucLineCache);
free(ucRotatedPixels);
fclose(f);
NSLog(#"Finished, %.1f s", (CACurrentMediaTime()-time));
return YES;
}
I'm a bit lost because I do not understand how the system "optimizes" my calls. Any input is appreciated.
Just to somehow close this question, I'll answer it myself and share my solution.
Although I wasn't able to improve the performance of the fseek() calls, I did implement a well performing workaround. The aim was to avoid fseek() at any cost. Because I need to write fragments of data to different positions of the target file but those fragments appear in equal distance and the gaps between those fragments will be filled with other fragments written somewhat later in the process, I splitted the writing into multiple files. I write to as many files as fragment streams are generated and then, in a last step, re-open all those temporary files, read them rotational and linearly write data blocks to the target file. The performance of this is good, reaching approx. 4 seconds for the example given above.

calculate sending file speed/sec in less than a second (without using thread.sleep)

This is a file transfer (Server-Client tcp sockets)
The code below shows the transfer rate per second (kb/s) every one second.
I want to show the the speed (rate/s) every time I send the data to the client. How do I calculate the speed every time (without usings thread.sleep(1000))?
private void timeElasped()
{
int rate = 0;
int prevSent = 0;
while (fileTransfer.busy)
{
rate = fileTransfer.Sent - prevSent ;
prevSum = fileTransfer.Sent;
RateLabel(string.Format("{0}/Sec", CnvrtUnit(rate)));
if(rate!=0)
Timeleft = (fileTransfer.fileSize - fileTransfer.sum) / rate;
TimeSpan t = TimeSpan.FromSeconds(Timeleft);
timeLeftLabel(FormatRemainingText(rate, t));
Thread.Sleep(1000);
}
}
You have two decisions to make:
Over how much time do you want to take the average transfer speed?
How often do you want to update/report the result?
Recall that there is no such thing as the current instantaneous transfer speed. Or, more correctly, the current instantaneous transfer speed is always either the full physical speed of your network interface (e.g. 100 Mbps) or zero, corresponding to the situations "there is a packet being sent/received right this microsecond" and "the line is idle". So you have to average.
In the code above, you have chosen one second as the value for both (1) and (2). (1) and (2) being equal is the simplest case to code.
I recommend that you choose a longer period for (1). Averaging over only one second is going to make for a pretty jittery transfer speed on all but the smoothest file transfers. Consider, for example, that Cisco IOS averages over 5 minutes by default and doesn't let you configure less than 30 seconds.
For (2), you can continue to use 1 second, or, if you like, even less than one second.
Choose a value for (1) that is a multiple of the value you choose for (2). Let n be (1) divides by (2). For example, (1) is 10 seconds, (2) is 500ms, and n=20.
Create a ring buffer with n entries. Every time (2) elapses, replace the oldest entry in the ring buffer with the number of bytes transferred since the previous time (2) elapsed, then recalculate the transfer speed as the sum of all the entries in the buffer divided by (1).
in form constructor
Timer timer1 = new Time();
public Form1()
{
InitializeComponent();
this.timer1.Enabled = true;
this.timer1.Interval = 1000;
this.timer1.Tick += new System.EventHandler(this.timer1_Tick);
}
or add it from toolbox and set the previous values
the sum of sent bytes should be public so our method can get its value every second
long sentBytes = 0; //the sent bytes that updated from sending method
long prevSentBytes = 0; //which references to the previous sentByte
double totalSeconds = 0; //seconds counter to show total time .. it increases everytime the timer1 ticks.
private void timer1_Tick(object sender, EventArgs e)
{
long speed = sentBytes - prevSentBytes ; //here's the Transfer-Rate or Speed
prevSentBytes = sentBytes ;
labelSpeed.Text = CnvrtUnit(speed) + "/S"; //display the speed like (100 kb/s) to a label
if (speed > 0) //considering that the speed would be 0 sometimes.. we avoid dividing on 0 exception
{
totalSeconds++; //increasing total-time
labelTime.Text = TimeToText(TimeSpan.FromSeconds((sizeAll - sumAll) / speed));
//displaying time-left in label
labelTotalTime.Text = TimeToText(TimeSpan.FromSeconds(totalSeconds));
//displaying total-time in label
}
}
private string TimeToText(TimeSpan t)
{
return string.Format("{2:D2}:{1:D2}:{0:D2}", t.Seconds, t.Minutes, t.Hours);
}
private string CnvrtUnit(long source)
{
const int byteConversion = 1024;
double bytes = Convert.ToDouble(source);
if (bytes >= Math.Pow(byteConversion, 3)) //GB Range
{
return string.Concat(Math.Round(bytes / Math.Pow(byteConversion, 3), 2), " GB");
}
else if (bytes >= Math.Pow(byteConversion, 2)) //MB Range
{
return string.Concat(Math.Round(bytes / Math.Pow(byteConversion, 2), 2), " MB");
}
else if (bytes >= byteConversion) //KB Range
{
return string.Concat(Math.Round(bytes / byteConversion, 2), " KB");
}
else //Bytes
{
return string.Concat(bytes, " Bytes");
}
}

Resources