Calculate the probability of undetected errors for the whole file (not packets) in the CRCs - probability

If we have a large file, I want to know how this will affect the probability of undetected errors, especially in CRCs.
I know that undetected error rate (packet or chunk) is= BitR* BER * 0.5^k which K is the FSC of the CRC. in CRC 32 k is 31,
from this equation and picture below, the packet size is not effecting the probability of undetected error for different packet sizes. Suppose we have 1,000,000 packets each with 2^(-32) probability of undetected error, how can I calculate the probability of undetected error for the entire 1Petabyte file?

The formula for mis-detection rate for multiple packets generally involves calculating the probability of 1 minus probability of zero mis-dectections. Assuming that the packet mis-detected rate is (1/(2^32)), then the mis-detected rate for 1000000 packets is 1-((1-(1/(2^32)))^1000000) ~= 1 - (0.999999999767169356346^1000000) ~= 0.0002328.

Related

What’s the most compression that we can hope for a file that contains 1000 bits (Huffman algorithm)?

How much file that contains 1000 bits, where 1 appears with a 10% probability of 0 - 90% probability can be compressed with Huffman code?
Maybe a factor of two.
But only if you do not include the overhead of sending the description of the Huffman code along with the data. For 1000 bits, that overhead will dominate the problem, and determine your maximum compression ratio. I find that for that small of a sample, 125 bytes, general-purpose compressors get it down to only around 100 to 120 bytes, due to the overhead.
A custom Huffman code just for this on bytes from such a stream gives a factor of 2.10, assuming the other side already knows the code. The best you could hope for is the entropy, e.g. with an arithmetic code, which gives 2.13.

unable to reduce bytes of a midi file

I am trying to do some operations on MIDI tracks, such as increasing/decreasing the playback speed.
For those who want more detail: To change playback speed, I need to divide 'delta times' of each track by the multiplier. Eg. If I want to speed up the track x2 then I divide the delta times by 2. Delta times are stored as variable length qualities so if I divide the delta times, I need to update the track's size by reducing the size in bytes so as to keep the state of the track consistent (because shorter delta times mean less number of bytes needed to store variable length quantity).
In my struct, the track length (size in bytes of the entire track) is stored as uint32_t. The problem occurs when I try to store the changed track size back. So lets say if my original track size was 3200 and after reducing the delta times the difference in bytes is 240, then I simply subtract this difference from the original length. However, when I use the 'du' command to check the new file size, the file size inflates heavily. Like it goes from somewhere like 16 kB to 2000 kB. I dont understand why.

Desired Compute-To-Memory-Ratio (OP/B) on GPU

I am trying to undertand the architecture of the GPUs and how we assess the performance of our programs on the GPU. I know that the application can be:
Compute-bound: performance limited by the FLOPS rate. The processor’s cores are fully utilized (always have work to do)
Memory-bound: performance limited by the memory
bandwidth. The processor’s cores are frequently idle because memory cannot supply data fast enough
The image below shows the FLOPS rate, peak memory bandwidth, and the Desired Compute to memory ratio, labeled by (OP/B), for each microarchitecture.
I also have an example of how to compute this OP/B metric. Example: Below is part of a CUDA code for applying matrix-matrix multiplication
for(unsigned int i = 0; i < N; ++i) {
sum += A[row*N + i]*B[i*N + col];
}
and the way to calculate OP/B for this matrix-matrix multiplication is as follows:
Matrix multiplication performs 0.25 OP/B
1 FP add and 1 FP mul for every 2 FP values (8B) loaded
Ignoring stores
and if we want to utilize this:
But matrix multiplication has high potential for reuse. For NxN matrices:
Data loaded: (2 input matrices)×(N^2 values)×(4 B) = 8N^2 B
Operations: (N^2 dot products)(N adds + N muls each) = 2N^3 OP
Potential compute-to-memory ratio: 0.25N OP/B
So if I understand this clearly well, I have the following questions:
It is always the case that the greater OP/B, the better ?
how do we know how much FP operations we have ? Is it the adds and the multiplications
how do we know how many bytes are loaded per FP operation ?
It is always the case that the greater OP/B, the better ?
Not always. The target value balances the load on compute pipe throughput and memory pipe throughput (i.e. that level of op/byte means that both pipes will be fully loaded). As you increase op/byte beyond that or some level, your code will switch from balanced to compute-bound. Once your code is compute bound, the performance will be dictated by the compute pipe that is the limiting factor. Additional op/byte increase beyond this point may have no effect on code performance.
how do we know how much FP operations we have ? Is it the adds and the multiplications
Yes, for the simple code you have shown, it is the adds and multiplies. Other more complicated codes may have other factors (e.g. sin, cos, etc.) which may also contribute.
As an alternative to "manually counting" the FP operations, the GPU profilers can indicate the number of FP ops that a code has executed.
how do we know how many bytes are loaded per FP operation ?
Similar to the previous question, for simple codes you can "manually count". For complex codes you may wish to try to use profiler capabilities to estimate. For the code you have shown:
sum += A[row*N + i]*B[i*N + col];
The values from A and B have to be loaded. If they are float quantities then they are 4 bytes each. That is a total of 8 bytes. That line of code will require 1 floating point multiplication (A * B) and one floating point add operation (sum +=). The compiler will fuse these into a single instruction (fused multiply-add) but the net effect is you are performing two floating point operations per 8 bytes. op/byte is 2/8 = 1/4. The loop does not change the ratio in this case. To increase this number, you would want to explore various optimization methods, such as a tiled shared-memory matrix multiply, or just use CUBLAS.
(Operations like row*N + i are integer arithmetic and don't contribute to the floating-point load, although its possible they may be significant, performance-wise.)

Extending Goertzel algorithm to 24 kHz, 32 kHz and 48 kHz in python

I'm learning to implement Goertzel's algorithm to detect DTMF tones from recorded wave files. I got one implemented in python from here. It supports audio sampled at 8 kHz and 16 kHz. I would like to extend it to support audio files sampled at 24 kHz, 32 kHz and 48 kHz.
From the code I got from the link above, I see that the author has set the following precondition parameters/constants:
self.MAX_BINS = 8
if pfreq == 16000:
self.GOERTZEL_N = 210
self.SAMPLING_RATE = 16000
else:
self.GOERTZEL_N = 92
self.SAMPLING_RATE = 8000
According to this article, before one can do the actual Goertzel, two of the preliminary calculations are:
Decide on the sampling rate.
Choose the block size, N
So, the author has clearly set block size as 210 for 16k sampled inputs and 92 for 8k sampled inputs. Now, I would like to understand:
how the author has arrived at this block size?
what would be the block size for 24k, 32k and 48k samples?
The block size determines the frequency resolution/selectivity and the time it takes to gather a block of samples.
The bandwidth of your detector is about Fs/N, and of course the time it takes to gather a block is N/Fs.
For equivalent performance, you should keep the ratio between Fs and N roughly the same, so that both of those measurements remain unchanged.
It is also important, though, to adjust your block size to be as close as possible to a multiple of the wave lengths you want to detect. The Goertzel algorithm is basically a quick way to calculate a few selected DFT bins, and this adjustment puts the frequencies you want to see near the center of those bins.
Optimization of the block size according to the last point is probably why Fs/N is not exactly the same in the code you have for 8KHz and 16Khz sampling rates.
You could redo this optimization for the other sampling rates you want to support, but really performance will be equivalent to what you already have if you just use N = 210 * Fs / 16000
You can find a detailed description of the block size choice here: http://www.telfor.rs/telfor2006/Radovi/10_S_18.pdf

why executation time of tf.nn.conv2d function different while the multiply times are the same?

I am using tensorflow to build cnn net in image classification experiment,I found such phenomenon as:
operation 1:tf.nn.conv2d(x, [3,3,32,32], strides=[1,1,1,1], padding='SAME')
the shape of x is [128,128,32],means convolution using 3x3 kernel on x,both input channels and output channels are 32,the total multiply times is
3*3*32*32*128*128=150994944
operation 2:tf.nn.conv2d(x, [3,3,64,64], strides=[1,1,1,1], padding='SAME')
the shape of x is [64,64,64],means convolution using 3x3 kernel on x,both input channels and output channels are 64,the total multiply times is
3*3*64*64*64*64=150994944
In contrast with operation 1,the feature map size of operation 2 scale down to 1/2 and the channel number doubled. The multiply times are the same so the running time should be same.But in practice the running time of operation 1 is longer than operation 2.
My measure method was shown below
eliminate an convolution of operation 1,the training time for one epoch reduced 23 seconds,means the running time of operation 1 is 23 seconds.
eliminate an convolution of operation 2,the training time for one epoch reduced 13 seconds,means the running time of operation 2 is 13 seconds.
the phenomenon can reproduction every time。
My gpu is nvidia gtx980Ti,os is ubuntu 16.04。
So that comes the question: Why the running time of operation 1 was longer than operation 2?
If I had to guess it has to do with how the image is ordered in memory. Remember that in memory everything is stored in a flattened format. This means that if you have a tensor of shape [128, 128, 32], the 32 features/channels are stored next to eachover. Then all of the rows, then all of the columns. https://en.wikipedia.org/wiki/Row-major_order
Accessing closely packed memory is very important to performance especially on a GPU which has a large memory bus and is optimized for aligned in order memory access. In case with the larger image you have to skip around the image more and the memory access is more out of order. In case 2 you can do more in order memory access which gives you more speed. Multiplications are very fast operations. I bet with a convolution memory access if the bottleneck which limits performance.
chasep255's answer is good and probably correct.
Another possibility (or alternative way of thinking about chasep255's answer) is to consider how caching (all the little hardware tricks that can speed up memory fetches, address mapping, etc) could be producing what you see...
You have basically two things: a stream of X input data and a static filter matrix. In case 1, you have 9*1024 static elements, in case 2 you have 4 times as many. Both cases have the same total multiplication count, but in case 2 the process is finding more of its data where it expects (i.e. where it was last time it was asked for.) Net result: less memory access stalls, more speed.

Resources