I am developing application for Time Proportional Control. I know that the output is proportional to an error. For example, if SetPoint is 20 and proportional band is 2, in case if actual temperature (AT) is 19, then Output would be 50%. However, I can not figure out what happens if AT is higher than 20? For example if AT is 21, then output would be again 50%, that does not make any sense.
Related
High-performance malloc implementations often implement segregated free lists, that is, each of the more common (smaller) sizes gets its own separate free list.
A first attempt at this could say, below a certain threshold, the size class is just the size divided by 8, rounded up. But actual implementations have more nuance, arranging the recognized size classes on something like an exponential curve (but gentler than simply doubling at each step), e.g. http://jemalloc.net/jemalloc.3.html
I'm trying to figure out how to convert a size to a size class on some such curve. Now, in principle this is not difficult; there are several ways to do it. But to achieve the desired goal of speeding up the common case, it really needs to be fast, preferably only a few instructions.
What's the fastest way to do this conversion?
In the dark ages, when I used to worry about those sorts of things, I just iterated through all the possible sizes starting at the smallest one.
This actually makes a lot of sense, since allocating memory strongly implies work outside of the actual allocation -- like initializing and using that memory -- that is proportional to the allocation size. In all but the smallest allocations, that overhead will swamp whatever you spend to pick a size class.
Only the small ones really need to be fast.
Lets assume you want to use all the power of two sizes and a plus half the size, ie 8, 12, 16, 24, 32, 48, 64 etc. ... 4096.
Check the value is less than or equal 4096, I have arbitrarily chosen that to be the highest allocable for this example.
Take the size of your struct, then use the highest set bit times two, and add 1 if the next bit is also set and you get an index into the size list add one more if the value is higher than the value the two bits would give. Should be 5-6 ASM instructions
So 26 is 16+8+2 bits are 4,3,1 4*2 + 1 + 1 so the 10th index is chosen for a 32 byte list.
Your system might require some minimum allocation size.
Also if your doing a lot of allocations, consider using some pool allocator that is private to your program with backup from the system allocator.
I executed a linear search on an array containing all unique elements in range [1, 10000], sorted in increasing order with all search values i.e., from 1 to 10000 and plotted the runtime vs search value graph as follows:
Upon closely analysing the zoomed in version of the plot as follows:
I found that the runtime for some larger search values is smaller than the lower search values and vice versa
My best guess for this phenomenon is that it is related to how data is processed by CPU using primary memory and cache, but don't have a firm quantifiable reason to explain this.
Any hint would be greatly appreciated.
PS: The code was written in C++ and executed on linux platform hosted on virtual machine with 4 VCPUs on Google Cloud. The runtime was measured using the C++ Chrono library.
CPU cache size depends on the CPU model, there are several cache levels, so your experiment should take all those factors into account. L1 cache is usually 8 KiB, which is about 4 times smaller than your 10000 array. But I don't think this is cache misses. L2 latency is about 100ns, which is much smaller than the difference between lowest and second line, which is about 5 usec. I suppose this (second line-cloud) is contributed from the context switching. The longer the task, the more probable the context switching to occur. This is why the cloud on the right side is thicker.
Now for the zoomed in figure. As Linux is not a real time OS, it's time measuring is not very reliable. IIRC it's minimal reporting unit is microsecond. Now, if a certain task takes exactly 15.45 microseconds, then its ending time depends on when it started. If the task started at exact zero time clock, the time reported would be 15 microseconds. If it started when the internal clock was at 0.1 microsecond in, than you will get 16 microsecond. What you see on the graph is a linear approximation of the analogue straight line to the discrete-valued axis. So the tasks duration you get is not actual task duration, but the real value plus task start time into microsecond (which is uniformly distributed ~U[0,1]) and all that rounded to the closest integer value.
I'm trying to determine the ideal number of cores for my model (in this example I'm using xbeach but I'm having the same issue when running other models such as SWAN) by doing a speedup test, i.e. I'm running the same code mpiexec -n <np> mymodel.exe with <np> = 2, 4, 8, 12, 16, 18, 20, 24, 32, and 36 cores/processes and see how long it takes. I ran the test multiple times, each with different results - see image
To my understanding MPI tries to distribute all tasks evenly across all processors (so having other applications running in the background might slow things down) but I'm still confused why I get SUCH a large range of computing times for the same code with the same number of cores/tasks. Why could it be that my model takes 18 minutes to run on 18 cores one time and then 128 minutes the next time I run it?
I'm using MPICH2 1.4.1p1 on Windows 10 on a machine with 2 NUMA nodes with 36 cores each so any potential background tasks should easily be handled on the remaining cores/node. It also doesn't look like I'm running out of memory.
To summarize my questions are:
Is it normal to have such a large range in computation time when using MPI?
If it is - why?
Is there a better way than mine to determine the ideal number of cores for an MPI program?
I am using tensorflow to build cnn net in image classification experiment,I found such phenomenon as:
operation 1:tf.nn.conv2d(x, [3,3,32,32], strides=[1,1,1,1], padding='SAME')
the shape of x is [128,128,32],means convolution using 3x3 kernel on x,both input channels and output channels are 32,the total multiply times is
3*3*32*32*128*128=150994944
operation 2:tf.nn.conv2d(x, [3,3,64,64], strides=[1,1,1,1], padding='SAME')
the shape of x is [64,64,64],means convolution using 3x3 kernel on x,both input channels and output channels are 64,the total multiply times is
3*3*64*64*64*64=150994944
In contrast with operation 1,the feature map size of operation 2 scale down to 1/2 and the channel number doubled. The multiply times are the same so the running time should be same.But in practice the running time of operation 1 is longer than operation 2.
My measure method was shown below
eliminate an convolution of operation 1,the training time for one epoch reduced 23 seconds,means the running time of operation 1 is 23 seconds.
eliminate an convolution of operation 2,the training time for one epoch reduced 13 seconds,means the running time of operation 2 is 13 seconds.
the phenomenon can reproduction every time。
My gpu is nvidia gtx980Ti,os is ubuntu 16.04。
So that comes the question: Why the running time of operation 1 was longer than operation 2?
If I had to guess it has to do with how the image is ordered in memory. Remember that in memory everything is stored in a flattened format. This means that if you have a tensor of shape [128, 128, 32], the 32 features/channels are stored next to eachover. Then all of the rows, then all of the columns. https://en.wikipedia.org/wiki/Row-major_order
Accessing closely packed memory is very important to performance especially on a GPU which has a large memory bus and is optimized for aligned in order memory access. In case with the larger image you have to skip around the image more and the memory access is more out of order. In case 2 you can do more in order memory access which gives you more speed. Multiplications are very fast operations. I bet with a convolution memory access if the bottleneck which limits performance.
chasep255's answer is good and probably correct.
Another possibility (or alternative way of thinking about chasep255's answer) is to consider how caching (all the little hardware tricks that can speed up memory fetches, address mapping, etc) could be producing what you see...
You have basically two things: a stream of X input data and a static filter matrix. In case 1, you have 9*1024 static elements, in case 2 you have 4 times as many. Both cases have the same total multiplication count, but in case 2 the process is finding more of its data where it expects (i.e. where it was last time it was asked for.) Net result: less memory access stalls, more speed.
Today i found my opengles program frame time sometimes increases for unknown reason, usually its 16ms, but sometimes it will take 33ms to finish one frame. after hours profiling and researching i found the reason : the frame time increase is because the 'eglSwapBuffers' takes much longer time than usual. usually time spend on 'eglSwapBuffers' is less than 10 milliseconds, but sometimes it will take about 26 milliseconds.
the scene is static so the frame time is supposed to be stable?
Would anybody know the reason please help, what should i do to make my frame time stable?
There was an answer in a different thread that helped me a lot with this problem.
this kind of behavior is usually caused by window and surface pixel format mismatch eg. 16bit (RGB565) vs 32bit.
I am also meeting such a problem.
I found if the window of eglsurface is resized to more bigger, the time of eglSwapbuffer spent become very long (about 2x than normal state).
In my case it turned out to be the MSAA.
Using 4x MSAA, caused my eglSwapBuffers() to go to 30 milliseconds.
I had to take out two lines from my config, and got back to a 2 ms swap.
const EGLint attribs[] = {
EGL_RENDERABLE_TYPE, EGL_OPENGL_ES2_BIT,
EGL_SURFACE_TYPE, EGL_WINDOW_BIT,
EGL_DEPTH_SIZE, 16,
EGL_BLUE_SIZE, 8,
EGL_GREEN_SIZE, 8,
EGL_RED_SIZE, 8,
// EGL_SAMPLE_BUFFERS, 1,
// EGL_SAMPLES, 4,
EGL_NONE
};