CUDA / OpenCL cache coherence, locality and space-filling curves - caching

I'm working on a CUDA app that makes use of all available RAM on the card, and am trying to figure out different ways to reduce cache misses.
The problem domain consists of a large 2- or 3-D grid, depending on the type of problem being solved. (For those interested, it's an FDTD simulator). Each element depends on either two or four elements in "parallel" arrays (that is, another array of nearly identical dimensions), so the kernels must access either three or six different arrays.
The Problem
*Hopefully this isn't "too localized". Feel free to edit the question
The relationship between the three arrays can be visualized as (apologize for the mediocre ASCII art)
A[0,0] -C[0,0]- A ---- C ---- A ---- C ---- A
| | | |
| | | |
B[0,0] B B B
| | | |
| | | |
A ---- C ---- A ---- C ---- A ---- C ---- A
| | | |
| | | |
B B B B
| | | |
| | | |
A ---- C ---- A ---- C ---- A ---- C ---- A
| | | |
| | | |
B B B B[3,2]
| | | |
| | | |
A ---- C ---- A ---- C ---- A ---- C ---- A[3,3]
[2,3]
Items connected by lines are coupled. As can be seen above, A[] depends on both B[] and C[], while B[] depends only on A[], as does C[]. All of A[] is updated in the first kernel, and all of B[] and C[] are updated in a second pass.
If I declare these arrays as simple 2D arrays, I wind up with strided memory access. For a very large domain size (3x3 +- 1 in the grid above), this causes occupancy and performance deficiencies.
So, I thought about rearranging the array layout in a Z-order curve:
Also, it would be fairly trivial to interleave these into one array, which should improve fetch performance since (depending on the interleave order) at least half of the elements required for a given cell update would be close to one another. However, it's not clear to me if GPU uses multiple data pointers when accessing multiple arrays. If so, this imagined benefit could actually be a hindrance.
The Questions
I've read that NVidia does this automatically behind the scenes when using texture memory, or a cudaArray. If this is not the case, should I expect the increased latency when crossing large spans (when the Z curve goes from upper right to bottom left at a high subdivision level) to eliminate the benefit of the locality in smaller grids?
Dividing the grid into smaller blocks that can fit in shared memory should certainly help, and the Z order makes this fairly trivial. Should I have a separate kernel pass that updates boundaries between blocks? Will the overhead of launching another kernel be significant compared to the savings I expect ?
Is there any real benefit to using a 2D vs 1D array? I expect memory to be linear, but am unsure if there is any real meaning to the 2D memory layout metaphor that's often used in CUDA literature.
Wow - long question. Thanks for reading and answering any/all of this.

Just to get this off of the unanswered list:
After a lot of benchmarking and playing with different arrangements, the fastest approach I found was to keep the arrays interleaved in z-order so that most of the values required by a thread were located near each other in RAM. This improved cache behavior (and thus performance). Obviously there are many cases where Z order fails to keep required values close together. I wonder if rotating quadrants to reduce "distance" between the end of a Z and the next quadrant, but I haven't tried that.
Thanks to everyone for the advice.

Related

Implementing a standard stack in a page-writable sector-erasable flash memory

I would like to implement a simple stack in my microcontroller firmware.
The stack I would like to implement is like this, i.e. something standard.
The problem is that the flash memory IC I am using supports page-granularity for write but sector-granularity for erasing and just like any other NAND flash, before writing some data on the flash, you should erase that part.
So as a summary, I erase a sector and write some data on its first page. For rewriting even one byte of that page, I should erase the whole sector first.
In my stack,
| O | O | O | O | O | O |
I push some data:
| W | W | W | O | O | O |
and then pop one of them:
| W | W | P | O | O | O |
Now I would like to push another data. Here the problem appears. If just like a standard stack, I decide to write on previous data and just change the index, I must delete the whole sector first! Therefore I should solve this problem programmatically.
Any ideas?
P.S: The flash memory map is: 16 Blocks -> Each Block = 16 Sectors -> Each Sector = 16 Pages -> Each Page = 256 Bytes
I have a write-command at page level with an offset and erase command for sector and block.
Well, if you don't want to write a whole sector with every push, you'll have to tolerate some old data in your stack:
| W | W | P | W | O | O |
You will need to reserve at least one bit in each item that allows you to distinguish between the valid and invalid ones. Assuming that a sector-erase fills it with 1s, then leave the valid bit 1 in all valid items that you write. You will then be able to change it to a 0 by writing just one page when you pop the item off, marking it as invalid.
Since you can't reuse item slots in this scheme, your stack will grow continuously toward the end of memory. That is actually what you want, since that NAND flash can only be written so many times before it dies, and this scheme will spread the writes out somewhat. When you get to the end of space, you can erase and rewrite the whole thing to remove all the gaps.
It may happen that you end up with a very long sequence of invalid items, so popping past them could involve scanning all the intermediate ones. You can fix this problem by reserving more than one bit in each item. When you write a valid item, you use these bits to store the number of invalid items that precede it, plus 1. This lets you skip back quickly and pop in constant time.
Then, when you want to mark an item as invalid, you can change the bits reserved for this count to all 0s, again by writing just one page. This zero count could not appear in a valid item, so it serves as an invalid mark.

How to use Amdahl's Law (overall speedup vs speedup)

Recall Amdahl’s law on estimating the best possible speedup. Answer the following questions.
You have a program that has 40% of its code parallelized on three processors, and just for this fraction of code, a speedup of 2.3 is achieved. What is the overall speedup?
I'm having trouble understanding the difference between speedup and overall speedup in this question. I know there must be a difference by the way this question is worded.
Q : What is the overall speedup?
Best start not with the original and trivial Amdahl's law formula, but by reading a bit more contemporary view, extending the original, where add-on overhead costs are discussed and also an aspect of atomicity-of-split-work was explained.
Two sections,one accelerated by a "local"-speed-up,one overall result
Your original problem-formulation seems to by-pass there explained sorts of problems with real-world process-orchestration overheads by simply postulating a (net-local)-speedup, where a <PAR>-able Section-under-Review related implementation add-on overhead costs become "hidden", expressed but by a sort of inefficiency of having three-times more resources for code-stream execution, yet having but a 2.3 x speedup, not 3.0 x, so spending more than a theoretical 1/3 of the time on actually also initial set-up (an add-on overhead-time, not present in a pure-[SERIAL] code-execution ) + parallel-processing (doing The_useful_work, now on triple the capacity of the code-execution resources) + also terminating and results-collection back (add-on overhead-times, not present in a pure-[SERIAL] code-execution) into the "main"-code.
"Hiding" these natural cost-of-going into/out-of [PARALLEL]-code-execution section(s) simplifies the homework, yet a proper understanding of the real-life costs is crucial not to spend way more (on setups and all other add-on overhead costs, that are un-avoidable in real-world) than one would ever receive back (from a wish-to-get many-processors-harnessed split-processing speedup)
|-------> time
|START:
| |DONE: 100% of the code
| | |
|______________________________________<SEQ>______60%_|_40%__________________<PAR>-able__|
o--------------------------------------<SEQ>----------o----------------------<PAR>-able--o CPU_x runs both <SEQ> and <PAR>-able sections of code, in a pure [SERIAL] process-flow orchestration, one after another
| |
| |
|-------> time
|START: |
| | |DONE: 100% of the code :
o--------------------------------------<SEQ>----------o | :
| o---------o .. .. .. .. ..CPU_1 runs <PAR>'d code
| o---------o .. .. .. .. ..CPU_2 runs <PAR>'d code
| o---------o .. .. .. .. ..CPU_3 runs <PAR>'d code
| | |
| | |
| <_not_1/3_> just ~ 2.3x faster (not 3x) perhaps reflects real-costs (penalisations) of new, add-on, process-organisation related setup + termination overheads
|______________________________________<SEQ>______60%_|_________|~ 40% / 2.3x ~ 17.39% i.e. the <PAR>-section has gained a local ( "net"-section ) speedup of 2.3x instead of 3.0x, achievable on 3-CPU-code-execution streams
| | |
Net overall speedup ( if no other process-organisation releated add-on overhead costs were accrued )
is:
( 60% + ( 40% / 1.0 ) )
---------------------------- ~ 1.2921 x
( 60% + ( 40% / 2.3 ) )

Chronological sequence of events with a running commentary *ahead* of the event

I apologise in advance if this is somewhat vague, but I'm lacking even the basic idea of how to approach this - including even knowing whether there is a proper term to search for.
I am trying to code a table-driven system of animating chronological events, where descriptive commentary is also pulled out of the table but that commentary is issued ahead of the actual event animation, so not like sub-titles, which appear synchronous with the event on screen.
To explain further:
Given a table with
Seq | Event | Start | End | Pace
1 | Walk up to A | | A | Walk
2 | Stand at A | A | | Stand
3 | Walk from A to B | A | B | Walk
4 | Run from B to C | B | C | Run
5 | Stand at C, turn left | C | | Stand
6 | Turn left at C and walk to D | C | D | Walk
7 | At D quickly spin around your own axis | D | | Stand
8 | Run back to A | D | A | Run
And so on. The speed of the movements and duration of standing in place or spinning around your axis are set by variable properties - some people are faster than others, after all.
The point is that the commentary - essentialy the text in the Event column, will be read before the actual movement.
Creating a sequence like this
Seq | Commentary | Movement
0 | See the toon walk up to A and stop there | animation of movement to A from random point
1 | They now stand and will move from A to B | toon stands in place
2 | at B they will run to C | toon is walking towards B
3 | at C they will turn left | toon is running towards C
4 | then they walk from C to D | toon is still running towards C
5 | at D they spin around their axis | toon is walking towards D
6 | after spin they will run back to A | toon does the spin at D,
| | based on timing they are running back to D
I've tried to include some edge cases here. For example, where an even is so short that the duration of the commentary would be longer than the duration of the animation, enforcing an immediate back-to-back reading of two commentaries ahead of the movement.
Thinking of systems like video and auditor editors, with their multi-track approach, I thought I could something similar. I would "pre-compile" the sequence table into a timeline and then create the commentary track by timeshifting it "to the left" (i.e. backwards in time, as it were). The playback would present the two streams.
But I utterly and totally lack the knowledge to do that and am not even sure what to look into, so I can learn. Is this some form of state machine? An event loop?
The system will ultimately be developed in C#/.NET. Xamarin to be precise, to allow it to run on both Android and iOS. The animation code actually exists, to the extend of having the toon walk, run, stand between points on a grid (where the points boil down to coordinates). Audio commentary, as in reading the event text is also straight-forward nowadays (there's even device-independent PCLs).
It's the timing and synchronisation that I am utterly lost with! Creating that time stream between offset animation and audio out of the information in the table (and properties). I looked into things like state machines and graph theory, but frankly a lot of that went over my head.
What should I even research here? I'd be more than happy with an answer of "what you're trying to do is called XXX, google it for the algorithms" and "read THIS to find out more about doing XXX and YYY" and "THIS describes the algorithms (even if in another language or not language at all) to do ZZZ".
Let's assume you can calculate the lengths of the animations and the lengths of the commentaries ahead.
action | length in sec
a1 | 5
a2 | 2
a3 | 3
commentary | length in sec
c1 | 2
c2 | 10
c3 | 5
Create a schedule for the actions first:
a-schedule | start | end
a1 | 0 | 5
a2 | 5 | 7
a3 | 7 | 10
Now create a schedule for the commentaries by adjusting the end time of each commentary to the start time of the appropriate action:
c-schedule | start | end
c1 | -2 | 0
c2 | -5 | 5
c3 | 2 | 7
Now iterate through on this last table in backwards from the second to last element and do the following:
for(int x=c.length-2;x>=0;x--) {
if (c[x].end> c[x+1].start) {
c[x].end-= c[x].end- c[x+1].start;
c[x].start-= c[x].end- c[x+1].start;
}
}
It will result the following table:
c-schedule | start | end
c1 | -10 | -8
c2 | -8 | 2
c3 | 2 | 7
Each commentaries will be finished before the corresponding action however - if most of the commentaries are last longer than the actions - they could happen way before. But if there are several actions which are longer that their commentaries then the commentary can gain on the action there.
The table shows that you have to start playing the first commentary 10 seconds before start playing the first action. You can shift both of these values if you want a zero-based schedule.

Why doesn't scheme have primitive c data types like int, float etc

And, how does it allocate memory from the memory pool? How many bytes for symbols, numbers and how does it handle type-casting, since it doesn't have int and float types for conversions
I really tried researching on the internet, I'm sorry i have to ask here cause I found nothing.
Like other dynamically typed languages, Scheme does have types, but they're associated with values instead of with variables. This means you can assign a boolean to a variable at one point and a number at another point in time.
Scheme doesn't use C types, because a Scheme implementation isn't necessarily tied to C at all: several compilers emit native code, without going through C. And like the other answers mention, Scheme (and Lisp before it) tries to free the programmer from having to deal with such (usually) unimportant details as the target machine's register size.
Numeric types specifically are pretty sophisticated in Lisp variants. Scheme has the so-called numeric tower that abstracts away details of representation. Much like many "newer" languages such as Go, Python, and Ruby, Scheme will represent small integers (called "fixnums") in a machine register or word in memory. This means it'll be fast like in C, but it will automatically switch to a different representation once the integer exceeds that size, so that arbitrary large numbers can be represented without needing any special provisioning.
The other answers have already shown you the implementation details of some Schemes. I've recently blogged about CHICKEN Scheme's internal data representation. The post contains links to data representation of several other Schemes, and at the end you'll find further references to data representation in Python, Ruby, Perl and older Lisp variants.
The beauty of Lisp and Scheme is that these are such old languages, but they still contain "new ideas" that only now get added to other languages. Garbage collection pretty much had to be invented for Lisp to work, it supported a numeric tower for a long time, object orientation was added to it at a pretty early date, anonymous procedures were in there from the beginning I think, and closures were introduced by Scheme when its authors proved that lambda can be implemented as efficiently as goto.
All of this was invented between the 1950s and the 1980s. Meanwhile, it took a long long time before even garbage collection became accepted in the mainstream (basically with Java, so about 45 years), and general support for closures/anonymous procedures has become popular only in the last 5 years or so. Even tail call optimization isn't implemented in most languages; JavaScript programmers are only now discovering it. And how many "modern" languages still require the programmer to handle arbitrarily large integers using a separate set of operators and as a special type?
Note that a lot of these ideas (including the numeric type conversion you asked about) introduce additional overhead, but the overhead can be reduced by clever implementation techniques. And in the end most are a net win because they can improve programmer productivity. And if you need C or assembly performance in selected parts of your code, most implementations allow you to drop down to the metal through various tricks, so this is not closed off to you. The disadvantage would be that it isn't standardized (though there is cffi for Common Lisp), but like I said, Scheme isn't tied to C so it would be very rude if the spec enforced a C foreign function interface onto non-C implementations.
The answer to this question is implementation dependent.
Here is how it was done in the Scheme compiler workshop.
The compiler generated machine code for a 32-bit Sparc machine.
See http://www.cs.indiana.edu/eip/compile/back.html
Data Formats
All of our data are represented by 32-bit words, with the lower three bits as a kind of type-tag. While this would normally only allow us eight types, we cheat a little bit: Booleans, empty-lists and characters can be represented in (much) less than 32 bits, so we steal a few of their data bits for an ``extended'' type tag.
Numbers:
--------------------------------------
| 29-bit 2's complement integer 000 |
--------------------------------------
Booleans:
------------------- -------------------
#t: | ... 1 00000 001 | #f: | ... 0 00000 001 |
------------------- -------------------
Empty lists:
-----------------
| ... 00001 001 |
-----------------
Characters:
---------------------------------------
| ... 8-bit character data 00010 001 |
---------------------------------------
Pairs, strings, symbols, vectors and closures maintain a 3-bit type tag, but devote the rest of their 32 bits to an address into the heap where the actual value is stored:
Pairs:
--------------- -------------
| address 010 | --> | car | cdr |
-----\--------- / -------------
-----------
Strings:
--------------- -------------------------------------------------
| address 011 | --> | length | string data (may span many words)... |
-----\--------- / -------------------------------------------------
-----------
Symbols:
--------------- --------------------------
| address 100 | --> | symbol name (a string) |
-----\--------- / --------------------------
-----------
Vectors:
---------------
| address 101 |
-----|---------
v
-----------------------------------------------------------
| length | (v-ref 0) | (v-ref 1) | ... | (v-ref length-1) |
-----------------------------------------------------------
Closures:
---------------
| address 110 |
-----|---------
v
-----------------------------------------------------------------------
| length | code pointer | (free 0) | (free 1) | ... | (free length-1) |
-----------------------------------------------------------------------
The short answer is that it has primitive data types, but you as a programmer don't need to worry about it.
The designer of Lisp was from a math background and didn't use limitations in a specific platform as inspiration. In math a number isn't 32 bits but we do differentiate between exact numbers an inexact ones.
Scheme was originally interpreted in MacLisp and inherited the types and primitives of MacLisp. MacLisp is based on Lisp 1.5.
A variable doesn't have a type and most implementations have a machine pointer as it's data type. Primitives like chars, symbols and small integers are stored right in the address by manipulating the last significant bits as a type flag, which would always be zero for an actual object since the machine aligns objects in memory to register width.
If you add two integers that becomes bigger than the size of the result is of a different type. In C it would overflow.
;; This is Common Lisp, but the same happens in Scheme
(type-of 1) ; ==> BIT
(type-of 10) ; ==> (INTEGER 0 281474976710655)
(type-of 10000000000000000) ; ==> (INTEGER (281474976710655))
The type of the objects are different even though we treat them the same. The first two doesn't use any extra space than the pointer but the last is a pointer to an actual object that is allocated on the heap.
All of this is implementation dependent. The Scheme standard does not dictate how its done, but many does it just like this. You can read the standard and it says nothing about how to model numbers, only the behavior. You may make a R6RS Scheme that stores everything in byte arrays.

How can MARS produce weird constants in terms?

I've been reading about an interesting machine learning algorithm, MARS(Multi-variate adaptive regression splines).
As far as I understand the algorithm, from Wikipedia and Friedman's papers, it works in two stages, forward pass and backward pass. I'll ignore backward pass for now, since forward pass is the part I'm interested in. The steps for forward pass, as far as I can tell are.
Start with just the mean of the data.
Generate a new term pair, through exhaustive search
Repeat 2 while improvements are being made
And to generate a term pair MARS appears to do the following:
Select an existing term (e)
Select a variable (x)
Select a value of that variable (v)
Return two terms one of the form e*max(0,x-v) and the other of the form e*max(0, v-x)
And this makes sense to me. I could see how, for example, a data table like this:
+---+---+---+
| A | B | Z |
+---+---+---+
| 5 | 6 | 1 |
| 7 | 2 | 2 |
| 3 | 1 | 3 |
+---+---+---+
Could produce a terms like 2*max(0, B-1) or even 8*max(0, B-1)*max(3-A). However, the wikipedia page has an example that I don't understand. It has an ozone example where the first term is 25. However, it also has term in the final regression that has a coefficient that is negative and fractional. I don't see how this is possible, since the initial term is 5, and you can only multiply by previous terms, and no previous term can have a negative coefficient, that you could ever end up with one...
What am I missing?
As I see it, either I misunderstand term generation, or I misunderstand the simplification process. However, simplification as described seems to only delete terms, not modify them. Can you see what I am missing here?

Resources