I apologise in advance if this is somewhat vague, but I'm lacking even the basic idea of how to approach this - including even knowing whether there is a proper term to search for.
I am trying to code a table-driven system of animating chronological events, where descriptive commentary is also pulled out of the table but that commentary is issued ahead of the actual event animation, so not like sub-titles, which appear synchronous with the event on screen.
To explain further:
Given a table with
Seq | Event | Start | End | Pace
1 | Walk up to A | | A | Walk
2 | Stand at A | A | | Stand
3 | Walk from A to B | A | B | Walk
4 | Run from B to C | B | C | Run
5 | Stand at C, turn left | C | | Stand
6 | Turn left at C and walk to D | C | D | Walk
7 | At D quickly spin around your own axis | D | | Stand
8 | Run back to A | D | A | Run
And so on. The speed of the movements and duration of standing in place or spinning around your axis are set by variable properties - some people are faster than others, after all.
The point is that the commentary - essentialy the text in the Event column, will be read before the actual movement.
Creating a sequence like this
Seq | Commentary | Movement
0 | See the toon walk up to A and stop there | animation of movement to A from random point
1 | They now stand and will move from A to B | toon stands in place
2 | at B they will run to C | toon is walking towards B
3 | at C they will turn left | toon is running towards C
4 | then they walk from C to D | toon is still running towards C
5 | at D they spin around their axis | toon is walking towards D
6 | after spin they will run back to A | toon does the spin at D,
| | based on timing they are running back to D
I've tried to include some edge cases here. For example, where an even is so short that the duration of the commentary would be longer than the duration of the animation, enforcing an immediate back-to-back reading of two commentaries ahead of the movement.
Thinking of systems like video and auditor editors, with their multi-track approach, I thought I could something similar. I would "pre-compile" the sequence table into a timeline and then create the commentary track by timeshifting it "to the left" (i.e. backwards in time, as it were). The playback would present the two streams.
But I utterly and totally lack the knowledge to do that and am not even sure what to look into, so I can learn. Is this some form of state machine? An event loop?
The system will ultimately be developed in C#/.NET. Xamarin to be precise, to allow it to run on both Android and iOS. The animation code actually exists, to the extend of having the toon walk, run, stand between points on a grid (where the points boil down to coordinates). Audio commentary, as in reading the event text is also straight-forward nowadays (there's even device-independent PCLs).
It's the timing and synchronisation that I am utterly lost with! Creating that time stream between offset animation and audio out of the information in the table (and properties). I looked into things like state machines and graph theory, but frankly a lot of that went over my head.
What should I even research here? I'd be more than happy with an answer of "what you're trying to do is called XXX, google it for the algorithms" and "read THIS to find out more about doing XXX and YYY" and "THIS describes the algorithms (even if in another language or not language at all) to do ZZZ".
Let's assume you can calculate the lengths of the animations and the lengths of the commentaries ahead.
action | length in sec
a1 | 5
a2 | 2
a3 | 3
commentary | length in sec
c1 | 2
c2 | 10
c3 | 5
Create a schedule for the actions first:
a-schedule | start | end
a1 | 0 | 5
a2 | 5 | 7
a3 | 7 | 10
Now create a schedule for the commentaries by adjusting the end time of each commentary to the start time of the appropriate action:
c-schedule | start | end
c1 | -2 | 0
c2 | -5 | 5
c3 | 2 | 7
Now iterate through on this last table in backwards from the second to last element and do the following:
for(int x=c.length-2;x>=0;x--) {
if (c[x].end> c[x+1].start) {
c[x].end-= c[x].end- c[x+1].start;
c[x].start-= c[x].end- c[x+1].start;
}
}
It will result the following table:
c-schedule | start | end
c1 | -10 | -8
c2 | -8 | 2
c3 | 2 | 7
Each commentaries will be finished before the corresponding action however - if most of the commentaries are last longer than the actions - they could happen way before. But if there are several actions which are longer that their commentaries then the commentary can gain on the action there.
The table shows that you have to start playing the first commentary 10 seconds before start playing the first action. You can shift both of these values if you want a zero-based schedule.
Related
I would like to implement a simple stack in my microcontroller firmware.
The stack I would like to implement is like this, i.e. something standard.
The problem is that the flash memory IC I am using supports page-granularity for write but sector-granularity for erasing and just like any other NAND flash, before writing some data on the flash, you should erase that part.
So as a summary, I erase a sector and write some data on its first page. For rewriting even one byte of that page, I should erase the whole sector first.
In my stack,
| O | O | O | O | O | O |
I push some data:
| W | W | W | O | O | O |
and then pop one of them:
| W | W | P | O | O | O |
Now I would like to push another data. Here the problem appears. If just like a standard stack, I decide to write on previous data and just change the index, I must delete the whole sector first! Therefore I should solve this problem programmatically.
Any ideas?
P.S: The flash memory map is: 16 Blocks -> Each Block = 16 Sectors -> Each Sector = 16 Pages -> Each Page = 256 Bytes
I have a write-command at page level with an offset and erase command for sector and block.
Well, if you don't want to write a whole sector with every push, you'll have to tolerate some old data in your stack:
| W | W | P | W | O | O |
You will need to reserve at least one bit in each item that allows you to distinguish between the valid and invalid ones. Assuming that a sector-erase fills it with 1s, then leave the valid bit 1 in all valid items that you write. You will then be able to change it to a 0 by writing just one page when you pop the item off, marking it as invalid.
Since you can't reuse item slots in this scheme, your stack will grow continuously toward the end of memory. That is actually what you want, since that NAND flash can only be written so many times before it dies, and this scheme will spread the writes out somewhat. When you get to the end of space, you can erase and rewrite the whole thing to remove all the gaps.
It may happen that you end up with a very long sequence of invalid items, so popping past them could involve scanning all the intermediate ones. You can fix this problem by reserving more than one bit in each item. When you write a valid item, you use these bits to store the number of invalid items that precede it, plus 1. This lets you skip back quickly and pop in constant time.
Then, when you want to mark an item as invalid, you can change the bits reserved for this count to all 0s, again by writing just one page. This zero count could not appear in a valid item, so it serves as an invalid mark.
Recall Amdahl’s law on estimating the best possible speedup. Answer the following questions.
You have a program that has 40% of its code parallelized on three processors, and just for this fraction of code, a speedup of 2.3 is achieved. What is the overall speedup?
I'm having trouble understanding the difference between speedup and overall speedup in this question. I know there must be a difference by the way this question is worded.
Q : What is the overall speedup?
Best start not with the original and trivial Amdahl's law formula, but by reading a bit more contemporary view, extending the original, where add-on overhead costs are discussed and also an aspect of atomicity-of-split-work was explained.
Two sections,one accelerated by a "local"-speed-up,one overall result
Your original problem-formulation seems to by-pass there explained sorts of problems with real-world process-orchestration overheads by simply postulating a (net-local)-speedup, where a <PAR>-able Section-under-Review related implementation add-on overhead costs become "hidden", expressed but by a sort of inefficiency of having three-times more resources for code-stream execution, yet having but a 2.3 x speedup, not 3.0 x, so spending more than a theoretical 1/3 of the time on actually also initial set-up (an add-on overhead-time, not present in a pure-[SERIAL] code-execution ) + parallel-processing (doing The_useful_work, now on triple the capacity of the code-execution resources) + also terminating and results-collection back (add-on overhead-times, not present in a pure-[SERIAL] code-execution) into the "main"-code.
"Hiding" these natural cost-of-going into/out-of [PARALLEL]-code-execution section(s) simplifies the homework, yet a proper understanding of the real-life costs is crucial not to spend way more (on setups and all other add-on overhead costs, that are un-avoidable in real-world) than one would ever receive back (from a wish-to-get many-processors-harnessed split-processing speedup)
|-------> time
|START:
| |DONE: 100% of the code
| | |
|______________________________________<SEQ>______60%_|_40%__________________<PAR>-able__|
o--------------------------------------<SEQ>----------o----------------------<PAR>-able--o CPU_x runs both <SEQ> and <PAR>-able sections of code, in a pure [SERIAL] process-flow orchestration, one after another
| |
| |
|-------> time
|START: |
| | |DONE: 100% of the code :
o--------------------------------------<SEQ>----------o | :
| o---------o .. .. .. .. ..CPU_1 runs <PAR>'d code
| o---------o .. .. .. .. ..CPU_2 runs <PAR>'d code
| o---------o .. .. .. .. ..CPU_3 runs <PAR>'d code
| | |
| | |
| <_not_1/3_> just ~ 2.3x faster (not 3x) perhaps reflects real-costs (penalisations) of new, add-on, process-organisation related setup + termination overheads
|______________________________________<SEQ>______60%_|_________|~ 40% / 2.3x ~ 17.39% i.e. the <PAR>-section has gained a local ( "net"-section ) speedup of 2.3x instead of 3.0x, achievable on 3-CPU-code-execution streams
| | |
Net overall speedup ( if no other process-organisation releated add-on overhead costs were accrued )
is:
( 60% + ( 40% / 1.0 ) )
---------------------------- ~ 1.2921 x
( 60% + ( 40% / 2.3 ) )
I've been reading about an interesting machine learning algorithm, MARS(Multi-variate adaptive regression splines).
As far as I understand the algorithm, from Wikipedia and Friedman's papers, it works in two stages, forward pass and backward pass. I'll ignore backward pass for now, since forward pass is the part I'm interested in. The steps for forward pass, as far as I can tell are.
Start with just the mean of the data.
Generate a new term pair, through exhaustive search
Repeat 2 while improvements are being made
And to generate a term pair MARS appears to do the following:
Select an existing term (e)
Select a variable (x)
Select a value of that variable (v)
Return two terms one of the form e*max(0,x-v) and the other of the form e*max(0, v-x)
And this makes sense to me. I could see how, for example, a data table like this:
+---+---+---+
| A | B | Z |
+---+---+---+
| 5 | 6 | 1 |
| 7 | 2 | 2 |
| 3 | 1 | 3 |
+---+---+---+
Could produce a terms like 2*max(0, B-1) or even 8*max(0, B-1)*max(3-A). However, the wikipedia page has an example that I don't understand. It has an ozone example where the first term is 25. However, it also has term in the final regression that has a coefficient that is negative and fractional. I don't see how this is possible, since the initial term is 5, and you can only multiply by previous terms, and no previous term can have a negative coefficient, that you could ever end up with one...
What am I missing?
As I see it, either I misunderstand term generation, or I misunderstand the simplification process. However, simplification as described seems to only delete terms, not modify them. Can you see what I am missing here?
I'm working on a CUDA app that makes use of all available RAM on the card, and am trying to figure out different ways to reduce cache misses.
The problem domain consists of a large 2- or 3-D grid, depending on the type of problem being solved. (For those interested, it's an FDTD simulator). Each element depends on either two or four elements in "parallel" arrays (that is, another array of nearly identical dimensions), so the kernels must access either three or six different arrays.
The Problem
*Hopefully this isn't "too localized". Feel free to edit the question
The relationship between the three arrays can be visualized as (apologize for the mediocre ASCII art)
A[0,0] -C[0,0]- A ---- C ---- A ---- C ---- A
| | | |
| | | |
B[0,0] B B B
| | | |
| | | |
A ---- C ---- A ---- C ---- A ---- C ---- A
| | | |
| | | |
B B B B
| | | |
| | | |
A ---- C ---- A ---- C ---- A ---- C ---- A
| | | |
| | | |
B B B B[3,2]
| | | |
| | | |
A ---- C ---- A ---- C ---- A ---- C ---- A[3,3]
[2,3]
Items connected by lines are coupled. As can be seen above, A[] depends on both B[] and C[], while B[] depends only on A[], as does C[]. All of A[] is updated in the first kernel, and all of B[] and C[] are updated in a second pass.
If I declare these arrays as simple 2D arrays, I wind up with strided memory access. For a very large domain size (3x3 +- 1 in the grid above), this causes occupancy and performance deficiencies.
So, I thought about rearranging the array layout in a Z-order curve:
Also, it would be fairly trivial to interleave these into one array, which should improve fetch performance since (depending on the interleave order) at least half of the elements required for a given cell update would be close to one another. However, it's not clear to me if GPU uses multiple data pointers when accessing multiple arrays. If so, this imagined benefit could actually be a hindrance.
The Questions
I've read that NVidia does this automatically behind the scenes when using texture memory, or a cudaArray. If this is not the case, should I expect the increased latency when crossing large spans (when the Z curve goes from upper right to bottom left at a high subdivision level) to eliminate the benefit of the locality in smaller grids?
Dividing the grid into smaller blocks that can fit in shared memory should certainly help, and the Z order makes this fairly trivial. Should I have a separate kernel pass that updates boundaries between blocks? Will the overhead of launching another kernel be significant compared to the savings I expect ?
Is there any real benefit to using a 2D vs 1D array? I expect memory to be linear, but am unsure if there is any real meaning to the 2D memory layout metaphor that's often used in CUDA literature.
Wow - long question. Thanks for reading and answering any/all of this.
Just to get this off of the unanswered list:
After a lot of benchmarking and playing with different arrangements, the fastest approach I found was to keep the arrays interleaved in z-order so that most of the values required by a thread were located near each other in RAM. This improved cache behavior (and thus performance). Obviously there are many cases where Z order fails to keep required values close together. I wonder if rotating quadrants to reduce "distance" between the end of a Z and the next quadrant, but I haven't tried that.
Thanks to everyone for the advice.
At work we are looking into common problems that lead to high cyclomatic complexity. For example, having a large if-else statement can lead to high cyclomatic complexity, but can be resolved by replacing conditionals with polymorphism. What other examples have you found?
See the NDepend's definition of Cyclomatic Complexity.
Nesting Depth is also a great code metric.
Cyclomatic complexity is a popular procedural software metric equal to the number of decisions that can be taken in a procedure. Concretely, in C# the CC of a method is 1 + {the number of following expressions found in the body of the method}:
if | while | for | foreach | case | default | continue | goto | && | || | catch | ternary operator ?: | ??
Following expressions are not counted for CC computation:
else | do | switch | try | using | throw | finally | return | object creation | method call | field access
Adapted to the OO world, this metric is defined both on methods and classes/structures (as the sum of its methods CC). Notice that the CC of an anonymous method is not counted when computing the CC of its outer method.
Recommendations: Methods where CC is higher than 15 are hard to understand and maintain. Methods where CC is higher than 30 are extremely complex and should be split in smaller methods (except if they are automatically generated by a tool).
Another example to avoid using so many if´s, it's the implementation of a Finite State Machine. Because events fire transitions, so the conditionals are implicit in a clearer way with these transitions that changes the state of the System. The control is easier.
Leave you a link where mentions some of it´s benefits:
http://www.skorks.com/2011/09/why-developers-never-use-state-machines/