Understanding APL's Inner Product

Understanding APL's Inner Product - matrix

Here is an excerpt from the Mastering Dyalog APL book, from the chapter on Inner Products:
HMS is a variable which contains duration in Hours, Minutes, and Seconds: HMS ← 3 44 29 Chapter J – Operators 397
We would like to convert it into seconds. We shall see 3 methods just now, and a 4th
method
will be given in another chapter.
A horrible solution (3600×HMS[1]) + (60×HMS[2]) + HMS[3]
A good APL solution +/ 3600 60 1 × HMS
An excellent solution with Inner Product 3600 60 1 +.× HMS
It then says that The second and third solutions are equivalent in terms of number of characters typed and performance.
As I understand it, APL programmers should generally use Inner Product, as well as Outer Product, as much as possible. Is that correct?
Can you give an example when using Inner Product would lead to performance gains? What exactly happens when I use Inner Product (on a lower level)? Is the first solution presented below horrible just because it doesn't use APL syntax in a proper way or does it actually have worse performance?
I know there are few questions but want I am asking about in general is how the Inner/Outer Products work and when exactly should an APL programmer use them.

We’ve done work to optimize both the +/ and the +.×.
MBaas is right in that it happens that the +/ in this instance is slightly better than the +.×
Our general advice is: use the constructs in the language best suited for the job, and eventually the implementation will catch up.
The "horrible" solution is considered bad as it does not use array thinking.
Regards,
Vince, Dyalog Support

APL programmers should generally use Inner Product, as well as Outer Product, as much as possible. Is that correct?
It is really up to the APL programmer and the task at hand, but if something makes APL code more concise and efficient, I don't see why a programmer wouldn't opt for it.
In this particular case 60⊥HMS is even more concise and efficient than the inner product.
Can you give an example when using Inner Product would lead to performance gains?
As typical in array-oriented programming, performance gains are achieved by doing things in one go.
Most APL functions are implicit loops---their implementation uses a counter, a limit for it, and an increment step.
The shorter your code is, the better, because not only it's easier to hold in one's head, it's also more efficient as the interpreter has to do fewer passes over the data.
Some implementations do loop fusion in an attempt to reduce this overhead.
Some have idiom recognition---certain combinations of squiggles are special-cased in the interpreter. Doing things in one go also allows the interpreter to do clever optimisations like using the SSE instruction set or GPUs.
Coming back to inner product, let's take the example of A f.g B where A and B are vectors and see how f and g are applied (in Dyalog):
f←{⎕←(⍕⍺),' f ',⍕⍵ ⋄ ⍺+⍵}
g←{⎕←(⍕⍺),' g ',⍕⍵ ⋄ ⍺×⍵}
0 1 2 3 4 f.g 5 6 7 8 9
4 g 9
3 g 8
24 f 36
2 g 7
14 f 60
1 g 6
6 f 74
0 g 5
0 f 80
80
You can see from the above that calls to f and g are interleaved. The interpreter apples f and reduces on g simultaneously, in one pass, avoiding the creation of a temporary array, like f/ A g B would do.
Another example: http://archive.vector.org.uk/art10500200
You can test the performance of different solutions for yourself and see which one works best:
)copy dfns.dws cmpx
⍝ or: ")copy dfns cmpx" if you are using Windows
HMS ← 3 44 29
cmpx '(3600×HMS[1]) + (60×HMS[2]) + HMS[3]' '+/ 3600 60 1 × HMS' '3600 60 1 +.× HMS' '60⊥HMS'
(3600×HMS[1]) + (60×HMS[2]) + HMS[3] → 2.7E¯6 | 0% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
+/ 3600 60 1 × HMS → 9.3E¯7 | -66% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
3600 60 1 +.× HMS → 8.9E¯7 | -68% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
60⊥HMS → 4.8E¯7 | -83% ⎕⎕⎕⎕⎕⎕⎕

The problem with generalization is that they might be incorrect, but as rule of thumb I'd say using the inner & outer products will benefit readability as well as performance ;-)
Now, looking at the thing in practice:
` ]performance.RunTime (3600×HMS[1])+(60×HMS[2])+HMS[3] -repeat=100000
Benchmarking "(3600×HMS[1])+(60×HMS[2])+HMS[3]", repeat=100000
Exp
CPU (avg): 0.001558503836
Elapsed: 0.001618446292
]performance.RunTime '+/ 3600 60 1 × HMS' -repeat=100000
Benchmarking "+/ 3600 60 1 × HMS", repeat=100000
Exp
CPU (avg): 0.0004698496481
Elapsed: 0.0004698496481
`
That is quite a difference - if you repeat it enough times to be measureable ;-)
But of course with larger dataset the advantage gets more visible!
Let's also look at the 3variant:
` ]performance.RunTime '3600 60 1 +.× HMS' -repeat=100000
Benchmarking "3600 60 1 +.× HMS", repeat=100000
Exp
CPU (avg): 0.0004698496481
Elapsed: 0.000439859245
`
No difference here, but again - with "real data" (larger array) you should see a much clearer difference. I think a simple explanation is that inner product is like one 'statement' for the interpreter, whereas the first variant has 3 single multiplications, indexing and needs to consider priorities (brackets) and then sum up that vector, which sounds like a lot of sweat ;-)
The 2nd statement has one multiplication only (for a vector), so it elimates several steps already, and the inner product enables the interpreter to possibly combine some of its internal working to do his job even faster.
BUT now here's a surprise:
v1←(10000/3600 60 1) ⋄v2← 10000/HMS
]performance.RunTime '+/v1 × v2' 'v1 +.× v2' -repeat=100000 -compare
+/v1 × v2 → 6.0E¯5 | 0% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
v1 +.× v2 → 6.3E¯5 | +5% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
I expected that the bigger arguments would help to make the last expression's performance-advantage more visible - but actually #2 won. Maybe Dyalog optimized case #2 more than #3... ;-)

Related

Need a faster and efficient way to add elements to a list in python

I am trying create a large list of numbers .
a = '1 1 1 2 2 0 0 1 1 1 1 9 9 0 0' (it goes over a ten million).
I've tried these methods:
%timeit l = list(map(int, a.split())) it was 4.07 µs per loop
%timeit l = a.split(' ') this was 462 ns per loop
%timeit l = [i for i in a.split()] it took 1.19 µs per loop
I understand that the 2nd and 3rd variants are character lists whereas first is an integer list, this is fine. But as the number of elements gets to over ten million, it can take up to 6 seconds to create a list. This is too long for my purposes.
Could someone tell me a more faster and efficient way to do this.
Thanks

In plain Python, not using a third-party extension, a.split() ought to be the fastest way to split your input into a list. The str.split() function only has one job and it is specialized for this use.

If you know your input consists of single digits separated by single spaces then you can also consider:
b = ord('0')
[ord(a)-b for a in A[::2]]
this makes a list of 10 million integers in 0.2 seconds on my computer.

I tested the various answers in a jupyter notebook, and Peter de Rivas does seem to edge out the others proposed.
Interestingly, the mapping to integer seems to be the bottleneck. The str.split() operation itself is an order of magnitude faster.

Lua: Code optimization vector length calculation

I have a script in a game with a function that gets called every second. Distances between player objects and other game objects are calculated every second there. The problem is that there can be thoretically 800 function calls in 1 second(max 40 players * 2 main objects(1 up to 10 sub-objects)). I have to optimize this function for less processing. this is my current function:
local square = math.sqrt;
local getDistance = function(a, b)
local x, y, z = a.x-b.x, a.y-b.y, a.z-b.z;
return square(x*x+y*y+z*z);
end;
-- for example followed by: for i = 800, 1 do getDistance(posA, posB); end
I found out, that the localization of the math.sqrt function through
local square = math.sqrt;
is a big optimization regarding to the speed, and the code
x*x+y*y+z*z
is faster than this code:
x^2+y^2+z^2
I don't know if the localization of x, y and z is better than using the class method "." twice, so maybe square(a.x*b.x+a.y*b.y+a.z*b.z) is better than the code local x, y, z = a.x-b.x, a.y-b.y, a.z-b.z;
square(x*x+y*y+z*z);
Is there a better way in maths to calculate the vector length or are there more performance tips in Lua?

You should read Roberto Ierusalimschy's Lua Performance Tips (Roberto is the chief architect of Lua). It touches some of the small optimizations you're asking about (such as localizing library functions and replacing exponents with their mutiplicative equivalents). Most importantly, it conveys one of the most important and overlooked ideas in engineering: sometimes the best solution involves changing your problem. You're not going to fix a 30-million-calculation leak by reducing the number of CPU cycles the calculation takes.
In your specific case of distance calculation, you'll find it's best to make your primitive calculation return the intermediate sum representing squared distance and allow the use case to call the final Pythagorean step only if they need it, which they often don't (for instance, you don't need to perform the square root to compare which of two squared lengths is longer).
This really should come before any discussion of optimization, though: don't worry about problems that aren't the problem. Rather than scouring your code for any possible issues, jump directly to fixing the biggest one - and if performance is outpacing missing functionality, bugs and/or UX shortcomings for your most glaring issue, it's nigh-impossible for micro-inefficiencies to have piled up to the point of outpacing a single bottleneck statement.
Or, as the opening of the cited article states:
In Lua, as in any other programming language, we should always follow the two
maxims of program optimization:
Rule #1: Don’t do it.
Rule #2: Don’t do it yet. (for experts only)

I honestly doubt these kinds of micro-optimizations really help any.
You should be focusing on your algorithms instead, like for example get rid of some distance calculations through pruning, stop calculating the square roots of values for comparison (tip: if a^2<b^2 and a>0 and b>0, then a<b), etc etc

Your "brute force" approach doesn't scale well.
What I mean by that is that every new object/player included in the system increases the number of operations significantly:
+---------+--------------+
| objects | calculations |
+---------+--------------+
| 40 | 1600 |
| 45 | 2025 |
| 50 | 2500 |
| 55 | 3025 |
| 60 | 3600 |
... ... ...
| 100 | 10000 |
+---------+--------------+
If you keep comparing "everything with everything", your algorithm will start taking more and more CPU cycles, in a cuadratic way.
The best option you have for optimizing your code isn't not in "fine tuning" the math operations or using local variables instead of references.
What will really boost your algorithm will be eliminating calculations that you don't need.
The most obvious example would be not calculating the distance between Player1 and Player2 if you already have calculated the distance between Player2 and Player1. This simple optimization should reduce your time by a half.
Another very common implementation consists in dividing the space into "zones". When two objects are on the same zone, you calculate the space between them normally. When they are in different zones, you use an approximation. The ideal way of dividing the space will depend on your context; an example would be dividing the space into a grid, and for players on different squares, use the distance between the centers of their squares, that you have computed in advance).
There's a whole branch in programming dealing with this issue; It's called Space Partitioning. Give this a look:
http://en.wikipedia.org/wiki/Space_partitioning

Seriously?
Running 800 of those calculations should not take more than 0.001 second - even in Lua on a phone.
Did you do some profiling to see if it's really slowing you down? Did you replace that function with "return (0)" to verify performance improves (yes, function will be lost).
Are you sure it's run every second and not every millisecond?
I haven't see an issue running 800 of anything simple in 1 second since like 1987.

If you want to calc sqrt for positive number a, take a recursive sequense
x_0 = a
x_n+1 = 1/2 * (x_n + a / x_n)
x_n goes to sqrt(a) with n -> infinity. first several iterations should be fast enough.
BTW! Maybe you'll try to use the following formula for length of vector instesd of standart.
local getDistance = function(a, b)
local x, y, z = a.x-b.x, a.y-b.y, a.z-b.z;
return x+y+z;
end;
It's much more easier to compute and in some cases (e.g. if distance is needed to know whether two object are close) it may act adequate.

Are Haskell List Comprehensions Inefficient?

I started doing Project Euler and got to problem number 9. Since I was using Project Euler to learn Haskell, I decided to use list comprehensions (as shown in Learn You A Haskell). I do that and GHCI takes awhile to figure out the triplet, which I figured is normal because of the calculations involved. Now, at work yesterday (I don't work as a programmer professionally, yet) I was talking to a friend who knows VBA and he wanted to try to find the answers in VBA. I thought it would be a fun challenge as well, and I churn out some basic for loops and if statements, but what got me was that it was much faster than Haskell was.
My question is: are Haskell's list comprehension incredibly inefficient? At first I thought it was just because I was in GHC's interactive mode, but then I realized VBA is interpreted too.
Please note, I didn't post my code because of it being an answer to project euler. If it will answer my question (as in I'm doing something wrong) then I will gladly post the code.
[edit]
Here is my Haskell list comprehension:
[(a,b,c) | c <- [1..1000], b <- [1..c], a <- [1..b], a+b+c=1000, a^2+b^2=c^2]
I guess I could've lowered the range on c but is that what is really slowing it down?

There are two things you could be doing with this problem that could make your code slow. One is how you are trying values for a, b and c. If you loop through all possible values for a, b, c from 1 to 1000, you'll be spending a long time. To give a hint, you can make use of a+b+c=1000 if you rearrange it for c. The other is that if you only use a list comprehension, it will process every possible value for a, b and c. The problem tells you that there is only one unique set of numbers that satisfies the problem, so if you change your answer from this:
[ a * b * c | .... ]
to:
head [ a * b * c | ... ]
then Haskell's lazy evaluation means that it will stop after finding the first answer. This is the Haskell equivalent of breaking out of your VBA loop when you find the first answer. When I used both these tips, I had an answer that completed very quickly (under a second) in ghci.
Addendum: I missed at first the condition a < b < c. You can also make use of this in your list comprehensions; it is valid to say things along the lines of:
[(a, b) | b <- [1..100], a <- [1..b-1]]

Consider this simplified version of your list comprehension:
[(a,b,c) | a <- [1..1000], b <- [1..1000], c <- [1..1000]]
This will give all possible combinations of a, b, and c. It's kind of like saying, "how many ways can three one-thousand-sided dice land?" The answer is 1000*1000*1000 = 1,000,000,000 different combinations. If it took 0.001 seconds to generate each combination, it would therefore take 1,000,000 seconds (~11.5 days) to finish all combinations. (OK, 0.001 seconds is actually pretty slow for a computer, but you get the idea)
When you add predicates to your list comprehension, it still takes the same amount of time to compute; in fact, it takes longer since it needs to check the predicate for each of the 1 billion combinations it computes.
Now consider your comprehension. It looks like it should be much faster, right?
[(a,b,c) | c <- [1..1000], b <- [1..c], a <- [1..b], a+b+c=1000, a^2+b^2=c^2]
There are 1000 choices for c. How many are there for b and a? Well, the average choice for c is 500. For all choices of c, then, there are an average of 500 choices for b (since b can range from 1 to c). Likewise, for all choices of c and b, there are an average of 250 choices for a. That's very hand-wavy, but I'm fairly sure it's accurate. So 1000 choices for c * 1000/2 choices for b * 1000/4 choices for a = 1 billion / 8 ~= 100 million. It's 8x faster, but if you paid attention, you'll notice it's actually the same big-Oh complexity as the simplified version above. If we compared "simplified" vs "improved" versions of the same problem, but from [1..100000] instead of [1..1000], the "improved" would still only be 8x faster than the "simplified".
Don't get me wrong, 8x is a wonderful constant-factor speedup. But unless you want to wait a couple hours to get the solution, you'll need to get a better big-Oh.
As Neil noted, the way to reduce the complexity of this problem is, for a given b and c, choose the a that satisfies a+b+c=1000. That way, you're not trying a bunch of as that will fail. This will drop the big-Oh complexity; you'll only be considering approximately 1000 * 500 * 1 = 500,000 combinations, instead of ~100,000,000.

Once you get the solution to the problem you can check out other peoples versions of Haskell solutions on the Project Euler site to get an idea of how other people have solved the problem. Incidentally, here is a link to the referenced problem: http://projecteuler.net/index.php?section=problems&id=9

In addition to what everyone else has said about generating fewer elements in the generators, you can also switch to using Int instead of Integer as the type of the numbers. The default is Integer, but your numbers are small enough to fit in an Int.
(Also, to nitpick, Haskell list comprehensions have no speed. Haskell is a language definition with very little operational semantics. A particular Haskell implementation might have slow list comprehensions, though.)

Is this a clever or stupid way to do an integer divide function?

I'm a Computer Science major, interested in how assembly languages handle a integer divide function. It seems that simply adding up to the numerator, while giving both the division and the mod, is way too impractical, so I came up with another way to divide using bit shifting, subtracting, and 2 look up tables.
Basically, the function takes the denominator, and makes "blocks" based on the highest power of 2. So dividing by 15 makes binary blocks of 4, dividing by 5 makes binary blocks of 3, etc. Then generate the first 2^block-size multiple of the denominator. For each multiple, write the values AFTER the first block into the look up table, keyed by the value of the first block.
Example: Multiples of 5 in binary - block size 3 (octal)
000 000 **101** - 5 maps to 0
000 001 **010** - 2 maps to 1
000 001 **111** - 7 maps to 1
000 010 **100** - 4 maps to 2
000 011 **001** - 1 maps to 3
000 011 **110** - 6 maps to 3
000 100 **011** - 3 maps to 4
000 101 **000** - 0 maps to 5
So the actual procedure involves getting the first block, left bit-shifting over the first block, and subtracting the value that the blocks maps to. If the resulting number comes out to 0, then it's perfectly divisible, and if the value becomes negative, it's not.
If you add another enumeration look up table, where you map the values to a counter as they come in, you can calculate the result of the division!
Example: Multiples of 5 again
5 maps to 1
2 maps to 2
7 maps to 3
4 maps to 4
1 maps to 5
6 maps to 6
3 maps to 7
0 maps to 8
Then all that's left is mapping every block to the counter-table, and you have your answer.
There are a few problems with this method.
If the answer isn't perfectly divisible, then the function returns back junk.
For high Integer values, this won't work, because a 5 block size will get truncated at the end of a 32 bit or 64 bit integer.
It's about 100 times slower than the standard division in C.
If the denominator is a factor of the divisor, then your blocks must map to multiple values, and you need even more tables. This can be solved with prime factorization, but all the methods I've read about easy/quick prime factorization involve dividing, defeating the purpose of this.
So I have 2 questions: First, is there an algorithm similar to this out there already? I've looked around, and I can't seem to find any like it. Second, How do actual assembly languages handle Integer division?
Sorry if there are any formatting mistake, this is my first time posting to stack overflow.

Sorry i answer so late. Ok, first regarding the commenters of your question: they think you are trying to do what the assembly memonic DIV or IDIV achieves by using different instructions in assembly. To me it seems you want to know how the op-codes that are selected by DIV and IDIV achieve division in hardware. To my knowledge Intel uses the SRT algorithm (uses a lookup-table) and AMD uses the Goldschmidt algorithm. I think what you are doing is similar to SRT. You can take a look at both of them here:
http://en.wikipedia.org/wiki/Division_%28digital%29

Aggregating automatically-generated feature vectors

I've got a classification system, which I will unfortunately need to be vague about for work reasons. Say we have 5 features to consider, it is basically a set of rules:
A B C D E Result
1 2 b 5 3 X
1 2 c 5 4 X
1 2 e 5 2 X
We take a subject and get its values for A-E, then try matching the rules in sequence. If one matches we return the first result.
C is a discrete value, which could be any of a-e. The rest are just integers.
The ruleset has been automatically generated from our old system and has an extremely large number of rules (~25 million). The old rules were if statements, e.g.
result("X") if $A >= 1 && $A <= 10 && $C eq 'A';
As you can see, the old rules often do not even use some features, or accept ranges. Some are more annoying:
result("Y") if ($A == 1 && $B == 2) || ($A == 2 && $B == 4);
The ruleset needs to be much smaller as it has to be human maintained, so I'd like to shrink rule sets so that the first example would become:
A B C D E Result
1 2 bce 5 2-4 X
The upshot is that we can split the ruleset by the Result column and shrink each independently. However, I cannot think of an easy way to identify and shrink down the ruleset. I've tried clustering algorithms but they choke because some of the data is discrete, and treating it as continuous is imperfect. Another example:
A B C Result
1 2 a X
1 2 b X
(repeat a few hundred times)
2 4 a X
2 4 b X
(ditto)
In an ideal world, this would be two rules:
A B C Result
1 2 * X
2 4 * X
That is: not only would the algorithm identify the relationship between A and B, but would also deduce that C is noise (not important for the rule)
Does anyone have an idea of how to go about this problem? Any language or library is fair game, as I expect this to be a mostly one-off process. Thanks in advance.

Check out the Weka machine learning lib for Java. The API is a little bit crufty but it's very useful. Overall, what you seem to want is an off-the-shelf machine learning algorithm, which is exactly what Weka contains. You're apparently looking for something relatively easy to interpret (you mention that you want it to deduce the relationship between A and B and to tell you that C is just noise.) You could try a decision tree, such as J48, as these are usually easy to visualize/interpret.

Twenty-five million rules? How many features? How many values per feature? Is it possible to iterate through all combinations in practical time? If you can, you could begin by separating the rules into groups by result.
Then, for each result, do the following. Considering each feature as a dimension, and the allowed values for a feature as the metric along that dimension, construct a huge Karnaugh map representing the entire rule set.
The map has two uses. One: research automated methods for the Quine-McCluskey algorithm. A lot of work has been done in this area. There are even a few programs available, although probably none of them will deal with a Karnaugh map of the size you're going to make.
Two: when you have created your final reduced rule set, iterate over all combinations of all values for all features again, and construct another Karnaugh map using the reduced rule set. If the maps match, your rule sets are equivalent.
-Al.

You could try a neural network approach, trained via backpropagation, assuming you have or can randomly generate (based on the old ruleset) a large set of data that hit all your classes. Using a hidden layer of appropriate size will allow you to approximate arbitrary discriminant functions in your feature space. This is more or less the same idea as clustering, but due to the training paradigm should have no issue with your discrete inputs.
This may, however, be a little too "black box" for your case, particularly if you have zero tolerance for false positives and negatives (although, it being a one-off process, you get an arbitrary degree of confidence by checking a gargantuan validation set).

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio