Clearly Seq asymptotically performs the same or better as [] for all possible operations. But since its structure is more complicated than lists, for small sizes its constant overhead will probably make it slower. I'd like to know how much, in particular:
How much slower is <| compared to :?
How much slower is folding over/traversing Seq compared to folding over/traversing [] (excluding the cost of a folding/traversing function)?
What is the size (approximately) for which \xs x -> xs ++ [x] becomes slower than |>?
What is the size (approximately) for which ++ becomes slower than ><?
What's the cost of calling viewl and pattern matching on the result compared to pattern matching on a list?
How much memory does an n-element Seq occupy compared to an n-element list? (Not counting the memory occupied by the elements, only the structure.)
I know that it's difficult to measure, since with Seq we talk about amortized complexity, but I'd like to know at least some rough numbers.
This should be a start - http://www.haskell.org/haskellwiki/Performance#Data.Sequence_vs._lists
A sequence uses between 5/6 and 4/3 times as much space as the equivalent list (assuming an overhead of one word per node, as in GHC). If only deque operations are used, the space usage will be near the lower end of the range, because all internal nodes will be ternary. Heavy use of split and append will result in sequences using approximately the same space as lists. In detail:
a list of length n consists of n cons nodes, each occupying 3 words.
a sequence of length n has approximately n/(k-1) nodes, where k is the average arity of the internal nodes (each 2 or 3). There is a pointer, a size and overhead for each node, plus a pointer for each element, i.e. n(3/(k-1) + 1) words.
List is a non-trivial constant-factor faster for operations at the head (cons and head), making it a more efficient choice for stack-like and stream-like access patterns. Data.Sequence is faster for every other access pattern, such as queue and random access.
I have one more concrete result to add to above answer. I am solving a Langevin equation. I used List and Data.Sequence. A lot of insertions at back of list/sequence are going on in this solution.
To sum up, I did not see any improvement in speed, in fact performance deteriorated with Sequences. Moreover with Data.Sequence, I need to increase the memory available for Haskell RTS.
Since I am definitely not an authority on optimizing; I post the both cases below. I'd be glad to know if this can be improved. Both codes were compiled with -O2 flag.
Solution with List, takes approx 13.01 sec
Solution with Data.Sequence, takes approx 15.13 sec
Related
In my application, I have a double complex N*3 matrix (where N is several thousand) and a 3*1 vector, and I am forming an N*1 using zgemv.
The N*3 is a subsection of a larger M*3 matrix (where M is slightly larger then N, but the same order of magnitude).
Each thread must perform a zgemv call to a different subsection of the larger matrix. That is, the N*3 is different for every thread. But all of the N*3 are formed from some portion of the larger M*3.
There isn't enough memory for each thread to store an independent N*3. Furthermore, the M*3 is too large to fit in shared memory. Thus each thread must pull its data from a single copy of the M*3. How can I do this without millions of threads serializing memory reads to the same memory locations in the M*3? Is there a more efficient way to approach this?
Probably, based on what I can gather so far, there are 2 types of optimizations I would want to consider:
convert operations that use the same N subset to a matrix-matrix multiply (zgemm), instead of multiple zgemv operations.
cache-block for the GPU L2 cache.
I'll discuss these in reverse order using these numbers for discussion:
M: ~10,000
N: ~3,000
cublas zgemv calls: ~1e6
"typical" Kepler L2: 1.5MB
An Nx3 matrix requires approximately 10,000 elements, each of which is 16 bytes, so let's call it 160K bytes. So we could store ~5-10 of these subsets in a memory size comparable to L2 cache size (without taking into account overlap of subsets - which would increase the residency of subsets in L2).
There are (M-N) possible unique contiguous N-row subsets in the M matrix. There are 1e6 zgemv calls, so on average each subset gets re-used 1e6/M-N times, approximately 100-150 times each. We could store about 10 of these subsets in the proposed L2, so we could "chunk" our 1e6 calls into "chunks" of ~1,000 calls that all operate out of the same data set.
Therefore the process I would follow would be:
transfer the M*3 matrix to the device
predetermine the N*3 subset needed by each thread.
sort or otherwise group like subsets together
divide the sorted sets into cache-sized blocks
for each block, launch a CDP kernel that will spawn the necessary zgemv calls
repeat the above step until all blocks are processed.
One might also wonder if such a strategy could be extended (with considerably more complexity) to L1/Texture. Unfortunately, I think CDP would confound your efforts to achieve this. It's pretty rare that people want to invest the effort to cache-block for L1 anyway.
To extend the above strategy to the gemm case, once you sort your zgemv operations by the particular N subset they require, you will have grouped like operations together. If the above arithmetic is correct, you will have on average around 100-150 gemv operations needed for each particular N-subset. You should group the corresponding vectors for those gemv operations into a matrix, and convert the 100-150 gemv operations into a single gemm operation.
This reduces your ~1e6 zgemv operations to ~1e4 zgemm operations. You can then still cache-block however many of these zgemm operations will be "adjacent" in M and fit in a single cache-block, into a single CDP kernel call, to benefit from L2 cache reuse.
Given the operational intensity of GEMM vs. GEMV, it might make sense to dispense with the complexity of CDP altogether, and simply run a host loop that dispatches the ZGEMM call for a particular N subset. That host loop would iterate for M-N loops.
Why there is no any information in Google / Wikipedia about unrolled skip list? e.g. combination between unrolled linked list and skip list.
Probably because it wouldn't typically give you much of a performance improvement, if any, and it would be somewhat involved to code correctly.
First, the unrolled linked list typically uses a pretty small node size. As the Wikipedia article says: " just large enough so that the node fills a single cache line or a small multiple thereof." On modern Intel processors, a cache line is 64 bytes. Skip list nodes have, on average, two pointers per node, which means an average of 16 bytes per node for the forward pointers. Plus whatever the data for the node is: 4 or 8 bytes for a scalar value, or 8 bytes for a reference (I'm assuming a 64 bit machine here).
So figure 24 bytes, total, for an "element." Except that the elements aren't fixed size. They have a varying number of forward pointers. So you either need to make each element a fixed size by allocating an array for the maximum number of forward pointers for each element (which for a skip list with 32 levels would require 256 bytes), or use a dynamically allocated array that's the correct size. So your element becomes, in essence:
struct UnrolledSkipListElement
{
void* data; // 64-bit pointer to data item
UnrolledSkipListElement* forward_pointers; // dynamically allocated
}
That would reduce your element size to just 16 bytes. But then you lose much of the cache-friendly behavior that you got from unrolling. To find out where you go next, you have to dereference the forward_pointers array, which is going to incur a cache miss, and therefore eliminate the savings you got by doing the unrolling. In addition, that dynamically allocated array of pointers isn't free: there's some (small) overhead involved in allocating that memory.
If you can find some way around that problem, you're still not going to gain much. A big reason for unrolling a linked list is that you must visit every node (up to the node you find) when you're searching it. So any time you can save with each link traversal adds up to very big savings. But with a skip list you make large jumps. In a perfectly organized skip list, for example, you could skip half the nodes on the first jump (if the node you're looking for is in the second half of the list). If your nodes in the unrolled skip list only contain four elements, then the only savings you gain will be at levels 0, 1, and 2. At higher levels you're skipping more than three nodes ahead and as a result you will incur a cache miss.
So the skip list isn't unrolled because it would be somewhat involved to implement and it wouldn't give you much of a performance boost, if any. And it might very well cause the list to be slower.
Linked list complexity is O(N)
Skip list complexity is O(Log N)
Unrolled Linked List complexity can be calculate as following:
O (N / (M / 2) + Log M) = O (2N/M + Log M)
Where M is number of elements in single node.
Because Log M is not significant,
Unrolled Linked List complexity is O(N/M)
If we suppose to combine Skip list with Unrolled linked list, the new complexity will be
O(Log N + "something from unrolled linked list such N1/M")
This means the "new" complexity will not be as better as first someone will think. New complexity might be even worse than original O(Log N). The implementation will more complex as well. So gain is questionable and rather dubious.
Also, since single node will have lots of data, but only single "forward" array, the "tree" will not be so-balanced either and this will ruin O(Log N) part of the equation.
Recently I have been working with combinations of words to make "phrases" in different languages and I have noticed a few things that I could do with some more expert input on.
Defining some constants for this,
Depths (n) is on average 6-7
The length of the input set is ~160 unique words.
Memory - Generating n permutations of 160 words wastes lots of space. I can abuse databases by writing it to disk, but then I take a hit in performance as I need to constantly wait for IO. The other trick is to generate the combinations on the fly like a generator object
Time - If Im not wrong n choose k gets big fast something like this formula factorial(n) / (factorial(depth) * (factorial(n-depth))) this means that input sets get huge quickly.
My question is thus.
Considering I have an function f(x) that takes a combination and applies a calculation that has a cost, e.g.
func f(x) {
if query_mysql("text search query").value > 15 {
return true
}
return false
}
How can I efficiently process and execute this function on a huge set of combinations?
Bonus question, can combinations be generated concurrently?
Update: I already know how to generate them conventionally, its more a case of making it efficient.
One approach will be to first calculate how much parallelism you can get, based on the number of threads you've got. Let the number of threads be T, and split the work as follows:
sort the elements according to some total ordering.
Find the smallest number d such that Choose(n,d) >= T.
Find all combinations of 'depth' (exactly) d (typically much lower than to depth d, and computable on one core).
Now, spread the work to your T cores, each getting a set of 'prefixes' (each prefix c is a combination of size d), and for each case, find all the suffixes that their 'smallest' element is 'bigger' than max(c) according to the total ordering.
this approach can also be translated nicely to map-reduce paradigm.
map(words): //one mapper
sort(words) //by some total ordering function
generate all combiations of depth `d` exactly // NOT K!!!
for each combination c produced:
idx <- index in words of max(c)
emit(c,words[idx+1:end])
reduce(c1, words): //T reducers
combinations <- generate all combinations of size k-d from words
for each c2 in combinations:
c <- concat(c1,c2)
emit(c,f(c))
Use one of the many known algorithms to generate combinations. Chase's Twiddle algorithm is one of the best known and perfectly suitable. It captures state in an array, so it can be restarted or seeded if wished.
See Algorithm to return all combinations of k elements from n for lots more.
You can progress through your list at your own pace, using minimal memory and no disk IO. Generating each combination will take a microscopic amount of time compared to the 1 sec or so of your computation.
This algorithm (and many others) are easily adapted for parallel execution if you have the necessary skills.
I have two arrays, N and M. they are both arbitrarily sized, though N is usually smaller than M. I want to find out what elements in N also exist in M, in the fastest way possible.
To give you an example of one possible instance of the program, N is an array 12 units in size, and M is an array 1,000 units in size. I want to find which elements in N also exist in M. (There may not be any matches.) The more parallel the solution, the better.
I used to use a hash map for this, but it's not quite as efficient as I'd like it to be.
Typing this out, I just thought of running a binary search of M on sizeof(N) independent threads. (Using CUDA) I'll see how this works, though other suggestions are welcome.
1000 is a very small number. Also, keep in mind that parallelizing a search will only give you speedup as the number of cores you have increases. If you have more threads than cores, your application will start to slow down again due to context switching and aggregating information.
A simple solution for your problem is to use a hash join. Build a hash table from M, then look up the elements of N in it (or vice versa; since both your arrays are small it doesn't matter much).
Edit: in response to your comment, my answer doesn't change too much. You can still speed up linearly only until your number of threads equals your number of processors, and not past that.
If you want to implement a parallel hash join, this would not be difficult. Start by building X-1 hash tables, where X is the number of threads/processors you have. Use a second hash function which returns a value modulo X-1 to determine which hash table each element should be in.
When performing the search, your main thread can apply the auxiliary hash function to each element to determine which thread to hand it off to for searching.
Just sort N. Then for each element of M, do a binary search for it over sorted N. Finding the M items in N is trivially parallel even if you do a linear search over an unsorted N of size 12.
I was wondering how does one decide the resizing factor by which dynamic array resizes ?
On wikipedia and else where I have always seen the number of elements being increased by a factor of 2? Why 2? Why not 3? how does one decide this factor ? IF it is language dependent I would like to know this for Java.
Actually in Java's ArrayList the formula to calculate the new capacity after a resize is:
newCapacity = (oldCapacity * 3)/2 + 1;
This means roughtly a 1.5 factor.
About the reason for this number I don't know but I hope someone has done a statistical analisys and found this is a good compromise between space and computational overhead.
Quoting from Wikipedia:
As n elements are inserted, the capacities form a geometric progression. Expanding the array by any constant proportion ensures that inserting n elements takes O(n) time overall, meaning that each insertion takes amortized constant time. The value of this proportion a leads to a time-space tradeoff: the average time per insertion operation is about a/(a−1), while the number of wasted cells is bounded above by (a−1)n. The choice of a depends on the library or application: a=3/21 and a=2[citation needed] is commonly-used.
Apparently it seems that it is a good compromise between CPU time and memory wasting. I guess the "best" value depends on what your application does.
Would you waste more space than you actually use?
If not, the factor should be less than or equal to 2.
If you want it to be an integer so it is simple to work with, there is only one choice.
There is another difference between a growth rate of 2X and a growth rate of 1.5X that nobody here has discussed yet.
Each time we allocate a new buffer to increase our dynamic array capacity, we are building up a region of unused memory preceding the array. If the growth rate is too high, then this region cannot ever be reused in the array.
To visualize, let "X" represent memory cells used by our array, and "O" represent memory cells that we can no longer use. A growth rate of 2X looks like so:
[X] -> [OXX] -> [OOOXXXX] -> [OOOOOOOXXXXXXXX]
... notice that the preceding O's keep growing! In fact, with a 2X growth rate, we can never use that memory again in our array.
But, with a 1.5X growth multiplier (rounded down, but at least 1), the usage looks like:
[X] -> [OXX] -> [OOOXXX] -> [OOOOOOXXXX] -> [XXXXXX]
Wait a sec, we were able to reclaim the old space! That's because the size of the unused space caught up with the size of the array.
If you work out the math, the limit growth factor is Phi (or about 1.618). Anything larger than Phi, and you cannot reclaim the old space.