Massive Performance Problem - Using Channels in Julia - performance

Summary
Benchmarking times for Channels in Julia - using a ~5GB tsv file
Baseline: Bash tools (cat, grep - baseline written in C)
~ 2 seconds
Julia: Simple loop with eachline
~ 4-5 seconds (2nd-run, not pre-compilation, etc)
Julia Channel implementation
~ 11 seconds (2nd-run, not pre-compilation, etc)
Also:
Pure Python
~ 4-5 seconds
Longer Explaination
I have been working towards making the most performant/standard type of multiprocessing design pattern wherein data is either streamed from disk or a download stream, pieces are fed to all cores on the system, and then the output from this is serialized to disk.
This is obviously a hugely important design to get right, since most programming tasks fall within this description.
Julia seems like a great choice for this due to it's supposed ability to be performant.
In order to serialize the IO to/from disk or download and then to send data to each processor, Channels seem to be the suggested choice by Julia.
However, my tests so far seem to indicate that this is extremely non-performant.
The simplest example displays just how exceedingly slow Channels (and Julia!) are at this.
It's been very disappointing.
A simple example of grep and cat (removing multiprocessing bits for clarity):
Julia code:
using CodecZlib: GzipDecompressorStream
using TranscodingStreams: NoopStream
"""
A simple function to "generate" (place into a Channel) lines from a file
- This mimics python-like behavior of 'yield'
"""
function cat_ch(fpath)
Channel() do ch
codec = endswith(fpath, ".gz") ? GzipDecompressorStream : NoopStream
open(codec, fpath, "r") do stream
for (i, l) in enumerate(eachline(stream))
put!(ch, (i, l))
end
end
end
end
function grep_ch(line_chnl, searchstr)
Channel() do ch
for (i, l) in line_chnl
if occursin(searchstr, l)
put!(ch, (i, l))
end
end
end
end
function catgrep_ch(fpath, search)
for (i, l) in grep_ch(cat_ch(fpath), search)
println((i, l))
end
end
function catgrep(fpath, search)
codec = endswith(fpath, ".gz") ? GzipDecompressorStream : NoopStream
open(codec, fpath, "r") do stream
for (i, l) in enumerate(eachline(stream))
if occursin(search, l)
println((i,l))
end
end
end
end
if abspath(PROGRAM_FILE) == #__FILE__
fpath = ARGS[1]
search = ARGS[2]
catgrep_ch(fpath, search)
end
Performance Benchmarks
1) Baseline:
user#computer>> time (cat bigfile.tsv | grep seachterm)
real 0m1.952s
user 0m0.205s
sys 0m2.525s
3) Without Channels (Simple) in Julia:
julia> include("test1.jl")
julia> #time catgrep("bigfile.tsv", "seachterm")
4.448542 seconds (20.30 M allocations: 10.940 GiB, 5.00% gc time)
julia> #time catgrep("bigfile.tsv", "seachterm")
4.512661 seconds (20.30 M allocations: 10.940 GiB, 4.87% gc time)
So, it's like 2-3x worse, in the most simplistic possible case. Nothing fancy is done here at all, and it isn't due to pre-compilation.
3) Channels in Julia:
julia> #time catgrep_ch("bigfile.tsv", "seachterm")
11.691557 seconds (65.45 M allocations: 12.140 GiB, 3.06% gc time, 0.80% compilation time)
julia> #time catgrep_ch("bigfile.tsv", "seachterm")
11.403931 seconds (65.30 M allocations: 12.132 GiB, 3.03% gc time)
This is really horrible, and I'm not sure how it becomes so sluggish.
Is the way in which Channels are used here wrong?

Julia, grep and Python uses different algorithms when it comes to string searching. There are many algorithm and some are far better than others in specific cases.
grep is highly optimized so to run quickly in many situation including in your specific use-case. Indeed, according to the GNU documentation, the Boyer-Moore fast string searching algorithm is used to match a single fixed pattern, and the Aho-Corasick algorithm is used to match multiple fixed patterns. In your specific use-case, Boyer-Moore is select and it is generally fast since it can skip part of the input based on the searched string. Its best-case complexity is Ω(n/m) and its worst-case complexity is O(mn). It is extremely fast if the text rarely contains characters of the searched string. For example, searching seachterm in this is a test with a pretty long sentence (repeated 58.5 million times) is 10 times faster than searching iss while both are not present in the target file. This is because Boyer-Moore search for the last letter of the searched string (a m) in the text and cannot find it so it can be very fast. There are other reasons explaining why grep is so fast compared to most alternative methods. One of them is that grep do not create/allocate sub-strings for each lines and use a huge raw buffer instead. Note that cat bigfile.tsv | grep seachterm can be significantly slower than grep seachterm bigfile.tsv since the pipe introduce a significant overhead when the parsing is fast enough.
CPython uses a mix of different algorithms so be efficient in most cases. Based on the implementation, they use a mix of the Boyer-Moore algorithm "incorporating ideas of Horspool and Sunday". They claim the resulting algorithm to be faster than other algorithm like Knuth-Morris-Pratt for example. For long strings, they use an even faster algorithm that is very efficient: the Crochemore and Perrin's Two-Way algorithm (a mix of BM and KMP). This one runs in O(n+m) in the worst case which is optimal. Note that while this implementation is great, splitting lines of a file and creating many string object can significantly decrease the performance. THis is certainly why your python implementation is not so fast compared to grep.
In Julia code, the file splitting in lines which introduces a significant overhead and put a pressure on the garbage-collector. Furthermore, occursin do not seems particularly optimized. There is no comment in the code about which algorithm is used. That being said, it looks like a naive generic brute-force algorithm running it O(mn) time. Such a code cannot compete with optimized implementations of efficient algorithms like the one in Python and grep.
Channels are a bit similar to coroutines and fibers (or any "light threads") with a FIFO queue so to manage messages. Such a construct introduces a significant overhead due to expensive software-defined context-switches (aka yield that mainly consists in saving/restoring some registers). The negative effect on performance can be delayed. Indeed, light threading systems have their own stack and they own code context. Thus, when the processor do a light-thread context switch, this can cause data/code cache-misses. For more information about how channels you can read the documentation about it (which mention an embedded task scheduler) or directly read the code.
In addition, channels create objects/message than needs to be managed by the garbage collector putting even more pressure on it. In fact, the number of allocation is >3 times bigger in the channel based version. One can argue that the reported GC overhead is low but such metrics often underestimate the overall overhead which include allocations, memory diffusion/fragmentation, GC collections, cache-effects, etc. (and, in this case, even I/O overlapping effects).
I think the main problem with the channel-based implementation is that the channel of your code are unbuffered (see the documentation about it). Using wide buffers can help to significantly reduce the number of context-switches and so the overhead. This may increase the latency but there is often a trade-off to make between latency and throughput (especially in scheduling). Alternatively, note that there are some packages that can be faster than built-ins channels.

Edit (with regard to new info from #chase)
#chase as far as I understand you are comparing performance of yield in Python which is a generator for non materialized lists vs a Channel in Julia which is a FIFO queue which supports for multi-threaded insertion and polling of elements. In this case this you are comparing two very different things (like apples-to-oranges).
If your goal is the implementation of processing similar in ideas to grep have a look at the performance tips below.
Performance tips
Channel will add a big overhead like any additional communication layer.
If you need performance you need to:
Use either #distributed or Threads.#threads to create parallel workers
Each worker opens the file for reading
Use seek to allocate their location (e.g. having a 1000 byte of file and 2 workers the first one starts at byte 0 and the second does seek(500).
Remember to implement mechanism in such a way that you handle situation that your worker gets data in the middle of the line
Operate directly on raw bytes rather than String (for performance)

Related

Incremental text file processing for parallel processing

I'm at the first experience with the Julia language, and I'm quite surprises by its simplicity.
I need to process big files, where each line is composed by a set of tab separated strings. As a first example, I started by a simple count program; I managed to use #parallel with the following code:
d = open(f)
lis = readlines(d)
ntrue = #parallel (+) for li in lis
contains(li,s)
end
println(ntrue)
close(d)
end
I compared the parallel approach against a simple "serial" one with a 3.5GB file (more than 1 million lines). On a 4-cores Intel Xeon E5-1620, 3.60GHz, with 32GB of RAM, What I've got is:
Parallel = 10.5 seconds; Serial = 12.3 seconds; Allocated Memory = 5.2
GB;
My first concern is about memory allocation; is there a better way to read the file incrementally in order to lower the memory allocation, while preserving the benefits of parallelizing the processing?
Secondly, since the CPU gain related to the use of #parallel is not astonishing, I'm wondering if it might be related to the specific case itself, or to my naive use of the parallel features of Julia? In the latter case, what would be the right approach to follow? Thanks for the help!
Your program is reading all of the file into memory as a large array of strings at once. You may want to try a serial version that processes the lines one at a time instead (i.e. streaming):
const s = "needle" # it's important for this to be const
open(f) do d
ntrue = 0
for li in eachline(d)
ntrue += contains(li,s)
end
println(ntrue)
end
This avoids allocating an array to hold all of the strings and avoids allocating all of string objects at once, allowing the program to reuse the same memory by periodically reclaiming it during garbage collection. You may want to try this and see if that improves the performance sufficiently for you. The fact that s is const is important since it allows the compiler to predict the types in the for loop body, which isn't possible if s could change value (and thus type) at any time.
If you still want to process the file in parallel, you will have to open the file in each worker and advance each worker's read cursor (using the seek function) to an appropriate point in the file to start reading lines. Note that you'll have to be careful to avoid reading in the middle of a line and you'll have to make sure each worker does all of the lines assigned to it and no more – otherwise you might miss some instances of the search string or double count some of them.
If this workload isn't just an example and you actually just want to count the number of lines in which a certain string occurs in a file, you may just want to use the grep command, e.g. calling it from Julia like this:
julia> s = "boo"
"boo"
julia> f = "/usr/share/dict/words"
"/usr/share/dict/words"
julia> parse(Int, readchomp(`grep -c -F $s $f`))
292
Since the grep command has been carefully optimized over decades to search text files for lines matching certain patterns, it's hard to beat its performance. [Note: if it's possible that zero lines contain the pattern you're looking for, you will want to wrap the grep command in a call to the ignorestatus function since the grep command returns an error status code when there are no matches.]

polyfit on GPUArray is extremely slow [duplicate]

function w=oja(X, varargin)
% get the dimensionality
[m n] = size(X);
% random initial weights
w = randn(m,1);
options = struct( ...
'rate', .00005, ...
'niter', 5000, ...
'delta', .0001);
options = getopt(options, varargin);
success = 0;
% run through all input samples
for iter = 1:options.niter
y = w'*X;
for ii = 1:n
% y is a scalar, not a vector
w = w + options.rate*(y(ii)*X(:,ii) - y(ii)^2*w);
end
end
if (any(~isfinite(w)))
warning('Lost convergence; lower learning rate?');
end
end
size(X)= 400 153600
This code implements oja's rule and runs slow. I am not able to vectorize it any more. To make it run faster I wanted to do computations on the GPU, therefore I changed
X=gpuArray(X)
But the code instead ran slower. The computation used seems to be compatible with GPU. Please let me know my mistake.
Profile Code Output:
Complete details:
https://drive.google.com/file/d/0B16PrXUjs69zRjFhSHhOSTI5RzQ/view?usp=sharing
This is not a full answer on how to solve it, but more an explanation why GPUs does not speed up, but actually enormously slow down your code.
GPUs are fantastic to speed up code that is parallel, meaning that they can do A LOT of things at the same time (i.e. my GPU can do 30070 things at the same time, while a modern CPU cant go over 16). However, GPU processors are very slow! Nowadays a decent CPU has around 2~3Ghz speed while a modern GPU has 700Mhz. This means that a CPU is much faster than a GPU, but as GPUs can do lots of things at the same time they can win overall.
Once I saw it explained as: What do you prefer, A million dollar sports car or a scooter? A million dolar car or a thousand scooters? And what if your job is to deliver pizza? Hopefully you answered a thousand scooters for this last one (unless you are a scooter fan and you answered the scooters in all of them, but that's not the point). (source and good introduction to GPU)
Back to your code: your code is incredibly sequential. Every inner iteration depends in the previous one and the same with the outer iteration. You can not run 2 of these in parallel, as you need the result from one iteration to run the next one. This means that you will not get a pizza order until you have delivered the last one, thus what you want is to deliver 1 by 1, as fast as you can (so sports car is better!).
And actually, each of these 1 line equations is incredibly fast! If I run 50 of them in my computer I get 13.034 seconds on that line which is 1.69 microseconds per iteration (7680000 calls).
Thus your problem is not that your code is slow, is that you call it A LOT of times. The GPU will not accelerate this line of code, because it is already very fast, and we know that CPUs are faster than GPUs for these kind of things.
Thus, unfortunately, GPUs suck for sequential code and your code is very sequential, therefore you can not use GPUs to speed up. An HPC will neither help, because every loop iteration depends in the previous one (no parfor :( ).
So, as far I can say, you will need to deal with it.

Fortran unformatted I/O optimization

I'm working on a set of Fortran programs that are heavily I/O bound, and so am trying to optimize this. I've read at multiple places that writing entire arrays is faster than individual elements, i.e. WRITE(10)arr is faster than DO i=1,n; WRITE(10) arr(i); ENDDO. But, I'm unclear where my case would fall in this regard. Conceptually, my code is something like:
OPEN(10,FILE='testfile',FORM='UNFORMATTED')
DO i=1,n
[calculations to determine m values stored in array arr]
WRITE(10) m
DO j=1,m
WRITE(10) arr(j)
ENDDO
ENDDO
But m may change each time through the DO i=1,n loop such that writing the whole array arr isn't an option. So, collapsing the DO loop for writing would end up with WRITE(10) arr(1:m), which isn't the same as writing the whole array. Would this still provide a speed-up to writing, what about reading? I could allocate an array of size m after the calculations, assign the values to that array, write it, then deallocate it, but that seems too involved.
I've also seen differing information on implied DO loop writes, i.e. WRITE(10) (arr(j),j=1,m), as to whether they help/hurt on I/O overhead.
I'm running a couple of tests now, and intend to update with my observations. Other suggestions on applicable
Additional details:
The first program creates a large file, the second reads it. And, no, merging the two programs and keeping everything in memory isn't a valid option.
I'm using unformatted I/O and have access to the Portland Group and gfortran compilers. It's my understanding the PG's is generally faster, so that's what I'm using.
The output file is currently ~600 GB, the codes take several hours to run.
The second program (reading in the file) seems especially costly. I've monitored the system and seen that it's mostly CPU-bound, even when I reduce the code to little more than reading the file, indicating that there is very significant CPU overhead on all the I/O calls when each value is read in one-at-a-time.
Compiler flags: -O3 (high optimization) -fastsse (various performance enhancements, optimized for SSE hardware) -Mipa=fast,inline (enables aggressive inter-procedural analysis/optimization on compiler)
UPDATE
I ran the codes with WRITE(10) arr(1:m) and READ(10) arr(1:m). My tests with these agreed, and showed a reduction in runtime of about 30% for the WRITE code, the output file is also slightly less than half the original's size. For the second code, reading in the file, I made the code do basically nothing but read the file to compare pure read time. This reduced the run time by a factor of 30.
If you use normal unformatted (record-oriented) I/O, you also write a record marker before and after the data itself. So you add eight bytes (usually) of overhead to each data item, which can easily (almost) double the data written to disc if your number is a double precision. The runtime overhead mentioned in the other answers is also significant.
The argument above does not apply if you use unformatted stream.
So, use
WRITE (10) m
WRITE (10) arr(1:m)
For gfortran, this is faster than an implied DO loop (i.e. the solution WRITE (10) (arr(i),i=1,m)).
In the suggested solution, an array descriptor is built and passed to the library with a single call. I/O can then be done much more efficiently, in your case taking advantage of the fact that the data is contiguous.
For the implied DO loop, gfortran issues multiple library calls, with much more overhead. This could be optimized, and is subject of a long-standing bug report, PR 35339, but some complicated corner cases and the presence of a viable alternative have kept this from being optimized.
I would also suggest doing I/O in stream access, not because of the rather insignificant saving in space (see above) but because keeping up the leading record marker up to date on writing needs a seek, which is additional effort.
If your data size is very large, above ~ 2^31 bytes, you might run into different behavior with record markers. gfortran uses subrecords in this case (compatible to Intel), but it should just work. I don't know what Portland does in this case.
For reading, of course, you can read m, then allocate an allocatable array, then read the whole array in one READ statement.
The point of avoiding outputting an array by looping over multiple WRITE() operations is to avoid the multiple WRITE() operations. It's not particularly important that the data being output are all the members of the array.
Writing either an array section or a whole array via a single WRITE() operation is a good bet. An implied DO loop cannot be worse than an explicit outer loop, but whether it's any better is a question of compiler implementation. (Though I'd expect the implied-DO to be better than an outer loop.)

Lazy Evaluation: Why is it faster, advantages vs disadvantages, mechanics (why it uses less cpu; examples?) and simple proof of concept examples [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
Lazy evaluation is said to be a way of delaying a process until the first time it is needed. This tends to avoid repeated evaluations and thats why I would imagine that is performing a lot faster.
Functional language like Haskell (and JavaScript..?) have this functionality built-in.
However, I don't understand how and why other 'normal' approaches (that is; same functionality but not using lazy evaluation) are slower.. how and why do these other approaches do repeated evaluations? Can someone elaborate on this by giving simple examples and explaining the mechanics of each approach?
Also, according to Wikipedia page about lazy evaluation these are said to be the advantages of this approach:
Performance increases by avoiding needless calculations, and error
conditions in evaluating compound expressions
The ability to construct potentially infinite data structures
The ability to define control flow (structures) as abstractions
instead of primitives
However, can we just control the calculations needed and avoid repeating the same ones? (1)
We can use i.e. a Linked List to create an infinite data structure (2)
Can we do (3) already..??? We can define classes/templates/objects and use those instead of primitives (i.e JavaScript).
Additionally, it seems to me that (at least from the cases i have seen), lazy evaluation goes hand-to-hand with recursion and using the 'head' and 'tail' (along with others) notions. Surely, there are cases where recursion is useful but is lazy evaluation something more than that...? more than a recursive approach to solving a problem..? Streamjs is JavaScript library that uses recursion along with some other simple operations (head,tail,etc) to perform lazy evaluation.
It seems i can't get my head around it...
Thanks in advance for any contribution.
I'll show examples in both Python 2.7 and Haskell.
Say, for example, you wanted to do a really inefficient sum of all the numbers from 0 to 10,000,000. You could do this with a for loop in Python as
total = 0
for i in range(10000000):
total += i
print total
On my computer, this takes about 1.3s to execute. If instead, I changed range to xrange (the generator form of range, lazily produces a sequence of numbers), it takes 1.2s, only slightly faster. However, if I check the memory used (using the memory_profiler package), the version with range uses about 155MB of RAM, while the xrange version uses only 1MB of RAM (both numbers not including the ~11MB Python uses). This is an incredibly dramatic difference, and we can see where it comes from with this tool as well:
Mem usage Increment Line Contents
===========================================
10.875 MiB 0.004 MiB total = 0
165.926 MiB 155.051 MiB for i in range(10000000):
165.926 MiB 0.000 MiB total += i
return total
This says that before we started we were using 10.875MB, total = 0 added 0.004MB, and then for i in range(10000000): added 155.051MB when it generated the entire list of numbers [0..9999999]. If we compare to the xrange version:
Mem usage Increment Line Contents
===========================================
11.000 MiB 0.004 MiB total = 0
11.109 MiB 0.109 MiB for i in xrange(10000000):
11.109 MiB 0.000 MiB total += i
return total
So we started with 11MB and for i in xrange(10000000): added only 0.109MB. This is a huge memory savings by only adding a single letter to the code. While this example is fairly contrived, it shows how not computing a whole list until the element is needed can make things a lot more memory efficient.
Python has iterators and generators which act as a sort of "lazy" programming for when you need to yield sequences of data (although there's nothing stopping you from using them for single values), but Haskell has laziness built into every value in the language, even user-defined ones. This lets you take advantage of things like data structures that won't fit in memory without having to program complicated ways around that fact. The canonical example would be the fibonacci sequence:
fibs = 1 : 1 : zipWith (+) fibs (tail fibs)
which very elegantly expresses this famous sequence to define a recursive infinite list generating all fibonacci numbers. It's CPU efficient because all values are cached, so each element only has to be computed once (compared to a naive recursive implementation)1, but if you calculate too many elements your computer will eventually run out of RAM because you're now storing this huge list of numbers. This is an example where lazy programming lets you have CPU efficiency, but not RAM efficiency. There is a way around this, though. If you were to write
fib :: Int -> Integer
fib n = let fibs = 1 : 1 : zipWith (+) fibs (tail fibs) in fibs !! n
then this runs in near-constant memory, and does so very quickly, but memoization is lost as subsequent calls to fib have to recompute fibs.
A more complex example can be found here, where the author shows how to use lazy programming and recursion in Haskell to perform dynamic programming with arrays, a feat that most initially think is very difficult and requires mutation, but Haskell manages to do very easily with "tying the knot" style recursion. It results in both CPU and RAM efficiency, and does so in fewer lines than I'd expect in C/C++.
All this being said, there are plenty of cases where lazy programming is annoying. Often you can build up huge numbers of thunks instead of computing things as you go (I'm looking at you, foldl), and some strictness has to be introduced to attain efficiency. It also bites a lot of people with IO, when you read a file to a string as a thunk, close the file, and then try to operate on that string. It's only after the file is closed that the thunk gets evaluated, causing an IO error to occur and crashes your program. As with anything, lazy programming is not without its flaws, gotchas, and pitfalls. It takes time to learn how to work with it well, and to know what its limitations are.
1) By "naive recursive implementation", I mean implementing the fibonacci sequence as
fib :: Integer -> Integer
fib 0 = 1
fib 1 = 1
fib n = fib (n-1) + fib (n-2)
With this implementation, you can see the mathematical definition very clearly, it's very much in the style of inductive proofs, you show your base cases and then the general case. However, if I call fib 5, this will "expand" into something like
fib 5 = fib 4 + fib 3
= fib 3 + fib 2 + fib 2 + fib 1
= fib 2 + fib 1 + fib 1 + fib 0 + fib 1 + fib 0 + fib 1
= fib 1 + fib 0 + fib 1 + fib 1 + fib 0 + fib 1 + fib 0 + fib 1
= 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1
= 8
When instead we'd like to share some of those computations, that way fib 3 only gets computed once, fib 2 only gets computed once, etc.
By using a recursively defined list in Haskell, we can avoid this. Internally, this list is represented something like this:
fibs = 1 : 1 : zipWith (+) fibs (tail fibs)
= 1 : 1 : zipWith (+) (f1:f2:fs) (f2:fs)
^--------------------^ ^ ^
^-------------------|-------|
= 1 : 1 : 2 : zipWith (+) (f2:f3:fs) (f3:fs)
^--------------------^ ^ ^
^-------------------|-------|
= 1 : 1 : 2 : 3 : zipWith (+) (f3:f4:fs) (f4:fs)
^--------------------^ ^ ^
^-------------------|-------|
So hopefully you can see the pattern forming here, as the list is build, it keeps pointers back to the last two elements generated in order to compute the next element. This means that for the nth element computed, there are n-2 additions performed. Even for the naive fib 5, you can see that there are more additions performed than that, and the number of additions will continue to grow exponentially. This definition is made possible through laziness and recursions, letting us turn an O(2^n) algorithm into an O(n) algorithm, but we have to give up RAM to do so. If this is defined at the top level, then values are cached for the lifetime of the program. It does mean that if you need to refer to the 1000th element repeatedly, you don't have to recompute it, just index it.
On the other hand, the definition
fib :: Int -> Integer
fib n =
let fibs = 1 : 1 : zipWith (+) fibs (tail fibs)
in fibs !! n
uses a local copy of fibs every time fib is called. We don't get caching between calls to fib, but we do get local caching, leaving our complexity O(n). Additionally, GHC is smart enough to know that we don't have to keep the beginning of the list around after we've used it to calculate the next element, so as we traverse fibs looking for the nth element, it only needs to hold on to 2-3 elements and a thunk pointing at the next element. This saves us RAM while computing it, and since it isn't defined at a global level it doesn't eat up RAM over the lifetime of the program. It's a tradeoff between when we want to spend RAM and CPU cycles, and different approaches are better for different situations. These techniques are applicable to much of Haskell programming in general, not just for this sequence!
Lazy evaluation is not, in general, faster. When it's said that lazy evaluation is more efficient, it is because when you consider Lambda Calculus (which is essentially what your Haskell programs are once the compiler finishes de-sugaring them) as a system of terms and reduction rules, then applying those rules in the order specified by the rules of a call-by-name with sharing evaluation policy always applies the same or fewer reduction rules than when you follow the rules in the order specified by call-by-value evaluation.
The reason that this theoretical result does not make lazy evaluation faster in general is that the translation to a linear sequential machine model with a memory access bottleneck tends to make all the reductions performed much more expensive! Initial attempts at implementing this model on computers led to programs that executed orders of magnitude more slowly than typical eagerly-evaluating language implementations. It has taken a lot of research and engineering into techniques for implementing lazy evaluation efficiently to get Haskell performance to where it is today. And the fastest Haskell programs take advantage of a form of static analysis called "strictness analysis" which attempts to determine at compile time which expressions will always be needed so that they can be evaluated eagerly rather than lazily.
There are still some cases where straightforward implementations of algorithms will execute faster in Haskell due to only evaluating terms that are needed for the result, but even eager languages always have some facility for evaluating some expressions by need. Conditionals and short-circuiting boolean expressions are ubiquitous examples, and in many eager languages, one can also delay evaluation by wrapping an expression in an anonymous function or some other sort of delaying form. So you can typically use these mechanisms (or even more awkward rewrites) to avoid evaluating expensive things that won't be necessary in an eager language.
The real advantage of Haskell's lazy evaluation is not a performance-related one. Haskell makes it easier to pull expressions apart, re-combine them in different ways, and generally reason about code as if it were a system of mathematical equations instead of being a sequentially-evaluated set of machine instructions. By not specifying any evaluation order, it forced the developers of the language to avoid side-effects that rely on a simple evaluation ordering, such as mutation or IO. This in turn led to a host of elegant abstractions that are generally useful and might not have been developed into usability otherwise.
The state of Haskell is now such that you can write high-level, elegant algorithms that make better re-use of existing higher-order functions and data structures than in nearly any other high-level typed language. And once you become familiar with the costs and benefits of lazy evaluation and how to control when it occurs, you can ensure that the elegant code also performs very well. But getting the elegant code to a state of high performance is not necessarily automatic and may require a bit more thought than in a similar but eagerly-evaluated language.
The concept of "lazy evaluation" is only about 1 thing, and only about that 1 thing:
The ability to postpone evaluation of something until needed
That's it.
Everything else in that wikipedia article follows from it.
Infinite data structures? Not a problem. We'll just make sure we don't actually figure out what the next element is until you actually ask for it. For instance, asking some code what the next value after X is, if the operation to perform is just to increase X by 1, will be infite. If you create a list containing all those values, it's going to fill your available memory in the computer. If you only figure out what the next value is when asked, not so much.
Needless calculations? Sure. You can return an object containing a lot of properties that when asked will provide you with some value. If you don't ask (ie. never inspect the value of a given property), the calculation necessary to figure out the value of that property will never be done.
Control flow ... ? Not at all sure what that is about.
The purpose of lazy evaluation of something is exactly as I stated to begin with, to avoid evaluating something until you actually need it. Be it the next value of something, the value of a property, whatever, adding support for lazy evaluation might conserve CPU cycles.
What would the alternative be?
I want to return an object to the calling code, containing any number of properties, some of which might be expensive to calculate. Without lazy evaluation, I would have to calculate the values of all those properties either:
Before constructing the object
After constructing the object, on the first time you inspected a property
After constructing the object, every time you inspected that property
With lazy evaluation you usually end up with number 2. You postpone evaluating the value of that property until some code inspects it. Note that you might cache the value once evaluated, which would save CPU cycles when inspecting the same property more than once, but that is caching, not quite the same, but in the same line of work: optimizations.

Time taken in executing % / * + - operations

Recently, i heard that % operator is costly in terms of time.
So, the question is that, is there a way to find the remainder faster?
Also your help will be appreciated if anyone can tell the difference in the execution of % / * + - operations.
In some cases where you're using power-of-2 divisors you can do better with roll-your-own techniques for calculating remainder, but generally a halfway decent compiler will do the best job possible with variable divisors, or "odd" divisors that don't fit any pattern.
Note that a few CPUs don't even have a multiply operation, and so (on those) multiply is quite slow vs add (at least 64x for a 32-bit multiply). (But a smart compiler may improve on this if the multiplier is a literal.) A slightly larger number do not have a divide operation or have a pretty slow one. (On a CPU with a fast multiplier multiply may only be on the order of 4 times slower than add, but on "normal" hardware it's 16-32 times slower for a 32 bit operation. Divide is inherently 2-4x slower than multiply, but can be much slower on some hardware.)
The remainder operation is rarely implemented in hardware, and normally A % B maps to something along the lines of A - ((A / B) * B) (a few extra operations may be required to assure the proper sign, et al).
(I learned about this stuff while microprogramming the instruction set for the SUMC computer for RCA/NASA back in the early 70s.)
No, the compiler is going to implement % in the most efficient way possible.
In terms of speed, + and - are the fastest (and are equally fast, generally done by the same hardware).
*, /, and % are much slower. Multiplication is basically done by the method you learn in grade school- multiply the first number by every digit in the second number and add the results. With some hacks made possible by binary. As of a few years ago, multiply was 3x slower than add. Division should be similar to multiply. Remainder is similar to division (in fact it generally calculates both at once).
Exact differences depend on the CPU type and exact model. You'd need to look up the latencies in the CPU spec sheets for your particular machine.

Resources