Amortized Analysis of Algorithms - algorithm

I am currently reading amortized analysis. I am not able to fully understand how it is different from normal analysis we perform to calculate average or worst case behaviour of algorithms. Can someone explain it with an example of sorting or something ?

Amortized analysis gives the average performance (over time) of each operation in
the worst case.
In a sequence of operations the worst case does not occur often in each operation - some operations may be cheap, some may be expensive Therefore, a traditional worst-case per operation analysis can give overly pessimistic bound. For example, in a dynamic array only some inserts take a linear time, though others - a constant time.
When different inserts take different times, how can we accurately calculate the total time? The amortized approach is going to assign an "artificial cost" to each operation in the sequence, called the amortized cost of an operation. It requires that the total real cost of the sequence should be bounded by the total of the amortized costs of all the operations.
Note, there is sometimes flexibility in the assignment of amortized costs.
Three methods are used in amortized analysis
Aggregate Method (or brute force)
Accounting Method (or the banker's method)
Potential Method (or the physicist's method)
For instance assume we’re sorting an array in which all the keys are distinct (since this is the slowest case, and takes the same amount of time as when they are not, if we don’t do anything special with keys that equal the pivot).
Quicksort chooses a random pivot. The pivot is equally likely to be the smallest key,
the second smallest, the third smallest, ..., or the largest. For each key, the
probability is 1/n. Let T(n) be a random variable equal to the running time of quicksort on
n distinct keys. Suppose quicksort picks the ith smallest key as the pivot. Then we run quicksort recursively on a list of length i − 1 and on a list of
length n − i. It takes O(n) time to partition and concatenate the lists–let’s
say at most n dollars–so the running time is
Here i is a random variable that can be any number from 1 (pivot is the
smallest key) to n (pivot is largest key), each chosen with probability 1/n,
so
This equation is called a recurrence. The base cases for the recurrence are T(0) = 1 and T(1) = 1. This means that sorting a list of length zero or one takes at most one dollar (unit of time).
So when you solve:
The expression 1 + 8j log_2 j might be an overestimate, but it doesn’t
matter. The important point is that this proves that E[T(n)] is in O(n log n).
In other words, the expected running time of quicksort is in O(n log n).
Also there’s a subtle but important difference between amortized running time
and expected running time. Quicksort with random pivots takes O(n log n) expected running time, but its worst-case running time is in Θ(n^2). This means that there is a small
possibility that quicksort will cost (n^2) dollars, but the probability that this
will happen approaches zero as n grows large.
Quicksort O(n log n) expected time
Quickselect Θ(n) expected time
For a numeric example:
The Comparison Based Sorting Lower Bound is:
Finally you can find more information about quicksort average case analysis here

average - a probabilistic analysis, the average is in relation to all of the possible inputs, it is an estimate of the likely run time of the algorithm.
amortized - non probabilistic analysis, calculated in relation to a batch of calls to the algorithm.
example - dynamic sized stack:
say we define a stack of some size, and whenever we use up the space, we allocate twice the old size, and copy the elements into the new location.
overall our costs are:
O(1) per insertion \ deletion
O(n) per insertion ( allocation and copying ) when the stack is full
so now we ask, how much time would n insertions take?
one might say O(n^2), however we don't pay O(n) for every insertion.
so we are being pessimistic, the correct answer is O(n) time for n insertions, lets see why:
lets say we start with array size = 1.
ignoring copying we would pay O(n) per n insertions.
now, we do a full copy only when the stack has these number of elements:
1,2,4,8,...,n/2,n
for each of these sizes we do a copy and alloc, so to sum the cost we get:
const*(1+2+4+8+...+n/4+n/2+n) = const*(n+n/2+n/4+...+8+4+2+1) <= const*n(1+1/2+1/4+1/8+...)
where (1+1/2+1/4+1/8+...) = 2
so we pay O(n) for all of the copying + O(n) for the actual n insertions
O(n) worst case for n operation -> O(1) amortized per one operation.

Related

Basic Confusion on MERGE SORT’s DIVIDE step

In Merge Sort, the running time of Divide step for every element is considered theta(1) i.e constant time in CLRS.
My confusion is :
Let divide is an elementary operation and which takes constant time C1.
Now, Consider the pesudocode
Now, When MERGE-SORT function called for the first time then it will take C1 to divide the input (Note:- We are not considering Time taken by Merge function)
And this MERGE-SORT will divide the input lg(n) time ( or we can say it will call itself lg((n)) times )
So the total time taken in dividing inputs should be C1 * lg(n)
that is theta(lg(n)).
But In CLRS :
If we consider that then Divide Step will require theta(1) time ( as whole)
NOTE: lg is log to the base 2
P.S. - Sorry in advance because my English is not upto that mark. Edits are welcome :)
One step of division indeed takes constant time O(1).
But contrary to your thinking, there are Θ(N) divisions in total (1+2+4+8+...N/2).
Anyway this is not accounted for as the total merging workload is Θ(N Log N).
The basic understanding of Merge Sort is as such - Consider we have two halves of an array which are already sorted. We can merge these two halves in O(n) time.
Mathematically speaking, the function Merge Sort calls itself twice for two of its halves (The DIVIDE step), and then merges these two (The CONQUER step).
Now, if we solve this recurrence relation (by using appropriate mathematical analysis), we get the relation as follows:
This clearly indicates that the DIVIDE step that takes O(1) time is called n times, while the total merging workload of merging all the smaller halves of various sizes is O(n logn), which is the upper limit. Hence the complexity of O(n logn) (Thanks to Yves Daoust for this clarification).
However, the DIVIDE step takes O(1) time only, which is called O(n) times. Thus, the workload for division is O(n).

Quicksort Analysis

Question:
Here's a modification of quick sort: whenever we have ten items or fewer in a sublist, we sort the sublist using selection sort rather than further recursing with quicksort. Does this change the big-oh time complexity of quicksort? Explain.
In my opinion the big-oh time complexity would change. We know that selection sort is O(n^2) and therefore sorting the sublist of ten items or fewer would take O(n^2). Until we get to a sublist that has ten or fewer items we would use quicksort and keep partitioning the list. So in the end we would have O( nlogn + n^2) which is O(n^2).
Am I correct? If not, could someone explain why?
The reason that the time complexity is actually unaffected is that 10 is a constant term. No matter how large the total array is, it always takes a constant amount of time to sort subarrays of size 10 or less. If you are sorting a list with one million elements, that constant 10 is going to play a very small role in the actual time it takes to sort the list (most of the time will be spent partitioning the original array into subarrays recursively).
If sorting a list of 10 elements takes constant time, partitioning the array at each recursive step is linear, and you end up with log n subarrays of 10 items or fewer, you end up with O(n log n + log n), which is the same as O(n log n).
Saying that selection sort is O(n^2) means that the running time of the algorithm increases quadratically with the size of the input. Running selection sort on an array with a constant number of elements will always take constant time on its own, but if you were to compare the running time of selection sort on arrays of varying sizes, you would see a quadratic increase in the running time as input size varies linearly.
The big O complexity does not change. Please read up on the Master Method (aka Master Theorem) https://en.wikipedia.org/wiki/Master_theorem
If you think through the algorithm as the size of the sorting list grows exceptionally large the time to sort the final ten in any given recursion substree will make insignificant contributions to overall running time.

what type of maths is required to understand algorithm time and space complexities?

I've been applying for jobs and every time I hear questions about algorithm time/space complexity, I cringe and stumble. No matter how much I read, my brain seems to be programmed to not get any of it, and I think the reason is down to the fact I have a very low mathematical background due to skipping school. This may not be the usual S.O question, potentially even be removed due to being fundamentally about the maths, but at least I'm hoping I'll figure out where to go next with this question.
I don't know why job people would go into that, so here's just a few examples. The whole "complexity" thing is just to provide an indication of how much time (or memory) the algorithm uses.
Now, if you have an array with values, accessing the value at a given index is O(1) -- constant. It doesn't matter how many elements are the in array, if you have an index you can get the element directly.
If, on the other hand, you are looking for a specific value, you'll have no choice but to look at every element (at least until you find the one, but that doesn't matter for the complexity thing). Thus, searching in a random array is O(n): the runtime corresponds to the number of elements.
If, on the other hand, you have sorted array then you can do a "binary search", which would be O(log n). "Log n" is twos-logarithm, which is basically the inverse of 2^n. For example, 2^10 is 2*2*2*2...*2 10 times = 1024, and log2(1024) is 10. Thus, algorithms with O(log n) are generally considered pretty good: to find an element in a sorted array using binary search, if the array has up to 1024 elements then the binary search will only have to look at 10 of them to find any value. For 1025-2048 elements it would be 11 peeks at most, for 2049-4096 it would 12 and so on. So, adding more elements will only slowly increase the runtime.
Of course, things can get a lot worse. A trivial sorting algorithm tends to be O(n**2), meaning that it needs 2^2 = 4 "operations" for an array with just 2 elements, 3^2 = 9 if the array has 3, 4^2 = 16 if the array has 4 elements and so on. Pretty bad actually, considering that an array with just 1000 elements would already require 1000*1000 = 1 million compares to sort. This is called exponential growth, and of course it can get even worse: O(n^3), O(n^4) etc. Increasingly bad.
A "good" sorting algorithm is O(n*log n). Assuming an array with 1024 elements, this would be 1024*10=10240 compares --- much better than the 1 million we had before.
Just take these O(...) as indicators for the runtime behavior (or memory footprint if applied to memory). I did plug in real numbers to you can see how the numbers change, but those are not important, and usually these complexities are worst case. Nonetheless, by just looking at the numbers, "constant time" is obviously best, exponential is always bad because runtime (or memory use) skyrockets really fast.
EDIT: also, you are not really interested in constant factors; you don't usually see "O(2n)". This would still be "O(n)" -- the runtime relates directly to the number of elements.
To analyze the time/space complexity of an algorithm - a high school knowledge should be fine. I studied this in Uni. during my first semester and I was just fine.
The fields of interests for the basics are:
basic calculus (understanding what a "limit" and asymptote are)
Series calculations (especially sum of arithmetic progression is often used)
Basics knowledge of combinatorics ("how many options there are to...?")
Basics of polynomial arithmetrics (p(x) + q(x) = ?)
The above is true for analyzing complexity of algorithms. Calculating complexity of problems is much deeper field, which is still in research - theory of complexity. This requires extensive knowledge on set theory, theory of computation, advanced calculus, linear algebra and much more.
While knowing something about calculus, summing series, and discrete mathematics are all Good Things, from your question and from my limited experience in industry, I'd doubt that your interviewers are expecting that level of understanding.
In practice, you can make useful big-O statements about time and space complexity without having to do much mathematical thinking. Here are the basics, which I'll talk about in terms of time complexity just make the language less abstract.
A big-O time complexity tells you how the worst-case running time of your algorithm scales with the size of its input. The actual numbers you get from the big-O function are an indication of the number of constant time operations your algorithm will perform on a given size of input.
A big-O function, therefore, simply counts the number of constant time operations your algorithm will perform.
A constant time operation is said to be O(1). [Note that any fixed length sequence of constant time operations is also O(1) since the sequence also takes a constant amount of time.]
O(k) = O(1), for any constant k.
If your algorithm performs several operations in series, you sum their costs.
O(f) + O(g) = O(f + g)
If your algorithm performs an operation multiple times, you multiply the cost of the operation by the number of times it is performed.
n * O(f) = O(n * f)
O(f) * O(f) * ... * O(f) = O(f^n), where there are n terms on the left hand side
A classic big-O function is log(n), which invariably corresponds to "the height of the balanced tree containing n items". You can get away with just knowing that sorting is O(n log(n)).
Finally, you only report the fastest growing term in a big-O function since, as the size of the input grows, this will dominate all the other terms. Any constant factors are also discarded, since we're only interested in the scaling properties of the result.
E.g., O(2(n^2) + n) = O(n^2).
Here are two examples.
Bubble Sorting n Items
Each traversal of the items sorts (at least) one item into place. We therefore need n traversals to sort all the items.
O(bubble-sort(n)) = n * O(traversal(n))
= O(n * traversal(n))
Each traversal of the items involves n - 1 adjacent compare-and-swap operations.
O(traversal(n)) = (n - 1) * O(compare-and-swap)
= O((n - 1) * O(compare-and-swap))
Compare-and-swap is a constant time operation.
O(compare-and-swap) = O(1)
Collecting our terms, we get:
O(bubble-sort(n)) = O(n * (n - 1) * 1)
= O(n^2 - n)
= O(n^2)
Merge Sorting n Items
Merge-sort works bottom-up, merging items into pairs, pairs into fours, fours into eights, and so on until the list is sorted. Call each such set of operations a "merge-traversal". There can be at most log_2(n) merge traversals since n = 2 ^ log_2(n) and at each level we are doubling the sizes of the sub-lists being merged. Therefore,
O(merge-sort(n)) = log_2(n) * O(merge-traversal(n))
= O(log_2(n) * merge-traversal(n))
Each merge-traversal traverses all the input data once. Each input item is the subject of at least one compare-and-select operation and each compare-and-select operation chooses one of a pair of items to "emit". Hence
O(merge-traversal(n)) = n * O(compare-and-select)
= O(n * compare-and-select)
Each compare-and-select operation takes constant time:
O(compare-and-select) = O(1)
Collecting terms, we get
O(merge-sort(n)) = O(log_2(n) * n * 1)
= O(n * log_2(n))
= O(n * log(n)), since change of log base is
multiplication by a constant.
Ta daaaa!

Intuitive explanation for why QuickSort is n log n?

Is anybody able to give a 'plain english' intuitive, yet formal, explanation of what makes QuickSort n log n? From my understanding it has to make a pass over n items, and it does this log n times...Im not sure how to put it into words why it does this log n times.
Complexity
A Quicksort starts by partitioning the input into two chunks: it chooses a "pivot" value, and partitions the input into those less than the pivot value and those larger than the pivot value (and, of course, any equal to the pivot value have go into one or the other, of course, but for a basic description, it doesn't matter a lot which those end up in).
Since the input (by definition) isn't sorted, to partition it like that, it has to look at every item in the input, so that's an O(N) operation. After it's partitioned the input the first time, it recursively sorts each of those "chunks". Each of those recursive calls looks at every one of its inputs, so between the two calls it ends up visiting every input value (again). So, at the first "level" of partitioning, we have one call that looks at every input item. At the second level, we have two partitioning steps, but between the two, they (again) look at every input item. Each successive level has more individual partitioning steps, but in total the calls at each level look at all the input items.
It continues partitioning the input into smaller and smaller pieces until it reaches some lower limit on the size of a partition. The smallest that could possibly be would be a single item in each partition.
Ideal Case
In the ideal case we hope each partitioning step breaks the input in half. The "halves" probably won't be precisely equal, but if we choose the pivot well, they should be pretty close. To keep the math simple, let's assume perfect partitioning, so we get exact halves every time.
In this case, the number of times we can break it in half will be the base-2 logarithm of the number of inputs. For example, given 128 inputs, we get partition sizes of 64, 32, 16, 8, 4, 2, and 1. That's 7 levels of partitioning (and yes log2(128) = 7).
So, we have log(N) partitioning "levels", and each level has to visit all N inputs. So, log(N) levels times N operations per level gives us O(N log N) overall complexity.
Worst Case
Now let's revisit that assumption that each partitioning level will "break" the input precisely in half. Depending on how good a choice of partitioning element we make, we might not get precisely equal halves. So what's the worst that could happen? The worst case is a pivot that's actually the smallest or largest element in the input. In this case, we do an O(N) partitioning level, but instead of getting two halves of equal size, we've ended up with one partition of one element, and one partition of N-1 elements. If that happens for every level of partitioning, we obviously end up doing O(N) partitioning levels before even partition is down to one element.
This gives the technically correct big-O complexity for Quicksort (big-O officially refers to the upper bound on complexity). Since we have O(N) levels of partitioning, and each level requires O(N) steps, we end up with O(N * N) (i.e., O(N2)) complexity.
Practical implementations
As a practical matter, a real implementation will typically stop partitioning before it actually reaches partitions of a single element. In a typical case, when a partition contains, say, 10 elements or fewer, you'll stop partitioning and and use something like an insertion sort (since it's typically faster for a small number of elements).
Modified Algorithms
More recently other modifications to Quicksort have been invented (e.g., Introsort, PDQ Sort) which prevent that O(N2) worst case. Introsort does so by keeping track of the current partitioning "level", and when/if it goes too deep, it'll switch to a heap sort, which is slower than Quicksort for typical inputs, but guarantees O(N log N) complexity for any inputs.
PDQ sort adds another twist to that: since Heap sort is slower, it tries to avoid switching to heap sort if possible To to that, if it looks like it's getting poor pivot values, it'll randomly shuffle some of the inputs before choosing a pivot. Then, if (and only if) that fails to produce sufficiently better pivot values, it'll switch to using a Heap sort instead.
Each partitioning operation takes O(n) operations (one pass on the array).
In average, each partitioning divides the array to two parts (which sums up to log n operations). In total we have O(n * log n) operations.
I.e. in average log n partitioning operations and each partitioning takes O(n) operations.
There's a key intuition behind logarithms:
The number of times you can divide a number n by a constant before reaching 1 is O(log n).
In other words, if you see a runtime that has an O(log n) term, there's a good chance that you'll find something that repeatedly shrinks by a constant factor.
In quicksort, what's shrinking by a constant factor is the size of the largest recursive call at each level. Quicksort works by picking a pivot, splitting the array into two subarrays of elements smaller than the pivot and elements bigger than the pivot, then recursively sorting each subarray.
If you pick the pivot randomly, then there's a 50% chance that the chosen pivot will be in the middle 50% of the elements, which means that there's a 50% chance that the larger of the two subarrays will be at most 75% the size of the original. (Do you see why?)
Therefore, a good intuition for why quicksort runs in time O(n log n) is the following: each layer in the recursion tree does O(n) work, and since each recursive call has a good chance of reducing the size of the array by at least 25%, we'd expect there to be O(log n) layers before you run out of elements to throw away out of the array.
This assumes, of course, that you're choosing pivots randomly. Many implementations of quicksort use heuristics to try to get a nice pivot without too much work, and those implementations can, unfortunately, lead to poor overall runtimes in the worst case. #Jerry Coffin's excellent answer to this question talks about some variations on quicksort that guarantee O(n log n) worst-case behavior by switching which sorting algorithms are used, and that's a great place to look for more information about this.
Well, it's not always n(log n). It is the performance time when the pivot chosen is approximately in the middle. In worst case if you choose the smallest or the largest element as the pivot then the time will be O(n^2).
To visualize 'n log n', you can assume the pivot to be element closest to the average of all the elements in the array to be sorted.
This would partition the array into 2 parts of roughly same length.
On both of these you apply the quicksort procedure.
As in each step you go on halving the length of the array, you will do this for log n(base 2) times till you reach length = 1 i.e a sorted array of 1 element.
Break the sorting algorithm in two parts. First is the partitioning and second recursive call. Complexity of partioning is O(N) and complexity of recursive call for ideal case is O(logN). For example, if you have 4 inputs then there will be 2(log4) recursive call. Multiplying both you get O(NlogN). It is a very basic explanation.
In-fact you need to find the position of all the N elements(pivot),but the maximum number of comparisons is logN for each element (the first is N,second pivot N/2,3rd N/4..assuming pivot is the median element)
In the case of the ideal scenario, the first level call, places 1 element in its proper position. there are 2 calls at the second level taking O(n) time combined but it puts 2 elements in their proper position. in the same way. there will be 4 calls at the 3rd level which would take O(n) combined time but will place 4 elements into their proper position. so the depth of the recursive tree will be log(n) and at each depth, O(n) time is needed for all recursive calls. So time complexity is O(nlogn).

How to calculate order (big O) for more complex algorithms (eg quicksort)

I know there are quite a bunch of questions about big O notation, I have already checked:
Plain english explanation of Big O
Big O, how do you calculate/approximate it?
Big O Notation Homework--Code Fragment Algorithm Analysis?
to name a few.
I know by "intuition" how to calculate it for n, n^2, n! and so, however I am completely lost on how to calculate it for algorithms that are log n , n log n, n log log n and so.
What I mean is, I know that Quick Sort is n log n (on average).. but, why? Same thing for merge/comb, etc.
Could anybody explain me in a not too math-y way how do you calculate this?
The main reason is that Im about to have a big interview and I'm pretty sure they'll ask for this kind of stuff. I have researched for a few days now, and everybody seem to have either an explanation of why bubble sort is n^2 or the unreadable explanation (for me) on Wikipedia
The logarithm is the inverse operation of exponentiation. An example of exponentiation is when you double the number of items at each step. Thus, a logarithmic algorithm often halves the number of items at each step. For example, binary search falls into this category.
Many algorithms require a logarithmic number of big steps, but each big step requires O(n) units of work. Mergesort falls into this category.
Usually you can identify these kinds of problems by visualizing them as a balanced binary tree. For example, here's merge sort:
6 2 0 4 1 3 7 5
2 6 0 4 1 3 5 7
0 2 4 6 1 3 5 7
0 1 2 3 4 5 6 7
At the top is the input, as leaves of the tree. The algorithm creates a new node by sorting the two nodes above it. We know the height of a balanced binary tree is O(log n) so there are O(log n) big steps. However, creating each new row takes O(n) work. O(log n) big steps of O(n) work each means that mergesort is O(n log n) overall.
Generally, O(log n) algorithms look like the function below. They get to discard half of the data at each step.
def function(data, n):
if n <= constant:
return do_simple_case(data, n)
if some_condition():
function(data[:n/2], n / 2) # Recurse on first half of data
else:
function(data[n/2:], n - n / 2) # Recurse on second half of data
While O(n log n) algorithms look like the function below. They also split the data in half, but they need to consider both halves.
def function(data, n):
if n <= constant:
return do_simple_case(data, n)
part1 = function(data[n/2:], n / 2) # Recurse on first half of data
part2 = function(data[:n/2], n - n / 2) # Recurse on second half of data
return combine(part1, part2)
Where do_simple_case() takes O(1) time and combine() takes no more than O(n) time.
The algorithms don't need to split the data exactly in half. They could split it into one-third and two-thirds, and that would be fine. For average-case performance, splitting it in half on average is sufficient (like QuickSort). As long as the recursion is done on pieces of (n/something) and (n - n/something), it's okay. If it's breaking it into (k) and (n-k) then the height of the tree will be O(n) and not O(log n).
You can usually claim log n for algorithms where it halves the space/time each time it runs. A good example of this is any binary algorithm (e.g., binary search). You pick either left or right, which then axes the space you're searching in half. The pattern of repeatedly doing half is log n.
For some algorithms, getting a tight bound for the running time through intuition is close to impossible (I don't think I'll ever be able to intuit a O(n log log n) running time, for instance, and I doubt anyone will ever expect you to). If you can get your hands on the CLRS Introduction to Algorithms text, you'll find a pretty thorough treatment of asymptotic notation which is appropriately rigorous without being completely opaque.
If the algorithm is recursive, one simple way to derive a bound is to write out a recurrence and then set out to solve it, either iteratively or using the Master Theorem or some other way. For instance, if you're not looking to be super rigorous about it, the easiest way to get QuickSort's running time is through the Master Theorem -- QuickSort entails partitioning the array into two relatively equal subarrays (it should be fairly intuitive to see that this is O(n)), and then calling QuickSort recursively on those two subarrays. Then if we let T(n) denote the running time, we have T(n) = 2T(n/2) + O(n), which by the Master Method is O(n log n).
Check out the "phone book" example given here: What is a plain English explanation of "Big O" notation?
Remember that Big-O is all about scale: how much more operation will this algorithm require as the data set grows?
O(log n) generally means you can cut the dataset in half with each iteration (e.g. binary search)
O(n log n) means you're performing an O(log n) operation for each item in your dataset
I'm pretty sure 'O(n log log n)' doesn't make any sense. Or if it does, it simplifies down to O(n log n).
I'll attempt to do an intuitive analysis of why Mergesort is n log n and if you can give me an example of an n log log n algorithm, I can work through it as well.
Mergesort is a sorting example that works through splitting a list of elements repeatedly until only elements exists and then merging these lists together. The primary operation in each of these merges is comparison and each merge requires at most n comparisons where n is the length of the two lists combined. From this you can derive the recurrence and easily solve it, but we'll avoid that method.
Instead consider how Mergesort is going to behave, we're going to take a list and split it, then take those halves and split it again, until we have n partitions of length 1. I hope that it's easy to see that this recursion will only go log (n) deep until we have split the list up into our n partitions.
Now that we have that each of these n partitions will need to be merged, then once those are merged the next level will need to be merged, until we have a list of length n again. Refer to wikipedia's graphic for a simple example of this process http://en.wikipedia.org/wiki/File:Merge_sort_algorithm_diagram.svg.
Now consider the amount of time that this process will take, we're going to have log (n) levels and at each level we will have to merge all of the lists. As it turns out each level will take n time to merge, because we'll be merging a total of n elements each time. Then you can fairly easily see that it will take n log (n) time to sort an array with mergesort if you take the comparison operation to be the most important operation.
If anything is unclear or I skipped somewhere please let me know and I can try to be more verbose.
Edit Second Explanation:
Let me think if I can explain this better.
The problem is broken into a bunch of smaller lists and then the smaller lists are sorted and merged until you return to the original list which is now sorted.
When you break up the problems you have several different levels of size first you'll have two lists of size: n/2, n/2 then at the next level you'll have four lists of size: n/4, n/4, n/4, n/4 at the next level you'll have n/8, n/8 ,n/8 ,n/8, n/8, n/8 ,n/8 ,n/8 this continues until n/2^k is equal to 1 (each subdivision is the length divided by a power of 2, not all lengths will be divisible by four so it won't be quite this pretty). This is repeated division by two and can continue at most log_2(n) times, because 2^(log_2(n) )=n, so any more division by 2 would yield a list of size zero.
Now the important thing to note is that at every level we have n elements so for each level the merge will take n time, because merge is a linear operation. If there are log(n) levels of the recursion then we will perform this linear operation log(n) times, therefore our running time will be n log(n).
Sorry if that isn't helpful either.
When applying a divide-and-conquer algorithm where you partition the problem into sub-problems until it is so simple that it is trivial, if the partitioning goes well, the size of each sub-problem is n/2 or thereabout. This is often the origin of the log(n) that crops up in big-O complexity: O(log(n)) is the number of recursive calls needed when partitioning goes well.

Resources