How to estimate the time complexity of the webpack build/compilation? - algorithm

Let's say we use typescript on a project, it has 1000 lines(or characters) of code, and there have 10 esmodule dependencies, and we assume the base compilation complexity of the typescript loader is n, and the per dependency bundle is m, can we estimate an approximate time complexity of webpack build this project to a single JS file? Is there a formula like <line of code> * <loader complexity> + <number of dependencies> * <bundle complexity> so the final time complexity of this project would be 1000 * n + 10 * m

Related

merging N sorted files using K way merge

There is decent literature about merging sorted files or say merging K sorted files. They all work on the theory that first element of each file is put in a Heap, then until the heap is empty poll that element, get another from the file from where this element was taken. This works as long as one record of each file can be put in a heap.
Now let us say I have N sorted files but I can only bring K records in the heap and K < N and let us say N = Kc where "c" is the multiplier implying that N is so large that it is some multiple of c. Clearly, it will require doing K way merge over and over until we only are left with K files and then we merge them as one last time into the final sort. How do I implement this and what will be the complexity of this?
There are multiple examples of k-way merge written in Java. One is http://www.sanfoundry.com/java-program-k-way-merge-algorithm/.
To implement your merge, you just have to write a simple wrapper that continually scans your directory, feeding the thing files until there's only one left. The basic idea is:
while number of files > 1
fileList = Load all file names
i = 0
while i < fileList.length
filesToMerge = copy files i through i+k-1 from file list
merge(filesToMerge, output file name)
i += k
end while
end while
Complexity analysis
This is easier to think about if we assume that each file contains the same number of items.
You have to merge M files, each of which contains n items, but you can only merge k files at a time. So you have to do logk(M) passes. That is, if you have 1,024 files and you can only merge 16 at a time, then you'll make one pass that merges 16 files at a time, creating a total of 64 files. Then you'll make another pass that merges 16 files at a time, creating four files, and your final pass will merge those four files to create the output.
If you have k files, each of which contains n items, then complexity of merging them is O(n*k log2 k).
So in the first pass you do M/k merges, each of which has complexity O(nk log k). That's O((M/k) * n * k * log2 k), or O(Mn log k).
Now, each of your files contains nkk items, and you do M/k/k merges of k files each. So the second pass complexity is O((M/k2) n * k2 * log2 k). Simplified, that, too works out to O(Mn log k).
In the second pass, you do k merges, each of which has complexity O(nk).
Note that in every pass you're working with M*n items. So each pass you do is O(Mn log k). And you're doing logk(M) passes. So the total complexity is: O(logk(M) * (Mn log k)), or
O((Mn log k) log M)
The assumption that every file contains the same number of items doesn't affect the asymptotic analysis because, as I've shown, every pass manipulates the same number of items: M*n.
This is all my thoughts
I would do it in iteration. First I would go for p=floor(n/k) iteration to get p sorted file. Then continue doing this for p+n%k items, until p+n%k becomes less then k. And then finally will get the sorted file.
Does it make sense?

Algorithm: Picking n items of different weights to obtain an average item-weight

I have 5 folders each containing 'n' files of size 10KB, 500KB, 1MB, 5MB and 30MB.
Now I need to pick exactly 15000 files from these folders and place them into a new folder such that I pick at least one file from each of the folders and the average file size remains around 1MB.
I've tried with working on weighed average distribution as well as along the lines of this problem http://goo.gl/uAHOk1 but couldn't reach any conclusion.
Is this problem solvable in polynomial time?
EDIT
From comments:
For the sake of clarity, you may consider each folder have exactly 16k files.
By around 1 MB I meant the average file size to be in the range of 1 to 1.5 MB. For example, if I had to pick exactly 5 files keeping in mind the constraints of my problem, the only solution would be picking one file from each folder. Then, the average file size would become 7.3MB
If you want the average size to be as close to your value as possible, this problem resembles the following ILP:
s_ij = size of file i in folder j [Parameter]
X_ij = select file i from folder j [Binary variable]
max Sum_ij s_ij * X_ij
such that
Sum_ij s_ij * X_ij <= 15,000 * average_size
Sum_ij X_ij = 15000
Sum_i X_ij >= 1 forall j
Which is pretty much a bin packing problem with one additional dimension and constraint (the one file per folder).
As Harold mentioned, we can start by going through each folder and selecting a file - for instance the smallest one. This can be done in polynomial time.
What's left is a bin packing problem where you can choose from any file in any folder to fill the gap between 15,000*average_size and the sum of pre-picked files. Bin packing is known to be NP-hard though, so you won't be able to solve this in polynomial time.

Given a time complexity, how do you calculate run time given an n?

I was given an algorithm and estimated the time complexity T(n) to be 3*n! + 2.
I know that the time required for the algorithm to run when n = 10 is 1 second, and I wish to calculate the run time for n = 20.
I'm a little confused on how to approach this. I assumed since n=10 that I just plug it into T(n), which gives 3*(10!) + 2, which is obviously not 1 (second).
Can anyone give some tips on how to approach this properly? Thanks!
As #MarkRansom wrote in the comments you would have to solve the equation
Runtime(m) / T(m) = Runtime(n) / T(n)
for Runtime(m). In this special case, the result is that big (see #shapiro.yaacov's comment) that it doesn't matter, if this value is exact or not.
Let's say your complexity is T(n) = 2n² and you measure 1 second for n = 1000 this leads us to
Runtime(2000) = T(2000) ⋅ 1s / T(1000) = 4s
But this does not mean, that your algorithm runs in exact 4 seconds. It can be much worse, if your input gets bigger than a specific type of memory. E.g. maybe the input for n = 1000 fits into the cache. If the input for n = 2000 does not, it has to be stored in the RAM and so your runningtime will be worse by a factor 50 (just throw in a number, I don't know how much slower RAM compared to L3 cache).
It gets even worse, if you have a giant input that not fits into the RAM and has to be stored on a hard disk.

Greedy Algorithm Optimization

I have the following problem:
Let there be n projects.
Let Fi(x) equal to the number of points you will obtain if you spent
x units of time working on project i.
You have T units of time to use and work on any project you would
like.
The goal is to maximize the number of points you will earn and the F functions are non-decreasing.
The F functions have diminishing marginal return, in other words spending x+1 unit of time working on a particular project will yield less of an increase in total points earned from that project than spending x unit of time on the project did.
I have come up with the following O(nlogn + Tlogn) algorithm but I am supposed to find an algorithm running in O(n + Tlogn):
sum = 0
schedule[]
gain[] = sort(fi(1))
for sum < T
getMax(gain) // assume that the max gain corresponds to project "P"
schedule[P]++
sum++
gain.sortedInsert(Fp(schedule[P] + 1) - gain[P])
gain[P].sortedDelete()
return schedule
That is, it takes O(nlogn) to sort the initial gain array and O(Tlogn) to run through the loop. I have thought through this problem more than I care to admit and cannot come up with an algorithm that would run in O(n + Tlogn).
For the first case, use a Heap, constructing the heap will take O(n) time, and each ExtractMin & DecreaseKey function call will take O(logN) time.
For the second case construct a nXT table where ith column denotes the solution for the case T=i. i+1 th column should only depend on the values on the ith column and the function F, hence calculatable in O(nT) time. I did not think all the cases thoroughly but this should give you a good start.

Big O confusion

I'm testing out some functions I made and I am trying to figure out the time complexity.
My problem is that even after reading up on some articles on Big O I can't figure out what the following should be:
1000 loops : 15000 objects : time 6
1000 loops : 30000 objects : time 9
1000 loops : 60000 objects : time 15
1000 loops : 120000 objects : time 75
The difference between the first 2 is 3 ms, then 6 ms, and then 60, so the time doubles up with each iteration. I know this isn't O(n), and I think it's not O(log n).
When I try out different sets of data, the time doesn't always go up. For example take this sequence (ms): 11-17-26-90-78-173-300
The 78 ms looks out of place. Is this even possible?
Edit:
NVM, I'll just have to talk this over with my college tutor.
The output of time differs too much with different variables.
Thanks for those who tried to help though!
Big O notation is not about how long it takes exactly for an operation to complete. It is a (very rough) estimation of how various algorithms compare asymptotically with respect to changing input sizes, expressed in generic "steps". That is "how many steps does my algorithm do for an input of N elements?".
Having said this, note that in the Big O notation constants are ignored. Therefore a loop over N elements doing 100 calculations at each iteration would be 100 * N but still equal to O(N). Similarly, a loop doing 10000 calculations would still be O(N).
Hence in your example, if you have something like:
for(int i = 0; i < 1000; i++)
for(int j = 0; j < N; j++)
// computations
it would be 1000 * N = O(N).
Big O is just a simplified algorithm running time estimation, which basically says that if an algorithm has running time O(N) and another one has O(N^2) then the first one will eventually be faster than the second one for some value of N. This estimation of course does not take into account anything related to the underlying platform like CPU speed, caching, I/O bottlenecks, etc.
Assuming you can't get O(n) from theory alone, then I think you need to look over more orders of magnitude in O(n) -- at least three, preferably six or more (you will just need to experiment to see what variation in n is required). Leave the thing running overnight if you have to. Then plot the results logarithmically.
Basically I suspect you are looking at noise right now.
Without seeing your actual algorithm, I can only guess:
If you allow a constant initialisation overhead of 3ms, you end up with
1000x15,000 = (OH:3) + 3
1000x30,000 = (OH:3) + 6
1000x60,000 = (OH:3) + 12
This, to me, appears to be O(n)
The disparity in your timestamping of differing datasets could be due to any number of factors.

Resources