How to gauge Excel Calculation speed? [closed] - performance

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions concerning problems with code you've written must describe the specific problem — and include valid code to reproduce it — in the question itself. See SSCCE.org for guidance.
Closed 9 years ago.
Improve this question
I am developing an Excel 2010 Application, which contains complex calculations in over 60+ worksheets. When I change certain data in any cell, it takes a lot of time in background calculation (I want the calculation to be automatic).....
Is there any way to find out which formula is taking more time over the other?
What is a better approach to improve performance - multiple simple formulas vs. single complex (MULTISTEP) formula?
i.e.
[STEP-1] E1 = C1 * D1
[STEP-2] F1 = E1 / B1
[STEP-3] G1 = F1 + B1
OR
[SINGLE STEP] G1 = (C1 * D1 / B1) + A1
Suggestion will be appreciated!
Thanks

As for the second part, if you use ordinary non-volatile functions then multiple simple formulas could be better for two reasons:
On simple recalculations (without rebuilding dependency trees) Excel will calculate only the parts that actually changed, e.g. in your single step example if value in A1 changes then Excel will have to recalculate the expression in the parentheses (C1 * D1 / B1) even if values of C1, D1, B1 are unchanged. When you replace that part with a reference to F1, the value of F1 will not be recalculated if only A1 changes its value.
Multiple simple formulas could be better calculated in parallel if you have multiple cores.
another usefull link in addition to MSDN: http://www.decisionmodels.com/calcsecretsc.htm
Volatile functions are evil in very large workbooks, especially OFFSET and INDIRECT. They all are recalculated every time anything changes in a file, and they are always calculated in a single thread. Any cell that depends on a cell with a volatile function becomes volatile as well, because all dependencies have to be recalculated every time a volatile function is recalculated. This viral volatility in a big file could seriously damage performance. Using many simple formulas helps in this case as well, since many dependencies could remain non-volatile.
From the link above:
Some Excel features do not use multithreaded calculation, for example:
Data table calculation (but structured references to tables do use
MTC). User-defined functions (but XLL functions can be
multithread-enabled). XLM functions. INDIRECT, CELL functions that use
either the format2 or address options. GETPIVOTDATA and other
functions referring to PivotTables or cubes. Range.Calculate and
Range.CalculateRowMajorOrder. Cells in circular reference loops.
Once upon a time I inherited a big file that took 30 min to recalculate on a dedicated fast machine and that was due to crazy usage of OFFSETS to access data from a big sheet. Just by moving calculation logic from Excel to Access and importing results via a pivot table I reduced total calculation time to several seconds!

This may help with your first question http://msdn.microsoft.com/en-us/library/ff700515%28v=office.14%29.aspx though as can be seen there your question may be close to being off topic as requiring a book to answer it comprehensively. For your second question I'd guess "no discernible difference".

Related

Is there a more efficient way to manage large data than using facts?

I finished working on an AI project that manages long strings (About 1300 characters each),
the issue is the Out of local stack exception when trying to run the algorithm with a larger population (NP) than 10 or when trying a higher amount of iterations (G) than 20.
Currently, i manage data using facts like this:
x(INDEX, NAMESTRING1, STRING1, NAMESTRING2, STRING2, FITNESS, VAR)
for storing current solutions
and
h(INDEX, NAMESTRING1, STRING1, NAMESTRING2, STRING2, FITNESS, VAR)
for storing possible solutions made by the algorithm.
Where STRING1 and STRING2 are two lists containing the characters of two DNA chains. These lists get modified adding '_' by the algorithm.
My code calls to assert/1 and retract/1 about NP * G * 4 * 2 times during each iteration.
I tried:
1.- Setting a higher stack limit, which seems to work with reservation.
2.- Using ! where necessary.
I was wondering if managing data with a list of lists would solve the problem.
Asserting and retracting facts should be very efficient and behave in the same way as inserting and removing document to/from an in-memory Key-Value database.
"Out of local stack space" indicates that your program recurses too deeply or has too many local variables in certain stack frames. Using ! certainly helps as the Prolog Processor can optimize away stack frames "that will never be re-visited in the future".
There is not enough information to judge this buit depending on the algorithm you may have the possibility to replace depth-first search with an iterative deepening approach (as one possibility that comes to mind)?

Sampling from discrete distribution without replacement where the probabilities change each draw

I have a sequence S = (s1,s2,...sk) with probability weights for each sequence site P = (p1,p2,...pk) where the sum of P = 1 maximum length of S may be around 10^9
By Simulation a site k is picked and modified after each draw , as reason the pk also changes each run through. Expected number of site exchanges is about 50k - 100k per simulation
Question 1: How would you suggest to draw site?
Actually I implemented this logic which seems to be ok itself as going along literature see e.g. here:
counter = 0
random_number = draw_random() #<= float in range 0,1
while P_sum < random_number
P_sum += P[counter]
counter++
return counter
By testing the simulation I observed a strong bias which seems to rebuilt random generators distribution ( see_here ) Three different generators generate 3 different results... which is fairly ok but none of them is correct at all states
Walkers and Knuth's methods with lookup table seem to be too time expensive for me as the lookup tables have to be recalculated each time.
Question 2 How can I reduce bias from randomness? Actual built in 3 different generators (only one used per simulation) which are uniform distributed in kindness to their chances. Knowing this is a heavy question when not knowing a line of simulation code
Question 3 Library for the thing ?
As it's not to much code I don't have problem to write on my own, but is there a another library for it which may not BOOST? Asking as this question may be outdated... Not Boost as I don't want to built in a fourth random generator and use that large thing
Question 4 Faster alternative?
I know that this topic was answered may thousands of time before - but none of the answers satisfies me enough nor gave me a wise alternative e.g. here seems to have the same problem but I don't understand which heap is where built and why in addition it seems very complicated for such a "easy" thing
Thank you for your support!

From Log value to Exponential value, huge Distortion for prediction of machine learning algorithm [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 5 years ago.
Improve this question
I build a machine learning algorithms to predict Y' value. For this, I used Log value of Y for data scaling.
As I got the predicted Y' and actual Y value, I have to convert Log value of Y&Y' to Exponential value.
BUT, there was so huge distortion from the values over exp7 (=ln1098)... It makes a lot of MSE(error).
How can I avoid this huge distortion?? (Generally, I need to get values over 1000)
Thanks!!
For this, I used Log value of Y for data scaling.
Not for scaling, but to make target variable distribution normal.
If your MSE arises when real target value arises too - it means that the model simply can't fit enough on big values. Usually it can be solved by cleaning data (removing outliers). Or take another ML-model.
UPDATE
You can run KFold and for each fold calculate MSE/MAE between predicted and real values. Then take big errors and take a look which parameters/features this cases have.
You can eliminate cases with big errors, but it's usually dangerous.
In general bad fit on big values mean that you did not remove outliers from your original dataset. Plot histograms and scatter plots and make sure that you don't have them.
Check categorical variables: maybe you have small values (<=5%). If so, group them.
Or you need to create 2 models: one for small values, one for big ones.

What is Youtube comment system sorting / ranking algorithm? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 4 years ago.
Improve this question
Youtube provides two sorting options: Newest first and Top comments. The "Newest first" is pretty simple that we just sort the comments by their post date. But the "Top comments" seems to be a lot more complex than just sorting by "thumb up"s.
After a short research, I found out that the order of comments depends on those things:
Number of "thumb up"s and "thumb down"s
Post date
Number of replies to that comment
But I don't know how Youtube uses this information to decide the order, like what information is more important and what is less important.
Is there any article about this topic that I could refer to?
Thanks!
I have the answer to your question.
After searching the internet for the answer to this, I never found precisely what I was looking for. So, my colleagues and I decided to experiment using the system with the Youtube comments.
First of all, we sorted what we believed to be popular videos into one section, average videos into another, and less popular into the last. There were 200 videos in each section, and after days of examining we started to notice a pattern. We found that you were right about the three things required, but we also dove a little deeper and found an additional variable.
The Youtube comment system depends on four things:
1) Time it was posted,
2) Like/dislike ratio of a comment,
3) Number of replies,
4) And, believe it or not, WHO posted it.
The average like/dislike ratio of every public comment you've ever posted builds into it, as (what we predicted) they believe that those with low like/dislike ratios would post comments that many people do not like or simply disagree with.
There is an algorithm to it, and it is quite simpler than you might think. Basically there are these things that we called "module points," and you get a certain one based on these four factors. First, here's the things you need to know about module point conversion with TWO of the factors:
For the like/dislike ratio on the comment, multiply that number by ten.
For the amount of replies (NOT from the original poster) that the comment has, there are two module points.
These are the two basic factors that tell the amount of module points the comment has.
For example, if a comment had 27 likes and 8 dislikes, then the ratio would be 3.375. Multiplying by 10, you would then have 33.75 module points. Using the next factor, amount of replies, let's say this comment has 4 direct replies to it. Multiplying 2 by 4, we get 8. This is the part where you add 8 onto the accumulative module points, giving you a total of 41.75 module points.
But we're not done here; this is where it gets tricky.
Using the average like/dislike ratio of a person's total comments that they've ever posted publicly, we found that the formula added onto the accumulative module points is this:
C = MP(R/3) + (MP/10)
where C = Comment Position Variable; MP = Module Points; R = Person's total like/dislike ratio
Trust me, we spend DAYS just on this part, which was probably the most frustrating. Even though the 3 and the 10 within this equation seem random and unnecessary, so far all of the comments we tested this equation on passed the test, but did not pass the test when those two variables were removed. After this equation is done, it gives you a number that we named to be the Position Variable.
However, we are not even done yet, we still haven't talked about time.
I was actually quite surprised that this part didn't take as long as I expected, but it sure was a pain doing this equation every single time for every comment we tested. At first, when testing it, we figured that the time was just there to break the barrier if 2 comments had equal Position Variables.
In fact, I almost called it a wrap on the experiment when this happened, but upon further inspection, we found out there was more to do. We found that some of the comments outranked each other that had the same Position Variable, but the timing seemed to be random! After a few days of inspection, here is where the final result comes in:
There is yet ANOTHER equation that we must find before applying the 4th variable. Using another separate equation, here's what our algebraic deductions came down to:
X = 1/3(S/10 + A) x [absolute value of](A - 3S)
where X = Timing Variable; S = How long ago the video was posted in minutes; A = How long ago the comment was posted in minutes
I wish I was making this up, but unfortunately this is how complicated the system is. There are mathematical reasons behind the other variables, but they are far too complex to explain, it will probably take up atleast three paragraphs worth of explaining. We tested this equation on more than 150 comments, all of them checked out to be true.
Once you find X, which is what we called the Timing Variable, all you have to do from here is apply it to this equation:
N = X(C/4 + 1)
where X = Timing Variable; C = Positioning Variable
N is the answer to all your problems.
This is the final equation, the final answer. The simple conclusion: the higher N, the higher up the comment is.
Note: Special thanks to my colleagues: David Mattison, Josh Williams, Diego Mendieta, Steven Orsette, and Kyle Shropshire. I could have never found out this without them and the work they put into this.

Good Data Structure for Unit Conversion? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 6 years ago.
Improve this question
StackOverflow crowd. I have a very open-ended software design question.
I've been looking for an elagant solution to this for a while and I was wondering if anyone here had some brilliant insight into the problem. Consider this to be like a data structures puzzle.
What I am trying to do is to create a unit converter that is capable of converting from any unit to any unit. Assume that the lexing and parsing is already done. A few simple examples:
Convert("days","hours") // Yields 24
Convert("revolutions", "degrees") // Yields 360
To make things a little more complicated, it must smoothly handle ambiguities between inputs:
Convert("minutes","hours") // Yields (1/60)
Convert("minutes","revolutions") // Yields (1/21600)
To make things even more fun, it must handle complex units without needing to enumerate all possibilities:
Convert("meters/second","kilometers/hour")
Convert("miles/hour","knots")
Convert("Newton meters","foot pounds")
Convert("Acre feet","meters^3")
There's no right or wrong answer, I'm looking for ideas on how to accomplish this. There's always a brute force solution, but I want something elegant that is simple and scalable.
I would start with a hashtable (or persisted lookup table - your choice how you implement) that carries unit conversions between as many pairs as you care to put in. If you put in every possible pair, then this is your brute force approach.
If you have only partial pairs, you can then do a search across the pairs you do have to find a combination. For example, let's say I have these two entries in my hashtable:
Feet|Inches|1/12
Inches|Centimeters|2.54
Now if I want to convert feet to centimeters, I have a simple graph search: vertices are Feet, Inches, and Centimeters, and edges are the 1/12 and 2.54 conversion factors. The solution in this case is the two edges 1/12, 2.54 (combined via multiplication, of course). You can get fancier with the graph parameters if you want to.
Another approach might be applying abductive reasoning - look into AI texts about algebraic problem solvers for this...
Edit: Addressing Compound Units
Simplified problem: convert "Acres" to "Meters^2"
In this case, the keys are understanding that we are talking about units of length, so why don't we insert a new column into the table for unit type, which can be "length" or "area". This will help performance even in the earlier cases as it gives you an easy column to pare down your search space.
Now the trick is to understand that length^2 = area. Why not add another lookup that stores this metadata:
Area|Length|Length|*
We couple this with the primary units table:
Meters|Feet|3.28|Length
Acres|Feet^2|43560|Area
So the algorithm goes:
Solution is m^2, which is m * m, which is a length * length.
Input is acres, which is an area.
Search the meta table for m, and find the length * length mapping. Note that in more complex examples there may be more than one valid mapping.
Append to the solution a conversion Acres->Feet^2.
Perform the original graph search for Feet->M.
Note that:
The algorithm won't know whether to use area or length as the basic domain in which to work. You can provide it hints, or let it search both spaces.
The meta table gets a little brute-force-ish.
The meta table will need to get smarter if you start mixing types (e.g. Resistance = Voltage / Current) or doing something really ugly and mixing unit systems (e.g. a FooArea = Meters * Feet).
Whatever structure you choose, and your choice may well be directed by your preferred implementation (OO ? functional ? DBMS table ?) I think you need to identify the structure of units themselves.
For example a measurement of 1000km/hr has several components:
a scalar magnitude, 1000;
a prefix, in this case kilo; and
a dimension, in this case L.T^(-1), that is, length divided by time.
Your modelling of measurements with units needs to capture at least this complexity.
As has already been suggested, you should establish what the base set of units you are going to use are, and the SI base units immediately suggest themselves. Your data structure(s) for modelling units would then be defined in terms of those base units. You might therefore define a table (thinking RDBMS here, but easily translatable into your preferred implementation) with entries such as:
unit name dimension conversion to base
foot Length 0.3048
gallon(UK) Length^3 4.546092 x 10^(-3)
kilowatt-hour Mass.Length^2.Time^(-2) 3.6 x 10^6
and so forth. You'll also need a table to translate prefixes (kilo-, nano-, mega-, mibi- etc) into multiplying factors, and a table of base units for each of the dimensions (ie meter is the base unit for Length, second for Time, etc). You'll also have to cope with units such as feet which are simply synonyms for other units.
The purpose of dimension is, of course, to ensure that your conversions and other operations (such as adding 2 feet to 3.5 metres) are commensurate.
And, for further reading, I suggest this book by Cardarelli.
EDIT in response to comments ...
I'm trying to veer away from suggesting (implementation-specific) solutions so I'll waffle a bit more. Compound units, such as kilowatt-hours, do pose a problem. One approach would be to tag measurements with multiple unit-expressions, such as kilowatt and hour, and a rule for combining them, in this case multiplication I could see this getting quite hairy quite quickly. It might be better to restrict the valid set of units to the most common ones in the domain of the application.
As to dealing with measurements in mixed units, well the purpose of defining the Dimension of a unit is to provide some means to ensure that only sensible operations can be applied to measurements-with-units. So, it's sensible to add two lengths (L+L) together, but not a length (L) and a volume (L^3). On the other hand it is sensible to divide a volume by a length (to get an area (L^2)). And it's kind of up to the application to determine if strange units such as kilowatt-hours per square metre are valid.
Finally, the book I link to does enumerate all the possibilities, I guess most sensible applications with units will implement only a selection.
I would start by choosing a standard unit for every quantity(eg. meters for length, newtons for force, etc) and then storing all the conversion factors to that unit in a table
then to go from days to hours, for example, you find the conversion factors for seconds per day and seconds per hour and divide them to find the answer.
for ambiguities, each unit could be associated with all the types of quantities it measures, and to determine which conversion to do, you would take the intersection of those two sets of types(and if you're left with 0 or more than one you would spit out an error)
I assume that you want to hold the data about conversion in some kind of triples (fstUnit, sndUnit, multiplier).
For single unit conversions:
Use some hash functions in O(1) to change the unit stucture to a number, and then put all multipliers in a matrix (you only have to remember the upper-right part, because the reflection is the same, but inversed).
For complex cases:
Example 1. m/s to km/h. You check (m,km) in the matrix, then the (s,h), then multiply the results.
Example 2. m^3 to km^3. You check (m,km) and take it to the third power.
Of course some errors, when types don't match like field and volume.
You can make a class for Units that takes the conversion factor and the exponents of all basic units (I'd suggest to use metric units for this, that makes your life easier). E.g. in Pseudo-Java:
public class Unit {
public Unit(double factor, int meterExp, int secondExp, int kilogrammExp ... [other base units]) {
...
}
}
//you need the speed in km/h (1 m/s is 3.6 km/h):
Unit kmPerH = new Unit(1 / 3.6, 1, -1, 0, ...)
I would have a table with these fields:
conversionID
fromUnit
toUnit
multiplier
and however many rows you need to store all the conversions you want to support
If you want to support a multi-step process (degrees F to C), you'd need a one-to-many relationship with the units table, say called conversionStep, with fields like
conversionID
sequence
operator
value
If you want to store one set of conversions but support multi-step conversions, like storing
Feet|Inches|1/12
Inches|Centimeters|2.54
and supporting converting from Feet to Centimeters, I would store a conversion plan in another table, like
conversionPlanID
startUnits
endUnits
via
your row would look like
1 | feet | centimeters | inches

Resources