I just created a VERY large neural net, albeit on very powerful hardware, and imagine my shock and disappointment, when I realized that NeuralFit[] from NeuralNetworks` package only seems to use one core, and not even to its fullest capacity. I was heartbroken. Do I really have to write an entire NN implementation from scratch? Or did I miss something simple?
My net took 200 inputs to 2 hidden layers of 300 neurons to produce 100 outputs. I understand we're talking about trillions of calculations, but as long as I know my hardware is the weak point - that can be upgraded. It should handle training of such a net fairly well if left alone for a while (4Ghz 8-thread machine with 24Gb of 2000Mhz CL7 memory running RAID-0 SSD drives on SATA-III - I'm fairly sure).
Ideas? Suggestions? Thanks in advance for your input.
I am the author of the Neural Network Package. It is easy to parallelize the evaluation of a neural network given the input. That is, to compute the output of the network given the inputs (and all the weights, the parameters of the network). However, this evaluation is not very time consuming and it is not very interesting to parallellize it for most problems. On the other hand, the training of the network is often time consuming and, unfortunately, not easy to parallelize. The training can be done with a different algorithms and best ones are not easy to parallelize. My contact info can be found at the product's homepage on the Wolfram web. Improvement suggestions are very welcome.
The last version of the package works fine one version 9 and 10 if you switch off the suggestion bar (under preferences). The reason for that is that the package use the old HelpBrowser for the documentation and it crash in combination with the suggestion bar.
yours Jonas
You can contact the author of the package directly, he is a very approachable fellow and might be able to make some suggestions.
I'm not sure how you wrote the code or how it is written inside the package you are using; try to use vectorization, it really speeds up the linear algebra computations. In the ml-class.org course you can see how it's made.
Related
I have written a new algorithm for something. Now I need to compare it with existing methods, some of which are old about 10 years.
The idea I had is to look at benchmarks of different processors over the years in order to establish how much faster my processor (i7-920) is than average processor from 2003. Then I would simply divide old methods' execution time by the speedup factor and use those numbers to compare with my own algorithm.
Has something like this been done? So I don't redo the existing work.
Can such a comparison be done some other way?
Are there some scientific papers written about such comparisons which I can reference?
I don't know which of these are possible for you, but here's a list of options I can think of:
Run their implementation side-by-side on your machine against yours.
This is the best option.
Rewrite their implementations and do (1).
You preferably need to compare it against their test to ensure you get vaguely similar results.
Find a library that implements their algorithm (or multiple libraries) and do (1).
I suggest multiple libraries, if possible, since a single one may not have implemented the algorithm efficiently. You may also want to compare these against their test.
Compare the algorithms mathematically.
This may be difficult, but it's not impossible.
Do what you presented.
(a) I would not recommend this as there are other determining factors in your computer other than the processor speed that affect the speed of an algorithm. Getting an equation that perfectly balances these will likely be very difficult.
(b) There is a massive difference between top and bottom of the line computers, so using the average is not a particularly good idea. If the author didn't provide details regarding this, I'm afraid your benchmark is not likely to be too accurate.
Go out and buy a machine of similar specs to the one used by the desired test to benchmark on.
A 10-year-old machine should be pretty cheap, if you can find one. Also, see (5.b).
Contact the author to allow for any of the other options.
Papers often provide contact details of the authors, or you should be able to find them elsewhere if they have any sort of online presence and you're half-decent at using Google.
If I were reviewing your results, I would be annoyed if you attempted to demonstrate less than an order of magnitude speedup this way. There are a lot of variables determining algorithm performance, and I would be skeptical that a generic benchmark could capture the right ones. My gold standard is old and new algorithms implemented by the same programmer, with similar effort made to optimize, running on the same hardware. Using the previous authors' implementation instead of making a new one is commonplace in the experimental algorithms literature, but using different hardware isn't.
Algorithmic performance is usually measured in big-O terms, for which it is better to count basic operations, like comparisons, and do it for a range of input sizes.
If you must measure overall time, at least eliminate other sources of difference.
As #larsmans said, do it on the same processor.
Also, if there is existing work, there's no harm in repeating it.
Generally, in science, that's a good thing.
You should attempt to reduce the amount of differing factors between the two runs. I think just run-timing the two algorithms side by side on the same machine and/or comparing their Big O times are both equally valid and important. You should also attempt to use updated libraries and other external functions; using outdated ones my also be the cause of timing results.
With the vast array of micro controllers out there and even different levels of arduinos providing more power than the last, is there a mathematical way or some way of knowing how much processing power you need, just by analysis, to run your program as designed in order to choose the right micro?.
Without just trial and error. i.e without just trying it and if it is too slow buying the next chip up.
I've had to do performance projections for computer systems that did not exist yet. Things like cycle time ratios can only give a very rough guide. Generally, I had to resort to simulation, the nearest I could get to measuring on actual hardware.
That said, you may be able to find numbers for benchmarks similar to your code that will at least give you a starting point.
I would not do it by working up one chip at a time - your code may have a problem that makes it too slow for any feasible chip. I would try to find a chip that is fast enough, and work down if it is much faster than needed.
What are the fast solvers for FEM equations? I would prefer open source implementation, but if there is a commercial implementation, then I won't mind paying for it.
Code Aster is an open source FE code. code aster
The pre- and post-processing is usually done with Salome - both originate from EDF.
How about FEAP. It has full source code available when you purchase it. It is pretty big project, maybe its too much for your needs, but check it out.
FEAP is a general purpose finite
element analysis program which is
designed for research and educational
use. Source code of the full program
is available for compilation using
Windows (Compaq or Intel compiler),
LINUX or UNIX operating systems, and
Mac OS X based Apple systems.
It has also a Personal Edition called FEAPpv available for free, including source code. Differences between those versions are listed in this pdf.
"brad"? do you mean "broad"?
you don't say if your problem is linear or non-linear. that'll make a very big difference.
the solver depends on the type of equation and the size of your problem. for elliptical pdes you can choose standard linear algebra techniques like lu decomposition, iterative methods like successive over relaxation, or wavefront solvers that minimize memory consumption.
some people like solving non-linear steady-state problems as if they were dynamics problems. the idea is to create "fake" mass and damping matricies and use explicit time integration to converge to steady state.
lots of choices. standard linear algebra is a good starting point.
language? java?
Oops, that's kind of a brad question.
Solving differential equations usually starts with analyzing equation itself. Some equations are notoriously difficult to solve efficiently, e.g. indifinite boundary problems.
So if you have something else than an elliptic problem, you'll might better prepare for hard times ahead.
Next important and crutial part is transfering the contiouus problem into a discrete mesh. Typically the accuracy of your results will vary with different ways to generate this mesh. You'll need some sound experience here.
So I'd say there is nothing like the fast slover for FEM equations. Anyway, while Wikipedia gives a short overview of the topic, you might perhaps also have a look a the german Wikipedia page. It lists well-known FEM implementations.
OpenFoam and Elmer are two open source solvers. Not sure about Elmer, but I think OpenFoam might uses the control volume approach.
I used OpenFOAM for fluid dynamics research. You can do parallel processing with it with MPI. And if you have a Cray T3E it will be fast!
It's open source :D
http://www.opencfd.co.uk/openfoam/features.html#features
Please have look for Deal.II open source library:
http://www.dealii.org/
They provide also VirtualBox image which comes pre-installed libs.
It's a tricky question I was asked the other day... We're working on a pretty complex telephony (SIP) application with mixed C++ and PHP code with MySQL databases and several open source components.
A telecom engineer asked us to estimate the performance of the application (which is not ready yet). He went like 'well, you know how many packets can pass through the Linux kernel per second, plus you might know how quick your app is, so tell me how many calls will pass through your stuff per second'.
Seems nonsense to me, as there are a million scenarios that might happen (well, literally...)
However... is there a way to estimate application performance (knowing the hardware it will run on, being able to run standard benchmarks on it, etc) before actual testing?
You certainly can bound the problem with upper (max throughput) limits. There is nothing nonsense about that. In fact, not knowing that stuff indicates a pretty haphazard approach to a problem - especially in the telephony world.
You can work through the problem yourself - what is the minimum "work" you have to accomplish for a transaction or whatever unit of task you have in your app?
Some messages to and from, some processing and a database hit for example? Getting information on the individual pieces will give you an idea of the fastest possible throughput. If you load up the system and see significantly lower performance then you can take time to figure out where you are possibly losing throughput with inefficient algorithms, etc.
EDIT
To do this exercise you need to know all the steps your app does for each use case. Then you can identify the max throughput for each use case. You should definitely know this stuff prior to release and going live.
I'm ignoring the worst case analysis as that - as you point out - is quite a bit harder.
See Capacity Planning for Web Performance: Metrics, Models, and Methods. There are also some tools that can do this sort of discrete event simulation:
Hyperformix
SimPy
WikiPedia list of simulation tools
This stuff ain't easy, and the commercial tools will cost ya. The Capacity Planning book comes with a CD with lots of Excel workbook templates and examples of models that can jump start you.
Good luck :)
If you really have to answer this you could say something like this:
"I don't know off the top of my head. I am will to estimate this for you but it will take time. Obviously the accuracy of my answer depends upon how much effort (I.E. time) I put into calculating my estimate. How much time should I put into calculating my estimate?"
Put the burden back on them. If they really want an accurate answer, they're going to have to let you build at least some test applications that can simulate the actual environment.
You can spike to measure performance. Your whole system may not be working yet, but you know how the parts are intended to fit together. You can whip something up in a few hours that does the same kind of work as the final app will, across all the layers, and use it to measure performance of your design.
Remember: prototypes are broad, spikes are deep.
You should do the estimate. An estimate won't give you the right answer. It will however make you to think about the problem. Right now it sounds like your coding and hoping that everything will be OK. Or you are in panic mode and feel you don't have time for estimates.
Spend some time thinking about it. Analyse the important use cases. Think about the memory you may need; think about database access; think about network access (local and remote). These will effect the performance of your system. Get the whole team together to do this.
Regularly measure your system's performance during development for these important use cases. Mock up components/other systems if you have to. Analyse the results. How do these compare to your estimate. Maybe components are memory/database/network bound. Maybe you need more memory; less database access; simpler queries; caching. You don't have to make these changes straight away. However you do know how your system operates and what you need to do.
Result: Fewer nasty surprises at system test. Less panic as the release date looms.
You can definitely do capacity planning in advance, but the quality of the estimate will depend on the quality of the data available.
The best estimate is to build the system in test, run simulated workloads, then predict capacity as a function of performance requirements and workload. These 3 form a prediction space - given 2 of the 3, you can predict the third:
Given performance requirements and capacity (i.e. hardware) you can calculate the workload you can handle.
Given performance requirements and workload, you can calculate the capacity (i.e. hardware) that you need.
Given Workload and capacity, you can predict your expected performance.
This is true in some domains, but unless you are an expert in that domain then you don't have any idea. For example I write code to controlling industrial robots. The speed is limited by the robot motion, not by the execution speed of the code. Knowing how fast the robot is and how far it has to go, we can make fairly good estimates of "speed". I'd have no idea how to estimate time for your application.
Let's just assume for now that you have narrowed down where the typical bottlenecks in your app are. For all you know, it might be the batch process you run to reindex your tables; it could be the SQL queries that runs over your effective-dated trees; it could be the XML marshalling of a few hundred composite objects. In other words, you might have something like this:
public Result takeAnAnnoyingLongTime(Input in) {
// impl of above
}
Unfortunately, even after you've identified your bottleneck, all you can do is chip away at it. No simple solution is available.
How do you measure the performance of your bottleneck so that you know your fixes are headed in the right direction?
Two points:
Beware of the infamous "optimizing the idle loop" problem. (E.g. see the optimization story under the heading "Porsche-in-the-parking-lot".) That is, just because a routine is taking a significant amount of time (as shown by your profiling), don't assume that it's responsible for slow performance as perceived by the user.
The biggest performance gains often come not from that clever tweak or optimization to the implementation of the algorithm, but from realising that there's a better algorithm altogether. Some improvements are relatively obvious, while others require more detailed analysis of the algorithms, and possibly a major change to the data structures involved. This may include trading off processor time for I/O time, in which case you need to make sure that you're not optimizing only one of those measures.
Bringing it back to the question asked, make sure that whatever you're measuring represents what the user actually experiences, otherwise your efforts could be a complete waste of time.
Profile it
Find the top line in the profiler, attempt to make it faster.
Profile it
If it worked, go to 1. If it didn't work, go to 2.
I'd measure them using the same tools / methods that allowed me to find them in the first place.
Namely, sticking timing and logging calls all over the place. If the numbers start going down, then you just might be doing the right thing.
As mentioned in this msdn column, performance tuning is compared to the job of painting Golden Gate Bridge: once you finish painting the entire thing, it's time to go back to the beginning and start again.
This is not a hard problem. The first thing you need to understand is that measuring performance is not how you find performance problems. Knowing how slow something is doesn't help you find out why. You need a diagnostic tool, and a good one. I've had a lot of experience doing this, and this is the best method. It is not automatic, but it runs rings around most profilers.
It's an interesting question. I don't think anyone knows the answer. I believe that significant part of the problem is that for more complicated programs, no one can predict their complexity. Therefore, even if you have profiler results, it's very complicated to interpret it in terms of changes that should be made to the program, because you have no theoretical basis for what the optimal solution is.
I think this is a reason why we have so bloated software. We optimize only so that quite simple cases would work on our fast machines. But once you put such pieces together into a large system, or you use order of magnitude larger input, wrong algorithms used (which were until then invisible both theoretically and practically) will start showing their true complexity.
Example: You create a string class, which handles Unicode. You use it somewhere like computer-generated XML processing where it really doesn't matter. But Unicode processing is in there, taking part of the resources. By itself, the string class can be very fast, but call it million times, and the program will be slow.
I believe that most of the current software bloat is of this nature. There is a way to reduce it, but it contradicts OOP. There is an interesting book There is an interesting book about various techniques, it's memory oriented but most of them could be reverted to get more speed.
I'd identify two things:
1) what complexity is it? The easiest way is to graph time-taken verses size of input.
2) how is it bound? Is it memory, or disk, or IPC with other processes or machines, or..
Now point (2) is the easier to tackle and explain: Lots of things go faster if you have more RAM or a faster machine or faster disks or move over to gig ethernet or such. If you identify your pain, you can put some money into hardware to make it tolerable.