I am using my MacBookPro. I am trying to run the mxnet python demo code and the execution time is extremely slow. It takes a lot time to execute the code. Is this normal? Also i want to run mxnet on Raspberry Pi 3.
Almost all deep learning frameworks (MXNet included) will run much faster with a CUDA-capable GPU from NVIDIA. GPU's will often speed up the kinds of vector math needed for deep learning by 100x. Apple stopped building machines with NVIDIA GPUs several years ago (2012 IIRC). If you have one of those make sure you have CUDA working on your Mac. I'm not aware of any way right now to get MXNet to make use of the AMD or Intel GPUs that ship with Apple machines. Also know that even with the fastest GPU's deep learning jobs will often take hours, days, or even weeks to complete. So patience is definitely part of the game, regardless of what hardware you're using.
That said, GPU's aren't the only way to run deep learning systems. Particularly for making predictions (inference) with pre-trained models, CPUs are often just fine. So this can be useful for a task like semantic image processing.
Or when training, using smaller datasets and smaller models can make them run faster. Also, to make sure you're getting the most out of your CPU, check that you have installed a good BLAS library like Intel's MKL.
But to get any useful work out of a raspberry pi is going to take some careful optimization, even for inference. This is an area of active scientific research. See for example this paper. Or look at adding a USB hardware accelerator.
Related
Judging by the latest news, new Apple processor A11 Bionic gains more points than the mobile Intel Core i7 in the Geekbench benchmark.
As I understand, there are a lot of different tests in this benchmark. These tests simulate a different load, including the load, which can occur in everyday use.
Some people state that these results can not be compared to x86 results. They say that x86 is able to perform "more complex tasks". As an example, they lead Photoshop, video conversion, scientific calculations. I agree that the software for the ARM is often only a "lighweight" version of software for desktops. But it seems to me that this limitation is caused by the format of mobile operating systems (do your work on the go, no mouse, etc), and not by the performance of ARM.
As an opposite example, let's look at Safari. A browser is a complex program. And on the iPad Safari works just as well as on the Mac. Moreover, if we take the results of Sunspider (JS benchmark), it turns out that Safari on the iPad is gaining more points.
I think that in everyday tasks (Web, Office, Music/Films) ARM (A10X, A11) and x86 (dual core mobile Intel i7) performance are comparable and equal.
Are there any kinds of tasks where ARM really lags far behind x86? If so, what is the reason for this? What's stopping Apple from releasing a laptop on ARM? They already do same thing with migration from POWER to x86. This is technical restrictions, or just marketing?
(Intended this as a comment since this question is off topic, but it got long..).
Of course you can compare, you just need to be very careful, which most people aren't. The fact that companies publishing (or "leaking") results are biased also doesn't help much.
The common misconception is that you can compare a benchmark across two systems and get a single score for each. That ignores the fact that different systems have different optimization points, most often with regards to power (or "TDP"). What you need to look at is the power/performance curve - this graph shows how the system reacts to more power (raising the frequency, enabling more performance features, etc), and how much it contributes to its performance.
One system can win over the low power range, but lose when the available power increases since it doesn't scale that well (or even stops scaling at some point). This is usually the case with Arm, as most of these CPUs are tuned for low power, while x86 covers a larger domain and scales much better.
If you are forced to observe a single point along the graph (which is a legitimate scenario, for example if you're looking for a CPU for a low-power device), at least make sure the comparison is fair and uses the same power envelope.
There are of course other factors that must be aligned (and sometimes aren't due to negligence or an intention to cheat) - the workload should be the same (i've seen different versions compared..), the compiler should be as close as possible (although generating arm vs x86 code is already a difference, but the compiler intermediate optimizations should be similar. When comparing 2 x86 like intel and AMD you should prefer the same binary, unless you also want to allow machine specific optimizations).
Finally, the system should also be similar, which is not the case when comparing a smartphone against a pc/macbook. The memory could differ, the core count, etc. This could be legitimate difference, but it's not really related to one architecture being better than the other.
the topic is bogus, from the ISA to an application or source code there are many abstraction level and the only metric that we have (execution time, or throughput) depends on many factors that could advantage one or the other: the algorithm choices, the optimization written in source code, the compiler/interpreter implementation/optimizations, the operating system behaviour. So they are not exactly/mathematically comparable.
However, looking at the numbers, and the utility of the mobile application written by talking as a management engeneer, ARM chip seems to be capable of run quite good.
I think the only reason is inertia of standard spread around (if you note microsoft propose a variant of windows running on ARM processors, debian ARM variant are ready https://www.debian.org/distrib/netinst).
the ARMv8 cores seems close to x86/64 ones by looking at raw numbers
note i7-3770k results: https://en.wikipedia.org/wiki/Instructions_per_second#MIPS
summary of last Armv8 CPU characteristics, note the quantity of decode, dispatch, caches, and compare the last column on cortex A73 to the i7 3770k
https://en.wikipedia.org/wiki/Comparison_of_ARMv8-A_cores
intel ivy bridge characteristics:
https://en.wikichip.org/wiki/intel/microarchitectures/ivy_bridge_(client)
A75 details. https://www.anandtech.com/show/11441/dynamiq-and-arms-new-cpus-cortex-a75-a55
the topic of power consumption is complex again, the basic rule that go under all the frequency/tension rule (used and abused) over www is: transistors raise time. https://en.wikipedia.org/wiki/Rise_time
There is a fixed time delay in the switching of a transistor, this determinates the maximum frequency that a transistor could switch, and with more of them linked in a cascade way this time sums up in a nonlinear way (need some integration to demonstrate it), as a result 10 years ago to increase the GHz companies try to split in more stage the execution of an operation and runs them (operations) in a pipeline way, even inside the logical pipeline stage. https://en.wikipedia.org/wiki/Instruction_pipelining
the raise time depends of physical characteristics (materials and shape of transistors). It can be reduced by increasing the voltage, so the transistor switch faster, as the switching is associated (let me the term) to a the charge/discharge of a capacitor that trigger the transistor channel opening/closing.
These ARM chips are designed to low power applications, by changing the design they could easily gain MHz, but they will use much power, how much? again not comparable if you don't work inside a foundry and have the numbers.
an example of server applications of ARM processors that could be closer to desktop/workstation CPU as power consumption are Cavium or qualcomm Falkor CPUs, and some benchmark report that they are not bad.
I have:
MacBook Pro (Retina, 13-inch, Mid 2014)
OS X Yosemite
Intel Iris 1536MB
I heard that I can not use GPU theano but CPU I do, but I want to know if the programming will be the same and internaly theano will work with CPU or in anothe case with GPU. Or if when I programming I have one way to program for each one.
Thanks a lot
For the most part a Theano program that runs well on a CPU will also run well on a GPU, with no changes to the code required. There are, however, a few things to bear in mind.
Not all operations have GPU versions so it's possible to create a computation that contains a component that cannot run on the GPU at all. When you run one of these computations on a GPU it will silently fall back to the CPU, so the program will run without failing and the result will be correct, but the computation is not running as efficiently as it might and will be slower because it has to copy data backwards and forwards between main and GPU memory.
How you access your data can affect GPU performance quite a lot but has little impact on CPU performance. Main memory tends to be larger than GPU memory so it's often the case that your entire dataset can fit in main memory but not in GPU memory. There are techniques that can be used to avoid this problem but you need to bear in mind the GPU limitations in advance.
If you stick to conventional neural network techniques, and follow the patterns used in the Theano sample/tutorial code, then it will probably run fine on the GPU since this is the primary use-case for Theano.
Yes, effectively Theano understand if you have GPU or not, and decide to use CPU or GPU to create variables, the only difficult is that when you create a model and variables with Theano with or without GPU config of variables change, in other words if you create a model with GPU (or CPU) and save them in a *.pickle (e.g.) and you go to another pc without CPU (or GPU respectively) this model and their variables saved will not work.
I know the question is only partially programming-related because the answer I would like to get is originally from these two questions:
Why are CPU cores number so low (vs GPU)? and Why aren't we using GPUs instead of CPUs, GPUs only or CPUs only? (I know that GPUs are specialized while CPUs are more for multi-task, etc.). I also know that there are memory (Host vs GPU) limitations along with precision and caches capability. But, In term of hardware comparison, high-end to high-end CPU/GPU comparison GPUs are much much more performant.
So my question is: Could we use GPUs instead of CPUs for OS, applications, etc
The reason I am asking this questions is because I would like to know the reason why current computers are still using 2 main processing units (CPU/GPU) with two main memory and caching systems (CPU/GPU) even if it is not something a programmer would like.
Current GPUs lack many of the facilities of a modern CPU that are generally considered important (crucial, really) to things like an OS.
Just for example, an OS normally used virtual memory and paging to manage processes. Paging allows the OS to give each process its own address space, (almost) completely isolated from every other process. At least based on publicly available information, most GPUs don't support paging at all (or at least not in the way an OS needs).
GPUs also operate at much lower clock speeds than CPUs. Therefore, they only provide high performance for embarrassingly parallel problems. CPUs are generally provide much higher performance for single threaded code. Most of the code in an OS isn't highly parallel -- in fact, a lot of it is quite difficult to make parallel at all (e.g., for years, Linux had a giant lock to ensure only one thread executed most kernel code at any given time). For this kind of task, a GPU would be unlikely to provide any benefit.
From a programming viewpoint, a GPU is a mixed blessing (at best). People have spent years working on programming models to make programming a GPU even halfway sane, and even so it's much more difficult (in general) than CPU programming. Given the difficulty of getting even relatively trivial things to work well on a GPU, I can't imagine attempting to write anything even close to as large and complex as an operating system to run on one.
GPUs are designed for graphics related processing (obviously), which is inherently something that benefits from parallel processing (doing multiple tasks/calculations at once). This means that unlike modern CPUs, which as you probably know usually have 2-8 cores, GPUs have hundreds of cores. This means that they are uniquely suited to processing things like ray tracing or anything else that you might encounter in a 3D game or other graphics intensive activity.
CPUs on the other hand have a relatively limited number of cores because the tasks that a CPU faces usually do not benefit from parallel processing nearly as much as rendering a 3D scene would. In fact, having too many cores in a CPU could actually degrade the performance of a machine, because of the nature of the tasks a CPU usually does and the fact that a lot of programs would not be written to take advantage of the multitude of cores. This means that for internet browsing or most other desktop tasks, a CPU with a few powerful cores would be better suited for the job than a GPU with many, many smaller cores.
Another thing to note is that more cores usually means more power needed. This means that a 256-core phone or laptop would be pretty impractical from a power and heat standpoint, not to mention the manufacturing challenges and costs.
Usually operating systems are pretty simple, if you look at their structure.
But parallelizing them will not improve speeds much, only raw clock speed will do.
GPU's simply lack parts and a lot of instructions from their instruction sets that an OS needs, it's a matter of sophistication. Just think of the virtualization features (Intel VT-x or AMD's AMD-v).
GPU cores are like dumb ants, whereas a CPU is like a complex human, so to speak. Both have different energy consumption because of this and produce very different amounts of heat.
See this extensive superuser answer here on more info.
Because nobody will spend money and time on this. Except for some enthusiasts like that one: http://gerigeri.uw.hu/DawnOS/history.html (now here: http://users.atw.hu/gerigeri/DawnOS/history.html)
Dawn now works on GPU-s: with a new OpenCL capable emulator, Dawn now
boots and works on Graphics Cards, GPU-s and IGP-s (with OpenCL 1.0).
Dawn is the first and only operating system to boot and work fully on
a graphics chip.
I am C++ programmer that develop image and video algorithims, should i learn Nvidia CUDA? or it is one of these technlogies that will disappear?
CUDA is currently a single vendor technology from NVIDIA and therefore doesn't have the multi vendor support that OpenCL does.
However, it's more mature than OpenCL, has great documentation and the skills learnt using it will be easily transferred to other parrallel data processing toolkit.
As an example of this, read the Data Parallel Algorithms by Steele and Hillis and then look at the Nvidia tutorials - theres a clear link between the two yet the Steele/Hillis paper was written over 20 years before CUDA was introduced.
Finally, the FCUDA Projects is working to allow CUDA projects to target non nvidia hardware (FPGAs).
CUDA should stick around for a while, but if you're just starting out, I'd recommend looking at OpenCL or DirectCompute. Both of these run on ATI as well as NVidia hardware, in addition to also working on the vector units (SSE) of CPUs.
I think you should rather stick with OpenCL, which is an open standard and supported by ATI, nVidia and more. CUDA might not disappear in the next years, but anyway it is not compatible with non-nVidia GPUs.
OpenCL might take sometime to become pervasive but i found learning CUDA very informative and i don't think CUDA's going to be out of the limelight anytime soon. Besides, CUDA is easy enough that the time it takes to learn it is much shorter than CUDA's shelf life.
This is the era of high performance computing, parallel computing. CUDA and OpenCL are the emerging technologies of GPU Computing which is actually a high performance computing! If you are a passionate programmer and willing to achieve benchmark in parallel algorithms, you should really go for these technologies. Data parallel part of your program will get executed within fraction of a second on GPU many-core architecture which usually takes much longer time on your CPU..
The 1.0 spec for OpenCL just came out a few days ago (Spec is here) and I've just started to read through it. I want to know if it plays well with other high performance multiprocessing APIs like OpenMP (spec) and I want to know what I should learn. So, here are my basic questions:
If I am already using OpenMP, will that break OpenCL or vice-versa?
Is OpenCL more powerful than OpenMP? Or are they intended to be complementary?
Is there a standard way of connecting an OpenCL program to a standard C99 program (or any other language)? What is it?
Does anyone know if anyone is writing an OpenCL book? I'm reading the spec, but I've found books to be more helpful.
OpenMP and OpenCL are distinct, but can be made to work together. Neither of them should "break" the other.
For the sake of argument, let's assume there's a tradeoff between minimizing changes to an existing codebase and performance or computing power. OMP is "easy" in that you can apply it "magically" to embarrassingly parallel problems with a quick pragma or two.
OpenCL introduces brand new high-level concepts beyond typical OS threading models. Khronos probably doesn't want to say it out loud, but its genesis is in NVIDIA's CUDA. If you want to see how it works today, download the CUDA SDK and start playing. If you don't have any NVIDIA GPUs, don't worry, there's a GPU-emulator software option. OpenCL is a handy abstraction of a GPU that should apply to CPUs, DSPs, "accelerators" (Khronos' nickname for IBM's CellBE and probably Intel's Larrabee).
OpenCL is not supposed to be "written directly in C99". It's referred to as a C99 extension since its syntax is similar/identical to C99 with some new keywords. You cannot call libc (or any other library) from a kernel.
You could use both, but theoretically, OpenCL should be "better" (in that it's portable to more computing devices) if you're willing to port your code. You can not use OpenMP pragmas in an OpenCL kernel.
See also:
http://wikipedia.org/wiki/OpenCL
CUDA
LLVM
For the most part OpenMP and OpenCL are independent from each other. They are both ways of giving the developer access to parallelism on their platform.
OpenMP is designed to work well with multiple (identical) processors, where work that is approximately equal can be (nearly) automatically farmed out between them.
OpenCL is a somewhat different beast, in that it is really shines when working with special co-processor hardware. It will allow you to offload some of the heavy-duty number crunching to the GPU or some other co-processor like in the Cell. However, it was also built with the idea that it could be used to harness other main processors, as are now common in multi-core computers. I would consider this feature to be secondary, and if this is all you intend to use OpenCL for, I would not recommend using OpenCL.
That said, I'd guess it would be somewhat challenging, though definitely not impossible to get OpenMP and OpenCL to work together in the same problem.
The first thing to think about is what work you're giving to OpenCL. This would definately be a case where you would only want OpenCL to run on the GPU/Co-processor...not on the other main-processors/cores, since OpenMP is alreay using those. It wouldn't (shouldn't) cause application errors to run OpenCL and OpenMP on the same main processor, but it will cause un-desirable scheduling where both the OpenMP and OpenCL run slower because they spend a good chunk of their time switching back and fourth between each other. This would also happen if you run any other processor-hungry process on the same core at the same time.
The other big thing to think about is how you're going to schedule tasks that do run on the Co-processor. Its true that you can feed a lot of work into one of the modern GPUs, but there are lots of things to think about with the pipeline and memory usage. What you wouldn't want to happen is to have 8 different OpenMP threads each trying to send their own work to the Co-Processor at the same time. I would recommend having only one thread that manages all the interactions with the Co-Processor, so it can make sure to feed it work in an efficient manner.
That said, I'm sure there are programs that have multiple types of tasks happening at the same time, where one type of task could always be farmed out to the Co-Processor and another kind of task could be handled by the multi-core main processor. This would be a fine example of a time to mix OpenMP and OpenCL.
Good Luck!
?
?
OpenCL is supposed to be written directly in C99 afaik? There are header files available now for it anyhow.
?
By the way, there is a work about openMp to gpgpu using CUDA.