Mathematica Parallel computing with VMware Client - parallel-processing

I have access to a virtual machine via vmware client and want to do parallel computing with mathematica (no direct www access on the vm).
But on the VM there's not mathematica installed and i don't want to buy an additinal one.
So I want to have the interface on my laptop and transfer as much computing as possible to the VM.
Following questions:
- Is it possible?
- How does it work?
Thanks,
Andreas

The use of Mathematica's native parallel computing capabilities does kind of depend on there being a Mathematica kernel running on each processing element (core, CPU, functional unit, whatever the heck they're called these days) in the gang so, without going outside Mathematica, you're dead in the water, it ain't possible.
On the other hand, you could use Mathematica's recently enhanced capabilities to play nicely with programs written in other languages (as long as those other languages are C) to write a back-end to run on the VM and call it from the Mathematica installation on your laptop. And, of course, once you start writing in C, you can do all the parallelisation you want.

Related

how to run a openmp program on clusters with multiple nodes? [duplicate]

I want to know if it would be possible to run an OpenMP program on multiple hosts. So far I only heard of programs that can be executed on multiple thread but all within the same physical computer. Is it possible to execute a program on two (or more) clients? I don't want to use MPI.
Yes, it is possible to run OpenMP programs on a distributed system, but I doubt it is within the reach of every user around. ScaleMP offers vSMP - an expensive commercial hypervisor software that allows one to create a virtual NUMA machine on top of many networked hosts, then run a regular OS (Linux or Windows) inside this VM. It requires a fast network interconnect (e.g. InfiniBand) and dedicated hosts (since it runs as a hypervisor beneath the normal OS). We have an operational vSMP cluster here and it runs unmodified OpenMP applications, but performance is strongly dependent on data hierarchy and access patterns.
NICTA used to develop similar SSI hypervisor named vNUMA, but development also stopped. Besides their solution was IA64-specific (IA64 is Intel Itanium, not to be mistaken with Intel64, which is their current generation of x86 CPUs).
Intel used to develop Cluster OpenMP (ClOMP; not to be mistaken with the similarly named project to bring OpenMP support to Clang), but it was abandoned due to "general lack of interest among customers and fewer cases than expected where it showed a benefit" (from here). ClOMP was an Intel extension to OpenMP and it was built into the Intel compiler suite, e.g. you couldn't use it with GCC (this request to start ClOMP development for GCC went in the limbo). If you have access to old versions of Intel compilers (versions from 9.1 to 11.1), you would have to obtain a (trial) ClOMP license, which might be next to impossible given that the product is dead and old (trial) licenses have already expired. Then again, starting with version 12.0, Intel compilers no longer support ClOMP.
Other research projects exist (just search for "distributed shared memory"), but only vSMP (the ScaleMP solution) seems to be mature enough for production HPC environments (and it's priced accordingly). Seems like most efforts now go into development of co-array languages (Co-Array Fortran, Unified Parallel C, etc.) instead. I would suggest that you have a look at Berkeley UPC or invest some time in learning MPI as it is definitely not going away in the years to come.
Before, there was the Cluster OpenMP.
Cluster OpenMP, was an implementation of OpenMP that could make use of multiple SMP machines without resorting to MPI. This advance had the advantage of eliminating the need to write explicit messaging code, as well as not mixing programming paradigms. The shared memory in Cluster OpenMP was maintained across all machines through a distributed shared-memory subsystem. Cluster OpenMP is based on the relaxed memory consistency of OpenMP, allowing shared variables to be made consistent only when absolutely necessary. source
Performance Considerations for Cluster OpenMP
Some memory operations are much more expensive than others. To achieve good performance with Cluster OpenMP, the number of accesses to unprotected pages must be as high as possible, relative to the number of accesses to protected pages. This means that once a page is brought up-to-date on a given node, a large number of accesses should be made to it before the next synchronization. In order to accomplish this, a program should have as little synchronization as possible, and re-use the data on a given page as much as possible. This translates to avoiding fine-grained synchronization, such as atomic constructs or locks, and having high data locality source.
Another option for running OpenMP programs on multiple hosts is the remote offloading plugin in the LLVM OpenMP runtime.
https://openmp.llvm.org/design/Runtimes.html#remote-offloading-plugin
The big issue with running OpenMP programs on distributed memory is data movement. Coincidentally, that is also one of the major issues in programming GPU's. Extending OpenMP to handle GPU programming has given rise to OpenMP directives to describe data transfer. Programming GPU's has also forced programmers to think more carefully about building programs that consider data movement.

high performance runtime

It’s the first time I submit a question in this forum.
I’m posting a general question. I don’t have to develop an application for a specific purpose.
After a lot of “googling” I still haven’t found a language/runtime/script engine/virtual machine that match these 5 requirements:
memory allocation of variables/values or objects cleaned at run time
(e.g. a la C++ that use keyword delete or free in C )
language (and consequently the program) is a script or
pseudo-compiled a la byte code that should be portable on main
operating system (windows, linux, *bsd, solaris) & platform(32/64bit)
native use of multicore (engine/runtime)
no limit on the heap usage
library for network
The programming language for building application and that run on this engine is agnostic oriented (paradigm is not important).
I hope that this post won’t stir up a Holy-War but I'd like to put focus on engine behavior during program execution.
Sorry for my bad english.
Luke
I think Erlang might fit your requirement:
most data is either allocated in local scopes and therefore immediately deleted after use or contained in a library-powered permanent storage like ETS, DETS or Mnesia. There is Garbage Collection, though, but the paradigm of the language makes the need for it not as important.
the Erlang compiler compiles the source code to the BEAM virtual machine byte code, which, unlike Java is register-based and thus much faster. The VM is available for:
Solaris (including 64 bit)
BSD
Linux
OSX
TRU64
Windows NT/2000/2003/XP/Vista/7
VxWorks
Erlang has been designed for distributed systems, concurrency and reliability from day one
Erlang's Heap grows with your demand for it, it's initially limited and expanded automatically (there are numerous tweaks you can use to configure this on a per-VM-basis)
Erlang comes from a networking background and provides tons of libraries from IP to higher-level protocols

OpenCL distribution

I'm currently developing an OpenCL-application for a very heterogeneous set of computers (using JavaCL to be specific). In order to maximize performance I want to use a GPU if it's available otherwise I want to fall back to the CPU and use SIMD-instructions. My plan is to implement the OpenCL-code using vector-types because my understanding is that this allows CPUs to vectorize the instructions and use SIMD-instructions.
My question however is regarding which OpenCL-implementation to use. E.g. if the computer has a Nvidia GPU I assume it's best to use Nvidia's library but if no GPU is available I want to use Intel's library to use the SIMD-instructions.
How do I achieve this? Is this handled automatically or do I have to include all libraries and implement some logic to pick the right one? It feels like this is a problem that more people than I are facing.
Update
After testing the different OpenCL-drivers this is my experience so far:
Intel: crashed the JVM when JavaCL tried to call it. After a restart it didn't crash the JVM but it also didn't return any usable
devices (I was using an Intel I7-CPU). When I compiled the
OpenCL-code offline it seemed to be able to do some
auto-vectorization so Intel's compiler seems quite nice.
Nvidia: Refused to install their WHQL-drivers because it claimed I didn't have Nvidia-card (that computer has a Geforce GT 330M). When
I tried it on a different computer I managed to get all the way to
create a kernel but at the first execution it crashed the drivers
(the screen flickered for a while and Windows 7 said it had to
restart the drivers). The second execution caused a bluee-screen of
death.
AMD/ATI: Refused to install 32-bit SDK (I tried that since I will be using a 32-bit JVM) but 64-bit SDK worked well. This is the only
driver which I've managed to execute the code on (after a restart
because at first it gave a cryptic error-message when compiling).
However it doesn't seem to be able to do any implicit vectorization
and since I don't have any ATI GPU I didn't get any performance
increase compared to the Java-implementation. If I use vector-types I
might see some improvements though.
TL;DR None of the drivers seem ready for commercial use. I'm probably better of creating JNI-module with C-code compiled to use SSE-instructions.
First try to understand hosts & devices: http://www.streamcomputing.eu/blog/2011-07-14/basic-concept-hosts-and-devices/
Basically you can just do exactly what you described: check if a certain driver is available and if not, try the next one. What you choose first depends completely on your own preference. I would pick the device I have tested my kernel best on. In JavaCL you can pick the fastest device with JavaCL.createBestContext and CLPlatform.getBestDevice, check the host-code here: http://ochafik.com/blog/?p=501
Know NVidia does not support CPUs via their driver; only AMD and Intel do. Also is targeting multiple devices (say 2 GPUs and a CPU) a bit more difficult.
There is no API providing what you want. however, you can do the following:
i suggest you iterate over clGetPlatformIDs and query for the number of devices (clGetDeviceIDs), and device type for each device;
and pick the platform which has both types.
then build a map in u'r code, that maps for each type the list of platforms supporting it, ordered in some manner.
finally, just get the first item in the list corresponding for CL_DEVICE_TYPE_CPU and the first item corresponding for CL_DEVICE_TYPE_GPU.
if both returned results are equal (platform_cpu == platform_gpu) then pick one of them and use it for both.
if there is a platform supporting both, you will get match as before since you got order lists. then u can also do load balancing if u like on a single platform, like what Intel has.
Sorry for being late to the party, but regarding Intel's implementation behaviour under JavaCL, I'm afraid you've been bitten by a JavaCL bug :
https://github.com/ochafik/nativelibs4java/issues/297
Fixed in JavaCL 1.0.0-RC2 !
Cheers

Can the Ruby language be used to build operating systems?

Can the Ruby language be used to create an entire new mobile operating system or desktop operating system i.e. can it be used in system programming?
Well there are a few operating systems out there right now which use higher-level languages than C. Basically the ruby interpreter itself would need to be written in something low-level, and there would need to be some boot-loading code that loaded a fully-functional ruby interpreter into memory as a standalone kernel. Once the ruby interpreter is bootstrapped and running in kernel-mode (or one of the inner rings), there would be nothing stopping you from builing a whole OS on top of it.
Unfortunately, it would likely be very slow. Garbage collection for every OS function would probably be rather noticeable. The ruby interpreter would be responsible for basic things like task scheduling and the network stack, which using a garbage-collecting framework would slow things down considerably. To work around this, odds are good that the "performance critical" pieces would still be written in C.
So yes, technically speaking this is possible. But no one in their right mind would try it (queue crazy person in 3... 2...)
For all practical purposes: No.
While the language itself is not suited for such a task, it is imaginable (in some other universe ;-) that there a Ruby run-time developed with such a goal in mind.
The only "high level" -- yes, the quotes are there for a reason, I don't consider C very "high level" these days -- language I know of designed for Systems Programming is BitC. (Which is quite unlike Ruby.)
Happy coding.
Edit: Here is a list of "Lisp-based OSes". While not Ruby, the dynamically-typed/garbage-collected nature of (many) Lisp implementations makes for a favorable comparison: if those crazy Lispers can do/attempt it, then so can some Ruby fanatic ... or at least they can wish for it ;-) There is even a link to an OCaml OS on the list...
No, not directly
In the same way that Rails is built on top of Ruby, Ruby is built on top of the services that lower layers .. the real OS .. provide.
I suppose one could subset Ruby until it functionally resembled C and then build an OS out of that, but it wouldn't be worth it. Sure, it would have a nice if .. end but C syntax is perfectly usable and we already have C language systems. Also, operating systems don't handle character data very much, so all of the Ruby features to manipulate it wouldn't be as valuable in a kernel.
If we were starting from scratch today we might actually try (as various experimental projects have) to use garbage collected memory allocation in a kernel but we already have OS kernels.
People are making investments at the higher layers rather than redoing work already done. After all, with all the upper level software to run these days, a new kernel would need to present a compatible interface and the question would then be asked "why not just run the nice kernels we already have?".
Now, the application API for a mobile OS could indeed be done for Ruby. So, just as Android apps are written in Java, RubyPhone apps could be written in Ruby. But Ruby might not be the best possible starting point for a rich application platform. Its development so far has been oriented to server-side problems. There exist various graphical interface gems but I don't think they are widely used.
basically yes, but with a big disclamer .. which is basically Chris' answer but a different spin on it. Since for kernel performance it would kinda suck to use ruby, you'd probably want to build around a linux-ish kernel and just not load any of the rest of the operating system. This is basically what Android does: the kernel is a fork from Linux (and is maintained close to linux), the console is a webkit screen, and the interpreter is Java with some Android specific libraries. IE, Android is Java masquerading as an OS, .. you could do about the same thing with Ruby instead of Java and only a smallish hit to performance from java
While building a whole OS from scratch in Ruby seems like
a multi-billion project (think of all the drivers), a
linux kernel module that runs simple ruby scripts does
make sense for me - even it was only for prototyping
new linux drivers.

LPT control on Windows

I am into new project, which should use microcontroller. The easiest way to program it is using parallel port. But, there are few things I hope you can help me with. Oh, and the preferred language is C and platform Windows.
So, I studied LPT ports and Windows a bit, and from what I learned the most important is: Since Windows NT based systems, you cannot use instructions for direct port manipulation. This should be, because now programs are run in different privilege mode, which doesn't support the kind of instructions that are used by outport() function.
But at this point, I don´t understand a few things. First, I thought that Windows actually used privilege levels since first protected mode version, but that's the wrong assumption.
But more importantly, I thought that Windows has included functions for just about any hardware communication. I mean, anything you do in Windows these days, you just call windows functions which further call kernel services. I assumed that outport() doesn´t use any Windows function, and just makes the communication itself, which is prohibited now. But I am literally shocked that there is no system function to control parallel ports in modern Windows systems. At least that's what I read.
But even if I could get the control of parallel port, there comes my second problem.
For programming the controller, I need to follow special protocol, especially timing. But since Windows is multitasked, I worry about what if Scheduler switches to another app, and therefore when is the right time to switch signals on LPT, my program just will not be able to run.
Oh, by the way, I know I could use any 3rd-party apps, but I just like to be able to do it myself, or at least before I use some 3rd-party app, I want to know how it works. And yes, you can program some microcontrollers just by parallel port with some resistors, I know this for sure.
Thanks.
For windows you need to install a DLL which contains a driver to run at elevated privileges to get access to the HW ports.
You can find such a library at :
http://logix4u.net/Legacy_Ports/Parallel_Port/Inpout32.dll_for_Windows_98/2000/NT/XP.html
There are also some links to sample code.
I do not know which uController you are using, but I programmed in the past a variety of them and never had issues with timings, well for programming at least. The programming protocols are usually robust enough to deal with the jitter caused by multitasking. Just keep your clock edges and signa edges well separated and it should go fine.

Resources