This might sound a bit naive, but, I'm unable to find an appropriate answer to the question in my mind.
Let's say there is an algorithm X, which is implemented in 10 different programming languages. After the bootstrap stage for every program, each program executes the algorithm over and over again. My question is that Would there be any difference from the hardware-level when they all execute on the same CPU?
What I understand is that the set of hardware resources (register, etc.) on each CPU are limited. Hence, executing a core algorithm should follow a similar (if not identical) pattern from the "fetch–decode–execute" cycle.
Related
I wanted to know how computers calculate logarithms?
I don't mean the related functions. For example, Python uses math.log() function. But I want to know what exactly does this function do? And can it be simulated again and more accurately?
Is there a formula for it? Or an algorithm? (I don't think the computer has a log table!)
Thanks
The GNU C library, for example, uses a call to the fyl2x() assembler instruction, which means that logarithms are calculated directly from the hardware.
Hence one should ask: what algorithm is used for calculating logarithms by computers?
Depends on the CPU, For intel IA64, they use the Taylor series combined with a table.
More info can be found here: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.24.5177
and here: http://www.computer.org/csdl/proceedings/arith/1999/0116/00/01160004.pdf
This is a hugely open, broad and "depends-on".
For every programming language, every different core library, every different system or so forth, different algorithms/mechanisms and machine-code instructions may be existing for performing mathematical (and any other type of) calculations.
Furthermore, even if all the programming languages, in this world, would have been using same algorithm "X", it still does not mean, that Computer calculates logarithm in way X, because, computer still will (most likely) be doing its machine-level job differently, in different circumstances, disregarding the point, that algorithms are the same (which is, very less likely).
Bear in mind, that Computer Architectures differ, Operating Systems differ, and Assembler Instructions can be very different from CPU to CPU.
I really think, that you should come up with more specific and concrete questions on this website.
I am planning to write something to take advantages of the many devices that I have at home.
Basically my aim is to use the laptop to execute calculations, and also to use my main desktop computer to add more power (and finish the task quicker). I work with cellular simulation and chemical interactions, so to me would be great to take advantage of all that I have available at home.
I am using mainly OSX, so I need something that may work with that OS. I can code in objective-C, C and C++.
I am aware of GCD, OpenCL and MPI, but I am not sure which way to go.
I was planning to not use the full power of my desktop but only some of the available cores (in this way I can continue to work on the desktop doing other tasks that are not so resource intensive). In particular I would love to use the graphic card power (it is an ATI card, so no CUDA), since all that I do mainly is spreadsheet, word and coding with Xcode, and the graphic card resources are basically unused in that scenario.
Is there a specific set of libraries or API, among the aforementioned 3, that would allow me to selectively route tasks, and use resources on another machine without leaving the control totally to the compiler? I've heard that GCD is great but it has very limited control on where the blocks are executed, while MPI is on the other side of the spectrum....OpenCL seems to be in the middle.
Before diving in one of these technologies I would like to know which one would most likely suit my needs; I am sure that some other researcher has already used successfully parallel computing to achieve what I am trying to achieve.
Thanks in advance.
MPI is more for scientific computing large scale many processors many nodes exc not for a weekend project, for what you describe I would suggest using OpenCl or any one the more distributed framework of AMQP protocol families, such as zeromq or rabbitMQ, or a combination of OpenCl and AMQP , or even simpler consider multithreading , i would suggest OpenMP for that. I'm not sure if you are looking for direct solvers or parallel functions but there are many that exist as well for gpu's and cpu's which you can find on the web
Sorry, but this question simply cannot be meaningfully answered as posed. To be sure, I could toss out a collection of buzzwords describing various technologies to look at like GCD, OpenMPI, OpenCL, CUDA and any number of other technologies which allow one to run a single program on multiple cores, multiple programs on different cooperating computers, or a single program distributed across CPU and GPU, and it sounds like you know about a number of those already so I wouldn't even be adding much value in listing the buzzwords.
To simply toss out such terms without knowing the full specifics of the problem you're trying to solve, however, is a bit like saying that you know English, French and a little German so sure, by all means - mix them all together in a single paragraph without knowing anything about the target audience! Similarly, you can parallelize a given computation in any number of ways, across any number of different processing elements, but whether that parallelization is actually a win or not is going to be entirely dependent on the nature of the algorithm, its data dependencies, how much computation is expected for each reasonable "work chunk", and whether it can be executed on a GPU with sufficient numeric precision, among many other factors. The more complex the technology you choose, the more those factors matter and the greater the possibility that the resulting code will actually be slower than its single-threaded, single machine counterpart. IPC overhead and data copying can, and frequently do, swamp all of the gains one might realize from trying to naively parallelize something and then add additional overhead on top of that, resulting in a net loss. This is why engineers who can do this kind of work meaningfully and well are in such high demand. :)
Without knowing anything about your calculations, I would move in baby steps. First try a simple multi-processor framework like GCD (which is already built in to OS X and requires no additional dependencies to use) and figure out how to factor your code such that it can effectively use all of the available cores on a single machine. Once you've learned where the wins are (and if there even are any - if multi-threading isn't helping, multi-machine parallelization almost certainly won't either), try setting up several instances of the calculation on several machines with a simple IPC model that allows for distributing the work. Having already factored your algorithm(s) for multiple threads, it should be comparatively straight-forward to further generalize the approach across multiple machines (though it bears noting that the two are NOT the same problem and either way you still want to use all the cores available on any of the given target machines, so the two challenges are both complimentary and orthogonal).
I will start with the question and then proceed to explain the need:
Given a single C++ source code file which compiles well in modern g++ and uses nothing more than the standard library, can I set up a single-task operating system and run it there?
EDIT: Following the comment by Chris Lively, I would have better asked: What's the easiest way you can suggest to try to tweak linux into effectively giving me a single-tasking behavior.
Nevertheless, it seems like I did get a good answer although I did not phrase my question well enough. See the second paragraph in sarnold's answer regarding the scheduler.
Motivation:
In some programming competitions the communication between a contestant's program and the grading program involves a huge number of very short interactions.
Thus, using getrusage to measure the time spent by a contestant's program is inaccurate because getrusage works by sampling a process at constant intervals (usually around once per 10ms) which are too large compared to the duration of each interaction.
Another approach to timing would be to measure time before and after the program is run using something like *clock_gettime* and then subtract their values. We should also subtract the amount of time spent on I/O and this can be done be intercepting printf and scanf using something like LD_PRELOAD and accumulate the time spent in each of these functions by checking the time just before and just after each call to printf/scanf (it's OK to require contestants to use these functions only for I/O).
The method proposed in the last paragraph is ofcourse only true assuming that the contestant's program is the only program running which is why I want a single-tasking OS.
To run both the contestant's program and the grading program at the same time I would need a mechanism which, when one of these program tries to read input and blocks, runs the other program until it write enough output. I still consider this to be single tasking because the programs are not going to run at the same time. The "context switch" would occur when it is needed.
Note: I am aware of the fact that there are additional issues to timing such as CPU power management, but I would like to start by addressing the issue of context switches and multitasking.
First things first, what I think would suit your needs best would actually be a language interpreter -- one that you can instrument to keep track of "execution time" of the program in some purpose-made units like "mems" to represent memory accesses or "cycles" to represent the speeds of different instructions. Knuth's MMIX in his The Art of Computer Programming may provide exactly that functionality, though Python, Ruby, Java, Erlang, are all reasonable enough interpreters that can provide some instruction count / cost / memory access costs should you do enough re-writing. (But losing C++, obviously.)
Another approach that might work well for you -- if used with extreme care -- is to run your programming problems in the SCHED_FIFO or SCHED_RR real-time processing class. Programs run in one of these real-time priority classes will not yield for other processes on the system, allowing them to dominate all other tasks. (Be sure to run your sshd(8) and sh(1) in a higher real-time class to allow you to kill runaway tasks.)
It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 11 years ago.
Just because of curiosity...
Is there a platform independent algorithm that produces a comparable value; so that I can
implement the algorith on different machines that were introduced to the market bi-yearly
and see how does it fit with Moore's Law by checking the returned values of the algorithm
in those machines?
Most of the transistors that are put onto your CPU by Intel and AMD are put there with the purpose of speeding it up one way or another, so a possible proxy for "how many transistors are on there?" is, "how fast is it?". Often when people talk about Moore's law in relation to a CPU it's performance that they're talking about, even though that's not what Moore said.
Benchmarking a CPU is notoriously arbitrary, though. What weightings do you give to your various speed tests? Suppose that next year, Intel invents 20 new SIMD instructions, and adds corresponding silicon to their chips to implement them. Unless your code uses those instructions, there's no way it's going to notice that they're there, so they won't affect your results and you won't report an increase in your performance/transistor index. Since they were invented after you wrote your code, you can't execute them explicitly, so the only way they're going to be used is if an up-to-date compiler, with options to target the new version of the CPU, finds some code in your benchmark that it thinks will benefit from the new instructions. Not very reliable, you simply can't detect new transistors if you can't find a way to use them.
Performance of a single core of a CPU on simple benchmarks has in any case hit something of a roadblock in the last few years. CPU manufacturers are adding cores, and adding special-purpose instructions and silicon, so programs have more resources to draw on if they know how to use them, but boring old arithmetic isn't getting much faster. It's hard to know for what special purposes CPU manufacturers will be adding transistors in 5 or 10 years time, but if you can do that then you could possibly write benchmarks now that will tell you when they've done it.
I don't know much about GPUs, but if you can somehow detect the number of GPU cores on your machine (counting parallel shaders and whatnot), that might actually be the best proxy for raw number of transistors. I guess the number of transistors in each core does go up over time too, but the number of cores on modern graphics cards is rocketing, so actually that might account for the bulk of the new transistors related to processing. Whether that will still be the case in 5 or 10 years, again, who knows.
Another big transistor count is RAM - presumably for a given type of RAM, the number of transistors is pretty much proportional to capacity, and that at least is easily measured using OS-specific functions.
If you stick a SSD in a machine, I bet you pile on the transistor count too. Is that the sort of thing you're interested in, though? Really Moore's law was about single ICs, not the total contents of a beige (well, white or silver these days) box at a given price point.
Well algorithm could be really simple, like calculating flops(floating point operations per second). Just get system time, make 1kk floating point operations get time again and get the difference(or use LINPACK Benchmarks wich is used to rate supercomputers). However implementing this in platform independent way would be tricky.
Years ago, I have heard that someone was about to demonstrate that every computer program could be solved with just three instructions:
Assignment
Conditional
Loop
Please I would like to hear your opinion. I mean representing any algorithm as a computer program. Do you agree with this?
No need. The minimal theoretical computer needs just one instruction. They are called One Instruction Set Computers (OISC for short, kinda like the ultimate RISC).
There are two types. The first is a theoretically "pure" one instruction machine in which the instruction really works like a regular instruction in normal CPUs. The instruction is usually:
subtract and branch if less than zero
or variations thereof. The wikipedia article have examples of how this single instruction can be used to write code that emulates other instructions.
The second type is not theoretically pure. It is the transfer triggered architecture (wikipedia again, sorry). This family of architectures are also known as move machines and I have designed and built some myself.
Some consider move machines cheating since the machine actually have all the regular instructions only that they are memory mapped instead of being part of the opcode. But move machines are not merely theoretical, they are practical (like I said, I've built some myself). There is even a commercially available family of CPUs built by Maxim: the MAXQ. If you look at the MAXQ instruction set (they call it transfer set since there is really only one instruction, I usually call it register set) you will see that MAXQ assembly looks rather like a standard accumulator based architecture.
This is a consequence of Turing Completeness, which is something that was established many decades ago.
Alan Turing, the famous computer scientist, proved that any computable function could be computed using a Turing Machine. A Turing machine is a very simple theoretical device which can do only a few things. It can read and write to a tape (i.e. memory), maintain an internal state which is altered by the contents read from memory, and use the internal state and the last read memory cell to determine which direction to move the tape before reading the next memory cell.
The operations of assignment, conditional, and loop are sufficient to simulate a Turing Machine. Reading and writing memory and maintaining state requires assignment. Changing the direction of the tape based on state and memory contents require conditionals and loops. "Loops" in fact are a bit more high-level than what is actually required. All that is really required is that program flow can jump backwards somehow. This implies that you can create loops if you want to, but the language does not need to have an explicit loop construct.
Since these three operations allow simulation of a Turing Machine, and a Turing Machine has been proven to be able to compute any computable function, it follows that any language which provides these operations is also able to compute any computable function.
Edit: And, as other answerers pointed out, these operations do not need to be discrete. You can craft a single instruction which does all three of these things (assign, compare, and branch) in such a way that it can simulate a Turing machine all by itself.
The minimal set is a single command, but you have to choose a fitting one, for example - One instruction set computer
When I studied, we used such a "computer" to calculate factorial, using just a single instruction:
SBN - Subtract and Branch if Negative:
SBN A, B, C
Meaning:
if((Memory[A] -= Memory[B]) < 0) goto C
// (Wikipedia has a slightly different definition)
Notable one instruction set computer (OSIC) implementations
This answer will focus on interesting implementations of single instruction set CPUs, compilers and assemblers.
movfuscator
https://github.com/xoreaxeaxeax/movfuscator
Compiles C code using only mov x86 instructions, showing in a very concrete way that a single instruction suffices.
The Turing completeness seems to have been proven in a paper: https://www.cl.cam.ac.uk/~sd601/papers/mov.pdf
subleq
https://esolangs.org/wiki/Subleq:
https://github.com/hasithvm/subleq-verilog Verilog, Xilinx ISE.
https://github.com/purisc-group/purisc Verilog and VHDL, Altera. Maybe that project has a clang backend, but I can't use it: https://github.com/purisc-group/purisc/issues/5
http://mazonka.com/subleq/sqasm.cpp | http://mazonka.com/subleq/sqrun.cpp C++-based assembler and emulator.
See also
What is the minimum instruction set required for any Assembly language to be considered useful?
https://softwareengineering.stackexchange.com/questions/230538/what-is-the-absolute-minimum-set-of-instructions-required-to-build-a-turing-comp/325501
In 1964, Bohm and Jacopini published a paper in which they demonstrated that all programs could be written in terms of only three control structures:
the sequence structure,
the selection structure
and the repetition structure.
Programmers using Haskell might argue that you only need the Contional and Loop because assignments, and mutable state, don't exist in Haskell.