I haven't looked in ages how the computer actually starts up, so I started playing around with writing my own loader which would boot into IA-32e mode and initialize all the CPUs with some dummy code to run. I'm fairly far, but I'm getting tired of writing trivial things in assembler.
Here's a toy case of what I would like to achieve. Say I want to write a simple piece of code that prints a C-style string and keeps track of the cursor in some fixed location in memory. A C implementation would be something along the following lines (this code is untested, I wrote it on the fly, so don't comment on bugs, since they're not relevant):
#define VIDEORAM_ADDRESS 0xa0000
#define VIDEORAM_LINE_LENGTH 160
#define VGA_GREY_ON_BLACK 0x07
#define CURSOR_X 0x100 /* dummy address */
#define CURSOR_Y 0x101
void printk(const char *s)
{
volatile char *p;
int x, y;
x = *(volatile char*)CURSOR_X;
y = *(volatile char*)CURSOR_Y;
while(*s != 0) {
if(*s == '\n') {
y++;
y = y >= 25 ? 0 : y;
x = 0;
} else {
x++;
if(x >= 80) {
y++;
y = y >= 25 ? 0 : y;
x = 0;
}
p = (volatile char*)VIDEORAM_ADDRESS + x + y * VIDEORAM_LINE_LENGTH;
*p++ = *s++;
*p = VGA_GREY_ON_BLACK;
}
}
*(volatile char*)CURSOR_X = x;
*(volatile char*)CURSOR_Y = y;
}
I can compile this with gcc -m32 -O2 -S printk.c, which generates printk.s. My question is essentially how to combine this together with a handwritten assembly file? The end result should of course be nothing else except a single binary blob of machine code and data that is loaded by the BIOS onto 0000:7C00 if, say, I want to include the code into the stage 1 loader loaded from a disk and call it after switching over to protected mode.
Is an alternative putting an .include directive somewhere in the handwritten assembly file to get the code included? Unfortunately, gcc emits all kinds of directives for the GNU Assembler in the .s file and I really only want the code for the printk function.
Is there some canonical way of doing this?
I know that machines find it difficult to make calculations involving very large numbers.
Let's say I want to find square of a million digit number. Will a typical computer give an answer almost instantly? How much time does it take for them to handle million digit calculations?
Also what is the reason for them to be slow in such calculations?
I found some calculator websites which claim that they can do the task instantly. Will a computer become faster if they use the method those websites use?
On my PC it takes more than 21 minutes to draw a square root of a number with 1 million digits. See the details below. It should be possible to achieve faster times, but "almost instantly" is probably not feasible without making use of special hardware (like graphics boards with CUDA support).
I have written a test program in C# to find the runtimes for calculating the square root with Newton's method. It uses the System.Numerics library which features the BigInteger class for arbitrary accuracy arithmetic.
The runtime depends on the initial value assumed for the iterative calculation method. To look for the highest non-zero bit of the number turned out to be faster than simply always using 1 as initial value.
using System;
using System.Diagnostics;
using System.Numerics;
namespace akBigSquareRoot
{
class Program
{
static void Main(string[] args)
{
Stopwatch stopWatch = new Stopwatch();
Console.WriteLine(" nDigits error iterations elapsed ");
Console.WriteLine("-----------------------------------------");
for (int nDigits = 10; nDigits <= 1e6; nDigits *= 10)
{
// create a base number with nDigits/2 digits
BigInteger x = 1;
for (int i = 0; i < nDigits / 2; i++)
{
x *= 10;
}
BigInteger square = x * x;
stopWatch.Restart();
int iterations;
BigInteger root = sqrt(square, out iterations);
stopWatch.Stop();
BigInteger error = x - root;
TimeSpan ts = stopWatch.Elapsed;
string elapsedTime = String.Format("{0:00}:{1:00}:{2:00}.{3:00}",
ts.Hours, ts.Minutes, ts.Seconds,
ts.Milliseconds / 10);
Console.WriteLine("{0,8} {1,6} {2,6} {3}", nDigits, error, iterations, elapsedTime);
}
Console.WriteLine("\n<end reached>");
Console.ReadKey();
}
public static BigInteger sqrt(BigInteger x, out int iterations)
{
BigInteger div = BigInteger.One << (bitLength(x) / 2);
// BigInteger div = 1;
BigInteger div2 = div;
BigInteger y;
// Loop until we hit the same value twice in a row, or wind
// up alternating.
iterations = 0;
while (true)
{
iterations++;
y = (div + (x / div)) >> 1;
if ((y == div) || (y == div2))
return y;
div2 = div;
div = y;
}
}
private static int bitLength(BigInteger x) {
int len = 0;
do
{
len++;
} while ((x >>= 1) != 0);
return len;
}
}
}
The results on a DELL XPS 8300 with Intel Core i7-2600 CPU 3.40 GHz
nDigits error iterations elapsed
----------------------------------------
10 0 4 00:00:00.00
100 0 7 00:00:00.00
1000 0 10 00:00:00.00
10000 0 14 00:00:00.09
100000 0 17 00:00:09.81
1000000 0 20 00:21:18.38
Increasing the number of digits by a factor of 10 results in three additional iterations in the search procedure. But due to the increased bit-length, the search iterations a slowed down substantially.
The computational complexity of calculating square (and higher degree) roots is discussed in a related post.
How to change the sign of int using bitwise operators? Obviously we can use x*=-1 or x/=-1. Is there any fastest way of doing this?
I did a small test as below. Just for curiosity...
public class ChangeSign {
public static void main(String[] args) {
int x = 198347;
int LOOP = 1000000;
int y;
long start = System.nanoTime();
for (int i = 0; i < LOOP; i++) {
y = (~x) + 1;
}
long mid1 = System.nanoTime();
for (int i = 0; i < LOOP; i++) {
y = -x;
}
long mid2 = System.nanoTime();
for (int i = 0; i < LOOP; i++) {
y = x * -1;
}
long mid3 = System.nanoTime();
for (int i = 0; i < LOOP; i++) {
y = x / -1;
}
long end = System.nanoTime();
System.out.println(mid1 - start);
System.out.println(mid2 - mid1);
System.out.println(mid3 - mid2);
System.out.println(end - mid3);
}
}
Output is almost similar to :
2200211
835772
1255797
4651923
The speed difference between non-floating point (e.g. int math) addition/multiplication and bitwise operations is less than negligible on almost all machines.
There is no general way to turn an n-bit signed integer into its negative equivalent using only bitwise operations, as the negation operation looks like x = (~x) + 1, which requires one addition. However, assuming the signed integer is 32 bit you can probably write a bitwise equation to do this calculation. Note: do not do this.
The most common, readable way to negate a number is x = -x.
Java uses Complement Two representation. In order to change a sign, it means you must do a bitwise negation (it would be equivalent to xor with FFFF) and add 1.
x = ~x + 1;
I am almost sure that -x is, if anything, faster than that.
Solution using high level language
Questions like these are popular in interviews and competitive programming world .
I landed here researching more solution for negation of a number without using - or + operator .
For this :
complement a number using ~ operator
Then add 1 to the number obtained in step 1 using Half adder logic :
int addNumbers(int x, int y) {
if(y==0) return x; // carry is 0 return
addNumbers(x^y,(x&y)<<1); }
Here x^y performs addition of bits and x&y handles carry operation
Is there a way, how to make modulo by 511 (and 127) faster than using "%" operator ?
int c = 758 % 511;
int d = 423 % 127;
Here is a way to do fast modulo by 511 assuming that x is at most 32767. It's about twice as fast as x%511. It does the modulo in five steps: two multiply, two addition, one shift.
inline int fast_mod_511(int x) {
int y = (513*x+64)>>18;
return x - 511*y;
}
Here is the theory at how I arrive at this. I posted the code I tested this at the end
Let's consider
y = x/511 = x/(512-1) = x/1000 * 1/(1-1/512).
Let's define z = 512, then
y = x/z*1/(1-1/z).
Using Taylor expansion
y = x/z(1 + 1/z + 1/z^2 + 1/z^3 + ...).
Now if we know that x has a limited range we can cut the expansion. Let's assume x is always less than 2^15=32768. Then we can write
512*512*y = (1+512)*x = 513*x.
After looking at the digits which are significant we arrive at
y = (513*x+64)>>18 //512^2 = 2^18.
We can divide x/511 (assuming x is less than 32768) in three steps:
multiply,
add,
shift.
Here is the code I just to profile this in MSVC2013 64-bit release mode on an Ivy Bridge core.
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
inline int fast_mod_511(int x) {
int y = (513*x+64)>>18;
return x - 511*y;
}
int main() {
unsigned int i, x;
volatile unsigned int r;
double dtime;
dtime = omp_get_wtime();
for(i=0; i<100000; i++) {
for(int j=0; j<32768; j++) {
r = j%511;
}
}
dtime =omp_get_wtime() - dtime;
printf("time %f\n", dtime);
dtime = omp_get_wtime();
for(i=0; i<100000; i++) {
for(int j=0; j<32768; j++) {
r = fast_mod_511(j);
}
}
dtime =omp_get_wtime() - dtime;
printf("time %f\n", dtime);
}
You can use a lookup table with the solutions pre-stored. If you create an array of a million integers looking up is about twice as fast as actually doing modulo in my C# app.
// fill an array
var mod511 = new int[1000000];
for (int x = 0; x < 1000000; x++) mod511[x] = x % 511;
and instead of using
c = 758 % 511;
you use
c = mod511[758];
This will cost you (possibly a lot of) memory, and will obviously not work if you want to use it for very large numbers also. But it is faster.
If you have to repeat those two modulus operations on a large number of data and your CPU supports SIMD (for example Intel's SSE/AVX/AVX2) then you can vectorize the operations, i.e., do the operations on many data in parallel. You can do this by using intrinsics or inline assembly. Yes the solution will be platform specific but maybe that is fine...
For one of the projects I'm doing right now, I need to look at the performance (amongst other things) of different concurrent enabled programming languages.
At the moment I'm looking into comparing stackless python and C++ PThreads, so the focus is on these two languages, but other languages will probably be tested later. Ofcourse the comparison must be as representative and accurate as possible, so my first thought was to start looking for some standard concurrent/multi-threaded benchmark problems, alas I couldn't find any decent or standard, tests/problems/benchmarks.
So my question is as follows: Do you have a suggestion for a good, easy or quick problem to test the performance of the programming language (and to expose it's strong and weak points in the process)?
Surely you should be testing hardware and compilers rather than a language for concurrency performance?
I would be looking at a language from the point of view of how easy and productive it is in terms of concurrency and how much it 'insulates' the programmer from making locking mistakes.
EDIT: from past experience as a researcher designing parallel algorithms, I think you will find in most cases the concurrent performance will depend largely on how an algorithm is parallelised, and how it targets the underlying hardware.
Also, benchmarks are notoriously unequal; this is even more so in a parallel environment. For instance, a benchmark that 'crunches' very large matrices would be suited to a vector pipeline processor, whereas a parallel sort might be better suited to more general purpose multi core CPUs.
These might be useful:
Parallel Benchmarks
NAS Parallel Benchmarks
Well, there are a few classics, but different tests emphasize different features. Some distributed systems may be more robust, have more efficient message-passing, etc. Higher message overhead can hurt scalability, since it the normal way to scale up to more machines is to send a larger number of small messages. Some classic problems you can try are a distributed Sieve of Eratosthenes or a poorly implemented fibonacci sequence calculator (i.e. to calculate the 8th number in the series, spin of a machine for the 7th, and another for the 6th). Pretty much any divide-and-conquer algorithm can be done concurrently. You could also do a concurrent implementation of Conway's game of life or heat transfer. Note that all of these algorithms have different focuses and thus you probably will not get one distributed system doing the best in all of them.
I'd say the easiest one to implement quickly is the poorly implemented fibonacci calculator, though it places too much emphasis on creating threads and too little on communication between those threads.
Surely you should be testing hardware
and compilers rather than a language
for concurrency performance?
No, hardware and compilers are irrelevant for my testing purposes. I'm just looking for some good problems that can test how well code, written in one language, can compete against code from another language. I'm really testing the constructs available in the specific languages to do concurrent programming. And one of the criteria is performance (measured in time).
Some of the other test criteria I'm looking for are:
how easy is it to write correct code; because as we all know concurrent programming is harder then writing single threaded programs
what is the technique used to to concurrent programming: event-driven, actor based, message parsing, ...
how much code must be written by the programmer himself and how much is done automatically for him: this can also be tested with the given benchmark problems
what's the level of abstraction and how much overhead is involved when translated back to machine code
So actually, I'm not looking for performance as the only and best parameter (which would indeed send me to the hardware and the compilers instead of the language itself), I'm actually looking from a programmers point of view to check what language is best suited for what kind of problems, what it's weaknesses and strengths are and so on...
Bare in mind that this is just a small project and the tests are therefore to be kept small as well. (rigorous testing of everything is therefore not feasible)
I have decided to use the Mandelbrot set (the escape time algorithm to be more precise) to benchmark the different languages.
It fits me quite well as the original algorithm can easily be implemented and creating the multi threaded variant from it is not that much work.
below is the code I currently have. It is still a single threaded variant, but I'll update it as soon as I'm satisfied with the result.
#include <cstdlib> //for atoi
#include <iostream>
#include <iomanip> //for setw and setfill
#include <vector>
int DoThread(const double x, const double y, int maxiter) {
double curX,curY,xSquare,ySquare;
int i;
curX = x + x*x - y*y;
curY = y + x*y + x*y;
ySquare = curY*curY;
xSquare = curX*curX;
for (i=0; i<maxiter && ySquare + xSquare < 4;i++) {
ySquare = curY*curY;
xSquare = curX*curX;
curY = y + curX*curY + curX*curY;
curX = x - ySquare + xSquare;
}
return i;
}
void SingleThreaded(int horizPixels, int vertPixels, int maxiter, std::vector<std::vector<int> >& result) {
for(int x = horizPixels; x > 0; x--) {
for(int y = vertPixels; y > 0; y--) {
//3.0 -> so we always have -1.5 -> 1.5 as the window; (x - (horizPixels / 2) will go from -horizPixels/2 to +horizPixels/2
result[x-1][y-1] = DoThread((3.0 / horizPixels) * (x - (horizPixels / 2)),(3.0 / vertPixels) * (y - (vertPixels / 2)),maxiter);
}
}
}
int main(int argc, char* argv[]) {
//first arg = length along horizontal axis
int horizPixels = atoi(argv[1]);
//second arg = length along vertical axis
int vertPixels = atoi(argv[2]);
//third arg = iterations
int maxiter = atoi(argv[3]);
//fourth arg = threads
int threadCount = atoi(argv[4]);
std::vector<std::vector<int> > result(horizPixels, std::vector<int>(vertPixels,0)); //create and init 2-dimensional vector
SingleThreaded(horizPixels, vertPixels, maxiter, result);
//TODO: remove these lines
for(int y = 0; y < vertPixels; y++) {
for(int x = 0; x < horizPixels; x++) {
std::cout << std::setw(2) << std::setfill('0') << std::hex << result[x][y] << " ";
}
std::cout << std::endl;
}
}
I've tested it with gcc under Linux, but I'm sure it works under other compilers/Operating Systems as well. To get it to work you have to enter some command line arguments like so:
mandelbrot 106 500 255 1
the first argument is the width (x-axis)
the second argument is the height (y-axis)
the third argument is the number of maximum iterations (the number of colors)
the last ons is the number of threads (but that one is currently not used)
on my resolution, the above example gives me a nice ASCII-art representation of a Mandelbrot set. But try it for yourself with different arguments (the first one will be the most important one, as that will be the width)
Below you can find the code I hacked together to test the multi threaded performance of pthreads. I haven't cleaned it up and no optimizations have been made; so the code is a bit raw.
the code to save the calculated mandelbrot set as a bitmap is not mine, you can find it here
#include <cstdlib> //for atoi
#include <iostream>
#include <iomanip> //for setw and setfill
#include <vector>
#include "bitmap_Image.h" //for saving the mandelbrot as a bmp
#include <pthread.h>
pthread_mutex_t mutexCounter;
int sharedCounter(0);
int percent(0);
int horizPixels(0);
int vertPixels(0);
int maxiter(0);
//doesn't need to be locked
std::vector<std::vector<int> > result; //create 2 dimensional vector
void *DoThread(void *null) {
double curX,curY,xSquare,ySquare,x,y;
int i, intx, inty, counter;
counter = 0;
do {
counter++;
pthread_mutex_lock (&mutexCounter); //lock
intx = int((sharedCounter / vertPixels) + 0.5);
inty = sharedCounter % vertPixels;
sharedCounter++;
pthread_mutex_unlock (&mutexCounter); //unlock
//exit thread when finished
if (intx >= horizPixels) {
std::cout << "exited thread - I did " << counter << " calculations" << std::endl;
pthread_exit((void*) 0);
}
//set x and y to the correct value now -> in the range like singlethread
x = (3.0 / horizPixels) * (intx - (horizPixels / 1.5));
y = (3.0 / vertPixels) * (inty - (vertPixels / 2));
curX = x + x*x - y*y;
curY = y + x*y + x*y;
ySquare = curY*curY;
xSquare = curX*curX;
for (i=0; i<maxiter && ySquare + xSquare < 4;i++){
ySquare = curY*curY;
xSquare = curX*curX;
curY = y + curX*curY + curX*curY;
curX = x - ySquare + xSquare;
}
result[intx][inty] = i;
} while (true);
}
int DoSingleThread(const double x, const double y) {
double curX,curY,xSquare,ySquare;
int i;
curX = x + x*x - y*y;
curY = y + x*y + x*y;
ySquare = curY*curY;
xSquare = curX*curX;
for (i=0; i<maxiter && ySquare + xSquare < 4;i++){
ySquare = curY*curY;
xSquare = curX*curX;
curY = y + curX*curY + curX*curY;
curX = x - ySquare + xSquare;
}
return i;
}
void SingleThreaded(std::vector<std::vector<int> >& result) {
for(int x = horizPixels - 1; x != -1; x--) {
for(int y = vertPixels - 1; y != -1; y--) {
//3.0 -> so we always have -1.5 -> 1.5 as the window; (x - (horizPixels / 2) will go from -horizPixels/2 to +horizPixels/2
result[x][y] = DoSingleThread((3.0 / horizPixels) * (x - (horizPixels / 1.5)),(3.0 / vertPixels) * (y - (vertPixels / 2)));
}
}
}
void MultiThreaded(int threadCount, std::vector<std::vector<int> >& result) {
/* Initialize and set thread detached attribute */
pthread_t thread[threadCount];
pthread_attr_t attr;
pthread_attr_init(&attr);
pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_JOINABLE);
for (int i = 0; i < threadCount - 1; i++) {
pthread_create(&thread[i], &attr, DoThread, NULL);
}
std::cout << "all threads created" << std::endl;
for(int i = 0; i < threadCount - 1; i++) {
pthread_join(thread[i], NULL);
}
std::cout << "all threads joined" << std::endl;
}
int main(int argc, char* argv[]) {
//first arg = length along horizontal axis
horizPixels = atoi(argv[1]);
//second arg = length along vertical axis
vertPixels = atoi(argv[2]);
//third arg = iterations
maxiter = atoi(argv[3]);
//fourth arg = threads
int threadCount = atoi(argv[4]);
result = std::vector<std::vector<int> >(horizPixels, std::vector<int>(vertPixels,21)); // init 2-dimensional vector
if (threadCount <= 1) {
SingleThreaded(result);
} else {
MultiThreaded(threadCount, result);
}
//TODO: remove these lines
bitmapImage image(horizPixels, vertPixels);
for(int y = 0; y < vertPixels; y++) {
for(int x = 0; x < horizPixels; x++) {
image.setPixelRGB(x,y,16777216*result[x][y]/maxiter % 256, 65536*result[x][y]/maxiter % 256, 256*result[x][y]/maxiter % 256);
//std::cout << std::setw(2) << std::setfill('0') << std::hex << result[x][y] << " ";
}
std::cout << std::endl;
}
image.saveToBitmapFile("~/Desktop/test.bmp",32);
}
good results can be obtained using the program with the following arguments:
mandelbrot 5120 3840 256 3
that way you will get an image that is 5 * 1024 wide; 5 * 768 high with 256 colors (alas you will only get 1 or 2) and 3 threads (1 main thread that doesn't do any work except creating the worker threads, and 2 worker threads)
Since the benchmarks game moved to a quad-core machine September 2008, many programs in different programming languages have been re-written to exploit quad-core - for example, the first 10 mandelbrot programs.