I'm trying to write a simple program in C++ to demonstrate the math behind RSA. I'm using the GMP library ( https://gmplib.org/ ) so that I can scale it up later with larger primes.
When I attempt to calculate D, the decryption exponent, as the modular inverse of e mod phi(n), it segfaults and I have am lost as to why.
Can anyone shed some light on this issue?
#include <gmp.h> // For the GMP library
int main()
{
mpz_t n,p,q,e,c,d,h;
mpz_init(n);
mpz_init(h);
mpz_init_set_str(e, "65537", 10);
mpz_init_set_str(p, "1298849", 10);
mpz_init_set_str(q, "1298863", 10);
mpz_mul(n,p,q);
mpz_sub_ui(p, p, 1UL);
mpz_sub_ui(q, q, 1UL);
mpz_mul(h, p, q);
gmp_printf ("%Zd\n", h);
//This next line segfaults it.
mpz_invert(d,e,h);
return 0;
}
Any help is appreciated, I'm pretty stumped!
To Compile:
g++ -std=c++11 Example.cpp -lgmp -lgmpxx -o Example
You never initialized c or d, so you can't use them for calculation.
Related
Question
I am testing a simple code which calculates Mandelbrot fractal. I have been checking its performance depending on the number of iterations in the function that checks if a point belongs to the Mandelbrot set or not.
The surprising thing is that I am getting a big difference in times after adding the -fPIC flag. From what I read the overhead is usually negligible and the highest overhead I came across was about 6%. I measured around 30% overhead. Any advice will be appreciated!
Details of my project
I use the -O3 flag, gcc 4.7.2, Ubuntu 12.04.2, x86_64.
The results look as follow
#iter C (fPIC) C C/C(fPIC)
1 0.01 0.01 1.00
100 0.04 0.03 0.75
200 0.06 0.04 0.67
500 0.15 0.1 0.67
1000 0.28 0.19 0.68
2000 0.56 0.37 0.66
4000 1.11 0.72 0.65
8000 2.21 1.47 0.67
16000 4.42 2.88 0.65
32000 8.8 5.77 0.66
64000 17.6 11.53 0.66
Commands I use:
gcc -O3 -fPIC fractalMain.c fractal.c -o ffpic
gcc -O3 fractalMain.c fractal.c -o f
Code: fractalMain.c
#include <time.h>
#include <stdio.h>
#include <stdbool.h>
#include "fractal.h"
int main()
{
int iterNumber[] = {1, 100, 200, 500, 1000, 2000, 4000, 8000, 16000, 32000, 64000};
int it;
for(it = 0; it < 11; ++it)
{
clock_t start = clock();
fractal(iterNumber[it]);
clock_t end = clock();
double millis = (end - start)*1000 / CLOCKS_PER_SEC/(double)1000;
printf("Iter: %d, time: %lf \n", iterNumber[it], millis);
}
return 0;
}
Code: fractal.h
#ifndef FRACTAL_H
#define FRACTAL_H
void fractal(int iter);
#endif
Code: fractal.c
#include <stdio.h>
#include <stdbool.h>
#include "fractal.h"
void multiplyComplex(double a_re, double a_im, double b_re, double b_im, double* res_re, double* res_im)
{
*res_re = a_re*b_re - a_im*b_im;
*res_im = a_re*b_im + a_im*b_re;
}
void sqComplex(double a_re, double a_im, double* res_re, double* res_im)
{
multiplyComplex(a_re, a_im, a_re, a_im, res_re, res_im);
}
bool isInSet(double P_re, double P_im, double C_re, double C_im, int iter)
{
double zPrev_re = P_re;
double zPrev_im = P_im;
double zNext_re = 0;
double zNext_im = 0;
double* p_zNext_re = &zNext_re;
double* p_zNext_im = &zNext_im;
int i;
for(i = 1; i <= iter; ++i)
{
sqComplex(zPrev_re, zPrev_im, p_zNext_re, p_zNext_im);
zNext_re = zNext_re + C_re;
zNext_im = zNext_im + C_im;
if(zNext_re*zNext_re+zNext_im*zNext_im > 4)
{
return false;
}
zPrev_re = zNext_re;
zPrev_im = zNext_im;
}
return true;
}
bool isMandelbrot(double P_re, double P_im, int iter)
{
return isInSet(0, 0, P_re, P_im, iter);
}
void fractal(int iter)
{
int noIterations = iter;
double xMin = -1.8;
double xMax = 1.6;
double yMin = -1.3;
double yMax = 0.8;
int xDim = 512;
int yDim = 384;
double P_re, P_im;
int nop;
int x, y;
for(x = 0; x < xDim; ++x)
for(y = 0; y < yDim; ++y)
{
P_re = (double)x*(xMax-xMin)/(double)xDim+xMin;
P_im = (double)y*(yMax-yMin)/(double)yDim+yMin;
if(isMandelbrot(P_re, P_im, noIterations))
nop = x+y;
}
printf("%d", nop);
}
Story behind the comparison
It might look a bit artificial to add the -fPIC flag when building executable (as per one of the comments). So a few words of explanation: first I only compiled the program as executable and wanted to compare to my Lua code, which calls the isMandelbrot function from C. So I created a shared object to call it from lua - and had big time differences. But couldn't understand why they were growing with number of iterations. In the end found out that it was because of the -fPIC. When I create a little c program which calls my lua script (so effectively I do the same thing, only don't need the .so) - the times are very similar to C (without -fPIC). So I have checked it in a few configurations over the last few days and it consistently shows two sets of very similar results: faster without -fPIC and slower with it.
It turns out that when you compile without the -fPIC option multiplyComplex, sqComplex, isInSet and isMandelbrot are inlined automatically by the compiler. If you define those functions as static you will likely get the same performance when compiling with -fPIC because the compiler will be free to perform inlining.
The reason why the compiler is unable to automatically inline the helper functions has to do with symbol interposition. Position independent code is required to access all global data indirectly, i.e. through the global offset table. The very same constraint applies to function calls, which have to go through the procedure linkage table. Since a symbol might get interposed by another one at runtime (see LD_PRELOAD), the compiler cannot simply assume that it is safe to inline a function with global visibility.
The very same assumption can be made if you compile without -fPIC, i.e. the compiler can safely assume that a global symbol defined in the executable cannot be interposed because the lookup scope begins with the executable itself which is then followed by all other libraries, including the preloaded ones.
For a more thorough understanding have a look at the following paper.
As other people already pointed out -fPIC forces GCC to disable many optimizations e.g. inlining and cloning. I'd like to point out several ways to overcome this:
replace -fPIC with -fPIE if you are compiling main executable (not libraries) as this allows compiler to assume that interposition is not possible;
use -fvisibility=hidden and __attribute__((visibility("default"))) to export only necessary functions from the library and hide the rest; this would allow GCC to optimize hidden functions more aggressively;
use private symbol aliases (__attribute__((alias ("__f")));) to refer to library functions from within the library; this would again untie GCC's hands
previous suggestion can be automated with -fno-semantic-interposition flag that was added in recent GCC versions
It's interesting to note that Clang is different from GCC as it allows all optimizations by default regardless of -fPIC (can be overridden with -fsemantic-interposition to obtain GCC-like behavior).
As others have discussed in the comment section of your opening post, compiling with -flto should help to reduce the difference in run-times you are seeing for this particular case, since the link time optimisations of gcc will likely figure out that it's actually ok to inline a couple of functions ;)
In general, link time optimisations could lead to massive reductions in code size (~6%) link to paper on link time optimisations in gold, and thus run time as well (more of your program fits in the cache). Also note that -fPIC is mostly viewed as a feature that enables tighter security and is always enabled in android. This question on SO briefly discusses as well. Also, just to let you know, -fpic is the faster version of -fPIC, so if you must use -fPIC try -fpic instead - link to gcc docs. For x86 it might not make a difference, but you need to check this for yourself/ask on gcc-help.
I am facing some very weird rounding errors when compiling my code with intel 2018 when compared to gcc 7.2.0. I'm simply looking into taking the absolutely value of extrememly small number:
#include <cfloat>
#include <math.h>
#include <stdlib.h>
#include <stdio.h>
int main() {
double numa = -1.3654159537789158e-08;
double numb = -7.0949094162313382e-08;
if (isnan(numa))
printf("numa is nan \n");
if (isnan(numb))
printf("numb is nan \n");
printf("abs(numa) %.17g \n", abs(numa));
printf("abs(numb) %.17g \n", abs(numb));
if ((isnan(numa) || (abs(numa) < DBL_EPSILON)) || (isnan(numb) || (abs(numb) < DBL_EPSILON))) {
printf("x %.17g y %.17g DBL_E %.17g \n", numa, numb, DBL_EPSILON);
}
return 0;
}
Here is the output when compiling the code with gcc 7.2.0, which is expected:
$ ./a.out
abs(numa) 1.3654159537789158e-08
abs(numb) 7.0949094162313382e-08
But it is a different story for intel/2018:
$ ./a.out
abs(numa) 2.0410903428666442e-314
abs(numb) 2.0410903428666442e-314
x -1.3654159537789158e-08 y -7.0949094162313382e-08 DBL_E 2.2204460492503131e-16
What could cause my version of Intel compilers to have such a huge difference?
Wrong function or wrong language
Output with "gcc 7.2.0" is as expected because OP compiled with C++
With "intel/2018" the output is consistent with a forced C compilation.
With C, the abs(numa) converts numa to an int with the value of 0 and the below is undefined behavior (UB) as "%.17g" expects a double and not an int.
// In C UB: vvvvv------vvvvvvvvv
printf("abs(numa) %.17g \n", abs(numa));
With the UB output of "abs(numa) 2.0410903428666442e-314", we can do some forensics.
Typical 2.0410903428666442e-314 in binary is
00000000 00000000 00000000 00000000 11110110 00111101 01001110 00101110
This is consistent with some C compilations that pass a 32-bit int 0 and then printf() retrieved that along with some other following junk as the expected double.
As UB, this result may vary from time-to-time, if output at all, yet is good indicator of the problem: Compile in C++ or change to fabs() (#dmuir) to take the absolute value of a double in both C++ and C.
Some kudos to OP for using "%g" (or "%e") when debugging a floating point issues. Far more informative the "%f"
I'm sorry if this is a really stupid question, but I really need this for my master thesis, and I just can't find a way. I need to calculate the complete elliptical integral of first kind with eclipse 3.8. on an Ubuntu laptop. My compiler is set to -c -fmessage-length=0 -std=c++11.
As for the ubuntu version, it's
#laptop:~$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 14.04.5 LTS
Release: 14.04
Codename: trusty
and for the gcc compiler, it is
laptop:~$ gcc --version
gcc (Ubuntu 4.8.5-2ubuntu1~14.04.1) 4.8.5
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
I found under mathematical special functions that there is a function double comp_ellint_1( float arg ) that would do the job, but as I understand it it is only included in C++ 17, which I have not installed and where I can't find information about how to install it. But apparently there is a possibility to calculate the function without C++17? Because it says:
As all special functions, comp_ellint_1 is only guaranteed to be available in <cmath> if __STDCPP_MATH_SPEC_FUNCS__ is defined by the implementation to a value at least 201003L and if the user defines __STDCPP_WANT_MATH_SPEC_FUNCS__ before including any standard library headers.
But their example code
#define __STDCPP_WANT_MATH_SPEC_FUNCS__ 1
#include <cmath>
#include <iostream>
int main(){
double integral= std::comp_ellint_1(0);
return 0;
}
Does not work, the error being 15:22: error: ‘comp_ellint_1’ is not a member of ‘std’. I've also tried
#define _STDCPP_MATH_SPEC_FUNCS__201003L
#define __STDCPP_WANT_MATH_SPEC_FUNCS__ 1
#include <cmath>
#include <iostream>
int main(){
double integral= std::comp_ellint_1(0);
return 0;
}
which leads to the same error. It does not say if I need to install certain packages to make it work (if I do need any, which are they and how do I install them). Or am I making a different mistake?
I'd be super thankful for any ideas how to solve this, so thank you very much in advance!
Your gcc 4.8.5 had this function as std::tr1::comp_ellint_1.
You will need to #include <tr1/cmath>
This is mentioned in the cppreference page for its C++17 version
If it does not work or want to run on older versions also you can include boost. To do it at Visual Studio you should include:
#define BOOST_CONFIG_SUPPRESS_OUTDATED_MESSAGE
#include <boost/lambda/lambda.hpp>
#include <boost/math/special_functions/ellint_1.hpp>
#include <boost/math/special_functions/ellint_2.hpp>
#include <boost/math/special_functions/ellint_3.hpp>
Then:
using namespace boost::math;
double Kk = ellint_1(k);
double Ek1 = ellint_2(k) / (q - 4.*al);
To do that you should write a copy of the boost at hard disk, as example at C:\boost_1_66_0
Then by edit the project properties you should add following links:
C/C++ Directories->additional include directories: C:\boost_1_66_0
C/C++->Precompiled headers->Precompiled header-> Not use precompiled headers
Linker->general->Additional Library Directories->C:\boost_1_66_0\libs;
Another way is to include the following function that calculates both: first and second kind complete integrals. I tested it and worked well using an online tool and the ellint_1 and 2:
void Complete_Elliptic_Integrals(double x, double* Fk, double* Ek)
{
const double PI_2 = 1.5707963267948966192313216916397514; // pi/2
const double PI_4 = 0.7853981633974483096156608458198757; // pi/4
double k; // modulus
double m; // the parameter of the elliptic function m = modulus^2
double a; // arithmetic mean
double g; // geometric mean
double a_old; // previous arithmetic mean
double g_old; // previous geometric mean
double two_n; // power of 2
double sum;
if ( x == 0.0 ) { *Fk = M_PI_2; *Ek = M_PI_2; return; }
k = fabs(x);
m = k * k;
if ( m == 1.0 ) { *Fk = DBL_MAX; *Ek = 1.0; return; }
a = 1.0;
g = sqrt(1.0 - m);
two_n = 1.0;
sum = 2.0 - m;
for (int i=0;i<100;i++)
{
g_old = g;
a_old = a;
a = 0.5 * (g_old + a_old);
g = g_old * a_old;
two_n += two_n;
sum -= two_n * (a * a - g);
if ( fabs(a_old - g_old) <= (a_old * DBL_EPSILON) ) break;
g = sqrt(g);
}
*Fk = (double) (PI_2 / a);
*Ek = (double) ((PI_4 / a) * sum);
return;
}
Unfortunately it lasts double than executing ellint_1 and ellint_2
I've been using the ConjugateGradient solver in Eigen 3.2 and decided to try upgrading to Eigen 3.3.3 with the hope of benefiting from the new multi-threading features.
Sadly, the solver seems slower (~10%) when I enable -fopenmp with GCC 4.8.4. Looking at xosview, I see that all 8 cpus are being used, yet performance is slower...
After some testing, I discovered that if I disable compiler optimization (use -O0 instead of -O3), then -fopenmp does speed up the solver by ~50%.
Of course, it's not really worth disabling optimization just to benefit from multi-threading, since that would be even slower overall.
Following advice from https://stackoverflow.com/a/42135567/7974125, I am storing the full sparse matrix and passing Lower|Upper as the UpLo parameter.
I've also tried each of the 3 preconditioners and also tried using RowMajor matrices, to no avail.
Is there anything else to try to get the full benefits of both multi-threading and compiler optimization?
I cannot post my actual code, but this is a quick test using the Laplacian example from Eigen's documentation, except for some changes to use ConjugateGradient instead of SimplicialCholesky. (Both of these solvers work with SPD matrices.)
#include <Eigen/Sparse>
#include <bench/BenchTimer.h>
#include <iostream>
#include <vector>
using namespace Eigen;
using namespace std;
// Use RowMajor to make use of multi-threading
typedef SparseMatrix<double, RowMajor> SpMat;
typedef Triplet<double> T;
// Assemble sparse matrix from
// https://eigen.tuxfamily.org/dox/TutorialSparse_example_details.html
void insertCoefficient(int id, int i, int j, double w, vector<T>& coeffs,
VectorXd& b, const VectorXd& boundary)
{
int n = int(boundary.size());
int id1 = i+j*n;
if(i==-1 || i==n) b(id) -= w * boundary(j); // constrained coefficient
else if(j==-1 || j==n) b(id) -= w * boundary(i); // constrained coefficient
else coeffs.push_back(T(id,id1,w)); // unknown coefficient
}
void buildProblem(vector<T>& coefficients, VectorXd& b, int n)
{
b.setZero();
ArrayXd boundary = ArrayXd::LinSpaced(n, 0,M_PI).sin().pow(2);
for(int j=0; j<n; ++j)
{
for(int i=0; i<n; ++i)
{
int id = i+j*n;
insertCoefficient(id, i-1,j, -1, coefficients, b, boundary);
insertCoefficient(id, i+1,j, -1, coefficients, b, boundary);
insertCoefficient(id, i,j-1, -1, coefficients, b, boundary);
insertCoefficient(id, i,j+1, -1, coefficients, b, boundary);
insertCoefficient(id, i,j, 4, coefficients, b, boundary);
}
}
}
int main()
{
int n = 300; // size of the image
int m = n*n; // number of unknowns (=number of pixels)
// Assembly:
vector<T> coefficients; // list of non-zeros coefficients
VectorXd b(m); // the right hand side-vector resulting from the constraints
buildProblem(coefficients, b, n);
SpMat A(m,m);
A.setFromTriplets(coefficients.begin(), coefficients.end());
// Solving:
// Use ConjugateGradient with Lower|Upper as the UpLo template parameter to make use of multi-threading
BenchTimer t;
t.reset(); t.start();
ConjugateGradient<SpMat, Lower|Upper> solver(A);
VectorXd x = solver.solve(b); // use the factorization to solve for the given right hand side
t.stop();
cout << "Real time: " << t.value(1) << endl; // 0=CPU_TIMER, 1=REAL_TIMER
return 0;
}
Resulting output:
// No optimization, without OpenMP
g++ cg.cpp -O0 -I./eigen -o cg
./cg
Real time: 23.9473
// No optimization, with OpenMP
g++ cg.cpp -O0 -I./eigen -fopenmp -o cg
./cg
Real time: 17.6621
// -O3 optimization, without OpenMP
g++ cg.cpp -O3 -I./eigen -o cg
./cg
Real time: 0.924272
// -O3 optimization, with OpenMP
g++ cg.cpp -O3 -I./eigen -fopenmp -o cg
./cg
Real time: 1.04809
Your problem is too small to expect any benefits from multi-threading. Sparse matrices are expected to at least one order of magnitude larger. Eigen's code should be adjusted to reduce the number of threads in this case.
Moreover, I guess that you only have 4 physical cores, so running with OMP_NUM_THREADS=4 ./cg might help.
Briefly speaking, my question relies in between compiling/building files (using libraries) with two different compilers while exploiting OpenACC constructs in source files.
I have a C source file that has an OpenACC construct. It has only a simple function that computes total sum of an array:
#include <stdio.h>
#include <stdlib.h>
#include <openacc.h>
double calculate_sum(int n, double *a) {
double sum = 0;
int i;
printf("Num devices: %d\n", acc_get_num_devices(acc_device_nvidia));
#pragma acc parallel copyin(a[0:n])
#pragma acc loop
for(i=0;i<n;i++) {
sum += a[i];
}
return sum;
}
I can easily compile it using following line:
pgcc -acc -ta=nvidia -c libmyacc.c
Then, create a static library by following line:
ar -cvq libmyacc.a libmyacc.o
To use my library, I wrote a piece of code as following:
#include <stdio.h>
#include <stdlib.h>
#define N 1000
extern double calculate_sum(int n, double *a);
int main() {
printf("Hello --- Start of the main.\n");
double *a = (double*) malloc(sizeof(double) * N);
int i;
for(i=0;i<N;i++) {
a[i] = (i+1) * 1.0;
}
double sum = 0.0;
for(i=0;i<N;i++) {
sum += a[i];
}
printf("Sum: %.3f\n", sum);
double sum2 = -1;
sum2 = calculate_sum(N, a);
printf("Sum2: %.3f\n", sum2);
return 0;
}
Now, I can use this static library with PGI compiler itself to compile above source (f1.c):
pgcc -acc -ta=nvidia f1.c libmyacc.a
And it will execute flawlessly. However, it differs for gcc. My question relies in here. How can I built it properly with gcc?
Thanks to Jeff's comment on this question:
linking pgi compiled library with gcc linker, now I can build my source file (f1.c) without errors, but the executable file emits some fatal errors.
This is what I use to compile my source file with gcc (f1.c):
gcc f1.c -L/opt/pgi/linux86-64/16.5/lib -L/usr/lib64 -L/usr/lib/gcc/x86_64-redhat-linux/4.8.5 -L. -laccapi -laccg -laccn -laccg2 -ldl -lcudadevice -lpgmp -lnuma -lpthread -lnspgc -lpgc -lm -lgcc -lc -lgcc -lmyacc
This is the error:
Num devices: 2
Accelerator Fatal Error: No CUDA device code available
Thanks to -v option when compiling f1.c with PGI compiler, I see that the compiler invokes so many other tools from PGI and NVidia (like pgacclnk and nvlink).
My questions:
Am I on the wrong path? Can I call functions in PGI compiled libraries from GCC and use OpenACC within those functions?
If answer to above is positive, can I use still link without steps (calling pgacclnk and nvlink) that PGI takes?
If answer to above is positive too, what should I do?
Add "-ta=tesla:nordc" to your pgcc compilation. By default PGI uses runtime dynamic compilation (RDC) for the GPU code. However RDC requires an extra link step (with nvlink) that gcc does not support. The "nordc" sub-option disables RDC so you'll be able to use OpenACC code in a library. However by disabling RDC you can no longer call external device routines from a compute region.
% pgcc -acc -ta=tesla:nordc -c libmyacc.c
% ar -cvq libmyacc.a libmyacc.o
a - libmyacc.o
% gcc f1.c -L/proj/pgi/linux86-64/16.5/lib -L/usr/lib64 -L/usr/lib/gcc/x86_64-redhat-linux/4.8.5 -L. -laccapi -laccg -laccn -laccg2 -ldl -lcudadevice -lpgmp -lnuma -lpthread -lnspgc -lpgc -lm -lgcc -lc -lgcc -lmyacc
% a.out
Hello --- Start of the main.
Sum: 500500.000
Num devices: 8
Sum2: 500500.000
Hope this helps,
Mat