SSE floating point dot product for dummies

SSE floating point dot product for dummies - gcc

I have read many SO questions about SSE/SIMD (e.g., Getting started with SSE), but I'm still confused by all of it. All I want is a dot product between two double precision floating-point vectors, in C (C99 FWIW). I'm using GCC.
Can someone post a simple and complete example, including how to convert double vectors to the SSE types and back again?
[Edit 2012-10-08]
Here's some SSE2 code I managed to cobble together, critiques?
#include <emmintrin.h>
double dotprod(double *restrict a, double *restrict b, int n)
{
__m128d aa, bb, cc, ss;
int i, n1 = n - 1;
double *s = calloc(2, sizeof(double));
double s2 = 0;
ss = _mm_set1_pd(0);
for(i = 0 ; i < n1 ; i += 2)
{
aa = _mm_load_pd(a + i);
bb = _mm_load_pd(b + i);
cc = _mm_mul_pd(aa, bb);
ss = _mm_add_pd(ss, cc);
}
_mm_store_pd(s, ss);
s2 = s[0] + s[1];
if(i < n)
s2 += a[i] * b[i];
free(s);
return s2;
}

Related

Is there any optimization function in Rcpp

The following is my Rcpp code, and I want to minimize the objective function logtpoi(x,theta) respect to theta in R by 'nlminb'. I found it is slow.
I have two question:
Anyone can improve my Rcpp code? Thank you very much.
Is there any optimization functions in Rcpp? If yes,maybe I can use them in Rcpp directly. And how to use them? Thank you very much.
My code:
#include <RcppArmadillo.h>
using namespace Rcpp;
using namespace arma;
// [[Rcpp::depends("RcppArmadillo")]]
// [[Rcpp::export]]
List dtpoi0(const IntegerVector& x, const NumericVector& theta){
//x is 3-dim vector; theta is a 6-dim parameter vector.
//be careful the order of theta1,...,theta6.
double theta1 = theta[0]; double theta2 = theta[1];
double theta3 = theta[2]; double theta4 = theta[3];
double theta5 = theta[4]; double theta6 = theta[5];
int x1 = x[0]; int x2 = x[1]; int x3 = x[2];
IntegerVector z1 = IntegerVector::create(x1,x2);
IntegerVector z2 = IntegerVector::create(x1,x3);
IntegerVector z3 = IntegerVector::create(x2,x3);
int s1 = min(z1); int s2 = min(z2); int s3 = min(z3);
arma::imat missy(1,3,fill::zeros); arma::irowvec ijk={0,0,0};
for (int i = 0; i <= s1; ++i) {
for (int j = 0; j <= s2; ++j) {
for (int k = 0; k <= s3; ++k) {
if ((i+j <= s1) & (i+k <= s2) & ( j+k <= s3))
{ ijk = {i,j,k};
missy = join_cols(missy,ijk);}
}
}
}
IntegerMatrix misy = as<IntegerMatrix>(wrap(missy));
IntegerVector u1 = IntegerVector::create(0);
IntegerVector u2 = IntegerVector::create(0);
IntegerVector u3 = IntegerVector::create(0);
IntegerVector u4 = IntegerVector::create(0);
IntegerVector u5 = IntegerVector::create(0);
IntegerVector u6 = IntegerVector::create(0);
int total = misy.nrow();
double fvalue = 0;
NumericVector part1(1); NumericVector part2(1);
NumericVector part3(1); NumericVector part4(1);
NumericVector part5(1); NumericVector part6(1);
for (int l = 1; l < total; ++l) {
u1 = IntegerVector::create(x1-misy(l,0)-misy(l,1));
u2 = IntegerVector::create(x2-misy(l,0)-misy(l,2));
u3 = IntegerVector::create(x3-misy(l,1)-misy(l,2));
u4 = IntegerVector::create(misy(l,0));
u5 = IntegerVector::create(misy(l,1));
u6 = IntegerVector::create(misy(l,2));
part1 = dpois(u1,theta1);
part2 = dpois(u2,theta2);
part3 = dpois(u3,theta3);
part4 = dpois(u4,theta4);
part5 = dpois(u5,theta5);
part6 = dpois(u6,theta6);
fvalue = fvalue + (part1*part2*part3*part4*part5*part6)[0]; }
return(List::create(Named("misy") = misy,Named("fvalue") = fvalue));
}
// [[Rcpp::export]]
NumericVector dtpoi(const IntegerMatrix& x, const NumericVector& theta){
//x is n*3 matrix, n is the number of observations.
int n = x.nrow();
NumericVector density(n);
for (int i = 0; i < n; ++i){
density(i) = dtpoi0(x.row(i),theta)["fvalue"];
}
return(density);
}
// [[Rcpp::export]]
double logtpoi0(const IntegerMatrix& x,const NumericVector theta){
// theta must be a 6-dimiension parameter.
double nln = -sum(log( dtpoi(x,theta) + 1e-60 ));
if(arma::is_finite(nln)) {nln = nln;} else {nln = -1e10;}
return(nln);
}

Huge caveat ahead: I don’t really know Armadillo. But I’ve had a stab at it because the code looks interesting.
A few general things:
You don’t need to declare things before you assign them for the first time. In particular, it’s generally not necessary to declare vectors outside a loop if they’re only used inside the loop. This is probably no less efficient than declaring them inside the loop. However, if your code is too slow it makes sense to carefully profile this, and test whether the assumption holds.
Many of your declarations are just aliases for vector elements and don’t seem necessary.
Your z{1…3} vectors aren’t necessary. C++ has a min function to find the minimum of two elements.
dtpoi0 contains two main loops. Both of these have been heavily modified in my code:
The first loop iterates over many ks that can are never used, due to the internal if that tests whether i + j exceeds s2. By pulling this check into the loop condition of j, we perform fewer k loops.
Your if uses & instead of &&. Like in R, using && rather than & causes short-circuiting. While this is probably not more efficient in this case, using && is idiomatic, whereas & causes head-scratching (my code uses and which is an alternative way of spelling && in C++; I prefer its readability).
The second loops effectively performs a matrix operation manually. I feel that there should be a way of expressing this purely with matrix operations — but as mentioned I’m not an Armadillo user. Still, my changes attempt to vectorise as much of this operation as possible (if nothing else this makes the code shorter). The dpois inner product is unfortunately still inside a loop.
The logic of logtpoi0 can be made more idiomatic and (IMHO) more readable by using the conditional operator instead of if.
const-correctness is a big deal in C++, since it weeds out accidental modifications. Use const liberally when declaring variables that are not supposed to change.
In terms of efficiency, the biggest hit when calling dtpoi or logtpoi0 is probably the conversion of missy to misy, which causes allocations and memory copies. Only convert to IntegerMatrix when necessary, i.e. when actually returning that value to R. For that reason, I’ve split dtpoi0 into two parts.
Another inefficiency is the fact that the first loop in dtpoi0 grows a matrix by appending columns. That’s a big no-no. However, rewriting the code to avoid this isn’t trivial.
#include <algorithm>
#include <RcppArmadillo.h>
// [[Rcpp::depends("RcppArmadillo")]]
using namespace Rcpp;
using namespace arma;
imat dtpoi0_mat(const IntegerVector& x) {
const int s1 = std::min(x[0], x[1]);
const int s2 = std::min(x[0], x[2]);
const int s3 = std::min(x[1], x[2]);
imat missy(1, 3, fill::zeros);
for (int i = 0; i <= s1; ++i) {
for (int j = 0; j <= s2 and i + j <= s1; ++j) {
for (int k = 0; k <= s3 and i + k <= s2 and j + k <= s3; ++k) {
missy = join_cols(missy, irowvec{i, j, k});
}
}
}
return missy;
}
double dtpoi0_fvalue(const IntegerVector& x, const NumericVector& theta, imat& missy) {
double fvalue = 0.0;
ivec xx = as<ivec>(x);
missy.each_row([&](irowvec& v) {
const ivec u(join_cols(xx - v(uvec{0, 0, 1}) - v(uvec{1, 2, 3}), v));
double prod = 1;
for (int i = 0; i < u.n_elem; ++i) {
prod *= R::dpois(u[i], theta[i], 0);
}
fvalue += prod;
});
return fvalue;
}
double dtpoi0_fvalue(const IntegerVector& x, const NumericVector& theta) {
imat missy = dtpoi0_mat(x);
return dtpoi0_fvalue(x, theta, missy);
}
// [[Rcpp::export]]
List dtpoi0(const IntegerVector& x, const NumericVector& theta) {
imat missy = dtpoi0_mat(x);
const double fvalue = dtpoi0_fvalue(x, theta, missy);
return List::create(Named("misy") = as<IntegerMatrix>(wrap(missy)), Named("fvalue") = fvalue);
}
// [[Rcpp::export]]
NumericVector dtpoi(const IntegerMatrix& x, const NumericVector& theta) {
//x is n*3 matrix, n is the number of observations.
int n = x.nrow();
NumericVector density(n);
for (int i = 0; i < n; ++i){
density(i) = dtpoi0_fvalue(x.row(i), theta);
}
return density;
}
// [[Rcpp::export]]
double logtpoi0(const IntegerMatrix& x, const NumericVector theta) {
// theta must be a 6-dimension parameter.
const double nln = -sum(log(dtpoi(x, theta) + 1e-60));
return is_finite(nln) ? nln : -1e10;
}
Important: This compiles, but I can’t test its correctness. It’s entirely possible (even likely!) that my refactor introduced errors. It should therefore only be viewed as a solution sketch, and should by no means be copied and pasted into an application.

How to find coefficients of linear combination of n numbers, which gives their GCD?

I have an array of numbers and I want their corresponding coefficients such that their linear combination would give me their greatest common divisor.
I found how to do this for 2 numbers in this link . But I was wondering how to do it efficiently for an array of numbers.
ll gcdExtended(ll a, ll b, ll *x, ll *y)
{
// Base Case
if (a == 0)
{
*x = 0;
*y = 1;
return b;
}
ll x1, y1; // To store results of recursive call
ll gcd = gcdExtended(b%a, a, &x1, &y1);
// Update x and y using results of recursive
// call
*x = y1 - (b/a) * x1;
*y = x1;
return gcd;
}
ll calc(ll gcd, int i)
{
ll x,y;
ll gcd_new = gcdExtended(gcd,conc[i],x,y);
if (i == conc.size()-1)
{
return x;
}
ll temp = calc(gcd_new, i+1);
x *= temp;
y* = temp;
return x;
}
Here is what I tried. ll stands for long long int and conc is a global vector containing the numbers whose GCD needs to be found.

Clarification of Answer... find the max possible two equal sum in a SET

I need a clarification of the answer of this question but I can not comment (not enough rep) so I ask a new question. Hope it is ok.
The problem is this:
Given an array, you have to find the max possible two equal sum, you
can exclude elements.
i.e 1,2,3,4,6 is given array we can have max two equal sum as 6+2 =
4+3+1
i.e 4,10,18, 22, we can get two equal sum as 18+4 = 22
what would be your approach to solve this problem apart from brute
force to find all computation and checking two possible equal sum?
edit 1: max no of array elements are N <= 50 and each element can be
up to 1<= K <=1000
edit 2: Total elements sum cannot be greater than 1000.
The approved answer says:
I suggest solving this using DP where instead of tracking A,B (the
size of the two sets), you instead track A+B,A-B (the sum and
difference of the two sets).
Then for each element in the array, try adding it to A, or B, or
neither.
The advantage of tracking the sum/difference is that you only need to
keep track of a single value for each difference, namely the largest
value of the sum you have seen for this difference.
What I do not undertand is:
If this was the subset sum problem I could solve it with DP, having a memoization matrix of (N x P), where N is the size of the set and P is the target sum...
But I can not figure it out how I should keep track A+B,A-B (as said for the author of the approved answer). Which should be the dimensions of the memoization matrix ? and how that helps to solve the problem ?
The author of the answer was kind enough to provide a code example but it is hard to me to undertand since I do not know python (I know java).

I think thinking how this solution relates to the single subset problem might be misleading for you. Here we are concerned with a maximum achievable sum, and what's more, we need to distinguish between two disjoint sets of numbers as we traverse. Clearly tracking specific combinations would be too expensive.
Looking at the difference between sets A and B, we can say:
A - B = d
A = d + B
Clearly, we want the highest sum when d = 0. How do we know that sum? It's (A + B) / 2!
For the transition in the dynamic program, we'd like to know if it's better to place the current element in A, B or neither. This is achieved like this:
e <- current element
d <- difference between A and B
(1) add e to A -> d + e
why?
A = d + B
(A + e) = d + e + B
(2) add e to B -> d - e
why?
A = d + B
A = d - e + (B + e)
(3) don't use e -> that's simply
what we already have stored for d
Let's look at Peter de Rivas' code for the transition:
# update a copy of our map, so
# we can reference previous values,
# while assigning new values
D2=D.copy()
# d is A - B
# s is A + B
for d,s in D.items():
# a new sum that includes element a
# we haven't decided if a
# will be in A or B
s2 = s + a
# d2 will take on each value here
# in turn, once d - a (adding a to B),
# and once d + a (adding a to A)
for d2 in [d-a, d+a]:
# The main transition:
# the two new differences,
# (d-a) and (d+a) as keys in
# our map get the highest sum
# seen so far, either (1) the
# new sum, s2, or (2) what we
# already stored (meaning `a`
# will be excluded here)
# so all three possibilities
# are covered.
D2[abs(d2)] = max(D2[abs(d2)], s2)
In the end we have stored the highest A + B seen for d = 0, where the elements in A and B form disjoint sets. Return (A + B) / 2.

Try this dp approch : it works fine.
/*
*
i/p ::
1
5
1 2 3 4 6
o/p : 8
1
4
4 10 18 22
o/p : 22
1
4
4 118 22 3
o/p : 0
*/
import java.util.Scanner;
public class TwoPipesOfMaxEqualLength {
public static void main(String[] args) {
Scanner sc = new Scanner(System.in);
int t = sc.nextInt();
while (t-- > 0) {
int n = sc.nextInt();
int[] arr = new int[n + 1];
for (int i = 1; i <= n; i++) {
arr[i] = sc.nextInt();
}
MaxLength(arr, n);
}
}
private static void MaxLength(int[] arr, int n) {
int dp[][] = new int[1005][1005];
int dp1[][] = new int[1005][1005];
// initialize dp with values as 0.
for (int i = 0; i <= 1000; i++) {
for (int j = 0; j <= 1000; j++)
dp[i][j] = 0;
}
// make (0,0) as 1.
dp[0][0] = 1;
for (int i = 1; i <= n; i++) {
for (int j = 0; j <= 1000; j++) {
for (int k = 0; k <= 1000; k++) {
if (j >= arr[i]) {
if (dp[j - arr[i]][k] == 1) {
dp1[j][k] = 1;## Heading ##
}
}
if (k >= arr[i]) {
if (dp[j][k - arr[i]] == 1) {
dp1[j][k] = 1;
}
}
if (dp[j][k] == 1) {
dp1[j][k] = 1;
}
}
}
for (int j = 0; j <= 1000; j++) {
for (int k = 0; k <= 1000; k++) {
dp[j][k] = dp1[j][k];
dp1[j][k] = 0;
}
}
}
int ans = 0;
for (int i = 1; i <= 1000; i++) {
if (dp[i][i] == 1) {
ans = i;
}
}
System.out.println(ans);
}
}

#include <bits/stdc++.h>
using namespace std;
/*
Brute force recursive solve.
*/
void solve(vector<int>&arr, int &ans, int p1, int p2, int idx, int mx_p){
// if p1 == p2, we have a potential answer
if(p1 == p2){
ans = max(ans, p1);
}
//base case 1:
if((p1>mx_p) || (p2>mx_p) || (idx >= arr.size())){
return;
}
// leave the current element
solve(arr, ans, p1, p2, idx+1, mx_p);
// add the current element to p1
solve(arr, ans, p1+arr[idx], p2, idx+1, mx_p);
// add the current element to p2
solve(arr, ans, p1, p2+arr[idx], idx+1, mx_p);
}
/*
Recursive solve with memoization.
*/
int solve(vector<vector<vector<int>>>&memo, vector<int>&arr,
int p1, int p2, int idx, int mx_p){
//base case 1:
if((p1>mx_p) || (p2>mx_p) || (idx>arr.size())){
return -1;
}
// memo'ed answer
if(memo[p1][p2][idx]>-1){
return memo[p1][p2][idx];
}
// if p1 == p2, we have a potential answer
if(p1 == p2){
memo[p1][p2][idx] = max(memo[p1][p2][idx], p1);
}
// leave the current element
memo[p1][p2][idx] = max(memo[p1][p2][idx], solve(memo, arr, p1, p2,
idx+1, mx_p));
// add the current element to p1
memo[p1][p2][idx] = max(memo[p1][p2][idx],
solve(memo, arr, p1+arr[idx], p2, idx+1, mx_p));
// add the current element to p2
memo[p1][p2][idx] = max(memo[p1][p2][idx],
solve(memo, arr, p1, p2+arr[idx], idx+1, mx_p));
return memo[p1][p2][idx];
}
int main(){
vector<int>arr = {1, 2, 3, 4, 7};
int ans = 0;
int mx_p = 0;
for(auto i:arr){
mx_p += i;
}
mx_p /= 2;
vector<vector<vector<int>>>memo(mx_p+1, vector<vector<int>>(mx_p+1,
vector<int>(arr.size()+1,-1)));
ans = solve(memo, arr, 0, 0, 0, mx_p);
ans = (ans>=0)?ans:0;
// solve(arr, ans, 0, 0, 0, mx_p);
cout << ans << endl;
return 0;
}

Portable efficient alternative to PDEP without using BMI2?

The documentation for the parallel deposit instruction (PDEP) in Intel's Bit Manipulation Instruction Set 2 (BMI2) describes the following serial implementation for the instruction (C-like pseudocode):
U64 _pdep_u64(U64 val, U64 mask) {
U64 res = 0;
for (U64 bb = 1; mask; bb += bb) {
if (val & bb)
res |= mask & -mask;
mask &= mask - 1;
}
return res;
}
See also Intel's pdep insn ref manual entry.
This algorithm is O(n), where n is the number of set bits in mask, which obviously has a worst case of O(k) where k is the total number of bits in mask.
Is a more efficient worst case algorithm possible?
Is it possible to make a faster version that assumes that val has at most one bit set, ie either equals 0 or equals 1<<r for some value of r from 0 to 63?

The second part of the question, about the special case of a 1-bit deposit, requires two steps. In the first step, we need to determine the bit index r of the single 1-bit in val, with a suitable response in case val is zero. This can easily be accomplished via the POSIX function ffs, or if r is known by other means, as alluded to by the asker in comments. In the second step we need to identify bit index i of the r-th 1-bit in mask, if it exists. We can then deposit the r-th bit of val at bit i.
One way of finding the index of the r-th 1-bit in mask is to tally the 1-bits using a classical population count algorithm based on binary partitioning, and record all of the intermediate group-wise bit counts. We then perform a binary search on the recorded bit-count data to identify the position of the desired bit.
The following C-code demonstrates this using 64-bit data. Whether this is actually faster than the iterative method will very much depend on typical values of mask and val.
#include <stdint.h>
/* Find the index of the n-th 1-bit in mask, n >= 0
The index of the least significant bit is 0
Return -1 if there is no such bit
*/
int find_nth_set_bit (uint64_t mask, int n)
{
int t, i = n, r = 0;
const uint64_t m1 = 0x5555555555555555ULL; // even bits
const uint64_t m2 = 0x3333333333333333ULL; // even 2-bit groups
const uint64_t m4 = 0x0f0f0f0f0f0f0f0fULL; // even nibbles
const uint64_t m8 = 0x00ff00ff00ff00ffULL; // even bytes
uint64_t c1 = mask;
uint64_t c2 = c1 - ((c1 >> 1) & m1);
uint64_t c4 = ((c2 >> 2) & m2) + (c2 & m2);
uint64_t c8 = ((c4 >> 4) + c4) & m4;
uint64_t c16 = ((c8 >> 8) + c8) & m8;
uint64_t c32 = (c16 >> 16) + c16;
int c64 = (int)(((c32 >> 32) + c32) & 0x7f);
t = (c32 ) & 0x3f; if (i >= t) { r += 32; i -= t; }
t = (c16>> r) & 0x1f; if (i >= t) { r += 16; i -= t; }
t = (c8 >> r) & 0x0f; if (i >= t) { r += 8; i -= t; }
t = (c4 >> r) & 0x07; if (i >= t) { r += 4; i -= t; }
t = (c2 >> r) & 0x03; if (i >= t) { r += 2; i -= t; }
t = (c1 >> r) & 0x01; if (i >= t) { r += 1; }
if (n >= c64) r = -1;
return r;
}
/* val is either zero or has a single 1-bit.
Return -1 if val is zero, otherwise the index of the 1-bit
The index of the least significant bit is 0
*/
int find_bit_index (uint64_t val)
{
return ffsll (val) - 1;
}
uint64_t deposit_single_bit (uint64_t val, uint64_t mask)
{
uint64_t res = (uint64_t)0;
int r = find_bit_index (val);
if (r >= 0) {
int i = find_nth_set_bit (mask, r);
if (i >= 0) res = (uint64_t)1 << i;
}
return res;
}

Using extended euclidean algorithm to find number of intersections of a line segment with points on a 2D grid

In the grid constructed by grid points (M*x, M*y) and given the point A(x1,y1) and point B(x2,y2) where all the variables are integers. I need to check how many grid points lie on the line segment from point A to point B. I know that it can be done by using the extended euclidean algorithm somehow, but I have no clue on how to do it. I would appreciate your help.

Rephrased, you want to determine how many numbers t satisfy
(1) M divides (1-t) x1 + t x2
(2) M divides (1-t) y1 + t y2
(3) 0 <= t <= 1.
Let's focus on (1). We introduce an integer variable q to represent the divisibility constraint and solve for t:
exists integer q, M q = (1-t) x1 + t x2
exists integer q, M q - x1 = (x2 - x1) t.
If x1 is not equal to x2, then this gives a periodic set of possibilities of the form t in {a + b q | q integer}, where a and b are fractions. Otherwise, all t or no t are solutions.
The extended Euclidean algorithm is useful for intersecting the solution sets arising from (1) and (2). Suppose that we want to compute the intersection
{a + b q | q integer} intersect {c + d s | s integer}.
By expressing a generic common element in two different ways, we arrive at a linear Diophantine equation:
a + b q = c + d s,
where a, b, c, d are constant and q, s are integer. Let's clear denominators and gather terms into one equation
A q + B s = C,
where A, B, C are integers. This equation has solutions if and only if the greatest common divisor g of A and B also divides C. Use the extended Euclidean algorithm to compute integer coefficients u, v such that
A u + B v = g.
Then we have a solution
q = u (C/g) + k (B/g)
s = v (C/g) - k (A/g)
for each integer k.
Finally, we have to take constraint (3) into consideration. This should boil down to some arithmetic and one floor division, but I'd rather not work out the details (the special cases of what I've written so far already will take quite a lot of your time).

Let's dX = Abs(x2-x1) and dY = Abs(y2 - y1)
Then number of lattice points on the segment is
P = GCD(dX, dY) + 1
(including start and end points)
where GCD is the greatest common divisor (through usual (not extended) Euclidean algorithm)
(See last Properties here)

Following instructions of Mr. David Eisenstat I have managed to write a program in c++ that calculates the answer.
#include <iostream>
#include <math.h>
using namespace std;
int gcd (int A, int B, int &u, int &v)
{
int Ad = 1;
int Bd = 1;
if (A < 0) { Ad = -1; A = -A; }
if (B < 0) { Bd = -1; B = -B; }
int x = 1, y = 0;
u = 0, v = 1;
while (A != 0)
{
int q = B/A;
int r = B%A;
int m = u-x*q;
int n = v-y*q;
B = A;
A = r;
u = x;
v = y;
x = m;
y = n;
}
u *= Ad;
v *= Bd;
return B;
}
int main(int argc, const char * argv[])
{
int t;
scanf("%d", &t);
for (int i = 0; i<t; i++) {
int x1, x2, y1, y2, M;
scanf("%d %d %d %d %d", &M, &x1, &y1, &x2, &y2);
if ( x1 == x2 ) // vertical line
{
if (x1%M != 0) { // in between the horizontal lines
cout << 0 << endl;
} else
{
int r = abs((y2-y1)/M); // number of points
if (y2%M == 0 || y1%M == 0) // +1 if the line starts or ends on the point
r++;
cout << r << endl;
}
} else {
if (x2 < x1)
{
int c;
c = x1;
x1 = x2;
x2 = c;
}
int A, B, C;
C = x1*y2-y1*x2;
A = M*(y2-y1);
B = -M*(x2-x1);
int u, v;
int g = gcd(A, B, u, v);
//cout << "A: " << A << " B: " << B << " C: " << C << endl;
//cout << A << "*" << u <<"+"<< B <<"*"<<v<<"="<<g<<endl;
double a = -x1/(double)(x2-x1);
double b = M/(double)(x2-x1);
double Z = (-a-C*b/g*u)*g/(B*b);
double Y = (1-a-C*b/g*u)*g/(B*b);
cout << floor(Z) - ceil(Y) + 1 << endl;
}
}
return 0;
}
Thank you for your help! Please check if all special cases are taken into consideration.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio