Accelerate matrix speed in neon arm

Accelerate matrix speed in neon arm - performance

I tried to run the below code for 1k x 1k the time taken is 1.4s.
Is there any possible way to increase the speed and the code is tested on raspberry pi-4.
The numpy based multiply for same size taking 0.14s to execute the code.
The A and B matrix are of same size 1000 x 1000.
The code is auto-vectorized while compiling.
void matrix_multiply_neon(float32_t *A, float32_t *B, float32_t *C, uint32_t n, uint32_t m, uint32_t k) {
int A_idx;
int B_idx;
int C_idx;
float32x4_t A0;
float32x4_t A1;
float32x4_t A2;
float32x4_t A3;
float32x4_t B0;
float32x4_t B1;
float32x4_t B2;
float32x4_t B3;
float32x4_t C0;
float32x4_t C1;
float32x4_t C2;
float32x4_t C3;
for (int i_idx=0; i_idx<n; i_idx+=4) {
for (int j_idx=0; j_idx<m; j_idx+=4) {
// Zero accumulators before matrix op
C0 = vmovq_n_f32(0);
C1 = vmovq_n_f32(0);
C2 = vmovq_n_f32(0);
C3 = vmovq_n_f32(0);
for (int k_idx=0; k_idx<k; k_idx+=4) {
A_idx = i_idx + n*k_idx;
B_idx = k*j_idx + k_idx;
A0 = vld1q_f32(A+A_idx);
A1 = vld1q_f32(A+A_idx+n);
A2 = vld1q_f32(A+A_idx+2*n);
A3 = vld1q_f32(A+A_idx+3*n);
B0 = vld1q_f32(B+B_idx);
C0 = vfmaq_laneq_f32(C0, A0, B0, 0);
C0 = vfmaq_laneq_f32(C0, A1, B0, 1);
C0 = vfmaq_laneq_f32(C0, A2, B0, 2);
C0 = vfmaq_laneq_f32(C0, A3, B0, 3);
B1 = vld1q_f32(B+B_idx+k);
C1 = vfmaq_laneq_f32(C1, A0, B1, 0);
C1 = vfmaq_laneq_f32(C1, A1, B1, 1);
C1 = vfmaq_laneq_f32(C1, A2, B1, 2);
C1 = vfmaq_laneq_f32(C1, A3, B1, 3);
B2 = vld1q_f32(B+B_idx+2*k);
C2 = vfmaq_laneq_f32(C2, A0, B2, 0);
C2 = vfmaq_laneq_f32(C2, A1, B2, 1);
C2 = vfmaq_laneq_f32(C2, A2, B2, 2);
C2 = vfmaq_laneq_f32(C2, A3, B2, 3);
B3 = vld1q_f32(B+B_idx+3*k);
C3 = vfmaq_laneq_f32(C3, A0, B3, 0);
C3 = vfmaq_laneq_f32(C3, A1, B3, 1);
C3 = vfmaq_laneq_f32(C3, A2, B3, 2);
C3 = vfmaq_laneq_f32(C3, A3, B3, 3);
}
// Compute base index for stores
C_idx = n*j_idx + i_idx;
vst1q_f32(C+C_idx, C0);
vst1q_f32(C+C_idx+n, C1);
vst1q_f32(C+C_idx+2*n, C2);
vst1q_f32(C+C_idx+3*n, C3);
}
}
}

Related

Accurate floating-point computation of the sum and difference of two products

The difference of two products and the sum of two products are two primitives found in a variety of common computations. diff_of_products (a,b,c,d) := ab - cd and sum_of_products(a,b,c,d) := ab + cd are closely-related companion functions that differ only by the sign of some of their operands. Examples for the use of these primitives are:
Computation of a complex multiplication with x = (a + i b) and y = (c + i d):
x*y = diff_of_products (a, c, b, d) + i sum_of_products (a, d, b, c)
Computation of the determinant of a 2x2 matrix: diff_of_products (a, d, b, c):
| a b |
| c d |
In a right-angle triangle computation of the length of the opposite cathesus from the hypothenuse h and adjacent cathetus a: diff_of_products (h, h, a, a)
Computation of the two real solutions of a quadratic equation with positive discriminant:
q = -(b + copysign (sqrt (diff_of_products (b, b, 4a, c)), b)) / 2
x0 = q / a
x1 = c / q
Computation of a 3D cross product a = b ⨯ c:
ax = diff_of_products (by, cz, bz, cy)
ay = diff_of_products (bz, cx, bx, cz)
az = diff_of_products (bx, cy, by, cx)
When computing with IEEE-754 binary floating-point formats, besides obvious issues with potential overflow and underflow, naive implementations of either function can suffer from catastrophic cancellation when the two products are similar in magnitude but of opposite signs for sum_of_products() or same sign for diff_of_products().
Focusing only on the accuracy aspect, how can one implement these functions robustly in the context of IEEE-754 binary arithmetic? The availability of fused multiply-add operations can be assumed, as this operation is supported by most modern processor architectures and exposed, via standard functions, in many programming languages. Without loss of generality, discussion can be restricted to single precision (IEEE-754 binary32) format for ease of exposition and testing.

The utility of the fused-multiply add (FMA) operation in providing protection against subtractive cancellation stems from the participation of the full double-width product in the final addition. To my knowledge, the first publicl record of its utility for accurately and robustly computing the solutions of quadratic equations are two sets of informal notes by renowned floating-point expert William Kahan:
William Kahan, "Matlab’s Loss is Nobody’s Gain". August 1998, revised July 2004 (online)
William Kahan, "On the Cost of Floating-Point Computation Without Extra-Precise Arithmetic". November 2004 (online)
The standard work on numerical computing by Higham was the first in which I encountered Kahan's algorithm applied to the computation of the determinant of a 2x2 matrix (p. 65):
Nicholas J. Higham, "Accuracy and Stability of Numerical Algorithms", SIAM 1996
A different algorithm for the computation of ab+cd, also based on FMA, was published by three Intel researchers in the context of Intel's first CPU with FMA support, the Itanium processor (p. 273):
Marius Cornea, John Harrison, and Ping Tak Peter Tang: "Scientific Computing on Itanium-based Systems." Intel Press 2002
In recent years, four papers by French researchers examined both algorithms in detail and provided error bounds with mathematical proofs. For binary floating-point arithmetic, provided there is no overflow or underflow in intermediate computation, the maximum relative error of both Kahan's algorithm and the Cornea-Harrison-Tang (CHT) algorithm were shown to be twice the unit round-off asymptotically, that is, 2u. For IEEE-754 binary32 or single precision this error bound is 2-23 and for IEEE-754 binary64 or double precision this error bound is 2-52.
Furthermore it was shown that the error in Kahan's algorithm is at most 1.5 ulps for binary floating-point arithmetic. From the literature I am not aware of an equivalent result, that is, a proven ulp error bound, for the CHT algorithm. My own experiments using the code below suggest an error bound of 1.25 ulp.
Sylvie Boldo, "Kahan’s algorithm for a correct discriminant computation at last formally proven",
IEEE Transactions on Computers, Vol. 58, No. 2, February 2009, pp. 220-225 (online)
Claude-Pierre Jeannerod, Nicolas Louvet, and Jean-Michel Muller, "Further Analysis of Kahan's Algorithm for the Accurate Computation of 2x2 Determinants", Mathematics of Computation, Vol. 82, No. 284, Oct. 2013, pp. 2245-2264 (online)
Jean-Michel Muller, "On the Error of Computing ab+cd using Cornea, Harrison and Tang's Method", ACM Transactions on Mathematical Software, Vol. 41, No.2, January 2015, Article 7 (online)
Claude-Pierre Jeannerod, "A Radix-Independent Error Analysis of the Cornea-Harrison-Tang Method", ACM Transactions on Mathematical Software Vol. 42, No. 3, May 2016, Article 19 (online)
While Kahan's algorithm requires four floating-point operations, two of which are FMAs, the CHT algorithm requires seven floating-point operations, two of which are FMAs. I constructed the test framework below to explore what other trade-offs may exist. I experimentally confirmed the bounds from the literature on the relative error of both algorithms and the ulp error of Kahan's algorithm. My experiments indicate that the CHT algorithm provides a smaller ulp error bound of 1.25 ulp, but that it also produces incorrectly-rounded results at roughly twice the rate of Kahan's algorithm.
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h>
#include <float.h>
#include <math.h>
#define TEST_SUM (0) // function under test. 0: a*b-c*d; 1: a*b+c*d
#define USE_CHT (0) // algorithm. 0: Kahan; 1: Cornea-Harrison-Tang
/*
Compute a*b-c*d with error <= 1.5 ulp. Maximum relative err = 2**-23
Claude-Pierre Jeannerod, Nicolas Louvet, and Jean-Michel Muller,
"Further Analysis of Kahan's Algorithm for the Accurate Computation
of 2x2 Determinants", Mathematics of Computation, Vol. 82, No. 284,
Oct. 2013, pp. 2245-2264
*/
float diff_of_products_kahan (float a, float b, float c, float d)
{
float w = d * c;
float e = fmaf (c, -d, w);
float f = fmaf (a, b, -w);
return f + e;
}
/*
Compute a*b-c*d with error <= 1.25 ulp (?). Maximum relative err = 2**-23
Claude-Pierre Jeannerod, "A Radix-Independent Error Analysis of the
Cornea-Harrison-Tang Method", ACM Transactions on Mathematical Software
Vol. 42, No. 3, Article 19 (May 2016).
*/
float diff_of_products_cht (float a, float b, float c, float d)
{
float p1 = a * b;
float p2 = c * d;
float e1 = fmaf (a, b, -p1);
float e2 = fmaf (c, -d, p2);
float r = p1 - p2;
float e = e1 + e2;
return r + e;
}
/*
Compute a*b+c*d with error <= 1.5 ulp. Maximum relative err = 2**-23
Jean-Michel Muller, "On the Error of Computing ab+cd using Cornea,
Harrison and Tang's Method", ACM Transactions on Mathematical Software,
Vol. 41, No.2, Article 7, (January 2015)
*/
float sum_of_products_kahan (float a, float b, float c, float d)
{
float w = c * d;
float e = fmaf (c, -d, w);
float f = fmaf (a, b, w);
return f - e;
}
/*
Compute a*b+c*d with error <= 1.25 ulp (?). Maximum relative err = 2**-23
Claude-Pierre Jeannerod, "A Radix-Independent Error Analysis of the
Cornea-Harrison-Tang Method", ACM Transactions on Mathematical Software
Vol. 42, No. 3, Article 19 (May 2016).
*/
float sum_of_products_cht (float a, float b, float c, float d)
{
float p1 = a * b;
float p2 = c * d;
float e1 = fmaf (a, b, -p1);
float e2 = fmaf (c, d, -p2);
float r = p1 + p2;
float e = e1 + e2;
return r + e;
}
// Fixes via: Greg Rose, KISS: A Bit Too Simple. http://eprint.iacr.org/2011/007
static unsigned int z=362436069,w=521288629,jsr=362436069,jcong=123456789;
#define znew (z=36969*(z&0xffff)+(z>>16))
#define wnew (w=18000*(w&0xffff)+(w>>16))
#define MWC ((znew<<16)+wnew)
#define SHR3 (jsr^=(jsr<<13),jsr^=(jsr>>17),jsr^=(jsr<<5)) /* 2^32-1 */
#define CONG (jcong=69069*jcong+13579) /* 2^32 */
#define KISS ((MWC^CONG)+SHR3)
typedef struct {
double y;
double x;
} dbldbl;
dbldbl make_dbldbl (double head, double tail)
{
dbldbl z;
z.x = tail;
z.y = head;
return z;
}
dbldbl add_dbldbl (dbldbl a, dbldbl b) {
dbldbl z;
double t1, t2, t3, t4, t5;
t1 = a.y + b.y;
t2 = t1 - a.y;
t3 = (a.y + (t2 - t1)) + (b.y - t2);
t4 = a.x + b.x;
t2 = t4 - a.x;
t5 = (a.x + (t2 - t4)) + (b.x - t2);
t3 = t3 + t4;
t4 = t1 + t3;
t3 = (t1 - t4) + t3;
t3 = t3 + t5;
z.y = t4 + t3;
z.x = (t4 - z.y) + t3;
return z;
}
dbldbl sub_dbldbl (dbldbl a, dbldbl b)
{
dbldbl z;
double t1, t2, t3, t4, t5;
t1 = a.y - b.y;
t2 = t1 - a.y;
t3 = (a.y + (t2 - t1)) - (b.y + t2);
t4 = a.x - b.x;
t2 = t4 - a.x;
t5 = (a.x + (t2 - t4)) - (b.x + t2);
t3 = t3 + t4;
t4 = t1 + t3;
t3 = (t1 - t4) + t3;
t3 = t3 + t5;
z.y = t4 + t3;
z.x = (t4 - z.y) + t3;
return z;
}
dbldbl mul_dbldbl (dbldbl a, dbldbl b)
{
dbldbl t, z;
t.y = a.y * b.y;
t.x = fma (a.y, b.y, -t.y);
t.x = fma (a.x, b.x, t.x);
t.x = fma (a.y, b.x, t.x);
t.x = fma (a.x, b.y, t.x);
z.y = t.y + t.x;
z.x = (t.y - z.y) + t.x;
return z;
}
double prod_diff_ref (float a, float b, float c, float d)
{
dbldbl t = sub_dbldbl (
mul_dbldbl (make_dbldbl ((double)a, 0), make_dbldbl ((double)b, 0)),
mul_dbldbl (make_dbldbl ((double)c, 0), make_dbldbl ((double)d, 0))
);
return t.x + t.y;
}
double prod_sum_ref (float a, float b, float c, float d)
{
dbldbl t = add_dbldbl (
mul_dbldbl (make_dbldbl ((double)a, 0), make_dbldbl ((double)b, 0)),
mul_dbldbl (make_dbldbl ((double)c, 0), make_dbldbl ((double)d, 0))
);
return t.x + t.y;
}
float __uint32_as_float (uint32_t a)
{
float r;
memcpy (&r, &a, sizeof r);
return r;
}
uint32_t __float_as_uint32 (float a)
{
uint32_t r;
memcpy (&r, &a, sizeof r);
return r;
}
uint64_t __double_as_uint64 (double a)
{
uint64_t r;
memcpy (&r, &a, sizeof r);
return r;
}
static double floatUlpErr (float res, double ref)
{
uint64_t i, j, err;
int expoRef;
/* ulp error cannot be computed if either operand is NaN, infinity, zero */
if (isnan(res) || isnan (ref) || isinf(res) || isinf (ref) ||
(res == 0.0f) || (ref == 0.0f)) {
return 0.0;
}
/* Convert the float result to an "extended float". This is like a float
with 56 instead of 24 effective mantissa bits.
*/
i = ((uint64_t)__float_as_uint32(res)) << 32;
/* Convert the double reference to an "extended float". If the reference is
>= 2^129, we need to clamp to the maximum "extended float". If reference
is < 2^-126, we need to denormalize because of float's limited exponent
range.
*/
expoRef = (int)(((__double_as_uint64(ref) >> 52) & 0x7ff) - 1023);
if (expoRef >= 129) {
j = (__double_as_uint64(ref) & 0x8000000000000000ULL) |
0x7fffffffffffffffULL;
} else if (expoRef < -126) {
j = ((__double_as_uint64(ref) << 11) | 0x8000000000000000ULL) >> 8;
j = j >> (-(expoRef + 126));
j = j | (__double_as_uint64(ref) & 0x8000000000000000ULL);
} else {
j = ((__double_as_uint64(ref) << 11) & 0x7fffffffffffffffULL) >> 8;
j = j | ((uint64_t)(expoRef + 127) << 55);
j = j | (__double_as_uint64(ref) & 0x8000000000000000ULL);
}
err = (i < j) ? (j - i) : (i - j);
return err / 4294967296.0;
}
int main (void)
{
const float ULMT = sqrtf (FLT_MAX) / 2; // avoid overflow
const float LLMT = sqrtf (FLT_MIN) * 2; // avoid underflow
const uint64_t N = 1ULL << 38;
double ref, ulp, relerr, maxrelerr = 0, maxulp = 0;
uint64_t count = 0LL, incorrectly_rounded = 0LL;
uint32_t ai, bi, ci, di;
float af, bf, cf, df, resf;
#if TEST_SUM
printf ("testing a*b+c*d ");
#else
printf ("testing a*b-c*d ");
#endif // TEST_SUM
#if USE_CHT
printf ("using Cornea-Harrison-Tang algorithm\n");
#else
printf ("using Kahan algorithm\n");
#endif
do {
do {
ai = KISS;
af = __uint32_as_float (ai);
} while (!isfinite(af) || (fabsf (af) > ULMT) || (fabsf (af) < LLMT));
do {
bi = KISS;
bf = __uint32_as_float (bi);
} while (!isfinite(bf) || (fabsf (bf) > ULMT) || (fabsf (bf) < LLMT));
do {
ci = KISS;
cf = __uint32_as_float (ci);
} while (!isfinite(cf) || (fabsf (cf) > ULMT) || (fabsf (cf) < LLMT));
do {
di = KISS;
df = __uint32_as_float (di);
} while (!isfinite(df) || (fabsf (df) > ULMT) || (fabsf (df) < LLMT));
count++;
#if TEST_SUM
#if USE_CHT
resf = sum_of_products_cht (af, bf, cf, df);
#else // USE_CHT
resf = sum_of_products_kahan (af, bf, cf, df);
#endif // USE_CHT
ref = prod_sum_ref (af, bf, cf, df);
#else // TEST_SUM
#if USE_CHT
resf = diff_of_products_cht (af, bf, cf, df);
#else // USE_CHT
resf = diff_of_products_kahan (af, bf, cf, df);
#endif // USE_CHT
ref = prod_diff_ref (af, bf, cf, df);
#endif // TEST_SUM
ulp = floatUlpErr (resf, ref);
incorrectly_rounded += ulp > 0.5;
relerr = fabs ((resf - ref) / ref);
if ((ulp > maxulp) || ((ulp == maxulp) && (relerr > maxrelerr))) {
maxulp = ulp;
maxrelerr = relerr;
printf ("%13llu %12llu ulp=%.9f a=% 15.8e b=% 15.8e c=% 15.8e d=% 15.8e res=% 16.6a ref=% 23.13a relerr=%13.9e\n",
count, incorrectly_rounded, ulp, af, bf, cf, df, resf, ref, relerr);
}
} while (count <= N);
return EXIT_SUCCESS;
}

How to pack Boolean operations using gcc or other compilers?

Intel CPUs are capable of performing 512 or 1024 bitwise operations using vectorized operations. Assume I have a snippet of code that looks like this:
#include <stdio.h>
int main()
{
_Bool i0, i1, i2, i3, w0, w1, w2, w3, w4;
i0 = 1;
i1 = 1;
i2 = 0;
i3 = 0;
w0 = i0 & i1;
w1 = i1 & i2;
w2 = i0 & i3;
w3 = w0 & w1;
w4 = w1 & w2;
printf("%d %d %d %d\n", i0, i1, i2, i3);
printf("%d %d %d %d %d\n", w0, w1, w2, w3, w4);
return 0;
}
Does GCC or Intel compiler vectorize this code automatically or I need to rewrite the code to be able to benefit from vectorization? Ideally, I would like the first three operations to be performed in parallel and then, the next two computed in parallel.

Strassen Multiplication Algorithm StackOverFlow Error

I am working on implementing Straussen's Multiplication. Below is my method for multiplying them in a Divide and Conquer Approach.
public static double[][] multiply(double[][] A, double[][] B)
{
int n = A.length;
double[][] R = new double[n][n];
/** base case **/
//if (n == 1){
// R[0][0] = A[0][0] * B[0][0];
// }
//else{
double[][] A11 = new double [n/2][n/2];
double[][] A12 = new double [n/2][n/2];
double[][] A21 = new double [n/2][n/2];
double[][] A22 = new double [n/2][n/2];
double[][] B11 = new double [n/2][n/2];
double[][] B12 = new double [n/2][n/2];
double[][] B21 = new double [n/2][n/2];
double[][] B22 = new double [n/2][n/2];
/** Dividing matrix A into 4 halves **/
split(A, A11, 0 , 0);
split(A, A12, 0 , n/2);
split(A, A21, n/2, 0);
split(A, A22, n/2, n/2);
/** Dividing matrix B into 4 halves **/
split(B, B11, 0 , 0);
split(B, B12, 0 , n/2);
split(B, B21, n/2, 0);
split(B, B22, n/2, n/2);
/**
M1 = (A11 + A22)(B11 + B22)
M2 = (A21 + A22) B11
M3 = A11 (B12 - B22)
M4 = A22 (B21 - B11)
M5 = (A11 + A12) B22
M6 = (A21 - A11) (B11 + B12)
M7 = (A12 - A22) (B21 + B22)
**/
double [][] M1 = multiply(add(A11, A22), add(B11, B22));
double [][] M2 = multiply(add(A21, A22), B11);
double [][] M3 = multiply(A11, sub(B12, B22));
double [][] M4 = multiply(A22, sub(B21, B11));
double [][] M5 = multiply(add(A11, A12), B22);
double [][] M6 = multiply(sub(A21, A11), add(B11, B12));
double [][] M7 = multiply(sub(A12, A22), add(B21, B22));
/**
C11 = M1 + M4 - M5 + M7
C12 = M3 + M5
C21 = M2 + M4
C22 = M1 - M2 + M3 + M6
**/
double [][] C11 = add(sub(add(M1, M4), M5), M7);
double [][] C12 = add(M3, M5);
double [][] C21 = add(M2, M4);
double [][] C22 = add(sub(add(M1, M3), M2), M6);
/** join 4 halves into one result matrix **/
join(C11, R, 0 , 0);
join(C12, R, 0 , n/2);
join(C21, R, n/2, 0);
join(C22, R, n/2, n/2);
/** return result **/
return R;
}
In order to implement this code, I am reading in 2 txt files, one for matrix A and one for matrix B. For terms of testing I have the two matrices being the exact same:
5
3.250 6.130 3.180 7.680 9.060
5.450 1.660 6.790 6.650 4.250
4.460 8.260 7.870 7.880 1.890
1.460 8.510 8.510 3.510 1.440
1.590 7.160 4.400 3.310 1.970
Where the first line is n, and the following lines are the matrix.
My problem is that I am getting a stack overflow error on the line
int n = A.length;
Which I can't seem to figure out why or where to look. So my question is, does the problem lie in this algorithm? Or would the problem be in my main method?

Pixel by pixel Bézier Curve

The quadratic/cubic bézier curve code I find via google mostly works by subdividing the line into a series of points and connects them with straight lines. The rasterization happens in the line algorithm, not in the bézier one. Algorithms like Bresenham's work pixel-by-pixel to rasterize a line, and can be optimized (see Po-Han Lin's solution).
What is a quadratic bézier curve algorithm that works pixel-by-pixel like line algorithms instead of by plotting a series of points?

A variation of Bresenham's Algorithm works with quadratic functions like circles, ellipses, and parabolas, so it should work with quadratic Bezier curves too.
I was going to attempt an implementation, but then I found one on the web: http://members.chello.at/~easyfilter/bresenham.html.
If you want more detail or additional examples, the page mentioned above has a link to a 100 page PDF elaborating on the method: http://members.chello.at/~easyfilter/Bresenham.pdf.
Here's the code from Alois Zingl's site for plotting any quadratic Bezier curve. The first routine subdivides the curve at horizontal and vertical gradient changes:
void plotQuadBezier(int x0, int y0, int x1, int y1, int x2, int y2)
{ /* plot any quadratic Bezier curve */
int x = x0-x1, y = y0-y1;
double t = x0-2*x1+x2, r;
if ((long)x*(x2-x1) > 0) { /* horizontal cut at P4? */
if ((long)y*(y2-y1) > 0) /* vertical cut at P6 too? */
if (fabs((y0-2*y1+y2)/t*x) > abs(y)) { /* which first? */
x0 = x2; x2 = x+x1; y0 = y2; y2 = y+y1; /* swap points */
} /* now horizontal cut at P4 comes first */
t = (x0-x1)/t;
r = (1-t)*((1-t)*y0+2.0*t*y1)+t*t*y2; /* By(t=P4) */
t = (x0*x2-x1*x1)*t/(x0-x1); /* gradient dP4/dx=0 */
x = floor(t+0.5); y = floor(r+0.5);
r = (y1-y0)*(t-x0)/(x1-x0)+y0; /* intersect P3 | P0 P1 */
plotQuadBezierSeg(x0,y0, x,floor(r+0.5), x,y);
r = (y1-y2)*(t-x2)/(x1-x2)+y2; /* intersect P4 | P1 P2 */
x0 = x1 = x; y0 = y; y1 = floor(r+0.5); /* P0 = P4, P1 = P8 */
}
if ((long)(y0-y1)*(y2-y1) > 0) { /* vertical cut at P6? */
t = y0-2*y1+y2; t = (y0-y1)/t;
r = (1-t)*((1-t)*x0+2.0*t*x1)+t*t*x2; /* Bx(t=P6) */
t = (y0*y2-y1*y1)*t/(y0-y1); /* gradient dP6/dy=0 */
x = floor(r+0.5); y = floor(t+0.5);
r = (x1-x0)*(t-y0)/(y1-y0)+x0; /* intersect P6 | P0 P1 */
plotQuadBezierSeg(x0,y0, floor(r+0.5),y, x,y);
r = (x1-x2)*(t-y2)/(y1-y2)+x2; /* intersect P7 | P1 P2 */
x0 = x; x1 = floor(r+0.5); y0 = y1 = y; /* P0 = P6, P1 = P7 */
}
plotQuadBezierSeg(x0,y0, x1,y1, x2,y2); /* remaining part */
}
The second routine actually plots a Bezier curve segment (one without gradient changes):
void plotQuadBezierSeg(int x0, int y0, int x1, int y1, int x2, int y2)
{ /* plot a limited quadratic Bezier segment */
int sx = x2-x1, sy = y2-y1;
long xx = x0-x1, yy = y0-y1, xy; /* relative values for checks */
double dx, dy, err, cur = xx*sy-yy*sx; /* curvature */
assert(xx*sx <= 0 && yy*sy <= 0); /* sign of gradient must not change */
if (sx*(long)sx+sy*(long)sy > xx*xx+yy*yy) { /* begin with longer part */
x2 = x0; x0 = sx+x1; y2 = y0; y0 = sy+y1; cur = -cur; /* swap P0 P2 */
}
if (cur != 0) { /* no straight line */
xx += sx; xx *= sx = x0 < x2 ? 1 : -1; /* x step direction */
yy += sy; yy *= sy = y0 < y2 ? 1 : -1; /* y step direction */
xy = 2*xx*yy; xx *= xx; yy *= yy; /* differences 2nd degree */
if (cur*sx*sy < 0) { /* negated curvature? */
xx = -xx; yy = -yy; xy = -xy; cur = -cur;
}
dx = 4.0*sy*cur*(x1-x0)+xx-xy; /* differences 1st degree */
dy = 4.0*sx*cur*(y0-y1)+yy-xy;
xx += xx; yy += yy; err = dx+dy+xy; /* error 1st step */
do {
setPixel(x0,y0); /* plot curve */
if (x0 == x2 && y0 == y2) return; /* last pixel -> curve finished */
y1 = 2*err < dx; /* save value for test of y step */
if (2*err > dy) { x0 += sx; dx -= xy; err += dy += yy; } /* x step */
if ( y1 ) { y0 += sy; dy -= xy; err += dx += xx; } /* y step */
} while (dy < 0 && dx > 0); /* gradient negates -> algorithm fails */
}
plotLine(x0,y0, x2,y2); /* plot remaining part to end */
}
Code for antialiasing is also available on the site.
The corresponding functions from Zingl's site for cubic Bezier curves are
void plotCubicBezier(int x0, int y0, int x1, int y1,
int x2, int y2, int x3, int y3)
{ /* plot any cubic Bezier curve */
int n = 0, i = 0;
long xc = x0+x1-x2-x3, xa = xc-4*(x1-x2);
long xb = x0-x1-x2+x3, xd = xb+4*(x1+x2);
long yc = y0+y1-y2-y3, ya = yc-4*(y1-y2);
long yb = y0-y1-y2+y3, yd = yb+4*(y1+y2);
float fx0 = x0, fx1, fx2, fx3, fy0 = y0, fy1, fy2, fy3;
double t1 = xb*xb-xa*xc, t2, t[5];
/* sub-divide curve at gradient sign changes */
if (xa == 0) { /* horizontal */
if (abs(xc) < 2*abs(xb)) t[n++] = xc/(2.0*xb); /* one change */
} else if (t1 > 0.0) { /* two changes */
t2 = sqrt(t1);
t1 = (xb-t2)/xa; if (fabs(t1) < 1.0) t[n++] = t1;
t1 = (xb+t2)/xa; if (fabs(t1) < 1.0) t[n++] = t1;
}
t1 = yb*yb-ya*yc;
if (ya == 0) { /* vertical */
if (abs(yc) < 2*abs(yb)) t[n++] = yc/(2.0*yb); /* one change */
} else if (t1 > 0.0) { /* two changes */
t2 = sqrt(t1);
t1 = (yb-t2)/ya; if (fabs(t1) < 1.0) t[n++] = t1;
t1 = (yb+t2)/ya; if (fabs(t1) < 1.0) t[n++] = t1;
}
for (i = 1; i < n; i++) /* bubble sort of 4 points */
if ((t1 = t[i-1]) > t[i]) { t[i-1] = t[i]; t[i] = t1; i = 0; }
t1 = -1.0; t[n] = 1.0; /* begin / end point */
for (i = 0; i <= n; i++) { /* plot each segment separately */
t2 = t[i]; /* sub-divide at t[i-1], t[i] */
fx1 = (t1*(t1*xb-2*xc)-t2*(t1*(t1*xa-2*xb)+xc)+xd)/8-fx0;
fy1 = (t1*(t1*yb-2*yc)-t2*(t1*(t1*ya-2*yb)+yc)+yd)/8-fy0;
fx2 = (t2*(t2*xb-2*xc)-t1*(t2*(t2*xa-2*xb)+xc)+xd)/8-fx0;
fy2 = (t2*(t2*yb-2*yc)-t1*(t2*(t2*ya-2*yb)+yc)+yd)/8-fy0;
fx0 -= fx3 = (t2*(t2*(3*xb-t2*xa)-3*xc)+xd)/8;
fy0 -= fy3 = (t2*(t2*(3*yb-t2*ya)-3*yc)+yd)/8;
x3 = floor(fx3+0.5); y3 = floor(fy3+0.5); /* scale bounds to int */
if (fx0 != 0.0) { fx1 *= fx0 = (x0-x3)/fx0; fx2 *= fx0; }
if (fy0 != 0.0) { fy1 *= fy0 = (y0-y3)/fy0; fy2 *= fy0; }
if (x0 != x3 || y0 != y3) /* segment t1 - t2 */
plotCubicBezierSeg(x0,y0, x0+fx1,y0+fy1, x0+fx2,y0+fy2, x3,y3);
x0 = x3; y0 = y3; fx0 = fx3; fy0 = fy3; t1 = t2;
}
}
and
void plotCubicBezierSeg(int x0, int y0, float x1, float y1,
float x2, float y2, int x3, int y3)
{ /* plot limited cubic Bezier segment */
int f, fx, fy, leg = 1;
int sx = x0 < x3 ? 1 : -1, sy = y0 < y3 ? 1 : -1; /* step direction */
float xc = -fabs(x0+x1-x2-x3), xa = xc-4*sx*(x1-x2), xb = sx*(x0-x1-x2+x3);
float yc = -fabs(y0+y1-y2-y3), ya = yc-4*sy*(y1-y2), yb = sy*(y0-y1-y2+y3);
double ab, ac, bc, cb, xx, xy, yy, dx, dy, ex, *pxy, EP = 0.01;
/* check for curve restrains */
/* slope P0-P1 == P2-P3 and (P0-P3 == P1-P2 or no slope change) */
assert((x1-x0)*(x2-x3) < EP && ((x3-x0)*(x1-x2) < EP || xb*xb < xa*xc+EP));
assert((y1-y0)*(y2-y3) < EP && ((y3-y0)*(y1-y2) < EP || yb*yb < ya*yc+EP));
if (xa == 0 && ya == 0) { /* quadratic Bezier */
sx = floor((3*x1-x0+1)/2); sy = floor((3*y1-y0+1)/2); /* new midpoint */
return plotQuadBezierSeg(x0,y0, sx,sy, x3,y3);
}
x1 = (x1-x0)*(x1-x0)+(y1-y0)*(y1-y0)+1; /* line lengths */
x2 = (x2-x3)*(x2-x3)+(y2-y3)*(y2-y3)+1;
do { /* loop over both ends */
ab = xa*yb-xb*ya; ac = xa*yc-xc*ya; bc = xb*yc-xc*yb;
ex = ab*(ab+ac-3*bc)+ac*ac; /* P0 part of self-intersection loop? */
f = ex > 0 ? 1 : sqrt(1+1024/x1); /* calculate resolution */
ab *= f; ac *= f; bc *= f; ex *= f*f; /* increase resolution */
xy = 9*(ab+ac+bc)/8; cb = 8*(xa-ya);/* init differences of 1st degree */
dx = 27*(8*ab*(yb*yb-ya*yc)+ex*(ya+2*yb+yc))/64-ya*ya*(xy-ya);
dy = 27*(8*ab*(xb*xb-xa*xc)-ex*(xa+2*xb+xc))/64-xa*xa*(xy+xa);
/* init differences of 2nd degree */
xx = 3*(3*ab*(3*yb*yb-ya*ya-2*ya*yc)-ya*(3*ac*(ya+yb)+ya*cb))/4;
yy = 3*(3*ab*(3*xb*xb-xa*xa-2*xa*xc)-xa*(3*ac*(xa+xb)+xa*cb))/4;
xy = xa*ya*(6*ab+6*ac-3*bc+cb); ac = ya*ya; cb = xa*xa;
xy = 3*(xy+9*f*(cb*yb*yc-xb*xc*ac)-18*xb*yb*ab)/8;
if (ex < 0) { /* negate values if inside self-intersection loop */
dx = -dx; dy = -dy; xx = -xx; yy = -yy; xy = -xy; ac = -ac; cb = -cb;
} /* init differences of 3rd degree */
ab = 6*ya*ac; ac = -6*xa*ac; bc = 6*ya*cb; cb = -6*xa*cb;
dx += xy; ex = dx+dy; dy += xy; /* error of 1st step */
for (pxy = &xy, fx = fy = f; x0 != x3 && y0 != y3; ) {
setPixel(x0,y0); /* plot curve */
do { /* move sub-steps of one pixel */
if (dx > *pxy || dy < *pxy) goto exit; /* confusing values */
y1 = 2*ex-dy; /* save value for test of y step */
if (2*ex >= dx) { /* x sub-step */
fx--; ex += dx += xx; dy += xy += ac; yy += bc; xx += ab;
}
if (y1 <= 0) { /* y sub-step */
fy--; ex += dy += yy; dx += xy += bc; xx += ac; yy += cb;
}
} while (fx > 0 && fy > 0); /* pixel complete? */
if (2*fx <= f) { x0 += sx; fx += f; } /* x step */
if (2*fy <= f) { y0 += sy; fy += f; } /* y step */
if (pxy == &xy && dx < 0 && dy > 0) pxy = &EP;/* pixel ahead valid */
}
exit: xx = x0; x0 = x3; x3 = xx; sx = -sx; xb = -xb; /* swap legs */
yy = y0; y0 = y3; y3 = yy; sy = -sy; yb = -yb; x1 = x2;
} while (leg--); /* try other end */
plotLine(x0,y0, x3,y3); /* remaining part in case of cusp or crunode */
}
As Mike 'Pomax' Kamermans has noted, the solution for cubic Bezier curves on the site is not complete; in particular, there are issues with antialiasing cubic Bezier curves, and the discussion of rational cubic Bezier curves is incomplete.

You can use De Casteljau's algorithm to subdivide a curve into enough pieces that each subsection is a pixel.
This is the equation for finding the [x,y] point on a Quadratic Curve at interval T:
// Given 3 control points defining the Quadratic curve
// and given T which is an interval between 0.00 and 1.00 along the curve.
// Note:
// At the curve's starting control point T==0.00.
// At the curve's ending control point T==1.00.
var x = Math.pow(1-T,2)*startPt.x + 2 * (1-T) * T * controlPt.x + Math.pow(T,2) * endPt.x;
var y = Math.pow(1-T,2)*startPt.y + 2 * (1-T) * T * controlPt.y + Math.pow(T,2) * endPt.y;
To make practical use of this equation, you can input about 1000 T values between 0.00 and 1.00. This results in a set of 1000 points guaranteed to be along the Quadratic Curve.
Calculating 1000 points along the curve is probably over-sampling (some calculated points will be at the same pixel coordinate) so you will want to de-duplicate the 1000 points until the set represents unique pixel coordinates along the curve.
There is a similar equation for Cubic Bezier curves.
Here's example code that plots a Quadratic Curve as a set of calculated pixels:
var canvas=document.getElementById("canvas");
var ctx=canvas.getContext("2d");
var points=[];
var lastX=-100;
var lastY=-100;
var startPt={x:50,y:200};
var controlPt={x:150,y:25};
var endPt={x:250,y:100};
for(var t=0;t<1000;t++){
var xyAtT=getQuadraticBezierXYatT(startPt,controlPt,endPt,t/1000);
var x=parseInt(xyAtT.x);
var y=parseInt(xyAtT.y);
if(!(x==lastX && y==lastY)){
points.push(xyAtT);
lastX=x;
lastY=y;
}
}
$('#curve').text('Quadratic Curve made up of '+points.length+' individual points');
ctx.fillStyle='red';
for(var i=0;i<points.length;i++){
var x=points[i].x;
var y=points[i].y;
ctx.fillRect(x,y,1,1);
}
function getQuadraticBezierXYatT(startPt,controlPt,endPt,T) {
var x = Math.pow(1-T,2) * startPt.x + 2 * (1-T) * T * controlPt.x + Math.pow(T,2) * endPt.x;
var y = Math.pow(1-T,2) * startPt.y + 2 * (1-T) * T * controlPt.y + Math.pow(T,2) * endPt.y;
return( {x:x,y:y} );
}
body{ background-color: ivory; }
#canvas{border:1px solid red; margin:0 auto; }
<script src="https://ajax.googleapis.com/ajax/libs/jquery/1.9.1/jquery.min.js"></script>
<h4 id='curve'>Q</h4>
<canvas id="canvas" width=350 height=300></canvas>

The thing to realise here is that "line segments", when created small enough, are equivalent to pixels. Bezier curves are not linearly traversible curves, so we can't easily "skip ahead to the next pixel" in a single step, like we can for lines or circular arcs.
You could, of course, take the tangent at any point for a t you already have, and then guess which next value t' will lie a pixel further. However, what typically happens is that you guess, and guess wrong because the curve does not behave linearly, then you check to see how "off" your guess was, correct your guess, and then check again. Repeat until you've converged on the next pixel: this is far, far slower than just flattening the curve to a high number of line segments instead, which is a fast operation.
If you pick the number of segments such that they're appropriate to the curve's length, given the display it's rendered to, no one will be able to tell you flattened the curve.
There are ways to reparameterize Bezier curves, but they're expensive, and different canonical curves require different reparameterization, so that's really not faster either. What tends to be the most useful for discrete displays is to build a LUT (lookup table) for your curve, with a length that works for the size the curve is on the display, and then using that LUT as your base data for drawing, intersection detection, etc. etc.

First of all, I'd like to say that the fastest and the most reliable way to render bezier curves is to approximate them by polyline via adaptive subdivision, then render the polyline. Approach by #markE with drawing many points sampled on the curve is rather fast, but it can skip pixels. Here I describe another approach, which is closest to line rasterization (though it is slow and hard to implement robustly).
I'll treat usually curve parameter as time. Here is the pseudocode:
Put your cursor at the first control point, find the surrounding pixel.
For each side of the pixel (four total), check when your bezier curves intersects its line by solving quadratic equations.
Among all the calculated side intersection times, choose the one which will happen strictly in future, but as early as possible.
Move to neighboring pixel depending on which side was best.
Set current time to time of that best side intersection.
Repeat from step 2.
This algorithm works until time parameter exceeds one. Also note that it has severe issues with curves exactly touching a side of a pixel. I suppose it is solvable with a special check.
Here is the main code:
double WhenEquals(double p0, double p1, double p2, double val, double minp) {
//p0 * (1-t)^2 + p1 * 2t(1 - t) + p2 * t^2 = val
double qa = p0 + p2 - 2 * p1;
double qb = p1 - p0;
double qc = p0 - val;
assert(fabs(qa) > EPS); //singular case must be handled separately
double qd = qb * qb - qa * qc;
if (qd < -EPS)
return INF;
qd = sqrt(max(qd, 0.0));
double t1 = (-qb - qd) / qa;
double t2 = (-qb + qd) / qa;
if (t2 < t1) swap(t1, t2);
if (t1 > minp + EPS)
return t1;
else if (t2 > minp + EPS)
return t2;
return INF;
}
void DrawCurve(const Bezier &curve) {
int cell[2];
for (int c = 0; c < 2; c++)
cell[c] = int(floor(curve.pts[0].a[c]));
DrawPixel(cell[0], cell[1]);
double param = 0.0;
while (1) {
int bc = -1, bs = -1;
double bestTime = 1.0;
for (int c = 0; c < 2; c++)
for (int s = 0; s < 2; s++) {
double crit = WhenEquals(
curve.pts[0].a[c],
curve.pts[1].a[c],
curve.pts[2].a[c],
cell[c] + s, param
);
if (crit < bestTime) {
bestTime = crit;
bc = c, bs = s;
}
}
if (bc < 0)
break;
param = bestTime;
cell[bc] += (2*bs - 1);
DrawPixel(cell[0], cell[1]);
}
}
Full code is available here.
It uses loadbmp.h, here it is.

Using 2.4" MCUFriend TFT LCD Display with Arduino

I am hoping that someone is familiar with the 2.4" TFT LCD Display board from MCUFriend. I am having troubling using this board with my Arduino Uno and I was hoping someone could help.
The problem that I am having is that there are all of these colored lines being drawn on the screen after a reset and initialization. Right now all i am trying to do is fill the screen and draw a box. here is my code:
#include <Adafruit_GFX.h>
#include <TouchScreen.h>
#include <Adafruit_TFTLCD.h>
//SPI Communication
#define LCD_CS A3
#define LCD_CD A2
#define LCD_WR A1
#define LCD_RD A0
#define LCD_RESET A4
//Color Definitons
#define BLACK 0x0000
#define WHITE 0xFFFF
#define BOXSIZE 40
Adafruit_TFTLCD tft(LCD_CS, LCD_CD, LCD_WR, LCD_RD, LCD_RESET);
void setup() {
Serial.begin(9600);
tft.reset();
tft.begin();
tft.fillScreen(BLACK);
tft.drawRect(100, 100, BOXSIZE, BOXSIZE, WHITE);
}
void loop() {
}
This is what my Screen is doing:
As you can see, the background is black, and a box is being drawn behind these colored bars.
Any help would be greatly appreciated!!
Thank you very very much!

Im new to Arduino myself but i do have the same screen which works perfect,your problem is probably that the TFT shield is shorting off the top off the arduino usb put something non conductive there and reset. if your still having trouble, try removing the shield and watch each pin as you insert it to make sure they are all inserted in the correct pins, LCD_02 should be in Dig pin 2.
here is the code i used for testing, it uses the same library, Hope that helps you.
#include <Adafruit_GFX.h> // Core graphics library
#include <Adafruit_TFTLCD.h> // Hardware-specific library
// The control pins for the LCD can be assigned to any digital or
// analog pins...but we'll use the analog pins as this allows us to
// double up the pins with the touch screen (see the TFT paint example).
#define LCD_CS A3 // Chip Select goes to Analog 3
#define LCD_CD A2 // Command/Data goes to Analog 2
#define LCD_WR A1 // LCD Write goes to Analog 1
#define LCD_RD A0 // LCD Read goes to Analog 0
#define LCD_RESET A4 // Can alternately just connect to Arduino's reset pin
// When using the BREAKOUT BOARD only, use these 8 data lines to the LCD:
// For the Arduino Uno, Duemilanove, Diecimila, etc.:
// D0 connects to digital pin 8 (Notice these are
// D1 connects to digital pin 9 NOT in order!)
// D2 connects to digital pin 2
// D3 connects to digital pin 3
// D4 connects to digital pin 4
// D5 connects to digital pin 5
// D6 connects to digital pin 6
// D7 connects to digital pin 7
// For the Arduino Mega, use digital pins 22 through 29
// (on the 2-row header at the end of the board).
// Assign human-readable names to some common 16-bit color values:
#define BLACK 0x0000
#define BLUE 0x001F
#define RED 0xF800
#define GREEN 0x07E0
#define CYAN 0x07FF
#define MAGENTA 0xF81F
#define YELLOW 0xFFE0
#define WHITE 0xFFFF
Adafruit_TFTLCD tft(LCD_CS, LCD_CD, LCD_WR, LCD_RD, LCD_RESET);
void setup(void) {
Serial.begin(9600);
Serial.println(F("TFT LCD test"));
#ifdef USE_ADAFRUIT_SHIELD_PINOUT
Serial.println(F("Using Adafruit 2.8\" TFT Arduino Shield Pinout"));
#else
Serial.println(F("Using Adafruit 2.8\" TFT Breakout Board Pinout"));
#endif
Serial.print("TFT size is "); Serial.print(tft.width()); Serial.print("x"); Serial.println(tft.height());
tft.reset();
uint16_t identifier = tft.readID();
if(identifier == 0x9325) {
Serial.println(F("Found ILI9325 LCD driver"));
} else if(identifier == 0x9327) {
Serial.println(F("Found ILI9327 LCD driver"));
} else if(identifier == 0x9328) {
Serial.println(F("Found ILI9328 LCD driver"));
} else if(identifier == 0x7575) {
Serial.println(F("Found HX8347G LCD driver"));
} else if(identifier == 0x9341) {
Serial.println(F("Found ILI9341 LCD driver"));
} else if(identifier == 0x8357) {
Serial.println(F("Found HX8357D LCD driver"));
} else if(identifier == 0x0154) {
Serial.println(F("Found S6D0154 LCD driver"));
} else {
Serial.print(F("Unknown LCD driver chip: "));
Serial.println(identifier, HEX);
Serial.println(F("If using the Adafruit 2.8\" TFT Arduino shield, the line:"));
Serial.println(F(" #define USE_ADAFRUIT_SHIELD_PINOUT"));
Serial.println(F("should appear in the library header (Adafruit_TFT.h)."));
Serial.println(F("If using the breakout board, it should NOT be #defined!"));
Serial.println(F("Also if using the breakout, double-check that all wiring"));
Serial.println(F("matches the tutorial."));
return;
}
tft.begin(identifier);
Serial.println(F("Benchmark Time (microseconds)"));
Serial.print(F("Screen fill "));
Serial.println(testFillScreen());
delay(500);
Serial.print(F("Text "));
Serial.println(testText());
delay(3000);
Serial.print(F("Lines "));
Serial.println(testLines(CYAN));
delay(500);
Serial.print(F("Horiz/Vert Lines "));
Serial.println(testFastLines(RED, BLUE));
delay(500);
Serial.print(F("Rectangles (outline) "));
Serial.println(testRects(GREEN));
delay(500);
Serial.print(F("Rectangles (filled) "));
Serial.println(testFilledRects(YELLOW, MAGENTA));
delay(500);
Serial.print(F("Circles (filled) "));
Serial.println(testFilledCircles(10, MAGENTA));
Serial.print(F("Circles (outline) "));
Serial.println(testCircles(10, WHITE));
delay(500);
Serial.print(F("Triangles (outline) "));
Serial.println(testTriangles());
delay(500);
Serial.print(F("Triangles (filled) "));
Serial.println(testFilledTriangles());
delay(500);
Serial.print(F("Rounded rects (outline) "));
Serial.println(testRoundRects());
delay(500);
Serial.print(F("Rounded rects (filled) "));
Serial.println(testFilledRoundRects());
delay(500);
Serial.println(F("Done!"));
}
void loop(void) {
for(uint8_t rotation=0; rotation<4; rotation++) {
tft.setRotation(rotation);
testText();
delay(2000);
}
}
unsigned long testFillScreen() {
unsigned long start = micros();
tft.fillScreen(BLACK);
tft.fillScreen(RED);
tft.fillScreen(GREEN);
tft.fillScreen(BLUE);
tft.fillScreen(BLACK);
return micros() - start;
}
unsigned long testText() {
tft.fillScreen(BLACK);
unsigned long start = micros();
tft.setCursor(0, 0);
tft.setTextColor(WHITE); tft.setTextSize(1);
tft.println("Hello World!");
tft.setTextColor(YELLOW); tft.setTextSize(2);
tft.println(1234.56);
tft.setTextColor(RED); tft.setTextSize(3);
tft.println(0xDEADBEEF, HEX);
tft.println();
tft.setTextColor(GREEN);
tft.setTextSize(5);
tft.println("Groop");
tft.setTextSize(2);
tft.println("I implore thee,");
tft.setTextSize(1);
tft.println("my foonting turlingdromes.");
tft.println("And hooptiously drangle me");
tft.println("with crinkly bindlewurdles,");
tft.println("Or I will rend thee");
tft.println("in the gobberwarts");
tft.println("with my blurglecruncheon,");
tft.println("see if I don't!");
return micros() - start;
}
unsigned long testLines(uint16_t color) {
unsigned long start, t;
int x1, y1, x2, y2,
w = tft.width(),
h = tft.height();
tft.fillScreen(BLACK);
x1 = y1 = 0;
y2 = h - 1;
start = micros();
for(x2=0; x2<w; x2+=6) tft.drawLine(x1, y1, x2, y2, color);
x2 = w - 1;
for(y2=0; y2<h; y2+=6) tft.drawLine(x1, y1, x2, y2, color);
t = micros() - start; // fillScreen doesn't count against timing
tft.fillScreen(BLACK);
x1 = w - 1;
y1 = 0;
y2 = h - 1;
start = micros();
for(x2=0; x2<w; x2+=6) tft.drawLine(x1, y1, x2, y2, color);
x2 = 0;
for(y2=0; y2<h; y2+=6) tft.drawLine(x1, y1, x2, y2, color);
t += micros() - start;
tft.fillScreen(BLACK);
x1 = 0;
y1 = h - 1;
y2 = 0;
start = micros();
for(x2=0; x2<w; x2+=6) tft.drawLine(x1, y1, x2, y2, color);
x2 = w - 1;
for(y2=0; y2<h; y2+=6) tft.drawLine(x1, y1, x2, y2, color);
t += micros() - start;
tft.fillScreen(BLACK);
x1 = w - 1;
y1 = h - 1;
y2 = 0;
start = micros();
for(x2=0; x2<w; x2+=6) tft.drawLine(x1, y1, x2, y2, color);
x2 = 0;
for(y2=0; y2<h; y2+=6) tft.drawLine(x1, y1, x2, y2, color);
return micros() - start;
}
unsigned long testFastLines(uint16_t color1, uint16_t color2) {
unsigned long start;
int x, y, w = tft.width(), h = tft.height();
tft.fillScreen(BLACK);
start = micros();
for(y=0; y<h; y+=5) tft.drawFastHLine(0, y, w, color1);
for(x=0; x<w; x+=5) tft.drawFastVLine(x, 0, h, color2);
return micros() - start;
}
unsigned long testRects(uint16_t color) {
unsigned long start;
int n, i, i2,
cx = tft.width() / 2,
cy = tft.height() / 2;
tft.fillScreen(BLACK);
n = min(tft.width(), tft.height());
start = micros();
for(i=2; i<n; i+=6) {
i2 = i / 2;
tft.drawRect(cx-i2, cy-i2, i, i, color);
}
return micros() - start;
}
unsigned long testFilledRects(uint16_t color1, uint16_t color2) {
unsigned long start, t = 0;
int n, i, i2,
cx = tft.width() / 2 - 1,
cy = tft.height() / 2 - 1;
tft.fillScreen(BLACK);
n = min(tft.width(), tft.height());
for(i=n; i>0; i-=6) {
i2 = i / 2;
start = micros();
tft.fillRect(cx-i2, cy-i2, i, i, color1);
t += micros() - start;
// Outlines are not included in timing results
tft.drawRect(cx-i2, cy-i2, i, i, color2);
}
return t;
}
unsigned long testFilledCircles(uint8_t radius, uint16_t color) {
unsigned long start;
int x, y, w = tft.width(), h = tft.height(), r2 = radius * 2;
tft.fillScreen(BLACK);
start = micros();
for(x=radius; x<w; x+=r2) {
for(y=radius; y<h; y+=r2) {
tft.fillCircle(x, y, radius, color);
}
}
return micros() - start;
}
unsigned long testCircles(uint8_t radius, uint16_t color) {
unsigned long start;
int x, y, r2 = radius * 2,
w = tft.width() + radius,
h = tft.height() + radius;
// Screen is not cleared for this one -- this is
// intentional and does not affect the reported time.
start = micros();
for(x=0; x<w; x+=r2) {
for(y=0; y<h; y+=r2) {
tft.drawCircle(x, y, radius, color);
}
}
return micros() - start;
}
unsigned long testTriangles() {
unsigned long start;
int n, i, cx = tft.width() / 2 - 1,
cy = tft.height() / 2 - 1;
tft.fillScreen(BLACK);
n = min(cx, cy);
start = micros();
for(i=0; i<n; i+=5) {
tft.drawTriangle(
cx , cy - i, // peak
cx - i, cy + i, // bottom left
cx + i, cy + i, // bottom right
tft.color565(0, 0, i));
}
return micros() - start;
}
unsigned long testFilledTriangles() {
unsigned long start, t = 0;
int i, cx = tft.width() / 2 - 1,
cy = tft.height() / 2 - 1;
tft.fillScreen(BLACK);
start = micros();
for(i=min(cx,cy); i>10; i-=5) {
start = micros();
tft.fillTriangle(cx, cy - i, cx - i, cy + i, cx + i, cy + i,
tft.color565(0, i, i));
t += micros() - start;
tft.drawTriangle(cx, cy - i, cx - i, cy + i, cx + i, cy + i,
tft.color565(i, i, 0));
}
return t;
}
unsigned long testRoundRects() {
unsigned long start;
int w, i, i2,
cx = tft.width() / 2 - 1,
cy = tft.height() / 2 - 1;
tft.fillScreen(BLACK);
w = min(tft.width(), tft.height());
start = micros();
for(i=0; i<w; i+=6) {
i2 = i / 2;
tft.drawRoundRect(cx-i2, cy-i2, i, i, i/8, tft.color565(i, 0, 0));
}
return micros() - start;
}
unsigned long testFilledRoundRects() {
unsigned long start;
int i, i2,
cx = tft.width() / 2 - 1,
cy = tft.height() / 2 - 1;
tft.fillScreen(BLACK);
start = micros();
for(i=min(tft.width(), tft.height()); i>20; i-=6) {
i2 = i / 2;
tft.fillRoundRect(cx-i2, cy-i2, i, i, i/8, tft.color565(0, i, 0));
}
return micros() - start;
}

in line:
uint16_t identifier = tft.readID();
you can fix your LCD driver like this:
uint16_t identifier = 0x7575; //Change according your LDC DRIVER ID, look in serial (Serial.println(identifier, HEX));

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Accelerate matrix speed in neon arm - performance

Related

Accurate floating-point computation of the sum and difference of two products

How to pack Boolean operations using gcc or other compilers?

Strassen Multiplication Algorithm StackOverFlow Error

Pixel by pixel Bézier Curve

Using 2.4" MCUFriend TFT LCD Display with Arduino

Categories

Resources