ASCII to Integer conversion in Verilog - ascii

I have a sequence of ASCII characters arriving sequentially from a UART. I want to convert from ASCII to the represented integers. For example, I receive 123 which is { 8'h31, 8'h32, 8'h33 } and I want to convert it to 8'h7B. Can anyone provide assistance?

I assume you are asking about synthesizable RTL to convert a sequence of ASCII coded characters into a corresponding number. If so, there are generally two ways of doing this — sequentially process the stream and convert each input into a binary, multiple accumulated value by ten, and add the current number. However, this method is very slow. If you have all of the input handy, you can simply use a lookup table to convert the input "string" into a binary number. Below is an example that I sketched up some time back:
/*
* An example of converting an ASCII string into binary using lookup tables.
*
* Copyright (C) 2012 Vlad Lazarenko <vlad#lazarenko.me>
*
* Permission is hereby granted, free of charge, to any person obtaining a copy
* of this software and associated documentation files (the "Software"), to deal
* in the Software without restriction, including without limitation the rights
* to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
* copies of the Software, and to permit persons to whom the Software is
* furnished to do so, subject to the following conditions:
*
* The above copyright notice and this permission notice shall be included in
* all copies or substantial portions of the Software.
*
* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
* IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
* FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
* AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
* LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
* OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
* SOFTWARE.
*/
// synopsys translate_off
`timescale 1 ns / 1 ps
// synopsys translate_on
/*
* Using carefully chosen minimal size for registers effectively increases fMAX.
* However, synthesis tool complain about result being truncated. It is possible
* to use full 32-bit registers and provide false paths, but easier to just
* disable the warning.
*/
// altera message_off 10230
/*
* Convert unsigned 32-bit ASCII number representation into binary.
*
* Data bus width is 80 bit for value (up to 10 characters).
* Empty flag is 4 bit wide.
* Output latency is 4 cycles.
* fMAX on Arria II device is a little above 300 MHz.
* # 300 MHz, the throughput is ~22.32 Gbps (or ~2.79 GBps).
*/
module ascii_to_binary(clk, reset_n, data, mod, result);
input clk;
input reset_n;
input [79:0] data;
input [3:0] mod;
output reg [31:0] result;
// Convert a single ASCII digit into a corresponding
// 4-bit binary number by subtracting 48.
function [3:0] c2i;
input [7:0] c;
reg [7:0] i;
begin
i = (c - 6'd48);
c2i = i[3:0];
end
endfunction
// Convert a single ASCII digit into a corresponding
// 4-bit binary number by subtracting 48 and multiply
// the result. Multiplication is used to normalize
// a single digit number depending on its position
// in the input data sequence. For example, this function
// can be used to transform a sequence of ASCII digits
// like this: '1', '2', '3' into a digits like 100, 20, 3.
// This function can potentially use multipliers instead of
// lookup table. However, using multipliers can reduce fMAX.
function [31:0] c2d;
input [7:0] c;
input integer m;
reg [31:0] d;
begin
case (c2i(c))
4'd0: d = 0; // "0" is always 0
4'd1: d = m; // Multiplying 1 by "m" always yields "m"
4'd2: d = m * 2;
4'd3: d = m * 3;
4'd4: d = m * 4;
4'd5: d = m * 5;
4'd6: d = m * 6;
4'd7: d = m * 7;
4'd8: d = m * 8;
4'd9: d = m * 9;
4'd10: d = 0; // Don't care (false path)
4'd11: d = 0; // Don't care (false path)
4'd12: d = 0; // Don't care (false path)
4'd13: d = 0; // Don't care (false path)
4'd14: d = 0; // Don't care (false path)
4'd15: d = 0; // Don't care (false path)
endcase
c2d = d[31:0];
end
endfunction
// Stage 1 registers. Each word holds a single converted
// and adjusted/normalized digit.
reg [31:0] m9;
reg [31:0] m8;
reg [26:0] m7;
reg [23:0] m6;
reg [19:0] m5;
reg [16:0] m4;
reg [13:0] m3;
reg [9:0] m2;
reg [6:0] m1;
reg [3:0] m0;
// Stage 2 sum registers.
reg [31:0] s0;
reg [31:0] s1;
reg [31:0] s2;
reg [31:0] s3;
// Stage 3 sum registers.
reg [31:0] s4;
reg [31:0] s5;
always # (posedge clk or negedge reset_n) begin
if (!reset_n) begin
m0 <= 0;
m1 <= 0;
m2 <= 0;
m3 <= 0;
m4 <= 0;
m5 <= 0;
m6 <= 0;
m7 <= 0;
m8 <= 0;
m9 <= 0;
s0 <= 0;
s1 <= 0;
s2 <= 0;
s3 <= 0;
s4 <= 0;
s5 <= 0;
result <= 0;
end else begin
/*
* Pipeline stage #1: Convert every ASCII character into a binary
* number, normalize every number depending on the word position
* and valid input data length. For example:
* - '1', '2' turns into 10 and 2.
* - '1', '2', '3' turns into 100, 20 and 3.
* - '1', '2', '3', '4' turns into 1000, 200, 30 and 4
*/
/*
* Empty signal is 4 bit wide, but its valid range is from 0 to 9.
* When MSB in empty signal is low, only 3 bits are compared using
* a full case. Otherwise, LSB is checked to differentiate between
* 8 and 9 (4'b1000 and 4'b0001).
*/
if (mod[3:3] == 1'b0) begin
case (mod[2:0])
3'd0: begin
m9 <= c2d(data[79:72], 1000000000);
m8 <= c2d(data[71:64], 100000000);
m7 <= c2d(data[63:56], 10000000);
m6 <= c2d(data[55:48], 1000000);
m5 <= c2d(data[47:40], 100000);
m4 <= c2d(data[39:32], 10000);
m3 <= c2d(data[31:24], 1000);
m2 <= c2d(data[23:16], 100);
m1 <= c2d(data[15:8], 10);
m0 <= c2i(data[7:0]);
end
3'd1: begin
m9 <= c2d(data[79:72], 100000000);
m8 <= c2d(data[71:64], 10000000);
m7 <= c2d(data[63:56], 1000000);
m6 <= c2d(data[55:48], 100000);
m5 <= c2d(data[47:40], 10000);
m4 <= c2d(data[39:32], 1000);
m3 <= c2d(data[31:24], 100);
m2 <= c2d(data[23:16], 10);
m1 <= c2i(data[15:8]);
m0 <= 0;
end
3'd2: begin
m9 <= c2d(data[79:72], 10000000);
m8 <= c2d(data[71:64], 1000000);
m7 <= c2d(data[63:56], 100000);
m6 <= c2d(data[55:48], 10000);
m5 <= c2d(data[47:40], 1000);
m4 <= c2d(data[39:32], 100);
m3 <= c2d(data[31:24], 10);
m2 <= c2i(data[23:16]);
m1 <= 0;
m0 <= 0;
end
3'd3: begin
m9 <= c2d(data[79:72], 1000000);
m8 <= c2d(data[71:64], 100000);
m7 <= c2d(data[63:56], 10000);
m6 <= c2d(data[55:48], 1000);
m5 <= c2d(data[47:40], 100);
m4 <= c2d(data[39:32], 10);
m3 <= c2i(data[31:24]);
m2 <= 0;
m1 <= 0;
m0 <= 0;
end
3'd4: begin
m9 <= c2d(data[79:72], 100000);
m8 <= c2d(data[71:64], 10000);
m7 <= c2d(data[63:56], 1000);
m6 <= c2d(data[55:48], 100);
m5 <= c2d(data[47:40], 10);
m4 <= c2i(data[39:32]);
m3 <= 0;
m2 <= 0;
m1 <= 0;
m0 <= 0;
end
3'd5: begin
m9 <= c2d(data[79:72], 10000);
m8 <= c2d(data[71:64], 1000);
m7 <= c2d(data[63:56], 100);
m6 <= c2d(data[55:48], 10);
m5 <= c2i(data[47:40]);
m4 <= 0;
m3 <= 0;
m2 <= 0;
m1 <= 0;
m0 <= 0;
end
3'd6: begin
m9 <= c2d(data[79:72], 1000);
m8 <= c2d(data[71:64], 100);
m7 <= c2d(data[63:56], 10);
m6 <= c2i(data[55:48]);
m5 <= 0;
m4 <= 0;
m3 <= 0;
m2 <= 0;
m1 <= 0;
m0 <= 0;
end
3'd7: begin
m9 <= c2d(data[79:72], 100);
m8 <= c2d(data[71:64], 10);
m7 <= c2i(data[63:56]);
m6 <= 0;
m5 <= 0;
m4 <= 0;
m3 <= 0;
m2 <= 0;
m1 <= 0;
m0 <= 0;
end
endcase
end else begin
case (mod[0:0])
1'b0: begin
m9 <= c2d(data[79:72], 10);
m8 <= c2i(data[71:64]);
end
1'b1: begin
m9 <= c2i(data[79:72]);
m8 <= 0;
end
endcase
m7 <= 0;
m6 <= 0;
m5 <= 0;
m4 <= 0;
m3 <= 0;
m2 <= 0;
m1 <= 0;
m0 <= 0;
end
// Pipeline stage #2: Sum up numbers from the previous step.
s0 <= m9 + m0;
s1 <= m8 + m1;
s2 <= m7 + (m2 + m3);
s3 <= m6 + (m4 + m5);
// Pipeline stage #3: Sum previous partial sums.
s4 <= (s0 + s1);
s5 <= (s2 + s3);
// Last pipeline stage #3: Sum previous partial sums.
// This yields a 32-bit result.
result <= (s4 + s5);
end
end
endmodule
You can find more details with (synthesizable) implementations for both methods along with a test bench, waveforms and even some software examples in my ASCII Horror — String to Binary Conversion article.
Hope it helps. Good Luck!

If your compiler supports SystemVerilog, you can use atoi function:
str.atoi() returns the integer corresponding to the ASCII decimal
representation in str . For example:
string str = "123";
int i = str.atoi(); // assigns 123 to i.
Otherwise, you need to write your own atoi function using a method similar to Ross's suggestion.

Here a small example of code:
First, an example to create a byte dynamic array from a string.
The dynamic array of bytes contains the ASCII CODE number representation of each character.
The advantage is that the dynamic array of bytes can be randomized but strings cannot be randomized.
(created doing e.g.
stringvar ="This is a example text";
rand byte byte_din_array[];
for(i=0;i<stringvar.len(); i++) begin
byte_din_array = {byte_din_array ,stringvar[i]};
//stringvar[i] will return empty byte if the index would be beyond the string length
//The advantage of using stringvar[i] instead of stringvar.atoi(i) is that
//the string can have all ASCII characters and not just numbers.
//Disadvantage is that the byte contains the ASCII CODE "number"
//representation of the character and that is not human readable
end
).
Here is the example to convert the dynamic array of bytes back in a concatenated string.
You may have used the previous dynamic array to be partly randomized (with constraints) inside an xfer or changed in post_randomize.
function string convert_byte_array2string(byte stringdescriptionholder[]);
automatic string temp_str="";
automatic byte byte_temp;
automatic string str_test;
for ( int unsigned i = 0; i<stringdescriptionholder.size(); i++) begin
i=i;//debug breakpoint
byte_temp = stringdescriptionholder[i];
str_test = string'(byte_temp); //the "string cast" will convert the numeric ASCII representation in a string character
temp_str = {temp_str,str_test};
end
return temp_str;
endfunction
If you want more information about strings i recommend to read the section 3.7 of the SystemVerilog 3.1a Language Reference Manual (LRM) Accellera’s Extensions to Verilog.
It is about the string data types and explain the built-in methods used with string data types.
You can also find information under section 6.16 of the IEEE Standard for SystemVerilog—Unified Hardware Design, Specification, and Verification Language/IEEE Std 1800™-2012. Probably, more detailed explanation than in LRM.

Related

Overflow detection in unsigned division algorithm

I have a question about 64/32-bits division algorithm as it appears in Hacker's Delight in Chapter 9-4 Unsigned Long Division, Figure 9-3, "div1u". Online it can be seen here, from where I copy-pasted it as follows:
unsigned divlu2(unsigned u1, unsigned u0, unsigned v,
unsigned *r) {
const unsigned b = 65536; // Number base (16 bits).
unsigned un1, un0, // Norm. dividend LSD's.
vn1, vn0, // Norm. divisor digits.
q1, q0, // Quotient digits.
un32, un21, un10,// Dividend digit pairs.
rhat; // A remainder.
int s; // Shift amount for norm.
if (u1 >= v) { // If overflow, set rem.
if (r != NULL) // to an impossible value,
*r = 0xFFFFFFFF; // and return the largest
return 0xFFFFFFFF;} // possible quotient.
s = nlz(v); // 0 <= s <= 31.
v = v << s; // Normalize divisor.
vn1 = v >> 16; // Break divisor up into
vn0 = v & 0xFFFF; // two 16-bit digits.
un32 = (u1 << s) | (u0 >> 32 - s) & (-s >> 31);
un10 = u0 << s; // Shift dividend left.
un1 = un10 >> 16; // Break right half of
un0 = un10 & 0xFFFF; // dividend into two digits.
q1 = un32/vn1; // Compute the first
rhat = un32 - q1*vn1; // quotient digit, q1.
again1:
if (q1 >= b || q1*vn0 > b*rhat + un1) {
q1 = q1 - 1;
rhat = rhat + vn1;
if (rhat < b) goto again1;}
un21 = un32*b + un1 - q1*v; // Multiply and subtract.
q0 = un21/vn1; // Compute the second
rhat = un21 - q0*vn1; // quotient digit, q0.
again2:
if (q0 >= b || q0*vn0 > b*rhat + un0) {
q0 = q0 - 1;
rhat = rhat + vn1;
if (rhat < b) goto again2;}
if (r != NULL) // If remainder is wanted,
*r = (un21*b + un0 - q0*v) >> s; // return it.
return q1*b + q0;
}
Specifically, I'm interested in the bounds of the variable un21. How large can it be? Somewhat surprising, it can be larger than v but by how much?
In other words, under again2 there is the test q0 >= b. If I wanted to know whether the division (q0 = un21/vn1) eventually overflows, is it enough to test (un21 >> 16) == vn1 or does it have to read (un21 >> 16) >= vn1, instead if q0 >= b?
The idea is to know in advance, prior to calculating the quotient, whether the division overflows or not.

How to give input matrix in Verilog code?

I am writing verilog code for inserting values in 4x4 matrix
I need to collect 16 input each one in a 4x4 matrix. How can I do that?
reg [15:0]fun_out;
integer x, y;
always #(posedge clk or negede rst_n) begin
if (~rst_n) begin
for (x=0,x<4,x=x+1) begin
for (y=0,y<4,y=y+1) begin
data[0][0] <= fun_out[0];
data[0][1] <= fun_out[1];
data[0][2] <= fun_out[2];
data[0][3] <= fun_out[3];
data[1][0] <= fun_out[4];
data[2][0] <= fun_out[5];
........
........
data[4][3] <= fun_out[14];
data[4][4] <= fun_out[15];
end
end
end
else begin
data[x][y]<=4'b0;
end
end ```
Taking into account that NUM_WIDTH is the amount of bits per value inside the matrix:
localparam NUM_WIDTH = 4;
localparam NUM_X = 4;
localparam NUM_Y = 4;
We can use "+:" to index a chunk of an array. Said that, we can pack and assign the data in every row of the matrix as:
genvar I, J;
wire [(NUM_WIDTH*NUM_X)-1:0] matrix [NUM_Y-1:0];
generate
for(I = 0; I < 4; I = I + 1) begin: matrix_gen_y
for(J = 0; J < 4; J = J + 1) begin: matrix_gen_x
assign matrix[I][(J*NUM_WIDTH)+:NUM_WIDTH] = (I*NUM_X)+J+1; //..from 1 to 16
end
end
endgenerate
So, in order to index a value inside the matrix with "x" and "y" indexes:
assign value = matrix[y_idx][(x_idx*NUM_WIDTH)+:NUM_WIDTH];

How to perform division of two Q15 values in Verilog , with out using '/' (division) Operator?

As division operation (/) is expensive in case of FPGA ? Is it possible to perform division of two Q15 format numbers(16 bit fixed point number) with basic shift operations?
Could someone help me by providing some example?
Thanks in advance!
Fixed-point arithmetic is just integer arithmetic with a bit of scaling thrown in. Q15 is a purely fractional format stored as a signed 16-bit integer with scale factor of 215, able to represent values in the interval [-1, 1). Clearly, division only makes sense in Q15 when the divisor's magnitude exceeds the dividend's magnitude, as otherwise the quotient's magnitude exceeds the representable range.
Before embarking on a custom Verilog implementation of fixed-point division, you would want to check your FPGA vendor's library offerings as a fixed-point library including pipeline division is often available. There are also opens source projects that may be relevant, such as this one.
When using integer division operators for fixed-point division, we need to adjust for the fact that the division will remove the scale factor, i.e (a * 2scale) / (b * 2scale) = (a/b), while the correct fixed-point result is (a/b * 2scale). This is easily fixed by pre-multiplying the dividend by 2scale, as in the following C implementation:
int16_t div_q15 (int16_t dividend, int16_t divisor)
{
return (int16_t)(((int32_t)dividend << 15) / (int32_t)divisor);
}
Wikipedia gives a reasonable overwiew on how to implement binary division on a bit-by-bit basis using add, subtract, and shift operations. These methods are closely related to the longhand division taught in grade school. For FPGAs, the use of the non-restoring method if often preferred, as pointed out by this paper, for example:
Nikolay Sorokin, "Implementation of high-speed fixed-point dividers on FPGA". Journal of Computer Science & Technology, Vol. 6, No. 1, April 2006, pp. 8-11.
Here is C code that shows how the non-restoring method may be used for the division of 16-bit two's-complement operands:
/* bit-wise non-restoring two's complement division */
void int16_div (int16_t dividend, int16_t divisor, int16_t *quot, int16_t *rem)
{
const int operand_bits = (int) (sizeof (int16_t) * CHAR_BIT);
uint16_t d = (uint16_t)divisor;
uint16_t nd = 0 - d; /* -divisor */
uint16_t r, q = 0; /* remainder, quotient */
uint32_t dd = (uint32_t)d << operand_bits; /* expanded divisor */
uint32_t pp = dividend; /* partial remainder */
int i;
for (i = operand_bits - 1; i >= 0; i--) {
if ((int32_t)(pp ^ dd) < 0) {
q = (q << 1) + 0; /* record quotient bit -1 (as 0) */
pp = (pp << 1) + dd;
} else {
q = (q << 1) + 1; /* record quotient bit +1 (as 1) */
pp = (pp << 1) - dd;
}
}
/* convert quotient from digit set {-1,1} to plain two's complement */
q = (q << 1) + 1;
/* remainder is upper half of partial remainder */
r = (uint16_t)(pp >> operand_bits);
/* fix up cases where we worked past a partial remainder of zero */
if (r == d) { /* remainder equal to divisor */
q = q + 1;
r = 0;
} else if (r == nd) { /* remainder equal to -divisor */
q = q - 1;
r = 0;
}
/* for truncating division, remainder must have same sign as dividend */
if (r && ((int16_t)(dividend ^ r) < 0)) {
if ((int16_t)q < 0) {
q = q + 1;
r = r - d;
} else {
q = q - 1;
r = r + d;
}
}
*quot = (int16_t)q;
*rem = (int16_t)r;
}
Note that there are multiple ways of dealing with the various special cases that arise in non-restoring division. For example, one frequently sees code that detects a zero partial remainder pp and exits the loop over the quotient bits early in this case. Here I assume that an FPGA implementation would unroll the loop completely to create a pipelined implementation, in which case early termination is not helpful. Instead, a final correction is applied to those quotients that are affected by ignoring a partial remainder of zero.
In order to create a Q15 division from the above, we have to make just a single change: incorporating the up-scaling of the dividend. Instead of:
uint32_t pp = dividend; /* partial remainder */
we now use this:
uint32_t pp = dividend << 15; /* partial remainder; incorporate Q15 scaling */
The resulting C code (sorry, I won't provide read-to-use Verilog code) including the test framework is:
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <limits.h>
#include <math.h>
/* bit-wise non-restoring two's complement division */
void q15_div (int16_t dividend, int16_t divisor, int16_t *quot, int16_t *rem)
{
const int operand_bits = (int) (sizeof (int16_t) * CHAR_BIT);
uint16_t d = (uint16_t)divisor;
uint16_t nd = 0 - d; /* -divisor */
uint16_t r, q = 0; /* remainder, quotient */
uint32_t dd = (uint32_t)d << operand_bits; /* expanded divisor */
uint32_t pp = dividend << 15; /* partial remainder, incorporate Q15 scaling */
int i;
for (i = operand_bits - 1; i >= 0; i--) {
if ((int32_t)(pp ^ dd) < 0) {
q = (q << 1) + 0; /* record quotient bit -1 (as 0) */
pp = (pp << 1) + dd;
} else {
q = (q << 1) + 1; /* record quotient bit +1 (as 1) */
pp = (pp << 1) - dd;
}
}
/* convert quotient from digit set {-1,1} to plain two's complement */
q = (q << 1) + 1;
/* remainder is upper half of partial remainder */
r = (uint16_t)(pp >> operand_bits);
/* fix up cases where we worked past a partial remainder of zero */
if (r == d) { /* remainder equal to divisor */
q = q + 1;
r = 0;
} else if (r == nd) { /* remainder equal to -divisor */
q = q - 1;
r = 0;
}
/* for truncating division, remainder must have same sign as dividend */
if (r && ((int16_t)(dividend ^ r) < 0)) {
if ((int16_t)q < 0) {
q = q + 1;
r = r - d;
} else {
q = q - 1;
r = r + d;
}
}
*quot = (int16_t)q;
*rem = (int16_t)r;
}
int main (void)
{
uint16_t dividend, divisor, ref_q, res_q, res_r;
double quot, fxscale = (1 << 15);
dividend = 0;
do {
printf ("\r%04x", dividend);
divisor = 1;
do {
quot = trunc (fxscale * (int16_t)dividend / (int16_t)divisor);
/* Q15 can only represent numbers in [-1, 1) */
if ((quot >= -1.0) && (quot < 1.0)) {
ref_q = (int16_t)((((int32_t)(int16_t)dividend) << 15) /
((int32_t)(int16_t)divisor));
q15_div ((int16_t)dividend, (int16_t)divisor,
(int16_t *)&res_q, (int16_t *)&res_r);
if (res_q != ref_q) {
printf ("!r dividend=%04x (%f) divisor=%04x (%f) res=%04x (%f) ref=%04x (%f)\n",
dividend, (int16_t)dividend / fxscale,
divisor, (int16_t)divisor / fxscale,
res_q, (int16_t)res_q / fxscale,
ref_q, (int16_t)ref_q / fxscale);
}
}
divisor++;
} while (divisor);
dividend++;
} while (dividend);
return EXIT_SUCCESS;
}

For logic implementation in System Verilog

I'm just learning HDL and I'm interested in how a for loop is implemented in System Verilog.
With the following code...
always_ff(posedge clk)
begin
for(int i = 0; i < 32; i++) s[i] = a[i] + b[i];
end
Will I end up with 32 adders in the logic and they are all executed simultaneously? Or are the additions performed sequentially somehow?
Thanks
Boscoe
Loops which can be statically unrolled (as per your example) can be synthesised.
The example you gave has to execute in a single clock cycle, there would be nothing sequential about the hardware generated:
Your example :
always_ff(posedge clk) begin
for(int i = 0; i < 32; i++) begin
s[i] <= a[i] + b[i];
end
end
Is just (32 parallel adders):
always_ff(posedge clk) begin
s[0] <= a[0] + b[0];
s[1] <= a[1] + b[1];
s[2] <= a[2] + b[2];
//...
end

How do I apply the dct values to the formula?

I am implementing an algorithm about face detection corresponding to a paper i have found. At the end of the paper it uses the dct values to take out false alarms by using some formulas. One of them is the following:
My question is: I have calculated dct values for MxN, now how do i apply them to the formula?
EDIT: So that is what you mean? (The 0 to 7 inner loops are a random part of the 100x100 dct1 array, which has only y dct, the cb,cr are not needed for the algorith)
for(i = 0; i <= M * N * (4 - 1); i++){
for(m = 0; m <= 7; m++){
for(n = 0; n <= 7; n++){
value += std::pow(dct1.at<float>(m,n),2);
}
}

Resources