I'm writing a NES/Famicom emulator. I register a callback function that will be called every time a pixel is rendered. It means that my callback function will be called about 3.5 million times (256width * 240height * 60fps).
In my callback function, there are many array/slice operations, and I found that Go will do bounds checking every time I index an element in it. But the indexes are results of bit and operations so I can tell that it will NOT exceed both bounds.
So, I'm here to ask if there is a way to disable bounds checking?
Thank you.
Using gcflags you can disable bounds checking.
go build -gcflags=-B .
If you really need to avoid the bounds check, you can use the unsafe package and use C-style pointer arithmetic to perform your lookups:
index := 2
size := unsafe.Sizeof(YourStruct{})
p := unsafe.Pointer(&yourStructSlice[0])
indexp := (unsafe.Pointer)(uintptr(p) + size*uintptr(index))
yourStructPtr := (*YourStruct)(indexp)
https://play.golang.org/p/GDNphKsJPOv
You should time it to determine how much CPU run time you are actually saving by doing this, but it is probably true it is possible to make it faster using this approach.
Also, you may want to have a look at the actual generated instructions to make sure that what you outputting is actually more efficient. Doing lookups without bounds checks very well may be more trouble than it's worth. Some info on how to do that here: https://github.com/teh-cmc/go-internals/blob/master/chapter1_assembly_primer/README.md
Another common approach is to write performance critical code in assembly (see https://golang.org/doc/asm). Ain't no automatic bounds checking in asm :)
The XY Problem
The XY problem is asking about your attempted solution rather than
your actual problem.
Your real problem is overall performance. Let's see some benchmarks to show that bounds checking is a significant problem. It may not be a significant problem. For example, less than one millisecond per second,
Bounds check:
BenchmarkPixels-4 300 4034580 ns/op
No bounds check:
BenchmarkPixels-4 500 3150985 ns/op
bounds_test.go:
package main
import (
"testing"
)
const (
width = 256
height = 240
frames = 60
)
var pixels [width * height]byte
func writePixel(w, h int) {
pixels[w*height+h] = 42
}
func BenchmarkPixels(b *testing.B) {
for N := 0; N < b.N; N++ {
for f := 0; f < frames; f++ {
for w := 0; w < width; w++ {
for h := 0; h < height; h++ {
writePixel(w, h)
}
}
}
}
}
Related
only use atomic implement the follow code:
const Max = 8
var index int
func add() int {
index++
if index >= Max {
index = 0
}
return index
}
such as:
func add() int {
atomic.AddUint32(&index, 1)
// error: race condition
atomic.CompareAndSwapUint32(&index, Max, 0)
return index
}
but it is wrong. there is a race condition.
can be implemented that don't use lock ?
Solving it without loops and locks
A simple implementation may look like this:
const Max = 8
var index int64
func Inc() int64 {
value := atomic.AddInt64(&index, 1)
if value < Max {
return value // We're done
}
// Must normalize, optionally reset:
value %= Max
if value == 0 {
atomic.AddInt64(&index, -Max) // Reset
}
return value
}
How does it work?
It simply adds 1 to the counter; atomic.AddInt64() returns the new value. If it's less than Max, "we're done", we can return it.
If it's greater than or equal to Max, then we have to normalize the value (make sure it's in the range [0..Max)) and reset the counter.
Reset may only be done by a single caller (a single goroutine), which will be selected by the counter's value. The winner will be the one that caused the counter to reach Max.
And the trick to avoid the need of locks is to reset it by adding -Max, not by setting it to 0. Since the counter's value is normalized, it won't cause any problems if other goroutines are calling it and incrementing it concurrently.
Of course with many goroutines calling this Inc() concurrently it may be that the counter will be incremented more that Max times before a goroutine that ought to reset it can actually carry out the reset, which would cause the counter to reach or exceed 2 * Max or even 3 * Max (in general: n * Max). So we handle this by using a value % Max == 0 condition to decide if a reset should happen, which again will only happen at a single goroutine for each possible values of n.
Simplification
Note that the normalization does not change values already in the range [0..Max), so you may opt to always perform the normalization. If you want to, you may simplify it to this:
func Inc() int64 {
value := atomic.AddInt64(&index, 1) % Max
if value == 0 {
atomic.AddInt64(&index, -Max) // Reset
}
return value
}
Reading the counter without incrementing it
The index variable should not be accessed directly. If there's a need to read the counter's current value without incrementing it, the following function may be used:
func Get() int64 {
return atomic.LoadInt64(&index) % Max
}
Extreme scenario
Let's analyze an "extreme" scenario. In this, Inc() is called 7 times, returning the numbers 1..7. Now the next call to Inc() after the increment will see that the counter is at 8 = Max. It will then normalize the value to 0 and wants to reset the counter. Now let's say before the reset (which is to add -8) is actually executed, 8 other calls happen. They will increment the counter 8 times, and the last one will again see that the counter's value is 16 = 2 * Max. All the calls will normalize the values into the range 0..7, and the last call will again go on to perform a reset. Let's say this reset is again delayed (e.g. for scheduling reasons), and yet another 8 calls come in. For the last, the counter's value will be 24 = 3 * Max, the last call again will go on to perform a reset.
Note that all calls will only return values in the range [0..Max). Once all reset operations are executed, the counter's value will be 0, properly, because it had a value of 24 and there were 3 "pending" reset operations. In practice there's only a slight chance for this to happen, but this solution handles it nicely and efficiently.
I assume your goal is to never let index has value equal or greater than Max. This can be solved using CAS (Compare-And-Swap) loop:
const Max = 8
var index int32
func add() int32 {
var next int32;
for {
prev := atomic.LoadInt32(&index)
next = prev + 1;
if next >= Max {
next = 0
}
if (atomic.CompareAndSwapInt32(&index, prev, next)) {
break;
}
}
return next
}
CAS can be used to implement almost any operation atomically like this. The algorithm is:
Load the value
Perform the desired operation
Use CAS, goto 1 on failure.
So, I have a piece of code that is concurrent and it's meant to be run onto each CPU/core.
There are two large vectors with input/output values
var (
input = make([]float64, rowCount)
output = make([]float64, rowCount)
)
these are filled and I want to compute the distance (error) between each input-output pair. Being the pairs independent, a possible concurrent version is the following:
var d float64 // Error to be computed
// Setup a worker "for each CPU"
ch := make(chan float64)
nw := runtime.NumCPU()
for w := 0; w < nw; w++ {
go func(id int) {
var wd float64
// eg nw = 4
// worker0, i = 0, 4, 8, 12...
// worker1, i = 1, 5, 9, 13...
// worker2, i = 2, 6, 10, 14...
// worker3, i = 3, 7, 11, 15...
for i := id; i < rowCount; i += nw {
res := compute(input[i])
wd += distance(res, output[i])
}
ch <- wd
}(w)
}
// Compute total distance
for w := 0; w < nw; w++ {
d += <-ch
}
The idea is to have a single worker for each CPU/core, and each worker processes a subset of the rows.
The problem I'm having is that this code is no faster than the serial code.
Now, I'm using Go 1.7 so runtime.GOMAXPROCS should be already set to runtime.NumCPU(), but even setting it explicitly does not improves performances.
distance is just (a-b)*(a-b);
compute is a bit more complex, but should be reentrant and use global data only for reading (and uses math.Pow and math.Sqrt functions);
no other goroutine is running.
So, besides accessing the global data (input/output) for reading, there are no locks/mutexes that I am aware of (not using math/rand, for example).
I also compiled with -race and nothing emerged.
My host has 4 virtual cores, but when I run this code I get (using htop) CPU usage to 102%, but I expected something around 380%, as it happened in the past with other go code that used all the cores.
I would like to investigate, but I don't know how the runtime allocates threads and schedule goroutines.
How can I debug this kind of issues? Can pprof help me in this case? What about the runtime package?
Thanks in advance
Sorry, but in the end I got the measurement wrong. #JimB was right, and I had a minor leak, but not so much to justify a slowdown of this magnitude.
My expectations were too high: the function I was making concurrent was called only at the beginning of the program, therefore the performance improvement was just minor.
After applying the pattern to other sections of the program, I got the expected results. My mistake in evaluation which section was the most important.
Anyway, I learned a lot of interesting things meanwhile, so thanks a lot to all the people trying to help!
I want to convert a float64 number, let's say it 1.003 to 1003 (integer type). My implementation is simply multiply the float64 with 1000 and cast it to int.
package main
import "fmt"
func main() {
var f float64 = 1.003
fmt.Println(int(f * 1000))
}
But when I run that code, what I got is 1002 not 1003. Because Go automatically stores 1.003 as 1.002999... in the variable. What is the correct approach to do this kind of operation on Golang?
Go spec: Conversions:
Conversions between numeric types
When converting a floating-point number to an integer, the fraction is discarded (truncation towards zero).
So basically when you convert a floating-point number to an integer, only the integer part is kept.
If you just want to avoid errors arising from representing with finite bits, just add 0.5 to the number before converting it to int. No external libraries or function calls (from standard library) required.
Since float -> int conversion is not rounding but keeping the integer part, this will give you the desired result. Taking into consideration both the possible smaller and greater representation:
1002.9999 + 0.5 = 1003.4999; integer part: 1003
1003.0001 + 0.5 = 1003.5001; integer part: 1003
So simply just write:
var f float64 = 1.003
fmt.Println(int(f * 1000 + 0.5))
To wrap this into a function:
func toint(f float64) int {
return int(f + 0.5)
}
// Using it:
fmt.Println(toint(f * 1000))
Try them on the Go Playground.
Note:
Be careful when you apply this in case of negative numbers! For example if you have a value of -1.003, then you probably want the result to be -1003. But if you add 0.5 to it:
-1002.9999 + 0.5 = -1002.4999; integer part: -1002
-1003.0001 + 0.5 = -1002.5001; integer part: -1002
So if you have negative numbers, you have to either:
subtract 0.5 instead of adding it
or add 0.5 but subtract 1 from the result
Incorporating this into our helper function:
func toint(f float64) int {
if f < 0 {
return int(f - 0.5)
}
return int(f + 0.5)
}
As Will mentions, this comes down to how floats are represented on various platforms. Essentially you need to round the float rather than let the default truncating behavior to happen. There's no standard library function for this, probably because there's a lot of possible behavior and it's trivial to implement.
If you knew you'd always have errors of the sort described, where you're slightly below (1299.999999) the value desired (1300.00000) you could use the math library's Ceil function:
f := 1.29999
n := math.Ceil(f*1000)
But if you have different kinds of floating error and want a more general sorting behavior? Use the math library's Modf function to separate the your floating point value by the decimal point:
f := 1.29999
f1,f2 := math.Modf(f*1000)
n := int(f1) // n = 1299
if f2 > .5 {
n++
}
fmt.Println(n)
You can run a slightly more generalized version of this code in the playground yourself.
This is probably likely a problem with floating points in general in most programming languages though some have different implementations than others. I wouldn't go into the intricacies here but most languages usually have a "decimal" approach either as a standard library or a third party library to get finer precision.
For instance, I've found the inf.v0 package largely useful. Underlying the library is a Dec struct that holds the exponents and the integer value. Therefore, it's able to hold 1.003 as 1003 * 10^-3. See below for an example:
package main
import (
"fmt"
"gopkg.in/inf.v0"
)
func main() {
// represents 1003 * 10^-3
someDec := inf.NewDec(1003, 3)
// multiply someDec by 1000 * 10^0
// which translates to 1003 * 10^-3 * 1000 * 10^0
someDec.Mul(someDec, inf.NewDec(1000, 0))
// inf.RoundHalfUp rounds half up in the 0th scale, eg. 0.5 rounds to 1
value, ok := someDec.Round(someDec, 0, inf.RoundHalfUp).Unscaled()
fmt.Println(value, ok)
}
Hope this helps!
I'm solving Project Euler problem 16, I've ended up with a code that can logically solve it, but is unable to process as I believe its overflowing or something? I tried int64 in place of int but it just prints 0,0. If i change the power to anything below 30 it works, but above 30 it does not work, Can anyone point out my mistake? I believe its not able to calculate 2^1000.
// PE_16 project main.go
package main
import (
"fmt"
)
func power(x, y int) int {
var pow int
var final int
final = 1
for pow = 1; pow <= y; pow++ {
final = final * x
}
return final
}
func main() {
var stp int
var sumfdigits int
var u, t, h, th, tth, l int
stp = power(2,1000)
fmt.Println(stp)
u = stp / 1 % 10
t = stp / 10 % 10
h = stp / 100 % 10
th = stp / 1000 % 10
tth = stp / 10000 % 10
l = stp / 100000 % 10
sumfdigits = u + t + h + th + tth + l
fmt.Println(sumfdigits)
}
Your approach to this problem requires exact integer math up to 1000 bits in size. But you're using int which is 32 or 64 bits. math/big.Int can handle such task. I intentionally do not provide a ready made solution using big.Int as I assume your goal is to learn by doing it by yourself, which I believe is the intent of Project Euler.
As noted by #jnml, ints aren't large enough; if you wish to calculate 2^1000 in Go, big.Ints are a good choice here. Note that math/big provides the Exp() method which will be easier to use than converting your power function to big.Ints.
I worked through some Project Euler problems about a year ago, doing them in Go to get to know the language. I didn't like the ones that required big.Ints, which aren't so easy to work with in Go. For this one, I "cheated" and did it in one line of Ruby:
Removed because I remembered it was considered bad form to show a working solution, even in a different language.
Anyway, my Ruby example shows another thing I learned with Go's big.Ints: sometimes it's easier to convert them to a string and work with that string than to work with the big.Int itself. This problem strikes me as one of those cases.
Converting my Ruby algorithm to Go, I only work with big.Ints on one line, then it's easy to work with the string and get the answer in just a few lines of code.
You don't need to use math/big. Below is a school boy maths way of doubling a decimal number as a hint!
xs holds the decimal digits in least significant first order. Pass in a pointer to the digits (pxs) as the slice might need to get bigger.
func double(pxs *[]int) {
xs := *pxs
carry := 0
for i, x := range xs {
n := x*2 + carry
if n >= 10 {
carry = 1
n -= 10
} else {
carry = 0
}
xs[i] = n
}
if carry != 0 {
*pxs = append(xs, carry)
}
}
I am trying to create random lines and select some of them, which are really rare. My code is rather simple, but to get something that I can use I need to create very large vectors(i.e.: <100000000 x 1, tracks variable in my code). Is there any way to be able to creater larger vectors and to reduce the time needed for all those calculations?
My code is
%Initial line values
tracks=input('Give me the number of muon tracks: ');
width=1e-4;
height=2e-4;
Ystart=15.*ones(tracks,1);
Xstart=-40+80.*rand(tracks,1);
%Xend=-40+80.*rand(tracks,1);
Xend=laprnd(tracks,1,Xstart,15);
X=[Xstart';Xend'];
Y=[Ystart';zeros(1,tracks)];
b=(Ystart.*Xend)./(Xend-Xstart);
hot=0;
cold=0;
for i=1:tracks
if ((Xend(i,1)<width/2 && Xend(i,1)>-width/2)||(b(i,1)<height && b(i,1)>0))
plot(X(:, i),Y(:, i),'r');%the chosen ones!
hold all
hot=hot+1;
else
%plot(X(:, i),Y(:, i),'b');%the rest of them
%hold all
cold=cold+1;
end
end
I am also using and calling a Laplace distribution generator made my Elvis Chen which can be found here
function y = laprnd(m, n, mu, sigma)
%LAPRND generate i.i.d. laplacian random number drawn from laplacian distribution
% with mean mu and standard deviation sigma.
% mu : mean
% sigma : standard deviation
% [m, n] : the dimension of y.
% Default mu = 0, sigma = 1.
% For more information, refer to
% http://en.wikipedia.org./wiki/Laplace_distribution
% Author : Elvis Chen (bee33#sjtu.edu.cn)
% Date : 01/19/07
%Check inputs
if nargin < 2
error('At least two inputs are required');
end
if nargin == 2
mu = 0; sigma = 1;
end
if nargin == 3
sigma = 1;
end
% Generate Laplacian noise
u = rand(m, n)-0.5;
b = sigma / sqrt(2);
y = mu - b * sign(u).* log(1- 2* abs(u));
The result plot is
As you indicate, your problem is two-fold. On the one hand, you have memory issues because you need to do so many trials. On the other hand, you have performance issues, because you have to process all those trials.
Solutions to each issue often have a negative impact on the other issue. IMHO, the best approach would be to find a compromise.
More trials are only possible of you get rid of those gargantuan arrays that are required for vectorization, and use a different strategy to do the loop. I will give priority to the possibility of using more trials, possibly at the cost of optimal performance.
When I execute your code as-is in the Matlab profiler, it immediately shows that the initial memory allocation for all your variables takes a lot of time. It also shows that the plot and hold all commands are the most time-consuming lines of them all. Some more trial-and-error shows that there is a disappointingly low maximum value for the trials you can do before OUT OF MEMORY errors start appearing.
The loop can be accelerated tremendously if you know a few things about its limitations in Matlab. In older versions of Matlab, it used to be true that loops should be avoided completely in favor of 'vectorized' code. In recent versions (I believe R2008a and up), the Mathworks introduced a piece of technology called the JIT accelerator (Just-in-Time compiler) which translates M-code into machine language on the fly during execution. Simply put, the JIT accelerator allows your code to bypass Matlab's interpreter and talk much more directly with the underlying hardware, which can save a lot of time.
The advice you'll hear a lot that loops should be avoided in Matlab, is no longer generally true. While vectorization still has its value, any procedure of sizable complexity that is implemented using only vectorized code is often illegible, hard to understand, hard to change and hard to upkeep. An implementation of the same procedure that uses loops, often has none of these drawbacks, and moreover, it will quite often be faster and require less memory.
Unfortunately, the JIT accelerator has a few nasty (and IMHO, unnecessary) limitations that you'll have to learn about.
One such thing is plot; it's generally a better idea to let a loop do nothing other than collect and manipulate data, and delay any plotting commands etc. until after the loop.
Another such thing is hold; the hold function is not a Matlab built-in function, meaning, it is implemented in M-language. Matlab's JIT accelerator is not able to accelerate non-builtin functions when used in a loop, meaning, your entire loop will run at Matlab's interpretation speed, rather than machine-language speed! Therefore, also delay this command until after the loop :)
Now, in case you're wondering, this last step can make a HUGE difference -- I know of one case where copy-pasting a function body into the upper-level loop caused a 1200x performance improvement. Days of execution time had been reduced to minutes!).
There is actually another minor issue in your loop (which is really small, and rather inconvenient, I will immediately agree with) -- the name of the loop variable should not be i. The name i is the name of the imaginary unit in Matlab, and the name resolution will also unnecessarily consume time on each iteration. It's small, but non-negligible.
Now, considering all this, I've come to the following implementation:
function [hot, cold, h] = MuonTracks(tracks)
% NOTE: no variables larger than 1x1 are initialized
width = 1e-4;
height = 2e-4;
% constant used for Laplacian noise distribution
bL = 15 / sqrt(2);
% Loop through all tracks
X = [];
hot = 0;
ii = 0;
while ii <= tracks
ii = ii + 1;
% Note that I've inlined (== copy-pasted) the original laprnd()
% function call. This was necessary to work around limitations
% in loops in Matlab, and prevent the nececessity of those HUGE
% variables.
%
% Of course, you can still easily generalize all of this:
% the new data
u = rand-0.5;
Ystart = 15;
Xstart = 800*rand-400;
Xend = Xstart - bL*sign(u)*log(1-2*abs(u));
b = (Ystart*Xend)/(Xend-Xstart);
% the test
if ((b < height && b > 0)) ||...
(Xend < width/2 && Xend > -width/2)
hot = hot+1;
% growing an array is perfectly fine when the chances of it
% happening are so slim
X = [X [Xstart; Xend]]; %#ok
end
end
% This is trivial to do here, and prevents an 'else' in the loop
cold = tracks - hot;
% Now plot the chosen ones
h = figure;
hold all
Y = repmat([15;0], 1, size(X,2));
plot(X, Y, 'r');
end
With this implementation, I can do this:
>> tic, MuonTracks(1e8); toc
Elapsed time is 24.738725 seconds.
with a completely negligible memory footprint.
The profiler now also shows a nice and even distribution of effort along the code; no lines that really stand out because of their memory use or performance.
It's possibly not the fastest possible implementation (if anyone sees obvious improvements, please, feel free to edit them in). But, if you're willing to wait, you'll be able to do MuonTracks(1e23) (or higher :)
I've also done an implementation in C, which can be compiled into a Matlab MEX file:
/* DoMuonCounting.c */
#include <math.h>
#include <matrix.h>
#include <mex.h>
#include <time.h>
#include <stdlib.h>
void CountMuons(
unsigned long long tracks,
unsigned long long *hot, unsigned long long *cold, double *Xout);
/* simple little helper functions */
double sign(double x) { return (x>0)-(x<0); }
double rand_double() { return (double)rand()/(double)RAND_MAX; }
/* the gateway function */
void mexFunction(int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[])
{
int
dims[] = {1,1};
const mxArray
/* Output arguments */
*hot_out = plhs[0] = mxCreateNumericArray(2,dims, mxUINT64_CLASS,0),
*cold_out = plhs[1] = mxCreateNumericArray(2,dims, mxUINT64_CLASS,0),
*X_out = plhs[2] = mxCreateDoubleMatrix(2,10000, mxREAL);
const unsigned long long
tracks = (const unsigned long long)mxGetPr(prhs[0])[0];
unsigned long long
*hot = (unsigned long long*)mxGetPr(hot_out),
*cold = (unsigned long long*)mxGetPr(cold_out);
double
*Xout = mxGetPr(X_out);
/* call the actual function, and return */
CountMuons(tracks, hot,cold, Xout);
}
// The actual muon counting
void CountMuons(
unsigned long long tracks,
unsigned long long *hot, unsigned long long *cold, double *Xout)
{
const double
width = 1.0e-4,
height = 2.0e-4,
bL = 15.0/sqrt(2.0),
Ystart = 15.0;
double
Xstart,
Xend,
u,
b;
unsigned long long
i = 0ul;
*hot = 0ul;
*cold = tracks;
/* seed the RNG */
srand((unsigned)time(NULL));
/* aaaand start! */
while (i++ < tracks)
{
u = rand_double() - 0.5;
Xstart = 800.0*rand_double() - 400.0;
Xend = Xstart - bL*sign(u)*log(1.0-2.0*fabs(u));
b = (Ystart*Xend)/(Xend-Xstart);
if ((b < height && b > 0.0) || (Xend < width/2.0 && Xend > -width/2.0))
{
Xout[0 + *hot*2] = Xstart;
Xout[1 + *hot*2] = Xend;
++(*hot);
--(*cold);
}
}
}
compile in Matlab with
mex DoMuonCounting.c
(after having run mex setup :) and then use it in conjunction with a small M-wrapper like this:
function [hot,cold, h] = MuonTrack2(tracks)
% call the MEX function
[hot,cold, Xtmp] = DoMuonCounting(tracks);
% process outputs, and generate plots
hot = uint32(hot); % circumvents limitations in 32-bit matlab
X = Xtmp(:,1:hot);
clear Xtmp
h = NaN;
if ~isempty(X)
h = figure;
hold all
Y = repmat([15;0], 1, hot);
plot(X, Y, 'r');
end
end
which allows me to do
>> tic, MuonTrack2(1e8); toc
Elapsed time is 14.496355 seconds.
Note that the memory footprint of the MEX version is slightly larger, but I think that's nothing to worry about.
The only flaw I see is the fixed maximum number of Muon counts (hard-coded as 10000 as the initial array size of Xout; needed because there are no dynamically growing arrays in standard C)...if you're worried this limit could be broken, simply increase it, change it to be equal to a fraction of tracks, or do some smarter (but more painful) dynamic array-growing tricks.
In Matlab, it is sometimes faster to vectorize rather than use a for loop. For example, this expression:
(Xend(i,1) < width/2 && Xend(i,1) > -width/2) || (b(i,1) < height && b(i,1) > 0)
which is defined for each value of i, can be rewritten in a vectorised manner like this:
isChosen = (Xend(:,1) < width/2 & Xend(:,1) > -width/2) | (b(:,1) < height & b(:,1)>0)
Expessions like Xend(:,1) will give you a column vector, so Xend(:,1) < width/2 will give you a column vector of boolean values. Note then that I have used & rather than && - this is because & performs an element-wise logical AND, unlike && which only works on scalar values. In this way you can build the entire expression, such that the variable isChosen holds a column vector of boolean values, one for each row of your Xend/b vectors.
Getting counts is now as simple as this:
hot = sum(isChosen);
since true is represented by 1. And:
cold = sum(~isChosen);
Finally, you can get the data points by using the boolean vector to select rows:
plot(X(:, isChosen),Y(:, isChosen),'r'); % Plot chosen values
hold all;
plot(X(:, ~isChosen),Y(:, ~isChosen),'b'); % Plot unchosen values
EDIT: The code should look like this:
isChosen = (Xend(:,1) < width/2 & Xend(:,1) > -width/2) | (b(:,1) < height & b(:,1)>0);
hot = sum(isChosen);
cold = sum(~isChosen);
plot(X(:, isChosen),Y(:, isChosen),'r'); % Plot chosen values