I’ve been working on building my first neural network from scratch, but the accuracy is very low (~10%).
import math
import numpy as np
####################################################################################################
# In this program, we will create a class that builds a neural network from scratch and use it to identify the images of digits 0-9 imported
# from the sklearn library.
####################################################################################################
# Create the ‘Nnetwork’ class and define its arguments:
# Set the number of neurons/nodes for each layer
# and initialize the weight matrices:
class Nnetwork:
def __init__(self):
# Initialize the values of the NN to the parameters passed into the function
self.no_of_in_nodes = 8 * 8 # Hardcoded for now
self.no_of_out_nodes = 10 # Right now we are doing 10 because we want to be able to recognize 10 different digits 0-9
self.no_of_hidden_nodes = math.floor(
self.no_of_in_nodes * 2 / 3) + self.no_of_out_nodes # Formula for calculating appropriate amount of hidden nodes
self.no_of_hidden_layers = 2 # This number will help determining how many nodes per hidden layer there are
self.learning_rate = .01 # How strongly the weights will change after back propagation
# Calculate the number of nodes per hidden layer to make life easier.
# We are assuming that all hidden layers have the same amount of nodes
self.no_of_nodes_per_hidden_layer = int(self.no_of_hidden_nodes / self.no_of_hidden_layers)
# print("Number of nodes per hidden layer:", self.no_of_nodes_per_hidden_layer)
# Initialize the weight matrices of the NN
# We have 3 weight matrices, two for the hidden layers and one for the output layer because they all differ in size
self.first_hidden_layer_weights = np.random.rand(self.no_of_in_nodes + 1,
self.no_of_nodes_per_hidden_layer) # The +1 here is for the bias weights.
# The first row of the weight matrix is the bias weights and the following rows are the weights of the input neurons
# This means that technically every single neuron in the hidden layer will have one bias node connected to it multiplied by a weight
# But in reality it is just one bias node with a weight going into every single node of the hidden layer. Each of these weights
# counts as one parameter. E.g. if the input layer had 5 neurons and a bias node, and the output layer has 4 neurons, then the model
# has 4*(5+1) parameters. 5 weights plus the bias weight for each node, and that for each of the 4 nodes in the output layer.
# print(self.hidden_layer_weights.shape)
self.second_hidden_layer_weights = np.random.rand(self.no_of_nodes_per_hidden_layer + 1,
self.no_of_nodes_per_hidden_layer)
self.output_layer_weights = np.random.rand(self.no_of_nodes_per_hidden_layer + 1,
self.no_of_out_nodes) # The + 1 here is for the bias weights again
# print(self.output_layer_weights.shape)
# Initialize node matrices of the NN. These will hold the activation values of the nodes and are initialized to 0
self.input_layer_nodes = np.zeros((self.no_of_in_nodes + 1, 1)) # + 1 for extra bias node
self.input_layer_nodes[0, 0] = 1
# print(self.input_layer_nodes)
self.first_hidden_layer_nodes = np.zeros(
(self.no_of_nodes_per_hidden_layer + 1, 1)) # +1 for extra bias node
self.first_hidden_layer_nodes[0, 0] = 1
self.second_hidden_layer_nodes = np.zeros(
(self.no_of_nodes_per_hidden_layer + 1, 1)) # +1 for extra bias node
self.second_hidden_layer_nodes[0, 0] = 1
self.output_layer_nodes = np.zeros((self.no_of_out_nodes, 1))
# print(self.input_layer_nodes.shape)
# print(self.hidden_layer_nodes.shape)
# print(self.output_layer_nodes.shape)
####################################################################################################
# This will be our method for training our neural network. It takes an input vector and the desired output (labels)
def forwardprop(self, input_vector, target_vector):
# First we need to normalize the input vector, otherwise the tanh function will just return 0s and 1s
# print(input_vector.shape)
norm_input = input_vector / np.linalg.norm(input_vector)
# Let's assume that the input we receive is already in the vector form we need for simplicity.
self.input_layer_nodes[1:, :] = norm_input # Initialize the input layer nodes to the input.
# Index every node starting from 1 instead of 0 since the 0th node is the bias node
# print(self.input_layer_nodes)
# ------------------------
# Now we need to apply all the weights of the input layer to the first hidden layer
# The way the weights are organized each, row represents all the weights for a single neuron in the input layer. If we take the
# transpose of this matrix, then each column will represent all the weights going from that specific node of the input layer to the
# nodes of the next layer (hidden layer).
# In this case we only have two hidden layers, so I did it by hand, but in the case of many hidden layers we need to loop through
# them here.
# We will perform the matrix multiplication, but then we need to normalize the data so that the tanh function can make something
# useful out of it.
# ----------------------------
# Forward propagation from input layer into first hidden layer
first_hidden_layer_nodes = np.matmul(self.first_hidden_layer_weights.transpose(), self.input_layer_nodes)
# print(hidden_layer_nodes)
# This gives us the weighted sum of the activation from the input layer
# print(hidden_layer_nodes)
norm_first_hidden_layer_nodes = first_hidden_layer_nodes / np.linalg.norm(first_hidden_layer_nodes)
# print(norm_hidden_layer_nodes)
self.first_hidden_layer_nodes[1:, :] = tanh(norm_first_hidden_layer_nodes) # Larger number give all 1's
# print(self.hidden_layer_nodes)
# ----------------------------
# Forward propagation from first hidden layer into second hidden layer
second_hidden_layer_nodes = np.matmul(self.second_hidden_layer_weights.transpose(), self.first_hidden_layer_nodes)
# print(hidden_layer_nodes)
# This gives us the weighted sum of the activation from the input layer
# print(hidden_layer_nodes)
norm_second_hidden_layer_nodes = second_hidden_layer_nodes / np.linalg.norm(second_hidden_layer_nodes)
# print(norm_hidden_layer_nodes)
self.second_hidden_layer_nodes[1:, :] = tanh(norm_second_hidden_layer_nodes) # Larger number give all 1's
# ----------------------------
# Forward propagation from second hidden layer into output layer
output_layer_nodes = np.matmul(self.output_layer_weights.transpose(), self.second_hidden_layer_nodes)
norm_output_layer_nodes = output_layer_nodes / np.linalg.norm(output_layer_nodes)
self.output_layer_nodes = tanh(norm_output_layer_nodes)
# print(self.output_layer_nodes)
# print('Loss:', self.loss(target_vector))
####################################################################################################
def backprop(self, target, learning_rate):
"""All of these steps are explained in detail in this video: https://www.youtube.com/watch?v=tIeHLnjs5U8"""
# Backprop would usually contain a loop but since we only have 2 hidden layers we can do it manually for clarity
# ---------------------------------------------------------------------------------------- #
# WEIGHT ADJUSTMENT OUTPUT LAYER
# z is the derivative of the activation of the previous layer multiplied by the weights with the bias added on
# It is needed for calculating the derivatives of the components of the derivative of the cost function with respect to the weights
z = np.matmul(self.output_layer_weights.transpose(),
self.second_hidden_layer_nodes)
# print(z.shape)
# Again normalize for tanh function
z_norm = z / np.linalg.norm(z)
# This is the derivative of z with respect to the weights
hidden_activation = self.second_hidden_layer_nodes # Doesn't need to be normalized because it already went through norm tanh
# print(hidden_activation)
# This is the derivative of the tanh function
tanh_derivative = 1 - np.square(tanh(z_norm))
# This is the derivative of the cost with respect to the activation of the current layer
cost_derivative = 2 * (self.output_layer_nodes - target)
# The product of all three of these combined should
# give us the negative gradient meaning the amount which we should adjust the
# weights to move towards our desired result
output_layer_gradient = -np.matmul(hidden_activation, np.multiply(tanh_derivative, cost_derivative).transpose())[1:, :]
# print(output_layer_gradient)
# After this, we need to add the result to our weight matrix so that the adjustments take effect
# Do not modify these weights yet as we need them to calculate the activation of the previous layer that minimizes the cost function
# BIAS WEIGHT ADJUSTMENT OUTPUT LAYER
bias_weights_output_layer = np.multiply(tanh_derivative, cost_derivative).transpose()
# ---------------------------------------------------------------------------------------- #
# WEIGHT ADJUSTMENT SECOND HIDDEN LAYER
ideal_activation = np.matmul(self.output_layer_weights[1:, :], np.multiply(tanh_derivative, cost_derivative))
norm_ideal_activation = ideal_activation / np.linalg.norm(ideal_activation)
z = np.matmul(self.second_hidden_layer_weights.transpose(),
self.first_hidden_layer_nodes) # 0 is our bias here, but we will have a matrix for it in the future
# Again normalize for tanh function
z_norm = z / np.linalg.norm(z)
# This is the derivative of z with respect to the weights
input_activation = self.first_hidden_layer_nodes
# print(input_activation)
# This is the derivative of the sigmoid function
# sigmoid_derivative = np.exp(-z_norm) / np.square(1 + np.exp(-z_norm))
# print(sigmoid_derivative.shape)
# This is the derivative of tanh
tanh_derivative = 1 - np.square(tanh(z_norm))
# This is the derivative of the cost with respect to the output layer of this example
cost_derivative = 2 * (
self.second_hidden_layer_nodes[1:, :] - norm_ideal_activation) # We can ignore the bias activation in this case
# because there is no weight going from the input layer into the bias node
# print(cost_derivative.shape)
# The product of all three of these combined should give us the negative gradient meaning the amount which we should adjust the
# weights to move towards our desired result
# We want to omit the first row because we will be calculating the change of weights for the bias separately
second_hidden_layer_gradient = -np.matmul(input_activation, np.multiply(tanh_derivative, cost_derivative).transpose())[1:, :]
# print(second_hidden_layer_gradient.shape)
# BIAS WEIGHT ADJUSTMENT SECOND HIDDEN LAYER
bias_weights_second_hidden_layer = np.multiply(tanh_derivative, cost_derivative).transpose()
# ---------------------------------------------------------------------------------------- #
# Here we will calculate the desired activation of the previous layer to minimize the cost function
ideal_activation = np.matmul(self.second_hidden_layer_weights, np.multiply(tanh_derivative, cost_derivative))
norm_ideal_activation = ideal_activation / np.linalg.norm(ideal_activation)
# WEIGHT ADJUSTMENT FIRST HIDDEN
# Then we need to do the same for the weights between the first hidden layer and the input layer
# We need to get the derivative of the cost function with respect to the activation of the previous layer so that we can calculate
# the difference of the activation of that layer and the activation of that layer that would minimize the cost.
z = np.matmul(self.first_hidden_layer_weights.transpose(),
self.input_layer_nodes) # 0 is out bias here, but we will have a matrix for it in the future
# Again normalize for tanh function
z_norm = z / np.linalg.norm(z)
# This is the derivative of z with respect to the weights
input_activation = self.input_layer_nodes
# print(input_activation)
# This is the derivative of the sigmoid function
# sigmoid_derivative = np.exp(-z_norm) / np.square(1 + np.exp(-z_norm))
# print(sigmoid_derivative.shape)
# This is the derivative of tanh
tanh_derivative = 1 - np.square(tanh(z_norm))
# This is the derivative of the cost with respect to the output layer of this example
cost_derivative = 2 * (
self.first_hidden_layer_nodes[1:, :] - norm_ideal_activation[1:, :]) # We can ignore the bias activation in this case
# because there is no weight going from the input layer into the bias node
# print(cost_derivative.shape)
# The product of all three of these combined should give us the negative gradient meaning the amount which we should adjust the
# weights to move towards our desired result
first_hidden_layer_gradient = -np.matmul(input_activation, np.multiply(tanh_derivative, cost_derivative).transpose())[1:, :]
# print(hidden_layer_gradient)
# BIAS WEIGHT ADJUSTMENT OUTPUT LAYER
bias_weights_first_hidden_layer = np.multiply(tanh_derivative, cost_derivative).transpose()
# print(self.first_hidden_layer_weights[0])
# --------------------------------------------------------------------------------------- #
# Now we can update all weights with their respective new gradients
self.output_layer_weights[1:, :] = self.output_layer_weights[1:, :] + output_layer_gradient * learning_rate
self.output_layer_weights[0] = bias_weights_output_layer * learning_rate
self.second_hidden_layer_weights[1:, :] = self.second_hidden_layer_nodes[1:, :] + second_hidden_layer_gradient * learning_rate
self.second_hidden_layer_weights[0] = bias_weights_second_hidden_layer * learning_rate
self.first_hidden_layer_weights[1:, :] = self.first_hidden_layer_weights[1:, :] + first_hidden_layer_gradient * learning_rate
self.first_hidden_layer_weights[0] = bias_weights_first_hidden_layer * learning_rate
It takes 8x8 greyscale images and is supposed to be able to recognize digits from 0-9. I can’t find any obvious mistakes in the math, but I’m sure that I must be making a bad mistake somewhere. I am also a bit confused on when exactly to normalize. I’ve assumed that I need to normalize every input I give to an activation function since very large and very small inputs just result in 0s and 1s. I’ve tried to document my thought process into the code as much as possible through comments. The data set I’m using is from sklearn so I am not worried about that being an issue and I’ve also played around with the learning rate, the test-train split, and the number of hidden layers. Any help would be appreciated.
Thank you.
https://github.com/WallerCodes/FirstNN/blob/main/neuralNetwork.py
Related
I am not high level in data strucutre, I need some help:
I want to adapt np.fft.fft to get specific Amplitudes and phase angles for each barcode.
However, for each barcode, there are 512 datapoints(rows) as a signal, and I want to build the loop to generate the corresonding complex numbers.
Which means, from index[0] to [511] as a single period and compute the np.fft.fft.
next barcode will be from index[512] to [1022] and so on until the end..
Could someone give me some guidelines?
Many thanks in advance!!
And I've already written the code like this:
def generate_Nth_sine_wave(signaldata,N):
"""Extracts the Nth harmonic of a given signal.
It assumes that the input belongs to a single period, spaced equally over the limits
Args:
signal : List containing signal values over
N : Nth Harmonic """
# Apply Fourier Transformation on the signal to obtain the Fourier Coefficients
x = np.fft.fft(signal)
# initiate a blank array with the same length as the coefficients list - "FT_Coeff"
Harmonic_list = [0] * len(x)
# The Nth list element of "x" will correspond to the coefficient of the Nth harmonic.
# Hence isolating only the Nth element by assigning null to the rest
Harmonic_list[N] = 1
Specific_Harmonic = Harmonic_list * x
# Applying inverse FFT to the isolated harmonic Coefficient to get back the curve that was contributed by the specific Harmonic
Harmonic_Curve = np.fft.ifft(Specific_Harmonic)*2
Harmonic_Curve = Harmonic_Curve.real
c = x[N]
a = c.real
b = c.imag
phi = math.degrees(math.atan2(b,a))%360 # Phase angle
hp = ((360-phi)/N)%360 # Fist higher peak position angle
Magnitude = max(Harmonic_Curve) # Magnitude of the harmonic curve
return Magnitude, hp
I am trying to implement A-Chao version of weighted reservoir sampling as shown in https://en.wikipedia.org/wiki/Reservoir_sampling#Algorithm_A-Chao
But I found that the pseudo-code described in wiki seems to be wrong, especially on the initialization part. I read the paper, it mentions we need to handle over-weighted data points, but I still cannot get the idea how to initialize correctly.
In my understanding, on initialization step, we want to make sure all initial data points chosen should have same probability*weight to be chosen. However, I don't understand how the over-weighted points is related with that.
Code I implemented according to the wiki, but the results show it is incorrect.
const reservoirSampling = <T>(dataList: T[], k: number, getWeight: (point: T) => number): T[] => {
const sampledList = dataList.slice(0, k);
let currentWeightSum: number = sampledList.reduce((sum, item) => sum + getWeight(item), 0);
for (let i = k; i < dataList.length; i++) {
const currentItem = dataList[i];
currentWeightSum += getWeight(currentItem);
const probOfChoosingCurrentItem = getWeight(currentItem) / currentWeightSum;
const rand = Math.random();
if (rand <= probOfChoosingCurrentItem) {
sampledList[getRandomInt(0, k - 1)] = currentItem;
}
}
return sampledList;
};
The best way to get the distribution that Chao's algorithm produces is to implement VarOptk sampling as in the pseudocode labeled Algorithm 1 from the paper that introduced VarOptk sampling by Cohen et al.
That's an arXiv link and hence very stable, but to summarize, the idea is to separate the items into "heavy" (weight high enough to guarantee inclusion in the sample so far) and "light" (the others). Keep the heavy items in a priority queue where it is easy to remove the lightest of them. When a new item comes in, we have to determine whether it is heavy or light, and which heavy items became light (if any). Then there's a sampling procedure for dropping an item that treats the heavy → light items specially using weighted sampling and then falls back to choosing a uniform random light item (as in the easy case of Chao's algorithm).
The one trick with the pseudocode is that, if you use floating-point arithmetic, you have to be a little careful about "impossible" cases. Post your finished code on Code Review and ping me here if you would like feedback.
You will find a python implementation of Chao's strategy below. Here is a plot of 10000 samples from 0,..,99 with weights indicated by the yellow lines. The y-coordinate denotes how many times a given item was sampled.
I first implemented the pseudocode on Wikipedia, and agree completely with the OP that it is dead wrong. It then took me more than a day to understand Chao's paper. I also found the section of Tillé's book on Chao's method (see Algorithm 6.14 on page 120) helpful. (I don't know what the OP means by with the issues with initialization.)
Disclaimer: I am new to python, and just tried to do my best. I think posting code might be more helpful than posting pseudocode. (Mainly I want to save someone a day's work getting to the bottom of Chao's paper!) If you do end up using this, I'd appreciate any feedback. Standard health warnings apply!
First, Chao's computation of inclusion probabilities:
import numpy as np
import random
def compute_Chao_probs(weights, total_weight, sample_size):
"""
Consider a weighted population, some of its members, and their weights.
This function returns a list of probabilities that these members are selected
in a weighted sample of sample_size members of the population.
Example 1: If all weights are equal, this probability is sample_size /(size of population).
Example 2: If the size of our population is sample_size then these probabilities are all 1.
Naively we expect these probabilities to be given by sample_size*weight/total_weight, however
this may lead to a probability greater than 1. For example, consider a population
of 3 with weights [3,1,1], and suppose we want to select 2 elements. The naive
probability of selecting the first element is 2*3/5 > 1.
We follow Chao's description: compute naive guess, set any probs which are bigger
than 1 to 1, rinse and repeat.
We expect to call this routine many times, so we avoid for loops, and try to make numpy do the work.
"""
assert all(w > 0 for w in weights), "weights must be strictly positive."
# heavy_items is a True / False array of length sample_size.
# True indicates items deemed "heavy" (i.e. assigned probability 1)
# At the outset, no items are heavy:
heavy_items = np.zeros(len(weights),dtype=bool)
while True:
new_probs = (sample_size - np.sum(heavy_items))/(total_weight - np.sum(heavy_items*weights))*weights
valid_probs = np.less_equal(np.logical_not(heavy_items) * new_probs, np.ones((len(weights))))
if all(valid_probs): # we are done
return np.logical_not(heavy_items)*new_probs + heavy_items
else: # we need to declare some more items heavy
heavy_items = np.logical_or(heavy_items, np.logical_not(valid_probs))
Then Chao's rejection rule:
def update_sample(current_sample, new_item, new_weight):
"""
We have a weighted population, from which we have selected n items.
We know their weights, the total_weight of the population, and the
probability of their inclusion in the sample when we selected them.
Now new_item arrives, with a new_weight. Should we take it or not?
current_sample is a dictionary, with keys 'items', 'weights', 'probs'
and 'total_weight'. This function updates current_sample according to
Chao's recipe.
"""
items = current_sample['items']
weights = current_sample['weights']
probs = current_sample['probs']
total_weight = current_sample['total_weight']
assert len(items) == len(weights) and len(weights) == len(probs)
fixed_sample_size = len(weights)
total_weight = total_weight + new_weight
new_Chao_probs = compute_Chao_probs(np.hstack((weights,[new_weight])),total_weight,fixed_sample_size)
if random.random() <= new_Chao_probs[-1]: # we should take new_item
#
# Now we need to decide which element should be replaced.
# Fix an index i in items, and let P denote probability. We have:
# P(i is selected in previous step) = probs[i]
# P(i is selected at current step) = new_Chao_probs[i]
# Hence (by law of conditional probability)
# P(i is selected at current step | i is selected at previous step) = new_Chao_probs[i] / probs[i]
# Thus:
# P(i is not selected at current step | i is selected at previous step) = 1 - new_Chao_probs[i] / probs[i]
# Now is we condition this on the assumption that the new element is taken, we get
# 1/new_Chao_probs[-1]*(1 - new_Chao_probs[i] / probs[i]).
#
# (*I think* this is what Chao is talking about in the two paragraphs just before Section 3 in his paper.)
rejection_weights = 1/new_Chao_probs[-1]*(np.ones((fixed_sample_size)) - (new_Chao_probs[0:-1]/probs))
# assert np.isclose(np.sum(rejection_weights),1)
# In examples we see that np.sum(rejection_weights) is not necessarily 1.
# I am a little confused by this, but ignore it for the moment.
rejected_index = random.choices(range(fixed_sample_size), rejection_weights)[0]
#make the changes:
current_sample['items'][rejected_index] = new_item
current_sample['weights'][rejected_index] = new_weight
current_sample['probs'] = new_Chao_probs[0:-1]
current_sample['probs'][rejected_index] = new_Chao_probs[-1]
current_sample['total_weight'] = total_weight
Finally, code to test and plot:
# Now we test Chao on some different distributions.
#
# This also illustrates how to use update_sample.
#
from collections import Counter
import matplotlib.pyplot as plt
n = 10 # number of samples
items_in = list(range(100))
weights_in = [random.random() for _ in range(10)]
# other possible tests:
weights_in = [i+1 for i in range(10)] # staircase
#weights_in = [9-i+1 for i in range(10)] # upside down staircase
#weights_in = [(i+1)**2 for i in range(10)] # parabola
#weights_in = [10**i for i in range(10)] # a very heavy tailed distribution (to check numerical stability)
random.shuffle(weights_in) # sometimes it is fun to shuffle
weights_in = np.array([w for w in weights_in for _ in range(10)])
count = Counter({})
for j in range(10000):
# we take the first n with probability 1:
current_sample = {}
current_sample['items'] = items_in[:n]
current_sample['weights'] = np.array(weights_in[:n])
current_sample['probs'] = np.ones((n))
current_sample['total_weight'] = np.sum(current_sample['weights'])
for i in range(n,len(items_in)):
update_sample(current_sample, items_in[i], weights_in[i])
count.update(current_sample['items'])
plt.figure(figsize=(20,10))
plt.plot(100000*np.array(weights_in)/np.sum(weights_in), 'yo')
plt.plot(list(count.keys()), list(count.values()), 'ro')
plt.show()
I was persuaded some time ago to drop my comfortable matlab programming and start programming in Julia. I have been working for a long with neural networks and I thought that, now with Julia, I could get things done faster by parallelising the calculation of the gradient.
The gradient need not be calculated on the entire dataset in one go; instead one can split the calculation. For instance, by splitting the dataset in parts, we can calculate a partial gradient on each part. The total gradient is then calculated by adding up the partial gradients.
Though, the principle is simple, when I parallelise with Julia I get a performance degradation, i.e. one process is faster then two processes! I am obviously doing something wrong... I have consulted other questions asked in the forum but I could still not piece together an answer. I think my problem lies in that there is a lot of unnecessary data moving going on, but I can't fix it properly.
In order to avoid posting messy neural network code, I am posting below a simpler example that replicates my problem in the setting of linear regression.
The code-block below creates some data for a linear regression problem. The code explains the constants, but X is the matrix containing the data inputs. We randomly create a weight vector w which when multiplied with X creates some targets Y.
######################################
## CREATE LINEAR REGRESSION PROBLEM ##
######################################
# This code implements a simple linear regression problem
MAXITER = 100 # number of iterations for simple gradient descent
N = 10000 # number of data items
D = 50 # dimension of data items
X = randn(N, D) # create random matrix of data, data items appear row-wise
Wtrue = randn(D,1) # create arbitrary weight matrix to generate targets
Y = X*Wtrue # generate targets
The next code-block below defines functions for measuring the fitness of our regression (i.e. the negative log-likelihood) and the gradient of the weight vector w:
####################################
## DEFINE FUNCTIONS ##
####################################
#everywhere begin
#-------------------------------------------------------------------
function negative_loglikelihood(Y,X,W)
#-------------------------------------------------------------------
# number of data items
N = size(X,1)
# accumulate here log-likelihood
ll = 0
for nn=1:N
ll = ll - 0.5*sum((Y[nn,:] - X[nn,:]*W).^2)
end
return ll
end
#-------------------------------------------------------------------
function negative_loglikelihood_grad(Y,X,W, first_index,last_index)
#-------------------------------------------------------------------
# number of data items
N = size(X,1)
# accumulate here gradient contributions by each data item
grad = zeros(similar(W))
for nn=first_index:last_index
grad = grad + X[nn,:]' * (Y[nn,:] - X[nn,:]*W)
end
return grad
end
end
Note that the above functions are on purpose not vectorised! I choose not to vectorise, as the final code (the neural network case) will also not admit any vectorisation (let us not get into more details regarding this).
Finally, the code-block below shows a very simple gradient descent that tries to recover the parameter weight vector w from the given data Y and X:
####################################
## SOLVE LINEAR REGRESSION ##
####################################
# start from random initial solution
W = randn(D,1)
# learning rate, set here to some arbitrary small constant
eta = 0.000001
# the following for-loop implements simple gradient descent
for iter=1:MAXITER
# get gradient
ref_array = Array(RemoteRef, nworkers())
# let each worker process part of matrix X
for index=1:length(workers())
# first index of subset of X that worker should work on
first_index = (index-1)*int(ceil(N/nworkers())) + 1
# last index of subset of X that worker should work on
last_index = min((index)*(int(ceil(N/nworkers()))), N)
ref_array[index] = #spawn negative_loglikelihood_grad(Y,X,W, first_index,last_index)
end
# gather the gradients calculated on parts of matrix X
grad = zeros(similar(W))
for index=1:length(workers())
grad = grad + fetch(ref_array[index])
end
# now that we have the gradient we can update parameters W
W = W + eta*grad;
# report progress, monitor optimisation
#printf("Iter %d neg_loglikel=%.4f\n",iter, negative_loglikelihood(Y,X,W))
end
As is hopefully visible, I tried to parallelise the calculation of the gradient in the easiest possible way here. My strategy is to break the calculation of the gradient in as many parts as available workers. Each worker is required to work only on part of matrix X, which part is specified by first_index and last_index. Hence, each worker should work with X[first_index:last_index,:]. For instance, for 4 workers and N = 10000, the work should be divided as follows:
worker 1 => first_index = 1, last_index = 2500
worker 2 => first_index = 2501, last_index = 5000
worker 3 => first_index = 5001, last_index = 7500
worker 4 => first_index = 7501, last_index = 10000
Unfortunately, this entire code works faster if I have only one worker. If add more workers via addprocs(), the code runs slower. One can aggravate this issue by create more data items, for instance use instead N=20000.
With more data items, the degradation is even more pronounced.
In my particular computing environment with N=20000 and one core, the code runs in ~9 secs. With N=20000 and 4 cores it takes ~18 secs!
I tried many many different things inspired by the questions and answers in this forum but unfortunately to no avail. I realise that the parallelisation is naive and that data movement must be the problem, but I have no idea how to do it properly. It seems that the documentation is also a bit scarce on this issue (as is the nice book by Ivo Balbaert).
I would appreciate your help as I have been stuck for quite some while with this and I really need it for my work. For anyone wanting to run the code, to save you the trouble of copying-pasting you can get the code here.
Thanks for taking the time to read this very lengthy question! Help me turn this into a model answer that anyone new in Julia can then consult!
I would say that GD is not a good candidate for parallelizing it using any of the proposed methods: either SharedArray or DistributedArray, or own implementation of distribution of chunks of data.
The problem does not lay in Julia, but in the GD algorithm.
Consider the code:
Main process:
for iter = 1:iterations #iterations: "the more the better"
δ = _gradient_descent_shared(X, y, θ)
θ = θ - α * (δ/N)
end
The problem is in the above for-loop which is a must. No matter how good _gradient_descent_shared is, the total number of iterations kills the noble concept of the parallelization.
After reading the question and the above suggestion I've started implementing GD using SharedArray. Please note, I'm not an expert in the field of SharedArrays.
The main process parts (simple implementation without regularization):
run_gradient_descent(X::SharedArray, y::SharedArray, θ::SharedArray, α, iterations) = begin
N = length(y)
for iter = 1:iterations
δ = _gradient_descent_shared(X, y, θ)
θ = θ - α * (δ/N)
end
θ
end
_gradient_descent_shared(X::SharedArray, y::SharedArray, θ::SharedArray, op=(+)) = begin
if size(X,1) <= length(procs(X))
return _gradient_descent_serial(X, y, θ)
else
rrefs = map(p -> (#spawnat p _gradient_descent_serial(X, y, θ)), procs(X))
return mapreduce(r -> fetch(r), op, rrefs)
end
end
The code common to all workers:
#= Returns the range of indices of a chunk for every worker on which it can work.
The function splits data examples (N rows into chunks),
not the parts of the particular example (features dimensionality remains intact).=#
#everywhere function _worker_range(S::SharedArray)
idx = indexpids(S)
if idx == 0
return 1:size(S,1), 1:size(S,2)
end
nchunks = length(procs(S))
splits = [round(Int, s) for s in linspace(0,size(S,1),nchunks+1)]
splits[idx]+1:splits[idx+1], 1:size(S,2)
end
#Computations on the chunk of the all data.
#everywhere _gradient_descent_serial(X::SharedArray, y::SharedArray, θ::SharedArray) = begin
prange = _worker_range(X)
pX = sdata(X[prange[1], prange[2]])
py = sdata(y[prange[1],:])
tempδ = pX' * (pX * sdata(θ) .- py)
end
The data loading and training. Let me assume that we have:
features in X::Array of the size (N,D), where N - number of examples, D-dimensionality of the features
labels in y::Array of the size (N,1)
The main code might look like this:
X=[ones(size(X,1)) X] #adding the artificial coordinate
N, D = size(X)
MAXITER = 500
α = 0.01
initialθ = SharedArray(Float64, (D,1))
sX = convert(SharedArray, X)
sy = convert(SharedArray, y)
X = nothing
y = nothing
gc()
finalθ = run_gradient_descent(sX, sy, initialθ, α, MAXITER);
After implementing this and run (on 8-cores of my Intell Clore i7) I got a very slight acceleration over serial GD (1-core) on my training multiclass (19 classes) training data (715 sec for serial GD / 665 sec for shared GD).
If my implementation is correct (please check this out - I'm counting on that) then parallelization of the GD algorithm is not worth of that. Definitely you might get better acceleration using stochastic GD on 1-core.
If you want to reduce the amount of data movement, you should strongly consider using SharedArrays. You could preallocate just one output vector, and pass it as an argument to each worker. Each worker sets a chunk of it, just as you suggested.
I'm working on a game for iOs and I'm stuck on something.
The idea is simple: I have a player that must go forward in order to win points (and avoid the yellow bricks).
Here is (on green) the ideal path in order to go forward inside the rectangle (from the bottom to up).
I'm adding a new row each time, so each new row has to let the player move forward (it can move to the left, right and forward, no diagonal).
The idea is to have some 'parasite' empty spots, so the user must think about his next move.`
So, my question: how to generate something like this (for any number of columns)
Thanks.
C.C.
Personally, I would tackle this backward:
Generate the "right" path then randomize the remaining cells with some heuristic that prevents it from generating a "wrong" path.
The heuristic could be something similar to this:
Let line 1 be the closest row, and line 10 the furthest.
Let a path be a series of contiguous 0s.
If line 1 contains only the "right" path and all lines from 2 to 9 contains at least one "wrong" path, have line 10 contain only the "right" path.
This heuristic might not be perfect, it's just an idea off the top of my head.
You could maintain a list of decoy paths, each one with a position (X) and a count down (C) that starts at a random number less than the number of lines visible.
At each step you mark the cells for the good path and each of the decoy paths. You decrement each decoy path counter and remove any that are at zero. If the number of decoy paths is below a certain threshold you add a new decoy path with a random C and a position that's either adjacent to the good path or randomly elsewhere if you want non-connected decoy paths too.
At each step the good path and the decoy paths can each increment, decrement or maintain their X position. If two paths touch they merge. When they merge you keep the lowest C value to ensure that you can't have a branch on a branch that goes beyond the number of visible rows.
This approach doesn't require any advance planning or creation of a graph.
You can think about board like about directed graph. If you want check "Is possible to reach last row?", you can use well know algorithms as DFS or BFS.
Ok. It works. But it's potential slow. Therefore you shouldn't use any of that algorithm after whole board is generated. Use it after any generated row! And if every node from new raw is unobtainable, then regenerate raw.
If you don't want algorithm like "generate and check", you add empty raw, check how many nodes are available and rand the number of that, how many will allow to go forward. Then randomly get a subset of them.
You can also write a simple random generator, witch return i from[0,n). If last area is on the left, you are going left, if right - right, if same - down. It's nice, because if you are close to end, you probably return to other side. So it will have nice shape. But for wide maps, it will lose benefits.
Better, but on same idea is smart use of distribution of numbers in random generators. (eg. look on Cpp random library, or other languages generators).
Code based on this idea (but very, very simple) in c++11:
constexpr size_t width = 8, height = 15;
std::vector<std::vector<bool>> board;
board.emplace_back(width, false);
board.front().front() = true;
size_t position = 0;
std::random_device gen;
while(board.size() < height) {
std::poisson_distribution<> d(position+(position < width/2 ? 1 : -1));
const size_t random = d(gen);
if(random == position) {
board.emplace_back(width, false);
board.back()[position] = true;
continue;
}
if(random < position && board.back()[position-1] == false)
board.back()[--position] = true;
else
if(position + 1 < width)
board.back()[++position] = true;
}
Example of output:
- # # # # # # #
- - - - - - - -
# # # # # # # -
# # # - - - - -
# # # - # # # #
# # # - - - # #
# # # # # - - #
# # # # # - - #
# # # # # - - -
# # # - - - - -
# # # - - # # #
# # # - - # # #
# # # - - # # #
# - - - - # # #
# - # # # # # #
I'm interested in a way (algorithm) of distributing a predefined number of points over a 4 sided surface like a square.
The main issue is that each point has got to have a minimum and maximum proximity to each other (random between two predefined values). Basically the distance of any two points should not be closer than let's say 2, and a further than 3.
My code will be implemented in ruby (the points are locations, the surface is a map), but any ideas or snippets are definitely welcomed as all my ideas include a fair amount of brute force.
Try this paper. It has a nice, intuitive algorithm that does what you need.
In our modelization, we adopted another model: we consider each center to be related to all its neighbours by a repulsive string.
At the beginning of the simulation, the centers are randomly distributed, as well as the strengths of the
strings. We choose randomly to move one center; then we calculate the resulting force caused by all
neighbours of the given center, and we calculate the displacement which is proportional and oriented
in the sense of the resulting force.
After a certain number of iterations (which depends on the number of
centers and the degree of initial randomness) the system becomes stable.
In case it is not clear from the figures, this approach generates uniformly distributed points. You may use instead a force that is zero inside your bounds (between 2 and 3, for example) and non-zero otherwise (repulsive if the points are too close, attractive if too far).
This is my Python implementation (sorry, I don´t know ruby). Just import this and call uniform() to get a list of points.
import numpy as np
from numpy.linalg import norm
import pylab as pl
# find the nearest neighbors (brute force)
def neighbors(x, X, n=10):
dX = X - x
d = dX[:,0]**2 + dX[:,1]**2
idx = np.argsort(d)
return X[idx[1:11]]
# repulsion force, normalized to 1 when d == rmin
def repulsion(neib, x, d, rmin):
if d == 0:
return np.array([1,-1])
return 2*(x - neib)*rmin/(d*(d + rmin))
def attraction(neib, x, d, rmax):
return rmax*(neib - x)/(d**2)
def uniform(n=25, rmin=0.1, rmax=0.15):
# Generate randomly distributed points
X = np.random.random_sample( (n, 2) )
# Constants
# step is how much each point is allowed to move
# set to a lower value when you have more points
step = 1./50.
# maxk is the maximum number of iterations
# if step is too low, then maxk will need to increase
maxk = 100
k = 0
# Force applied to the points
F = np.zeros(X.shape)
# Repeat for maxk iterations or until all forces are zero
maxf = 1.
while maxf > 0 and k < maxk:
maxf = 0
for i in xrange(n):
# Force calculation for the i-th point
x = X[i]
f = np.zeros(x.shape)
# Interact with at most 10 neighbors
Neib = neighbors(x, X, 10)
# dmin is the distance to the nearest neighbor
dmin = norm(Neib[0] - x)
for neib in Neib:
d = norm(neib - x)
if d < rmin:
# feel repulsion from points that are too near
f += repulsion(neib, x, d, rmin)
elif dmin > rmax:
# feel attraction if there are no neighbors closer than rmax
f += attraction(neib, x, d, rmax)
# save all forces and the maximum force to normalize later
F[i] = f
if norm(f) <> 0:
maxf = max(maxf, norm(f))
# update all positions using the forces
if maxf > 0:
X += (F/maxf)*step
k += 1
if k == maxk:
print "warning: iteration limit reached"
return X
I presume that one of your brute force ideas includes just repeatedly generating points at random and checking to see if the constraints happen to be satisified.
Another way is to take a configuration that satisfies the constraints and repeatedly perturb a small part of it, chosen at random - for instance move a single point - to move to a randomly chosen nearby configuration. If you do this often enough you should move to a random configuration that is almost independent of the starting point. This could be justified under http://en.wikipedia.org/wiki/Metropolis%E2%80%93Hastings_algorithm or http://en.wikipedia.org/wiki/Gibbs_sampling.
I might try just doing it at random, then going through and dropping points that are to close to other points. You can compare the square of the distance to save some math time.
Or create cells with borders and place a point in each one. Less random, it depends on if this is a "just for looks thing" or not. But it could be very fast.
I made a compromise and ended up using the Poisson Disk Sampling method.
The result was fairly close to what I needed, especially with a lower number of tries (which also drastically reduces cost).