Training neural network in Ruby - ruby

I'm a total beginner when it comes to neural networks. I've been wrestling with ruby-fann and ai4r all day and unfortunately I don't have anything to show for it, so I figured I would come onto Stack Overflow and ask the knowledgeable people here.
I have a set of samples -- each day has one data point, but they don't fit any clear pattern that I've been able to figure out (I tried a couple regressions). Still, I think it would be neat to see if there was any way to predict the data going into the future just from the date, and I thought a neural network would be a good way to generate a function that could hope to express that relationship.
The dates are DateTime objects and the data points are decimal numbers, like 7.68. I've been converting the DateTime objects to floats and then dividing by 10,000,000,000 to get a number between 0 and 1, and I've been dividing the decimal numbers by 1,000 to also get a number between 0 and 1. I have over a thousand samples... here's what a short excerpt looks like:
[
["2012-03-15", "7.68"],
["2012-03-14", "4.221"],
["2012-03-13", "12.212"],
["2012-03-12", "42.1"]
]
Which when transformed looks like this:
[
[0.13317696, 0.000768],
[0.13316832, 0.0004221],
[0.13315968, 0.0012212],
[0.13315104, 0.00421]
]
I kind of wish this transformation weren't necessary, but I digress. The problem is that both ai4r and ruby-fann return one constant number, generally something in the middle of the range of the samples, when I run them. Here's the code for ruby-fann:
#fann = RubyFann::Standard.new(:num_inputs=>1, :hidden_neurons=>[3, 3], :num_outputs=>1)
training_data = RubyFann::TrainData.new(:inputs => formatted_data.collect{|d| [d.first]}, :desired_outputs => formatted_data.collect{|d| [d.last]})
#fann.train_on_data(training_data, 1000, 1, 0.0001)
#fann.run([DateTime.now.to_f / 10000000000.0]) # Always something random, and always the same number no matter what date I request it for
And for ai4r:
#ai4r = Ai4r::NeuralNetwork::Backpropagation.new([1, 3, 3, 1])
1000.times do
formatted_data.each do |data|
#ai4r.train(data.first, data.last)
end
end
#ai4r.eval([DateTime.now.to_f / 10000000000.0]) # A different result frmo above, but always seemingly random and the same for any requested date
I feel like I'm missing something really basic here. I know this is a rather open-ended question but if anyone could help me figure out how I'm improperly teaching my neural networks, I'd really appreciate it!

alfa has a good point in his comment, an alternative ways of using the NN might be more appropriate.
It depends on the problem, but if the day's value is even partly a function of
the previous days' values, treating this as a time series might yield better
results.
You would then instead teach the NN to produce the day's value as a function
of a window of, say, the previous ten days values; you could also keep the date
parameter as a real input scale between [0, 1] as you believe it has a significant effect
on the day's value.

Related

How to detect numerical value of a text?

We have data for survey question (e.g. rate us between 1-5) that's supposed to be numerical. However, we find that the response also includes
šŸ‘ repeated 5 times
ā¤ļø repeated 4 times
Great!
four
3 and a half
I'd like a way to turn the user response into a numerical value. e.g. the text above should translate into 5, 4, 5, 4, 3.5 respectively. Obviously this won't work 100% of the time so I'm looking for an optimal solution (perhaps a text analysis approach) that gets me over 80%.
If you are solely looking to turn THESE SPECIFIC responses into numerical values, then you can pass them through a series of if statements in a function:
def inputToNumber(string)
#thumbs up emoji
if string == "\u{1f44d}"
return 5
#the word four
elsif string == "four"
return 4
#etc., etc. with if statements for your other cases
end
end
But it might make more sense for you to only allow numeric answers to begin with, because:
You can't predict every possible written response
Someone could input malicious code
You didn't provide your code to show how you are accepting input so I can't really offer you specific solutions, but you can look here for some suggestions: Accept only numeric input
Good luck with your project.

Statistics/Algorithm: How do I compare a weekly graph with its own history to see when in the past it was almost the same?

Iā€™ve got a statistical/mathematical problem Iā€™m stumped on and I was really hoping to get some help. Iā€™m working on a research where I need to compare a weekly graph with its own history to see when in the past it was almost the same. Think of this as ā€œfinding the closest matchā€. The information is displayed as a line graph, but itā€™s readily available as raw data:
Date...................Result
08/10/18......52.5
08/07/18......60.2
08/06/18......58.5
08/05/18......55.4
08/04/18......55.2
and so on...
What I really want is the output to be a form of correlation between the current data points with the other set of 5 concurrent data points in history. So, something like:
Date range.....................Correlation
07/10/18-07/15/18....0.98
Weā€™ll be getting a code written in Python for the software to do this automatically (so that as new data is added, it automatically runs and finds the closest set of numbers to match the current one).
Hereā€™s where the difficulty sets in: Since numbers are on a general upward trend over time, we donā€™t want it to compare the absolute value (since the numbers might never really match). One suggestion has been to compare the delta (rate of change as a percentage over the previous day), or using a log scale.
Iā€™m wondering: how do I go about this? What kind of calculation I can use to get the desired results? Iā€™ve looked at the different kind of correlation equations, but they donā€™t account for the ā€œshapeā€ of the data, and they generally just average it out. The shape of the line chart is the important thing.
Thanks very much in advance!
I would simply divide the data of each week by their average (i.e., normalize them to an average of 1), then sum the squares of the differences of each day of each pair of weeks. This sum is what you want to minimize.
If you don't care about how much a graph oscillates relative to its mean, you can normalize also the variance. For each week, calculate mean and variance, then subtract the mean and divide by the root of the variance. Each week will have mean 0 and variance 1. Then minimize the sum of squares of differences like before.
If the normalization of data is all you can change in your workflow, just leave out the sum of squares of differences minimization part.

Good algorithm for maximum likelihood estimation

I have a problem. I need to estimate some statistics with GARCH/ARCH model. In Matlab I use something like this:
spec = garchset('P', 1, 'Q', 1)
[fit01,~,LogL01] =garchfit(spec, STAT);
so this returns three parameters of GARCH model with maximum likelihood.
But I really need to how which algorithm is used in garchfit , because I need to write a program which makes the same work in estimating parameters automatically.
My program works now very slow and sometimes not correct.
So the questions are:
How get the code of garchfit or MLE in Matlab?
Does anyone know some good and fast algorithm on MLE?
(MLE = maximum likelihood estimation)
To see the code (if possible) you can type edit garchfit.
From the documentation of garchfit I have found some recommendations:
garchfit will be removed in a future release. Use estimate, estimate,
estimate, or estimate instead.
My guess is that you want to look into garch.estimate.

how to find interesting points in time series

i have an array of date=>values, like this
"2010-10-12 14:58:36" =>13.4
"2010-10-17 14:58:36" =>12
"2010-10-22 14:58:36" =>17.6
"2010-10-27 14:58:36" =>22
"2010-11-01 14:58:36" =>10
[...]
I use this date-value combination to paint an graph in javascript.
Now i like to mark those dates, who are "very special".
My problem (and Question) is, which aspect should consider to find those specific dates?
As an human, i prefer the date "2010-10-17 14:58:36", because "something" should be happens on this date, because the value on the next dates rises for 5.6 points, which is the biggest step up followed by one mor big step up. On the other hand, also the date "2010-10-27 14:58:36" is an "highlight", because this is
the top of all values and
after this date, there comes the biggest step down.
So as an human, i would be choose both dates.
My problem is: how could an algorithm look like?
I tried averages values for n dates before and after the current values, which results in an accumulation of those specifics dates at the beginning and at the end of the graph
So i tried to find the biggest percentage step up (depending on the date before), but I'm not sure, if i really find the specific dates, I'm looking for?!
How would you tackle the problem?
Thank you.
Looks like financial stocking issue :-) You are looking for Time series analysis - this is a statistical issue. I'd recommend to use R programming language to play with it (you can do complex statistical things very fast). There are tens of special packages, for sure financial one's too. Once you know what you want, you may implement the solution in any other language.
Just try to google time series analysis r.
EDIT: note that R is very powerful - I'd bet there is a tool how to use R packages from other languages.
If you have information over a timeline you could use Inerpolation.
A Polynomial interpolation will give you an approximated polynomial that goes through the points.
What's nice about this is you can then use Mathematical analysis which is easy on polynomials to find interesting points (large gradients, min-max points etc...)
Also you get an approximation of how the function behaves, so you could "future" points and see what may happen in the near future.
Of course looking into the future isn't so accurate, but forms of interpolation are used in analytic to see trends and behaviors.
And of course, it's easy to plot a polynomial, which is always nice.
This is really a question of Statistics http://en.wikipedia.org/wiki/Statistics and the context of your data and what you're looking to highlight, for example, the fact that between 12/10 and 17/10 the data moved negative 1.4 units may be more useful in some scenarios than a larger positive step change.
You need sample data, on which build up a function which can calculate an expected value for any given date; for instance averaging the values of the day before, the same week day of the previous week, of the previous month and so on. After that decide a threshold: interesting date are those for which real value is outside expected value +- threshold

Algorithm for deviations

I have to track if given a week full of data integers ( 40, 30, 25, 55, 5, 40, etc ) raise an alert when the deviation from the norm happens (the '5' in the above case). An extra nice thing to have would be to actually learn if 5 is a normal event for that day of the week.
Do you know an implementation in ruby that is meant for this issue? In case this is a classic problem, what's the name of the problem/algorithm?
It's a very easy thing to calculate, but you will need to tune one parameter. You want to know if any given value is X standard deviations from the mean. To figure this out, calculate the standard deviation (see Wikipedia), then compare each value's deviation abs(mean - value) from the mean to this value. If a value's deviation is say, more than two standard deviations from the mean, flag it.
Edit:
To track deviations by weekday, keep an array of integers, one for each day. Every time you encounter a deviation, increment that day's counter by one. You could also use doubles and instead maintain a percentage of deviations for that day (num_friday_deviations/num_fridays) for example.
This is often referred to as "anomaly detection" and there is a lot of work out there if you google for it. The paper Mining Deviants in Time Series Data Streams may help you with your specific needs.
From the abstract:
We present ļ¬rst-known algorithms for identifying deviants on massive data streams. Our algorithms monitor
streams using very small space (polylogarithmic in data
size) and are able to quickly ļ¬nd deviants at any instant,
as the data stream evolves over time.
http://en.wikipedia.org/wiki/Control_chart describes classical ways of doing this sort of thing. As Jonathan Feinberg commented, there are different approaches.
The name of the algorithm could be as simple as "calculate standard deviation."
http://en.wikipedia.org/wiki/Standard_deviation
However, any analysis you do should be specific to the data set. You should inspect historical data to get at the right algorithm. Standard deviation won't be a good measure at all unless your data is normally distributed. Your data might even be such that you just want to look for numbers above a certain max value... it really depends.
So, my advice to you is:
1) Google for statistics overview and read up on basic statistics.
2) Inspect any historical data you have.
3) Come up with some reasonable measure of an odd number.
4) Test your measure against your historical data and see if it highlights the numbers you think it should.
5) Repeat steps 2-4 as necessary to refine your algorithm.

Resources