Building an activity chart combining two kinds of data - algorithm

I'm building, for example, an application that monitors your health. Each day, you're jogging and doing push-ups and you enter the information on a web site.
What I would like to do is building a chart combining the hours you jogged and the number of push-ups/sit-ups you did. Let's say on the first day, you jogged 1 hour and did 10 push-ups and on the second day, you jogged 50 minutes and did 20 push-ups, you would see a progression in your training.
I know it may sound strange but I want to have an overall-view of your health, not different views for jogging and push-ups. I don't want a double y-axis chart because if I have, as example, 6 runners, I will end up with 12 lines on the chart.

First I would redefine your terms. You are not tracking "health" here, you are tracking level of exertion through exercise.
Max Exertion != Max Health. If you exert yourself to the max and don't eat or drink, you will actually damage your health. :-)
To combine and plot your total "level of exertion" for multiple exercises you need to convert them to a common unit ... something like "calories burned".
I'm pretty sure there are many sources for reference tables with rough conversion factors for how many calories various exercises burn.
Does that help any?

Then you need a model of how push-ups and jogging affect yourself, and for this you should be asking a doctor or fitness expert, not a programmer :-). This question should probably be taken elsewhere.

Sounds like a double y-axis chart.

You can just do a regular excel-type chart with 2 lines, scaled appropriately, one for push-ups, one for jogging time. There are graphics libraries that let you do that in back-end language of your choice. X-axys is date.
You may want to have 2 scaled graphs, one for last week and one for last year (ala Yahoo Finance charts for different intervals).

Show the first set of values as a line graph above the x axis, and the second set below the x axis. If both sets of values increase over time this will show as an "expansion" of the graph; should be easy to recognize if one set is growing but the other is not.

Because the two quantities have no intrinsic relationship, you're stuck with either displaying them independently, such as two curves with two y-axes, or making up a measure that combines them, such as an estimate of calories burned, muscles used, mental anguish from exercising, etc. But it's tricky... taking from your example, I suspect one will never approach the calories burned from a 50 mile run by doing push-ups. Combining these in a meaningful way depends not on mathematics but on approximations and knowledge of the quantities that you start with and are interested in.
One compromise might be a graph with a single y-axis that shows some combined quantity, but where the independent values at each point are also graphically represented, for example, by a line where the local color represents the ratio of miles to pushups, or any of the many variants that display information in the shapes or colors in the plot.
Another option is to do a 3D plot, and then rotate it around and look for trends or whatever interests you.

If you want one overall measure of exercise levels, you could try using total exercise time. Another alternative is to define a points system, whereby users score points for each exercise.
I do think that there is virtue in letting the users see how much of each individual exercise they have done - in this case use a different graph for different exercises rather than using dual y-axes, if the scales are not comparable (e.g. time jogging and number of push-ups). There is a very good article on the problems with dual y-axes by business intelligence guru Stephen Few, here (pdf).
If you want to know more about presenting data well, I can also recommend his book "Now you see it", and the classic "The Visual Display of Quantitative Information" by Edward Tufte.

Related

Dividing the world in a thousand or so locations

Background: I want to create a weather service, and since most available APIs limit the number of daily calls, I want to divide the planet in a thousand or so areas.
Obviously, internet users are not uniformly distributed, so the sampling should be finer around densely populated regions.
How should I go about implementing this?
Where can I find data regarding geographical internet user density?
The algorithm will probably be something similar to k-means. However, implementing it on a sphere with oceans may be a bit tricky. Any insight?
Finally, maybe there is a way I can avoid doing all of this?
Very similar to k-mean is the centroidal Voronoi diagram (it is the continuous version of k-means). However, this would produce a uniform tesselation of your sphere that does not account for user density as you wish.
So a similar solution is the same technique but used with a Power Diagram : a Power Diagram is a Voronoi Diagram that accounts for a density (by assigning a weight to each Voronoi seed). Such diagram can be computed using an embedding in a 3D space (instead of 2D) that consists of the first two (x,y) coordinates plus a third one which is the square root of [any large positive constant minus the weight for the given point].
Using that, you can obtain a tesselation of your domain accounting for a user density.
You don't care about internet user density in general. You care about the density of users using your service - and you don't care where those users are, you care where they ask about. So once your site has been going for more than a day you can use the locations people ask about the previous day to work out what the areas should be for the next day.
Dynamic programming on a tree is easy. What I would do for an algorithm is to build a tree of successively more finely divided cells. More cells mean a smaller error, because people get predictions for points closer to them, and you can work out the error, or at least the relative error between more cells and fewer cells. Starting from the bottom up work out the smallest possible total error contributed by each subtree, allowing it to be divided in up to 1,2,3,..N. ways. You can work out the best possible division and smallest possible error for each k=1..N for a node by looking at the smallest possible error you have already calculated for each of its descendants, and working out how best to share out the available k divisions between them.
I would try to avoid doing this by thinking of a different idea. Depending on the way you look at life, there are at least two disadvantages of this:
1) You don't seem to be adding anything to the party. It looks like you are interposing yourself between organizations that actually make weather forecasts and their clients. Organizations lose direct contact with their clients, which might for instance lose them advertising revenue. Customers get a poorer weather forecast.
2) Most sites have legal terms of service, which must clients can ignore without worrying. My guess is that you would be breaking those terms of service, and if your service gets popular enough to be noticed they will be enforced against you.

combinatorial optimization - maximalize profit when creating furniture

Some firm is supplied with large wooden panels. These panels are cut to required pieces. To make for example bookshelf, they have to cut pieces from the large panel. In most cases, the pig panel is not used from 100%, there will be some loss, some remainder pieces, which can not be used. So to minimize the loss, they have to find optimal layout of separate pieces on big panel/panels. I think this is called "two dimensional rectangle bin packing problem".
Now it is getting more interesting.
Not all panels are the same, they can have slightly different tone. Ideal bookshelf is made from pieces all cut from one panels or multiple panels with same color tone. But bookshelf can be produced in different qualities (ideal one; one piece with different tone; two pieces..., three different color plates used; etc...). Each quality has its own price. (the superior in quality the more expensive).
Now we have some wooden panels in stock and request to some furnitures (e.g. 100 bookshelves). The goal is to maximize the profit (e.g. create some ones in ideal quality and some in less quality to keep material loss low).
How to solve this problem? How to combine it with bin packing problem? And hints, papers/articles would be appreciated. I know I can minimize/maximize some function and inequalities with integer linear programming, but I really do not know how to solve this.
(please, do not consider the real scenerio, when for example would be the best to create only ideal ones... imagine, that loss from remaining material is X money per cm^2 and Y is the price for specific product quality and that X and Y can be "arbitrary")
I can give an idea of how these problems are solved and why yours is particularly difficult.
In a typical optimization problem, you want to maximize or minimize a function (e.g. energy) with respect to a set number of variables (e.g. length). For example, how long should a spring be in order to minimize the stored energy. The answer is just a number, the equilibrium length of the spring. Another example would be "what price should we set our product to maximize profit?" (Too expensive and no-one will buy anything; too cheap and you won't cover your costs.) Again, the answer is just a number, the optimal price. Optimizations like that are handled with ordinary calculus.
A much more difficult optimization problem is where the answer isn't a number, but a function, like a shape. An example is: what shape will a hanging chain make in order to minimize its gravitational potential energy. Or: what shape should we cut out of these boards in order to maximize profit? This type of problem is solved using variational calculus, which is very difficult.
In any case, when solving optimization problems numerically, there are a few basic steps to follow. First you have to define a function, for example profit(cuts,params) that you want to maximize with respect to some variables 'cuts', with other parameters 'params' fixed. 'params' stores information like the amount and type of wood that you have, and the amount of money different type of furniture is worth.
The second step is to come up with a guess for the best set of cuts, we'll call it cuts_guess. In order to do this you need to come up with an algorithm that will suggest a set of furniture you could actually make using the supplies that you have. For example, if you can make at least one bookshelf from each board, then that could be your initial guess for the best way to use the wood.
The third phase is the optimization. For the initialization, set cuts_best=cuts_guess and profit_best=profit_guess=profit(cuts_guess, params). Then you need (an algorithm) to make small pseudo-random changes to 'cuts', and check if profit increases or decreases. Record the best set of cuts that you find, and the corresponding profit. Usually it's best if there some randomness involved, in order to explore the largest number of possibilities and not get 'stuck' on a poor choice. You'll find examples of this if you look up 'Monte Carlo algorithm'.
Anyway, all of this will be very difficult for your problem. It's easy how to come up with a guess for a variable (e.g. length), and then how to change that guess (e.g. increase or decrease the length a bit). It's not at all obvious how to make a 'guess' for how to place a cut-out on a board, or how to make a small change.

Neural network and algorithm(s), predicting future outcome from past

I was working on a algorithm, where I am given some input and I am given output for them, and given the output for 3 months (give or take) I need a way to find/calculate what might be the future output.
Now, this problem given can be related to stock exchange, we are given certaing constraints and certain outcomes, and we need to find the next.
I stumbled upon neural network stock market prediction, you can Google it, or you can read about it here, here and here.
To get started at making the algorithm, I couldn't figure out what should be the structure of layers.
The given constraint are:
The output would always be integer.
The output would always be between 1 and 100.
There is no exact input for say, just like stock market, we just know that the stock price would fluctuate btw 1 and 100, so we might (or not?) consider this as the only input.
We have record for last 3 months (or more).
Now, my first question is, how many nodes do I take for input?
The output is just one, fine. But as I said, should I take 100 nodes for input layer (given that the stock price would always be integer and would always be btw 1 and 100?)
What about hidden layer? How many nodes there? Say, if I take 100 nodes there too, I don't think that would train the network much, because what I think is that for each input we need to take into account all previous input also.
Say, we are calulating output for 1st day of 4th month, we should have 90 nodes in hidden/middle layer (imagining each month is 30 days for simplicity). Now there are two cases
Our prediction was correct and outcome was same as we predicted.
Our prediction failed, and the outcome was different than what we predicted.
Whatever the case be, now when we are calculating the output for 2nd day of 4th month, we need not only those 90 input(s) but also that last result (and not the prediction, be it the same!) too, so we now have 91 nodes in our middle/hidden layer.
And so on, it would keep increasing the number of nodes each day, AFAICT.
So, my other question is how do I define/set the number of nodes in hidden/middle layer if its dynamically changing.
My last question is, is there any other particular algorithm out there (for this kinda thing/stuff) that I am not aware of? That I should be using instead of messing around with this neural networking stuff?
Lastly, is there anything, that I might be missing that might cause me (rather the algo I am making) to predict the output, I mean any caveats, or anything that might make it go wrong that I might be missing?
There is much to tell as an answer to your question. In fact, your question addresses the problem of time series forecasting in general, and neural networks application for this task. I'm writing here only several most important keys, but after reading this you should possibly dig into Google's results for the query time series prediction neural network. There is a lot of works where the principles are covered in details. A variety of software implementations (with source codes) do also exist (here is just one of examples with codes in c++).
1) I must say that the problem is 99% about data preprocessing and choosing correct input/output factors, and only 1% about concrete instrument to use, whether neural networks or something other. Just as a side note, neural networks can internally implement most of other data analysis methods. For example, you can use neural network for Principal Component Analysis (PCA) which is closely related to SVD, mentioned in another answer.
2) It's very rare that input/output values are strictly fit a specific region. Real life data can be considered as unbounded in absolute values (even if its changes seem producing a channel, it can be broken down just in a moment), but neural network can operate in a stable conditions only. This is why the data is normally converted into increments first (by calculating deltas between i-th point and i-1, or taking log from their ratio). I suggest you do it with your data anyway, though you declare it's inside [0, 100] region. If you don't do it, neural network will most likely degenerate to a so called naive predictor which produce a forecast with each next value equal to previous.
The data then is normalized into [0, 1] or [-1, +1]. The second is appropriate for the case of time series prediction where +1 denotes move up, and -1 - move down. Use hypertanh activation function for neurons in your net.
3) You should feed NN with an input data obtained from a sliding window of dates. For example, if you have a data for a year and every point is a day, you should choose the size of window - say, a month - and slide it day by day, from the past to the future. The day just at the right bound of the window is the target output for NN. This is a very simple approach (there are much more complicated), I mention it just because you ask how to handle data which does continuously arrive. The answer is - you don't need to change/enlarge your NN every day. Just use a constant structure with a fixed window size and "forget" (do not provide to the NN) the oldest point. It's important that you do not treat all the data you have as a single input, but divide it into many small vectors and train NN on them, so the net can generalize data and find regularity.
4) The size of sliding window is your NN input size. The output size is 1. You should play with hidden layer size to find better performance. Start with a value which somethat between input and output, for example sqrt(in*out).
According to lastest researches, Recurrent Neural Networks seem operating better for tasks of time series forecasting.
I agree with Stan when he says
1) I must say that the problem is 99% about data preprocessing
I've applied Neural Networks for 25+ years to various aerospace applications including helicopter flight control - setting up the input/output data set is everything - all else is secondary.
I'm amazed, in smirkman's comment that Neural Networks were quickly dropped "as they produced nothing worthwhile" - that tells me that whoever was working with Neural Networks had little experience with them.
Given that the topic discusses neural network stock market prediction - I'll say that I've made it work. Test results are downloadable from my website at www.nwtai.com.
I don't give away how it was done but there's enough interesting data that should make you want to explore using Neural Networks more seriously.
This kind of problem was particularly well researched by thousands of people who wanted to win the 1M$ NetFlix prize.
Earlier submissions were often based on K Nearest Neigbours. Later submissions were made using Singular Value Decomposition, Support Vector Machines and Stochastic Gradient Descent. The winner used a blend of several techniques.
Reading the excellent Community forums will give you many insights about the best methods to predict the future from the past. You'll also find loads of source code for the different methods.
Amusingly, neural networks were quickly dropped, as they produced nothing worthwhile (and I personally have yet to see a non-trivial NN produce anything of value).
If you are starting out, I'd suggest SVD as a first path; it's quite easy to make and often produces surprising insights into data.
Good luck!

Techniques to evaluate the "twistiness" of a road in Google Maps?

As per the title. I want to, given a Google maps URL, generate a twistiness rating based on how windy the roads are. Are there any techniques available I can look into?
What do I mean by twistiness? Well I'm not sure exactly. I suppose it's characterized by a high turn -to-distance ratio, as well as high angle-change-per-turn number. I'd also say that elevation change of a road comes in to it as well.
I think that once you know exactly what you want to measure, the implementation is quite straightforward.
I can think of several measurements:
the ratio of the road length to the distance between start and end (this would make a long single curve "twisty", so it is most likely not the complete answer)
the number of inflection points per unit length (this would make an almost straight road with a lot of little swaying "twisty", so it is most likely not the complete answer)
These two could be combined by multiplication, so that you would have:
road-length * inflection-points
--------------------------------------
start-end-distance * road-length
You can see that this can be shortened to "inflection-points per start-end-distance", which does seem like a good indicator for "twistiness" to me.
As for taking elevation into account, I think that making the whole calculation in three dimensions is enough for a first attempt.
You might want to handle left-right inflections separately from up-down inflections, though, in order to make it possible to scale the elevation inflections by some factor.
Try http://www.hardingconsultants.co.nz/transportationconference2007/images/Presentations/Technical%20Conference/L1%20Megan%20Fowler%20Canterbury%20University.pdf as a starting point.
I'd assume that you'd have to somehow capture the road centreline from Google Maps as a vectorised dataset & analyse using GIS software to do what you describe. Maybe do a screen grab then a raster-to-vector conversion to start with.
Cumulative turn angle per Km is a commonly-used measure in road assessment. Vertex density is also useful. Note that these measures depend upon an assumption that vertices have been placed at some form of equal density along the line length whilst they were captured, rather than being manually placed. Running a GIS tool such as a "bendsimplify" algorithm on the line should solve this. I have written scripts in Python for ArcGIS 10 to define these measures if anyone wants them.
Sinuosity is sometimes used for measuring bends in rivers - see the help pages for Hawths Tools for ArcGIS for a good description. It could be misleading for roads that have major
changes in course along their length though.

How to rank a million images with a crowdsourced sort

I'd like to rank a collection of landscape images by making a game whereby site visitors can rate them, in order to find out which images people find the most appealing.
What would be a good method of doing that?
Hot-or-Not style? I.e. show a single image, ask the user to rank it from 1-10. As I see it, this allows me to average the scores, and I would just need to ensure that I get an even distribution of votes across all the images. Fairly simple to implement.
Pick A-or-B? I.e. show two images, ask user to pick the better one. This is appealing as there is no numerical ranking, it's just a comparison. But how would I implement it? My first thought was to do it as a quicksort, with the comparison operations being provided by humans, and once completed, simply repeat the sort ad-infinitum.
How would you do it?
If you need numbers, I'm talking about one million images, on a site with 20,000 daily visits. I'd imagine a small proportion might play the game, for the sake of argument, lets say I can generate 2,000 human sort operations a day! It's a non-profit website, and the terminally curious will find it through my profile :)
As others have said, ranking 1-10 does not work that well because people have different levels.
The problem with the Pick A-or-B method is that its not guaranteed for the system to be transitive (A can beat B, but B beats C, and C beats A). Having nontransitive comparison operators breaks sorting algorithms. With quicksort, against this example, the letters not chosen as the pivot will be incorrectly ranked against each other.
At any given time, you want an absolute ranking of all the pictures (even if some/all of them are tied). You also want your ranking not to change unless someone votes.
I would use the Pick A-or-B (or tie) method, but determine ranking similar to the Elo ratings system which is used for rankings in 2 player games (originally chess):
The Elo player-rating
system compares players’ match records
against their opponents’ match records
and determines the probability of the
player winning the matchup. This
probability factor determines how many
points a players’ rating goes up or
down based on the results of each
match. When a player defeats an
opponent with a higher rating, the
player’s rating goes up more than if
he or she defeated a player with a
lower rating (since players should
defeat opponents who have lower
ratings).
The Elo System:
All new players start out with a base rating of 1600
WinProbability = 1/(10^(( Opponent’s Current Rating–Player’s Current Rating)/400) + 1)
ScoringPt = 1 point if they win the match, 0 if they lose, and 0.5 for a draw.
Player’s New Rating = Player’s Old Rating + (K-Value * (ScoringPt–Player’s Win Probability))
Replace "players" with pictures and you have a simple way of adjusting both pictures' rating based on a formula. You can then perform a ranking using those numeric scores. (K-Value here is the "Level" of the tournament. It's 8-16 for small local tournaments and 24-32 for larger invitationals/regionals. You can just use a constant like 20).
With this method, you only need to keep one number for each picture which is a lot less memory intensive than keeping the individual ranks of each picture to each other picture.
EDIT: Added a little more meat based on comments.
Most naive approaches to the problem have some serious issues. The worst is how bash.org and qdb.us displays quotes - users can vote a quote up (+1) or down (-1), and the list of best quotes is sorted by the total net score. This suffers from a horrible time bias - older quotes have accumulated huge numbers of positive votes via simple longevity even if they're only marginally humorous. This algorithm might make sense if jokes got funnier as they got older but - trust me - they don't.
There are various attempts to fix this - looking at the number of positive votes per time period, weighting more recent votes, implementing a decay system for older votes, calculating the ratio of positive to negative votes, etc. Most suffer from other flaws.
The best solution - I think - is the one that the websites The Funniest The Cutest, The Fairest, and Best Thing use - a modified Condorcet voting system:
The system gives each one a number based on, out of the things that it has faced, what percentage of them it usually beats. So each one gets the percentage score NumberOfThingsIBeat / (NumberOfThingsIBeat + NumberOfThingsThatBeatMe). Also, things are barred from the top list until they've been compared to a reasonable percentage of the set.
If there's a Condorcet winner in the set, this method will find it. Since that's unlikely, given the statistical nature, it finds the one that's the "closest" to being a Condorcet winner.
For more information on implementing such systems the Wikipedia page on Ranked Pairs should be helpful.
The algorithm requires people to compare two objects (your Pick-A-or-B option), but frankly, that's a good thing. I believe it's very well accepted in decision theory that humans are vastly better at comparing two objects than they are at abstract ranking. Millions of years of evolution make us good at picking the best apple off the tree, but terrible at deciding how closely the apple we picked hews to the true Platonic Form of appleness. (This is, by the way, why the Analytic Hierarchy Process is so nifty...but that's getting a bit off topic.)
One final point to make is that SO uses an algorithm to find the best answers which is very similar to bash.org's algorithm to find the best quote. It works well here, but fails terribly there - in large part because an old, highly rated, but now outdated answer here is likely to be edited. bash.org doesn't allow editing, and it's not clear how you'd even go about editing decade-old jokes about now-dated internet memes even if you could... In any case, my point is that the right algorithm usually depends on the details of your problem. :-)
I know this question is quite old but I thought I'd contribute
I'd look at the TrueSkill system developed at Microsoft Research. It's like ELO but has a much faster convergence time (looks exponential compared to linear), so you get more out of each vote. It is, however, more complex mathematically.
http://en.wikipedia.org/wiki/TrueSkill
I don't like the Hot-or-Not style. Different people would pick different numbers even if they all liked the image exactly the same. Also I hate rating things out of 10, I never know which number to choose.
Pick A-or-B is much simpler and funner. You get to see two images, and comparisons are made between the images on the site.
These equations from Wikipedia makes it simpler/more effective to calculate Elo ratings, the algorithm for images A and B would be simple:
Get Ne, mA, mB and ratings RA,RB from your database.
Calculate KA ,KB, QA, QB by using the number of comparisons performed (Ne) and the number of times that image was compared (m) and current ratings :
Calculate EA and EB.
Score the winner's S : the winner as 1, loser as 0, and if you have a draw as 0.5,
Calculate the new ratings for both using:
Update the new ratings RA,RB and counts mA,mB in the database.
You may want to go with a combination.
First phase:
Hot-or-not style (although I would go with a 3 option vote: Sucks, Meh/OK. Cool!)
Once you've sorted the set into the 3 buckets, then I would select two images from the same bucket and go with the "Which is nicer"
You could then use an English Soccer system of promotion and demotion to move the top few "Sucks" into the Meh/OK region, in order to refine the edge cases.
Ranking 1-10 won't work, everyone has different levels. Someone who always gives 3-7 ratings would have his rankings eclipsed by people who always give 1 or 10.
a-or-b is more workable.
Wow, I'm late in the game.
I like the ELO system very much so, but like Owen says it seems to me that you'd be slow building up any significant results.
I believe humans have much greater capacity than just comparing two images, but you want to keep interactions to the bare minimum.
So how about you show n images (n being any number you can visibly display on a screen, this may be 10, 20, 30 depending on user's preference maybe) and get them to pick which they think is best in that lot. Now back to ELO. You need to modify you ratings system, but keep the same spirit. You have in fact compared one image to n-1 others. So you do your ELO rating n-1 times, but you should divide the change of rating by n-1 to match (so that results with different values of n are coherent with one another).
You're done. You've now got the best of all worlds. A simple rating system working with many images in one click.
If you prefer using the Pick A or B strategy I would recommend this paper: http://research.microsoft.com/en-us/um/people/horvitz/crowd_pairwise.pdf
Chen, X., Bennett, P. N., Collins-Thompson, K., & Horvitz, E. (2013,
February). Pairwise ranking aggregation in a crowdsourced setting. In
Proceedings of the sixth ACM international conference on Web search
and data mining (pp. 193-202). ACM.
The paper tells about the Crowd-BT model which extends the famous Bradley-Terry pairwise comparison model into crowdsource setting. It also gives an adaptive learning algorithm to enhance the time and space efficiency of the model. You can find a Matlab implementation of the algorithm on Github (but I'm not sure if it works).
The defunct web site whatsbetter.com used an Elo style method. You can read about the method in their FAQ on the Internet Archive.
Pick A-or-B its the simplest and less prone to bias, however at each human interaction it gives you substantially less information. I think because of the bias reduction, Pick is superior and in the limit it provides you with the same information.
A very simple scoring scheme is to have a count for each picture. When someone gives a positive comparison increment the count, when someone gives a negative comparison, decrement the count.
Sorting a 1-million integer list is very quick and will take less than a second on a modern computer.
That said, the problem is rather ill-posed - It will take you 50 days to show each image only once.
I bet though you are more interested in the most highly ranked images? So, you probably want to bias your image retrieval by predicted rank - so you are more likely to show images that have already achieved a few positive comparisons. This way you will more quickly just start showing 'interesting' images.
I like the quick-sort option but I'd make a few tweeks:
Keep the "comparison" results in a DB and then average them.
Get more than one comparison per view by giving the user 4-6 images and having them sort them.
Select what images to display by running qsort and recording and trimming anything that you don't have enough data on. Then when you have enough items recorded, spit out a page.
The other fun option would be to use the crowd to teach a neural-net.

Resources