Machine Learning: Which is algorithm I'm looking for? - algorithm

Let's say I'm scraping a website of apples, and all of them have different prices, and different properties. The example data extracted could be:
Apple 1 -> color: red, days_old: 15, flavour: sweet -> 10$
Apple 2 -> color: green -> 3$
Apple 3 -> flavour: sweet -> 5$
Apple 4 -> color: red, days_old: 10 -> 15$
So in this dataset example I have 4 different apples, which have different properties (not only values, but different number of properties), and I would like to know how each property and value affects the final price of the apple. For example: color is red -> +5$, flavour is sweet -> +3$, more than 10 days_old -> -2$...
Which will be the algorithm that I'm looking for?

Related

How to implement least cost path through matrix in Haskell

Hello I have a particular question I cant find any resources on for Haskell. I'm looking to create a function that takes a mmatrix in as a parameter and returns an array for haskell. something like:
returnPossiblePaths :: [[Int]] -> [Int]
The condition though, is that I return the the array with the 'least cost path' or the path that has the lowest sum. So if I have the matrix:
[6 9 3
2 5 7]
I want to iterate from the head to the tail, add the numbers up in that path and return the array with the smallest sum. e.g:
6 -> 9 -> 3 -> 7 = 25
6 -> 9 -> 5 -> 7 = 27
6 -> 2 -> 5 -> 7 = 20
6 -> 2 -> 5 -> 9 -> 3 -> 7 = 32
So here my result array would be: [6, 2, 5, 7]. I need help on how to go about doing this. I have no idea how I would go about iterating from head to tail in different 'paths' without going through all the elements. My general plan was to get all the paths into arrays, map sum to al of them then compare the results and return the array with the smallest sum. So I would first get all the arrays (paths) from the matrix then apply this function to them:
addm::[Int]->Int
addm (x:xs) = sum(x:xs)
store those values in a variable, compare them then return the lowest one. I know haskell has amazing functions that make this way easier and I was wondering if I could get help on how to go about doing this. Any advice is greatly appreciated, thanks!

Calculating total number of mismatched socks combinations

This is one of the problems from the 2015 ICPC NorthWest Programming Contest and I was wondering if there is any easier or more efficient way of doing it.
Here's the problem:
"Fred likes to wear mismatched socks. This sometimes means he has to plan ahead.
Suppose his sock drawer has one red, one blue, and two green socks. If he wears the
red with the blue, he's stuck with matching green socks. He put together two
mismatched pairs if he pairs red with green and then blue with green. Given the
contents of his sock drawer, how many pairs of mismatched socks can he put together?"
Here's a sample input:
Color 1 -> 4 socks
Color 2 -> 3 socks
Color 3 -> 7 socks
Color 4 -> 11 socks
Color 5 -> 4 socks
The way I'm doing is I first read the input into an array and sort it increasingly. That way I'll have the highest number of socks at the end of the array. From here I basically compare arr[i] and arr[i-1] and get the min between them. Add that to the total, save the remainder and just repeat the process going down the array. For example, using the sample input it looks something like this:
Sorted array: [3,4,4,7,11]
1:3 socks ---> 1:3 socks ---> 1:0 socks ---> 1:0 socks
2:4 socks ---> 2:4 socks ---> 2:1 socks ---> 2:1 socks
3:4 socks ---> 3:4 socks ---> 3:0 socks ---> 3:0 socks
4:7 socks ---> 4:0 socks ---> 4:0 socks ---> 4:0 socks
5:11 socks ---> 5:4 socks ---> 5:0 socks ---> 5:0 socks
------> total = 14 possible combinations of mismatched socks. This seems way too naive an approach. Does anyone have any ideas on how to optimize it ? I can post my code for this if necessary.
I think the optimal solution can be found by examining all possible groupings of the different sock colors into 2 piles. For each such grouping p odd pairs of socks can be made, where p is the number of socks in the smallest pile. You want the grouping that gives the maximum p. You can generate all possible groupings of socks into 2 piles recursively.
Here's some Java code to illustrate:
public static void main(String[] args)
{
int[] socks = {3,4,4,7,11};
System.out.println(count(0, 0, socks, 0));
}
static int count(int a, int b, int[] socks, int i)
{
if(i == socks.length)
{
return Math.min(a, b);
}
return Math.max(count(a+socks[i], b, socks, i+1),
count(a, b+socks[i], socks, i+1));
}
Output:
14
Build a graph data structure. Every sock is vertex. Make edges from every vertex to all vertices of another color.
Now find power of maximum matching - size of the set of edges without common vertices.
You can build max matching using Edmonds algorithm in polynomial time O(V^2*E). Seems that graph for this task would be dense, so complexity tends to O(V^4). There exists also Micali and Vazirani algorithm with lesser complexity (don't know about implementation hardness).
If your task doesn't require max matching itself - only number of edges, then that value might be calculated using randomized Lovasz algorithm based on Tutte's matrix theorem. (I did not find concise English description - perhaps terms might differ, short one in Russian is here)
"HOPEFULLY CORRECT THIS TIME!" METHOD
Step 1: Check if there's 2 or more colours left. If there's none or one colour left you're finished (can't find more pairs).
Step 2: Find one colour that has the lowest non-zero count
Step 3: Excluding the colour with the lowest count (in case all colours have the same count); find the highest count and determine how many colours share the highest count
Step 4: Excluding the colour with the lowest count and all colours with the highest count; try to find the second highest count.
Step 5a: If there is a second highest count, calculate amount_to_pair = min(highest_count - second_highest_count, lowest_count).
Step 5b: If there isn't a second highest count, calculate amount_to_pair = lowest_count.
Step 6: Create amount_to_pair pairs by pairing socks from the colour with the lowest count with socks with colours that have the highest count as evenly as possible (e.g. if there's 9 red socks, 20 blue socks and 20 green socks; then create 5 "red and blue" pairs and 4 "red and green" pairs).
Step 7: Goto step 1.
Example (the pathological case mentioned in comments):
Initial condition
Color 1 -> 1 socks
Color 2 -> 20 socks
Color 3 -> 80 socks
Color 4 -> 81 socks
First iteration:
Color 1 -> 1 socks (lowest non-zero count)
Color 2 -> 20 socks
Color 3 -> 80 socks (2nd highest count)
Color 4 -> 81 socks (highest count)
Amount to remove = min(81-80, 1) = 1
Color 1 -> 1-1=0 socks (lowest non-zero count)
Color 2 -> 20 socks
Color 3 -> 80 socks
Color 4 -> 81-1=80 socks (highest count)
Results so far:
(1 pair of colour 1 and colour 4)
Second iteration:
Color 1 -> 0 socks
Color 2 -> 20 socks (lowest non-zero count)
Color 3 -> 80 socks (highest count)
Color 4 -> 80 socks (highest count)
Amount to remove = 20
Color 1 -> 0 socks
Color 2 -> 20-20=0 socks (lowest non-zero count)
Color 3 -> 80-(20/2)=70 socks (highest count)
Color 4 -> 80-(20-20/2)=70 socks (highest count)
Results so far:
(1 of colour 1 and colour 4)
(10 of colour 2 and colour 3)
(10 of colour 2 and colour 4)
Third iteration:
Color 1 -> 0 socks
Color 2 -> 0 socks
Color 3 -> 70 socks (lowest non-zero count)
Color 4 -> 70 socks (highest count)
Amount to remove = 70
Color 1 -> 0 socks
Color 2 -> 0 socks
Color 3 -> 70-70=0 socks (lowest non-zero count)
Color 4 -> 70-70=0 socks (highest count)
Results so far:
(1 of colour 1 and colour 4)
(10 of colour 2 and colour 3)
(10 of colour 2 and colour 4)
(70 of colour 3 and colour 4)
ORIGINAL METHOD
WARNING: The method below gives wrong results in various pathological cases (and has been updated/replaced by the algorithm above). I've left it here to give context to some of the comments
Starting condition:
Color 1 -> 4 socks
Color 2 -> 3 socks
Color 3 -> 7 socks
Color 4 -> 11 socks
Color 5 -> 4 socks
Find the highest count and lowest count; and cancel out whichever has lowest count so it ceases to exist:
Color 1 -> 4 socks
Color 3 -> 7 socks
Color 4 -> 11-3=8 socks
Color 5 -> 4 socks
Results so far:
(3 of colour 2 and colour 4)
Do it again:
Color 3 -> 7 socks
Color 4 -> 8-4=4 socks
Color 5 -> 4 socks
Results so far:
(3 of colour 2 and colour 4)
(4 of colour 1 and colour 4)
Do it again:
Color 3 -> 7-4=3 socks
Color 5 -> 4 socks
Results so far:
(3 of colour 2 and colour 4)
(4 of colour 1 and colour 4)
(4 of colour 4 and colour 3)
Do it again:
Color 5 -> 4-3=1 sock
Results so far:
(3 of colour 2 and colour 4)
(4 of colour 1 and colour 4)
(4 of colour 4 and colour 3)
(4 of colour 3 and colour 5)
Stop because there's only one colour left.

Algorithm to calculate the price volatility of a commodity

I am trying to design an algorithm to calculate how volatile the price fluctuations of a commodity are.
The way I would like this to work is that if the price of a commodity constantly goes up and down, it should have a higher score than if the price of the commodity gradually increases and then falls in price rapidly.
Here is an example of what I mean:
Commodity A: 1 -> 2 -> 3 -> 2 -> 1 -> 3 -> 4 -> 2 -> 1
Commodity B: 1 -> 2 -> 3 -> 4 -> 5 -> 6 -> 7 -> 8 -> 2
Commodity C: 1 -> 2 -> 3 -> 4 -> 5 -> 4 -> 3 -> 2-> 1
Commodity A has a 'wave' like pattern in that its price goes up and falls down on a regular basis.
Commodity B has a 'cliff' like pattern in that the price goes up gradually and then falls steeply.
Commodity C has a 'hill' like pattern in that the price rises gradually and then falls gradually.
A should receive the highest ranking, followed by C, followed by B. The more of a wave pattern the price of the commodity follows, the higher a ranking it should have.
Does have any suggestions for an algorithm that could do this?
Thanks!
My Approach looks something like this.
For my algorithm, I am considering the above example.
A: 1 -> 2 -> 3 -> 2 -> 1 -> 3 -> 4 -> 2 -> 1
B: 1 -> 2 -> 3 -> 4 -> 5 -> 6 -> 7 -> 8 -> 2
C: 1 -> 2 -> 3 -> 4 -> 5 -> 4 -> 3 -> 2-> 1
Now I will squash these list, by squash i mean taking the start value and end value of an increasing or decreasing sequence.
So, after squashing the list will look something like this.
A: 1 -> 3 -> 1 -> 4 -> 1
B: 1 -> 8 -> 2
C: 1 -> 5 -> 1
Now once this it done, I take the difference between i and i+1 element and then take the average and based on the average, I give them the rank.
So the difference between i and i+1 element will look something like this
2 2 3 3
A: 1 --> 3 --> 1 --> 4 --> 1
7 6
B: 1 --> 8 --> 2
4 4
C: 1 --> 5 --> 1
Now let's sum this difference and take the average.
A: (2+2+3+3)/4 = 2.5
B: (7+6)/2 = 6.5
C: (4+4)/2 = 4
Now we can assign ranks based on this average value where
A < C < B
Hope this helps!

merging linear lists - reconstruct railway network

I need to reconstruct the sequence of stations in a railway network from the sequences of single trips requested from a arbitrary station. There's no direction given in the data. But every request returns an terminal stop. The sequences of single trips can have gaps.
The (end-) result is always a linear list - forking is not allowed.
For example:
Result trips from requested station "4" :
4 - 3 - 2 - 1
4 - 1
4 - 5 - 6
4 - 8 - 9
4 - 6 - 7 - 8 - 9
manually reordered:
1 - 2 - 3 - 4
1 - 4
- 4 - 5 - 6
- 4 - 8 - 9
- 4 - 6 - 7 - 8 - 9
After merging result should be:
1 - 2 - 3 - 4 - 5 - 6 - 7 - 8 - 9
start/stop: 1, 9
Is there an algorithm to calculate the resulting "rope of pearls" list? I tried to figure it out with perls graph-module, but no luck. My books on algorithms doesn't help either.
I think, there are pathologic cases, where multiple solutions are possible, depending on input data.
Maybe someone has an idea to solve it!
As you see in the answers, there is more than one solution. So here's a real-world dataset:
2204236 -> 2200007 -> 2200001
2204236 -> 2203095 -> 2203976 -> 2200225 -> 2200007 -> 2200001
2204236 -> 2204805 -> 2204813 -> 2204401 -> 2219633 -> 2204476 -> 2202024 -> 2202508 -> 2202110 -> 2202026
2204236 -> 2204813 -> 2204401 -> 2219633 -> 2202508 -> 2202110 -> 2202026 -> 3011047 -> 3011048 -> 3011049
2204236 -> 2204813 -> 2204401 -> 2219633 -> 2204476 -> 2202024 -> 2202508 -> 2202110 -> 2202352 -> 2202026
2204236 -> 2204813 -> 2204401 -> 2219633 -> 2204476 -> 2202024 -> 2202508 -> 2209637 -> 2202110
solution of the example data with perl:
use Graph::Directed;
use Graph::Traversal::DFS;
my $g = Graph::Directed->new;
$g->add_path(1,2,3,4);
$g->add_path(1,4);
$g->add_path(4,5,6);
$g->add_path(4,8,9);
$g->add_path(4,6,7,8,9);
print "The graph is $g\n";
my #topo = $g->toposort;
print "g toposorted = #topo\n";
Output
> The graph is 1-2,1-4,2-3,3-4,4-5,4-6,4-8,5-6,6-7,7-8,8-9
> g toposorted = 1 2 3 4 5 6 7 8 9
Using the other direction
$g->add_path(4,3,2,1);
$g->add_path(4,1);
$g->add_path(4,5,6);
$g->add_path(4,8,9);
$g->add_path(4,6,7,8,9);
reveals the second solution
The graph is 2-1,3-2,4-1,4-3,4-5,4-6,4-8,5-6,6-7,7-8,8-9
g toposorted = 4 3 2 1 5 6 7 8 9
Treat the lists node links in a graph. 4-3-2-1 should mean 4 must come before 3, 3 before 2 and 2 before 1. So add arcs from 4 to 3, 3 to 2, 2 to 1.
Once you have all of those you run a topological sort(look it up on wikipedia) on the resulting graph. This will guarantee that the order you get will always respect the partial orderings you are given.
The only case when you are not going to find a solution is when the data is contradicting itself (if you have 4-3-2 and 4-2-3 there's no possible ordering).
You are right, there are multiple cases. Another good solution is 4-5-6-7-8-9-3-2-1, for your example.
Terminal stop station is articulation node and it splits graph into multiple partitions: all nodes inside partition are reachable from one another, nodes in different partitions are reachable only via known terminal stop station. Number of partitions is 2 in your example, but may be much larger, e.g. consider star-like structure 1 - 2, 1 - 3, 1 - 4, 1 - 5.
First of all you need to enumerate partitions. You treat your graph as undirected graph and run DFS from stop station in each of directions. At first run you discover partition #1, at second run partition #2 and so on.
Then you treat you graph as directed with stop station as root node for all partitions and run topological sorting (TS) for each of partitions.
Possible outcomes:
TS for one of partitions fails. This means there is no solution.
Number of partitions is one and TS for it succeeds. Solution is unique.
Number of partitions is more than one and TS succeeds for all of them. This means there are multiple solutions. To get any single valid result, you choose some partition and declare that it contains another terminal station. All other partitions are inserted into the first one in between arbitrary pair of nodes.

Sort on an separate table without joining result in pandas?

I have the following data:
fruit = pd.DataFrame({'fruit': ['apple', 'orange', 'apple', 'blueberry'],
'colour': ['red', 'orange', 'green', 'black']})
costs = pd.DataFrame({'fruit': ['apple', 'orange', 'blueberry'],
'cost': [1.7, 1.4, 2.1]})
I want a copy of the fruit table sorted by cost from the costs table, but without the cost column included. What's the best way to do this? It's fine if there's a join in an intermediate step - I'm mostly worried about long-term memory waste.
I would do a left merge and then argsort:
In [11]: fruit.merge(costs, how="left")
Out[11]:
colour fruit cost
0 red apple 1.7
1 orange orange 1.4
2 green apple 1.7
3 black blueberry 2.1
Note: that if you used a different index (for fruits), it will be ignored/replaced with range(0, len(fruit)).
In [12]: fruit.merge(costs, how="left")["cost"].argsort()
Out[12]:
0 1
1 0
2 2
3 3
Name: cost, dtype: int64
Now reorder using iloc (by position) rather than loc (by label).
In [13]: fruit.iloc[fruit.merge(costs, how="left")["cost"].argsort()]
Out[13]:
colour fruit
1 orange orange
0 red apple
2 green apple
3 black blueberry
Note: It's important to left merge as an ordinary merge will change the order (!!). It's also more efficient.
An alternative, cleaner, but less efficient way:
In [21]: fruit.merge(costs).sort("cost").loc[:, fruit.columns]
Out[21]:
colour fruit
2 orange orange
0 red apple
1 green apple
3 black blueberry
Note: In the next pandas, sort_values might be preferred over sort...
why don't you merge the columns and then drop the unneeded one
pd.merge(fruit , costs).sort_index(by = 'cost').drop('cost' , axis = 1 )

Resources