10 fold cross validation

10 fold cross validation - algorithm

In k fold we have this:
you divide the data into k subsets of
(approximately) equal size. You train the net k times, each time leaving
out one of the subsets from training, but using only the omitted subset to
compute whatever error criterion interests you. If k equals the sample
size, this is called "leave-one-out" cross-validation. "Leave-v-out" is a
more elaborate and expensive version of cross-validation that involves
leaving out all possible subsets of v cases.
what the Term training and testing mean?I can't understand.
would you please tell me some references where I can learn this algorithm with an example?
Train classifier on folds: 2 3 4 5 6 7 8 9 10; Test against fold: 1
Train classifier on folds: 1 3 4 5 6 7 8 9 10; Test against fold: 2
Train classifier on folds: 1 2 4 5 6 7 8 9 10; Test against fold: 3
Train classifier on folds: 1 2 3 5 6 7 8 9 10; Test against fold: 4
Train classifier on folds: 1 2 3 4 6 7 8 9 10; Test against fold: 5
Train classifier on folds: 1 2 3 4 5 7 8 9 10; Test against fold: 6
Train classifier on folds: 1 2 3 4 5 6 8 9 10; Test against fold: 7
Train classifier on folds: 1 2 3 4 5 6 7 9 10; Test against fold: 8
Train classifier on folds: 1 2 3 4 5 6 7 8 10; Test against fold: 9
Train classifier on folds: 1 2 3 4 5 6 7 8 9; Test against fold: 10

In short:
Training is the process of providing feedback to the algorithm in order to adjust the predictive power of the classifier(s) it produces.
Testing is the process of determining the realistic accuracy of the classifier(s) which were produced by the algorithm. During testing, the classifier(s) are given never-before-seen instances of data to do a final confirmation that the classifier's accuracy is not drastically different from that during training.
However, you're missing a key step in the middle: the validation (which is what you're referring to in the 10-fold/k-fold cross validation).
Validation is (usually) performed after each training step and it is performed in order to help determine if the classifier is being overfitted. The validation step does not provide any feedback to the algorithm in order to adjust the classifier, but it helps determine if overfitting is occurring and it signals when the training should be terminated.
Think about the process in the following manner:
1. Train on the training data set.
2. Validate on the validation data set.
if(change in validation accuracy > 0)
3. repeat step 1 and 2
else
3. stop training
4. Test on the testing data set.

In k-fold method, you have to divide the data into k segments, k-1 of them are used for training, while one is left out and used for testing. It is done k times, first time, the first segment is used for testing, and remaining are used for training, then the second segment is used for testing, and remaining are used for training, and so on. It is clear from your example of 10 fold, so it should be simple, read again.
Now about what training is and what testing is:
Training in classification is the part where a classification model is created, using some algorithm, popular algorithms for creating training models are ID3, C4.5 etc.
Testing means to evaluate the classification model by running the model over the test data, and then creating a confusion matrix and then calculating the accuracy and error rate of the model.
In K-fold method, k models are created (as clear from the description above) and the most accurate model for classification is the selected.

Related

Extract possible sample combinations from multiple count constraints

I have some input data like this.
unique ID
Q1
Q2
Q3
1
1
1
2
2
1
1
2
3
1
0
3
4
2
0
1
5
3
1
2
6
4
1
3
And my target is to extract some data which satisfy the following conditions:
total count: 4
Q1=1 count: 2
Q1=2 count: 1
Q2=1 count: 1~3
Q3=1 count: 1
In this case, both data set with ids [1, 2, 4, 5] or [2, 3, 4, 5] are acceptable answers.
In reality, I will possibly have 6000+ rows of data and up to 12 count limitation like above. The count might varies from 1 to 50.
I've written a solution which firstly group all ids by each condition, then use deapth first search to exhaustedly try out all possible combinations between the groups. (I believe this is a brute-force solution...)
However, I always run out my computer's memory and my time before I can get a possible answer.
My question is,
what's the possible least time complexity of this problem. (I believe this is kind of subset sum problem, but I am not sure)
how can I solve this problem instead of a brute-force one? I'm considering dynamic programming or decision tree. However, I believe that I will possibly run out of my computer's memory with either of this one. Or can I solve this problem by each data row's probabilities/entropy (and I would appreciate more details on this)?
My brute-force solution sample codes are not worth reading at all. Thus, I'll skip posting my code snippets...

Algorithm to group people by two factors

I have thought long about this but couldn't figure it out. I am looking for an algorithm ( in any language) to group a bunch of people by the following these 2 rules:
Group by ascending skill level which is represented by a number (the higher the more skilled). The best and weakest in the group should not differ by more than 1 point, where possible.
Spread out people from the same country as far as possible, i.e. dont put people from the same country in the same group, while at the same time not breaking rule 1 above. A group should not consist of people from 1 country where possible.
Each group can have at most 4 person (where possible) or 3 persons e.g. if there are 18 people, then they are split into 3 groups of 4 and 2 groups of 3.
Sample data (Skill level followed by country) :
5 US
5 US
5 US
5 US
6 GB
6 GB
6 GB
7 CN
7 CN
7 CN
7 CN
7 HK
8 US
8 US
8 US
8 CA
8 CN
8 CN
..to be grouped into 2groups of 4s and 2groups of 3s
Please help if you have any idea?
thank you in advance

I would suggest the following.
First, aggregate the data by country and skill level, so the data looks more like:
US 5 4
GB 6 3
. . .
Sort this by the highest ranking first.
Then use a greedy algorithm.
Determine the number of members in the group (either size or size - 1)
Take one from the first group (highest ranking).
Continue taking one from each subsequent group meeting the country condition (so you might need to skip the US).
That defines the first group.
Then repeat.
This is not guaranteed to be optimal. But then again, optimality is not defined for the problem. Which is more important? Country diversity or skill sameness?

Recommendation system and baseline predictors

I have a bunch of data where the first column represents users, the second column is movies, and the third is a ten-points rating.
0 0 9
0 1 8
1 1 4
1 2 6
2 2 7
And I have to predict the third number for another ser of data (user, movie, ?):
0 2
1 0
2 0
2 1
I use this way for finding bias values https://youtube.com/watch?v=dGM4bNQcVKI and this way for predicting https://www.youtube.com/watch?v=4RSigTais8o.
Bias value for user number 0: 9 + 8 / 2 = 8.5 - 1.5 = 7.
Bias value for movie number 2: 6 + 7 / 2 = 6.5 - 1.5 = 5.
And baseline predictors:
1.5 + 7 + 5, where result is 13.5, but in contest result is: 7.052009.
But the problem description says the result of my Recommendation system should be:
0 2 7.052009
1 0 6.687943
2 0 6.995272
2 1 6.687943
Where is my mistake?

The raw average is the average of ALL the present scores ((9+8+4+6+7) / 5 = 6.8), I don't see that number anywhere, so I guess that's your error.
In the video Prof. used the raw average of 3.5 on all the calculations, including calculating bias, he skipped how to reach that number, if you add all numbers on the table of the video and divide, you get 3.5.
0 2 9.2 is the answer for the first one, using your videos as guide. The videos claims to have avoided calculus, the different final answers of the contest probably come from using the "full" method.
0 2 ?, user 0 (row 0: 9 8 x), movie 2 (column 2: x 6 7)
raw average = 6.8
bias user 0: (9+8) / 2 - 6.8 = 1.7
bias movie 2: (6+7) / 2 - 6.8 = -0.3
prediction: 6.8+1.7-0.3 = 8.2
The problem looks like a variation of the Netflix Contest, the contest' host knows the actual answers (the ratings), he doesn't give them to you, you are expected to guess/predict them, the winner of the contest is the one that gets the closest to the actual answers.
The winner of you contest got the closest, but he got there using an unknown method, or his own variation of a know method, if your goal is to match his answer exactly, you are better off asking him what method he used and how did he modify it, and try to replicate his results.
If this was homework and not a contest, then the teacher would expect you to use the "correct" method he taught you (there's no set method, just many methods that work with different accuracy), you'd have to use it exactly like he taught you. But it is a contest, your goal is to find a base method that approximates the best (the one you used is very low on accuracy), and tinker with it a bit to get even better results.
If you want to understand the link I suggest you research and later ask a statistics question, because it's just plain statistics. You can try to understand the link or research Matrix factorization on your own. Remember that to get contest winning results (or close) you won't be able to use a simple method like the one you found on the youtube video, but require a method with a lot more math.

Find minimum number of moves for Tower of London task

I am looking for a solution for a task similar to the Tower of Hanoi task, however this is different from Hanoi as the disks are not constrained by size. The Tower of London task I am creating has 8 disks, instead of the traditional 3 or 5 (as shown in the Wikipedia link). I am using PEBL software that is "programmed primarily in C++ (although you do not need to know C++ to use PEBL), but also uses flex and bison (GNU versions of lex and yacc) to handle parsing."
Here is a video of what the task looks like in action: http://www.youtube.com/watch?v=IiBJ94HRpeM&noredirect=1
*Each disk is a number. e.g., blue disk=1, red disk = 2, etc.
1 \
2 ----\
3 ----/ 3 1
4 5 / 2 4 5
========= =========
The left side consists of the disks you have to move, to match the right side. There are 3 columns.
So if I am making it with 8 disks, I would create a trial to look like this:
1 \
2 ----\ 7 8
6 3 8 ----/ 3 6 1
7 4 5 / 2 4 5
========= =========
How do I figure out what is the minimum amount of moves needed for the left to look like the right? I don't need to use PEBL to code this, but I need to know since I am calculating how close to the minimum a person would get for each trial.

The principle is easy and its called breadth first search:
Each state has a certain number of successor states (defined by the moves possible).
You start out with a set of states that contains the initial state and step number 0.
If the end state is in the set of states, return the step number.
Increment the step number.
Rebuild the set of states by replacing the current states with each of their successor states.
Go to 2
So, in each step, compute the successor states of your currently available states and look if you reached the target state.
BUT, be warned, this can take a while and eat up a lot of memory!
You can optimize a bit in our case, since you can leave out the predecessor state.
Still, you will have 5 possible moves in most states. Which means you will have 5^N states to consider after N steps.
For example, your second example will need 10 moves, if I don't err. This will give you about 10 million states. Most contemporary computers will not be able to search beyond depth 15.
I think that an algorithm to find a solution would be easy and fast, but we have no proof this solution would be the shortest one.

Scheduling Algorithm with contiguous breaks

I'm lost here. Here's the problem and I think it's NP-hard. A center is staffed with a finite number of workers with the following conditions:
There are 3 shifts per day with 2 people in each shift
Each employee works for 5 days straight and then 2 days off with only one shift per day
So the problem is: how many workers do we need if the center remains active every day and a feasible schedule?
Update:
Thanks for all the great answers. The closest I've come to (with a randomized brute-force algorithm) is the following:
X 3 0
1 0 3
2 3 1
2 1 3
0 1 2
0 2 1
3 0 2
I've simplified the problem into batches of 2 people (0-3 represent 4 batches) in the hopes of getting a feasible solution. X refers to a shift which has not been assigned (which was not the initial goal but it looks like there may not be an alternative).

The constraints cannot be respected exactly as expressed in the question.
That's because the numbers don't add up (or rather "divide up").
Consequently, the problem should be reworded to require
exactly 3 shifts per day
exactly 2 workers per shift
workers work a maximum of 5 consecutive days
workers rest a minimum of 2 consecutive days
With the introduction of the minimum and maximum qualifiers, the minimum number of workers required is 9 (again assuming no part-time worker).
Note that although 9 appears to be a absolute minimum, given the need to cover 42 shifts per week (3 * 2 * 7) with workers who can cover a maximum of 5 shifts per week (5 work days 2 rest days = a week), there is no assurance that 9 would be sufficient given the consecutive work and/or rest day requirements.
This is how I figure...
8 workers isn't enough, and the following 9 workers line-up, is an example of such a schedule.
To make things easy, I assigned all workers except for worker #1 and #9, to an optimal schedule of exactly a 5 days-on and 2 days-off schedule; #1 and #9 work less. Of course many other arrangements would work (maybe this is what the OP sensed when he hinted at an NP-complete problem). Also, the schedule is such that each week's schedule is exactly the same for everyone, but that could also be changed (maybe introducing some fairness, by having all workers have a lighter week every once in a while, but this BTW can lead to some difficulties of respecting the requirement of 5 maximum work days).
The sample schedule shows two consecutive weeks to help see the consecutive work or rest days, but as said, all weeks are the same for every one.
Max Conseq Ws Min Conseq Rs
Worker #1 RRWWWRW RRWWWRW 3 2
Worker #2 WWWWWRR WWWWWRR 5 2
Worker #3 WWWRRWW WWWRRWW 5 2
Worker #4 WWWRRWW WWWRRWW 5 2
Worker #5 WRRWWWW WRRWWWW 5 2
Worker #6 WRRWWWW WRRWWWW 5 2
Worker #7 RWWWWWR RWWWWWR 5 2
Worker #8 RWWWWWR RWWWWWR 5 2
Worker #9 WWRRRRW WWRRRRW 3 3
Nb of Ws 6666666 6666666
The tally at the bottom shows exactly 6 workers per day (respecting the need to cover 3 shifts with 2 workers each), the max and min columns on the right show that the maximum consecutive work and minimum consecutive rest requirements are respected.

3 shifts per day * 2 people per shift * (7 days per week / 5 working days per person) = 8.4 people (9 if part time is not an option).

3 shifts x 7 days = 21
this does not divide evenly by 5 nor 2 - so your constraints will not allow a complete filling of the slots.

OK - even though you have an answer, let me take a shot.
Let's take the general problem: 7 days x 3 shifts = 21 different shifts to fill
There are 7 possible employee schedules expressed as days on (1) & days off (0)
MTWTFSS
0011111
1001111
1100111
1110011
1111001
1111100
0111110
We want to minimize the number of scheduled employees that matches the number of required hours.
I have a matrix of number of employees of each type per shift and that number is an integer variable. My optimization model is:
Min (number of employees)
Subject to: sum of (# of emp sched * employee schedule) = staff required for each shift
and
number of employees scheduled is integer
You can change the = sign in the first constraint to a >=. Then you'll get a feasible solution with extra staff. You can solve this in Excel with the basic SOLVER addin.
Let's say I need four employees for each day on a shift but I'm willing to tolerate extra staff.
A solution using the schedules above is:
Number of staff by schedule type: 0,2,0,2,0,2,0
Schedule types 0011111,1001111,1100111,1110011,1111001,1111100,0111110
(In other words 2 with schedules 1001111, 2 with schedules 1111001, and 2 more with schedules 1111100)
This results in one day (Monday) with two extra staff and 4 employees on all the other days.
Of course, this isn't a unique solution. There are at least 6 other solutions with two extra staff members. Constraint programming would be a better and much faster approach since there will often be many feasible schedules.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio