I was wondering if you had a column like
[8 8 8 8 8 1 4 4 4 1 1]'
What code could I write to find the numbers that are not repeated consecutively (non-contiguous)? In this case, what code would I have to write to find row 6? This is for big data.
--Dwight
Related
I have some input data like this.
unique ID
Q1
Q2
Q3
1
1
1
2
2
1
1
2
3
1
0
3
4
2
0
1
5
3
1
2
6
4
1
3
And my target is to extract some data which satisfy the following conditions:
total count: 4
Q1=1 count: 2
Q1=2 count: 1
Q2=1 count: 1~3
Q3=1 count: 1
In this case, both data set with ids [1, 2, 4, 5] or [2, 3, 4, 5] are acceptable answers.
In reality, I will possibly have 6000+ rows of data and up to 12 count limitation like above. The count might varies from 1 to 50.
I've written a solution which firstly group all ids by each condition, then use deapth first search to exhaustedly try out all possible combinations between the groups. (I believe this is a brute-force solution...)
However, I always run out my computer's memory and my time before I can get a possible answer.
My question is,
what's the possible least time complexity of this problem. (I believe this is kind of subset sum problem, but I am not sure)
how can I solve this problem instead of a brute-force one? I'm considering dynamic programming or decision tree. However, I believe that I will possibly run out of my computer's memory with either of this one. Or can I solve this problem by each data row's probabilities/entropy (and I would appreciate more details on this)?
My brute-force solution sample codes are not worth reading at all. Thus, I'll skip posting my code snippets...
I'm doing some theoretical examples with different page replacement algorithms, in order to get a better understanding for when I actually write the code. I'm kind of confused about this example.
Given below is a physical memory with 4 tiles (4 sections?). The following pages are visited one after the other:
R = 1, 2, 3, 2, 4, 5, 3, 6, 1, 4, 2, 3, 1, 4
Run the optimal page replacement algorithm on R with 4 tiles.
I know that when a page needs to be swapped in, the operating system swaps out the page whose next use will occur farthest in the future. In practice I'll have:
Time 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Page 1 2 3 2 4 5 3 6 1 4 2 3 1 4
Tile 1 1 1 1
Tile 2 2 2
Tile 3 3
Tile 4
I'm not sure what happens at time 4 because we get page 2, but thats already present in the memory. Normally, if it was another number like 6, then it would go in Tile 4 but I'm lost in this case.
At time t=4, page 2 is already present, so there is no need to do anything. You can just skip it and move to the next time interval.
If there was a another number like 6, if there is a free slot available, you move it there, or else find the page that won't be used for the longest duration in the future and swap it.
Lets say im storing some ordered strings like this:
1 apple
2 banana
3 pear
4 mango
5 cantaloupe
Now I need to insert strawberry that should show up at position 4.
OFC I can easily do that by updating numeric index, ex:
1 apple
2 banana
3 pear
4 strawberry
5 mango
6 cantaloupe
But the issue is - if I need to store this position update in the database I now need to store 3 operations:
a) UPDATE index = 6, WHERE index=5
b) UPDATE index = 5, WHERE index=4
c) insert strawberry at position 4
Which is fine for small lists, but in large lists I would end up with a large number of position update operations.
Is there a more efficient approach? Maybe using something other than numbers?
I need a way to find pattern in list of values. In particular every second I get a value in a range (ex. 1-3), and I want to find recurring pattern from this value list.
If i plot this values into an x,y system i'd get something like a Nyquist–Shannon sampling. It could be very interesting to work on this.
I could also plot these values and work on visual pattern recognition (neural networks...).
input:
instant value
1 1
2 2
3 3
4 1
5 2
6 3
7 1
output->1,2,3
What could be the best way to proceed ?
In k fold we have this:
you divide the data into k subsets of
(approximately) equal size. You train the net k times, each time leaving
out one of the subsets from training, but using only the omitted subset to
compute whatever error criterion interests you. If k equals the sample
size, this is called "leave-one-out" cross-validation. "Leave-v-out" is a
more elaborate and expensive version of cross-validation that involves
leaving out all possible subsets of v cases.
what the Term training and testing mean?I can't understand.
would you please tell me some references where I can learn this algorithm with an example?
Train classifier on folds: 2 3 4 5 6 7 8 9 10; Test against fold: 1
Train classifier on folds: 1 3 4 5 6 7 8 9 10; Test against fold: 2
Train classifier on folds: 1 2 4 5 6 7 8 9 10; Test against fold: 3
Train classifier on folds: 1 2 3 5 6 7 8 9 10; Test against fold: 4
Train classifier on folds: 1 2 3 4 6 7 8 9 10; Test against fold: 5
Train classifier on folds: 1 2 3 4 5 7 8 9 10; Test against fold: 6
Train classifier on folds: 1 2 3 4 5 6 8 9 10; Test against fold: 7
Train classifier on folds: 1 2 3 4 5 6 7 9 10; Test against fold: 8
Train classifier on folds: 1 2 3 4 5 6 7 8 10; Test against fold: 9
Train classifier on folds: 1 2 3 4 5 6 7 8 9; Test against fold: 10
In short:
Training is the process of providing feedback to the algorithm in order to adjust the predictive power of the classifier(s) it produces.
Testing is the process of determining the realistic accuracy of the classifier(s) which were produced by the algorithm. During testing, the classifier(s) are given never-before-seen instances of data to do a final confirmation that the classifier's accuracy is not drastically different from that during training.
However, you're missing a key step in the middle: the validation (which is what you're referring to in the 10-fold/k-fold cross validation).
Validation is (usually) performed after each training step and it is performed in order to help determine if the classifier is being overfitted. The validation step does not provide any feedback to the algorithm in order to adjust the classifier, but it helps determine if overfitting is occurring and it signals when the training should be terminated.
Think about the process in the following manner:
1. Train on the training data set.
2. Validate on the validation data set.
if(change in validation accuracy > 0)
3. repeat step 1 and 2
else
3. stop training
4. Test on the testing data set.
In k-fold method, you have to divide the data into k segments, k-1 of them are used for training, while one is left out and used for testing. It is done k times, first time, the first segment is used for testing, and remaining are used for training, then the second segment is used for testing, and remaining are used for training, and so on. It is clear from your example of 10 fold, so it should be simple, read again.
Now about what training is and what testing is:
Training in classification is the part where a classification model is created, using some algorithm, popular algorithms for creating training models are ID3, C4.5 etc.
Testing means to evaluate the classification model by running the model over the test data, and then creating a confusion matrix and then calculating the accuracy and error rate of the model.
In K-fold method, k models are created (as clear from the description above) and the most accurate model for classification is the selected.