cross validated predictions using glmnet - cross-validation

Does anyone know if glmnet produces cross-validated predictions ie predictions based on the fold that was left out of the model building (what one usually thinks of as cross-validated) rather than cross-validated predictions being predictions all from the same model based on an optimal lambda which is established by cross-validation ?

predict.cv.glmnet just passes the 'glmnet' fit for all of the data to predict.glmnet as you suspect.
However, the argument keep returns predictions for the training data (fitted values) based on the left-out datasets. The fold each record is assigned to is recorded as the element foldid.
> library(glmnet)
> # keep prevalidated array
> cvf1 <- cv.glmnet(x = as.matrix(mtcars[, c("disp", "hp", "mpg")]),
+ y = mtcars$am, family = "binomial", keep = TRUE)
> dim(mtcars)
# [1] 32 11
> length(cvf1$lambda)
# [1] 84
> # leave-n out fitted predictions
> # 84 columns, 2 columns padded with NAs
> dim(cvf1$fit.preval)
# [1] 32 86
> # performance of cross-validated model predictions
> round(mtcars$am - cvf1$fit.preval[, cvf1$lambda == cvf1$lambda.min])
# [1] 1 1 0 0 0 0 0 0 -1 0 0 0 0 0 0
# [16] 0 0 0 0 0 -1 0 0 0 0 0 0 0 1 0
# [31] 0 0
> cvf1$foldid
# [1] 1 6 6 1 1 8 9 6 2 5 9 4 4 2 2
# [16] 10 5 2 3 4 10 3 1 3 10 9 7 8 7 8
# [31] 7 5

Related

Understanding Spark MLlib LDA input format

I am trying to implement LDA using Spark MLlib.
But I am having difficulty understanding input format. I was able to run its sample implementation to take input from a file which contains only number's as shown :
1 2 6 0 2 3 1 1 0 0 3
1 3 0 1 3 0 0 2 0 0 1
1 4 1 0 0 4 9 0 1 2 0
2 1 0 3 0 0 5 0 2 3 9
3 1 1 9 3 0 2 0 0 1 3
4 2 0 3 4 5 1 1 1 4 0
2 1 0 3 0 0 5 0 2 2 9
1 1 1 9 2 1 2 0 0 1 3
4 4 0 3 4 2 1 3 0 0 0
2 8 2 0 3 0 2 0 2 7 2
1 1 1 9 0 2 2 0 0 3 3
4 1 0 0 4 5 1 3 0 1 0
I followed
http://spark.apache.org/docs/latest/mllib-clustering.html#latent-dirichlet-allocation-lda
I understand the output format of this as explained here.
My use case is very simple, I have one data file with some sentences.
I want to convert this file into corpus so that to pass it to org.apache.spark.mllib.clustering.LDA.run().
My doubt is about what those numbers in input represent which is then zipWithIndex and passed to LDA? Is it like number 1 appearing everywhere represent same word or it is some kind of count?
First you need to convert your sentences into vectors.
val documents: RDD[Seq[String]] = sc.textFile("yourfile").map(_.split(" ").toSeq)
val hashingTF = new HashingTF()
val tf: RDD[Vector] = hashingTF.transform(documents)
val idf = new IDF().fit(tf)
val tfidf: RDD[Vector] = idf.transform(tf)
val corpus = tfidf.zipWithIndex.map(_.swap).cache()
// Cluster the documents into three topics using LDA
val ldaModel = new LDA().setK(3).run(corpus)
Read more about TF_IDF vectorization here

Sorting rows and columns of adjacency matrix to reveal cliques

I'm looking for a reordering technique to group connected components of an adjacency matrix together.
For example, I've made an illustration with two groups, blue and green. Initially the '1's entries are distributed across the rows and columns of the matrix. By reordering the rows and columns, all '1''s can be located in two contiguous sections of the matrix, revealing the blue and green components more clearly.
I can't remember what this reordering technique is called. I've searched for many combinations of adjacency matrix, clique, sorting, and reordering.
The closest hits I've found are
symrcm moves the elements closer to the diagonal, but does not make groups.
Is there a way to reorder the rows and columns of matrix to create a dense corner, in R? which focuses on removing completely empty rows and columns
Please either provide the common name for this technique so that I can google more effectively, or point me in the direction of a Matlab function.
I don't know whether there is a better alternative which should give you direct results, but here is one approach which may serve your purpose.
Your input:
>> A
A =
0 1 1 0 1
1 0 0 1 0
0 1 1 0 1
1 0 0 1 0
0 1 1 0 1
Method 1
Taking first row and first column as Column-Mask(maskCol) and
Row-Mask(maskRow) respectively.
Get the mask of which values contains ones in both first row, and first column
maskRow = A(:,1)==1;
maskCol = A(1,:)~=1;
Rearrange the Rows (according to the Row-mask)
out = [A(maskRow,:);A(~maskRow,:)];
Gives something like this:
out =
1 0 0 1 0
1 0 0 1 0
0 1 1 0 1
0 1 1 0 1
0 1 1 0 1
Rearrange columns (according to the column-mask)
out = [out(:,maskCol),out(:,~maskCol)]
Gives the desired results:
out =
1 1 0 0 0
1 1 0 0 0
0 0 1 1 1
0 0 1 1 1
0 0 1 1 1
Just a check whether the indices are where they are supposed to be or if you want the corresponding re-arranged indices ;)
Before Re-arranging:
idx = reshape(1:25,5,[])
idx =
1 6 11 16 21
2 7 12 17 22
3 8 13 18 23
4 9 14 19 24
5 10 15 20 25
After re-arranging (same process we did before)
outidx = [idx(maskRow,:);idx(~maskRow,:)];
outidx = [outidx(:,maskCol),outidx(:,~maskCol)]
Output:
outidx =
2 17 7 12 22
4 19 9 14 24
1 16 6 11 21
3 18 8 13 23
5 20 10 15 25
Method 2
For Generic case, if you don't know the matrix beforehand, here is the procedure to find the maskRow and maskCol
Logic used:
Take first row. Consider it as column mask (maskCol).
For 2nd row to last row, the following process are repeated.
Compare the current row with maskCol.
If any one value matches with the maskCol, then find the element
wise logical OR and update it as new maskCol
Repeat this process till the last row.
Same process for finding maskRow while the column are used for
iterations instead.
Code:
%// If you have a square matrix, you can combine both these loops into a single loop.
maskCol = A(1,:);
for ii = 2:size(A,1)
if sum(A(ii,:) & maskCol)>0
maskCol = maskCol | A(ii,:);
end
end
maskCol = ~maskCol;
maskRow = A(:,1);
for ii = 2:size(A,2)
if sum(A(:,ii) & maskRow)>0
maskRow = maskRow | A(:,ii);
end
end
Here is an example to try that:
%// Here I removed some 'ones' from first, last rows and columns.
%// Compare it with the original example.
A = [0 0 1 0 1
0 0 0 1 0
0 1 1 0 0
1 0 0 1 0
0 1 0 0 1];
Then, repeat the procedure you followed before:
out = [A(maskRow,:);A(~maskRow,:)]; %// same code used
out = [out(:,maskCol),out(:,~maskCol)]; %// same code used
Here is the result:
>> out
out =
0 1 0 0 0
1 1 0 0 0
0 0 0 1 1
0 0 1 1 0
0 0 1 0 1
Note: This approach may work for most of the cases but still may fail for some rare cases.
Here, is an example:
%// this works well.
A = [0 0 1 0 1 0
1 0 0 1 0 0
0 1 0 0 0 1
1 0 0 1 0 0
0 0 1 0 1 0
0 1 0 0 1 1];
%// This may not
%// Second col, last row changed to zero from one
A = [0 0 1 0 1 0
1 0 0 1 0 0
0 1 0 0 0 1
1 0 0 1 0 0
0 0 1 0 1 0
0 0 0 0 1 1];
Why does it fail?
As we loop through each row (to find the column mask), for eg, when we move to 3rd row, none of the cols match the first row (current maskCol). So the only information carried by 3rd row (2nd element) is lost.
This may be the rare case because some other row might still contain the same information. See the first example. There also none of the elements of third row matches with 1st row but since the last row has the same information (1 at the 2nd element), it gave correct results. Only in rare cases, similar to this might happen. Still it is good to know this disadvantage.
Method 3
This one is Brute-force Alternative. Could be applied if you think the previous case might fail. Here, we use while loop to run the previous code (finding row and col mask) number of times with updated maskCol, so that it finds the correct mask.
Procedure:
maskCol = A(1,:);
count = 1;
while(count<3)
for ii = 2:size(A,1)
if sum(A(ii,:) & maskCol)>0
maskCol = maskCol | A(ii,:);
end
end
count = count+1;
end
Previous example is taken (where the previous method fails) and is run with and without while-loop
Without Brute force:
>> out
out =
1 0 1 0 0 0
1 0 1 0 0 0
0 0 0 1 1 0
0 1 0 0 0 1
0 0 0 1 1 0
0 0 0 0 1 1
With Brute-Forcing while loop:
>> out
out =
1 1 0 0 0 0
1 1 0 0 0 0
0 0 0 1 1 0
0 0 1 0 0 1
0 0 0 1 1 0
0 0 0 0 1 1
The number of iterations required to get the correct results may vary. But it is safe to have a good number.
Good Luck!

APL find frequency of elements in a matrix

I have this piece of code
((⍳3)∘.+(⍳2))
which generates the following matrix
2 3
3 4
4 5
I want to find the occurrence of each unique element in the result i.e occurrence of 2,3,4,5 in the result.
I tried using "∘.=" with the matrix itself and then reshaping such that elements of each sub matrix is transformed into a row
using
6 6⍴ ((⍳3)∘.+(⍳2))∘.=((⍳3)∘.+(⍳2))
which gives the following result
1 0 0 0 0 0 for 2
0 1 1 0 0 0 for 3
0 1 1 0 0 0 for 3
0 0 0 1 1 0 for 4
0 0 0 1 1 0 for 4
0 0 0 0 0 1 for 5
as you can see it still contains the sum for duplicate items, and I'm lost as of now.
Any help will be appreciated.
You should do ∘.= between the unique elements in the matrix and a flat vector of all elements, like:
m ← ((⍳3)∘.+(⍳2))
(∪,m) ∘.= ,m
1 0 0 0 0 0
0 1 1 0 0 0
0 0 0 1 1 0
0 0 0 0 0 1
Then just do +/ on it to get the frequencies of ∪,m
+/ (∪,m) ∘.= ,m
1 2 2 1
∪,m
2 3 4 5
(Tested on GNU APL.)
Dyalog APL version 14.0 has the ⌸ Key operator exactly for this, you just need to ravel your data:
{≢⍵}⌸ ,((⍳3)∘.+(⍳2))
1 2 2 1
Try it online!
You can even use the left argument of ⌸'s operand function to create a table:
{⍺,≢⍵}⌸ ,((⍳3)∘.+(⍳2))
2 1
3 2
4 2
5 1
Try it online!

Finding all subsets of a multiset

Suppose I have a bag which contains 6 balls (3 white and 3 black). I want to find all possible subsets of a given length, disregarding the order. In the case above, there are only 4 combinations of 3 balls I can draw from the bag:
2 white and 1 black
2 black and 1 white
3 white
3 black
I already found a library in my language of choice that does exactly this, but I find it slow for greater numbers. For example, with a bag containing 15 white, 1 black, 1 blue, 1 red, 1 yellow and 1 green, there are only 32 combinations of 10 balls, but it takes 30 seconds to yield the result.
Is there an efficient algorithm which can find all those combinations that I could implement myself? Maybe this problem is not as trivial as I first thought...
Note: I'm not even sure of the right technic words to express this, so feel free to correct the title of my post.
You can do significantly better than a general choose algorithm. The key insight is to treat each color of balls at the same time, rather than each of those balls one by one.
I created an un-optimized implementation of this algorithm in python that correctly finds the 32 result of your test case in milliseconds:
def multiset_choose(items_multiset, choose_items):
if choose_items == 0:
return 1 # always one way to choose zero items
elif choose_items < 0:
return 0 # always no ways to choose less than zero items
elif not items_multiset:
return 0 # always no ways to choose some items from a set of no items
elif choose_items > sum(item[1] for item in items_multiset):
return 0 # always no ways to choose more items than are in the multiset
current_item_name, current_item_number = items_multiset[0]
max_current_items = min([choose_items, current_item_number])
return sum(
multiset_choose(items_multiset[1:], choose_items - c)
for c in range(0, max_current_items + 1)
)
And the tests:
print multiset_choose([("white", 3), ("black", 3)], 3)
# output: 4
print multiset_choose([("white", 15), ("black", 1), ("blue", 1), ("red", 1), ("yellow", 1), ("green", 1)], 10)
# output: 32
No, you don't need to search through all possible alternatives. A simple recursive algorithm (like the one given by #recursive) will give you the answer. If you are looking for a function that actually outputs all of the combinations, rather than just how many, here is a version written in R. I don't know what language you are working in, but it should be pretty straightforward to translate this into anything, although the code might be longer, since R is good at this kind of thing.
allCombos<-function(len, ## number of items to sample
x, ## array of quantities of balls, by color
names=1:length(x) ## names of the colors (defaults to "1","2",...)
){
if(length(x)==0)
return(c())
r<-c()
for(i in max(0,len-sum(x[-1])):min(x[1],len))
r<-rbind(r,cbind(i,allCombos(len-i,x[-1])))
colnames(r)<-names
r
}
Here's the output:
> allCombos(3,c(3,3),c("white","black"))
white black
[1,] 0 3
[2,] 1 2
[3,] 2 1
[4,] 3 0
> allCombos(10,c(15,1,1,1,1,1),c("white","black","blue","red","yellow","green"))
white black blue red yellow green
[1,] 5 1 1 1 1 1
[2,] 6 0 1 1 1 1
[3,] 6 1 0 1 1 1
[4,] 6 1 1 0 1 1
[5,] 6 1 1 1 0 1
[6,] 6 1 1 1 1 0
[7,] 7 0 0 1 1 1
[8,] 7 0 1 0 1 1
[9,] 7 0 1 1 0 1
[10,] 7 0 1 1 1 0
[11,] 7 1 0 0 1 1
[12,] 7 1 0 1 0 1
[13,] 7 1 0 1 1 0
[14,] 7 1 1 0 0 1
[15,] 7 1 1 0 1 0
[16,] 7 1 1 1 0 0
[17,] 8 0 0 0 1 1
[18,] 8 0 0 1 0 1
[19,] 8 0 0 1 1 0
[20,] 8 0 1 0 0 1
[21,] 8 0 1 0 1 0
[22,] 8 0 1 1 0 0
[23,] 8 1 0 0 0 1
[24,] 8 1 0 0 1 0
[25,] 8 1 0 1 0 0
[26,] 8 1 1 0 0 0
[27,] 9 0 0 0 0 1
[28,] 9 0 0 0 1 0
[29,] 9 0 0 1 0 0
[30,] 9 0 1 0 0 0
[31,] 9 1 0 0 0 0
[32,] 10 0 0 0 0 0
>

Evaluating the distribution of words in a grid

I'm creating a word search and am trying to calculate quality of the generated puzzles by verifying the word set is "distributed evenly" throughout the grid. For example placing each word consecutively, filling them up row-wise is not particularly interesting because there will be clusters and the user will quickly notice a pattern.
How can I measure how 'evenly distributed' the words are?
What I'd like to do is write a program that takes in a word search as input and output a score that evaluates the 'quality' of the puzzle. I'm wondering if anyone has seen a similar problem and could refer me to some resources. Perhaps there is some concept in statistics that might help? Thanks.
The basic problem is distribution of lines in a square or rectangle. You can eighter do this geometrically or using integer arrays. I will try the integer arrays here.
Let M be a matrix of your puzzle,
A B C D
E F G H
I J K L
M N O P
Let the word "EFGH" be an existent word, as well as "CGKO". Then, create a matrix which will contain the count of membership in eighter words in each cell:
0 0 1 0
1 1 2 1
0 0 1 0
0 0 1 0
Apply a rule: the current cell value is equal to the sum of all neighbours (4-way) and multiply with the cell's original value, if the original value is 2 or higher.
0 0 1 0 1 2 2 2
1 1 2 1 -\ 1 3 8 2
0 0 1 0 -/ 1 2 3 2
0 0 1 0 0 1 1 1
And sum up all values in rows and columns the matrix:
1 2 2 2 = 7
1 3 8 2 = 14
1 2 3 2 = 8
0 1 1 1 = 3
| | | |
3 7 | 6
14
Then calculate the avarage of both result sets:
(7 + 14 + 8 + 3) / 4 = 32 / 4 = 8
(3 + 7 + 14 + 6) / 4 = 30 / 4 = 7.5
And calculate the avarage difference to the avarage of each result set:
3 <-> 7.5 = 4.5 7 <-> 8 = 1
7 <-> 7.5 = 0.5 14 <-> 8 = 6
14 <-> 7.5 = 6.5 8 <-> 8 = 0
6 <-> 7.5 = 1.5 3 <-> 8 = 5
___avg ___avg
3.25 3
And multiply them together:
3 * 3.25 = 9.75
Which you treat as a distributionscore. You might need to tweak it a little bit to make it work better, but this should calculate distributionscores quite nicely.
Here is an example of a bad distribution:
1 0 0 0 1 1 0 0 2
1 0 0 0 -\ 2 1 0 0 -\ 3 -\ C avg 2.5 -\ C avg-2-avg 0.5
1 0 0 0 -/ 2 1 0 0 -/ 3 -/ R avg 2.5 -/ R avg-2-avg 2.5
1 0 0 0 1 1 0 0 2 _____*
6 4 0 0 1.25 < score
Edit: calc. errors fixed.

Resources