Understanding Spark MLlib LDA input format

Understanding Spark MLlib LDA input format - apache-spark-mllib

I am trying to implement LDA using Spark MLlib.
But I am having difficulty understanding input format. I was able to run its sample implementation to take input from a file which contains only number's as shown :
1 2 6 0 2 3 1 1 0 0 3
1 3 0 1 3 0 0 2 0 0 1
1 4 1 0 0 4 9 0 1 2 0
2 1 0 3 0 0 5 0 2 3 9
3 1 1 9 3 0 2 0 0 1 3
4 2 0 3 4 5 1 1 1 4 0
2 1 0 3 0 0 5 0 2 2 9
1 1 1 9 2 1 2 0 0 1 3
4 4 0 3 4 2 1 3 0 0 0
2 8 2 0 3 0 2 0 2 7 2
1 1 1 9 0 2 2 0 0 3 3
4 1 0 0 4 5 1 3 0 1 0
I followed
http://spark.apache.org/docs/latest/mllib-clustering.html#latent-dirichlet-allocation-lda
I understand the output format of this as explained here.
My use case is very simple, I have one data file with some sentences.
I want to convert this file into corpus so that to pass it to org.apache.spark.mllib.clustering.LDA.run().
My doubt is about what those numbers in input represent which is then zipWithIndex and passed to LDA? Is it like number 1 appearing everywhere represent same word or it is some kind of count?

First you need to convert your sentences into vectors.
val documents: RDD[Seq[String]] = sc.textFile("yourfile").map(_.split(" ").toSeq)
val hashingTF = new HashingTF()
val tf: RDD[Vector] = hashingTF.transform(documents)
val idf = new IDF().fit(tf)
val tfidf: RDD[Vector] = idf.transform(tf)
val corpus = tfidf.zipWithIndex.map(_.swap).cache()
// Cluster the documents into three topics using LDA
val ldaModel = new LDA().setK(3).run(corpus)
Read more about TF_IDF vectorization here

Related

plotting multiple graphs and animation from a data file in gnuplot

Suppose I have the following sample data file.
0 1 2
0 3 4
0 1 9
0 9 2
0 19 0
0 6 1
0 11 0
1 3 2
1 3 4
1 1 6
1 9 2
1 15 0
1 6 6
1 11 1
2 3 2
2 4 4
2 1 6
2 9 6
2 15 0
2 6 6
2 11 1
first column gives value of time. Second gives values of x and 3rd column y. I wish to plot graphs of y as functions of x from this data file at different times,
i.e, for t=0, I shall plot using 2:3 with lines up to t=0 index. Then same thing I shall do for the variables at t=1.
At the end of the day, I want to get a gif, i.e, an animation of how the y vs x graph changes shape as time goes on. How can I do this in gnuplot?

What have you tried so far? (Check help ternary and help gif)
You need to filter your data with the ternary operator and then create the animation.
Code:
### plot filtered data and animate
reset session
$Data <<EOD
0 1 2
0 3 4
0 1 9
0 9 2
0 19 0
0 6 1
0 11 0
1 3 2
1 3 4
1 1 6
1 9 2
1 15 0
1 6 6
1 11 1
2 3 2
2 4 4
2 1 6
2 9 6
2 15 0
2 6
2 11 1
EOD
set terminal gif animate delay 50 optimize
set output "myAnimation.gif"
set xrange[0:20]
set yrange[0:10]
do for [i=0:2] {
plot $Data u 2:($1==i?$3:NaN) w lp pt 7 ti sprintf("Time: %g",i)
}
set output
### end of code
Result:
Addition:
The meaning of $1==i?$3:NaN in words:
If the value in the first column is equal to i then the result is the value in the third column else it will be NaN ("Not a Number").

how to find number of paths between 2 nodes of a certain length [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 5 years ago.
Improve this question
given G graph ' and a matrix of nodes , how can i find number of paths between 2 given nodes of a certain length ?
i've thought of multiping the matrix k times and then find the Ak[i,j] but i don't know to build the algorithm , or is it the best solution when it comes to complexity ?

If you want to find all the paths between two nodes of length k, just multiply the adjacency matrix by itself k times.
The reason for this is simple:
If there is an edge ij and an edge js, then there will be a path is through j. The entries ii are the degrees of the nodes i.
Here is an adjacency matrix for a graph:
0 1 1 0 0 0 0 0 0 0
0 1 1 0 0 1 0 0 0 0
0 0 0 1 1 0 0 0 0 0
1 0 0 1 1 0 0 0 0 0
0 0 0 0 1 1 1 0 1 0
1 0 0 0 0 0 0 1 0 0
0 0 0 0 0 0 1 1 1 0
0 0 0 0 0 0 0 1 0 1
0 0 0 0 0 0 0 1 0 1
0 0 0 1 0 0 0 0 0 1
Let's say we want to find the number of length 3 paths between Nodes 2 and 5. For this we need to find A_3[2, 5].
There are plenty of algorithms for matrix multiplication, and certain languages have these built in.
So if our adjancency matrix is called A, we want A * A * A.
This gives us:
2 1 1 2 3 2 1 1 1 0
2 2 2 2 3 2 1 2 1 1
2 1 1 1 3 2 3 3 3 1
2 2 2 2 4 3 3 3 3 1
1 1 1 1 1 1 3 8 3 6
0 1 1 2 1 1 0 1 0 2
0 0 0 2 0 0 1 5 1 6
1 0 0 3 1 0 0 1 0 3
1 0 0 3 1 0 0 1 0 3
2 1 1 3 3 1 1 0 1 1
When we find A_3[2, 5] we are given 2, which is the number of length 3 paths between the 2 nodes.

APL find frequency of elements in a matrix

I have this piece of code
((⍳3)∘.+(⍳2))
which generates the following matrix
2 3
3 4
4 5
I want to find the occurrence of each unique element in the result i.e occurrence of 2,3,4,5 in the result.
I tried using "∘.=" with the matrix itself and then reshaping such that elements of each sub matrix is transformed into a row
using
6 6⍴ ((⍳3)∘.+(⍳2))∘.=((⍳3)∘.+(⍳2))
which gives the following result
1 0 0 0 0 0 for 2
0 1 1 0 0 0 for 3
0 1 1 0 0 0 for 3
0 0 0 1 1 0 for 4
0 0 0 1 1 0 for 4
0 0 0 0 0 1 for 5
as you can see it still contains the sum for duplicate items, and I'm lost as of now.
Any help will be appreciated.

You should do ∘.= between the unique elements in the matrix and a flat vector of all elements, like:
m ← ((⍳3)∘.+(⍳2))
(∪,m) ∘.= ,m
1 0 0 0 0 0
0 1 1 0 0 0
0 0 0 1 1 0
0 0 0 0 0 1
Then just do +/ on it to get the frequencies of ∪,m
+/ (∪,m) ∘.= ,m
1 2 2 1
∪,m
2 3 4 5
(Tested on GNU APL.)

Dyalog APL version 14.0 has the ⌸ Key operator exactly for this, you just need to ravel your data:
{≢⍵}⌸ ,((⍳3)∘.+(⍳2))
1 2 2 1
Try it online!
You can even use the left argument of ⌸'s operand function to create a table:
{⍺,≢⍵}⌸ ,((⍳3)∘.+(⍳2))
2 1
3 2
4 2
5 1
Try it online!

Finding all subsets of a multiset

Suppose I have a bag which contains 6 balls (3 white and 3 black). I want to find all possible subsets of a given length, disregarding the order. In the case above, there are only 4 combinations of 3 balls I can draw from the bag:
2 white and 1 black
2 black and 1 white
3 white
3 black
I already found a library in my language of choice that does exactly this, but I find it slow for greater numbers. For example, with a bag containing 15 white, 1 black, 1 blue, 1 red, 1 yellow and 1 green, there are only 32 combinations of 10 balls, but it takes 30 seconds to yield the result.
Is there an efficient algorithm which can find all those combinations that I could implement myself? Maybe this problem is not as trivial as I first thought...
Note: I'm not even sure of the right technic words to express this, so feel free to correct the title of my post.

You can do significantly better than a general choose algorithm. The key insight is to treat each color of balls at the same time, rather than each of those balls one by one.
I created an un-optimized implementation of this algorithm in python that correctly finds the 32 result of your test case in milliseconds:
def multiset_choose(items_multiset, choose_items):
if choose_items == 0:
return 1 # always one way to choose zero items
elif choose_items < 0:
return 0 # always no ways to choose less than zero items
elif not items_multiset:
return 0 # always no ways to choose some items from a set of no items
elif choose_items > sum(item[1] for item in items_multiset):
return 0 # always no ways to choose more items than are in the multiset
current_item_name, current_item_number = items_multiset[0]
max_current_items = min([choose_items, current_item_number])
return sum(
multiset_choose(items_multiset[1:], choose_items - c)
for c in range(0, max_current_items + 1)
)
And the tests:
print multiset_choose([("white", 3), ("black", 3)], 3)
# output: 4
print multiset_choose([("white", 15), ("black", 1), ("blue", 1), ("red", 1), ("yellow", 1), ("green", 1)], 10)
# output: 32

No, you don't need to search through all possible alternatives. A simple recursive algorithm (like the one given by #recursive) will give you the answer. If you are looking for a function that actually outputs all of the combinations, rather than just how many, here is a version written in R. I don't know what language you are working in, but it should be pretty straightforward to translate this into anything, although the code might be longer, since R is good at this kind of thing.
allCombos<-function(len, ## number of items to sample
x, ## array of quantities of balls, by color
names=1:length(x) ## names of the colors (defaults to "1","2",...)
){
if(length(x)==0)
return(c())
r<-c()
for(i in max(0,len-sum(x[-1])):min(x[1],len))
r<-rbind(r,cbind(i,allCombos(len-i,x[-1])))
colnames(r)<-names
r
}
Here's the output:
> allCombos(3,c(3,3),c("white","black"))
white black
[1,] 0 3
[2,] 1 2
[3,] 2 1
[4,] 3 0
> allCombos(10,c(15,1,1,1,1,1),c("white","black","blue","red","yellow","green"))
white black blue red yellow green
[1,] 5 1 1 1 1 1
[2,] 6 0 1 1 1 1
[3,] 6 1 0 1 1 1
[4,] 6 1 1 0 1 1
[5,] 6 1 1 1 0 1
[6,] 6 1 1 1 1 0
[7,] 7 0 0 1 1 1
[8,] 7 0 1 0 1 1
[9,] 7 0 1 1 0 1
[10,] 7 0 1 1 1 0
[11,] 7 1 0 0 1 1
[12,] 7 1 0 1 0 1
[13,] 7 1 0 1 1 0
[14,] 7 1 1 0 0 1
[15,] 7 1 1 0 1 0
[16,] 7 1 1 1 0 0
[17,] 8 0 0 0 1 1
[18,] 8 0 0 1 0 1
[19,] 8 0 0 1 1 0
[20,] 8 0 1 0 0 1
[21,] 8 0 1 0 1 0
[22,] 8 0 1 1 0 0
[23,] 8 1 0 0 0 1
[24,] 8 1 0 0 1 0
[25,] 8 1 0 1 0 0
[26,] 8 1 1 0 0 0
[27,] 9 0 0 0 0 1
[28,] 9 0 0 0 1 0
[29,] 9 0 0 1 0 0
[30,] 9 0 1 0 0 0
[31,] 9 1 0 0 0 0
[32,] 10 0 0 0 0 0
>

Adding zeros between every 2 elements of a matrix in matlab/octave

I am interested in how can I add rows and columns of zeros in a matrix so that it looks like this:
1 0 2 0 3
1 2 3 0 0 0 0 0
2 3 4 => 2 0 3 0 4
5 4 3 0 0 0 0 0
5 0 4 0 3
Actually I am interested in how can I do this efficiently, because walking the matrix and adding zeros takes a lot of time if you work with a big matrix.
Update:
Thank you very much.
Now I'm trying to replace the zeroes with the sum of their neighbors:
1 0 2 0 3 1 3 2 5 3
1 2 3 0 0 0 0 0 3 8 5 12... and so on
2 3 4 => 2 0 3 0 4 =>
5 4 3 0 0 0 0 0
5 0 4 0 3
as you can see i'm considering all the 8 neighbors of an element, but again using for and walking the matrix slows me down quite a bit, is there a faster way ?

Let your little matrix be called m1. Then:
m2 = zeros(5)
m2(1:2:end,1:2:end) = m1(:,:)
Obviously this is hard-wired to your example, I'll leave it to you to generalise.

Here are two ways to do part 2 of the question. The first does the shifts explicitly, and the second uses conv2. The second way should be faster.
M=[1 2 3; 2 3 4 ; 5 4 3];
% this matrix (M expanded) has zeros inserted, but also an extra row and column of zeros
Mex = kron(M,[1 0 ; 0 0 ]);
% The sum matrix is built from shifts of the original matrix
Msum = Mex + circshift(Mex,[1 0]) + ...
circshift(Mex,[-1 0]) +...
circshift(Mex,[0 -1]) + ...
circshift(Mex,[0 1]) + ...
circshift(Mex,[1 1]) + ...
circshift(Mex,[-1 1]) + ...
circshift(Mex,[1 -1]) + ...
circshift(Mex,[-1 -1]);
% trim the extra line
Msum = Msum(1:end-1,1:end-1)
% another version, a bit more fancy:
MexTrimmed = Mex(1:end-1,1:end-1);
MsumV2 = conv2(MexTrimmed,ones(3),'same')
Output:
Msum =
1 3 2 5 3
3 8 5 12 7
2 5 3 7 4
7 14 7 14 7
5 9 4 7 3
MsumV2 =
1 3 2 5 3
3 8 5 12 7
2 5 3 7 4
7 14 7 14 7
5 9 4 7 3

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Understanding Spark MLlib LDA input format - apache-spark-mllib

Related

plotting multiple graphs and animation from a data file in gnuplot

how to find number of paths between 2 nodes of a certain length [closed]

APL find frequency of elements in a matrix

Finding all subsets of a multiset

Adding zeros between every 2 elements of a matrix in matlab/octave

Categories

Resources