Spark RDD into Matrix - matrix

I have an RDD like:
(A,AA,1)
(A,BB,0)
(A,CC,0)
(B,AA,2)
(B,BB,1)
(B,CC,4)
and I want to convert it into the following RRD:
([1,0,0],[2,1,4])
the order is important for me since the main propose is using RowMatrix to convert the second RDD to a matrix.

Your need to be careful with the wording, when you ask for a Matrix, do you mean something like the spark.mllib.matrix ? If so, you will need to follow very specific instructions to create one. However, it seems to me that your problem can be solved in a much easier way. Just using zipWithIndex with groupBy
//Here is how I see it
val test = sc.parallelize(Array(("A","AA",1),("A","BB",0),("A","CC",0),("B","AA",2),("B","BB",1),("B","CC",4))).zipWithIndex
val grouptest = test.groupBy(_._1._1).map(x=>(Vectors.dense(x._2.map(y=>(y._2,y._1._3)).toArray.sortBy(_._1).map(z=>z._2.toDouble))))
In your example, you seem to want the result as a vector? So I used spark's Vector (which by the way, only allows Doubles).
Result looks like:
[1.0,0.0,0.0]
[2.0,1.0,4.0]

Related

Leveraging spring to reduce DB calls

I have a data piece that is:
foo{
string: one
string: two
list<string>: listOne
list<string>: listTwo
}
such that in the DB one is associated with multiple entries of listOne.
not much background, I'm at a loss as to where to even look for answers. I received feed back to try to eliminate a jdbctemplate.query during a code review with a "there may be a way to reduce this using #autowire".
no code to share, I just need a place to start looking for answers. I've been on the spring website and I don't see anything that looks like I can use it. and I didn't see any google results that resemble what I'm looking for.
I should probably preface this with the fact that I'm a new dev so even a simple answer is likely not something I've tried. so this came about because my query for listOne and listTwo are returning columns. so I first tried using a mapper with the jdbcTemplate.query() that returned a string. but jdbc didn't like that. so I ended up returning a list from the mapper. then jdbc turns those answers into a list>, I then afterwards loop through those list> to convert them to a list and store them in foo. in my mind an ideal solution allows me to combine the two queries and the mapper looks like (pseudo code):
public foo fooMapper implements<RowMapper>(){
foo.one = resultSet.get("thingOne")
foo.two = resultSet.get("thingTwo")
foo.listOne = resultSet.get("[a portion of the column]listThingOne")
foo.listTwo = resultSet.get("[a portion of the column]listThingTwo")
return foo;
}
it should be noted that the the result set is mono-directional, I found out when I tried using a string[] instead of a list.

(Using Julia) How can I reduce my data matrix by averaging values from the same hour?

I am trying to reduce the size of my data and I cannot make it work. I have data points taken every minute over 1 month. I want to reduce this data to have one sample for every hour. The problem is: Some of my runs have "NA" value, so I delete these rows. There is not exactly 60 points for every hour - it varies.
I have a 'Timestamp' column. I have used this to make a 'datehour' column which has the same value if the data set has the same date and hour. I want to average all the values with the same 'datehour' value.
How can I do this? I have tried using the if and for loop below, but it takes so long to run.
Thanks for all your help! I am new to Julia and come from a Matlab background.
======= CODE ==========
uniquedatehour=unique(datehour,1)
index=[]
avedata=reshape([],0,length(alldata[1,:]))
for j in uniquedatehour
for i in 1:length(datehour)
if datehour[i]==j
index=vcat(index,i)
else
rows=alldata[index,:]
rows=convert(Array{Float64,2},rows)
avehour=mean(rows,1)
avedata=vcat(avedata,avehour)
index=[]
continue
end
end
end
There are several layers to optimizing this code. I am assuming that your data is sorted on datehour (your code assumes this).
Layer one: general recommendation
Wrap your code in a function. Executing code in global scope in Julia is much slower than within a function. By wrapping it make sure to either pass data to your function as arguments or if data is in global scope it should be qualified with const;
Layer two: recommendations to your algorithm
Statement like [] creates an array of type Any which is slow, you should use type qualifier like index=Int[] to make it fast;
Using vcat like index=vcat(index,i) is inefficient, it is better to do push!(index, i) in place;
It is better to preallocate avedata with e.g. fill(NA, length(uniquedatehour), size(alldata, 2)) and assign values to an existing matrix than to do vcat on it;
Your code will produce incorrect results if I am not mistaken as it will not catch the last entry of uniquedatehour vector (assume it has only one element and check what happens - avedata will have zero rows)
Line rows=convert(Array{Float64,2},rows) is probably not needed at all. If alldata is not Matrix{Float64} it is better to convert it at the beginning with Matrix{Float64}(alldata);
You can change line rows=alldata[index,:] to a view like view(alldata, index, :) to avoid allocation;
In general you can avoid creation of index vector as it is enough that you remember start s and end e position of the range of the same values and then use range s:e to select rows you want.
If you correct those things please post your updated code and maybe I can help further as there is still room for improvement but requires a bit different algorithmic approach (but maybe you will prefer option below for simplicity).
Layer three: how I would do it
I would use DataFrames package to handle this problem like this:
using DataFrames
df = DataFrame(alldata) # assuming alldata is Matrix{Float64}, otherwise convert it here
df[:grouping] = datehour
agg = aggregate(df, :grouping, mean) # maybe this is all what you need if DataFrame is OK for you
Matrix(agg[2:end]) # here is how you can convert DataFrame back to a matrix
This is not the fastest solution (as it converts to a DataFrame and back but it is much simpler for me).

Spark RDD Persistence and Partitions

When a certain RDD is created in Spark for example:
lines = sc.textFile("README.md")
And then a transformation is called on this RDD:
pythonLines = lines.filter(lambda line: "Python" in line)
If you call an action on this transformed Filter RDD (such as pythonlines.first) what does it mean when they say an RDD will be recomputed ones again each time you run an action on them? I thought the original RDD that you created using the textFile method is not persisted after you called the filter transformation on that original RDD. So will it just recompute the most recent transformed RDD, where in this case it is the RDD I made using the filter transformation? I don't really see why that would be necessary if my assumption is correct?
In spark, RDDs are lazily evaluated. This means if you simply write
lines = sc.textFile("README.md").map(xxx)
Your program will exit without reading the file since you never used the result. If you write something like:
linesLength = sc.textFile("README.md").map(line => line.split(" ").length)
sumLinesLength = linesLength.reduce(_ + _) // <-- scala way
maxLineLength = linesLength.max()
The computations needed to have lineLength will be made twice, since you are reusing it in two different places. To avoid that, you should persist your resulting RDD before using it in two different ways
linesLength = sc.textFile("README.md").map(line => line.split(" ").length)
linesLength.persist()
// ...
You can also take a look at https://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence. Hope my explanation isn't too confused!

Pig:FLATTEN keyword

I am a little confused with the use of FLATTEN keyword in PIG.
Consider the below dataset:
tuple_record: {details: (firstname: chararray,lastname: chararray,age: int,sex: chararray)}
Without using the FLATTEN I can access a field (suppose firstname) like this:
display_firstname = FOREACH tuple_record GENERATE details.firstname;
Now, using the FLATTEN keyword:
flatten_record = FOREACH tuple_record GENERATE FLATTEN(details);
DESCRIBE gives me this:
flatten_record: {details::firstname: chararray,details::lastname: chararray,details::age: int,details::sex: chararray}
And hence I can access the fields present directly without dereferencing like this:
display_record = FOREACH flatten_record GENERATE firstname;
My questions related to this FLATTEN keyword is:
1) Which way among the two (i.e. with or without using FLATTEN) is the optimized way of achieving the same output?
2) Any special scenarios where without using the FLATTEN keywords, the desired output cant be achieved?
Totally confused; please clarify its use and in which all scenarios I shall use it.
Sometimes you have data in a bag or a tuple and you want to remove that level of nesting.
when you want to switch around your data on the fly and group by a particular field, you need a way to pull those entries out of the bag.
As per Pig documentation:
The FLATTEN operator looks like a UDF syntactically, but it is
actually an operator that changes the structure of tuples and bags in
a way that a UDF cannot. Flatten un-nests tuples as well as bags. The
idea is the same, but the operation and result is different for each
type of structure.
For more details check this link they have explained the usage of FLATTEN clearly with examples

Extract ordered tuple values from a bag

In pig I massaged my data into something like:
(a,{(b,c),(d,e),(f,g)})
(h,{(i,j),(k,l)})
where the first item is the group and the bag are other values related to the group. I would like to get it into the following format:
(a,b,c,d,e,f,g)
(h,i,j,k,l)
I got to where I am now with
grunt> j = foreach G {
>> o = order myvar by second;
>> generate group, o.(first,second);
>> };
So the tuples in the bag are currently ordered. If I do something like mystuff = foreach j generate group, flatten($1); I get it all flattened and un-grouped.
Is this possible in pig, and if so what command should I be looking at?
There is no way I can that can do what you want out of the box. You really need to use a user-defined function for this. I know it sucks because you have to write Java or Python code, but you'll find several situations where Pig just doesn't go far enough. Pig can be considered a data flow language and not so much of a programming language, which is why UDFs play such an important role: they bridge the gap.
My suggestion is you write a UDF that takes in the group and value bag as parameters. Do the ordering/sorting in the UDF and also the flattening.
The other thing you want to be careful about is that now your rows will have different numbers of columns and Pig doesn't really like this. If you are just immediately outputting it, you can probably get away with this. You might want to consider having your UDF write out the list in a tab-delimited string or something that is preformatted. This isn't that big of a deal... feel free to ignore my advice here.

Resources