VB.NET filtering/grouping a datatable - linq

Sample Scenario
I have a table in a database with the following fields:- SerialNo, GroupNo, Description, Quantity. At the moment I am cycling around a DataTable which has been populated from an ADO.NET DataSet and I am adding the fields into a List as follows...
' Gets the items from the database and created a DataSet
' The DataSet has a named DataTable called MyTable
ds = GetItems
' Item is an model in my MVC project
Dim Item As Item
' I am creating a List of items...
i As List(Of Item)
For Each row As DataRow In ds.Tables("MyTable").Rows
Item = New Item() With {
.SerialNo = If(Not IsDBNull(row("SerialNo")), CInt(row("SerialNo")), 0),
.GroupNo = If(Not IsDBNull(row("GroupNo")), CStr(row("GroupNo")), ""),
.Description = If(Not IsDBNull(row("Description")), CStr(row("Description")), ""),
.Quantity = If(Not IsDBNull(row("Quantity")), CInt(row("Quantity")), 0)
}
ai.Add(Item)
Next
Requirement
Instead of getting every row I want to get just the first occurence of every GroupNo and return this result into a List. For example...
SerialNo = 1 GroupNo = 1 Description = Item A Quantity = 100
SerialNo = 2 GroupNo = 1 Description = Item B Quantity = 100
SerialNo = 3 GroupNo = 1 Description = Item C Quantity = 100
SerialNo = 4 GroupNo = 2 Description = Item D Quantity = 100
SerialNo = 5 GroupNo = 2 Description = Item E Quantity = 100
SerialNo = 6 GroupNo = 3 Description = Item F Quantity = 100
... should actually be modified to return...
SerialNo = 1 GroupNo = 1 Description = Item A Quantity = 100
SerialNo = 4 GroupNo = 2 Description = Item D Quantity = 100
SerialNo = 6 GroupNo = 3 Description = Item F Quantity = 100
I am using Visual Studio 2010 (VB.NET) with .NET 4.0.
I've tried to research various ways but I either come stuck trying to extract all 4 columns doesn't seem to group correctly. Note: I don't want to modify the query to only return the subset of data. I need to filter/group it with code.

So you just want to take the first DataRow of each group to initialize your Item:
Dim items = From row In ds.Tables("MyTable").AsEnumerable()
Let GroupNo = row.Field(Of Int32)("GroupNo")
Group row By GroupNo Into Group
Select New Item() With {
.GroupNo = GroupNo,
.SerialNo = Group.First().Field(Of Int32)("SerialNo"),
.Quantity = Group.First().Field(Of Int32)("Quantity"),
.Description = Group.First().Field(Of String)("Description")
}
If you want to copy it into a List(Of Item) you only have to call items.ToList().

Related

Spark 1.6: How to process extremely big DataFrame without for loop?

I have a Hive table and i created an extremely big spark dataframe of the following shape:
------------------------------------------------------
| Category |Subcategory | Purchase_ID | Product_ID |
|------------+------------+-------------+------------|
| a | a_1 | purchase 1 | product 1 |
| a | a_1 | purchase 1 | product 2 |
| a | a_1 | purchase 1 | product 3 |
| a | a_1 | purchase 4 | product 1 |
| a | a_2 | purchase 5 | product 4 |
| b | b_1 | purchase 6 | product 5 |
| b | b_2 | purchase 7 | product 6 |
------------------------------------------------------
Please note that this matrix is extremely big, tens of millions of purchases for each subcategories and 50M+ purchases for each category.
My tasks are as following:
Group all purchases by 'subcategory' and compute cosine similarity between all products (means, if 2 products are found in all purchases together then their cosine similarity is 1.0, if they never appear together their cosine similarity is 0.0)
Group all purchases by 'category' and compute cosine similarity between all products
My current solution so far:
First i collect all unique 'Subcategory' values into a driver machine from Hive using SQL and then i loop through each subcategory where i load data again for that particular subcategory and compute cosine similarity. Computing cosine similarity for each pair of products requires building NxN matrix. I thought (please correct me if i am wrong) that loading entire dataframe and groupby by Subcategory and computing NxN matrix for each subcategory might lead to out of memory error so i computed sequentially as following:
val subcategories = hiveContext.sql(s"SELECT Subcategory FROM $table_name")
val subcategory_ids = subcategories.select("Subcategory").collect()
// for each context, sequentially compute models
for ((arr_subcategory_id, index) <- subcategory_ids.zipWithIndex) {
println("Loading current context")
val subcategory_id = arr_subcategory_id(0)
println("subcategory id: ".concat(subcategory_id.toString))
val context_data = hiveContext.sql(s"SELECT Purchase_ID, Product_ID FROM $table_name WHERE Subcategory = $subcategory_id")
//UDF to concatenate column values into single long string
class ConcatenatedGroupItems() extends UserDefinedAggregateFunction {
// Input Data Type Schema
def inputSchema: StructType = StructType(Array(StructField("item", StringType)))
// Intermediate Schema
def bufferSchema = StructType(Array(StructField("items", StringType)))
// Returned Data Type .
def dataType: DataType = StringType
// Self-explaining
def deterministic = true
// This function is called whenever key changes
def initialize(buffer: MutableAggregationBuffer) = {
buffer(0) = "" // initialize to empty string
}
// Iterate over each entry of a group
def update(buffer: MutableAggregationBuffer, input: Row) = {
var tempString:String = buffer.getString(0)
// add space in between the items unless it is the first element
if (tempString.length() != 0){
tempString = tempString + " "
}
buffer(0) = tempString.concat(input.getString(0))
}
// Merge two partial aggregates
def merge(buffer1: MutableAggregationBuffer, buffer2: Row) = {
var tempString = buffer1.getString(0)
// add space in between the items unless it is the first element
if (tempString.length() != 0){
tempString = tempString + " "
}
buffer1(0) = tempString.concat(buffer2.getString(0))
}
// Called after all the entries are exhausted.
def evaluate(buffer: Row) = {
buffer.getString(0)
}
}
// ========================================================================================
println("Concatenating grouped items")
val itemConcatenator = new ConcatenatedGroupItems()
val sents = context_data.groupBy("Purchase_ID").agg(itemConcatenator(context_data.col("Product_ID ")).as("items"))
// ========================================================================================
println("Tokenizing purchase items")
val tokenizer = new Tokenizer().setInputCol("items").setOutputCol("words")
val tokenized = tokenizer.transform(sents)
// ========================================================================================
// fit a CountVectorizerModel from the corpus
println("Creating sparse incidence matrix")
val cvModel: CountVectorizerModel = new CountVectorizer().setInputCol("words").setOutputCol("features").fit(tokenized)
val incidence = cvModel.transform(tokenized)
// ========================================================================================
// create dataframe of mapping from indices into the item id
println("Creating vocabulary")
val vocabulary_rdd = sc.parallelize(cvModel.vocabulary)
val rows_vocabulary_rdd = vocabulary_rdd.zipWithIndex.map{ case (s,i) => Row(s,i)}
val vocabulary_field1 = StructField("Product_ID", StringType, true)
val vocabulary_field2 = StructField("Product_Index", LongType, true)
val schema_vocabulary = StructType(Seq(vocabulary_field1, vocabulary_field2))
val df_vocabulary = hiveContext.createDataFrame(rows_vocabulary_rdd, schema_vocabulary)
// ========================================================================================
println("Computing similarity matrix")
val myvectors = incidence.select("features").rdd.map(r => r(0).asInstanceOf[Vector])
val mat: RowMatrix = new RowMatrix(myvectors)
val sims = mat.columnSimilarities(0.0)
// ========================================================================================
// Convert records of the Matrix Entry RDD into Rows
println("Extracting paired similarities")
val rowRdd = sims.entries.map{case MatrixEntry(i, j, v) => Row(i, j, v)}
// ========================================================================================
// create dataframe schema
println("Creating similarity dataframe")
val field1 = StructField("Product_Index", LongType, true)
val field2 = StructField("Neighbor_Index", LongType, true)
var field3 = StructField("Similarity_Score", DoubleType, true)
val schema_similarities = StructType(Seq(field1, field2, field3))
// create the dataframe
val df_similarities = hiveContext.createDataFrame(rowRdd, schema_similarities)
// ========================================================================================
println("Register vocabulary and correlations as spark temp tables")
df_vocabulary.registerTempTable("df_vocabulary")
df_similarities.registerTempTable("df_similarities")
// ========================================================================================
println("Extracting Product_ID")
val temp_corrs = hiveContext.sql(
"SELECT T1.Product_ID, T2.Neighbor_ID, T1.Similarity_Score " +
"FROM " +
"(SELECT Product_ID, Neighbor_Index, Similarity_Score " +
"FROM df_similarities LEFT JOIN df_vocabulary " +
"WHERE df_similarities.Product_Index = df_vocabulary.Product_Index) AS T1 " +
"LEFT JOIN " +
"(SELECT Product_ID AS Neighbor_ID, Product_Index as Neighbor_Index FROM df_vocabulary) AS T2 " +
"ON " +
"T1.Neighbor_Index = T2.Neighbor_Index")
// ========================================================================================
val context_corrs = temp_corrs.withColumn("Context_ID", lit(context_id))
// ========================================================================================
context_corrs.registerTempTable("my_temp_table_correlations")
hiveContext.sql(s"INSERT INTO TABLE $table_name_correlations SELECT * FROM my_temp_table_correlations")
// ========================================================================================
// clean up environment
println("Cleaning up temp tables")
hiveContext.dropTempTable("my_temp_table_correlations")
hiveContext.dropTempTable("df_similarities")
hiveContext.dropTempTable("df_vocabulary")
}
}
Problems:
this is extremely slow. Computing for each subcategory takes about 1 minute
after processing of ~30 subcategories, i am getting out of memory errors.
What is the right logic to solve such a problem?
I could partition the dataframe by subcategory first then apply groupby according to subcategory and compute similarity for each subcategory but then if dataset is too big for that partition, i might get OOM when i build NxN matrix.

Linq Sql Query Error for Get value

Here is my linq query for get Qty and Number from first collection Qty - Second Collection Qty and first collection Number - Second Collection Number, Some times First collection RM not contain second colection
var summary = (from r in firstCollection
join s in secondCollection
on new { r.RM, r.Size } equals new { s.RM, s.Size }
group new { r, s } by new { RM = r.RM, Size = s.Size, Qty = (r.Qty - s.Qty), Number = (r.Number - s.Number) }
into grp
select new
{
RM = grp.Key.RM,
RMsize = grp.Key.Size,
Qty = grp.Key.Qty,
Number = grp.Key.Number
}).ToList();
there is an error like
Additional information: A group by expression can only contain
non-constant scalars that are comparable by the server. The expression
with type 'Manufacturing.DataAccess.tbl_RawMaterial' is not
comparable.
How can i solve this ?
You can project to anonymous type first and then do a grouping. Try this:
var summary = (from r in firstCollection
join s in secondCollection
on new { r.RM, r.Size } equals new { s.RM, s.Size }
select new
{
RM = r.RM,
Size = s.Size,
Qty = (r.Qty - s.Qty),
Number = (r.Number - s.Number)
} into tmp
group tmp by new
{
RM,
Size,
Qty,
Number
} into grp
select new
{
RM = grp.Key.RM,
RMsize = grp.Key.Size,
Qty = grp.Key.Qty,
Number = grp.Key.Number
}).ToList();
Looks like the problem is RM member which I assume is some navigation property of type Manufacturing.DataAccess.tbl_RawMaterial. As the exception message states, you can only group by simple properties.
Let say your entity Manufacturing.DataAccess.tbl_RawMaterial primary key is called Id (you can replace it with the actual name). Then the query could be something like this
var summary =
(from r in firstCollection
join s in secondCollection
on new { r.RM.Id, r.Size } equals new { s.RM.Id, s.Size }
group new { r, s }
by new { Id = r.RM.Id, Size = s.Size, Qty = (r.Qty - s.Qty), Number = (r.Number - s.Number) }
into grp
select new
{
RM = grp.FirstOrDefault(e => e.r),
RMsize = grp.Key.Size,
Qty = grp.Key.Qty,
Number = grp.Key.Number
}).ToList();

LINQ getting distinct records based on an item value

I have following LINQ query
var unallocatedOrders = (from orderLine in context.OrderLineItemDboes
where (orderLine.Status == unallocated || orderLine.Status == null)
&& orderLine.orderline.order.order_status_fk == verified
group orderLine
by new { orderLine.orderline.ol_id,orderLine.orderline.order.order_id }
into g
select new { OrderLineId = g.Key.ol_id, Count = g.Count(), OrderId = g.Key.order_id })
.ToList();
Above query giving me results in the following way
Order1 ol1 2
order1 ol2 3
order1 ol3 1
order2 ol1 1
order2 ol2 2
order3 ol1 4
order3 ol2 3
order3 ol3 2
I need to iterate through the above list based on order ids and need to fetch corresponding lines and quantity.
I need to get this line id and quantity to a Dictionary.
Can somebody suggest how can I get it done.
Thanks
Here's how you can select the items using GroupBy. (Your question doesn't really specify how you want to use the lines, so I just output them to the Debug console.)
// group by the OrderId
foreach (var group in unallocatedOrders.GroupBy(row => row.OrderId))
{
Debug.WriteLine(
// for each line, output "Order x has lines y1, y2, y3..."
string.Format("Order {0} has lines {1}",
// here the key is the OrderId
group.Key,
// comma-delimited output
string.Join(", ",
// select each value in the group, and output its OrderLineId, and quantity
group.Select(item =>
string.Format("{0} (quantity {1})", item.OrderLineId, item.Count)
)
)
)
);
}
You can get a dictionary lookup by using ToDictionary.
// two-level lookup: 1) OrderId 2) OrderLineId
var lookup = new Dictionary<int, Dictionary<int, long>>();
foreach (var group in unallocatedOrders.GroupBy(row => row.OrderId))
{
// add each order to the lookup
lookup.Add(group.Key, group.ToDictionary(
// key selector
keySelector: item => item.OrderLineId,
// value selector
elementSelector: item => item.Count()
));
}

Update existing list values with values from another query

I have a linq statement which calls a stored proc and returns a list of items and descriptions.
Like so;
var q = from i in doh.usp_Report_PLC()
where i.QTYGood == 0
orderby i.PartNumber
select new Parts() { PartNumber = i.PartNumber, Description = i.Descritpion.TrimEnd() };
I then have another SQL statement which returns the quantities on order and delivery date for each of those items. The Parts class has two other properties to store these. How do I update the existing Parts list with the other two values so that there is one Parts list with all four values?
UPDATE
The following code now brings out results.
var a = from a1 in db.usp_Optos_DaysOnHand_Report_PLC()
where a1.QTYGood == 0
orderby a1.PartNumber
select new Parts() { PartNumber = a1.PartNumber, Description = a1.Descritpion.TrimEnd() };
var b = from b1 in db.POP10110s
join b2 in db.IV00101s on b1.ITEMNMBR equals b2.ITEMNMBR
//from b3 in j1.DefaultIfEmpty()
where b1.POLNESTA == 2 && b1.QTYCANCE == 0
group b1 by new { itemNumber = b2.ITMGEDSC } into g
select new Parts() { PartNumber = g.Key.itemNumber.TrimEnd(), QtyOnOrder = g.Sum(x => Convert.ToInt32(x.QTYORDER)), DeliveryDue = g.Max(x => x.REQDATE).ToShortDateString() };
var joinedList = a.Join(b,
usp => usp.PartNumber,
oss => oss.PartNumber,
(usp, oss) =>
new Parts
{
PartNumber = usp.PartNumber,
Description = usp.Description,
QtyOnOrder = oss.QtyOnOrder,
DeliveryDue = oss.DeliveryDue
});
return joinedList.ToList();
Assuming your "other SQL statement" returns PartNumber, Quantity and DeliveryDate, you can join the lists into one:
var joinedList = q.Join(OtherSQLStatement(),
usp => usp.PartNumber,
oss => oss.PartNumber,
(usp, oss) =>
new Parts
{
PartNumber = usp.PartNumber,
Description = usp.Description,
Quantity = oss.Quantity,
DeliveryDate = oss.DeliveryDate
}).ToList();
You can actually combine the queries and do this in one join and projection:
var joinedList = doh.usp_Report_PLC().
Where(i => i.QTYGood == 0).
OrderBy(i => i.PartNumber).
Join(OtherSQLStatement(),
i => i.PartNumber,
o => o.PartNumber,
(i, o) =>
new Parts
{
PartNumber = i.PartNumber,
Description = i.Description,
Quantity = o.Quantity,
DeliveryDate = o.DeliveryDate
}).ToList();
And again: I assume you have PartNumber in both returned collections to identify which item belongs to which.
Edit
In this case the LINQ Query syntax would probably be more readable:
var joinedList = from aElem in a
join bElem in b
on aElem.PartNumber equals bElem.PartNumber into joinedAB
from abElem in joinedAB.DefaultIfEmpty()
select new Part
{
PartNumber = aElem.PartNumber,
Description = aElem.Description,
DeliveryDue = abElem == null ? null : abElem.DeliveryDue,
QtyOnOrder = abElem == null ? null : abElem.QtyOnOrder
};
Your DeliveryDue and QtyOnOrder are probably nullable. If not, replace the nulls by your default values. E.g. if you don't have the element in b and want QtyOnOrder to be 0 in the resulting list, change the line to
QtyOnOrder = abElem == null ? 0 : abElem.QtyOnOrder

LINQ | How do I get SUM without grouping?

Crazy question...however, I want the sum of all the rows in a table for a column (without using the group by clause)
Example:
Table = Survey
Columns = Answer1, Answer2, Answer3
1 1 1
4 3 5
3 3 2
I want the sums for each column.
Final results should look like:
Answer1Sum Answer2Sum Answer2Sum
8 7 8
This doesn't work:
from survey in SurveyAnswers
select new
{
Answer1Sum = survey.Sum(),
Answer2Sum = survey.Sum(),
Answer3Sum = survey.Sum()
}
Would this work:
var answer1Sum = SurveyAnswers.Sum( survey => survey.Answer1 );
var answer2Sum = SurveyAnswers.Sum( survey => survey.Answer2 );
var answer3Sum = SurveyAnswers.Sum( survey => survey.Answer3 );
A VB.NET soltuion to this answer for anyone that needs it is as follows:
Dim Answer1Sum = SurveyAnswers.Sum(Function(survey) survey.Answer1)
Dim Answer2Sum = SurveyAnswers.Sum(Function(survey) survey.Answer2)
Dim Answer3Sum = SurveyAnswers.Sum(Function(survey) survey.Answer3)
SurveyAnswers.Sum(r => r.Answer1);

Resources