duplicated rows in RDD - caching

I encountered the following problem in spark:
...
while(...){
key = filtersIterator.next()
pricesOverLeadTimesKeyFiltered = pricesOverLeadTimesFilteredMap_cached
.filter(x => x._1.equals(key))
.values
resultSRB = processLinearBreakevens(pricesOverLeadTimesKeyFiltered)
resultsSRB = resultsSRB.union(resultSRB)
}
....
By this way, I accumulate the same resultSRB in resultsSRB.
But here are "some" tricks allowing me to add a different/right resultSRB for each iteration
call resultSRB.collect() or resultSRB.foreach(println) or println(resultSRB.count) after each processLinearBreakevens(..) call
perform the same operation on pricesOverLeadTimesKeyFiltered at the beginning of processLinearBreakevens(..)
It seems I need to ensure that all operations must be "flushed" before performing the union. I already tried the union through a temporary variable, or to persist resultSRB, or to persist pricesOverLeadTimesKeyFiltered but still the same problem.
Could you help me please?
Michael

If my assumption is correct; that all of these are var, then the problem is closures. key needs to be a val as it is being lazily captured into your filter. So, when it is finally acted on, the filtering is always using the last state of key
My example:
def process(filtered : RDD[Int]) = filtered.map(x=> x+1)
var i = 1
var key = 1
var filtered = sc.parallelize(List[Int]())
var result = sc.parallelize(List[Int]())
var results = sc.parallelize(List[Int]())
val cached = sc.parallelize(1 to 1000).map(x=>(x, x)).persist
while(i <= 3){
key = i * 10
val filtered = cached
.filter(x => x._1.equals(key))
.values
val result = process(filtered)
results = results.union(result)
i = i + 1
}
results.collect
//expect 11,21,31 but get 31, 31, 31
To fix it, change key to be val in the while loop and will get your expected 11,21,31

Related

App Script For Loop is Slow Need to optimize code for faster update

This code works perfectly but its slow like turtle. Generally I don't take this approach but I was not able to find any other option.
Well my requirement is to where ever in the column code finds 1 get the key and value of the that index defined range and go to To sheet find the key and paste the value in the required columns.
If I can jump to cells where 1 is in selected column and get the index to find the key and value and then jump to To page key column where key is instead of going to every cell and checking for it whether it is there or not.
I am new to app script and little help would be great.
Thank you in advance.
function Data_Update(){
//assigning sheet name to variables
var ss = SpreadsheetApp.getActive()
var from = ss.getSheetByName("From Sheet")//update here from sheet name(Updated)
var To= ss.getSheetByName("To Sheet")//update here to sheet name(Updated)
// Creating Loop
for (var i = 4;i<=7000;i++){//Update here from which row to start
//assigning values to check if there is 1 in column K
var udaterang = from .getRange("K"+i).getValue()//update here from which column to check for
Logger.log(i)
// Checking condition if the value is diffrent from the value already is
if (udaterang == 1) {
//creating key to find the value
var name1 = from .getRange("A"+i).getValue()//update here the key column 1
var name2 = from .getRange("B"+i).getValue()//update here the key column 2
var name3 = from .getRange("C"+i).getValue()//update here the key column 3
var name = name1.trim()+name2.trim()+name3.trim()
var rng = from .getSheetValues(i,4,1,7)//start row, start column, # rows, # columns
// Looping through each cell to check if the data needs update
for(var j=2;j<=12500;j++){
var key = To.getRange("AP"+j).getValue()
if(key == name){ //[1] because column B
To.getRange("AI"+j+":"+"AO"+j).setValues(rng)
break
}
}
}
}
}
Try this:
function Data_Update() {
const ss = SpreadsheetApp.getActive();
const fsh = ss.getSheetByName("From Sheet");
const fkvs = fsh.getRange(4, 11, fsh.getLastRow() - 3).getValues().flat();
const fvs = fsh.getRange(4, 1, fsh.getLastRow() - 3, fsh.getLastColumn()).getValues();
const tapvs = tsh.getRange(2, 42, tsh.getLastRow() - 1).getValues().flat();
fkvs.forEach((k, i) => {
let udaterang = k;
if (k == 1) {
let name = fvs[i][0].toString().trim() + fvs[i][1].toString().trim() + fvs[i][2].toString().trim();
let idx = tapvs.indexOf(name);
if (~idx) {
tsh.getRange(idx + 2, 35, 1, 7).setValues(tsh.getRange(i + 2, 4, 1, 7).getValues());
}
}
});
}
This may take some tweaking because I have not tested this as I don't have the data to do so.

How to request with random row linq

I am slow today
There is a request
"Take random child and put it into another garden."
I changed the code, but error in the last line of code "Does not contain a definition…and no extension method":
var query = db.Child.Where(x => x.Garden != null);
int count = query.Count();
int index = new Random().Next(count);
var ch = db.Child.OrderBy(x => query.Skip(index).FirstOrDefault());
ch.Garden_Id = "1";
What am I doing wrong?
It's hard to tell what you're doing wrong, because you didn't say why the results you're getting does not satisfy you.
But I can see two possible mistakes.
You're counting items with x.Garden != null condition, but taking from all children.
Take returns IEnumerable<T> even when you specify it to return only 1 item, you should probably use First instead.
I think your k should be
var k = db.Child.Where(x => x.Garden != null).Skip(rnd.Next(0,q)).First();

Unable to create a constant value - only primitive types or Enumeration types allowed

I have seen some questions related to this Exception here but none made me understand the root cause of the problem. So here we have one more...
var testquery =
((from le in context.LoanEMIs.Include("LoanPmnt")
join lp in context.LoanPmnts on le.Id equals lp.LoanEMIId
where lp.PmntDtTm < date && lp.IsPaid == false
&& le.IsActive == true && lp.Amount > 0
select new ObjGetAllPendingPmntDetails
{
Id = lp.Id,
Table = "LoanEMI",
loanEMIId = lp.LoanEMIId,
Name = le.AcHead,
Ref = SqlFunctions.StringConvert((double)le.FreqId),
PmntDtTm = lp.PmntDtTm,
Amount = lp.Amount,
IsDiscard = lp.IsDiscarded,
DiscardRemarks = lp.DiscardRemarks
}).DefaultIfEmpty(ObjNull));
List<ObjGetAllPendingPmntDetails> test = testquery.ToList();
This query gives the following Exception Message -
Unable to create a constant value of type CashVitae.ObjGetAllPendingPmntDetails. Only primitive types or enumeration types are supported in this context.
I got this Exception after I added the SQL function statement to convert le.FreqId which is a byte to a string as ToString() is not recognized in the LINQ Expression Store.
ObjGetAllPendingPmntDetails is a partial class in my model which is added as it is used too many times in the code to bind data to tables.
It has both IDs as long, 'Amount' as decimal, PmntDtTm as Datetime,IsDiscard as bool and remaining all are string including 'Ref'.
I get no results as currently no data satisfies the condition. While trying to handle null, I added DefaultIfEmpty(ObjNull) and ObjNull has all properties initialized as follows.
ObjGetAllPendingPmntDetails ObjNull = new ObjGetAllPendingPmntDetails()
{ Id = 0, Table = "-", loanEMIId = 0, Name = "-", Ref = "-",
PmntDtTm = Convert.ToDateTime("01-01-1900"),
Amount = 0, IsDiscard = false, DiscardRemarks = "" };
I need this query to work fine as it has Union() called on it with 5 other queries. All returning the same ObjGetAllPendingPmntDetails columns. But there is some problem as this query has no data satisfying the conditions and the Exception Shared Above.
Any suggestions are appreciated as I am unable to understand the root cause of the problem.
#AndrewCoonce is right, the .DefaultIfEmpty(ObjNull) is the culprit here. Entity Framework turns DefaultIfEmpty into something like...
CASE WHEN ([Project1].[C1] IS NULL) THEN #param ELSE [Project1].[Value] END AS [C1]
...but there's no way to coerce an instance of ObjGetAllPendingPmntDetails into something that can take the place of #param, so you get an exception.
If you move the DefaultIfEmpty call to after the ToList it should work correctly (although you'll need to call ToList again after that if you really want a concrete list instance).

Row number in LINQ

I have a linq query like this:
var accounts =
from account in context.Accounts
from guranteer in account.Gurantors
where guranteer.GuarantorRegistryId == guranteerRegistryId
select new AccountsReport
{
recordIndex = ?
CreditRegistryId = account.CreditRegistryId,
AccountNumber = account.AccountNo,
}
I want to populate recordIndex with the value of current row number in collection returned by the LINQ. How can I get row number ?
Row number is not supported in linq-to-entities. You must first retrieve records from database without row number and then add row number by linq-to-objects. Something like:
var accounts =
(from account in context.Accounts
from guranteer in account.Gurantors
where guranteer.GuarantorRegistryId == guranteerRegistryId
select new
{
CreditRegistryId = account.CreditRegistryId,
AccountNumber = account.AccountNo,
})
.AsEnumerable() // Moving to linq-to-objects
.Select((r, i) => new AccountReport
{
RecordIndex = i,
CreditRegistryId = r.CreditRegistryId,
AccountNumber = r.AccountNo,
});
LINQ to objects has this builtin for any enumerator:
http://weblogs.asp.net/fmarguerie/archive/2008/11/10/using-the-select-linq-query-operator-with-indexes.aspx
Edit: Although IQueryable supports it too (here and here) it has been mentioned that this does unfortunately not work for LINQ to SQL/Entities.
new []{"aap", "noot", "mies"}
.Select( (element, index) => new { element, index });
Will result in:
{ { element = aap, index = 0 },
{ element = noot, index = 1 },
{ element = mies, index = 2 } }
There are other LINQ Extension methods (like .Where) with the extra index parameter overload
Try using let like this:
int[] ints = new[] { 1, 2, 3, 4, 5 };
int counter = 0;
var result = from i in ints
where i % 2 == 0
let number = ++counter
select new { I = i, Number = number };
foreach (var r in result)
{
Console.WriteLine(r.Number + ": " + r.I);
}
I cannot test it with actual LINQ to SQL or Entity Framework right now. Note that the above code will retain the value of the counter between multiple executions of the query.
If this is not supported with your specific provider you can always foreach (thus forcing the execution of the query) and assign the number manually in code.
Because the query inside the question filters by a single id, I think the answers given wont help out. Ofcourse you can do it all in memory client side, but depending how large the dataset is, and whether network is involved, this could be an issue.
If you need a SQL ROW_NUMBER [..] OVER [..] equivalent, the only way I know is to create a view in your SQL server and query against that.
This Tested and Works:
Amend your code as follows:
int counter = 0;
var accounts =
from account in context.Accounts
from guranteer in account.Gurantors
where guranteer.GuarantorRegistryId == guranteerRegistryId
select new AccountsReport
{
recordIndex = counter++
CreditRegistryId = account.CreditRegistryId,
AccountNumber = account.AccountNo,
}
Hope this helps.. Though its late:)

Why is this LINQ so slow?

Can anyone please explain why the third query below is orders of magnitude slower than the others when it oughtn't to take any longer than doing the first two in sequence?
var data = Enumerable.Range(0, 10000).Select(x => new { Index = x, Value = x + " is the magic number"}).ToList();
var test1 = data.Select(x => new { Original = x, Match = data.Single(y => y.Value == x.Value) }).Take(1).Dump();
var test2 = data.Select(x => new { Original = x, Match = data.Single(z => z.Index == x.Index) }).Take(1).Dump();
var test3 = data.Select(x => new { Original = x, Match = data.Single(z => z.Index == data.Single(y => y.Value == x.Value).Index) }).Take(1).Dump();
EDIT: I've added a .ToList() to the original data generation because I don't want any repeated generation of the data clouding the issue.
I'm just trying to understand why this code is so slow by the way, not looking for faster alternative, unless it sheds some light on the matter. I would have thought that if Linq is lazily evaluated and I'm only looking for the first item (Take(1)) then test3's:
data.Select(x => new { Original = x, Match = data.Single(z => z.Index == data.Single(y => y.Value == x.Value).Index) }).Take(1);
could reduce to:
data.Select(x => new { Original = x, Match = data.Single(z => z.Index == 1) }).Take(1)
in O(N) as the first item in data is successfully matched after one full scan of the data by the inner Single(), leaving one more sweep of the data by the remaining Single(). So still all O(N).
It's evidently being processed in a more long winded way but I don't really understand how or why.
Test3 takes a couple of seconds to run by the way, so I think we can safely assume that if your answer features the number 10^16 you've made a mistake somewhere along the line.
The first two "tests" are identical, and both slow. The third adds another entire level of slowness.
The first two LINQ statements here are quadratic in nature. Since your "Match" element potentially requires iterating through the entire "data" sequence in order to find the match, as you progress through the range, the length of time for that element will get progressively longer. The 10000th element, for example, will force the engine to iterate through all 10000 elements of the original sequence to find the match, making this an O(N^2) operation.
The "test3" operation takes this to an entirely new level of pain, since it's "squaring" the O(N^2) operation in the second single - forcing it to do another quadratic operation on top of the first one - which is going to be a huge number of operations.
Each time you do data.Single(...) with the match, you're doing an O(N^2) operation - the third test basically becomes O(N^4), which will be orders of magnitude slower.
Fixed.
var data = Enumerable.Range(0, 10000)
.Select(x => new { Index = x, Value = x + " is the magic number"})
.ToList();
var forward = data.ToLookup(x => x.Index);
var backward = data.ToLookup(x => x.Value);
var test1 = data.Select(x => new { Original = x,
Match = backward[x.Value].Single()
} ).Take(1).Dump();
var test2 = data.Select(x => new { Original = x,
Match = forward[x.Index].Single()
} ).Take(1).Dump();
var test3 = data.Select(x => new { Original = x,
Match = forward[backward[x.Value].Single().Index].Single()
} ).Take(1).Dump();
In the original code,
data.ToList() generates 10,000 instances (10^4).
data.Select( data.Single() ).ToList() generates 100,000,000 instances (10^8).
data.Select( data.Single( data.Single() ) ).ToList() generates 100,000,000,000,000,000 instances (10^16).
Single and First are different. Single throws if multiple instances are encountered. Single must fully enumerate its source to check for multiple instances.

Resources