How to configure MaxDegreeOfParallelism BoundedCapacity MaxMessagesPerTask - tpl-dataflow

I make a test with Walkthrough: Using BatchBlock and BatchedJoinBlock to Improve Efficiency, I got confused by MaxDegreeOfParallelism, BoundedCapacity, MaxMessagesPerTask, how to use them to improve the performance.
I make a test with code below, increase the BoundedCapacity from -1 to 30, I found the time is longest when BoundedCapacity equals '-1', it is shortest when equals '1', then increase from 2, the timer increase and bigger then equals 1.
When I make test with BoundedCapacity, MaxMessagesPerTask, the time seems not change.
var insertEmployee = new ActionBlock<Employee>(e =>
AdoDbContext.InsertEmployees(new Employee[] { e }, connectionString),new ExecutionDataflowBlockOptions {
BoundedCapacity = 30
//MaxDegreeOfParallelism = 6
//MaxMessagesPerTask = 5
});
Edit1:
I make a test with and without BatchBlock, it did not make such difference like the docuemnt output. The time is much close

Related

Why is there consistent variation in execution time on my timed trigger?

I have a timed trigger that runs every 15 minutes. A simplified partial version of the script is shown below. The script compiles data from about 50 other spreadsheets and records a row for each spreadsheet, then writes that summary data to the active spreadsheet.
I noticed that in the logs, there is an alternating pattern in the execution times for this script: half the executions take 200-400 seconds, and the other half typically take 700-900 seconds. It's a pretty significant difference, and the pattern persists over the past several days of logs.
There's nothing in the script itself that changes from one execution to the next, so I'm curious if anyone can suggest a reason this would happen (even better if it's a documented reason). For example, is there some sort of caching of the spreadsheet reads so that the next execution gets those values faster?
// The triggered function.
function updateRankings()
{
var rankingSheet = SS.getSheetByName(RANKING_SHEET_NAME) // SS is the active spreadsheet
// Read the id's of the target spreadsheets, which are stored on an external spreadsheet
var gyms = getRowsData( SpreadsheetApp.openById(ADMIN_PANEL_ID).getSheetByName(ADMIN_PANEL_SHEET_NAME))
// Iterate over gyms
gyms.forEach(getGymStats)
// Write the compiled data back to the active sheet
setRowsData(rankingSheet, gyms)
}
function getGymStats(gym)
{
var gymSpreadsheet = SpreadsheetApp.openById(gym.spreadsheetId)
// Force spreadsheet formulas to calculate before reading values
SpreadsheetApp.flush()
var metricsSheet = gymSpreadsheet.getSheetByName('Detailed Metrics')
var statsColumn = metricsSheet.getRange('E:E').getValues()
var roasColumn = metricsSheet.getRange('J:J').getValues()
// Get stats
var gymStats = {
facebookAdSpend: getFacebookAdSpend(gymSpreadsheet),
scheduling: statsColumn[8][0],
showup: statsColumn[9][0],
closing: statsColumn[10][0],
costPerLead: statsColumn[25][0],
costPerAppointment: statsColumn[26][0],
costPerShow: statsColumn[27][0],
costPerAcquisition: statsColumn[28][0],
leadCount: statsColumn[13][0],
frontEndRoas: (roasColumn[21][0] / statsColumn[5][0]) || 0,
totalRoas: (roasColumn[35][0] / statsColumn[5][0]) || 0,
totalProjectedRoas: (roasColumn[36][0] / statsColumn[5][0]) || 0,
conversionRate: (gym.currency ?
'=IFS(ISBLANK(INDIRECT("R[0]C[-4]", FALSE)),,ISBLANK(INDIRECT("R[0]C[-2]", FALSE)), 1,TRUE, IFERROR(GOOGLEFINANCE("Currency:"&INDIRECT("R[0]C[-2]", FALSE)&"USD")))' :
1)
}
Object.assign(gym, gymStats)
}
function getFacebookAdSpend(spreadsheet)
{
var range = spreadsheet.getRangeByName('FacebookAdSpend')
if (!range) return ''
return range.getValue()
}

Best way to get max value in LINQ?

I'm newbie to LINQ. I will like to get know what's the highest value for 'Date', which method is preferred?
var ma1x= spResult.Where(p =>p.InstrumentId== instrument).OrderByDescending(u => int.Parse(u.Date)).FirstOrDefault();
var max2= spResult.Where(p =>p.InstrumentId== instrument).Max(u => int.Parse(u.Date));
Max or OrderByDescending ?
Max is better for both the developer and the computer.
Max will be always better because Max is semantic and meaningful.
Enumerable.Max Method
Returns the maximum value in a sequence of values.
msdn
You want the max value? Use Max. You want to order? Use OrderBy. The next developer will thank you. To quote Martin Fowler:
Any fool can write code that a computer can understand. Good programmers write code that humans can understand.
If you really want to use OrderBy to do the role of Max at least, wrap the orderby and the first in a method with a meaningful name. Something like ... Max. Great, now you have a meaningful OrderBy.
Lets see how this custom Max will do.
Enumerable.Max should be O(n) in the worst case when OrderBy use a quicksort which is O(n^2). So, the custom max is worst than the standard one...
Enjoy the performance bonus and go for Enumerable.Max. It is better for both the developer and the computer.
Edit:
Check Marco's answer to see how they perform in practice. A race of horses is always a nice idea to know which one is the faster.
.Max() should be faster. First of all the semantics of the method are clearer and your colleagues will know what your call does.
I've compared both your options on the AdventureWorks2014 database, with the following calls in LinqPad:
var times = new List<long>();
for(var i = 0; i < 1000; i++) {
Stopwatch sw = Stopwatch.StartNew();
var max2= SalesOrderHeaders.Max(u => u.OrderDate);
long elapsed = sw.ElapsedMilliseconds;
times.Add(elapsed);
}
var averageElapsed = times.Sum (t => t) / times.Count();
averageElapsed.Dump(" ms");
Generated SQL:
SELECT MAX([t0].[OrderDate]) AS [value]
FROM [Sales].[SalesOrderHeader] AS [t0]
GO
Result:
5 ms
var times = new List<long>();
for(var i = 0; i < 1000; i++) {
Stopwatch sw = Stopwatch.StartNew();
var max1 = SalesOrderHeaders.OrderByDescending(u => u.OrderDate).FirstOrDefault();
long elapsed = sw.ElapsedMilliseconds;
times.Add(elapsed);
}
var averageElapsed = times.Sum (t => t) / times.Count();
averageElapsed.Dump(" ms");
Generated SQL:
SELECT TOP (1) [t0].[SalesOrderID], [t0].[RevisionNumber], [t0].[OrderDate], [t0].[DueDate], [t0].[ShipDate], [t0].[Status], [t0].[OnlineOrderFlag], [t0].[SalesOrderNumber], [t0].[PurchaseOrderNumber], [t0].[AccountNumber], [t0].[CustomerID], [t0].[SalesPersonID], [t0].[TerritoryID], [t0].[BillToAddressID], [t0].[ShipToAddressID], [t0].[ShipMethodID], [t0].[CreditCardID], [t0].[CreditCardApprovalCode], [t0].[CurrencyRateID], [t0].[SubTotal], [t0].[TaxAmt], [t0].[Freight], [t0].[TotalDue], [t0].[Comment], [t0].[rowguid] AS [Rowguid], [t0].[ModifiedDate]
FROM [Sales].[SalesOrderHeader] AS [t0]
ORDER BY [t0].[OrderDate] DESC
GO
Result:
28ms
Conclusion: Max() is more concise and faster!
Purely speculative, but I'd imagine max2. It is just looping through each item and checking if the value is higher than the last.
While max1 is checking which is higher and reordering. Even if it's just moving pointers around (rather than moving values), this is still more work.
The Max method is better than FirstOrDefault both of them send a result as true, but the performance of Max is good.
This code:
var ma1x= spResult.Where(p =>p.InstrumentId== instrument).OrderByDescending(u => int.Parse(u.Date)).FirstOrDefault();
First check you condition, then sort them order by your condition, after that will be select and have more action to find your result.

ElasticSearch pagination--Does ES count from 0 or from 1?

I'm using pagination in a series of ES filters. in the first page i set from = 0, size = 10000. My question is in the next page do i used from = 10000, size = 20000, or do i use from = 10001? I suspect it's from = 10000, but don't want to duplicate or drop a hit.
GET /_search?size=5
GET /_search?size=5&from=5
GET /_search?size=5&from=10
When size=5 and from=5 it skips 5 and produces a result of 6,7,8,9,10.
More detail: https://www.elastic.co/guide/en/elasticsearch/guide/current/pagination.html
So in your case for from=10000 it actually starts from 10001.

Finding minimum time from record and splitting it in LINQ

We have a database of swimmers with their times.
To create a ranking we want to get the fastest time of each athlete.
var rankings = (
from r in _db.Results
orderby r.swimtime
group r by r.athleteid into rg
select new
{
AthleteId = rg.Key,
FirstName = rg.Min(f2 => f2.Athlete.firstname),
Swimtime = rg.Min(f8 => f8.swimtime),
hours = rg.Min(f9 => f9.swimtime.Hours),
minutes = rg.Min(f10 => ("00" + f10.swimtime.Minutes.ToString()).Substring(("00" + f10.swimtime.Minutes.ToString()).Length - 2)), // to get 2 digits in minutes
seconds = rg.Min(f11 => ("00" + f11.swimtime.Seconds.ToString()).Substring(("00" + f11.swimtime.Seconds.ToString()).Length - 2)), // to get 2 digits in seconds
milliseconds = rg.Min(f12 => (f12.swimtime.Milliseconds.ToString() + "00").Substring(0, 2)), // because miliseconds are not always filled
}
);
Now the ranking is created correctly, however the time shown is not.
I know what the problem is, but don't know how to fix it:
In the database we have a swimmer that has 2 times : 00:01:02:10 (1min2sec10) and 00:00:56:95 (56sec95)
The result we get is the minimum for the minutes (=00), the minimum for the seconds (=02) and the minimum for the milliseconds (=10)
Resulting in a time of 00:00:02:10.
What we should get is the hours,minutes,seconds and milliseconds of the fastest time (=00:00:56:95)
Anyone any ideas on how to fix this ?
This should do the trick:
from result in db.Results
group result by result.AthleteId into g
let bestResult = (
from athleteResult in g
orderby athleteResult.SwimTime
select athleteResult).First()
orderby bestResult.SwimTime
select new
{
AthleteId = bestResult.Athlete.Id,
FirstName = bestResult.Athlete.FirstName,
BestTime = bestResult.SwimTime,
}
The query fetches the best result from a group (all results from a single athelete), orders by that result, and uses that result to populate the final result.
Stay away from calling .First on groups, as that can cause automatic requerying due to the difference between LINQ's group (key and elements) vs SQL's group (key and aggregates).
Instead, get the minSwimTime once and let it be.
var rankings =
from r in _db.Results
group r by r.athleteid into rg
let minSwimTime = rg.Min(x => x.swimtime)
select new
{
AthleteId = rg.Key,
FirstName = rg.Min(f2 => f2.Athlete.firstname),
Swimtime = minSwimTime,
hours = minSwimTime.Hours,
minutes = ("00" + minSwimTime.Minutes.ToString()).Substring(("00" + minSwimTime.Minutes.ToString()).Length - 2), // to get 2 digits in minutes
seconds = ("00" + minSwimTime.Seconds.ToString()).Substring(("00" + minSwimTime.Seconds.ToString()).Length - 2), // to get 2 digits in seconds
milliseconds = minSwimTime.Milliseconds.ToString() + "00").Substring(0, 2), // because miliseconds are not always filled
};
Also - don't do string formatting in the database. Database servers have better things to do than turn datetimes into text.

Why is this LINQ so slow?

Can anyone please explain why the third query below is orders of magnitude slower than the others when it oughtn't to take any longer than doing the first two in sequence?
var data = Enumerable.Range(0, 10000).Select(x => new { Index = x, Value = x + " is the magic number"}).ToList();
var test1 = data.Select(x => new { Original = x, Match = data.Single(y => y.Value == x.Value) }).Take(1).Dump();
var test2 = data.Select(x => new { Original = x, Match = data.Single(z => z.Index == x.Index) }).Take(1).Dump();
var test3 = data.Select(x => new { Original = x, Match = data.Single(z => z.Index == data.Single(y => y.Value == x.Value).Index) }).Take(1).Dump();
EDIT: I've added a .ToList() to the original data generation because I don't want any repeated generation of the data clouding the issue.
I'm just trying to understand why this code is so slow by the way, not looking for faster alternative, unless it sheds some light on the matter. I would have thought that if Linq is lazily evaluated and I'm only looking for the first item (Take(1)) then test3's:
data.Select(x => new { Original = x, Match = data.Single(z => z.Index == data.Single(y => y.Value == x.Value).Index) }).Take(1);
could reduce to:
data.Select(x => new { Original = x, Match = data.Single(z => z.Index == 1) }).Take(1)
in O(N) as the first item in data is successfully matched after one full scan of the data by the inner Single(), leaving one more sweep of the data by the remaining Single(). So still all O(N).
It's evidently being processed in a more long winded way but I don't really understand how or why.
Test3 takes a couple of seconds to run by the way, so I think we can safely assume that if your answer features the number 10^16 you've made a mistake somewhere along the line.
The first two "tests" are identical, and both slow. The third adds another entire level of slowness.
The first two LINQ statements here are quadratic in nature. Since your "Match" element potentially requires iterating through the entire "data" sequence in order to find the match, as you progress through the range, the length of time for that element will get progressively longer. The 10000th element, for example, will force the engine to iterate through all 10000 elements of the original sequence to find the match, making this an O(N^2) operation.
The "test3" operation takes this to an entirely new level of pain, since it's "squaring" the O(N^2) operation in the second single - forcing it to do another quadratic operation on top of the first one - which is going to be a huge number of operations.
Each time you do data.Single(...) with the match, you're doing an O(N^2) operation - the third test basically becomes O(N^4), which will be orders of magnitude slower.
Fixed.
var data = Enumerable.Range(0, 10000)
.Select(x => new { Index = x, Value = x + " is the magic number"})
.ToList();
var forward = data.ToLookup(x => x.Index);
var backward = data.ToLookup(x => x.Value);
var test1 = data.Select(x => new { Original = x,
Match = backward[x.Value].Single()
} ).Take(1).Dump();
var test2 = data.Select(x => new { Original = x,
Match = forward[x.Index].Single()
} ).Take(1).Dump();
var test3 = data.Select(x => new { Original = x,
Match = forward[backward[x.Value].Single().Index].Single()
} ).Take(1).Dump();
In the original code,
data.ToList() generates 10,000 instances (10^4).
data.Select( data.Single() ).ToList() generates 100,000,000 instances (10^8).
data.Select( data.Single( data.Single() ) ).ToList() generates 100,000,000,000,000,000 instances (10^16).
Single and First are different. Single throws if multiple instances are encountered. Single must fully enumerate its source to check for multiple instances.

Resources