SSIS Data Flow: Aggregate Task 'hides' my other columns downstream - visual-studio

I'm hoping someone can help me with this.
I have an SSIS package that updates a table, and one of it's columns is a 'current average'. In my Data Flow Task, I need to look at 'tickets sold' for each record, aggregate the average, and add it as part of my insert.
Problem is, that Aggregate Task hides all my other columns. I've got 'tickets' and 'avg tickets', 'venue' and 'time'. When I go to put in the record to my DB Destination, all 3 source columns (venue, time, tickets) aren't visible, and the only one available is my aggregate. I need all four for my insert. How do I get those other columns to 'pass through' the aggregate task so I can use them?
Source: Excel sheet
Venue, Tickets Sold, Show Time
Royal Oak Music Theatre, 300, 7:00 PM
Saint Andrew's Hall, 200, 9:00 PM
Fox Music Theatre, 700, 8:00 PM
Destination: SQL Table
Venue, Tickets Sold, Show Time, Average Tickets Sold Per Show
Royal Oak Music Theatre, 300, 7:00PM, 300
Saint Andrews Halls, 200, 9:00PM, 250
Fox Music Theatre, 700, 8:00PM, 400

Given your sample data, it appears you're creating a running average. Each row in adds a new weight to be factored into the average.
The challenge is, the Aggregate component in SSIS doesn't do that. It's going to give you an average by each grouping (or none, in your case).
You're going to need a Script Component to compute this.
Check the "Tickets Sold" column as an input for the script (which will likely be named TicketsSold or Tickets_Sold or some permutation there of)
You'll need to define a new column in your Output Buffer I'll assume is named runningAverage and it's type is dt_numeric,
I'm free handing this code so syntax errors are mine but the logic is sound ;)
public class ScriptMain : SomeComponent
{
int itemCount;
int total;
/// initialize members
public override void PreExecute()
{
this.itemCount = 0;
this.total = 0;
}
/// Process all the rows, one at a time
public override void Input0_ProcessInputRow(Input0Buffer Row)
{
// accumulate
this.itemCount++;
this.total += Row.TicketsSold;
// populate the new column
// force the floating point division lest we truncate with integer division
Row.runningAverage = this.total / (this.itemCount * 1.0);
}
}

Related

I need to find a faster solution to iterate rows in Google App Script

I'm trying to save some rows values for multiple columns on multiple tabs in GAS, but it's taking a lot of time and I'd like to find a faster way of doing this, if there's any.
A project e.g:'Project1' -as a key- has a value associated with it which corresponds to the column where it's stored, the tabs are 600+ iterations long.
this script opens up a tab called 'person1' at first and goes through all the rows for the column that corresponds to that project in 'projects' dictionary (it's the same format for every tab, but more projects will be added in the future)
right now i'm iterating through the 'members' dictionary (length=m), then through the projects dictionary (length=p) and finally through the length of the rows (length='r'), in the meantime it access the other spreadsheet where I want to save all those rows.
This means that the current time complexity of my algorithm is O(mpr) and it's WAY too slow.
for 15 people and 6 projects each, the amount of iterations would be 156600+ = 54,000 iterations at least (more people and more projects and more rows will be added).
is there any way to make my algorithm faster?
const members = {'Person1':'P1', 'Person2':'P2'};
const projects = {'Project1':'L','Project2':'R'}
function saveRowValue() {
let sourceSpreadsheet = SpreadsheetApp.getActiveSpreadsheet();
let targetSpreadsheet = SpreadsheetApp.openById('-SPREADSHEET-');
let targetSheet = targetSpreadsheet.getSheetByName('Tracking time');
let rowsToWrite = [];
rowsToWrite.push(['Project', 'Initials', 'Date', 'Tracking time'])
var rowsToSave = 1;
for(m in members){
Logger.log(m +' initials:'+ members[m]);
let sourceSheet = sourceSpreadsheet.getSheetByName(m);
for(p in projects){
let values = sourceSheet.getRange(projects[p]+"1:"+projects[p]).getValues();
Logger.log(values)
let list = [null, 0,''];
for(var i=0; i<values.length; i++){
try{
date = sourceSheet.getRange('B'+i).getValue();
let val = sourceSheet.getRange(projects[p]+i)
val = Utilities.formatDate(val.getValue(), "GMT", val.getNumberFormat())
Logger.log(val);
if(!(list.includes(val)) && date instanceof Date){
//rowsToWrite.push();
rowsToSave++;
targetSheet.getRange(rowsToSave,1,1,4).setValues([[p, members[m], date, val]]);
}
}catch(e){
Logger.log(e)
}
}
}
}
Logger.log(rowsToWrite);
[Here you can see how much time it takes to iterate 600 rows for a single project and a single member after changing what Yuri Khristich told me to change][1]
[1]: https://i.stack.imgur.com/CnRZY.png
First step is to try to get rid of getValue() and setValue() in loops. All data should be captured at once as 2D arrays in one step and put on the sheet in one step as well. No single cell or single row operations.
Next trick depends on your workflow. Say, it's unlikely that every time all 54000+ cells need to be checked. Probably there are ranges that have no changes. You can figure out some way to indicate the changes. And process only the changed ranges. Probably, the indication could be performed with onChange() trigger. For example you can add * to the name of the sheets and columns where changes have occurred and remove these * whenever you run your script.
Reference:
Use batch operations

Year over Year Stats from a Crossfilter Dataset

Summary
I want to pull out Year over Year stats in a Crossfilter-DC driven dashboard
Year over Year (YoY) Definition
2017 YoY is the total units in 2017 divided by the total units in 2016.
Details
I'm using DC.js (and therefore D3.js & Crossfilter) to create an interactive Dashboard that can also be used to change the data it's rendering.
I have data, that though wider (has ~6 other attributes in addition to date and quantity: size, color, etc...sales data), boils down to objects like:
[
{ date: 2017-12-7, quantity: 56, color: blue ...},
{ date: 2017-2-17, quantity: 104, color: red ...},
{ date: 2016-12-7, quantity: 60, color: red ...},
{ date: 2016-4-15, quantity: 6, color: blue ...},
{ date: 2017-2-17, quantity: 10, color: green ...},
{ date: 2016-12-7, quantity: 12, color: green ...}
...
]
I'm displaying one rowchart per attribuet such that you can see the totals by color, size, etc. People would use each of these charts to be able to see the totals by that attribute and drill into the data by filtering by just a color, or a color and a size, or a size, etc. This setup is all (relatively) straight forward and kind of what DC is made for.
However, now I'd like to add some YoY stats such that I can show a barchart with x-axis as the years, and the y-axis as the YoY values (ex. YoY-2019 = Units-2019 / Units-2018). I'd also like to do the same by quarter and month such that I could see YoY Mar-2019 = Units-Mar-2019 / Units-Mar-2018 (and the same for quarter).
I have a year dimension and sum quantity
var yearDim = crossfilterObject.dimension(_ => _.date.getFullYear());
var quantityGroup = yearDim.group.reduceSum(_ => _.quantity);
I can't figure out how to do the Year over Year calc though in the nice, beautiful DC.js-way.
Attempted Solutions
Year+1
Add another dimension that's year + 1. I didn't' really get any further though because all I get out of it are two dimensions whose year groups I want to divide ... but am not sure how.
var yearPlusOneDim = crossfilterObject.dimension(_ => _.date.getFullYear() + 1);
Visually I can graph the two separately and I know, conceptually, what I want to do: which is divide the 2017 number in yearDim by the 2017 number in YearPlusOneDim (which, in reality, is the 2016 number). But "as a concept is as far as I got on this one.
Abandon DC Graphing
I could always use the yearDim's quantity group to get the array of values, which I could then feed into a normal D3.js graph.
var annualValues = quantityGroup.all();
console.log(annualValues);
// output = [{key: 2016, value: 78}, {key: 2017, value: 170}]
// example data from the limited rows listed above
But this feels like a hacky solution that's bound to fail and not benefit from all the rapid and dynamic DC updating.
I'd use a fake group, in order to solve this in one pass.
As #Ethan says, you could also use a value accessor, but then you'd have to look up the previous year each time a value is accessed - so you'd probably have to keep an extra table around. With a fake group, you only need this table in the body of your .all() function.
Here's a quick sketch of what the fake group might look like:
function yoy_group(group) {
return {
all: function() {
// index all values by date
var bydate = group.all().reduce(function(p, kv) {
p[kv.key.getTime()] = kv.value;
return p;
}, {});
// for any key/value pair which had a value one year earlier,
// produce a new pair with the ratio between this year and last
return group.all().reduce(function(p, kv) {
var date = d3.timeYear.offset(kv.key, -1);
if(bydate[date.getTime()])
p.push({key: kv.key, value: kv.value / bydate[date.getTime()]});
return p;
}, []);
}
};
}
The idea is simple: first index all the values by date. Then when producing the array of key/value pairs, look each one up to see if it had a value one year earlier. If so, push a pair to the result (otherwise drop it).
This should work for any date-keyed group where the dates have been rounded.
Note the use of Array.reduce in a couple of places. This is the spiritual ancestor of crossfilter's group.reduce - it takes a function which has the same signature as the reduce-add function, and an initial value (not a function) and produces a single value. Instead of reacting to changes like the crossfilter one does, it just loops over the array once. It's useful when you want to produce an object from an array, or produce an array of different size from the original.
Also, when indexing an object by a date, I use Date.getTime() to fetch the numeric representation of the date. Otherwise the date coerces to a string representation which may not be exact. Probably for this application it would be okay to skip .getTime() but I'm in the habit of always comparing dates exactly.
Demo fiddle of YOY trade volume in the data set used by the stock example on the main dc.js page.
I've rewritten #Gordon 's code below. All the credit is his for the solution (answered above) and I've just wirtten down my own version (far longer and likely only useful for beginners like me) of the code (much more verbose!) and the explanation (also much more verbose) to replicate my thinking in bridging my near-nothing starting point up to #Gordon 's really clever answer.
yoyGroup = function(group) {
return { all: function() {
// For every key-value pair in the group, iterate across it, indexing it by it's time-value
var valuesByDate = group.all().reduce(function(outputArray, thisKeyValuePair) {
outputArray[thisKeyValuePair.key.getTime()] = thisKeyValuePair.value;
return outputArray;
}, []);
return group.all().reduce(function(newAllArray, thisKeyValuePair) {
var dateLastYear = d3.timeYear.offset(thisKeyValuePair.key, -1);
if (valuesByDate[dateLastYear.getTime()]) {
newAllArray.push({
key: thisKeyValuePair.key,
value: thisKeyValuePair.value / valuesByDate[dateLastYear.getTime()] - 1
});
}
return newAllArray;
}, []); // closing reduce() and a function(...)
}}; // closing the return object & a function
};
¿Why are we overwritting the all() function?
When DC.js goes to create a graph based on a grouping, the only function from Crossfilter it uses is the all() function. So if we want to do something custom to a grouping to affect a DC graph, we only have to overwrite that one function: all().
¿What does the all() function need to return?
A group's all function must return an array of objects and each object must have two properties: key & value.
¿So what exactly are we doing here?
We're starting with an existing group which shows some values over time (Important Assumption: keys are date objects) and then creating a wrapper around it so that we can take advantage of the work that crossfilter has already done to aggregate at a certain level (ex. year, month, etc.).
We start by using reduce to manipulate the array of objects into a more simple array where the keys and values that were in the objects are now directly in the array. We do this to make it easier to look up values by keys.
before / output structure of group.all()
[ {key: k1, value: v1},
{key: k2, value: v2},
{key: k3, value: v3}
]
after
[ k1: v1,
k2: v2,
k3: v3
]
Then we move on to creating the correct all() structure again: an array of objects each of which has a key & value property. We start with the existing group's all() array (once again), but this time we have the advantage of our valuesByDate array which will make it easy to look up other dates.
So we iterate (via reduce) over the original group.all() output and lookup in the array we generated earlier (valuesByDate), if there's an entry from one year ago (valuesByDate[dateLastYear.getTime()]). (We use getTime() so it's simple integers rather than objects we're indexing off of.) If there is an element of the array from one year ago, then we add a key-value object-pair to our soon-to-be-returned array with the current key (date) and for the value we divide the "now" value (thisKeyValuePair.value) by the value 1 year ago: valuesByDate[dateLastYear.getTime()]. Lastly we subtract 1 so that it's (the most traditional definition of) YoY. Ex. This year = 110 and last year = 100 ... YoY = +10% = 110/100 - 1.

Calculated Measure in Analysis Services

The AdventureWorksDW has the construct of the Financial Reporting Fact table. I have a similar fact table where the fact contains only the FK to the dimension tables and a value. The measure gets it's context from an DimAccount dimension. Are there any code samples that show how to do a simple ratio in a calculated member between two measures of the AdventureWorks Financial Reporting sample?
So basically I would like to see say Total Long term Debt / Total Assets from AdventureWorksDW? What I need is the expression or MDX.
Thanks in advance.
Use a query like this:
with member [Account].[Accounts].[Balance Sheet].[Dept by Assets] as
IIf([Account].[Accounts].[Assets] <> 0,
[Account].[Accounts].[Long Term Liabilities] / [Account].[Accounts].[Assets],
null
)
,format_string = "0.00%"
select {
[Account].[Accounts].[Assets],
[Account].[Accounts].[Long Term Liabilities],
[Account].[Accounts].[Dept by Assets]
}
on columns,
{ [Measures].[Amount] }
on rows
from [Adventure Works]
You can define members in any hierarchy, not only in the measures. In the definition, you should use the parent member before the name of the new member, to tell AS the position in the hierarchy. This is more important for CREATE MEMBER in the cube calculation script than for WITH MEMBER, as it influences the position where the client tool will display it.

Solving: How many shops are open at a certain time?

I was posed this question at an interview and never really came up with a great solution. Does anyone have an "optimum" solution? Where the target is efficiency and being able to deal with large input.
Material Provided:
I am given a long list of shops and their opening/closing times (say 1000).
The Problem:
For a given time of day, return how many of the shops are open
Example Data:
Sainsburys 10:00 23:00
Asda 02:00 18:00
Walmart 17:00 22:00
Example In/Out
Input | Output
12:00 | 2
22:00 | 1 (walmart shut # 22:00)
17:30 | 3
The two parts of the problem are how to store the data and how to efficiently get the answer, I guess how you're reading the input etc doesn't really matter.
Thanks for your time and insight!
Let's take a stab:
//First, we turn the time input into an int and then compare it to the open and
//closing time ints to determine if the shop is open. We'l use 10:00 in this example.
//get magic time int
int magicTimeInt = CInt("10:00".Replace(":",""));
int openstorecount = 0;
foreach(var shoptime in ShopTimesList)//SHopTImesList is the list of shop names and times
{
string[] theShop = shoptime.Split(" ");
if( CInt(theshop[1].ToString().Replace(":", "")) < magicTimeInt
&&
CInt(theshop[2].ToString().Replace(":", "")) > magicTimeInt)
{
openstorecount++;
}
}
Console.WriteLine("10:00 | " + openstorecount.ToString());
I would use a database:
TABLE shops (
name VARCHAR,
open TIME,
close TIME
)
SELECT count(*) AS number_of_shops FROM shops WHERE [input_time] BETWEEN open AND close
To prevent the query from counting walmart (in your example), you could add a second to open and substract a second from close (or some minutes to give anyone the chance to buy something).
I would do it Java:
class Shop {
String name;
Time opening, closing;
Shop(String name; Time opening, Time closing){
this.name = name;
this.opening = opening;
this.closing = closing;
}
public boolean isOpen(Time time){
return opening.before(time) && closing.after(time)
}
}
Add code to trim date information from the time values, and just make a collection of all the stores, iterate through, and incerment your count for each open one.

achieving a complex sort via Linq to Objects

I've been asked to apply conditional sorting to a data set and I'm trying to figure out how to achieve this via LINQ. In this particular domain, purchase orders can be marked as primary or secondary. The exact mechanism used to determine primary/secondary status is rather complex and not germane to the problem at hand.
Consider the data set below.
Purchase Order Ship Date Shipping Address Total
6 1/16/2006 Tallahassee FL 500.45
19.1 2/25/2006 Milwaukee WI 255.69
5.1 4/11/2006 Chicago IL 199.99
8 5/16/2006 Fresno CA 458.22
19 7/3/2006 Seattle WA 151.55
5 5/1/2006 Avery UT 788.52
5.2 8/22/2006 Rice Lake MO 655.00
Secondary POs are those with a decimal number and primary PO's are those with an integer number. The requirement I'm dealing with stipulates that when a user chooses to sort on a given column, the sort should only be applied to primary POs. Secondary POs are ignored for the purposes of sorting, but should still be listed below their primary PO in ship date descending order.
For example, let's say a user sorts on Shipping Address ascending. The data would be sorted as follows. Notice that if you ignore the secondary POs, the data is sorted by Address ascending (Avery, Fresno, Seattle, Tallahassee)
Purchase Order Ship Date Shipping Address Total
5 5/1/2006 Avery UT 788.52
--5.2 8/22/2006 Rice Lake MO 655.00
--5.1 4/11/2006 Chicago IL 199.99
8 5/16/2006 Fresno CA 458.22
19 7/3/2006 Seattle WA 151.55
--19.1 2/25/2006 Milwaukee WI 255.69
6 1/16/2006 Tallahassee FL 500.45
Is there a way to achieve the desired effect using the OrderBy extension method? Or am I stuck (better off) applying the sort to the two data sets independently and then merging into a single result set?
public IList<PurchaseOrder> ApplySort(bool sortAsc)
{
var primary = purchaseOrders.Where(po => po.IsPrimary)
.OrderBy(po => po.ShippingAddress).ToList();
var secondary = purchaseOrders.Where(po => !po.IsPrimary)
.OrderByDescending(po => po.ShipDate).ToList();
//merge 2 lists somehow so that secondary POs are inserted after their primary
}
Have you seen ThenBy and ThenByDescending methods?
purchaseOrders.Where(po => po.IsPrimary).OrderBy(po => po.ShippingAddress).ThenByDescending(x=>x.ShipDate).ToList();
I'm not sure if this is going to fit your needs because I don't quiet understand well how final list should look like (po.IsPrimary and !po.IsPrimary is confusing me).
The solution for your problem is GroupBy.
First order your object according to selected column:
var ordered = purchaseOrders.OrderBy(po => po.ShippingAddress);
Than you need to group your orders according to the primary order. I assumed the order is a string, so i created a string IEqualityComparer like so:
class OrderComparer : IEqualityComparer<string>
{
public bool Equals(string x, string y)
{
x = x.Contains('.') ? x.Substring(0, x.IndexOf('.')) : x;
y = y.Contains('.') ? y.Substring(0, y.IndexOf('.')) : y;
return x.Equals(y);
}
public int GetHashCode(string obj)
{
return obj.Contains('.') ? obj.Substring(0, obj.IndexOf('.')).GetHashCode() : obj.GetHashCode();
}
}
and use it to group the orders:
var grouped = ordered.GroupBy(po => po.Order, new OrderComparer());
The result is a tree like structure ordered by the ShippingAddress column and grouped by the primary order id.

Resources