I am a senior developer but new to Pig.
We have a use case to construct a metric in Pig Latin as follows
Count of Customers who (purchased items month AND purchased items prior month) / Count of customers who purchased items in prior month
First step would seem to be to generate the customer counts with FOREACH GROUP GENERATE COUNT(Purchases); and write it to a file, then read it back in again
When i read the data back in again, is there a way in a for each to compare the current row (which would now be an aggregate count by month) and the previous row
Possibly the data should be pivoted before the data is written out and read back in again, and each column compared to the 'previous' going left to right instead of row by row?
can a case statement in pig have something like this
case (customerboolean_has_sales_february + customerboolean_has_sales_january)
2 countsalesfeb+ countalesjan/countsalesjanuary
1 null
0 null
Customers who rode in month AND rode in prior month / Total customers who rode in prior month
1)
Load the file twice into relation A and relation B.
Sort the relations A & B in sorted order of month.
Use RANK to generate row numbers.
From relation B remove the top record.
Use RANK on relation B to get new relation B_New
Join A and B_New on the row numbers.
For each row generate the result you want from the columns.
I've answered similar question here
2)
*can a case statement in pig have something like this?
case (customerboolean_has_sales_february + customerboolean_has_sales_january)
*
Yes.
Related
I have 2 columns - Matches(Integer), Accounts_type(String). And i want to create a third column where i want to get proportions of matches played by different account types. I am new to Talend & am facing issue with this for past 2 days & did a lot of research but to no avail. Please help..
You can do it like this:
You need to read your source data twice (I used tFixedFlowInput_1 and tFixedFlowInput_2 with the same data). The idea is to calculate the total of your matches in tAggregateRow_1, it simply does a sum of all Matches without a group by column, then use that as a lookup.
The tMap then joins your source data with the calculated total. Since the total will always be one record, you don't need any join column. You then simply divide Matches by Total as required.
This is supposing you have unique values in Account_type; if you don't, you need to add another tAggregateRow between your source and tMap_1, in order to get sum of Matches for each Account_type (group by Account_type).
Context:
I have a data set for the weights of truck and trailer combinations coming into my site over the span of a few years. I have organized my data by seasons as I am trying to prove that the truck:trailers in winter are noticeably heavier due to ice, snow, and mud. The theory is, if the tare weight is higher in this season (the weight of the truck after it empties its load) than its Avg tare weight (which I need to calculate from the data) it can be deduced that the truck:trailer combinations are coming in with extra weight that we pay for in part as some snow/ice/mud falls off in the trailer emptying process.
What I've done so far:
I've defined a custom date range for my seasons
I've grouped Truck:Trailer by: count to get a duplicates column and, all rows to keep all my details
I've filtered out every combination I've seen less than 50 times, as i want good representation for each truck:trailer combo so that I can better emphasize repeated patterns
I've added an index column to better keep track of the individuals before expanding the details
What I need to do:
I only want to work with truck:trailer combinations which have weighed in for all four seasons at least once
I need to find the average tare weight of the truck:trailer combinations based over the extended range for both summer and autumn (the dry time of the year) while preserving the raw tare data for all seasons, as I need to eventually compare the winter tare values to this average.
example of my data
When I'm finished I'd like the data to look something like this
Pivot Chart
query data
For your first question (all seasons) you can add a column that holds the distinct count of the values in [Season] for each [Driver:Trailer]. Then filter your table on that column, keeping only the 4's. To achieve this, add the following m-code to your script in the Advanced Editor. Change the part after in to #"DistinctCount Season"
#"DistinctCount Season" = Table.Join(#"insert name previous step","Driver:Trailer",
Table.Group(#"insert name previous step", {"Driver:Trailer"},
{{"DistinctCountSeasons", each Table.RowCount(Table.Distinct(_,"Season")),
type number}}),"Driver:Trailer")
Insert the name of your previous step where indicated.
For second question:
You can use a matrix-visual for that in you report. First create a measure:
[AverageTare] = AVERAGE(table'[Tare])
Then put [Season] on Rows and the [AverageTare] on Values. You can create a group (right-click on [Season] in the FIELDS-pain) called [DrySeason], to combine the values for Spring and Summer.
If that doesn't work for you, explore the AVERAGEX function.
EDIT
In excel you can use a pivottable. Put [Season] on Rows and the [AverageTare] on Values. Right-click a value in the pivottable. Select Value Field Setting and choose Average. Then select the Seasons you want to group, right-click and select Group.
EDIT 2
To add a column in the Power Query Editor that holds the average [Tare] for the [Season] in each row, add the following steps to your script in the Avanced Editor:
#"GroupedSeasonAvg" = Table.Group(#"Insert name previous step", {"Season"}, {{"AVG", each List.Average([Tare]), type number}}),
#"JoinOnSeason" = Table.NestedJoin(#"Insert name previous step",{"Season"},GroupedSeasonAvg,{"Season"},"AVGGrouped"),
#"ExtractSeasonAVG" = Table.ExpandTableColumn(JoinOnSeason, "AVGGrouped", {"AVG"}, {"SeasonAVG"})
It works something like this:
"GroupedSeasonAvg" : Creates a table with the avereges for each [Season]
"JoinOnSeason": Creates a new column with tables joining the [Season] value for each row to [Season] in the grouped table.
#"ExtractSeasonAVG": Expand each table and keep only [AVG].
Crosstab report works 99%.
About 20 rows, all but one are ok.
5 columns - Company Division.
The rows are things like cost, revenue, revenue 2, etc.
All the rows that work have three attributes I'm using to select them:
Fiscal Year
Period
Solution.
The problem is there is table that lists an YTD rate for each period. This table is not Division Specific; it's company wide.
All the tables are linked to the accounting period table that has fiscal year and period. So the overall query limits data to fiscal year (?pFiscalYear?) and period <= ?pPeriod?, based on prompt page results.
The source table has this:
FY_CD PD_NO ACT_CURR_RT ACT_YTD_RT
2018 1 0.36121715 0.36121715
2018 2 0.32471476 0.34255512
2018 3 0.25240906 0.31210183
2018 4 0.33154745 0.31925874
Note the YTD rate is not an average of any of the other numbers.
When I select the ACT_YTD_RT, as a row, I want the ACT_YTD_RT that matches the selected period.
What I get is the average if I set the aggregation to average or the lowest if I set it to other aggregations. So sometimes, it looks right (if I run for period 1,2,3, as the rate kept falling), and sometimes it's wrong (period 4
returns .3121 instead of .3192).
I've tried a number of different methods and can generate garbage data (totals, min, max, average) and crossjoins but can't figure out how to get the value I'm looking for.
I want YTD_RT where fiscal year =?pFiscal? and period = ?pPeriod?.
I tried a straight if then clause:
if (sourcetable.fiscalYear = ?pFiscalYear?) and (sourcetable.Period = ?pPeriod?) then (ACT_YTD_RT)
but I get an error like this:
'ACT_YTD_RT' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause. (SQLSTATE=42000, SQLERRORCODE=8120)
If I create another query that generates the right response and try to include it, I get a crossjoin error that the query I'm referencing is trying to crossjoin several other items in the crosstab query.
A union doesn't work (different number of columns).
Not sure how a join would work since the division doesn't exist in the rate table.
I maybe could create a view in the database that did a crossjoin of the division table and the rate table, add that to the framework and then I wouldn't have a crossjoin since the solution would be in the rate "table" (really view), but that seems wrong somehow.
If I could just write a freaking parameterized query direct to the database I'd be done. But in Cognos 11 crosstabs I can't find a place for a SQL query object. And that shouldn't be necessary.
I've spent hours and hours chasing this in circles.
Anybody have any ideas?
Thanks
Paul
So the earlier problem was that this:
if (sourcetable.fiscalYear = ?pFiscalYear?) and (sourcetable.Period = ?pPeriod?) then (ACT_YTD_RT)
Generated an error like this:
'ACT_YTD_RT' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause. (SQLSTATE=42000, SQLERRORCODE=8120)
To fix the above, I had to add a cross join of the division table and the rate table as a view in the database. Then add that to the framework. Then build the data item this way:
total (
if (sourcetable.fiscalYear = ?pFiscalYear?) and (sourcetable.Period = ?pPeriod?) then (ACT_YTD_RT)
)
And now the "total" provides the missing group by. And the crossjoin in the database provides the division information so the crosstab is happy.
I still think there should have been an easier way to do this, but I have a functioning hammer at the moment.
Is it possible to calculate the moving average for 4 weeks in a separate column within the pivot shown here.
This is what I have:
http://i58.tinypic.com/2rraweo.jpg
Needs to have additional columns like:
http://i59.tinypic.com/30ti8ok.jpg
Side note: This is a union report and the "Total # of Submitted SRs" column is a sum measure from a fact table.
The answer was the function MAVG("column", 4) as long as the rows are defined by each week's count of "Total # of Submitted SRs".
I have to create a matrix in SSRS to detail the number uses leaving an organisation.
The columns will all represent spaces of time spanning 1 week and the rows will all represent departements in the organisation. The detail portion will be a count of people who have left that area in that week.
I have a leaving date field in the DB but nothing that flags the specific intevals I have been told to use. That means that as the matrix is, it counts each of users that have left a specific department however the date range columns is 1 day, not 1 week. Is there a way to force the column headers to respect the week intervals I want given that they are currently coming from the dataset and are not hard coded?
Firstly try to manage your data in sql itself by using Group By with date and making each group as one week period. That way you can manage to get all data in your required format
I don't know what is your columns so I am just showing a way to get the week groups from table and get the count of the people
SELECT DATEPART(wk, datevaluecolumn) weekno
, SUM(peopleleavingcolumn) totalvalue
FROM yourTable
GROUP BY DATEPART(wk, datevalue)