Access Report: How to group on one field, but sort by another? - sorting

I've read through similar questions and they don't seem to quite fit my issue or they're in a different environment.
I'm working in MS-Access 2016.
I have a customer complaints report which has fields: year, month, count([complaint #]), complaint_desc.
(complaint # is the literal ID number we assign to each complaint entered into the table)
I grouped the report by year and month and then grouped by complaint_desc and for each desc did a count of complaint number, and then did a count of complaint # to add up total complaints for the month and stuck it in the month footer which gives a result of something like this:
2020 03 <= (this is the month group header)
complaint desc | count of complaints/desc
---------------------------------------------
electrical | 2 {This section is
cosmetic | 6 {in the Complaint_desc
mechanical | 1 {group footer
---------------------------------------------
9 <= (this is month group footer)
repeating the group for each month
This is all good. What I want to do is to sort the records within the complaint desc group in descending order of count(complaint#) so that it looks like:
2020 03
complaint desc | count of complaints/category
---------------------------------------------
cosmetic | 6
electrical | 2
mechanical | 1
---------------------------------------------
9
However nothing I do seems to work, the desc group's built-in sort "a on top" overrides sorting in the query. adding a sort by complaint# is ignored also. I tried to add a sort by count(complaint#) and access told me I can't have an aggregate function in an order by (but I think it would have been overridden anyway). I also tried to group by count(complaint#) also shot down as aggregate in a group by. Tried moving complaint_desc and count(complaint#) to the complaint# group header and it screwed up the total count in the month footer and also split up the complaint desc's defeating it's original purpose...
I really didn't think this change was going to be a big deal, but a solution has evaded me for a while now. I've read similar questions and tried to follow examples but they didn't lead to my intended result.
Any Idea?

I figured it out! Thank you to #UnhandledException who got me thinking on the right track.
So here's what I did:
The original query the report was based on contained the following:
Design mode:
Field | Year | Month | Complaint_Desc | Complaint# |
Total | Group By | Group By | Group By | Group By |
Sort | | | | |
or in SQL:
SELECT Year, Month, [tbl Failure Mode].[Code description], [Complaint Data Table].[Complaint #]
FROM [tbl Failure Mode] RIGHT JOIN [Complaint Data Table] ON [tbl Failure Mode].[ID code] = [Complaint Data Table].[Failure Mode]
GROUP BY Year, Month, [tbl Failure Mode].[Code description], [Complaint Data Table].[Complaint #];
And then I was using the report's group and sort functions to make it show how I wanted except for the hiccup I mentioned.
I made another query based upon that query:
Design mode:
Field | Year | Month | Complaint_Desc | Complaint# |
Total | Group By | Group By | Group By | Count |
Sort | Descending | Descending | | Descending |
or in SQL:
SELECT [qry FailureMode].Year, [qry FailureMode].Month, [qry FailureMode].[Complaint_description], Count([qry FailureMode].[Complaint #]) AS [CountOfComplaint #], [qry FailureMode].Complaint
FROM [qry FailureMode]
GROUP BY [qry FailureMode].Year, [qry FailureMode].Month, [qry FailureMode].[Code description], [qry FailureMode].Complaint
ORDER BY [qry FailureMode].Year DESC , [qry FailureMode].Month DESC , Count([qry FailureMode].[Complaint #]) DESC;
Then I changed the report structure:
I eliminated the Complaint_Desc group, moved complaint_desc and CountofComplaint# (which is now not a function but it's own calculated field from my new query) to the DETAIL section of the report. Then I deleted my 2nd count(complaint#) that was in the month footer as a total for each month and replaced it with the "AccessTotalsCountOfComplaint #" which is =Sum([CountOfComplaint #]) which I had access auto-create by right-clicking on the CountofComplaint_Desc in details scrolling to "Total" and clicking on "Sum". (I deleted the extra AccessTotalsCountOfComplaint#'s that were outside of the Month Group Footer that I needed it for...)
Et Voila
I hope this helps someone else, and thank you again to Unhandled Exception who pointed me in the right direction.

Related

In hiveql, what is the most elegant/performatic way of calculating an average value if some of the data is implicitly not present?

In Hiveql, what is the most elegant and performatic way of calculating an average value when there are 'gaps' in the data, with implicit repeated values between them? i.e. Considering a table with the following data:
+----------+----------+----------+
| Employee | Date | Balance |
+----------+----------+----------+
| John | 20181029 | 1800.2 |
| John | 20181105 | 2937.74 |
| John | 20181106 | 3000 |
| John | 20181110 | 1500 |
| John | 20181119 | -755.5 |
| John | 20181120 | -800 |
| John | 20181121 | 1200 |
| John | 20181122 | -400 |
| John | 20181123 | -900 |
| John | 20181202 | -1300 |
+----------+----------+----------+
If I try to calculate a simple average of the november rows, it will return ~722.78, but the average should take into account the days that are not shown have the same balance as the previous register. In the above data, John had 1800.2 between 20181101 and 20181104, for example.
Assuming that the table always have exactly one row for each date/balance and given that I cannot change how this data is stored (and probably shouldn't since it would be a waste of storage to write rows for days with unchanged balances), I've been tinkering with getting the average from a select with subqueries for all the days in the queried month, returning a NULL for the absent days, and then using case to get the balance from the previous available date in reverse order. All of this just to avoid writing temporary tables.
Step 1: Original Data
The 1st step is to recreate a table with the original data. Let's say the original table is called daily_employee_balance.
daily_employee_balance
use default;
drop table if exists daily_employee_balance;
create table if not exists daily_employee_balance (
employee_id string,
employee string,
iso_date date,
balance double
);
Insert Sample Data in original table daily_employee_balance
insert into table daily_employee_balance values
('103','John','2018-10-25',1800.2),
('103','John','2018-10-29',1125.7),
('103','John','2018-11-05',2937.74),
('103','John','2018-11-06',3000),
('103','John','2018-11-10',1500),
('103','John','2018-11-19',-755.5),
('103','John','2018-11-20',-800),
('103','John','2018-11-21',1200),
('103','John','2018-11-22',-400),
('103','John','2018-11-23',-900),
('103','John','2018-12-02',-1300);
Step 2: Dimension Table
You will need a dimension table where you will have a calendar (table with all the possible dates), call it dimension_date. This is a normal industry standard to have a calendar table, you could probably download this sample data over the internet.
use default;
drop table if exists dimension_date;
create external table dimension_date(
date_id int,
iso_date string,
year string,
month string,
month_desc string,
end_of_month_flg string
);
Insert some sample data for entire month of Nov 2018:
insert into table dimension_date values
(6880,'2018-11-01','2018','2018-11','November','N'),
(6881,'2018-11-02','2018','2018-11','November','N'),
(6882,'2018-11-03','2018','2018-11','November','N'),
(6883,'2018-11-04','2018','2018-11','November','N'),
(6884,'2018-11-05','2018','2018-11','November','N'),
(6885,'2018-11-06','2018','2018-11','November','N'),
(6886,'2018-11-07','2018','2018-11','November','N'),
(6887,'2018-11-08','2018','2018-11','November','N'),
(6888,'2018-11-09','2018','2018-11','November','N'),
(6889,'2018-11-10','2018','2018-11','November','N'),
(6890,'2018-11-11','2018','2018-11','November','N'),
(6891,'2018-11-12','2018','2018-11','November','N'),
(6892,'2018-11-13','2018','2018-11','November','N'),
(6893,'2018-11-14','2018','2018-11','November','N'),
(6894,'2018-11-15','2018','2018-11','November','N'),
(6895,'2018-11-16','2018','2018-11','November','N'),
(6896,'2018-11-17','2018','2018-11','November','N'),
(6897,'2018-11-18','2018','2018-11','November','N'),
(6898,'2018-11-19','2018','2018-11','November','N'),
(6899,'2018-11-20','2018','2018-11','November','N'),
(6900,'2018-11-21','2018','2018-11','November','N'),
(6901,'2018-11-22','2018','2018-11','November','N'),
(6902,'2018-11-23','2018','2018-11','November','N'),
(6903,'2018-11-24','2018','2018-11','November','N'),
(6904,'2018-11-25','2018','2018-11','November','N'),
(6905,'2018-11-26','2018','2018-11','November','N'),
(6906,'2018-11-27','2018','2018-11','November','N'),
(6907,'2018-11-28','2018','2018-11','November','N'),
(6908,'2018-11-29','2018','2018-11','November','N'),
(6909,'2018-11-30','2018','2018-11','November','Y');
Step 3: Fact Table
Create a fact table from the original table. In normal practice, you ingest the data to hdfs/hive then process the raw data and create a table with historical data where you keep inserting in increment manner. You can look more into data warehousing to get the proper definition but I call this a fact table - f_employee_balance.
This will re-create the original table with missing dates and populate the missing balance with earlier known balance.
--inner query to get all the possible dates
--outer self join query will populate the missing dates and balance
drop table if exists f_employee_balance;
create table f_employee_balance
stored as orc tblproperties ("orc.compress"="SNAPPY") as
select q1.employee_id, q1.iso_date,
nvl(last_value(r.balance, true) --initial dates to be populated with 0 balance
over (partition by q1.employee_id order by q1.iso_date rows between unbounded preceding and current row),0) as balance,
month, year from (
select distinct
r.employee_id,
d.iso_date as iso_date,
d.month, d.year
from daily_employee_balance r, dimension_date d )q1
left outer join daily_employee_balance r on
(q1.employee_id = r.employee_id) and (q1.iso_date = r.iso_date);
Step 4: Analytics
The query below will give you the true average for by month:
select employee_id, monthly_avg, month, year from (
select employee_id,
row_number() over (partition by employee_id,year,month) as row_num,
avg(balance) over (partition by employee_id,year,month) as monthly_avg, month, year from
f_employee_balance)q1
where row_num = 1
order by year, month;
Step 5: Conclusion
You could have just combined step 3 and 4 together; this would save you from creating extra table. When you are in the big data world, you don't worry much about wasting extra disk space or development time. You can easily add another disk or node and automate the process using workflows. For more information, please look into data warehousing concept and hive analytical queries.

LOOKUPVALUE based upon aggregate function in DAX

I need a calculated column (because this will be used in a slicer) that returns the employee's most recent supervisor.
Data sample (table 'Performance'):
EMPLOYEE | DATE | SUPERVISOR
--------------------------------------------
Jim | 2018-11-01 | Bob
Jim | 2018-11-02 | Bob
Jim | 2018-11-03 | Bill
Mike | 2018-11-01 | Steve
Mike | 2018-11-02 | Gary
Desired Output:
EMPLOYEE | DATE | SUPERVISOR | LAST SUPER
---------------------------------------------------------------
Jim | 2018-11-01 | Bob | Bill
Jim | 2018-11-02 | Bob | Bill
Jim | 2018-11-03 | Bill | Bill
Mike | 2018-11-01 | Steve | Gary
Mike | 2018-11-02 | Gary | Gary
I tried to use
LAST SUPER =
LOOKUPVALUE (
Performance[SUPERVISOR],
Performance[DATE], MAXX ( Performance, [DATE] )
)
but I get the error:
Calculation error in column 'Performance'[]: A table of multiple
values was supplied where a single value was expected.
After doing more research, it appears this approach was doomed from the start. According to this, the search value cannot refer to any column in the same table being searched. However, even when I changed the search value to TODAY() or a static date as a test, I got the same error about multiple values. MAXX() is also returning the maximum date in the entire table, not just for that employee.
I wondered if it was a many to many issue, so I went back into Power Query, duplicated the original query, grouped by EMPLOYEE to get MAX(DATE), matched both fields against the original query to get the SUPERVISOR on MAX(DATE), and can treat this like a regular lookup table. While it does work, unsurprisingly the refresh is markedly slower.
I can't decide if I'm over-complicating, over-simplifying, or just wildly off base with either approach, but I would be grateful for any suggestions.
What I'd like to know is:
Is it possible to use a simple function like LOOKUPVALUES() to achieve the desired output?
If not, is there a more efficient approach than duplicating the query?
The reason LOOKUPVALUE is giving that particular error is that it's doing a lookup on the whole table, not just the rows associated with that particular employee. So if you have multiple supervisors matching the same maximal date, then you've got a problem.
If you want to use the LOOKUPVALUE function for this, I suggest the following:
Last Super =
VAR EmployeeRows =
FILTER( Performance, Performance[Employee] = EARLIER( Performance[Employee] ) )
VAR MaxDate = MAXX( EmployeeRows, Performance[Date] )
RETURN
LOOKUPVALUE(
Performance[Supervisor],
Performance[Date], MaxDate,
Performance[Employee], Performance[Employee]
)
There are two key differences here.
I'm taking the maximal date over only the rows for that particular employee (EmployeeRows).
I'm including Employee in the lookup function, so that it
only matches for the appropriate employee.
For other possible solutions, please see this question:
Return top value ordered by another column

Column that sums values once per unique ID, while filtering on type (Oracle Fusion Transportation Intelligence)

I realize that this has been discussed before but haven't seen a solution in a simple CASE expression for adding a column in Oracle FTI - which is as far as my experience goes at the moment unfortunately. My end goal is to have an total Weight for each Category only counting the null type entries and only one Weight per ID (Don't know why null was chosen as the default Type). I need to break the data apart by Type for a total Cost column which is working fine so I didn't include that in the example data below, but because I have to break the data up by Type, I am having trouble eliminating redundant values in my Total Weight results.
My original column which included redundant weights was as follows:
SUM(CASE Type
WHEN null
THEN 'Weight'
ELSE null
END)
Some additional info:
Each ID can have multiple Types (additionally each ID may not always have A or B but should always have null)
Each ID can only have one Weight (But when broken apart by type the value just repeats and messes up results)
Each ID can only have one Category (This doesn't really matter since I already separate this out with a Category column in the results)
Example Data:
ID |Categ. |Type | Weight
1 | Old | A | 1600
1 | Old | B | 1600
1 | Old |(null) | 1600
2 | Old | B | 400
2 | Old |(null) | 400
2 | Old |(null) | 400
3 | New | A | 500
3 | New | B | 500
3 | New |(null) | 500
4 | New | A | 500
4 | New |(null) | 500
4 | New |(null) | 500
Desired Results:
Categ. | Total Weight
Old | 2000
New | 1000
I was trying to figure out how to include a DISTINCT based on ID in the column, but when I put DISTINCT in front of CASE it just eliminates redundant weights so I would just get 500 for Total Weight New.
Additionally, I thought it would be possible to divide the weight by the count of weights before aggregating them, but this didn't seem to work either:
SUM(CASE Type
WHEN null
THEN 'Weight'/COUNT(CASE Type
WHEN null
THEN 'Weight'
ELSE null
END)
ELSE null
END)
I am very appreciative of any help that can be offered, please let me know if there is a simple way to create a column that achieves the desired results. As it may be apparent, I am pretty new to Oracle, so please let me know if there is any additional information that is needed.
Thanks,
Will
You don't need a case statement here. You were on the right track with distinct, but you also need to use an inline view (a subquery in the from the caluse).
The subquery in the from clause, selecting all distinct combinations of (id, categ, weight), allows you to then select from the result set, whereby you select only categ, sum of weight, grouping by categ. The subquery in the from clause has no repeated weights for a given id (unlike the table itself, which is why this is needed).
This would have to be done a little differently if an id were ever to have more than one category, but you noted that an id only ever has one category.
select categ,
sum(weight)
from (select distinct id,
categ,
weight
from tbl)
group by categ;
Fiddle: http://sqlfiddle.com/#!4/11a56/1/0

how do retrieve specific row in Hive?

I have a dataset looks like this:
---------------------------
cust | cost | cat | name
---------------------------
1 | 2.5 | apple | pkLady
---------------------------
1 | 3.5 | apple | greenGr
---------------------------
1 | 1.2 | pear | yelloPear
----------------------------
1 | 4.5 | pear | greenPear
-------------------------------
my hive query should now compare the cheapest price of each item the customer bought. So I want now to get the 2.5 and 1.2 into one row to get its difference. Since I am new to Hive I don't now how to ignore everything else until I reach next category of item while I still kept the cheapest price in the previous category.
you can use like below:
select cat,min(cost) from table group by cost;
Given your options (brickhouse UDFs, hive windowing functions or a self-join) in Hive, a self-join is the worst way to do this.
select *
, (cost - min(cost) over (partition by cust)) cost_diff
from table
You could create a subquery containing the minimum cost for each customer, and then join it to the original table:
select
mytable.*,
minCost.minCost,
cost - minCost as costDifference
from mytable
inner join
(select
cust,
min(cost) as minCost
from mytable
group by cust) minCost
on mytable.cust = minCost.cust
I created an interactive SQLFiddle example using MySQL, but it should work just fine in Hive.
I think this is really a SQL question rather than a Hive question: If you just want the cheapest cost per customer you can do
select cust, min(cost)
group by cust
Otherwise if you want the cheapest cost per customer per category you can do:
select cust, cat, min(cost)
from yourtable
groupby cust, cat

Will this type of pagination scale?

I need to paginate on a set of models that can/will become large. The results have to be sorted so that the latest entries are the ones that appear on the first page (and then, we can go all the way to the start using 'next' links).
The query to retrieve the first page is the following, 4 is the number of entries I need per page:
SELECT "relationships".* FROM "relationships" WHERE ("relationships".followed_id = 1) ORDER BY created_at DESC LIMIT 4 OFFSET 0;
Since this needs to be sorted and since the number of entries is likely to become large, am I going to run into serious performance issues?
What are my options to make it faster?
My understanding is that an index on 'followed_id' will simply help the where clause. My concern is on the 'order by'
Create an index that contains these two fields in this order (followed_id, created_at)
Now, how large is the large we are talking about here? If it will be of the order of millions.. How about something like the one that follows..
Create an index on keys followed_id, created_at, id (This might change depending upon the fields in select, where and order by clause. I have tailor-made this to your question)
SELECT relationships.*
FROM relationships
JOIN (SELECT id
FROM relationships
WHERE followed_id = 1
ORDER BY created_at
LIMIT 10 OFFSET 10) itable
ON relationships.id = itable.id
ORDER BY relationships.created_at
An explain would yield this:
+----+-------------+---------------+------+---------------+-------------+---------+------+------+-----------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+---------------+------+---------------+-------------+---------+------+------+-----------------------------------------------------+
| 1 | PRIMARY | NULL | NULL | NULL | NULL | NULL | NULL | NULL | Impossible WHERE noticed after reading const tables |
| 2 | DERIVED | relationships | ref | sample_rel2 | sample_rel2 | 5 | | 1 | Using where; Using index |
+----+-------------+---------------+------+---------------+-------------+---------+------+------+-----------------------------------------------------+
If you examine carefully, the sub-query containing the order, limit and offset clauses will operate on the index directly instead of the table and finally join with the table to fetch the 10 records.
It makes a difference when at one point your query makes a call like limit 10 offset 10000. It will retrieve all the 10000 records from the table and fetch the first 10. This trick should restrict the traversal to just the index.
An important note: I tested this in MySQL. Other database might have subtle differences in behavior, but the concept holds good no matter what.
you can index these fields. but it depends:
you can assume (mostly) that the created_at is already ordered. So that might by unnecessary. But that more depends on you app.
anyway you should index followed_id (unless its the primary key)

Resources