I have a long list of rows with dates on the side, and a text field after
01/01/2019 | ABC | ...
The list is ordered by date, and may have between 1 and 4 rows per date
01/01/2019 | ABC | ...
01/01/2019 | DEF | ...
05/01/2019 | ABC | ...
05/01/2019 | DEF | ...
05/01/2019 | ABC | ...
05/01/2019 | GHI | ...
10/01/2019 | ABC | ...
10/01/2019 | XYZ | ...
I can happily run a QUERY() which groups by the date and COUNT()s the number of rows matching that date
01/01/2019 | 2
05/01/2019 | 4
10/01/2019 | 2
I'm trying to use a series of functions in acceptable Google Sheets format which will group the items by date, and then only return the Nth rows. I'm also happy with EVEN/ODD rows here.
Importantly, I don't want the EVEN/ODD based on the actual spreadsheet ROW(), but I need the EVEN/ODD/Nth based on the number of matching rows in the aggregated group, if that makes sense.
So I would like this output:
EVENS
01/01/2019 | DEF | (row 2 in group)
05/01/2019 | DEF | (row 2 in group)
05/01/2019 | GHI | (row 4 in group)
10/01/2019 | XYZ | (row 2 in group)
ODDS
01/01/2019 | ABC | (row 1 in group)
05/01/2019 | ABC | (row 1 in group)
05/01/2019 | ABC | (row 3 in group)
10/01/2019 | ABC | (row 1 in group)
Ultimately, my aim is to count all the occurrences of the text field (ABC/DEF/GHI/etc) that happen as the FIRST or SECOND or THIRD or FOURTH event for any particular day, then sort descending, but only include them (for example) if ABC was an EVEN row of that group, or if XYZ was an ODD row within that group (eg row 2 of the group, ignoring the fact in the whole spreadsheet it happens to be on row 35)
ABC | 156
DEF | 30
GHI | 10
JKL | 8
MNO | 7
XYZ | 1
You could do it with one formula if you wanted to
=filter(A2:B,ISEVEN(row(A2:A)-match(A2:A,A2:A,0)))
and
=filter(A2:B,isodd(row(A2:A)-match(A2:A,A2:A,0)+1))
assuming the data starts in row 2.
If the data started in a different row, you could do a lookup on the row:
=filter(A2:B,ISODD(row(A2:A)-vlookup(A2:A,{A2:A,row(A2:A)},2,false)))
and
=filter(A2:B,ISEVEN(row(A2:A)-vlookup(A2:A,{A2:A,row(A2:A)},2,false)))
you can add helper column like:
=ARRAYFORMULA(IF(LEN(A1:A), COUNTIFS(B1:B, B1:B, ROW(B1:B), "<="&ROW(B1:B)), ))
and then filter for even and odd like:
=FILTER(A1:B, ISEVEN(C1:C))
=FILTER(A1:B, ISODD(C1:C))
Related
I'm trying to get some precise row counts for all tables, given that some have deleted rows. I have been using sys.storage.count. But this seems to count the deleted ones also.
I assume using sys.storage would be simpler and faster than looping through count(*) queries, though both strategies may be fine in practice.
Maybe there is some column that counts modifications so I could just subtract the two counts?
If all you need to know is the number of actual rows in a table, I'd recommend just using a count(*) query. It's very fast. Even if you have N tables, it's easy to do a count(*) for each table.
sys.storage gives you information from the raw storage. With that, you can get pretty low-level information, but it has some edges. sys.storage.count returns the count in the storage, hence, indeed, it includes the delete rows since they are not actually deleted. As of Jul2021 version of MonetDB, deleted rows are automatically overwritten by new inserts (i.e. auto-vacuuming). So, to get the actual row count, you need to look up the 'deletes' from sys.deltas('<schema>', '<table>'). For instance:
sql>create table tbl (id int, city string);
operation successful
sql>insert into tbl values (1, 'London'), (2, 'Paris'), (3, 'Barcelona');
3 affected rows
sql>select * from tbl;
+------+-----------+
| id | city |
+======+===========+
| 1 | London |
| 2 | Paris |
| 3 | Barcelona |
+------+-----------+
3 tuples
sql>select schema, table, column, count from sys.storage where table='tbl';
+--------+-------+--------+-------+
| schema | table | column | count |
+========+=======+========+=======+
| sys | tbl | city | 3 |
| sys | tbl | id | 3 |
+--------+-------+--------+-------+
2 tuples
sql>select id, deletes from sys.deltas ('sys', 'tbl');
+-------+---------+
| id | deletes |
+=======+=========+
| 15569 | 0 |
| 15570 | 0 |
+-------+---------+
2 tuples
After we delete one row, the actual row count is sys.storage.count - sys.deltas ('sys', 'tbl').deletes:
sql>delete from tbl where id = 2;
1 affected row
sql>select * from tbl;
+------+-----------+
| id | city |
+======+===========+
| 1 | London |
| 3 | Barcelona |
+------+-----------+
2 tuples
sql>select schema, table, column, count from sys.storage where table='tbl';
+--------+-------+--------+-------+
| schema | table | column | count |
+========+=======+========+=======+
| sys | tbl | city | 3 |
| sys | tbl | id | 3 |
+--------+-------+--------+-------+
2 tuples
sql>select id, deletes from sys.deltas ('sys', 'tbl');
+-------+---------+
| id | deletes |
+=======+=========+
| 15569 | 1 |
| 15570 | 1 |
+-------+---------+
2 tuples
After we insert a new row, the deleted row is overwritten:
sql>insert into tbl values (4, 'Praag');
1 affected row
sql>select * from tbl;
+------+-----------+
| id | city |
+======+===========+
| 1 | London |
| 4 | Praag |
| 3 | Barcelona |
+------+-----------+
3 tuples
sql>select schema, table, column, count from sys.storage where table='tbl';
+--------+-------+--------+-------+
| schema | table | column | count |
+========+=======+========+=======+
| sys | tbl | city | 3 |
| sys | tbl | id | 3 |
+--------+-------+--------+-------+
2 tuples
sql>select id, deletes from sys.deltas ('sys', 'tbl');
+-------+---------+
| id | deletes |
+=======+=========+
| 15569 | 0 |
| 15570 | 0 |
+-------+---------+
2 tuples
So, the formula to compute the actual row count (sys.storage.count - sys.deltas ('sys', 'tbl').deletes) is generally applicable. sys.deltas() keeps stats for every column of a table, but the count and deletes are table wide, so you only need to check one column.
I have table i have run the job in scdtype 2 load the data below
no | name | loc |
-----------------
1 | abc | hyd |
-----------------
2 | def | bang |
-----------------
3 | ghi | chennai |
then i have run the second run load the data given below
no | name | loc |
-----------------
1 | abc | hyd |
-----------------
2 | def | bang |
-----------------
3 | ghi | chennai |
--------------------
1 | abc | bang |
here no dates,flags,and run ids
how to find second updated record in this situtation
Thanks
I don't think you'll be able to distinguish between the updated record and the original record.
A Dimension table using Type 2 SCD requires additional columns that describes the period in which the record is valid (or current), exactly for this reason.
The solution is to ensure your dimension table has these columns (Typically ValidFrom and ValidTo dates or date/times, and sometimes an IsCurrent flag for good measure). Your ETL process would then populate these columns as part of making the Type 2 updates.
I have table called data rules. Its given below with some explanation.
Data|GroupNum|GroupType|GroupMinOcc|GroupMaxOcc|DataStatus|DataMinOccWithinGroup|DataMaxOccurenceWithinGroup|IDX
ABC |GroupA |Mandatory| 1 | 1 | Mandatory| 1 | 1 |1
DEF |GroupB |Mandatory| 1 | 1 |Mandatory | 1 | 1 |2
GHI |GroupC |Mandatory| 1 | 1 |Mandatory | 1 | 1 |3
JKL |GroupD |Optional | 0 | 1 |Optional | 0 | 1 |4
FFF |Group1 |Optional | 0 | 1 |Mandatory | 1 | 1 |5
RRR |Group1 |Optional | 0 | 1 |Optional | 0 | 2 |6
MMM |Group2 |Optional | 0 | 2 |Mandatory | 1 | 1 |7
PPP |Group2 |Optional | 0 | 2 |Optional | 0 | 1 |8
CCC |Group3 |Optional | 0 | 2 |Optional | 0 | 2 |9
SSS |Group4 |Mandatory| 1 | 2 |Mandatory | 1 | 1 |10
TTT |Group4 |Mandatory| 1 | 2 |Mandatory | 0 | 2 |11
Let me explain you this data rules first.
1) A group can have multiple data records
Here as you can see GroupA is having only ABC data and Group 1 is having FFF and RRR data.
2) A group can be mandatory and optional. It means a group will appear definitely if its mandatory. Secondly If its mandatory, then its data records also having mandatory and optional status.
For example: Check group4
This group is mandatory and its First data SSS is also mandatory. It means this group is mandatory and when it will occur this data should also occur. But second data in this group is TTT which is optional. No matter group is mandatory, but this data is optional inside mandatory group so it can occur from 0 to 2 times
Lets Say this group occur two times..It would look like this
Group4 Example: Valid
SSS
TTT
TTT
SSS
TTT
Invalid Group4 occurrence
SSS
SSS
TTT
TTT
TTT
Its invalid because in second occurrence of group TTT is occurred 3 times but it should not exceed from 2
3) If group is optional it can be appear or cannot be.
So As you can see, GroupD, Group2 and Group3 are optional So after GroupC directly Group4 data also can come in input data..like this
ABC
DEF
GHI
SSS
TTT
I want to capture exact IDX number from Data rules table from their respective groups only, If input data dont follow the rules mentioned in data rules table.
For example 1st Input Data Example
ABC
DEF
GHI
JKL
JKL
SSS
As you can see here JKL is optional data in optional group. But if this optional group occurs this JKL should come only one time. But it came twice. So I want return that IDX number 4.
2nd Data Example.
ABC
DEF
GHI
TTT
Here it should return IDX number 10. Because from mandatory group4 mandatory data SSS is missing and in data rules its IDX is 10
3rd Example
ABC
DEF
GHI
SSS
SSS
TTT
In this also IDX return for SSS should be 10. Because Its occurred twice. And as you can see in data rules whole Group4 can repeat one time only and whenever it will occur SSS will come only one time. SO that's a error
Many errors can occur same time as well. SO IDX number needs to returned from their respective group data only from data rules table.
In input data, Only one column will come with data records only.
Note: Group data will appear only in series like mentioned in data rules from top to bottom. And can be appeared or not on the basis of definition mentioned in data rules table.
Any suggestions..how can I achieve this ?
In stata if I have a list if groups:
XYZ
ABC
ABC
BCH
JSA
BCH
XYZ
How I get each group to have a unique ID in a second column after sorting, for example:
ABC 1
BCH 2
JSA 3
XYZ 4
You need sort, then group(), which is part of egen.
sysuse auto,clear
sort make
egen make_gp = group(make)
This yields:
. list make make_gp in 1/5
+-------------------------+
| make make_gp |
|-------------------------|
1. | AMC Concord 1 |
2. | AMC Pacer 2 |
3. | AMC Spirit 3 |
4. | Buick Century 7 |
5. | Buick Electra 8 |
+-------------------------+
This is a bit hard to explain in words ... I'm trying to calculate a sum of grouped distinct values in a matrix. Let's say I have the following data returned by a SQL query:
------------------------------------------------
| Group | ParentID | ChildID | ParentProdCount |
| A | 1 | 1 | 2 |
| A | 1 | 2 | 2 |
| A | 1 | 3 | 2 |
| A | 1 | 4 | 2 |
| A | 2 | 5 | 3 |
| A | 2 | 6 | 3 |
| A | 2 | 7 | 3 |
| A | 2 | 8 | 3 |
| B | 3 | 9 | 1 |
| B | 3 | 10 | 1 |
| B | 3 | 11 | 1 |
------------------------------------------------
There's some other data in the query, but it's irrelevant. ParentProdCount is specific to the ParentID.
Now, I have a matrix in the MS Report Designer in which I'm trying to calculate a sum for ParentProdCount (grouped by "Group"). If I just add the expression
=Sum(Fields!ParentProdCount.Value)
I get a result 20 for Group A and 3 for Group B, which is incorrect. The correct values should be 5 for group A and 1 for group B. This wouldn't happen if there wasn't ChildID involved, but I have to use some other child-specific data in the same matrix.
I tried to nest FIRST() and SUM() aggregate functions but apparently it's not possible to have nested aggregation functions, even when they have scopes defined.
I'm pretty sure there is some way to calculate the grouped distinct sum without needing to create another SQL query. Anyone got an idea how to do that?
Ok I got this sorted out by adding a ROW_NUMBER() function my SQL query:
SELECT Group, ParentID, ROW_NUMBER() OVER (PARTITION BY ParentID ORDER BY ChildID ASC) AS Position, ChildID, ParentProdCount FROM Table
and then I replaced the SSRS SUM function with
=SUM(IIF(Position = 1, ParentProdCount.Value, 0))
Put a grouping over the ParentID and use a summation over that group,
eg:
if group over ParentID = "ParentIDGroup"
then
column sum of ParentPrdCount = SUM(Fields!ParentProdCount.Value,"ParentIDGroup")