sqlite: wide v. long performance

sqlite: wide v. long performance - performance

I'm considering whether I should format a table in my sqlite database in "wide or "long" format. Examples of these formats are included at the end of the question.
I anticipate that the majority of my requests will be of the form:
SELECT * FROM table
WHERE
series in (series1, series100);
or the analog for selecting by columns in wide format.
I also anticipate that there will be a large number of columns, even enough to need to increase the column limit.
Are there any general guidelines for selecting a table layout that will optimize query performance for this sort of case?
(Examples of each)
"Wide" format:
| date | series1 | series2 | ... | seriesN |
| ---------- | ------- | ------- | ---- | ------- |
| "1/1/1900" | 15 | 24 | 43 | 23 |
| "1/2/1900" | 15 | null | null | 23 |
| ... | 15 | null | null | 23 |
| "1/2/2019" | 12 | 12 | 4 | null |
"Long" format:
| date | series | value |
| ---------- | ------- | ----- |
| "1/1/1900" | series1 | 15 |
| "1/2/1900" | series1 | 15 |
| ... | series1 | 43 |
| "1/2/2019" | series1 | 12 |
| "1/1/1900" | series2 | 15 |
| "1/2/1900" | series2 | 15 |
| ... | series2 | 43 |
| "1/2/2019" | series2 | 12 |
| ... | ... | ... |
| "1/1/1900" | seriesN | 15 |
| "1/2/1900" | seriesN | 15 |
| ... | seriesN | 43 |
| "1/2/2019" | seriesN | 12 |

The "long" format is the preferred way to go here, for so many reasons. First, if you use the "wide" format and there is ever a need to add more series, then you would have to add new columns to the database table. While this is not too much of a hassle, in general once you put a schema into production, you want to avoid further schema changes.
Second, the "long" format makes reporting and querying much easier. For example, suppose you wanted to get a count of rows/data points for each series. Then you would only need something like:
SELECT series, COUNT(*) AS cnt
FROM yourTable
GROUP BY series;
To get this report with the "wide" format, you would need a lot more code, and it would be as verbose as your sample data above.
The thing to keep in mind here is that SQL databases are built to operate on sets of records (read: across rows). They can also process things column wise, but they are not generally setup to do this.

Related

query hive json serde table having nested ARRAY & STRUCT combination

Trying to query a json hive table built on top of json data. Using json2Hive was able to generate DDL and was able to create table after removing unnecessary fields.
create external table user_tables.sample_json_table (
`apps` struct<
`app`: array<struct<
`id`: string,
`queue`: string,
`finalstatus`: string,
`trackingurl`: string,
`applicationtype`: string,
`applicationtags`: string,
`startedtime`: string,
`launchtime`: string,
`finishedtime`: string,
`memoryseconds`: string,
`vcoreseconds`: string,
`resourcesecondsmap`: struct<
`entry`: struct<
`key`: string,
`value`: string
>
>
>
>
>
)
row format serde 'org.apache.hadoop.hive.serde2.JsonSerDe'
location '/xyz/location/;
Now, stuck trying to figure out how to query each field from the below schema ?
checked several articles but all of them are case specific, and need a generic explanation or example how to query each field under array/struct :)
I only care about the multiple 'app' subsection entries and would like them to be imported onto another table with separate fields for each fields.
Sample json data:
{"apps":{"app":[{"id":"application_282828282828_12717","user":"xyz","name":"xyz-4b6bdae2-1a0c-4772-bd8e-0d7454268b82","queue":"root.users.dummy","state":"finished","finalstatus":"succeeded","progress":100.0,"trackingui":"history","trackingurl":"http://dang:8088/proxy/application_282828282828_12717/","diagnostics":"session stats:submitteddags=1, successfuldags=1, faileddags=0, killeddags=0\n","clusterid":282828282828,"applicationtype":"aquaman","applicationtags":"ABC,xyz_20221107070124_2beb5d90-24c7-4b1b-b977-3c9af1397195,userid=dummy","priority":0,"startedtime":1667822485626,"launchtime":1667822485767,"finishedtime":1667822553365,"elapsedtime":67739,"amcontainerlogs":"http://dingdong:8042/node/containerlogs/container_e65_282828282828_12717_01_000001/xyz","amhosthttpaddress":"dingdong:8042","amrpcaddress":"dingdong:46457","masternodeid":"dingdong:8041","allocatedmb":-1,"allocatedvcores":-1,"reservedmb":-1,"reservedvcores":-1,"runningcontainers":-1,"memoryseconds":1264304,"vcoreseconds":79,"queueusagepercentage":0.0,"clusterusagepercentage":0.0,"resourcesecondsmap":{"entry":{"key":"memory-mb","value":"1264304"},"entry":{"key":"vcores","value":"79"}},"preemptedresourcemb":0,"preemptedresourcevcores":0,"numnonamcontainerpreempted":0,"numamcontainerpreempted":0,"preemptedmemoryseconds":0,"preemptedvcoreseconds":0,"preemptedresourcesecondsmap":{},"logaggregationstatus":"succeeded","unmanagedapplication":false,"amnodelabelexpression":"","timeouts":{"timeout":[{"type":"lifetime","expirytime":"unlimited","remainingtimeinseconds":-1}]}},{"id":"application_282828282828_12724","user":"xyz","name":"xyz-94962a3e-d230-4fd0-b68b-01b59dd3299d","queue":"root.users.dummy","state":"finished","finalstatus":"succeeded","progress":100.0,"trackingui":"history","trackingurl":"http://dang:8088/proxy/application_282828282828_12724/","diagnostics":"session stats:submitteddags=1, successfuldags=1, faileddags=0, killeddags=0\n","clusterid":282828282828,"applicationtype":"aquaman","applicationtags":"ZZZ_,xyz_20221107070301_e6f788db-e39c-49b6-97d5-6a02ff994c00,userid=dummy","priority":0,"startedtime":1667822585231,"launchtime":1667822585437,"finishedtime":1667822631435,"elapsedtime":46204,"amcontainerlogs":"http://ding:8042/node/containerlogs/container_e65_282828282828_12724_01_000002/xyz","amhosthttpaddress":"ding:8042","amrpcaddress":"ding:46648","masternodeid":"ding:8041","allocatedmb":-1,"allocatedvcores":-1,"reservedmb":-1,"reservedvcores":-1,"runningcontainers":-1,"memoryseconds":5603339,"vcoreseconds":430,"queueusagepercentage":0.0,"clusterusagepercentage":0.0,"resourcesecondsmap":{"entry":{"key":"memory-mb","value":"5603339"},"entry":{"key":"vcores","value":"430"}},"preemptedresourcemb":0,"preemptedresourcevcores":0,"numnonamcontainerpreempted":0,"numamcontainerpreempted":0,"preemptedmemoryseconds":0,"preemptedvcoreseconds":0,"preemptedresourcesecondsmap":{},"logaggregationstatus":"time_out","unmanagedapplication":false,"amnodelabelexpression":"","timeouts":{"timeout":[{"type":"lifetime","expirytime":"unlimited","remainingtimeinseconds":-1}]}},{"id":"application_282828282828_12736","user":"xyz","name":"xyz-1a9c73ef-2992-40a5-aaad-9f0688bb04f4","queue":"root.users.dummy","state":"finished","finalstatus":"succeeded","progress":100.0,"trackingui":"history","trackingurl":"http://dang:8088/proxy/application_282828282828_12736/","diagnostics":"session stats:submitteddags=1, successfuldags=1, faileddags=0, killeddags=0\n","clusterid":282828282828,"applicationtype":"aquaman","applicationtags":"BLAHBLAH,xyz_20221107070609_8d261352-3efa-46c5-a5a0-8a3cd745d180,userid=dummy","priority":0,"startedtime":1667822771170,"launchtime":1667822773663,"finishedtime":1667822820351,"elapsedtime":49181,"amcontainerlogs":"http://dong:8042/node/containerlogs/container_e65_282828282828_12736_01_000001/xyz","amhosthttpaddress":"dong:8042","amrpcaddress":"dong:34266","masternodeid":"dong:8041","allocatedmb":-1,"allocatedvcores":-1,"reservedmb":-1,"reservedvcores":-1,"runningcontainers":-1,"memoryseconds":1300011,"vcoreseconds":89,"queueusagepercentage":0.0,"clusterusagepercentage":0.0,"resourcesecondsmap":{"entry":{"key":"memory-mb","value":"1300011"},"entry":{"key":"vcores","value":"89"}},"preemptedresourcemb":0,"preemptedresourcevcores":0,"numnonamcontainerpreempted":0,"numamcontainerpreempted":0,"preemptedmemoryseconds":0,"preemptedvcoreseconds":0,"preemptedresourcesecondsmap":{},"logaggregationstatus":"succeeded","unmanagedapplication":false,"amnodelabelexpression":"","timeouts":{"timeout":[{"type":"lifetime","expirytime":"unlimited","remainingtimeinseconds":-1}]}},{"id":"application_282828282828_12735","user":"xyz","name":"xyz-d5f56a0a-9c6b-4651-8f88-6eaff5953777","queue":"root.users.dummy","state":"finished","finalstatus":"succeeded","progress":100.0,"trackingui":"history","trackingurl":"http://dang:8088/proxy/application_282828282828_12735/","diagnostics":"session stats:submitteddags=1, successfuldags=1, faileddags=0, killeddags=0\n","clusterid":282828282828,"applicationtype":"aquaman","applicationtags":"HAHAHA_,xyz_20221107070605_a082d9d8-912f-4278-a2ef-5dfe66089fd7,userid=dummy","priority":0,"startedtime":1667822766897,"launchtime":1667822766999,"finishedtime":1667822796759,"elapsedtime":29862,"amcontainerlogs":"http://dung:8042/node/containerlogs/container_e65_282828282828_12735_01_000001/xyz","amhosthttpaddress":"dung:8042","amrpcaddress":"dung:42765","masternodeid":"dung:8041","allocatedmb":-1,"allocatedvcores":-1,"reservedmb":-1,"reservedvcores":-1,"runningcontainers":-1,"memoryseconds":669695,"vcoreseconds":44,"queueusagepercentage":0.0,"clusterusagepercentage":0.0,"resourcesecondsmap":{"entry":{"key":"memory-mb","value":"669695"},"entry":{"key":"vcores","value":"44"}},"preemptedresourcemb":0,"preemptedresourcevcores":0,"numnonamcontainerpreempted":0,"numamcontainerpreempted":0,"preemptedmemoryseconds":0,"preemptedvcoreseconds":0,"preemptedresourcesecondsmap":{},"logaggregationstatus":"succeeded","unmanagedapplication":false,"amnodelabelexpression":"","timeouts":{"timeout":[{"type":"lifetime","expirytime":"unlimited","remainingtimeinseconds":-1}]}}]}}
sample query output :
id | queue | finalStatus | trackingurl |....
-----------------------------------------------------------
application_282828282828_12717 | root.users.dummy | succeeded | ...
application_282828282828_12724 | root.users.dummy2 | failed | ....

For anyone looking to perform something similar ,I found this article very helpful with clear explanation: https://community.cloudera.com/t5/Support-Questions/Complex-Json-transformation-using-Hive-functions/m-p/236476
Below is the query to parse using LATERAL VIEW EXPLODE in case people on the same boat:
select ex1.* from user_tables.sample_json_table cym LATERAL VIEW OUTER inline(cym.apps.app) ex1;
| id | queue | finalstatus | trackingurl | applicationtype | applicationtags | startedtime | launchtime | finishedtime | memoryseconds | vcoreseconds | resourcesecondsmap |
| ------------------------------- | ----------------- | ----------- | ------------------------------------------------------- | --------------- | --------------------------------------------------------------------------------------- | ------------- | ------------- | ------------- | ------------- | ------------ | ---------------------------------------- |
| application_1667627410794_12717 | root.users.dummy2 | succeeded | http://dang:8088/proxy/application_1667627410794_12717/ | tez | \_xyz,test-app-24c7-4b1b-b977-3c9af1397195,userid=dummy1 | 1667822485626 | 1667822485767 | 1667822553365 | 1264304 | 79 | {"entry":{"key":"vcores","value":"79"}} |
| application_1667627410794_12724 | root.users.dummy3 | succeeded | http://dang:8088/proxy/application_1667627410794_12724/ | tez | \_generate_stuff,hive_20221107070301_e6f788db-e39c-49b6-97d5-6a02ff994c00,userid=dummy3 | 1667822585231 | 1667822585437 | 1667822631435 | 5603339 | 430 | {"entry":{"key":"vcores","value":"430"}} |
| application_1667627410794_12736 | root.users.dummy1 | succeeded | http://dang:8088/proxy/application_1667627410794_12736/ | tez | \_sample_job,test-zzz-3efa-46c5-a5a0-8a3cd745d180,userid=dummy1 | 1667822771170 | 1667822773663 | 1667822820351 | 1300011 | 89 | {"entry":{"key":"vcores","value":"89"}} |
| application_1667627410794_12735 | root.users.dummy2 | succeeded | http://dang:8088/proxy/application_1667627410794_12735/ | tez | \_mixed_article,placebo_2-912f-4278-a2ef-5dfe66089fd7,userid=dummy2 | 1667822766897 | 1667822766999 | 1667822796759 | 669695 | 44 | {"entry":{"key":"vcores","value":"44"}} |
Add. Note: Although my requirement no longer needs it, but If anyone can suggest how to further parse the last field resourcesecondsmap to populate map key value would be great to know! basically use key value as field and value as actual value in field:
Desired Output:
| id | queue | finalstatus | trackingurl | applicationtype | applicationtags | startedtime | launchtime | finishedtime | memoryseconds | vcoreseconds | vcores-value |
| ------------------------------- | ----------------- | ----------- | ------------------------------------------------------- | --------------- | --------------------------------------------------------------------------------------- | ------------- | ------------- | ------------- | ------------- | ------------ | ------------ |
| application_1667627410794_12717 | root.users.dummy2 | succeeded | http://dang:8088/proxy/application_1667627410794_12717/ | tez | \_xyz,test-app-24c7-4b1b-b977-3c9af1397195,userid=dummy1 | 1667822485626 | 1667822485767 | 1667822553365 | 1264304 | 79 | 79 |
| application_1667627410794_12724 | root.users.dummy3 | succeeded | http://dang:8088/proxy/application_1667627410794_12724/ | tez | \_generate_stuff,hive_20221107070301_e6f788db-e39c-49b6-97d5-6a02ff994c00,userid=dummy3 | 1667822585231 | 1667822585437 | 1667822631435 | 5603339 | 430 | 430 |

How to pivot data in Hive?

First, I've checked other topics on the subject like this one How to transpose/pivot data in hive? but that doesn't match with what I want.
So this is the table I have
| ID | Day | Status |
| 1 | 1 | A |
| 2 | 10 | B |
| 3 | 101 | A |
| 3 | 322 | B |
| 3 | 102 | C |
| 3 | 354 | D |
And i'd like to concat the different Status for each IDs ordering by the Day, in order to have this :
| ID | Status |
| 1 | A |
| 2 | B |
| 3 | A,C,B,D |
The thing is that I don't know how many status I can have, so i can't create as many columns I want for the days since I don't know how many day/status I'll have, so the answers from other topics with group_map or others, I don't know how to adapt it for my problem.
Thank's for helping me ^^

use collect_set (for distinct values) or collect_list to aggregate array and concatenate it using concat_ws:
select ID, concat_ws(',',collect_list(Status)) as Status
from table
group by ID;

Converting Column headings into Row data

I have a table in an Access Database that has columns that I would like to convert to be Row data.
I found a code in here
converting database columns into associated row data
I am new to VBA and I just don't know how to use this code.
I have attached some sample data
How the table currently is set up it is 14 columns long.
+-------+--------+-------------+-------------+-------------+----------------+
| ID | Name | 2019-10-31 | 2019-11-30 | 2019-12-31 | ... etc ... |
+-------+--------+-------------+-------------+-------------+----------------+
| 555 | Fred | 1 | 4 | 12 | |
| 556 | Barney| 5 | 33 | 24 | |
| 557 | Betty | 4 | 11 | 76 | |
+-------+--------+-------------+-------------+-------------+----------------+
I would like the output to be
+-------+------------+-------------+
| ID | Date | HOLB |
+-------+------------+-------------+
| 555 | 2019-10-31| 1 |
| 555 | 2019-11-30| 4 |
| 555 | 2019-12-31| 12 |
| 556 | 2019-10-31| 5 |
| 556 | 2019-11-30| 33 |
| 556 | 2019-12-31| 24 |
+-------+--------+-------------+---+
How can I modify this code into a Module and call the module in a query?
Or any other idea you may have.

DAX Query with multiple filters in powerbi

I have two tables 'locations' and 'markets', where, a many to many relationship exists between these two tables on the column 'market_id'. A report level filter has been applied on the column 'entity' from 'locations' table. Now, I'm supposed to distinctly count the 'location_id' from 'markets' table where 'active=TRUE'. How can I write a DAX query such that the distinct count of location_id dynamically changes with respect to the selection made in the report level filter?
Below is an example of the tables:
locations:
| location_id | market_id | entity | active |
|-------------|-----------|--------|--------|
| 1 | 10 | nyc | true |
| 2 | 20 | alaska | true |
| 2 | 20 | alaska | true |
| 2 | 30 | miami | false |
| 3 | 40 | dallas | true |
markets:
| location_id | market_id | active |
|-------------|-----------|--------|
| 2 | 20 | true |
| 2 | 20 | true |
| 5 | 20 | true |
| 6 | 20 | false |
I'm fairly new to powerbi. Any help will be appreciated.

Here you go:
DistinctLocations = CALCULATE(DISTINCTCOUNT(markets[location_id]), markets[active] = TRUE())

Birt-Crosstab with empty columns

so I'm a BIRT beginner, and I just tried to get a real simple report from one of my tables of a postgres DB.
So I defined a flat table as datasource which looks like:
+----------------+--------+----------+-------+--------+
| date | store | product | value | color |
+----------------+--------+----------+-------+--------+
| 20160101000000 | store1 | productA | 5231 | red |
| 20160101000000 | store1 | productB | 3213 | green |
| 20160101000000 | store2 | productX | 4231 | red |
| 20160101000000 | store3 | productY | 3213 | green |
| 20160101000000 | store4 | productZ | 1223 | green |
| 20160101000000 | store4 | productK | 3113 | yellow |
| 20160101000000 | store4 | productE | 213 | green |
| .... | | | | |
| 20160109000000 | store1 | productA | 512 | green |
+----------------+--------+----------+-------+--------+
So I would like to add a table / crosstab to my birt report which creates a table (and after that a page break) for EVERY store which looks like:
**Store 1**
+----------------+----------+----------+----------+-----+
| | productA | productB | productC | ... |
+----------------+----------+----------+----------+-----+
| 20160101000000 | 3120 | 1231 | 6433 | ... |
| 20160102000000 | 6120 | 1341 | 2121 | ... |
| 20160103000000 | 1120 | 5331 | 1231 | ... |
+----------------+----------+----------+----------+-----+
--- PAGE BREAK ---
....
So what I tried in first was: Getting to work the standard CrossTab tutorial-template of BIRT.
I defined the DataSource, and created a datacube with dimension-group of 'store' and 'product' , and as SUM / detail -data the 'value' and for this example I just selected ONE day.
But the result looks like this:
+--------+----------+----------+----------+----------+-----+----------+
| | productA | productC | productD | productE | ... | productZ |
+--------+----------+----------+----------+----------+-----+----------+
| Store1 | 213 | | 3234 | 897 | ... | 6767 |
| Store2 | 513 | 2213 | 1233 | | ... | 845 |
| Store3 | 21 | | | 32 | ... | |
| Store4 | 123 | 222 | 142 | | ... | |
+--------+----------+----------+----------+----------+-----+----------+
It's because not every product is selled in every store, but the crosstab creates the columns by selecting ALL products available.
So, I just have no idea how to generate dynamicly different tables with different (but also dynamic) amount of columns.
The second step then would be to get the dates (days) to work.
But thanks in advance for every hint ot tutorial link to question one ;-)

You can just add a table with the complete datasource. Select the table and a group. Group by StoreID. You can set the pagebreak options for each grouping. Set the property for after to "always exluding last".
BIRT will add a group header. You can add multiple groupheader rows get the layout you're after.
For crosstabs it works in a similar way. After you added the crosstab to your page and set the info for the groups on rows and columns and added summaries. You can view the data. Select the crosstab and View the Row Area properties, select the pagegroup settings and add a new pagebreak. You can select on which group you want to break, choose your storeID group and select after: "always excluding last"

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

sqlite: wide v. long performance - performance

Related

query hive json serde table having nested ARRAY & STRUCT combination

How to pivot data in Hive?

Converting Column headings into Row data

DAX Query with multiple filters in powerbi

Birt-Crosstab with empty columns

Categories

Resources