I am trying to extract a table to a flat file using python and the ssas_api package, allowing me to run DAX queries from python code.
The table is fairly big and because of that a simple EVALUATE tablename query will timeout after 1h.
I want to split the queries into smaller ones, iterating over the table by chunks of let say 20k lines for example.
I could do the first chunk using TOPN but what about the next ones?
Related
I need to tune a job that looks like below.
import pyspark.sql.functions as F
dimensions = ["d1", "d2", "d3"]
measures = ["m1", "m2", "m3"]
expressions = [F.sum(m).alias(m) for m in measures]
# Aggregation
aggregate = (spark.table("input_table")
.groupBy(*dimensions)
.agg(*expressions))
# Write out summary table
aggregate.write.format("delta").mode("overwrite").save("output_table")
The input table contains transactions, partitioned by date, 8 files per date.
It has 108 columns and roughly half a billion records. The aggregated result has 37 columns and ~20 million records.
I am unable to make any sort of improvement in the runtime whatever I do, so I would like to understand what are the things that affect the performance of this aggregation, i.e. what are the things I can potentially change?
The only thing that seems to help is to manually partition the work, i.e. starting multiple concurrent copies of the same code but with different date ranges.
to the best of my understanding currently the groupBy clause doesn't include the 'date' column so you are actually aggregating all dates in the query and you are not using the input table partition at all.
you can add the "date" column to the partitionBy clause and then you will sum up the measures for each date.
also, as for the input_table, when it is built, if possible, you can additionally partition it by d1, d2, d3 if they don't have a high cardinality or at least some of them.
finally the input_table will benefit from a columnar file type (parquet) so you won't have to i/o all 108 columns if you are using something like csv. assuming you are using something like parquet but just in case.
I'm trying to figure out how to write a parquet file where the columns do not contain the same number of rows per Row Group. For example, my first column might be a value sampled at 10Hz, while my second column may be a value sampled at only 5Hz. I'd rather not repeat values in the slower column since this can lead to computational errors. However, I cannot write columns of two different sizes to the same Row Group, so how can I accomplish this?
I'm attempting to do this with ParquetSharp.
It is not possible for the columns in a parquet file to have different row counts.
It is not explicit in the documentation but if you look on https://parquet.apache.org/documentation/latest/#metadata, you will see that a RowGroup has a num_rows and several ColumnChunks that do not themselves have individual row numbers.
I would like to know if it is possible to achieve the steps below in PL / SQL.
Please note that I use the word "partition" when I mean "put rows with a certain condition together" because a) I would like to avoid the word "group" because it combines rows in SQL, b) my research so far led me to think that the "PARTITION BY" clause is possibly what I want:
1. Select rows based on a long query with many joins,
partition the results based on a certain column value of type LONG.
2. Loop through each row of a partition and partition again,
based on another column of type VARCHAR.
Do that for every partition.
3. Loop through each row of the resulting sub-partition, compare multiple columns
with predefined values, set a boolean column to true or false based on the result.
Do that for every sub-partition.
It would be really easy to do for me in a normal programming language, such as Java. But can I do that in PL/SQL? If so, what would be a good approach?
I have a huge excel file with more than a million rows and a bunch of columns (300) which I've imported to an access database. I'm trying to run an inner join query on it which matches on a numeric field in a relatively small dataset. I would like to capture all the columns of data from the huge dataset if possible. I was able to get the query to run in about 1/2 hour when I selected just one column from the huge dataset. However, when I select all the columns from the larger dataset, and have the query writes to a table, it just never stops.
One consideration is that the smaller dataset's join field is a number, while the larger one's is in text. To get around this, I created a query on the larger dataset which converts the text field to a number using the "val" function. The text field in question is indexed, but I'm thinking I should convert on the table itself to a numeric field to match the smaller dataset's type. Maybe that would make the lookup more efficient.
Other than that, I could use and would greatly appreciate some suggestions of a good strategy to get this query to run in a reasonable amount of time.
Access is a relational database. It is designed to work efficiently if your structure respects the relational model. Volume is not the issue.
Step 1: normalize your data. If you don't have a clue about what that means, there is a wizard in Access that can help you for this (Database Tools, Analyze table) , or search for Database normalization
Step 2: index the join fields
Step 3: enjoy fast results
Your idea of having both sides of the join in the same type IS a must. If you don't do that, indexes and optimisation won't be able to operate.
I'm currently doing some data loading for a kind of warehouse solution. I get an data export from the production each night, which then must be loaded. There are no other updates on the warehouse tables. To only load new items for a certain table I'm currently doing the following steps:
get the current max value y for a specific column (id for journal tables and time for event tables)
load the data via a query like where x > y
To avoid performance issues (I load around 1 million rows per day) I removed most indices from the tables (there are only needed for production, not in the warehouse). But that way the retrieval of the max value takes some time...so my question is:
What is the best way to get the current max value for a column without an index on that column? I just read about using the stats but I don't know how to handle columns with 'timestamp with timezone'. Disabling the index before load, and recreate it afterwards takes much too long...
The minimum and maximum values that are computed as part of column-level statistics are estimates. The optimizer only needs them to be reasonably close, not completely accurate. I certainly wouldn't trust them as part of a load process.
Loading a million rows per day isn't terribly much. Do you have an extremely small load window? I'm a bit hard-pressed to believe that you can't afford the cost of indexing the row(s) you need to do a min/ max index scan.
If you want to avoid indexes, however, you probably want to store the last max value in a separate table that you maintain as part of the load process. After you load rows 1-1000 in table A, you'd update the row in this summary table for table A to indicate that the last row you've processed is row 1000. The next time in, you would read the value from the summary table and start at 1001.
If there is no index on the column, the only way for the DBMS to find the maximum value in the column is a complete table scan, which takes a long time for large tables.
I suppose a DBMS could try to keep track of the minimum and maximum values in the column (storing the values in the system catalog) as it does inserts, updates and deletes - but deletes are why no DBMS I know of tries to keep statistics up to date with per-row operations. If you delete the maximum value, finding the new maximum requires a table scan if the column is not indexed (and if it is indexed, the index makes it trivial to find the maximum value, so the information does not have to be stored in the system catalog). This is why they're called 'statistics'; they're an approximation to the values that apply. But when you request 'SELECT MAX(somecol) FROM sometable', you aren't asking for statistical maximum; you're asking for the actual current maximum.
Have the process that creates the extract file also extract a single row file with the min/max you want. I assume that piece is scripted on some cron or scheduler, so shouldn't be too much to ask to add min/max calcs to that script ;)
If not, just do a full scan. Million rows isn't much really, esp in a data warehouse environment.
This code was written with oracle, but should be compatible with most SQL versions:
This gets the key of the max(high_val) in the table according to the range.
select high_val, my_key
from (select high_val, my_key
from mytable
where something = 'avalue'
order by high_val desc)
where rownum <= 1
What this says is: Sort mytable by high_val descending for values where something = 'avalue'. Only grab the top row, which will provide you with the max(high_val) in the selected range and the my_key to that table.