Improve performance of wide GroupBy + write - performance

I need to tune a job that looks like below.
import pyspark.sql.functions as F
dimensions = ["d1", "d2", "d3"]
measures = ["m1", "m2", "m3"]
expressions = [F.sum(m).alias(m) for m in measures]
# Aggregation
aggregate = (spark.table("input_table")
.groupBy(*dimensions)
.agg(*expressions))
# Write out summary table
aggregate.write.format("delta").mode("overwrite").save("output_table")
The input table contains transactions, partitioned by date, 8 files per date.
It has 108 columns and roughly half a billion records. The aggregated result has 37 columns and ~20 million records.
I am unable to make any sort of improvement in the runtime whatever I do, so I would like to understand what are the things that affect the performance of this aggregation, i.e. what are the things I can potentially change?
The only thing that seems to help is to manually partition the work, i.e. starting multiple concurrent copies of the same code but with different date ranges.

to the best of my understanding currently the groupBy clause doesn't include the 'date' column so you are actually aggregating all dates in the query and you are not using the input table partition at all.
you can add the "date" column to the partitionBy clause and then you will sum up the measures for each date.
also, as for the input_table, when it is built, if possible, you can additionally partition it by d1, d2, d3 if they don't have a high cardinality or at least some of them.
finally the input_table will benefit from a columnar file type (parquet) so you won't have to i/o all 108 columns if you are using something like csv. assuming you are using something like parquet but just in case.

Related

Bind variables results in full table scan in Oracle

Checking the query cost on a table with 1 million records results in full table scan while the same query in oracle with actual values results in significant lesser cost.
Is this expected behaviour from Oracle ?
Is there a way to tell Oracle not to scan the full table ?
The query is scanning the full table when bind variables are used:
The query cost reduces significantly with actual variables:
This is a pagination query. You want to retrieve a handful of records from the table, filtering on their position in the filtered set. Your projection includes all the columns of the table, so you need to query the table to get the whole row. The question is, why do the two query variants have different plans?
Let's consider the second query. You are passing hard values for the offsets, so the optimizer knows that you want the eleven most recent rows in the sorted set. The set is sorted by an indexed column. The most important element is that the optimizer knows you want 11 rows. 11 is a very small sliver of one million, so using an indexed read to get the required rows is an efficient way of doing things. The path starts at the far end of the index, reads the last eleven entries and retrieves the rows.
Now, your first query has bind variables for the starting and finishing offsets and also for the number of rows to be returned. This is crucial: the optimizer doesn't know whether you want to return eleven rows or eleven thousand rows. So it opts for a very high cardinality. The reason for this is that index reads perform very badly for retrieving large numbers of rows. Full table scans are the best way of handling big slices of our tables.
Is this expected behaviour from Oracle ?
Now you understand this you will can see that the answer to this question is yes. The optimizer makes the best decision it can with the information we give it. When we provide hard values it can be very clever. When we provide vague data it has to guess; sometimes its guesses aren't the ones we expected.
Bind variables are very useful for running the same query with different values when the expected result set is similar. But using bind variables to specify ranges means the result sets can potentially vary tremendously in size.
Is there a way to tell Oracle not to scan the full table ?
If you can fix the pagesize, thus removing the :a2 parameter, that would allow the optimizer to produce a much more accurate plan. Alternatively, if you need to vary the pagesize within a small range (say 10 - 100) then you could try a /*+ cardinality (100) */ hint in the query; provided the cardinality value is within the right order of magnitude it doesn't have to be the precise value.
As with all performance questions, the devil is in the specifics. So you need to benchmark various performance changes and choose the best fit for your particular use case(s).

How to bucket a Hive table with ORC for a complex query?

Maybe this question is too generic but I think it is worth a try.
I am working with a table that has 270 fields. It is partitioned by the date (like dt=20180101). However when we are hitting this table with queries we are essentially doing a whole table scan because we use fields in the where clause that are not dt. I was wondering what is the right approach for enable bucketing for this table. I could pick one of the where clause fields and enable bucketing for that. For example:
PARTITIONED BY (
dt INT
)
CLUSTERED BY (
class
)
INTO 16 BUCKETS
Another approach is to use more than 1 field for bucketing:
PARTITIONED BY (
dt INT
)
CLUSTERED BY (
class, other_field, other_field_2
)
INTO 128 BUCKETS
Is it worth to bucker by multiple field? I guess it will only speed up queries when the same exact fields are present in the select.
Another question, is it worth at least sort by multiple fields so when the file is read it is sequential read? Like this:
PARTITIONED BY (
dt INT
)
CLUSTERED BY (
class
)
SORTED BY (
other_field, other_field_2
)
INTO 16 BUCKETS
First, if you dont usually query on date and your queries span over many dates, then you might want to change your partitioning strategy.
Its not necessary that you will always query only for 1 or few dates but if your queries are usually totally NOT related to 'date' filtering then you should change that!
Second, bucketing basically splits your data based on hash of your bucketing columns. So it helps you to split your data into equally sized folders in file system and helps mapReduce program runnig over it manage the partitions in an efficient way. But, bucketing into large number of buckets can also have negative effects as all such metadata is also stored in Hive metastore. So, this metadata is read first when you execute some query and based on the result from metadata query, actual data (part of actual data) is read from file system.
So in actual there's no specific rule for bucketing; as to how many buckets should be there and on what all columns you should bucket.
So you should look into your queries and plan accordingly!
Third, sorting does help at the time of querying, as its easy for the engine to push down filtering and sorting criteria. But when you enable sorting on a table, ingestion of data actually becomes a little slower than the case where sorting isnt enabled! But definitely in high queries system it is bound to get you good benefits.
So all in all, these three are all optimization techniques and dont hold any particular rules for their application. It purely depends on your use case!
Hope this helps!!

Partitioning or bucketing hive table based on only month/year to optimize queries

I'm building a table that contains about 400k rows of a messaging app's data.
The current table's columns looks something like this:
message_id (int)| sender_userid (int)| other_col (string)| other_col2 (int)| create_dt (timestamp)
A lot of queries I would be running in the future will rely on a where clause involving the create_dt column. Since I expect this table to grow, I would like to try and optimize it right now. I'm aware that partitioning is one way, but when I partition it based on create_dt the result is too many partitions since I have every single date spanning back to Nov 2013.
Is there a way to instead partition by a range of dates? How about partition for every 3 months? or even every month? If this is possible - Could I possibly have too many partitions in the future making it inefficient? What are some other possible partition methods?
I've also read about bucketing, but as far as I'm aware that's only useful if you would be doing joins on a column that the bucket is based on. I would most likely be doing joins only on column sender_userid (int).
Thanks!
I think this might be a case of premature optimization. I'm not sure what your definition of "too many partitions" is, but we have a similar use case. Our tables are partitioned by date and customer column. We have data that spans back to Mar 2013. This created approximately 160k+ partitions. We also use a filter on date and we haven't seen any performance problems with this schema.
On a side note, Hive is getting better at scaling up to 100s of thousands of partitions and tables.
On another side note, I'm curious as to why you're using Hive in the first place for this. 400k rows is a tiny amount of data and is not really suited for Hive.
Check out hive built in UDFs. With the right combination of them you can achieve what you want. Here's an example to partition on every month (produces "YEAR-MONTH" string that you can use as partition column value):
select concat(cast(year(to_date(create_dt)) as string),'-',cast(month(to_date(create_dt)) as string))
But when partitioning on dates it is usually useful to have multiple levels of the date dimension so in this case you should have two partition columns, first for year and second for month:
select year(to_date(create_dt)),month(to_date(create_dt))
Keep in mind that timestamps and dates are strings, and that functions like month() or year() return integers as values of date fields. You can use simple mathematical operations to figure out the right partition.

performance for sum oracle

I have to sum a huge number of data with aggregation and where clause, using this query
what I am doing is like this : I have three tables one contains terms the second contains user terms , and the third contains correlation factor between term and user term.
I want to calculate the similarity between the sentence that that user inserted with an already existing sentences, and take the results greater than .5 by summing the correlation factor between sentences' terms
The problem is that this query takes more than 15 min. because I have huge tables
any suggestions to improve performance please?
insert into PLAG_SENTENCE_SIMILARITY
SELECT plag_TERMS.SENTENCE_ID ,plag_User_TERMS.SENTENCE_ID,
least( sum( plag_TERM_CORRELATIONS3.CORRELATION_FACTOR)/ plag_terms.sentence_length,
sum (plag_TERM_CORRELATIONS3.CORRELATION_FACTOR)/ plag_user_terms.sentence_length),
plag_TERMs.isn,
plag_user_terms.isn
FROM plag_TERM_CORRELATIONS3,
plag_TERMS,
Plag_User_TERMS
WHERE ( Plag_TERMS.TERM_ROOT = Plag_TERM_CORRELATIONS3.TERM1
AND Plag_User_TERMS.TERM_ROOT = Plag_TERM_CORRELATIONS3.TERM2
AND Plag_User_Terms.ISN=123)
having
least( sum( plag_TERM_CORRELATIONS3.CORRELATION_FACTOR)/ plag_terms.sentence_length,
sum (plag_TERM_CORRELATIONS3.CORRELATION_FACTOR)/ plag_user_terms.sentence_length) >0.5
group by (plag_User_TERMS.SENTENCE_ID,plag_TERMS.SENTENCE_ID , plag_TERMs.isn, plag_terms.sentence_length,plag_user_terms.sentence_length, plag_user_terms.isn);
plag_terms contains more than 50 million records and plag_correlations3 contains 500000
If you have a sufficient amount of free disk space, then create a materialized view
over the join of the three tables
fast-refreshable on commit (don't use the ANSI join syntax here, even if tempted to do so, or the mview won't be fast-refreshable ... a strange bug in Oracle)
with query rewrite enabled
properly physically organized for quick calculations
The query rewrite is optional. If you can modify the above insert-select, then you can just select from the materialized view instead of selecting from the join of the three tables.
As for the physical organization, consider
hash partitioning by Plag_User_Terms.ISN (with a sufficiently high number of partitions; don't hesitate to partition your table with e.g. 1024 partitions, if it seems reasonable) if you want to do a bulk calculation over all values of ISN
single-table hash clustering by Plag_User_Terms.ISN if you want to retain your calculation over a single ISN
If you don't have a spare disk space, then just hint your query to
either use nested loops joins, since the number of rows processed seems to be quite low (assumed by the estimations in the execution plan)
or full-scan the plag_correlations3 table in parallel
Bottom line: Constrain your tables with foreign keys, check constraints, not-null constraints, unique constraints, everything! Because Oracle optimizer is capable of using most of these informations to its advantage, as are the people who tune SQL queries.

Poor Performance of Mondrian w/ Degenerate Dimensions

I have an application that collects performance metrics and stores them in a datamart. I then use Mondrian to enable analysis and ad-hoc exploration of the data. I'm collecting about 5e6 rows per day and total size of the METRIC table is about 300M rows.
We "color" our data based on the metrics comparison to an SLA. There are exactly 5 distinct values for color. When we do simple MDX queries to get, for example, a color distribution of the data for a specific date range, say 1 day, we see queries like below:
2014-06-11 23:17:08,042 DEBUG [sql] - 223: SqlTupleReader.readTuples
[[Color].[Color]]: executing sql [select "METRIC"."COLOR" as "c0"
from "METRIC" "METRIC" group by "METRIC"."COLOR" order by
"METRIC"."COLOR" ASC NULLS LAST] 2014-06-11 23:17:58,747 DEBUG [sql] -
223: , exec 50704 ms
In order to improve performance, the datamart includes aggregate tables at the hour and day levels, and both aggregate tables include the COLOR column.
I understand that Mondrian is very dependent on the underlying database performance, but there is really no way to tune this. I can create an index on COLOR (because a full scan of the index will be marginally faster than a full scan of the table), but it seems silly to create an index with 5 distinct value on a 300M row table. The day aggregate table has about 500K rows and would be significantly faster executing virtually the same query against this table, but Mondrian always seems to go to the base fact table for these dimension queries.
My question is, is there some way to avoid this query? If I can't avoid it, is it possible to get Mondrian to use the aggregate tables for this type of query? I have specified approxRowCount in the single level of this dimension/hierarchy and that eliminated the similar query to get the count of values. I haven't dug into the source of Mondrian yet to determine if there is a possibility of using the aggregate table or if there is some configuration on my part that is preventing it.
Edit for Clarification:
I probably didn't do a good job of asking my question-let me try and clarify. My MDX query looks something like:
select [Color].[Color].Members on columns,
{[Measures].[Metric Value], [Measures].[Count]} on rows
from [Metric]
where [Time].[2014].[June].[11]
I can look at this and hand write a SQL query that answers this query
select COLOR, avg(VALUE), sum(FACT_COUNT)
from AGG_DAY_METRIC
where YEAR = 2014
and MONTH = 6
and DAY_OF_MONTH = 11
group by COLOR
The database answers this query in about 100ms scanning approx 4K rows. It takes Mondrian several minutes to answer the
query because it does several queries that don't answer the MDX query directly, but rather get information about the
dimension. In the case above, the database has to scan 300M rows, taking 50 seconds, to return that there are 5 possible
colors. If color was in a normal dimension table there would only be 5 rows, but in a degenerate dimension there can be 100s
of millions of rows.
So my questions are:
a) Is there a way to tell Mondrian the values of a degenerate dimension and avoid these queries?
b) Is there a way to have Mondrian answer these queries from aggregate tables?
This problem was solved, not by modifying anything in the Mondrian schema or the application, but the database. The database in this case was Oracle and we were able to create a materialized view with query rewrite enabled.
The materialized view is created from the exact query issued by Mondrian. Since the color values don't change very frequently (almost never in our case), the materialized view does a full refresh once a day.
In this case the queries went from taking minute(s) to milliseconds. If your facing an issue like this and your database is Oracle this is a good approach to speeding up the tuples resolution for degenerate dimensions with low cardinality.
It's hard to give any specific directions without knowing more about your schema, but it looks to me you have to make sure that the number of rows with certain colours (count) has to be marked defined as an aggregate measure (Count or Max Number).
Please note that these aggregates are not calculated continuously (I think it would be to heavy for the backing data-store, and Mondrian won't keep a flowing set in memory for incoming facts).
The aggregation can be specified to be ran/rebuilt at specific times (nightly, hourly...). This would make Mondrian a bit unsuitable for real-time analysis, but you should be able to do almost instant queries on historical data.
If your dimension has 5 distinct values in a 300M fact table it should not be a degenerate dimension. It should be in a separate dimension table. A degenerate dimension should ONLY be used if its cardinality is close to the full fact table row count, making a separate table pointless, as there would be no significant storage savings and joining the dimension results in a lot of data being read;
If you put the colors on a separate dim table, any "Read Tuples" query will return results in a few ms, and your problem is solved.
However, more to the point of your question, Mondrian should be able to pick the dim values from the agg tables. Unless you have distinct-count aggregators in the cube, in which case you're in a tricky situation (unless there's an agg table that exactly matches the level of detail you need, Mondrian will very likely scan the fact table).
You should also set the highCardinality attribute of this degenerate dimension to True. Even with only 5 distinct values, having highCardinality=false tells Mondrian it's safe to scan the whole dimension to populate the list of members. Setting it to true stops this scan.
You should also add an index to this column. It's always a good idea to add indexes to every key and degenerate dimension column in a fact table. With an index the DB should answer much faster that SQL query.
Finally, you have a 300M row fact table. What DBMS are you using? Is it a Column oriented DB? If not, you should try them as a possible alternative to your data store. Column oriented DB have a significant performance increase over Row oriented DBs for Mondrian-like queries. There are a few good options out there, you should test drive them.

Resources