I am trying to join two dataframes which are read from S3 as parquet files. One of the dataframe is huge with size of 10GB(deserialized size) and the other one is about 1GB(deserialized size). I am just doing a left join based on two columns but the join is taking forever to complete. Below is the code snippet.
pk_populated_df = left_df.join(
right_df.select(
ID1,
ID2,
F.struct('*').alias(COLLECTION_DF)
),
on=[ID1, ID2],
how='left'
).persist()
pk_populated_df = pk_populated_df.join(
second_right_df.select(
ID1,
ID2,
F.struct('*').alias(COLLECTION_DF2)
),
on=[ID1, ID2],
how='left'
)
pk_populated_df.write.parquet("s3://")
The first join doesnt take much time. As the left_df has only small data which is of 10MB but the result is 12 Million Rows
Few of the things which is notice is that.
The second join is the problem here which might produce billion rows.
When i see the tasks output, i can see two tasks taking same input takes different times to complete the task. i.e., one task processing 7 MB of data takes 2 Mins, whereas other task processing almost same amount of data takes 8 Mins.
I thought it might be due to data skewness, so i tried salting, still it did not work.
Each task takes only few MB's of data from shuffle and all the tasks are processed as "PROCESS_LOCAL"
I have tried increasing the shuffle partition upto 1000, but still each task runs for atleast 2 mins and some tasks might go upto 1 Hour.
I have also notices that the CPU time is constantly hitting 100% and that might causing the tasks to complete slower which i am not sure.
I have also tried repartitioning the data based on the join columns, but it did not help the cause.
Below is the time taken by tasks as a sample:
enter image description here
enter image description here
Can someone guide me on this to tune the job. Thanks for your help.
Related
Data skew is something that hapen offen, that should be detected and treated correctly, I'm able to detect data skew in specific table using a groupby/count query in the joining key, however I have multiple joins in my application and doing that for each join can take time.
So is it possible to detect data skew directlly in the spark web ui which will saves me time ?
Data skew mean that you will have partitions that are significantly bigger than some other partitions.
For me, I usually check 2 things, In the stage tab, sort by decreasing duration, then click on tasks that are slow:
1- Check Summary Metrics which is one of the most important parts of the Spark UI. It gives you information about how your data is distributed among your partitions.
So to detect skew you can compare duration in Median and in Max columns, ideally the 2 values should be the same, when the difference between the two is bigger than defiantly there's a data skew, for example in the below picture:
Which means some tasks in that stage are taking too much time (31min) compared to other that takes only 1.1 minutes because of partitions size imbalance, the Min duration is also low which indicates that some partitions are nearly empty.
2- In the bottom of the stage You can find all tasks related to that stage, sort them by decreasing duration, then by Increasing duration, make sure that min duration and max duration are close if not than there are skewed in the you partitions, like in the picture below:
See these 2 Snowflake queries profile images. They are doing similar work (Update the same 370M table join with small tables(one case is 21k, the other one is 9k), but the performance result is 5x).
The query finished around 15 mins, using one xsmall VDW:
Fast query finished around 15 mins
And this query, update the same table of 370M rows, join with an even small DIM table of 9k, but still running after 1 hour 30 mins
Still, running after 90 minutes
From the query profile, I cannot explain why the 2nd query runs so much slower than the first one. The 2nd one is run right after the first one.
Any idea? Thanks
in the second query you can see bytes spilled to local storage is 272gb. This means that the work done in processing was too large to fit in the cluster memory and so had to spill to local attached SSD. From a performance perspective this is a costly operation and I think probably why the 2nd query took so long to run (query 1 only had 2gb of spilling). The easiest solution to this is to increase the size of the VDW - or you could rewrite the query:
https://docs.snowflake.net/manuals/user-guide/ui-query-profile.html#queries-too-large-to-fit-in-memory
Note also that query 1 managed to read 100% of its data set from VDW memory - which is very efficient - whereas query2 could only find about half of its data set there and so had to perform remote io (read from cloud storage) to get the rest. Queries/work performed prior to running query 1 and 2 had retrieved that information to the local VDW cache, and retains this info on an LRU basis.
The join for the slow query is producing more rows than are flowing into it. This can be what you want, but often it's caused by duplicate values in the tables. I'd do a sanity check on whether that's expected here.
I want to ask regarding the hive partitions numbers and how they will impact performance.
let me reflect this on a real example;
I have am external table that is expecting to have around 500M rows per day from multiple sources, and it shall have 5 partition columns.
for one day, that resulted in 250 partitions and expecting to have 1 year retention that will get around 75K.. which i suppose it is a huge number as when i checked, hive can go to 10K but after that the performance is going to be bad.. (and some one told me that partitions should not exceed 1K per table).
Mainly the queries that will select from this table
50% of them shall use the exact order of partitions..
25% shall use only 1-3 partitions and not using the other 2.
25% only using 1st partition
So do you think even with 1 month retention this may work well? or only start date can be enough.. assuming normal distribution the other 4 columns ( let's say 500M/250 partitions, for which we shall have 2M row for each partition).
I would go with 3 partition columns, since that will a) exactly match ~50% of your query profiles, and b) substantially reduce (prune) the number of scanned partitions for the other 50%. At the same time, you won't be pressured to increase your Hive MetaStore (HMS) heap memory and beef up HMS backend database to work efficiently with 250 x 364 = 91,000 partitions.
Since the time a 10K limit was introduced, significant efforts have been made to improve partition-related operations in HMS. See for example JIRA HIVE-13884, that provides the motivation to keep that number low, and describes the way high numbers are being addressed:
The PartitionPruner requests either all partitions or partitions based
on filter expression. In either scenarios, if the number of partitions
accessed is large there can be significant memory pressure at the HMS
server end.
... PartitionPruner [can] first fetch the partition names (instead of
partition specs) and throw an exception if number of partitions
exceeds the configured value. Otherwise, fetch the partition specs.
Note that partition specs (mentioned above) and statistics gathered per partition (always recommended to have for efficient querying), is what constitutes the bulk of data HMS should store and cache for good performance.
I am trying to execute this query using hive, but it takes forever to run, especially after going to the reducer step. It say mappers:451, reducers:1.
create table mb.ref201501_nontarget as select * from adv.raf_201501 where target=0 limit 200000;
My motivation to change the query came from this answer:
Hive unable to manually set number of reducers
I tried changing the query to:
create table mb.ref201501_nontarget as select * from (select * from adv.raf_201501 limit 200000) where target=0;
but its throwing error.
This question is very vague, if you think the last query produces the proper result (note that it is not the same as the first one!!) this should do the trick:
create table mytmptbl = select * from advanl.raf_201501 limit 200000;
create table mbansa001c.ref201501_nontarget as select * from (mytmptbl ) where target=0;
After which you probably want to delete the temporary table again.
Hadoop is a framework for distributed computing. Some data processing actions are a good fit because they are "embarrassingly parallel". Some data processing actions are a bad fit because they cannot be distributed. Most real-life cases are somewhere in between.
I strongly suspect that what you want to do is get a sample of the raw data with approximately 200k items. But your query requires exactly 200k items.
The simplest way for Hive to do that would be to run the WHERE clause in parallel (451 Mappers on 451+ file blocks) then dump all partial results in a single "sink" (1 Reducer) that lets the first 200k rows to pass through and ignore the rest. But all records will be processed, even the ones to be ignored.
Bottom line: you have a very inefficient sampler, and the result will probably have a strong bias -- smaller file blocks will be Mapped faster and processed earlier by the Reducer, hence larger file blocks have almost no chance to be represented in the sample.
I guess you know how many records match the WHERE clause, so you would be better off with some kind of random sampling that retrieves approx. 500K or 1M records -- that can be done up front, inside each Mapper -- then a second query with the LIMIT if you really want an arbitrary number of records -- a single Reducer will be OK for this kind of smallish volume.
Ok. this is what worked for me. now taking only 2-5 minutes for about 27m records:
create table mb.ref201501_nontarget as SELECT * FROM adv.raf_201501 TABLESAMPLE(0.02 PERCENT) where target=0;
When using limit or rand(), it uses at least 1 reducers and the process takes more than 2 hours and kinda freezes at 33% reducing step.
In Tablesample without limit it assigned only 1 mapper and 0 reducer.
BACKGROUND
I have a binary classification task where the data is highly imbalanced. Specifically, there are
way more data with label 0 than that with label 1. In order to solve this problem, I plan to subsampling
data with label 0 to roughly match the size of data with label 1. I did this in a pig script. Instead of
only sampling one chunk of training data, I did this 10 times to generate 10 data chunks to train 10 classifiers
similar to bagging to reduce variance.
SAMPLE PIG SCRIPT
---------------------------------
-- generate training chunk i
---------------------------------
-- subsampling data with label 0
labelZeroTrainingDataChunki = SAMPLE labelZeroTrainingData '$RATIO';
-- combine data with label 0 and label 1
trainingChunkiRaw = UNION labelZeroTrainingDataChunk1,labelOneTrainingData;
-- join two tables to get all the features back from table 'dataFeatures'
trainingChunkiFeatures = JOIN trainingChunkiRaw BY id, dataFeatures BY id;
-- in order to shuffle data, I give a random number to each data
trainingChunki = FOREACH trainingChunkiFeatures GENERATE
trainingChunkiRaw::id AS id,
trainingChunkiRaw::label AS label,
dataFeatures::features AS features,
RANDOM() AS r;
-- shuffle the data
trainingChunkiShuffledRandom = ORDER trainingChunki BY r;
-- store this chunk of data into s3
trainingChunkiToStore = FOREACH trainingChunkiShuffledRandom GENERATE
id AS id,
label AS label,
features AS features;
STORE trainingChunkiToStore INTO '$training_data_i_s3_path' USING PigStorage(',');
In my real pig script, I do this 10 times to generate 10 data chunks.
PROBLEM
The problem I have is that if I choose to generate 10 chunks of data, there are so many mapper/reducer tasks, more than 10K. The majority of
mappers do very little things (runs less 1 min). And at some point, the whole pig script is jammed. Only one mapper/reducer task could run and all other mapper/reducer tasks are blocked.
WHAT I'VE TRIED
In order to figure out what happens, I first reduced the number of chunks to generate to 3. The situation was less severe.
There were roughly 7 or 8 mappers running at the same time. Again these mappers did very little things (runs about
1 min).
Then, I increased the number of chunks to 5, at this point, I observed the the same problem I have when I set the number of chunks
to be 10. At some point, there was only one mapper or reducer running and all other mappers and reducers were blocked.
I removed some part of script to only store id, label without features
--------------------------------------------------------------------------
-- generate training chunk i
--------------------------------------------------------------------------
-- subsampling data with label 0
labelZeroTrainingDataChunki = SAMPLE labelZeroTrainingData $RATIO;
-- combine data with label 0 and label 1
trainingChunkiRaw = UNION labelZeroTrainingDataChunki, labelOneTrainingData;
STORE trainingChunkiRaw INTO '$training_data_i_s3_path' USING PigStorage(',');
This worked without any problem.
Then I added the shuffling back
--------------------------------------------------------------------------
-- generate training chunk i
--------------------------------------------------------------------------
-- subsampling data with label 0
labelZeroTrainingDataChunki = SAMPLE labelZeroTrainingData $RATIO;
-- combine data with label 0 and label 1
trainingChunkiRaw = UNION labelZeroTrainingDataChunki, labelOneTrainingData;
trainingChunki = FOREACH trainingChunkiRaw GENERATE
id,
label,
features,
RANDOM() AS r;
-- shuffle data
trainingChunkiShuffledRandom = ORDER trainingChunki BY r;
trainingChunkiToStore = FOREACH trainingChunkiShuffledRandom GENERATE
id AS id,
label AS label,
features AS features;
STORE trainingChunkiToStore INTO '$training_data_i_s3_path' USING PigStorage(',');
The same problem reappears. Even worse, at some point, there was no mapper/reducer running. The whole program hanged without making any progress. I added another machine and the program ran for a few minutes before it jammed again. Looks like there are some dependency issues here.
WHAT'S THE PROBLEM
I suspect there are some dependency which leads to deadlock. The confusing thing is that before shuffling, I already
generate the data chunks. I was expecting the shuffling could be executed in parallel since these data chunks are independent
with each other.
Also I noticed there are many mappers/reducers do very little thing (exists less than 1 min). In such case, I would
imagine the overhead to launch mappers/reducers would be high, is there any way to control this?
What's the problem, any suggestions?
Is there standard way to do this sampling. I would imagine there are many cases where we need to do these subsampling like bootstrapping or bagging. So, there might be some standard way to do this in pig. I couldn't find anything useful online.
Thanks a lot
ADDITIONAL INFO
The size of table 'labelZeroTrainingData' is really small, around 16MB gziped.
table 'labelZeroTrainingData' is also generated in the same pig script by filtering.
I ran the pig script on 3 aws c3.2xlarge machines.
table 'dataFeatures' could be large, around 15GB gziped.
I didn't modify any default configuration of hadoop.
I checked the disk space and memory usage. Disk space usage is around 40%. Memory usage is around 90%. I'm not sure memory is the problem. Since
I was told if the memory is the issue, the whole task should fail.
After a while, I think I figure out something. The problem is likely to be the multiple STORE statements there. Looks like pig script will be running in batch by default. So, for each chunk of the data, there is a job running which leads to lack of resource, e.g. slots for mapper and reducer. None of the job could finish because each needs more mapper/reducer slots.
SOLUTION
use piggybank. There is a storage function called MultiStorage which might be useful in this case. I had some version incompatible issue between piggybank and hadoop. But it might work.
Disable pig executing operations in batch. Pig tries to optimize the execution. I simply disable this multiquery feature by adding -M. So, when you run pig script, it looks like something pig -M -f pig_script.pg which executes one statement at a time without any optimization. This might not be ideal because no optimization is done. For me, it's acceptable.
Use EXEC in pig to enforce certain execution order which is helpful in this case.