We have time series data (timestamp in us since 1970 and integer data value):
# load data and cache it
df_cache = readInData() # read data from several files (paritioned by hour)
df_cache.persist(pyspark.StorageLevel.MEMORY_AND_DISK)
df_cache.agg({"data": "max"}).collect()
# now data is cached
df_cache.show()
+--------------------+---------+
| time| data|
+--------------------+---------+
|1.448409599861109E15|1551.7468|
|1.448409599871109E15|1551.7463|
|1.448409599881109E15|1551.7468|
Now we want to calculate some non-trivial things on top of 10 Minute time windows using an external python library. In order to do so, we need to load the data of each time frame in memory, apply the external function and store the result. Therefore a User Defined Aggregate Function (UDAF) is not possible.
Now the problem is, when we apply the GroupBy to the RDD, it is very slow.
df_cache.rdd.groupBy(lambda x: int(x.time / 600e6) ). \ # create 10 minute groups
map(lambda x: 1). \ # do some calculations, e.g. external library
collect() # get results
This operation takes for 120Mio samples (100Hz data) on two nodes with 6GB Ram around 14 minutes. Spark Details for the groupBy stage:
Total Time Across All Tasks: 1.2 h
Locality Level Summary: Process local: 8
Input Size / Records: 1835.0 MB / 12097
Shuffle Write: 1677.6 MB / 379
Shuffle Spill (Memory): 79.4 GB
Shuffle Spill (Disk): 1930.6 MB
If I use a simple python script and let it iterate over the input files, it takes way less time to finish.
How can this job be optimized in spark?
The groupBy is your bottleneck here : it needs to shuffle the data across all partitions, which is time consuming and takes a hefty space in memory, as you can see from your metrics.
The way to go here is to use the reduceByKey operation and chaining it as follow :
df_cache.rdd.map(lambda x: (int(x.time/600e6), (x.time, x.data) ).reduceByKey(lambda x,y: 1).collect()
The key takeaway here is that groupBy needs to shuffle all of your data across all partitions, whereas reduceByKey will first reduce on each of the partition and then across all partitions - reducing drastically the size of the global shuffle. Notice how I organized the input into a key to take advantage of the reduceByKey operation.
As I mentionned in the comments, you might also want to try your program by using Spark SQL's DataFrame abstraction, that can potentially give you an extra boost, thanks to its optimizer.
Related
Data skew is something that hapen offen, that should be detected and treated correctly, I'm able to detect data skew in specific table using a groupby/count query in the joining key, however I have multiple joins in my application and doing that for each join can take time.
So is it possible to detect data skew directlly in the spark web ui which will saves me time ?
Data skew mean that you will have partitions that are significantly bigger than some other partitions.
For me, I usually check 2 things, In the stage tab, sort by decreasing duration, then click on tasks that are slow:
1- Check Summary Metrics which is one of the most important parts of the Spark UI. It gives you information about how your data is distributed among your partitions.
So to detect skew you can compare duration in Median and in Max columns, ideally the 2 values should be the same, when the difference between the two is bigger than defiantly there's a data skew, for example in the below picture:
Which means some tasks in that stage are taking too much time (31min) compared to other that takes only 1.1 minutes because of partitions size imbalance, the Min duration is also low which indicates that some partitions are nearly empty.
2- In the bottom of the stage You can find all tasks related to that stage, sort them by decreasing duration, then by Increasing duration, make sure that min duration and max duration are close if not than there are skewed in the you partitions, like in the picture below:
I have a pyspark application running on EMR for which I'd like to monitor some metrics.
For example count loaded, saved rows. Currently I use count operation to extract values, which, obviously, slows down the application. I was thinking whether there are a better options to extract those kind of metrics from dataframe?
I'm using pyspark 2.4.5
If you're counting the full dataframe, try persisting the dataframe first, so that you don't have to run the computation twice.
If an approximate count is acceptable, you can sample before counting to speed things up.
sample_fraction = 0.01 # take a roughly 1% sample
sample_count = df.sample(fraction=sample_fraction).count() # count the sample
extrapolated_count = sample_count / sample_fraction # estimate the total count
There's also an approx_count_distinct function, if you need a count of distinct values for a particular column.
If you need exact count then use parquet or delta lake format to store the data. It stores statistics so count results are fast (in seconds).
If you can live without exact count then you can use Dataframe.isEmpty, Dataframe.first, Dataframe.head(<number of rows>) etc to compensate for your needs.
I want to ask regarding the hive partitions numbers and how they will impact performance.
let me reflect this on a real example;
I have am external table that is expecting to have around 500M rows per day from multiple sources, and it shall have 5 partition columns.
for one day, that resulted in 250 partitions and expecting to have 1 year retention that will get around 75K.. which i suppose it is a huge number as when i checked, hive can go to 10K but after that the performance is going to be bad.. (and some one told me that partitions should not exceed 1K per table).
Mainly the queries that will select from this table
50% of them shall use the exact order of partitions..
25% shall use only 1-3 partitions and not using the other 2.
25% only using 1st partition
So do you think even with 1 month retention this may work well? or only start date can be enough.. assuming normal distribution the other 4 columns ( let's say 500M/250 partitions, for which we shall have 2M row for each partition).
I would go with 3 partition columns, since that will a) exactly match ~50% of your query profiles, and b) substantially reduce (prune) the number of scanned partitions for the other 50%. At the same time, you won't be pressured to increase your Hive MetaStore (HMS) heap memory and beef up HMS backend database to work efficiently with 250 x 364 = 91,000 partitions.
Since the time a 10K limit was introduced, significant efforts have been made to improve partition-related operations in HMS. See for example JIRA HIVE-13884, that provides the motivation to keep that number low, and describes the way high numbers are being addressed:
The PartitionPruner requests either all partitions or partitions based
on filter expression. In either scenarios, if the number of partitions
accessed is large there can be significant memory pressure at the HMS
server end.
... PartitionPruner [can] first fetch the partition names (instead of
partition specs) and throw an exception if number of partitions
exceeds the configured value. Otherwise, fetch the partition specs.
Note that partition specs (mentioned above) and statistics gathered per partition (always recommended to have for efficient querying), is what constitutes the bulk of data HMS should store and cache for good performance.
I have code similar to what follows:
val fileContent = sc.textFile("file:///myfile")
val dataset = fileContent.map(row => {
val explodedRow = row.split(",").map(s => s.toDouble)
new LabeledPoint(explodedRow(13), Vectors.dense(
Array(explodedRow(10), explodedRow(11), explodedRow(12))
))})
val algo = new LassoWithSGD().setIntercept(true)
val lambda = 0.0
algo.optimizer.setRegParam(lambda)
algo.optimizer.setNumIterations(100)
algo.optimizer.setStepSize(1.0)
val model = algo.run(dataset)
I'm running this in the cloud on my virtual server with 20 cores. The file is a "local" (i.e. not in HDFS) file with a few million rows. I run this in local mode, with sbt run (i.e. I don't use a cluster, I don't use spark-submit).
I would have expected this to get be increasingly faster as I increase the spark.master=local[*] setting from local[8] to local[40]. Instead, it takes the same amount of time regardless of what setting I use (but I notice from the Spark UI that my executor has a maximum number of Active Tasks at any given time that is equal to the expected amount, i.e. ~8 for local[8], ~40 for local[40], etc. -- so it seems that the parallelization works).
By default the number of partitions my dataset RDD is 4. I tried forcing the number of partitions to 20, without success -- in fact it slows the Lasso algorithm down even more...
Is my expectation of the scaling process incorrect? Can somebody help me troubleshoot this?
Is my expectation of the scaling process incorrect?
Well, kind of. I hope you don't mind I use a little bit of Python to prove my point.
Lets be generous and say a few million rows is actually ten million. With 40 000 000 values (intercept + 3 features + label per row) it gives around 380 MB of data (Java Double is a double-precision 64-bit IEEE 754 floating point). Lets create some dummy data:
import numpy as np
n = 10 * 1000**2
X = np.random.uniform(size=(n, 4)) # Features
y = np.random.uniform(size=(n, 1)) # Labels
theta = np.random.uniform(size=(4, 1)) # Estimated parameters
Each step of gradient descent (since default miniBatchFraction for LassoWithSGD is 1.0 it is not really stochastic) ignoring regularization requires operation like this.
def step(X, y, theta):
return ((X.dot(theta) - y) * X).sum(0)
So lets see how long it takes locally on our data:
%timeit -n 15 step(X, y, theta)
## 15 loops, best of 3: 743 ms per loop
Less than a second per step, without any additional optimizations. Intuitively it is pretty fast and it won't be easy to match this. Just for fun lets see how much it takes to get closed form solution for data like this
%timeit -n 15 np.linalg.inv(X.transpose().dot(X)).dot(X.transpose()).dot(y)
## 15 loops, best of 3: 1.33 s per loop
Now lets go back to Spark. Residuals for a single point can be computed in parallel. So this is a part which scales linearly when you increase number of partitions which are processed in parallel.
Problem is that you have to aggregate data locally, serialize, transfer to the driver, deserialize and reduce locally to get a final result after each step. Then you have compute new theta, serialize send back and so on.
All of that can be improved by a proper usage of mini batches and some further optimizations but at the end of the day you are limited by a latency of a whole system. It is worth noting that when you increase parallelism on a worker side you also increase amount of work that has to be performed sequentially on a driver and the other way round. One way or another the Amdahl's law will bite you.
Also all of the above ignores actual implementation.
Now lets perform another experiment. First some dummy data:
nCores = 8 # Number of cores on local machine I use for tests
rdd = sc.parallelize([], nCores)
and bechmark:
%timeit -n 40 rdd.mapPartitions(lambda x: x).count()
## 40 loops, best of 3: 82.3 ms per loop
It means that with 8 cores, without any real processing or network traffic we get to the point where we cannot do much better by increasing parallelism in Spark (743ms / 8 = 92.875ms per partition assuming linear scalability of the parallelized part)
Just to summarize above:
if data can be easily processed locally with a closed-form solution using gradient descent is just a waste of time. If you want to increase parallelism / reduce latency you can use good linear algebra libraries
Spark is designed to handle large amounts of data not to reduce latency. If your data fits in a memory of a few years old smartphone it is a good sign that is not the right tool
if computations are cheap then constant costs become a limiting factor
Side notes:
relatively large number of cores per machine is generally speaking not the best choice unless you can match this with IO throughput
BACKGROUND
I have a binary classification task where the data is highly imbalanced. Specifically, there are
way more data with label 0 than that with label 1. In order to solve this problem, I plan to subsampling
data with label 0 to roughly match the size of data with label 1. I did this in a pig script. Instead of
only sampling one chunk of training data, I did this 10 times to generate 10 data chunks to train 10 classifiers
similar to bagging to reduce variance.
SAMPLE PIG SCRIPT
---------------------------------
-- generate training chunk i
---------------------------------
-- subsampling data with label 0
labelZeroTrainingDataChunki = SAMPLE labelZeroTrainingData '$RATIO';
-- combine data with label 0 and label 1
trainingChunkiRaw = UNION labelZeroTrainingDataChunk1,labelOneTrainingData;
-- join two tables to get all the features back from table 'dataFeatures'
trainingChunkiFeatures = JOIN trainingChunkiRaw BY id, dataFeatures BY id;
-- in order to shuffle data, I give a random number to each data
trainingChunki = FOREACH trainingChunkiFeatures GENERATE
trainingChunkiRaw::id AS id,
trainingChunkiRaw::label AS label,
dataFeatures::features AS features,
RANDOM() AS r;
-- shuffle the data
trainingChunkiShuffledRandom = ORDER trainingChunki BY r;
-- store this chunk of data into s3
trainingChunkiToStore = FOREACH trainingChunkiShuffledRandom GENERATE
id AS id,
label AS label,
features AS features;
STORE trainingChunkiToStore INTO '$training_data_i_s3_path' USING PigStorage(',');
In my real pig script, I do this 10 times to generate 10 data chunks.
PROBLEM
The problem I have is that if I choose to generate 10 chunks of data, there are so many mapper/reducer tasks, more than 10K. The majority of
mappers do very little things (runs less 1 min). And at some point, the whole pig script is jammed. Only one mapper/reducer task could run and all other mapper/reducer tasks are blocked.
WHAT I'VE TRIED
In order to figure out what happens, I first reduced the number of chunks to generate to 3. The situation was less severe.
There were roughly 7 or 8 mappers running at the same time. Again these mappers did very little things (runs about
1 min).
Then, I increased the number of chunks to 5, at this point, I observed the the same problem I have when I set the number of chunks
to be 10. At some point, there was only one mapper or reducer running and all other mappers and reducers were blocked.
I removed some part of script to only store id, label without features
--------------------------------------------------------------------------
-- generate training chunk i
--------------------------------------------------------------------------
-- subsampling data with label 0
labelZeroTrainingDataChunki = SAMPLE labelZeroTrainingData $RATIO;
-- combine data with label 0 and label 1
trainingChunkiRaw = UNION labelZeroTrainingDataChunki, labelOneTrainingData;
STORE trainingChunkiRaw INTO '$training_data_i_s3_path' USING PigStorage(',');
This worked without any problem.
Then I added the shuffling back
--------------------------------------------------------------------------
-- generate training chunk i
--------------------------------------------------------------------------
-- subsampling data with label 0
labelZeroTrainingDataChunki = SAMPLE labelZeroTrainingData $RATIO;
-- combine data with label 0 and label 1
trainingChunkiRaw = UNION labelZeroTrainingDataChunki, labelOneTrainingData;
trainingChunki = FOREACH trainingChunkiRaw GENERATE
id,
label,
features,
RANDOM() AS r;
-- shuffle data
trainingChunkiShuffledRandom = ORDER trainingChunki BY r;
trainingChunkiToStore = FOREACH trainingChunkiShuffledRandom GENERATE
id AS id,
label AS label,
features AS features;
STORE trainingChunkiToStore INTO '$training_data_i_s3_path' USING PigStorage(',');
The same problem reappears. Even worse, at some point, there was no mapper/reducer running. The whole program hanged without making any progress. I added another machine and the program ran for a few minutes before it jammed again. Looks like there are some dependency issues here.
WHAT'S THE PROBLEM
I suspect there are some dependency which leads to deadlock. The confusing thing is that before shuffling, I already
generate the data chunks. I was expecting the shuffling could be executed in parallel since these data chunks are independent
with each other.
Also I noticed there are many mappers/reducers do very little thing (exists less than 1 min). In such case, I would
imagine the overhead to launch mappers/reducers would be high, is there any way to control this?
What's the problem, any suggestions?
Is there standard way to do this sampling. I would imagine there are many cases where we need to do these subsampling like bootstrapping or bagging. So, there might be some standard way to do this in pig. I couldn't find anything useful online.
Thanks a lot
ADDITIONAL INFO
The size of table 'labelZeroTrainingData' is really small, around 16MB gziped.
table 'labelZeroTrainingData' is also generated in the same pig script by filtering.
I ran the pig script on 3 aws c3.2xlarge machines.
table 'dataFeatures' could be large, around 15GB gziped.
I didn't modify any default configuration of hadoop.
I checked the disk space and memory usage. Disk space usage is around 40%. Memory usage is around 90%. I'm not sure memory is the problem. Since
I was told if the memory is the issue, the whole task should fail.
After a while, I think I figure out something. The problem is likely to be the multiple STORE statements there. Looks like pig script will be running in batch by default. So, for each chunk of the data, there is a job running which leads to lack of resource, e.g. slots for mapper and reducer. None of the job could finish because each needs more mapper/reducer slots.
SOLUTION
use piggybank. There is a storage function called MultiStorage which might be useful in this case. I had some version incompatible issue between piggybank and hadoop. But it might work.
Disable pig executing operations in batch. Pig tries to optimize the execution. I simply disable this multiquery feature by adding -M. So, when you run pig script, it looks like something pig -M -f pig_script.pg which executes one statement at a time without any optimization. This might not be ideal because no optimization is done. For me, it's acceptable.
Use EXEC in pig to enforce certain execution order which is helpful in this case.