higher SAS voluntary wait time reading observations - performance

I have a unique requriment. I have 4 SAS datasets in sas library (p4,A,B,C). I have to read 1 observation from A,do some processing , store in temp dataset. Read 1 observation from B,do some process and store in temp dataset. Read 1 observation from C and do some processing and store in temp dataset. Then write all the temp dataset in JSON format. I am using multiple data steps and PROC JSON. It is taking lot of 'wall clock ' unusually. below is the sample code. Is there any way better to do this to avoid program load delay and voluntary wait.

Related

Spring batch to process huge data

I have around 10 million files in my database in blob format which I need to convert and save them in pdf format. Each file size is around between 0.5 - 10mb and combined files size is around 20 TB. I’m trying to implement the functionality using spring batch. However my question is when I run the batch can the server memory hold that much amount of data? I’m trying to use chunk based processing and thread pool task executor. Please suggest if this best approach to run the job to process that much amount of data in less time
Each file size is 0.5 to 10 MB and approach you mentioned is perfect with chunks. You can get more control with below and monitor the processing.
Create Partition based on thread pool count(Based on your System resource) from file table.
Each partition step of reader will select only 1 file at a time.
You can calculate memory based on number of parallel steps and give as VM argument.
Configure Commit chunk based on memory calculation of total parallel steps.
Please refer below for example code.
Spring Batch multiple process for heavy load with multiple thread under every process

What is the best way to count rows in spark data frame for monitoring?

I have a pyspark application running on EMR for which I'd like to monitor some metrics.
For example count loaded, saved rows. Currently I use count operation to extract values, which, obviously, slows down the application. I was thinking whether there are a better options to extract those kind of metrics from dataframe?
I'm using pyspark 2.4.5
If you're counting the full dataframe, try persisting the dataframe first, so that you don't have to run the computation twice.
If an approximate count is acceptable, you can sample before counting to speed things up.
sample_fraction = 0.01 # take a roughly 1% sample
sample_count = df.sample(fraction=sample_fraction).count() # count the sample
extrapolated_count = sample_count / sample_fraction # estimate the total count
There's also an approx_count_distinct function, if you need a count of distinct values for a particular column.
If you need exact count then use parquet or delta lake format to store the data. It stores statistics so count results are fast (in seconds).
If you can live without exact count then you can use Dataframe.isEmpty, Dataframe.first, Dataframe.head(<number of rows>) etc to compensate for your needs.

pig script to sample 10 chunks of training data, pig script is jammed

BACKGROUND
I have a binary classification task where the data is highly imbalanced. Specifically, there are
way more data with label 0 than that with label 1. In order to solve this problem, I plan to subsampling
data with label 0 to roughly match the size of data with label 1. I did this in a pig script. Instead of
only sampling one chunk of training data, I did this 10 times to generate 10 data chunks to train 10 classifiers
similar to bagging to reduce variance.
SAMPLE PIG SCRIPT
---------------------------------
-- generate training chunk i
---------------------------------
-- subsampling data with label 0
labelZeroTrainingDataChunki = SAMPLE labelZeroTrainingData '$RATIO';
-- combine data with label 0 and label 1
trainingChunkiRaw = UNION labelZeroTrainingDataChunk1,labelOneTrainingData;
-- join two tables to get all the features back from table 'dataFeatures'
trainingChunkiFeatures = JOIN trainingChunkiRaw BY id, dataFeatures BY id;
-- in order to shuffle data, I give a random number to each data
trainingChunki = FOREACH trainingChunkiFeatures GENERATE
trainingChunkiRaw::id AS id,
trainingChunkiRaw::label AS label,
dataFeatures::features AS features,
RANDOM() AS r;
-- shuffle the data
trainingChunkiShuffledRandom = ORDER trainingChunki BY r;
-- store this chunk of data into s3
trainingChunkiToStore = FOREACH trainingChunkiShuffledRandom GENERATE
id AS id,
label AS label,
features AS features;
STORE trainingChunkiToStore INTO '$training_data_i_s3_path' USING PigStorage(',');
In my real pig script, I do this 10 times to generate 10 data chunks.
PROBLEM
The problem I have is that if I choose to generate 10 chunks of data, there are so many mapper/reducer tasks, more than 10K. The majority of
mappers do very little things (runs less 1 min). And at some point, the whole pig script is jammed. Only one mapper/reducer task could run and all other mapper/reducer tasks are blocked.
WHAT I'VE TRIED
In order to figure out what happens, I first reduced the number of chunks to generate to 3. The situation was less severe.
There were roughly 7 or 8 mappers running at the same time. Again these mappers did very little things (runs about
1 min).
Then, I increased the number of chunks to 5, at this point, I observed the the same problem I have when I set the number of chunks
to be 10. At some point, there was only one mapper or reducer running and all other mappers and reducers were blocked.
I removed some part of script to only store id, label without features
--------------------------------------------------------------------------
-- generate training chunk i
--------------------------------------------------------------------------
-- subsampling data with label 0
labelZeroTrainingDataChunki = SAMPLE labelZeroTrainingData $RATIO;
-- combine data with label 0 and label 1
trainingChunkiRaw = UNION labelZeroTrainingDataChunki, labelOneTrainingData;
STORE trainingChunkiRaw INTO '$training_data_i_s3_path' USING PigStorage(',');
This worked without any problem.
Then I added the shuffling back
--------------------------------------------------------------------------
-- generate training chunk i
--------------------------------------------------------------------------
-- subsampling data with label 0
labelZeroTrainingDataChunki = SAMPLE labelZeroTrainingData $RATIO;
-- combine data with label 0 and label 1
trainingChunkiRaw = UNION labelZeroTrainingDataChunki, labelOneTrainingData;
trainingChunki = FOREACH trainingChunkiRaw GENERATE
id,
label,
features,
RANDOM() AS r;
-- shuffle data
trainingChunkiShuffledRandom = ORDER trainingChunki BY r;
trainingChunkiToStore = FOREACH trainingChunkiShuffledRandom GENERATE
id AS id,
label AS label,
features AS features;
STORE trainingChunkiToStore INTO '$training_data_i_s3_path' USING PigStorage(',');
The same problem reappears. Even worse, at some point, there was no mapper/reducer running. The whole program hanged without making any progress. I added another machine and the program ran for a few minutes before it jammed again. Looks like there are some dependency issues here.
WHAT'S THE PROBLEM
I suspect there are some dependency which leads to deadlock. The confusing thing is that before shuffling, I already
generate the data chunks. I was expecting the shuffling could be executed in parallel since these data chunks are independent
with each other.
Also I noticed there are many mappers/reducers do very little thing (exists less than 1 min). In such case, I would
imagine the overhead to launch mappers/reducers would be high, is there any way to control this?
What's the problem, any suggestions?
Is there standard way to do this sampling. I would imagine there are many cases where we need to do these subsampling like bootstrapping or bagging. So, there might be some standard way to do this in pig. I couldn't find anything useful online.
Thanks a lot
ADDITIONAL INFO
The size of table 'labelZeroTrainingData' is really small, around 16MB gziped.
table 'labelZeroTrainingData' is also generated in the same pig script by filtering.
I ran the pig script on 3 aws c3.2xlarge machines.
table 'dataFeatures' could be large, around 15GB gziped.
I didn't modify any default configuration of hadoop.
I checked the disk space and memory usage. Disk space usage is around 40%. Memory usage is around 90%. I'm not sure memory is the problem. Since
I was told if the memory is the issue, the whole task should fail.
After a while, I think I figure out something. The problem is likely to be the multiple STORE statements there. Looks like pig script will be running in batch by default. So, for each chunk of the data, there is a job running which leads to lack of resource, e.g. slots for mapper and reducer. None of the job could finish because each needs more mapper/reducer slots.
SOLUTION
use piggybank. There is a storage function called MultiStorage which might be useful in this case. I had some version incompatible issue between piggybank and hadoop. But it might work.
Disable pig executing operations in batch. Pig tries to optimize the execution. I simply disable this multiquery feature by adding -M. So, when you run pig script, it looks like something pig -M -f pig_script.pg which executes one statement at a time without any optimization. This might not be ideal because no optimization is done. For me, it's acceptable.
Use EXEC in pig to enforce certain execution order which is helpful in this case.

Scripting and rrdtool

I have a case where I have collected a lot of SNMP data and stored it via rrdtool. (Using OpenNMS)
My goal is to determine, among hundreds of servers, which ones haven't had memory usage exceed a certain amount within the past six months. (say, 64 gigs)
My plan was to write a bash script to extract and process the data from rrdtool, but unsure how to start. This seems like a common enough task that I felt I should ask on here if anyone has any ideas.
Thanks!
RRDTool terminology:
RRD : The RRDTool database file
DS : Data Source. One of the variables being measured
RRA : Roundrobin Archive. Consolodation archive definedin the RRD file
CF : Consolodation Factor. MAX, MIN, AVERAGE, LAST. How the RRA consolodates the
data
DP : Data point. A sample of data as stored into the RRD before consolodation
CDP : Consolodated data point. A data point in an RRA, which corresponds to one or more DP merged using the CF of that RRA.
I would suggest doing this in two parts.
Firstly, extract the Maximum of the values over the time period for each DS. This step is simplified considerably if you create an RRA that has a MAXIMUM CF and an appropriate granularity, such as 1 day. How you do the extract will depend on if you have a single RRD with many DS, or many RRDs with one DS in each; however you will need to use the rrdtool xport and NOT rrdtool fetch to retrieve the data so that you get a single data value for each DS. The xport function of rrdtool will allow you to further consolodate your 1CDP==1day RRA to get a single CDP; do this by setting 'step' to be 6 months, and force your DEF to use a MAX CF. The reason we use a 1day RRA rather than a 6month one is so that we can run the calculation on any date, not just once every 6 months.
Assuming your file is data1.rrd containing a single DS dsname for host host1 :
rrdtool xport --end now --start "end - 6 months" --step 15552000 --maxrows 1
DEF:x=data1.rrd:dsname:MAX
XPORT:x:host1
Next, you will need to threshold and filter these to get the list of DS that have a MAX value below your threshold. This would be a simple process in bash that shouldn't tax you!

Spring Batch - Performance Tuning

We developed a spring batch application in which we have two flows. 1. Forward 2. Backward. We are only using file read/write no DB involved.
Forward Scenario : The input file will have records with 22 fields. The 22 fields to be converted into 32 fields by doing some operations like sequence number generation and adding few fillers fields. Based on country codes the output will be split into max 3. each chunk will have 250K records. (If records are in million the multiple files will be generated for same country).
8 Million records its taking 36 minutes.
8 Million records will be in single file.
We are using spring batch thread 1000 threads we are using.
Backward Flow : The input file will have 82 fields for each record. These 82 fields to be converted into 86 records. The two fields will be added in between which is taken from the Forward flow input file. The other fields are simply copied and pasted. The error records also to be written into error file. The error records is nothing but actual input records which came for Forward flow. To track we are persisting the sequence number & actual records in a file this is done in forward flow itself. We are taking the persistent file in backward flow and comparing the sequence number if anything is missing then we are writing into error records through key,value pair. This process is done after completion of backward flow.
Maximum size of input file is 250K.
8 Million records its taking 1 hour 8 minutes which is too bad.
32 Files (each 250K) will be there for input in this flow.
There is no thread used in backward. I don't know how thread usage will be. I tried but the process got hung.
Server Configurations:
12 CPU & 64 GB Linux Server.
Can you guys help in this regard to get improved the performance since we have 12 CPU/64GB RAM.
You are already using 1000 threads and that is a very high number. I have fine tuned spring batch jobs and this is what I have done
1. Reduce network traffic- Try to reduce number of calls to data base or file system in each process. Can you get all info possible in one shot and save it in memory for the life of thread ? I have uses org.apache.commons.collections.map.MultiKeyMap for storage and retrieval of data.
For eg in your case you need sequence number comparison . So get all sequence numbers into one map before you start the process. You can store the ids (if not too many) into step execution context.
Write less frequently - Keep storing all the info you need to write for some time and then write them at the end.
Set unused objects at end of process to null to expedite GC
Check your GC frequency through VisualVm or Jconsole . You should see frequent GC happening when your process is running which means objects are being created and garbage collected. If your memory graph keeps on increasing, something is wrong.

Resources