pyspark cache() dataframe issue

pyspark cache() dataframe issue - caching

I have program written in order to parallelize the process, cache has been applied after certain transformations on dataframe's. Lets say:
df1 = df.filter()
df3 = df1.join(df2, join_cond, "left")
df3.cache() #ex: it has col1, col2, col3, col4 columns
After cache, we have some other steps to take care:
#1
df4 = df3.select(df3.col1, df3.col2)
df4.filter(df3.col1 > 500).show()
#2
df5 = df3.select(df3.col3, df3.col4)
df5.filter(df3.col4 > 2000)
df3.unpersist()
So, in this process if any issue or error occurs we have to uncache the dataframe or the old cache will destroy automatically when we are rerunning the program.
Could you please help me how the cache() will work if there is any kind of failures in the program at a certain point of time.
Thanks

cache persists the lazy evaluation result in memory, so after the cache, any transformation could directly from scanning the df in memory and start working.
action vs transformation, action leads to a non-rdd non-df object like in your code .show(), transformation leads to another rdd/spark df, like in your code .filter, .select, .join
just based on your code snippet, there is no problem, your df4's dependency is just scanned df3 -> df4 and there is just one action. But if you want to call df5.filter().show() or df4.show() again, it will become a problem. Because you unpersist df3, there is no data in memory, in order to regenerate df4, the spark application need to start from df1 -> df2 -> df3 -> df4.
Does unpersist break your code? no, but definitely influence your performance of your application. I will double-check whether a persisted df is no longer needed in any further downstream job, then unpersist it

Related

Looking for a more efficient way to pull data from multiple datasets in SAS

I'm trying to find a more efficient and speedier way (if possible) to pull subsets of observations that meet certain criteria from multiple hospital claims datasets in SAS. A simplified but common type of data pull would look like this:
data out.qualifying_patients;
set in.state1_2017
in.state1_2018
in.state1_2019
in.state1_2020
in.state2_2017
in.state2_2018
in.state2_2019
in.state2_2020;
array prcode{*} I10_PR1-I10_PR25;
do i=1 to 25;
if prcode{i} in ("0DTJ0ZZ","0DTJ4ZZ") then cohort=1;
end;
if cohort=1 then output;
run;
Now imagine that instead of 2 states and 4 years we have 18 states and 9 years -- each about 1GB in size. The code above works fine but it takes FOREVER to run on our non-optimized server setup. So I'm looking for alternate methods to perform the same task but hopefully at a faster clip.
I've tried including (KEEP=) or (DROP=) statements for each dataset included the SET statement to limit the variables being scanned, but this really didn't have much of an impact on speed -- and, for non-coding-related reasons, we pretty much need to pull all the variables.
I've also experimented a bit with hash tables but it's too much to store in memory so that didn't seem to solve the issue. This also isn't a MERGE issue which seems to be what hash tables excel at.
Any thoughts on other approaches that might help? Every data pull we do contains customized criteria for a given project, but we do these pulls a lot and it seems really inefficient to constantly be processing thru the same datasets over and over but not benefitting from that. Thanks for any help!

I happend to have a 1GB dataset on my compute, I tried several times, it takes SAS no more than 25 seconds to set the dataset 8 times. I think the set statement is too simple and basic to improve its efficient.
I think the issue may located at the do loop. Your program runs do loop 25 times for each record, may assigns to cohort more than once, which is not necessary. You can change it like:
do i=1 to 25 until(cohort=1);
if prcode{i} in ("0DTJ0ZZ","0DTJ4ZZ") then cohort=1;
end;
This can save a lot of do loops.

First, parallelization will help immensely here. Instead of running 1 job, 1 dataset after the next; run one job per state, or one job per year, or whatever makes sense for your dataset size and CPU count. (You don't want more than 1 job per CPU.). If your server has 32 cores, then you can easily run all the jobs you need here - 1 per state, say - and then after that's done, combine the results together.
Look up SAS MP Connect for one way to do multiprocessing, which basically uses rsubmits to submit code to your own machine. You can also do this by using xcmd to literally launch SAS sessions - add a parameter to the SAS program of state, then run 18 of them, have them output their results to a known location with state name or number, and then have your program collect them.
Second, you can optimize the DO loop more - in addition to the suggestions above, you may be able to optimize using pointers. SAS stores character array variables in memory in adjacent spots (assuming they all come from the same place) - see From Obscurity to Utility:
ADDR, PEEK, POKE as DATA Step Programming Tools from Paul Dorfman for more details here. On page 10, he shows the method I describe here; you PEEKC to get the concatenated values and then use INDEXW to find the thing you want.
data want;
set have;
array prcode{*} $8 I10_PR1-I10_PR25;
found = (^^ indexw (peekc (addr(prcode[1]), 200 ), '0DTJ0ZZ')) or
(^^ indexw (peekc (addr(prcode[1]), 200 ), '0DTJ4ZZ'))
;
run;
Something like that should work. It avoids the loop.
You also could, if you want to keep the loop, exit the loop once you run into an empty procedure code. Usually these things don't go all 25, at least in my experience - they're left-filled, so I10_PR1 is always filled, and then some of them - say, 5 or 10 of them - are filled, then I10_PR11 and on are empty; and if you hit an empty one, you're all done for that round. So not just leaving when you hit what you are looking for, but also leaving when you hit an empty, saves you a lot of processing time.

You probably should consider a hardware upgrade or find someone who can tune your server. This paper suggests tips to improve the processing of large datasets.
Your code is pretty straightforward. The only suggestion is to kill the loop as soon as the criteria is met to avoid wasting unnecessary resources.
do i=1 to 25;
if prcode{i} in ("0DTJ0ZZ","0DTJ4ZZ") then do;
output; * cohort criteria met so output the row;
leave; * exit the loop immediately;
end;
end;

When to cache in pyspark?

I've been reading about pyspark caching and how execution works. It is clear for me how using .cache() when multiple actions trigger the same computation:
df = sc.sql("select * from table")
df.count()
df = df.where({something})
df.count()
can be improved by doing:
df = sc.sql("select * from table").cache()
df.count()
df = df.where({something})
df.count()
However, it is not clear for me if and why it would be advantageous without intermediate actions:
df = sc.sql("select * from table")
df2 = sc.sql("select * from table2")
df = df.where({something})
df2 = df2.where({something})
df3 = df.join(df2).where({something})
df3.count()
In this type of code (where we have only one final action) is cache() useful?

Being straight to the point: no, in that case it would not be useful.
Transformations have lazy evaluation in Spark. I.e., they are recorded but the execution needs to be triggered by an Action (such as your count).
So, when you execute df3.count() it will evaluate all the transformations up to that point.
If you do not perform another action, then it is certain that adding .cache() anywhere will not provide any performance improvement.
However, even if you do more than one action, .cache() [or .checkpoint(), depending on your problem] sometimes does not provide any performance increase. It will highly depends on your problem, and the transformation costs you have - e.g., a join can be very costly.
Also if you are running Spark using its interactive shell, eventually sometimes .checkpoint() can be better suited after costly transformations.

Working with a constant stream of realtime data

So I have a project idea that requires me to process incoming realtime data and constantly track some metrics about the realtime data. Then every now and then I want to be able to request for the metrics I am calculating and do some stuff with that data.
Currently I have a simple Python script that uses the socket library to get the realtime data. It is basically just...
metric1 = 0
metric2 = ''
while True:
response = socket.recv(512).decode('utf-8')
if response.startswith('PING'):
sock.send("PONG\n".encode('utf-8'))
else:
process(response)
In the above process(response) will update metric1 and metric2 with data from each response. (For example they might be mean len(response) and most common response respectively)
What I want to do is run the above script constantly after starting up the project and occasionally query for metric1 and metric2 in a script I have running locally. I am guessing that I will have to look into running code on a server which I have very little experience with.
What are the most accessible tools to do what I want? I am pretty comfortable with a variety of languages so if there is a library or tool in another language that is better suited for all of this, please tell me about it
Thanks!

I worked on a similar project, not sure if it specifically can be applied to your case, but maybe it can give you a starting point.
Although I am very aware it's not best practice to use Pandas Dataframes for real-time purposes, in my case it's just fast enough (I am actually open to suggestions on how to improve my workflow!), here is my code:
all_prices = pd.Dataframe()
readprice():
global all_prices
msg = mysock.recv(16384)
msg_stringa=str(msg,'utf-8')
new_price = pd.read_csv(StringIO(msg_stringa) , sep=";", error_bad_lines=False,
index_col=None, header=None, engine='c', names=range(33),
decimal = '.')
...
...
all_prices = all_prices.append(new_price, ignore_index=True).copy()
So 'all_prices' Pandas Dataframe is global, new prices get appended to the general 'all_prices' DF . This global DF can be used by other functions in order to read the content ect. Be very careful about the variable sharing between two or more threads, it can lead to errors.
More info here: http://www.laurentluce.com/posts/python-threads-synchronization-locks-rlocks-semaphores-conditions-events-and-queues/
In my case, I don't share the DF to a parallel thread, other threads are launched after the append, not in the meantime.

Hbase: Having just the first version of each cell

I was wondering how can I configure Hbase in a way to store just the first version of each cell? Suppose the following Htable:
row_key cf1:c1 timestamp
----------------------------------------
1 x t1
After putting ("1","cf1:c2",t2) in the scenario of ColumnDescriptor.DEFAULT_VERSIONS = 2 the mentioned Htable becomes:
row_key cf1:c1 timestamp
----------------------------------------
1 x t1
1 x t2
where t2>t1.
My question would be how can I change this scenario in a way that the first version of cell would be the only version that could be store and retrieve. I mean in the provided example the only version would be 't1' one! Thus, I want to change hbase in a way that ignore insertion on duplicates.
I know that setting VERSIONS to 1 for Htable and putting based on Long.MAX_VALUE - System.currentTimeMillis() would solve my problem but I dont know is it the best solution or not?! What is the concern of changing tstamp to Long.MAX_VALUE - System.currentTimeMillis()? Does it has any performance issue?

There are two strategies that I can think of:
1. One version + inverted timestamp
Setting VERSIONS to 1 for Htable and putting based on Long.MAX_VALUE - System.currentTimeMillis() will generally work and does not have any major performance issues.
On write:
When multiple versions of the same cell are written to hbase, at any point in time, all versions will be written (without any impact on performance). After compaction only the cell with the highest timestamp will survive.
The cell with the highest timestamp in this scheme is the one written by the client with the lowest value for System.currentTimeMillis(). It should be noted that this might not actually be the machine who tried to write to the cell first, since hbase clients might be out of sync.
On read:
When multiple versions of the same cell are found pruning will occur at that time. This can happen at any time, since your writes can occur at any time, even after compaction. This has a very slight impact on performance.
2. checkAndPut
To get true ordering through atomicity, meaning only the first write to reach the region server will succeed, you can use the checkAndPut operation:
From the docs:
public boolean checkAndPut(byte[] row, byte[] family, byte[] qualifier, byte[] value, Put put) throws IOException
Atomically checks if a row/family/qualifier value matches the expected
value. If it does, it adds the put. If the passed value is null, the
check is for the lack of column (ie: non-existance)`
So by setting value to null your Put will only succeed if the cell did not exist. If your Put succeeded then the return value will be true. This gives true atomicity, but at a write performance cost.
On write:
A row lock is set and a Get is issued internally before existance is checked. Once non-existance is confirmed the Put is issued. As you can imagine this has a pretty big performance impact for each write, since each write now also involves a read and a lock.
During compaction nothing needs to happen, because only one Put will ever make it to hbase. Which is always the first Put to reach the region server.
It should be noted that there is no way to batch these kind of checkAndPut operations by using checkAndMutate, since each Put needs it own check. This means each put needs to be a separate request, which means you will be paying a latency cost as well when writing in batches.
On read:
Only ever one version will make it to Hbase, so there is no impact here.
Picking between strategies:
If true ordering really matters or you may need to read each row after or before you write to hbase anyway (for example to find out if your write succeeded or not), you're better of with strategy 2, otherwise, in all other cases, I'd recommend strategy 1, since its write performance is much better. In that case just make sure your clients are properly time synced.

You can insert the Put with Long.MAX_VALUE - timestampand configure the table to store only 1 version (max versions => 1). This way only the first (earliest) Put will be returned by the Scan because all successive Puts will have a smaller timestamp value.

Hadoop tmp directory gets huge

my problem is that I have a 5 nodes Hadoop cluster, the files on the cluster takes 350 GB. I am running a Pig script which joins three different files and joins them.
The job runs every time less thant 30 min to complete all the map tasks, and then 6 hours to complete the reduce tasks, all of these reduce tasks fail at the end in the best case. In the worst case my hadoop got stuck, caused by the namenode which goes in safemode cause it has not enough space(Quota exceeded).
The problem caused by the tmp directory which takes the hall available space(7TB!!).
My script looks like this :
info_file = LOAD '$info' as (name, size, type,generation,streamId);
chunks_file = LOAD '$chunk' as (fp, size);
relation_file = LOAD '$relation' as (fp, filename);
chunks_relation = JOIN chunks_file BY fp, relation_file BY fp;
chunks_files= JOIN chunks_relation BY $3, info_file BY $0;
result = FOREACH chunks_files GENERATE $0,$1,$3,$5,$6,$7,$8;
STORE result INTO '$out';
Any Idea ??

Your script looks fine. What is the size of the files that you are joining?
Join is a costly operator any where. You can optimize the joins by using replicated,skew, merge joinsin Pig. Go through these joins documentation once and apply based on your file sizes and requirement.
https://bluewatersql.wordpress.com/category/Pig/

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio