I'm parsing data from one table and writing it back to another one. Input are characteristics, written as text. Output is a boolean field that needs to be updated. For example a characteristic would be "has 4 wheel drive" and I want to set a boolean has_4weeldrive to true.
I'm going through all the characteristics that belong to a car and set it to true if found, else to null. The filter after the tmap_1 filters the rows for which the attribute is true, and then updates that in a table. I want to do that for all different characteristics (around 10).
If I do it for one characteristic the job runs fine, as soon as I have more than 1 it only loads 1 record and waits indefinitely. I can of course make 10 jobs and it will run, but I need to touch all the characteristics 10 times, that doesn't feel right. Is this a locking issue? Is there a better way to do this? Target and source db is Postgresql if that makes a difference.
Shared connections could cause problems like this.
Also make sure you're committing after each update. Talend use 1 thread for execution (except the enterprise version) so multiple shared outputs could cause problems.
Setting the commit to 1 should eliminate the problem.
Related
I'm trying to find a more efficient and speedier way (if possible) to pull subsets of observations that meet certain criteria from multiple hospital claims datasets in SAS. A simplified but common type of data pull would look like this:
data out.qualifying_patients;
set in.state1_2017
in.state1_2018
in.state1_2019
in.state1_2020
in.state2_2017
in.state2_2018
in.state2_2019
in.state2_2020;
array prcode{*} I10_PR1-I10_PR25;
do i=1 to 25;
if prcode{i} in ("0DTJ0ZZ","0DTJ4ZZ") then cohort=1;
end;
if cohort=1 then output;
run;
Now imagine that instead of 2 states and 4 years we have 18 states and 9 years -- each about 1GB in size. The code above works fine but it takes FOREVER to run on our non-optimized server setup. So I'm looking for alternate methods to perform the same task but hopefully at a faster clip.
I've tried including (KEEP=) or (DROP=) statements for each dataset included the SET statement to limit the variables being scanned, but this really didn't have much of an impact on speed -- and, for non-coding-related reasons, we pretty much need to pull all the variables.
I've also experimented a bit with hash tables but it's too much to store in memory so that didn't seem to solve the issue. This also isn't a MERGE issue which seems to be what hash tables excel at.
Any thoughts on other approaches that might help? Every data pull we do contains customized criteria for a given project, but we do these pulls a lot and it seems really inefficient to constantly be processing thru the same datasets over and over but not benefitting from that. Thanks for any help!
I happend to have a 1GB dataset on my compute, I tried several times, it takes SAS no more than 25 seconds to set the dataset 8 times. I think the set statement is too simple and basic to improve its efficient.
I think the issue may located at the do loop. Your program runs do loop 25 times for each record, may assigns to cohort more than once, which is not necessary. You can change it like:
do i=1 to 25 until(cohort=1);
if prcode{i} in ("0DTJ0ZZ","0DTJ4ZZ") then cohort=1;
end;
This can save a lot of do loops.
First, parallelization will help immensely here. Instead of running 1 job, 1 dataset after the next; run one job per state, or one job per year, or whatever makes sense for your dataset size and CPU count. (You don't want more than 1 job per CPU.). If your server has 32 cores, then you can easily run all the jobs you need here - 1 per state, say - and then after that's done, combine the results together.
Look up SAS MP Connect for one way to do multiprocessing, which basically uses rsubmits to submit code to your own machine. You can also do this by using xcmd to literally launch SAS sessions - add a parameter to the SAS program of state, then run 18 of them, have them output their results to a known location with state name or number, and then have your program collect them.
Second, you can optimize the DO loop more - in addition to the suggestions above, you may be able to optimize using pointers. SAS stores character array variables in memory in adjacent spots (assuming they all come from the same place) - see From Obscurity to Utility:
ADDR, PEEK, POKE as DATA Step Programming Tools from Paul Dorfman for more details here. On page 10, he shows the method I describe here; you PEEKC to get the concatenated values and then use INDEXW to find the thing you want.
data want;
set have;
array prcode{*} $8 I10_PR1-I10_PR25;
found = (^^ indexw (peekc (addr(prcode[1]), 200 ), '0DTJ0ZZ')) or
(^^ indexw (peekc (addr(prcode[1]), 200 ), '0DTJ4ZZ'))
;
run;
Something like that should work. It avoids the loop.
You also could, if you want to keep the loop, exit the loop once you run into an empty procedure code. Usually these things don't go all 25, at least in my experience - they're left-filled, so I10_PR1 is always filled, and then some of them - say, 5 or 10 of them - are filled, then I10_PR11 and on are empty; and if you hit an empty one, you're all done for that round. So not just leaving when you hit what you are looking for, but also leaving when you hit an empty, saves you a lot of processing time.
You probably should consider a hardware upgrade or find someone who can tune your server. This paper suggests tips to improve the processing of large datasets.
Your code is pretty straightforward. The only suggestion is to kill the loop as soon as the criteria is met to avoid wasting unnecessary resources.
do i=1 to 25;
if prcode{i} in ("0DTJ0ZZ","0DTJ4ZZ") then do;
output; * cohort criteria met so output the row;
leave; * exit the loop immediately;
end;
end;
I have a list of 96 times in 15 minute intervals that I'm trying to use for data validation. It turns out that it works only up to a certain point. When selecting times from 00:00 to 18:30 it works fine. When selecting times from 18:45 to 23:45 it throws an error.
The data that you entered in cell B1 violates the data validation
rules set on this cell.
I made a test file to see if the same thing would happen if I used a different type of data, but when using a lists of numbers or text with the same amount of items it seems to work without issues.
https://docs.google.com/spreadsheets/d/1ClBhZk6fysOq0wqIrE5IxJgPMs_A05gJsN_kEiXoFGI/edit?usp=sharing
Does anyone know the reason why it doesn't work with the list of times? More importantly, does anyone know a way I could get it to work?
Edit:
The steps provided by player0 worked to fix it in my test file linked above, but didn't make a difference in the real file I'm working on. Here's a copy of the real file which exhibits the same problem.
https://docs.google.com/spreadsheets/d/1cOMB0BpzSBR7ZM7fJQQfhA56fniKTBJb-3TePUky6gM/edit?usp=sharing
Please try setting a time of greater than 18:30 in a couple of cells. I keep getting the same errors.
I suppose a workaround could be to not reject input on invalid data, but I feel this should just work with rejecting it and would like to know where I went wrong.
formatting. select columns A & B and force Time (or Plain text):
I am part of tractor pulling team and we have Bechoff CX8190 based PLC for data logging. System works most of the time but every now and then saving sensor values (every 10ms is collected) to CSV fails (mostly in middle of csv row). Guy who build the code is new with the TwinCAT and does not know how to find what causes that. Any Ideas where to look reason for this.
Writing to a file is always a asynchron action in TwinCAT. That is to say this is no realtime action and it is not safe that the writing process is done within the task cycletime of 10ms. Therefore these functionblocks always have a BUSY-output which has to be evaluated and the functionblock has to be called successivly until the BUSY-output returns to FALSE. Only then a new write command can be executed.
I normally tackle this task with a two-side-buffer algorithm. Lets say the buffer-array has 2x100 entries. So fill up the first 100 entries with sample values. Then write them all together with one command to the file. When its done, clean the buffer. In the meanwhile the other half of the buffer can be filled with sample values. If second side is full, write them all together to the file ... and so on. So you have more time for the filesystem access (in the example above 100x10ms=1s) as the 10ms task cycletime.
But this is just a suggestion out of my experience. I agree with the others, some code could really help.
I was wondering how can I configure Hbase in a way to store just the first version of each cell? Suppose the following Htable:
row_key cf1:c1 timestamp
----------------------------------------
1 x t1
After putting ("1","cf1:c2",t2) in the scenario of ColumnDescriptor.DEFAULT_VERSIONS = 2 the mentioned Htable becomes:
row_key cf1:c1 timestamp
----------------------------------------
1 x t1
1 x t2
where t2>t1.
My question would be how can I change this scenario in a way that the first version of cell would be the only version that could be store and retrieve. I mean in the provided example the only version would be 't1' one! Thus, I want to change hbase in a way that ignore insertion on duplicates.
I know that setting VERSIONS to 1 for Htable and putting based on Long.MAX_VALUE - System.currentTimeMillis() would solve my problem but I dont know is it the best solution or not?! What is the concern of changing tstamp to Long.MAX_VALUE - System.currentTimeMillis()? Does it has any performance issue?
There are two strategies that I can think of:
1. One version + inverted timestamp
Setting VERSIONS to 1 for Htable and putting based on Long.MAX_VALUE - System.currentTimeMillis() will generally work and does not have any major performance issues.
On write:
When multiple versions of the same cell are written to hbase, at any point in time, all versions will be written (without any impact on performance). After compaction only the cell with the highest timestamp will survive.
The cell with the highest timestamp in this scheme is the one written by the client with the lowest value for System.currentTimeMillis(). It should be noted that this might not actually be the machine who tried to write to the cell first, since hbase clients might be out of sync.
On read:
When multiple versions of the same cell are found pruning will occur at that time. This can happen at any time, since your writes can occur at any time, even after compaction. This has a very slight impact on performance.
2. checkAndPut
To get true ordering through atomicity, meaning only the first write to reach the region server will succeed, you can use the checkAndPut operation:
From the docs:
public boolean checkAndPut(byte[] row, byte[] family, byte[] qualifier, byte[] value, Put put) throws IOException
Atomically checks if a row/family/qualifier value matches the expected
value. If it does, it adds the put. If the passed value is null, the
check is for the lack of column (ie: non-existance)`
So by setting value to null your Put will only succeed if the cell did not exist. If your Put succeeded then the return value will be true. This gives true atomicity, but at a write performance cost.
On write:
A row lock is set and a Get is issued internally before existance is checked. Once non-existance is confirmed the Put is issued. As you can imagine this has a pretty big performance impact for each write, since each write now also involves a read and a lock.
During compaction nothing needs to happen, because only one Put will ever make it to hbase. Which is always the first Put to reach the region server.
It should be noted that there is no way to batch these kind of checkAndPut operations by using checkAndMutate, since each Put needs it own check. This means each put needs to be a separate request, which means you will be paying a latency cost as well when writing in batches.
On read:
Only ever one version will make it to Hbase, so there is no impact here.
Picking between strategies:
If true ordering really matters or you may need to read each row after or before you write to hbase anyway (for example to find out if your write succeeded or not), you're better of with strategy 2, otherwise, in all other cases, I'd recommend strategy 1, since its write performance is much better. In that case just make sure your clients are properly time synced.
You can insert the Put with Long.MAX_VALUE - timestampand configure the table to store only 1 version (max versions => 1). This way only the first (earliest) Put will be returned by the Scan because all successive Puts will have a smaller timestamp value.
I'm a part of a team writing an application for embedded systems. The application often suffers from data corruption caused by power shortage. I thought that implementing some kind of transactions would stop this from happening. One scenario would include copying the area of a file before writing to some additional storage (transaction log). What are other possibilities?
Databases use a variety of techniques to assure that the state is properly persisted.
The DBMS often retains a replicated control file -- several synchronized copies on several devices. Two is enough. More if your're paranoid. The control file provides a few key parameters used to locate the other files and their expected states. The control file can include a "database version number".
Each file has a "version number" in several forms. A lot of times it's in plain form plus in some XOR-complement so that the two version numbers can be trivially checked to have the correct relationship, and match the control file version number.
All transactions are written to a transaction journal. The transaction journal is then written to the database files.
Before writing to database files, the original data block is copied to a "before image journal", or rollback segment, or some such.
When the block is written to the file, the sequence numbers are updated, and the block is removed from the transaction journal.
You can read up on RDBMS techniques for reliability.
There's a number of ways to do this; generally the only assumption required is that small writes (<4k) are atomic. For example, here's how CouchDB does it:
A 4k header contains, amongst other things, the file offset of the root of the BTree containing all the data.
The file is append-only. When updates are required, write the update to the end of the file, followed by any modified BTree nodes, up to and including the root. Then, flush the data, and write the new address of the root node to the header.
If the program dies while writing an update but before writing the header, the extra data at the end of the file is discarded. If it fails after writing the header, the write is complete and all is well. Because the file is append-only, these are the only failure scenarios. This also has the advantage of providing multi-version concurrency control with no read locks.
When the file grows too long, simply read out all the 'live' data and write it to a new file, then delete the original.
You can avoid implementing such transaction logs yourself by using existing transaction managers around file-systems, e.g. XADisk.
The old link is no longer available, a github repo is here.