making scan works like Get in HBase - hadoop

I am just writing a simple big data application on my existing data o HBase, sometime I feel that Scan could work faster than one Get, so I want to experiment it and convert my Get commands to the exact scan
Therefore if I have below keys and would like to Get(12)
row keys
12
123
21
22
what I eed to put as Startrow and Stoprow of my scan or I might configure other paramter in scan?

If the direction of the scan is default (i.e. not reversed), then what normally works for me is to set the start row as 12 and the stop row as 12x, where x is some trailing special character that you know doesn't occur in your row key space and is likely to be lexicographically later after all possible characters in your row key range. For example i usually use '~' as the trailing symbol, but maybe something else might work better for you.
Also, scan has the .setLimit(int) parameter, which can limit your scan to just 1. You can use both elements together. However I'm not sure why this should work faster than Get.
If you feel that your scans work faster than gets, maybe it has something to do with Call Queue configuration of your cluster. For example maybe your cluster is configured to allocate more handlers to Scans rather than Gets. That's not default behavior, but it's possible that someone may have configured it that way, and if your cluster is very busy, maybe that's why you are feeling it that way.

Related

Looking for a more efficient way to pull data from multiple datasets in SAS

I'm trying to find a more efficient and speedier way (if possible) to pull subsets of observations that meet certain criteria from multiple hospital claims datasets in SAS. A simplified but common type of data pull would look like this:
data out.qualifying_patients;
set in.state1_2017
in.state1_2018
in.state1_2019
in.state1_2020
in.state2_2017
in.state2_2018
in.state2_2019
in.state2_2020;
array prcode{*} I10_PR1-I10_PR25;
do i=1 to 25;
if prcode{i} in ("0DTJ0ZZ","0DTJ4ZZ") then cohort=1;
end;
if cohort=1 then output;
run;
Now imagine that instead of 2 states and 4 years we have 18 states and 9 years -- each about 1GB in size. The code above works fine but it takes FOREVER to run on our non-optimized server setup. So I'm looking for alternate methods to perform the same task but hopefully at a faster clip.
I've tried including (KEEP=) or (DROP=) statements for each dataset included the SET statement to limit the variables being scanned, but this really didn't have much of an impact on speed -- and, for non-coding-related reasons, we pretty much need to pull all the variables.
I've also experimented a bit with hash tables but it's too much to store in memory so that didn't seem to solve the issue. This also isn't a MERGE issue which seems to be what hash tables excel at.
Any thoughts on other approaches that might help? Every data pull we do contains customized criteria for a given project, but we do these pulls a lot and it seems really inefficient to constantly be processing thru the same datasets over and over but not benefitting from that. Thanks for any help!
I happend to have a 1GB dataset on my compute, I tried several times, it takes SAS no more than 25 seconds to set the dataset 8 times. I think the set statement is too simple and basic to improve its efficient.
I think the issue may located at the do loop. Your program runs do loop 25 times for each record, may assigns to cohort more than once, which is not necessary. You can change it like:
do i=1 to 25 until(cohort=1);
if prcode{i} in ("0DTJ0ZZ","0DTJ4ZZ") then cohort=1;
end;
This can save a lot of do loops.
First, parallelization will help immensely here. Instead of running 1 job, 1 dataset after the next; run one job per state, or one job per year, or whatever makes sense for your dataset size and CPU count. (You don't want more than 1 job per CPU.). If your server has 32 cores, then you can easily run all the jobs you need here - 1 per state, say - and then after that's done, combine the results together.
Look up SAS MP Connect for one way to do multiprocessing, which basically uses rsubmits to submit code to your own machine. You can also do this by using xcmd to literally launch SAS sessions - add a parameter to the SAS program of state, then run 18 of them, have them output their results to a known location with state name or number, and then have your program collect them.
Second, you can optimize the DO loop more - in addition to the suggestions above, you may be able to optimize using pointers. SAS stores character array variables in memory in adjacent spots (assuming they all come from the same place) - see From Obscurity to Utility:
ADDR, PEEK, POKE as DATA Step Programming Tools from Paul Dorfman for more details here. On page 10, he shows the method I describe here; you PEEKC to get the concatenated values and then use INDEXW to find the thing you want.
data want;
set have;
array prcode{*} $8 I10_PR1-I10_PR25;
found = (^^ indexw (peekc (addr(prcode[1]), 200 ), '0DTJ0ZZ')) or
(^^ indexw (peekc (addr(prcode[1]), 200 ), '0DTJ4ZZ'))
;
run;
Something like that should work. It avoids the loop.
You also could, if you want to keep the loop, exit the loop once you run into an empty procedure code. Usually these things don't go all 25, at least in my experience - they're left-filled, so I10_PR1 is always filled, and then some of them - say, 5 or 10 of them - are filled, then I10_PR11 and on are empty; and if you hit an empty one, you're all done for that round. So not just leaving when you hit what you are looking for, but also leaving when you hit an empty, saves you a lot of processing time.
You probably should consider a hardware upgrade or find someone who can tune your server. This paper suggests tips to improve the processing of large datasets.
Your code is pretty straightforward. The only suggestion is to kill the loop as soon as the criteria is met to avoid wasting unnecessary resources.
do i=1 to 25;
if prcode{i} in ("0DTJ0ZZ","0DTJ4ZZ") then do;
output; * cohort criteria met so output the row;
leave; * exit the loop immediately;
end;
end;

How to implement the equivalent of the Aggregator EIP in Nifi

I'm very experienced with Apache Camel and EIPs and am struggling to understand how to implement equivalents in Nifi. I understand that Nifi uses a different paradigm (flow based programming) but I don't think what I'm trying to do is unreasonable.
In a nutshell I want the contents of each file to be sent to many rest services and I want to aggregate the responses into a single document which will stored in elasticsearch. I might also do some further processing and cleanup to improve what is stored (but this isn't my immediate issue)
The screenshot is a quick mock-up of what I'm trying to achieve but I don't understand enough about Nifi to know how to implement this pattern correctly.
If you are going to take a single piece of data and then fork to multiple parts of the flow and then converge back, there needs to be a way for MergeContent to know which pieces go together.
There are generally two ways this can be done...
The first is using MergeContent in "defragment mode". Think of this as reversing a split operation that was performed by one of the split processors like SplitText. For example, you split a file of 100 lines into 100 flow files of 1 line each, then do some stuff to each one, then want to converge back. The split processors produce a standard set of split attributes (described in the docs of the processors) and the defragment mode knows how to bin the splits accordingly and merge them back together. This probably doesn't apply to your example since you didn't start with a split processor.
The second approach is the "Correlation Attribute" in MergeConent. This tells merge content to only merge flow files together that have the same value for the attribute specified. In your example, when a file gets picked up by GetFile and sent to 3 InvokeHttp processors, there are 3 flow files created, and they all should have their "filename" attribute set to the name of the file picked up from disk. So telling MergeContent to correlate on filename should do the trick, and probably setting the min and max number of entries to the number you expect like 3, and a maximum time in case one of them fails or hangs.

Apache Nifi MergeContent output data inconsistent?

Fairly new to using nifi. Need help with the design.
I am trying to create a simple flow with dummy csv files(for now) in HDFS dir and prepend some text data to each record in each flowfile.
Incoming files:
dummy1.csv
dummy2.csv
dummy3.csv
contents:
"Eldon Base for stackable storage shelf, platinum",Muhammed MacIntyre,3,-213.25,38.94,35,Nunavut,Storage & Organization,0.8
"1.7 Cubic Foot Compact ""Cube"" Office Refrigerators",BarryFrench,293,457.81,208.16,68.02,Nunavut,Appliances,0.58
"Cardinal Slant-D Ring Binder, Heavy Gauge Vinyl",Barry French,293,46.71,8.69,2.99,Nunavut,Binders and Binder Accessories,0.39
...
Desired output:
d17a3259-0718-4c7b-bee8-924266aebcc7,Mon Jun 04 16:36:56 EDT 2018,Fellowes Recycled Storage Drawers,Allen Rosenblatt,11137,395.12,111.03,8.64,Northwest Territories,Storage & Organization,0.78
25f17667-9216-4f1d-b69c-23403cd13464,Mon Jun 04 16:36:56 EDT 2018,Satellite Sectional Post Binders,Barry Weirich,11202,79.59,43.41,2.99,Northwest Territories,Binders and Binder Accessories,0.39
ce0b569f-5d93-4a54-b55e-09c18705f973,Mon Jun 04 16:36:56 EDT 2018,Deflect-o DuraMat Antistatic Studded Beveled Mat for Medium Pile Carpeting,Doug Bickford,11456,399.37,105.34,24.49,Northwest Territories,Office Furnishings,0.61
the flow
splitText-
ReplaceText-
MergeContent-
(this may be a poor way to achieve what I am trying to get, but I saw somewhere that uuid is best bet when it comes to generating unique session id. So thought of extracting each line from incoming data to flowfile and generating uuid)
But somehow, as you can see the order of data is messing up. The first 3 rows are not the same in output. However, the test data I am using (50000 entries) seems to have the data in some other line. Multiple tests show usually the data order changes after 2001st line.
And yes, I did search similar issues here and tried using defragment method in merge but it didnt work. I would appreciate if someone can explain what is happening here and how can I get the data in the same way with unique session_id,timestamp for each record. Is there some parameter I need to change or modify to get the correct output? I am open to suggestions if there is a better way as well.
First of all thank you for such an elaborate and detailed response. I think you cleared a lot of doubts I had as to how the processor works!
The ordering of the merge is only guaranteed in defragment mode because it will put the flow files in order according to their fragment index. I'm not sure why that wouldn't be working, but if you could create a template of a flow with sample data that showed the problem it would be helpful to debug.
I will try to replicate this method using a clean template again. Could be some parameter problem and the HDFS writer not able to write.
I'm not sure if the intent of your flow is to just re-merge the original CSV that was split, or to merge together several different CSVs. Defragment mode will only re-merge the original CSV, so if ListHDFS picked up 10 CSVs, after splitting and re-merging, you should again have 10 CSVs.
Yes, that is exactly what I need. Split and join data to their corresponding files. I dont specifically (yet) need to join the outputs again.
The approach of splitting a CSV down to 1 line per flow file to manipulate each line is a common approach, however it won't perform very well if you have many large CSV files. A more efficient approach would be to try and manipulate the data in place without splitting. This can generally be done with the record-oriented processors.
I used this approach purely instinctively and did not realize this is a common method. Sometimes the datafile could be very large, that means more than a million records in a single file. Wont that be an issue with the i/o in the cluster? coz that would mean each record=one flowfile=one unique uuid. What is a comfortable number of flowfiles that nifi can handle? (i know it depends on cluster config and will try to get more info about the cluster from hdp admin)
What do you suggest by "try and manipulate the data in place without splitting" ? can you give an example or template or processor to use?
In this case you would need to define a schema for your CSV which included all the columns in your data, plus the session id and timestamp. Then using an UpdateRecord processor you would use record path expressions like /session_id = ${UUID()} and /timestamp = ${now()}. This would stream the content line by line and update each record and write it back out, keeping it all as one flow file.
This looks promising. Can you share a simple template pulling files from hdfs>processing>write hdfs files but without splitting?
I am reluctant to share the template due to restrictions. But let me see if I can create a generic templ and I will share
Thank you for your wisdom! :)

Hbase: Having just the first version of each cell

I was wondering how can I configure Hbase in a way to store just the first version of each cell? Suppose the following Htable:
row_key cf1:c1 timestamp
----------------------------------------
1 x t1
After putting ("1","cf1:c2",t2) in the scenario of ColumnDescriptor.DEFAULT_VERSIONS = 2 the mentioned Htable becomes:
row_key cf1:c1 timestamp
----------------------------------------
1 x t1
1 x t2
where t2>t1.
My question would be how can I change this scenario in a way that the first version of cell would be the only version that could be store and retrieve. I mean in the provided example the only version would be 't1' one! Thus, I want to change hbase in a way that ignore insertion on duplicates.
I know that setting VERSIONS to 1 for Htable and putting based on Long.MAX_VALUE - System.currentTimeMillis() would solve my problem but I dont know is it the best solution or not?! What is the concern of changing tstamp to Long.MAX_VALUE - System.currentTimeMillis()? Does it has any performance issue?
There are two strategies that I can think of:
1. One version + inverted timestamp
Setting VERSIONS to 1 for Htable and putting based on Long.MAX_VALUE - System.currentTimeMillis() will generally work and does not have any major performance issues.
On write:
When multiple versions of the same cell are written to hbase, at any point in time, all versions will be written (without any impact on performance). After compaction only the cell with the highest timestamp will survive.
The cell with the highest timestamp in this scheme is the one written by the client with the lowest value for System.currentTimeMillis(). It should be noted that this might not actually be the machine who tried to write to the cell first, since hbase clients might be out of sync.
On read:
When multiple versions of the same cell are found pruning will occur at that time. This can happen at any time, since your writes can occur at any time, even after compaction. This has a very slight impact on performance.
2. checkAndPut
To get true ordering through atomicity, meaning only the first write to reach the region server will succeed, you can use the checkAndPut operation:
From the docs:
public boolean checkAndPut(byte[] row, byte[] family, byte[] qualifier, byte[] value, Put put) throws IOException
Atomically checks if a row/family/qualifier value matches the expected
value. If it does, it adds the put. If the passed value is null, the
check is for the lack of column (ie: non-existance)`
So by setting value to null your Put will only succeed if the cell did not exist. If your Put succeeded then the return value will be true. This gives true atomicity, but at a write performance cost.
On write:
A row lock is set and a Get is issued internally before existance is checked. Once non-existance is confirmed the Put is issued. As you can imagine this has a pretty big performance impact for each write, since each write now also involves a read and a lock.
During compaction nothing needs to happen, because only one Put will ever make it to hbase. Which is always the first Put to reach the region server.
It should be noted that there is no way to batch these kind of checkAndPut operations by using checkAndMutate, since each Put needs it own check. This means each put needs to be a separate request, which means you will be paying a latency cost as well when writing in batches.
On read:
Only ever one version will make it to Hbase, so there is no impact here.
Picking between strategies:
If true ordering really matters or you may need to read each row after or before you write to hbase anyway (for example to find out if your write succeeded or not), you're better of with strategy 2, otherwise, in all other cases, I'd recommend strategy 1, since its write performance is much better. In that case just make sure your clients are properly time synced.
You can insert the Put with Long.MAX_VALUE - timestampand configure the table to store only 1 version (max versions => 1). This way only the first (earliest) Put will be returned by the Scan because all successive Puts will have a smaller timestamp value.

Using Hadoop to "bucket" data out with a single run

Is it possible to use one Hadoop job run to output data to different directories based on keys?
My use case is server access logs. Say I have them all together, but I want to split them out based on some common URL patterns.
For example,
Anything that starts with /foo/ should go to /year/month/day/hour/foo/file
Anything that starts with /bar/ should go to /year/month/day/hour/bar/file
Anything that doesn't match should go to /year/month/day/hour/other/file
There are two problems here (from my understanding of Map Reduce): first, I'd prefer to just iterate over my data one time, instead of running one "grep" job per URL type I'd like to match. How would I split up the output, though? If I key the first with "foo", second with "bar", and rest with "other" then don't they all still go to the same reducers? How do I tell Hadoop to output them into different files?
The second problem is related (maybe the same?), I need to break output up by the timestamp in the access log line.
I should note that I'm not looking for code to solve this, but rather the proper terminology and high level solution to look into. If I have to do it with multiple runs, that's alright, but I can't run one "grep" for each possible hour (to make a file for that hour), there must be another way?
You need to partition the data just as you describe. Then you need to have multiple output files. See here (Generating Multiple Output files with Hadoop 0.20+).

Resources