Store nested entity in Hbase and read it as rows in hive - hadoop

My requirement is to write a nested entity(Array of POJO objects) from Java to Hbase and to read them as individual records in Hive.
(i,e) while writing from Java, its just a single string(Array). But from hive, the array represents the table as a whole. So the hive should have the individual elements of the array as individual records in it.
Any help on this will be appreciated.
Thanks,
GK

Perhaps you should take a look to Hive UDTF functions like explode, depending on what you store and what you need to retrieve they may work for you but be noticed they have some important limitations:
No other expressions are allowed in SELECT SELECT pageid, explode(adid_list) AS myCol... is not supported
UDTF's can't be nested SELECT explode(explode(adid_list)) AS myCol... is not supported
GROUP BY / CLUSTER BY / DISTRIBUTE BY / SORT BY is not supported SELECT explode(adid_list) AS myCol ... GROUP BY myCol is not
supported
If standard UDTFs don't fit your case and you're in the mood, you can also do this:
Store each item of your array as a json string in a different column: i0, i1, i2 ... iN
Write your own UDTF function to process each row columns and emit 1 row per column.
IMHO, I'll just write one row per element of the array, appending to the rowkey the index of each array item, it will be faster when processing the data and you'll have a lot less headaches. You shouldn't worry about writing billions of rows if that's the case.

Related

Inject a list into SpringData to be used as a kind of virtual database table

I have a database table on which a sorted query needs to be done.
To do the sorting a join on another table is requiered. The problem is that this other table does not exist in the database because we read the required data on the services startup from a CSV file and keep it as an in-memory list.
Is it possible to somehow inject this list as a kind of virtual database into Spring Data? So that it could use this list to make the required join and sorting.
As far as I know, the only other options I have would be to create a real database table from this in-memory list or load the whole table and do the sorting in the service itself.
You can add a special order by expression through e.g. Spring Data Specification, but that is going to be very ugly. In HQL it looks like this:
case rootAlias.attribute when 'value1' then 1 when 'value2' then 2 ... else null end
which will return some integer value by which you can sort ascending or descending, based on the mapping you have.
Even if you have lots of values, I would rather recommend you don't do a join at all, and instead try to make this attribute of your main table sortable, so that you don't need this mapping. You can maybe create a trigger that maintains a column based on the mapping, which can be used for sorting directly. If you do all your changes through JPA/Hibernate, you could also use a #PreUpdate/#PrePersist listener to handle the maintenance of this column.

Different Columns for Each Row in HBase?

In my HBase table, each row may be have different columns than other rows. For example;
ROW COLUMN
1-1040 cf:s1
1-1040 cf:s2
1-1043 cf:s2
2-1040 cf:s5
2-1045 cf:s99
3-1040 cf:s75
3-1042 cf:s135
As seen above, each row has different columns than other rows. So, when I run scan query like this;
scan 'tb', {COLUMNS=>'cf:s2', STARTROW=>'1-1040', ENDROW=>'1-1044'}
I want to get cf:s2 values using above query. But, does any performance issue occur due to each row has different columns?
Another option;
ROW COLUMN
1-1040-s1 cf:value
1-1040-s2 cf:value
1-1043-s2 cf:value
2-1040-s5 cf:value
2-1045-s99 cf:value
3-1040-s75 cf:value
3-1042-s135 cf:value
In this option, when I want to get s2 values between 1-1040 and 1-1044, I am running this query for this;
scan 'tb', {STARTROW=>'1-1040s2', ENDROW=>'1-1044', FILTER=>"RowFilter(=, 'substring:s2')"}
When I want to get s2 values, which option is better in read performance?
HBase stores all records for a given column family in the same file, and so the scan has to run over all key-value pairs, even if you apply a filter. This is true of both ways you suggest for storing the data.
For optimal performance of this specific scan, you should consider storing your s2 data in a different column family. Under-the-hood, HBase will store your data in the following way:
One file:
1-1040 cf1:s1
2-1040 cf1:s5
2-1045 cf1:s99
3-1040 cf1:s75
3-1042 cf1:s135
Another file:
1-1040 cf2:s2
1-1043 cf2:s2
Then you can run a scan over just cf2, and HBase will only read data containing s2, making the operation much faster.
scan 'tb', {COLUMNS => 'cf2', STARTROW=>'1-1040s2', ENDROW=>'1-1044'}
Considerations:
It's recommended to only have two or three column families per table, so you shouldn't implement this if you want to run this query for s5, s75 etc. In this case, your composite rowkey option is better as HBase only need look at the rowkey, and not column qualifiers.
It depends on exactly which queries you'll be running, and how often you'll be running them. This is the fastest way for you to get values associated with s2, but might not be fastest for other queries.

Oracle compressed/b-tree index how and when to use

I would like to add a compressed index to the Oracle Applications workflow table hr.pqh_ss_transaction_history in order to access specific types of workflows (process_name) and workflows for specific people (selected_person_id).
There are lots of repeating values in process_name although the data is skewed. I would however want to access the TFG_HR_NEW_HIRE_PLACE_JSP_PRC and TFG_HR_TERMINATION_JSP_PRC process types.
"PROCESS_NAME","CNT"
"HR_GENERIC_APPROVAL_PRC",40347
"HR_PERSONAL_INFO_JSP_PRC",39284
"TFG_HR_NEW_HIRE_PLACE_JSP_PRC",18117
"TFG_HREMPSTS_TERMS_CHG_JSP_PRC",14076
"TFG_HR_TERMINATION_JSP_PRC",8764
"HR_ADV_INDIVIDUAL_COMP_PRC",4907
"TFG_HR_SIT_NOAPP",3979
"TFG_YE_TAX_PROV",2663
"HR_TERMINATION_JSP_PRC",1310
"HR_CHANGE_PAY_JSP_PRC",953
"TFG_HR_SIT_EXIT_JSP_PRC",797
"HR_SIT_JSP_PRC",630
"HR_QUALIFICATION_JSP_PRC",282
"HR_CAED_JSP_PRC",250
"TFG_HR_EMP_TERM_JSP_PRC",211
"PER_DOR_JSP_PRC",174
"HR_AWARD_JSP_PRC",101
"TFG_HR_SIT_REP_MOT",32
"TFG_HR_SIT_NEWPOS_NIB_JSP_PRC",30
"TFG_HR_SIT_NEWPOS_INBU_JSP_PRC",28
"HR_NEW_HIRE_PLACE_JSP_PRC",22
"HR_NEWHIRE_JSP_PRC",6
selected_person_id would obviously be more selective. Unfortunately there are 3774 nulls for this column and the highest count after that is 73 for one person. A lot of people would only have 1 row. The total row count is 136963.
My query would be in this format:
select psth.item_key,
psth.creation_date,
psth.last_update_date
from hr.pqh_ss_transaction_history psth
where nvl(psth.selected_person_id, :p_person_id) = :p_person_id
and psth.process_name = 'HR_TERMINATION_JSP_PRC'
order by psth.last_update_date
I am on Oracle 12c release 1.
I assume it would be a good idea to put a non-compressed b-tree index on selected_person_id since the values returned would fall in the less than 5% of the total rows scenario, but how do you handle the nulls in the column which would not go into the index when you select using nvl(psth.selected_person_id, :p_person_id) = :p_person_id? Is there a more efficient way to write the sql and how should you create this index?
For process_name I would like to use a compressed b-tree index. I am assuming that the statement is
CREATE INDEX idxname ON pqh_ss_transaction_history(process_name) COMPRESS
where there would be an implicit second column for rowid. Is it safe for it to use rowid here, since normally it is not advised to use rowid? Is the skewed data an issue (most of the time I would be selecting on the high volume side)? I don't understand how compressed indexes would be efficient. For b-tree indexes you would normally want to return 5% of the data, otherwise a full table scan is actually more efficient. How does the compressed index return so many rowids and then do lookup into the full table using those rowids, faster than a full table scan?
Or since the optimizer will only be able to use one of the two indexes should I rather create an uncompressed function based index with selected_person_id and process_name concatenated?
Perhaps you could create this index:
CREATE INDEX idxname ON pqh_ss_transaction_history
(process_name, NVL(selected_person_id,-1)) COMPRESS 1
Then change your query to:
select psth.item_key,
psth.creation_date,
psth.last_update_date
from hr.pqh_ss_transaction_history psth
where nvl(psth.selected_person_id, -1) in (:p_person_id,-1)
and psth.process_name = 'HR_TERMINATION_JSP_PRC'
order by psth.last_update_date

Hadoop map-reduce : Order of records while grouping

I have a record in each line of input and each record has around 10 fields. First, I group the records by three fields (field1, field2, field3) thus one mapper/reducer is responsible for one unique group (based on the three fields). Within each group, I sort the records based on another integer field timestamp and I tag each record in the group with the same tag aTag by adding another field.
Lets say that in mapper#1, I tag a sorted group as aTag and in mapper#2, I tag another group (a different group because I initially grouped the records based on the three fields) with the same tag aTag.
Now, if I group the records based on the tag field (i.e., grouping the groups in different mappers), I notice that the ordering within each group is no more preserved. I was expecting that since each mapper has a group with all records having the same tag, grouping by the tag name should just involve getting the relevant groups from other mappers and just concatenating them without re-ordering each individual group.
Is it because I am trying to store the records in gzip format and hence it tries to re-order the records for better compression? Also I would like to know how to preserve the order after grouping by the tag name.
It seems that you are trying to implement the sort step of MapReduce yourself in local memory, but then it completely ignores what you did and re-sorts the items in each group anyway. The proper way to fix this would be to specify a comparator on the keys, so that within each partition so that the merged input to the reducer is according to that comparison function. This means that
You don't have to do the sorting yourself
You don't run out of memory on one machine trying to sort a really large group.
It seems on your case that you'd want to add timestamp to the set of keys, tell it to partition on the first three keys, and tell it to sort on the timestamp.
For more information, see the following diagram, and Where is Sort used in MapReduce phase and why?

assigning IDs to hadoop/PIG output data

I m working on PIG script which performs heavy duty data processing on raw transactions and come up with various transaction patterns.
Say one of pattern is - find all accounts who received cross border transactions in a day (with total transaction and amount of transactions).
My expected output should be two data files
1) Rollup data - like account A1 received 50 transactions from country AU.
2) Raw transactions - all above 50 transactions for A1.
My PIG script is currently creating output data source in following format
Account Country TotalTxns RawTransactions
A1 AU 50 [(Txn1), (Txn2), (Txn3)....(Txn50)]
A2 JP 30 [(Txn1), (Txn2)....(Txn30)]
Now question here is, when I get this data out of Hadoop system (to some DB) I want to establish link between my rollup record (A1, AU, 50) with all 50 raw transactions (like ID 1 for rollup record used as foreign key for all 50 associated Txns).
I understand Hadoop being distributed should not be used for assigning IDs, but are there any options where i can assign non-unique Ids (no need to be sequential) or some other way to link this data?
EDIT (after using Enumerate from DataFu)
here is the PIG script
register /UDF/datafu-0.0.8.jar
define Enumerate datafu.pig.bags.Enumerate('1');
data_txn = LOAD './txndata' USING PigStorage(',') AS (txnid:int, sndr_acct:int,sndr_cntry:chararray, rcvr_acct:int, rcvr_cntry:chararray);
data_txn1 = GROUP data_txn ALL;
data_txn2 = FOREACH data_txn1 GENERATE flatten(Enumerate(data_txn));
dump data_txn2;
after running this, I am getting
ERROR org.apache.pig.tools.pigstats.SimplePigStats - ERROR 2997: Unable to recreate exception from backed error: java.lang.NullPointerException
at datafu.pig.bags.Enumerate.enumerateBag(Enumerate.java:89)
at datafu.pig.bags.Enumerate.accumulate(Enumerate.java:104)
....
I often assign random ids in Hadoop jobs. You just need to ensure you generate ids which contain a sufficient number of random bits to ensure the probability of collisions is sufficiently small (http://en.wikipedia.org/wiki/Birthday_problem).
As a rule of thumb I use 3*log(n) random bits where n = # of ids that need to be generated.
In many cases Java's UUID.randomUUID() will be sufficient.
http://en.wikipedia.org/wiki/Universally_unique_identifier#Random_UUID_probability_of_duplicates
What is unique in your rows? It appears that account ID and country code are what you have grouped by in your Pig script, so why not make a composite key with those? Something like
CONCAT(CONCAT(account, '-'), country)
Of course, you could write a UDF to make this more elegant. If you need a numeric ID, try writing a UDF which will create the string as above, and then call its hashCode() method. This will not guarantee uniqueness of course, but you said that was all right. You can always construct your own method of translating a string to an integer that is unique.
But that said, why do you need a single ID key? If you want to join the fields of two tables later, you can join on more than one field at a time.
DataFu had a bug in Enumerate which was fixed in 0.0.9, so use 0.0.9 or later.
In case when your IDs are numbers and you can not use UUID or other string based IDs.
There is a DataFu library of UDFs by LinkedIn (DataFu) with a very useful UDF Enumerate. So what you can do is to group all records into a bag and pass the bag to the Enumerate. Here is the code from top of my head:
register jar with UDF with Enumerate UDF
inpt = load '....' ....;
allGrp = group inpt all;
withIds = foreach allGrp generate flatten(Enumerate(inpt));

Resources