How do I diff two tables in HBase - hadoop

I am trying to compare two different tables in HBase so that I can automate the validation of some ETL processes that we use to move data in HBase. What's the best way to compare two tables in HBase?
My use case is below:
What I am trying to do is create one table that will be my expected output. This table will contain all of the data that I am expecting to be created via executing the teams code against an input file. I will then take the diff between the actual output table and the expected output table to verify the integrity of the component under test.

I don't know of anything out of the box but you can write a multi-table map/reduce.
The mappers will just emit keys from each table (with a value being all the hbase key values plus a table name)
The reducer can make sure it has 2 records of each key and compare the key-values. When there's only one key it can see which table is out of sync

I know this question is a little old, but how large are the tables? If they will both fit into memory you could load them into Pig using HBaseStorage, then use Pig's built in DIFF function to compare the resulting bags.
This will work even with large tables that don't fit into memory, according to the docs, but it will be extremely slow.

dataset1 = LOAD '/path/to/dataset1' USING PigStorage('<your delimiter>') AS (a:chararray, b:chararray, c:chararray, d:chararray);
dataset2 = LOAD '/path/to/dataset2' USING PigStorage('<your delimiter>') AS (a:chararray, b:chararray, c:chararray, d:chararray);
dataset3 = COGROUP dataset1 BY (a, b,c, d), dataset2 BY (a, b, c, d);
dataset4 = FOREACH dataset3 GENERATE DIFF(dataset1,dataset2);

Related

Hive partitioned column doesn't appear in rdd via sc.textFile

The Hive partitioned column is not the part of the underlying saved data, I need to know how it can be pulled via sc.textFile(filePath) syntax to be loaded in RDD.
I know the other way of creating hive context and all but was wondering is there a way I can directly get it via sc.textFile(filePath) syntax and use it.
By partitioning the data by a column when saving, that columns data will be stored in the file structure and not in the actual files. Since, sc.textFile(filePath) is made for reading single files I do not believe it supports reading partitioned data.
I would recommend reading the data as a dataframe, for example:
val df = hiveContext.read().format("orc").load("path/to/table/")
The wholeTextFiles() method could also be used. Then you would get a tuple of (file path, file data) and from that it should be possible to parse out the partitioned data column and then add it as a new column.
If the storage size is no problem, then an alternative solution would be to store the information of the partitioned column twice. Once in the file structure (done by partitioning on that column), and once more in the data itself. This is achieved by duplicating the column before saving it to file. Say the column in question is named colA,
val df2 = df.withColumn("colADup", $"colA")
df2.write.partitionBy("colADup").orc("path/to/save/")
This can also easily be extended to multiple columns.

Bulk Loading Key-value pair data into HBASE

I am evaluating HBASE for dealing with a very wide dataset with a variable number of columns per row. In its raw form, my data has a variable list of parameter names and values for each row. In its transformed form, it is available in key-value pairs.
I want to load this data into HBASE. It is very easy to translate my key-value pair processed data into individual "put" statements to get the data in. However I need to bulkload as I have 1000s of columns and millions of rows, leading to billions of individual key-value pairs, requiring billions of "put" statements. Also, the list of columns (a,b,c,d,...) is not fully known ahead of time. I investigated the following options so far:
importtsv: Cannot be used because that requires the data to be pivoted from rows to columns ahead of time, with a fixed set of known columns to import.
HIVE to generate HFile: This option too requires column names to be specified ahead of time and map each column in the hive table to a column in hbase.
My only option seems to be to parse a chunk of the data once, pivot it into a set of known columns, and bulk load that. This seems wasteful, as HBASE is going to blow it down into key-value pairs anyway. There really should be a simpler more efficient way of bulk loading the key value pairs?
Raw data format:
rowkey1, {a:a1, b:b1}
rowkey2, {a:a2, c:c2}
rowkey3, {a:a3, b:b3, c:c3, d:d3}
processed Data format:
rowkey1, a, a1
rowkey1, b, b1
rowkey2, a, a2
rowkey2, c, c2
rowkey3, a, a3
rowkey3, b, b3
rowkey3, c, c3
rowkey3, d, d3
You almost assuredly want to use a customer M/R job + Incremental loading (aka bulk loading).
There general process will be:
Submit an M/R job that has been configured using HFileOutputFormat.configureIncrementalLoad
Map over the raw data and write PUTs for HBase
Load the output of job into table using the following:
sudo -u hdfs hdfs dfs -chown -R hbase:hbase /path/to/job/output
sudo -u hbase hbase org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles /path/to/job/output table-name-here
There are ways to do the load from java but it means impersonating HBase. The tricky part here is making sure that the files are owned by HBase and the user running the incremental load is also HBase. This Cloudera Blog Post talks a bit more about about those details.
In general I recommend taking a peek at this GH Repo which seems to cover the basics of the process.

Filtering AVRO Data from 2 datasets

Use-case:
I had 2 dataset/fileset Machine (Parent) and Alerts (Child).
Their data is also stored in two avro files viz machine.avro and alert.avro.
Alert schema had machineId : column type int.
How can I filter data from machine if there is a dependency on alert too? (one-to-many).
e.g. get all machines where alert time is between 2 time-stamp.
Any e.g. with source will be great help...
Thanks in advance...
Got answer in another thread....
Mapping through two data sets with Hadoop
Posting comments from that thread...
According to the documentation, the MapReduce framework includes the following steps:
Map
Sort/Partition
Combine (optional)
Reduce
You've described one way to perform your join: loading all of Set A into memory in each Mapper. You're correct that this is inefficient.
Instead, observe that a large join can be partitioned into arbitrarily many smaller joins if both sets are sorted and partitioned by key. MapReduce sorts the output of each Mapper by key in step (2) above. Sorted Map output is then partitioned by key, so that one partition is created per Reducer. For each unique key, the Reducer will receive all values from both Set A and Set B.
To finish your join, the Reducer needs only to output the key and either the updated value from Set B, if it exists; otherwise, output the key and the original value from Set A. To distinguish between values from Set A and Set B, try setting a flag on the output value from the Mapper.

How to optimize Hive queires with external table and serde

Part 1: my enviroment
I have following files uploaded to Hadoop:
The are plain text
Each line contains JSON like:
{code:[int], customerId:[string], data:{[something more here]}}
code are numbers from 1 to 3000,
customerId are total up to 4 millions, daily up to 0.5 millon
All files are gzip
In hive I created external table with custom JSON serde (let's call it CUSTOMER_DATA)
All files from each date is stored in separate directory - and I use it as partitions in Hive tables
Most queries which I do are filtering by date, code and customerId. I have also a second file with format (let's call it CUSTOMER_ATTRIBUTES]:
[customerId] [attribute_1] [attribute_2] ... [attribute_n]
which contains data for all my customers, so rows are up to 4 millions.
I query and filter my data in following way:
Filtering by date - partitions do the job here using WHERE partitionDate IN (20141020,20141020)
Filtering by code using statement like for example `WHERE code IN (1,4,5,33,6784)
Joining table CUSTOMER_ATTRIBUTES with CUSTOMER_DATA with condition query like
SELECT customerId
FROM CUSTOMER_DATA
JOIN CUSTOMER_ATTRIBUTES ON (CUSTOMER_ATTRIBUTES.customerId=CUSTOMER_DATA.customerId)
WHERE CUSTOMER_ATTRIBUTES.attribute_1=[something]
Part 2: question
Is there any efficient way how can I optimize my queries. I read about indexes and buckets by I don't know if I can use them with external tables and if they will optimize my queries.
Performance on search:
Internal or External table does not make a difference as far as performance is considered. You can build indexes on both. Either ways building indexes on large data sets is counter intuitive.
Bucketing the data on your searching columns would give a lot of performance gains. But whether you can bucket you data or not depends on your use case.
You can consider more partitioning (if possible) to get more gains if you can on code/customer id. Hopefully you don't have to many unique code or customer id.
Rather than trying these things out on your Textual Json formatted data, I would strongly suggest you to move away from JSON test data. Parsing JSON(Text) is a big performance killer.
These days there are a lot of file format which work pretty good. If cant change the component which produces the data, you use a series of queries and tables to convert to other file formats. This will be one time job for each partition data. After that your search queries will run faster on newer file formats.
for eg. RCFile format is support by hive. If you pull out code, customerid as separate columns in RCFILE then the query engine can completely skip data col for not matching code in (1,4,5,33,6784) , reducing IO heavily.
Also storing data in RCFile ie columnar storage will help your joins. With RCFile when you run a query with join the hive execution engine will only read in required columns, again significantly reducing IO. On top of this if you bucketted your columns which are a part of JOIN keys it will lead to more performance gains.
If you need to have JSON due to nesting nature of data then I would suggesting you look at Parquet
It will give you performance gains of RCFile + binary (avro, thrift etc)
At my work we had 2 columns of heavily nested JSON data. We tried storing this as compressed text and sequence file format. We then broke up the complex nested JSON columns to lesser nested multiple columns and pulled out some frequently searched keys into other columns. We stored this as RCfile and performance gains we observed on searching were huge.
Rightnow with more burst in data we need to improve more. After trying a few more things and talking to Cloudera guys there is only one big area to improve. Move away from JSON parsing. Parquet seems to be ideal candidate for this.
Yes you can use Indexes with External Tables. Index do optimize the search Queries.
CREATE INDEX your_index_name ON TABLE your_table_name(field_you_want_to_index) AS 'COMPACT' WITH DEFERRED REBUILD;
indexing takes a lot of time for a huge dataset, so we can do a deferred rebuild, i.e after production hours :)
ALTER INDEX your_index_name ON your_table_name REBUILD;
you can even rebuild a specific partition.
ALTER INDEX your_index_name ON your_table_name PARTITION(your_field = 'any_thing') REBUILD;
when you JOIN two tables BUCKETING is the best option to go with, does alot of optimization.

Not able to store the data into hbase using pig when I dont know the number of columns in a file

I have a text file with N number of columns (Not sure, in the future I may have N+1).
Example:
1|A
2|B|C
3|D|E|F
I want to store above data into hbase using pig without writing UDF. How can I store this kind of data without knowing the number of columns in a file?
Put it in a map and then you can use cf1:* where cf1 is your column family

Resources