HBase Need to export data from one cluster and import it to another with slight modification in row key - hadoop

I am trying to export data from HBase table 'mytable' which rowkey starts with 'abc'.
scan 'mytable', {ROWPREFIXFILTER => 'abc'}
The above exported data need to be imported into the another cluster by changing the rowkey prefix from 'abc' to 'def'
Old Data:
hbase(main):002:0> scan 'mytable', {ROWPREFIXFILTER => 'abc'}
ROW COLUMN+CELL
abc-6535523 column=track:aid, timestamp=1339121507633, value=some stream/pojos
New Data: (In another cluster)
hbase(main):002:0> get 'mytable', 'def-6535523'
ROW COLUMN+CELL
def-6535523 column=track:aid, timestamp=1339121507633, value=some stream/pojos
Only part of the row key needs to be modified. Other data needs to be as same.
Tried to use bin/hbase org.apache.hadoop.hbase.mapreduce.Export table_name file:///tmp/db_dump/
In the Export there is no provision to specify start row and end row.
But don't know how to import it with changed rowkey.
Also is there any inbuilt available in HBase/Hadoop to achie
Please help.

Related

How to create external tables from parquet files in s3 using hive 1.2?

I have created an external table in Qubole(Hive) which reads parquet(compressed: snappy) files from s3, but on performing a SELECT * table_name I am getting null values for all columns except the partitioned column.
I tried using different serialization.format values in SERDEPROPERTIES, but I am still facing the same issue.
And on removing the property 'serialization.format' = '1' I am getting ERROR: Failed with exception java.io.IOException:Can not read value at 0 in block -1 in file s3://path_to_parquet/.
I checked the parquet files and was able to read the data using parquet-tools:
**file_01.snappy.parquet:**
{"col_2":1234,"col_3":ABC}
{"col_2":124,"col_3":FHK}
{"col_2":12515,"col_3":UPO}
**External table stmt:**
CREATE EXTERNAL TABLE parquet_test
(
col2 int,
col3 string
)
PARTITIONED BY (col1 date)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1'
)
STORED AS PARQUET
LOCATION 's3://path_to_parquet'
TBLPROPERTIES ('parquet.compress'='SNAPPY');
Result:
col_1 col_2 col_3
5/3/19 NULL NULL
5/4/19 NULL NULL
5/5/19 NULL NULL
5/6/19 NULL NULL
Expected Result:
col_1 col_2 col_3
5/3/19 1234 ABC
5/4/19 124 FHK
5/5/19 12515 UPO
5/6/19 1234 ABC
Writing the below answer assuming that table was created using Hive and read using Spark(Since the question is tagged with apache-spark-sql)
How was the data created?
Spark supports case-sensitive schema.
When we use dataframe APIs, it is possible to write using case sensitive schema.
Example:
scala> case class Employee(iD: Int, NaMe: String )
defined class Employee
scala> val df =spark.range(10).map(x => Employee(x.toInt, s"name$x")).write.save("file:///tmp/data/")
scala> spark.read.parquet("file:///tmp/data/").printSchema
root
|-- iD: integer (nullable = true)
|-- NaMe: string (nullable = true)
Notice that in the above example case sensitivity is preserved.
When we create a Hive table on top of the data created from Spark, Hive will be able to read it right since it is not cased sensitive.
Whereas when the same data is read using Spark, it uses the schema from Hive which is lower case by default, and the rows returned is null.
To overcome this, Spark has introduced a config spark.sql.hive.caseSensitiveInferenceMode.
object HiveCaseSensitiveInferenceMode extends Enumeration {
val INFER_AND_SAVE, INFER_ONLY, NEVER_INFER = Value
}
val HIVE_CASE_SENSITIVE_INFERENCE = buildConf("spark.sql.hive.caseSensitiveInferenceMode")
.doc("Sets the action to take when a case-sensitive schema cannot be read from a Hive " +
"table's properties. Although Spark SQL itself is not case-sensitive, Hive compatible file " +
"formats such as Parquet are. Spark SQL must use a case-preserving schema when querying " +
"any table backed by files containing case-sensitive field names or queries may not return " +
"accurate results. Valid options include INFER_AND_SAVE (the default mode-- infer the " +
"case-sensitive schema from the underlying data files and write it back to the table " +
"properties), INFER_ONLY (infer the schema but don't attempt to write it to the table " +
"properties) and NEVER_INFER (fallback to using the case-insensitive metastore schema " +
"instead of inferring).")
.stringConf
.transform(_.toUpperCase(Locale.ROOT))
.checkValues(HiveCaseSensitiveInferenceMode.values.map(_.toString))
.createWithDefault(HiveCaseSensitiveInferenceMode.INFER_AND_SAVE.toString)
INFER_AND_SAVE - Spark infers the schema and store in metastore as part of table's TBLEPROPERTIES (desc extended <table name> should reveal this)
If the value of the property is NOT either INFER_AND_SAVE or INFER_ONLY, then Spark uses the schema from metastore table, and wil not be able to read the parquet files.
The default value of the property is INFER_AND_SAVE since Spark 2.2.0.
We could check the following to see if the problem is related to schema sensitivity:
1. Value of spark.sql.hive.caseSensitiveInferenceMode (spark.sql("set spark.sql.hive.caseSensitiveInferenceMode") should reveal this)
2. If the data created using Spark
3. If 2 is true, check if the Schema is case sensitive(spark.read(<location>).printSchema)
4. if 3 uses case-sensitive schema and output from 1 is not INFER_AND_SAVE/INFER_ONLY, set spark.sql("set spark.sql.hive.caseSensitiveInferenceMode=INFER_AND_SAVE"), drop the table, recreate the table and try to read the data from Spark.

hbase rowkey filter on multiple values

I have created an Hbase table rowkey with combination of multiple column values.
my data of rowkey in hbase look like below.
'123456~raja~ram~45000~mgr~20170116'
'123456~raghu~rajan~65000~mgr~20150106'
i am trying to filter condition just like in sql
as
select * from table
where deptid =123456 and name='rajan'
how can i do (or) and (and) conditions.
i am using below code to filter the condition
scan 'tablename', {FILTER => (org.apache.hadoop.hbase.filter.RowFilter.new(CompareFilter::CompareOp.valueOf('EQUAL'),SubstringComparator.new("123456"))) && (org.apache.hadoop.hbase.filter.RowFilter.new(CompareFilter::CompareOp.valueOf('EQUAL'),SubstringComparator.new("rajan")))}
if i use same code with swapping values i am getting different results
scan 'tablename', {FILTER => (org.apache.hadoop.hbase.filter.RowFilter.new(CompareFilter::CompareOp.valueOf('EQUAL'),SubstringComparator.new("rajan"))) && (org.apache.hadoop.hbase.filter.RowFilter.new(CompareFilter::CompareOp.valueOf('EQUAL'),SubstringComparator.new("123456")))}

Hive cannot read ORC if set "orc.create.index"="false" when loading table

Hive version: 1.2.1, create a table by the below:
CREATE TABLE ORC_NONE(
millisec bigint,
...
)
stored as orc tblproperties ("orc.create.index"="false");
insert into table ORC_NONE select * from ex_test_convert;
But when giving query, it always return NULL. For example:
Select * from ORC_NONE limit 10; // return blank
Select min(millisec), max(millisec) from ORC_NONE; // return NULL, NULL
I check the size of ORC_NONE, 2G, so it is not empty table, and if creating table by setting "orc.create.index"="true", queries work.
I was meant to test Hive performance on ORC with/without row indexes, more exactly, to test the skipping power of row indexes. However, it seemed that Hive can not read data when row index unavailable.
Is this a bug? Or something wrong with my loading?

How to retrieve information from specific rows from table in Hbase?

I have a table in Hbase and the key of this table is "user_name" + "id" for example ("username123").
I want to retrieve all rows for specific user_name for example (if i have some rows with key "john1","john2"..., i want to retrieve all rows for john)
How can i do it ?
Use PrefixFilter. For Java API answer is here Hbase Java API: Retrieving all rows that match a Partial Row Key
In HBase shell PrefixFilter too:
scan 'tablename', {FILTER => "(PrefixFilter ('username'))"}

How transfer a Table from HBase to Hive?

How can I tranfer a HBase table into Hive correctly?
What I tried before can you read in this question
How insert overwrite table in hive with diffrent where clauses?
( I made one table to import all data. The problem here is that data is still in rows and not in columns. So I made 3 tables for news, social and all with a specific where clause. After that I made 2 Joins on the tables which is giving me the result table. So I had 6 Tables at all which is not really performant!)
to sum my problem up : In HBase are column familys which are saved as rows like this.
count verpassen news 1
count verpassen social 0
count verpassen all 1
What I want to achieve in Hive is a datastructure like this:
name news social all
verpassen 1 0 1
How am I supposed to do this?
Below is the approach use can use.
use hbase storage handler to create the table in hive
example script
CREATE TABLE hbase_table_1(key string, value string) STORED BY
'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH
SERDEPROPERTIES ("hbase.columns.mapping" = ":key,f1:val")
TBLPROPERTIES ("hbase.table.name" = "test");
I loaded the sample data you have given into hive external table.
select name,collect_set(concat_ws(',',type,val)) input from TESTTABLE
group by name ;
i am grouping the data by name.The resultant output for the above query will be
Now i wrote a custom mapper which takes the input as input parameter and emits the values.
from (select '["all,1","social,0","news,1"]' input from TESTTABLE group by name) d MAP d.input Using 'python test.py' as
all,social,news
alternatively you can use the output to insert into another table which has column names name,all,social,news
Hope this helps

Resources