I've created a Hbase table like this,
create 'student','personal'
and I've put some data into it like this.
ROW COLUMN+CELL
1 column=personal:age, timestamp=1456224023454, value=20
1 column=personal:name, timestamp=1456224008188, value=pesronA
2 column=personal:age, timestamp=1456224891317, value=13
2 column=personal:name, timestamp=1456224868967, value=pesronB
3 column=personal:age, timestamp=1456224935178, value=21
3 column=personal:name, timestamp=1456224921246, value=personC
4 column=personal:age, timestamp=1456224951789, value=20
4 column=personal:name, timestamp=1456224961845, value=personD
5 column=personal:age, timestamp=1456224983240, value=20
5 column=personal:name, timestamp=1456224972816, value=personE
-
I want to import this data to a hive table. I wrote a hive query for that like this
CREATE TABLE hbaseStudent(key INT,name STRING,age INT) STORED BY'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,personal:age,personal:name") TBLPROPERTIES("hbase.table.name" = "student")
But when I execute the query error comes out like this.
Driver returned: 1. Errors: OK
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. org/apache/hadoop/hbase/HBaseConfiguration
what should i do?
I tried this thing and it worked try replacing all the double quotes (") with single quotes ('). It will work & also try to add terminator ; in last line.
Related
I am trying to export data from excel into a hive table, while doing so, i have a column 'ABC' which has values like '1,2,3'.
I used the lateral view explode function but it does not does anything to my data.
Following is my code snippet :
CREATE TABLE table_name
(
id string,
brand string,
data_name string,
name string,
address string,
country string,
flag string,
sample_list array )
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE
;
LOAD DATA LOCAL INPATH 'location' INTO TABLE
table_name ;
output sample:
id brand data_name name address country flag sample_list
19 1 ABC SQL ABC Cornstarch IN 1 ["[1,2,3]"]
then i do:
select * from franchise_unsupress LATERAL VIEW explode(SEslist) SEslist as final_SE;
output sample:
id brand data_name name address country flag sample_list
19 1 ABC SQL ABC Cornstarch IN 1 [1,2,3]
I also tried:
select * from franchise_unsupress lateral view explode(split(SEslist,',')) SEslist AS final_SE ;
but got an error:
FAILED: ClassCastException org.apache.hadoop.hive.serde2.objectinspector.StandardListObjectInspector cannot be cast to org.apache.hadoop.hive.serde2.objectinspector.PrimitiveObjectInspector
whereas, what i need is:
id brand data_name name address country flag sample_list
19 1 ABC SQL ABC Cornstarch IN 1 1
19 1 ABC SQL ABC Cornstarch IN 1 2
19 1 ABC SQL ABC Cornstarch IN 1 3
Any help will be greatly appreciated! thank you
The problem is that array is recognized in a wrong way and loaded as a single element array ["[1,2,3]"]. It should be [1,2,3] or ["1","2","3"] (if it is array<string>)
When creating table, specify delimiter for collections:
FIELDS TERMINATED BY '\t'
COLLECTION ITEMS TERMINATED BY ','
I wanted to provide my answer.
The issue was with the input that was being provided. My input txt file had [] around the input value. They had to be removed and it worked.
In an Oracle database, I can read this table containing a CLOB type (note the newlines):
ID MY_CLOB
001 500,aaa,bbb
500,ccc,ddd
480,1,2,bad
500,eee,fff
002 777,0,0,bad
003 500,yyy,zzz
I need to process this, and import into an HDFS table with new rows for each MY_CLOB line starting with "500,". In this case, the hive table should look like:
ID C_1 C_2 C_3
001 500 aaa bbb
001 500 ccc ddd
001 500 eee fff
003 500 yyy zzz
This solution to my previous question succeeds in producing this on Oracle. But writing the result to HDFS with a Python driver is very slow, or never succeeds.
Following this solution, I've tested a similar regex + pyspark solution that might work for my purposes:
<!-- begin snippet: js hide: true -->
import cx_Oracle
#... query = """SELECT ID, MY_CLOB FROM oracle_table"""
#... cx_oracle_results <--- fetchmany results (batches) from query
import re
from pyspark.sql import Row
from pyspark.sql.functions import col
def clob_to_table(clob_lines):
m = re.findall(r"^(500),(.*),(.*)",
clob_lines, re.MULTILINE)
return Row(C_1 = m.group(1), C_2 = m.group(2), C_3 = m.group(3))
# Process each batch of results and write to hive as parquet
for batch in cx_oracle_results():
# batch is like [(1,<cx_oracle object>), (2,<cx_oracle object>), (3,<cx_oracle object>)]
# When `.read()` looks like [(1,"500,a,b\n500c,d"), (2,"500,e,e"), (3,"500,z,y\n480,-1,-1")]
df = sc.parallelize(batch).toDF(["ID", "MY_CLOB"])\
.withColumn("clob_as_text", col("MY_CLOB")\
.read()\ # Converts cx_oracle CLOB object to text.
.map(clob_to_table)
df.write.mode("append").parquet("myschema.pfile")
But reading oracle cursor results and feeding them into pyspark this way doesn't work well.
I'm trying to to run a sqoop job generated by another tool, importing the CLOB as text, and hoping I can process the sqooped table into a new hive table like the above in reasonable time. Perhaps with pyspark with a solution similar to above.
Unfortunately, this sqoop job doesn't work.
sqoop import -Doraoop.timestamp.string=false -Doracle.sessionTimeZone=America/Chicago
-Doraoop.import.hint=" " -Doraoop.oracle.session.initialization.statements="alter session disable parallel query;"
-Dkite.hive.tmp.root=/user/hive/kite_tmp/wassadamo --verbose
--connect jdbc:oracle:thin:#ldap://connection/string/to/oracle
--num-mappers 8 --split-by date_column
--query "SELECT * FROM (
SELECT ID, MY_CLOB
FROM oracle_table
WHERE ROWNUM <= 1000
) WHERE \$CONDITIONS"
--create-hive-table --hive-import --hive-overwrite --hive-database my_db
--hive-table output_table --as-parquetfile --fields-terminated-by \|
--delete-target-dir --target-dir $HIVE_WAREHOUSE --map-column-java=MY_CLOB=String
--username wassadamo --password-file /user/wassadamo/.oracle_password
But I get an error (snippet below):
20/07/13 17:04:08 INFO mapreduce.Job: map 0% reduce 0%
20/07/13 17:05:08 INFO mapreduce.Job: Task Id : attempt_1594629724936_3157_m_000001_0, Status : FAILED
Error: java.io.IOException: SQLException in nextKeyValue
...
Caused by: java.sql.SQLDataException: ORA-01861: literal does not match format string
This seems to have been caused by mapping the CLOB column to string. I did this based on this answer.
How can I fix this? I'm open to a different pyspark solution as well
Partial answer: the oracle error seems to have been due to
--split-by date_column
This date_column is an Oracle Date type. Turns out it doesn't work when sqooping from Oracle. It would be nice to be able to split on this. But splitting on ID (varchar2) seems to be working.
The issue of performantly parsing the text MY_CLOB field and creating new rows for each line remains.
I have to put below 2 rows in my hbase table :
put 'TABLE', 'ABC::ABC::NLOC','data:document','myvalue'
put 'TABLE', 'ABC::ABC::NLOC','data:meta:test','values'
But after executing this command , i am unable to see the 2 nd command creating a column data:meta:test.
hbase(main):003:0> get 'TABLE', 'ABC::ABC::NLOC'
COLUMN CELL
data:document timestamp=1528398479692, value=profile data - POST!
data:meta timestamp=1528398532570, value=values
2 row(s) in 0.0220 seconds
How can i see the column as data:meta:test, should i use hbase put in a dieffernt way? any help please
I am trying to create a table in Hive using complex data types.
One of my column is an array of strings and the other is an array of maps.
After I have loaded the data into the table, when I try to query the data, I don't get the desired result in the third column which is an array of maps.
The following is my Hive query:
Step 1:
create table transactiondb2(order_id int,billtype array<string>,paymenttype array<map<string,int>>)ROW FORMAT
DELIMITED FIELDS TERMINATED BY '\t'
COLLECTION ITEMS TERMINATED BY '|'
MAP KEYS TERMINATED BY '#';
Step 2:
load data local inpath '/home/xyz/data.txt' overwrite into table transactiondb2;
Step 3:
select * from transactiondb2;
And my output is as follows:
OK
1 ["A","B"] [{"credit":null,"10":null},{"cash":null,"25":null},{"emi":null,"30":null}]
2 ["C","D"] [{"credit":null,"157":null},{"cash":null,"45":null},{"emi":null,"35":null}]
3 ["X","Y"] [{"credit":null,"25":null},{"cash":null,"38":null},{"emi":null,"50":null}]
4 ["E","F"] [{"credit":null,"89":null},{"cash":null,"105":null},{"emi":null,"85":null}]
5 ["Z","A"] [{"credit":null,"7":null},{"cash":null,"79":null},{"emi":null,"105":null}]
6 ["D","Y"] [{"credit":null,"30":null},{"cash":null,"100":null},{"emi":null,"101":null}]
7 ["A","Z"] [{"credit":null,"50":null},{"cash":null,"9":null},{"emi":null,"85":null}]
8 ["B","Z"] [{"credit":null,"70":null},{"cash":null,"38":null},{"emi":null,"90":null}]
And my input file data is as follows:
1 A|B credit#10|cash#25|emi#30
2 C|D credit#157|cash#45|emi#35
3 X|Y credit#25|cash#38|emi#50
4 E|F credit#89|cash#105|emi#85
5 Z|A credit#7|cash#79|emi#105
6 D|Y credit#30|cash#100|emi#101
7 A|Z credit#50|cash#9|emi#85
8 B|Z credit#70|cash#38|emi#90
I solved it myself.
We need not mention an array of maps explicitly by default it takes values from one map after the other
Create the table as shown below and load the data, then you will get the desired output.
create table complex(id int,bill array<string>,paytype map<string,int>)
ROW FORMAT
DELIMITED FIELDS TERMINATED BY '\t'
COLLECTION ITEMS TERMINATED BY '|'
MAP KEYS TERMINATED BY '#';
I have a single file with a structure like:
A 1 2 3
A 4 5 6
A 5 8 12
B abc cde
B and fae
B bsd oio
C 1
C 2
C 3
and would like to load the data in 3 simple tables (A (int int int), B(string string) C(int)).
Is it possible and how?
It's also fine for me, if A(string int int int) etc. with the first column of the file to be included in the table.
I'd go with option 1 as Praveen suggests. I'd create an external table of only a string, and use the FROM ( ... ) syntax to insert into multiple tables at once. I think something like the following would work
create external table source_table( line string )
stored as textfile
location '/myfile';
from ( select split( line , " ") as col_array from source_table ) cols
insert overwrite table A select col_array[1], col_array[2], col_array[3] where col_array[0] = 'A'
insert overwrite table B select col_array[1], col_array[2] where col_array[0] = 'B'
insert overwrite table C select col_array[1] where col_array[0] = 'C';
Option 1) Map the entire data to a Hive table and then use the insert overwrite table .... option to map the appropriate data to the target tables.
Option 2) Develop a MR program to split the file into multiple files and then do the mapping of the files to the target tables in Hive.