Using spark/scala, I use saveAsTextFile() to HDFS, but hiveql("select count(*) from...) return 0 - hadoop

I created external table as follows...
hive -e "create external table temp_db.temp_table (a char(10), b int) PARTITIONED BY (PART_DATE VARCHAR(10)) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE LOCATION '/work/temp_db/temp_table'"
And I use saveAsTextFile() with scala in IntelliJ IDEA as follows...
itemsRdd.map(_.makeTsv).saveAsTextFile("hdfs://work/temp_db/temp_table/2016/07/19")
So the file(fields terminated by '\t') was in the /work/temp_db/temp_table/2016/07/19.
hadoop fs -ls /work/temp_db/temp_table/2016/07/19/part-00000 <- data file..
But, I checked with hiveql, there are no datas as follows.
hive -e "select count(*) from temp_db.temp_table" -> 0.
hive -e "select * from temp_db.temp_table limit 5" -> 0 rows fetched.
Help me what to do. Thanks.

you are saving at wrong location from spark. Partition dir name follows part_col_name=part_value.
In Spark: save file at directory part_date=2016%2F07%2F19 under temp_table dir
itemsRdd.map(_.makeTsv)
.saveAsTextFile("hdfs://work/temp_db/temp_table/part_date=2016%2F07%2F19")
add partitions: You will need to add partition that should update hive table's metadata (partition dir we have created from spark as hive expected key=value format)
alter table temp_table add partition (PART_DATE='2016/07/19');
[cloudera#quickstart ~]$ hadoop fs -ls /user/hive/warehouse/temp_table/part*|awk '{print $NF}'
/user/hive/warehouse/temp_table/part_date=2016%2F07%2F19/part-00000
/user/hive/warehouse/temp_table/part_date=2016-07-19/part-00000
query partitioned data:
hive> alter table temp_table add partition (PART_DATE='2016/07/19');
OK
Time taken: 0.16 seconds
hive> select * from temp_table where PART_DATE='2016/07/19';
OK
test1 123 2016/07/19
Time taken: 0.219 seconds, Fetched: 1 row(s)
hive> select * from temp_table;
OK
test1 123 2016/07/19
test1 123 2016-07-19
Time taken: 0.199 seconds, Fetched: 2 row(s)
For Everyday process: you can run saprk job like this - just add partitions right after saveAsTextFile(), aslo note the s in alter statement. it is need to pass variable in hive sql from spark:
val format = new java.text.SimpleDateFormat("yyyy/MM/dd")
vat date = format.format(new java.util.Date())
itemsRDD.saveAsTextFile("/user/hive/warehouse/temp_table/part=$date")
val hive = new HiveContext(sc)
hive.sql(s"alter table temp_table add partition (PART_DATE='$date')")
NOTE: Add partition after saving the file or else spark will throw directory already exist exception as hive creates dir (if not exist) when adding partition.

Related

Hive select * shows 0 but count(1) show returns millions of rows

I create a hive external table and add a partition by
create external table tbA (
colA int,
colB string,
...)
PARTITIONED BY (
`day` string)
stored as parquet;
alter table tbA add partition(day= '2021-09-04');
Then I put a parquet file to the target HDFS directory by hdfs dfs -put ...
I can get expected results using select * from tbA in IMPALA.
In Hive, I can get correct result when using
select count(1) from tbA
However, when I use
select * from tbA limit 10
It returns no result at all.
If anything is wrong with the parquet file or the directory, IMPALA should not get the correct result and Hive can count out row numbers... Why select * from ... shows nothing? Any help is appreciated.
In addition,
running select distinct day from tbA, it returns 2021-09-04
running select * from tbA, it returns data with day = 2021-09-04
It seems this partition is not recognized correctly? I retried to drop the partition and use msck repair table but still not working...

Unable to map the HBase row key in HIve table effectively

I have a HBase table where the rowkey looks like this.
08:516485815:2013 1
06:260070837:2014 1
00:338289200:2014 1
I create a Hive link table using the below query.
create external table hb
(key string,value string)
stored by 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
with serdeproperties("hbase.columns.mapping"=":key,e:-1")
tblproperties("hbase.table.name"="hbaseTable");
When I query the table I get the below result
select * from hb;
08:516485815 1
06:260070837 1
00:338289200 1
This is very strange to me. Why the serde is not able to map the whole content of the HBase key? The hive table is missing everything after the second ':'
Has anybody faced a similar kind of issue?
I tried by recreating your scenario on Hbase 1.1.2 and Hive 1.2.1000,it works as expected and i am able to get the whole rowkey from hive.
hbase> create 'hbaseTable','e'
hbase> put 'hbaseTable','08:516485815:2013','e:-1','1'
hbase> scan 'hbaseTable'
ROW COLUMN+CELL
08:516485815:2013 column=e:-1, timestamp=1519675029451, value=1
1 row(s) in 0.0160 seconds
As i'm having 08:516485815:2013 as rowkey and i have created hive table
hive> create external table hb
(key string,value string)
stored by 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
with serdeproperties("hbase.columns.mapping"=":key,e:-1")
tblproperties("hbase.table.name"="hbaseTable");
hive> select * from hb;
+--------------------+-----------+--+
| hb.key | hb.value |
+--------------------+-----------+--+
| 08:516485815:2013 | 1 |
+--------------------+-----------+--+
Can you once make sure your hbase table rowkey having the data after second :.

Insert into bucketed table produces empty table

I`m trying to do insert into bucketed table. When I run the query everything looks fine and I see in reports some amount of wrote bytes. No any errors in Hive logs also.
But when I look into table I have nothing :(
CREATE TABLE test(
test_date string,
test_id string,
test_title string,)
CLUSTERED BY (
text_date)
INTO 100 BUCKETS
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
LINES TERMINATED BY '\n'
STORED AS ORC
LOCATION
'hdfs://myserver/data/hive/databases/test.db/test'
TBLPROPERTIES (
'skip.header.line.count'='1',
'transactional' = 'true')
INSERT INTO test.test
SELECT 'test_date', 'test_id', 'test_title' from test2.green
Result
Ended Job = job_148140234567_254152
Loading data to table test.test
Table test.teststats: [numFiles=100, numRows=1601822, totalSize=9277056, rawDataSize=0]
MapReduce Jobs Launched:
Stage-Stage-1: Map: 6 Reduce: 100 Cumulative CPU: 423.34 sec
HDFS Read: 148450105
HDFS Write: 9282219
SUCCESS
hive> select * from test.test limit 2;
OK
Time taken: 0.124 seconds
hive>
Is this query really working? You have extra comma after in line
test_title string,)
also coulmn text_date isnt in your you column definition. May be you meant test_date?
CLUSTERED BY (text_date)

update query in hive/hbase

I have already created a table in hbase using hive:
hive> CREATE TABLE hbase_table_emp(id int, name string, role string)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,cf1:name,cf1:role")
TBLPROPERTIES ("hbase.table.name" = "emp");
and created another table to load data on it :
hive> create table testemp(id int, name string, role string) row format delimited fields terminated by '\t';
hive> load data local inpath '/home/user/sample.txt' into table testemp;
and finally insert data into the hbase table:
hive> insert overwrite table hbase_table_emp select * from testemp;
hive> select * from hbase_table_emp;
OK
123 Ram TeamLead
456 Silva Member
789 Krishna Member
time taken: 0.160 seconds, Fetched: 3 row(s)
the table looks like this in hbase:
hbase(main):002:0> scan 'emp'
ROW COLUMN+CELL
123 column=cf1:name, timestamp=1422540225254, value=Ram
123 column=cf1:role, timestamp=1422540225254, value=TeamLead
456 column=cf1:name, timestamp=1422540225254, value=Silva
456 column=cf1:role, timestamp=1422540225254, value=Member
789 column=cf1:name, timestamp=1422540225254, value=Krishna
789 column=cf1:role, timestamp=1422540225254, value=Member
3 row(s) in 2.1230 seconds
Now I'm trying to update a value in this table
for exemple, I want to change the "role" of "Ram" from "Teamlead" to "Member",
which query should I use?
Assuming that you are trying to overwrite the previous value, from the hbase shell you can run the following:
put 'emp', 123, 'cf1:role', Member', 1422540225254
It's important to use the same timestamp as the previous entry if your goal is to overwrite.
You can do it with HIVE since v0.14+:
INSERT INTO TABLE hbase_table_emp VALUES (123, null, "Member");
You've got to provide the key you want to update and null values on fields you don't want to update... yeah, it's odd, like having to compile and run a MapReduce job for a single update, but it's also odd to pretend HIVE+HBase to work like a regular RDBMS and provide full ACID support in the process :)
For updating data I'd stick to the APIs provided by HBase (Stargate, Thrift, Native or even the hbase shell) and use HIVE just for mass imports and data analysis.

Data not getting loaded into Partitioned Table in Hive

I am trying to create partition for my Table inorder to update a value.
This is my sample data
1,Anne,Admin,50000,A
2,Gokul,Admin,50000,B
3,Janet,Sales,60000,A
I want to update Janet's Department to B.
So for doing that I created a table with Department as partition.
create external table trail (EmployeeID Int,FirstName
String,Designation String,Salary Int) PARTITIONED BY (Department
String) row format delimited fields terminated by "," location
'/user/sreeveni/HIVE';
But while doing the above command.
No data are inserted into trail table.
hive>select * from trail;
OK
Time taken: 0.193 seconds
hive>desc trail;
OK
employeeid int None
firstname string None
designation string None
salary int None
department string None
# Partition Information
# col_name data_type comment
department string None
Am I doing anything wrong?
UPDATE
As suggested I tried to insert data into my table
load data inpath '/user/aibladmin/HIVE' overwrite into table trail
Partition(Department);
But it is showing
FAILED: SemanticException [Error 10096]: Dynamic partition strict mode
requires at least one static partition column. To turn this off set
hive.exec.dynamic.partition.mode=nonstrict
After setting set hive.exec.dynamic.partition.mode=nonstrict also didnt work fine.
Anything else to do.
Try both below properties
SET hive.exec.dynamic.partition = true;
SET hive.exec.dynamic.partition.mode = nonstrict;
And while writing insert statement for a partitioned table make sure that you specify the partition columns at the last in select clause.
You cannot directly insert data(Hdfs File) into a Partitioned hive table.
First you need to create a normal table, then you will insert that table data into partitioned table.
set hive.exec.dynamic.partition.mode=strict means when ever you are populating hive table it must have at least one static partition column.
set hive.exec.dynamic.partition.mode=nonstrict In this mode you don't need any static partition column.
Try the following:
Start by creating the table:
create external table test23 (EmployeeID Int,FirstName String,Designation String,Salary Int) PARTITIONED BY (Department String) row format delimited fields terminated by "," location '/user/rocky/HIVE';
Create a directory in hdfs with partition name :
$ hadoop fs -mkdir /user/rocky/HIVE/department=50000
Create a local file abc.txt by filtering records having department equal to 50000:
$ cat abc.txt
1,Anne,Admin,50000,A
2,Gokul,Admin,50000,B
Put it into HDFS:
$ hadoop fs -put /home/yarn/abc.txt /user/rocky/HIVE/department=50000
Now alter the table:
ALTER TABLE test23 ADD PARTITION(department=50000);
And check the result:
select * from test23 ;
just set those 2 properties BEFORE you getOrCreate() the spark session:
SparkSession
.builder
.config(new SparkConf())
.appName(appName)
.enableHiveSupport()
.config("hive.exec.dynamic.partition","true")
.config("hive.exec.dynamic.partition.mode", "nonstrict")
.getOrCreate()
I ran into the same problem and yes these two properties are needed. However, I used JDBC driver with Scala to set these properties before executing Hive statements. The problem, however, was that I was executing a bunch of properties (SET statements) in one execution statement like this
conn = DriverManager.getConnection(conf.get[String]("hive.jdbc.url"))
conn.createStatement().execute(
"SET spark.executor.memory = 2G;
SET hive.exec.dynamic.partition.mode = nonstrict;
SET hive.other.statements =blabla ;")
For some reason, the driver was not able to interpret all these as separate statements, so I needed to execute each one of them separately.
conn = DriverManager.getConnection(conf.get[String]("hive.jdbc.url"))
conn.createStatement().execute("SET spark.executor.memory = 2G;")
conn.createStatement().execute("SET hive.exec.dynamic.partition.mode=nonstrict;")
conn.createStatement().execute("SET hive.other.statements =blabla ;")
Can you try running
MSCK REPAIR TABLE table_name;
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-RecoverPartitions(MSCKREPAIRTABLE)

Resources