Hive cannot read ORC if set "orc.create.index"="false" when loading table - hadoop

Hive version: 1.2.1, create a table by the below:
CREATE TABLE ORC_NONE(
millisec bigint,
...
)
stored as orc tblproperties ("orc.create.index"="false");
insert into table ORC_NONE select * from ex_test_convert;
But when giving query, it always return NULL. For example:
Select * from ORC_NONE limit 10; // return blank
Select min(millisec), max(millisec) from ORC_NONE; // return NULL, NULL
I check the size of ORC_NONE, 2G, so it is not empty table, and if creating table by setting "orc.create.index"="true", queries work.
I was meant to test Hive performance on ORC with/without row indexes, more exactly, to test the skipping power of row indexes. However, it seemed that Hive can not read data when row index unavailable.
Is this a bug? Or something wrong with my loading?

Related

HIVE - Cannot partition a table: semantic exception failure

I'm not able to import data on partitioned table in Hive.
Here is how I create the table
CREATE TABLE IF NOT EXISTS title_ratings
(
tconst STRING,
averageRating DOUBLE,
numVotes INT
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE
TBLPROPERTIES("skip.header.line.count"="1");
And then I load the data into it : LOAD DATA INPATH '/title.ratings.tsv.gz' INTO TABLE eval_hive_db.title_ratings;
It works fine till here. Now I want to create a dynamic partitioned table. First of all, I setup theses params:
SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
I now create my partitioned table:
CREATE TABLE IF NOT EXISTS title_ratings_part
(
tconst STRING,
numVotes INT
)
PARTITIONED BY (averageRating DOUBLE)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\n'
STORED AS TEXTFILE;
insert into title_ratings_part partition(title_ratings) select tconst, averageRating, numVotes from title_ratings;
(I also tried with numVotes instead by the way)
And I receive this error: FAILED: ValidationFailureSemanticException eval_hive_db.title_ratings_part: Partition spec {title_ratings=null} contains non-partition columns
Someone can help me please?
Ideally, I want to partition my table by averageRating (less than 2, between 2 and 4, and greater than 4)
You can run this command to check if there are null values or not.
select count(averageRating) from title_ratings group by averageRating;
Now, if there are null values in this column then you will get the count, which you have to fill then apply partitioning again.
Partition column is stored as last column in a table so while inserting you need to maintain correct order in select statement.
Pls change order of columns in select.
insert into title_ratings_part partition(title_ratings)
Select
Tconst,
numVotes,
averageRating --orderwise this should always be last column
from title_ratings

updating a table using hive

Right now I run the following Hive query
CREATE TABLE dwo_analysis.exp_shown AS
SELECT
MIN(sc.date_time) as first_shown_time,
SUBSTR(sc.post_evar12,1,24) as guid,
sc.post_evar238 as experiment_name,
sc.post_evar239 as variant_name
FROM test
WHERE report_suite='adbemmarvelweb.prod'
AND date >= DATE_SUB(CURRENT_DATE,90) AND date < DATE_SUB(CURRENT_DATE, 2)
AND post_prop5 = 'experiment:standard:authenticated:shown'
AND post_evar238 NOT LIKE 'control%'
AND post_evar238 <> ''
AND post_evar239 <> ''
The table test is large. I would like to optimize this query by running it once, and every other time updating the table by getting the last 2 days of data and adding it to the table.
so basically run the above query once and every time run it again but with the condition
WHERE click_date >= DATE_SUB(CURRENT_DATE, 2) AND click_date < DATE_SUB(CURRENT_DATE)
How do I update the table using hive to populate the the rows as mentioned in the condition above?
First, your queries would be quicker if the Hive table were partitioned based on date. Your create table statement isn't inserting into any partitions, therefore I suspect your table is not partitioned. It would also be quicker if the source data were Parquet/ORC
In any case, you can overwrite the table for a date range like so
INSERT OVERWRITE TABLE dwo_analysis.exp_shown
SELECT * FROM test
WHERE click_date
BETWEEN DATE_SUB(CURRENT_DATE, 2) AND CURRENT_DATE;

hive add columns on partitioned table does not work

I share my experience about adding columns on a partitioned hive table.
As you can see, despite the CASCADE function, the ALTER brakes my table :(
add columns on partitioned table
table description
CREATE TABLE test (
a string,
b string,
c string
)
PARTITIONED BY (
x string,
y string,
z string
)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
TBLPROPERTIES (
'orc.compress'='SNAPPY'
);
duplicate the table
CREATE TABLE test_tmp...
hadoop distcp hdfs://.../test/* dfs://.../test_tmp
MSCK REPAIR TABLE test_tmp;
SELECT * FROM test_tmp
LIMIT 100
check : OK (I get results)
modify the table
ALTER TABLE test_tmp
ADD COLUMNS(
aa timestamp,
bb string,
cc int,
dd string
) CASCADE;
SELECT * FROM test_tmp
LIMIT 100
...
]], Vertex did not succeed due to OWN_TASK_FAILURE, failedTasks:1 killedTasks:19, Vertex vertex_1502459312997_187854_4_00 [Map 1] killed/failed due to:OWN_TASK_FAILURE]DAG did not succeed due to VERTEX_FAILURE. failedVertices:1 killedVertices:0
... 1 statement(s) executed, 0 rows affected, exec/fetch time: 21.655/0.000 sec [0 successful, 1 errors]
check : KO (I get this error)
If you are using Hive 0.x or 1.x then you are probably a victim of...
HIVE-10598 Vectorization borks when column is added to table.
...which is specific to ORC format, even if it's not apparent from the JIRA label.
There is a partial fix as of Hive 2.0 (i.e. ADD is fixed, but DROP / RENAME / CHANGE are still crippled) thanks to
HIVE-11981 ORC Schema Evolution Issues (Vectorized, ACID, and
Non-Vectorized)
And another related fix as of Hive 2.1.1 for CHANGE
HIVE-14355 Schema evolution for ORC in llap is broken
for Int to String conversion
To be continued...

How transfer a Table from HBase to Hive?

How can I tranfer a HBase table into Hive correctly?
What I tried before can you read in this question
How insert overwrite table in hive with diffrent where clauses?
( I made one table to import all data. The problem here is that data is still in rows and not in columns. So I made 3 tables for news, social and all with a specific where clause. After that I made 2 Joins on the tables which is giving me the result table. So I had 6 Tables at all which is not really performant!)
to sum my problem up : In HBase are column familys which are saved as rows like this.
count verpassen news 1
count verpassen social 0
count verpassen all 1
What I want to achieve in Hive is a datastructure like this:
name news social all
verpassen 1 0 1
How am I supposed to do this?
Below is the approach use can use.
use hbase storage handler to create the table in hive
example script
CREATE TABLE hbase_table_1(key string, value string) STORED BY
'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH
SERDEPROPERTIES ("hbase.columns.mapping" = ":key,f1:val")
TBLPROPERTIES ("hbase.table.name" = "test");
I loaded the sample data you have given into hive external table.
select name,collect_set(concat_ws(',',type,val)) input from TESTTABLE
group by name ;
i am grouping the data by name.The resultant output for the above query will be
Now i wrote a custom mapper which takes the input as input parameter and emits the values.
from (select '["all,1","social,0","news,1"]' input from TESTTABLE group by name) d MAP d.input Using 'python test.py' as
all,social,news
alternatively you can use the output to insert into another table which has column names name,all,social,news
Hope this helps

Data not getting loaded into Partitioned Table in Hive

I am trying to create partition for my Table inorder to update a value.
This is my sample data
1,Anne,Admin,50000,A
2,Gokul,Admin,50000,B
3,Janet,Sales,60000,A
I want to update Janet's Department to B.
So for doing that I created a table with Department as partition.
create external table trail (EmployeeID Int,FirstName
String,Designation String,Salary Int) PARTITIONED BY (Department
String) row format delimited fields terminated by "," location
'/user/sreeveni/HIVE';
But while doing the above command.
No data are inserted into trail table.
hive>select * from trail;
OK
Time taken: 0.193 seconds
hive>desc trail;
OK
employeeid int None
firstname string None
designation string None
salary int None
department string None
# Partition Information
# col_name data_type comment
department string None
Am I doing anything wrong?
UPDATE
As suggested I tried to insert data into my table
load data inpath '/user/aibladmin/HIVE' overwrite into table trail
Partition(Department);
But it is showing
FAILED: SemanticException [Error 10096]: Dynamic partition strict mode
requires at least one static partition column. To turn this off set
hive.exec.dynamic.partition.mode=nonstrict
After setting set hive.exec.dynamic.partition.mode=nonstrict also didnt work fine.
Anything else to do.
Try both below properties
SET hive.exec.dynamic.partition = true;
SET hive.exec.dynamic.partition.mode = nonstrict;
And while writing insert statement for a partitioned table make sure that you specify the partition columns at the last in select clause.
You cannot directly insert data(Hdfs File) into a Partitioned hive table.
First you need to create a normal table, then you will insert that table data into partitioned table.
set hive.exec.dynamic.partition.mode=strict means when ever you are populating hive table it must have at least one static partition column.
set hive.exec.dynamic.partition.mode=nonstrict In this mode you don't need any static partition column.
Try the following:
Start by creating the table:
create external table test23 (EmployeeID Int,FirstName String,Designation String,Salary Int) PARTITIONED BY (Department String) row format delimited fields terminated by "," location '/user/rocky/HIVE';
Create a directory in hdfs with partition name :
$ hadoop fs -mkdir /user/rocky/HIVE/department=50000
Create a local file abc.txt by filtering records having department equal to 50000:
$ cat abc.txt
1,Anne,Admin,50000,A
2,Gokul,Admin,50000,B
Put it into HDFS:
$ hadoop fs -put /home/yarn/abc.txt /user/rocky/HIVE/department=50000
Now alter the table:
ALTER TABLE test23 ADD PARTITION(department=50000);
And check the result:
select * from test23 ;
just set those 2 properties BEFORE you getOrCreate() the spark session:
SparkSession
.builder
.config(new SparkConf())
.appName(appName)
.enableHiveSupport()
.config("hive.exec.dynamic.partition","true")
.config("hive.exec.dynamic.partition.mode", "nonstrict")
.getOrCreate()
I ran into the same problem and yes these two properties are needed. However, I used JDBC driver with Scala to set these properties before executing Hive statements. The problem, however, was that I was executing a bunch of properties (SET statements) in one execution statement like this
conn = DriverManager.getConnection(conf.get[String]("hive.jdbc.url"))
conn.createStatement().execute(
"SET spark.executor.memory = 2G;
SET hive.exec.dynamic.partition.mode = nonstrict;
SET hive.other.statements =blabla ;")
For some reason, the driver was not able to interpret all these as separate statements, so I needed to execute each one of them separately.
conn = DriverManager.getConnection(conf.get[String]("hive.jdbc.url"))
conn.createStatement().execute("SET spark.executor.memory = 2G;")
conn.createStatement().execute("SET hive.exec.dynamic.partition.mode=nonstrict;")
conn.createStatement().execute("SET hive.other.statements =blabla ;")
Can you try running
MSCK REPAIR TABLE table_name;
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-RecoverPartitions(MSCKREPAIRTABLE)

Resources