CRUD operations in Hive - hadoop

I'm trying to do CRUD operations in Hive and able to successfully run insert query however when I tried to run update and delete getting the below exception.
FAILED: SemanticException [Error 10294]: Attempt to do update or delete using transaction manager that does not support these operations.
List of the queries I ran
CREATE TABLE students (name VARCHAR(64), age INT, gpa DECIMAL(3, 2))
CLUSTERED BY (age) INTO 2 BUCKETS STORED AS ORC;
INSERT INTO TABLE students
VALUES ('fred flintstone', 35, 1.28), ('barney rubble', 32, 2.32);
CREATE TABLE pageviews (userid VARCHAR(64), link STRING, came_from STRING)
PARTITIONED BY (datestamp STRING) CLUSTERED BY (userid) INTO 256 BUCKETS STORED AS ORC;
INSERT INTO TABLE pageviews PARTITION (datestamp = '2014-09-23')
VALUES ('jsmith', 'mail.com', 'sports.com'), ('jdoe', 'mail.com', null);
INSERT INTO TABLE pageviews PARTITION (datestamp)
VALUES ('tjohnson', 'sports.com', 'finance.com', '2014-09-23'), ('tlee', 'finance.com', null, '2014-09-21');
Source : https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-Delete
Update and delete queries I'm trying to run
update students1 set age = 36 where name ='barney rubble';
update students1 set name = 'barney rubble1' where age =36;
delete from students1 where age=32;
Hive Version : 2.1(Latest)
Note : I'm aware that Hive is not for Update and Delete commands(on BigData set) still trying to do, to get awareness on Hive CRUD operations.
Can someone point/guide me the where I'm going wrong on update/delete queries.

make sure you are setting the properties listed here.
https://community.hortonworks.com/questions/37519/how-to-activate-acid-transactions-in-hive-within-h.html
I tested in Hive 1.1.0 CDH 5.8.3 and it is working. same exampled you provided in your comment

Related

HIVE - Cannot partition a table: semantic exception failure

I'm not able to import data on partitioned table in Hive.
Here is how I create the table
CREATE TABLE IF NOT EXISTS title_ratings
(
tconst STRING,
averageRating DOUBLE,
numVotes INT
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE
TBLPROPERTIES("skip.header.line.count"="1");
And then I load the data into it : LOAD DATA INPATH '/title.ratings.tsv.gz' INTO TABLE eval_hive_db.title_ratings;
It works fine till here. Now I want to create a dynamic partitioned table. First of all, I setup theses params:
SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
I now create my partitioned table:
CREATE TABLE IF NOT EXISTS title_ratings_part
(
tconst STRING,
numVotes INT
)
PARTITIONED BY (averageRating DOUBLE)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\n'
STORED AS TEXTFILE;
insert into title_ratings_part partition(title_ratings) select tconst, averageRating, numVotes from title_ratings;
(I also tried with numVotes instead by the way)
And I receive this error: FAILED: ValidationFailureSemanticException eval_hive_db.title_ratings_part: Partition spec {title_ratings=null} contains non-partition columns
Someone can help me please?
Ideally, I want to partition my table by averageRating (less than 2, between 2 and 4, and greater than 4)
You can run this command to check if there are null values or not.
select count(averageRating) from title_ratings group by averageRating;
Now, if there are null values in this column then you will get the count, which you have to fill then apply partitioning again.
Partition column is stored as last column in a table so while inserting you need to maintain correct order in select statement.
Pls change order of columns in select.
insert into title_ratings_part partition(title_ratings)
Select
Tconst,
numVotes,
averageRating --orderwise this should always be last column
from title_ratings

clickhouse alter MATERIALIZED VIEW add column

env: Clikchouse version:22.3.3.44; Database engine: atomic
I have a raw table and mv, schema like this:
CREATE TABLE IF NOT EXISTS test.Income_Raw on cluster '{cluster}' (
Id Int64,
DateNum Date,
Cnt Int64,
LoadTime DateTime
) ENGINE==MergeTree
PARTITION BY toYYYYMMDD(LoadTime)
ORDER BY (Id, DateNum);
CREATE MATERIALIZED VIEW test.Income_MV on cluster '{cluster}'
ENGINE = ReplicatedAggregatingMergeTree('/clickhouse/{shard}/{database}/{table}', '{replica}')
PARTITION BY toYYYYMM(DateNum)
ORDER BY (Id, DateNum)
TTL DateNum+ INTERVAL 100 DAY
AS SELECT
DateNum,
Id,
argMaxState(Cnt, LoadTime) as Cnt ,
maxState( LoadTime) as latest_loadtime
FROM test.Income_Raw
GROUP BY Id, DataNum;
now I want to add a column named 'price' to raw table and mv,
so I run sql step by step like below:
// first I alter raw table
1. alter table test.Income_Raw on cluster '{cluster}' add column Price Int32
// below sqls, I run to alter MV
2. detach test.Income_MV on cluster '{cluster}'
3. alter test.`.inner_id.{uuid}` on cluster '{cluster}' add column Price Int32
// step 4, basically I just use 'attach' replace 'create' and add 'Price' to select query
4. attach MATERIALIZED VIEW test.Income_MV on cluster '{cluster}'
ENGINE = ReplicatedAggregatingMergeTree('/clickhouse/{shard}/{database}/{table}', '{replica}')
PARTITION BY toYYYYMM(DateNum)
ORDER BY (Id, DateNum)
TTL DateNum+ INTERVAL 100 DAY
AS SELECT
DateNum,
Id,
Price,
argMaxState(Cnt, LoadTime) as Cnt ,
maxState( LoadTime) as latest_loadtime
FROM test.Income_Raw
GROUP BY Id, DataNum, Price;
but at step 4, I met error like this
Code: 80. DB::Exception: Incorrect ATTACH TABLE query for Atomic database engine. Use one of the following queries instead:
1. ATTACH TABLE Income_MV;
2. CREATE TABLE Income_MV <table definition>;
3. ATTACH TABLE Income_MV FROM '/path/to/data/' <table definition>;
4. ATTACH TABLE Income_MVUUID '<uuid>' <table definition>;. (INCORRECT_QUERY) (version 22.3.3.44 (official build))
these sqls I runned is I followed from below references.
https://kb.altinity.com/altinity-kb-schema-design/materialized-views/
Clickhouse altering materialized view's select
so my question is, how to modify mv select query, which step I was wrong?
I figure out that just need:
prepare: use explicit target table instead inner table for MV
1 alter MV target table
2 drop MV
3 re-create MV with new query

updating a table using hive

Right now I run the following Hive query
CREATE TABLE dwo_analysis.exp_shown AS
SELECT
MIN(sc.date_time) as first_shown_time,
SUBSTR(sc.post_evar12,1,24) as guid,
sc.post_evar238 as experiment_name,
sc.post_evar239 as variant_name
FROM test
WHERE report_suite='adbemmarvelweb.prod'
AND date >= DATE_SUB(CURRENT_DATE,90) AND date < DATE_SUB(CURRENT_DATE, 2)
AND post_prop5 = 'experiment:standard:authenticated:shown'
AND post_evar238 NOT LIKE 'control%'
AND post_evar238 <> ''
AND post_evar239 <> ''
The table test is large. I would like to optimize this query by running it once, and every other time updating the table by getting the last 2 days of data and adding it to the table.
so basically run the above query once and every time run it again but with the condition
WHERE click_date >= DATE_SUB(CURRENT_DATE, 2) AND click_date < DATE_SUB(CURRENT_DATE)
How do I update the table using hive to populate the the rows as mentioned in the condition above?
First, your queries would be quicker if the Hive table were partitioned based on date. Your create table statement isn't inserting into any partitions, therefore I suspect your table is not partitioned. It would also be quicker if the source data were Parquet/ORC
In any case, you can overwrite the table for a date range like so
INSERT OVERWRITE TABLE dwo_analysis.exp_shown
SELECT * FROM test
WHERE click_date
BETWEEN DATE_SUB(CURRENT_DATE, 2) AND CURRENT_DATE;

Hive cannot read ORC if set "orc.create.index"="false" when loading table

Hive version: 1.2.1, create a table by the below:
CREATE TABLE ORC_NONE(
millisec bigint,
...
)
stored as orc tblproperties ("orc.create.index"="false");
insert into table ORC_NONE select * from ex_test_convert;
But when giving query, it always return NULL. For example:
Select * from ORC_NONE limit 10; // return blank
Select min(millisec), max(millisec) from ORC_NONE; // return NULL, NULL
I check the size of ORC_NONE, 2G, so it is not empty table, and if creating table by setting "orc.create.index"="true", queries work.
I was meant to test Hive performance on ORC with/without row indexes, more exactly, to test the skipping power of row indexes. However, it seemed that Hive can not read data when row index unavailable.
Is this a bug? Or something wrong with my loading?

Indexing data from HDFS to Elasticsearch via Hive

I'm using Elasticsearch for Hadoop plugin in order to read and index documents in Elasticsearch via Hive.
I followed the documentation in this page:
https://www.elastic.co/guide/en/elasticsearch/hadoop/current/hive.html
In order to index documents in Elasticsearch with Hadoop, you need to create a table in Hive that is configured properly.
And I encountered a problem with inserting data into that hive table.
That’s the table's script for writing I used to create:
CREATE EXTERNAL TABLE es_names_w
(
firstname string,
lastname string
)
STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler'
TBLPROPERTIES('es.resource' = 'hive_test/names', 'es.index.auto.create' = 'true')
Then I tried to insert data:
INSERT OVERWRITE TABLE es_names_w
SELECT firstname,lastname
FROM tmp_names_source;
The error I get from hive is:
"Job submission failed with exception 'org.apache.hadoom.ipc.RemoteExaption(java.lang.RuntimeExeption: org.xml.sax.SAXParseException; systemId: file:////hdfs_data/mapred/jt/jobTracker/job_201506091622_0064.xml; lineNunber: 607; columnNumber:51; Character reference "&#..."
However, this error occurs ONLY when the hive table that I create has more than one column.
For example, this code works:
CREATE EXTERNAL TABLE es_names_w
(
firstname string
)
STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler'
TBLPROPERTIES('es.resource' = 'hive_test/names', 'es.index.auto.create' = 'true')
INSERT OVERWRITE TABLE es_names_w
SELECT firstname
FROM tmp_names_source;
Everything went well,
Hive has created a new type in elasticsearch index and the data has been indexed in Elasticsearch
I really don’t know why my first attempt doesn't work
I would appreciate some help,
Thanks
Can you add this to the tbl es.mapping.id’=’key’. The key can be your firstname.
Try
es.index.auto.create' = 'false'
Try with SerDe it will work out. For eg.
CREATE EXTERNAL TABLE elasticsearch_es (
name STRING, id INT, country STRING )
ROW FORMAT SERDE 'org.elasticsearch.hadoop.hive.EsSerDe'
STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler'
TBLPROPERTIES ('es.resource'='elasticsearch/demo');
Also, make sure when you create index and type in ES you create the exact same mapping as Hive column's in ES.

Resources