How do I INSERT OVERWRITE with a struct in HIVE? - insert
I have a Hive table tweets stored as text that I am trying to write to another table tweetsORC that is ORC. Both have the same structure:
col_name data_type comment
racist boolean from deserializer
contributors string from deserializer
coordinates string from deserializer
created_at string from deserializer
entities struct<hashtags:array<string>,symbols:array<string>,urls:array<struct<display_url:string,expanded_url:string,indices:array<tinyint>,url:string>>,user_mentions:array<string>> from deserializer
favorite_count tinyint from deserializer
favorited boolean from deserializer
filter_level string from deserializer
geo string from deserializer
id bigint from deserializer
id_str string from deserializer
in_reply_to_screen_name string from deserializer
in_reply_to_status_id string from deserializer
in_reply_to_status_id_str string from deserializer
in_reply_to_user_id string from deserializer
in_reply_to_user_id_str string from deserializer
is_quote_status boolean from deserializer
lang string from deserializer
place string from deserializer
possibly_sensitive boolean from deserializer
retweet_count tinyint from deserializer
retweeted boolean from deserializer
source string from deserializer
text string from deserializer
timestamp_ms string from deserializer
truncated boolean from deserializer
user struct<contributors_enabled:boolean,created_at:string,default_profile:boolean,default_profile_image:boolean,description:string,favourites_count:tinyint,follow_request_sent:string,followers_count:tinyint,following:string,friends_count:tinyint,geo_enabled:boolean,id:bigint,id_str:string,is_translator:boolean,lang:string,listed_count:tinyint,location:string,name:string,notifications:string,profile_background_color:string,profile_background_image_url:string,profile_background_image_url_https:string,profile_background_tile:boolean,profile_image_url:string,profile_image_url_https:string,profile_link_color:string,profile_sidebar_border_color:string,profile_sidebar_fill_color:string,profile_text_color:string,profile_use_background_image:boolean,protected:boolean,screen_name:string,statuses_count:smallint,time_zone:string,url:string,utc_offset:string,verified:boolean> from deserializer
When I try to insert from tweets to tweetsORC I get:
INSERT OVERWRITE TABLE tweetsORC SELECT * FROM tweets;
FAILED: NoMatchingMethodException No matching method for class org.apache.hadoop.hive.ql.udf.UDFToString with (struct<hashtags:array<string>,symbols:array<string>,urls:array<struct<display_url:string,expanded_url:string,indices:array<tinyint>,url:string>>,user_mentions:array<string>>). Possible choices: _FUNC_(bigint) _FUNC_(binary) _FUNC_(boolean) _FUNC_(date) _FUNC_(decimal(38,18)) _FUNC_(double) _FUNC_(float) _FUNC_(int) _FUNC_(smallint) _FUNC_(string) _FUNC_(timestamp) _FUNC_(tinyint) _FUNC_(void)
The only help I have found on this kind of problem says to make a UDF use primitive types, but I am not using a UDF! Any help is much appreciated!
FYI: Hive version:
Hive 1.2.1000.2.4.2.0-258
Subversion git://u12-slave-5708dfcd-10/grid/0/jenkins/workspace/HDP-build-ubuntu12/bigtop/output/hive/hive-1.2.1000.2.4.2.0 -r 240760457150036e13035cbb82bcda0c65362f3a
EDIT: Create tables and sample data:
create table tweets (
contributors string,
coordinates string,
created_at string,
entities struct <
hashtags: array <string>,
symbols: array <string>,
urls: array <struct <
display_url: string,
expanded_url: string,
indices: array <tinyint>,
url: string>>,
user_mentions: array <string>>,
favorite_count tinyint,
favorited boolean,
filter_level string,
geo string,
id bigint,
id_str string,
in_reply_to_screen_name string,
in_reply_to_status_id string,
in_reply_to_status_id_str string,
in_reply_to_user_id string,
in_reply_to_user_id_str string,
is_quote_status boolean,
lang string,
place string,
possibly_sensitive boolean,
retweet_count tinyint,
retweeted boolean,
source string,
text string,
timestamp_ms string,
truncated boolean,
`user` struct <
contributors_enabled: boolean,
created_at: string,
default_profile: boolean,
default_profile_image: boolean,
description: string,
favourites_count: tinyint,
follow_request_sent: string,
followers_count: tinyint,
`following`: string,
friends_count: tinyint,
geo_enabled: boolean,
id: bigint,
id_str: string,
is_translator: boolean,
lang: string,
listed_count: tinyint,
location: string,
name: string,
notifications: string,
profile_background_color: string,
profile_background_image_url: string,
profile_background_image_url_https: string,
profile_background_tile: boolean,
profile_image_url: string,
profile_image_url_https: string,
profile_link_color: string,
profile_sidebar_border_color: string,
profile_sidebar_fill_color: string,
profile_text_color: string,
profile_use_background_image: boolean,
protected: boolean,
screen_name: string,
statuses_count: smallint,
time_zone: string,
url: string,
utc_offset: string,
verified: boolean>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
STORED AS TEXTFILE;
LOAD DATA LOCAL INPATH '/home/ed/Downloads/hive-json-master/1abbo.txt' OVERWRITE INTO TABLE tweets;
create table tweetsORC (
racist boolean,
contributors string,
coordinates string,
created_at string,
entities struct <
hashtags: array <string>,
symbols: array <string>,
urls: array <struct <
display_url: string,
expanded_url: string,
indices: array <tinyint>,
url: string>>,
user_mentions: array <string>>,
favorite_count tinyint,
favorited boolean,
filter_level string,
geo string,
id bigint,
id_str string,
in_reply_to_screen_name string,
in_reply_to_status_id string,
in_reply_to_status_id_str string,
in_reply_to_user_id string,
in_reply_to_user_id_str string,
is_quote_status boolean,
lang string,
place string,
possibly_sensitive boolean,
retweet_count tinyint,
retweeted boolean,
source string,
text string,
timestamp_ms string,
truncated boolean,
`user` struct <
contributors_enabled: boolean,
created_at: string,
default_profile: boolean,
default_profile_image: boolean,
description: string,
favourites_count: tinyint,
follow_request_sent: string,
followers_count: tinyint,
`following`: string,
friends_count: tinyint,
geo_enabled: boolean,
id: bigint,
id_str: string,
is_translator: boolean,
lang: string,
listed_count: tinyint,
location: string,
name: string,
notifications: string,
profile_background_color: string,
profile_background_image_url: string,
profile_background_image_url_https: string,
profile_background_tile: boolean,
profile_image_url: string,
profile_image_url_https: string,
profile_link_color: string,
profile_sidebar_border_color: string,
profile_sidebar_fill_color: string,
profile_text_color: string,
profile_use_background_image: boolean,
protected: boolean,
screen_name: string,
statuses_count: smallint,
time_zone: string,
url: string,
utc_offset: string,
verified: boolean>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
STORED AS ORC tblproperties ("orc.compress"="ZLIB");
data here.
Instead of using Select * I list the fields by name and the error goes.
Data type mismatch: The data type you want to insert is inconsistent with the field type in the corresponding data table. For example, if the field type declared when you create the table is string, but the field type you inserted is indeed the list type, this error will be thrown.
Related
Non-string values showing as NULL in Hive
Im new to HIVE and creating my first table! for some reason all non-string values are showing as NULL (including int, BOOLEAN, etc.) my data looks like this sample row: 58;"management";"married";"tertiary";"no";2143;"yes";"no";"unknown";5;"may";261;1;-1;0;"unknown";"no" i used this to create the table: create external table bank_dataset( age TINYINT, job string, education string, default BOOLEAN, balance INT, housing BOOLEAN, loan BOOLEAN, contact STRING, day STRING, month STRING, duration INT, campaign INT, pdays INT, previous INT, poutcome STRING, y BOOLEAN) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\u003B' STORED AS TEXTFILE location '/user/marchenrisaad_gmail/Bank_Project' tblproperties("skip.header.line.count"="1");
Thanks for the comments it worked! but i have 1 issue. For every row i get all the data correctly then I get extra columns of null values. Find below my code: create external table bank_dataset(age TINYINT, job string, education string, default BOOLEAN, balance INT, housing BOOLEAN, loan BOOLEAN, contact STRING,day INT, month STRING, duration INT,campaign INT, pdays INT, previous INT, poutcome STRING,y BOOLEAN) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' WITH SERDEPROPERTIES ( "separatorChar" = "\u003B", "quoteChar" = '"' ) STORED AS TEXTFILE location '/user/marchenrisaad_gmail/Bank_Project' tblproperties("skip.header.line.count"="1"); Any suggestions?
Unable to insert data in Hive for JSON file
Failed!! Create the table for below schema (schema = {"type":"record","name":"topLevelRecord","fields":[{"name":"MESSAGE_ID","type":["string","null"]},{"name":"MSGNAME","type":["string","null"]},{"name":"SOURCE","type":["string","null"]},{"name":"EVENT_DATETIME","type":["string","null"]},{"name":"CUSTOMER_ORDER_ID","type":["string","null"]},{"name":"SP_ORGANISATION_NAME","type":["string","null"]},{"name":"CUSTOMER_ACCOUNT_ID","type":["string","null"]},{"name":"ORDER_TYPE_NAME","type":["string","null"]},{"name":"ORDER_SUBTYPE_NAME","type":["string","null"]},{"name":"ORDER_REASON_NAME","type":["string","null"]},{"name":"ORDER_CREATED_DATE","type":["string","null"]},{"name":"ORDER_CREATED_CHANNEL_NAME","type":["string","null"]},{"name":"ORDER_CREATED_RETAILER_ID","type":["string","null"]},{"name":"ORDER_CREATED_DEALER_ID","type":["string","null"]},{"name":"ORDER_CREATED_AFFILIATE_ID","type":["string","null"]},{"name":"ORDER_CREATED_EMPLOYEE_ID","type":["string","null"]},{"name":"ORDER_CREATED_CONTACT_CENTRE_AGENT_ID","type":["string","null"]},{"name":"ORDER_SUBMITTED_DATE","type":["string","null"]},{"name":"ORDER_SUBMITTED_CHANNEL_NAME","type":["string","null"]},{"name":"ORDER_DUE_DATE","type":["string","null"]},{"name":"ONE_TIME_CHARGE_AMT","type":["string","null"]},{"name":"RECURRING_CHARGE_AMT","type":["string","null"]},{"name":"ORDER_STATUS_NAME","type":["string","null"]},{"name":"ORDER_STATUS_CHANGE_REASON_NAME","type":["string","null"]},{"name":"CREATE_JOB_RUN_ID","type":"int"},{"name":"CREATE_DATE_TIME","type":"string"},{"name":"SYSTEM_ID","type":"int"},{"name":"SRC_FILE_NAME","type":"string"}]} I am new to hive just tried out by just looking around and came up with below Query CREATE EXTERNAL TABLE governed_data.customer_order( message_id string, msgname string, source string, event_datetime string, customer_order_id string, sp_organisation_name string, customer_account_id string, order_type_name string, order_subtype_name string, order_reason_name string, order_created_date string, order_created_channel_name string, order_created_retailer_id string, order_created_dealer_id string, order_created_affiliate_id string, order_created_employee_id string, order_created_contact_centre_agent_id string, order_submitted_date string, order_submitted_channel_name string, order_due_date string, one_time_charge_amt string, recurring_charge_amt string, order_status_name string, order_status_change_reason_name string, create_job_run_id int, create_date_time string, system_id int, src_file_name string) ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe' STORED AS AVRO location 'adl://rbsitbinsighstdlt001.azuredatalakestore.net/insights/governed_data/'; In the i want to insert data in the hive Database
You specified stored as AVRO and serde is JsonSerde, these properties are conflicting. If you need AVRO, then specify the serde as org.apache.hadoop.hive.serde2.avro.AvroSerDe, specify the inputformat as org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat, and the outputformat as org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat. Also provide a location from which the AvroSerde will pull the most current schema for the table. See example here: Creating Avro-backed Hive tables Or simply specify STORED AS AVRO, without SerDe, Input and Output format. Try to remove ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe' in your DDL. And if you want JsonSerDe to parse attributes then create table like this: CREATE EXTERNAL TABLE governed_data.customer_order(message_id string, msgname string, source string, event_datetime string, customer_order_id string, sp_organisation_name string, customer_account_id string, order_type_name string, order_subtype_name string, order_reason_name string, order_created_date string, order_created_channel_name string, order_created_retailer_id string, order_created_dealer_id string, order_created_affiliate_id string, order_created_employee_id string, order_created_contact_centre_agent_id string, order_submitted_date string, order_submitted_channel_name string, order_due_date string, one_time_charge_amt string, recurring_charge_amt string, order_status_name string, order_status_change_reason_name string, create_job_run_id int, create_date_time string, system_id int, src_file_name string) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' LOCATION 'adl://rbsitbinsighstdlt001.azuredatalakestore.net/insights/governed_data/' ; Read also docs about JsonSerDe
Hive insert query failing with error return code -101
I am trying to run a simple insert statement as below: insert into table `bwc_test` partition(call_date) select * from `bwc_master`; Then it fails with the below error: INFO : Loading data to table dtc.bwc_test partition (call_date=null) from /apps/hive/warehouse/dtc.db/bwc_test/.hive-staging_hive_2018-11-13_19-10-37_084_8697431764330812894-1/-ext-10000 Error: Error while processing statement: FAILED: Execution Error, return code -101 from org.apache.hadoop.hive.ql.exec.MoveTask. HIVE_LOAD_DYNAMIC_PARTITIONS_THREAD_COUNT (state=08S01,code=-101) Table definition for bwc_master: CREATE TABLE `bwc_master`( unique_id bigint, customer_id string, direction string, call_date_time timestamp, duration int, billed_duration int, retail_rate decimal(9,7), retail_cost decimal(19,7), billed_tier smallint, call_type tinyint, record_status tinyint, aggregate_id bigint, originating_ipaddress string, originating_number string, destination_number string, lrn string, ocn string, destination_rate_center string, destination_lata int, billed_prefix string, rate_id string, wholesale_rate decimal(9,7), wholesale_cost decimal(19,7), cnam_dipped boolean, billed_number_type tinyint, source_lata int, source_ocn string, location_id string, sippeer_id int, rate_attempts tinyint, source_state string, source_rc string, destination_country string, destination_state string, destination_ip string, carrier_id string, rated_date_time timestamp, partition_id smallint, encryption_rate decimal(9,7), encryption_cost decimal(19,7), trans_coding_rate decimal(9,7), trans_coding_cost decimal(19,7), file_name string, call_id string, from_tag string, to_tag string, unique_record_id string) PARTITIONED BY ( `call_date` date) CLUSTERED BY ( customer_id) INTO 10 BUCKETS ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' LOCATION 'hdfs://*****/apps/hive/warehouse/dtc.db/bwc_master' Can someone help me debug this? I didn't find anything in the logs.
You missing the "table" before bwc_test insert into table `bwc_test` partition(call_date) select * from `bwc_master`;
Query on Bucketized Table
I created a bucketized table as the following drop table if exists bi_st.st_usr_member_active_day_test; CREATE TABLE `bi_st.st_usr_member_active_day_test`( `cal_dt_from` string, `cal_dt_to` string, `memberid` string, `vipcode` string, `vipleavel` string, `cityid` string, `cityname` string, `groupid` int, `groupname` string, `storeid` int, `storename` string, `sectionid` int, `sectionname` string, `promotionid` string, `promotionname` string, `moduleid` string, `modulename` string, `activeness_today` string, `new_vip_class` string ) clustered by (storeid) into 2 buckets row format delimited fields terminated by '\t' stored as orc TBLPROPERTIES('transactional'='true'); And then inserted some data into it, and then I did select * from bi_st.st_usr_member_active_day_test where storeid = 193;, it failed and gave an array index out of bound error. Can anybody explain about this? Thanks
elasticsearch hadoop integration- java.lang.ClassCastException
I downloaded elasticsearch2.1.2 JAR and followed the guide to configure it in Hadoop(v5.4.4). Everything looks ok but I am getting 'CAST' error while reading from the elasticsearch source. Below is the error message- Failed with exception java.io.IOException:org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.ClassCastException: org.elasticsearch.hadoop.mr.WritableArrayWritable cannot be cast to org.apache.hadoop.io.Text Below is the table created in hive- CREATE EXTERNAL TABLE Log_Event_ICS_ES( product_version string, agent_host string, product_name string, temp_time_stamp bigint, log_message string, org_id string, log_datetime timestamp, message string, log_source_provider string, log_source_name string, log_message_for_trending string, index_only_message string, log_level string, code_source string, log_type string, full_message string, session_log_operation string, source_received_time timestamp ) STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler' TBLPROPERTIES('es.resource' = 'log_event_2015-05-11/log_event', 'es.nodes' = '', 'es.port' = '' ) Select query- select * from log_event_ics_es Any idea?