HIVE order by messes up data - hadoop

In Hive 0.8 with Hadoop 1.03 consider this table:
CREATE TABLE table (
key int,
date timestamp,
name string,
surname string,
height int,
weight int,
age int)
CLUSTERED BY(key) INTO 128 BUCKETS
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',';
Then I tried:
select *
from table
where key=xxx
order by date;
The result is sorted but everything after the column name is wrong. In fact, all the rows have the exact same values in the respective fields and the surname column is missing. I also have a bitmap index on name and surname and an index on key.
Is there something wrong with my query or should I be looking into bugs about order by (I cant find anything specific).

Seems like there has been an error in loading data into hive. Make sure you don't have any special characters in your CSV File that might interfere with the insertion.
And you have clustered by the key property. Where does this key come from the CSV? or some other source? Are you sure that this is unique?

Related

HIVE: Empty buckets getting created after partitioning in HDFS

I was trying to create Partition and buckets using HIVE.
For setting some of the properties:
set hive.enforce.bucketing = true;
SET hive.exec.dynamic.partition = true;
SET hive.exec.dynamic.partition.mode = nonstrict;
Below is the code for creating the table:
CREATE TABLE transactions_production
( id string,
dept string,
category string,
company string,
brand string,
date1 string,
productsize int,
productmeasure string,
purchasequantity int,
purchaseamount double)
PARTITIONED BY (chain string) clustered by(id) into 5 buckets
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE;
Below is the code for inserting data into the table:
INSERT OVERWRITE TABLE transactions_production PARTITION (chain)
select id, dept, category, company, brand, date1, productsize, productmeasure,
purchasequantity, purchaseamount, chain from transactions_staging;
What went wrong:
Partitions and buckets are getting created in HDFS but the data is present only in the 1st bucket of all the partitions; all the remaining buckets are empty.
Please let me know what i did wrong and how to resolve this issue.
When using bucketing, Hive comes up with a hash of the clustered by value (here you use id) and splits the table into that many flat files inside partitions.
Because the table is split up by the hashes of the id's the size of each split is based on the values in your table.
If you have no values that would get mapped to the buckets other than the first bucket, all those flat files will be empty.

Hive, Bucketing for the partitioned table

This is my script:
--table without partition
drop table if exists ufodata;
create table ufodata ( sighted string, reported string, city string, shape string, duration string, description string )
row format delimited
fields terminated by '\t'
Location '/mapreduce/hive/ufo';
--load my data in ufodata
load data local inpath '/home/training/downloads/ufo_awesome.tsv' into table ufodata;
--create partition table
drop table if exists partufo;
create table partufo ( sighted string, reported string, city string, shape string, duration string, description string )
partitioned by ( year string )
clustered by (year) into 6 buckets
row format delimited
fields terminated by '/t';
--by default dynamic partition is not set
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
--by default bucketing is false
set hive.enforcebucketing=true;
--loading mydata
insert overwrite table partufo
partition (year)
select sighted, reported, city, shape, min, description, SUBSTR(TRIM(sighted), 1,4) from ufodata;
Error message:
FAILED: Error in semantic analysis: Invalid column reference
I tried bucketing for my partitioned table. If I remove "clustered by (year) into 6 buckets" the script works fine. How do I bucket the partitioned table
There is an important thing we should consider while doing bucketing in hive.
The same column name cannot be used for both bucketing and partitioning. The reason is as follows:
Clustering and Sorting happens within a partition. Inside each partition there will be only one value associated with the partition column(in your case it is year)therefore there will not any be any impact on clustering and sorting. That is the reason for your error....
You can use the below syntax to create bucketing table with partition.
CREATE TABLE bckt_movies
(mov_id BIGINT , mov_name STRING ,prod_studio STRING, col_world DOUBLE , col_us_canada DOUBLE , col_uk DOUBLE , col_aus DOUBLE)
PARTITIONED BY (rel_year STRING)
CLUSTERED BY(mov_id) INTO 6 BUCKETS;
when you're doing dynamic partition, create a temporary table with all the columns (including your partitioned column) and load data into temporary table.
create actual partitioned table with partition column. While you are loading data from temporary table the partitioned column should be in the last in the select clause.

Hive Hadoop : Need to LOAD data into a table based on conditions on the input file

I am new to Hadoop Hive and have just started to do basic querying in hive.
My intention is I have an input text file (which has large number of records per line). The format of the file is something like this:
1;23;0;;;;1;3;2;1;1;4;5;6;;;;
1;43;6;;;;1;3;2;1;1;4;5;5;;;;
1;53;7;;;;1;3;2;1;1;4;5;2;;;;
(Each integer before a ";" has a meaning which I am intending to put it in Hive table as column names - and each line contains about 400 fields)
So for inserting this I have created a table "test" - using the following query:
CREATE TABLE test (field1 INT, field2 INT, field3 INT, field4 INT, ... field390 INT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY "\073";
And I load my text file with the records using the LOAD query as below:
LOAD DATA LOCAL INPATH '/tmp/test.txt'
OVERWRITE INTO TABLE test;
For now all the fields are getting inserted into the table upto 50 fields accurately. Later I have mismatches.
What I have in my format of input is - at 50th field in the test.txt I have a INT number which decides the number of fields to take following the field.
Example:
50th field: 2 -> Hive has to take the next 2*10 field INT values and insert in the table.
50th field: 1 -> Hive has to take the next 1*10 field INT values and insert in the table. And the rest 10 fields can be set NULL.
(The maximum value of 50th field is 2 - so I have reserved 2*10 fields for this in the table)
After 50th+(2*10) fields , the data should be read normally in the sequence as it did before the 50th field.
Do we have a way in which we can have a condition on the input so that the data gets inserted accordingly in Hive.
A help may be appreciated. Need a solution which will not guide me to pre-process the test.txt and then supply to the table.
I have tried to answer it at http://www.knowbigdata.com/page/hive-hadoop-need-load-data-table-based-conditions-input-file#comment-85
Does it make sense?
You can use where clause in Hive.
First load data into Hive raw table or HDFS, then again create table and load data based on where clause.
Means:
SELECT * FROM table_reference
WHERE name like "%venu%"
GROUP BY City;
Resource: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Select

Hive static partitions issue

I have a csv file which have 600 records, 300 for male and female each.
I have created a Table_Temp and fill all these records in that table. Then, I create Table_Main with gender as partition column.
For Temp_Table query is:
Create table if not exists Temp_Table
(id string, age int, gender string, city string, pin string)
row format delimited
fields terminated by ',';
Then I write the below query:
Insert into Table_Main
partitioned (gender)
select a,b,c,d,gender from Table)Temp
Problem: I am getting a file in /user/hive/warehouse/mydb.db/Table_Main/gender=Male/000000_0
In this file, I am getting total 600 records. I am not sure whats happening but what I was expected is I should get 300 records in this file(only Male).
Q:1. Where am I mistaken ?
Q:2. Should I not get one more folder for all other values(which are not in static partition) ? If NOT, what will happen to those ?
In static partition we need to specify a where condition while inserting data into partition table.(which I have not done).
For this we can use dynamic partition without where condition.

Elastic Map Reduce JSON export to DynamoDB error AttributeValue may not contain an empty string

I'm trying to import data using an EMR job from JSON files in S3 that contain sparse fields e.g. an ios_os field and android_os but only one contains data. Sometimes the data is null and sometimes it's an empty string, when trying to insert into DynamoDB I'm getting an error (although I am able to insert some records that are sparsely populated):
"AttributeValue may not contain an empty string"
{"created_at_timestamp":1358122714,...,"data":null,"type":"e","android_network_carrier":""}
I filtered out the columns that had the empty string "", but I'm still getting that error. I'm assuming it's the "property":null values that are causing this (or both). I assume that for it to work properly those values shouldn't exist when going to DynamoDB?
Is there any way to tell Hive through the JSONSerde or Hive's interaction with the DynamoDB table to ignore empty string attribute values.
Here's an example of the Hive SQL schema and insert command:
CREATE EXTERNAL TABLE IF NOT EXISTS json_events (
-- Common
created_at BIGINT,
data STRING,
type STRING,
android_network_carrier STRING
)
PARTITIONED BY (created_at BIGINT, type STRING)
ROW FORMAT SERDE "org.apache.hadoop.hive.contrib.serde2.JsonSerde"
WITH SERDEPROPERTIES (
-- Common
"created_at"="$.created_at",
"data"="$.data",
"android_network_carrier"="$.anw",
"type"="$.dt"
)
LOCATION s3://test.data/json_events;
CREATE EXTERNAL TABLE IF NOT EXISTS dynamo_events (
-- Common
created_at BIGINT,
data STRING,
type STRING,
android_network_carrier STRING
)
STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler'
TBLPROPERTIES ("dynamodb.table.name" = "test-events",
"dynamodb.column.mapping" = "created_at:created_at,data:data,type:type,android_network_carrier:android_network_carrier");
ALTER TABLE json_events RECOVER PARTITIONS;
INSERT OVERWRITE TABLE dynamo_events
SELECT created_at,
data,
android_network_carrier,
type
FROM json_events
WHERE created_at = 20130114 AND type = 'e';
The nulls shouldn't be a problem as long as it's not for the primary key.
However, DynamoDB does not allow empty strings nor empty sets as described in the data model.
To work around this, I think you have a couple options:
Define a constant for empty strings like "n/a", and make sure that your data extraction processes treats missing values as such.
You could also filter these records, but that will mean losing data. This could be done like this:
INSERT OVERWRITE TABLE dynamo_events
SELECT created_at,
data,
android_network_carrier,
type
FROM json_events
WHERE created_at = 20130114 AND type = 'e' AND android_network_carrier != "";

Resources