Does hive query always do full scan, even if it is in equals condition? - hadoop

I am trying to create an external Hive table that maps to a DynamoDB table like the official documentation says.
CREATE EXTERNAL TABLE dynamodb(hashKey STRING, recordTimeStamp BIGINT, fullColumn map<String, String>)
STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler'
TBLPROPERTIES (
"dynamodb.table.name" = "myTable",
"dynamodb.column.mapping" = "hashKey:HashKey,recordTimeStamp:RangeKey");
But doing a query using the hash key it seems it is doing a full scan table.
hive> select * from dynamodb where hashKey="test";
Any suggestions on that? Thanks

Related

Creating external hive table from dynamodb table in sparkSQL 2.2.1

I am having issues to execute the following hive query from sparkSQL 2.2.1. The query works fine in the hive editor in hue. I have loaded the jars (emr-dynamodb-hadoop-4.2.0.jar and emr-dynamodb-hive-4.2.0.jar) for dynamodb.
The problem appear to be in the syntax. I have tried to use this but no luck.
CREATE EXTERNAL TABLE schema.table
(name string, surname string, address string)
STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler'
TBLPROPERTIES (
"dynamodb.table.name" = "dynamodb-table",
"dynamodb.column.mapping" = "name:name,surname:surname,address:address"
);
I was able to run this:
CREATE TABLE IF NOT EXISTS surge.threshold3 (name string, surname string, address string) USING hive
OPTIONS(
INPUTFORMAT 'org.apache.hadoop.dynamodb.read.DynamoDBInputFormat',
OUTPUTFORMAT 'org.apache.hadoop.dynamodb.write.DynamoDBOutputFormat'
)
TBLPROPERTIES (
"dynamodb.table.name" = "dynamodb-table",
"dynamodb.column.mapping" = "name:name,surname:surname,address:address"
)
LOCATION ''
However, I can't find a way to find the location of the dynamodb table. Its quite different from locating hdfs or s3.
Help is much appreciated!

Hive cannot read ORC if set "orc.create.index"="false" when loading table

Hive version: 1.2.1, create a table by the below:
CREATE TABLE ORC_NONE(
millisec bigint,
...
)
stored as orc tblproperties ("orc.create.index"="false");
insert into table ORC_NONE select * from ex_test_convert;
But when giving query, it always return NULL. For example:
Select * from ORC_NONE limit 10; // return blank
Select min(millisec), max(millisec) from ORC_NONE; // return NULL, NULL
I check the size of ORC_NONE, 2G, so it is not empty table, and if creating table by setting "orc.create.index"="true", queries work.
I was meant to test Hive performance on ORC with/without row indexes, more exactly, to test the skipping power of row indexes. However, it seemed that Hive can not read data when row index unavailable.
Is this a bug? Or something wrong with my loading?

How transfer a Table from HBase to Hive?

How can I tranfer a HBase table into Hive correctly?
What I tried before can you read in this question
How insert overwrite table in hive with diffrent where clauses?
( I made one table to import all data. The problem here is that data is still in rows and not in columns. So I made 3 tables for news, social and all with a specific where clause. After that I made 2 Joins on the tables which is giving me the result table. So I had 6 Tables at all which is not really performant!)
to sum my problem up : In HBase are column familys which are saved as rows like this.
count verpassen news 1
count verpassen social 0
count verpassen all 1
What I want to achieve in Hive is a datastructure like this:
name news social all
verpassen 1 0 1
How am I supposed to do this?
Below is the approach use can use.
use hbase storage handler to create the table in hive
example script
CREATE TABLE hbase_table_1(key string, value string) STORED BY
'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH
SERDEPROPERTIES ("hbase.columns.mapping" = ":key,f1:val")
TBLPROPERTIES ("hbase.table.name" = "test");
I loaded the sample data you have given into hive external table.
select name,collect_set(concat_ws(',',type,val)) input from TESTTABLE
group by name ;
i am grouping the data by name.The resultant output for the above query will be
Now i wrote a custom mapper which takes the input as input parameter and emits the values.
from (select '["all,1","social,0","news,1"]' input from TESTTABLE group by name) d MAP d.input Using 'python test.py' as
all,social,news
alternatively you can use the output to insert into another table which has column names name,all,social,news
Hope this helps

Indexing data from HDFS to Elasticsearch via Hive

I'm using Elasticsearch for Hadoop plugin in order to read and index documents in Elasticsearch via Hive.
I followed the documentation in this page:
https://www.elastic.co/guide/en/elasticsearch/hadoop/current/hive.html
In order to index documents in Elasticsearch with Hadoop, you need to create a table in Hive that is configured properly.
And I encountered a problem with inserting data into that hive table.
That’s the table's script for writing I used to create:
CREATE EXTERNAL TABLE es_names_w
(
firstname string,
lastname string
)
STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler'
TBLPROPERTIES('es.resource' = 'hive_test/names', 'es.index.auto.create' = 'true')
Then I tried to insert data:
INSERT OVERWRITE TABLE es_names_w
SELECT firstname,lastname
FROM tmp_names_source;
The error I get from hive is:
"Job submission failed with exception 'org.apache.hadoom.ipc.RemoteExaption(java.lang.RuntimeExeption: org.xml.sax.SAXParseException; systemId: file:////hdfs_data/mapred/jt/jobTracker/job_201506091622_0064.xml; lineNunber: 607; columnNumber:51; Character reference "&#..."
However, this error occurs ONLY when the hive table that I create has more than one column.
For example, this code works:
CREATE EXTERNAL TABLE es_names_w
(
firstname string
)
STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler'
TBLPROPERTIES('es.resource' = 'hive_test/names', 'es.index.auto.create' = 'true')
INSERT OVERWRITE TABLE es_names_w
SELECT firstname
FROM tmp_names_source;
Everything went well,
Hive has created a new type in elasticsearch index and the data has been indexed in Elasticsearch
I really don’t know why my first attempt doesn't work
I would appreciate some help,
Thanks
Can you add this to the tbl es.mapping.id’=’key’. The key can be your firstname.
Try
es.index.auto.create' = 'false'
Try with SerDe it will work out. For eg.
CREATE EXTERNAL TABLE elasticsearch_es (
name STRING, id INT, country STRING )
ROW FORMAT SERDE 'org.elasticsearch.hadoop.hive.EsSerDe'
STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler'
TBLPROPERTIES ('es.resource'='elasticsearch/demo');
Also, make sure when you create index and type in ES you create the exact same mapping as Hive column's in ES.

HIVE External Table - Set Empty Strings to NULL

Currently I have a HIVE 0.7 instance on Amazon EMR. I am trying to create a duplicate of this instance on a new EMR cluster using Hive 0.11.
In my 0.7 instance I have an external table that will set empty strings to NULL. Here is how I create the table:
CREATE EXTERNAL TABLE IF NOT EXISTS tablename
(column1 string,
column2 string)
PARTITIONED BY (year STRING, month STRING, day STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
TBLPROPERTIES ('serialization.null.format' = '');
Data is added to the table like this:
ALTER TABLE tablename
ADD PARTITION (year = '2013', month = '10', day='01')
LOCATION '/location_in_hdfs';
This works great in 0.7 but in 0.11 it doesn't seem to be evaluating my empty strings as NULLS. Interestingly, creating a normal table with the same data and table definition seems to evaluate empty strings as NULLs as expected.
Is there are different way to do this with an external table in 0.11?
Hive default partition properties overriding the table properties. Include SERDE properties in your alter statement:
ALTER TABLE tablename ADD PARTITION (year = '2013', month = '10', day='01') SET
SERDEPROPERTIES ('serialization.null.format' = '');

Resources