Creating external hive table from dynamodb table in sparkSQL 2.2.1 - hadoop

I am having issues to execute the following hive query from sparkSQL 2.2.1. The query works fine in the hive editor in hue. I have loaded the jars (emr-dynamodb-hadoop-4.2.0.jar and emr-dynamodb-hive-4.2.0.jar) for dynamodb.
The problem appear to be in the syntax. I have tried to use this but no luck.
CREATE EXTERNAL TABLE schema.table
(name string, surname string, address string)
STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler'
TBLPROPERTIES (
"dynamodb.table.name" = "dynamodb-table",
"dynamodb.column.mapping" = "name:name,surname:surname,address:address"
);
I was able to run this:
CREATE TABLE IF NOT EXISTS surge.threshold3 (name string, surname string, address string) USING hive
OPTIONS(
INPUTFORMAT 'org.apache.hadoop.dynamodb.read.DynamoDBInputFormat',
OUTPUTFORMAT 'org.apache.hadoop.dynamodb.write.DynamoDBOutputFormat'
)
TBLPROPERTIES (
"dynamodb.table.name" = "dynamodb-table",
"dynamodb.column.mapping" = "name:name,surname:surname,address:address"
)
LOCATION ''
However, I can't find a way to find the location of the dynamodb table. Its quite different from locating hdfs or s3.
Help is much appreciated!

Related

Does hive query always do full scan, even if it is in equals condition?

I am trying to create an external Hive table that maps to a DynamoDB table like the official documentation says.
CREATE EXTERNAL TABLE dynamodb(hashKey STRING, recordTimeStamp BIGINT, fullColumn map<String, String>)
STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler'
TBLPROPERTIES (
"dynamodb.table.name" = "myTable",
"dynamodb.column.mapping" = "hashKey:HashKey,recordTimeStamp:RangeKey");
But doing a query using the hash key it seems it is doing a full scan table.
hive> select * from dynamodb where hashKey="test";
Any suggestions on that? Thanks

Indexing data from HDFS to Elasticsearch via Hive

I'm using Elasticsearch for Hadoop plugin in order to read and index documents in Elasticsearch via Hive.
I followed the documentation in this page:
https://www.elastic.co/guide/en/elasticsearch/hadoop/current/hive.html
In order to index documents in Elasticsearch with Hadoop, you need to create a table in Hive that is configured properly.
And I encountered a problem with inserting data into that hive table.
That’s the table's script for writing I used to create:
CREATE EXTERNAL TABLE es_names_w
(
firstname string,
lastname string
)
STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler'
TBLPROPERTIES('es.resource' = 'hive_test/names', 'es.index.auto.create' = 'true')
Then I tried to insert data:
INSERT OVERWRITE TABLE es_names_w
SELECT firstname,lastname
FROM tmp_names_source;
The error I get from hive is:
"Job submission failed with exception 'org.apache.hadoom.ipc.RemoteExaption(java.lang.RuntimeExeption: org.xml.sax.SAXParseException; systemId: file:////hdfs_data/mapred/jt/jobTracker/job_201506091622_0064.xml; lineNunber: 607; columnNumber:51; Character reference "&#..."
However, this error occurs ONLY when the hive table that I create has more than one column.
For example, this code works:
CREATE EXTERNAL TABLE es_names_w
(
firstname string
)
STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler'
TBLPROPERTIES('es.resource' = 'hive_test/names', 'es.index.auto.create' = 'true')
INSERT OVERWRITE TABLE es_names_w
SELECT firstname
FROM tmp_names_source;
Everything went well,
Hive has created a new type in elasticsearch index and the data has been indexed in Elasticsearch
I really don’t know why my first attempt doesn't work
I would appreciate some help,
Thanks
Can you add this to the tbl es.mapping.id’=’key’. The key can be your firstname.
Try
es.index.auto.create' = 'false'
Try with SerDe it will work out. For eg.
CREATE EXTERNAL TABLE elasticsearch_es (
name STRING, id INT, country STRING )
ROW FORMAT SERDE 'org.elasticsearch.hadoop.hive.EsSerDe'
STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler'
TBLPROPERTIES ('es.resource'='elasticsearch/demo');
Also, make sure when you create index and type in ES you create the exact same mapping as Hive column's in ES.

insert into hbase table using hive (Hadoop)

I created this table in hbase using hive successfully :
CREATE TABLE hbase_trades(key string, value string)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,cf1:val")
TBLPROPERTIES ("hbase.table.name" = "trades");
now I want to insert values in this table! what is the HiveQl query?
Just insert to your hive table(hbase_trades) ,since you have integrated both, the data will be queriable from hbase (trades) table. Hope this helps.

Error while creating a external table using JSON jars in hive 13.0

I am doing some testing on hive 13.0. I am trying to create a external table and using json jars to read the json formatted data file. But getting errors. Below is my create table statment
'$response = Invoke-Hive -Query #"
add jar wasb://path/json-serde-1.1.9.2.jar;
add jar wasb://path/json-serde-1.1.9.2-jar-with-dependencies.jar;
CREATE EXTERNAL TABLE IF NOT EXISTS table_name (col1 string, col2 string...coln int)
PARTITIONED BY (year string, month string, day string)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES ()
STORED AS TEXTFILE;
"#'
below is the error i am getting
'FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask.org.apache.hadoop.hive.serde2.object inspector.primitive.AbstractPrimitiveJavaObjectInspector.<init>(Lorg/apache/hadoop/hive/serde2/objectinspector/primitive
/PrimitiveObjectInspectorUtils$PrimitiveTypeEntry;)V'
Any suggestions?
There are some changes needed to that SerDe for Hive .13 - you can see a list here: https://github.com/rcongiu/Hive-JSON-Serde/pull/64

AWS EMR Hive partitioning unable to recognize any type of partitions

I am trying to process some log files on a bucket in amazon s3.
I create the table :
CREATE EXTERNAL TABLE apiReleaseData2 (
messageId string, hostName string, timestamp string, macAddress string DISTINCT, apiKey string,
userAccountId string, userAccountEmail string, numFiles string)
ROW FORMAT
serde 'com.amazon.elasticmapreduce.JsonSerde'
with serdeproperties ( 'paths'='messageId, hostName, timestamp, macAddress, apiKey, userAccountId, userAccountEmail, numFiles')
LOCATION 's3://apireleasecandidate1/regression/transferstatistics/2013/12/31/';
Then I run the following HiveQL statement and get my desired output in the file without any issues. My directories are setup in the following manner :
s3://apireleasecandidate1/regression/transferstatistics/2013/12/31/ < All the log files for this day >
What I want to do is that I specify the LOCATION up to the 's3://apireleasecandidate1/regression/transferstatistics/' and then call the
ALTER TABLE <Table Name> ADD PARTITION (<path>)
statement or the
ALTER TABLE <Table Name> RECOVER PARTITIONS ;
statement to access the files in the subdirectories. But when I do this there is no data in my table.
I tried the following :
CREATE EXTERNAL TABLE apiReleaseDataUsingPartitions (
messageId string, hostName string, timestamp string, macAddress string, apiKey string,
userAccountId string, userAccountEmail string, numFiles string)
PARTITIONED BY (year STRING, month STRING, day STRING)
ROW FORMAT
serde 'com.amazon.elasticmapreduce.JsonSerde'
with serdeproperties ( 'paths'='messageId, hostName, timestamp, macAddress, apiKey, userAccountId, userAccountEmail, numFiles')
LOCATION 's3://apireleasecandidate1/regression/transferstatistics/';
and then I run the following ALTER command :
ALTER TABLE apiReleaseDataUsingPartitions ADD PARTITION (year='2013', month='12', day='31');
But running the Select statement on the table gives out no results.
Can someone please guide me what I am doing wrong ?
Am I missing something Important ?
Cheers
Tanzeel
In HDFS anyway, the partitions manifest in a key/value format like this:
hdfs://apireleasecandidate1/regression/transferstatistics/year=2013/month=12/day=31
I can't vouch for S3 but an easy way to check would be to write some data into a dummy partition and see where it creates the file.
ADD PARTITION supports an optional LOCATION parameter, so you might be able to deal with this by saying
ALTER TABLE apiReleaseDataUsingPartitions ADD PARTITION (year='2013', month='12', day='31') LOCATION 's3://apireleasecandidate1/regression/transferstatistics/2013/12/31/';
Again I've not dealt with S3 but would be interested to hear if this works for you.

Resources