Indexing data from HDFS to Elasticsearch via Hive - hadoop

I'm using Elasticsearch for Hadoop plugin in order to read and index documents in Elasticsearch via Hive.
I followed the documentation in this page:
https://www.elastic.co/guide/en/elasticsearch/hadoop/current/hive.html
In order to index documents in Elasticsearch with Hadoop, you need to create a table in Hive that is configured properly.
And I encountered a problem with inserting data into that hive table.
That’s the table's script for writing I used to create:
CREATE EXTERNAL TABLE es_names_w
(
firstname string,
lastname string
)
STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler'
TBLPROPERTIES('es.resource' = 'hive_test/names', 'es.index.auto.create' = 'true')
Then I tried to insert data:
INSERT OVERWRITE TABLE es_names_w
SELECT firstname,lastname
FROM tmp_names_source;
The error I get from hive is:
"Job submission failed with exception 'org.apache.hadoom.ipc.RemoteExaption(java.lang.RuntimeExeption: org.xml.sax.SAXParseException; systemId: file:////hdfs_data/mapred/jt/jobTracker/job_201506091622_0064.xml; lineNunber: 607; columnNumber:51; Character reference "&#..."
However, this error occurs ONLY when the hive table that I create has more than one column.
For example, this code works:
CREATE EXTERNAL TABLE es_names_w
(
firstname string
)
STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler'
TBLPROPERTIES('es.resource' = 'hive_test/names', 'es.index.auto.create' = 'true')
INSERT OVERWRITE TABLE es_names_w
SELECT firstname
FROM tmp_names_source;
Everything went well,
Hive has created a new type in elasticsearch index and the data has been indexed in Elasticsearch
I really don’t know why my first attempt doesn't work
I would appreciate some help,
Thanks

Can you add this to the tbl es.mapping.id’=’key’. The key can be your firstname.

Try
es.index.auto.create' = 'false'

Try with SerDe it will work out. For eg.
CREATE EXTERNAL TABLE elasticsearch_es (
name STRING, id INT, country STRING )
ROW FORMAT SERDE 'org.elasticsearch.hadoop.hive.EsSerDe'
STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler'
TBLPROPERTIES ('es.resource'='elasticsearch/demo');
Also, make sure when you create index and type in ES you create the exact same mapping as Hive column's in ES.

Related

How to get default values of table properties in Hive?

I created an internal table using HiveQL:
CREATE TABLE city (
id INT,
city VARCHAR(15)
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
STORED AS TEXTFILE;
Inserted one record
INSERT INTO city SELECT 1, null;
I want to know which default values are used. But Hive returns 'Table default.city does not have property'
SHOW TBLPROPERTIES city('serialization.format');
SHOW TBLPROPERTIES city('serialization.null.format');
SHOW TBLPROPERTIES city('serialization.encoding');
SHOW TBLPROPERTIES city('serialization.escape.crlf');
I also don't see them using the describe command:
DESCRIBE FORMATTED city;
I found out which values are used analyzing files on HDFS but I want to know if there is any easy way to get default values using HiveQL.

Creating external hive table from dynamodb table in sparkSQL 2.2.1

I am having issues to execute the following hive query from sparkSQL 2.2.1. The query works fine in the hive editor in hue. I have loaded the jars (emr-dynamodb-hadoop-4.2.0.jar and emr-dynamodb-hive-4.2.0.jar) for dynamodb.
The problem appear to be in the syntax. I have tried to use this but no luck.
CREATE EXTERNAL TABLE schema.table
(name string, surname string, address string)
STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler'
TBLPROPERTIES (
"dynamodb.table.name" = "dynamodb-table",
"dynamodb.column.mapping" = "name:name,surname:surname,address:address"
);
I was able to run this:
CREATE TABLE IF NOT EXISTS surge.threshold3 (name string, surname string, address string) USING hive
OPTIONS(
INPUTFORMAT 'org.apache.hadoop.dynamodb.read.DynamoDBInputFormat',
OUTPUTFORMAT 'org.apache.hadoop.dynamodb.write.DynamoDBOutputFormat'
)
TBLPROPERTIES (
"dynamodb.table.name" = "dynamodb-table",
"dynamodb.column.mapping" = "name:name,surname:surname,address:address"
)
LOCATION ''
However, I can't find a way to find the location of the dynamodb table. Its quite different from locating hdfs or s3.
Help is much appreciated!

Does hive query always do full scan, even if it is in equals condition?

I am trying to create an external Hive table that maps to a DynamoDB table like the official documentation says.
CREATE EXTERNAL TABLE dynamodb(hashKey STRING, recordTimeStamp BIGINT, fullColumn map<String, String>)
STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler'
TBLPROPERTIES (
"dynamodb.table.name" = "myTable",
"dynamodb.column.mapping" = "hashKey:HashKey,recordTimeStamp:RangeKey");
But doing a query using the hash key it seems it is doing a full scan table.
hive> select * from dynamodb where hashKey="test";
Any suggestions on that? Thanks

Import multiple column families from hbase to hive

I am trying to move hbase table having two column family into hive table. I am able to move one column family but how can i move another one in same hive table.
Edit:
I moved one column family ushing below code.
CREATE TABLE hbase_hive(key string, firstname string, age string)
STORED BY ‘org.apache.hadoop.hive.hbase.HBaseStorageHandler’
WITH SERDEPROPERTIES (“hbase.columns.mapping” = “id:firstname,id:age")
TBLPROPERTIES(“hbase.table.name” = “hl”);
but i am having one more column family with name hb and having three columns. How to achive this.
Update:
I also tried adding column name of different column family below is my code.
CREATE TABLE hbase_hive(key string, firstname string, age string, testname string)
STORED BY ‘org.apache.hadoop.hive.hbase.HBaseStorageHandler’
WITH SERDEPROPERTIES (“hbase.columns.mapping” = “id:firstname,id:age,pd:name")
TBLPROPERTIES(“hbase.table.name” = “hl”);
but i am getting below result:
819215975 19391121 625678921720 NULL
819617215 19570622 625116365890 NULL
820333876 19640303 623221670810 NULL
824794938 19531211 625278010070 NULL
828093442 19420803 625284904860 NULL
828905771 19320209 625078004220 NULL
829832017 19630722 625178010070 NULL
Instead of values i am getting null.
Update:
I tried creating hbase table using below command in hbase shell
create ‘hl’,’id’
then i created one more column family using below command
alter ‘hl’,’pd’
In your HiveQL, you select two columns in column family "id" from hbase table "hl" into hive table. If you want to add more columns (even from other column families), you just need to add them to table schema and hbase.columns.mapping. For example:
CREATE TABLE hbase_hive(key string, firstname string, age string, a string)
STORED BY ‘org.apache.hadoop.hive.hbase.HBaseStorageHandler’
WITH SERDEPROPERTIES (“hbase.columns.mapping” = “id:firstname,id:age,hb:a")
TBLPROPERTIES(“hbase.table.name” = “hl”);
see https://cwiki.apache.org/confluence/display/Hive/HBaseIntegration#HBaseIntegration-MultipleColumnsandFamilies
I see a couple of issues (more or less serious) with what you wrote:
First of all, I would create an EXTERNAL TABLE instead
You are creating a Hive table with only 3 columns but expecting 4 in the end
You are not explicitly mapping the :key
Your data for 'firstname' and 'age' looks like wild random numbers! :|
I could not test it but the following should be a better starting point:
CREATE EXTERNAL TABLE hbase_hive_hl(key string, firstname string, age string, name string)
STORED BY ‘org.apache.hadoop.hive.hbase.HBaseStorageHandler’
WITH SERDEPROPERTIES (“hbase.columns.mapping” = “:key,id:firstname,id:age,pd:name")
TBLPROPERTIES(“hbase.table.name” = “hl”);

Elastic Map Reduce JSON export to DynamoDB error AttributeValue may not contain an empty string

I'm trying to import data using an EMR job from JSON files in S3 that contain sparse fields e.g. an ios_os field and android_os but only one contains data. Sometimes the data is null and sometimes it's an empty string, when trying to insert into DynamoDB I'm getting an error (although I am able to insert some records that are sparsely populated):
"AttributeValue may not contain an empty string"
{"created_at_timestamp":1358122714,...,"data":null,"type":"e","android_network_carrier":""}
I filtered out the columns that had the empty string "", but I'm still getting that error. I'm assuming it's the "property":null values that are causing this (or both). I assume that for it to work properly those values shouldn't exist when going to DynamoDB?
Is there any way to tell Hive through the JSONSerde or Hive's interaction with the DynamoDB table to ignore empty string attribute values.
Here's an example of the Hive SQL schema and insert command:
CREATE EXTERNAL TABLE IF NOT EXISTS json_events (
-- Common
created_at BIGINT,
data STRING,
type STRING,
android_network_carrier STRING
)
PARTITIONED BY (created_at BIGINT, type STRING)
ROW FORMAT SERDE "org.apache.hadoop.hive.contrib.serde2.JsonSerde"
WITH SERDEPROPERTIES (
-- Common
"created_at"="$.created_at",
"data"="$.data",
"android_network_carrier"="$.anw",
"type"="$.dt"
)
LOCATION s3://test.data/json_events;
CREATE EXTERNAL TABLE IF NOT EXISTS dynamo_events (
-- Common
created_at BIGINT,
data STRING,
type STRING,
android_network_carrier STRING
)
STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler'
TBLPROPERTIES ("dynamodb.table.name" = "test-events",
"dynamodb.column.mapping" = "created_at:created_at,data:data,type:type,android_network_carrier:android_network_carrier");
ALTER TABLE json_events RECOVER PARTITIONS;
INSERT OVERWRITE TABLE dynamo_events
SELECT created_at,
data,
android_network_carrier,
type
FROM json_events
WHERE created_at = 20130114 AND type = 'e';
The nulls shouldn't be a problem as long as it's not for the primary key.
However, DynamoDB does not allow empty strings nor empty sets as described in the data model.
To work around this, I think you have a couple options:
Define a constant for empty strings like "n/a", and make sure that your data extraction processes treats missing values as such.
You could also filter these records, but that will mean losing data. This could be done like this:
INSERT OVERWRITE TABLE dynamo_events
SELECT created_at,
data,
android_network_carrier,
type
FROM json_events
WHERE created_at = 20130114 AND type = 'e' AND android_network_carrier != "";

Resources