Unable to load data from Apache hive to ElasticSearch -

Unable to load data from Apache hive to ElasticSearch - - hadoop

I am using CDH5.5,ElasticSearch-2.4.1.
I have created Hive table and trying to push the hive table data to ElasticSearch using the below query.
CREATE EXTERNAL TABLE test1_es(
id string,
timestamp string,
dept string)<br>
ROW FORMAT SERDE 'org.elasticsearch.hadoop.hive.EsSerDe'
STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler'
LOCATION
'hdfs://quickstart.cloudera:8020/user/cloudera/elasticsearch/test1_es'
TBLPROPERTIES ( 'es.nodes'='localhost',
'es.resource'='sample/test1',
'es.mapping.names' = 'timestamp:#timestamp',
'es.port' = '9200',
'es.input.json' = 'false',
'es.write.operation' = 'index',
'es.index.auto.create' = 'yes'
);<br>
INSERT INTO TABLE default.test1_es select id,timestamp,dept from test1_hive;
I'm getting the below error in the Job Tracker URL
"
Failed while trying to construct the redirect url to the log server. Log Server url may not be configured. <br>
java.lang.Exception: Unknown container. Container either has not started or has already completed or doesn't belong to this node at all. "
It will throw "FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask" in hive terminal.
I tried all the steps mentioned in forums like including /usr/lib/hive/bin/elasticsearch-hadoop-2.0.2.jar in hive-site.xml, adding ES-hadoop jar to HIVEAUXJARS_PATH, copied yarn jar to /usr/lib/hadoop/elasticsearch-yarn-2.1.0.Beta3.jar also. Please suggest me how to fix the error.
Thanks in Advance,
Sreenath

I'm dealing with the same problem, and I found the execution error thrown by hive is caused by a timestamp field of string type which could not be parsed. I'm wondering whether timestamp fields of string type could be properly mapped to es, and if not this could be the root cause.
BTW, you should go to the hadoop MR log to find more details about the error.

REATE EXTERNAL TABLE test1_es(
id string,
timestamp string,
dept string)<br>
ROW FORMAT SERDE 'org.elasticsearch.hadoop.hive.EsSerDe'
STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler'
TBLPROPERTIES ...........
don't need location

Related

Failure loading parquet in Synapse Analytics - INT mapped as UTF8

We have an on-premise Oracle database from which we need to extract data and store this in a Synapse dedicated pool. I have created a Synapse pipeline which first copies the data from Oracle to a datalake in a parquet file, which should then be imported into Synapse using a second copy task.
The data from Oracle is extracted through a dynamically created query. This query has 2 hard-coded INT values which are generated at runtime. The query runs fine and the parquet file is created correctly, but if I use polybase or copy command to import the file to Synapse it fails with the following error:
"errorCode": "2200",
"message": "ErrorCode=UserErrorSqlDWCopyCommandError,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=SQL DW Copy Command operation failed with error 'HdfsBridge::recordReaderFillBuffer - Unexpected error encountered filling record reader buffer: ClassCastException: ',Source=Microsoft.DataTransfer.ClientLibrary,''Type=System.Data.SqlClient.SqlException,Message=HdfsBridge::recordReaderFillBuffer - Unexpected error encountered filling record reader buffer: ClassCastException: ,Source=.Net SqlClient Data Provider,SqlErrorNumber=106000,Class=16,ErrorCode=-2146232060,State=1,Errors=[{Class=16,Number=106000,State=1,Message=HdfsBridge::recordReaderFillBuffer - Unexpected error encountered filling record reader buffer: ClassCastException: ,},],'",
Bulk insert works but is less efficient on large quantities of data so I don't want to use that.
The mapping for the copy activities is created dynamically based on the target database table definition. However, when I created a separate copy task and import the mapping to check what is going on, I noticed that the 2 INT columns are mapped as UTF8 on the parquet source side. The sink table is INT32. When I exclude both columns the copy task completes successfully. It seems that the copy activity fails because it cannot implicitly cast a string to an integer.
The 2 columns are explicitly cast as integers in the Oracle query that is the source for the parquet file.
SELECT t.*
, CAST(419 AS INT) AS "Execution_id"
, CAST(4832 AS INT) AS "Task_id"
, TO_DATE('2022-07-05 14:40:34', 'YYYY-MM-DD HH24:MI:SS') AS "ProcessedDTS"
, t.DEMUTDT AS "EffectiveDTS"
FROM CBO.DRKASTR t
WHERE DEMUTDT >= TO_DATE('2022-07-05 13:37:35', 'YYYY-MM-DD HH24:MI:SS');
Adding an explicit mapping for Oracle to parquet mapping them as INT also doesn't solve the problem.
How do I prevent these 2 columns from being interpreted as integers instead of strings?!

We ended up resolving this by first importing the data as strings in the database and casting to the correct database during further processing.

Table or view not found-convert hive table to spark dataframe

I am trying to do the following operation:
import hiveContext.implicits._
val productDF=hivecontext.sql("select * from productstorehtable2")
println(productDF.show())
The error I am getting is
org.apache.spark.sql.AnalysisException: Table or view not found:
productstorehtable2; line 1 pos 14
I am not sure why that is occurring.
I have used this in spark configuration
set("spark.sql.warehouse.dir", "hdfs://quickstart.cloudera:8020/user/hive/warehouse")
and the location when I do describe formatted productstorehtable2
hdfs://quickstart.cloudera:8020/user/hive/warehouse/productstorehtable2
I have used this code for creating the table
create external table if not exists productstorehtable2
(
device string,
date string,
word string,
count int
)
row format delimited fields terminated by ','
location 'hdfs://quickstart.cloudera:8020/user/cloudera/hadoop/hive/warehouse/VerizonProduct2';
I use sbt (with spark dependencies) to run application. My OS is CentOS and I have spark 2.0
Could someone help me out in spotting where I am going wrong?
edit:
when I perform println(hivecontext.sql("show tables")) it just outputs a blank line
Thanks

simple JSON file analysing in Hive-0.14 using serde

I am trying to execute hive commands on json file using jsonserde's,but I am always getting null values ,but not actual data. I have used serde's provided in "code.google.com/p/hive-json-serde/downloads/list" link. I have tried multiple ways but all of the attempts were not successful. Please can some one help me with the exact steps to be followed and serde's to be used in order to work with json files in apache hive latest version (0.14)
BR,
San

Here are the simple steps to play around with JSON in Hive
Create a hive table
CREATE EXTERNAL TABLE IF NOT EXISTS json_table (
field1 string COMMENT 'This is a field1',
field2 int COMMENT 'This is a field2',
field3 string COMMENT 'This is a field3',
field4 double COMMENT 'This is a field4'
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
Location '/path/to/json_table';
Sample data for your table. Copy the below content into a json file and store into a file location pointed by json_table.
{"field1":"data1","field2":100,"field3":"more data1","field4":123.001}
{"field1":"data2","field2":200,"field3":"more data2","field4":123.002}
{"field1":"data3","field2":300,"field3":"more data3","field4":123.003}
{"field1":"data4","field2":400,"field3":"more data4","field4":123.004}
Make sure JSON Serde Jar file is added in the HIVE class path. For this example we have used openx json serde. It can be downloaded from here
Command to add the jar
ADD JAR /path-to/json-serde-1.3.6-jar-with-dependencies.jar;
Now we can query the entries from json_table
select * from json_table;

Where is the syntax error on this simple Hive query with STRUCT?

Let's import a simple table in Hive:
hive> CREATE EXTERNAL TABLE tweets (id BIGINT, id_str STRING, user STRUCT<id:BIGINT, screen_name:STRING>)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.JsonSerde'
LOCATION '/projets/tweets';
OK
Time taken: 2.253 seconds
hive> describe tweets.user;
OK
id bigint from deserializer
screen_name string from deserializer
Time taken: 1.151 seconds, Fetched: 2 row(s)
I cannot figure out where is the syntax error here:
hive> select user.id from tweets limit 5;
OK
Failed with exception java.io.IOException:org.apache.hadoop.hive.ql.metadata.HiveException: Error evaluating user.id
Time taken: 0.699 seconds
I am using the version 1.2.1 of Hive.

I finally found the answer. It seems it is a problem with the JAR used to serialize/deserialize the JSON. The default one (Apache) is not able to perform a good job on the data I have.
I tried all these typical JAR (in parenthesis, the class for 'ROW FORMAT SERDE'):
hive-json-serde-0.2.jar (org.apache.hadoop.hive.contrib.serde2.JsonSerde)
hive-serdes-1.0-SNAPSHOT.jar (com.cloudera.hive.serde.JSONSerDe)
hive-serde-1.2.1.jar (org.apache.hadoop.hive.serde2.DelimitedJSONSerDe)
hive-serde-1.2.1.jar (org.apache.hadoop.hive.serde2.avro.AvroSerDe)
All of them gave me different kinds of errors. I list them there so the next guy can Google them:
Failed with exception java.io.IOException:org.apache.hadoop.hive.ql.metadata.HiveException: Error evaluating user.id
java.lang.ClassCastException: org.json.JSONObject cannot be cast to [Ljava.lang.Object;
Failed with exception java.io.IOException:org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.ClassCastException: java.lang.Integer cannot be cast to java.lang.Long
Failed with exception
java.io.IOException:org.apache.hadoop.hive.serde2.SerDeException: DelimitedJSONSerDe cannot deserialize.
Failed with exception java.io.IOException:org.apache.hadoop.hive.serde2.avro.AvroSerdeException: Expecting a AvroGenericRecordWritable
Finally, the working JAR is json-serde-1.3-jar-with-dependencies.jar which can be found here. This one is working with 'STRUCT' and can even ignore some malformed JSON. I have also to use for the creation of the table this class:
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES ("ignore.malformed.json" = "true")
LOCATION ...
If needed, it is possible to recompile it from here or here. I tried the first repository and it is compiling fine for me, after adding the necessary libs. The repository has also been updated recently.

Hive - error while using dynamic partition query

I am trying to execute the query below:
INSERT OVERWRITE TABLE nasdaq_daily
PARTITION(stock_char_group)
select exchage, stock_symbol, date, stock_price_open,
stock_price_high, stock_price_low, stock_price_close,
stock_volue, stock_price_adj_close,
SUBSTRING(stock_symbol,1,1) as stock_char_group
FROM nasdaq_daily_stg;
I have already set hive.exec.dynamic.partition=true and hive.exec.dynamic.partiion.mode=nonstrict;.
Table nasdaq_daily_stg table contains proper information in the form of a number of CSV files. When I execute this query, I get this error message:
Caused by: java.lang.SecurityException: sealing violation: package org.apache.derby.impl.jdbc.authentication is sealed.
FAILED: Execution Error, return code -101 from org.apache.hadoop.hive.ql.exec.MapRedTask
The mapreduce job didnt start at all. So there are no logs present in the jobtracker web-UI for this error. I am using derby to store meta-store information.
Can someone help me fix this?

Please try this. This may be the issue. You may be having Derby classes twice on your classpath.
"SecurityException: sealing violation" when starting Derby connection

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio