Unable to load data into parquet file format? - hadoop

I am trying to parse log data into parquet file format in hive , the separator used is "||-||".
The sample row is
"b8905bfc-dc34-463e-a6ac-879e50c2e630||-||syntrans1||-||CitBook"
After performing the data staging I am able to get the result
"b8905bfc-dc34-463e-a6ac-879e50c2e630 syntrans1 CitBook ".
While converting the data to parquet file format I got error :
`
Caused by: java.lang.ClassNotFoundException: Class org.apache.hadoop.hive.contrib.serde2.MultiDelimitSerDe not found
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2185)
at org.apache.hadoop.hive.ql.plan.PartitionDesc.getDeserializer(PartitionDesc.java:137)
at org.apache.hadoop.hive.ql.exec.MapOperator.getConvertedOI(MapOperator.java:297)
... 24 more
This is what I have tried
create table log (a String ,b String ,c String)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.MultiDelimitSerDe'
WITH SERDEPROPERTIES (
"field.delim"="||-||",
"collection.delim"="-",
"mapkey.delim"="#"
);
create table log_par(
a String ,
b String ,
c String
) stored as PARQUET ;
insert into logspar select * from log_par ;
`

Aman kumar,
To resolve this issue, run the hive query after adding the following jar:
hive> add jar hive-contrib.jar;
To add the jar permanently, do the following:
1.On Hive Server host, create a /usr/hdp//hive/auxlib directory.
2.Copy /usr/hdp//hive/lib/hive-contrib-.jar to /usr/hdp//hive/auxlib.
3.Restart the HS2 server.
Please check further reference.
https://community.hortonworks.com/content/supportkb/150175/errororgapachehadoophivecontribserde2multidelimits.html.
https://community.hortonworks.com/questions/79075/loading-data-to-hive-via-pig-orgapachehadoophiveco.html
Let me know,if you face any issues

Related

Confluent Kafka-connect-JDBC connector showing hexa decimal data in the kafka topic

I'm trying to copy the data from a table in the oracle db and trying to put that data in a kafka topic. I've used the following JDBC source connector for that :
name=JDBC-DB-source
connector.class=io.confluent.connect.jdbc.JdbcSourceConnector
connection.password = *******
connection.url = jdbc:oracle:thin:#1.1.1.1:1111/ABCD
connection.user = *****
table.types=TABLE
query= select * from (SELECT * FROM JENNY.WORKFLOW where ID = '565231')
key.converter=io.confluent.connect.avro.AvroConverter
key.converter.schema.registry.url=http://localhost:8081
value.converter=io.confluent.connect.avro.AvroConverter
value.converter.schema.registry.url=http://localhost:8081
mode=timestamp+incrementing
incrementing.column.name=ID
timestamp.column.name=MODIFIED
topic.prefix=workflow_data12
poll.interval.ms=6000
timestamp.delay.interval.ms=60000
transforms:createKey
transforms.createKey.type:org.apache.kafka.connect.transforms.ValueToKey
transforms.createKey.fields:ID
So far good. I'm able to get the data into my kafka topic. But the output looks like the following :
key - {"ID":"\u0001"}
value - {"ID":"\u0001","MODIFIED":1874644537368}
You can observer that my key "ID" is being printed as Hexadecimal format, despite I'm using Avro in my JDBC properties file.
(I'm using kafka-avro-console consumer to view the data on the command line)
(And the column "ID" is of type "NUMBER" in the oracle db.)
Could anyone help me to point out if I'm missing some property? to print the data properly in Avro format.
Thanks in advance!!
Add this property to your .properties file e.g before query:
numeric.mapping=best_fit
Detail Explanation can be found here

How to merge orc files for external tables?

I am trying to merge multiple small ORC files. Came across ALTER TABLE CONCATENATE command but that only works for managed tables.
Hive gave me the following error when I try to run it :
FAILED: SemanticException
org.apache.hadoop.hive.ql.parse.SemanticException: Concatenate/Merge
can only be performed on managed tables
Following are the table parameters :
Table Type: EXTERNAL_TABLE
Table Parameters:
COLUMN_STATS_ACCURATE true
EXTERNAL TRUE
numFiles 535
numRows 27051810
orc.compress SNAPPY
rawDataSize 20192634094
totalSize 304928695
transient_lastDdlTime 1512126635
# Storage Information
SerDe Library: org.apache.hadoop.hive.ql.io.orc.OrcSerde
InputFormat: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
Compressed: No
Num Buckets: -1
Bucket Columns: []
Sort Columns: []
Storage Desc Params:
serialization.format 1
I believe your table is an external table,then there are two ways:
Either you can change it to Managed table (ALTER TABLE <table> SET
TBLPROPERTIES('EXTERNAL'='FALSE') and run the ALTER TABLE
CONCATENATE.Then you can convert the same back to external changing
it to TRUE.
Or you can create a managed table using CTAS and insert the data. Then run the merge query and import the data back to external table
From my previous answer to this question, here is a small script in Python using PyORC to concatenate the small ORC files together. It doesn't use Hive at all, so you can only use it if you have direct access to the files and are able to run a Python script on them, which might not always be the case in managed hosts.
import pyorc
import argparse
def main():
parser = argparse.ArgumentParser()
parser.add_argument('-o', '--output', type=argparse.FileType(mode='wb'))
parser.add_argument('files', type=argparse.FileType(mode='rb'), nargs='+')
args = parser.parse_args()
schema = str(pyorc.Reader(args.files[0]).schema)
with pyorc.Writer(args.output, schema) as writer:
for i, f in enumerate(args.files):
reader = pyorc.Reader(f)
if str(reader.schema) != schema:
raise RuntimeError(
"Inconsistent ORC schemas.\n"
"\tFirst file schema: {}\n"
"\tFile #{} schema: {}"
.format(schema, i, str(reader.schema))
)
for line in reader:
writer.write(line)
if __name__ == '__main__':
main()

Invalid path because of filename when hive load data local inpath

The file '/home/hadoop/_user_active_score_small' exactly exists. But when run load data local as below, get a SemanticException:
hive> load data local inpath '/home/hadoop/_user_active_score_small' overwrite into table user_active_score_tmp ;
FAILED: SemanticException Line 1:24 Invalid path ''/home/hadoop/_user_active_score_small'': No files matching path file:/home/hadoop/_user_active_score_small
But, cp /home/hadoop/_user_active_score_small /home/hadoop/user_active_score_small, and then run load data again:
hive> load data local inpath '/home/hadoop/user_active_score_small' overwrite into table user_active_score_tmp ;
Loading data to table user_bg_action.user_active_score_tmp
OK
Time taken: 0.368 seconds
The files' access type are the same, in the same directory:
-rw-rw-r-- 1 hadoop hadoop 614 7月 5 13:49 _user_active_score_small
-rw-rw-r-- 1 hadoop hadoop 614 7月 5 11:48 user_active_score_small
I don't know how does this happen. Is file name which starts with '_' not allowed by hive?
Files and directories that starts with underscore _ are considered hidden in MapReduce, that's probably the reason of the observed behavior.
If you look at FileInputFormat source code you can find this:
protected static final PathFilter hiddenFileFilter = new PathFilter(){
public boolean accept(Path p){
String name = p.getName();
return !name.startsWith("_") && !name.startsWith(".");
}
};

SPARK SQL (1.5.1) connect to Oracle and write to Avro

I am using spark-sql to connect to oracle databse and getting data as dataframes. I would like to write this retrieved data into avro file. While writing to avro I am seeing multiple issues, could you help us.
Here is the code -
val df = sqlContext.read.format("jdbc")
.options(Map( "driver"->"oracle.jdbc.driver.OracleDriver",
"url" -> "jdbc:oracle:thin:user/password#host/service"
, "numPartitions" -> "1", "dbtable"-> "
(Select * from schema.table WHERE STAGE_NUM <=39 and
guid='I284ba1f9cdba11dea82ab9f4ee295c21')"))
.load()
df.write.format("com.databricks.spark.avro").save("Outputfile")
Dependencies that are there in my project -
<dependency><br> <groupId>org.apache.spark</groupId><br> <artifactId>spark-sql_2.10</artifactId><br> <version>1.5.1</version><br></dependency><br><dependency><br> <groupId>com.databricks</groupId><br> <artifactId>spark-avro_2.10</artifactId><br> <version>2.0.1</version><br></dependency><br><dependency><br> <groupId>org.apache.avro</groupId><br> <artifactId>avro</artifactId><br> <version>1.7.7</version><br></dependency><br><dependency><br> <groupId>org.apache.avro</groupId><br> <artifactId>avro-mapred</artifactId><br> <version>1.7.7</version><br></dependency>
Here is the exception information -
java.lang.RuntimeException: com.databricks.spark.avro.DefaultSource does not allow create table as select
If I use - df.write.avro("headnotes"), I get the following exception.
java.lang.IllegalAccessError: tried to access class org.apache.avro.SchemaBuilder$FieldDefault from class com.databricks.spark.avro.SchemaConverters$$anonfun$convertStructToAvro$1

Need javax.jdo.option.ConnectionURL for cassandra

Are the below properties in hive-site.xml correct for Hive access to cassandra??
(I HAVE COPIED ENTIRE HIVE-DEFAULT.XML CONTENT BUT HAVE CHANGED ONLY THE BELOW PROPERTIES)
javax.jdo.option.ConnectionURL : cassandra://localhost:9160
javax.jdo.option.ConnectionDriverName:org.apache.cassandra.cql.jdbc.CassandraDriver
hive.stats.dbclass: jdbc:cassandra
hive.stats.jdbcdriver: org.apache.cassandra.cql.jdbc.CassandraDriver
hive.stats.dbconnectionstring: jdbc:cassandra:;databaseName=TempStatsStore;create=true
I am running 1-node Cassandra. But, later would make it a minimum 2 node cluster.
When I run the below table creation command I get an error:
CREATE EXTERNAL TABLE MyHiveTable
(m string, n string, o string, p string)
STORED BY 'org.apache.hadoop.hive.cassandra.cql3.CqlStorageHandler'
TBLPROPERTIES ( "cassandra.ks.name" = "cql3ks",
"cassandra.cf.name" = "test",
"cassandra.cql3.type" = "text, text, text, text");
Error:
FAILED: Error in metadata: javax.jdo.JDOFatalInternalException: Error creating transactional connection factory
NestedThrowables:
java.lang.reflect.InvocationTargetException
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask
don't know about jdo settings but you could try this link which is far better option for integrating hive with cassandra -
https://github.com/milliondreams/hive/tree/cas-support-cql/cassandra-handler

Resources