Insert data into Cassandra from Pig using list datatype fails - hadoop

I have the following scenario:
Table in Cassandra:
CREATE TABLE tb_st_test (
id int,
email list<text>,
PRIMARY KEY ((id));
PIG Code:
teste = LOAD 'cql://main/tb_st_test' USING CqlStorage();
testing = FOREACH teste GENERATE $0 as cod, ['emailtest#test.com'] as field:();
insert_test =
FOREACH testing GENERATE
TOTUPLE(
TOTUPLE('id',cod)
),
TOTUPLE(field);
STORE insert_test INTO 'cql://main/tb_st_test?output_query=UPDATE tb_st_test set email %3D%3F' USING CqlStorage();
The idea here is to read the table tb_st_test, get the key values, and update the field email.
But when I run the script I get the following error:
Backend error message
java.io.IOException: org.apache.thrift.transport.TTransportException
at org.apache.cassandra.hadoop.cql3.CqlRecordWriter$RangeClient.run(CqlRecordWriter.java:256)
Caused by: org.apache.thrift.transport.TTransportException
at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:132)
at org.apache.thrift.transport.TTransport.readAll(TTransport.java:84)
at org.apache.thrift.transport.TFramedTransport.readFrame(TFramedTransport.java:129)
at org.apache.thrift.transport.TFramedTransport.read(TFramedTransport.java:101)
at org.apache.thrift.transport.TTransport.readAll(TTransport.java:84)
at org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:362)
at org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:284)
at org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:191)
at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:69)
at org.apache.cassandra.thrift.Cassandra$Client.recv_execute_prepared_cql3_query(Cassandra.java:1820)
at org.apache.cassandra.thrift.Cassandra$Client.execute_prepared_cql3_query(Cassandra.java:1805)
at org.apache.cassandra.hadoop.cql3.CqlRecordWriter$RangeClient.run(CqlRecordWriter.java:240)
Does anyone know what it is happening?

The insert_test format is wrong, for list collection format should be TOTUPLE(TOUTUPLE('some email', 'email2')). check https://issues.apache.org/jira/browse/CASSANDRA-5867

Related

Unable to load data into parquet file format?

I am trying to parse log data into parquet file format in hive , the separator used is "||-||".
The sample row is
"b8905bfc-dc34-463e-a6ac-879e50c2e630||-||syntrans1||-||CitBook"
After performing the data staging I am able to get the result
"b8905bfc-dc34-463e-a6ac-879e50c2e630 syntrans1 CitBook ".
While converting the data to parquet file format I got error :
`
Caused by: java.lang.ClassNotFoundException: Class org.apache.hadoop.hive.contrib.serde2.MultiDelimitSerDe not found
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2185)
at org.apache.hadoop.hive.ql.plan.PartitionDesc.getDeserializer(PartitionDesc.java:137)
at org.apache.hadoop.hive.ql.exec.MapOperator.getConvertedOI(MapOperator.java:297)
... 24 more
This is what I have tried
create table log (a String ,b String ,c String)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.MultiDelimitSerDe'
WITH SERDEPROPERTIES (
"field.delim"="||-||",
"collection.delim"="-",
"mapkey.delim"="#"
);
create table log_par(
a String ,
b String ,
c String
) stored as PARQUET ;
insert into logspar select * from log_par ;
`
Aman kumar,
To resolve this issue, run the hive query after adding the following jar:
hive> add jar hive-contrib.jar;
To add the jar permanently, do the following:
1.On Hive Server host, create a /usr/hdp//hive/auxlib directory.
2.Copy /usr/hdp//hive/lib/hive-contrib-.jar to /usr/hdp//hive/auxlib.
3.Restart the HS2 server.
Please check further reference.
https://community.hortonworks.com/content/supportkb/150175/errororgapachehadoophivecontribserde2multidelimits.html.
https://community.hortonworks.com/questions/79075/loading-data-to-hive-via-pig-orgapachehadoophiveco.html
Let me know,if you face any issues

Invalid format: "19690321" is too short

I am trying to convert yyyyMMdd format to yyyy/MM/dd format using pig for that i have written below code.
Code:
STOCK_A = LOAD '/user/root/xxxx/*' USING PigStorage('|');
data = FILTER STOCK_A BY ($1 matches '.*ID.*');
MSH_DATA = FOREACH data GENERATE ToDate($8,'yyyy/MM/dd','UTC') AS dob;
When i am trying to dump the result i am getting below error.
ERROR org.apache.pig.tools.pigstats.SimplePigStats - ERROR 0:
Exception while executing [POUserFunc (Name:
POUserFunc(org.apache.pig.builtin.ToDate3ARGS)[datetime] - scope-209
Operator Key: scope-209) children: null at []]:
java.lang.IllegalArgumentException: Invalid format: "19690321" is too
short
Sample:
EXVORV##PDULD21F|ID|1|483|1020783||EXVORV##PDULD||19690321|F|
$8 seems valid to me i am not able to locate the reason the issue is coming. Any help would be really appreciated.
You use :
ToDate($8,'yyyy/MM/dd','UTC')
but the format is
19690321
so you should have
ToDate($8,'yyyyMMdd','UTC')
The issue is most likely because of the load statement.Since you are not specifying the schema the datatype by default will be bytearray. You will have to convert it to chararray before passing the field to ToDate
STOCK_A = LOAD '/user/root/xxxx/*' USING PigStorage('|');
data = FILTER STOCK_A BY ($1 matches '.*ID.*');
MSH_DATA = FOREACH data GENERATE ToDate((chararray)$8,'yyyy/MM/dd','UTC') AS dob;

SPARK SQL (1.5.1) connect to Oracle and write to Avro

I am using spark-sql to connect to oracle databse and getting data as dataframes. I would like to write this retrieved data into avro file. While writing to avro I am seeing multiple issues, could you help us.
Here is the code -
val df = sqlContext.read.format("jdbc")
.options(Map( "driver"->"oracle.jdbc.driver.OracleDriver",
"url" -> "jdbc:oracle:thin:user/password#host/service"
, "numPartitions" -> "1", "dbtable"-> "
(Select * from schema.table WHERE STAGE_NUM <=39 and
guid='I284ba1f9cdba11dea82ab9f4ee295c21')"))
.load()
df.write.format("com.databricks.spark.avro").save("Outputfile")
Dependencies that are there in my project -
<dependency><br> <groupId>org.apache.spark</groupId><br> <artifactId>spark-sql_2.10</artifactId><br> <version>1.5.1</version><br></dependency><br><dependency><br> <groupId>com.databricks</groupId><br> <artifactId>spark-avro_2.10</artifactId><br> <version>2.0.1</version><br></dependency><br><dependency><br> <groupId>org.apache.avro</groupId><br> <artifactId>avro</artifactId><br> <version>1.7.7</version><br></dependency><br><dependency><br> <groupId>org.apache.avro</groupId><br> <artifactId>avro-mapred</artifactId><br> <version>1.7.7</version><br></dependency>
Here is the exception information -
java.lang.RuntimeException: com.databricks.spark.avro.DefaultSource does not allow create table as select
If I use - df.write.avro("headnotes"), I get the following exception.
java.lang.IllegalAccessError: tried to access class org.apache.avro.SchemaBuilder$FieldDefault from class com.databricks.spark.avro.SchemaConverters$$anonfun$convertStructToAvro$1

Need javax.jdo.option.ConnectionURL for cassandra

Are the below properties in hive-site.xml correct for Hive access to cassandra??
(I HAVE COPIED ENTIRE HIVE-DEFAULT.XML CONTENT BUT HAVE CHANGED ONLY THE BELOW PROPERTIES)
javax.jdo.option.ConnectionURL : cassandra://localhost:9160
javax.jdo.option.ConnectionDriverName:org.apache.cassandra.cql.jdbc.CassandraDriver
hive.stats.dbclass: jdbc:cassandra
hive.stats.jdbcdriver: org.apache.cassandra.cql.jdbc.CassandraDriver
hive.stats.dbconnectionstring: jdbc:cassandra:;databaseName=TempStatsStore;create=true
I am running 1-node Cassandra. But, later would make it a minimum 2 node cluster.
When I run the below table creation command I get an error:
CREATE EXTERNAL TABLE MyHiveTable
(m string, n string, o string, p string)
STORED BY 'org.apache.hadoop.hive.cassandra.cql3.CqlStorageHandler'
TBLPROPERTIES ( "cassandra.ks.name" = "cql3ks",
"cassandra.cf.name" = "test",
"cassandra.cql3.type" = "text, text, text, text");
Error:
FAILED: Error in metadata: javax.jdo.JDOFatalInternalException: Error creating transactional connection factory
NestedThrowables:
java.lang.reflect.InvocationTargetException
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask
don't know about jdo settings but you could try this link which is far better option for integrating hive with cassandra -
https://github.com/milliondreams/hive/tree/cas-support-cql/cassandra-handler

H2 database inserting data exception after adding new column with ALTER TABLE

I am adding new column to my table like below code:
String sql = "ALTER TABLE PROJE_ALAN ADD NEWCOLUMN VARCHAR(30)";
PreparedStatement ps = conn.prepareStatement(sql.toString());
ps.execute();
conn.close();
ps.close();
This is adding a new column to my table but when I want to add new data throws an exception:
Caused by: org.h2.jdbc.JdbcSQLException: General error: "net.sourceforge.hatbox.RTreeInternalException: Unable to select meta node"; SQL statement: INSERT INTO "PROJE_ALAN" ( "THE_GEOM","JJ","KK","NEWCOLUMN" ) VALUES ( ST_GeomFromText ('MULTIPOLYGON (((-244856.06897661195 4166022.019422841, 189248.78294214187 4442270.561552957, 778743.439809086 4301679.785647452, 662817.7123080553 4101892.893571207, 83189.0748029009 3707252.1190996123, -244856.06897661195 4166022.019422841)))',23036),'','','') [50000-172]
at org.h2.message.DbException.getJdbcSQLException(DbException.java:329)
at org.h2.message.DbException.get(DbException.java:158)
at org.h2.message.DbException.convert(DbException.java:281)
at org.h2.schema.TriggerObject.fireRow(TriggerObject.java:215)
at org.h2.table.Table.fireRow(Table.java:904)
at org.h2.table.Table.fireAfterRow(Table.java:895)
at org.h2.command.dml.Insert.insertRows(Insert.java:128)
at org.h2.command.dml.Insert.update(Insert.java:86)
at org.h2.command.CommandContainer.update(CommandContainer.java:79)
at org.h2.command.Command.executeUpdate(Command.java:235)
at org.h2.jdbc.JdbcStatement.executeInternal(JdbcStatement.java:180)
at org.h2.jdbc.JdbcStatement.execute(JdbcStatement.java:155)
at org.apache.commons.dbcp.DelegatingStatement.execute(DelegatingStatement.java:264)
at org.apache.commons.dbcp.DelegatingStatement.execute(DelegatingStatement.java:264)
at org.geotools.jdbc.JDBCDataStore.insert(JDBCDataStore.java:1447)
... 17 more
Caused by: net.sourceforge.hatbox.RTreeInternalException: Unable to select meta node
at net.sourceforge.hatbox.Lock.<init>(Lock.java:88)
at net.sourceforge.hatbox.RTreeSessionDb.<init>(RTreeSessionDb.java:75)
at net.sourceforge.hatbox.jts.InsertTrigger.fire(InsertTrigger.java:43)
at org.h2.schema.TriggerObject.fireRow(TriggerObject.java:201)
... 28 more
Caused by: org.h2.jdbc.JdbcSQLException: Table "PROJE_ALAN_COPY_11_5_HATBOX" not found; SQL statement: select node_data, id from "PUBLIC"."PROJE_ALAN_COPY_11_5_HATBOX" where id = ? FOR UPDATE [42102-172]
at org.h2.message.DbException.getJdbcSQLException(DbException.java:329)
at org.h2.message.DbException.get(DbException.java:169)
at org.h2.message.DbException.get(DbException.java:146)
at org.h2.schema.Schema.getTableOrView(Schema.java:419)
at org.h2.command.Parser.readTableOrView(Parser.java:4808)
at org.h2.command.Parser.readTableFilter(Parser.java:1099)
at org.h2.command.Parser.parseSelectSimpleFromPart(Parser.java:1705)
at org.h2.command.Parser.parseSelectSimple(Parser.java:1813)
at org.h2.command.Parser.parseSelectSub(Parser.java:1699)
at org.h2.command.Parser.parseSelectUnion(Parser.java:1542)
at org.h2.command.Parser.parseSelect(Parser.java:1530)
at org.h2.command.Parser.parsePrepared(Parser.java:405)
at org.h2.command.Parser.parse(Parser.java:279)
at org.h2.command.Parser.parse(Parser.java:251)
at org.h2.command.Parser.prepareCommand(Parser.java:218)
at org.h2.engine.Session.prepareLocal(Session.java:425)
at org.h2.engine.Session.prepareCommand(Session.java:374)
at org.h2.jdbc.JdbcConnection.prepareCommand(JdbcConnection.java:1138)
at org.h2.jdbc.JdbcPreparedStatement.<init>(JdbcPreparedStatement.java:70)
at org.h2.jdbc.JdbcConnection.prepareStatement(JdbcConnection.java:644)
at net.sourceforge.hatbox.Lock.<init>(Lock.java:72)
... 31 more
If I restart my application then I can add new data to new table. I think problem can be refreshing indexes without restarting application. Maybe It is related with Hat Box.
So what am I missing?
It's not Hatbox's fault. It's a bug in H2. H2 engine is recreates triggers after an ALTER TABLE statement, but calling trigger.init with TEMP TABLE NAME. Because of that, the trigger is being initialized with the wrong table name. After creating triggers, H2 renames the table to the original value.
My workaround for this bug is (it's buggy too, but working):
Changed net.sourceforge.hatbox.jts.Insert, Update, Delete trigger init methods to
public void init(Connection con, String schema, String trigger, String table,
boolean before, int type) throws SQLException {
this.schema = schema;
this.table = table;
if(this.table.contains("_COPY_")) {
this.table = table.substring(0, table.indexOf("_COPY_"));
}
}
You have to be while using this, if you have COPY on your table name, it will not work. You may change _COPY_ to *_COPY_?_? like regex matching.

Resources