Hadoop Spark SQL insert failing - hadoop

I'm trying to insert something around 13M rows into a new table but I'm getting the following error:
22/12/09 19:33:56 ERROR Utils: Aborting task
java.lang.AssertionError: assertion failed: Created file counter 11 is beyond max value 10
at scala.Predef$.assert(Predef.scala:223)
at org.apache.spark.sql.execution.datasources.DynamicPartitionDataWriter.$anonfun$increaseCreatedFileAndCheck$1(FileFormatDataWriter.scala:191)
at scala.runtime.java8.JFunction1$mcVI$sp.apply(JFunction1$mcVI$sp.java:23)
at scala.Option.foreach(Option.scala:407)
at org.apache.spark.sql.execution.datasources.DynamicPartitionDataWriter.increaseCreatedFileAndCheck(FileFormatDataWriter.scala:188)
at org.apache.spark.sql.execution.datasources.DynamicPartitionDataWriter.write(FileFormatDataWriter.scala:277)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$executeTask$1(FileFormatWriter.scala:280)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1473)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:288)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$15(FileFormatWriter.scala:211)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:131)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:498)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:501)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
22/12/09 19:33:57 ERROR FileFormatWriter: Job job_202212091917352650741377131539872_0020 aborted.
22/12/09 19:33:57 ERROR Executor: Exception in task 0.1 in stage 20.0 (TID 26337)
org.apache.spark.SparkException: Task failed while writing rows.
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:298)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$15(FileFormatWriter.scala:211)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:131)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:498)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:501)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.AssertionError: assertion failed: Created file counter 11 is beyond max value 10
at scala.Predef$.assert(Predef.scala:223)
at org.apache.spark.sql.execution.datasources.DynamicPartitionDataWriter.$anonfun$increaseCreatedFileAndCheck$1(FileFormatDataWriter.scala:191)
at scala.runtime.java8.JFunction1$mcVI$sp.apply(JFunction1$mcVI$sp.java:23)
at scala.Option.foreach(Option.scala:407)
at org.apache.spark.sql.execution.datasources.DynamicPartitionDataWriter.increaseCreatedFileAndCheck(FileFormatDataWriter.scala:188)
at org.apache.spark.sql.execution.datasources.DynamicPartitionDataWriter.write(FileFormatDataWriter.scala:277)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$executeTask$1(FileFormatWriter.scala:280)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1473)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:288)
The insert operation is like the following:
insert overwrite table fake_table_txt partition(partition_name)
select id, name, type, description from ( inner query )
I'm a Hadoop beginner and I have no idea what may be causing this.
Could anybody please give me any direction?

After struggling a little, I was told that increasing the property "files per task" would do the trick.
set spark.sql.maxCreatedFilesPerTask = 15;
It was defaulted to 10 previously.

Related

Exception 'Cannot get a connection, pool error Timeout waiting for idle object' when using 'DBCPConnectionPoolLookup' service in Nifi

I'm trying to use 'DBCPConnectionPoolLookup' service in 'ExecuteGroovyScript' to dynamically query the required database based on 'database.name' parameter in the input flow file.
The processor is successfully able to get the corresponding 'DBCPConnectionPool' service for querying but I'm getting the an exception java.sql.SQLException: Cannot get a connection, pool error Timeout waiting for idle object. As opposed to if I directly use the 'DBCPConnectionPool' service without the 'Lookup' service without changing any configuration it works fine.
I access the service as follows:
def clientDb = CTL.SQLLookupService.getConnection(flowFile.getAttributes())
Then use the 'clientDb' object to query as:
clientDb.rows(timseriesSqlCountQuery).eachWithIndex { row, idx ->numRowsTimeSeries= row.c}
I have tried increasing the values of Max Wait Time and Max Total Connections to higher values in 'DBCPConnectionPool' service, it does not help.
Please find below detail links of images for code,error and configuration
Exception
Configuration of 'ExecuteGroovyScript'
Configuration of 'DBCPConnectionPool' service
Configuration of 'DBCPConnectionPoolLookup' service
Script Code
import org.apache.nifi.distributed.cache.client.Deserializer
import org.apache.nifi.distributed.cache.client.Serializer
import org.apache.nifi.distributed.cache.client.exception.DeserializationException
import org.apache.nifi.distributed.cache.client.exception.SerializationException
import groovy.sql.Sql
import java.time.*
try {
def flowFile = session.get()
def isBootstrap=flowFile."isBootstrap"
def timseriesSqlQuery='SELECT id FROM [dbo].[Points] where ([MappedToEquipment] = \'Mapped\' or PointStatus = \'Mapped\')'
def timseriesSqlCountQuery='SELECT count(id) as c FROM [dbo].[Points] where ([MappedToEquipment] = \'Mapped\' or PointStatus = \'Mapped\')'
def spaceSqlQuery='select id from (select id from dbo.organization union select id from dbo.facility union select id from dbo.building union select id from dbo.floor union select id from dbo.wing union select id from dbo.room union select id from dbo.systems) tmp'
def spaceSqlCountQuery='select count(id) as c from (select id from dbo.organization union select id from dbo.facility union select id from dbo.building union select id from dbo.floor union select id from dbo.wing union select id from dbo.room union select id from dbo.systems) tmp'
def cache = CTL.lastIngestTimeMap
def clientDb = CTL.SQLLookupService.getConnection(flowFile.getAttributes())//SQL.staticService
int numRowsTimeSeries=0
int numRowsSpace=0
clientDb.rows(timseriesSqlCountQuery).eachWithIndex { row, idx ->numRowsTimeSeries= row.c}
clientDb.rows(spaceSqlCountQuery).eachWithIndex { row, idx ->numRowsSpace= row.c}
}
Exception from Nifi logs
2019-09-12 06:18:33,629 ERROR [Timer-Driven Process Thread-3] o.a.n.p.groovyx.ExecuteGroovyScript ExecuteGroovyScript[id=b435c079-ee6c-3c42-a6ea-020968267ecf] ExecuteGroovyScript[id=b435c079-ee6c-3c42-a6ea-020968267ecf] failed to process session due to java.lang.ClassCastException; Processor Administratively Yielded for 1 sec: java.lang.ClassCastException
java.lang.ClassCastException: null
2019-09-12 06:18:33,629 WARN [Timer-Driven Process Thread-3] o.a.n.controller.tasks.ConnectableTask Administratively Yielding ExecuteGroovyScript[id=b435c079-ee6c-3c42-a6ea-020968267ecf] due to uncaught Exception: java.lang.ClassCastException
java.lang.ClassCastException: null
2019-09-12 06:18:33,629 ERROR [Timer-Driven Process Thread-9] o.a.n.p.groovyx.ExecuteGroovyScript ExecuteGroovyScript[id=9b81ca15-93a5-3953-9f40-d0874cfe2531] ExecuteGroovyScript[id=9b81ca15-93a5-3953-9f40-d0874cfe2531] failed to process session due to java.lang.ClassCastException; Processor Administratively Yielded for 1 sec: java.lang.ClassCastException
java.lang.ClassCastException: null
2019-09-12 06:18:33,629 WARN [Timer-Driven Process Thread-9] o.a.n.controller.tasks.ConnectableTask Administratively Yielding ExecuteGroovyScript[id=9b81ca15-93a5-3953-9f40-d0874cfe2531] due to uncaught Exception: java.lang.ClassCastException
java.lang.ClassCastException: null
2019-09-12 06:18:33,708 ERROR [Timer-Driven Process Thread-10] o.a.n.p.groovyx.ExecuteGroovyScript ExecuteGroovyScript[id=a1ec4496-dca3-38ab-a47b-43d7ff95e40f] org.apache.nifi.processor.exception.ProcessException: java.sql.SQLException: Cannot get a connection, pool error Timeout waiting for idle object: org.apache.nifi.processor.exception.ProcessException: java.sql.SQLException: Cannot get a connection, pool error Timeout waiting for idle object
org.apache.nifi.processor.exception.ProcessException: java.sql.SQLException: Cannot get a connection, pool error Timeout waiting for idle object
at org.apache.nifi.dbcp.DBCPConnectionPool.getConnection(DBCPConnectionPool.java:308)
at org.apache.nifi.dbcp.DBCPService.getConnection(DBCPService.java:49)
at sun.reflect.GeneratedMethodAccessor106.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.nifi.controller.service.StandardControllerServiceInvocationHandler.invoke(StandardControllerServiceInvocationHandler.java:84)
at com.sun.proxy.$Proxy89.getConnection(Unknown Source)
at org.apache.nifi.processors.groovyx.ExecuteGroovyScript.onInitSQL(ExecuteGroovyScript.java:339)
at org.apache.nifi.processors.groovyx.ExecuteGroovyScript.onTrigger(ExecuteGroovyScript.java:439)
at org.apache.nifi.processor.AbstractProcessor.onTrigger(AbstractProcessor.java:27)
at org.apache.nifi.controller.StandardProcessorNode.onTrigger(StandardProcessorNode.java:1165)
at org.apache.nifi.controller.tasks.ConnectableTask.invoke(ConnectableTask.java:203)
at org.apache.nifi.controller.scheduling.TimerDrivenSchedulingAgent$1.run(TimerDrivenSchedulingAgent.java:117)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.sql.SQLException: Cannot get a connection, pool error Timeout waiting for idle object
at org.apache.commons.dbcp2.PoolingDataSource.getConnection(PoolingDataSource.java:142)
at org.apache.commons.dbcp2.BasicDataSource.getConnection(BasicDataSource.java:1563)
at org.apache.nifi.dbcp.DBCPConnectionPool.getConnection(DBCPConnectionPool.java:305)
... 19 common frames omitted
Caused by: java.util.NoSuchElementException: Timeout waiting for idle object
at org.apache.commons.pool2.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:451)
at org.apache.commons.pool2.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:365)
at org.apache.commons.dbcp2.PoolingDataSource.getConnection(PoolingDataSource.java:134)
... 21 common frames omitted
2019-09-12 06:18:33,708 ERROR [Timer-Driven Process Thread-2] o.a.n.p.groovyx.ExecuteGroovyScript ExecuteGroovyScript[id=54d1e251-88f2-33f3-0489-722879a802bd] org.apache.nifi.processor.exception.ProcessException: java.sql.SQLException: Cannot get a connection, pool error Timeout waiting for idle object: org.apache.nifi.processor.exception.ProcessException: java.sql.SQLException: Cannot get a connection, pool error Timeout waiting for idle object
org.apache.nifi.processor.exception.ProcessException: java.sql.SQLException: Cannot get a connection, pool error Timeout waiting for idle object
at org.apache.nifi.dbcp.DBCPConnectionPool.getConnection(DBCPConnectionPool.java:308)
at org.apache.nifi.dbcp.DBCPService.getConnection(DBCPService.java:49)
at sun.reflect.GeneratedMethodAccessor106.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.nifi.controller.service.StandardControllerServiceInvocationHandler.invoke(StandardControllerServiceInvocationHandler.java:84)
at com.sun.proxy.$Proxy89.getConnection(Unknown Source)
at org.apache.nifi.processors.groovyx.ExecuteGroovyScript.onInitSQL(ExecuteGroovyScript.java:339)
at org.apache.nifi.processors.groovyx.ExecuteGroovyScript.onTrigger(ExecuteGroovyScript.java:439)
at org.apache.nifi.processor.AbstractProcessor.onTrigger(AbstractProcessor.java:27)
at org.apache.nifi.controller.StandardProcessorNode.onTrigger(StandardProcessorNode.java:1165)
at org.apache.nifi.controller.tasks.ConnectableTask.invoke(ConnectableTask.java:203)
at org.apache.nifi.controller.scheduling.TimerDrivenSchedulingAgent$1.run(TimerDrivenSchedulingAgent.java:117)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.sql.SQLException: Cannot get a connection, pool error Timeout waiting for idle object
at org.apache.commons.dbcp2.PoolingDataSource.getConnection(PoolingDataSource.java:142)
at org.apache.commons.dbcp2.BasicDataSource.getConnection(BasicDataSource.java:1563)
at org.apache.nifi.dbcp.DBCPConnectionPool.getConnection(DBCPConnectionPool.java:305)
... 19 common frames omitted
Caused by: java.util.NoSuchElementException: Timeout waiting for idle object
at org.apache.commons.pool2.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:451)
at org.apache.commons.pool2.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:365)
at org.apache.commons.dbcp2.PoolingDataSource.getConnection(PoolingDataSource.java:134)
... 21 common frames omitted
Finally after bring down Nifi twice I have found the solution. The problem seemed to be in the code which I was using, I used the object returned by CTL.index.getConnection(flowFile.getAttributes()) to query the SQL table which actually is a connection table, now due to this Nifi used up all available connections to SQL, due to which even if I reverted to using 'DBCPConnectionPool' service instead if 'Lookup' I was getting the above error. When I used to restart Nifi it used to work fine.
The actual code to be used in your script for using 'Lookup' Service is
def connectionObj = CTL.index.getConnection(flowFile.getAttributes())
def clientDb = new Sql(connectionObj)
Now use the 'clientDb' object to query your table
clientDb.rows(timseriesSqlCountQuery).eachWithIndex { row, idx ->numRowsTimeSeries= row.c}

Aggregate query in Presto on Redshift's view failing with error org.postgresql.util.PSQLException: ERROR: transaction is read-only

When querying from presto on to redshift views with aggregate query we are getting org.postgresql.util.PSQLException: ERROR: transaction is read-only exception
We have a view
select count(col1) from redshift.public.my_view
in our redshift which we are trying to access from presto and below is the exception full stacktrace
com.facebook.presto.spi.PrestoException: ERROR: transaction is read-only
Detail:
at com.facebook.presto.plugin.jdbc.JdbcRecordCursor.handleSqlException(JdbcRecordCursor.java:236)
at com.facebook.presto.plugin.jdbc.JdbcRecordCursor.<init>(JdbcRecordCursor.java:95)
at com.facebook.presto.plugin.jdbc.JdbcRecordSet.cursor(JdbcRecordSet.java:59)
at com.facebook.presto.spi.RecordPageSource.<init>(RecordPageSource.java:37)
at com.facebook.presto.split.RecordPageSourceProvider.createPageSource(RecordPageSourceProvider.java:42)
at com.facebook.presto.split.PageSourceManager.createPageSource(PageSourceManager.java:56)
at com.facebook.presto.operator.TableScanOperator.getOutput(TableScanOperator.java:239)
at com.facebook.presto.operator.Driver.processInternal(Driver.java:379)
at com.facebook.presto.operator.Driver.lambda$processFor$8(Driver.java:283)
at com.facebook.presto.operator.Driver.tryWithLock(Driver.java:675)
at com.facebook.presto.operator.Driver.processFor(Driver.java:276)
at com.facebook.presto.execution.SqlTaskExecution$DriverSplitRunner.processFor(SqlTaskExecution.java:1077)
at com.facebook.presto.execution.executor.PrioritizedSplitRunner.process(PrioritizedSplitRunner.java:162)
at com.facebook.presto.execution.executor.TaskExecutor$TaskRunner.run(TaskExecutor.java:483)
at com.facebook.presto.$gen.Presto_0_219____20190520_105106_1.run(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.postgresql.util.PSQLException: ERROR: transaction is read-only
Detail:
at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2198)
at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1927)
at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:255)
at org.postgresql.jdbc2.AbstractJdbc2Statement.execute(AbstractJdbc2Statement.java:561)
at org.postgresql.jdbc2.AbstractJdbc2Statement.executeWithFlags(AbstractJdbc2Statement.java:419)
at org.postgresql.jdbc2.AbstractJdbc2Statement.executeQuery(AbstractJdbc2Statement.java:304)
at com.facebook.presto.plugin.jdbc.JdbcRecordCursor.<init>(JdbcRecordCursor.java:92)
... 16 more
select count(mycol) from redshift.public.my_view
actual results we need to get count from the view

IllegalDataException from DateUtil.java when saving spark streaming dataframe to phoenix

I am using kafka + spark streaming to stream messages and do analytics, then saving to phoenix. Some spark job fail several times per day with the following error message:
org.apache.phoenix.schema.IllegalDataException:
java.lang.IllegalArgumentException: Invalid format: ""
at org.apache.phoenix.util.DateUtil$ISODateFormatParser.parseDateTime(DateUtil.java:297)
at org.apache.phoenix.util.DateUtil.parseDateTime(DateUtil.java:163)
at org.apache.phoenix.util.DateUtil.parseTimestamp(DateUtil.java:175)
at org.apache.phoenix.schema.types.PTimestamp.toObject(PTimestamp.java:95)
at org.apache.phoenix.expression.LiteralExpression.newConstant(LiteralExpression.java:194)
at org.apache.phoenix.expression.LiteralExpression.newConstant(LiteralExpression.java:172)
at org.apache.phoenix.expression.LiteralExpression.newConstant(LiteralExpression.java:159)
at org.apache.phoenix.compile.UpsertCompiler$UpsertValuesCompiler.visit(UpsertCompiler.java:979)
at org.apache.phoenix.compile.UpsertCompiler$UpsertValuesCompiler.visit(UpsertCompiler.java:963)
at org.apache.phoenix.parse.BindParseNode.accept(BindParseNode.java:47)
at org.apache.phoenix.compile.UpsertCompiler.compile(UpsertCompiler.java:832)
at org.apache.phoenix.jdbc.PhoenixStatement$ExecutableUpsertStatement.compilePlan(PhoenixStatement.java:578)
at org.apache.phoenix.jdbc.PhoenixStatement$ExecutableUpsertStatement.compilePlan(PhoenixStatement.java:566)
at org.apache.phoenix.jdbc.PhoenixStatement$2.call(PhoenixStatement.java:331)
at org.apache.phoenix.jdbc.PhoenixStatement$2.call(PhoenixStatement.java:326)
at org.apache.phoenix.call.CallRunner.run(CallRunner.java:53)
at org.apache.phoenix.jdbc.PhoenixStatement.executeMutation(PhoenixStatement.java:324)
at org.apache.phoenix.jdbc.PhoenixStatement.execute(PhoenixStatement.java:245)
at org.apache.phoenix.jdbc.PhoenixPreparedStatement.execute(PhoenixPreparedStatement.java:172)
at org.apache.phoenix.jdbc.PhoenixPreparedStatement.execute(PhoenixPreparedStatement.java:177)
at org.apache.phoenix.mapreduce.PhoenixRecordWriter.write(PhoenixRecordWriter.java:79)
at org.apache.phoenix.mapreduce.PhoenixRecordWriter.write(PhoenixRecordWriter.java:39)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12$$anonfun$apply$4.apply$mcV$sp(PairRDDFunctions.scala:1113)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12$$anonfun$apply$4.apply(PairRDDFunctions.scala:1111)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12$$anonfun$apply$4.apply(PairRDDFunctions.scala:1111)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1251)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1119)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1091)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.IllegalArgumentException: Invalid format: ""
at org.joda.time.format.DateTimeFormatter.parseDateTime(DateTimeFormatter.java:673)
at org.apache.phoenix.util.DateUtil$ISODateFormatParser.parseDateTime(DateUtil.java:295)
My code:
val myDF = sqlContext.createDataFrame(myRows, myStruct)
myDF.write
.format(sourcePhoenixSpark)
.mode("overwrite")
.options(Map("table" -> (myPhoenixNamespace + myTable), "zkUrl" -> myPhoenixZKUrl))
.save()
I am using phoenix-spark version 4.7.0-HBase-1.1. Any suggestion to solve the problem would be appreciated. Thanks
You are trying to process dirty data.
That error comes from here:
https://github.com/apache/phoenix/blob/master/phoenix-core/src/main/java/org/apache/phoenix/util/DateUtil.java#L301
Where it's trying to parse some string that is expected to be a Date in ISO format and the provided String is empty ("").
You need to prepare+clean your data before attempting to write it to storage.

Groovy Script Execution Exception

Can anyone tell me what approach I need to take to solve the below exception?
[2017-05-23 15:28:48,783][DEBUG][action.search.type ] [Mister Buda] [2217] Failed to execute fetch phase
org.elasticsearch.script.groovy.GroovyScriptExecutionException: NullPointerException[null]
at org.elasticsearch.script.groovy.GroovyScriptEngineService$GroovyScript.run(GroovyScriptEngineService.java:274)
at org.elasticsearch.search.fetch.script.ScriptFieldsFetchSubPhase.hitExecute(ScriptFieldsFetchSubPhase.java:74)
at org.elasticsearch.search.fetch.FetchPhase.execute(FetchPhase.java:194)
at org.elasticsearch.search.SearchService.executeFetchPhase(SearchService.java:516)
at org.elasticsearch.search.action.SearchServiceTransportAction$17.call(SearchServiceTransportAction.java:452)
at org.elasticsearch.search.action.SearchServiceTransportAction$17.call(SearchServiceTransportAction.java:449)
at org.elasticsearch.search.action.SearchServiceTransportAction$23.run(SearchServiceTransportAction.java:559)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
I see this exception during querying an index, that query contains a script_field.
I am using elasticsearch version 1.5.
This is the script,
(( _source.rating.'${amount}'.costRating.value * costWeighting) + ( _source.rating.flexibilityRating.value * (1 - costWeighting))) * 10
Here's the image of the sample data in the index,

Finding the Max value in JavaPairDStream

Is there a way to find the max from a JavaPairDStream? My key is a String and value is ArrayList<Row>. I need to find a tuple which has the max value of a certain column of the row in the ArrayList . I am using spark 1.6
I want to implement something similar to javaRDD.max() for a JavaPairDstream.
Using updateStateByKey is not an option as i need the associative state for all the keys and not per key .
I have tried using accumulators in the following the way :
**
if(accumulator.value()< max)
{
accumulator.setValue(max);
}
**
i do this inside reduceByKey , but i get the following exception :
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost): java.lang.UnsupportedOperationException: Can't read accumulator value in task
at org.apache.spark.Accumulable.value(Accumulators.scala:98)
at sample.sample.SampleJdd$2.call(SampleJdd.java:82)
at sample.sample.SampleJdd$2.call(SampleJdd.java:74)
at org.apache.spark.api.java.JavaPairRDD$$anonfun$toScalaFunction2$1.apply(JavaPairRDD.scala:996)
at org.apache.spark.util.collection.ExternalSorter$$anonfun$5.apply(ExternalSorter.scala:200)
at org.apache.spark.util.collection.ExternalSorter$$anonfun$5.apply(ExternalSorter.scala:199)
at org.apache.spark.util.collection.AppendOnlyMap.changeValue(AppendOnlyMap.scala:138)
at org.apache.spark.util.collection.SizeTrackingAppendOnlyMap.changeValue(SizeTrackingAppendOnlyMap.scala:32)
at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:205)
at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:56)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1204)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1193)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1192)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354)
org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48

Resources