Scala: exception handling in anonymous function - debugging

If I pass an anonymous function as an argument, like e.g. in this code sample:
val someMap = someData.map(line => (line.split("\\|")(0), // key
line.split("\\|")(1) + "|" + // value as string concat
line.split("\\|")(4) + "|" +
line.split("\\|")(9)))
I could catch, e.g. an ArrayIndexOutOfBoundsException like this:
try {
val someMap = someData.map(line => (line.split("\\|")(0), // key
line.split("\\|")(1) + "|" + // value as string concat
line.split("\\|")(4) + "|" +
line.split("\\|")(9)))
} catch {
case e1: ArrayIndexOutOfBoundsException => println("exception in line " )
}
The problem with this is that I do not have access to the inner function's scope. In this case I would like to print the line (from the anonymous function) which caused the exception.
How can I do this? Is there some way of catching an exception within an anonymous function? Is there a way to access the scope of an anonymous function from the outside for debugging purposes?
edit: I'm using Scala 2.9.3

You could use Either
val result =
someData.map {
line =>
try {
val values = (line.split("\\|")(0), // key
line.split("\\|")(1) + "|" + // value as string concat
line.split("\\|")(4) + "|" +
line.split("\\|")(9))
Right(values)
} catch {
case e1: ArrayIndexOutOfBoundsException =>
Left(s"exception in line $line")
}
}
result.foreach {
case (Right(values)) => println(values)
case (Left(msg)) => println(msg)
}
But if you are importing data from a text file, I would try to do it without exceptions (because it's not really exceptional to get invalid data in that case):
val result =
someData.map {
line =>
val fields = line.split("\\|")
if (fields.length < 9) {
Left(s"Error in line $line")
} else {
val values = (fields(0), Seq(fields(1), fields(4), fields(9)))
Right(values)
}
}
result.foreach {
case (Right((key, values))) => println(s"$key -> ${values.mkString("|")}")
case (Left(msg)) => println(msg)
}

Perhaps this will give you some ideas:
try {
val someMap = someData.map { line =>
try {
(line.split("\\|")(0), // key
line.split("\\|")(1) + "|" + // value as string concat
line.split("\\|")(4) + "|" +
line.split("\\|")(9)))
} catch {
case inner: ArrayIndexOutOfBoundsException => {
println("exception in " + line)
throw inner;
}
}
}
} catch {
case outer: ArrayIndexOutOfBoundsException => ...
}

The other answers give nice functional solutions using Either etc. If you were using Scala 2.10, you could also use Try as
val lines = List("abc", "ef");
println(lines.map(line => Try(line(3))));
to get List[Try[Char]], where you can examine each element if it succeeded or failed. (I haven't tried to compile this.)
If for any reasons you prefer exceptions, you need to catch the exception inside the mapping function and rethrow it with information about the line. For example:
// Your own exception class holding a line that failed:
case class LineException(line: String, nested: Exception)
extends Exception(nested);
// Computes something on a line and throw a proper `LineException`
// if the processing fails:
def lineWorker[A](worker: String => A)(line: String): A =
try {
worker(line)
} catch {
case (e: Exception) => throw LineException(line, e);
}
def getNth(lines: List[String], i: Int): List[Char]
= lines.map(lineWorker(_.apply(i)));
val lines = List("abc", "ef");
println(getNth(lines, 1));
println(getNth(lines, 2));
You can also express it using Catch from scala.util.control.Exception:
case class LineException(line: String, nested: Throwable)
extends Exception(nested); // we need Throwable here ^^
import scala.util.control.Exception._
// Returns a `Catch` that wraps any exception to a proper `LineException`.
def lineExceptionCatch[T](line: String): Catch[T]
= handling[T](classOf[Exception]).by(e => throw LineException(line, e));
def lineWorker[A](worker: String => A)(line: String): A =
lineExceptionCatch[A](line)(worker(line))
// ...

First your outer try/catch is useless. If you List (or other structure) is empty, map function won't do anything => no ArrayIndexOutOfBoundsException will be thrown.
As for the inner loop, i would sugest another solution with Scalaz Either:
import scalaz._
import EitherT._
import Id.Id
val someMap = someData.map { line =>
fromTryCatch[Id, (String, String)] {
(line.split("\\|")(0), // key
line.split("\\|")(1) + "|" + // value as string concat
line.split("\\|")(4) + "|" +
line.split("\\|")(9))
}
}
and then chain you operations on List[EitherT[...]]

Related

In Rx instead of only getting the last debounced object, can I get the complete sequence?

I want to know if one of the debounced objects was a green ball. Filtering for only green balls before or after the debounce leads to incorrect behavior.
You can use the buffer operator together with the debounce operator. Here a very basic example:
// This is our event stream. In this example we only track mouseup events on the document
const move$ = Observable.fromEvent(document, 'mouseup');
// We want to create a debounced version of the initial stream
const debounce$ = move$.debounceTime(1000);
// Now create the buffered stream from the initial move$ stream.
// The debounce$ stream can be used to emit the values that are in the buffer
const buffered$ = move$.buffer(debounce$);
// Subscribe to your buffered stream
buffered$.subscribe(res => console.log('Buffered Result: ', res));
If I understand correctly what you want to achieve, you probably need to build an Observable which emits some sort of object which contains both the source value (i.e. blue, red, green in your case) as well as a flag that indicates whether or not there was a green in the debounced values.
If this is true, you can try to code along these lines
const s = new Subject<string>();
setTimeout(() => s.next('B'), 100);
setTimeout(() => s.next('G'), 1100);
setTimeout(() => s.next('B'), 1200);
setTimeout(() => s.next('G'), 1300);
setTimeout(() => s.next('R'), 1400);
setTimeout(() => s.next('B'), 2400);
let hasGreen = false;
s
.do(data => hasGreen = hasGreen || data === 'G')
.debounceTime(500)
.map(data => ({data, hasGreen})) // this map has to come before the following do
.do(() => hasGreen = false)
.subscribe(data => console.log(data))
Be careful about the sequence. In particular you have to put the map operator which creates the object you want to emit before the do that resets your variable.
This could be done with a non-trivial set of operators and side-effecting a flow by introducing extra channels:
import java.util.Queue;
import java.util.concurrent.*;
import java.util.concurrent.atomic.AtomicLong;
import org.junit.Test;
import io.reactivex.*;
import io.reactivex.functions.Consumer;
import io.reactivex.schedulers.*;
import io.reactivex.subjects.PublishSubject;
public class DebounceTimeDrop {
#Test
public void test() {
PublishSubject<Integer> source = PublishSubject.create();
TestScheduler scheduler = new TestScheduler();
source.compose(debounceTime(10, TimeUnit.MILLISECONDS, scheduler, v -> {
System.out.println(
"Dropped: " + v + " # T=" + scheduler.now(TimeUnit.MILLISECONDS));
}))
.subscribe(v -> System.out.println(
"Passed: " + v + " # T=" + scheduler.now(TimeUnit.MILLISECONDS)),
Throwable::printStackTrace,
() -> System.out.println(
"Done " + " # T=" + scheduler.now(TimeUnit.MILLISECONDS)));
source.onNext(1);
scheduler.advanceTimeBy(10, TimeUnit.MILLISECONDS);
scheduler.advanceTimeBy(20, TimeUnit.MILLISECONDS);
source.onNext(2);
scheduler.advanceTimeBy(1, TimeUnit.MILLISECONDS);
source.onNext(3);
scheduler.advanceTimeBy(1, TimeUnit.MILLISECONDS);
source.onNext(4);
scheduler.advanceTimeBy(1, TimeUnit.MILLISECONDS);
source.onNext(5);
scheduler.advanceTimeBy(10, TimeUnit.MILLISECONDS);
scheduler.advanceTimeBy(20, TimeUnit.MILLISECONDS);
source.onNext(6);
scheduler.advanceTimeBy(10, TimeUnit.MILLISECONDS);
scheduler.advanceTimeBy(20, TimeUnit.MILLISECONDS);
source.onComplete();
}
public static <T> ObservableTransformer<T, T> debounceTime(
long time, TimeUnit unit, Scheduler scheduler,
Consumer<? super T> dropped) {
return o -> Observable.<T>defer(() -> {
AtomicLong index = new AtomicLong();
Queue<Timed<T>> queue = new ConcurrentLinkedQueue<>();
return o.map(v -> {
Timed<T> t = new Timed<>(v,
index.getAndIncrement(), TimeUnit.NANOSECONDS);
queue.offer(t);
return t;
})
.debounce(time, unit, scheduler)
.map(v -> {
while (!queue.isEmpty()) {
Timed<T> t = queue.peek();
if (t.time() < v.time()) {
queue.poll();
dropped.accept(t.value());
} else
if (t == v) {
queue.poll();
break;
}
}
return v.value();
})
.doOnComplete(() -> {
while (!queue.isEmpty()) {
dropped.accept(queue.poll().value());
}
});
});
}
}
prints
Passed: 1 # T=10
Dropped: 2 # T=43
Dropped: 3 # T=43
Dropped: 4 # T=43
Passed: 5 # T=43
Passed: 6 # T=73
Done # T=93

Where does internal exception in Kafka Streams come from?

I am encountering a problem when trying to aggregate a KGroupedStream< String, TsdbObject >
where TsdbObject is a POJO that has a method Double getValue(). The following statements
show the groupBy and attempted aggregation:
KGroupedStream< String, TsdbObject > assets_grouped_by_parents =
kstream.groupBy( group_by_parent_mapper, Serialized.with( Serdes.String(), tsdb_object_serde ) );
KTable< String, Double > sums_of_groups_by_parents =
assets_grouped_by_parents.aggregate( new SummerInitializer(), new SummerAggregator() );
The aggregation is done by the following classes:
private class SummerAggregator implements Aggregator< String, TsdbObject, Double > {
#Override
public Double apply(String key, TsdbObject value, Double aggregate) {
System.out.println( "SummerAggregator.apply: key is " + key + ", value is " + value +
", aggregate is " + aggregate );
return aggregate + value.getValue();
}
}
private class SummerInitializer implements Initializer< Double > {
#Override
public Double apply() {
// TODO Auto-generated method stub
System.out.println( "SummerInitializer" );
return 0.0;
}
}
When I execute the application, I get the following exception:
Encountered the following error during processing:
java.lang.ClassCastException: [B cannot be cast to java.lang.Double
at com.ui.kafka.experiments.metrics.TsdbObjectRollUp$SummerAggregator.apply(TsdbObjectRollUp.java:1)
at org.apache.kafka.streams.kstream.internals.KStreamAggregate$KStreamAggregateProcessor.process(KStreamAggregate.java:79)
The referenced line inKStreamAggregate is:
// try to add the new new value
if (value != null) {
newAgg = aggregator.apply(key, value, newAgg);
}
The strange thing is that the value of newAgg, which is supposed to be a Double, is:
[0, 0, 0, 0, 0, 0, 0, 21]
which certainly isn't castable to a Double. Where did this weird value come from?
You need to pass in an DoubleSerde for the result value type of assets_grouped_by_parents.aggregate(...) using the optional parameter Materialized.withValueSerde():
KTable<String, Double> sums_of_groups_by_parents =
assets_grouped_by_parents.aggregate(
new SummerInitializer(),
new SummerAggregator(),
Materialized.withValueSerde(Serdes.DoubleSerde()));
You might also need to specify the StringSerde for the key, if it's not set as default serde in the config.

How to throw error in map for converting string to int array for number format exception

I have a string
var str = "1 2 3 4"
and I want to convert it into [Int]. It can be done as follow
let intArray = str.characters.split {$0 == " "}.map(String.init).map { Int($0)!}
Now what if my string is
var invalid = " 1 a 4"
Then, the program will crash with
fatal error: unexpectedly found nil while unwrapping an Optional value
I need to be able to check number and throw number format error in map.
You can use throws, try - throw, do - try - catch and guard (if) for that. Here is the code
var invalid = " 1 a 4"
let intArray: [Int]
do {
intArray = try getIntArray(invalid, delimiter: " ")
}catch let error as NSError {
print(error)
intArray = []
}
func getIntArray(input:String, delimiter:Character ) throws -> [Int] {
let strArray = input.characters.split {$0 == delimiter}.map(String.init)
let intArray = try strArray.map {
(int:String)->Int in
guard Int(int) != nil else {
throw NSError.init(domain: " \(int) is not digit", code: -99, userInfo: nil)
}
return Int(int)!
}
return intArray
}
In getIntArray function, We first convert the input string to string array.
Then when we are converting string array to int array, we are expanding the map closure function parameter to include number format checking and throwing error using "guard".
"guard" can be replaced with "if" too if it is not available
if Int(int) == nil {
throw NSError.init(domain: " \(int) is not digit", code: -99, userInfo: nil)
}
Rather than throwing NSError types, you can create your own Swift native enum conforming to ErrorType where your enumeration contains the error case you would like to explicitly handle. E.g.
enum MyErrors : ErrorType {
case NumberFormatError(String)
}
/* throwing function attempting to initialize an
integer given a string (if failure: throw error) */
func strToInt(str: String) throws -> Int {
guard let myInt = Int(str) else { throw MyErrors.NumberFormatError(str) }
return myInt
}
Example usage within a do-try-catch construct:
func strAsIntArr(str: String) -> [Int]? {
var intArr: [Int] = []
do {
intArr = try str.characters.split {$0 == " "}
.map(String.init)
.map { try strToInt($0) }
} catch MyErrors.NumberFormatError(let faultyString) {
print("Format error: '\(faultyString)' is not number convertible.")
// naturally you could rethrow here to propagate the error
} catch {
print("Unknown error.")
}
return intArr
}
/* successful example */
let myStringA = "1 2 3 4"
let intArrA = strAsIntArr(myStringA)
/*[1, 2, 3, 4] */
/* error throwing example */
let myStringB = "1 b 3 4"
let intArrB = strAsIntArr(myStringB)
/* [], Format error: 'b' is not number convertible. */

SPARK SQL - update MySql table using DataFrames and JDBC

I'm trying to insert and update some data on MySql using Spark SQL DataFrames and JDBC connection.
I've succeeded to insert new data using the SaveMode.Append. Is there a way to update the data already existing in MySql Table from Spark SQL?
My code to insert is:
myDataFrame.write.mode(SaveMode.Append).jdbc(JDBCurl,mySqlTable,connectionProperties)
If I change to SaveMode.Overwrite it deletes the full table and creates a new one, I'm looking for something like the "ON DUPLICATE KEY UPDATE" available in MySql
It is not possible. As for now (Spark 1.6.0 / 2.2.0 SNAPSHOT) Spark DataFrameWriter supports only four writing modes:
SaveMode.Overwrite: overwrite the existing data.
SaveMode.Append: append the data.
SaveMode.Ignore: ignore the operation (i.e. no-op).
SaveMode.ErrorIfExists: default option, throw an exception at runtime.
You can insert manually for example using mapPartitions (since you want an UPSERT operation should be idempotent and as such easy to implement), write to temporary table and execute upsert manually, or use triggers.
In general achieving upsert behavior for batch operations and keeping decent performance is far from trivial. You have to remember that in general case there will be multiple concurrent transactions in place (one per each partition) so you have to ensure that there will no write conflicts (typically by using application specific partitioning) or provide appropriate recovery procedures. In practice it may be better to perform and batch writes to a temporary table and resolve upsert part directly in the database.
A pity that there is no SaveMode.Upsert mode in Spark for such quite common cases like upserting.
zero322 is right in general, but I think it should be possible (with compromises in performance) to offer such replace feature.
I also wanted to provide some java code for this case.
Of course it is not that performant as the built-in one from spark - but it should be a good basis for your requirements. Just modify it towards your needs:
myDF.repartition(20); //one connection per partition, see below
myDF.foreachPartition((Iterator<Row> t) -> {
Connection conn = DriverManager.getConnection(
Constants.DB_JDBC_CONN,
Constants.DB_JDBC_USER,
Constants.DB_JDBC_PASS);
conn.setAutoCommit(true);
Statement statement = conn.createStatement();
final int batchSize = 100000;
int i = 0;
while (t.hasNext()) {
Row row = t.next();
try {
// better than REPLACE INTO, less cycles
statement.addBatch(("INSERT INTO mytable " + "VALUES ("
+ "'" + row.getAs("_id") + "',
+ "'" + row.getStruct(1).get(0) + "'
+ "') ON DUPLICATE KEY UPDATE _id='" + row.getAs("_id") + "';"));
//conn.commit();
if (++i % batchSize == 0) {
statement.executeBatch();
}
} catch (SQLIntegrityConstraintViolationException e) {
//should not occur, nevertheless
//conn.commit();
} catch (SQLException e) {
e.printStackTrace();
} finally {
//conn.commit();
statement.executeBatch();
}
}
int[] ret = statement.executeBatch();
System.out.println("Ret val: " + Arrays.toString(ret));
System.out.println("Update count: " + statement.getUpdateCount());
//conn.commit();
statement.close();
conn.close();
overwrite org.apache.spark.sql.execution.datasources.jdbc JdbcUtils.scala insert into to replace into
import java.sql.{Connection, Driver, DriverManager, PreparedStatement, ResultSet, SQLException}
import scala.collection.JavaConverters._
import scala.util.control.NonFatal
import com.typesafe.scalalogging.Logger
import org.apache.spark.sql.catalyst.InternalRow
import org.apache.spark.sql.execution.datasources.jdbc.{DriverRegistry, DriverWrapper, JDBCOptions}
import org.apache.spark.sql.jdbc.{JdbcDialect, JdbcDialects, JdbcType}
import org.apache.spark.sql.types._
import org.apache.spark.sql.{DataFrame, Row}
/**
* Util functions for JDBC tables.
*/
object UpdateJdbcUtils {
val logger = Logger(this.getClass)
/**
* Returns a factory for creating connections to the given JDBC URL.
*
* #param options - JDBC options that contains url, table and other information.
*/
def createConnectionFactory(options: JDBCOptions): () => Connection = {
val driverClass: String = options.driverClass
() => {
DriverRegistry.register(driverClass)
val driver: Driver = DriverManager.getDrivers.asScala.collectFirst {
case d: DriverWrapper if d.wrapped.getClass.getCanonicalName == driverClass => d
case d if d.getClass.getCanonicalName == driverClass => d
}.getOrElse {
throw new IllegalStateException(
s"Did not find registered driver with class $driverClass")
}
driver.connect(options.url, options.asConnectionProperties)
}
}
/**
* Returns a PreparedStatement that inserts a row into table via conn.
*/
def insertStatement(conn: Connection, table: String, rddSchema: StructType, dialect: JdbcDialect)
: PreparedStatement = {
val columns = rddSchema.fields.map(x => dialect.quoteIdentifier(x.name)).mkString(",")
val placeholders = rddSchema.fields.map(_ => "?").mkString(",")
val sql = s"REPLACE INTO $table ($columns) VALUES ($placeholders)"
conn.prepareStatement(sql)
}
/**
* Retrieve standard jdbc types.
*
* #param dt The datatype (e.g. [[org.apache.spark.sql.types.StringType]])
* #return The default JdbcType for this DataType
*/
def getCommonJDBCType(dt: DataType): Option[JdbcType] = {
dt match {
case IntegerType => Option(JdbcType("INTEGER", java.sql.Types.INTEGER))
case LongType => Option(JdbcType("BIGINT", java.sql.Types.BIGINT))
case DoubleType => Option(JdbcType("DOUBLE PRECISION", java.sql.Types.DOUBLE))
case FloatType => Option(JdbcType("REAL", java.sql.Types.FLOAT))
case ShortType => Option(JdbcType("INTEGER", java.sql.Types.SMALLINT))
case ByteType => Option(JdbcType("BYTE", java.sql.Types.TINYINT))
case BooleanType => Option(JdbcType("BIT(1)", java.sql.Types.BIT))
case StringType => Option(JdbcType("TEXT", java.sql.Types.CLOB))
case BinaryType => Option(JdbcType("BLOB", java.sql.Types.BLOB))
case TimestampType => Option(JdbcType("TIMESTAMP", java.sql.Types.TIMESTAMP))
case DateType => Option(JdbcType("DATE", java.sql.Types.DATE))
case t: DecimalType => Option(
JdbcType(s"DECIMAL(${t.precision},${t.scale})", java.sql.Types.DECIMAL))
case _ => None
}
}
private def getJdbcType(dt: DataType, dialect: JdbcDialect): JdbcType = {
dialect.getJDBCType(dt).orElse(getCommonJDBCType(dt)).getOrElse(
throw new IllegalArgumentException(s"Can't get JDBC type for ${dt.simpleString}"))
}
// A `JDBCValueGetter` is responsible for getting a value from `ResultSet` into a field
// for `MutableRow`. The last argument `Int` means the index for the value to be set in
// the row and also used for the value in `ResultSet`.
private type JDBCValueGetter = (ResultSet, InternalRow, Int) => Unit
// A `JDBCValueSetter` is responsible for setting a value from `Row` into a field for
// `PreparedStatement`. The last argument `Int` means the index for the value to be set
// in the SQL statement and also used for the value in `Row`.
private type JDBCValueSetter = (PreparedStatement, Row, Int) => Unit
/**
* Saves a partition of a DataFrame to the JDBC database. This is done in
* a single database transaction (unless isolation level is "NONE")
* in order to avoid repeatedly inserting data as much as possible.
*
* It is still theoretically possible for rows in a DataFrame to be
* inserted into the database more than once if a stage somehow fails after
* the commit occurs but before the stage can return successfully.
*
* This is not a closure inside saveTable() because apparently cosmetic
* implementation changes elsewhere might easily render such a closure
* non-Serializable. Instead, we explicitly close over all variables that
* are used.
*/
def savePartition(
getConnection: () => Connection,
table: String,
iterator: Iterator[Row],
rddSchema: StructType,
nullTypes: Array[Int],
batchSize: Int,
dialect: JdbcDialect,
isolationLevel: Int): Iterator[Byte] = {
val conn = getConnection()
var committed = false
var finalIsolationLevel = Connection.TRANSACTION_NONE
if (isolationLevel != Connection.TRANSACTION_NONE) {
try {
val metadata = conn.getMetaData
if (metadata.supportsTransactions()) {
// Update to at least use the default isolation, if any transaction level
// has been chosen and transactions are supported
val defaultIsolation = metadata.getDefaultTransactionIsolation
finalIsolationLevel = defaultIsolation
if (metadata.supportsTransactionIsolationLevel(isolationLevel)) {
// Finally update to actually requested level if possible
finalIsolationLevel = isolationLevel
} else {
logger.warn(s"Requested isolation level $isolationLevel is not supported; " +
s"falling back to default isolation level $defaultIsolation")
}
} else {
logger.warn(s"Requested isolation level $isolationLevel, but transactions are unsupported")
}
} catch {
case NonFatal(e) => logger.warn("Exception while detecting transaction support", e)
}
}
val supportsTransactions = finalIsolationLevel != Connection.TRANSACTION_NONE
try {
if (supportsTransactions) {
conn.setAutoCommit(false) // Everything in the same db transaction.
conn.setTransactionIsolation(finalIsolationLevel)
}
val stmt = insertStatement(conn, table, rddSchema, dialect)
val setters: Array[JDBCValueSetter] = rddSchema.fields.map(_.dataType)
.map(makeSetter(conn, dialect, _))
val numFields = rddSchema.fields.length
try {
var rowCount = 0
while (iterator.hasNext) {
val row = iterator.next()
var i = 0
while (i < numFields) {
if (row.isNullAt(i)) {
stmt.setNull(i + 1, nullTypes(i))
} else {
setters(i).apply(stmt, row, i)
}
i = i + 1
}
stmt.addBatch()
rowCount += 1
if (rowCount % batchSize == 0) {
stmt.executeBatch()
rowCount = 0
}
}
if (rowCount > 0) {
stmt.executeBatch()
}
} finally {
stmt.close()
}
if (supportsTransactions) {
conn.commit()
}
committed = true
Iterator.empty
} catch {
case e: SQLException =>
val cause = e.getNextException
if (cause != null && e.getCause != cause) {
if (e.getCause == null) {
e.initCause(cause)
} else {
e.addSuppressed(cause)
}
}
throw e
} finally {
if (!committed) {
// The stage must fail. We got here through an exception path, so
// let the exception through unless rollback() or close() want to
// tell the user about another problem.
if (supportsTransactions) {
conn.rollback()
}
conn.close()
} else {
// The stage must succeed. We cannot propagate any exception close() might throw.
try {
conn.close()
} catch {
case e: Exception => logger.warn("Transaction succeeded, but closing failed", e)
}
}
}
}
/**
* Saves the RDD to the database in a single transaction.
*/
def saveTable(
df: DataFrame,
url: String,
table: String,
options: JDBCOptions) {
val dialect = JdbcDialects.get(url)
val nullTypes: Array[Int] = df.schema.fields.map { field =>
getJdbcType(field.dataType, dialect).jdbcNullType
}
val rddSchema = df.schema
val getConnection: () => Connection = createConnectionFactory(options)
val batchSize = options.batchSize
val isolationLevel = options.isolationLevel
df.foreachPartition(iterator => savePartition(
getConnection, table, iterator, rddSchema, nullTypes, batchSize, dialect, isolationLevel)
)
}
private def makeSetter(
conn: Connection,
dialect: JdbcDialect,
dataType: DataType): JDBCValueSetter = dataType match {
case IntegerType =>
(stmt: PreparedStatement, row: Row, pos: Int) =>
stmt.setInt(pos + 1, row.getInt(pos))
case LongType =>
(stmt: PreparedStatement, row: Row, pos: Int) =>
stmt.setLong(pos + 1, row.getLong(pos))
case DoubleType =>
(stmt: PreparedStatement, row: Row, pos: Int) =>
stmt.setDouble(pos + 1, row.getDouble(pos))
case FloatType =>
(stmt: PreparedStatement, row: Row, pos: Int) =>
stmt.setFloat(pos + 1, row.getFloat(pos))
case ShortType =>
(stmt: PreparedStatement, row: Row, pos: Int) =>
stmt.setInt(pos + 1, row.getShort(pos))
case ByteType =>
(stmt: PreparedStatement, row: Row, pos: Int) =>
stmt.setInt(pos + 1, row.getByte(pos))
case BooleanType =>
(stmt: PreparedStatement, row: Row, pos: Int) =>
stmt.setBoolean(pos + 1, row.getBoolean(pos))
case StringType =>
(stmt: PreparedStatement, row: Row, pos: Int) =>
stmt.setString(pos + 1, row.getString(pos))
case BinaryType =>
(stmt: PreparedStatement, row: Row, pos: Int) =>
stmt.setBytes(pos + 1, row.getAs[Array[Byte]](pos))
case TimestampType =>
(stmt: PreparedStatement, row: Row, pos: Int) =>
stmt.setTimestamp(pos + 1, row.getAs[java.sql.Timestamp](pos))
case DateType =>
(stmt: PreparedStatement, row: Row, pos: Int) =>
stmt.setDate(pos + 1, row.getAs[java.sql.Date](pos))
case t: DecimalType =>
(stmt: PreparedStatement, row: Row, pos: Int) =>
stmt.setBigDecimal(pos + 1, row.getDecimal(pos))
case ArrayType(et, _) =>
// remove type length parameters from end of type name
val typeName = getJdbcType(et, dialect).databaseTypeDefinition
.toLowerCase.split("\\(")(0)
(stmt: PreparedStatement, row: Row, pos: Int) =>
val array = conn.createArrayOf(
typeName,
row.getSeq[AnyRef](pos).toArray)
stmt.setArray(pos + 1, array)
case _ =>
(_: PreparedStatement, _: Row, pos: Int) =>
throw new IllegalArgumentException(
s"Can't translate non-null value for field $pos")
}
}
usage:
val url = s"jdbc:mysql://$host/$database?useUnicode=true&characterEncoding=UTF-8"
val parameters: Map[String, String] = Map(
"url" -> url,
"dbtable" -> table,
"driver" -> "com.mysql.jdbc.Driver",
"numPartitions" -> numPartitions.toString,
"user" -> user,
"password" -> password
)
val options = new JDBCOptions(parameters)
for (d <- data) {
UpdateJdbcUtils.saveTable(d, url, table, options)
}
ps: pay attention to the deadlock, not update data frequently, just use in re-run in case of emergency, I think that's why spark not support this official.
If your table is small, then you can read the sql data and do the upsertion in spark dataframe. And overwrite the existing sql table.
zero323's answer is right, I just wanted to add that you could use JayDeBeApi package to workaround this:
https://pypi.python.org/pypi/JayDeBeApi/
to update data in your mysql table. It might be a low-hanging fruit since you already have mysql jdbc driver installed.
The JayDeBeApi module allows you to connect from Python code to
databases using Java JDBC. It provides a Python DB-API v2.0 to that
database.
We use Anaconda distribution of Python, and JayDeBeApi python package comes standard.
See examples in that link above.
In PYSPARK I was not able to do that so I decided to use odbc.
url = "jdbc:sqlserver://xxx:1433;databaseName=xxx;user=xxx;password=xxx"
df.write.jdbc(url=url, table="__TableInsert", mode='overwrite')
cnxn = pyodbc.connect('Driver={ODBC Driver 17 for SQL Server};Server=xxx;Database=xxx;Uid=xxx;Pwd=xxx;', autocommit=False)
try:
crsr = cnxn.cursor()
# DO UPSERTS OR WHATEVER YOU WANT
crsr.execute("DELETE FROM Table")
crsr.execute("INSERT INTO Table (Field) SELECT Field FROM __TableInsert")
cnxn.commit()
except:
cnxn.rollback()
cnxn.close()

Saving / Loading Images in Postgres using Anorm (Scala/PlayFramework 2)

I think I'm saving the image to Postgres correctly, but get unexpected results trying to load the image. I don't really know if the error is in save or load.
Here is my Anorm code for saving the image:
def storeBadgeImage(badgeHandle: String, imgFile: File) = {
val cmd = """
|update badge
|set img={imgBytes}
|where handle = {badgeHandle}
"""
var fis = new FileInputStream(imgFile)
var imgBytes: Array[Byte] = Resource.fromInputStream(fis).byteArray
// at this point I see the image in my browser if I return the imgBytes in the HTTP response, so I'm good so far.
DB.withConnection { implicit c =>
{
try {
SQL(cmd stripMargin).on("badgeHandle" -> badgeHandle, "imgBytes" -> imgBytes).executeUpdate() match {
case 0 => "update failed for badge " + badgeHandle + ", image " + imgFile.getCanonicalPath
case _ => "Update Successful"
}
} catch {
case e: SQLException => e.toString()
}
}
}
}
...I get "update succesful", so I presume the save is working (I could be wrong). Here is my code for loading the image:
def fetchBadgeImage(badgeHandle: String) = {
val cmd = """
|select img from badge
|where handle = {badgeHandle}
"""
DB.withConnection { implicit c =>
SQL(cmd stripMargin).on("badgeHandle" -> badgeHandle)().map {
case Row(image: Array[Byte]) => {
"image = " + image
}
case Row(Some(unknown: Any)) => {
println(unknown + " unknown type is " + unknown.getClass.getName) //[B#11be1c6 unknown type is [B
"unknown"
}
}
}
}
...rather than going into the case "Row(image: Array[Byte])" as hoped, it goes into the "Row(Some(unknown: Any))" case. My println outputs "[B#11be1c6 unknown type is [B"
I don't know what type [B is or where I may have gone wrong...
It's an array of byte in Java(byte[]). > "I don't know what type [B".
And You can write match { case Row(Some(image: Array[Byte])) => } too in this case and that might be better.
Or you might be able to do that as follows.
val results: Stream[Array[Byte]] = SQL(cmd stripMargin)
.on("badgeHandle" -> "name")().map { row => row[Array[Byte]]("img") }
...Oops, got the following compile error.
<console>:43: error: could not find implicit value for parameter c: anorm.Column[Array[Byte]]
val res: Stream[Array[Byte]] = SQL(cmd stripMargin).on("badgeHandle" -> "name")().map { row => row[Array[Byte]]("img") }
Unfortunately, scala.Array is not supported by default. If you imitate the way of other types, It works.
implicit def rowToByteArray: Column[Array[Byte]] = {
Column.nonNull[Array[Byte]] { (value, meta) =>
val MetaDataItem(qualified, nullable, clazz) = meta
value match {
case bytes: Array[Byte] => Right(bytes)
case _ => Left(TypeDoesNotMatch("..."))
}
}
}
val results: Stream[Array[Byte]] = SQL(cmd stripMargin)
.on("badgeHandle" -> "name")().map { row => row[Array[Byte]]("img") }
https://github.com/playframework/Play20/blob/master/framework/src/anorm/src/main/scala/anorm/Anorm.scala

Resources