How to map on failure of ValidationNel? - validation

I've learned that Validation has methods :-> and <-: to map on the success and failure.
scala> val failure: Validation[String, Unit] = "xxx".failure
failure: scalaz.Validation[String,Unit] = Failure(xxx)
scala> failure.<-:("!!!" + _)
res17: scalaz.Validation[String,Unit] = Failure(!!!xxx)
Unfortunately <-: does not exist for ValidationNel:
scala> val failure: ValidationNel[String, Unit] = "xxx".failureNel
failure: scalaz.ValidationNel[String,Unit] = Failure(NonEmptyList(xxx))
scala> failure.<-:(nel => nel + "!!!")
<console>:15: error: value <-: is not a member of scalaz.ValidationNel[String,Unit]
failure.<-:(nel => nel + "!!!")
Interesting that it does compile if I define may failure as Validation[NonEmptyList[String], Unit].
Now I wonder if there is another way to use <-: for ValidationNel.

Right, you won't be able to conjure a Bifunctor instance for ValidationNel, but ValidationNel is really just a type alias for Validation[NonEmptyList[?], ?], and that does, as you know, have a Bifunctor instance. If you coerce the type of your value from ValidationNel to just Validation, things will start to work:
scala> import scalaz._, Scalaz._
import scalaz._
import Scalaz._
scala> val failure: Validation[String, Unit] = "xxx".failure
failure: scalaz.Validation[String,Unit] = Failure(xxx)
scala> failure.<-:("!!!" + _)
res0: scalaz.Validation[String,Unit] = Failure(!!!xxx)
scala> val failure: ValidationNel[String, Unit] = "xxx".failureNel
failure: scalaz.ValidationNel[String,Unit] = Failure(NonEmptyList(xxx))
scala> failure.<-:("!!!" <:: _)
<console>:15: error: value <-: is not a member of scalaz.ValidationNel[String,Unit]
failure.<-:("!!!" <:: _)
scala> val failure2: Validation[NonEmptyList[String], Unit] = failure
failure2: scalaz.Validation[scalaz.NonEmptyList[String],Unit] = Failure(NonEmptyList(xxx))
scala> failure2.<-:("!!!" <:: _)
res2: scalaz.Validation[scalaz.NonEmptyList[String],Unit] = Failure(NonEmptyList(!!!, xxx))
however, you can also just call the leftMap method on Validation:
scala> failure2.leftMap("!!!" <:: _)
res3: scalaz.Validation[scalaz.NonEmptyList[String],Unit] = Failure(NonEmptyList(!!!, xxx))
scala> failure.leftMap("!!!" <:: _)
res4: scalaz.Validation[scalaz.NonEmptyList[String],Unit] = Failure(NonEmptyList(!!!, xxx))
or use "swapped" to swap failure and success, rightMap the swapped value, then swap them back:
scala> failure.swapped(_ :-> ("!!!" <:: _))
res5: scalaz.Validation[scalaz.NonEmptyList[String],Unit] = Failure(NonEmptyList(!!!, xxx))


Can´t find "window" function in Spark Structured Streaming

I´m coding a small example in Spark Structured Streaming where I´m trying to process the output of the netstatcommand and can´t figure out how to invoke the windowfunction.
These are the relevant lines of my build.sbt:
scalaVersion := "2.11.4"
scalacOptions += "-target:jvm-1.8"
libraryDependencies ++= {
val sparkVer = "2.3.0"
"org.apache.spark" %% "spark-streaming" % sparkVer % "provided",
"org.apache.spark" %% "spark-streaming-kafka-0-8" % sparkVer % "provided",
"org.apache.spark" %% "spark-core" % sparkVer % "provided" withSources(),
"org.apache.spark" %% "spark-hive" % sparkVer % "provided",
And the code:
case class NetEntry(val timeStamp: java.sql.Timestamp, val sourceHost: String, val targetHost: String, val status: String)
def convertToNetEntry(x: String): NetEntry = {
// tcp 0 0 eselivpi14:icl-twobase1 TIME_WAIT
val array = x.replaceAll("\\s+"," ").split(" ").slice(3,6)
NetEntry(java.sql.Timestamp.valueOf(, array(0),array(1),array(2))
def main(args: Array[String]) {
// Initialize spark context
val spark: SparkSession = SparkSession.builder.appName("StructuredNetworkWordCount").getOrCreate()
val lines = spark.readStream
.option("host", args(0))
.option("port", args(1).toInt)
import spark.implicits._
val df =[String].map(x => convertToNetEntry(x))
val wordsArr: Dataset[NetEntry] =[NetEntry]
// Never get past this point
val windowColumn = window($"timestamp", "10 minutes", "5 minutes")
val windowedCounts = wordsArr.groupBy( windowColumn, $"targetHost").count()
val query = windowedCounts.writeStream.outputMode("complete").format("console").start()
I have with Spark 2.1, 2,2 and 2.3 with the same results. What is really bizarre is that, I have a Spark Cluster, I log in the Spark Shell and copy all the lines... and it works! Any idea of what am I doing wrong?
The error at compilation time:
[error] C:\code_legacy\edos-dp-mediation-spark-consumer\src\main\scala\com\ericsson\streaming\structured\StructuredStreamingMain.scala:39: not found: value window
[error] val windowColumn = window($"timestamp", "10 minutes", "5 minutes")
[error] ^
[warn] 5 warnings found
[error] one error found
[error] (compile:compileIncremental) Compilation failed
[error] Total time: 19 s, completed 16-mar-2018 20:13:40
Update: To make things weirder, I have check the API docs and I could not found a valid reference here either:$implicits$
You need to import the window function to compile it, which is already imported in spark-shell.
Add this import statement:
import org.apache.spark.sql.functions.window

Spark LinearRegressionSummary "normal" summary

According to LinearRegressionSummary (Spark 2.1.0 JavaDoc), p-values are only available for the "normal" solver.
This value is only available when using the "normal" solver.
What the hell is the "normal" solver?
I'm doing this:
import{Pipeline, PipelineModel}
import{CrossValidator, CrossValidatorModel, ParamGridBuilder}
import org.apache.spark.sql.functions._
import org.apache.spark.sql.{DataFrame, SparkSession}
val (trainingData, testData): (DataFrame, DataFrame) =
com.acme.pta.accuracy.Util.splitData(output, testProportion)
val lr =
val pipeline = new Pipeline()
val paramGrid = new ParamGridBuilder()
.addGrid(lr.elasticNetParam, Array(0.2, 0.4, 0.8, 0.9))
.addGrid(lr.regParam, Array(0,6, 0.3, 0.1, 0.01))
val cv = new CrossValidator()
.setNumFolds(numFolds) // Use 3+ in practice
val cvModel: CrossValidatorModel =
val pipelineModel: PipelineModel = cvModel.bestModel.asInstanceOf[PipelineModel]
val lrModel: LinearRegressionModel =
val modelSummary = lrModel.summary"lrModel.summary: " + modelSummary)
try {"feature p values: ")
// Exception occurs on line below.
val featuresAndPValues =
(featureAndPValue: (String, Double)) =>
"feature: " + featureAndPValue._1 + ": " + featureAndPValue._2))
} catch {
case _: java.lang.UnsupportedOperationException
=> Holder.log.error("Cannot compute p-values")
I am still getting the UnsupportedOperationException.
The exception message is:
No p-value available for this LinearRegressionModel
Is there something else I need to be doing? I'm using
"org.apache.spark" %% "spark-mllib" % "2.1.1"
Is pValues supported in that version?
Solution 1
In normal LinearRegression pValues and other "normal" statistics are only present when one of the parameters elasticNetParam or regParam is zero. So you can change
.addGrid( lr.elasticNetParam, Array( 0.0 ) )
.addGrid( lr.regParam, Array( 0.0 ) )
Solution 2
Make custom version of LinearRegression which would explicitly use
"normal" solver for regression.
Cholesky solver for WeightedLeastSquares.
I made this class as an extension to ml.regression package.
import scala.collection.mutable
import org.apache.spark.SparkException
import org.apache.spark.internal.Logging
import{Vector, Vectors}
import{Param, ParamMap, ParamValidators}
import org.apache.spark.mllib.linalg.VectorImplicits._
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.{DataFrame, Dataset, Row}
import org.apache.spark.sql.functions._
class CholeskyLinearRegression ( override val uid: String )
extends Regressor[ Vector, CholeskyLinearRegression, LinearRegressionModel ]
with LinearRegressionParams with DefaultParamsWritable with Logging {
import CholeskyLinearRegression._
def this() = this(Identifiable.randomUID("linReg"))
def setRegParam(value: Double): this.type = set(regParam, value)
setDefault(regParam -> 0.0)
def setFitIntercept(value: Boolean): this.type = set(fitIntercept, value)
setDefault(fitIntercept -> true)
def setStandardization(value: Boolean): this.type = set(standardization, value)
setDefault(standardization -> true)
def setElasticNetParam(value: Double): this.type = set(elasticNetParam, value)
setDefault(elasticNetParam -> 0.0)
def setMaxIter(value: Int): this.type = set(maxIter, value)
setDefault(maxIter -> 100)
def setTol(value: Double): this.type = set(tol, value)
setDefault(tol -> 1E-6)
def setWeightCol(value: String): this.type = set(weightCol, value)
def setSolver(value: String): this.type = set(solver, value)
setDefault(solver -> Auto)
def setAggregationDepth(value: Int): this.type = set(aggregationDepth, value)
setDefault(aggregationDepth -> 2)
override protected def train(dataset: Dataset[_]): LinearRegressionModel = {
// Extract the number of features before deciding optimization solver.
val numFeatures =$(featuresCol))).first().getAs[Vector](0).size
val w = if (!isDefined(weightCol) || $(weightCol).isEmpty) lit(1.0) else col($(weightCol))
val instances: RDD[Instance] =
.select( col( $(labelCol) ), w, col( $(featuresCol) ) ) {
case Row(label: Double, weight: Double, features: Vector) =>
Instance(label, weight, features)
// if (($(solver) == Auto &&
// numFeatures <= WeightedLeastSquares.MAX_NUM_FEATURES) || $(solver) == Normal) {
// For low dimensional data, WeightedLeastSquares is more efficient since the
// training algorithm only requires one pass through the data. (SPARK-10668)
val optimizer = new WeightedLeastSquares(
elasticNetParam = $(elasticNetParam),
solverType = WeightedLeastSquares.Cholesky,
maxIter = $(maxIter),
tol = $(tol)
val model =
val lrModel = copyValues(new LinearRegressionModel(uid, model.coefficients, model.intercept))
val (summaryModel, predictionColName) = lrModel.findSummaryModelAndPredictionCol()
val trainingSummary = new LinearRegressionTrainingSummary(
.setSummary( Some( trainingSummary ) )
override def copy(extra: ParamMap): CholeskyLinearRegression = defaultCopy(extra)
object CholeskyLinearRegression
extends DefaultParamsReadable[CholeskyLinearRegression] {
override def load(path: String): CholeskyLinearRegression = super.load(path)
/** String name for "auto". */
private[regression] val Auto = "auto"
/** String name for "normal". */
private[regression] val Normal = "normal"
/** String name for "l-bfgs". */
private[regression] val LBFGS = "l-bfgs"
/** Set of solvers that LinearRegression supports. */
private[regression] val supportedSolvers = Array(Auto, Normal, LBFGS)
All you have to do is to paste it to the separate file in the project and change LinearRegression to CholeskyLinearRegression in your code.
val lr = new CholeskyLinearRegression() // new LinearRegression()
.setSolver( "normal" )
.setMaxIter( maxIter )
It works with non-zero params and gives pValues. Tested on following params grid.
val paramGrid = new ParamGridBuilder()
.addGrid( lr.elasticNetParam, Array( 0.2, 0.4, 0.8, 0.9 ) )
.addGrid( lr.regParam, Array( 0.6, 0.3, 0.1, 0.01 ) )
Full investigation
I initially thought that the main issue is with the model being not fully preserved. Trained model is not preserved after fitting in CrossValidator. It is understandable because of memory consumption. There is an ongoing debate on how should it be resolved. Issue in JIRA.
You can see in the commented section that I tried to extract parameters from the best model in order to run it again. Then I found out that the model summary is ok, it's just for some parameters diagInvAtWa has length of 1 and basically a zero.
For ridge regression or Tikhonov regularization (elasticNet = 0) and any regParam pValues and other "normal" statistics can be computed but for Lasso method and something in between (elastic net) not. Same goes for regParam = 0: with any elasticNet pValues were computed.
Why is that
LinearRegression uses Weighted Least Square optimizer for "normal" solver with solverType = WeightedLeastSquares.Auto. This optimizer has two options for solvers: QuasiNewton or Cholesky. The former is selected only when both regParam and elasticNetParam are non-zeroes.
val solver = if (
( solverType == WeightedLeastSquares.Auto &&
elasticNetParam != 0.0 &&
regParam != 0.0 ) ||
( solverType == WeightedLeastSquares.QuasiNewton ) ) {
new QuasiNewtonSolver(fitIntercept, maxIter, tol, effectiveL1RegFun)
} else {
new CholeskySolver
So in your parameters grid the QuasiNewtonSolver will be always used because there are no combinations of regParam and elasticNetParam where one of them is zero.
We know that in order to get pValues and other "normal" statistics such as t-statistic or std. error of coefficients the diagonal of matrix (A^T * W * A)^-1 (diagInvAtWA) must not be a vector with only one zero. This condition is set in definition of pValues.
diagInvAtWA is a vector of diagonal elements of packed upper triangular matrix (solution.aaInv).
val diagInvAtWA = { inv => ...
For Cholesky solver it is calculated but for QuasiNewton not. Second parameter for NormalEquationSolution is this matrix.
You technically could make your own version of LinearRegression with
In this example I used data sample_linear_regression_data.txt from here.
Full code of reproduction
import org.apache.spark._
import{Pipeline, PipelineModel}
import{RegressionEvaluator, BinaryClassificationEvaluator}
import{LinearRegressionModel, LinearRegression}
import{CrossValidator, CrossValidatorModel, ParamGridBuilder}
import org.apache.spark.sql.functions._
import org.apache.spark.sql.{DataFrame, SparkSession}
object Main {
def main( args: Array[ String ] ): Unit = {
val spark =
.appName( "SO" )
.master( "local[*]" )
.config( "", "localhost" )
import spark.implicits._
val data =
.format( "libsvm" )
.load( "./sample_linear_regression_data.txt" )
val Array( training, test ) =
.randomSplit( Array( 0.9, 0.1 ), seed = 12345 )
val maxIter = 10;
val lr = new LinearRegression()
.setSolver( "normal" )
.setMaxIter( maxIter )
val paramGrid = new ParamGridBuilder()
// .addGrid( lr.elasticNetParam, Array( 0.2, 0.4, 0.8, 0.9 ) )
.addGrid( lr.elasticNetParam, Array( 0.0 ) )
.addGrid( lr.regParam, Array( 0.6, 0.3, 0.1, 0.01 ) )
val pipeline = new Pipeline()
.setStages( Array( lr ) )
val cv = new CrossValidator()
.setEstimator( pipeline )
.setEvaluator( new RegressionEvaluator )
.setEstimatorParamMaps( paramGrid )
.setNumFolds( 2 ) // Use 3+ in practice
val cvModel =
.fit( training )
val pipelineModel: PipelineModel =
.asInstanceOf[ PipelineModel ]
val lrModel: LinearRegressionModel =
.stages( 0 )
.asInstanceOf[ LinearRegressionModel ]
// Technically there is a way to use exact ParamMap
// to build a new LR but for the simplicity I'll
// get and set them explicitly
// lrModel.params.foreach( ( param ) => {
// println( param )
// } )
// val bestLr = new LinearRegression()
// .setSolver( "normal" )
// .setMaxIter( maxIter )
// .setRegParam( lrModel.getRegParam )
// .setElasticNetParam( lrModel.getElasticNetParam )
// val bestLrModel = training )
val modelSummary =
println( "lrModel pValues: " + modelSummary.pValues.mkString( ", " ) )
There are three solver algorithms available:
l-bfgs - Limited-memory Broyden–Fletcher–Goldfarb–Shanno algorithm which is a limited-memory quasi-Newton optimization method.
normal - using Normal Equation as an analytical solution to the linear regression problem. It is basically a weighted least squares approach or reweighted least squares approach.
auto - solver algorithm is selected automatically. The Normal Equations solver will be used when possible, but this will automatically fall back to iterative optimization methods when needed
The coefficientStandardErrors, tValues and pValues are only available when using the "normal" solver because they are all based on diagInvAtWA - a diagonal of matrix (A^T * W * A)^-1.

How to compose function to applicatives with scalaz

While learning Scalaz 6, I'm trying to write type-safe readers returning validations. Here are my new types:
type ValidReader[S,X] = (S) => Validation[NonEmptyList[String],X]
type MapReader[X] = ValidReader[Map[String,String],X]
and I have two functions creating map-readers for ints and strings (*):
def readInt( k: String ): MapReader[Int] = ...
def readString( k: String ): MapReader[String] = ...
Given the following map:
val data = Map( "name" -> "Paul", "age" -> "8" )
I can write two readers to retrieve the name and age:
val name = readString( "name" )
val age = readInt( "age" )
println( name(data) ) //=> Success("Paul")
println( age(data) ) //=> Success(8)
Everything works fine, but now I want to compose both readers to build a Boy instance:
case class Boy( name: String, age: Int )
My best take is:
val boy = ( name |#| age ) {
(n,a) => ( n |#| a ) { Boy(_,_) }
println( boy(data) ) //=> Success(Boy(Paul,8))
It works as expected, but the expression is awkward with two levels of applicative builders. Is there a way, to get the following syntax to work ?
val boy = ( name |#| age ) { Boy(_,_) }
(*) Full and runnable implementation in:
Update: Here is the compiler error message that I get when trying the line above or Daniel suggestion:
[error] ***/MapReader.scala:114: type mismatch;
[error] found : scalaz.Validation[scalaz.NonEmptyList[String],String]
[error] required: String
[error] val boy = ( name |#| age ) { Boy(_,_) }
[error] ^
How about this?
val boy = (name |#| age) {
(Boy.apply _).lift[({type V[X]=ValidationNEL[String,X]})#V]
or using a type alias:
type VNELStr[X] = ValidationNEL[String,X]
val boy = (name |#| age) apply (Boy(_, _)).lift[VNELStr]
This is based on the following error message at the console:
scala> name |#| age apply Boy.apply
<console>:22: error: type mismatch;
found : (String, Int) => MapReader.Boy
required: (scalaz.Validation[scalaz.NonEmptyList[String],String],
scalaz.Validation[scalaz.NonEmptyList[String],Int]) => ?
So I just lifted Boy.apply to take the required type.
Note that since Reader and Validation (with a semigroup E) are both Applicative, their composition is also Applicative. Using scalaz 7 this can be expressed as:
import scalaz.Reader
import scalaz.Reader.{apply => toReader}
import scalaz.{Validation, ValidationNEL, Applicative, Kleisli, NonEmptyList}
//type IntReader[A] = Reader[Int, A] // has some ambigous implicit resolution problem
type IntReader[A] = Kleisli[scalaz.IdInstances#Id, Int, A]
type ValNEL[A] = ValidationNEL[Throwable, A]
val app = Applicative[IntReader].compose[ValNEL]
Now we can use a single |#| operation on the composed Applicative:
val f1 = toReader((x: Int) => Validation.success[NonEmptyList[Throwable], String](x.toString))
val f2 = toReader((x: Int) => Validation.success[NonEmptyList[Throwable], String]((x+1).toString))
val f3 = app.map2(f1, f2)(_ + ":" + _) should be_==(Validation.success("5:6"))

How can I define a method that takes an Ordered[T] Array in Scala?

I'm building some basic algorithms in Scala (following Cormen's book) to refresh my mind on the subject and I'm building the insertion sort algorithm. Doing it like this, it works correctly:
class InsertionSort extends Sort {
def sort ( items : Array[Int] ) : Unit = {
if ( items.length < 2 ) {
throw new IllegalArgumentException( "Array must be bigger than 1" )
1.until( items.length ).foreach( ( currentIndex ) => {
val key = items(currentIndex)
var loopIndex = currentIndex - 1
while ( loopIndex > -1 && items(loopIndex) > key ) {
items.update( loopIndex + 1, items(loopIndex) )
loopIndex -= 1
items.update( loopIndex + 1, key )
} )
But this is for Int only and I would like to use generics and Ordered[A] so I could sort any type that is ordered. When I change the signature to be like this:
def sort( items : Array[Ordered[_]] ) : Unit
The following spec doesn't compile:
"sort correctly with merge sort" in {
val items = Array[RichInt](5, 2, 4, 6, 1, 3)
insertionSort.sort( items )
items.toList === Array[RichInt]( 1, 2, 3, 4, 5, 6 ).toList
And the compiler error is:
Type mismatch, expected: Array[Ordered[_]], actual Array[RichInt]
But isn't RichInt an Ordered[RichInt]? How should I define this method signature in a way that it would accept any Ordered object?
In case anyone is interested, the final source is available here.
Actually RichInt is not an Ordered[RichInt] but an Ordered[Int]. However scala.runtime.RichInt <: Ordered[_], but class Array is invariant in type T so Array[RichInt] is not an Array[Ordered[_]].
scala> def f[T <% Ordered[T]](arr: Array[T]) = { arr(0) < arr(1) }
f: [T](arr: Array[T])(implicit evidence$1: T => Ordered[T])Boolean
scala> f(Array(1,2,3))
res2: Boolean = true
You can do this with a context bound on the type parameter;
scala> def foo[T : Ordering](arr: Array[T]) = {
| import math.Ordering.Implicits._
| arr(0) < arr(1)
| }
foo: [T](arr: Array[T])(implicit evidence$1: Ordering[T])Boolean
Such that usage is:
scala> foo(Array(2.3, 3.4))
res1: Boolean = true
The advantage to this is that you don't need the default order of the type if you don't want it:
scala> foo(Array("z", "bc"))
res4: Boolean = false
scala> foo(Array("z", "bc"))(
res3: Boolean = true

How to avoid overhead to pass a Map[Integer, String] where a Map[Number, String] is expected?

The problem:
I have a mutable.Map[Integer, String], I want to pass it to two methods:
def processNumbers(nums: Map[Number, String])
def processIntegers(nums: mutable.Map[Integer, String])
after getting compile error, I ended up with this:
val ints: mutable.Map[Integer, String] = mutable.Map.empty[Integer, String]
//init of ints
val nums: Map[Number, String] = ints.toMap[Number, String]
With a little experiment, I figured out that my way of doing this has a significant overhead: the type conversion step multiply by 10 the execution time.
All in all the type conversion is really just to please the compiler, so how to do that without any overhead ?
For info, the code of my experiment:
package qndTests
import scala.collection.mutable
object TypeTest {
var hashNums = 0
var hashIntegers = 0
def processNumbers(nums: Map[Number, String]): Unit = {
nums.foreach(num =>{
def processNumbers2(nums: mutable.Map[Integer, String]): Unit = {
nums.foreach(num =>{
def processIntegers(nums: mutable.Map[Integer, String]): Unit = {
nums.foreach(num =>{
def test(ints: mutable.Map[Integer, String], convertType: Boolean): Unit = {
println("run test with type conversion")
println("run test without type conversion")
val start = System.nanoTime
hashNums = 0
hashIntegers = 0
val nTest = 10
for(i <- 0 to nTest) {
val nums: Map[Number, String] = ints.toMap[Number, String] //how much does that cost ?
val end= System.nanoTime
println("nums: "+hashNums)
println("ints: "+hashIntegers)
def main(args: Array[String]): Unit = {
val ints: mutable.Map[Integer, String] = mutable.Map.empty[Integer, String]
val testSize = 1000000
println("creating a map of "+testSize+" elements")
for(i <- 0 to testSize) ints.put(i, i.toBinaryString)
test(ints, false)
test(ints, true)
and its output:
creating a map of 1000000 elements
run test without type conversion
nums: -1650117013
ints: -1650117013
run test with type conversion
nums: -1650117013
ints: -1650117013
--> about 2 seconds in the first case against 25 seconds in the second one !
As you've seen, Map[A,B] is nonvariant in the key type A, so you'll need a conversion of some kind to assign to a variable of type Map[A,B] a Map[A1,B] where A1 <: A. However, if you can change the definition of def processNumbers(nums: Map[Number, String]), you could try something like:
def processNumbers[T <: Number](nums: Map[T, String])
and pass the Map[Integer, String] without conversion.
Would that help solve your problem?
