Can´t find "window" function in Spark Structured Streaming - spark-streaming

I´m coding a small example in Spark Structured Streaming where I´m trying to process the output of the netstatcommand and can´t figure out how to invoke the windowfunction.
These are the relevant lines of my build.sbt:
scalaVersion := "2.11.4"
scalacOptions += "-target:jvm-1.8"
libraryDependencies ++= {
val sparkVer = "2.3.0"
Seq(
"org.apache.spark" %% "spark-streaming" % sparkVer % "provided",
"org.apache.spark" %% "spark-streaming-kafka-0-8" % sparkVer % "provided",
"org.apache.spark" %% "spark-core" % sparkVer % "provided" withSources(),
"org.apache.spark" %% "spark-hive" % sparkVer % "provided",
)
}
And the code:
case class NetEntry(val timeStamp: java.sql.Timestamp, val sourceHost: String, val targetHost: String, val status: String)
def convertToNetEntry(x: String): NetEntry = {
// tcp 0 0 eselivpi14:icl-twobase1 eselivpi149.int.e:48442 TIME_WAIT
val array = x.replaceAll("\\s+"," ").split(" ").slice(3,6)
NetEntry(java.sql.Timestamp.valueOf(LocalDateTime.now()), array(0),array(1),array(2))
}
def main(args: Array[String]) {
// Initialize spark context
val spark: SparkSession = SparkSession.builder.appName("StructuredNetworkWordCount").getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
val lines = spark.readStream
.format("socket")
.option("host", args(0))
.option("port", args(1).toInt)
.load()
import spark.implicits._
val df = lines.as[String].map(x => convertToNetEntry(x))
val wordsArr: Dataset[NetEntry] = df.as[NetEntry]
wordsArr.printSchema()
// Never get past this point
val windowColumn = window($"timestamp", "10 minutes", "5 minutes")
val windowedCounts = wordsArr.groupBy( windowColumn, $"targetHost").count()
val query = windowedCounts.writeStream.outputMode("complete").format("console").start()
query.awaitTermination()
}
I have with Spark 2.1, 2,2 and 2.3 with the same results. What is really bizarre is that, I have a Spark Cluster, I log in the Spark Shell and copy all the lines... and it works! Any idea of what am I doing wrong?
The error at compilation time:
[error] C:\code_legacy\edos-dp-mediation-spark-consumer\src\main\scala\com\ericsson\streaming\structured\StructuredStreamingMain.scala:39: not found: value window
[error] val windowColumn = window($"timestamp", "10 minutes", "5 minutes")
[error] ^
[warn] 5 warnings found
[error] one error found
[error] (compile:compileIncremental) Compilation failed
[error] Total time: 19 s, completed 16-mar-2018 20:13:40
Update: To make things weirder, I have check the API docs and I could not found a valid reference here either:
https://spark.apache.org/docs/2.3.0/api/scala/index.html#org.apache.spark.sql.SparkSession$implicits$

You need to import the window function to compile it, which is already imported in spark-shell.
Add this import statement:
import org.apache.spark.sql.functions.window

Related

How to map on failure of ValidationNel?

I've learned that Validation has methods :-> and <-: to map on the success and failure.
scala> val failure: Validation[String, Unit] = "xxx".failure
failure: scalaz.Validation[String,Unit] = Failure(xxx)
scala> failure.<-:("!!!" + _)
res17: scalaz.Validation[String,Unit] = Failure(!!!xxx)
Unfortunately <-: does not exist for ValidationNel:
scala> val failure: ValidationNel[String, Unit] = "xxx".failureNel
failure: scalaz.ValidationNel[String,Unit] = Failure(NonEmptyList(xxx))
scala> failure.<-:(nel => nel + "!!!")
<console>:15: error: value <-: is not a member of scalaz.ValidationNel[String,Unit]
failure.<-:(nel => nel + "!!!")
Interesting that it does compile if I define may failure as Validation[NonEmptyList[String], Unit].
Now I wonder if there is another way to use <-: for ValidationNel.
Right, you won't be able to conjure a Bifunctor instance for ValidationNel, but ValidationNel is really just a type alias for Validation[NonEmptyList[?], ?], and that does, as you know, have a Bifunctor instance. If you coerce the type of your value from ValidationNel to just Validation, things will start to work:
scala> import scalaz._, Scalaz._
import scalaz._
import Scalaz._
scala> val failure: Validation[String, Unit] = "xxx".failure
failure: scalaz.Validation[String,Unit] = Failure(xxx)
scala> failure.<-:("!!!" + _)
res0: scalaz.Validation[String,Unit] = Failure(!!!xxx)
scala> val failure: ValidationNel[String, Unit] = "xxx".failureNel
failure: scalaz.ValidationNel[String,Unit] = Failure(NonEmptyList(xxx))
scala> failure.<-:("!!!" <:: _)
<console>:15: error: value <-: is not a member of scalaz.ValidationNel[String,Unit]
failure.<-:("!!!" <:: _)
^
scala> val failure2: Validation[NonEmptyList[String], Unit] = failure
failure2: scalaz.Validation[scalaz.NonEmptyList[String],Unit] = Failure(NonEmptyList(xxx))
scala> failure2.<-:("!!!" <:: _)
res2: scalaz.Validation[scalaz.NonEmptyList[String],Unit] = Failure(NonEmptyList(!!!, xxx))
however, you can also just call the leftMap method on Validation:
scala> failure2.leftMap("!!!" <:: _)
res3: scalaz.Validation[scalaz.NonEmptyList[String],Unit] = Failure(NonEmptyList(!!!, xxx))
scala> failure.leftMap("!!!" <:: _)
res4: scalaz.Validation[scalaz.NonEmptyList[String],Unit] = Failure(NonEmptyList(!!!, xxx))
or use "swapped" to swap failure and success, rightMap the swapped value, then swap them back:
scala> failure.swapped(_ :-> ("!!!" <:: _))
res5: scalaz.Validation[scalaz.NonEmptyList[String],Unit] = Failure(NonEmptyList(!!!, xxx))

Spark RDD map in yarn mode does not allow access to variables?

I have got a brand new install of spark 1.2.1 over a mapr cluster and while testing it I find that it works nice in local mode but in yarn modes it seems not to be able to access variables, neither if broadcasted. To be precise, the following test code
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
object JustSpark extends App {
val conf = new org.apache.spark.SparkConf().setAppName("SimpleApplication")
val sc = new SparkContext(conf)
val a = List(1,3,4,5,6)
val b = List("a","b","c")
val bBC= sc.broadcast(b)
val data = sc.parallelize(a)
val transform = data map ( t => { "hi" })
transform.take(3) foreach (println _)
val transformx2 = data map ( t => { bBC.value.size })
transformx2.take(3) foreach (println _)
//val transform2 = data map ( t => { b.size })
//transform2.take(3) foreach (println _)
}
works in local mode but fails in yarn. More precisely, both methods, transform2 and transformx2, fail, and all of them work if --master local[8].
I am compiling it with sbt and sending with the submit tool
/opt/mapr/spark/spark-1.2.1/bin/spark-submit --class JustSpark --master yarn target/scala-2.10/simulator_2.10-1.0.jar
Any idea what is going on? The fail message just claims to have a java null pointer exception in the place where it should be accessing the variable. Is there other method to pass variables inside the RDD maps?
I'm going to take a pretty good guess: it's because you're using App. See https://issues.apache.org/jira/browse/SPARK-4170 for details. Write a main() method instead.
I presume the culprit were
val transform2 = data map ( t => { b.size })
In particular the accessing the local variable b . You may actually see in your log files java.io.NotSerializableException .
What is supposed to happen: Spark will attempt to serialize any referenced object. That means in this case the entire JustSpark class - since one of its members is referenced.
Why did this fail? Your class is not Serializable. Therefore Spark is unable to send it over the wire. In particular you have a reference to SparkContext - which does not extend Serializable
class SparkContext(config: SparkConf) extends Logging with ExecutorAllocationClient {
So - your first code - which does broadcast only the variable value - is the correct way.
This is the original example of broadcast, from spark sources, altered to use lists instead of arrays:
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}
object MultiBroadcastTest {
def main(args: Array[String]) {
val sparkConf = new SparkConf().setAppName("Multi-Broadcast Test")
val sc = new SparkContext(sparkConf)
val slices = if (args.length > 0) args(0).toInt else 2
val num = if (args.length > 1) args(1).toInt else 1000000
val arr1 = (1 to num).toList
val arr2 = (1 to num).toList
val barr1 = sc.broadcast(arr1)
val barr2 = sc.broadcast(arr2)
val observedSizes: RDD[(Int, Int)] = sc.parallelize(1 to 10, slices).map { _ =>
(barr1.value.size, barr2.value.size)
}
observedSizes.collect().foreach(i => println(i))
sc.stop()
}}
I compiled it in my environment and it works.
So what is the difference?
The problematic example uses extends App while the original example is a plain singleton.
So I demoted the code to a "doIt()" function
object JustDoSpark extends App{
def doIt() {
...
}
doIt()
and guess what. It worked.
Surely the problem is related to Serialization indeed, but in a different way. Having the code in the body of the object seems to cause problems.

SBT / Ivy: xmlbeans library not detected even after it has been added to the dependency list

I couldn't access xmlbeans even after it has been added as a dependency in SBT.
Here's my build.sbt
name := "xmlbeans"
version := "1.0"
scalaVersion := "2.11.5"
libraryDependencies += "stax" % "stax-api" % "1.0.1"
libraryDependencies += "org.apache.xmlbeans" % "xmlbeans" % "2.6.0"
And here's my code:
import org.apache.xmlbeans._
object Main {
def main(args: Array[String]) = {
println("hi")
}
}
And the error message I got when running sbt run right after I performed sbt update:
src/main/scala/Main.scala:5: object apache is not a member of package org
[error] import org.apache.xmlbeans._
Is there some steps that I miss?
Turns out the issue was a simple ivy2 cache issue.
I simply deleted the ~/.ivy2/cache/org.apache.xmlbeans folder and re-run sbt update and sbt run. This time it worked like a charm.

Graphite / Carbon / Ceres node overlap

I'm working with Graphite monitoring using Carbon and Ceres as the storage method. I have some problems with correcting bad data. It seems that (due to various problems) I've ended up with overlapping files. That is, since Carbon / Ceres stores the data as timestamp#interval.slice, I can have two or more files with overlapping time ranges.
There are two kinds of overlaps:
File A: +------------+ orig file
File B: +-----+ subset
File C: +---------+ overlap
This is causing problems because the existing tools available (ceres-maintenance defrag and rollup) don't cope with these overlaps. Instead, they skip the directory and move on. This is a problem, obviously.
I've created a script that fixes this problem, as follows:
For subsets, just delete the subset file.
For overlaps, using the file system 'truncate' on the orig file at the point where the next file starts. While it is possible to cut off the start of the overlap file and rename it properly, I would suggest that this is fraught with danger.
I've found that it's possible to do this in two ways:
Walk the dirs and iterate over the files, fixing as you go, and find the file subsets, remove them;
Walk the dirs and fix all the problems in a dir before moving on. This is BY FAR the faster approach, since the dir walk is hugely time consuming.
Code:
#!/usr/bin/env python2.6
################################################################################
import io
import os
import time
import sys
import string
import logging
import unittest
import datetime
import random
import zmq
import json
import socket
import traceback
import signal
import select
import simplejson
import cPickle as pickle
import re
import shutil
import collections
from pymongo import Connection
from optparse import OptionParser
from pprint import pprint, pformat
################################################################################
class SliceFile(object):
def __init__(self, fname):
self.name = fname
basename = fname.split('/')[-1]
fnArray = basename.split('#')
self.timeStart = int(fnArray[0])
self.freq = int(fnArray[1].split('.')[0])
self.size = None
self.numPoints = None
self.timeEnd = None
self.deleted = False
def __repr__(self):
out = "Name: %s, tstart=%s tEnd=%s, freq=%s, size=%s, npoints=%s." % (
self.name, self.timeStart, self.timeEnd, self.freq, self.size, self.numPoints)
return out
def setVars(self):
self.size = os.path.getsize(self.name)
self.numPoints = int(self.size / 8)
self.timeEnd = self.timeStart + (self.numPoints * self.freq)
################################################################################
class CeresOverlapFixup(object):
def __del__(self):
import datetime
self.writeLog("Ending at %s" % (str(datetime.datetime.today())))
self.LOGFILE.flush()
self.LOGFILE.close()
def __init__(self):
self.verbose = False
self.debug = False
self.LOGFILE = open("ceresOverlapFixup.log", "a")
self.badFilesList = set()
self.truncated = 0
self.subsets = 0
self.dirsExamined = 0
self.lastStatusTime = 0
def getOptionParser(self):
return OptionParser()
def getOptions(self):
parser = self.getOptionParser()
parser.add_option("-d", "--debug", action="store_true", dest="debug", default=False, help="debug mode for this program, writes debug messages to logfile." )
parser.add_option("-v", "--verbose", action="store_true", dest="verbose", default=False, help="verbose mode for this program, prints a lot to stdout." )
parser.add_option("-b", "--basedir", action="store", type="string", dest="basedir", default=None, help="base directory location to start converting." )
(options, args) = parser.parse_args()
self.debug = options.debug
self.verbose = options.verbose
self.basedir = options.basedir
assert self.basedir, "must provide base directory."
# Examples:
# ./updateOperations/1346805360#60.slice
# ./updateOperations/1349556660#60.slice
# ./updateOperations/1346798040#60.slice
def getFileData(self, inFilename):
ret = SliceFile(inFilename)
ret.setVars()
return ret
def removeFile(self, inFilename):
os.remove(inFilename)
#self.writeLog("removing file: %s" % (inFilename))
self.subsets += 1
def truncateFile(self, fname, newSize):
if self.verbose:
self.writeLog("Truncating file, name=%s, newsize=%s" % (pformat(fname), pformat(newSize)))
IFD = None
try:
IFD = os.open(fname, os.O_RDWR|os.O_CREAT)
os.ftruncate(IFD, newSize)
os.close(IFD)
self.truncated += 1
except:
self.writeLog("Exception during truncate: %s" % (traceback.format_exc()))
try:
os.close(IFD)
except:
pass
return
def printStatus(self):
now = self.getNowTime()
if ((now - self.lastStatusTime) > 10):
self.writeLog("Status: time=%d, Walked %s dirs, subsetFilesRemoved=%s, truncated %s files." % (now, self.dirsExamined, self.subsets, self.truncated))
self.lastStatusTime = now
def fixupThisDir(self, inPath, inFiles):
# self.writeLog("Fixing files in dir: %s" % (inPath))
if not '.ceres-node' in inFiles:
# self.writeLog("--> Not a slice directory, skipping.")
return
self.dirsExamined += 1
sortedFiles = sorted(inFiles)
sortedFiles = [x for x in sortedFiles if ((x != '.ceres-node') and (x.count('#') > 0)) ]
lastFile = None
fileObjList = []
for thisFile in sortedFiles:
wholeFilename = os.path.join(inPath, thisFile)
try:
curFile = self.getFileData(wholeFilename)
fileObjList.append(curFile)
except:
self.badFilesList.add(wholeFilename)
self.writeLog("ERROR: file %s, %s" % (wholeFilename, traceback.format_exc()))
# name is timeStart, really.
fileObjList = sorted(fileObjList, key=lambda thisObj: thisObj.name)
while fileObjList:
self.printStatus()
changes = False
firstFile = fileObjList[0]
removedFiles = []
for curFile in fileObjList[1:]:
if (curFile.timeEnd <= firstFile.timeEnd):
# have subset file. elim.
self.removeFile(curFile.name)
removedFiles.append(curFile.name)
self.subsets += 1
changes = True
if self.verbose:
self.writeLog("Subset file situation. First=%s, overlap=%s" % (firstFile, curFile))
fileObjList = [x for x in fileObjList if x.name not in removedFiles]
if (len(fileObjList) < 2):
break
secondFile = fileObjList[1]
# LT is right. FirstFile's timeEnd is always the first open time after first is done.
# so, first starts#100, len=2, end=102, positions used=100,101. second start#102 == OK.
if (secondFile.timeStart < firstFile.timeEnd):
# truncate first file.
# file_A (last): +---------+
# file_B (curr): +----------+
# solve by truncating previous file at startpoint of current file.
newLenFile_A_seconds = int(secondFile.timeStart - firstFile.timeStart)
newFile_A_datapoints = int(newLenFile_A_seconds / firstFile.freq)
newFile_A_bytes = int(newFile_A_datapoints) * 8
if (not newFile_A_bytes):
fileObjList = fileObjList[1:]
continue
assert newFile_A_bytes, "Must have size. newLenFile_A_seconds=%s, newFile_A_datapoints=%s, newFile_A_bytes=%s." % (newLenFile_A_seconds, newFile_A_datapoints, newFile_A_bytes)
self.truncateFile(firstFile.name, newFile_A_bytes)
if self.verbose:
self.writeLog("Truncate situation. First=%s, overlap=%s" % (firstFile, secondFile))
self.truncated += 1
fileObjList = fileObjList[1:]
changes = True
if not changes:
fileObjList = fileObjList[1:]
def getNowTime(self):
return time.time()
def walkDirStructure(self):
startTime = self.getNowTime()
self.lastStatusTime = startTime
updateStatsDict = {}
self.okayFiles = 0
emptyFiles = 0
for (thisPath, theseDirs, theseFiles) in os.walk(self.basedir):
self.printStatus()
self.fixupThisDir(thisPath, theseFiles)
self.dirsExamined += 1
endTime = time.time()
# time.sleep(11)
self.printStatus()
self.writeLog( "now = %s, started at %s, elapsed time = %s seconds." % (startTime, endTime, endTime - startTime))
self.writeLog( "Done.")
def writeLog(self, instring):
print instring
print >> self.LOGFILE, instring
self.LOGFILE.flush()
def main(self):
self.getOptions()
self.walkDirStructure()

Specs2 on OSX - error: object specs2 is not a member of package org

I am trying to run some Scala code on my OSX machine and keep on getting an error that says
error: object specs2 is not a member of package org
I have version 2.9.1-1 of Scala installed.
I am also using verison 0.7.7 of sbt
My build.sbt file looks like this
name := "Comp-338-Web-App"
version := "1.0"
scalaVersion := "2.9.1"
scalacOptions += "-deprecation"
libraryDependencies ++= Seq(
"junit" % "junit" % "4.7",
"org.specs2" %% "specs2" % "1.8.2" % "test",
"org.mockito" % "mockito-all" % "1.9.0",
"org.hamcrest" % "hamcrest-all" % "1.1"
)
resolvers ++= Seq("snapshots" at "http://oss.sonatype.org/content/repositories/snapshots",
"releases" at "http://oss.sonatype.org/content/repositories/releases")
I've tried a bunch of different things but can't get it to run the test correctly.
Any advice?
Let me know if you need more information about the settings on my computer.
The solution looks simple: Please use the latest release of sbt, currently 0.11.2.
The version 0.7.x you are using does not know how to use build.sbt, which was only introduced in sbt 0.9 or so.
In addition to moving to sbt 0.11.2 I recommend going to full configuration even though the author recommends using .sbt descriptor for the majority of tasks, and use .scala descriptor only if you cannot achieve something with .sbt syntax or use subprojects (I do for all of my projects to clearly separate different parts of application).
Below is the sample project setup that I use for the project I've just started so it only has specs2 dependency:
import sbt._
import Keys._
object BuildSettings {
val buildOrganization = "net.batyuk"
val buildScalaVersion = "2.9.1"
val buildVersion = "0.1"
val buildSettings = Defaults.defaultSettings ++ Seq(organization := buildOrganization,
scalaVersion := buildScalaVersion,
version := buildVersion)
}
object Resolvers {
val typesafeRepo = "Typesafe Repository" at "http://repo.typesafe.com/typesafe/releases/"
val sonatypeRepo = "Sonatype Releases" at "http://oss.sonatype.org/content/repositories/releases"
val scalaResolvers = Seq(typesafeRepo, sonatypeRepo)
}
object Dependencies {
val specs2Version = "1.8.2"
val specs2 = "org.specs2" %% "specs2" % specs2Version
}
object IdefixBuild extends Build {
import Resolvers._
import Dependencies._
import BuildSettings._
val commonDeps = Seq(specs2)
lazy val idefix = Project("idefix", file("."), settings = buildSettings ++ Seq(resolvers := scalaResolvers,
libraryDependencies := commonDeps))
}

Resources