Error in Post Requst (Kusto & Databricks) - azure-databricks

I'm trying to follow the example code here:
Here is my code:
import com.microsoft.kusto.spark.datasource.KustoOptions
import com.microsoft.kusto.spark.sql.extension.SparkExtension._
import org.apache.spark.SparkConf
import org.apache.spark.sql._
val cluster = dbutils.secrets.get(scope = "key-vault-secrets", key = "ClusterName")
val client_id = dbutils.secrets.get(scope = "key-vault-secrets", key = "ClientId")
val client_secret = dbutils.secrets.get(scope = "key-vault-secrets", key = "ClientSecret")
val authority_id = dbutils.secrets.get(scope = "key-vault-secrets", key = "TenantId")
val database = "db"
val table = "tablename"
val conf: Map[String, String] = Map(
KustoOptions.KUSTO_AAD_CLIENT_ID -> client_id,
KustoOptions.KUSTO_AAD_CLIENT_PASSWORD -> client_secret,
KustoOptions.KUSTO_QUERY -> s"$table | top 100"
)
// Simplified syntax flavor
import org.apache.spark.sql._
import com.microsoft.kusto.spark.sql.extension.SparkExtension._
import org.apache.spark.SparkConf
val df = spark.read.kusto(cluster, database, "", conf)
display(df)
However this gives me this error:
com.microsoft.azure.kusto.data.exceptions.DataServiceException: Error in post request
at com.microsoft.azure.kusto.data.Utils.post(Utils.java:106)
at com.microsoft.azure.kusto.data.ClientImpl.execute(ClientImpl.java:89)
at com.microsoft.azure.kusto.data.ClientImpl.execute(ClientImpl.java:45)
at com.microsoft.kusto.spark.utils.KustoDataSourceUtils$.getSchema(KustoDataSourceUtils.scala:103)
at com.microsoft.kusto.spark.datasource.KustoRelation.getSchema(KustoRelation.scala:102)
at com.microsoft.kusto.spark.datasource.KustoRelation.schema(KustoRelation.scala:36)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:450)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:297)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:283)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:201)
at com.microsoft.kusto.spark.sql.extension.SparkExtension$DataFrameReaderExtension.kusto(SparkExtension.scala:19)
at linef172a4a7eaa6435fa4ff9fec071cf03535.$read$$iw$$iw$$iw$$iw$$iw$$iw.<init>(command-1810687702746193:25)
at linef172a4a7eaa6435fa4ff9fec071cf03535.$read$$iw$$iw$$iw$$iw$$iw.<init>(command-1810687702746193:86)
at linef172a4a7eaa6435fa4ff9fec071cf03535.$read$$iw$$iw$$iw$$iw.<init>(command-1810687702746193:88)
at linef172a4a7eaa6435fa4ff9fec071cf03535.$read$$iw$$iw$$iw.<init>(command-1810687702746193:90)
at linef172a4a7eaa6435fa4ff9fec071cf03535.$read$$iw$$iw.<init>(command-1810687702746193:92)
at linef172a4a7eaa6435fa4ff9fec071cf03535.$read$$iw.<init>(command-1810687702746193:94)
at linef172a4a7eaa6435fa4ff9fec071cf03535.$read.<init>(command-1810687702746193:96)
at linef172a4a7eaa6435fa4ff9fec071cf03535.$read$.<init>(command-1810687702746193:100)
at linef172a4a7eaa6435fa4ff9fec071cf03535.$read$.<clinit>(command-1810687702746193)
at linef172a4a7eaa6435fa4ff9fec071cf03535.$eval$.$print$lzycompute(<notebook>:7)
Any ideas?

Make sure that the format of your cluster name matches the expected format.
The expected is clustername.region

In the file you are reading your query from, make sure you append a line break ( \n ) at the end of each line.

Related

why is asynchronous slower than synchronous?

Why is it that when I start testing the load on the server, the synchronous writing style works hundreds of times faster than the asynchronous one. I'm just trying my hand at asynchrony, what am I doing wrong?
ab -t 10 -n 10000 -c 100 http://127.0.0.1:8888/api/db/sync/
Number of requests per second: 35
ab -t 10 -n 10000 -c 100 http://127.0.0.1:8888/api/db/
Number of requests per second: 0.37
import asyncio
import json
import time
from asyncio import sleep
from typing import Optional, Union
import tornado.ioloop
import tornado.web
from jinja2 import Environment, select_autoescape, FileSystemLoader
from sqlalchemy import text, create_engine
from sqlalchemy.ext.asyncio import create_async_engine, AsyncSession
from sqlalchemy.orm import sessionmaker, Session
from tornado import gen
from tornado.httpserver import HTTPServer
env = Environment(
loader=FileSystemLoader('templates'),
autoescape=select_autoescape()
)
async_engine = create_async_engine(
"postgresql+asyncpg://***:***#localhost:5431/***",
echo=True,
)
async_session = sessionmaker(
async_engine, expire_on_commit=False, class_=AsyncSession
)
engine = create_engine(
"postgresql://***:***#localhost:5431/***",
echo=True,
)
sync_session = sessionmaker(
engine, expire_on_commit=False
)
class ApiDbHandler(tornado.web.RequestHandler):
def prepare(self):
header = "Content-Type"
body = "application/json"
self.set_header(header, body)
async def get(self):
async with async_session() as session:
session: AsyncSession
for result in await asyncio.gather(
session.execute(text('SELECT * from *** where client_id = :bis_id').bindparams(bis_id='bis1')),
session.execute(text('SELECT * from *** where bis_id = :bis_id').bindparams(bis_id='bis1')),
session.execute(text('SELECT * from *** where abs_id = :abs_id').bindparams(abs_id='abs1')),
session.execute(
text('SELECT * from *** where abs_id = :abs_id').bindparams(abs_id='abs1')),
):
self.write(
json.dumps(result.fetchall(), default=str)
)
class ApiDbSyncHandler(tornado.web.RequestHandler):
def prepare(self):
header = "Content-Type"
body = "application/json"
self.set_header(header, body)
def get(self):
with sync_session() as session:
session: Session
assets = session.execute(text('SELECT * from *** where client_id = :bis_id').bindparams(bis_id='bis1'))
self.write(
json.dumps(assets.fetchall(), default=str)
)
ili = session.execute(text('SELECT * from *** where bis_id = :bis_id').bindparams(bis_id='bis1'))
self.write(
json.dumps(ili.fetchall(), default=str)
)
securities = session.execute(text('SELECT * from *** where abs_id = :abs_id').bindparams(abs_id='abs1'))
self.write(
json.dumps(securities.fetchall(), default=str)
)
abs_accounts = session.execute(text('SELECT * from *** where abs_id = :abs_id').bindparams(abs_id='abs1'))
self.write(
json.dumps(abs_accounts.fetchall(), default=str)
)
def make_app():
return tornado.web.Application([
(r"/api/db/", ApiDbHandler),
(r"/api/db/sync/", ApiDbSyncHandler),
])
if __name__ == "__main__":
app = make_app()
server = HTTPServer(app)
server.listen(8888)
tornado.ioloop.IOLoop.current().start()

How to convert an RDD (that read in a directory of text files) into dataFrame in Apache Spark in Scala?

I'm developing a Scala feature extracting app using Apache Spark TF-IDF. I need to read in from a directory of text files. I'm trying to convert an RDD to a dataframe but I'm getting the error "value toDF() is not a member of org.apache.spark.rdd.RDD[streamedRDD]". This is what I have right now ...
I have spark-2.2.1 & Scala 2.1.11. Thanks in advance.
Code:
// Creating the Spark context that will interface with Spark
val conf = new SparkConf()
.setMaster("local")
.setAppName("TextClassification")
val sc = new SparkContext(conf)
// Load documents (one per line)
val data = sc.wholeTextFiles("C:/Users/*")
val text = data.map{case(filepath,text) => text}
val id = data.map{case(filepath, text) => text.split("#").takeRight(1)(0)}
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
case class dataStreamed(id: String, input: String)
val tweetsDF = data
.map{case (filepath, text) =>
val id = text.split("#").takeRight(1)(0)
val input = text.split(":").takeRight(2)(0)
dataStreamed(id, input)}
.as[dataStreamed]
.toDF()
.cache()
// -------------------- TF-IDF --------------------
// From spark.apache.org
// URL http://spark.apache.org/docs/latest/ml-features.html#tf-idf
val tokenizer = new Tokenizer().setInputCol("input").setOutputCol("words")
val wordsData = tokenizer.transform(tweetsDF)
val hashingTF = new HashingTF()
.setInputCol("words")
.setOutputCol("rawFeatures")
val tf = hashingTF.transform(wordsData).cache() // Hashed words
// Compute for the TFxIDF
val idf = new IDF().setInputCol("rawFeatures").setOutputCol("features")
val tfidf = idf.fit(tf)
Data: (Text files like these in a folder is what I need read in)
https://www.dropbox.com/s/cw3okhaosu7i1md/cars.txt?dl=0
https://www.dropbox.com/s/29tgqg7ifpxzwwz/Italy.txt?dl=0
The problem here is that map function returns a type of Dataset[Row] which you assign to tweetsDF. It should be:
case class dataStreamed(id: String, input: String)
def test() = {
val sparkConf = new SparkConf().setAppName("TextClassification").setMaster("local")
val spark = SparkSession.builder().config(sparkConf).getOrCreate()
val sqlContext = spark.sqlContext
import sqlContext.implicits._
// Load documents (one per line)
val data = spark.sparkContext.wholeTextFiles("C:\\tmp\\stackoverflow\\*")
val dataset = spark.createDataset(data)
val tweetsDF = dataset
.map{case (id : String, input : String) =>
val file = id.split("#").takeRight(1)(0)
val content = input.split(":").takeRight(2)(0)
dataStreamed(file, content)}
.as[dataStreamed]
tweetsDF.printSchema()
tweetsDF.show(10)
}
First data will be an RDD(String, String) then I create a new Dataset with spark.createDataset in order to be able to use map properly together with the case class. Please note that you must define dataStreamedclass out of your method (test in this case)
Good luck
We can do this with couple of commands/functions:
Invoke the spark/scala shell, you can use driver-memory, executor-memory, executor-cores etc as suits your job
spark-shell
Read the text file from HDFS
val text_rdd = sc.textFile("path/to/file/on/hdfs")
Convert the text rdd to DataFrame
val text_df = text_rdd.toDF
Save as plan text format in HDFS
text_df.saveAsTextFile("path/to/hdfs")
Save as splittable compressed format in HDFS
text_df.coalesce(1).write.parquet("path/to/hdfs")

Spark LinearRegressionSummary "normal" summary

According to LinearRegressionSummary (Spark 2.1.0 JavaDoc), p-values are only available for the "normal" solver.
This value is only available when using the "normal" solver.
What the hell is the "normal" solver?
I'm doing this:
import org.apache.spark.ml.{Pipeline, PipelineModel}
import org.apache.spark.ml.evaluation.RegressionEvaluator
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.regression.LinearRegressionModel
import org.apache.spark.ml.tuning.{CrossValidator, CrossValidatorModel, ParamGridBuilder}
import org.apache.spark.sql.functions._
import org.apache.spark.sql.{DataFrame, SparkSession}
.
.
.
val (trainingData, testData): (DataFrame, DataFrame) =
com.acme.pta.accuracy.Util.splitData(output, testProportion)
.
.
.
val lr =
new org.apache.spark.ml.regression.LinearRegression()
.setSolver("normal").setMaxIter(maxIter)
val pipeline = new Pipeline()
.setStages(Array(lr))
val paramGrid = new ParamGridBuilder()
.addGrid(lr.elasticNetParam, Array(0.2, 0.4, 0.8, 0.9))
.addGrid(lr.regParam, Array(0,6, 0.3, 0.1, 0.01))
.build()
val cv = new CrossValidator()
.setEstimator(pipeline)
.setEvaluator(evaluator)
.setEstimatorParamMaps(paramGrid)
.setNumFolds(numFolds) // Use 3+ in practice
val cvModel: CrossValidatorModel = cv.fit(trainingData)
val pipelineModel: PipelineModel = cvModel.bestModel.asInstanceOf[PipelineModel]
val lrModel: LinearRegressionModel =
pipelineModel.stages(0).asInstanceOf[LinearRegressionModel]
val modelSummary = lrModel.summary
Holder.log.info("lrModel.summary: " + modelSummary)
try {
Holder.log.info("feature p values: ")
// Exception occurs on line below.
val featuresAndPValues = features.zip(lrModel.summary.pValues)
featuresAndPValues.foreach(
(featureAndPValue: (String, Double)) =>
Holder.log.info(
"feature: " + featureAndPValue._1 + ": " + featureAndPValue._2))
} catch {
case _: java.lang.UnsupportedOperationException
=> Holder.log.error("Cannot compute p-values")
}
I am still getting the UnsupportedOperationException.
The exception message is:
No p-value available for this LinearRegressionModel
Is there something else I need to be doing? I'm using
"org.apache.spark" %% "spark-mllib" % "2.1.1"
Is pValues supported in that version?
Updated
tl;dr
Solution 1
In normal LinearRegression pValues and other "normal" statistics are only present when one of the parameters elasticNetParam or regParam is zero. So you can change
.addGrid( lr.elasticNetParam, Array( 0.0 ) )
or
.addGrid( lr.regParam, Array( 0.0 ) )
Solution 2
Make custom version of LinearRegression which would explicitly use
"normal" solver for regression.
Cholesky solver for WeightedLeastSquares.
I made this class as an extension to ml.regression package.
package org.apache.spark.ml.regression
import scala.collection.mutable
import org.apache.spark.SparkException
import org.apache.spark.internal.Logging
import org.apache.spark.ml.feature.Instance
import org.apache.spark.ml.linalg.{Vector, Vectors}
import org.apache.spark.ml.optim.WeightedLeastSquares
import org.apache.spark.ml.param.{Param, ParamMap, ParamValidators}
import org.apache.spark.ml.util._
import org.apache.spark.mllib.linalg.VectorImplicits._
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.{DataFrame, Dataset, Row}
import org.apache.spark.sql.functions._
class CholeskyLinearRegression ( override val uid: String )
extends Regressor[ Vector, CholeskyLinearRegression, LinearRegressionModel ]
with LinearRegressionParams with DefaultParamsWritable with Logging {
import CholeskyLinearRegression._
def this() = this(Identifiable.randomUID("linReg"))
def setRegParam(value: Double): this.type = set(regParam, value)
setDefault(regParam -> 0.0)
def setFitIntercept(value: Boolean): this.type = set(fitIntercept, value)
setDefault(fitIntercept -> true)
def setStandardization(value: Boolean): this.type = set(standardization, value)
setDefault(standardization -> true)
def setElasticNetParam(value: Double): this.type = set(elasticNetParam, value)
setDefault(elasticNetParam -> 0.0)
def setMaxIter(value: Int): this.type = set(maxIter, value)
setDefault(maxIter -> 100)
def setTol(value: Double): this.type = set(tol, value)
setDefault(tol -> 1E-6)
def setWeightCol(value: String): this.type = set(weightCol, value)
def setSolver(value: String): this.type = set(solver, value)
setDefault(solver -> Auto)
def setAggregationDepth(value: Int): this.type = set(aggregationDepth, value)
setDefault(aggregationDepth -> 2)
override protected def train(dataset: Dataset[_]): LinearRegressionModel = {
// Extract the number of features before deciding optimization solver.
val numFeatures = dataset.select(col($(featuresCol))).first().getAs[Vector](0).size
val w = if (!isDefined(weightCol) || $(weightCol).isEmpty) lit(1.0) else col($(weightCol))
val instances: RDD[Instance] =
dataset
.select( col( $(labelCol) ), w, col( $(featuresCol) ) )
.rdd.map {
case Row(label: Double, weight: Double, features: Vector) =>
Instance(label, weight, features)
}
// if (($(solver) == Auto &&
// numFeatures <= WeightedLeastSquares.MAX_NUM_FEATURES) || $(solver) == Normal) {
// For low dimensional data, WeightedLeastSquares is more efficient since the
// training algorithm only requires one pass through the data. (SPARK-10668)
val optimizer = new WeightedLeastSquares(
$(fitIntercept),
$(regParam),
elasticNetParam = $(elasticNetParam),
$(standardization),
true,
solverType = WeightedLeastSquares.Cholesky,
maxIter = $(maxIter),
tol = $(tol)
)
val model = optimizer.fit(instances)
val lrModel = copyValues(new LinearRegressionModel(uid, model.coefficients, model.intercept))
val (summaryModel, predictionColName) = lrModel.findSummaryModelAndPredictionCol()
val trainingSummary = new LinearRegressionTrainingSummary(
summaryModel.transform(dataset),
predictionColName,
$(labelCol),
$(featuresCol),
summaryModel,
model.diagInvAtWA.toArray,
model.objectiveHistory
)
lrModel
.setSummary( Some( trainingSummary ) )
lrModel
}
override def copy(extra: ParamMap): CholeskyLinearRegression = defaultCopy(extra)
}
object CholeskyLinearRegression
extends DefaultParamsReadable[CholeskyLinearRegression] {
override def load(path: String): CholeskyLinearRegression = super.load(path)
val MAX_FEATURES_FOR_NORMAL_SOLVER: Int = WeightedLeastSquares.MAX_NUM_FEATURES
/** String name for "auto". */
private[regression] val Auto = "auto"
/** String name for "normal". */
private[regression] val Normal = "normal"
/** String name for "l-bfgs". */
private[regression] val LBFGS = "l-bfgs"
/** Set of solvers that LinearRegression supports. */
private[regression] val supportedSolvers = Array(Auto, Normal, LBFGS)
}
All you have to do is to paste it to the separate file in the project and change LinearRegression to CholeskyLinearRegression in your code.
val lr = new CholeskyLinearRegression() // new LinearRegression()
.setSolver( "normal" )
.setMaxIter( maxIter )
It works with non-zero params and gives pValues. Tested on following params grid.
val paramGrid = new ParamGridBuilder()
.addGrid( lr.elasticNetParam, Array( 0.2, 0.4, 0.8, 0.9 ) )
.addGrid( lr.regParam, Array( 0.6, 0.3, 0.1, 0.01 ) )
.build()
Full investigation
I initially thought that the main issue is with the model being not fully preserved. Trained model is not preserved after fitting in CrossValidator. It is understandable because of memory consumption. There is an ongoing debate on how should it be resolved. Issue in JIRA.
You can see in the commented section that I tried to extract parameters from the best model in order to run it again. Then I found out that the model summary is ok, it's just for some parameters diagInvAtWa has length of 1 and basically a zero.
For ridge regression or Tikhonov regularization (elasticNet = 0) and any regParam pValues and other "normal" statistics can be computed but for Lasso method and something in between (elastic net) not. Same goes for regParam = 0: with any elasticNet pValues were computed.
Why is that
LinearRegression uses Weighted Least Square optimizer for "normal" solver with solverType = WeightedLeastSquares.Auto. This optimizer has two options for solvers: QuasiNewton or Cholesky. The former is selected only when both regParam and elasticNetParam are non-zeroes.
val solver = if (
( solverType == WeightedLeastSquares.Auto &&
elasticNetParam != 0.0 &&
regParam != 0.0 ) ||
( solverType == WeightedLeastSquares.QuasiNewton ) ) {
...
new QuasiNewtonSolver(fitIntercept, maxIter, tol, effectiveL1RegFun)
} else {
new CholeskySolver
}
So in your parameters grid the QuasiNewtonSolver will be always used because there are no combinations of regParam and elasticNetParam where one of them is zero.
We know that in order to get pValues and other "normal" statistics such as t-statistic or std. error of coefficients the diagonal of matrix (A^T * W * A)^-1 (diagInvAtWA) must not be a vector with only one zero. This condition is set in definition of pValues.
diagInvAtWA is a vector of diagonal elements of packed upper triangular matrix (solution.aaInv).
val diagInvAtWA = solution.aaInv.map { inv => ...
For Cholesky solver it is calculated but for QuasiNewton not. Second parameter for NormalEquationSolution is this matrix.
You technically could make your own version of LinearRegression with
Reproduction
In this example I used data sample_linear_regression_data.txt from here.
Full code of reproduction
import org.apache.spark._
import org.apache.spark.ml.{Pipeline, PipelineModel}
import org.apache.spark.ml.evaluation.{RegressionEvaluator, BinaryClassificationEvaluator}
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.regression.{LinearRegressionModel, LinearRegression}
import org.apache.spark.ml.tuning.{CrossValidator, CrossValidatorModel, ParamGridBuilder}
import org.apache.spark.sql.functions._
import org.apache.spark.sql.{DataFrame, SparkSession}
import org.apache.spark.ml.param.ParamMap
object Main {
def main( args: Array[ String ] ): Unit = {
val spark =
SparkSession
.builder()
.appName( "SO" )
.master( "local[*]" )
.config( "spark.driver.host", "localhost" )
.getOrCreate()
import spark.implicits._
val data =
spark
.read
.format( "libsvm" )
.load( "./sample_linear_regression_data.txt" )
val Array( training, test ) =
data
.randomSplit( Array( 0.9, 0.1 ), seed = 12345 )
val maxIter = 10;
val lr = new LinearRegression()
.setSolver( "normal" )
.setMaxIter( maxIter )
val paramGrid = new ParamGridBuilder()
// .addGrid( lr.elasticNetParam, Array( 0.2, 0.4, 0.8, 0.9 ) )
.addGrid( lr.elasticNetParam, Array( 0.0 ) )
.addGrid( lr.regParam, Array( 0.6, 0.3, 0.1, 0.01 ) )
.build()
val pipeline = new Pipeline()
.setStages( Array( lr ) )
val cv = new CrossValidator()
.setEstimator( pipeline )
.setEvaluator( new RegressionEvaluator )
.setEstimatorParamMaps( paramGrid )
.setNumFolds( 2 ) // Use 3+ in practice
val cvModel =
cv
.fit( training )
val pipelineModel: PipelineModel =
cvModel
.bestModel
.asInstanceOf[ PipelineModel ]
val lrModel: LinearRegressionModel =
pipelineModel
.stages( 0 )
.asInstanceOf[ LinearRegressionModel ]
// Technically there is a way to use exact ParamMap
// to build a new LR but for the simplicity I'll
// get and set them explicitly
// lrModel.params.foreach( ( param ) => {
// println( param )
// } )
// val bestLr = new LinearRegression()
// .setSolver( "normal" )
// .setMaxIter( maxIter )
// .setRegParam( lrModel.getRegParam )
// .setElasticNetParam( lrModel.getElasticNetParam )
// val bestLrModel = bestLr.fit( training )
val modelSummary =
lrModel
.summary
println( "lrModel pValues: " + modelSummary.pValues.mkString( ", " ) )
spark.stop()
}
}
Original
There are three solver algorithms available:
l-bfgs - Limited-memory Broyden–Fletcher–Goldfarb–Shanno algorithm which is a limited-memory quasi-Newton optimization method.
normal - using Normal Equation as an analytical solution to the linear regression problem. It is basically a weighted least squares approach or reweighted least squares approach.
auto - solver algorithm is selected automatically. The Normal Equations solver will be used when possible, but this will automatically fall back to iterative optimization methods when needed
The coefficientStandardErrors, tValues and pValues are only available when using the "normal" solver because they are all based on diagInvAtWA - a diagonal of matrix (A^T * W * A)^-1.

How to pass command line input in Gatling using Scala script?

I want that user can input 'Count, repeatCount, testServerUrl and definitionId' from command line while executing from Gatling. From command line I execute
> export JAVA_OPTS="-DuserCount=1 -DflowRepeatCount=1 -DdefinitionId=10220101 -DtestServerUrl='https://someurl.com'"
> sudo bash gatling.sh
But gives following error:
url null/api/workflows can't be parsed into a URI: scheme
Basically null value pass there. Same happens to 'definitionId'. Following is the code. you can try with any url. you just have to check the value which you provides by commandline is shown or not?
import io.gatling.core.Predef._
import io.gatling.http.Predef._
import scala.concurrent.duration._
class TestCLI extends Simulation {
val userCount = Integer.getInteger("userCount", 1).toInt
val holdEachUserToWait = 2
val flowRepeatCount = Integer.getInteger("flowRepeatCount", 2).toInt
val definitionId = java.lang.Long.getLong("definitionId", 0L)
val testServerUrl = System.getProperty("testServerUrl")
val httpProtocol = http
.baseURL(testServerUrl)
.inferHtmlResources()
.acceptHeader("""*/*""")
.acceptEncodingHeader("""gzip, deflate""")
.acceptLanguageHeader("""en-US,en;q=0.8""")
.authorizationHeader(envAuthenticationHeaderFromPostman)
.connection("""keep-alive""")
.contentTypeHeader("""application/vnd.v7811+json""")
.userAgentHeader("""Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.65 Safari/537.36""")
val headers_0 = Map(
"""Cache-Control""" -> """no-cache""",
"""Origin""" -> """chrome-extension://faswwegilgnpjigdojojuagwoowdkwmasem""")
val scn = scenario("testabcd")
.repeat (flowRepeatCount) {
exec(http("asdfg")
.post("""/api/workflows""")
.headers(headers_0)
.body(StringBody("""{"definitionId":$definitionId}"""))) // I also want to get this value dynamic from CLI and put here
.pause(holdEachUserToWait)
}
setUp(scn.inject(atOnceUsers(userCount))).protocols(httpProtocol)
}
Here no main method is defined so I think it would be difficult to pass the command line argument here. But for the work around what you can do is Read the property from the Environment variables.
For that you can find some help here !
How to read environment variables in Scala
In case of gatling See here : http://gatling.io/docs/2.2.2/cookbook/passing_parameters.html
I think this will get you done :
import io.gatling.core.Predef._
import io.gatling.http.Predef._
import scala.concurrent.duration._
class TestCLI extends Simulation {
val count = Integer.getInteger("users", 50)
val wait = 2
val repeatCount = Integer.getInteger("repeatCount", 2)
val testServerUrl = System.getProperty("testServerUrl")
val definitionId = java.lang.Long.getLong("definitionId", 0L)
val scn = scenario("testabcd")
.repeat (repeatCount ) {
exec(http("asdfg")
.post("""/xyzapi""")
.headers(headers_0)
.body(StringBody("""{"definitionId":$definitionId}"""))) // I also want to get this value dynamic from CLI and put here
.pause(wait)
}
setUp(scn.inject(atOnceUsers(count))).protocols(httpProtocol)
}
On the command line firstly export the JAVA_OPTS environment variable
by using this command directly in terminal.
export JAVA_OPTS="-Duse rCount=50 -DflowRepeatCount=2 -DdefinitionId=10220301 -DtestServerUrl='something'"
Windows 10 solution:
create simple my_gatling_with_params.bat file with content, e.g.:
#ECHO OFF
#REM You could pass to this script JAVA_OPTS in cammandline arguments, e.g. '-Dusers=2 -Dgames=1'
set JAVA_OPTS=%*
#REM Define this variable if you want to autoclose your .bat file after script is done
set "NO_PAUSE=1"
#REM To have a pause uncomment this line and comment previous one
rem set "NO_PAUSE="
gatling.bat -s computerdatabase.BJRSimulation_lite -nr -rsf c:\Work\gatling-charts-highcharts-bundle-3.3.1\_mydata\
exit
where:
computerdatabase.BJRSimulation_lite - your .scala script
users and games params that you want to pass to script
So in your computerdatabase.BJRSimulation_lite file you could use variables users and games in the following way:
package computerdatabase
import io.gatling.core.Predef._
import io.gatling.http.Predef._
import scala.concurrent.duration._
import scala.util.Random
import java.util.concurrent.atomic.AtomicBoolean
class BJRSimulation_lite extends Simulation {
val httpProtocol = ...
val nbUsers = Integer.getInteger("users", 1).toInt
val nbGames = Integer.getInteger("games", 1).toInt
val scn = scenario("MyScen1")
.group("Play") {
//Set count of games
repeat(nbGames) {
...
}
}
// Set count of users
setUp(scn.inject(atOnceUsers(nbUsers)).protocols(httpProtocol))
}
After that you could just invoke 'my_gatling_with_params.bat -Dusers=2 -Dgames=1' to pass yours params into test

Spark RDD map in yarn mode does not allow access to variables?

I have got a brand new install of spark 1.2.1 over a mapr cluster and while testing it I find that it works nice in local mode but in yarn modes it seems not to be able to access variables, neither if broadcasted. To be precise, the following test code
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
object JustSpark extends App {
val conf = new org.apache.spark.SparkConf().setAppName("SimpleApplication")
val sc = new SparkContext(conf)
val a = List(1,3,4,5,6)
val b = List("a","b","c")
val bBC= sc.broadcast(b)
val data = sc.parallelize(a)
val transform = data map ( t => { "hi" })
transform.take(3) foreach (println _)
val transformx2 = data map ( t => { bBC.value.size })
transformx2.take(3) foreach (println _)
//val transform2 = data map ( t => { b.size })
//transform2.take(3) foreach (println _)
}
works in local mode but fails in yarn. More precisely, both methods, transform2 and transformx2, fail, and all of them work if --master local[8].
I am compiling it with sbt and sending with the submit tool
/opt/mapr/spark/spark-1.2.1/bin/spark-submit --class JustSpark --master yarn target/scala-2.10/simulator_2.10-1.0.jar
Any idea what is going on? The fail message just claims to have a java null pointer exception in the place where it should be accessing the variable. Is there other method to pass variables inside the RDD maps?
I'm going to take a pretty good guess: it's because you're using App. See https://issues.apache.org/jira/browse/SPARK-4170 for details. Write a main() method instead.
I presume the culprit were
val transform2 = data map ( t => { b.size })
In particular the accessing the local variable b . You may actually see in your log files java.io.NotSerializableException .
What is supposed to happen: Spark will attempt to serialize any referenced object. That means in this case the entire JustSpark class - since one of its members is referenced.
Why did this fail? Your class is not Serializable. Therefore Spark is unable to send it over the wire. In particular you have a reference to SparkContext - which does not extend Serializable
class SparkContext(config: SparkConf) extends Logging with ExecutorAllocationClient {
So - your first code - which does broadcast only the variable value - is the correct way.
This is the original example of broadcast, from spark sources, altered to use lists instead of arrays:
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}
object MultiBroadcastTest {
def main(args: Array[String]) {
val sparkConf = new SparkConf().setAppName("Multi-Broadcast Test")
val sc = new SparkContext(sparkConf)
val slices = if (args.length > 0) args(0).toInt else 2
val num = if (args.length > 1) args(1).toInt else 1000000
val arr1 = (1 to num).toList
val arr2 = (1 to num).toList
val barr1 = sc.broadcast(arr1)
val barr2 = sc.broadcast(arr2)
val observedSizes: RDD[(Int, Int)] = sc.parallelize(1 to 10, slices).map { _ =>
(barr1.value.size, barr2.value.size)
}
observedSizes.collect().foreach(i => println(i))
sc.stop()
}}
I compiled it in my environment and it works.
So what is the difference?
The problematic example uses extends App while the original example is a plain singleton.
So I demoted the code to a "doIt()" function
object JustDoSpark extends App{
def doIt() {
...
}
doIt()
and guess what. It worked.
Surely the problem is related to Serialization indeed, but in a different way. Having the code in the body of the object seems to cause problems.

Resources