Spark: Measure performance of UDF on large dataset - performance

I want to measure performance of an udf on a large dataset. The spark SQL is:
spark.sql("SELECT my_udf(value) as results FROM my_table")
The udf returns an array. The issue I'm facing is how to make this execute without returning the data to the driver. I need an action but anything returning the full data set will crash the driver, eg. collect or I'm not running the calculation for all rows (show/take(n)). So how can i trigger the calculation and not return all data to the driver?

I think the closest you can get to only running your UDF for measuring timing would be something like below. The general idea is using caching to try and remove data loading time from your measurement, and then use a foreach that does nothing to make spark run your UDF.
val myFunc: String => Int = _.length
val myUdf = udf(myFunc)
val data = Seq("a", "aa", "aaa", "aaaa")
val df = sc.parallelize(data).toDF("text")
// Cache to remove data loading from measurements as much as possible
// Also, do a foreach no-op action to force the data to load and cache before our test
df.cache()
df.foreach(row => {})
// Run the test, grabbing before and after time
val start = System.nanoTime()
val udfDf = df.withColumn("udf_column", myUdf($"text"))
// Force spark to run your UDF and do nothing with the result so we don't include any writing time in our measurement
udfDf.rdd.foreach(row => {})
// Get the total elapsed time
val elapsedNs = System.nanoTime() - start

Related

Mlib RandomForest (Spark 2.0) predict a single vector

After training a RandomForestRegressor in PipelineModel using mlib and DataFrame (Spark 2.0)
I loaded the saved model into my RT environment in order to predict using the model, each request
is handled and transform through the loaded PipelineModel but in the process I had to convert the
single request vector to a one row DataFrame using spark.createdataframe all of this takes around 700ms!
comparing to 2.5ms if I uses mllib RDD RandomForestRegressor.predict(VECTOR).
Is there any way to use the new mlib to predict a a single vector without converting to DataFrame or do something else to speed things up?
The dataframe based org.apache.spark.ml.regression.RandomForestRegressionModel also takes a Vector as input. I don't think you need to convert a vector to dataframe for every call.
Here is how I think you code should work.
//load the trained RF model
val rfModel = RandomForestRegressionModel.load("path")
val predictionData = //a dataframe containing a column 'feature' of type Vector
predictionData.map { row =>
Vector feature = row.getAs[Vector]("feature")
Double result = rfModel.predict(feature)
(feature, result)
}

Spark not be able to retrieve all of Hbase data in specific column

My Hbase table has 30 Million records, each record has the column raw:sample, raw is columnfamily sample is column. This column is very big, the size from a few KB to 50MB. When I run the following Spark code, it only can get 40 thousand records but I should get 30 million records:
val conf = HBaseConfiguration.create()
conf.set("hbase.zookeeper.quorum", "10.1.1.15:2181")
conf.set(TableInputFormat.INPUT_TABLE, "sampleData")
conf.set(TableInputFormat.SCAN_COLUMNS, "raw:sample")
conf.set("hbase.client.keyvalue.maxsize","0")
val hBaseRDD = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat],classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable],classOf[org.apache.hadoop.hbase.client.Result])
var arrRdd:RDD[Map[String,Object]] = hBaseRDD.map(tuple => tuple._2).map(...)
Right now I work around this by get the id list first then iterate the id list to get the column raw:sample by pure Hbase java client in Spark foreach.
Any ideas please why I can not get all of the column raw:sample by Spark, is it because the column too big?
A few days ago one of my zookeeper nodes and datanodes down, but I fixed it soon since the replica is 3, is this the reason? Would think if I run hbck -repair would help, thanks a lot!
Internally, TableInputFormat creates a Scan object in order to retrieve the data from HBase.
Try to create a Scan object (without using Spark), configured to retrieve the same column from HBase, see if the error repeats:
// Instantiating Configuration class
Configuration config = HBaseConfiguration.create();
// Instantiating HTable class
HTable table = new HTable(config, "emp");
// Instantiating the Scan class
Scan scan = new Scan();
// Scanning the required columns
scan.addColumn(Bytes.toBytes("personal"), Bytes.toBytes("name"));
scan.addColumn(Bytes.toBytes("personal"), Bytes.toBytes("city"));
// Getting the scan result
ResultScanner scanner = table.getScanner(scan);
// Reading values from scan result
for (Result result = scanner.next(); result != null; result = scanner.next())
System.out.println("Found row : " + result);
//closing the scanner
scanner.close();
In addition, by default, TableInputFormat is configured to request a very small chunk of data from the HBase server (which is bad and causes a large overhead). Set the following to increase the chunk size:
scan.setBlockCache(false);
scan.setCaching(2000);
For a high throughput like yours, Apache Kafka is the best solution to integrate the data flow and keeping data pipeline alive. Please refer http://kafka.apache.org/08/uses.html for some use cases of kafka
One more
http://sites.computer.org/debull/A12june/pipeline.pdf

Pig LOADER for SPLUNK like records

I am trying to use PIG to read data from HDFS where the files contain rows that look like:
"key1"="value1", "key2"="value2", "key3"="value3"
"key1"="value10", "key3"="value30"
In a way the rows of the data are essentially dictionaries:
{"key1":"value1", "key2":"value2", "key3":"value3"}
{"key1":"value10", "key3":"value30"}
I can read and dump portion of this data easily enough with something like:
data = LOAD '/hdfslocation/weirdformat*' as PigStorage(',');
sampled = SAMPLE data 0.00001;
dump sampled;
My problem is that I can't parse it efficiently. I have tried to use
org.apache.pig.piggybank.storage.MyRegExLoader
but it seems extremely slow.
Could someone recommend a different approach?
Seems like one way is to use a python UDF.
This solution is heavily inspired from bag-to-tuple
In myudfs.py write:
#!/usr/bin/python
def FieldPairsGenerator(dataline):
for x in dataline.split(','):
k,v = x.split('=')
yield (k.strip().strip('"'),v.strip().strip('"'))
#outputSchema("foo:map[]")
def KVDataToDict(dataline):
return dict( kvp for kvp in FieldPairsGenerator(dataline) )
then write the following Pig script:
REGISTER 'myudfs.py' USING jython AS myfuncs;
data = LOAD 'whereyourdatais*.gz' AS (foo:chararray);
A = FOREACH data GENERATE myfuncs.KVDataToDict(foo);
A now has the data stored as a PigMap

How to make script execution slow?

I have the task: need to select data from "TABLE_FROM", modify it and insert to the "TABLE_TO". The main problem is script must run on production and shouldn't hurts live site performance, but "TABLE_FROM" contains hundred millions of rows. Going to run the script using nodejs. What techniques are using to resolve such kind of problems? ie. how to make this script running "slowly" or other words "softly" to prevent DB and CPU overload?
Time of script execution is irrelevant. I use Cassandra DB.
Sample code:
var OFFSET = 0;
var BATCHSIZE = 100;
var TIMEOUT = 1000;
function fetchPush() {
// fetch from TABLE_FROM, possibly in batches
rows = fetch(OFFSET, BATCHSIZE);
// push to TABLE_TO
push(rows);
// do next batch in timeout
setTimeout(fetchPush, TIMEOUT);
}
Here I'm assuming the fetch and push are blocking calls, for async processing you could (obviously) use async.

SqlAlchemy - when I iterate on a query, do I get a list or a iterator?

I'm starting to learn how to use SQLAlchemy and I'm running into some efficiency problems.
I created an object mapping an existing big table on our Oracle database:
engine = create_engine(connectionString, echo=False)
class POI(object):
def __repr__(self):
return "{poi_id} - {title}, {city} - {uf}".format(**self.__dict__)
def loadSession():
metadata = MetaData(engine)
_poi = Table('tbl_ourpois', metadata, autoload = True)
mapper(POI, _poi)
Session = sessionmaker(bind = engine)
session = Session()
return session
This table have millions of registries. When I do a simple query and try to iterate over it:
session = loadSession()
for poi in session.query(POI):
print poi
I noticed two things: (1) it takes some minutes for it to start printing objects on the screen, (2) memory usage starts to grow like crazy. So, my conclusion was that this code was fetching all the result set in a list and then iterating over it. Is this correct?
With cx_Oracle, when I do a query like:
conn = cx_Oracle.connect(connectionString)
cursor = conn.cursor()
cursor.execute("select * from tbl_ourpois")
for poi in cursor:
print poi
the resulting cursor behaves as an iterator that gets results into a buffer and returns them as they are needed intead of loading the whole thing in a list. This loop starts printing results almost instantly and memory usage is pretty low and constant.
Can I get this kind of behavior wiht SQLAlchemy? Is there a way to get a constant memory iterator out of session.query(POI) instead of a list?

Resources