Do I need caching after repartitining - caching

I have madde a dataaframe which I repartitined based on its primary key on the nodes
val config=new SparkConf().setAppName("MyHbaseLoader").setMaster("local[10]")
val context=new SparkContext(config)
val sqlContext=new SQLContext(context)
val rows="sender,time,time(utc),reason,context-uuid,rat,cell-id,first-pkt,last-pkt,protocol,sub-proto,application-id,server-ip,server-domain-name, http-proxy-ip,http-proxy-domain-name, video,packets-dw, packets-ul, bytes-dw, bytes-ul"
val scheme= new StructType(rows.split(",").map(e=>new StructField(e.trim,StringType,true)))
val dFrame=sqlContext.read
.schema(scheme)
.format("csv")
.load("E:\\Users\\Mehdi\\Downloads\\ProbDocument\\ProbDocument\\ggsn_cdr.csv")
dFrame.registerTempTable("GSSN")
dFrame.persist(StorageLevel.MEMORY_AND_DISK)
val distincCount=sqlContext.sql("select count(distinct sender) as SENDERS from GSSN").collectAsList().get(0).get(0).asInstanceOf[Long]
dFrame.repartition(distincCount.toInt/3,dFrame("sender"))
Do I need to call my presist method again after repartitioning for next reducing jobs on dataframe?

Yes, repartition returns a new DataFrame so you would need to cache it again.

While the answer provided by Dikei seems to address your direct question it is important to note that in a case like this there is typically no reason to explicitly cache at all.
Every shuffle in Spark (here it is repartition) serves as an implicit caching point. If some part of lineage has to be re-executed and none of the executors has been lost you it won't have to go further back than to the last shuffle and read shuffle files.
It means that caching just before or just after a shuffle is typically a waste of time and resources especially if you're not interested in in-memory only or some non standard caching mechanism.

You would need to persist the reparation DataFrame, since DataFrames are immutable and reparation returns a new DataFrame.
A approach which you could follow is to persist dFrame and after its reparation the new DataFrame which returned is dFrameRepart. At this stage you could persist the dFrameRepart and unpersist the dFrame in order to free up the memory, provided that you won't be using dFrame again. In case your using dFrame after the reparation operation , both the DataFrames can be persisted.
dFrame.registerTempTable("GSSN")
dFrame.persist(StorageLevel.MEMORY_AND_DISK)
val distincCount=sqlContext.sql("select count(distinct sender) as SENDERS from GSSN").collectAsList().get(0).get(0).asInstanceOf[Long]
valdFrameRepart=dFrame.repartition(distincCount.toInt/3, dFrame("sender")).persist(StorageLevel.MEMORY_AND_DISK)
dFrame.unpersist

Related

How write performance can be improved for RecordWriter

Can anyone help me out finding correct API to improve write performance?
We use MultipleOutputs<ImmutableBytesWritable, Result> class to write data we read from a table, we use the newly created file as a backup. We face performance issue in write using MultipleOutputs, it takes nearly 5 seconds for every 10000 records we write.
This is the code we use:
Result[] results = // result from another table
MultipleOutputs<ImmutableBytesWritable, Result> mos = new MultipleOutputs<ImmutableBytesWritable, Result> ();
for(Result res : results ){
mos.write(new ImmutableBytesWritable(result.getRow()), result, baseoutputpath);
}
We get a batch of 10000 rows and write them in a loop, with baseoutputpath changing depending on Result content.
We are facing performance dip when writing into MultipleOutputs, we suspect that it might be due to writing in a loop.
Is there any other API in maprdb or HBase which push data to database using fewer RPC calls by buffering upto certain limit.
We write data as records so no file system write class would work for us.
Please note that we use mapreduce job to do all of the above.

Getting duplicates with NiFi HBase_1_1_2_ClientMapCacheService

I need to remove duplicates from a flow I've developed, it can receive the same ${filename} multiple times. I tried using HBase_1_1_2_ClientMapCacheService with DetectDuplicate (I am using NiFi v1.4), but found that it lets a few duplicates through. If I use DistributedMapCache (ClientService and Server), I do not get any duplicates. Why would I receive some duplicates with the HBase Cache?
As a test, I listed a directory (ListSFTP) with 20,000 files on all cluster nodes (4 nodes) and passed to DetectDuplicate (using the HBase Cache service). It routed 20,020 to "non-duplicate", and interestingly the table actually has 20,000 rows.
Unfortunately I think this is due to a limitation in the operations that are offered by HBase.
The DetectDuplicate processor relies on an operation "getAndPutIfAbsent" which is expected to return the original value, and then set the new value if it wasn't there. For example, first time through it would return null and set the new value, indicating it wasn't a duplicate.
HBase doesn't natively support this operation, so the implementation of this method in the HBase map cache client does this:
V got = get(key, keySerializer, valueDeserializer);
boolean wasAbsent = putIfAbsent(key, value, keySerializer, valueSerializer);
if (! wasAbsent) return got;
else return null;
So because it is two separate calls there is a possible race condition...
Imagine node 1 calls the first line and gets null, but then node 2 performs the get and the putIfAbsent, now when node 1 calls putIfAbsent it gets false because node 2 just populated the cache, so now node 1 returns the null value from the original get... both of these look like non-duplicates to DetectDuplicate.
In the DistributedMapCacheServer, it locks the entire cache per operation so it can provide an atomic getAndPutIfAbsent.

Creating single object DataFrame for predictions

once I got my classification models trained, I'd like them to use in my web application to make classification predictions on the data that has been collected for a given session.
That is:
1) I have some session data structure that I need to map to a DataFrame row
2) feed tha DataFrame row into my ML model to predict the classification
3) use the prediction with the origination session to show it to the user in front of the browser.
The examples to create a DataFrame as input to a Spark pipeline that I've seen so far create it from a data source like a file. Now it seems a bit unwieldy to first create a single POJO or JsonNode, serialize it to file containing just on record and then use that file to create the DataFrame to feed the model.
Writing this I also get the feeling that it might not be a great idea to create and tear down the ML pipeline for each request, which seems to follow from this approach.
So maybe I should better think "Spark Streaming"?
Feed the mapped session data into some kind of message queue and feed that into my Spark pipeline? What kind of "stream" would be appropriate here?
I read somewhere that Spark streaming consumes the stream in micro batches and not record by record - that implies some delay until enough records have been collected to fill up the micro batch (or some preconfigured delay to wait until the micro batch is considered to be "full enough"). What does that mean for the responsiveness of the web application? Can I trigger the micro batches like every 100 milliseconds?
I would appreciate if someone could point me in the right direction.
Maybe Spark is not a good fit here and I should switch to Apache Flink?
Thanks in advance, Bernd
Ok, by now I have found some ways to solve my problem, maybe that
helps someone else:
Use a Sequence containing one tuple and name the columns separately
val df= spark.createDataFrame(
Seq("val1", "val2")
).toDF("label1", "label2")
Using a JSON-String
val sqlContext = spark.sqlContext
val jsonData= """{ "label1": "val1", "label2": "val2" }"""
val rdd= sparkSession.sparkContext.parallelize(Seq(jsonData))
val df= sqlContext.read.json(rdd)
NOT Working: create from Sequence case class Objects:
val sqlContext = sparkSession.sqlContext
import sqlContext.implicits._
val myData= Seq(Feat("value1", "value2"))
val ds: Dataset[Feat]= myData.toDS()
ds.show(10, false)
This compiles ok, but yields an Exception at runtime:
[error] a.a.OneForOneStrategy - java.lang.RuntimeException:
Error while encoding: java.lang.ClassCastException:
es.core.recommender.Feat cannot be cast to es.core.recommender.Feat
I'd love to include more of the stacktrace, but this glorious editor
won't let me...
It would be nice to know why this alternative did not work...

spark map(func).cache slow

When I use the cache to store data,I found that spark is running very slow. However, when I don't use cache Method,the speed is very good.My main profile is follows:
SPARK_JAVA_OPTS+="-Dspark.local.dir=/home/wangchao/hadoop-yarn-spark/tmp_out_info
-Dspark.rdd.compress=true -Dspark.storage.memoryFraction=0.4
-Dspark.shuffle.spill=false -Dspark.executor.memory=1800m -Dspark.akka.frameSize=100
-Dspark.default.parallelism=6"
And my test code is:
val file = sc.textFile("hdfs://10.168.9.240:9000/user/bailin/filename")
val count = file.flatMap(line => line.split(" ")).map(word => (word, 1)).cache()..reduceByKey(_+_)
count.collect()
Any answers or suggestions on how I can resolve this are greatly appreciated.
cache is useless in the context you are using it. In this situation cache is saying save the result of the map, .map(word => (word, 1)) in memory. Whereas if you didn't call it the reducer could be chained to the end of the map and the maps results discarded after they are used. cache is better used in a situation where multiple transformations/actions will be called on the RDD after it is created. For example if you create a data set you want to join to 2 different datasets it is helpful to cache it, because if you don't on the second join the whole RDD will be recalculated. Here is an easily understandable example from spark's website.
val file = spark.textFile("hdfs://...")
val errors = file.filter(line => line.contains("ERROR")).cache() //errors is cached to prevent recalculation when the two filters are called
// Count all the errors
errors.count()
// Count errors mentioning MySQL
errors.filter(line => line.contains("MySQL")).count()
// Fetch the MySQL errors as an array of strings
errors.filter(line => line.contains("MySQL")).collect()
What cache is doing internally is removing the ancestors of an RDD by keeping it in memory/saving to disk(depending on the storage level), the reason an RDD must save its ancestors is so it can be recalculated on demand, this is the recovery method of RDD's.

Cache consistency when using memcached and a rdbms like MySQL

I have taken a database class this semester and we are studying about maintaining cache consistency between the RDBMS and a cache server such as memcached. The consistency issues arise when there are race conditions. For example:
Suppose I do a get(key) from the cache and there is a cache miss. Because I get a cache miss, I fetch the data from the database, and then do a put(key,value) into the cache.
But, a race condition might happen, where some other user might delete the data I fetched from the database. This delete might happen before I do a put into the cache.
Thus, ideally the put into the cache should not happen, since the data is longer present in the database.
If the cache entry has a TTL, the entry in the cache might expire. But still, there is a window where the data in the cache is inconsistent with the database.
I have been searching for articles/research papers which speak about this kind of issues. But, I could not find any useful resources.
This article gives you an interesting note on how Facebook (tries to) maintain cache consistency : http://www.25hoursaday.com/weblog/2008/08/21/HowFacebookKeepsMemcachedConsistentAcrossGeoDistributedDataCenters.aspx
Here's a gist from the article.
I update my first name from "Jason" to "Monkey"
We write "Monkey" in to the master database in California and delete my first name from memcache in California but not Virginia
Someone goes to my profile in Virginia
We find my first name in memcache and return "Jason"
Replication catches up and we update the slave database with my first name as "Monkey." We also delete my first name from Virginia memcache because that cache object showed up in the replication stream
Someone else goes to my profile in Virginia
We don't find my first name in memcache so we read from the slave and get "Monkey"
How about using a variable save in memcache as a lock signal?
every single memcache command is atomic
after you retrieved data from db, toggle lock on
after you put data to memcache, toggle lock off
before delete from db, check lock state
The code below gives some idea of how to use Memcached's operations add, gets and cas to implement optimistic locking to ensure consistency of cache with the database.
Disclaimer: i do not guarantee that it's perfectly correct and handles all race conditions. Also consistency requirements may vary between applications.
def read(k):
loop:
get(k)
if cache_value == 'updating':
handle_too_many_retries()
sleep()
continue
if cache_value == None:
add(k, 'updating')
gets(k)
get_from_db(k)
if cache_value == 'updating':
cas(k, 'value:' + version_index(db_value) + ':' + extract_value(db_value))
return db_value
return extract_value(cache_value)
def write(k, v):
set_to_db(k, v)
loop:
gets(k)
if cache_value != 'updated' and cache_value != None and version_index(cache_value) >= version_index(db_value):
break
if cas(k, v):
break
handle_too_many_retries()
# for deleting we can use some 'tumbstone' as a cache value
When you read, the following happens:
if(Key is not in cache){
fetch data from db
put(key,value);
}else{
return get(key)
}
When you write, the following happens:
1 delete/update data from db
2 clear cache

Resources