I am writing a spark streaming app with online streaming data compared to basic data which i broadcast into each computing node. However, since the basic data is updated daily, i need to update the broadcasted variable daily too. The basic data resides on hdfs.
Is there a way to do this? The update is not related to any online streaming results, just say at 12:00 am everyday. Moreover, if there is such a way, will the updating process block spark streaming computing jobs?
Refer to the last answer in the thread you referred. Summary - instead of sending the data, send the caching code to update data at the needed interval
Create CacheLookup object that updates daily#12 am
Wrap that in Broadcast variable
Use CacheLookup as part of streaming logic
Related
I have persisted machine learning model in hdfs via spark batch job and i am consuming this in my spark streaming. Basically, the ML model is broadcasted to all executors from the spark driver.
Can some one suggest how i can update the model in real time without stopping the spark streaming job? Basically a new ML model will get created as and when more data points are available but not have any idea how the NEW model will need to be sent to the spark executors.
Request to post some sample code as well.
Regards,
Deepak.
The best approach is probably updating the model on each batch. Since you would probably rather not update too often, you probably want to check if you actually need to load the model and skip that if possible.
In your case of a model stored on hdfs, you can just check for a new timestamp on the model file (or a new model present in a directory) before updating the value of the variable holding the loaded model.
Following 1 and 2:
Different types of files enter my NFS directory from time to time. I would like to use OOZIE or any other HDFS solution to trigger the file arrival event and to copy the file into specific location at the HDFS in accordance to its type. What is the best way to do it?
Best way is very subjective term. It largely depends on, what kind of data, frequency and what sorts of things should happen once the data arrive at specific location.
Apache flume can monitor specific folder for data availability and push it down to any sink like HDFS as-is. Flume is good for streaming data.But it does only one specific job- just moving data from place to place.
But on other hand, look up Oozie Coordinators. Coordinators have data availability trigger and with oozie you can perform all sort of ETL operations after data arrives using tools like spark,hive,pig etc and push it down to hdfs using shell actions. You can schedule jobs to run during specific times,frequency or have job send you an email if something goes wrong...
I am developing a Spark application. The application takes data from Kafka queue and processes that data. After processing it stores data in Hbase table.
Now I want to monitor some of the performance attributed such as,
Total count of input and output records.(Not all records will be persisted to Hbase, some of the data may be filtered out in processing)
Average processing time per message
Average time taken to persist the messages.
I need to collect this information and send it to a different Kafka queue for monitoring.
Considering that the monitoring should not incur a significant delay in the processing.
Please suggest some ideas for this.
Thanks.
I am trying to use spark streaming to deal with some order stream, I have some previous computed features for maybe a buyer_id for order in the stream.
I need to get these features while the Spark Streaming is running.
Now, I stored the buyer_id features in a hive table and load it into and RDD and
val buyerfeatures = loadBuyerFeatures()
orderstream.transform(rdd => rdd.leftOuterJoin(buyerfeatures))
to get the pre-computed features.
another way to deal with this is maybe save the features in to a hbase table. and fire a get on every buyer_id.
which one is better ? or maybe I can solve this in another way.
From my short experience:
Loading the necessary data for the computation should be done BEFORE starting the streaming context:
If you are loading inside a DStream operation, this operation will be repeated at each Batch Inteverval time.
If you load each time from Hive, you should seriously consider overhead costs and possible problems during data transfer.
So, if your data is already computed and "small" enough, load it at the beginning of the program in a Broadcast variable or,even better, in a final variable. Either this, or create an RDD before the DStream and keep it as reference (which looks like what you are doing now), although remember to cache it (always if you have enough space).
If you actually do need to read it at streaming time (maybe you receive your query key from the stream), then try to do it once in a foreachPartition and save it in a local variable.
Is there a possibility to perform some action at the end of each micro-batch inside the DStream in Spark Streaming? My aim is to compute number of the events processed by Spark. Spark Streaming gives me some numbers, but the average also seems to sum up zero values (as some micro-batches are empty).
e.g. I do collect some statistics data and want to send them to my server, but the object that collects the data only exists during a certain batch and is initialized from the scratch for the next batch. I would love to be able to call my "finish" method before the batch is done and the object is gone. Otherwise I loose the data that has not been sent to my server.
Maybe you can use StreamingListener:
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.streaming.scheduler.StreamingListener