ML model update in spark streaming - spark-streaming

I have persisted machine learning model in hdfs via spark batch job and i am consuming this in my spark streaming. Basically, the ML model is broadcasted to all executors from the spark driver.
Can some one suggest how i can update the model in real time without stopping the spark streaming job? Basically a new ML model will get created as and when more data points are available but not have any idea how the NEW model will need to be sent to the spark executors.
Request to post some sample code as well.
Regards,
Deepak.

The best approach is probably updating the model on each batch. Since you would probably rather not update too often, you probably want to check if you actually need to load the model and skip that if possible.
In your case of a model stored on hdfs, you can just check for a new timestamp on the model file (or a new model present in a directory) before updating the value of the variable holding the loaded model.

Related

Can we use cached RDD across batches on an executor

I have a case where I want to download some data from a remote store every one hour and store that as Key-Value pairs in a RDD on an executor/worker. I want to cache this RDD so that all future jobs/tasks/batches running on this executor/worker can use the cached RDD to do a lookup. Is this possible in Spark Streaming?
Some relevant code or pointers to relevant code will be helpful.
Alluxio is a memory-centric distributed storage system. Alluxio can be used to cache Spark RDDs in memory, for multiple and future Spark applications and jobs to access.
Spark can store RDDs in Alluxio memory, and future Spark jobs can read them from Alluxio memory. That blog post has more details on how that works. Here is information on how to setup and configure Alluxio with Spark.
Given your requirements, here is what I would propose:
Run a Spark Application job every 1 hour, which will get the data from external data source and append to a hive table.
Use Spark thrift server to access the data
Note: Your notion of "caching within executor to use across application" is not correct. Executors relates to single Spark App, so as any RDD within that app.
If you really need to invest on caching data on distributed nodes, you may want to consider off-heap in-memory databases, such as Tachyon and Alluxio
If you just need a giant, distributed map, and you want to use Spark, write a standalone job that downloads the data every hours, and caches the RDD thus obtained (you can unpersist the old RDD). Let us call this Job DataRefresher.
You can then expose a REST api (if you are on Scala, consider using Scalatra) that wraps the DataRefresher, and returns the value, given the key. Something like: http://localhost:9191/lookup/key, which can be used by other jobs to do a relatively fast lookup.

spark streaming broadcast variable daily update

I am writing a spark streaming app with online streaming data compared to basic data which i broadcast into each computing node. However, since the basic data is updated daily, i need to update the broadcasted variable daily too. The basic data resides on hdfs.
Is there a way to do this? The update is not related to any online streaming results, just say at 12:00 am everyday. Moreover, if there is such a way, will the updating process block spark streaming computing jobs?
Refer to the last answer in the thread you referred. Summary - instead of sending the data, send the caching code to update data at the needed interval
Create CacheLookup object that updates daily#12 am
Wrap that in Broadcast variable
Use CacheLookup as part of streaming logic

Loading data into HIVE to support front end application

We have a datawarehousing application which we are planning to convert to Hadoop.
Currently, there are 20 feeds that we receive on daily basis and load this data into MySQL database.
As the data is getting large, we are planning to move to Hadoop for faster query processing.
As the first step we are planning to load the data into HIVE on a daily basis instead of MySQL.
Question:-
1.Can I convert Hadoop similar to a DWH application to process files on daily basis?
2.When I load the data in Master Node, will it be sync'd automatically?
It really depends on the size of your data. The Question is a bit complex but in general you will have to design your own pipeline.
If you are analyzing raw logs HDFS will be a good choice to start from. You can use Java, Python or Scala to schedule the Hive jobs on daily basis and use Sqoop if you still need some MySQL data.
In Hive you will have to create partitioned table to be synced and available upon query execution. Partition creation can be also scheduled.
I would suggest to go with Impala instead of Hive as it is more tunable, fault tolerant and easier to use.

Processing data in couchdb with hadoop + mapreduce

I have a very very large quantity of data in CouchDB, but I have very recently found out how crippled the mapreduce functions in couch are (no chaining).
So I had this idea of running map reduce queries from the CouchDB database using Hadoop, and hopefully storing the final result in another CouchDB database?
Is this too crazy? I know I can set up Hbase to do this, but I do not want to migrate my data from CouchDB to Hbase. And I love couch as a data store.
Apparently CouchDB is supposed to be able to stream data to Hadoop via Sqoop, but I didn't see any other information than that link. Worst case, you can write your own input reader to read from CouchDB, or export your data regularly and throw it onto HDFS and run it from there.
The MapReduce functions in CouchDB are constrained to simplify caching of the results. Rather than having to search for views that are impacted by a change, views were designed to be self-contained.
This means that if you have complex MapReduce code, you can use a tool like CouchApp to embed functions within a MapReduce function. I'm having trouble finding the reference for this, but you the macro !code to embed JavaScript functions in views. Using require() or // !json, !code in CouchDB?
This could help to get some of the productivity benefit of chaining without chaining, by putting most of the code in shared functions, and merely calling the function in the different views. For the performance benefit of chaining, if that's what you're after, you may be better off just moving to HBase.

Can Hadoop MapReduce can run over other filesystems?

I heard like for mapreduce jobs input need not in HDFS. It can be on other file system.. Can someone please provide me more inputs on this..
I am litle confused on this? In standalone mode, data can be on local file system. But in cluster mode how can we point to mapreduce jobs to some other file system?
No it does not need to be in HDFS. For instance jobs which target HBase using its TableInputFormat pull records over the network from HBase nodes as inputs to its map jobs. The DbInputFormat can be used to pull data from a SQL database into a job. You could build an input format that did something like read data off of an NFS mount.
In practice you want to avoid pulling data over the network if you can. MR performance is much better if you can have your data locally on the nodes where the job is being run since Disk Throughput > Network Throughput.
Based in the InputFormat set on the job, Hadoop can read from any source. Hadoop provides a couple of InputFormats. It's not difficult to write a custom InputFormat also, let's say to provide a proprietary format as input to a Job.
On the same lines Hadoop provides a couple of OutputFormats and it shouldn't be difficult to write a custom OutputFormat also.
Here is a nice article on the DBInputFormat.
Another way to achieve it is to put into HDFS files with information where the real data is. Mapper will get this information and pull real data for the processing.
For example we can have several files with URLs of data to be processed.
What we will loose in this case is data locality - otherwise it is fine.

Resources