Spark saveToEs asynchronously - elasticsearch

We have a spark streaming job which writes output to ElasticSearch. When elasticsearch is slow due to any reason, spark job waits indefinitely which causes it to accumulate data and has a snowballing effect on the streaming job itself. The only solution to make the streaming job stable then is to restart it.
Is there a way to specify spark writes to elastic search to be asynchronous? I tried looking at https://www.elastic.co/guide/en/elasticsearch/hadoop/current/configuration.html and do not see any option to have async writes.

Related

Is it possible to use Spot instances for Flink Batch job on EMR?

I have a Flink Streaming job running in a mode=Batch and I want to optimize costs. So, I wonder if anyone has experience of using Spot instances for Flink on EMR.
What should I be cautious about?
What should I take into consideration?
Can Flink schedule a job manager on a Task node?
What happens if one of the instances that held the state of previous stage computation results fail? Would both failover regions (previous and running) have to be re-computed?
Currently I am using only on-demand instances in EMR cluster.

Apache NIFI Jon is not terminating automatically

I am new to Apache NIFI tool. I am trying to import data from mongo db and put that data into the HDFS. I have created 2 processors one for MongoDB and second for HDFS and I configured them correctly. The job is running successfully and storing the data into HDFS but the job should terminate automatically on success. But it is not, and creating too many files in HDFS. I want to know how to make On Demand Job in NIFI and how to determine that a job is successfull.
GetMongo will continue to pull data from MongoDB based on the provided properties such as Query, Projection, Limit. It has no way of tracking the execution process, at least for now. What you can do, however, is changing the Run Schedule and/or Scheduling Strategy. You can find them by right clicking on the processor and clicking Configure. By default, Run Schedule will be 0 sec which means running continuously. Changing it to, say, 60 min will make the processor run every one hour. This will still read the same documents from MongoDB again every one hour but since you have mentioned that you just want to run it only once, I'm suggesting this approach.

Apache Spark Jobc complete but hadoop job still running

I'm running a large Spark job (about 20TB in and stored to HDFS) alongside Hadoop. The spark console is showing the job as complete but Hadoop still things the job is running, both in the console and the logs are still spitting out 'running'.
How long should I be waiting until I should be worried?
You can try to stop the spark context cleanly. If you havent close it add a sparkcontext stop method at the end of the job. For example
sc.stop()

What is the best way to minimize the initialization time for Apache Spark jobs on Google Dataproc?

I am trying to use a REST service to trigger Spark jobs using Dataproc API client. However, each job inside the dataproc clusters take 10-15 s to initialize the Spark Driver and submit the application. I am wondering if there is an effective way to eliminate the initialization time for Spark Java jobs triggered from a JAR file in gs bucket? Some solutions I am thinking of are:
Pooling a single instance of JavaSparkContext that can be used for every Spark job
Start a single job and run Spark-based processing in a single job
Is there a more effective way? How would I implement the above ways in Google Dataproc?
Instead of writing this logic yourself, you may want to investigate the Spark Job Server: https://github.com/spark-jobserver/spark-jobserver as this should allow you to reuse spark contexts.
You can write a driver program for Dataproc which accepts RPCs from your REST server and re-use the SparkContext yourself and then submit this driver via the Jobs API, but I personally would look at the job server first.

Event based triggering of a Pig job

I have a system which ingests application log data into Hadoop using Flume. I am indexing this data using Elasticsearch by running a Pig script to load data from Hadoop into ES. Now, I need to automate this task such that every time a new line gets appended, the script should be triggered, or whenever it is triggered it loads only the newly written lines. Can anyone tell me how to achieve this?

Resources