What services can I turn off to trigger a second application attempt? - hadoop

Hadoop YARN includes a configuration to modify how many times an application can be started: yarn.resourcemanager.am.max-attempts.
I am interested in hitting this limit to observe how the system may fail, and I want to be able to do it without modifying code. To mimic production scenarios, I would like to turn off other Hadoop services to cause a second attempt of the application.
What services can I turn off during the application run to trigger another application attempt?

For simplicity, just close storage services(hosting your source data or target data). For example, hdfs service, hive service, etc.


How do I run a cron inside a kubernetes pod/container which has a running spring-boot application?

I have a spring-boot application running on a container. One of the APIs is a file upload API and every time a file is uploaded it has to be scanned for viruses. We have uvscan to scan the uploaded file. I'm looking at adding uvscan to the base image but the virus definitions need to be updated on a daily basis. I've created a script to update the virus definitions. The simplest way currently is to run a cron inside the container which invokes the script. Is there any other alternative to do this? Can the uvscan utility be isolated from the app pod and invoked from the application?
There are many ways to solve the problem. I hope, I can help you to find what suits you best.
From my perspective, it would be pretty convenient to have a CronJob that builds and pushes the new docker image with uvscan and the updated virus definition database on a daily basis.
In your file processing sequence you can create a scan Job using Kubernetes API, and provide it access to shared volume with a file you need to scan.
Scan Job will use :latest image, and if new images will appear in the registry it will download new image and create pod from it.
The downside is when you create images daily it consumes "some" amount of disk space, so you may need to invent the process of removing the old images from the registry and from the docker cache on each node of Kubernetes cluster.
Alternatively, you can put AV database on a shared volume or using Mount Propagation and update it independently of pods. If uvscan opens AV database in read-only mode it should be possible.
On the other hand it usually takes time to load virus definition into the memory, so it might be better to run virus scan as a Deployment than as a Job with a daily restart after new image was pushed to the registry.
At my place of work, we also run our dockerized services within EC2 instances. If you only need to update the definitions once a day, I would recommend utilizing an AWS Lamda function. It's relatively affordable and you don't need to worry about the overhead of a scheduler, etc. If you need help setting up the Lambda, I could always provide more context. Nevertheless, I'm only offering another solution for you in the AWS realm of things.
So basically I simply added a cron to the application running inside the container to update the virus definitions.

Apache NIFI Jon is not terminating automatically

I am new to Apache NIFI tool. I am trying to import data from mongo db and put that data into the HDFS. I have created 2 processors one for MongoDB and second for HDFS and I configured them correctly. The job is running successfully and storing the data into HDFS but the job should terminate automatically on success. But it is not, and creating too many files in HDFS. I want to know how to make On Demand Job in NIFI and how to determine that a job is successfull.
GetMongo will continue to pull data from MongoDB based on the provided properties such as Query, Projection, Limit. It has no way of tracking the execution process, at least for now. What you can do, however, is changing the Run Schedule and/or Scheduling Strategy. You can find them by right clicking on the processor and clicking Configure. By default, Run Schedule will be 0 sec which means running continuously. Changing it to, say, 60 min will make the processor run every one hour. This will still read the same documents from MongoDB again every one hour but since you have mentioned that you just want to run it only once, I'm suggesting this approach.

Running spark cluster on standalone mode vs Yarn/Mesos

Currently I am running my spark cluster as standalone mode. I am reading data from flat files or Cassandra(depending upon the job) and writing back the processed data to the Cassandra itself.
I was wondering if I switch to Hadoop and start using a Resource manager like YARN or mesos, does it give me an additional performance advantage like execution time and better resource management?
Currently sometime when I am processing huge chunk of data during shuffling with a possibility of stage failure. If I migrate to a YARN, can Resource manager address this issue?
Spark standalone cluster manager can also give you cluster mode capabilities.
Spark standalone cluster will provide almost all the same features as the other cluster managers if you are only running Spark.
When you submit your application in cluster mode all you job related files would be copied on to one of the machines on the cluster which would then submit the job on your behalf, if you submit the application in client mode the machine from which the job is being submitted would be taking care of driver related activities. This means that the machine from which the job has been submitted cannot go offline, whereas in cluster mode the machine from which the job has been submitted can go offline.
Having a Cassandra cluster would also not change any of these behaviors except it can save you network traffic if you can get the nearest contact point for the spark executor(Just like Data locality).
The failed stages gets rescheduled if you use either of the cluster managers.
In Standalone cluster model, each application uses all the available nodes in the cluster.
From spark-standalone documentation page:
The standalone cluster mode currently only supports a simple FIFO scheduler across applications. However, to allow multiple concurrent users, you can control the maximum number of resources each application will use. By default, it will acquire all cores in the cluster, which only makes sense if you just run one application at a time.
In other cases (when you are running multiple applications in the cluster) , you can prefer YARN.
Not sure since your application logic is not known. But you can give a try with YARN.
Have a look at related SE question for benefits of YARN over Standalone and Mesos:
Which cluster type should I choose for Spark?

Manually start HDFS every time I boot?

Laconically: Should I start HDFS every that I come back to the cluster after a power-off operation?
I have successfully created a Hadoop cluster (after loosing some battles) and now I want to be very careful on proceeding with this.
Should I execute start-dfs.sh every time I power on the cluster, or it's ready to execute my application's code? Same for start-yarn.sh.
I am afraid that if I run it without everything being fine, it might leave garbage directories after execution.
Just from playing around with the Hortonworks and Cloudera sandboxes, I can say turning them on and off doesn't seem to demonstrate any "side-effects".
However, it is necessary to start the needed services everytime the cluster starts.
As far as power cycling goes in a real cluster, it is recommended to stop the services running on the respective nodes before powering them down (stop-dfs.sh and stop-yarn.sh). That way there are no weird problems and any errors on the way to stopping the services will be properly logged on each node.

How to collect Hadoop userlogs?

I am running M/R jobs and logging errors when they occur, rather than making the job fail. There are only a few errors, but the job is run on a hadoop cluster with hundreds of nodes. How to search in task logs without having to manually open each task log in the web ui (jobtaskhistory)? In other words, how to automatically search in M/R task logs that are spread all over the cluster, stored in each node locally?
Side Note First: 2.0.0 is oldy moldy (that's the "beta" version of 2.0), you should consider upgrading to a newer stack (e.g. 2.4, 2.5 2.6).
Starting with 2.0, Hadoop implemented what's called "log aggregation" (though it's not what you would think. The logs are just stored on HDFS). There are bunch of command line tools that you can use to get the logs and analyze them without having to go through the UI. This is, in fact, much faster than the UI.
Check out this blog post for more information.
Unfortunately, even with the command line tool, there's not way for you to get all task logs at the same time and pipe it to something like grep. You'll have to get each task log as a separate command. However, this is at least scriptable.
The Hadoop community is working on a more robust log analysis tool that will not only store the job logs on HDFS, but will also give you the ability to perform search and other analyses on these logs. However, this is tool is still a ways out.
This is how we did it (large internet company): we made sure that only v critical messages were logged: but for those messages we actually did use System.err.println. Please keep the aggregate messages per tracker/reducer to only a few KB.
The majority of messages should still use the standard log4j mechanism (which goes to the System logs area)
Go to to your http://sandbox-hdp.hortonworks.com:8088/cluster/apps
There look for the instantiation of the execution you are interested in, and for that entry click the History link (in the Tracking UI column),
then look for the Logs link (in the Logs column), and click on it
yarn logs -applicationId <myAppId> | grep ...
