I have an flow that fetches a file, extract texts, and runs a shell script which in turn runs a hive script. (I am just pulling a date from the file and passing it as -hivevar). My shell script looks something like this:
#!/bin/bash
endDate=$1
CONNECT="jdbc:hive2://master2:10000/default"
beeline -u "$CONNECT" -n root -hivevar endDate="$endDate" -f /pathToScript/Hive_scipt.hql
The hive script completes and data is inserted into my table but the ExecuteStreamCommand stays running (1 stays at the top corner) indefinitely and I have to restart nifi.. (is there a better way to handle this?).
I've noticed a few things:
If I reduce the size of the query (my hive query is a number of union's) the ExecuteStreamCommand wont hang.
When the job hangs, the AM on Resource Manager stays Running for quite some time ~10 min. Sort of like when you create a Hive CLI Tez Session with 1 container. When i reduce the query size and the job doesn't hang the AM goes to finish state right away.
-Running the full query or the script manually via command line works fine.
Behavior is not consistent. Sometimes it wont hang, sometimes it will...(most of the time it will..)
Any ideas? Couldn't find anything in app.log or application log
When you run that from the command-line, does it generate lots of output (on either/both standard out and/or standard error), such as the Tez/MR progress?
Try beeline with the --silent=true option (unless you really need the output for some reason), or the (albeit deprecated) "hive" client with -S. If the output is the problem, this should alleviate it.
I think your case could be related to the issue I solved in https://issues.apache.org/jira/browse/NIFI-5024.
If your script is logging enough to stderr (I can reproduce the bug with 1mb but it could be less) then the unix process running the hive script and the ExecuteStreamCommand processor will enter in a deadlock. The details are in the jira issue above.
Related
I am currently reindexing Elasticsearch indices in a shell script using curl to do REST API calls.
To improve performance, I am running this script with xargs over 10 processes.
However, all of my scripts are outputting into a single log, making this log useless/incomprehensible, since output is being written async which makes the log unordered.
The client would like to know progress made on reindexing by looking at logs (i.e they want to know if 50/100 index is done when they look at the log). And we would like to have the curl outputs etc for debugging purposes.
What are some ways I can make comprehensible logs?
You could try specifying a separate log file for each process, then concatenating the logs at the end.
But if you really want them all to access the same logfile, then I'm afraid your only way to do that is with a lock, which is not easy with bash/xargs.
I need to write a shell (bash) script that will be executing several Hive queries.
Each of the queries will produce a directory with a lot of files.
After all queries are finished I need to process all these files in a specific order.
I want to run Hive queries in parallel as background processes as each one might take couple of hours.
I would also like to parallelize resulting file processing but there are some culprits, that I don't know how to handle. I.e. I can start processing results of the first and second queries as soon as they are finished, but for the third, I need to hold until first two processors are done. Similarly for the fourth and fifth.
I won't have any problems writing such a program in Java, but how to do it in shell - beats me.
If someone can give me a hint on how can I monitor execution of these components in the shell script, I would appreciate it greatly.
Im writing a bash script that essentially fires off a python script that takes roughly 10 hours to complete, followed by an R script that checks the outputs of the python script for anything I need to be concerned about. Here is what I have:
ProdRun="python scripts/run_prod.py"
echo "Commencing Production Run"
$ProdRun #Runs python script
wait
DupCompare="R CMD BATCH --no-save ../dupCompareTD.R" #Runs R script
$DupCompare
Now my issues is that often the python script can generate a whole heap of different processes on our linux server depending on its input, with lots of different PIDs AND we have heaps of workers using the same server firing off scripts. As far as I can tell from reading, the 'wait' command must wait for all processes to finish or for a specific PID to finish, but when i cannot tell what or how many PIDs will be assigned/processes run, how exactly do I use it?
EDIT: Thank you to all that helped, here is what caused my dilemma for anyone google searching this. I broke up the ProdRun python script into its individual script that it was itself calling, but still had the issue, I think found that one of these scripts was also calling another smaller script that had a "&" at the end of it that was ignoring any commands to wait on it inside the python script itself. Simply removing this and inserting a line of "os.system()" allowed all the code to run sequentially.
It sounds like you are trying to implement a job scheduler with possibly some complex dependencies between different tasks. I recommend to use a job scheduler instead. It allows you to specify to run those jobs whilst also benefitting from features like monitoring, handling exceptional cases, errors, ...
Examples are: the open source rundeck https://github.com/rundeck/rundeck or the commercial one http://www.bmcsoftware.uk/it-solutions/control-m.html
Make your Python program wait on the children it spawns. That's the proper way to fix this scenario. Then you don't have to wait for Python after it finishes (sic).
(Also, don't put your commands in variables.)
I have a script named program.rb and would like to write a script named main.rb that would do the following:
system("ruby", "program.rb")
constantly check if program.rb is running until it is done
if program.rb has reached completion
exit main.rb
end
otherwise keep doing this until program.rb reaches completion{
if program.rb is not running and stopped before completing
restart program.rb from where it left off
end}
I've looked into Pidify but could not find a way to apply it to fit this exactly the right way...
Any help in how to approach this script would be greatly appreciated!
Update:
I could figure out how to resume running the script from where it left off in program.rb if there's no way to do it in main.rb
It's impossible to "restart script from where it left off" without full cooperation from the program.rb. That is, it should be able to advertise its progress (by writing current state to a file, maybe?) and be able to start correctly from a step specified in ARGV. There's no external ruby magic that can replace this functionality.
Also, if a program terminated abnormally, it means one of two things:
the error is (semi-)permanent (disk is full, no appropriate access rights to a file, etc). In this case, simply restarting the program would cause it to fail again. And again. Infinite fail loop.
the error is temporary (shaky internet connection). In this case, program should do better job with exception handling and retry on its own (instead of terminating).
In either case, there's no need for restarting, IMHO.
Well, here is one way.
Modify program.rb to take an optional flag argument --restart or something.
When program.rb starts up without this argument it will initialize a file to record its current state. Periodically, it will write whatever it needs into this file to record some kind of checkpoint.
When program.rb starts up with the restart flag, it will read its checkpoint file and start processing at that point. For this to work, it must either checkpoint all state changes or arrange for all processing between checkpoints to be idempotent so it can be repeated without ill effect.
There are lots of ways to monitor the health of program.rb. The best way is with some sort of ping, perhaps something like GET /health_check or a dummy message via a socket or pipe. You could just have a locked file to detect if the lock is still held, or you could record the PID on startup and check that it still exists.
Hi I am trying to use screen as part of a cronjob.
Currently I have the following command:
screen -fa -d -m -S mapper /home/user/cron
Is there anyway I can make this command do nothing if the screen mapper already exists? The mapper is set on a half an hour cronjob, but sometimes the mapping takes more than half an hour to complete and so they and up overlapping, slowing each other down and sometimes even causing the next one to be slow and so I end up with lots of mapper screens running.
Thanks for your time,
ls /var/run/screen/S-"$USER"/*.mapper >/dev/null 2>&1 || screen -S mapper ...
This will check if any screen sessions named mapper exist for the current user, and only if none do, will launch the new one.
Why would you want a job run by cron, which (by definition) does not have a terminal attached to it, to do anything with the screen? According to Wikipedia, 'GNU Screen is a software application which can be used to multiplex several virtual consoles, allowing a user to access multiple separate terminal sessions'.
However, assuming there is some reason for doing it, then you probably need to create a lock file which the process checks before proceeding. At this point, you need to run a shell script from the cron entry (which is usually a good technique anyway), and the shell script can check whether the previous run of the task has completed, exiting if not. If the previous incarnation is complete, then the current incarnation creates a lock file containing its PID and runs the job. When it completes, it removes the lock file. You should review the shell trap command, and make sure that the lock file is removed if the shell script exits as a result of a trappable signal (you can't trap KILL and some process-control signals).
Judging from another answer, the screen program already creates lock files; you may not have to do anything special to create them - but will need to detect whether they exist. Also check the GNU manual for screen.