Producing comprehensible logs for shell script ran through xargs - bash

I am currently reindexing Elasticsearch indices in a shell script using curl to do REST API calls.
To improve performance, I am running this script with xargs over 10 processes.
However, all of my scripts are outputting into a single log, making this log useless/incomprehensible, since output is being written async which makes the log unordered.
The client would like to know progress made on reindexing by looking at logs (i.e they want to know if 50/100 index is done when they look at the log). And we would like to have the curl outputs etc for debugging purposes.
What are some ways I can make comprehensible logs?

You could try specifying a separate log file for each process, then concatenating the logs at the end.
But if you really want them all to access the same logfile, then I'm afraid your only way to do that is with a lock, which is not easy with bash/xargs.

Related

How to monitor and control background processes in shell script

I need to write a shell (bash) script that will be executing several Hive queries.
Each of the queries will produce a directory with a lot of files.
After all queries are finished I need to process all these files in a specific order.
I want to run Hive queries in parallel as background processes as each one might take couple of hours.
I would also like to parallelize resulting file processing but there are some culprits, that I don't know how to handle. I.e. I can start processing results of the first and second queries as soon as they are finished, but for the third, I need to hold until first two processors are done. Similarly for the fourth and fifth.
I won't have any problems writing such a program in Java, but how to do it in shell - beats me.
If someone can give me a hint on how can I monitor execution of these components in the shell script, I would appreciate it greatly.

ExecuteStreamCommand Hangs when executing Shell Script (that execute Hive)

I have an flow that fetches a file, extract texts, and runs a shell script which in turn runs a hive script. (I am just pulling a date from the file and passing it as -hivevar). My shell script looks something like this:
#!/bin/bash
endDate=$1
CONNECT="jdbc:hive2://master2:10000/default"
beeline -u "$CONNECT" -n root -hivevar endDate="$endDate" -f /pathToScript/Hive_scipt.hql
The hive script completes and data is inserted into my table but the ExecuteStreamCommand stays running (1 stays at the top corner) indefinitely and I have to restart nifi.. (is there a better way to handle this?).
I've noticed a few things:
If I reduce the size of the query (my hive query is a number of union's) the ExecuteStreamCommand wont hang.
When the job hangs, the AM on Resource Manager stays Running for quite some time ~10 min. Sort of like when you create a Hive CLI Tez Session with 1 container. When i reduce the query size and the job doesn't hang the AM goes to finish state right away.
-Running the full query or the script manually via command line works fine.
Behavior is not consistent. Sometimes it wont hang, sometimes it will...(most of the time it will..)
Any ideas? Couldn't find anything in app.log or application log
When you run that from the command-line, does it generate lots of output (on either/both standard out and/or standard error), such as the Tez/MR progress?
Try beeline with the --silent=true option (unless you really need the output for some reason), or the (albeit deprecated) "hive" client with -S. If the output is the problem, this should alleviate it.
I think your case could be related to the issue I solved in https://issues.apache.org/jira/browse/NIFI-5024.
If your script is logging enough to stderr (I can reproduce the bug with 1mb but it could be less) then the unix process running the hive script and the ExecuteStreamCommand processor will enter in a deadlock. The details are in the jira issue above.

Spark: Silently execute sc.wholeTextFiles

I am loading about 200k text files in Spark using input = sc.wholeTextFiles(hdfs://path/*)
I then run a println(input.count)
It turns out that my spark shell outputs a ton of text (which are the path of every file) and after a while it just hangs without returning my result.
I believe this may be due to the amount of text outputted by wholeTextFiles. Do you know of any way to run this command silently? or is there a better workaround?
Thanks!
How large are your files?
From the wholeTextFiles API:
Small files are preferred, large files are also allowable, but may
cause bad performance.
In conf/log4j.properties, you can suppress excessive logging, like this:
# Set everything to be logged to the console
log4j.rootCategory=ERROR, console
That way, you'll get back only res to the repl, just like in the Scala (the language) repl.
Here are all other logging levels you can play with: log4j API.

Bash piping output and input of a program

I'm running a minecraft server on my linux box in a detached screen session. I'm not very fond of screen very much and would like to be able to constantly pipe the output of the server to a file(like a pipe) and pipe some input from a file to the server(so that I can input and output to the server from remote programs, like a python script). I'm not very experienced in bash, so could somebody tell me how to do this?
Thanks, NikitaUtiu.
It's not clear if you need screen at all. I don't know the minecraft server, but generally for server software, you can run it from a crontab entry and redirect output to log files.
Assuming your server kills itself at midnight sunday night, (we can discuss changing this if restarting 1x per week is too little or too much OR you require ad-hoc restarts), but for a basic idea of what to do, here is a crontab entry that starts the server each monday at 1 minute after midnight.
01 00 * * 1 dtTm=`/bin/date +\%Y\%m\%d.\%H\%M\%S`; export dtTm; { /usr/bin/mineserver -o ..... your_options_to_run_mineserver_here ... ; } > /tmp/mineserver_trace_log.${dtTm} 2>&1
consult your man page for crontab to confirm that day-of-week ranges are 0-6 (0=Sunday), and change the day-of-week value if 0!=Sunday.
Normally I would break the code up so it is easier to read, but for crontab entries, each entry has to be all on one line, (with some weird exceptions) AND usually a limit of 1024b-8K to how long the line can be. Note that the ';' just before the closing '}' is super-critical. If this is left out, you'll get un-deciperable error messages, or no error messages at all.
Basically, you're redirecting any output into a file (including std-err output). Now you can do a lot of stuff with the output, use more or less to look at the file, grep ERR ${logFile}, write scripts that grep for error messages and then send you emails that errors have been found, etc, etc.
You may have some sys-admin work on your hands to get the mineserver user so it can run crontab entries. Also if you're not comfortable using the vi or emacs editors, creating a crontab file may require help from others. Post to superuser.com to get answers for problems you have with linux admin issues.
Finally, there are two points I'd like to make about dated logfiles.
Good: a. If you app dies, you never have to rerun it to then capture output and figure out why something has stopped working. For long running programs this can save you a lot of time. b. keeping dated files gives you the ability to prove to you, your boss, others, that It used to work just fine, see here are the log files. c. Keeping the log files, assuming there is useful information in them, gives you the opportunity to mine those files for facts. I.E. : program used to take 1 sec for processing, now it is taking 1 hr, etc etc.
Bad: a. You'll need to set up a mechanism to sweep old log files, otherwise at some point everything will have stopped, AND when you finally figure out what the problem was, you discover that your /tmp OR whatever dir you chose to use IS completely full.
There is a self-maintaining solution to using dates on the logfiles I can tell you about if you find this approach useful. It will take a little explaining, so I don't want to spend the time writing it up if you don't find the crontab solution useful.
I hope this helps!

Locking output file for shell script invoked multiple times in parallel

I have close to a million files over which I want to run a shell script and append the result to a single file.
For example suppose I just want to run wc on the files.
So that it runs fast I can parallelize it with xargs. But I do not want the scripts to step over each other when writing the output. It is probably better to write to a few separate files rather than one and then cat them later. But I still want the number of such temporary output files to be significantly smaller than the number of input files. Is there a way to get the kind of locking I want, or is it the case that is always ensured by default?
Is there any utility that will recursively cat two files in parallel?
I can write a script to do that, but have to deal with the temporaries and clean up. So was wondering if there is an utility which does that.
GNU parallel claims that it:
makes sure output from the commands is
the same output as you would get had
you run the commands sequentially
If that's the case, then I presume it should be safe to simple pipe the output to your file and let parallel handle the intermediate data.
Use the -k option to maintain the order of the output.
Update: (non-Perl solution)
Another alternative would be prll, which is implemented with shell functions with some C extensions. It is less feature-rich compared to GNU parallel but should the the job for basic use cases.
The feature listing claims:
Does internal buffering and locking to
prevent mangling/interleaving of
output from separate jobs.
so it should meet your needs as long as order of output is not important
However, note on the following statement on this page:
prll generates a lot of status
information on STDERR which makes it
harder to use the STDERR output of the
job directly as input for another
program.
Disclaimer: I've tried neither of the tools and am merely quoting from their respective docs.

Resources