ignore but recall malformed data : iterate & process folder with bash script + .jar

ignore but recall malformed data : iterate & process folder with bash script + .jar - bash

There is a folder full of files- each of which contains some data that I need to convert to a single output file.
I've built a conversion script- it can run like so:
java -jar tableGenerator.jar -inputfile more-adzuna-jobs-type-9.rdf -skillNames skillNames.ttl -countries countries_europe.rdf -outputcsv out.csv
The problem is- some of the files contain characters that are regarded as invalid by my .jar file, is there a way to create a bash script to run this command simultaneously on a folder full of these files (many hundreds) and for each one that generates an error to:
ignore it, i.e. not let it halt the process
remember it- so that later it can be dealt with appropriately
It seems like this would be possible but my bash-fu is quite weak- what would be a logical way to execute this task?

If your Java program in fact exits with an error status then it should be fairly easy to write a bash script that processes all the files in a folder and tracks which had errors. I emphasize that the Java program must exit with an error (non-zero) status for this to be easy. For example, it should terminate execution by invoking System.exit(1).
If your program does report its success or failure to the system via its exit status, then you might do something like this:
#!/bin/bash
# The name of the directory to process is expected as the first argument.
if [ $# -lt 1 ]; then
echo usage: $0 directory
exit 1
fi
# The first argument to the script is $1
if [ -e failures.txt ]; then
rm failures.txt
fi
touch failures.txt
for f in $1/*; do
if ! java -jar /path/to/tableGenerator.jar \
-inputfile $f \
-skillNames /path/to/skillNames.ttl \
-countries /path/to/countries_europe.rdf \
-outputcsv $f.out.csv
then
echo $f >> failures.txt
fi
done
That iterates over all the files in the directory specified by the first script argument, assigning each path in turn to shell variable $f, and runs your Java program for each one, passing the path as the argument following -inputfile. In the event that the program exits with a non-zero status, the script writes the name of the failing file in file failures.txt in the script's current working directory (unrelated to the data directory designated to it) and continues.
Note that it does not run the command simultaneously on all the files, but instead iteratively. I am uncertain whether that was a key component of your request. Inasmuch as the system you run this on is unlikely to have a separate core it can dedicate to each of hundreds of instances of your program, and inasmuch as the storage medium on which the files reside probably has only one data channel, you cannot effectively run the command hundreds of times simultaneously, anyway.
If you do want to run multiple jobs in parallel then bash has ways to do that, but I recommend getting the serial script working first. If processing the files serially is not good enough then you can explore ways to achieve some parallelism. However, to the extent that Java VM startup time may present an issue with starting up hundreds of JVMs, you might be better off building multiple-file handling directly into your Java program, so that you can process all the files in the same VM.

Related

Just partially creation of csv file using crontab command

I have some problem in automation the generation of a csv file. The bash code used to produce the csv works in parallel using 3 cores in order to reduce the time consumption; initially different csv files are produced, which are subsequently combined to form a single csv file. The core of the code is this cycle:
...
waitevery=3
for j in `seq 1 24`; do
if((j==1)); then
printf '%s\n' A B C D E | paste -sd ',' >> code${namefile}01${rr}.csv
fi
j=$(printf "%02d" $j)
../src/thunderstorm --mask-file=mask.grib const_${namefile}$j${rr}.grib surf_${namefile}$j${rr}.grib ua_${namefile}$j${rr}.grib hl_const.grib out &
if ! ((c % waitevery)); then
wait
fi
c=$((c+1))
done
...
where ../src/thunderstorm is a .F90 code which produce the second and successive files.
If I run this code manually it produces the right csv file, but if I run it by a programmed crontab command it generates a csv file with the only header A B C D E
Some suggestions?
Thanks!

cron runs your script in an environment, that often does not match your expectations.
check that the PATH is correct and that the script is called from the correct location: ../src is obviously relative, but to what?
I find cron-scripts to be much more reliable when using full paths for input, output and programs.

As #umläute points out, cron runs your scripts but does not run the typical initiallizations that you may have when you open a terminal session. Which means that you have to make no assumptions regarding your environment.
For scripts that may be invoked from the shell and may be invoked from cron I usually add at the beginning something like this:
BIN_DIR=/home/myhome/bin
PATH=$PATH:$BIN_DIR
Also, make sure you do not use relative paths to executables like ../src/thunderstorm. The working directory of the script invoked by cron may not be what you think. You may use $BIN_DIR/../src/thunderstorm. If you want to save typing add the relevant directories to the PATH.
The same logic goes for all other shell variables.
Doing a good initialization at the beginning of your script will allow you to run it from the shell for testing (or manual execution) and then run it as a cron job too.

Define a Increment variable in shell script that increments on every cronjob

I have searched the forum couldn't find one.can we define a variable that only increments on every cronjob run?
for example:
i have a script that runs every 5minutes so i need a variable that increments based on the cron run
Say if the job ran 5minutes for minutes. so 6 times the script got executed so my counter variable should be 6 now
Im expecting in bash/shell
Apologies if a duplicate question
tried:
((count+1))

You can do it this way:
create two scripts: counter.sh and increment_counter.sh
add execution of increment_counter.sh in your cron job
add . /path/to/counter.sh into /etc/profile or /etc/bash.bashrc or wherever you need
counter.sh
declare -i COUNTER
COUNTER=1
export COUNTER
increment_counter.sh
#!/bin/bash
echo "COUNTER=\$COUNTER+1" >> /path/to/counter.sh

The shell that you've run the command in has exited; any variables it has set have gone away. You can't use variables for this purpose.
What you need is some sort of permanent data store. This could be a database, or a remote network service, or a variety of things, but by far the simplest solution is to store the value in a file somewhere on disk. Read the file in when the script starts and write out the incremented value afterwards.
You should think about what to do if the file is missing and what happens if multiple copies of the script are run at the same time, and decide whether those are situations you care about at all. If they are, you'll need to add appropriate error handling and locking, respectively, in your script.

Wouldn't this be a better solution?
...to define a file under /tmp, such that a command like:
echo -n "." > $MyCounterFilename
Tracks the number of times something is invoked, in my particular case of app.:
#!/bin/bash
xterm [ Options ] -T "$(cat $MyCounterFilename | wc -c )" &
echo -n "." > $MyCounterFilename
Because i had to modify the way xterm is invoked for my purposes and i found already that having opened many of these concurrently one waste less time if knowing exactly what is running on each one by its number (without having to cycle alt+tab and eye inspect through everything).
NOTE: /etc/profile, or better either ~/.profile or ~/.bash_profile needs only a env. variable name defined containing the full path to your counter file.
Anyway, if you dont like the idea above, experiments might be performed to determine a) 1st time out of all that /etc/profile is executed since machine is powered on and system boots. 2) Wether /etc/profile is executed or not, and how many times (Each time we open an xterm?, for instance). ... thereafter the same sort of testing for the other dudes lesser general than /etc one.

bash script rsync itself from remote host - how to?

I have multiple remote sites which run a bash script, initiated by cron (running VERY frequently -- 10 minutes or less), in which one of it's jobs is to sync a "scripts" directory. The idea is for me to be able to edit the scripts in one location (a server in a data center) rather than having to log into each remote site and doing any edits manually. The question is, what are the best options for syncing the script that is currently running the sync? (I hope that's clear).
I would imagine syncing a script that is currently running would be very bad. Does the following look feasible if I run it as the last statement of my script? pros? cons? Other options??
if [ -e ${newScriptPath} ]; then
mv ${newScriptPath} ${permanentPath}" | at "now + 1 minute"
fi
One problem I see is that it's possible that if I use "1 minute" (which is "at's" smallest increment), and the script ends, and cron initiates the next job before "at" replaces the script, it could try to replace it during the next run of the script....

Changing the script file during execution is indeed dangerous (see this previous answer), but there's a trick that (at least with the versions of bash I've tested with) forces bash to read the entire script into memory, so if it changes during execution there won't be any effect. Just wrap the script in {}, and use an explicit exit (inside the {}) so if anything gets added to the end of the file it won't be executed:
#!/bin/bash
{
# Actual script contents go here
exit
}
Warning: as I said, this works on the versions of bash I have tested it with. I make no promises about other versions, or other shells. Test it with the shell(s) you'll be using before putting it into production use.
Also, is there any risk that any of the other scripts will be running during the sync process? If so, you either need to use this trick with all of them, or else find some general way to detect which scripts are in use and defer updates on them until later.

So I ended up using the "at" utility, but only if the file changed. I have a ".cur" and ".new" version of the script on the local machine. If the MD5 is the same on both, I do nothing. If they are different, I wait until after the main script completes, then force copy the ".new" to the ".cur" in a different script.
I create the same lock file (name) for the update_script so another instance of the first script won't run if I'm changing it..
part in main script....
file1=`script_cur.sh`
file2=`script_new.sh`
if [ "$file1" == "$file2" ] ; then
echo "Files have the same content"
else
echo "Files are different, scheduling update_script.sh at command"
at -f update_script.sh now + 1 minute
fi

Is there a way to use qsub and source together?

I wrote a shell script to process a bunch of files separately, like this
#!/bin/bash
#set parameters (I have many)
...
#find the files and do iteratively
for File in $FileList; do
source MyProcessing.sh
done
MyProcessing.sh is the calling script, and the variables and functions from the main script are used in the calling script.
Now I'd like to move my shell script to cluster, and use qsub in the iteration. I tried
#find the files and do iteratively
for File in $FileList; do
echo "source MyProcessing.sh" | qsub
done
But it does not work in this way. Anyone can help? Thank you in advance.

Variables and functions are local to a script. This means source MyProcessing.sh will work but bash MyProcessing.sh won't. The second syntax creates a subshell which means a new Unix process and Unix processes are isolated.
The same is true for qsub since you invoke it via a pipe: BASH will create a new process qsub and set the stdin to source MyProcessing.sh. That only passes these 23 bytes to qsub and nothing else.
If you want this to work, then you will have to write a new script that is 100% independent of the main script (i.e. it must not use any variables or functions). Then you must read the documentation of qsub to find out how to set it up. Usually, tools like that only work after you distributed a copy of MyProcessing.sh on every node of the cluster.
Also, the tool probably won't try to figure out what other data the script needs, so you will have to copy the files to the cluster nodes as well (probably by putting them on a shared file system).

Use:
(set; echo "source MyProcessing.sh") | qsub
You need to set your current variables in qsub shell.

Run a list of bash scripts consecutively

I have a load of bash scripts that backup different directories to different locations. I want each one to run every day. However, I want to make they don't run simultaneously.
I've wrote a script that basically just calls each script in succession and sits in cron.daily, but I want a way for this script to work even if I add and remove backup scripts without having to manually edit it.
So what I need to go is generate a list of the scripts (e.g. "dir -1 /usr/bin/backup*.sh") and then run each script it finds in turn.
Thanks.

#!/bin/sh
for script in /usr/bin/backup*.sh
do
$script
done

#!/bin/bash
for SCRIPT in /usr/bin/backup*.sh
do
[ -x "$SCRIPT" ] && [ -f "$SCRIPT" ] && $SCRIPT
done

If your system has run-parts then that will take care of it for you. You can name your scripts like "10script", "20anotherscript" and they will be run in order in a manner similar to the rc*.d hierarchy (which is run via init or Upstart, however). On some systems it's a script. On mine it's a binary executable.
It is likely that your system is using it to run hourly, daily, etc., cron jobs just by dropping scripts into directories such as /etc/cron.hourly/
Pay particular attention, though, to how you name your scripts. (Don't use dots, for example.) Check the man page specific to your system, since file naming restrictions may vary.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio