I've been handed a project that consists of several dozen (probably over 100, I haven't counted) bash scripts. Most of the scripts make at least one call to another one of the scripts. I'd like to get the equivalent of a call graph where the nodes are the scripts instead of functions.
Is there any existing software to do this?
If not, does anybody have clever ideas for how to do this?
Best plan I could come up with was to enumerate the scripts and check to see if the basenames are unique (they span multiple directories). If there are duplicate basenames, then cry, because the script paths are usually held in variable names so you may not be able to disambiguate. If they are unique, then grep the names in the scripts and use those results to build up a graph. Use some tool (suggestions?) to visualize the graph.
Suggestions?
Wrap the shell itself by your implementation, log who called you wrapper and exec the original shell.
Yes you have to start the scripts in order to identify which script is really used. Otherwise you need a tool with the same knowledge as the shell engine itself to support the whole variable expansion, PATHs etc -- I never heard about such a tool.
In order to visualize the calling graph use GraphViz's dot format.
Here's how I wound up doing it (disclaimer: a lot of this is hack-ish, so you may want to clean up if you're going to use it long-term)...
Assumptions:
- Current directory contains all scripts/binaries in question.
- Files for building the graph go in subdir call_graph.
Created the script call_graph/make_tgf.sh:
#!/bin/bash
# Run from dir with scripts and subdir call_graph
# Parameters:
# $1 = sources (default is call_graph/sources.txt)
# $2 = targets (default is call_graph/targets.txt)
SOURCES=$1
if [ "$SOURCES" == "" ]; then SOURCES=call_graph/sources.txt; fi
TARGETS=$2
if [ "$TARGETS" == "" ]; then TARGETS=call_graph/targets.txt; fi
if [ ! -d call_graph ]; then echo "Run from parent dir of call_graph" >&2; exit 1; fi
(
# cat call_graph/targets.txt
for file in `cat $SOURCES `
do
for target in `grep -v -E '^ *#' $file | grep -o -F -w -f $TARGETS | grep -v -w $file | sort | uniq`
do echo $file $target
done
done
)
Then, I ran the following (I wound up doing the scripts-only version):
cat /dev/null | tee call_graph/sources.txt > call_graph/targets.txt
for file in *
do
if [ -d "$file" ]; then continue; fi
echo $file >> call_graph/targets.txt
if file $file | grep text >/dev/null; then echo $file >> call_graph/sources.txt; fi
done
# For scripts only:
bash call_graph/make_tgf.sh call_graph/sources.txt call_graph/sources.txt > call_graph/scripts.tgf
# For scripts + binaries (binaries will be leaf nodes):
bash call_graph/make_tgf.sh > call_graph/scripts_and_bin.tgf
I then opened the resulting tgf file in yEd, and had yEd do the layout (Layout -> Hierarchical). I saved as graphml to separate the manually-editable file from the automatically-generated one.
I found that there were certain nodes that were not helpful to have in the graph, such as utility scripts/binaries that were called all over the place. So, I removed these from the sources/targets files and regenerated as necessary until I liked the node set.
Hope this helps somebody...
Insert a line at the beginning of each shell script, after the #! line, which logs a timestamp, the full pathname of the script, and the argument list.
Over time, you can mine this log to identify likely candidates, i.e. two lines logged very close together have a high probability of the first script calling the second.
This also allows you to focus on the scripts which are still actually in use.
You could use an ed script
1a
log blah blah blah
.
wq
and run it like so:
find / -perm +x -exec ed {} <edscript
Make sure you test the find command with -print instead of the exec clause. And / is probably not the path that you want to use. If you have to include bin directories then you will probably need to switch to grep in order to identify the pathnames to include, then when you have a file full of the right names, use xargs instead of find to run the script.
Related
I have a three OSX machine setup that was using syncthing to keep shared drives synchronized remotely. Someone made some mistakes and a lot of files ended up getting renamed.
So all throughout this drive I have situations where there's a file of size 0KB named,for example, file.jpg and another file with real size named
file.sync-confilct201705-4528.jpg. I need to search the entire drive recursively and while I find a file with the sync-conflict string in it, check to see if there is the same file without the 'sync-conflict' string along with a size of 0KB. If there is, I need to rename the sync-conflict file to overwrite the 0KB file.
I have considered tackling this with a bash script or a Perl script. Using bash I think just using the 'find' command with -regex would get me started but I don't really know how to process the results and run the next find test. I am studying and working on it.
Same problem with Perl. I can get through the first step using File::Find:find and select what I need using regex to filter out the files, but there again I am stuck getting to the next step, which would be finding the original file in the same directory and performing the necessary file move function.
In both of these cases I am willing to put in the time to figure it out, but I wonder what the caveats will be? Can both of these scenarios handle recursing a large number of files without exception? Is there perhaps a better approach anyone can recommend?
One good tool in Perl for this is File::Find::Rule.
Find all sync-conflict files, then test whether corresponding files exist and are zero size
use warnings;
use strict;
use FindBin qw($RealBin);
use File::Copy qw(move);
use File::Find::Rule;
my $dir = shift || '.'; # top of hierarchy to search (from command line, or ./)
my #conflict_files = File::Find::Rule
->file->name('*sync-conflict*.jpg')->in($dir);
foreach my $conflict (#conflict_files)
{
my ($file) = $conflict =~ m|(.*)\.sync-conflict|;
$file .= '.jpg';
if (-z "$RealBin/$file") {
print "Rename $conflict to $file\n"
#move($conflict, $file) or warn "Can't move $conflict to $file: $!";
}
}
This builds the file's name file for each file.sync-conflict file and applies -z file test (-X), which tests for both existence and zero size. Then it renames the file using the core File::Copy.
Note that file-test operators need the full path while File::Find::Rule returns the path relative to the $dir it searches. I use $RealBin provided by FindBin, which is the path to the directory where the script was started with all links resolved, to build the full path for -z.
Uncomment the move line after sufficient testing (and with having made a backup first).
The code makes some assumptions about file names, please adjust as needed.
The $dir supplied on the command line is expected to be relative to the script's directory.
find is great. But as you've noted, you need more.
What find gets you in this scenario is the ability to search recursively and match certain patterns. As it happens as of Bash version 4, you can do that right in the shell.
(Note that macOS ships with bash version 3, so for this solution, you'll need to install bash 4 from Macports, Homebrew or Fink.)
$ shopt -s globstar nullglob
$ for file in **/*sync-confilct2017*.*; do echo mv -v "$file" "${file%sync-conf*}${file##*.}"; done
mv -v file.sync-confilct201705-4528.jpg file.jpg
mv -v foo/bar.sync-confilct201705-4528.ext foo/bar.ext
You can remove the echo to actually run the mv command.
The way this works is that the double asterisk, **, is treated by bash like a * that recurses. We're using parameter expansion to strip the parts of the filename we want in order to construct the "target" filename.
Create a function to fix the name:
$ function fixname() { file="$1"; newname=$( echo "$file" | sed "s/sync-conflict.*\.jpg$/.jpg/" ); if [ -f "$newname" -a ! -s "$newname" ]; then mv "$file" "$newname"; fi; }
Or, spread out a bit:
function fixname() {
file="$1"
newname=$( echo "$file" | sed "s/sync-conflict.*\.jpg$/.jpg/" )
# If empty file exists
if [ -f "$newname" -a ! -s "$newname" ]; then
mv "$file" "$newname"
fi
}
Export the function:
$ export -f fixname
Run find to execute the function:
$ find . -type f -name \*sync-conflict\*.jpg -exec bash -c 'fixname {}' bash \;
Caveat: It will not work with spaces or funky characters in the filenames.
I have a question on how to approach a problem I've been trying to tackle at multiple points over the past month. The scenario is like so:
I have a a base directory with multiple sub-directories all following the same sub-directory format:
A/{B1,B2,B3} where all B* have a pipeline/results/ directory structure under them.
All of these results directories have multiple *.xyz files in them. These *.xyz files have a certain hierarchy based on their naming prefixes. The naming prefixes in turn depend on how far they've been processed. They could be, for example, select.xyz, select.copy.xyz, and select.copy.paste.xyz, where the operations are select, copy and paste. What I wish to do is write a ls | grep or a find that picks these files based on their processing levels.
EDIT:
The processing pipeline goes select -> copy -> paste. The "most processed" file would be the one with the most of those stages as prefixes in its filename. i.e. select.copy.paste.xyz is more processed than select.copy, which in turn is more processed than select.xyz
For example, let's say
B1/pipeline/results/ has select.xyz and select.copy.xyz,
B2/pipeline/results/ has select.xyz
B3/pipeline/results/ has select.xyz, select.copy.xyz, and select.copy.paste.xyz
How can I write a ls | grep/find that picks the most processed file from each subdirectory? This should give me B1/pipeline/results/select.copy.xyz, B2/pipeline/results/select.xyz and B3/pipeline/results/select.copy.paste.xyz.
Any pointer on how I can think about an approach would help. Thank you!
For this answer, we will ignore the upper part A/B{1,2,3} of the directory structure. All files in some .../pipeline/results/ directory will be considered, even if the directory is A/B1/doNotIncludeMe/forbidden/pipeline/results. We assume that the file extension xyz is constant.
A simple solution would be to loop over the directories and check whether the files exist from back to front. That is, check if select.copy.paste.xyz exists first. In case the file does not exist, check if select.copy.xyz exists and so on. A script for this could look like the following:
#! /bin/bash
# print paths of the most processed files
shopt -s globstar nullglob
for d in **/pipeline/result; do
if [ -f "$d/select.copy.paste.xyz" ]; then
echo "$d/select.copy.paste.xyz"
elif [ -f "$d/select.copy.xyz" ]; then
echo "$d/select.copy.xyz"
elif [ -f "$d/select.xyz" ]; then
echo "$d/select.xyz"
else
# there is no file at all
fi
done
It does the job, but is not very nice. We can do better!
#! /bin/bash
# print paths of the most processed files
shopt -s globstar nullglob
for dir in **/pipeline/result; do
for file in "$dir"/select{.copy{.paste,},}.xyz; do
[ -f "$file" ] && echo "$file" && break
done
done
The second script does exactly the same thing as the first one, but is easier to maintain, adapt, and so on. Both scripts work with file and directory names that contain spaces or even newlines.
In case you don't have whitespace in your paths, the following (hacky, but loop-free) script can also be used.
#! /bin/bash
# print paths of the most processed files
shopt -s globstar nullglob
files=(**/pipeline/result/select{.copy{.paste,},}.xyz)
printf '%s\n' "${files[#]}" | sed -r 's#(.*/)#\1 #' | sort -usk1,1 | tr -d ' '
I have some pseduocode below and would like to know if it would work/ is the best method to tackle the problem before I begin developing the code.
I need to dynamically search through a directory on one server and find out if it exists on another server or not. The path will be different so I use basename and save it as a temporary variable.
for $FILE in $CURRENT_DIRECTORY
$TEMP=$(basename "$FILE" )
if [ssh user#other_serverip find . -name '$TEMP']; then
//write code here
fi
Would this if statement return true if the file existed on the other server?
Here is a functioning, cleaner implementation of your logic:
for FILE in *; do
if ssh user#other_serverip test -e "$FILE"; then
# write code here
fi
done
(There won't be a path on files when the code is composed this way, so you don't need basename.) test -e "$FILE" will silently exit 0 (true) if the file exists and 1 (false) if the file does not, though ssh will also exit with a false code if the connection fails.
However, that is a very expensive way to solve your issue. It will fail if your current directory has too many files in it and it runs ssh once per file.
You're better off getting a list of the remote files first and then checking against it:
#!/bin/sh
if [ "$1" != "--xargs" ]; then # this is an internal flag
(
ssh user#other_serverip find . -maxdepth 1 # remote file list
find . -maxdepth 1 # local file list
) |awk '++seen[$0]==2' |xargs -d "\n" sh "$0" --xargs # keep duplicates
else
shift # remove the --xargs marker
for FILE in "$#"; do
# write code here using "$FILE" (with quotes)
done
fi
This does two things. First, since the internal --xargs is not given when you run the script, it connects to the remote server and gets a list of all files in the home directory there. These will be listed as ./.bashrc for example. Then the same list is generated locally, and the results are passed to awk.
The awk command builds an associative array (a hash) from each item it sees, incrementing it and then checking the total against the number two. It prints the second instance of any line it sees. Those are then passed on to xargs, which is instructed to use \n (a line break) as its delimiter rather than any space character.
Note: this code will break if you have any files that have a line break in their name. Don't do that.
xargs then recursively calls this script, resulting in the else clause and we loop through each file. If you have too many files, be aware that there may be more than one instance of this script (see man xargs).
This code requires GNU xargs. If you're on BSD or some other system that doesn't support xargs -d "\n", you can use perl -pe 's/\n/\0/' |xargs -0 instead.
It would return true if ssh exits successfully.
Have you tried command substitution and parsing find's output instead?
So I am going to post a question about shell scripting again.
Problem Definition: For all files under a dir, ex.:
A_anything.txt, B_anything.txt, ......
I want to execute a script, say 'CMD', on each of them, with the output files named like:
A_result.txt, B_result.txt, ......
In addition, at the first line of these output file, I want to have the file name of the original one.
The 'find -exec' util seems to me unable to extract part of the file name.
Does someone know a solution to this problem, by any means(shell, python, find,etc)? Thank you!
cd /directory
for file in *.txt ; do
newfilename=`echo "$file"|sed 's/\(.\+\)_.*/\1_result.txt/`
echo "$file" > "$newfilename"
your-command $file >> "$newfilename"
done
HTH
Well, there's more than one way to do it (including using Perl, where that's the motto), but probably I'd write it like this:
find . -name '[A-Z]_*.txt' -type f -print0 |
xargs -0 modify_rename.sh
And then I'd write the script modify_rename.sh like this:
#!/bin/sh
for file in "$#"
do
dirname=$(dirname "$file")
basename=$(basename "$file" .txt)
leadname=${file%_*}
outname="$dirname/${leadname}_result.txt"
# Optionally check for pre-existence of $outname
{
# Optionally echo "$basename.txt" instead of "$file"
echo "$file"
# Does this invocation of CMD write to standard output?
# If not, adjust invocation appropriately.
CMD "$file"
} > "$outname"
done
The advantage of this separation into separate scripting operations is that the rename/modify operation can be checked out separately from the search process - which runs less risk of zapping your entire directory structure with bad commands.
Bash has the tools to avoid invoking basename and dirname but the notation is moderatly excruciating; I find the clarity of the command names worth having. I'd be happy if bash implemented them as built-ins. There are plenty of other ways to get the prefix of the file; this should be safe, though, even in the presence of spaces (tabs, newlines) in file or directory names because of the careful use of double quotes.
I have a perl script (or any executable) E which will take a file foo.xml and write a file foo.txt. I use a Beowulf cluster to run E for a large number of XML files, but I'd like to write a simple job server script in shell (bash) which doesn't overwrite existing txt files.
I'm currently doing something like
#!/bin/sh
PATTERN="[A-Z]*0[1-2][a-j]"; # this matches foo in all cases
todo=`ls *.xml | grep $PATTERN -o`;
isdone=`ls *.txt | grep $PATTERN -o`;
whatsleft=todo - isdone; # what's the unix magic?
#tack on the .xml prefix with sed or something
#and then call the job server;
jobserve E "$whatsleft";
and then I don't know how to get the difference between $todo and $isdone. I'd prefer using sort/uniq to something like a for loop with grep inside, but I'm not sure how to do it (pipes? temporary files?)
As a bonus question, is there a way to do lookahead search in bash grep?
To clarify/extend the problem:
I have a bunch of programs that take input from sources like (but not necessarily) data/{branch}/special/{pattern}.xml and write output to another directory results/special/{branch}-{pattern}.txt (or data/{branch}/intermediate/{pattern}.dat, e.g.). I want to check in my jobfarming shell script if that file already exists.
So E transforms data/{branch}/special/{pattern}.xml->results/special/{branch}-{pattern}.dat, for instance. I want to look at each instance of the input and check if the output exists. One (admittedly simpler) way to do this is just to touch *.done files next to each input file and check for those results, but I'd rather not manage those, and sometimes the jobs terminate improperly so I wouldn't want them marked done.
N.B. I don't need to check concurrency yet or lock any files.
So a simple, clear way to solve the above problem (in pseudocode) might be
for i in `/bin/ls *.xml`
do
replace xml suffix with txt
if [that file exists]
add to whatsleft list
end
done
but I'm looking for something more general.
#!/bin/sh
shopt -s extglob # allow extended glob syntax, for matching the filenames
LC_COLLATE=C # use a sort order comm is happy with
IFS=$'\n' # so filenames can have spaces but not newlines
# (newlines don't work so well with comm anyhow;
# shame it doesn't have an option for null-separated
# input lines).
files_todo=( **([A-Z])0[1-2][a-j]*.xml )
files_done=( **([A-Z])0[1-2][a-j]*.txt )
files_remaining=( \
$(comm -23 --nocheck-order \
<(printf "%s\n" "${files_todo[#]%.xml}") \
<(printf "%s\n" "${files_done[#]%.txt}") ))
echo jobserve E $(for f in "${files_remaining[#]%.xml}"; do printf "%s\n" "${f}.txt"; done)
This assumes that you want a single jobserve E call with all the remaining files as arguments; it's rather unclear from the specification if such is the case.
Note the use of extended globs rather than parsing ls, which is considered very poor practice.
To transform input to output names without using anything other than shell builtins, consider the following:
if [[ $in_name =~ data/([^/]+)/special/([^/]+).xml ]] ; then
out_name=results/special/${BASH_REMATCH[1]}-${BASH_REMATCH[2]}.dat
else
: # ...handle here the fact that you have a noncompliant name...
fi
The question title suggests that you might be looking for:
set -o noclobber
The question content indicates a wholly different problem!
It seems you want to run 'jobserve E' on each '.xml' file without a matching '.txt' file. You'll need to assess the TOCTOU (Time of Check, Time of Use) problems here because you're in a cluster environment. But the basic idea could be:
todo=""
for file in *.xml
do [ -f ${file%.xml}.txt ] || todo="$todo $file"
done
jobserve E $todo
This will work with Korn shell as well as Bash. In Bash you could explore making 'todo' into an array; that will deal with spaces in file names better than this will.
If you have processes still generating '.txt' files for '.xml' files while you run this check, you will get some duplicated effort (because this script cannot tell that the processing is happening). If the 'E' process creates the corresponding '.txt' file as it starts processing it, that minimizes the chance or duplicated effort. Or, maybe consider separating the processed files from the unprocessed files, so the 'E' process moves the '.xml' file from the 'to-be-done' directory to the 'done' directory (and writes the '.txt' file to the 'done' directory too). If done carefully, this can avoid most of the multi-processing problems. For example, you could link the '.xml' to the 'done' directory when processing starts, and ensure appropriate cleanup with an 'atexit()' handler (if you are moderately confident your processing program does not crash). Or other trickery of your own devising.
whatsleft=$( ls *.xml *.txt | grep $PATTERN -o | sort | uniq -u )
Note this actually gets a symmetric difference.
i am not exactly sure what you want, but you can check for existence of the file first, if it exists, create a new name? ( Or in your E (perl script) you do this check. )
if [ -f "$file" ];then
newname="...."
fi
...
jobserve E .... > $newname
if its not what you want, describe more clearly in your question what you mean by "don't overwrite files"..
for posterity's sake, this is what i found to work:
TMPA='neverwritethis.tmp'
TMPB='neverwritethat.tmp'
ls *.xml | grep $PATTERN -o > $TMPA;
ls *.txt | grep $PATTERN -o > $TMPB;
whatsleft = `sort $TMPA $TMPB | uniq -u | sed "s/%/.xml" > xargs`;
rm $TMPA $TMPB;