Batch converting Markdown files with Markdown.pl - bash

Referencing a previous answer, but still having difficulties converting my folder of .md files:
for i in src/*.md; do perl markdown/Markdown.pl --html4tags $i > output/${i%.*}.html; done;
Unfortunately (for my test file "index.md")it's throwing the error:
line 11: output/src/index.html: No such file or directory
I'm not sure how to get it to direct output to just "output/index.html".
Any thoughts? (I'm not interested in using another soluton like pandoc, just trying to do this in bash)

The expansion of src/*.md will yield elements that all start with src/. You can remove the path of a file, yield only the filename sans directory, with dirname.
Since you're using the ${variable%match} replacement pattern to replace .md with .html, it would probably be easiest to create a new variable, here $j, to hold the results of basename.
for i in src/*.md; do j="$(basename $i)"; perl markdown/Markdown.pl --html4tags $i > output/${j%.*}.html; done;

The error message means that the directory output/src, relative to the working directory in which the command is executed, does not exist. You can do a
mkdir -p output/src; for i in ....

You can avoid the for loop, take advantage of modern multi-core CPUs, test what its going to do in advance without actually doing anything and get everything done in parallel with GNU Parallel like this:
parallel --dry-run perl markdown/Markdown.pl --html4tags {} \> output/{/.}.html ::: src/*md
Sample Output
perl markdown/Markdown.pl --html4tags src/a.md > output/a.html
If that looks correct, run it again but without the --dry-run to do it for real.

Related

BASH Shell Find Multiple Files with Wildcard and Perform Loop with Action

I have a script that I call with an application, I can't run it from command line. I derive the directory where the script is called and in the next variable go up 1 level where my files are stored. From there I have 3 variables with the full path and file names (with wildcard), which I will refer to as "masks".
I need to find and "do something with" (copy/write their names to a new file, whatever else) to each of these masks. The do something part isn't my obstacle as I've done this fine when I'm working with a single mask, but I would like to do it cleanly in a single loop instead of duplicating loop and just referencing each mask separately if possible.
Assume in my $FILESFOLDER directory below that I have 2 existing files, aaa0.csv & bbb0.csv, but no file matching the ccc*.csv mask.
#!/bin/bash
SCRIPTFOLDER=${0%/*}
FILESFOLDER="$(dirname "$SCRIPTFOLDER")"
ARCHIVEFOLDER="$FILESFOLDER"/archive
LOGFILE="$SCRIPTFOLDER"/log.txt
FILES1="$FILESFOLDER"/"aaa*.csv"
FILES2="$FILESFOLDER"/"bbb*.csv"
FILES3="$FILESFOLDER"/"ccc*.csv"
ALLFILES="$FILES1
$FILES2
$FILES3"
#here as an example I would like to do a loop through $ALLFILES and copy anything that matches to $ARCHIVEFOLDER.
for f in $ALLFILES; do
cp -v "$f" "$ARCHIVEFOLDER" > "$LOGFILE"
done
echo "$ALLFILES" >> "$LOGFILE"
The thing that really spins my head is when I run something like this (I haven't done it with the copy command in place) that log file at the end shows:
filesfolder/aaa0.csv filesfolder/bbb0.csv filesfolder/ccc*.csv
Where I would expect echoing $ALLFILES just to show me the masks
filesfolder/aaa*.csv filesfolder/bbb*.csv filesfolder/ccc*.csv
In my "do something" area, I need to be able to use whatever method to find the files by their full path/name with the wildcard if at all possible. Sometimes my network is down for maintenance and I don't want to risk failing a change directory. I rarely work in linux (primarily SQL background) so feel free to poke holes in everything I've done wrong. Thanks in advance!
Here's a light refactoring with significantly fewer distracting variables.
#!/bin/bash
script=${0%/*}
folder="$(dirname "$script")"
archive="$folder"/archive
log="$folder"/log.txt # you would certainly want this in the folder, not $script/log.txt
shopt -s nullglob
all=()
for prefix in aaa bbb ccc; do
cp -v "$folder/$prefix"*.csv "$archive" >>"$log" # append, don't overwrite
all+=("$folder/$prefix"*.csv)
done
echo "${all[#]}" >> "$log"
The change in the loop to append the output or cp -v instead of overwrite is a bug fix; otherwise the log would only contain the output from the last loop iteration.
I would probably prefer to have the files echoed from inside the loop as well, one per line, instead of collect them all on one humongous line. Then you can remove the array all and instead simply
printf '%s\n' "$folder/$prefix"*.csv >>"$log"
shopt -s nullglob is a Bash extension (so won't work with sh) which says to discard any wildcard which doesn't match any files (the default behavior is to leave globs unexpanded if they don't match anything). If you want a different solution, perhaps see Test whether a glob has any matches in Bash
You should use lower case for your private variables so I changed that, too. Notice also how the script variable doesn't actually contain a folder name (or "directory" as we adults prefer to call it); fixing that uncovered a bug in your attempt.
If your wildcards are more complex, you might want to create an array for each pattern.
tmpspaces=(/tmp/*\ *)
homequest=($HOME/*\?*)
for file in "${tmpspaces[#]}" "${homequest[#]}"; do
: stuff with "$file", with proper quoting
done
The only robust way to handle file names which could contain shell metacharacters is to use an array variable; using string variables for file names is notoriously brittle.
Perhaps see also https://mywiki.wooledge.org/BashFAQ/020

How do I insert a period in all filenames of a certain pattern?

In a bash shell, I have a directory with files:
my.file.name0000.h5
my.file.name0001.h5
...
my.file.name0100.h5
How can I batch rename them all to insert a period before the digits?
my.file.name.0000.h5
my.file.name.0001.h5
...
my.file.name.0100.h5
I've tried looking into regular expression, and while I'm familiar with how some individual commands work, I am unfamiliar with how to put them together for my task.
Why regular expressions? If you have util-linux rename (watch out! some have perl rename, which is different), just:
rename file.name file.name. *.h5
I found an example using this source. Applying that to your problem this should work.
for filename in my.file.name*; do echo mv \"$filename\" \"${filename//my.file.name/my.file.name.}\"; done | /bin/bash

how to make separate temp directories for each processes in a batch job

I have just started learning bioinformatics in my lab and I am a complete newbie.
I am using a genome annotation tool called Kofamscan from NCBI and I am getting an error that could be due to the fact results of multiple processes are being stored in the same temp directory and the files are collapsing.
So I want to create separate temp directories per process (temp1 for process1, temp2 for process2,...etc) but I don't know how to write the code that enables it.
files=(`cat kofam_files`) #input files
TASK_ID = `expr ${SGE_TASK_ID} -1`
~/kofamscan/bin/exec_annotation -o marine_kofam.txt --tmp-dir **** ${files[$TASK_ID]}
I probably need to write something in the **** section of the above code but I don't know how to write them.
Thank you in advance.
Ryohei
This is exactly why mktemp command exists :)
The following snippet the minimal changes you would have to make to yours:
files=(`cat kofam_files`) #input files
TASK_ID=`expr ${SGE_TASK_ID} -1`
~/kofamscan/bin/exec_annotation -o marine_kofam.txt --tmp-dir `mktemp -d` ${files[$TASK_ID]}
Note that the temp directory would be created in /tmp though. You could use the flags for mktemp to create temp subdirectories in the current directory.
EDIT : Folowing the best practices for bash, one would also,
Use newer the mapfile or readarray commands (in bash 4+) instead of using cat to create arrays in bash
Use $(...) instead of `...` since they support nesting
Use $((...)) instead of the archaic expr syntax (see this thread)
The final snippet would then look like:
readarray -t files < kofam_files #input files
TASK_ID=$((SGE_TASK_ID - 1))
~/kofamscan/bin/exec_annotation -o marine_kofam.txt --tmp-dir $(mktemp -d) ${files[$TASK_ID]}

for loop in a bash script

I am completely new to bash script. I am trying to do something really basic before using it for my actual requirement. I have written a simple code, which should print test code as many times as the number of files in the folder.
My code:
for variable in `ls test_folder`; do
echo test code
done
"test_folder" is a folder which exist in the same directory where the bash.sh file lies.
PROBLEM: If the number of files are one then, it prints single time but if the number of files are more than 1 then, it prints a different count. For example, if there are 2 files in "test_folder" then, test code gets printed 3 times.
Just use a shell pattern (aka glob):
for variable in test_folder/*; do
# ...
done
You will have to adjust your code to compensate for the fact that variable will contain something like test_folder/foo.txt instead of just foo.txt. Luckily, that's fairly easy; one approach is to start the loop body with
variable=${variable#test_folder/}
to strip the leading directory introduced by the glob.
Never loop over the output of ls! Because of word splitting files having spaces in their names will be a problem. Sure, you could set IFS to $\n, but files in UNIX can also have newlines in their names.
Use find instead:
find test_folder -maxdepth 1 -mindepth 1 -exec echo test \;
This should work:
cd "test_folder"
for variable in *; do
#your code here
done
cd ..
variable will contain only the file names

Bash: find references to filenames in other files

Problem:
I have a list of filenames, filenames.txt:
Eg.
/usr/share/important-library.c
/usr/share/youneedthis-header.h
/lib/delete/this-at-your-peril.c
I need to rename or delete these files and I need to find references to these files in a project directory tree: /home/noob/my-project/ so I can remove or correct them.
My thought is to use bash to extract the filename: basename filename, then grep for it in the project directory using a for loop.
FILELISTING=listing.txt
PROJECTDIR=/home/noob/my-project/
for f in $(cat "$FILELISTING"); do
extension=$(basename ${f##*.})
filename=$(basename ${f%.*})
pattern="$filename"\\."$extension"
grep -r "$pattern" "$PROJECTDIR"
done
I could royally screw up this project -- does anyone see a flaw in my logic; better: do you see a more reliable scalable way to do this over a huge directory tree? Let's assume that revision control is off the table ( it is, in fact ).
A few comments:
Instead of
for f in $(cat "$FILELISTING") ; do
...
done
it's somewhat safer to write
while IFS= read -r f ; do
...
done < "$FILELISTING"
That way, your code will have no problem with spaces, tabs, asterisks, and so on in the filenames (though it still won't support newlines).
Your goal in separating f into extension and filename, and then reassembling them with \., seems to be that you want the filename to be treated as a literal string; right? Like, you're worried that grep will treat the . as meaning "any character" rather than as "one dot". A more general solution is to use grep's -F option, which tells it to treat the pattern as a fixed string rather than a regex:
grep -r -F "$f" "$PROJECTDIR"
Your introduction mentions using basename, but then you don't actually use it. Is that intentional?
If your non-use of basename is intentional, then filenames.txt really just contains a list of patterns to search for; you don't even need to write a loop, in this case, since grep's -f option tells it to take a newline-separated list of patterns from a file:
grep -r -F -f "$FILELISTING" "$PROJECTDIR"
You should back up your project, using something like tar -czf backup.tar.gz "$PROJECTDIR". "Revision control is off the table" doesn't mean you can't have a rollback strategy!
Edited to add:
To pass all your base-names to grep at once, in the hopes that it can do something smarter with them than just looping over them just as though the calls were separate, you can write something like:
grep -r -F "$(sed 's#.*/##g' "$FILELISTING")" "$PROJECTDIR"
(I used sed rather than while+basename for brevity's sake, but you can an entire loop inside the "$(...)" if you prefer.)
This is a job for an IDE.
You're right that this is a perilous task, and unless you know the build process and the search directories and the order of the directories, you really can't say what header is with which file.
Let's take something as simple as this:
# include "sql.h"
You have a file in the project headers/sql.h. Is that file needed? Maybe it is. Maybe not. There's also a /usr/include/sql.h. Maybe that's the one that's actually used. You can't tell without looking at the Makefile and seeing the order of the include directories which is which.
Then, there are the libraries that get included and may need their own header files in order to be able to compile. And, once you get to the C preprocessor, you really will have a hard time.
This is a task for an IDE (Integrated Development Environment). An IDE builds the project and tracks file and other resource dependencies. In the Java world, most people use Eclipse, and there is a C/C++ plugin for those developers. However, there are over 2 dozen listed in Wikipedia and almost all of them are open source. The best one will depend upon your environment.

Resources