How can I accomplish this `cat` usage more tersely? - shell

Open ended question (be creative!) for a real use case. Essentially I want to cat (1) an existing file (2) the output of a program and (3) a specific bit of text. Between pipes, echo and redirects, I feel like I should be able to do better than this!
pandoc -t latex -o mydoc.tex mydoc.rst
echo \\end{document} > footer.tex
cat header.tex mydoc.tex footer.tex > fulldoc.tex

{
cat header.tex
pandoc -t latex mydoc.rst
echo \\end{document}
} > fulldoc.tex

If you're using bash, you can use process substitution and a here string:
cat header.tex <(pandoc -t latex mydoc.rst) <<<'\end{document}' > fulldoc.tex

Related

Automate command lines

I executing following cmd lines and now I need to put it in a script that I can call and pass file1, file2, file3 as arguments.
sort file1.csv > file1.csv.sorted
sort file2.csv > file2.csv.sorted
diff --speed-large-files \
file1.csv.sorted \
file2.csv.sorted \
> file3.difftmp
rm file1.csv.sorted
rm file2.csv.sorted
I have tried to create bash script, but following eval was not working:
s="diff --speed-large-files $file1.csv.sorted $file2.csv.sorted > $file3"
eval s
I do not necessarily need to create a bash script, but I need to automate this process so that other processes could call it and pass arguments.
Don't use eval here; a simple function will suffice:
filediff() {
sort "$1".csv > "$1".csv.sorted
sort "$2".csv > "$2".csv.sorted
diff --speed-large-files "$1".csv.sorted "$2".csv.sorted > "$3".difftmp
rm "$1".csv.sorted
rm "$2".csv.sorted
}
As Tom Fenech suggested, you can also use process substitution and avoid creating temporary files.
As you're using bash, you can take advantage of process substitution:
#!/bin/bash
diff --speed-large-files <(sort "$1") <(sort "$2")
You can pass the two file names to the script as arguments. This avoids the creation of temporary files and the need for manual cleanup.
I think, you could just do
k=$(s)
So k is a variable where the result returned from command s is stored. If you want to use a string as s command, do $("cmd")

Read from STDIN and output to a file

I'm having trouble with what I thought would be a very basic script, but has turned out to be more complicated than I imagined. I want to read data from STDIN and then write the data out to a file.
After much mucking about, I have a script which kindof works; it seems to work fine for text files (at least the MD5 sums match) but creates an unparseable file if you try it with a JPEG image.
# Start with a clean slate
rm file1
# http://unix.stackexchange.com/q/194388/5769
IFS=
#while read -r -N 8192 data; do
while read -r -N 40 data; do # Reduced bytesize for debugging
echo -n "$data" >> file1
done;
# Some data still remains because of how 'read' uses exit codes
echo -n "$data" >> file1
And the usage*:
$ curl -s "http://loripsum.net/api/plaintext/5/" | ./save.sh # Sucess
$ curl -s "http://lorempixel.com/400/200/food/" | ./save.sh # Failure: No error messages, but the file can't be opened with an image viewer
What's wrong with my code, and why doesn't it work for binary files?
* Yes, in this example, I could just use > to redirect the data directly to a file, but I'm eventually using this code to save POST data from an HTTP form coming in from busybox's httpd through STDIN.
If you want to accept from STDIN and output to a file, this works pretty well...
#!/usr/bin/bash
cat >file1
echo does not deal with binary data properly. See this answer for details. You may be better advised to use a scripting language like perl if you want to do anything more than simple redirection.

read multiple files in bash

I have two .txt files that I want to read line per line simultaneously in .sh script. Both .txt files have the same number of lines. Inside the loop I want to use the sed-command to change the full_sample_name and sample_name in another file.
I know how this works if you just read one file, but I cannot get it work for two files.
#! /bin/bash
FULL_SAMPLE="file1.txt"
SAMPLE="file2.txt"
while read ... && ...
do
sed -e "s/\<full_sample_name\>/$FULL_SAMPLE/g" -e "s/\<sample_name\>/$SAMPLE/g" pipeline.sh > $SAMPLE.sh
done < ...?
Charles provided a very good answer.
You could use paste to join the lines of the files with some delimiter (that shouldn't appear in the files):
paste -d ":" file1.txt file2.txt | while IFS=":" read -r full samp; do
do_stuff_with "$full" and "$samp"
done
#!/bin/bash
full_sample_file="file1.txt"
sample_file="file2.txt"
while read -r -u 3 full_sample_name && read -r -u 4 sample_name; do
sed -e "s/\<full_sample_name\>/$full_sample_name/g" \
-e "s/\<sample_name\>/$sample_name/g" \
pipeline.sh >"$sample_name.sh"
done 3<"$full_sample_file" 4<"$sample_file" # automatically closed on loop exit
In this case, I'm assigning file descriptor 3 to file1.txt and file descriptor 4 to file2.txt.
By the way, with bash 4.1 or newer, you no longer need to handle file descriptors manually:
# opening explicitly, since even if opened on the loop, these need
# to be explicitly closed.
exec {full_sample_fd}<file1.txt
exec {sample_fd}<file2.txt
while read -r -u "$full_sample_fd" full_sample_name \
&& read -r -u "$sample_fd" sample_name; do
: do stuff here with "$full_sample_name" and "$sample_name"
done
# close the files explicitly
exec {full_sample_fd}>&- {sample_fd}>&-
One more note: You could make this a bit more efficient (and also more correct, if your sample_name and full_sample_name values aren't guaranteed to evaluate to themselves when interpreted as regular expressions, if your input file contains no literal NULs [which, as a shell script, it shouldn't], and if the arrow brackets are intended to be literal rather than word-boundary regex characters) by not using sed at all, but just reading the input to be converted into a shell variable, and doing the replacements there!
exec {full_sample_fd}<file1.txt
exec {sample_fd}<file2.txt
IFS= read -r -d '' input_file <pipeline.sh
while read -r -u "$full_sample_fd" full_sample_name \
&& read -r -u "$sample_fd" sample_name; do
output=${input_file//'<full_sample_name>'/${full_sample_name}}
output=${output//'<sample_name>'/${sample_name}}
printf '%s' "$output" >"${sample_name}.sh"
done
# close the files explicitly
exec {full_sample_fd}>&- {sample_fd}>&-
With GNU Parallel it will look like this:
#! /bin/bash
do_sed() {
sed -e "s/\<full_sample_name\>/$1/g" -e "s/\<sample_name\>/$2/g" pipeline.sh > "$2".sh
}
export -f do_sed
parallel --xapply do_sed {1} {2} :::: file1.txt file2.txt
The added benefit is that you get it run in parallel. Depending on your storage system this may speed up the processing: On a raid6 I have seen a 6x speedup by running 10 jobs in parallel. YMMV, so the only way to know for sure is to test and measure.
GNU Parallel is a general parallelizer and makes is easy to run jobs in parallel on the same machine or on multiple machines you have ssh access to.
If you have 32 different jobs you want to run on 4 CPUs, a straight forward way to parallelize is to run 8 jobs on each CPU:
GNU Parallel instead spawns a new process when one finishes - keeping the CPUs active and thus saving time:
Installation
If GNU Parallel is not packaged for your distribution, you can do a personal installation, which does not require root access. It can be done in 10 seconds by doing this:
(wget -O - pi.dk/3 || curl pi.dk/3/ || fetch -o - http://pi.dk/3) | bash
For other installation options see http://git.savannah.gnu.org/cgit/parallel.git/tree/README
Learn more
See more examples: http://www.gnu.org/software/parallel/man.html
Watch the intro videos: https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1
Walk through the tutorial: http://www.gnu.org/software/parallel/parallel_tutorial.html
Sign up for the email list to get support: https://lists.gnu.org/mailman/listinfo/parallel

Bash: replace part of filename

I have a command I want to run on all of the files of a folder, and the command's syntax looks like this:
tophat -o <output_file> <input_file>
What I would like to do is a script that loops over all the files in an arbitrary folder and also uses the input file names to create similar, but different, output file names. The file names looks like this:
input name desired output name
path/to/sample1.fastq path/to/sample1.bam
path/to/sample2.fastq path/to/sample2.bam
Getting the input to work seems simple enough:
for f in *.fastq
do
tophat -o <output_file> $f
done
I tried using output=${f,.fastq,.bam} and using that as the output parameter, but that doesn't work. All I get is an error: line 3: ${f,.fastq,.bam}: bad substitution. Is this the way to do what I want, or should I do something else? If it's the correct way, what am I doing wrong?
[EDIT]:
Thanks for all the answers! A bonus question, though... What if I have files named like this, instead:
path/to/sample1_1.fastq
path/to/sample1_2.fastq
path/to/sample2_1.fastq
path/to/sample2_2.fastq
...
... where I can have an arbitrary number of samples (sampleX), but all of them have two files associated with them (_1 and _2). The command now looks like this:
tophat -o <output_file> <input_1> <input_2>
So, there's still just the one output, for which I could do something like "${f/_[1-2].fastq/.bam}", but I'm unsure how to get a loop that only iterates once over every sampleX at the same time as taking both the associated files... Ideas?
[EDIT #2]:
So, this is the final script that did the trick!
for f in *_1.fastq
do
tophat -o "${f/_1.fastq/.bam}" $f "${f/_1.fastq/_2.fasq}"
done
You can use:
tophat -o "${f/.fastq/.bam}" "$f"
Testing:
f='path/to/sample1.fastq'
echo "${f/.fastq/.bam}"
path/to/sample1.bam
Not an answer but a suggestion: as a bioinformatician, you shoud use GNU make and its option -j (number of parallel jobs). The Makefile would be:
.PHONY:all
FASTQS=$(shell ls *.fastq)
%.bam: %.fastq
tophat -o $# $<
all: $(FASTQS:.bam=.fastq)
Alternative to anubhava's concise solution,
d=$(dirname path/to/sample1.fastq)
b=$(basename path/to/sample1.fastq .fastq)
echo $d/$b.fastq
path/to/sample1.fastq
tophat -o "$d/$b.fastq" "$f"

Prepend header to file without changing the file

Background
The enscript command can apply syntax highlighting to various types of source files, including SQL statements, shell scripts, PHP code, HTML files, and more. I am using enscript to generate 300dpi images of source code for a technical manual to:
Generate content for the book based on actual source code.
Distribute the source code along with the book, without any modification.
Run and test the scripts while writing the book.
Problem
The following shell script performs the conversion almost as desired:
#!/bin/bash
DIRNAME=$(dirname $1)
FILENAME=$(basename $1)
# Remove the extension from the filename.
BASENAME=${FILENAME%%.*}
FILETYPE=${FILENAME##*.}
LIGHTGRAY="#f3f3f3"
enscript --escapes --color -f Courier10 -X ps -B -1 --highlight=$FILETYPE \
$2 -h -o - $1 | \
gs -dSAFER -sDEVICE=pngalpha -dGraphicsAlphaBits=4 -dNOPAUSE -r300 \
-sOutputFile=$BASENAME.png -dBackgroundColor=16$LIGHTGRAY > /dev/null && \
convert -trim $BASENAME.png $BASENAME-trimmed.png && \
mv $BASENAME-trimmed.png $BASENAME.png
The problem is that the background is not a light gray colour. According to the enscript man page, the --escapes (-e) option indicates that the file (i.e., $1) has enscript-specific control sequences embedded within it.
Adding the control sequences means having to duplicate code, which defeats the purpose of having a single source.
Solution
The enscript documentation implies that it should be possible to concatenate two files together (the target and a "header") before running the script, to create a third file:
^#shade{0.85} -- header line
#!/bin/bash -- start of source file
Then delete the third file once the command completes.
Questions
Q.1. What is a more efficient way to pipe the control sequences and the source file to the enscript program without using a third file?
Q.2. What other options are available to automate syntax highlighting for a book, while honouring the single source requirements I have described? (For example, write the book in LyX and use LaTeX commands for import and syntax highlighting.)
Q1 You can use braces '{}' to do I/O redirection:
{ echo "^#shade{0.85}"; cat $1; } |
enscript --color -f Courier10 -X ps -B -1 --highlight=$FILETYPE $2 -h -o - |
gs -dSAFER -sDEVICE=pngalpha -dGraphicsAlphaBits=4 -dNOPAUSE -r300 \
-sOutputFile=$BASENAME.png -dBackgroundColor=16$LIGHTGRAY > /dev/null &&
convert -trim $BASENAME.png $BASENAME-trimmed.png &&
mv $BASENAME-trimmed.png $BASENAME.png
This assumes that enscript reads its standard input when not given an explicit file name; if not, you may need to use an option (perhaps '-i -') or some more serious magic, possibly even 'process substitution' in bash.
You could also use parentheses to run a sub-shell:
(echo "^#shade{0.85}"; cat $1) | ...
Note that the semi-colon after cat is necessary with braces and not necessary with parentheses (and a space is necessary after the open brace) - such are the mysteries of shell scripting.
Q2 I don't have any alternatives to offer. When I produced a book (20 years ago now, using troff), I wrote a program to convert source into the the necessary markup, so that the book was produced from the source code, but by an automated process.
(Is 300 dpi sufficiently high resolution?)
Edit
To work-around the enscript program interpreting the escape sequence embedded in the conversion script itself:
{ cat ../../enscript-header.txt $1; } |
Q2: Use LaTeX with the listings package.

Resources