Piping Image stream to command line programs in Bash - bash

I am trying to pipe a JPEG file from Image Magik's "convert" to others command line programs in other to make a benchmark. Is there a way to force a program that doesn't have a built-in functionality to read from pipe instead of reading the file from disk?
An example of a program that has such built-in functionality is "CJPEG":
convert INFILE.JPG tga:- | cjpeg -outfile OUTFILE.JPG -targa
An example of a program that doesn't ":
jpegoptim --dest OUTFOLDER INFILE.JPG
Ideally it would work like this (but it doesn't):
convert INFILE.JPG jpg:- | jpegoptim --dest OUTFOLDER -
I managed to do this though:
BASE64_IMG=$(convert INFILE.JPG jpg:- | base64)
And:
JPG<(echo "${BASE64_IMG}" | base64 --decode) /dev/stdin | awk '{print $1}'
But I don't know what to do with that...
Thanks for any help.

You can use process substitution, if the reading program doesn't expect to be able to seek to random positions within the input. (That is, it can only read continuously from the beginning of the file.)
The incorrect version (programs need to be written to accept - as a filename meaning standard input; it's not something the shell does for you):
convert INFILE.JPG jpg:- | jpegoptim --dest OUTFOLDER -
becomes
jpegoptim --dest OUTFOLDER <(convert INFILE.JPG jpg:-)
The <(...) is special bash syntax to simplify the use of a named pipe.
mkfifo input # Create a named pipe called "ipnut"
convert INFILE.JPG jpg:- > input & # Start writing to it in the background
jpegoptim --dest OUTFOLDER input # Read from the named pipe; convert blocks until jpegoptim opens the pipe for reading
rm input # Clean up after you are done

Related

How can you use multiple input streams for input parameters?

I want to run a command line script that requires several parameters. Specifically:
perl prinseq-lite.pl -fastq file1.fq.gz -fastq2 file2.fq.gz
\ -out_good goodfile.out -out_bad badfile.out -log prin.log
\ -ns_max_n 5 ... more_params ...
The problem is that the files are zipped, and must be processed without first unzipping and storing them, because the unzipped file sizes are very large and this command will be run on a large number of files.
So what I need to do is to unzip the input on the fly. Previously, user l0b0, suggested that multiple input streams might be a solution. I have tried the following, but seem to be passing an empty input stream here as the program claims the input files are empty.
perl prinseq-lite.pl -fastq <(zcat f1.gz) -fastq2 <(zcat f2.gz) ...
perl prinseq-lite.pl -fastq 1< <(zcat f1.gz) -fastq2 2< <(zcat f2.gz) ...
So what I need to do, in short, is provide unzipped input for multiple parameters to this program.
Can someone tell me the proper way to do this, and/or what I'm doing wrong with my current attempts? Thanks in advance for your input.
Well, I think the easiest might be to make named pipes for the output of gzunip, then use those names pipes in the command:
mkfifo file1.fq file2.fq file3.fq ...
gunzip -c file1.fq.gz > file1.fq &
gunzip -c file2.fq.gz > file2.fq &
gunzip -c file3.fq.gz > file3.fq &
Then call your program with those pipes as file names:
perl prinseq-lite.pl -fastq file1.fq -fastq2 file2.fq -fastq3 file3.fq ...

ImageMagick convert tiffs to pdf with sequential file suffix

I have the following scenario and I'm not much of a coder (nor do I know bash well). I don't even have a base working bash script to share, so any help would be appreciated.
I have a file share that contains tiffs (thousands) of a document management system. The goal is to convert and combine from multiple file tiffs to single file pdfs (preferably PDF/A 1a format).
The directory format:
/Document Management Root # This is root directory
./2009/ # each subdirectory represents a year
./2010/
./2011/
....
./2016/
./2016/000009.001
./2016/000010.001
# files are stored flat - just thousands of files per year directory
The document management system stores tiffs with sequential number file names along with sequential file suffixes:
000009.001
000010.001
000011.002
000012.003
000013.001
Where each page of a document is represented by the suffix. The suffix restarts when a new, non-related document is created. In the example above, 000009.001 is a single page tiff. Files 000010.001, 000011.002, and 000012.003 belong to the same document (i.e. the pages are all related). File 000013.001 represents a new document.
I need to preserve the file name for the first file of a multipage document so that the filename can be cross referenced with the document management system database for metadata.
The pseudo code I've come up with is:
for each file in {tiff directory}
while file extension is "001"
convert file to pdf and place new pdf file in {pdf directory}
else
convert multiple files to pdf and place new pd file in {pdf directory}
But this seems like it will have the side effect of converting all 001 files regardless of what the next file is.
Any help is greatly appreciated.
EDIT - Both answers below work. The second answer worked, however it was my mistake in not realizing that the data set I tested against was different than my scenario above.
So, save the following script in your login ($HOME) directory as TIFF2PDF
#!/bin/bash
ls *[0-9] | awk -F'.' '
/001$/ { if(NR>1)print cmd,outfile; outfile=$1 ".pdf"; cmd="convert " $0;next}
{ cmd=cmd " " $0}
END { print cmd,outfile}'
and make it executable (necessary just once) by going in Terminal and running:
chmod +x TIFF2PDF
Then copy a few documents from any given year into a temporary directory to try things out... then go to the directory and run:
~/TIFF2PDF
Sample Output
convert 000009.001 000009.pdf
convert 000010.001 000011.002 000012.003 000010.pdf
convert 000013.001 000013.pdf
If that looks correct, you can actually execute those commands like this:
~/TIFF2PDF | bash
or, preferably if you have GNU Parallel installed:
~/TIFF2PDF | parallel
The script says... "Generate a listing of all files whose names end in a digit and send that list to awk. In awk, use the dot as the separator between fields, so if the file is called 00011.0002, then $0 will be 00011.0002, $1 will be 00011 and $2 will be 0002. Now, if the filename ends in 0001, print the accumulated command and append the output filename. Then save the filename prefix with PDF extension as the output filename of the next PDF and start building up the next ImageMagick convert command. On subsequent lines (which don't end in 0001), add the filename to the list of filenames to include in the PDF. At the end, output any accumulated commands and append the output filename."
As regards the ugly black block at the bottom of your image, it happens because there are some tiny white specks in there that prevent ImageMagick from removing the black area. I have circled them in red:
If you blur the picture a little (to diffuse the specks) and then get the size of the trim-box, you can apply that to the original, unblurred image like this:
trimbox=$(convert original.tif -blur x2 -bordercolor black -border 1 -fuzz 50% -format %# info:)
convert original.tif -crop $trimbox result.tif
I would recommend you do that first to A COPY of all your images, then run the PDF conversion afterwards. As you will want to save a TIFF file but with the extension 0001, 0002, you will need to tell ImageMagick to trim and force the output filetype to TIF:
original=XYZ.001
trimbox=$(convert $original -blur x2 -bordercolor black -border 1 -fuzz 50% -format %# info:)
convert $original -crop $trimbox TIF:$original
As #AlexP. mentions, there can be issues with globbing if there is a large number of files. On OSX, ARG_MAX is very high (262144) and your filenames are around 10 characters, so you may hit problems if there are more than around 26,000 files in one directory. If that is the case, simply change:
ls *[0-9] | awk ...
to
ls | grep "\d$" | awk ...
The following command would convert the whole /Document Management Root tree (assuming it's actual absolute path) properly processing all subfolders even with names including whitespace characters and properly skipping all other files not matching the 000000.000 naming pattern:
find '/Document Management Root' -type f -regextype sed -regex '.*/[0-9]\{6\}.001$' -exec bash -c 'p="{}"; d="${p:0: -10}"; n=${p: -10:6}; m=10#$n; c[1]="$d$n.001"; for i in {2..999}; do k=$((m+i-1)); l=$(printf "%s%06d.%03d" "$d" $k $i); [[ -f "$l" ]] || break; c[$i]="$l"; done; echo -n "convert"; printf " %q" "${c[#]}" "$d$n.pdf"; echo' \; | bash
To do a dry run just remove the | bash in the end.
Updated to match the 00000000.000 pattern (and split to multiple lines for clarity):
find '/Document Management Root' -type f -regextype sed -regex '.*/[0-9]\{8\}.001$' -exec bash -c '
pages[1]="{}"
p1num="10#${pages[1]: -12:8}"
for i in {2..999}; do
nextpage=$(printf "%s%08d.%03d" "${pages[1]:0: -12}" $((p1num+i-1)) $i)
[[ -f "$nextpage" ]] || break
pages[i]="$nextpage"
done
echo -n "convert"
printf " %q" "${pages[#]}" "${pages[1]:0: -3}pdf"
echo
' \; | bash

Send output from `split` utility to stdout

From this question, I found the split utilty, which takes a file and splits it into evenly sized chunks. By default, it outputs these chunks to new files, but I'd like to get it to output them to stdout, separated by a newline (or an arbitrary delimiter). Is this possible?
I tried cat testfile.txt | split -b 128 - /dev/stdout
which fails with the error split: /dev/stdoutaa: Permission denied.
Looking at the help text, it seems this tells split to use /dev/stdout as a prefix for the filename, not to write to /dev/stdout itself. It does not indicate any option to write directly to a single file with a delimiter. Is there a way I can trick split into doing this, or is there a different utility that accomplishes the behavior I want?
It's not clear exactly what you want to do, but perhaps the --filter option to split will help out:
--filter=COMMAND
write to shell COMMAND; file name is $FILE
Maybe you can use that directly. For example, this will read a file 10 bytes at a time, passing each chunk through the tr command:
split -b 10 --filter "tr [:lower:] [:upper:]" afile
If you really want to emit a stream on stdout that has separators between chunks, you could do something like:
split -b 10 --filter 'dd 2> /dev/null; echo ---sep---' afile
If afile is a file in my current directory that looks like:
the quick brown fox jumped over the lazy dog.
Then the above command will result in:
the quick ---sep---
brown fox ---sep---
jumped ove---sep---
r the lazy---sep---
dog.
---sep---
From info page :
`--filter=COMMAND'
With this option, rather than simply writing to each output file,
write through a pipe to the specified shell COMMAND for each
output file. COMMAND should use the $FILE environment variable,
which is set to a different output file name for each invocation
of the command.
split -b 128 --filter='cat ; echo ' inputfile
Here is one way of doing it. You will get each 128 character into variable "var".
You may use your preferred delimiter to print or use it for further processing.
#!/bin/bash
cat yourTextFile | while read -r -n 128 var ; do
printf "\n$var"
done
You may use it as below at command line:
while read -r -n 128 var ; do printf "\n$var" ; done < yourTextFile
No, the utility will not write anything to standard output. The standard specification of it says specifically that standard output in not used.
If you used split, you would need to concatenate the created files, inserting a delimiter in between them.
If you just want to insert a delimiter every N th line, you may use GNU sed:
$ sed '0~3a\-----\' file
This inserts a line containing ----- every 3rd line.
To divide the file into chunks, separated by newlines, and write to stdout, use fold:
cat yourfile.txt | fold -w 128
...will write to stdout in "chunks" of 128 chars.

Modify file in memory while keeping directory

Is there a way to modify the contents of a file before a command receives it while maintaining its directory?
mpv 'https://example.com/directory/file.playlist'
but use sed to modify the contents in memory before it is read by mpv?
The issue is I can't just read the file straight in, it must maintain the directory it is in because the files in the playlist are relative to that directory.
I just need to replace .wav with .flac.
Generally you can use process substitution:
mplayer <(curl 'http://...' | sed 's/\.wav/.flac/')
However, mplayer supports the special option - (hyphen) for the filename argument which means read the file from stdin. This allows you to use a pipe:
curl 'http://...' | sed 's/\.wav/.flac/' | mplayer -
So far I'm using this to achieve what I need, but it's not exactly ideal in that I lose my playlist control.
ssh example.com "tar czpf - 'files/super awesome music directory'" | tar xzpf - -O | mpv -

Regexp lines from file and run command

I have a file with output from the identify command, looks like this (following format: FILENAME FORMAT SIZE METADATA)
/foo/bar.jpg JPEG 2055x1381 2055x1381+0+0 8-bit DirectClass
/foo/ham spam.jpg JPEG 855x781 855x781+0+0 8-bit DirectClass
...
Note that the filenames can contain spaces! What I want to do is to basically run this on each of those lines:
convert -size <SIZE> -colors 1 xc:black <FILENAME>
In other words, creating blank images of existing ones. I've tried doing this with cat/sed/xargs but it's making my head explode. Any hints? And preferably a command-line solution..
Assuming, that filename is the string before " JPEG":
LINE="/foo/ham spam.jpg JPEG 855x781 855x781+0+0 8-bit DirectClass"
You can get file name as:
FILENAME=$(echo "$LINE" | sed 's/\(.*\) JPEG.*/\1/')
cat data_file | sed -e 's/\(.*\) JPEG \([^ ]*\) .*/convert -size \2 -colors 1 xc:black "\1"/' | bash
You can do what MichaƂ suggests. Also, if the metadata has a fixed number of words, you could do this easily like the following (supposing you process every line):
FILENAME=`echo $LINE | rev | cut -d\ -f 6- | rev`
(that is, reverse the line, and take the name from the sixth parameter on, then you have to reverse to obtain the filename proper.)
If not, you can use the fact that all the images have an extension and that the extension itself doesn't have spaces, and search for the extension till the first space afterwards:
FILENAME=`echo $LINE | sed -e '/([^.]+) .*$/\1/'`

Resources