Compress Makefile Intermediates (Two ways to create same target)

Compress Makefile Intermediates (Two ways to create same target) - makefile

I have a Makefile I'm currently using for purposes other than compiling. Instead of deleting intermediate files, I'd like to keep them, but gzip them, and then later have Makefile detect that an intermediate file exists and instead of recomputing it, simply unzip it.
Let's suppose I have target target.txt that depends on an intermediate file called intermediate.txt, which itself depends on prereq.txt. So something like:
target.txt: intermediate.txt
intermediate.txt: prereq.txt
Now by default Make deletes the intermediate file, but we can disable that. But let's say that both computing intermediate.txt takes a long time, so I'll disable automatic deletion of it. But what if file intermediate.txt is also very large, so I'd like to compress it (gzip) to intermediate.txt.gz. Instead of recomputing the file, I'd like Make to unzip the existing zipped file, so gunzip intermediate.txt.gz.
The larger question I suppose I'm asking is I have two ways of making a target, based on two different dependencies. I'd like Make to execute the rule that has the prerequisite that exists, and ignore the other rule, but perhaps delete the zipped version and recompute it only if the prerequisite to the intermediate has a newer timestamp. Does anyone have any suggestions?

If you are using GNU Make, you can do this with pattern rules (tac is used to represent whatever processing you're doing):
%.txt: %.i.txt
tac $^ > $# #make .txt file the normal way
gzip $^ #gzip the intermediate file
%.txt: %.i.txt.gz
gunzip < $^ | tac > $# #make .txt by streaming in the gzipped intermediate
%.i.txt: %.p.txt
tac $^ > $# #make the intermediate file from the prereq
This works for pattern rules because if the .i.txt file is not found, Make falls through to the next pattern and looks for the .i.txt.gz version. This does not work for explicit rules, because later rules simply replace earlier rules.

I would guess that you do NOT want to just uncompress the .gz version if prereq.txt is newer than the gzipped file. In that case, I would tend to just use shell tests to store off and restore the gzipped file and not get make directly involved:
target.txt: intermediate.txt
...same as it ever was ....
intermediate.txt: prereq.txt
if [ $#.gz -nt $< ]; then \
gunzip <$#.gz >$#; \
else \
whatever >$# <$<; \
gzip <$# >$#.gz; \
fi
where 'whatever' is the command that creates intermediate.txt from prereq.txt

Related

Processing multiple files generated from single input

I have a data file that is processed by a script to produce multiple output files. Each of these output files is then processed further. Which files are created depends on the contents of the input file, so I can't list them explicitly. I can't quite figure out how to refer to the various files that are generated in a makefile.
Currently, I have something like this:
final.out: *.out2
merge_files final.out $(sort $^)
%.out2: %.out1
convert_files $?
%.out1: data.in
extract_data data.in
This fails with No rule to make target '*.out2', needed by 'final.out'. I assume this is because the .out2 files don't exist yet and therefore the wildcard expression isn't replaced the way I would like it to. I have tried to use the wildcard function but that fails because the list of prerequisites ends up being empty.
Any pointers would be much appreciated.

EDIT: fixed the list of prerequisites in second pass.
You apparently cannot compute the list of intermediate files before running the extract_data command. In this case a solution consists in running make twice. One first time to generate the *.out1 files and a second time to finish the job. You can use an empty dummy file to mark whether the
extract_data command shall be run again or not:
ifeq ($(FIRST_PASS_DONE),)
final.out: .dummy
$(MAKE) FIRST_PASS_DONE=yes
.dummy: data.in
extract_data $<
else
OUT1 := $(wildcard *.out1)
OUT2 := $(patsubst %.out1,%.out2,$(OUT1))
final.out: $(OUT2)
merge_files $# $(sort $^)
%.out2: %.out1
convert_files $?
endif

Unfortunately your question is missing some details I would ask immediately if some SW developer would present this makefile for review:
does extract_files provide the list of files?
does convert_files convert one file or multiple? The example seems to imply that it converts multiple.
then I have to question the decision to break up extract, convert and merge into separate rules as you will not benefit from parallel build anyway
The following is the approach I would choose. I'm going to use a tar file as an example for an input file that results in multiple output files
generate a makefile fragment for the sorted list of files
use the tar option v to print files while they are extracted
convert each line into a makefile variable assignment
include the fragment to define $(DATA_FILES)
if the fragment needs to be regenerated, make will restart after it has generated it
use static pattern rule for the conversion
use the converted file list as dependency for the final target
.PHONY: all
all: final.out
# extract files and created sorted list of files in $(DATA_FILES)
Makefile.data_files: data.tar
set -o pipefail; tar xvf $< | sort | sed 's/^/DATA_FILES += /' >$#
DATA_FILES :=
include Makefile.data_files
CONVERTED_FILES := $(DATA_FILES:%.out1=%.out2)
$(CONVERTED_FILES): %.out2: %.out1
convert_files $< >$#
final.out: $(CONVERTED_FILES)
merge_files final.out $^
UPDATE if extract_data doesn't provide the list of files, you could modify my example like this. But of course that depends on that there are no other files that match *.out1 in your directory.
# extract files and created sorted list of files in $(DATA_FILES)
Makefile.data_files: data.in
set -o pipefail; \
extract_data $< && \
(ls *.out1 | sort | sed 's/^/DATA_FILES += /') >$#

why does "make" delete target files only if implicit

Suppose I have a Makefile like this
B1.txt: A1.txt
python big_long_program.py A1.txt > $#
correct1.txt: B1.txt reference.txt
diff -q B1.txt reference.txt
touch $#
Then the output when I make correct1.txt is pretty well what I would expect:
python big_long_program.py A1.txt > B1.txt
diff -q B1.txt reference.txt
touch correct1.txt
Now if I have lots of files, B1.txt, B2.txt, B3.txt etc, so create an implicit rule:
B%.txt: A%.txt
python big_long_program.py A$*.txt > $#
correct%.txt: B%.txt reference.txt
diff -q B$*.txt reference.txt
touch $#
Instead this happens when I make correct1.txt:
python big_long_program.py A1.txt > B1.txt
diff -q B1.txt reference.txt
touch correct1.txt
rm B1.txt
i.e. there difference is that now the file B1.txt has been deleted, which in many cases is really bad.
So why are implicit rules different? Or am I doing something wrong?

You are not doing anything wrong. The behavior you observe and analyze is documented in 10.4 Chains of Implicit Rules. It states that intermediate files are indeed treated differently.
The second difference is that if make does create b in order to update
something else, it deletes b later on after it is no longer needed.
Therefore, an intermediate file which did not exist before make also
does not exist after make. make reports the deletion to you by
printing a rm -f command showing which file it is deleting.
The documentation does not explicitly explain why it behaves like this. Looking in the file ChangeLog.1, there is a reference to the remove_intermediates function as far back as 1988. At that time, disk space was expensive and at a premium.
If you do not want this behavior, mention the targets you want to keep somewhere in the makefile as an explicit prerequisite or target or use the .PRECIOUS or the .SECONDARY special built-in targets for that.
With thanks to MadScientist for the additional comments, see below.

How to write a (GNU)makefile with output different than the target?

I have script that takes in a filename and generates multiple files with same name but different extension. I want to write a makefile that depends on files generated with different extension but only specify the filename. I have a dummy example to explain it below.
test_output:test_input
genereate.py -i $^ -o $#
The above makefile dependency generates multiple files with same filename but different extension, but won't generate the actual target. For example, it generates
test_output.a test_output.b test_output.c
The way its written above is not the efficient way as there no actual target, so it runs this even though the output is already there.
How would i specify the makefile so that it reads in the target(test_output) but actually depends on the output file it generates like test_output.a or any of the other files.

If you use GNU make (you didn't say) you can use pattern rules to tell make about a rule that generates multiple outputs based on a single stem. So for example you can write:
%.a %.b %.c : test_input
genereate.py -i $^ -o $*
(it would work a lot better if the input filename was related to the output filenames with the same stem, but the above will work although you'll have to write a different one for each input file).
Typically that's what you want, so that other targets that need these outputs can depend on them.
If you really want to have a target without any extension as well, just create it:
test_output : test_output.a test_output.b test_output.c
%.a %.b %.c : test_input
genereate.py -i $^ -o $*

telling 'make' to ignore dependencies when the top target has been created

I'm running the following kind of pipeline:
digestA: hugefileB hugefileC
cat $^ > $#
rm $^
hugefileB:
touch $#
hugefileC:
touch $#
The targets hugefileB and hugefileC are very big and take a long time to compute (and need the power of Make). But once digestA has been created, there is no need to keep its dependencies: it deletes those dependencies to free up disk space.
Now, if I invoke 'make' again, hugefileB and hugefileC will be rebuilt, whereas digestA is already ok.
Is there any way to tell 'make' to avoid to re-comile the dependencies ?
NOTE: I don't want to build the two dependencies inside the rules for 'digestA'.

Use "intermediate files" feature of GNU Make:
Intermediate files are remade using their rules just like all other files. But intermediate files are treated differently in two ways.
The first difference is what happens if the intermediate file does not exist. If an ordinary file b does not exist, and make considers a target that depends on b, it invariably creates b and then updates the target from b. But if b is an intermediate file, then make can leave well enough alone. It won't bother updating b, or the ultimate target, unless some prerequisite of b is newer than that target or there is some other reason to update that target.
The second difference is that if make does create b in order to update something else, it deletes b later on after it is no longer needed. Therefore, an intermediate file which did not exist before make also does not exist after make. make reports the deletion to you by printing a rm -f command showing which file it is deleting.
Ordinarily, a file cannot be intermediate if it is mentioned in the makefile as a target or prerequisite. However, you can explicitly mark a file as intermediate by listing it as a prerequisite of the special target .INTERMEDIATE. This takes effect even if the file is mentioned explicitly in some other way.
You can prevent automatic deletion of an intermediate file by marking it as a secondary file. To do this, list it as a prerequisite of the special target .SECONDARY. When a file is secondary, make will not create the file merely because it does not already exist, but make does not automatically delete the file. Marking a file as secondary also marks it as intermediate.
So, adding the following line to the Makefile should be enough:
.INTERMEDIATE : hugefileB hugefileC
Invoking make for the first time:
$ make
touch hugefileB
touch hugefileC
cat hugefileB hugefileC > digestA
rm hugefileB hugefileC
And the next time:
$ make
make: `digestA' is up to date.

If you mark hugefileB and hugefileC as intermediate files, you will get the behavior you want:
digestA: hugefileB hugefileC
cat $^ > $#
hugefileB:
touch $#
hugefileC:
touch $#
.INTERMEDIATE: hugefileB hugefileC
For example:
$ gmake
touch hugefileB
touch hugefileC
cat hugefileB hugefileC > digestA
rm hugefileB hugefileC
$ gmake
gmake: `digestA' is up to date.
$ rm -f digestA
$ gmake
touch hugefileB
touch hugefileC
cat hugefileB hugefileC > digestA
rm hugefileB hugefileC
Note that you do not need the explicit rm $^ command anymore -- gmake automatically deletes intermediate files at the end of the build.

I would recommend you to create pseudo-cache files that are created by the hugefileB and hugeFileC targets.
Then have digestA depend on those cache files, because you know they will not change again until you manually invoke the expensive targets.

See also .PRECIOUS:
.PRECIOUS : hugefileA hugefileB
.PRECIOUS
The targets which .PRECIOUS depends on are given the following special treatment: if make is killed or interrupted during the execution of their recipes, the target is not deleted. See Interrupting or Killing make. Also, if the target is an intermediate file, it will not be deleted after it is no longer needed, as is normally done. See Chains of Implicit Rules. In this latter respect it overlaps with the .SECONDARY special target.
You can also list the target pattern of an implicit rule (such as ‘%.o’) as a prerequisite file of the special target .PRECIOUS to preserve intermediate files created by rules whose target patterns match that file’s name.
Edit: On re-reading the question, I see that you don't want to keep the hugefiles; maybe do this:
digestA : hugefileA hugefileB
grep '^Subject:' %^ > $#
for n in $^; do echo > $$n; done
sleep 1; touch $#
It truncates the hugefiles after using them, then touches the output file a second later, just to ensure that the output is newer than the input and this rule won't run again until the empty hugefiles are removed.
Unfortunately, if only the digest is removed, then running this rule will create an empty digest. You'd probably want to add code to block that.

The correct way is to not delete the files, as that removes the information that make uses to determine whether to rebuild the files.
Recreating them as empty does not help because make will then assume that the empty files are fully built.
If there is a way to merge digests, then you could create one from each of the huge files, which is then kept, and the huge file automatically removed as it is an intermediate.

Makefile rule depending on change of number/titles of files instead of change in content of files

I'm using a makefile to automate some document generation. I have several documents in a directory, and one of my makefile rules will generate an index page of those files. The list of files itself is loaded on the fly using list := $(shell ls documents/*.txt) so I don't have to bother manually editing the makefile every time I add, delete, or rename a document. Naturally, I want the index-generation rule to trigger when number/title of files in the documents directory changes, but I don't know how to set up the prerequisites to work in this way.
I could use .PHONY or something similar to force the index-generation to run all the time, but I'd rather not waste the cycles. I tried piping ls to a file list.txt and using that as a prerequisite for my index-generation rule, but that would require either editing list.txt manually (trying to avoid it), or auto-generating it in the makefile (this changes the creation time, so I can't use list.txt in the prerequisite because it would trigger the rule every time).

If you need a dependency on the number of files, then... why not just depend on the number itself? The number will be represented as a dummy file that is created when the specified nubmer of files is in the documents directory.
NUMBER=$(shell ls documents/*.txt | wc -l).files
# This yields name like 2.files, 3.files, etc...
# .PHONY $(NUMBER) -- NOT a phony target!
$(NUMBER):
rm *.files # Remove previous trigger
touch $(NUMBER)
index.txt: $(NUMBER)
...generate index.txt...
While number of files is one property to track, instead you may depend on a hash of a directory listing. It's very unlikely that hash function will be the same for two listings that occur in your workflow. Here's an example:
NUMBER=$(shell ls -l documents/*.txt | md5sum | sed 's/[[:space:]].*//').files
Note using -l -- this way you'll depend on full listing of files, which includes modification time, sizes and file names. Bu if you don't need it, you may drop the option.
Note: sed was needed because on my system md5sum yields some stuff after the hash-sum itself.

You can put a dependency on the directory itself for the list file, e.g.
list.txt: documents
ls documents/*.txt 2>/dev/null > $# || true
Every time you add or remove a file in the documents directory, the directory's timestamp will be altered and make will do the right thing.

Here's a solution that updates the index if and only if the set of files has changed:
list.txt.tmp: documents
ls $</*.txt > $#
list.txt: list.txt.tmp
cmp -s $< $# || cp $< $#
index.txt: list.txt
...generate index.txt...
Thanks to the "cmp || cp", the ctime of "list.txt" does not change unless the output of the "ls" has changed.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio