How to make GNU make run in batches? - makefile

I'd like to use make to process a large number of inputs to outputs using a script (python, say.) The problem is that the script takes an incredibly short amount of time to run per input, but the initialization takes a while (python engine + library initialization.) So, a naive makefile that just has an input->output rule ends up being dominated by this initialization time. Parallelism doesn't help with that.
The python script can accept multiple inputs and outputs, as so:
python my_process -i in1 -o out1 -i in2 -o out2 ...
and this is the recommended way to use the script.
How can I make a Makefile rule that best uses my_process, by sending in out of date input-output pairs in batches? Something like parallel but aware of which outputs are out of date.
I would prefer to avoid recursive make, if at all possible.

I don't completely grasp your problem: do you really want make to operate in batches or do you want a kind of perpetual make process checking the file system on the fly and feeding to the Python process whenever it finds necessary? If the latter, this is quite the opposite of a batch mode and rather a pipeline.
For the batch mode there is a work-around which needs a dummy file recording the last runnning time. In this case we are abusing make for because the makefile is in this part a one-trick pony which is unintuitive and against the good rules:
SOURCES := $(wildcard in*)
lastrun : $(SOURCES)
python my_process $(foreach src,$?,-i $(src) -o $(patsubst in%,out%,$(src)))
touch lastrun
PS: please note that this solution has a substantial flaw in that it doesn't detect the update of in-files when they happen during the run of the makefile. All in all it is more advisable to simply collect the filenames of the in-files which were updated by the update process itself and avoid make althogether.

This is what I ended up going with, a makefile with one layer of recursion.
I tried using $? both with grouped and ungrouped targets, but couldn't get the exact behavior needed. If one of the output targets was deleted, the rule would be re-run but $? wouldn't necessarily have some input files but not the correct corresponding input file, very strange.
Makefile:
all:
INDIR=in
OUTDIR=out
INFILES=$(wildcard in/*)
OUTFILES=$(patsubst in/%, out/%, $(INFILES))
ifdef FIRST_PASS
#Discover which input-output pairs are out of date
$(shell mkdir -p $(OUTDIR); echo -n > $(OUTDIR)/.needs_rebuild)
$(OUTFILES) : out/% : in/%
#echo $# $^ >> $(OUTDIR)/.needs_rebuild
all: $(OUTFILES)
#echo -n
else
#Recurse to run FIRST_PASS, builds .needs_rebuild:
$(shell $(MAKE) -f $(CURDIR)/$(firstword $(MAKEFILE_LIST)) FIRST_PASS=1)
#Convert .needs_rebuild into batches, creates all_batches phony target for convenience
$(shell cat $(OUTDIR)/.needs_rebuild | ./make_batches.sh 32 > $(OUTDIR)/.batches)
-include $(OUTDIR)/.batches
batch%:
#In this rule, $^ is all inputs needing rebuild.
#The corresponding utputs can be computed using a patsubst:
targets="$(patsubst in/%, out/%, $^)"; touch $$targets
clean:
rm -rf $(OUTDIR)
all: all_batches
endif
make_batches.sh:
#!/bin/bash
set -beEu -o pipefail
batch_size=$1
function _make_batches {
batch_num=$1
shift 1
#echo ".PHONY: batch$batch_num"
echo "all_batches: batch$batch_num"
while (( $# >= 1 )); do
read out in <<< $1
shift 1
echo "batch$batch_num: $in"
echo "$out: batch$batch_num"
done
}
export -f _make_batches
echo ".PHONY: all_batches"
parallel -N$batch_size -- _make_batches {#} {} \;
Unfortunately, the makefile is a one trick pony and there's quite a bit of boilerplate to pull this recipe off.

Related

How to have dependencies that result from another target in Makefile?

I've tweak "a bit" Make so I can use it as a "kind of" cli for some tasks.
MAKEFLAGS += --no-builtin-rules
MAKEFLAGS += --no-builtin-variables
MAKEFLAGS += --no-print-directory
SHELL := /bin/bash
.ONESHELL:
.PHONY: project_list
project_list: all_projects_info.json
echo "Filtering project list with:" >&2
echo " PROJECT_FILTER: $(PROJECT_FILTER)" >&2
jq -r -S '.[] | select(
(.projectId | test("$(PROJECT_FILTER)"))
) | .projectId' $^ > $#
.PHONY: get_storage_info
get_storage_info: project_list
PROJECT_LIST=$$(cat $<)
$(MAKE) -f $(MKFILE) -j storage_info.json PROJECT_LIST="$$PROJECT_LIST"
all_projects_info.json:
curl -X GET https://toto/get_all_my_projects_info >$#
# here it's PHONY because we want to always rebuild it
.PHONY: storage_info.json
storage_info.json: $(STORAGE_INFO_JSON_FILES)
jq -s -S '[.[]?.items?[]?]' $(STORAGE_INFO_JSON_FILES) > $#
storage_info/:
mkdir -p $#
STORAGE_INFO_JSON_FILES=$(foreach project_name,$(PROJECT_LIST),storage_info/$(project_name).json)
$(STORAGE_INFO_JSON_FILES): storage_info/%.json: | storage_info/
curl \
-X GET \
"https://storage_api/list_s3?project=$*" \
2> /dev/null > $#
As you can see here, I've got 2 "command":
project_list witch list all project I can access too,
get_storage_info witch list all bucket in projects.
The trick here is because I've got a lot of projects and buckets, I may want to filter like this:
make get_storage_info PROJECT="foo"
And it will print ONLY bucket in project with foo in their name.
It's quit handy and fast (only the first time it may be slow, the time to get all informations).
What is bothering me, I've not found a better way than to call a sub make command (with the exact list of project to take into account).
Is it possible to express dynamic dependencies of a target ?
But something that can result from another target ?
Thanks.
I don't see anything wrong with invoking a submake. That's IMO the best way to do it, especially if you want to add -j to it.
It's not really possible to get rid of this easily. It's not the fact that you want to express a dynamic dependency: that can be done. The problem is you want the list of dependencies to be extracted from the results of running another rule. But that's not how make works: make always starts with the final target and works its way backwards. So, by the time you get around to building the prerequisite file, the target that depended on it has already been processed (not its recipe of course, but all the prerequisites).

Make - Dependency building different than explicit build

I have a rule in my makefile:
$(OW_GROUP_ONE_C): $(OW_GROUP_ONE_PNG)
for file in $^; \
do \`enter code here`
grit $$file -ftc -fh\! -fa -gt -gz\! -gB4 -m\! -p -pzl -pu16 -o $#; \
done
It builds a single c file out of different images, those are iterated in a for loop (They are, I checked using an echo)
The rule which depends on that is
$(OW_GROUP_ONE_O): $(OW_GROUP_ONE_C)
$(CC) $(CFLAGS) -c -o $# $<
which is executed via
$(SPRITES_BINARY): $(NORMAL_PAL_OBJ) $(SHINY_PAL_OBJ) $(SPRITE_FRONT_OBJ) $(SPRITE_BACK_OBJ) $(NORMAL_CASTFORM_PAL_OBJ) $(SHINY_CASTFORM_PAL_OBJ) $(CASTFORM_FRONT_OBJ) $(CASTFORM_BACK_OBJ) $(OW_GROUP_ONE_O)
If I execute the rule by calling "make $(OW_GROUP_ONE_C)" everything works fine, but as soon as the rule is executed via dependency from another rule, the loop seems to just read the first file. I again used echo to check, but the loop accumulates all files in the list. I don't know what the deal i, the tool (GRIT - GBA raster image transmogrifier) should be able to handle that, but there must be a difference between calling the rule explicit if it works that way...
Thanks in advance for any hints!

Suppress "Clock skew" warning for future-times in Makefile

I have a Makefile that does performs a task if it hasn't happened in the last hour. It does so like this:
HOUR_FROM_NOW = $(shell perl -e '($$s,$$m,$$h,$$d,$$M)=localtime(time()+3600); printf("%02d%02d%02d%02d\n",$$M+1,$$d,$$h,$$m);')
NOW_FILE = $(shell mkdir -p .make; touch .make/now; echo .make/now )
.PHONY: externals
externals: $(PROJECTS:%=.make/proj_%)
.make/proj_%: $(NOW_FILE)
$(MAKE) -s $(*F)
touch -t $(HOUR_FROM_NOW) $#
.PHONY: $(PROJECTS)
$(PROJECTS):
# do stuff, specifically, clone git-repo if not exists, else pull latest
That part works great, except that I now get warnings:
make: Warning: File `.make/proj' has modification time 3.5e+03 s in the future
make: Nothing to be done for `externals'.
make: warning: Clock skew detected. Your build may be incomplete.
Anyone know how to suppress those warnings? (Or to do a periodic task in a makefile)
Most versions of touch I have come across can do some date time maths which allows for setting the timestamp of a file directly via the --date option.
That and the fact that variables assigned with := are only "evaluated once" makes this a bit easier to read.
HOUR_AGO := .make/hour_ago
__UGLY := $(shell mkdir -p .make && touch --date='1hour ago' $(HOUR_AGO))
# The preceding line will be executed once
.make/proj_%: .make/hour_ago | .make
$(MAKE) -s $(*F)
#touch $#
.make:
mkdir -p $#
I'm using something very similar to this to periodically refresh login tokens.
Never would have thought of it if it wasn't for Dave's answer though.
The directory is created by specifying it as a order-only-prerequisite
I suspect that the + 3600 is at fault. What happens if you remove it?
I thought and thought, and then the stupid-obvious solution hit me ...
Instead of setting timestamps in the future with HOUR_FROM_NOW, I use the real time and compare with HOUR_AGO_FILE ...
HOUR_AGO = $(shell perl -e '($$s,$$m,$$h,$$d,$$M)=localtime(time()-3600); printf("%02d%02d%02d%02d\n",$$M+1,$$d,$$h,$$m);')
HOUR_AGO_FILE = $(shell mkdir -p .make; touch -t $(HOUR_AGO) .make/hour_ago; echo .make/hour_ago )
.PHONY: externals
externals: $(PROJECTS:%=.make/proj_%)
.make/proj_%: $(HOUR_AGO_FILE)
$(MAKE) -s $(*F)
#touch $#

GNU Make (running program on multiple files)

I'm stuck trying to figure out how to run a program, on a set of files, using GNU Make:
I have a variable that loads some filenames alike this:
FILES=$(shell ls *.pdf)
Now I'm wanting to run a program 'p' on each of the files in 'FILES', however I can't seem to figure how to do exactly that.
An example of the 'FILES' variable would be:
"a.pdf k.pdf omg.pdf"
I've tried the $(foreach,,) without any luck, and #!bin/bash like loops seem to fail.
You can do a shell loop within the command:
all:
for x in $(FILES) ; do \
p $$x ; \
done
(Note that only the first line of the command must start with a tab, the others can have any old whitespace.)
Here's a more Make-style approach:
TARGETS = $(FILES:=_target)
all: $(TARGETS)
#echo done
.PHONY: $(TARGETS)
$(TARGETS): %_target : %
p $*

How can I write a makefile to auto-detect and parallelize the build with GNU Make?

Not sure if this is possible in one Makefile alone, but I was hoping to write a Makefile in a way such that trying to build any target in the file auto-magically detects the number of processors on the current system and builds the target in parallel for the number of processors.
Something like the below "pseudo-code" examples, but much cleaner?
all:
#make -j$(NUM_PROCESSORS) all
Or:
all: .inparallel
... build all here ...
.inparallel:
#make -j$(NUM_PROCESSORS) $(ORIGINAL_TARGET)
In both cases, all you would have to type is:
% make all
Hopefully that makes sense.
UPDATE: Still hoping for an example Makefile for the above. Not really interested in finding the number of processes, but interested in how to write a makefile to build in parallel without the -j command line option.
The detection part is going to be OS dependent. Here's a fragment that will work on Linux and Mac OS X:
NPROCS:=1
OS:=$(shell uname -s)
ifeq($(OS),Linux)
NPROCS:=$(shell grep -c ^processor /proc/cpuinfo)
endif
ifeq($(OS),Darwin) # Assume Mac OS X
NPROCS:=$(shell system_profiler | awk '/Number Of CPUs/{print $4}{next;}')
endif
To get it working you are probably going to have to re-invoke make. Then your problem is preventing infinite recursion. You could manage that by having two makefiles (the first only resetting the -j value), but it is probably possible to finesse it.
I just added this to the top of my Makefile. It lets make create any number of jobs, but tries to keep the load average below the number of cpu cores.
MAKEFLAGS+="-j -l $(shell grep -c ^processor /proc/cpuinfo) "
Note this is Linux specific.
Here's what I went with:
ifeq ($(OS),Linux)
NUMPROC := $(shell grep -c ^processor /proc/cpuinfo)
else ifeq ($(OS),Darwin)
NUMPROC := $(shell sysctl hw.ncpu | awk '{print $$2}')
endif
# Only take half as many processors as available
NUMPROC := $(shell echo "$(NUMPROC)/2"|bc)
ifeq ($(NUMPROC),0)
NUMPROC = 1
endif
After poking around the LDD3 chapter 2 a bit and reading dmckee's answer, I came up with this not so great answer of using two makefiles (I would prefer just one).
$ cat Makefile
MAKEFLAGS += -rR --no-print-directory
NPROCS := 1
OS := $(shell uname)
export NPROCS
ifeq ($J,)
ifeq ($(OS),Linux)
NPROCS := $(shell grep -c ^processor /proc/cpuinfo)
else ifeq ($(OS),Darwin)
NPROCS := $(shell system_profiler | awk '/Number of CPUs/ {print $$4}{next;}')
endif # $(OS)
else
NPROCS := $J
endif # $J
all:
#echo "running $(NPROCS) jobs..."
#$(MAKE) -j$(NPROCS) -f Makefile.goals $#
%:
#echo "building in $(NPROCS) jobs..."
#$(MAKE) -j$(NPROCS) -f Makefile.goals $#
$ cat Makefile.goals
MAKEFLAGS += -rR --no-print-directory
NPROCS ?= 1
all: subgoal
#echo "$(MAKELEVEL) nprocs = $(NPROCS)"
subgoal:
#echo "$(MAKELEVEL) subgoal"
What do you think about this solution?
Benefits I see is that people still type make to build. So there isn't some "driver" script that does the NPROCS and make -j$(NPROCS) work which people will have to know instead of typing make.
Downside is that you'll have to explicitly use make -f Makefile.goals in order to do a serial build. And I'm not sure how to solve this problem...
UPDATED: added $J to above code segment. Seems work work quite well. Even though its two makefiles instead of one, its still quite seamless and useful.
I'll skip over the $(NPROCS) detection stuff, but here's how you could do this in a single Makefile (this is probably GNU Make specific, but that looks like the variant you're running):
ifeq ($(NPROCS),)
# Code to set NPROCS...
%:
$(MAKE) -j$(NPROCS) NPROCS=$(NPROCS)
else
# All of your targets...
endif
See Defining Last-Resort Default Rules and Overriding Part of Another Makefile in the GNU Make Manual.
If I read the question correctly, the goal is to parallelize the build process as much as possible. The make man page states the following
If the -j option is given without an argument, make will not limit the number of jobs that can run simultaneously.
Isn't this basically the solution that you want? If your Makefile has enough parallel targets you will use all your CPUs and if the targets are not parallel, that -j option won't help anywas.
If you want it to be automatic, then you can override your typical make command to be an alias to itself in your .bashrc in your home directory.
Example:
alias make="make -j"
Or you could do something like:
alias jmake="make -j"
in the case that you don't want to override it, but want a quick and easy (and memorable) way to run make in parallel.

Resources