gnu make dependencies for data processing - makefile

I'm trying to set up an ETL system using GNU Make 3.81. The idea is to transform and load only what is necessary after a change to my source data.
My project's directory layout looks like this:
${SCRIPTS}/ <- transform & load scripts
${DATA}/incoming/ <- storage for extracted data
${DATA}/processed/ <- transformed, soon-to-be-loaded data
My ${TRANSFORM_SCRIPTS}/Makefile is filled with statements like this:
A_step_1: ${SCRIPTS}/A/do_step_1.sh ${DATA}/incoming/A_files/*
${SCRIPTS}/A/do_step_1.sh ${DATA}/incoming/A_files/* > ${DATA}/processed/A.step_1
A_step_2: ${SCRIPTS}/A/do_step_2.sh ${DATA}/processed/A.step_1
${SCRIPTS}/A/do_step_2.sh ${DATA}/processed/A.step_1 > ${DATA}/processed/A.step_2
B_step_1: ${SCRIPTS}/B/do_step_1.sh ${DATA}/incoming/B_files/*
${SCRIPTS}/B/do_step_1.sh ${DATA}/incoming/B_files/* > ${DATA}/processed/B.step_1
B_step_2: ${SCRIPTS}/B/do_step_2.sh ${DATA}/processed/B.step_1
${SCRIPTS}/B/do_step_2.sh ${DATA}/processed/B.step_1 > ${DATA}/processed/B.step_2
joined: A_step_2 B_step_2
join ${DATA}/processed/A.step_2 ${DATA}/processed/B.step_2 > ${DATA}/processed/joined
Calling `make joined' successfully produces the "joined" file I need, but it rebuilds every file every time, despite there being no changes to the dependency files.
I tried using the output file names as targets, but GNU Make doesn't seem to know how to cope:
${DATA}/processed/B.step_2: ${SCRIPTS}/B/do_step_2.sh ${DATA}/processed/B.step_1
${SCRIPTS}/B/do_step_2.sh ${DATA}/processed/B.step_1 > ${DATA}/processed/B.step_2
Any suggestions other than dropping the output of each process in the current working directory? Make seems like a reasonable tool to perform this work because, in reality, there tens of data sources and close to 100 steps altogether, and managing dependencies myself via script files is becoming too difficult.

You can do one of two things:
Either fix the target and its dependencies with something like:
JOINED=${DATA}/processed/joined
$(JOINED): ${DATA}/processed/A.step_2 ${DATA}/processed/B.step_2
or in the steps you can end each step with a
touch $#
for example:
A_step_2: ${SCRIPTS}/A/do_step_2.sh ${DATA}/processed/A.step_1
${SCRIPTS}/A/do_step_2.sh ${DATA}/processed/A.step_1 > ${DATA}/processed/A.step_2 && touch $# || $(RM) $#
including the joined step.
but this is ugly.

Related

Shell script to verify data packages

I need to make shell script to check my algorithms with loads of data(tests packages saved in .in files, every package contains folder with .in file and the other one with .out file where supposed to be correct result)
Sometimes It's about 1000 files in one packages so there's no point of doing it manually. I need some kind of loop which opens this .in file then redirect input of my c++ program and also redirect output of this program(save result to .out files) But the point is I can't get this language as quick as I need.
And I would like this script to compare results of my algorithm to .out files from packages
for f in ExternalIn/*.in; do//part of code which opens process with my algorithm and compare its .out file to .out file from package
Skipping checks for missing files, whitespace-safety, etc., you probably need something like:
for f in ExternalIn/*.in; do
# diff the result of my_cpp_app eating file.in with file.out
# and store the comparison result in file.diff
diff ${f/.in/.out} <(my_cpp_app <$f 2>/dev/null) > ${f/.in/.diff}
done
Although I would probably do it with find / xargs pipeline which is not only safer but also allows parallel execution.
Or even write a Makefile for this and use make, which after all is a tool for exactly this kind of work.

Formatting multiple json files recursively

This is a theoretical question about minimizing side effects in bash scripting.
I recently used a simple mechanism for formatting a bunch of json files, in a nested directory structure...
for f in `find ./ -name *json`; do echo $f ; python -mjson.tool $f > /tmp/1 && cp /tmp/1 $f ; done.
The mechanism is simply to
format each file using python's mjson.tool,
write it to a tmp location, and
then rewrite it back in place.
Is there a way to do this which is more elegant, i.e. with minimal side effects? I'm assuming bash experts have a better way of doing this sort of thing .
Unix tools working on a streaming basis -- they don't store all of the contents of the files in memory at once. Therefore, you have to use an intermediary location since you would be overwriting a file that is currently being read from.
You may consider that your snippet isn't fault tolerant. If you make a mistake, you would have just overwritten all your data. You should store the output in a new location, verify, then move to overwrite. :)
Using Eclipse IDE we can achieve formatting for multiple JSON files
Import the files into eclipse and select the files (you wish to format) or folder(for all the files) and right click -> source -> format
I was looking for something similar and just noticed I can select all JSON files I have in my VSCode file panel and CTRL + Click > "Format". Works like magic for a one-off operation, it's formatting the files in-place.
VSCode format in action

Why does make always update this target?

Using make on my Gentoo machine (which is GNU make 3.82) with the following Makefile, I wonder why the target data/spectra/o4_greenblatt_296K.dat gets updated every time I execute make data/spectra/o4_greenblatt_296K.dat, even though none of the files params/base/fwhm.dat, params/base/wavelength_grid.dat, and data/raw/o4green_gpp.dat has changed, and the file data/spectra/o4_greenblatt_296K.dat already exists:
FWHM = params/base/fwhm.dat
WLGRID = params/base/wavelength_grid.dat
$(WLGRID): code/create_wavelength_grid.py
cp code/create_wavelength_grid.py params/base/wavelength_grid.dat
$(FWHM): code/create_fwhm_param.py
cp code/create_fwhm_param.py params/base/fwhm.dat
data/raw/o4green_gpp.dat:
echo 1 > data/raw//o4green_gpp.dat
input_spectra_o4_raw: data/raw/o4green_gpp.dat
data/spectra/o4_greenblatt_296K.dat: $(WLGRID) $(FWHM) input_spectra_o4_raw
echo 1 > data/spectra/o4_greenblatt_296K.dat
input_spectra_o4: data/spectra/o4_greenblatt_296K.dat
Any help you guys can give a make newbie is greatly appreciated :)
I would guess that it's because there is no file named input_spectra_o4_raw, which is a prerequisite of your data/spectra/o4_greenblatt_296K.dat.
The decision looks kind of like this:
1. params/base/wavelength_grid.dat and params/base/fwhm.dat are both up to date
2. check input_spectra_o4_raw - file does not exist, so build it first
3. there is a target for input_spectra_o4_raw, and it's prerequisite
data/raw/o4green_gpp.dat is up to date, so run all the commands to build
input_spectra_o4_raw (there are none, though, so we essentially just mark that we've
done everything we need to for input_spectra_o4_raw and that we built it new)
4. we just built input_spectra_o4_raw, so data/spectra/o4_greenblatt_296K.dat is out of
date with respect to that prerequisite and needs to be rebuilt
You should research how to use the .PHONY: pseudo-target.

Using Make to create DPKG .debs always rebuilds, not when files have changed

This is a slightly odd one where I'm sure I'm missing something perfectly straightforward.
I'm trying to cut some of the cruft off our build time, part of that is rebuilding a set of .debs we use which occurs everytime we've changed an aspect of the system due to the way an ant script has been configured. I was hoping to use Makefiles to monitor the folders that are going to be used for the dpkg process, so only the directories that have had recent changes are recreated but:
build-printing:
fakeroot dpkg -b printing printing.deb
Is constantly rerun, even though the files in that specific directory haven't changed. I'm sure I've missed something really simple, but I can't spot it in the man pages.
Your build-printing rule doesn't depend on anything - tell it which files it should watch the timestamps of, e.g.:
build-printing: directory/myfile.src
....
will cause build-printing to only be run if the time stamp on directoy/myfile.src is newer than the timestamp of build-printing. Since the rule doesn't look like it actually creates build-printing as a file you probably want to rename it to match the output file, e..g.
printing.deb: directory/myfile.src
....
If you want to use a rule named build-printing you can either make that rule touch a file called build-printing, or make that rule depend upon printing.deb.

Join multiple Coffeescript files into one file? (Multiple subdirectories)

I've got a bunch of .coffee files that I need to join into one file.
I have folders set up like a rails app:
/src/controller/log_controller.coffee
/src/model/log.coffee
/src/views/logs/new.coffee
Coffeescript has a command that lets you join multiple coffeescripts into one file, but it only seems to work with one directory. For example this works fine:
coffee --output app/controllers.js --join --compile src/controllers/*.coffee
But I need to be able to include a bunch of subdirectories kind of like this non-working command:
coffee --output app/all.js --join --compile src/*/*.coffee
Is there a way to do this? Is there a UNIXy way to pass in a list of all the files in the subdirectories?
I'm using terminal in OSX.
They all have to be joined in one file because otherwise each separate file gets compiled & wrapped with this:
(function() { }).call(this);
Which breaks the scope of some function calls.
From the CoffeeScript documentation:
-j, --join [FILE] : Before compiling, concatenate all scripts together in the order they were passed, and write them into the specified file. Useful for building large projects.
So, you can achieve your goal at the command line (I use bash) like this:
coffee -cj path/to/compiled/file.js file1 file2 file3 file4
where file1 - fileN are the paths to the coffeescript files you want to compile.
You could write a shell script or Rake task to combine them together first, then compile. Something like:
find . -type f -name '*.coffee' -print0 | xargs -0 cat > output.coffee
Then compile output.coffee
Adjust the paths to your needs. Also make sure that the output.coffee file is not in the same path you're searching with find or you will get into an infinite loop.
http://man.cx/find |
http://www.rubyrake.org/tutorial/index.html
Additionally you may be interested in these other posts on Stackoverflow concerning searching across directories:
How to count lines of code including sub-directories
Bash script to find a file in directory tree and append it to another file
Unix script to find all folders in the directory
I've just release an alpha release of CoffeeToaster, I think it may help you.
http://github.com/serpentem/coffee-toaster
The most easy way to use coffee command line tool.
coffee --output public --join --compile app
app is my working directory holding multiple subdirectories and public is where ~output.js file will be placed. Easy to automate this process if writing app in nodejs
This helped me (-o output directory, -j join to project.js, -cw compile and watch coffeescript directory in full depth):
coffee -o web/js -j project.js -cw coffeescript
Use cake to compile them all in one (or more) resulting .js file(s). Cakefile is used as configuration which controls in which order your coffee scripts are compiled - quite handy with bigger projects.
Cake is quite easy to install and setup, invoking cake from vim while you are editing your project is then simply
:!cake build
and you can refresh your browser and see results.
As I'm also busy to learn the best way of structuring the files and use coffeescript in combination with backbone and cake, I have created a small project on github to keep it as a reference for myself, maybe it will help you too around cake and some basic things. All compiled files are in www folder so that you can open them in your browser and all source files (except for cake configuration) are in src folder. In this example, all .coffee files are compiled and combined in one output .js file which is then included in html.
Alternatively, you could use the --bare flag, compile to JavaScript, and then perhaps wrap the JS if necessary. But this would likely create problems; for instance, if you have one file with the code
i = 0
foo = -> i++
...
foo()
then there's only one var i declaration in the resulting JavaScript, and i will be incremented. But if you moved the foo function declaration to another CoffeeScript file, then its i would live in the foo scope, and the outer i would be unaffected.
So concatenating the CoffeeScript is a wiser solution, but there's still potential for confusion there; the order in which you concatenate your code is almost certainly going to matter. I strongly recommend modularizing your code instead.

Resources