Defer variable assignment until file present or rule executed in Makefile - makefile

I have a Makefile which downloads data from a biological database. Given a project number it should first download a file containing all the run information about that project, then extract accession numbers from the information, then download in parallel the FASTQ files associated with those accession numbers. My problem is that I cannot get the variable FASTQ to be deferred until after run.txt and sra.txt have been downloaded. I have tried combinations of order-only prerequisites and .SECONDEXPANSION but still cannot get it to work. Is it even possible?
# Project
PROJECT := PRJNA257197
# Download
.SECONDEXPANSION:
FASTQ = $(patsubst %, %.fastq, $(shell cat sra.txt))
download: $$(FASTQ) | run.txt sra.txt
%.fastq: sra.txt
# Download FASTQ files
fastq-dump $*
sra.txt: run.txt
# Extract SRA accession numbers
cat $^ | cut -f 1 -d ',' | grep SRR | tr '\n' ' ' > $#
run.txt:
# Download run information
esearch -db sra -query $(PROJECT) | efetch -format runinfo > $#

To do what you want you want something more like this (comments inline):
# Project
PROJECT := PRJNA257197
# Include the fastqs.mk makefile.
include fastqs.mk
# Default target is all the fastq files.
all: $(FASTQS)
%.fastq: sra.txt
# Download FASTQ files
fastq-dump $*
# Create the fastqs.mk file from sra.txt.
fastqs.mk: sra.txt
sed 's.*/FASTQS+=&.fastq/' $< > $#
sra.txt: run.txt
# Extract SRA accession numbers
cat $^ | cut -f 1 -d ',' | grep SRR | tr '\n' ' ' > $#
run.txt:
# Download run information
esearch -db sra -query $(PROJECT) | efetch -format runinfo > $#
Assuming each .fastq file has a matching bare file (i.e. foo.fastq -> foo) then you probably want this as the pattern target instead.
%.fastq: % sra.txt
The magic here is in that included makefile. Specifically that make is smart enough to notice when it needs to build an included makefile and restart processing after that has been done. See How Makefiles Are Remade in the manual for more details.

Related

Listing existing files that are not present in a list using shell

How do I list files that exist, but not present in the list? More specifically, I'd like to remove *.cpp files not listed in Build. Something like this lists files that are present in both the current directory and the Build file:
ls *.cpp | xargs -I % bash -c 'grep % Build'
However, the following line is incorrect of course:
ls *.cpp | xargs -I % bash -c 'grep -v % Build'
Thus the question: how does one list the *.cpp files that are not present in the Build file using shell commands? I can do something like this, bug this is ugly:
ls *.cpp | perl -e 'while(<>){chomp;my $l=`grep $_ Build`;chomp $l;if(length $l==0){print("rm $_\n");}}'
More specifically, I'd like to remove *.cpp files not listed in Build
You want comm or join to join two sorted lists together. I always mix comm arguments, but I think:
comm -23 <(find . -type f -name '*.cpp' | sort) <(sort Build) |
xargs -n '\n' echo rm
or if you want to depend on filename expansion:
shopt -s nullglob # at least
comm -23 <(printf "%s\n" *.cpp | sort) <(sort Build) | ...
Do not parse `ls.
The <(...) is bash specific process substitution. In non-bash shell just create a temporary file with the output of processes.
GNU grep already offers you this possibility with the -f switch:
printf '%s\n' *.cpp | grep -F -x -v -f Build
-F: no regex
-x: full-line match
-v: invert (not match)
-f: any of the line in Build
In other words: filter out any line in Build

Makefile, substitute paths in prerequisites automatic variable

I have following make rule:
$ ls files
a b c
$ cat Makefile
files.7z: ./files/*
7z a $# $?
The rule will be executed as follows:
7z a files.7z files/a files/b files/c
7z treats paths beginning with ./ especially in that, it will include the file in the archive without the path. What would be the shortest way to replace the paths, to begin with, ./ in the $?? (Files names can have spaces.)
Makefile doesn't handle spaces very well at all (see here). The simplest thing is to make make fail if there are spaces and then it's easy:
check:
# ls ./files/* | grep -q " "; \
if [ $$? -ne 1 ]; then \
echo "WE DO NOT SUPPORT FILENAMES WITH SPACES"; \
exit 1; \
fi
files.7z: ./files/* | check
7z a $# $(addprefix ./,$^)
Note: there are workarounds on the web for filenames with spaces, but they're overly complex, not very maintainable and will cost you a lot more headaches than telling your customers/coworkers "We don't support file names with spaces"...

Bash looping over files in different directories and print output

I have *.vcf, *.vcf.vcfidx and *.vcf.idx files in directory /mypath/mydir/. I want to loop over .vcf files only using command below (for file 1):
command for one vcf file:
vcf-subset -c sample.txt vcffile1.vcf | bgzip -c > output_vcfile1.vcf_.vcf.gz
Can someone please help loop over all the .vcf (not vcf.vcfidx or vcf.idx) files and get the output for each file in designated directory /get/inthis/dir/ using the command shown above?
Just use glob pattern *.vcf:
for i in *.vcf; do echo "$i"; done
The glob pattern *.vcf will match only files ending in .vcf.
Your command:
for i in *.vcf; do
vcf-subset -c sample.txt "$i" | bgzip -c > /get/inthis/dir/output_"$i"_.vcf.gz
done
If you have to search for .vcf files in a specific directory e.g. /foo/bar/, do:
for i in /foo/bar/*.vcf; do
vcf-subset -c sample.txt "$i" | bgzip -c > /get/inthis/dir/output_"${i##*/}"_.vcf.gz
done

How to perform parallel processes for different groups in a folder?

I have a folder containing a lot of images. I have a code which transforms these images into black and white format and then use tesseract to convert them into text files. I have been using the following code to split these files into subgroups:
i=0; for f in *; do d+dir_$(printf %03d $((i/(number of files in each folder+1))); mkdir -p $d; mv "$f" $d' let i++; done
This command works great to split up the files (puts the grouped files into different folders) but because I am planning on using this procedure for many many files I would like to change this process to be less time consuming (it would take a bit too much time to move the files to a folder). Is there a way I can specify the subgroup of files in order to run a process and use & in order to do multiple instances at once? For example, I would like to run a process for the firt 400 files in a folder and then use " & " in order to run that same process for the files that are in the order of 401-800.
Here is the code that I am using for the conversion:
parallel -j 5 convert {} "-resample 200 -colorspace Gray" {.}BW.png ::: *.png ; parallel -j 5 tesseract {} {} -l tla -psm 6 ::: *BW.png ; rm *BW.png
By group I simply mean the first 400 files, the second group would be the following 400 files and so on...
I would let Make to take care of multiprocessing, using a Makefile like this:
Makefile:
EXT_IN := .jpg
EXT_OUT := .txt
FILES_IN := $(wildcard *$(EXT_IN))
FILES_OUT := $(addsuffix $(EXT_OUT), $(basename $(FILES_IN)))
.PHONY: all
$(FILES_OUT):
#echo Generating $# from $(addsuffix $(EXT_IN), $(basename $#))
# Do your conversion here!
all: $(FILES_OUT)
#echo "Processing finished!"
Running:
$ > make all -j 8
Generating file1.txt from file1.jpg
Generating file2.txt from file2.jpg
Generating file3.txt from file3.jpg
Generating file4.txt from file4.jpg
Generating file5.txt from file5.jpg
Generating file6.txt from file6.jpg
Processing finished!
So my whole ordeal was with trying to use my code on a directory with a lot of files. In order to get rid of the errer stating that there are too many Arguments, I used this code that I gathered from previous Ole Tange posts:
ls ./ | grep -v '\BW.png' | parallel -j 60 convert {} "-resample 100 -colorspace Gray" {.}BW.png; ls ./ | grep \BW.png | parallel -j 60 tesseract {} {} -l tla -psm 6; find . -name "*BW.png" -print0 | xargs -0 rm;
Thanks to everyone that contributed.

Make with dependencies from a file

I want to write a Makefile that reads a file list.txt and produces result.tar containing the contents. If there is a change in either the list.txt file, or any of the files it points at, then result.tar should be rebuilt. How can I express this in a Makefile? The closest I have come is:
result.tar : list.txt
cat list.txt | xargs tar -cf result.tar
But this omits the dependency on the contents of list.txt.
I think there should be something like this:
result.tar : list.txt $(shell cat list.txt)
cat list.txt | xargs tar -cf result.tar
Or, a bit better (extracting list.txt to a variable and using automatic variables):
LIST_FILE := list.txt
result.tar : $(LIST_FILE) $(shell cat $(LIST_FILE))
cat $< | xargs tar -cf $#

Resources