Merging yaml config files recursively with bash - bash

Is it possible using some smart piping and coding, to merge yaml files recursively?
In PHP, I make an array of them (each module can add or update config nodes of/in the system).
The goal is an export shellscript that will merge all separate module folders' config files into big merged files. It's faster, efficient, and the customer does not need the modularity at the time we deploy new versions via FTP, for example.
It should behave like the PHP function: array_merge_recursive
The filesystem structure is like this:
mod/a/config/sys.yml
mod/a/config/another.yml
mod/b/config/sys.yml
mod/b/config/another.yml
mod/c/config/totally-new.yml
sys/config/sys.yml
Config looks like:
date:
format:
date_regular: %d-%m-%Y
And a module may, say, do this:
date:
format:
date_regular: regular dates are boring
date_special: !!!%d-%m-%Y!!!
So far, I have:
#!/bin/bash
#........
cp -R $dir_project/ $dir_to/
for i in $dir_project/mod/*/
do
cp -R "${i}/." $dir_to/sys/
done
This of course destroys all existing config files in the loop.. (rest of the system files are uniquely named)
Basically, I need a yaml parser for the command line, and an array_merge_recursive like alternative. Then a yaml writer to ouput it merged. I fear I have to start to learn Python because bash won't cut it on this one.

You can use for example perl. The next oneliner:
perl -MYAML::Merge::Simple=merge_files -MYAML -E 'say Dump merge_files(#ARGV)' f1.yaml f2.yaml
for the next input files: f1.yaml
date:
epoch: 2342342343
format:
date_regular: "%d-%m-%Y"
f2.yaml
date:
format:
date_regular: regular dates are boring
date_special: "!!!%d-%m-%Y!!!"
prints the merged result...
---
date:
epoch: 2342342343
format:
date_regular: regular dates are boring
date_special: '!!!%d-%m-%Y!!!'
Because #Caleb pointed out that the module now is develeloper only, here is an replacement. It is a bit longer and uses two (but commonly available) modules:
the YAML
and the Hash::Merge::Simple
perl -MYAML=LoadFile,Dump -MHash::Merge::Simple=merge -E 'say Dump(merge(map{LoadFile($_)}#ARGV))' f1.yaml f2.yaml
produces the same as above.

I recommend yq -m. yq is a swiss army knife for yaml, very similar to jq (for JSON).

No.
Bash has no support for nested data structures (its maps are integer->string or string->string only), and thus cannot represent arbitrary YAML documents in-memory.
Use a more powerful language for this task.

Late to the party, but I also wrote a tool for this:
https://github.com/benprofessionaledition/yamlmerge
It's almost identical to Ondra's JVM tool (they're even both called "yaml merge"), the key difference being that it's written in Go so it compiles to a ~3MB binary with no external dependencies. We use it in Gitlab-CI containers.

Bash is a bit of a stretch for this (it could be done but it would be error prone). If all you want to do is call a few things from a bash shell (as opposed to actually scripting the merge using bash functions) then you have a few options.
I noticed there is a Java based yaml-merge tool, but that didn't suit my fancy very much, so I kept looking. In the end I clobbered together something using two tools: yaml2json and jq.
Warning: Since JSON's capabilities are only a subset of YAML's, this is not a lossless process for complex YAML structures. It will work for a lot of simple key/value/sequence scenarios but will muck things up if your input YAML is too fancy. Test it on your data types to see if it does what you expect.
Use yaml2json to convert your inputs to JSON:
yaml2json input1.yml > input1.json
yaml2json input2.yml > input2.json
Use jq to iterate over the objects and merge them recursively (see this question and answers for details). List files in reverse order of importance as values in later ones will clobber earlier ones:
jq -s 'reduce .[] as $item({}; . + $item)' input1.json input2.json > merged.json
Take it back to YAML:
json2yaml merged.json > merged.yml
If you want to script this, of course the usual bash mechanisms are your friend. And if you happen to be in GNU-Make like I was, something like this will do the trick:
.SECONDEXPANSION:
merged.yml: input1.yml input2.yml
json2yaml <(jq -s 'reduce .[] as $$item({}; . + $$item)' $(foreach YAML,$^,<(yaml2json $(YAML)))) > $#

There is a tool that merges YAML files - merge-yaml.
It supports full YAML syntax, and is capable of expanding environment variables references.
I forked it and released it into a form of an executable .jar.
https://github.com/OndraZizka/yaml-merge
Usage:
./bin/yaml-merge.sh ./*.yml > result.yml
It is written in Java so you need Java (I think 8 and newer) installed.
(Btw, if someone wants to contribute, that would be great.)
In general, merging YAML is not a trivial thing, in the sense that the tool doesn't always know what you really want to do. You can merge structures in multiple way. Think if this example:
foo:
bar: bar2
baz:
- baz1
---
foo:
bar: bar1
baz:
- baz2
goo: gaz1
Few questions / unknowns arise:
Should the 2nd foo tree replace the first one?
Should the 2nd bar replace the first one, or merge to an array?
Should the 2nd baz array replace the 1st, or be merged?
If merged, then how - should there be duplicities, or should the tool keep the values unique? Should the order be managed in some way?
Etc. One may object that there can be some default, but often, the real world requirements need different operations.
Other tools and libraries to deal with data structures deal with this by defining a scheme with metadata, for instance, JAXB or Jackson use Java annotations.
For this general tool, that is not an option, so the user would have to control this through a) the input data, or b) parameters. a) is impractical and sometimes impossible, b) is tedious and needs a fancy syntax like jq has.
That said, Caleb's answer might be what you need. Although, that solution reduces your data to what JSON is capable of, so you will loose comments, various way to represent long strings, usage of JSON within YAML, etc., which is not too user friendly.

Related

What is the best way to create a configuration file in bash?

I use bash to configure many of my build tools.
I want to use variables to set certain things so that the build tools can be used in many environments.
Currently I do
exports var=val
and then I do
$var
when I need to use it.
Is this the best way to go about it, as I know there are many ways to do things in bash.
**Example**
#!/bin/bash
path_bash="$HOME/root/config/bash/"
source "${path_bash}_private.sh"
source "${path_bash}config.sh"
source "${path_bash}utility.sh"
source "${path_bash}workflow.sh"
source "${path_bash}net.sh"
source "${path_bash}makeHTM.sh"
#
#
#
# Divider - Commands
#
#
#
cd ~/root
Skip the export unless you really need it (that is, unless you need that variable to propagate to unrelated (=execed) processes that you execute).
If you do export, it's usually a good idea to capitalize the variable (export VAR=val) to make it apparent that it will spread to executed binaries.
When you refer to a shell variable, you usually want to double quote it ("$var") unless you need glob expansion and whitespace-splitting (splitting on $IFS characters, to be exact) done on that variable expansion.
sparrow automation framework gives you an opportunity to configure bash scripts in many ways :
passing command line named parameters
configuration files in yaml format
configuration files in json format
configuration files in config::general format
Of course you need to pack your bash scripts into sparrow plugins first to get this behavior , but this is not a big deal . Follow https://github.com/melezhik/sparrow#configuring-tasks for details .
PS. disclosure - I am the sparrow tool author .

How to add arguments to $# without splitting the space-containing arguments that are already there?

My goal is to encapsulate the svn command to provide additional features. Here is a minimal (not) working example of the function used to do that:
function custom_svn {
newargs="$# my_additional_arg"
svn $newargs
}
But this does not work with arguments containing spaces. For instance when called like this:
message="my commit text"
custom_svn ci some_file.txt -m "$message"
SVN tries to commit some_file.txt, commit and text, with the message "my" instead of committing only some_file.txt with message "my commit text".
I guess the issue lies in the erroneous use of $#, but I'm not sure how to proceed to keep the message whole.
In standard sh, the positional arguments are the only sort of array you've got. You can append to them with set:
set -- "$#" my_additional_arg
svn "$#"
In bash, you can create your own custom array variables too:
newargs=("$#" my_additional_arg)
svn "${newargs[#]}"
Of course, as DigitalRoss answered, in your specific example you can avoid using a variable entirely, but I'll guess that your example is a bit oversimplified.
Do this:
svn "$#" my_additional_arg
The problem is that the construct "$#" is special only in that exact form.
It's interesting, this brings up the whole good-with-the-bad nature of shell programming. Because it's a macro processor, it's much easier to write simple things in bash than in full languages. But, it's harder to write complex things, because every time you try to go a level deeper in abstraction you need to change your code to properly expand the macros in the new level of evaluation.

Find and replace in file with script

I want to find and replace the VALUE into a xml file :
<test name="NAME" value="VALUE"/>
I have to filter by name (because there are lot of lines like that).
Is it possible ?
Thanks for you help.
Since you tagged the question "bash", I assume that you're not trying to use an XML library (although I think an XML expert might be able to give you something like an XSLT processor command that solves this question very robustly), but that you're simply interested in doing search & replace from the commandline.
I am using perl for this:
perl -pi -e 's#VALUE#replacement#g' *.xml
See perlrun man page: Very shortly put, the -p switches perl into text processing mode, -i stands for "in-place", and -e let's you specify an expression to apply to all lines of input.
Also note (if you are not too familiar with that already) that you may use other characters than # (common ones are %, a comma, etc.) that don't clash with your search & replacement strings.
There is one small caveat: perl will read & write all files given on the commandline, even those that did not change. Thus, the files' modification times will be updated even if they did not change. (I usually work around that with some more shell magic, e.g. using grep -l or grin -l to select files for perl to work on.)
EDIT: If I understand your comments correctly, you also need help with the regular expression to apply. Let me briefly suggest something like this then:
perl -pi -e 's,(name="NAME" value=)"[^"]*",\1"NEWVALUE",g' *.xml
Related: bash XHTML parsing using xpath
You can use SED:
SED 's/\(<test name=\"NAME\"\) value=\"VALUE\"/\1 value=\"YourValue\"/' test.xml
where test.xml is the xml document containing the given node. This is very fragile, and you can work to make it more flexible if you need to do this substitution multiple times. For instance, the current statement is case sensitive, so it won't substitute the value on a node with the name="name", but you can add a case insensitivity flag to the end of the statement, like so:
('s/\(<test name=\"NAME\"\) value=\"VALUE\"/\1 value=\"YourValue\"/I').
Another option would be to use XSLT, but it would require you to download an external library. It's pretty versatile, and could be a viable option for more complex modifications to an XML document.

Bash: Trying to append to a variable name in the output of a function

this is my very first post on Stackoverflow, and I should probably point out that I am EXTREMELY new to a lot of programming. I'm currently a postgraduate student doing projects involving a lot of coding in various programs, everything from LaTeX to bash, MATLAB etc etc.
If you could explicitly explain your answers that would be much appreciated as I'm trying to learn as I go. I apologise if there is an answer else where that does what I'm trying to do, but I have spent a couple of days looking now.
So to the problem I'm trying to solve: I'm currently using a selection of bioinformatics tools to analyse a range of genomes, and I'm trying to somewhat automate the process.
I have a few sequences with names that look like this for instance (all contained in folders of their own currently as paired files):
SOL2511_S5_L001_R1_001.fastq
SOL2511_S5_L001_R2_001.fastq
SOL2510_S4_L001_R1_001.fastq
SOL2510_S4_L001_R2_001.fastq
...and so on...
I basically wish to automate the process by turning these in to variables and passing these variables to each of the programs I use in turn. So for example my idea thus far was to assign them as wildcards, using the R1 and R2 (which appears in all the file names, as they represent each strand of DNA) as follows:
#!/bin/bash
seq1=*R1_001*
seq2=*R2_001*
On a rudimentary level this works, as it returns the correct files, so now I pass these variables to my first function which trims the DNA sequences down by a specified amount, like so:
# seqtk is the program suite, trimfq is a function within it,
# and the options -b -e specify how many bases to trim from the beginning and end of
# the DNA sequence respectively.
seqtk trimfq -b 10 -e 20 $seq1 >
seqtk trimfq -b 10 -e 20 $seq2 >
So now my problem is I wish to be able to append something like "_trim" to the output file which appears after the >, but I can't find anything that seems like it will work online.
Alternatively, I've been hunting for a script that will take the name of the folder that the files are in, and create a variable for the folder name which I can then give to the functions in question so that all the output files are named correctly for use later on.
Many thanks in advance for any help, and I apologise that this isn't really much of a minimum working example to go on, as I'm only just getting going on all this stuff!
Joe
EDIT
So I modified #ghoti 's for loop (does the job wonderfully I might add, rep for you :D ) and now I append trim_, as the loop as it was before ended up giving me a .fastq.trim which will cause errors later.
Is there any way I can append _trim to the end of the filename, but before the extension?
Explicit is usually better than implied, when matching filenames. Your wildcards may match more than you expect, especially if you have versions of the files with "_trim" appended to the end!
I would be more precise with the wildcards, and use for loops to process the files instead of relying on seqtk to handle multiple files. That way, you can do your own processing on the filenames.
Here's an example:
#!/bin/bash
# Define an array of sequences
sequences=(R1_001 R2_001)
# Step through the array...
for seq in ${sequences[#]}; do
# Step through the files in this sequence...
for file in SOL*_${seq}.fastq; do
seqtk trimfq -b 10 -e 20 "$file" > "${file}.trim"
done
done
I don't know how your folders are set up, so I haven't addressed that in this script. But the basic idea is that if you want the script to be able to manipulate individual filenames, you need something like a for loop to handle the that manipulation on a per-filename basis.
Does this help?
UPDATE:
To put _trim before the extension, replace the seqtk line with the following:
seqtk trimfq -b 10 -e 20 "$file" > "${file%.fastq}_trim.fastq"
This uses something documented in the Bash man page under Parameter Expansion if you want to read up on it. Basically, the ${file%.fastq} takes the $file variable and strips off a suffix. Then we add your extra text, along with the suffix.
You could also strip an extension using basename(1), but there's no need to call something external when you can use something built in to the shell.
Instead of setting variables with the filenames, you could pipe the output of ls to the command you want to run with these filenames, like this:
ls *R{1,2}_001* | xargs -I# sh -c 'seqtk trimfq -b 10 -e 20 "$1" > "${1}_trim"' -- #
xargs -I# will grab the output of the previous command and store it in # to be used by seqtk

Opposite of Linux Split

I have a huge file and I split the big file into several small chunks and divide and conquer. Now I have a folder contains a list of files like below:
output_aa #(the output file done: cat input_aa | python parse.py > output_aa)
output_ab
output_ac
output_ad
...
I am wondering is there a way to merge those files back together FOLLOWING THE INDEX ORDER:
I know I could do it by using
cat * > output.all
but I am more curious another magical command already exist comes with split..
The magic command would be:
cat output_* > output.all
There is no need to sort the file names as the shell already does it (*).
As its name suggests, cat original design was precisely to conCATenate files which is basically the opposite of split.
(*) Edit:
Should you use an (hypothetical ?) locale that use a collating order where the a-z order is not abcdefghijklmnopqrstuvwxyz, here is one way to overcome the issue:
LC_ALL=C "sh -c cat output_* > output.all"
There are other ways to concat files together, but there is no magical "opposite of split" in "linux".
Of course, talking about "linux" in general is a bit far fetched, as many distributions have different tools (most of them use a different shell already by default, like sh, bash, csh, zsh, ksh, ...), but if you're talking about debian based linux at least, I don't know of any distribution which would provide such a tool.
For sorting you can use the linux command "sort" ;
Also be aware that using ">" for redirecting stdout will override maybe existing contents, while ">>" will concat to an existing file.
I don't want to copycat, but still make this answer complete, so what jlliagre said about the cat command should also be considered of course (that "cat" was made to con-"cat" files, effectively making it possible to reverse the split command - but that's only provided you use the same ordering of files, so it's not exactly the "opposite of split", but will work that way in close to 100% of the cases (see comments under jlliagre answer for specifics))

Resources