Import a YAML (fixtures) file from another - yaml

for better organization (for eg: seed and test data), is it
possible to split the yaml file and import the first one from second.
Of course, the variables in file1 should be available to use in the
second. I use Snakeyaml based parser i java, if that matters.
thanks.
Update 1: (example)
Seed file: seed.yaml
Priority(L1E1):
level: 1
priorityCode: E1
description: Escalation
test data file: test-data.yaml
Request(RER1):
priority: L1E1
title: Something
So, I need to split the files, as they becoming huge. Also the variable/data (L1E1 in this case) defined in one file needs to be accessible in the second file.

YAML does not define an "include" directive.
What do you mean by "the variables in file1 should be available to use in the second" ? Do you expect anchors and aliases to work across the files ?

Related

Helm split global section

I have a helm values.yaml file like below
global:
foo: bar
foo1: bar1
random-chart:
fooo: baar
My use case is to append values in both global and random-chart during run time.
After appending values, the chart looks like this.
global:
foo: bar
foo1: bar1
random-chart:
fooo: baar
global:
secret: password
random-chart:
secret1: password1
Since there's 2 different global and random-chart keys. Will it work as intended and is it a good practice to do that?
This probably won't work as intended.
The YAML 1.2.2 spec notes (emphasis from original):
The content of a mapping node is an unordered set of key/value node pairs, with the restriction that each of the keys is unique.
And in discussing loading errors continues:
... mapping keys may not be unique ....
So the YAML file you show has a mapping with two keys both named global and two keys both named random-chart, and that's not valid. Depending on the specific YAML library that's being used, this might be interpreted as a loading error, or the library might just pick the last value of global.
In general, it's hard to work with YAML files using line-oriented shell tools, since there are so many syntactic variations. A dedicated library in a higher-level language will usually work better. For example, using the Python PyYAML library:
import yaml
with open('values.in.yaml', 'r') as f:
values = yaml.safe_load(f)
values['global']['secret'] = 'password'
values['random-chart']['secret-1'] = 'password1'
with open('values.out.yaml', 'w') as f:
yaml.dump(values. f)
Two other possibilities to consider: you can have multiple helm install -f options, so it's possible to write out a file with just the values you're adding, and those will be merged with other settings (you do not need to repeat the values from the chart's values.yaml file). Depending on your environment, you also may find it easier to dynamically write out JSON files, particularly if you don't need to re-read the base chart; setups like Jenkins or Javascript applications will have built-in JSON support, and valid JSON turns out to be valid YAML.

How to concatenate many files using their basenames?

I study genetic data from 288 fish samples (Fish_one, Fish_two ...)
I have four files per fish, each with a different suffix.
eg. for sample_name Fish_one:
file 1 = "Fish_one.1.fq.gz"
file 2 = "Fish_one.2.fq.gz"
file 3 = "Fish_one.rem.1.fq.gz"
file 4 = "Fish_one.rem.2.fq.gz"
I would like to apply the following concatenate instructions to all my samples, using maybe a text file containing a list of all the sample_name, that would be provided to a loop?
cp sample_name.1.fq.gz sample_name.fq.gz
cat sample_name.2.fq.gz >> sample_name.fq.gz
cat sample_name.rem.1.fq.gz >> sample_name.fq.gz
cat sample_name.rem.2.fq.gz >> sample_name.fq.gz
In the end, I would have only one file per sample, ideally in a different folder.
I would be very grateful to receive a bit of help on this one, even though I'm sure the answer is quite simple for a non-novice!
Many thanks,
NoƩ
I would like to apply the following concatenate instructions to all my
samples, using maybe a text file containing a list of all the
sample_name, that would be provided to a loop?
In the first place, the name of the cat command is mnemonic for "concatentate". It accepts multiple command-line arguments naming sources to concatenate together to the standard output, which is exactly what you want to do. It is poor form to use a cp and three cats where a single cat would do.
In the second place, although you certainly could use a file of name stems to drive the operation you describe, it's likely that you don't need to go to the trouble to create or maintain such a file. Globbing will probably do the job satisfactorily. As long as there aren't any name stems that need to be excluded, then, I'd probably go with something like this:
for f in *.rem.1.fq.gz; do
stem=${f%.rem.1.fq.gz}
cat "$stem".{1,2,rem.1,rem.2}.fq.gz > "${other_dir}/${stem}.fq.gz"
done
That recognizes the groups present in the current working directory by the members whose names end with .rem.1.fq.gz. It extracts the common name stem from that member's name, then concatenates the four members to the correspondingly-named output file in the directory identified by ${other_dir}. It relies on brace expansion to form the arguments to cat, so as to minimize code and (IMO) improve clarity.

merge multiple YAML files under a new field

I have several YAML files representing the pieces of a whole. I want to merge them under a new field ("guests") that declares the whole.
file1.yml
name: johnny
age: 23
file2.yml
name: sally
age: 21
output.yml
guests:
- name: johnny
age: 23
- name: sally
age: 21
tools like yq make merging/overwriting easy, but I can't find any that helps me nest values under new fields.
The tool you are looking for comes with several different names and
are called programming languages or scripting languages. I recommend
you use Python with ruamel.yaml installed. (disclaimer: I am the author of
that package).
Once you have that you can do:
python -c "import sys, ruamel.yaml; yaml=ruamel.yaml.YAML(); yaml.indent(sequence=4, offset=2); yaml.dump(dict(guest=[yaml.load(open(f)) for f in sys.argv[1:]]), sys.stdout)" file*.yml > output.yml
To get the desired output.
A few notes:
YAML files should have the .yaml extension unless your filesystem doesn't support that.
By default sequence elements are indented two spaces and the dash has no offset within that (i.e. would align with the g of guests. Hence the yaml.indent() call.
Any comments on the key-value files of your input file would be preserved but not automatically pushed out to the right from their original starting column unless necessary because of a mapping value getting in the way. Adjusting that is possible, but I would not recommend trying that in a one-liner.
If you need to preserve quotes add yaml.preserve_quotes = True; in the one-liner
If any of your YAML files contain multiple YAML documents, the above will fail. You would need to think about how to combine the documents, and use a try except clause to fall back to yaml.load_all() for documents that do (it would be a good idea to abandon the one-liner in favour of a multiline Python program at that point).
You can also do the above using the yaml commandline utility (installable with pip install ruamel.yaml.cmd>=0.5.0):
yaml from-dirs --sequence ./*.yml | yaml map --indent 2,4,2 guest - > output.yml
but this is a two step process (first combining multiple yaml files as root level sequence, then pushing that sequence to be a value for mapping), and thus twice as slow as the one-liner.

MapReduce One-to-one processing of multiple input files

Please clarify
I have set of input files (say 10) with specific names. I run word count job on all files at once (input path is folder). I am expecting 10 output files with same names as input files. I.e. File1 input should be counted and should be stored in a separate output file with "file1" name. And so on to all files.
There are 2 approaches you can take to achieve multiple outputs
Use MultipleOutputs class - refer this document for information about multipleclassoutput (https://hadoop.apache.org/docs/r2.6.3/api/org/apache/hadoop/mapreduce/lib/output/MultipleOutputs.html) , for more information about how to implement refer this http://appsintheopen.com/posts/44-map-reduce-multiple-outputs
Another option is using LazyOuputFormat, however, this is used in conjunction with multipleoutputs, for more information about its implementation refer this ( https://ssmolen.wordpress.com/2014/07/09/hadoop-mapreduce-write-output-to-multiple-directories-depending-on-the-reduce-key/ ).
I feel using LazyOutputFormat in conjunction with MultipleOuputs class is better approach.
Set the number of reduce tasks to be equal to the number of input files. This will create the given number of output files, as well.
Add a file prefix to each map output key (word). E.g., when you meet the word "cat" in file named "file0.txt" you can emit the key "0_cat", or "file0_cat", or anything else that is unique for "file0.txt". Use the context to get each time the filename.
Override the default Partitioner, to make sure that all the map output keys with prefix "0_", or "file0_" will go to the first partition, all the keys with prefix "1_", or "file1_" will go to the second, etc.
In the reducer, remove the "x_" or "filex_" prefix from the output key and use it as the name of the output file (using MultipleOutputs). Otherwise, if you don't want MultipleOutputs, you can easily do the mapping between outputfiles and input files by checking your Partitioner code. (e.g., part-00000 will be the partition 0's output)

Pig load files using tuple's field

I need help for following use case:
Initially we load some files and and process those records (or more technically tuples). After this processing, finally we have tuples of the form:
(some_field_1, hdfs://localhost:9000/user/kailashgupta/data/1/part-r-00000, some_field_3)
(some_field_1, hdfs://localhost:9000/user/kailashgupta/data/2/part-r-00000, some_field_3)
(some_field_1, hdfs://localhost:9000/user/kailashgupta/data/1/part-r-00001, some_field_3)
So basically, tuples has file path as value of its field (We can obviously transform this tuple having only one field having file path as value OR to a single tuple having only one field with some delimiter (say comma) separated string).
So now I have to load these files in Pig script, but I am not able to do so. Could you please suggest how to proceed further. I thought of using advanced foreach operator and tried as follows:
data = foreach tuples_with_file_info {
fileData = load $2 using PigStorage(',');
....
....
};
However its not working.
Edit:
For simplicity lets assume, I have single tuple with one field having file name:
(hdfs://localhost:9000/user/kailashgupta/data/1/part-r-00000)
You can't use Pig out of the box to do it.
What I would do is use some other scripting language (bash, Python, Ruby...) to read the file from hdfs and concatenate the files into a single string that you can then push as a parameter to a Pig script to use in your LOAD statement. Pig supports globbing so you can do the following:
a = LOAD '{hdfs://localhost:9000/user/kailashgupta/data/1/part-r-00000,hdfs://localhost:9000/user/kailashgupta/data/2/part-r-00000}' ...
so all that's left to do is read the file that contains those file names, concatenate them into a glob such as:
{hdfs://localhost:9000/user/kailashgupta/data/1/part-r-00000,hdfs://localhost:9000/user/kailashgupta/data/2/part-r-00000}
and pass that as a parameter to Pig so your script would start with:
a = LOAD '$input'
and your pig call would look like this:
pig -f script.pig -param input={hdfs://localhost:9000/user/kailashgupta/data/1/part-r-00000,hdfs://localhost:9000/user/kailashgupta/data/2/part-r-00000}
First, store the tuples_with_file_info into some file:
STORE tuples_with_file_info INTO 'some_temporary_file';
then,
data = LOAD 'some_temporary_file' using MyCustomLoader();
where
MyCustomLoader is nothing but a Pig loader extending LoadFunc, which uses MyInputFormat as InputFormat.
MyInputFormat is an encapsulation over the actual InputFormat (e.g. TextInputFormat) which has to be used to read actual data from the files (e.g. in my case from file hdfs://localhost:9000/user/kailashgupta/data/1/part-r-00000).
In MyInputFormat, override getSplits method; first read the actual file name(s) from the some_temporary_file (You have to get this file name from Configuration's mapred.input.dir property), then update the same Configuration mapred.input.dir with retrieved file names, then return result from wrapped up InputFormat (e.g. in my case TextInputFormat).
Note: 1. You cannot use the setLocation API from the LoadFunc (or some other similar API) to read the contents of some_temporary_file, as its contents will be available only at run time.
2. One doubt may arise in your mind, what if LOAD statement executes before STORE? But this would not happen because if STORE and LOAD use same file in the script, Pig ensures that the jobs are executed in the right sequence. For more detail you may read section Store-load sequences on Pig Wiki

Resources