Opposite of Linux Split - shell

I have a huge file and I split the big file into several small chunks and divide and conquer. Now I have a folder contains a list of files like below:
output_aa #(the output file done: cat input_aa | python parse.py > output_aa)
output_ab
output_ac
output_ad
...
I am wondering is there a way to merge those files back together FOLLOWING THE INDEX ORDER:
I know I could do it by using
cat * > output.all
but I am more curious another magical command already exist comes with split..

The magic command would be:
cat output_* > output.all
There is no need to sort the file names as the shell already does it (*).
As its name suggests, cat original design was precisely to conCATenate files which is basically the opposite of split.
(*) Edit:
Should you use an (hypothetical ?) locale that use a collating order where the a-z order is not abcdefghijklmnopqrstuvwxyz, here is one way to overcome the issue:
LC_ALL=C "sh -c cat output_* > output.all"

There are other ways to concat files together, but there is no magical "opposite of split" in "linux".
Of course, talking about "linux" in general is a bit far fetched, as many distributions have different tools (most of them use a different shell already by default, like sh, bash, csh, zsh, ksh, ...), but if you're talking about debian based linux at least, I don't know of any distribution which would provide such a tool.
For sorting you can use the linux command "sort" ;
Also be aware that using ">" for redirecting stdout will override maybe existing contents, while ">>" will concat to an existing file.
I don't want to copycat, but still make this answer complete, so what jlliagre said about the cat command should also be considered of course (that "cat" was made to con-"cat" files, effectively making it possible to reverse the split command - but that's only provided you use the same ordering of files, so it's not exactly the "opposite of split", but will work that way in close to 100% of the cases (see comments under jlliagre answer for specifics))

Related

Replace/sync only certain lines using Bash, SSH and rsync

I am looking for a quick and dirty one-liner to sync only certain settings in remote config files. Need to preserve what's unique and sync generic settings. Example:
Config1.conf:
HOSTNAME=COMP1
IP=10.10.13.10
LOCATION=SITE_A
BUILDING=DEPT_IT
ROOM=COMP_LAB1
Remote-Config2.txt:
HOSTNAME=COMP2
IP=10.10.13.11
LOCATION=FOO
BUILDING=BAR
ROOM=BAZ
I need to sync or copy replace only the bottom 3 lines over ssh. The line numbers are predictable, by the way. Always lines 4,5 and 6 in this case.
Here's a working idea that is missing one piece (a standard replacement for the non-standard utility I used to replace the vars in the local conf):
for var in $(ssh root#10.10.8.12 'sed -n "4,6p" /etc/conf1.conf');do <missing piece> ${var/=*}=${var/*=} local-conf.conf; done
So this uses variable expansion and a non-standard utility but needs like a sed or Perl routine to replace the info in the local conf.
Update
The last line of code actually works. Tested and works! However -- the missing piece is a custom non-standard utility. I'm asking if someone can think of something, using standard Linux tools, to replace that.
One solution would be to take the left side and match, then replace the right side. This is basically what that utility does. Looks for the variable in the conf then sets it. Using variable expansion is one way (shown).
Here's an alternative solution that does not require the command to have special knowledge of the file contents:
Take a copy of the files you want to sync. Then, in the copy, deliberately vandalise (arbitrarily modify) the lines you do not want synced. It doesn't matter what they say as long as there are the same number of lines and they'll never match the actual file contents. Have some fun. This becomes your base version. Your example might look like this:
HOSTNAME=foo
IP=bar
LOCATION=SITE_A
BUILDING=DEPT_IT
ROOM=COMP_LAB1
rsync the remote files into a temporary location. This is the remote version.
For each file, take a three-way diff.
diff3 -3 <localfile> <basefile> <remotefile>
The output of diff3 is an "ed script" that decribes what edits to make to the local file so that it would look like the remote file.
The -3 option tells it to only output the non-conflicting differences. This is why we vandalised the base files in the first place: so those lines would have conflicts.
Once you have the ed script for a file, you can visually check it, if you choose, and then apply the update using patch:
cat <ed-script> | patch --ed <localfile>
So, to do this recursively, you might have:
cd $localdir
for file in `find . -type f`; do
diff3 -3 "$file" "$basedir/$file" "$remotedir/$file" | patch --ed "$file"
done
You probably need to add some checks that the base and remote files actually exist.

Merging yaml config files recursively with bash

Is it possible using some smart piping and coding, to merge yaml files recursively?
In PHP, I make an array of them (each module can add or update config nodes of/in the system).
The goal is an export shellscript that will merge all separate module folders' config files into big merged files. It's faster, efficient, and the customer does not need the modularity at the time we deploy new versions via FTP, for example.
It should behave like the PHP function: array_merge_recursive
The filesystem structure is like this:
mod/a/config/sys.yml
mod/a/config/another.yml
mod/b/config/sys.yml
mod/b/config/another.yml
mod/c/config/totally-new.yml
sys/config/sys.yml
Config looks like:
date:
format:
date_regular: %d-%m-%Y
And a module may, say, do this:
date:
format:
date_regular: regular dates are boring
date_special: !!!%d-%m-%Y!!!
So far, I have:
#!/bin/bash
#........
cp -R $dir_project/ $dir_to/
for i in $dir_project/mod/*/
do
cp -R "${i}/." $dir_to/sys/
done
This of course destroys all existing config files in the loop.. (rest of the system files are uniquely named)
Basically, I need a yaml parser for the command line, and an array_merge_recursive like alternative. Then a yaml writer to ouput it merged. I fear I have to start to learn Python because bash won't cut it on this one.
You can use for example perl. The next oneliner:
perl -MYAML::Merge::Simple=merge_files -MYAML -E 'say Dump merge_files(#ARGV)' f1.yaml f2.yaml
for the next input files: f1.yaml
date:
epoch: 2342342343
format:
date_regular: "%d-%m-%Y"
f2.yaml
date:
format:
date_regular: regular dates are boring
date_special: "!!!%d-%m-%Y!!!"
prints the merged result...
---
date:
epoch: 2342342343
format:
date_regular: regular dates are boring
date_special: '!!!%d-%m-%Y!!!'
Because #Caleb pointed out that the module now is develeloper only, here is an replacement. It is a bit longer and uses two (but commonly available) modules:
the YAML
and the Hash::Merge::Simple
perl -MYAML=LoadFile,Dump -MHash::Merge::Simple=merge -E 'say Dump(merge(map{LoadFile($_)}#ARGV))' f1.yaml f2.yaml
produces the same as above.
I recommend yq -m. yq is a swiss army knife for yaml, very similar to jq (for JSON).
No.
Bash has no support for nested data structures (its maps are integer->string or string->string only), and thus cannot represent arbitrary YAML documents in-memory.
Use a more powerful language for this task.
Late to the party, but I also wrote a tool for this:
https://github.com/benprofessionaledition/yamlmerge
It's almost identical to Ondra's JVM tool (they're even both called "yaml merge"), the key difference being that it's written in Go so it compiles to a ~3MB binary with no external dependencies. We use it in Gitlab-CI containers.
Bash is a bit of a stretch for this (it could be done but it would be error prone). If all you want to do is call a few things from a bash shell (as opposed to actually scripting the merge using bash functions) then you have a few options.
I noticed there is a Java based yaml-merge tool, but that didn't suit my fancy very much, so I kept looking. In the end I clobbered together something using two tools: yaml2json and jq.
Warning: Since JSON's capabilities are only a subset of YAML's, this is not a lossless process for complex YAML structures. It will work for a lot of simple key/value/sequence scenarios but will muck things up if your input YAML is too fancy. Test it on your data types to see if it does what you expect.
Use yaml2json to convert your inputs to JSON:
yaml2json input1.yml > input1.json
yaml2json input2.yml > input2.json
Use jq to iterate over the objects and merge them recursively (see this question and answers for details). List files in reverse order of importance as values in later ones will clobber earlier ones:
jq -s 'reduce .[] as $item({}; . + $item)' input1.json input2.json > merged.json
Take it back to YAML:
json2yaml merged.json > merged.yml
If you want to script this, of course the usual bash mechanisms are your friend. And if you happen to be in GNU-Make like I was, something like this will do the trick:
.SECONDEXPANSION:
merged.yml: input1.yml input2.yml
json2yaml <(jq -s 'reduce .[] as $$item({}; . + $$item)' $(foreach YAML,$^,<(yaml2json $(YAML)))) > $#
There is a tool that merges YAML files - merge-yaml.
It supports full YAML syntax, and is capable of expanding environment variables references.
I forked it and released it into a form of an executable .jar.
https://github.com/OndraZizka/yaml-merge
Usage:
./bin/yaml-merge.sh ./*.yml > result.yml
It is written in Java so you need Java (I think 8 and newer) installed.
(Btw, if someone wants to contribute, that would be great.)
In general, merging YAML is not a trivial thing, in the sense that the tool doesn't always know what you really want to do. You can merge structures in multiple way. Think if this example:
foo:
bar: bar2
baz:
- baz1
---
foo:
bar: bar1
baz:
- baz2
goo: gaz1
Few questions / unknowns arise:
Should the 2nd foo tree replace the first one?
Should the 2nd bar replace the first one, or merge to an array?
Should the 2nd baz array replace the 1st, or be merged?
If merged, then how - should there be duplicities, or should the tool keep the values unique? Should the order be managed in some way?
Etc. One may object that there can be some default, but often, the real world requirements need different operations.
Other tools and libraries to deal with data structures deal with this by defining a scheme with metadata, for instance, JAXB or Jackson use Java annotations.
For this general tool, that is not an option, so the user would have to control this through a) the input data, or b) parameters. a) is impractical and sometimes impossible, b) is tedious and needs a fancy syntax like jq has.
That said, Caleb's answer might be what you need. Although, that solution reduces your data to what JSON is capable of, so you will loose comments, various way to represent long strings, usage of JSON within YAML, etc., which is not too user friendly.

Find and replace in file with script

I want to find and replace the VALUE into a xml file :
<test name="NAME" value="VALUE"/>
I have to filter by name (because there are lot of lines like that).
Is it possible ?
Thanks for you help.
Since you tagged the question "bash", I assume that you're not trying to use an XML library (although I think an XML expert might be able to give you something like an XSLT processor command that solves this question very robustly), but that you're simply interested in doing search & replace from the commandline.
I am using perl for this:
perl -pi -e 's#VALUE#replacement#g' *.xml
See perlrun man page: Very shortly put, the -p switches perl into text processing mode, -i stands for "in-place", and -e let's you specify an expression to apply to all lines of input.
Also note (if you are not too familiar with that already) that you may use other characters than # (common ones are %, a comma, etc.) that don't clash with your search & replacement strings.
There is one small caveat: perl will read & write all files given on the commandline, even those that did not change. Thus, the files' modification times will be updated even if they did not change. (I usually work around that with some more shell magic, e.g. using grep -l or grin -l to select files for perl to work on.)
EDIT: If I understand your comments correctly, you also need help with the regular expression to apply. Let me briefly suggest something like this then:
perl -pi -e 's,(name="NAME" value=)"[^"]*",\1"NEWVALUE",g' *.xml
Related: bash XHTML parsing using xpath
You can use SED:
SED 's/\(<test name=\"NAME\"\) value=\"VALUE\"/\1 value=\"YourValue\"/' test.xml
where test.xml is the xml document containing the given node. This is very fragile, and you can work to make it more flexible if you need to do this substitution multiple times. For instance, the current statement is case sensitive, so it won't substitute the value on a node with the name="name", but you can add a case insensitivity flag to the end of the statement, like so:
('s/\(<test name=\"NAME\"\) value=\"VALUE\"/\1 value=\"YourValue\"/I').
Another option would be to use XSLT, but it would require you to download an external library. It's pretty versatile, and could be a viable option for more complex modifications to an XML document.

Bash: Trying to append to a variable name in the output of a function

this is my very first post on Stackoverflow, and I should probably point out that I am EXTREMELY new to a lot of programming. I'm currently a postgraduate student doing projects involving a lot of coding in various programs, everything from LaTeX to bash, MATLAB etc etc.
If you could explicitly explain your answers that would be much appreciated as I'm trying to learn as I go. I apologise if there is an answer else where that does what I'm trying to do, but I have spent a couple of days looking now.
So to the problem I'm trying to solve: I'm currently using a selection of bioinformatics tools to analyse a range of genomes, and I'm trying to somewhat automate the process.
I have a few sequences with names that look like this for instance (all contained in folders of their own currently as paired files):
SOL2511_S5_L001_R1_001.fastq
SOL2511_S5_L001_R2_001.fastq
SOL2510_S4_L001_R1_001.fastq
SOL2510_S4_L001_R2_001.fastq
...and so on...
I basically wish to automate the process by turning these in to variables and passing these variables to each of the programs I use in turn. So for example my idea thus far was to assign them as wildcards, using the R1 and R2 (which appears in all the file names, as they represent each strand of DNA) as follows:
#!/bin/bash
seq1=*R1_001*
seq2=*R2_001*
On a rudimentary level this works, as it returns the correct files, so now I pass these variables to my first function which trims the DNA sequences down by a specified amount, like so:
# seqtk is the program suite, trimfq is a function within it,
# and the options -b -e specify how many bases to trim from the beginning and end of
# the DNA sequence respectively.
seqtk trimfq -b 10 -e 20 $seq1 >
seqtk trimfq -b 10 -e 20 $seq2 >
So now my problem is I wish to be able to append something like "_trim" to the output file which appears after the >, but I can't find anything that seems like it will work online.
Alternatively, I've been hunting for a script that will take the name of the folder that the files are in, and create a variable for the folder name which I can then give to the functions in question so that all the output files are named correctly for use later on.
Many thanks in advance for any help, and I apologise that this isn't really much of a minimum working example to go on, as I'm only just getting going on all this stuff!
Joe
EDIT
So I modified #ghoti 's for loop (does the job wonderfully I might add, rep for you :D ) and now I append trim_, as the loop as it was before ended up giving me a .fastq.trim which will cause errors later.
Is there any way I can append _trim to the end of the filename, but before the extension?
Explicit is usually better than implied, when matching filenames. Your wildcards may match more than you expect, especially if you have versions of the files with "_trim" appended to the end!
I would be more precise with the wildcards, and use for loops to process the files instead of relying on seqtk to handle multiple files. That way, you can do your own processing on the filenames.
Here's an example:
#!/bin/bash
# Define an array of sequences
sequences=(R1_001 R2_001)
# Step through the array...
for seq in ${sequences[#]}; do
# Step through the files in this sequence...
for file in SOL*_${seq}.fastq; do
seqtk trimfq -b 10 -e 20 "$file" > "${file}.trim"
done
done
I don't know how your folders are set up, so I haven't addressed that in this script. But the basic idea is that if you want the script to be able to manipulate individual filenames, you need something like a for loop to handle the that manipulation on a per-filename basis.
Does this help?
UPDATE:
To put _trim before the extension, replace the seqtk line with the following:
seqtk trimfq -b 10 -e 20 "$file" > "${file%.fastq}_trim.fastq"
This uses something documented in the Bash man page under Parameter Expansion if you want to read up on it. Basically, the ${file%.fastq} takes the $file variable and strips off a suffix. Then we add your extra text, along with the suffix.
You could also strip an extension using basename(1), but there's no need to call something external when you can use something built in to the shell.
Instead of setting variables with the filenames, you could pipe the output of ls to the command you want to run with these filenames, like this:
ls *R{1,2}_001* | xargs -I# sh -c 'seqtk trimfq -b 10 -e 20 "$1" > "${1}_trim"' -- #
xargs -I# will grab the output of the previous command and store it in # to be used by seqtk

Create a new sequence of files from an existing sequence, along with numbering

I know this question has been asked, but I can't find more than one solution, and it does not work for me. Essentially, I'm looking for a bash script that will take a file list that looks like this:
image1.jpg
image2.jpg
image3.jpg
And then make a copy of each one, but number it sequentially backwards. So, the sequence would have three new files created, being:
image4.jpg
image5.jpg
image6.jpg
And yet, image4.jpg would have been an untouched copy of image3.jpg, and image5.jpg an untouched copy of image2.jpg, and so on. I have already tried the solution outlined in this stackoverflow question with no luck. I am admittedly not very far down the bash scripting path, and if I take the chunk of code in the first listed answer and make a script, I always get "2: Syntax error: "(" unexpected" over and over. I've tried changing the syntax with the ( around a bit, but no success ever. So, either I am doing something wrong or there's a better script around.
Sorry for not posting this earlier, but the code I'm using is:
image=( image*.jpg )
MAX=${#image[*]}
for i in ${image[*]}
do
num=${i:5:3} # grab the digits
compliment=$(printf '%03d' $(echo $MAX-$num | bc))
ln $i copy_of_image$compliment.jpg
done
And I'm taking this code and pasting it into a file with nano, and adding !#/bin/bash as the first line, then chmod +x script and executing in bash via sh script. Of course, in my test runs, I'm using files appropriately titled image1.jpg - but I was also wondering about a way to apply this script to a directory of jpegs, not necessarily titled image(integer).jpg - in my file keeping structure, most of these are a single word, followed by a number, then .jpg, and it would be nice to not have to rewrite the script for each use.
Perhaps something like this. It will work well for something like script image*.jpg where the wildcard matches a set of files which match a regular pattern with monotonously increasing numbers of the same length, and less ideally with a less regular subset of the files in the current directory. It simply assumes that the last file's digit index plus one through the total number of file names is the range of digits to loop over.
#!/bin/sh
# Extract number from final file name
eval lastidx=\$$#
tmp=${lastidx#*[!0-9][0-9]}
lastidx=${lastidx#${lastidx%[0-9]$tmp}}
tmp=${lastidx%[0-9][!0-9]*}
lastidx=${lastidx%${lastidx#$tmp[0-9]}}
num=$(expr $lastidx + $#)
width=${#lastidx}
for f; do
pref=${f%%[0-9]*}
suff=${f##*[0-9]}
# Maybe show a warning if pref, suff, or width changed since the previous file
printf "cp '$f' '$pref%0${width}i$suff'\\n" $num
num=$(expr $num - 1)
done |
sh
This is sh-compatible; the expr stuff and the substring extraction up front is ugly but Bourne-compatible. If you are fine with the built-in arithmetic and string manipulation constructs of Bash, converting to that form should be trivial.
(To be explicit, ${var%foo} returns the value of $var with foo trimmed off the end, and ${var#foo} does similar trimming from the beginning of the value. Regular shell wildcard matching operators are available in the expression for what to trim. ${#var} returns the length of the value of $var.)
Maybe your real test data runs from 001 to 300, but here you have image1 2 3, and therefore you extract one, not three digits from the filename. num=${i:5:1}
Integer arithmetic can be done in the bash without calling bc
${#image[#]} is more robust than ${#image[*]}, but shouldn't be a difference here.
I didn't consult a dictionary, but isn't compliment something for your girl friend? The opposite is complement, isn't it? :)
the other command made links - to make copies, call cp.
Code:
#!/bin/bash
image=( image*.jpg )
MAX=${#image[#]}
for i in ${image[#]}
do
num=${i:5:1}
complement=$((2*$MAX-$num+1))
cp $i image$complement.jpg
done
Most important: If it is bash, call it with bash. Best: do a shebang (as you did), make it executable and call it by ./name . Calling it with sh name will force the wrong interpreter. If you don't make it executable, call it bash name.

Resources