Replace/sync only certain lines using Bash, SSH and rsync - bash

I am looking for a quick and dirty one-liner to sync only certain settings in remote config files. Need to preserve what's unique and sync generic settings. Example:
Config1.conf:
HOSTNAME=COMP1
IP=10.10.13.10
LOCATION=SITE_A
BUILDING=DEPT_IT
ROOM=COMP_LAB1
Remote-Config2.txt:
HOSTNAME=COMP2
IP=10.10.13.11
LOCATION=FOO
BUILDING=BAR
ROOM=BAZ
I need to sync or copy replace only the bottom 3 lines over ssh. The line numbers are predictable, by the way. Always lines 4,5 and 6 in this case.
Here's a working idea that is missing one piece (a standard replacement for the non-standard utility I used to replace the vars in the local conf):
for var in $(ssh root#10.10.8.12 'sed -n "4,6p" /etc/conf1.conf');do <missing piece> ${var/=*}=${var/*=} local-conf.conf; done
So this uses variable expansion and a non-standard utility but needs like a sed or Perl routine to replace the info in the local conf.
Update
The last line of code actually works. Tested and works! However -- the missing piece is a custom non-standard utility. I'm asking if someone can think of something, using standard Linux tools, to replace that.
One solution would be to take the left side and match, then replace the right side. This is basically what that utility does. Looks for the variable in the conf then sets it. Using variable expansion is one way (shown).

Here's an alternative solution that does not require the command to have special knowledge of the file contents:
Take a copy of the files you want to sync. Then, in the copy, deliberately vandalise (arbitrarily modify) the lines you do not want synced. It doesn't matter what they say as long as there are the same number of lines and they'll never match the actual file contents. Have some fun. This becomes your base version. Your example might look like this:
HOSTNAME=foo
IP=bar
LOCATION=SITE_A
BUILDING=DEPT_IT
ROOM=COMP_LAB1
rsync the remote files into a temporary location. This is the remote version.
For each file, take a three-way diff.
diff3 -3 <localfile> <basefile> <remotefile>
The output of diff3 is an "ed script" that decribes what edits to make to the local file so that it would look like the remote file.
The -3 option tells it to only output the non-conflicting differences. This is why we vandalised the base files in the first place: so those lines would have conflicts.
Once you have the ed script for a file, you can visually check it, if you choose, and then apply the update using patch:
cat <ed-script> | patch --ed <localfile>
So, to do this recursively, you might have:
cd $localdir
for file in `find . -type f`; do
diff3 -3 "$file" "$basedir/$file" "$remotedir/$file" | patch --ed "$file"
done
You probably need to add some checks that the base and remote files actually exist.

Related

BASH Shell Find Multiple Files with Wildcard and Perform Loop with Action

I have a script that I call with an application, I can't run it from command line. I derive the directory where the script is called and in the next variable go up 1 level where my files are stored. From there I have 3 variables with the full path and file names (with wildcard), which I will refer to as "masks".
I need to find and "do something with" (copy/write their names to a new file, whatever else) to each of these masks. The do something part isn't my obstacle as I've done this fine when I'm working with a single mask, but I would like to do it cleanly in a single loop instead of duplicating loop and just referencing each mask separately if possible.
Assume in my $FILESFOLDER directory below that I have 2 existing files, aaa0.csv & bbb0.csv, but no file matching the ccc*.csv mask.
#!/bin/bash
SCRIPTFOLDER=${0%/*}
FILESFOLDER="$(dirname "$SCRIPTFOLDER")"
ARCHIVEFOLDER="$FILESFOLDER"/archive
LOGFILE="$SCRIPTFOLDER"/log.txt
FILES1="$FILESFOLDER"/"aaa*.csv"
FILES2="$FILESFOLDER"/"bbb*.csv"
FILES3="$FILESFOLDER"/"ccc*.csv"
ALLFILES="$FILES1
$FILES2
$FILES3"
#here as an example I would like to do a loop through $ALLFILES and copy anything that matches to $ARCHIVEFOLDER.
for f in $ALLFILES; do
cp -v "$f" "$ARCHIVEFOLDER" > "$LOGFILE"
done
echo "$ALLFILES" >> "$LOGFILE"
The thing that really spins my head is when I run something like this (I haven't done it with the copy command in place) that log file at the end shows:
filesfolder/aaa0.csv filesfolder/bbb0.csv filesfolder/ccc*.csv
Where I would expect echoing $ALLFILES just to show me the masks
filesfolder/aaa*.csv filesfolder/bbb*.csv filesfolder/ccc*.csv
In my "do something" area, I need to be able to use whatever method to find the files by their full path/name with the wildcard if at all possible. Sometimes my network is down for maintenance and I don't want to risk failing a change directory. I rarely work in linux (primarily SQL background) so feel free to poke holes in everything I've done wrong. Thanks in advance!
Here's a light refactoring with significantly fewer distracting variables.
#!/bin/bash
script=${0%/*}
folder="$(dirname "$script")"
archive="$folder"/archive
log="$folder"/log.txt # you would certainly want this in the folder, not $script/log.txt
shopt -s nullglob
all=()
for prefix in aaa bbb ccc; do
cp -v "$folder/$prefix"*.csv "$archive" >>"$log" # append, don't overwrite
all+=("$folder/$prefix"*.csv)
done
echo "${all[#]}" >> "$log"
The change in the loop to append the output or cp -v instead of overwrite is a bug fix; otherwise the log would only contain the output from the last loop iteration.
I would probably prefer to have the files echoed from inside the loop as well, one per line, instead of collect them all on one humongous line. Then you can remove the array all and instead simply
printf '%s\n' "$folder/$prefix"*.csv >>"$log"
shopt -s nullglob is a Bash extension (so won't work with sh) which says to discard any wildcard which doesn't match any files (the default behavior is to leave globs unexpanded if they don't match anything). If you want a different solution, perhaps see Test whether a glob has any matches in Bash
You should use lower case for your private variables so I changed that, too. Notice also how the script variable doesn't actually contain a folder name (or "directory" as we adults prefer to call it); fixing that uncovered a bug in your attempt.
If your wildcards are more complex, you might want to create an array for each pattern.
tmpspaces=(/tmp/*\ *)
homequest=($HOME/*\?*)
for file in "${tmpspaces[#]}" "${homequest[#]}"; do
: stuff with "$file", with proper quoting
done
The only robust way to handle file names which could contain shell metacharacters is to use an array variable; using string variables for file names is notoriously brittle.
Perhaps see also https://mywiki.wooledge.org/BashFAQ/020

Bash: Trying to append to a variable name in the output of a function

this is my very first post on Stackoverflow, and I should probably point out that I am EXTREMELY new to a lot of programming. I'm currently a postgraduate student doing projects involving a lot of coding in various programs, everything from LaTeX to bash, MATLAB etc etc.
If you could explicitly explain your answers that would be much appreciated as I'm trying to learn as I go. I apologise if there is an answer else where that does what I'm trying to do, but I have spent a couple of days looking now.
So to the problem I'm trying to solve: I'm currently using a selection of bioinformatics tools to analyse a range of genomes, and I'm trying to somewhat automate the process.
I have a few sequences with names that look like this for instance (all contained in folders of their own currently as paired files):
SOL2511_S5_L001_R1_001.fastq
SOL2511_S5_L001_R2_001.fastq
SOL2510_S4_L001_R1_001.fastq
SOL2510_S4_L001_R2_001.fastq
...and so on...
I basically wish to automate the process by turning these in to variables and passing these variables to each of the programs I use in turn. So for example my idea thus far was to assign them as wildcards, using the R1 and R2 (which appears in all the file names, as they represent each strand of DNA) as follows:
#!/bin/bash
seq1=*R1_001*
seq2=*R2_001*
On a rudimentary level this works, as it returns the correct files, so now I pass these variables to my first function which trims the DNA sequences down by a specified amount, like so:
# seqtk is the program suite, trimfq is a function within it,
# and the options -b -e specify how many bases to trim from the beginning and end of
# the DNA sequence respectively.
seqtk trimfq -b 10 -e 20 $seq1 >
seqtk trimfq -b 10 -e 20 $seq2 >
So now my problem is I wish to be able to append something like "_trim" to the output file which appears after the >, but I can't find anything that seems like it will work online.
Alternatively, I've been hunting for a script that will take the name of the folder that the files are in, and create a variable for the folder name which I can then give to the functions in question so that all the output files are named correctly for use later on.
Many thanks in advance for any help, and I apologise that this isn't really much of a minimum working example to go on, as I'm only just getting going on all this stuff!
Joe
EDIT
So I modified #ghoti 's for loop (does the job wonderfully I might add, rep for you :D ) and now I append trim_, as the loop as it was before ended up giving me a .fastq.trim which will cause errors later.
Is there any way I can append _trim to the end of the filename, but before the extension?
Explicit is usually better than implied, when matching filenames. Your wildcards may match more than you expect, especially if you have versions of the files with "_trim" appended to the end!
I would be more precise with the wildcards, and use for loops to process the files instead of relying on seqtk to handle multiple files. That way, you can do your own processing on the filenames.
Here's an example:
#!/bin/bash
# Define an array of sequences
sequences=(R1_001 R2_001)
# Step through the array...
for seq in ${sequences[#]}; do
# Step through the files in this sequence...
for file in SOL*_${seq}.fastq; do
seqtk trimfq -b 10 -e 20 "$file" > "${file}.trim"
done
done
I don't know how your folders are set up, so I haven't addressed that in this script. But the basic idea is that if you want the script to be able to manipulate individual filenames, you need something like a for loop to handle the that manipulation on a per-filename basis.
Does this help?
UPDATE:
To put _trim before the extension, replace the seqtk line with the following:
seqtk trimfq -b 10 -e 20 "$file" > "${file%.fastq}_trim.fastq"
This uses something documented in the Bash man page under Parameter Expansion if you want to read up on it. Basically, the ${file%.fastq} takes the $file variable and strips off a suffix. Then we add your extra text, along with the suffix.
You could also strip an extension using basename(1), but there's no need to call something external when you can use something built in to the shell.
Instead of setting variables with the filenames, you could pipe the output of ls to the command you want to run with these filenames, like this:
ls *R{1,2}_001* | xargs -I# sh -c 'seqtk trimfq -b 10 -e 20 "$1" > "${1}_trim"' -- #
xargs -I# will grab the output of the previous command and store it in # to be used by seqtk

Opposite of Linux Split

I have a huge file and I split the big file into several small chunks and divide and conquer. Now I have a folder contains a list of files like below:
output_aa #(the output file done: cat input_aa | python parse.py > output_aa)
output_ab
output_ac
output_ad
...
I am wondering is there a way to merge those files back together FOLLOWING THE INDEX ORDER:
I know I could do it by using
cat * > output.all
but I am more curious another magical command already exist comes with split..
The magic command would be:
cat output_* > output.all
There is no need to sort the file names as the shell already does it (*).
As its name suggests, cat original design was precisely to conCATenate files which is basically the opposite of split.
(*) Edit:
Should you use an (hypothetical ?) locale that use a collating order where the a-z order is not abcdefghijklmnopqrstuvwxyz, here is one way to overcome the issue:
LC_ALL=C "sh -c cat output_* > output.all"
There are other ways to concat files together, but there is no magical "opposite of split" in "linux".
Of course, talking about "linux" in general is a bit far fetched, as many distributions have different tools (most of them use a different shell already by default, like sh, bash, csh, zsh, ksh, ...), but if you're talking about debian based linux at least, I don't know of any distribution which would provide such a tool.
For sorting you can use the linux command "sort" ;
Also be aware that using ">" for redirecting stdout will override maybe existing contents, while ">>" will concat to an existing file.
I don't want to copycat, but still make this answer complete, so what jlliagre said about the cat command should also be considered of course (that "cat" was made to con-"cat" files, effectively making it possible to reverse the split command - but that's only provided you use the same ordering of files, so it's not exactly the "opposite of split", but will work that way in close to 100% of the cases (see comments under jlliagre answer for specifics))

Bash: find references to filenames in other files

Problem:
I have a list of filenames, filenames.txt:
Eg.
/usr/share/important-library.c
/usr/share/youneedthis-header.h
/lib/delete/this-at-your-peril.c
I need to rename or delete these files and I need to find references to these files in a project directory tree: /home/noob/my-project/ so I can remove or correct them.
My thought is to use bash to extract the filename: basename filename, then grep for it in the project directory using a for loop.
FILELISTING=listing.txt
PROJECTDIR=/home/noob/my-project/
for f in $(cat "$FILELISTING"); do
extension=$(basename ${f##*.})
filename=$(basename ${f%.*})
pattern="$filename"\\."$extension"
grep -r "$pattern" "$PROJECTDIR"
done
I could royally screw up this project -- does anyone see a flaw in my logic; better: do you see a more reliable scalable way to do this over a huge directory tree? Let's assume that revision control is off the table ( it is, in fact ).
A few comments:
Instead of
for f in $(cat "$FILELISTING") ; do
...
done
it's somewhat safer to write
while IFS= read -r f ; do
...
done < "$FILELISTING"
That way, your code will have no problem with spaces, tabs, asterisks, and so on in the filenames (though it still won't support newlines).
Your goal in separating f into extension and filename, and then reassembling them with \., seems to be that you want the filename to be treated as a literal string; right? Like, you're worried that grep will treat the . as meaning "any character" rather than as "one dot". A more general solution is to use grep's -F option, which tells it to treat the pattern as a fixed string rather than a regex:
grep -r -F "$f" "$PROJECTDIR"
Your introduction mentions using basename, but then you don't actually use it. Is that intentional?
If your non-use of basename is intentional, then filenames.txt really just contains a list of patterns to search for; you don't even need to write a loop, in this case, since grep's -f option tells it to take a newline-separated list of patterns from a file:
grep -r -F -f "$FILELISTING" "$PROJECTDIR"
You should back up your project, using something like tar -czf backup.tar.gz "$PROJECTDIR". "Revision control is off the table" doesn't mean you can't have a rollback strategy!
Edited to add:
To pass all your base-names to grep at once, in the hopes that it can do something smarter with them than just looping over them just as though the calls were separate, you can write something like:
grep -r -F "$(sed 's#.*/##g' "$FILELISTING")" "$PROJECTDIR"
(I used sed rather than while+basename for brevity's sake, but you can an entire loop inside the "$(...)" if you prefer.)
This is a job for an IDE.
You're right that this is a perilous task, and unless you know the build process and the search directories and the order of the directories, you really can't say what header is with which file.
Let's take something as simple as this:
# include "sql.h"
You have a file in the project headers/sql.h. Is that file needed? Maybe it is. Maybe not. There's also a /usr/include/sql.h. Maybe that's the one that's actually used. You can't tell without looking at the Makefile and seeing the order of the include directories which is which.
Then, there are the libraries that get included and may need their own header files in order to be able to compile. And, once you get to the C preprocessor, you really will have a hard time.
This is a task for an IDE (Integrated Development Environment). An IDE builds the project and tracks file and other resource dependencies. In the Java world, most people use Eclipse, and there is a C/C++ plugin for those developers. However, there are over 2 dozen listed in Wikipedia and almost all of them are open source. The best one will depend upon your environment.

How to compare files with same names in two different directories using a shell script

Before moving on to use SVN, I used to manage my project by simply keeping a /develop/ directory and editing and testing files there, then moving them to the /main/ directory. When I decided to move to SVN, I needed to be sure that the directories were indeed in sync.
So, what is a good way to write a shell script [ bash ] to recursively compare files with the same name in two different directories?
Note: The directory names used above are for sample only. I do not recommend storing your code in the top level :).
The diff command has a -r option to recursively compare directories:
diff -r /develop /main
diff -rqu /develop /main
It will only give you a summary of changes that way :)
If you want to see only new/missing files
diff -rqu /develop /main | grep "^Only
If you want to get them bare:
diff -rqu /develop /main | sed -rn "/^Only/s/^Only in (.+?): /\1/p"
The diff I have available allows recursive differences:
diff -r main develop
But with a shell script:
( cd main ; find . -type f -exec diff {} ../develop/{} ';' )
[I read somewhere that answering your own questions is OK, so here goes :) ]
I tried this, and it worked pretty well
[/]$ cd /develop/
[/develop/]$ find | while read line; do diff -ruN "/main/$line" $line; done |less
You can choose to compare only specific files [e.g., only the .php ones] by editing the above line as
[/]$ cd /develop/
[/develop/]$ find -name "*.php" | while read line; do diff -ruN "/main/$line" $line; done |less
Any other ideas?
here is an example of a (somewhat messy) script of mine, dircompare.sh, which will:
sort files and directories in arrays depending on which directory they occur in (or both), in two recursive passes
The files that occur in both directories, are sorted again in two arrays, depending on if diff -q determines if they differ or not
for those files that diff claims are equal, show and compare timestamps
Hope it can be found useful - Cheers!
EDIT2: (Actually, it works fine with remote files - the problem was unhandled Ctrl-C signal during a diff operation between local and remote file, which can take a while; script now updated with a trap to handle that - however, leaving the previous edit below for reference):
EDIT: ... except it seems to crash my server for a remote ssh directory (which I tried using over ~/.gvfs)... So this is not bash anymore, but an alternative I guess is to use rsync, here's an example:
$ # get example revision 4527 as testdir1
$ svn co https://openbabel.svn.sf.net/svnroot/openbabel/openbabel/trunk/data#4527 testdir1
$ # get earlier example revision 2729 as testdir2
$ svn co https://openbabel.svn.sf.net/svnroot/openbabel/openbabel/trunk/data#2729 testdir2
$ # use rsync to generate a list
$ rsync -ivr --times --cvs-exclude --dry-run testdir1/ testdir2/
sending incremental file list
.d..t...... ./
>f.st...... CMakeLists.txt
>f.st...... MACCS.txt
>f..t...... SMARTS_InteLigand.txt
...
>f.st...... atomtyp.txt
>f+++++++++ babel_povray3.inc
>f.st...... bin2hex.pl
>f.st...... bondtyp.h
>f..t...... bondtyp.txt
...
Note that:
To get the above, you mustn't forget trailing slashes / at the end of directory names in rsync
--dry-run - simulate only, don't update/transfer files
-r - recurse into directories
-v - verbose (but not related to file changes info)
--cvs-exclude - ignore .svn files
-i - "--itemize-changes: output a change-summary for all updates"
Here is a brief excerpt of man rsync that explains the information shown by -i (for instance, the >f.st...... strings above):
The "%i" escape has a cryptic output that is 11 letters long.
The general format is like the string YXcstpoguax, where Y is
replaced by the type of update being done, X is replaced by the
file-type, and the other letters represent attributes that may
be output if they are being modified.
The update types that replace the Y are as follows:
o A < means that a file is being transferred to the remote
host (sent).
o A > means that a file is being transferred to the local
host (received).
o A c means that a local change/creation is occurring for
the item (such as the creation of a directory or the
changing of a symlink, etc.).
...
The file-types that replace the X are: f for a file, a d for a
directory, an L for a symlink, a D for a device, and a S for a
special file (e.g. named sockets and fifos).
The other letters in the string above are the actual letters
that will be output if the associated attribute for the item is
being updated or a "." for no change. Three exceptions to this
are: (1) a newly created item replaces each letter with a "+",
(2) an identical item replaces the dots with spaces, and (3) an
....
A bit cryptic, indeed - but at least it shows basic directory comparison over ssh. Cheers!
The classic (System V Unix) answer would be dircmp dir1 dir2, which was a shell script that would list files found in either dir1 but not dir2 or in dir2 but not dir1 at the start (first page of output, from the pr command, so paginated with headings), followed by a comparison of each common file with an analysis (same, different, directory were the most common results).
This seems to be in the process of vanishing - I have an independent reimplementation of it available if you need it. It's not rocket science (cmp is your friend).

Resources