Find & Replace Multiple Sequence Headers in Multiple FASTA Files - bash

Here's my problem (using a Mac OS X):
I have about 35 FASTA files with 30 sequences in each one. Each FASTA file represents a gene, and they all contain the same individuals with the same sequence headers in each file. The headers are formatted as "####_G_species," with the numbers being non-sequential. I need to go through every file and change 4 specific headers, while also keeping the output as 35 discrete files with the same names as their corresponding input files, preferably depositing the outputs into a separate subdirectory.
For example: Every file contains a "6934_Sergia_sp," and I need to change
every instance of that name in all of the 35 files to "6934_R_robusta." I need to do the same with "8324_Sergestes_sp," changing every instance in every file to "8324_P_vigilax." Rinse and repeat 2 more times with different headers. After changing the headers, I need to have 35 discrete output files with the same names as their corresponding input files.
What I've found so far that seems to show the most promise is from the following link:
https://askubuntu.com/questions/84007/find-and-replace-text-within-multiple-files
using the following script:
find /home/user/directory -name \*.c -exec sed -i "s/cybernetnews/cybernet/g" {} \;
Changing the information to fit my needs, I get a script like this:
find Path/to/my/directory -name \*.fas -exec sed -i 's/6934_Sergia_sp/6934_R_robusta/g' {} \;
Running the script like that, I get and "undefined label" error. After researching,
https://www.mkyong.com/mac/sed-command-hits-undefined-label-error-on-mac-os-x/
I found that I should add '.fas' after -i giving:
find Path/to/my/directory -name \*.fas -exec sed -i '.fas' 's/6934_Sergia_sp/6934_R_robusta/g' {} \;
because on Macs you need to specify an extension for the output files. Running the script like that, I get very nearly what I'm looking for with each input file being duplicated, the correct header in each being correctly substituted for the new name, and the outputs being placed in the same directory. However, this only substitutes one header at a time, and the output files have a .fas.fas extension.
Moving forward, I would have to rename the output files to remove the second " .fas " in the extension, and rewrite and rerun the script 3 more times to get everything changed how I want it, which wouldn't be the end of the world, but definitely wouldn't be ideal.
Is it possible to set up a script so that I can run all 4 substitutions at the same time, while also exporting the outputs to a new subdirectory?

Your approach is good, but I would prefer a more verbose approach where I don't have to fight so much with the quotes. Something like:
for fasta in $(find Path/to/my/directory -name "*.fas")
do
new_fasta=$(basename $fasta .fas).new.fas
sed 's/6934_Sergia_sp/6934_R_robusta/g; s/Another_substitution/Another_result/' $fasta > $new_fasta
done
Here, you fed the list of FastA file to loop over, you compute a new fasta name (and location, if needed), and finally run sed over the input and leave the output in a new file. Observe that you can give more than one substitution in sed, separated by semicolons.
BTW, as #Ed Morton said, for the next question please, include a concise description of the problem and sample input and expected output.

Related

Randomly shuffling lines in multiple text files but keeping them as separate files using a command or bash script

I have several text files in a directory. All of them unrelated. Words doesn't repeat in each file. Each line has 1 to 3 words in it such as:
apple
potato soup
vitamin D
banana
guinea pig
life is good
I know how to randomize each file:
sort -R file.txt > file-modified.txt
That's great but I want to do this in over 500+ files in a directory and it would take me ages. There must be something better.
I would like to do something like:
sort -R *.txt -o KEEP-SAME-NAME-AS-ORIGINAL-FILE-ADD-SUFFIX-TO-ALL.txt
Maybe this is possible with an script that go through each file in the directory until finished.
Very important is every file should only randomize the words within itself and do not mix with the other files.
Thank you.
Something like this one-liner:
for file in !(*-modified).txt; do shuf "$file" > "${file%.txt}-modified.txt"; done
Just loop over the files and shuffle each one in turn.
The !(*-modified).txt pattern uses bash's extended pattern matching to not match .txt files that already have -modified at the end of the name so you don't shuffle a pre-existing already shuffled output file and end up with file-modified-modified.txt. Might require a shopt -s extglob first, though that's usually turned on already in an interactive shell session.

Understanding the sed command

I'm trying to change every first line in all files contained in a parent directory so that they inherit the pathname of the directory that they're in.
For example I have a file with the format:
2000-01-18
Tuesday
Livingston
42178
This particular file is in a directory named 18, inside another directory named 01, which is in another directory named 2000, which is in a directory called filesToSort.
I managed to use this code as a console command to change the first line of the file:
perl -pi -w -e 's/2000-01-18/Test/g;' ff_1177818640
This changed the file to
Test
Tuesday
Livingston
42178
Is it possible for me to change the "date" in this command to select all dates, I tried to use it like this:
perl -pi -w -e 's/*/Test/g;' ff_1177818640
But it didn't like that at all.
My current though process is that if I can make this command select all dates in the initial input, then some how find a way to implement the pathname into the second part where I currently have "Test" using something like this:
path=/filesToSort/2000/01/18/ff_1177818640
file=$(basename "$path")
I should in theory be able to run this entire code through my parent directory and all sub directories, therefore changing every date value in the files, which apear on line 1 of every single file, in these directories to mirror the file path that they're in, effectively turning a file that looks like this:
2000-xx-18
Tuesday
Livingston
42178
Contained in directory /filesToSort/2000/01/18
into this:
2000/01/18
Tuesday
Livingston
42178
I'm not sure if I'm just using the sed command wrong here and that there is another command that I should be using instead but I've been trying to get this to work for 4 hours now and I can't seem to nail it.
Thanks in advance for the help!
Looks to me that what you want to do is basically translate "-" for "/". You could, find the file, take a backup of it (always a good idea) and then use :
sed 's|-|/|g' <path/backup_file >path/modified_file
That said, if it is the same assignement, you could use that command to copy to its new directory and modified the file at the same time.
You haven't posted a sed command, so it's hard to know what will work. Let's take this in small steps. Try this:
sed -i '1s/^/X/' ff_1177818640
and see if that modifies the file (adding 'X' to the beginning of the first line). If your version of sed doesn't like that, try this:
sed -i "" '1s/^/X/' ff_1177818640
Once you have the syntax working, we must tackle the problem of converting a path into a date. Try this:
echo some/path/filesToSort/2000/01/18/ff_1177818640 | sed 's|.*/filesToSort/||; s|/[^/]*$||'
If that produces "2000/01/18", post a comment, and we can put it all together.
EDIT: putting it all together. Abracadabra!
find . -type f -exec sed -i "1{s|.*|{}|;s|.*/filesToSort/||;s|/[^/]*$||;}" {} \;

Running a process on every combination between files in two folders

I have two folders where the 1st has 19 .fa files and the 2nd has 37096 .fa files
Files in the 1st folder are named BF_genomea[a-s].fa, and files in the 2nd are named [1-37096]ZF_genome.fa
I have to run this process where lastz filein1stfolder filein2ndfolder [arguments] > outputfile.axt, so that I run every file in the 1st folder against every file in the 2nd folder.
Any sort of output file's naming would serve, as far as it allows for id which particular combination of parent files they came from, and they have extension .axt
This is what I have done so far
for file in /tibet/madzays/finch_data/BF_genome_split/*.fa; do for otherfile in /tibet/madzays/finch_data/ZF_genome_split/*.fa; name="${file##*/}"; othername="${otherfile##*/}"; lastz $file $otherfile --step=19 --hspthresh=2200 --gappedthresh=10000 --ydrop=3400 --inner=2000 --seed=12of19 --format=axt --scores=/tibet/madzays/finch_data/BFvsZFLASTZ/HoxD55.q > /home/madzays/qsub/test/"$name""$othername".axt; done; done
Ad I said in a comment, the inner loop is missing a do keyword (for otherfile in pattern; do <-- right there). Is this in the form of a script file? If so, you should add a shebang as the first line to tell the OS how to run the script. And break it into multiple lines and indent the contents of the loops, to make it easier to read (and easier to spot problems like the missing do).
Off the top of my head, I see one other thing I'd change: the output filenames are going to be pretty ugly, just the two input files mashed together with a ".atx" on the end (along the lines of "BF_genomeac.fa14ZF_genome.fa.axt"). I'd parse the IDs out of the input filenames and then use them to build a more reasonable output filename convention. Something like this
#!/bin/bash
for file in /tibet/madzays/finch_data/BF_genome_split/*.fa; do
for otherfile in /tibet/madzays/finch_data/ZF_genome_split/*.fa; do
name="${file##*/}"
tmp="${name#BF_genomea}" # remove filename prefix
id="${tmp%.*}" # remove extension to get the ID
othername="${otherfile##*/}"
otherid="${othername%ZF_genome.fa}" # just have to remove a suffix here
lastz $file $otherfile --step=19 --hspthresh=2200 --gappedthresh=10000 --ydrop=3400 --inner=2000 --seed=12of19 --format=axt --scores=/tibet/madzays/finch_data/BFvsZFLASTZ/HoxD55.q > "/home/madzays/qsub/test/BF${id}_${otherid}ZF.axt"
done
done
The code can nearly directly been translated from your requierements:
base=/tibet/madzays/finch_data
for b in {a..s}
do
for z in {1..37096}
do
lastz $base/BF_genome_split/${b}.fa $base/ZF_genome_split/${z}.fa --hspthresh=2200 --gappedthresh=10000 --ydrop=3400 --inner=2000 --seed=12of19 --format=axt --scores=$base/BFvsZFLASTZ/HoxD55.q > /home/madzays/qsub/test/${b}-${z}.axt
done
done
Note that oneliners easily lead to errors, like missing dos, which are then hard to find from the error message (error in line 1).

OSX / MacOs batch rename hexadecimal filenames to decimal filenames

I want to rename filenames with a hexadecimal part in the name to decimal. For example: MOV12B.MOD, MOV12C.MOD etc. To MOV299.mod, MOV300.MOD.
Can this be done in terminal?
It is possible to rename the extension using:
find . -name "*.MOD" -exec rename 's/\.MOD$/.MPG/' '{}' \;
But how can I rename the files to decimal?
Sure, you can do it with rename, also known as Perl rename and prename which is most simply installed on macOS with homebrew using:
brew install rename
Then the command is:
rename --dry-run 's/[0-9A-F]+/hex($&)/e' *MOD
Sample Output
'MOV10.MOD' would be renamed to 'MOV16.MOD'
'MOV12B.MOD' would be renamed to 'MOV299.MOD'
'MOV12C.MOD' would be renamed to 'MOV300.MOD'
'MOVBEEF.MOD' would be renamed to 'MOV48879.MOD'
If you like what it does, remove the --dry-run part and do it for real.
I would recommend you make a backup before trying this anyway, because if your films are actually named "Film 23.MOD" rather than "MOV12B.MOD" you will get:
'Film 23.MOD' would be renamed to '15ilm 23.MOD'
If you want to put the date in too, you can do:
rename --dry-run 's/[0-9A-F]+/hex($&)/e; s|.MOD| 17/01/2018.MOD|' *MOD
Sample Output
'MOV12A.MOD' would be renamed to 'MOV298 17/01/2018.MOD'
Why couldn't you find it in the man-page? Well, there is a line in there that casually says you can pass a line of Perl code to modify the name. That means that the entire Perl language is available to you - so you could write several pages of code that access a database, run something on a remote machine, or fetch a URL in order to rename your file.
The only tricky thing in my code is the e lurking at the end:
s/search/replace/e
The e means that the second half of the search/replace is actually executed so it is not a straight textual replacement, it is a new program that gets the search string from the left-hand side in $& and can do maths or lookups on it.
I have done some other answers that involve similar techniques...
here,
here,
here.
If you want to put the modification time of the file into its name as well, you need to do a little more work. First, stat() the file before changing its name ;-) Remember you receive the original filename in $_. Then do the the hex to decimal thing, then add in the mtime. Remember Perl uses a dot to concatenate strings together.
So, the command is going to look like this:
rename --dry-run 'my $mtime=(stat($_))[9]; s/[0-9A-F]+/hex($&) . " " . $mtime/e;' *MOD
Sample Output
'MOV12A.MOD' would be renamed to 'MOV298 1516229449.MOD'
If all the substitution and evaluation gets too much, you can always do all your calculations and assign the result to Perl's $_ variable through which you receive the into filename and in which you pass the desired name back to rename. So, for an example:
rename --dry-run 'my $prefix="PREFIX "; my $middle=$_; my $suffix=" SUFFIX"; $_=$prefix . $middle . $suffix;' *MOD
'MOV12A.MOD' would be renamed to 'PREFIX MOV12A.MOD SUFFIX'
Only a real programmer would store his movies with hex names - kudos to you!

Recursively dumping the content of a file located in different folders

Still being a newbie with bash-programming I am fighting with another task I got. A specific file called ".dump" (yes, with a dot in the beginning) is located in each folder and always contains three numbers. I need to dump the third number in a variable in case it is greater than 1000 and then print this and the folder name locating the number. So the outcome should look like this:
"/dir1/ 1245"
"/dir1/subdir1/ 3434"
"/dir1/subdir2/ 10003"
"/dir1/subdir2/subsubdir3/ 4123"
"/dir2/ 45440"
(without "" and each of them in a new line (not sure, why it is not shown correctly here))
I was playing around with awk, find and while, but the results are that bad that I do not wanna post them here, which I hope is understood. So any code snippet helping is appreciated.
This could be cleaned up, but should work:
find /dir1 /dir2 -name .dump -exec sh -c 'k=$(awk "\$3 > 1000{print \$3; exit 1}" $0) ||
echo ${0%.dump} $k ' {} \;
(I'm assuming that all three numbers in your .dump files appear on one line. The awk will need to be modified if the input is in a different format.)

Resources