Rename genome FASTA files with part of sequence header - bash

I'd like to rename FASTA files with organism name (stored in the file) and the identifier (part of the filename).
All files have the same format in filename and stored data, each file only have one FASTA header and corresponding sequence.
Original filename:
$ head GCF_000008205.1_ASM820v1_genomic.fna
>NC_007295.1 Mycoplasma hyopneumoniae J, complete genome
CCAAAATCAACTTTATTAAATGTGCTAAATAAAGTTGATAAAATGTTTGCAAAAACATTTTTGTTGTTTTAAACAAAACA
AATTGATTTAAAAATTATACTACAAAATTAAAGGAAAATTTATAAAATGCAAACAAATAAAAATAATTTAAAGGTTAGAA
CACAGCAAATTAGACAACAAATTGAAAATTTATTAAATGATCGAATGTTGTATAACAACTTTTTTAGCACAATTTATGTA
...
I'd like to rename only the filename, using the assembly identifier (GCF_000008205.1) in the filename, and the second and third words of the FASTA header (Mycoplasma hyopneumoniae):
Mycoplasma_hyopneumoniae_GCF_000008205.1.fna
I've tried this:
for fname in *.fna; do
mv -- "$fname" \
"$(awk 'NR==1{printf("%s_%s_%s\n",$2,$3,substr($1,2));exit}' "$fname")".fna
done
result:
Mycoplasma_hyopneumoniae_NC_007295.1.fna
But the result shows a code ahead of the name of the organism, instead of the identifier that interests me, which is in the name of the original file.
Thanks!

The following idea works, but only if every single file is formatted like the one in your example.
In the directory that has all your files do the following:
for i in $(ls)
do
name1=$(cat "$i" | grep \> | awk -v OFS='_' '{print $2,$3,_}')
name2=$(basename "$i" | cut -d_ -f 1,2 | sed 's/$/.fna/g')
mv "$i" "${name1}${name2}"
done
I suggest creating a backup folder first before trying it just in case you have some files formatted differently.

Related

Prepending part of a filename to a .csv file using bash/sed

I have a couple of files in a directory that are named like this;
1_38OE983729JKHKJV.csv
an integer followed by an ID (the Integer and ID are both unique).
I need to prepend this ID to every line of the file for each file in the folder to prepare the files for import to a database (and discard the integer part of the filename). The contents of the file look something like this:
BW;20015;11,45;0,49;41;174856;4103399
BA;25340;11,41;0,55;40;222161;4599779
BB;800;7,58;0,33;42;10559;239887
HE;6301;9,11;0,39;40;69191;1614302
.
.
.
Total;112613;9,33;0,43;40;1207387;25897426
The end result should look something like this:
38OE983729JKHKJV;BW;20015;11,45;0,49;41;174856;4103399
38OE983729JKHKJV;BA;25340;11,41;0,55;40;222161;4599779
38OE983729JKHKJV;BB;800;7,58;0,33;42;10559;239887
38OE983729JKHKJV;HE;6301;9,11;0,39;40;69191;1614302
.
.
.
38OE983729JKHKJV;Total;112613;9,33;0,43;40;1207387;25897426
Thanks for the help!
EDIT: Spelling and vocabular for clarity
Loop over the files with for, use parameter expansion to extract the id.
#!/bin/bash
for csv in *.csv ; do
prefix=${csv%_*}
id=${csv#*_}
id=${id%.csv}
sed -i~ "s/^/$id;/" "$csv"
done
If the ID can contain underscores, you might need to be more careful with the expansion.
With awk tool:
for f in *csv; do awk '{ fn=FILENAME; $0=substr(fn,index(fn,"_")+1,length(fn)-6)";"$0 }1' "$f" > tmp && mv tmp "$f"; done
fn=FILENAME - the filename
try following too in single awk and which will take care of the number of files which are getting opened during this operation too, so that we will avoid the error of maximum number of files opened.
awk 'FNR==1{close(val);val=FILENAME;split(FILENAME,a,"_");sub(/\..*/,"",a[2])} {print a[2]","$0}' *.csv
With GNU awk for inplace editing and gensub() all you need is:
awk -i inplace '{print gensub(/.*_(.*)\..*/,"\\1;",1,FILENAME) $0}' *.csv
No shell loops or anything else necessary, just that command.

How to print all lines of a file that do not contain a *partial* pattern

We know grep -v pattern file prints lines that do not contain pattern.
My file to search is a table:
Sample File, Sample Name, Panel, Marker, Allele 1, Allele 2, GQ,
M090972.s-206_B01.fsa, M090972-206, Sample ID-1, SNPchr1, C, T,0.9933,
I want to weed out the lines that contain "M090972-206" and some more patterns like that.
My search patterns come from a directory of text files:
$ ls 20170227_snap_genotypes_1_VCF
M070370-208_S1.genome.vcf M170276-201_S20.genome.vcf
M170308-201_S5.genome.vcf
Only the part of these filenames up to the first "_" is in my table (or the first "." if I remove the ".s" in the example). It is not a constant number of characters. I could remove the characters after the first "." but could not find a way in the sed and awk documentation.
Alternatively I tried using agrep 3.441 with the "-f" option for reading the patterns from a temporary file made with
$ ls "directory" > temp.txt
$ ./agrep -v -f temp.txt $infile >> $outfile
But agrep -f does not find any match (or everything with -v).
What am I missing? Is there a better way, perhaps with sed or awk?
If you are deriving your patterns from the name of files (up to the first _) that exist in 20170227_snap_genotypes_1_VCF directory, then you could do this:
# run from the parent of 20170227_snap_genotypes_1_VCF directory
grep -vf <(cd 20170227_snap_genotypes_1_VCF; ls | cut -f1 -d_) file

How do change all filenames with a similar but not identical structure?

Due to a variety of complex photo library migrations that had to be done using a combination of manual copying and importing tools that renamed the files, it seems I wound up with a ton of files with a similar structure. Here's an example:
2009-05-05 - 2009-05-05 - IMG_0486 - 2009-05-05 at 10-13-43 - 4209 - 2009-05-05.JPG
What it should be:
2009-05-05 - IMG_0486.jpg
The other files have the same structure, but obviously the individual dates and IMG numbers are different.
Is there any way I can do some command line magic in Terminal to automatically rename these files to the shortened/correct version?
I assume you may have sub-directories and want to find all files inside this directory tree.
This first code block (which you could put in a script) is "safe" (does nothing), but will help you see what would be done.
datep="[0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9]"
dir="PUT_THE_FULL_PATH_OF_YOUR_MAIN_DIRECTORY"
while IFS= read -r file
do
name="$(basename "$file")"
[[ "$name" =~ ^($datep)\ -\ $datep\ -\ ([^[:space:]]+)\ -\ $datep.*[.](.+)$ ]] || continue
date="${BASH_REMATCH[1]}"
imgname="${BASH_REMATCH[2]}"
ext="${BASH_REMATCH[3],,}"
dir_of_file="$(dirname "$file")"
target="$dir_of_file/$date - $imgname.$ext"
echo "$file"
echo " would me moved to..."
echo " $target"
done < <(find "$dir" -type f)
Make sure the output is what you want and are expecting. I cannot test on your actual files, and if this script does not produce results that are entirely satisfactory, I do not take any responsibility for hair being pulled out. Do not blindly let anyone (including me) mess with your precious data by copy and pasting code from the internet if you have no reliable, checked backup.
Once you are sure, decide if you want to take a chance on some guy's code written without any opportunity for testing and replace the three consecutive lines beginning with echo with this :
mv "$file" "$target"
Note that file names have to match to a pretty strict pattern to be considered for processing, so if you notice that some files are not being processed, then the pattern may need to be modified.
Assuming they are all the exact same structure, spaces and everything, you can use awk to split the names up using the spaces as break points. Here's a quick and dirty example:
#!/bin/bash
output=""
for file in /path/to/files/*; do
unset output #clear variable from previous loop
output="$(echo $file | awk '{print $1}')" #Assign the first field to the output variable
output="$output"" - " #Append with [space][dash][space]
output="$output""$(echo $file | awk '{print $5}')" #Append with IMG_* field
output="$output""." #Append with period
#Use -F '.' to split by period, and $NF to grab the last field (to get the extension)
output="$output""$(echo $file | awk -F '.' '{print $NF}')"
done
From there, something like mv /path/to/files/$file /path/to/files/$output as a final line in the file loop will rename the file. I'd copy a few files into another folder to test with first, since we're dealing with file manipulation.
All the output assigning lines can be consolidated into a single line, as well, but it's less easy to read.
output="$(echo $file | awk '{print $1 " - " $5 "."}')""$(echo $file | awk -F '.' '{print $NF}')"
You'll still want a file loop, though.
Assuming that you want to convert the filename with the first date and the IMG* name, you can run the following on the folder:
IFS=$'\n'
for file in *
do
printf "mv '$file' '"
printf '%s' $(cut -d" " -f1,4,5 <<< "$file")
printf "'.jpg"
done | sh

Add one text to multiple files using bash

I have many files with the extension .com, so the files are named 001.com, 002.com, 003.com, and so on.
And I have another file called headname which contains the following information:
abc=chd
dha=djj
cjas=FILENAME.chk
dhdh=hsd
I need to put the information of the file headname inside (and at the begin of) the files 001.com, 002.com, 003.com and so on... But FILENAME needs to be the filename of the file that will receive the headname information (without the .com extension).
So the output need to be:
For the 001.com:
abc=chd
dha=djj
cjas=001.chk
dhdh=hsd
For the 002.com:
abc=chd
dha=djj
cjas=002.chk
dhdh=hsd
For the 003.com:
abc=chd
dha=djj
cjas=003.chk
dhdh=hsd
And so on...
set -e
for f in *.com
do
cat <(sed "s#^cjas=FILENAME.chk\$#cjas=${f%.com}.chk#" headname) "$f" > "$f.new"
mv -f "$f.new" "$f"
done
Explanation:
for f in *.com -- this loops over all file names ending with .com.
sed is a program that can be used to replace text.
s#...#...# is the substitute command.
${f%.com} is the file name without the .com suffix.
cat <(...) "$f" -- this merges the new head with the body of the .com file.
The output of cat is stored into a file named 123.com.new -- mv -f "$f.new" "$f" is used to rename 123.com.new to 123.com.
Something like this should work:
head=$(<headname) # read head file into variable
head=${head//$'\n'/\\n} # replace literal newlines with "\n" for sed
for f in *.com; do # loop over all *.com files
# make a backup copy of the file (named 001.com.bak etc).
# insert the contents of $head with FILENAME replaced by the
# part of the filename before ".com" at the beginning of the file
sed -i.bak "1i${head/FILENAME/${f%.com}}" "$f"
done

combining grep and find to search for file names from query file

I've found many similar examples but cannot find an example to do the following. I have a query file with file names (file1, file2, file3, etc.) and would like to find these files in a directory tree; these files may appear more than once in the dir tree, so I'm looking for the full path. This option works well:
find path/to/files/*/* -type f | grep -E "file1|file2|file3|fileN"
What I would like is to pass grep a file with filenames, e.g. with the -f option, but am not successful. Many thanks for your insight.
This is what the query file looks like:
so the file contains one column of filenames separated by '\n' and here is how it looks like:
103128_seqs.fna
7010_seqs.fna
7049_seqs.fna
7059_seqs.fna
7077A_seqs.fna
7079_seqs.fna
grep -f FILE gets the patterns to match from FILE ... one per line*:
cat files_to_find.txt
n100079_seqs.fna
103128_seqs.fna
7010_seqs.fna
7049_seqs.fna
7059_seqs.fna
7077A_seqs.fna
7079_seqs.fna
Remove any whitespace (or do it manually):
perl -i -nle 'tr/ //d; print if length' files_to_find.txt
Create some files to test:
touch `cat files_to_find.txt`
Use it:
find ~/* -type f | grep -f files_to_find.txt
output:
/home/user/tmp/7010_seqs.fna
/home/user/tmp/103128_seqs.fna
/home/user/tmp/7049_seqs.fna
/home/user/tmp/7059_seqs.fna
/home/user/tmp/7077A_seqs.fna
/home/user/tmp/7079_seqs.fna
/home/user/tmp/n100079_seqs.fna
Is this what you want?

Resources