Need command or script to rename a list of files in linux using a pattern match - bioinformatics

I have downloaded some 90 fasta files from NCBI for bacterial genomes. The downloaded files have default names given by NCBI. I need to change it to my desired file names. Thus I have created two .txt files:
file1.txt - having the default files names provided by NCBI. listed out the names provided by NCBI in file1.txt
file2.txt - having the names to replace the default. listed out the names to replace the NCBI names
Both the files are made in an order so that 1st entry of file1.txt is corresponding to 1st entry of file2.txt.
Now all the downloaded files are in a folder. the folder having the files
and I need a script which reads file1.txt, matches with the file name in the folder and replace it with the names in file2.txt.
I am not a bioinformatician, new to this genre. I look forward to your help. Can this process be made simpler?

This can done with a very small awk one-liner. For convenience, lets first combine your file1 and file2 to make processing easier. This can be done with paste file1.txt file2.txt >> names.txt.
names.txt will be a text file with the old names in the first column and the new names in the second. Awk lets us conveniently run through a file line-by-line (or record-by-record in its terminology) and access each column/field.
Assuming you are in the directory with all these files, as well as names.txt, you can simply run awk '{system("mv " $1 " " $2)}' names.txt to transform them all. This will run through all the lines in names.txt, take the filename given in the first column, and move it to the name given in the second column. The system command allows you to access more basic file system operations through the shell, like moving mv, copying cp, or removing rm files.

Use paste
and xargs like so:
paste file1.txt file2.txt | xargs --verbose -n2 mv
The command is using paste to write lines from 2 files side by side, separated by TABs, to STDOUT. The STDOUT is read by xargs using a pipe (|). Option --verbose prints the command, and option -n2 specifies the max number of arguments for xargs to be 2, so that the resulting commands that are executed are something like mv old_file new_file.
Alternatively, use the Perl one-liners below.
Print the commands to rename the files, without executing the commands ("dry run"):
paste file1.txt file2.txt | perl -lane '$cmd = "mv $F[0] $F[1]"; print $cmd;'
Print the commands to rename the files, then actually execute them:
paste file1.txt file2.txt | perl -lane '$cmd = "mv $F[0] $F[1]"; print $cmd; system $cmd;'
The command is using paste to write lines from 2 files side by side, separated by TABs, to STDOUT. The STDOUT is read by the Perl one-liner using a pipe (|) to pass it to Perl one-liner's STDIN.
The Perl one-liner uses these command line flags:
-e : Tells Perl to look for code in-line, instead of in a file.
-n : Loop over the input one line at a time, assigning it to $_ by default.
-l : Strip the input line separator ("\n" on *NIX by default) before executing the code in-line, and append it when printing.
-a : Split $_ into array #F on whitespace or on the regex specified in -F option.
$F[0], $F[1] : first and second elements of the array #F into which the line is split. They are old and new file names, respectively.
system executes the command $cmd, which actually moves the files.
SEE ALSO:
perldoc perlrun: how to execute the Perl interpreter: command line switches

Related

How to split a text file content by a string?

Suppose I've got a text file that consists of two parts separated by delimiting string ---
aa
bbb
---
cccc
dd
I am writing a bash script to read the file and assign the first part to var part1 and the second part to var part2:
part1= ... # should be aa\nbbb
part2= ... # should be cccc\ndd
How would you suggest write this in bash ?
You can use awk:
foo="$(awk 'NR==1' RS='---\n' ORS='' file.txt)"
bar="$(awk 'NR==2' RS='---\n' ORS='' file.txt)"
This would read the file twice, but handling text files in the shell, i.e. storing their content in variables should generally be limited to small files. Given that your file is small, this shouldn't be a problem.
Note: Depending on your actual task, you may be able to just use awk for the whole thing. Then you don't need to store the content in shell variables, and read the file twice.
A solution using sed:
foo=$(sed '/^---$/q;p' -n file.txt)
bar=$(sed '1,/^---$/b;p' -n file.txt)
The -n command line option tells sed to not print the input lines as it processes them (by default it prints them). sed runs a script for each input line it processes.
The first sed script
/^---$/q;p
contains two commands (separated by ;):
/^---$/q - quit when you reach the line matching the regex ^---$ (a line that contains exactly three dashes);
p - print the current line.
The second sed script
1,/^---$/b;p
contains two commands:
1,/^---$/b - starting with line 1 until the first line matching the regex ^---$ (a line that contains only ---), branch to the end of the script (i.e. skip the second command);
p - print the current line;
Using csplit:
csplit --elide-empty-files --quiet --prefix=foo_bar file.txt "/---/" "{*}" && sed -i '/---/d' foo_bar*
If version of coreutils >= 8.22, --suppress-matched option can be used and sed processing is not required, like
csplit --suppress-matched --elide-empty-files --quiet --prefix=foo_bar file.txt "/---/" "{*}".

Use grep only on specific columns in many files?

Basically, I have one file with patterns and I want every line to be searched in all text files in a certain directory. I also only want exact matches. The many files are zipped.
However, I have one more condition. I need the first two columns of a line in the pattern file to match the first two columns of a line in any given text file that is searched. If they match, the output I want is the pattern(the entire line) followed by all the names of the text files that a match was found in with their entire match lines (not just first two columns).
An output such as:
pattern1
file23:"text from entire line in file 23 here"
file37:"text from entire line in file 37 here"
file156:"text from entire line in file 156 here"
pattern2
file12:"text from entire line in file 12 here"
file67:"text from entire line in file 67 here"
file200:"text from entire line in file 200 here"
I know that grep can take an input file, but the problem is that it takes every pattern in the pattern file and searches for them in a given text file before moving onto the next file, which makes the above output more difficult. So I thought it would be better to loop through each line in a file, print the line, and then search for the line in the many files, seeing if the first two columns match.
I thought about this:
cat pattern_file.txt | while read line
do
echo $line >> output.txt
zgrep -w -l $line many_files/*txt >> output.txt
done
But with this code, it doesn't search by the first two columns only. Is there a way so specify the first two columns for both the pattern line and for the lines that grep searches through?
What is the best way to do this? Would something other than grep, like awk, be better to use? There were other questions like this, but none that used columns for both the search pattern and the searched file.
Few lines from pattern file:
1 5390182 . A C 40.0 PASS DP=21164;EFF=missense_variant(MODERATE|MISSENSE|Aag/Cag|p.Lys22Gln/c.64A>C|359|AT1G15670|protein_coding|CODING|AT1G15670.1|1|1)
1 5390200 . G T 40.0 PASS DP=21237;EFF=missense_variant(MODERATE|MISSENSE|Gcc/Tcc|p.Ala28Ser/c.82G>T|359|AT1G15670|protein_coding|CODING|AT1G15670.1|1|1)
1 5390228 . A C 40.0 PASS DP=21317;EFF=missense_variant(MODERATE|MISSENSE|gAa/gCa|p.Glu37Ala/c.110A>C|359|AT1G15670|protein_coding|CODING|AT1G15670.1|1|1)
Few lines from a file in searched files:
1 10699576 . G A 36 PASS DP=4 GT:GQ:DP 1|1:36:4
1 10699790 . T C 40 PASS DP=6 GT:GQ:DP 1|1:40:6
1 10699808 . G A 40 PASS DP=7 GT:GQ:DP 1|1:40:7
They both in reality are much larger.
It sounds like this might be what you want:
awk 'NR==FNR{a[$1,$2]; next} ($1,$2) in a' patternfile anyfile
If it's not then update your question to provide a clear, simple statement of your requirements and concise, testable sample input and expected output that demonstrates your problem and that we could test a potential solution against.
if anyfile is actually a zip file then you'd do something like:
zcat anyfile | awk 'NR==FNR{a[$1,$2]; next} ($1,$2) in a' patternfile -
Replace zcat with whatever command you use to produce text from your zip file if that's not what you use.
Per the question in the comments, if both input files are compressed and your shell supports it (e.g. bash) you could do:
awk 'NR==FNR{a[$1,$2]; next} ($1,$2) in a' <(zcat patternfile) <(zcat anyfile)
otherwise just uncompress patternfile to a tmp file first and use that in the awk command.
Use read to parse the pattern file's columns and add an anchor to the zgrep pattern :
while read -r column1 column2 rest_of_the_line
do
echo "$column1 $column2 $rest_of_the_line"
zgrep -w -l "^$column1\s*$column2" many_files/*txt
done < pattern_file.txt >> output.txt
read is able to parse lines into multiple variables passed as parameters, the last of which getting the rest of the line. It will separate fields around characters of the $IFS Internal Field Separator (by default tabulations, spaces and linefeeds, can be overriden for the read command by using while IFS='...' read ...).
Using -r avoids unwanted escapes and makes the parsing more reliable, and while ... do ... done < file performs a bit better since it avoids an useless use of cat. Since the output of all the commands inside the while is redirected I also put the redirection on the while rather than on each individual commands.

Concatenate awk-output, string, and text file

I have the following two tab-separated files in my current directory.
a.tsv
do not use this line
but this one
and that too
b.tsv
three fields here
not here
For each tsv file there is an associated txt file in the same directory, with the same filename but different suffix.
a.txt
This is the a-specific text.
b.txt
Text associated to b.
For each pair of files I want to create a new file with the same name but the suffix _new.txt. The new files should contain all lines from the respective tsv file that contain exactly 3 fields, afterwards the string \n####\n, and then the whole content of the respective txt file. Thus, the following output files should be created.
Desired output
a_new.txt
but this one
and that too
####
This is the a-specific text.
b_new.txt
three fields here
####
Text associated to b.
Working, but bad solution
for file in ./*.tsv
do awk -F'\t' 'NF==3' $file > ${file//.tsv/_3_fields.tsv}
done
for file in ./*_3_fields.tsv
do cat $file <(printf "\n####\n") ${file//_3_fields.tsv/.txt} > ${file//_3_fields.tsv/_new.txt}
done
Non-working code
I'd like to get the result with one script, and avoid creating the intermediate file with the suffix _3_fields.tsv.
I tried command substitution as follows:
for file in ./*.tsv
do cat <<< $(awk -F'\t' 'NF==3' $file) <(printf "\n####\n") ${file//.tsv/.txt} > ${file//.tsv/_new.txt}
done
But this doesn't write the awk-processed part into the new files.
Yet, the command substitution seems to work if I only write the awk-processed part into the new file like follows:
for file in ./*.tsv; do cat <<< $(awk -F'\t' 'NF==3' $file) > ${file//.tsv/_new.txt}; done
I'd be interested in why the second last code doesn't work as expected, and what a good solution would be to do the task.
Maybe you wanted to redirect a sequence of commands
for file in ./*.tsv
do
{
awk -F'\t' 'NF==3' "$file"
printf "\n####\n"
cat "${file//.tsv/.txt}"
} > "${file//.tsv/_new.txt}"
done
Note that space after opening brace and semicolon or newline before closing brace are important.
Seems also you are confusing command substitution $() and process substituion <() or >(). Also <<< is to redirect content as standard input whereas < to redirect a file.

How to fetch the file names present in the text file and delete that files using shell

I want to delete some files mentioned in a text file .The text would be in a single line like below along with some other data
Cannot Handle File:C:\patches\BUG2\abc.javaCannot Handle File:C:\patches\BUG2\xyz.javaErrors .
So now I want to fetch the file names like abc.java and xyz.java in the text file and delete them so How can we proceed with it using shell. Please help to resolve this
Perl to the rescue:
perl -lne 'unlink $1 while /File:(.*?)(?:Cannot|Errors)/g' input.txt
-l adds a newline to prints
-n processes the input line by line
(.*?) matches "frugally", i.e. finds the shortest possible match
/g matches globally, i.e. as many times as it can.
unlink removes a file.
So, the file name must be preceded by File: and followed by Cannot or Errors.
Using grep -o and xargs:
grep -Eo '[[:alnum:]_$-]+\.java' file | xargs rm
Will get this output from grep:
grep -Eo '[[:alnum:]_$-]+\.java' file
abc.java
xyz.java

Search recursively text from each line in file using standard cmd line commands

I have a file with variables names prepared by grep call.
Now I want to do following: grep directory recursively and search in each file each variable entries (from initially prepared file). How could I achieve it via awk/sed or any other console utility? I know how to do it with, for example, python script, but now I'd like to use pure console solution.
I am stuck on applying command to data: awk '{ print $0}' RS="/" settings_vars.txt Is it right? But how to call command instead of print line content?
You can use recursive grep with -f option:
grep -rHf settings_vars.txt .
Options used are:
-f # Read one or more newline separated patterns from file.
-H # Always print filename headers with output lines.
-r # Read all files under each directory, recursively, following symbolic links only
if they are on the command line.

Resources