Using a file with specific IDs to extract data from another file into separate files and then using them to get values - bash

I have a file with some IDs listed like this:
id1
id2
id3
etc
I want to use those IDs to extract data from files (IDs are occurring in every file) and save output for each of these IDs to a separate file (IDs are protein family names and I want to get each protein from a specific family). And, when I have the name for each of the protein I want to use this name to get those proteins (in .fasta format), so that they would be grouped by their family (they'll be staying in the same group)
So I've tried to do it like this (I knew that it would dump all the IDs into one file):
#! /bin/bash
for file in *out
do grep -n -E 'id1|id2|id3' /directory/$file >> output; done
I would appreciate any help and I will gladly specify if not everything is clear to you.
EDIT: i will try to clarify, sorry for the inconvenience:
so theres a file called "pfamacc" with the following content:
PF12312
PF43555
PF34923
and so on - those are the IDs that i need to acces other files, which have a structure like that "something_something.faa.out"
<acc_number> <aligment_start> <aligment_end> <pfam_acc>
RXOOOA 5 250 PF12312
OC2144 6 200 PF34923
i need those accesion numbers so i can then get protein sequences from files which look like this:
>RXOOOA
ASDBSADBASDGHH
>OC2144
SADHHASDASDCJHWINF

With the assumption there is a file ids_file.txt in the same directory with the subsequent content:
id1
id2
id3
id4
And in the same directory is as well a file called id1 with the following content:
Bla bla bla
id1
and id2
is
here id4
Then this script could help:
#!/bin/sh
IDS=$(cat ids_file.txt)
IDS_IN_ONE=$(cat ids_file.txt | tr '\n' '|' | sed -r 's/(\|)?\|$//')
echo $IDS_IN_ONE
for file in $IDS; do
grep -n -E "$IDS_IN_ONE" ./$file >> output
done
The file output has then the following result:
2:id1
3:and id2
5:here id4

Reading that a list needs to be cross-referenced to get a 2nd list, which then needs to be used to gather FASTAs.
Starting with the following 3 files...
starting_values.txt
PF12312
PF43555
PF34923
cross_reference.txt
<acc_number> <aligment_start> <aligment_end> <pfam_acc>
RXOOOA 5 250 PF12312
OC2144 6 200 PF34923
find_from_file.fasta
>RXOOOA
ASDBSADBASDGHH
>OC2144
SADHHASDASDCJHWINF
SADHHASDASDCJHWINF
>NC11111
IURJCNKAERJKADSF
for i in `cat starting_values.txt`; do awk -v var=$i 'var==$4 {print $1}' cross_reference.txt; done > needed_accessions.txt
If multiline FASTA change to single line. https://www.biostars.org/p/9262/
awk '/^>/ {printf("\n%s\n",$0);next; } { printf("%s",$0);} END {printf("\n");}' find_from_file.fasta > find_from_file.temp
for i in `cat needed_accessions.txt`; do grep -A 1 "$i" find_from_file.temp; done > found_sequences.fasta
Final Output...
found_sequences.fasta
>RXOOOA
ASDBSADBASDGHH
>OC2144
SADHHASDASDCJHWINFSADHHASDASDCJHWINF

Related

Pass number of for loop elements to external command

I'm using for loop to iterate through .txt files in a directory and grab specified rows from the files. Afterwards the output is passed to pr command in order to print it as a table. Everything works fine, however I'm manually specifying the number of columns that the table should contain. This is cumbersome when the number of files is not constant.
The command I'm using:
for f in *txt; do awk -F"\t" 'FNR ~ /^(2|6|9)$/{print $2}' $f; done | pr -ts --column 4
How should I modify the command to replace '4' with elements number?
Edit:
The fundamental question was if one can provide matching files number to function outside the loop. Seeing the solutions I guess it is not possible to work around the problem. Until this conclusion the structure of the files was not really relevant.
However taking the above into account, I'm providing the files structure below.
Sample file.txt:
Irrelevant1 text
Placebo 1222327
Irrelevant1 text
Irrelevant2 text
Irrelevant3 text
Treatment1 105956
Irrelevant1 text
Irrelevant2 text
Treatment2 49271
Irrelevant1 text
Irrelevant2 text
The for loop generates the following from 4 *txt files:
1222327
105956
49271
969136
169119
9672
1297357
237210
11581
1189529
232095
13891
Expected pr output using a dynamically generated --column 4:
1222327 969136 1297357 1189529
105956 169119 237210 232095
49271 9672 11581 13891
Assumptions:
all input files generate the same number of output lines (otherwise we can add some code to keep track of the max number of lines and generate blank columns as needed)
Setup (columns are tab-delimited):
$ grep -n xxx f[1-4].txt
f1.txt:6:xxx 1222327
f1.txt:9:xxx 105956
f1.txt:24:xxx 49271
f2.txt:6:xxx 969136
f2.txt:9:xxx 169119
f2.txt:24:xxx 9672
f3.txt:6:xxx 1297357
f3.txt:9:xxx 237210
f3.txt:24:xxx 11581
f4.txt:6:xxx 1189529
f4.txt:9:xxx 232095
f4.txt:24:xxx 13891
One idea using awk to dynamically build the 'table' (replaces OP's current for loop):
awk -F'\t' '
FNR==1 { c=0 }
FNR ~ /^(6|9|24)$/ { ++c ; arr[c]=arr[c] (FNR==NR ? "" : " ") $2 }
END { for (i=1;i<=c;i++) print arr[i] }
' f[1-4].txt | column -t -o ' '
NOTE: we'll go ahead and let column take care of pretty-printing the table with a single space separating the columns, otherwise we could add some more code to awk to right-pad columns with spaces
This generates:
1222327 969136 1297357 1189529
105956 169119 237210 232095
49271 9672 11581 13891
You could just run ls and pipe the output to wc -l. Then once you've got that number you can assign it to a variable and place that variable in your command.
num=$(ls *.txt | wc -l)
I forget how to place bash variables in AWK, but I think you can do that. If not, respond back and I'll try to find a different answer.

Replace a word using different files Bash

im looking to edit my 1.txt file, to find a word and replace it with the correspondant word in 2.txt and also add the rest of the string of file 2.
Im interest in maintain the order of my 1.txt file.
>title1
ID1 .... rest of string im not interested
>title2
ID2 .... rest of string im not interested
>title3
ID3 .... rest of string im not interested
>title....
But I want to add the information of my file 2
>ID1 text i want to extract
>ID2 text i want to extract
>ID3 text i want to extract
>IDs....
At the end im looking to create a new file with this structure
>title1
ID1 .... text I want
>title2
ID2 .... text I want
>title3
ID3 .... text I want
>title....
I have tried several sed commands, but most of them dont replace the ID# exactly for the one
that is in the two files. Hopefully it could be done in bash
Thanks for your help
Failed atempts..
my codes are
File 1 = cog_anotations.txt, File 2=Real.cog.txt
ID= COG05764, COG 015668, etc...
sed -e '/COG/{r Real.cog.txt' -e 'd}' cog_anotations.txt
sed "s/^.*COG.*$/$(cat Real.cog.txt)/" cog_anotations.txt
sed -e '/\$COG\$/{r Real.cog.txt' -e 'd}' cog_anotations.txt
grep -F -f cog_anotations.txt Real.cog.txt > newfile.txt
grep -F -f Real.cog.txt cog_anotations.txt > newfile.txt
file.awk :
BEGIN { RS=">" }
{
if (FILENAME == "1.txt") {
a[$2]=$1; b[$2]=$2;
}
else {
if ($1 == b[$1]) {
if ($1 !="") { printf(">%s\n%s",a[$1],$0) }
}
}
}
call:
gawk -f file.awk 1.txt 2.txt
The order of files is important.
result:
>title1
ID1 text i want to extract
>title2
ID2 text i want to extract
>title3
ID3 text i want to extract
explanation:
The first file is divided into records at the ">" place and then two associative arrays are created. Only the else body is performed for the second file. Next we check if field 1 of the second file is in table b and if so format and print the next lines.
DO NOT write some nested grep.
A simplistic one-pass-each logic with a lookup table:
declare -A lookup
while read key txt
do lookup["$key"]="$txt"
done < 2.txt
while read key txt
do echo "${lookup[$key]:-$txt}"
done < 1.txt

Take first first column from 1.csv and find in second 2.csv

I want to read two csv files (a.csv and b.csv) and write a new csv file new.csv with a status of each column. I want to do that with a shell script.
A.csv:
Inputfile_name,Date
abc.csv,2018/11/26 16.38.54
bbc.csv,2018/11/26 15.28.11
B.csv:
Outputfile_name,Date
abc_SUCCESS.csv,2018/11/26 17.20.11
bbc_FAIL.csv,2018/11/26 16.28.11
new.csv:
Inputfile_name,Date,Outputfile_name,Date,Status
abc.csv,2018/11/26 16.38.54,abc_SUCCESS.csv,2018/11/26 17.20.11,SUCCESS
bbc.csv,2018/11/26 15.28.11,bbc_FAIL.csv,2018/11/26 16.28.11,FAIL
Like so?
$ paste -d, A.csv B.csv | sed -e 's/\(SUCCESS\|FAIL\).*/&,\1/'
Inputfile_name,Date,Outputfile_name,Date
abc.csv,2018/11/26 16.38.54,abc_SUCCESS.csv,2018/11/26 17.20.11,SUCCESS
bbc.csv,2018/11/26 15.28.11,bbc_FAIL.csv,2018/11/26 16.28.11,FAIL
paste can concatenate the contents of two files linewise. And with sed you can do a search+replace operation for adding SUCCESS or FAIL at the end of each line.

How to split a CSV file into multiple files based on column value

I have CSV file which could look like this:
name1;1;11880
name2;1;260.483
name3;1;3355.82
name4;1;4179.48
name1;2;10740.4
name2;2;1868.69
name3;2;341.375
name4;2;4783.9
there could more or less rows and I need to split it into multiple .dat files each containing rows with the same value of the second column of this file. (Then I will make bar chart for each .dat file) For this case it should be two files:
data1.dat
name1;1;11880
name2;1;260.483
name3;1;3355.82
name4;1;4179.48
data2.dat
name1;2;10740.4
name2;2;1868.69
name3;2;341.375
name4;2;4783.9
Is there any simple way of doing it with bash?
You can use awk to generate a file containing only a particular value of the second column:
awk -F ';' '($2==1){print}' data.dat > data1.dat
Just change the value in the $2== condition.
Or, if you want to do this automatically, just use:
awk -F ';' '{print > ("data"$2".dat")}' data.dat
which will output to files containing the value of the second column in the name.
Try this:
while IFS=";" read -r a b c; do echo "$a;$b;$c" >> data${b}.dat; done <file

I want my bash script output in html format?

I am parsing the csv file using bash script, my output will be in tabular form with number of rows and coloums, so when i redirect my output to text file alignment mismatch and look so messy.
Can anyone guide me how to redirect my output to html format or suggest me with anyother alternative way.
Thanks in advance
If you don't really need the output in HTML, but you're having trouble with column alignment using tabs, you can get good column alignment with printf.
By the way, it would help if your question included some sample input, the script that you're using to parse and output it and some sample output.
Here is a simple demonstration of printf:
$ cat file
example text,123,word,23.12
more text,1004,long sequence of words,1.1
text,1,a,1000.42
$ cat script
#!/bin/bash
headformat='%-*s%-*s%*s%*s\n'
format='%-*s%-*s%*d%*.*f\n'
modwidth=16; descwidth=24; qtywidth=6; pricewidth=10
printf "$headformat" "$modwidth" Model "$descwidth" Desc. "$qtywidth" Qty "$pricewidth" Price
while IFS=, read model quantity description price
do
printf "$format" "$modwidth" "$model" "$descwidth" "$description" "$qtywidth" "$quantity" "$pricewidth" 2 "$price"
done < file
$ ./script
Model Desc. Qty Price
example text word 123 23.12
more text long sequence of words 1004 1.10
text a 1 1000.42
Write it out in TSV, then have a XSLT stylesheet convert it from TSV to XHTML. You can use $'\t' in bash to produce a tab character.
A simple solution wold use column(1):
column -t -s, <( echo "head1,head2,head3,head4"; cat csv.dat )
with a result like this one:
head1 head2 head3 head4
aaaa 33333 bbb 123
aaa 333333 bbbx 123
aa 3333333 bbbxx 123
a 33333333 bbbxxx 123

Resources