Replace a word using different files Bash - bash

im looking to edit my 1.txt file, to find a word and replace it with the correspondant word in 2.txt and also add the rest of the string of file 2.
Im interest in maintain the order of my 1.txt file.
>title1
ID1 .... rest of string im not interested
>title2
ID2 .... rest of string im not interested
>title3
ID3 .... rest of string im not interested
>title....
But I want to add the information of my file 2
>ID1 text i want to extract
>ID2 text i want to extract
>ID3 text i want to extract
>IDs....
At the end im looking to create a new file with this structure
>title1
ID1 .... text I want
>title2
ID2 .... text I want
>title3
ID3 .... text I want
>title....
I have tried several sed commands, but most of them dont replace the ID# exactly for the one
that is in the two files. Hopefully it could be done in bash
Thanks for your help
Failed atempts..
my codes are
File 1 = cog_anotations.txt, File 2=Real.cog.txt
ID= COG05764, COG 015668, etc...
sed -e '/COG/{r Real.cog.txt' -e 'd}' cog_anotations.txt
sed "s/^.*COG.*$/$(cat Real.cog.txt)/" cog_anotations.txt
sed -e '/\$COG\$/{r Real.cog.txt' -e 'd}' cog_anotations.txt
grep -F -f cog_anotations.txt Real.cog.txt > newfile.txt
grep -F -f Real.cog.txt cog_anotations.txt > newfile.txt

file.awk :
BEGIN { RS=">" }
{
if (FILENAME == "1.txt") {
a[$2]=$1; b[$2]=$2;
}
else {
if ($1 == b[$1]) {
if ($1 !="") { printf(">%s\n%s",a[$1],$0) }
}
}
}
call:
gawk -f file.awk 1.txt 2.txt
The order of files is important.
result:
>title1
ID1 text i want to extract
>title2
ID2 text i want to extract
>title3
ID3 text i want to extract
explanation:
The first file is divided into records at the ">" place and then two associative arrays are created. Only the else body is performed for the second file. Next we check if field 1 of the second file is in table b and if so format and print the next lines.

DO NOT write some nested grep.
A simplistic one-pass-each logic with a lookup table:
declare -A lookup
while read key txt
do lookup["$key"]="$txt"
done < 2.txt
while read key txt
do echo "${lookup[$key]:-$txt}"
done < 1.txt

Related

Using a file with specific IDs to extract data from another file into separate files and then using them to get values

I have a file with some IDs listed like this:
id1
id2
id3
etc
I want to use those IDs to extract data from files (IDs are occurring in every file) and save output for each of these IDs to a separate file (IDs are protein family names and I want to get each protein from a specific family). And, when I have the name for each of the protein I want to use this name to get those proteins (in .fasta format), so that they would be grouped by their family (they'll be staying in the same group)
So I've tried to do it like this (I knew that it would dump all the IDs into one file):
#! /bin/bash
for file in *out
do grep -n -E 'id1|id2|id3' /directory/$file >> output; done
I would appreciate any help and I will gladly specify if not everything is clear to you.
EDIT: i will try to clarify, sorry for the inconvenience:
so theres a file called "pfamacc" with the following content:
PF12312
PF43555
PF34923
and so on - those are the IDs that i need to acces other files, which have a structure like that "something_something.faa.out"
<acc_number> <aligment_start> <aligment_end> <pfam_acc>
RXOOOA 5 250 PF12312
OC2144 6 200 PF34923
i need those accesion numbers so i can then get protein sequences from files which look like this:
>RXOOOA
ASDBSADBASDGHH
>OC2144
SADHHASDASDCJHWINF
With the assumption there is a file ids_file.txt in the same directory with the subsequent content:
id1
id2
id3
id4
And in the same directory is as well a file called id1 with the following content:
Bla bla bla
id1
and id2
is
here id4
Then this script could help:
#!/bin/sh
IDS=$(cat ids_file.txt)
IDS_IN_ONE=$(cat ids_file.txt | tr '\n' '|' | sed -r 's/(\|)?\|$//')
echo $IDS_IN_ONE
for file in $IDS; do
grep -n -E "$IDS_IN_ONE" ./$file >> output
done
The file output has then the following result:
2:id1
3:and id2
5:here id4
Reading that a list needs to be cross-referenced to get a 2nd list, which then needs to be used to gather FASTAs.
Starting with the following 3 files...
starting_values.txt
PF12312
PF43555
PF34923
cross_reference.txt
<acc_number> <aligment_start> <aligment_end> <pfam_acc>
RXOOOA 5 250 PF12312
OC2144 6 200 PF34923
find_from_file.fasta
>RXOOOA
ASDBSADBASDGHH
>OC2144
SADHHASDASDCJHWINF
SADHHASDASDCJHWINF
>NC11111
IURJCNKAERJKADSF
for i in `cat starting_values.txt`; do awk -v var=$i 'var==$4 {print $1}' cross_reference.txt; done > needed_accessions.txt
If multiline FASTA change to single line. https://www.biostars.org/p/9262/
awk '/^>/ {printf("\n%s\n",$0);next; } { printf("%s",$0);} END {printf("\n");}' find_from_file.fasta > find_from_file.temp
for i in `cat needed_accessions.txt`; do grep -A 1 "$i" find_from_file.temp; done > found_sequences.fasta
Final Output...
found_sequences.fasta
>RXOOOA
ASDBSADBASDGHH
>OC2144
SADHHASDASDCJHWINFSADHHASDASDCJHWINF

Reading multiple lines using read line do

OK i'm an absolute noob to this (only started trying to code a few weeks ago for my job) so please go easy on me
IM on an aix system
I have file1, file2 and file3 and they all contain 1 column of data (text or numerical).
file1
VBDSBQ_KFGP_SAPECC_PRGX_ACCNT_WKLY
VBDSBQ_KFGP_SAPECC_PRGX_ADDRM_WKLY
VBDSBQ_KFGP_SAPECC_PRGX_COND_WKLY
VBDSBQ_KFGP_SAPECC_PRGX_CUSTM_WKLY
VBDSBQ_KFGP_SAPECC_PRGX_EPOS_DLY
VBDSBQ_KFGP_SAPECC_PRGX_INVV_WKLY
file2
MCMILS03
HGAHJK05
KARNEK93
MORROT32
LAWFOK12
LEMORK82
file3
8970597895
0923875
89760684
37960473
526238495
146407
There will be the exact same amount of lines in each of these files.
I have another file called "dummy_file" which is what i want to pull out, replace parts and pop into a new file.
WORKSTATION#JOB_NAME
SCRIPTNAME "^TWSSCRIPTS^SCRIPT"
STREAMLOGON "^TWSUSER^"
-job JOB_NAME -user USER_ID -i JOB_ID
RECOVERY STOP
There are only 3 strings i care about in this file that i want replaced and they will always be the same for the dummy files i use in future
JOB_NAME
JOB_ID
USER_ID
There are 2 entries for JOB_NAME and only 1 for the others. What i want is take the raw file, replace both JOB_NAME entries with line 1 from file1 then replace USER_ID with line 1 from file 2 and then replace JOB_ID with line 1 from file3 then throw this into a new file
I want to repeat the process for all the lines in file 1, 2 and 3 so the next one will have its entries replaced by line 2 from the 3 files then next one will have its entries replaced by line 3 from the 3 files then all of line 3 from the files and so on and so on
raw file and the expected output are below:
WORKSTATION#JOB_NAME
SCRIPTNAME "^TWSSCRIPTS^SCRIPT"
STREAMLOGON "^TWSUSER^"
-job JOB_NAME -user USER_ID -i JOB_ID
RECOVERY STOP
WORKSTATION#VBDSBQ_KFGP_SAPECC_PRGX_ACCNT_WKLY
SCRIPTNAME "^TWSSCRIPTS^SCRIPT"
STREAMLOGON "^TWSUSER^"
-job VBDSBQ_KFGP_SAPECC_PRGX_ACCNT_WKLY -user MCMILS03 -i 8970597895
RECOVERY STOP
this is as far as i got (again i know its crap)
file="/dir/dir/dir/file1"
while IFS= read -r line
do
cat dummy_file | sed "s/JOB_NAME/$file1/" | sed "s/JOB_ID/$file2/" | sed "s/USER_ID/$file3" #####this is where i get stuck as i dont know how to reference file2 and file3##### >>new_file.txt
done
You really don't want a do/while loop in the shell. Just do:
awk '/^WORKSTATION/{
getline jobname < "file1";
getline user_id < "file2";
getline job_id < "file3"
}
{
gsub("JOB_NAME", jobname);
gsub("USER_ID", user_id);
gsub("JOB_ID", job_id)
}1' dummy_file
This might work for you (GNU parallel and sed):
parallel -q sed 's/JOB_NAME/{1}/;s/USER_ID/{2}/;s/JOB_ID/{3}/' templateFile >newFile :::: file1 ::::+ file2 ::::+ file3
This creates newFile by appending the templateFile for each instance of a line jointly in file1, file2 and file3.
N.B. the ::::+ operation ensures the union of lines in file1, file2 and file3 rather than the default product.
Using GNU awk (ARGIND and 2d arrays):
$ gawk '
NR==FNR { # store the template file
t=t (t==""?"":ORS) $0 # to t var
next
}
{
a[FNR][ARGIND]=$0 # store filen records to 2d array
}
END { # in the end
for(i=1;i<=FNR;i++) { # for each record stored from filen
t_out=t # make a working copy of the template
gsub(/JOB_NAME/,a[i][2],t_out) # replace with data
gsub(/USER_ID/,a[i][3],t_out)
gsub(/JOB_ID/,a[i][4],t_out)
print t_out # output
}
}' template file1 file2 file3
Output:
WORKSTATION#VBDSBQ_KFGP_SAPECC_PRGX_ACCNT_WKLY
SCRIPTNAME "^TWSSCRIPTS^SCRIPT"
STREAMLOGON "^TWSUSER^"
-job VBDSBQ_KFGP_SAPECC_PRGX_ACCNT_WKLY -user MCMILS03 -i 8970597895
RECOVERY STOP
...
Bash variant
#!/bin/bash
exec 5<file1 # create file descriptor for file with job names
exec 6<file2 # create file descriptor for file with job ids
exec 7<file3 # create file descriptor for file with user ids
dummy=$(cat dummy_txt) # load dummy text
output () { # create output by inserting new values in a copy of dummy var
out=${dummy//JOB_NAME/$JOB_NAME}
out=${out//USER_ID/$USER_ID}
out=${out//JOB_ID/$JOB_ID}
printf "\n$out\n"
}
while read -u5 JOB_NAME; do # this will read from all files and print output
read -u6 JOB_id
read -u7 USER_ID
output
done
From read help
$ read --help
...
-u fd read from file descriptor FD instead of the standard input
...
And a variant with paste
#!/bin/bash
dummy=$(cat dummy)
while read JOB_NAME JOB_id USER_ID; do
out=${dummy//JOB_NAME/$JOB_NAME}
out=${out//USER_ID/$USER_ID}
out=${out//JOB_ID/$JOB_ID}
printf "\n$out\n"
done < <(paste file1 file2 file3)

Bash script to efficiently return two file names that both contain a string found in a list

I'm trying to find duplicates of a string ID across files. Each of these IDs are unique and should be used in only one file. I am trying to verify that each ID is only used once, and the script should tell me the ID which is duplicated and in which files.
This is an example of the set.csv file
"Read-only",,"T","ID6776","3.1.1","Text","?"
"Read-only",,"T","ID4294","3.1.1.1","Text","?"
"Read-only","ID","T","ID7294","a )","Text","?"
"Read-only","ID","F","ID8641","b )","Text","?"
"Read-only","ID","F","ID8642","c )","Text","?"
"Read-only","ID","T","ID9209","d )","Text","?"
"Read-only","ID","F","ID3759","3.1.1.2","Text","?"
"Read-only",,"F","ID2156","3.1.1.3","
This is the very inefficient code I wrote
for ID in $(grep 'ID\"\,\"[TF]' set.csv | cut -c 23-31);
do for FILE1 in *.txt; do for FILE2 in *.txt;
do if [[ $FILE1 -nt $FILE2 && `grep -E '$ID' $FILE1 $FILE2` ]];
then echo $ID + $FILE1 + $FILE2;
fi;
done;
done;
done
Essentially I'm only interested in ID#s that are identified as "ID" in the CSV which would be 7294, 8641, 8642, 9209, 3759 but not the others. If File1 and File2 both contain the same ID from this set then it would print out the duplicated ID and each file that it is found in.
There might be thousands of IDs, and files so my exponential approach isn't at all preferred. If Bash isn't up to it I'll move to sets, hashmaps and a logarithmic searching algorithm in another language... but if the shell can do it I'd like to know how.
Thanks!
Edit: Bonus would be to find which IDs from the set .csv aren't used at all. A pseudo code for another language might be create a set for all the IDs in the csv, then make another set and add to it IDs found in the files, then compare the sets. Can bash accomplish something like this?
A linear option would be to use awk to store discovered identifiers with their corresponding filename, then report when an identifier is found again. Assuming
awk -F, '$2 == "\"ID\"" && ($3 == "\"T\"" || $3 == "\"F\"") {
id=substr($4,4,4)
if(ids[id]) {
print id " is in " ids[id] " and " FILENAME;
} else {
ids[id]=FILENAME;
}
}' *.txt
The awk script looks through every *.txt file; it splits the fields based on commas (-F,). If field 2 is "ID" and field 3 is "T" or "F", then it extracts the numeric ID from field 4. If that ID has been seen before, it reports the previous file and the current filename; otherwise, it saves the id with an association to the current filename.

Unix - Bash - How to split a file according to specific rules

I have thousands of files on unix, that I need to split into two parts, according to following rules:
1) Find the first occurence of the string ' JOB ' in the file
2) Find the first line after the occurence found in point 1) which doesn't end with comma ','
3) Split the file after the line found in point 2)
Below is a sample file, this one should be split after the line ending with the string 'DUMMY'.
//*%OPC SCAN
//*%OPC FETCH MEMBER=$BUDGET1,PHASE=SETUP
// TESTJOB JOB USER=TESTUSER,MSGLEVEL=5,
// CLASS=H,PRIORITY=10,
// PARAM=DUMMY
//*
//STEP1 EXEC DB2OPROC
//...
How can I achieve this?
Thanks
You can use sed for this task:
$ cat data1
//*%OPC SCAN
//*%OPC FETCH MEMBER=$BUDGET1,PHASE=SETUP
// TESTJOB JOB USER=TESTUSER,MSGLEVEL=5,
// CLASS=H,PRIORITY=10,
// PARAM=DUMMY
//*
//STEP1 EXEC DB2OPROC
//...
$ sed -n '0,/JOB/ p;/JOB/,/[^,]$/ p' data1 | uniq > part1
$ sed '0,/JOB/ d;0,/[^,]$/ d' data1 > part2
$ cat part1
//*%OPC SCAN
//*%OPC FETCH MEMBER=$BUDGET1,PHASE=SETUP
// TESTJOB JOB USER=TESTUSER,MSGLEVEL=5,
// CLASS=H,PRIORITY=10,
// PARAM=DUMMY
force#force-virtual-machine:~$ cat part2
//*
//STEP1 EXEC DB2OPROC
//...
$
my solution is:
find all files to be checked;
grep each file for specified pattern with -n to get the match line if it matches;
split the matching file by head or tail with line number got in step two.
what's more, grep can handle reg expression. such as grep -n "^.*JOB.*[^,]$" filename.
You can do this in a couple of steps using awk/sed:
line=`awk '/JOB/,/[^,]$/ {x=NR} END {print x}' filename`
next=`expr $line + 1`
sed -ne "1,$line p" filename > part_1
sed -ne "$next,\$ p" filename > part_2
where filename is the name of your file. This will create two files: part_1 and part_2.

combine lines of csv in bash

I want to create new csv file for each city combining several csv with rows and columns, one column has the name of cities, that repeat in all the csv files...
For example,
I have files with the name of the date,YYYYMMDD, 20140713.csv, 20140714.csv, 20140715.csv...
They have the same structure, same numbers of rows and columns, for example, 20140713.csv...
1. City, Data, TMinreal, TMaxreal, TMinext, TMaxext, DiffTMin, DiffTMax
2. Milano,20140714,19.0,28.8,18,27,1,1.8
3. Rome,20140714,18.1,29.3,14,29,4.1,0.3
4. Pisa,20140714,10.8,27.5,8,29,2.8,-1.5
5. Venecia,20140714,21.1,29.1,16,27,5.1,2.1
I want to combine all these csv files...and get, csv files with the name of the city, as Milano.csv and inside with the information about this city stored in all the csv combined.
For example, if I combine 20140713.csv, 20140714.csv, 20140715.csv, for Milano.csv
1. Milano,20140713,19.0,28.8,18,26,1,2.8
2. Milano,20140714,19.0,28.8,20,27,-1,1.8
3. Milano,20140715,21.0,26.8,19,27,2,-0.2
any idea? thank you
untested, but this should work:
awk -F, 'FNR==1{next} {file = $1".csv"; print > file}' 20*.csv
You can have this bash script:
#!/bin/bash
for FILE; do
{
read ## Skip header
while IFS=, read -r A B; do
echo "$A,$B" >> "$A".csv
done
} < "$FILE"
done
Then run as:
bash script.sh file1.csv file2.csv ...

Resources