This question already has answers here:
Printing a sequence from a fasta file
(5 answers)
Closed 20 days ago.
Hi I have a similar situation with Grep group of lines, but slightly different.
I have a file in the format of:
> xxxx AB=AAA NNN xxxx CD=DDD xxxxx
xxx
xxx
xxx
xxx
xxx
>xxxx AB=AAA JJJ xxxx CD=EEE xxxxx
xxx
xxx
xxx
xxx
xxx
xxx
xxx
>xxxx AB=AAA NNN xxxx CD=FFF xxxxx
xxx
xxx
xxx
xxx
>xxxx AB=EEE FFF xxxx CD=GGG xxxxx
xxx
xxx
xxx
xxx
xxx
xxx
(each item starting with > does not necessarily contain same number of lines with xxx, xxx is a list of string with all capital letters, the only cue that the record of this item is completed is that the next line starts with >)
Firstly, I want to grep all items with AB = EEE FFF as a resultant file like below:
>xxxx AB=EEE FFF xxxx CD=GGG xxxxx
xxx
xxx
xxx
xxx
xxx
>xxxx AB=EEE FFF xxxx CD=TTT xxxxx
xxx
xxx
xxx
xxx
>xxxx AB=EEE FFF xxxx CD=EEE xxxxx
xxx
xxx
xxx
xxx
xxx
xxx
Then, I have a csv file with list of CD items, and I want to grep all these with CD=xxx as xxx is a line in csv file.
A sample of an item is:
>sp|P01023|A2MG_HUMAN Alpha-2-macroglobulin OS=Homo sapiens OX=9606 GN=A2M PE=1 SV=3
MGKNKLLHPSLVLLLLVLLPTDASVSGKPQYMVLVPSLLHTETTEKGCVLLSYLNETVTV
SASLESVRGNRSLFTDLEAENDVLHCVAFAVPKSSSNEEVMFLTVQVKGPTQEFKKRTTV
MVKNEDSLVFVQTDKSIYKPGQTVKFR
AB in my example refers to OS here, and CD in my example refers to GN (so it's a single string containing capital letters AND/OR number
My csv file looks like (with ~1000 lines):
A2M
AIF1
Thanks a lot!
Your question doesn't have much in the way of testable sample data but something like this might be a starting point:
awk -v s1='AB=EEE FFF' -v s2='CD' -v out='out.dat' '
/^>/ {
if ( ok = index($0,s1) )
for ( i=1; i<=NF; i++ )
if ( index($i, s2"=")==1 )
print substr( $i, index($i,"=")+1 )
}
ok { print >out }
' in.dat |\
grep -Fx -f - in.csv >out.csv
use awk to process in.csv:
look for lines starting > and if found:
set flag based on presence/absence of desired string s1 (flag will remain set until re-tested at next > line)
if s1 present, search for a field that starts with string s2 followed by =
if found, write section after = to stdout
(for efficiency, one could break out of the for here)
if ok flag is set, copy the line to out.dat
awk's stdout is piped into grep.
use grep to search for fixed strings listed in awk's output that match an entire line of in.csv, and save results to out.csv
Related
The title of my question is very similar to other posts, I haven't found anything on my specific example though. I have to read in a text file as "$1", then put the values into an array line by line. Example:
myscript.sh /path/to/file
My question is would this approach work?
1 #!/bin/bash
2 file="$1"
3 readarray array < file
Would this code treat the "path/to/file" as "$1" then place that path into the variable "file". And if that part works correctly I believe line 3 should properly put the lines into an array correct?
This is the contents of the text file:
$ head short - rockyou .txt
290729 123456
79076 12345
76789 123456789
59462 password
49952 iloveyou
33291 princess
21725 1234567
20901 rockyou
20553 12345678
16648 abc123
.
.
.
I hope this is enough information to help
Very close. :)
#!/usr/bin/env bash
case $BASH_VERSION in ''|[123].*) echo "ERROR: Bash 4.0 needed" >&2; exit 1;; esac
file="$1"
readarray -t array <"$file"
declare -p array >&2 # print the array to stderr for demonstration/proof-of-concept
Note the use of the -t argument to readarray (to discard trailing newlines), and the use of $file rather than just file.
I use the following for placing the lines of a file in an array:
IFS=$'\r\n' GLOBIGNORE='*' command eval 'array=($(<filename))'
This gets all the columns and you can later work with it.
Edit: Explanations on the procedure above:
IFS=$'\r\n': stands for "internal field separator". It is used by the shell to determine how to do word splitting, i. e. how to recognize word boundaries.
GLOBIGNORE='*': From the bash's manual page: A colon-separated list of patterns defining the set of filenames to be ignored by pathname expansion. If a filename matched by a pathname expansion pattern also matches one of the patterns in GLOBIGNORE, it is removed from the list of matches.
command eval: The addition of command eval allows for the expression to be kept in the present execution environment
array=...: Simply the definition.
There are different threads on Stackoverflow and Stackexchange with more details on this:
https://unix.stackexchange.com/questions/184863/what-is-the-meaning-of-ifs-n-in-bash-scripting
https://unix.stackexchange.com/questions/105465/how-does-globignore-work
Read lines from a file into a Bash array
Then I just loop around the array like this:
for (( b = 0; b < ${#array[#]}; b++ )); do
#Do Somethng
done
This could be matter of opinion. Please, wait for more comments.
Edit: Use case with empty lines and globs
After the comments yesterday. I finally have had time to test the suggestions (empty lines, lines with globs)
In both cases the array is working fine when working in conjunction with awk. In the following example I attempt to print only the column2 into a new text file:
IFS=$'\r\n' GLOBIGNORE='*' command eval 'array=($(<'$1'))'
for (( b = 0; b < ${#array[#]}; b++ )); do
echo "${array[b]}" | awk -F "/| " '{print $2}' >> column2.txt
done
Starting with the following text file:
290729 123456
79076 12345
76789 123456789
59462 password
49952 iloveyou
33291 princess
21725 1234567
20901 rockyou
20553 12345678
16648 abc123
20901 rockyou
20553 12345678
16648 abc123
/*/*/*/*/*/*
20901 rockyou
20553 12345678
16648 abc123
Clear empty lines and globs in the script.
The result of the execution is the following:
123456
12345
123456789
password
iloveyou
princess
1234567
rockyou
12345678
abc123
rockyou
12345678
abc123
*
rockyou
12345678
abc123
Clear evidence that the array is working as expected.
Execution example:
adama#galactica:~$ ./processing.sh test.txt
adama#galactica:~$ cat column2.txt
123456
12345
123456789
password
iloveyou
princess
1234567
rockyou
12345678
abc123
rockyou
12345678
abc123
*
rockyou
12345678
abc123
Should we wish to remove empty lines (as it doesn't make sence to me have them in the output) we can do it in awk by changing the following line:
echo "${array[b]}" | awk -F "/| " '{print $2}' >> column2.txt
adding /./
echo "${array[b]}" | awk -F "/| " '/./ {print $2}' >> column2.txt
End Result:
123456
12345
123456789
password
iloveyou
princess
1234567
rockyou
12345678
abc123
rockyou
12345678
abc123
*
rockyou
12345678
abc123
Should you wish to apply it to the whole file (not column by column) you can take a look at the following thread:
AWK remove blank lines
Edit: Security concern on rm:
I actually went ahead and placed $(rm -rf ~) in the test file to test what would happen on a virtual machine:
Test.txt contents now:
290729 123456
79076 12345
76789 123456789
59462 password
49952 iloveyou
33291 princess
21725 1234567
20901 rockyou
20553 12345678
16648 abc123
$(rm -rf ~)
20901 rockyou
20553 12345678
16648 abc123
/*/*/*/*/*/*
20901 rockyou
20553 12345678
16648 abc123
Execution:
adama#galactica:~$ ./processing.sh test.txt
adama#galactica:~$ ll
total 28
drwxr-xr-x 3 adama adama 4096 dic 1 22:41 ./
drwxr-xr-x 3 root root 4096 dic 1 19:27 ../
drwx------ 2 adama adama 4096 dic 1 22:38 .cache/
-rw-rw-r-- 1 adama adama 144 dic 1 22:41 column2.txt
-rwxr-xr-x 1 adama adama 182 dic 1 22:41 processing.sh*
-rw-r--r-- 1 adama adama 286 dic 1 22:39 test.txt
-rw------- 1 adama adama 1545 dic 1 22:39 .viminfo
adama#galactica:~$ cat column2.txt
123456
12345
123456789
password
iloveyou
princess
1234567
rockyou
12345678
abc123
-rf
rockyou
12345678
abc123
*
rockyou
12345678
abc123
No effect on the system.
Note: I am using Ubuntu 18.04 x64 LTS on an VM. Best not to try testing the security issue with root.
Edit: set -f necessity:
adama#galactica:~$ ./processing.sh a
adama#galactica:~$ cat column2.txt
[a]
adama#galactica:~$
Works perfectly without set -f
BR
I have an input file and if any line in that file has a particular key word i want to overwrite the data in a particular column position (say I want to populate column 10 to 15 with xxxxxx). I am new to shell scripting. Please forgive if I sound naive.
Sample Input:
aaaaa 11 ****** bacxyz more data
bbbbb 11 ****** qweabc more data
ccccc 11 ****** pqrxyz more data
aaaaa 11 ****** jkkxyz more data
Expected Output: (If a line has aaaaaa at any position overwrite the col 10 to 15 with xxxxxx else write as it is.)
aaaaa 11 ****** xxxxxx more data
bbbbb 11 ****** qweabc more data
ccccc 11 ****** pqrxyz more data
aaaaa 11 ****** xxxxxx more data
This sed may work for you:
sed -E '/aaaaa/ s/^(.{16}).{6}(.*)/\1xxxxxx\2/' file
aaaaa 11 ****** xxxxxx more data
bbbbb 11 ****** qweabc more data
ccccc 11 ****** pqrxyz more data
aaaaa 11 ****** xxxxxx more data
btw it is not position 10-15 but 17-22 in your expected output.
You could use the following Vim command:
:g/aaaaa/s/\%>16c\%<23c./x/g
This will replace characters between columns 16 and 23 with x. The replacement is performed in lines that contain the text aaaaa.
Column data is usually easy with awk - "if the first field is aaaaa, make the fourth field xxxxx; then print the line, whatever it is":
awk '$1=="aaaaa"{$4="xxxxx"}{print}' filename
Another global command
:g/^a/norm 3wvt rx
I want to catch the row contain "Will_Liu>" from massive_data.txt if the n < m or n==0 or m==0, a period of the prototype is as below.
cat massive_data.txt
Will_Liu> set Name.* xxx
============================================
Id Name Para status
============================================
1 name-1 xxxxx OK
2 name-2 xxxxx OK
3 name-3 xxxxx Not_OK
. ... .... OK
. ... .... OK
m name-m .... Not_OK
============================================
Total: m name attempted, n name set OK
In the above code, the "m" and "n" are variable, if the n < m or n==0 or m==0, print the rows contain "Will_Liu>" ;
if n==m and both of them !=0, just skip and ignore this situation.
I just could use "grep" and "sed" to grasp key points like those:
cat test.txt
Will_Liu> set Name_group1 xxx
============================================
Id Name Para status
============================================
1 name-1 xxxxx OK
2 name-2 xxxxx OK
3 name-3 xxxxx Not_OK
============================================
Total: 3 name attempted, 2 name set OK
Will_Liu> set Name_group2 yyy
============================================
Id Name Para status
============================================
1 name-4 xxxxx OK
2 name-5 xxxxx Not_OK
3 name-6 xxxxx Not_OK
============================================
Total: 3 name attempted, 1 name set OK
I could use "sed" and "grep" command like this:
sed -n "/Total: 3 name attempted,/p" test.txt
Total: 3 name attempted, 2 name set OK
Total: 3 name attempted, 1 name set OK
grep -B 9 "Total: 3 name attempted" test.txt | sed -n '/Will_Liu>/p'
Will_Liu> set Name_group1 xxx
Will_Liu> set Name_group2 yyy
in the grep command the 9 is 3+6, the 6 is base on the format of the structure, it's a fixed value.
So how can I introduce 2 variates to define the "m" and "n" and improve my code to get expected result from massive_data.txt?
My expect output:
Will_Liu> set Name1 xxx
Will_Liu> set Name2 yyy
Will_Liu> set Name3 zzz
. . .
. . .
. . .
In general, any previous line you want to print matches another pattern. In these cases it is better to store the last candidate to be printed and when you reach your condition, decide what to do with it. For example
awk '/^Will_Liu/{
last_will=$0
}
/^Total/{
m=$2; n=$5
if (m>n || (m==0 && n==0))
print last_will
}' file
In cases where you really don't have any pattern to select the last candidate to print, and you have to decide some line number to print after a math operation on matched line data, then you could double pass a file, or use tac to invert the input or keep all last lines in a hash array or any similar approach. These approaches could be not efficient sometimes. For example, with storing all lines, which is not recommended for your case
awk '{ line[NR]=$0 }
/^Total/{
m=$2; n=$5
if (m>n || (m==0 && n==0))
print line[NR-(m+5)]
}' file
Example input file:
xxx-xxx(-) xxx xxx xxx - 2e-15 Cytochrome b-c1 complex subunit 9 xxx xxx:241-77(-)
xxx-xxx(+) xxx xxx xxx + 3e-24 Probable endo-beta-1,4-glucanase D xxx xxx:241-77(+)
I've been trying sed, but without success. I can see that the following two things work:
rev file|sed -e 's/-/M/'|rev
rev file|sed -e 's/)/M/'|rev
But, - and ) together do not work:
rev file|sed -e 's/-)/M/'|rev
It's because rev "reverses" the order, you know? -) does not occur in the reversed version; it is )- in the reversed file:
rev file|sed -e 's/)-/M/'|rev
You don't need multiple commands with chains of pipes or fancy operations - since seds regexps are greedy, all you need is:
$ sed 's/\(.*\)-)/\1M/' file
xxx-xxx(-) xxx xxx xxx - 2e-15 Cytochrome b-c1 complex subunit 9 xxx xxx:241-77(M
xxx-xxx(+) xxx xxx xxx + 3e-24 Probable endo-beta-1,4-glucanase D xxx xxx:241-77(+)
A general approach for "replace the nth-to-last one of something" with pure (GNU) sed
We want to replace "something", in this case -), with something unique not found elsewhere in your input, say ~B. To make sure that this sequence isn't in your input, we first replace all ~ with ~A:
sed 's/~/~A/g' infile
Replace all "something", in this case -), with ~B, of which we now know that it'll be unique:
sed 's/-)/~B/g'
Now your input file looks like this (slightly edited so it fits the line width here):
xxx-xxx(~B xxx - 2e-15 Cytochrome b-c1 complex subunit 9 xxx xxx:241-77(~B
xxx-xxx(+) xxx + 3e-24 Probable endo-beta-1,4-glucanase D xxx xxx:241-77(+)
The next command does this: "as longs as the line has n + 1 of ~B, replace the first one with -). The :a and ta are a label to branch to and conditional branching ("go to label :a if a substitution took place"):
sed ':a;/~B\(.*~B\)\{1\}/s/~B/-)/;ta'
For the case of n = 1, i.e., we want to replace the last occurrences, the quantifier \{1\} is of course not needed, but can be replaced for other values of n.
The input file now has a unique ~B where the last -) used to be:
xxx-xxx(-) xxx - 2e-15 Cytochrome b-c1 complex subunit 9 xxx xxx:241-77(~B
xxx-xxx(+) xxx + 3e-24 Probable endo-beta-1,4-glucanase D xxx xxx:241-77(+)
We replace that single ~B:
sed 's/~B/M/'
resulting in
xxx-xxx(-) xxx - 2e-15 Cytochrome b-c1 complex subunit 9 xxx xxx:241-77(M
xxx-xxx(+) xxx + 3e-24 Probable endo-beta-1,4-glucanase D xxx xxx:241-77(+)
The rest of the ~B can now be replaced with what they were, -) (a no-op in this case):
sed 's/~B/-)/g'
Finally, we undo the first substitution (which has no effect for this example as the input had no ~ to start with):
sed 's/~A/~/g'
All in a single line:
sed 's/~/~A/g;s/-)/~B/g;:a;/~B\(.*~B\)\{1\}/s/~B/-)/;ta;s/~B/M/;s/~B/-)/g;s/~A/~/g' infile
Or, for readability, over multiple lines:
sed '
s/~/~A/g
s/-)/~B/g
:label
/~B\(.*~B\)\{1\}/s/~B/-)/
t label
s/~B/M/
s/~B/-)/g
s/~A/~/g
' infile
Naturally, for the case of n = 1, there are much simpler solutions, like Ed Morton's answer.
Want I want to do is simply add a column with the numbers of a huge file:
xxx xxxxx xxxx
xxx xxxxx xxxx
xxx xxxxx xxxx
xxx xxxxx xxxx
xxx xxxxx xxxx
To get the next output:
xxx 1 xxxx xxxxx
xxx xxxx xxxx
xxx 2 xxxx xxxxx
xxx xxxx xxxx
xxx 3 xxxx xxxxx
I tried something with awk '{print NR % 2==1 etc ...} but it doesn't work
Any suggestion?
Many thanks in advance
You're on the right track
awk 'NR%2 { $1 = $1" "++i}; 1;' file.txt
NR%2 evalutes to true for odd-numbers lines. The resulting assignment replace the first field with the value in the first field plus a number that (starting from 0) is incremented then concatenated. The 1; always evaluates to true and applies the default action (print) to the line. The longer-but-clear equivalent is NR%2 { $1 = $1" "++i}; {print}.
perl -lane 'if ($. % 2 == 1){$n++; print "$F[0] $n #F[1..$#F]"} else{print}' file.txt
produces the output:
xxx 1 xxxxx xxxx
xxx xxxxx xxxx
xxx 2 xxxxx xxxx
xxx xxxxx xxxx
xxx 3 xxxxx xxxx
Explanation:
-n loop around every line of the input file, put the line in the $_ variable, do not automatically print every line
-l removes newlines before processing, and adds them back in afterwards
-a autosplit mode – split input lines into the #F array.
-e execute the perl code
$. is the line number
#F is the array of words in each line, indexed starting with 0
$#F is the number of words in #F
#F[1..$#F] is an array slice of element 1 through the last element