I'm a novice using grep/egrep/awk and have not wrapped my head around regular expressions (bonus: a link to an introduction to regex for someone who has zero programming experience would be great).
My question revolves around matching a number range within a flat file. I have values which are ten digits. Telephone numbers...
I'm attempting to match a range of numbers that move across a range for example.
55512122041 through 55512122050 (41, 42, 43, 44, 45, 46, 47, 48, 49, and 50).
I have been using grep to match the first value like this.
grep 555121204[1-9]
Next step is I grep for the final digit
grep 55512122050
I believe that I have not found the right way to use a regex to allow one grep.

Try the below grep command which uses P(Perl regex) parameter,
grep -P '55512120(?:4[1-9]|50)' file
grep -E '555121204[1-9]|5551212050' file
This would print the lines which has the number ranges from 55512122041 to 55512122050.
If you want to print only the number then add o parameter to the above grep command.
grep -oP '55512120(?:4[1-9]|50)' file
$ cat file
bar foo
5551212040 Don't match
5551212041 Match
5551212050 Match
foo bar
$ grep -P '55512120(?:4[1-9]|50)' file
5551212041 Match
5551212050 Match

For the general case, where the number range is not easy to express as a regex, Awk is probably better, as it has proper support for arithmetic.
awk '(($1 > 123) && ($1 < 1024)) || (($1 > 2048) && ($1 < 65536))' file
This prints the entire matching line; if you only want to print the second field, add { print $2 } etc.
You can learn enough Awk to figure this out on your own with a good tutorial and 30 minutes; see the Stack Overflow awk tag info page for pointers.


How can I replace a number with one higher

I have a bash script for macOS, which replaces the last character (a number) on the first line of a document with the number plus one. For example, if the number was 19, it would be replaced with 20 (I know that the code only replaces the last character). Here's the code that I currently have:
#For some reason the -e (not -E) is necessary here
RC01=$(sed -ne "1s/.*\(.\)$/\1/p" ./document)
let "RC02++"
sed -i '' -n "1s/${RC01}/${RC02}/" ./document
When the last line runs, if the document has a first line with anything on it, it clears the whole document.
What's going wrong, and how can I fix it?
Your immediate problem is that sed -n suppresses all printing of output lines.
You can fix this easily by removing the -n;
RC01=$(sed -n "1s/.*\(.\)$/\1/p" ./document)
let "RC02++"
sed -i '' "1s/${RC01}\$/${RC02}/" ./document
This still has the problem that you only increment the very last digit; if your number has more than one digit, this will lead to weirdness like 18, 19, 110, 111 rather than 18, 19, 20, 21
RC01=$(sed -n "s/\(.*[^0-9]\)\([0-9]*\)$/\2/p;q" ./document)
let "RC02++"
sed -i '' "1s/${RC01}\$/${RC02}/" ./document
The -i feature of sed is hard to beat for replacing a file in place. If it weren't for that, I would definitely recommend Awk instead. Here's a quick attempt.
awk 'NR == 1 { $NF = 1+$NF }1' ./document >./document.tmp &&
mv ./document.tmp ./document
Perl has the best of both worlds, with sugar on top;
perl -pi -e 's/(\d+)$/1+$1/e if $. == 1' ./document

Script using sed and grep gives unintended output

I have a "source.fasta" file that contains information in the following format:
>TRINITY_DN80_c0_g1_i1 len=723 path=[700:0-350 1417:351-368 1045:369-722] [-1, 700, 1417, 1045, -2]
>TRINITY_DN83_c0_g1_i1 len=371 path=[1:0-173 152:174-370] [-1, 1, 152, -2]
>TRINITY_DN83_c0_g1_i2 len=218 path=[1:0-173 741:174-217] [-1, 1, 741, -2]
>TRINITY_DN99_c0_g1_i1 len=326 path=[1:0-242 221:243-243 652:244-267 246:268-325] [-1, 1, 221, 652, 246, -2]
>TRINITY_DN90_c0_g1_i1 len=1240 path=[1970:0-527 753:528-1239] [-1, 1970, 753, -2]
>TRINITY_DN84_c0_g1_i1 len=301 path=[1:0-220 358:221-300] [-1, 1, 358, -2]
>TRINITY_DN84_c0_g1_i2 len=301 path=[1:0-220 199:221-300] [-1, 1, 199, -2]
>TRINITY_DN72_c0_g1_i1 len=434 path=[412:0-247 847:248-271 661:272-433] [-1, 412, 847, 661, -2]
>TRINITY_DN75_c0_g1_i1 len=478 path=[456:0-477] [-1, 456, -2]
There are close to 400,000 sequences in this file.
I have another file ids.txt in the following format:
I have 100 sequence ids in this file. When I match these ids to the source file I want an output that gives me the match for each of these ids with the entire sequence.
For example, for an id:
I want my output to be:
I want all hundred sequences in this format.
I used this code:
while read p; do
echo ''$p >> out.fasta
grep -A 400000 -w $p source.fasta | sed -n -e '1,/>/ {/>/ !{'p''}} >> out.fasta
done < ids.txt
But my output is different in that only the last id has a sequence and the rest dont have any sequence associated:
I am only getting the desired output for the 100th id from my ids.txt. Could someone help me on where my script is wrong. I would like to get all 100 sequences when i run the script.
Thank you
I have added google drive links to the files i am working with:
Repeatedly looping over a large file is inefficient; you really want to avoid running grep (or sed or awk) more than once if you can avoid it. Generally speaking, sed and Awk will often easily allow you to specify actions for individual lines in a file, and then you run the script on the file just once.
For this particular problem, the standard Awk idiom with NR==FNR comes in handy. This is a mechanism which lets you read a number of keys into memory (concretely, when NR==FNR it means that you are processing the first input file, because the overall input line number is equal to the line number within this file) and then check if they are present in subsequent input files.
Recall that Awk reads one line at a time and performs all the actions whose conditions match. The conditions are a simple boolean, and the actions are a set of Awk commands within a pair of braces.
awk 'NR == FNR { s[$0]; next }
# If we fall through to here, we have finished processing the first file.
# If we see a wedge and p is 1, reset it -- this is a new sequence
/^>/ && p { p = 0 }
# If the prefix of this line is in s, we have found a sequence we want.
($1$2 in s) || ($1 in s) || ((substr($1, 1, 1) " " substr($1, 2)) in s) {
if ($1 ~ /^>./) { print $1 } else { print $1 $2 }; p = 1; next }
# If p is true, we want to print this line
p' ids.txt source.fasta >out.fasta
So when we are reading ids.txt, the condition NR==FNR is true, and so we simply store each line in the array s. The next causes the rest of the Awk script to be skipped for this line.
On subsequent reads, when NR!=FNR, we use the variable p to control what to print. When we see a new sequence, we set p to 0 (in case it was 1 from a previous iteration). Then when we see a new sequence, we check if it is in s, and if so, we set p to one. The last line simply prints the line if p is not empty or zero. (An empty action is a shorthand for the action { print }.)
The slightly complex condition to check if $1 is in s might be too complicated -- I put in some normalizations to make sure that a space between the > and the sequence identifier is tolerated, regardless of there was one in ids.txt or not. This can probably be simplified if your files are consistently formatted.
Only with GNU grep and sed:
grep -A 1 -w -F -f ids.txt source.fasta | sed 's/ .*//'
See: man grep
$ awk 'NR==FNR{a[$1];next} $1 in a{c=2} c&&c--' ids.txt source.fasta
>TRINITY_DN80_c0_g1_i1 len=723 path=[700:0-350 1417:351-368 1045:369-722] [-1, 700, 1417, 1045, -2]
The above was run using your posted source.fasta and this ids.txt:
$ cat ids.txt
First group all id's as one expression separated by | like this
cat ids.txt | tr '\n' '|' | awk "{print "\"" $0 "\""}'
Remove the last | symbol from the expression.
Now you can grep using the output you got from previous command like this
egrep -E ">TRINITY_DN14840_c10_g1_i1|>TRINITY_DN8506_c0_g1_i1|>TRINITY_DN12276_c0_g2_i1|>TRINITY_DN15434_c5_g3_i1|>TRINITY_DN9323_c8_g3_i5|>TRINITY_DN11957_c1_g7_i1|>TRINITY_DN15373_c1_g1_i1|>TRINITY_DN22913_c0_g1_i1|>TRINITY_DN13029_c4_g5_i1" source.fasta
This will print only the matching lines
Editing as per tripleee comments
Using the following is printing the output properly
Assuming the ID and sequence are in different lined
tr '\n' '|' <ids.txt | sed 's/|$//' | grep -A 1 -E -f - source.fasta
This might work for you (GNU sed):
sed 's#.*#/^&/{s/ .*//;N;p}#' idFile | sed -nf - fastafile
Convert the idFile into a sed script and run it against the fastaFile.
the best way to do this is using either python or perl. I was able to make a script for extracting the ids using python as follows.
#script to extract sequences from a source file based on ids in another file
#the source is a fasta file with a header and a sequence that follows in one line
#the ids file contains one id per line
#both the id and source file should contain the character '>' at the beginning that siginifies an id
def main():
#asks the user for the ids file
file1 = raw_input('ids file: ');
#opens the ids file into the memory
ids_file = open(file1, 'r');
#asks the user for the fasta file
file2 = raw_input('fasta file: ');
#opens the fasta file into memory; you need your memory to be larger than the filesize, or python will hard crash
fasta_file = open(file2, 'r');
#ask the user for the file name of output file
file3 = raw_input('enter the output filename: ');
#opens output file with append option; append is must as you dont want to override the existing data
output_file = open(file3, 'w');
#split the ids into an array
ids_lines = ids_file.read().splitlines()
#split the fasta file into an array, the first element will be the id followed by the sequence
fasta_lines = fasta_file.read().splitlines()
#initializing loop counters
i = 0;
j = 0;
#while loop to iterate over the length of the ids file as this is the limiter for the program
while j<len(fasta_lines) and i<len(ids_lines):
#if statement to match ids from both files and bring matching sequences
if ids_lines[i] == fasta_lines[j]:
#output statements including newline characters
#increment i so that we go for the next id
#deprecate j so we start all over for the new id
#when there is no match check the id, we are skipping the sequence in the middle which is j+1
The code is not perfect but works for any number of ids. I have tested for my samples which contained 5000 ids in one of them and the program worked fine. If there are improvements to the code please do so, I am a relatively newbie at programming so the code is a bit crude.

awk sed backreference csv file

A question to extend previous one here. (I prefer asking new question rather editing first one. I may be wrong)
EDIT : ok, I was wrong, I should edit my first question. My bad (SO question is an art, difficult to master)
I have csv file, with semi-column as field delimiter. Here is an extract of csv file :
Here is the desired output :
...;field;(:);(n,d) 10000;(:);field;....
...;field;(b) 123.12;(a) 123;(:) 123.00;....
I search a solution to swap 2 patterns in each field.
pattern 1 : any digit, with optional decimal mark (.) and optional decimal digit
e.g : 1 / 1111.00 / 444444444.3 / 32 / 32.6666666 / 1.0 / ....
pattern 2 : any string that begin with left parenthesis, follow by one or more character, ending with right parenthesis
e.g : (n,a,p) / (:) / (llll) / (d) / (123) / (1;2;3) ...
Solutions provided in first question are right for simple file that contain only one column. If I try the solution within csv file, I face multiple failures.
So I try awk similar solution, which is (I think) more "column-oriented".
I have try
awk -F";" '{print gensub(/([[:digit:].]*)(\(.*\))/, "\\2 \\1", "g")}' file
I though by fixing field delimiter (;), "my regex swap" will succes in every field. It was a mistake.
Here is an exemple of failure
desired output --> ;(:);(n,d) 7320000;(:)
My questions (finally) : why awk fail when it success with one-column file. what is the best tool to face this challenge ?
sed with very long regex ?
awk with very long regex ?
for loop ?
other tools ?
PS : I know I am not clear. I have 2 problems (English language, technical limitations). Sorry.
Your "question" is far too long, cluttered, and containing too many separate questions to wade through but here's how to get the output you want from the input you provided with any sed:
$ sed 's/\([0-9][0-9.]*\)\(([^)]*)\)/\2 \1/g' file
...;field;(:);(n,d) 10000;(:);field;....
...;field;(b) 123.12;(a) 123;(:) 123.00;....
Well, when parsing simple delimetered files without any quoted values, usually awk comes to the rescue:
awk -vFS=';' -vOFS=';' '{
for (i = 1; i < NF; i++) {
split($i, t, "(")
if (length(t[1]) != 0 && length(t[2]) != 0) {
$i="("t[2]" "t[1]
}' <<EOF
However this will fail if fields are quoted, ie. the separator ; comes inside the values...
First we set input and output seapartor as ;
We iterate through all the fields in the line for (i = 1; i < NF; i++)
We split the line over ( character
If the first field splitted over ( is nonzero length and the second field has also nonzero length
We swap the firelds for this fields and add a space (we also remember about the removed ( on the beginning).
And then the line get's printed.
A solution using sed and xargs, but you need to know the number of fields in advance:
sed 's/;/\n/g' |
sed 's/\([^(]\{1,\}\)\((.*)\)/\2 \1/' |
xargs -d '\n' -n7 -- printf "%s;%s;%s;%s;%s;%s;%s\n"
} <<EOF
For each ; i do a newline
For each line i substitute the string with at least on character before ( and a string inside ).
I then merge 7 lines using ; as separator with xargs and printf.
This might work for you (GNU sed):
sed -r 's/([0-9]+(\.[0-9]+)?)(\([^)]*\))/\3 \1/g' file
Look for group of numbers (possibly with a decimal point) followed by a pair of parens and rearrange them in the desired fashion, globally through out each line.

How to grep a pattern followed by a number, only if the number is above a certain value

I actually need to grep the entire line. I have a file with a bunch of lines that look like this
1 123213 A T . stuff=1.232;otherstuf=34;morestuff=121;AF=0.44;laststuff=AV
4 223152 D L . stuff=1.122;otherstuf=4;morestuff=41;AF=0.02;laststuff=RV
and I want to keep all the lines where AF>0.1. So for the lines above I only want to keep the first line.
Using gnu-awk you can do this:
awk 'gensub(/.*;AF=([^;]+).*/, "\\1", "1", $NF)+0 > 0.1' file
1 123213 A T . stuff=1.232;otherstuf=34;morestuff=121;AF=0.44;laststuff=AV
This gensub function parses out AF=<number> from last field of the input and captures number in captured group #1 which is used for comparison with 0.1.
PS: +0 will convert parsed field to a number.
You could use awk with multiple delimeters to extract the value and compare it:
$ awk -F';|=' '$8 > 0.1' file
Assuming that AF is always of the form 0.NN you can simply match values where the tens place is 1-9, e.g.:
grep ';AF=0.[1-9][0-9];' your_file.csv
You could add a + after the second character group to support additional digits (i.e. 0.NNNNN) but if the values could be outside the range [0, 1) you shouldn't try to match the field with regular expressions.
$ awk -F= '$5>0.1' file
1 123213 A T . stuff=1.232;otherstuf=34;morestuff=121;AF=0.44;laststuff=AV
If that doesn't do what you want when run against your real data then edit your question to provide more truly representative sample input/output.
I would use awk. Since awk supports alphanumerical comparisons you can simply use this:
awk -F';' '$(NF-1) > "AF=0.1"' file.txt
-F';' splits the line into fields by ;. $(NF-1) address the second last field in the line. (NF is the number of fields)

I'm trying to get the current track running from 'cmus-remote -Q'
Its always underneath of this line
tag genre Various
<some track>
Now, I need to keep it simple because I want to add it to my i3 bar. I used
cmus-remote -Q | grep -A 1 "tag genre"
but that grep's the 'tag' line AND the line underneath.
I want ONLY the line underneath.
With sed:
sed -n '/tag genre/{n;p}'
$ cmus-remote -Q | sed -n '/tag genre/{n;p}'
<some track>
If you want to use grep as the tool for this, you can achieve it by adding another segment to your pipeline:
cmus-remote -Q | grep -A 1 "tag genre" | grep -v "tag genre"
This will fail in cases where the string you're searching for is on two lines in a row. You'll have to define what behaviour you want in that case if we're going to program something sensible for it.
Another possibility would be to use a tool like awk, which allows for greater compexity in the line selection:
cmus-remote -Q | awk '/tag genre/ { getline; print }'
This searches for the string, then gets the next line, then prints it.
Another possibility would be to do this in bash alone:
while read line; do
[[ $line =~ tag\ genre ]] && read line && echo "$line"
done < <(cmus-remote -Q)
This implements the same functionality as the awk script, only using no external tools at all. It's likely slower than the awk script.
You can use awk instead of grep:
awk 'p{print; p=0} /tag genre/{p=1}' file
<some track>
/tag genre/{p=1} - sets a flag p=1 when it encounters tag genre in a line.
p{print; p=0} when p is non-zero then it prints a line and resets p to 0.
I'd suggest using awk:
awk 'seen && seen--; /tag genre/ { seen = 1 }'
when seen is true, print the line.
when seen is true, decrement the value, so it will no longer true after the desired number of lines are printed
when the pattern matches, set seen to the number of lines to be printed
