I have a text file having numbers like this +12345678912 (start with + and is 11 digit long) separated by tab whitespaces (seems like) in a text among with other data.
How can I match only those who have a + before the text and match the first 11 characters if they're present and they're all digits?
Updated:
This is the input
+12345678912 http://google.com 2012-05-07 11:30:06
+12345678913 http://google.com 2012-05-07 19:26:21
And the output should be an array with matching results
[12345678912, 12345678913]
Use this...
matches = str.scan(/^\+(\d{11})/m).flatten!;
CodePad.
Related
I need to match all the alphabets and numbers in a string str.
This is my code.
str.match(/^(AB)(\d+)([A-Za-z][0-9])?/)
When str = AB57933A [sic], it matches only AB57933, and not the characters appended after the numbers.
If I try with str = AB57933AbC [sic], it matches only AB57933; it only matches up to the last number, and not the characters after that.
In the way you have written it:
/^(AB)(\d+)([A-Za-z][0-9])/
you impose that the last character is between 0 and 9, you can replace it depending on your needs by if you do not expect digits after the last letter
/^(AB)(\d+)([A-Za-z]+)/
or by
/^(AB)(\d+)([A-Za-z0-9]+)/
if AB57933AbC12 are also accepted as valid input.
Last but not least, if you do not use back references you can omit the parenthesis as you do not need capturing groups
I have a long text file that reads like:
where the last element of :math:`\pmb{x}_i\in\mathbf{R}^{p}` is 1 and
the first :math:`p-1` elements of :math:`\pmb{x_i}` and
I would like to replace all strings that between :math: and "`" by white spaces. For example, the text above should become:
where the last element of is 1 and
the first elements of and
I tried this:
sed $'/end/ {r exceptions\n} ; /:math:/,/`/ {d}' input_text.text > output_text.text
but this removes the whole line containing the guard strings. I just want to remove what is between the guard strings.
Try this:
sed -E 's/:math:`[^`]*`//g'
Given your input, as output I get
where the last element of is 1 and the first elements of and
It's worth noting that this assumes that the ` character cannot be used inside the :math: tag.
I have a multiple sequence alignment file in which the lines from the different sequences are interspersed, as in the format outputed by clustal and other popular multiple sequence alignment tools. It looks like this:
TGFb3_human_used_for_docking ALDTNYCFRNLEENCCVRPLYIDFRQDLGWKWVHEPKGYYANFCSGPCPY
tr|B3KVH9|B3KVH9_HUMAN ALDTNYCFRNLEENCCVRPLYIDFRQDLGWKWVHEPKGYYANFCSGPCPY
tr|G3UBH9|G3UBH9_LOXAF ALDTNYCFRNLEENCCVRPLYIDFRQDLGWKWVHEPKGYYANFCSGPCPY
tr|G3WTJ4|G3WTJ4_SARHA ALDTNYCFRNLEENCCVRPLYIDFRQDLGWKWVHEPKGYYANFCSGPCPY
TGFb3_human_used_for_docking LRSADTTHST-
tr|B3KVH9|B3KVH9_HUMAN LRSADTTHST-
tr|G3UBH9|G3UBH9_LOXAF LRSTDTTHST-
tr|G3WTJ4|G3WTJ4_SARHA LRSADTTHST-
Each line begins with a sequence identifier, and then a sequence of characters (in this case describing the amino acid sequence of a protein). Each sequence is split into several lines, so you see that the first sequence (with ID TGFb3_human_used_for_docking) has two lines. I want to convert this to a format in which each sequence has a single line, like this:
TGFb3_human_used_for_docking ALDTNYCFRNLEENCCVRPLYIDFRQDLGWKWVHEPKGYYANFCSGPCPYLRSADTTHST-
tr|B3KVH9|B3KVH9_HUMAN ALDTNYCFRNLEENCCVRPLYIDFRQDLGWKWVHEPKGYYANFCSGPCPYLRSADTTHST-
tr|G3UBH9|G3UBH9_LOXAF ALDTNYCFRNLEENCCVRPLYIDFRQDLGWKWVHEPKGYYANFCSGPCPYLRSTDTTHST-
tr|G3WTJ4|G3WTJ4_SARHA ALDTNYCFRNLEENCCVRPLYIDFRQDLGWKWVHEPKGYYANFCSGPCPYLRSADTTHST-
(In this particular examples the sequences are almost identical, but in general they aren't!)
How can I convert from multi-line multiple sequence alignment format to single-line?
Looks like you need to write a script of some sort to achieve this. Here's a quick example I wrote in Python. It won't line the white-space up prettily like in your example (if you care about that, you'll have to mess around with formatting), but it gets the rest of the job done
#Create a dictionary to accumulate full sequences
full_sequences = {}
#Loop through original file (replace test.txt with your file name)
#and add each line to the appropriate dictionary entry
with open("test.txt") as infile:
for line in infile:
line = [element.strip() for element in line.split()]
if len(line) < 2:
continue
full_sequences[line[0]] = full_sequences.get(line[0], "") + line[1]
#Now loop through the dictionary and write each entry as a single line
outstr = ""
with open("test.txt", "w") as outfile:
for seq in full_sequences:
outstr += seq + "\t\t" + full_sequences[seq] + "\n"
outfile.write(outstr)
How do I find repeated characters using a regular expression?
If I have aaabbab, I would like to match only characters which have three repetitions:
aaa
Try string.scan(/((.)\2{2,})/).map(&:first), where string is your string of characters.
The way this works is that it looks for any character and captures it (the dot), then matches repeats of that character (the \2 backreference) 2 or more times (the {2,} range means "anywhere between 2 and infinity times"). Scan will return an array of arrays, so we map the first matches out of it to get the desired results.
I need a regex to match something like
"4f0f30500be4443126002034"
and
"4f0f30500be4443126002034>4f0f31310be4443126005578"
but not like
"4f0f30500be4443126002034>4f0f31310be4443126005578>4f0f31310be4443126005579"
Try:
^[\da-f]{24}(>[\da-f]{24})?$
[\da-f]{24} is exactly 24 characters consisting only of 0-9, a-f. The whole pattern is one such number optionally followed by a > and a second such number.
I think you want something like:
/^[0-9a-f]{24}(>[0-9a-f]{24})?$/
That matches 24 characters in the 0-9a-f range (which matches your first string) followed by zero or one strings starting with a >, followed by 24 characters in the 0-9a-f range (which matches your second string). Here's a RegexPal for this regex.
Don't need a regex.
str = "4f0f30500be4443126002034>4f0f31310be4443126005578"
match = str.count('>') < 2
match will be set to true for matches where there are 1 or 0 '>' in the string. Otherwise match is set to false.