Checking if a text file is formatted in a specific way - ruby

I have a text file which contains instructions. I'm reading it using File.readlines(filename). I want to check that the file is formatted as follows:
Has 3 lines
Line 1: two integers (including negatives) separated by a space
Line 2: two integers (including negatives) separated by a space and 1 capitalised letter of the alphabet also separated by a space.
Line 3: capitalised letters of the alphabet without any spaces (or punctuation).
This is what the file should look like:
8 10
1 2 E
MMLMRMMRRMML
So far I have calculated the number of lines using File.readlines(filename).length. How do I check the format of each line, do I need to loop through the file?
EDIT:
I solved the problem by creating three methods containing regular expressions, then I passed each line into it's function and created a conditional statement to check if the out put was true.

Suppose IO::read is used to return the following string str.
str = <<~END
8 10
1 2 E
MMLMRMMRRMML
END
#=> "8 10\n1 2 E\nMMLMRMMRRMML\n"
You can then test the string with a single regular expression:
r = /\A(-?\d+) \g<1>\n\g<1> \g<1> [A-Z]\n[A-Z]+\n\z/
str.match?(r)
#=> true
I could have written
r = /\A-?\d+ -?\d+\n-?\d+ -?\d+ [A-Z]\n[A-Z]+\n\z/
but matching an integer (-?\d+) is done three times. It's slightly shorter, and reduces the chance of error, to put the first of the three in capture group 1, and then treat that as a subexpression by calling it with \g<1> (not to be confused with a back-reference, which is written \k<1>). Alternatively, I could have use named capture groups:
r = /\A(?<int>-?\d+) \g<int>\n\g<int> \g<int> (?<cap>[A-Z])\n\g<cap>+\n\z/

Related

Filter sequences with more than 8 same consecutive nucleotides in a fastq file?

I want to filter my sequences which has more than 8 same consecutive nucleotides like "GGGGGGGG", "CCCCCCCC", etc in my fastq files.
How should I do that?
The quick and incorrect way, which might be close enough: grep -E -B1 -A2 'A{8}|C{8}|G{8}|T{8}' yourfile.fastq.
This will miss blocks where the 8-mer is split across two lines (e.g. the first line ends with AAAA and the second starts with AAAA). It also assumes the output has blocks of 4 lines each.
The proper way: write a little program (in Python, or a language of your choice) which buffers one FASTQ block (e.g. 4 lines) and checks that the concatenation of the previous (buffered) block's sequence and the current block's sequence do not have an 8-mer as above. If that's the case, then output the buffered block.
I ended up to use below codes in R and solved my problem.
library(ShortRead)
fq <- FastqFile("/Users/path/to/file")
reads_fq <- readFastq(fq)
trimmed_fq <- reads_fq[grep("GGGGGGGG|TTTTTTTTT|AAAAAAAAA|CCCCCCCCC",
sread(reads_fq), invert = TRUE)]
writeFastq(trimmed_fq, "new_name_for_fq.fastq", compress = FALSE)
You can use the Python package biotite for it (https://www.biotite-python.org).
Let's say you have the following FASTQ file:
#Read:01
CCCAAGGGCCCCCCCCCACTGCGATCACCTGGTTGCTGCCGGGAAAGGAGACCCAGGAGGTGAAACGGACTGGTGAATTG
CGGGGGTAGATATGGCGGGTGACACAAAAACATATAATCGGGCC
+
.+.+:'-FEAC-4'4CA-3-5#/4+?*G#?,<)<E&5(*82C9FH4G315F*DF8-4%F"9?H5535F7%?7#+6!FDC&
+4=4+,#2A)8!1B#,HA18)1*D1A-.HGAED%?-G10'6>:2
#Read:02
AACACTACTTCGCTGTCGCCAAAGGTTGGTGTAGGTCGGACTTCGAATTATCGATACTAGTTAGTAGTACGTCGCGTGGC
GTCAGCTCGTATGCTCTCAGAACAGGGAGAACTAGCACCGTAAGTAACCTAGCTCCCAAC
+
6%9,#'4A0&%.19,1E)E?!9/$.#?(!H2?+E"")?6:=F&FE91-*&',,;;$&?#2A"F.$1)%'"CB?5$<.F/$
7055E>#+/650B6H<8+A%$!A=0>?'#",8:#5%18&+3>'8:28+:5F0);E9<=,+
This is a script, that should do the work:
import biotite.sequence.io.fastq as fastq
import biotite.sequence as seq
# 'GGGGGGGG', 'CCCCCCCC', etc.
consecutive_nucs = [seq.NucleotideSequence(nuc * 8) for nuc in "ACGT"]
fastq_file = fastq.FastqFile("Sanger")
fastq_file.read("example.fastq")
# Iterate over sequence entries in file
for header in fastq_file:
sequence = fastq_file.get_sequence(header)
# Iterative over each of the consecutive sequences
for consecutive_nuc in consecutive_nucs:
# Find all indices, where a match was found
matches = seq.find_subsequence(sequence, consecutive_nuc)
if len(matches) > 0:
# If any match was found report it
print(
f"Found '{consecutive_nuc}' "
f"in sequence '{header}' at position {matches[0]}"
)
This is the output:
Found 'CCCCCCCC' in sequence 'Read:01' at pos 8

How do I concatenate lines from a text file into one big string?

I have an input file that looks like(without such big spaces between lines):
3 4
ATCGA
GACTTACA
AACTGTA
ATC
...and I need to concatenate all lines except for the first "3 4" line. Is there a simple solution? I've tried manipulating getline() somehow, but that has not worked for me.
Edit: The amount of lines will not be known initially, so it will have to be done recursively.
If your concate 2 lines in 1 line then you can use easily concate "+",
e.g:
String a = "WAQAR MUGHAL";
String b = "check";
System.out.println(a + b);
System.out.println("WAQAR MUGHAL" + "CHECK");
Output:
WAQAR MUGHAL check
WAQAR MUGHAL CHECK

Splitting a string on variable numbers of words

The following question was posted by #ruhroe about an hour ago. I was about to post an answer when it was taken down. That's unfortunate, as I thought it was rather interesting. I'm putting it back up in case the OP sees this and also to give others an opportunity to post solutions.
The original question (which I've edited):
The problem is to split a string on some spaces in the string, based on criteria which depend in part on a number given by the user. If that number were, say, 5, each substring would contain either:
one word having 5 or more characters or
as many consecutive words (separated by spaces) as possible, provided the resulting string has at most 5 characters.
For example, if the string were:
"abcdefg fg hijkl mno pqrs tuv wx yz"
the result would be:
["abcdefg", "fg", "hijkl", "mno", "pqrs", "tuv", "wx yz"]
"abcdefg" is on a separate line because it has at least five characters.
"fg" is on a separate line because "fg" contains 5 or few characters and when combined with the following word, with a space between them, the resulting string, "fg hijkl", contains more than 5 characters.
"hijkl" is on a separate line because it satisfies both criteria.
How can I do that?
I believe this does it:
str = "abcdefg fg hijkl e mn pqrs tuv wx yz"
str.scan(/\b(?:\w{5,}|\w[\w\s]{0,3}\w|\w)\b/)
#=> ["abcdefg", "fg", "hijkl", "e mn", "pqrs", "tuv", "wx yz"]
As you iterate through the words in your collection (splitting the original string up into words should be trivial), it seems like there are three possible scenarios:
It's a blank line, and we should insert the current word into the line
It's a non-blank line, and the word can fit
It's a non-blank line, and the word can't fit and it should go into a new line
Something like this should work (note - I haven't tested this much outside of your solution. You'll definitely want to do that):
words.each do |word|
if line.blank?
# this is a new line, so start it with the current word
line << word
elsif word_can_fit_line?(line, word, length)
# the word fits, so append it to the current line
line << " #{word}"
else
# the word doesn't fit, so keep this line and start a new one with
# the current word
lines << line
line = word
end
end
# add the last line and we're done
lines << line
lines
Note that the implementation of word_can_fit_line? should be trivial - you just want to see if the current line length, plus a space, plus the word length, is less than or equal to your desired line length.

How does this gsub and regex work?

I'm trying to learn ruby and having a hard time figuring out what each individual part of this code is doing. Specifically, how does the global subbing determine whether two sequential numbers are both one of these values [13579] and how does it add a dash (-) in between them?
def DashInsert(num)
num_str = num.to_s
num_str.gsub(/([13579])(?=[13579])/, '\1-')
end
num_str.gsub(/([13579])(?=[13579])/, '\1-')
() called capturing group, which captures the characters matched by the pattern present inside the capturing group. So the pattern present inside the capturing group is [13579] which matches a single digit from the given set of digits. That corresponding digit was captured and stored inside index 1.
(?=[13579]) Positive lookahead which asserts that the match must be followed by the character or string matched by the pattern inside the lookahead. Replacement will occur only if this condition is satisfied.
\1 refers the characters which are present inside the group index 1.
Example:
> "13".gsub(/([13579])(?=[13579])/, '\1-')
=> "1-3"
You may start with some random tests:
def DashInsert(num)
num_str = num.to_s
num_str.gsub(/([13579])(?=[13579])/, '\1-')
end
10.times{
x = rand(10000)
puts "%6i: %6s" % [x,DashInsert(x)]
}
Example:
9633: 963-3
7774: 7-7-74
6826: 6826
7386: 7-386
2145: 2145
7806: 7806
9499: 949-9
4117: 41-1-7
4920: 4920
14: 14
And now to check the regex.
([13579]) take any odd number and remember it (it can be used later with \1
(?=[13579]) Check if the next number is also odd, but don't take it (it still remains in the string)
'\1-' Output the first odd num and ab a - to it.
In other word:
Puts a - between each two odds numbers.

Regular expression to match my pattern of words, wild chars

can you help me with this:
I want a regular expression for my Ruby program to match a word with the below pattern
Pattern has
List of letters ( For example. ABCC => 1 A, 1 B, 2 C )
N Wild Card Charaters ( N can be 0 or 1 or 2)
A fixed word (for example “XY”).
Rules:
Regarding the List of letters, it should match words with
a. 0 or 1 A
b. 0 or 1 B
c. 0 or 1 or 2 C
Based on the value of N, there can be 0 or 1 or 2 wild chars
Fixed word is always in the order it is given.
The combination of all these can be in any order and should match words like below
ABWXY ( if wild char = 1)
BAXY
CXYCB
But not words with 2 A’s or 2 B’s
I am using the pattern like ^[ABCC]*.XY$
But it looks for words with more than 1 A, or 1 B or 2 C's and also looks for words which end with XY, I want all words which have XY in any place and letters and wild chars in any postion.
If it HAS to be a regex, the following could be used:
if subject =~
/^ # start of string
(?!(?:[^A]*A){2}) # assert that there are less than two As
(?!(?:[^B]*B){2}) # and less than two Bs
(?!(?:[^C]*C){3}) # and less than three Cs
(?!(?:[ABCXY]*[^ABCXY]){3}) # and less than three non-ABCXY characters
(?=.*XY) # and that XY is contained in the string.
/x
# Successful match
else
# Match attempt failed
end
This assumes that none of the characters A, B, C, X, or Y are allowed as wildcards.
I consider myself to be fairly good with regular expressions and I can't think of a way to do what you're asking. Regular expressions look for patterns and what you seem to want is quite a few different patterns. It might be more appropriate to in your case to write a function which splits the string into characters and count what you have so you can satisfy your criteria.
Just to give an example of your problem, a regex like /[abc]/ will match every single occurrence of a, b and c regardless of how many times those letters appear in the string. You can try /c{1,2}/ and it will match "c", "cc", and "ccc". It matches the last case because you have a pattern of 1 c and 2 c's in "ccc".
One thing I have found invaluable when developing and debugging regular expressions is rubular.com. Try some examples and I think you'll see what you're up against.
I don't know if this is really any help but it might help you choose a direction.
You need to break out your pattern properly. In regexp terms, [ABCC] means "any one of A, B or C" where the duplicate C is ignored. It's a set operator, not a grouping operator like () is.
What you seem to be describing is creating a regexp based on parameters. You can do this by passing a string to Regexp.new and using the result.
An example is roughly:
def match_for_options(options)
pattern = '^'
pattern << 'A' * options[:a] if (options[:a])
pattern << 'B' * options[:b] if (options[:b])
pattern << 'C' * options[:c] if (options[:c])
Regexp.new(pattern)
end
You'd use it something like this:
if (match_for_options(:a => 1, :c => 2).match('ACC'))
# ...
end
Since you want to allow these "elements" to appear in any order, you might be better off writing a bit of Ruby code that goes through the string from beginning to end and counts the number of As, Bs, and Cs, finds whether it contains your desired substring. If the number of As, Bs, and Cs, is in your desired limits, and it contains the desired substring, and its length (i.e. the number of characters) is equal to the length of the desired substring, plus # of As, plus # of Bs, plus # of Cs, plus at most N characters more than that, then the string is good, otherwise it is bad. Actually, to be careful, you should first search for your desired substring and then remove it from the original string, then count # of As, Bs, and Cs, because otherwise you may unintentionally count the As, Bs, and Cs that appear in your desired string, if there are any there.
You can do what you want with a regular expression, but it would be a long ugly regular expression. Why? Because you would need a separate "case" in the regular expression for each of the possible orders of the elements. For example, the regular expression "^ABC..XY$" will match any string beginning with "ABC" and ending with "XY" and having two wild card characters in the middle. But only in that order. If you want a regular expression for all possible orders, you'd need to list all of those orders in the regular expression, e.g. it would begin something like "^(ABC..XY|ACB..XY|BAC..XY|BCA..XY|" and go on from there, with about 5! = 120 different orders for that list of 5 elements, then you'd need more for the cases where there was no A, then more for cases where there was no B, etc. I think a regular expression is the wrong tool for the job here.

Resources