How do I concatenate lines from a text file into one big string? - text-files

I have an input file that looks like(without such big spaces between lines):
3 4
ATCGA
GACTTACA
AACTGTA
ATC
...and I need to concatenate all lines except for the first "3 4" line. Is there a simple solution? I've tried manipulating getline() somehow, but that has not worked for me.
Edit: The amount of lines will not be known initially, so it will have to be done recursively.

If your concate 2 lines in 1 line then you can use easily concate "+",
e.g:
String a = "WAQAR MUGHAL";
String b = "check";
System.out.println(a + b);
System.out.println("WAQAR MUGHAL" + "CHECK");
Output:
WAQAR MUGHAL check
WAQAR MUGHAL CHECK

Related

Checking if a text file is formatted in a specific way

I have a text file which contains instructions. I'm reading it using File.readlines(filename). I want to check that the file is formatted as follows:
Has 3 lines
Line 1: two integers (including negatives) separated by a space
Line 2: two integers (including negatives) separated by a space and 1 capitalised letter of the alphabet also separated by a space.
Line 3: capitalised letters of the alphabet without any spaces (or punctuation).
This is what the file should look like:
8 10
1 2 E
MMLMRMMRRMML
So far I have calculated the number of lines using File.readlines(filename).length. How do I check the format of each line, do I need to loop through the file?
EDIT:
I solved the problem by creating three methods containing regular expressions, then I passed each line into it's function and created a conditional statement to check if the out put was true.
Suppose IO::read is used to return the following string str.
str = <<~END
8 10
1 2 E
MMLMRMMRRMML
END
#=> "8 10\n1 2 E\nMMLMRMMRRMML\n"
You can then test the string with a single regular expression:
r = /\A(-?\d+) \g<1>\n\g<1> \g<1> [A-Z]\n[A-Z]+\n\z/
str.match?(r)
#=> true
I could have written
r = /\A-?\d+ -?\d+\n-?\d+ -?\d+ [A-Z]\n[A-Z]+\n\z/
but matching an integer (-?\d+) is done three times. It's slightly shorter, and reduces the chance of error, to put the first of the three in capture group 1, and then treat that as a subexpression by calling it with \g<1> (not to be confused with a back-reference, which is written \k<1>). Alternatively, I could have use named capture groups:
r = /\A(?<int>-?\d+) \g<int>\n\g<int> \g<int> (?<cap>[A-Z])\n\g<cap>+\n\z/

Filter sequences with more than 8 same consecutive nucleotides in a fastq file?

I want to filter my sequences which has more than 8 same consecutive nucleotides like "GGGGGGGG", "CCCCCCCC", etc in my fastq files.
How should I do that?
The quick and incorrect way, which might be close enough: grep -E -B1 -A2 'A{8}|C{8}|G{8}|T{8}' yourfile.fastq.
This will miss blocks where the 8-mer is split across two lines (e.g. the first line ends with AAAA and the second starts with AAAA). It also assumes the output has blocks of 4 lines each.
The proper way: write a little program (in Python, or a language of your choice) which buffers one FASTQ block (e.g. 4 lines) and checks that the concatenation of the previous (buffered) block's sequence and the current block's sequence do not have an 8-mer as above. If that's the case, then output the buffered block.
I ended up to use below codes in R and solved my problem.
library(ShortRead)
fq <- FastqFile("/Users/path/to/file")
reads_fq <- readFastq(fq)
trimmed_fq <- reads_fq[grep("GGGGGGGG|TTTTTTTTT|AAAAAAAAA|CCCCCCCCC",
sread(reads_fq), invert = TRUE)]
writeFastq(trimmed_fq, "new_name_for_fq.fastq", compress = FALSE)
You can use the Python package biotite for it (https://www.biotite-python.org).
Let's say you have the following FASTQ file:
#Read:01
CCCAAGGGCCCCCCCCCACTGCGATCACCTGGTTGCTGCCGGGAAAGGAGACCCAGGAGGTGAAACGGACTGGTGAATTG
CGGGGGTAGATATGGCGGGTGACACAAAAACATATAATCGGGCC
+
.+.+:'-FEAC-4'4CA-3-5#/4+?*G#?,<)<E&5(*82C9FH4G315F*DF8-4%F"9?H5535F7%?7#+6!FDC&
+4=4+,#2A)8!1B#,HA18)1*D1A-.HGAED%?-G10'6>:2
#Read:02
AACACTACTTCGCTGTCGCCAAAGGTTGGTGTAGGTCGGACTTCGAATTATCGATACTAGTTAGTAGTACGTCGCGTGGC
GTCAGCTCGTATGCTCTCAGAACAGGGAGAACTAGCACCGTAAGTAACCTAGCTCCCAAC
+
6%9,#'4A0&%.19,1E)E?!9/$.#?(!H2?+E"")?6:=F&FE91-*&',,;;$&?#2A"F.$1)%'"CB?5$<.F/$
7055E>#+/650B6H<8+A%$!A=0>?'#",8:#5%18&+3>'8:28+:5F0);E9<=,+
This is a script, that should do the work:
import biotite.sequence.io.fastq as fastq
import biotite.sequence as seq
# 'GGGGGGGG', 'CCCCCCCC', etc.
consecutive_nucs = [seq.NucleotideSequence(nuc * 8) for nuc in "ACGT"]
fastq_file = fastq.FastqFile("Sanger")
fastq_file.read("example.fastq")
# Iterate over sequence entries in file
for header in fastq_file:
sequence = fastq_file.get_sequence(header)
# Iterative over each of the consecutive sequences
for consecutive_nuc in consecutive_nucs:
# Find all indices, where a match was found
matches = seq.find_subsequence(sequence, consecutive_nuc)
if len(matches) > 0:
# If any match was found report it
print(
f"Found '{consecutive_nuc}' "
f"in sequence '{header}' at position {matches[0]}"
)
This is the output:
Found 'CCCCCCCC' in sequence 'Read:01' at pos 8

Find lines that have partial matches

So I have a text file that contains a large number of lines. Each line is one long string with no spacing, however, the line contains several pieces of information. The program knows how to differentiate the important information in each line. The program identifies that the first 4 numbers/letters of the line coincide to a specific instrument. Here is a small example portion of the text file.
example text file
1002IPU3...
POIPIPU2...
1435IPU1...
1812IPU3...
BFTOIPD3...
1435IPD2...
As you can see, there are two lines that contain 1435 within this text file, which coincides with a specific instrument. However these lines are not identical. The program I'm using can not do its calculation if there are duplicates of the same station (ie, there are two 1435* stations). I need to find a way to search through my text files and identify if there are any duplicates of the partial strings that represent the stations within the file so that I can delete one or both of the duplicates. If I could have BASH script output the number of the lines containing the duplicates and what the duplicates lines say, that would be appreciated. I think there might be an easy way to do this, but I haven't been able to find any examples of this. Your help is appreciated.
If all you want to do is detect if there are duplicates (not necessarily count or eliminate them), this would be a good starting point:
awk '{ if (++seen[substr($0, 1, 4)] > 1) printf "Duplicates found : %s\n",$0 }' inputfile.txt
For that matter, it's a good starting point for counting or eliminating, too, it'll just take a bit more work...
If you want the count of duplicates:
awk '{a[substr($0,1,4)]++} END {for (i in a) {if(a[i]>1) print i": "a[i]}}' test.in
1435: 2
or:
{
a[substr($0,1,4)]++ # put prefixes to array and count them
}
END { # in the end
for (i in a) { # go thru all indexes
if(a[i]>1) print i": "a[i] # and print out the duplicate prefixes and their counts
}
}
Slightly roundabout but this should work-
cut -c 1-4 file.txt | sort -u > list
for i in `cat list`;
do
echo -n "$i "
grep -c ^"$i" file.txt #This tells you how many occurrences of each 'station'
done
Then you can do whatever you want with the ones that occur more than once.
Use following Python script(syntax of python 2.7 version used)
#!/usr/bin/python
file_name = "device.txt"
f1 = open(file_name,'r')
device = {}
line_count = 0
for line in f1:
line_count += 1
if device.has_key(line[:4]):
device[line[:4]] = device[line[:4]] + "," + str(line_count)
else:
device[line[:4]] = str(line_count)
f1.close()
print device
here the script reads each line and initial 4 character of each line are considered as device name and creates a key value pair device with key representing device name and value as line numbers where we find the string(device name)
following would be output
{'POIP': '2', '1435': '3,6', '1002': '1', '1812': '4', 'BFTO': '5'}
this might help you out!!

Multiple sequence alignment. Convert multi-line format to single-line format?

I have a multiple sequence alignment file in which the lines from the different sequences are interspersed, as in the format outputed by clustal and other popular multiple sequence alignment tools. It looks like this:
TGFb3_human_used_for_docking ALDTNYCFRNLEENCCVRPLYIDFRQDLGWKWVHEPKGYYANFCSGPCPY
tr|B3KVH9|B3KVH9_HUMAN ALDTNYCFRNLEENCCVRPLYIDFRQDLGWKWVHEPKGYYANFCSGPCPY
tr|G3UBH9|G3UBH9_LOXAF ALDTNYCFRNLEENCCVRPLYIDFRQDLGWKWVHEPKGYYANFCSGPCPY
tr|G3WTJ4|G3WTJ4_SARHA ALDTNYCFRNLEENCCVRPLYIDFRQDLGWKWVHEPKGYYANFCSGPCPY
TGFb3_human_used_for_docking LRSADTTHST-
tr|B3KVH9|B3KVH9_HUMAN LRSADTTHST-
tr|G3UBH9|G3UBH9_LOXAF LRSTDTTHST-
tr|G3WTJ4|G3WTJ4_SARHA LRSADTTHST-
Each line begins with a sequence identifier, and then a sequence of characters (in this case describing the amino acid sequence of a protein). Each sequence is split into several lines, so you see that the first sequence (with ID TGFb3_human_used_for_docking) has two lines. I want to convert this to a format in which each sequence has a single line, like this:
TGFb3_human_used_for_docking ALDTNYCFRNLEENCCVRPLYIDFRQDLGWKWVHEPKGYYANFCSGPCPYLRSADTTHST-
tr|B3KVH9|B3KVH9_HUMAN ALDTNYCFRNLEENCCVRPLYIDFRQDLGWKWVHEPKGYYANFCSGPCPYLRSADTTHST-
tr|G3UBH9|G3UBH9_LOXAF ALDTNYCFRNLEENCCVRPLYIDFRQDLGWKWVHEPKGYYANFCSGPCPYLRSTDTTHST-
tr|G3WTJ4|G3WTJ4_SARHA ALDTNYCFRNLEENCCVRPLYIDFRQDLGWKWVHEPKGYYANFCSGPCPYLRSADTTHST-
(In this particular examples the sequences are almost identical, but in general they aren't!)
How can I convert from multi-line multiple sequence alignment format to single-line?
Looks like you need to write a script of some sort to achieve this. Here's a quick example I wrote in Python. It won't line the white-space up prettily like in your example (if you care about that, you'll have to mess around with formatting), but it gets the rest of the job done
#Create a dictionary to accumulate full sequences
full_sequences = {}
#Loop through original file (replace test.txt with your file name)
#and add each line to the appropriate dictionary entry
with open("test.txt") as infile:
for line in infile:
line = [element.strip() for element in line.split()]
if len(line) < 2:
continue
full_sequences[line[0]] = full_sequences.get(line[0], "") + line[1]
#Now loop through the dictionary and write each entry as a single line
outstr = ""
with open("test.txt", "w") as outfile:
for seq in full_sequences:
outstr += seq + "\t\t" + full_sequences[seq] + "\n"
outfile.write(outstr)

Regex that doesn't accept spaces in between numbers

I am trying to parse lines like this:
0: abc 0.5
1: a 16.1,3
2: b 0.9,2.3
3: c -19.645
7:
which are in the format:
Number:[space][up to 4 letters from the range ABCD][space][comma separated numbers that could be decimal and/or negative]
with the ruby command below
if line=~ /^(\d*): [abcd]{0,4} ((\-?)(\d*).(\d*))*/) then
do x
else
do y
However, it also matches the strings below, which I don't want it to, since they have " " or ":" in between numbers instead of ",".
4: d 0.8 16.56
5: d 0.9:5.0
How can I modify my regex to make it work for only comma separators?
Edit: The Rubular link if you would like to edit the Regular Expression is as follows: http://rubular.com/r/8Z9Eeu27i5
If I understand correctly, this one should work:
^\d+:\s[a-zA-Z]+\s(-?(\d+\.)?\d+,)*(-?(\d+\.)?\d+)$
EDIT:
If 7:[space][space] is valid too, then use this one:
^\d+:\s[a-zA-Z]*\s(-?(\d+\.)?\d+,)*(-?(\d+\.)?\d+)?$

Resources