Matching all lines between two lines recursively in ruby - ruby

I would like to match all lines (including the first line) between two lines that start with 'SLX-', convert them to a comma separated line and then append them to a text file.
A truncated version of the original text file looks like:
SLX-9397._TC038IV_L_FLD0214.Read1.fq.gz
Sequences: 1406295
With index: 1300537
Sufficient length: 1300501
Min index: 0
Max index: 115
0 1299240
1 71
2 1
4 1
Unique: 86490
# reads processed: 86490
# reads with at least one reported alignment: 27433 (31.72%)
# reads that failed to align: 58544 (67.69%)
# reads with alignments suppressed due to -m: 513 (0.59%)
Reported 27433 alignments to 1 output stream(s)
SLX-9397._TC044II_D_FLD0197.Read1.fq.gz
Sequences: 308905
With index: 284599
Sufficient length: 284589
Min index: 0
Max index: 114
0 284290
1 16
Unique: 32715
# reads processed: 32715
# reads with at least one reported alignment: 13114 (40.09%)
# reads that failed to align: 19327 (59.08%)
# reads with alignments suppressed due to -m: 274 (0.84%)
Reported 13114 alignments to 1 output stream(s)
SLX-9397._TC047II_D_FLD0220.Read1.fq.gz
I imagine the ruby would look like
Convert all /n between two lines with SLX- to commas
Save the original text file as a new text file (or even better a CSV file.
I think I specifically have a problem with how to find and replace between two specific lines.
I guess I could do this without using ruby, but seeing as I'm trying to get into Ruby...

Assuming, that you have your string in str:
require 'csv'
CSV.open("/tmp/file.csv", "wb") do |csv|
str.scan(/^(SLX-.*?)(?=\R+SLX-)/m).map do |s| # break by SLX-
s.first.split($/).map do |el| # split by CR
"'#{el}'" # quote values
end
end.each do |line| # iterate
csv << line # fulfil csv
end
end

I don't know much about Ruby but this should work. You should read the entire file into a Sting. Use this regex - (\RSLX-) - to match all SLX- (all but the first one) and replace it with ,SLX-. For the explanation of the regex, go to https://regex101.com/r/pP3pP3/1
This question - Ruby replace string with captured regex pattern - might help you to understand how to replace in ruby

Related

Mnetgen: bash-files and syntax errors - where's the bug?

I want to create Multicommodity Min Cost Flow instances with the help of
Mnetgen but I have problems with the file called "batch" whose first lines are given by
# Batch file for generating MMCF problems with the mnetgen random generator
#
# For each n in {64, 128, 256} generates 12 instances for each pair (n, k)
# with k in {4, 8, 16 , ... , n}, using as input the parameters found
# in pr{n}.{k}/{n}-{k}-{h}.inp for h in {1, ... , 12}. The instances
# are left in the directory pr{n}.{k}
#
# At the end of the file, commented out, there are the instructions for
# generating the groups with n = 512 and n = 768: in the latter case,
# however, only 6 instances for each group are generated.
#
# In a Unix environment, simply type "source batch" or "csh < batch"
foreach i ( 64 )
foreach j ( 4 8 16 32 64 )
foreach h ( 1 2 3 4 5 6 7 8 9 10 11 12 )
mnetgen pr$i.$j/$i-$j-$h.inp pr$i.$j/$i-$j-$h
end
end
end
...
What did I so far? First I added #include <cstring> to mnetgen.c to avoid errors. Then I typed make to get an executable file mnetgen. The last step would be to generate the instances by using the batch-file.
Using the hint in last comment line I get either
bash: batch: Zeile 14: Syntaxfehler beim unerwarteten Wort »(«
bash: batch: Zeile 14: 'foreach i ( 64 )'
or
mnetgen: Command not found.
How can I fix that?
You are trying to run a csh shell in bash.
To fix that, either run
csh myshell
or add on the first line:
#!/bin/csh
When running your command, Unix/linux will check the first line which will be seen as a kind of magic number and will prefix it by /bin/csh.
(there are better ways to do that maybe with #!/bin/env csh)
and for your mnetgen command which is not found, I suggest that you add the full path of your command in your script or add it to the system PATH.

Transpose a list based on specific text in ruby

I have one file which is one long list of different patient samples. Each sample always starts with "SLX" as below:
I would like to transpose each sample into a CSV with the output shown below. I know that the CSV library might be able to do this but I don't know how to approach it as I would have to transpose only when the line starting with SLX is matched.
Input:
SLX.1767356.fdfsIH.fq.gz
Sequences: 160220
With index: 139019
Sufficient length: 139018
Min index: 0
Max index: 83
Unique: 48932
# reads processed: 48932
# reads with at least one reported alignment: 21172 (43.27%)
# reads that failed to align: 27022 (55.22%)
# reads with alignments suppressed due to -m: 738 (1.51%)
Reported 21172 alignments to 1 output stream(s)
SLX.94373.GHDUA_.fq.gz
Sequences: 28232
With index: 24875
Sufficient length: 24875
Min index: 3
Max index: 41
Unique: 14405
# reads processed: 14405
# reads with at least one reported alignment: 8307 (57.67%)
# reads that failed to align: 5776 (40.10%)
# reads with alignments suppressed due to -m: 322 (2.24%)
Reported 8307 alignments to 1 output stream(s)
SLX.73837.BLABLA_Control.fq.gz
Sequences: 248466
With index: 230037
Sufficient length: 230036
Min index: 0
Max index: 98
Unique: 64883
# reads processed: 64883
# reads with at least one reported alignment: 24307 (37.46%)
# reads that failed to align: 39764 (61.29%)
# reads with alignments suppressed due to -m: 812 (1.25%)
Reported 24307 alignments to 1 output stream(s)
Output
SLX.10456.FastSeqI_Control_OC_AH_094.fq.gz Sequences: 160220 With index: 139019 Sufficient length: 139018 Min index: 0 Max index: 83 Unique: 48932 # reads processed: 48932 # reads with at least one reported alignment: 21172 (43.27%) # reads that failed to align: 27022 (55.22%) # reads with alignments suppressed due to -m: 738 (1.51%) Reported 21172 alignments to 1 output stream(s) mv: /Volumes/SeagateBackupPlusDriv1/SequencingRawFiles/TumourOesophagealOCCAMS/MetaOCCAMSTumoursRawFiles/LCMDysplasiaAndCancer_LCM_PS14_1105_1F/SLX.10456.FastSeqI_Control_OC_AH_094.fq.gz and /Volumes/SeagateBackupPlusDriv1/SequencingRawFiles/TumourOesophagealOCCAMS/MetaOCCAMSTumoursRawFiles/LCMDysplasiaAndCancer_LCM_PS14_1105_1F/SLX.10456.FastSeqI_Control_OC_AH_094.fq.gz are identical
SLX.10456.FastSeqI_Control_OC_ED_008_F1_.fq.gz Sequences: 28232 With index: 24875 Sufficient length: 24875 Min index: 3 Max index: 41 Unique: 14405 # reads processed: 14405 # reads with at least one reported alignment: 8307 (57.67%) # reads that failed to align: 5776 (40.10%) # reads with alignments suppressed due to -m: 322 (2.24%) Reported 8307 alignments to 1 output stream(s)
SLX.10456.FastSeqJ_OC_AH_086_F1_Control.fq.gz Sequences: 248466 With index: 230037 Sufficient length: 230036 Min index: 0 Max index: 98 Unique: 64883 # reads processed: 64883 # reads with at least one reported alignment: 24307 (37.46%) # reads that failed to align: 39764 (61.29%) # reads with alignments suppressed due to -m: 812 (1.25%) Reported 24307 alignments to 1 output stream(s)
OK, it’s so easy that I will post an answer.
input.scan(/^SLX.*?(?=^SLX|\z)/m)
.map { |p| p.split($/).map { |e| %Q|"#{e}"| }.join (', ') }
.join($/)

Wierd output characters (Chinese characters) when using Ruby to read / write CSV

I'm trying to print the first 5 lines from a set of large (>500MB) csv files into small headers in order to inspect the content more easily.
I'm using Ruby code to do this but am getting each line padded out with extra Chinese characters, like this:
week_num type ID location total_qty A_qty B_qty count਍㌀㐀ऀ猀漀爀琀愀戀氀攀ऀ㄀㤀㜀ऀ䐀䔀开伀渀氀礀ऀ㔀㐀㜀㈀ ㌀ऀ㔀㐀㜀㈀ ㌀ऀ ऀ㤀㄀㈀㔀㌀ഀ
44 small 14 A 907859 907859 0 550360਍㐀㄀ऀ猀漀爀琀愀戀氀攀ऀ㐀㈀㄀ऀ䐀䔀开伀渀氀礀ऀ㌀ ㈀㄀㜀㐀ऀ㌀ ㈀㄀
The first few lines of input file are like so:
week_num type ID location total_qty A_qty B_qty count
34 small 197 A 547203 547203 0 91253
44 small 14 A 907859 907859 0 550360
41 small 421 A 302174 302174 0 18198
The strange characters appear to be Line 1 and Line 3 of the data.
Here's my Ruby code:
num_lines=ARGV[0]
fh = File.open(file_in,"r")
fw = File.open(file_out,"w")
until (line=fh.gets).nil? or num_lines==0
fw.puts line if outflag
num_lines = num_lines-1
end
Any idea what's going on and what I can do to simply stop at the line end character?
Looking at input/output files in hex (useful suggestion by #user1934428)
Input file - each character looks to be two bytes.
Output file - notice the NULL (00) between each single byte character...
Ruby version 1.9.1
The problem is an encoding mismatch which is happening because the encoding is not explicitly specified in the read and write parts of the code. Read the input csv as a binary file "rb" with utf-16le encoding. Write the output in the same format.
num_lines=ARGV[0]
# ****** Specifying the right encodings <<<< this is the key
fh = File.open(file_in,"rb:utf-16le")
fw = File.open(file_out,"wb:utf-16le")
until (line=fh.gets).nil? or num_lines==0
fw.puts line
num_lines = num_lines-1
end
Useful references:
Working with encodings in Ruby 1.9
CSV encodings
Determining the encoding of a CSV file

How concatenate mp3 files with ruby

I have few mp3 files as binary strings with same number of channels and same sample rate. I need to concatenate them in memory without using command line tools.
Currently I just do string concatenation, like this:
out = ''
mp3s.each { |mp3| out << mp3 }
Audio players can play the result, but with some warnings, because mp3 headers were not handled correctly as far as I understand.
Is there a way to proceed the concatenation in more correct way?
After reading this article about MP3 in russian I came up with solution.
You must be able to get complete ID3 specification at http://id3.org/ but it seems to be down at the moment.
Usually Mp3 file have the next format:
[ID3 head(10 bytes) | ID3 tags | MP3 frames ]
ID3 is not part of MP3 format, but it's kind of container which is used to put information like artists, albums, etc...
The audio data itself are stored in MP3 frames.Every frame starts with 4 bytes header which provides meta info (codecs, bitrate, etc).
Every frame has fixed size. So if there are not enough samples at the end of last frame, coder adds silence to make frame have necessary size. I also found there chunks like
LAME3.97 (name and version of coder).
So, all we need to do is to get rid of ID3 container. The following solution works for me perfect, no warnings anymore and out file became smaller:
# Length of header that describes ID3 container
ID3_HEADER_SIZE = 10
# Get size of ID3 container.
# Length is stored in 4 bytes, and the 7th bit of every byte is ignored.
#
# Example:
# Hex: 00 00 07 76
# Bin: 00000000 00000000 00000111 01110110
# Real bin: 111 1110110
# Real dec: 1014
#
def get_id3_size(header)
result = 0
str = header[6..9]
# Read 4 size bytes from left to right applying bit mask to exclude 7th bit
# in every byte.
4.times do |i|
result += (str[i].ord & 0x7F) * (2 ** (7 * (3-i)))
end
result
end
def strip_mp3!(raw_mp3)
# 10 bytes that describe ID3 container.
id3_header = raw_mp3[0...ID3_HEADER_SIZE]
id3_size = get_id3_size(id3_header)
# Offset from which mp3 frames start
offset = id3_size + ID3_HEADER_SIZE
# Get rid of ID3 container
raw_mp3.slice!(0...offset)
raw_mp3
end
# Read raw mp3s
hi = File.binread('hi.mp3')
bye = File.binread('bye.mp3')
# Get rid of ID3 tags
strip_mp3!(hi)
strip_mp3!(bye)
# Concatenate mp3 frames
hi << bye
# Save result to disk
File.binwrite('out.mp3', hi)

Replace the n-th byte in a file with another byte

In Ruby, how do I replace, say, the 7th byte of a file with another byte?
Use binwrite method from IO class
IO.binwrite("testfile", [0x0D].pack("C"), 7) # => 1
# File could contain: "This is0two\nThis is line three\nAnd so on...\n"
0x0D is 13
Also you may need to know about pack method

Resources