count specific lines in specific files in a folder - ruby

I'm fairly new to ruby but this is testing me
I want to count all the lines in any file that ends in bowtie.txt in a folder
The lines have to start with a number of varying length followed by a '+' or a '-' (with or without whitespace inbetween. Sometimes the lines are wrapped but I don't know if this matters).
I want to then create a hash that stores the filename with the count associated with it.
I've got as far I think as looping through the directory to select the files out and then counting the number of lines in that file but how do I then create the hash and return it?
The file data looks like:
0 + chr12 129402816 ACACAGGGAGGGGAATAACACACACTGGGACCTGTCAGGAGAGGGTAGGGCTGGGGGCATCAGGAGAGCATCAGGAAAAATAGCTAATGCATGCTGGGCT IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII 0
2 - chr5 93625939 TCAACCTGTCATCTACATTAGGTATTTCTCCTAATGCTATCCCTCCCCTAGCCCCCCACCACCCAACAGACCCTGGTGTGTGATGTTCCCCTCCCTGTGT IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII 0 5:T>C
5 + chr3 155023119 ACACAGGGAGGGGAACATCACACACCGGGGCCTGTAGTGGGGGTGAGGGGCAAGAGGAGGAATAGCATTAGGAGAAATACCTAATGTAGATGACCGGTTG IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII 0
7 + chr2 22818055 ACACAGGGAGGGGAAAAACACACACTGGGGCTTCTCAGGGGTGGTGGGGGGAGAGCATCAGGATAAATAGCTAATGCATGCAGGGCTTAATACCTAGGTG IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII 0
8 + chr3 131206106 ACACAGGGAGGGGAACATCACACACCAGGCCCTGTCAGCGGTGAGGGGCTGGGGGAGGGATAGCATTAAGAGAAATACCTAATATAAATGACGAGTTGAT IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII 0 8:C>A
10 + chrX 108455592 ACACAGGGAGGGGAACATCACACACCAGGGCCTGTCGGGCAGTGGGGGGGCAAAGGGAGGGATTAAGTCATACACCCAATGCATGTGGGGCTTAAAACCC IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII 0 7:A>G
11 - chr2 31936302 ACCCATTAACTCGTCATTTACATTAGGTATATCTCCTAATGCTATCCCTCCCCCCACCCCACAACAGGCCCCCCGGTGTGTGATGTTCCCCTCCCTGTGT IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII 0 7:T>C
This is what I am trying to get at the end
blablabla.bowtie.txt : 27998
blablafsfds.bowtie.txt : 25987
etc
This is my attempt at the code:
Dir[File.join('/Volumes/SeagateBackupPlusDriv/SequencingRawFiles/TumourOesophagealOCCAMS/SequencingScripts/3finalcounts', '*.bowtie.txt')].each |file| do
puts File.open(file) { |f| f.grep(/^[0-9]*.\+|\-/).count }
end

Untested, since I have no input files, but likely working:
# `Dir[]` expects it’s own format
# ⇓ will inject results into hash
Dir['/Volumes/.../*.bowtie.txt'].inject({}) do |memo, file|
memo[file] = File.readlines(file).select do |line|
line =~ /^[0-9]+\s*(\+|\-)/ # only those, matching
end.count
memo
end
Additional references: IO#readlines, Enumerable#select, Enumerable#inject.

Related

Get line number where first occurrence of a value appears?

I have a CSV file like below:
E Run 1 Run 2 Run 3 Run 4 Run 5 Run 6 Mean
1 0.7019 0.6734 0.6599 0.6511 0.701 0.6977 0.680833333
2 0.6421 0.6478 0.6095 0.608 0.6525 0.6285 0.6314
3 0.6039 0.6096 0.563 0.5539 0.6218 0.5716 0.5873
4 0.5564 0.5545 0.5138 0.4962 0.5781 0.5154 0.535733333
5 0.5056 0.4972 0.4704 0.4488 0.5245 0.4694 0.485983333
I'm trying to use find the row number where the final column has a value below a certain range. For example, below 0.6.
Using the above CSV file, I want to return 3 because E = 3 is the first row where Mean <= 0.60. If there is no value below 0.6 I want to return 0. I am in effect returning the value in the first column based on the final column.
I plan to initialize this number as a constant in gnuplot. How can this be done? I've tagged awk because I think it's related.
In case you want a gnuplot-only version... if you use a file remove the datablock and replace $Data by your filename in " ".
Edit: You can do it without a dummy table, it can be done shorter with stats (check help stats). Even shorter than the accepted solution (well, we are not at code golf here), but additionally platform-independent because it's gnuplot-only.
Furthermore, in case E could be any number, i.e. 0 as well, then it might be better
to first assign E = NaN and then compare E to NaN (see here: gnuplot: How to compare to NaN?).
Script:
### conditional extraction into a variable
reset session
$Data <<EOD
E Run 1 Run 2 Run 3 Run 4 Run 5 Run 6 Mean
1 0.7019 0.6734 0.6599 0.6511 0.701 0.6977 0.680833333
2 0.6421 0.6478 0.6095 0.608 0.6525 0.6285 0.6314
3 0.6039 0.6096 0.563 0.5539 0.6218 0.5716 0.5873
4 0.5564 0.5545 0.5138 0.4962 0.5781 0.5154 0.535733333
5 0.5056 0.4972 0.4704 0.4488 0.5245 0.4694 0.485983333
EOD
E = NaN
stats $Data u ($8<=0.6 && E!=E? E=$1 : 0) nooutput
print E
### end of script
Result:
3.0
Actually, OP wants to return E=0 if the condition was not met. Then the script would be like this:
E=0
stats $Data u ($8<=0.6 && E==0? E=$1 : 0) nooutput
Another awk. You could initialize the default return value to var ret in BEGIN but since it's 0 there is really no point as empty var+0 produces the same effect. If the threshold value of 0.6 is not met before the ENDis reached, that is returned. If it is met, exit invokes the END and ret is output:
$ awk '
NR>1 && $NF<0.6 { # final column has a value below a certain range
ret=$1 # I want to return 3 because E = 3
exit
}
END {
print ret+0
}' file
Output:
3
Something like this should do the trick:
awk 'NR>1 && $8<.6 {print $1;fnd=1;exit}END{if(!fnd){print 0}}' yourfile

Ruby - How to subtract numbers of two files and save the result in one of them on a specified position?

I have 2 txt files with different strings and numbers in them splitted with ;
Now I need to subtract the
((number on position 2 in file1) - (number on position 25 in file2)) = result
Now I want to replace the (number on position 2 in file1) with the result.
I tried my code below but it only appends the number in the end of the file and its not the result of the calculation which got appended.
def calc
f1 = File.open("./file1.txt", File::RDWR)
f2 = File.open("./file2.txt", File::RDWR)
f1.flock(File::LOCK_EX)
f2.flock(File::LOCK_EX)
f1.each.zip(f2.each).each do |line, line2|
bg = line.split(";").compact.collect(&:strip)
bd = line2.split(";").compact.collect(&:strip)
n = bd[2].to_i - bg[25].to_i
f2.print bd[2] << n
#puts "#{n}" Only for testing
end
f1.flock(File::LOCK_UN)
f2.flock(File::LOCK_UN)
f1.close && f2.close
end
Use something like this:
lines1 = File.readlines('file1.txt').map(&:to_i)
lines2 = File.readlines('file2.txt').map(&:to_i)
result = lines1.zip(lines2).map do |value1, value2| value1 - value2 }
File.write('file1.txt', result.join(?\n))
This code load all files in memory, then calculate result and write it to first file.
FYI: If you want to use your code just save result to other file (i.e. result.txt) and at the end copy it to original file.

How to write some value to a text file in ruby based on position

I need some help is some unique solution. I have a text file in which I have to replace some value based on some position. This is not a big file and will always contain 5 lines with fixed number of length in all the lines at any given time. But I have to specficaly replace soem text in some position only. Further, i can also put in some text in required position and replace that text with required value every time. I am not sure how to implement this solution. I have given the example below.
Line 1 - 00000 This Is Me 12345 trying
Line 2 - 23456 This is line 2 987654
Line 3 - This is 345678 line 3 67890
Consider the above is the file I have to use to replace some values. Like in line 1, I have to replace '00000' with '11111' and in line 2, I have to replace 'This' with 'Line' or any require four digit text. The position will always remain the same in text file.
I have a solution which works but this is for reading the file based on position and not for writing. Can someone please give a solution similarly for wrtiting aswell based on position
Solution for reading the file based on position :
def read_var file, line_nr, vbegin, vend
IO.readlines(file)[line_nr][vbegin..vend]
end
puts read_var("read_var_from_file.txt", 0, 1, 3) #line 0, beginning at 1, ending at 3
#=>308
puts read_var("read_var_from_file.txt", 1, 3, 6)
#=>8522
I have also tried this solution for writing. This works but I need it to work based on position or based on text present in the specific line.
Explored solution to wirte to file :
open(Dir.pwd + '/Files/Try.txt', 'w') { |f|
f << "Four score\n"
f << "and seven\n"
f << "years ago\n"
}
I made you a working sample anagraj.
in_file = "in.txt"
out_file = "out.txt"
=begin
=>contents of file in.txt
00000 This Is Me 12345 trying
23456 This is line 2 987654
This is 345678 line 3 67890
=end
def replace_in_file in_file, out_file, shreds
File.open(out_file,"wb") do |file|
File.read(in_file).each_line.with_index do |line, index|
shreds.each do |shred|
if shred[:index]==index
line[shred[:begin]..shred[:end]]=shred[:replace]
end
end
file << line
end
end
end
shreds = [
{index:0, begin:0, end:4, replace:"11111"},
{index:1, begin:6, end:9, replace:"Line"}
]
replace_in_file in_file, out_file, shreds
=begin
=>contents of file out.txt
11111 This Is Me 12345 trying
23456 Line is line 2 987654
This is 345678 line 3 67890
=end

ruby multiple loop sets but with limited rows per set

Alrightie, so I'm building an CSV file this time with ruby. The outer loop will run up to length of num_of_loops, but it runs for an entire set rather than up to the specified row. I want to change the first column of a CSV file to a new name for each row.
If I do this:
class_days = %w[Wednesday Thursday Friday]
num_of_loops = (num_of_loops / class_days.size).ceil
num_of_loops.times {
["Wednesday","Thursday","Friday"].each do |x|
data[0] = x
data[4] = classname()
# Write all to file
#
csv << data
end
}
Then the loop will run only 3 times for a 5 row request.
I'd like it to run the full 5 rows such that instead of stopping at Wed/Thurs/Fri it goes to Wed/Thurs/Fri/Wed/Thurs instead.
class_days = %w[Wednesday Thursday Friday]
num_of_loops.times do |i|
data[0] = class_days[i % class_days.size]
data[4] = classname
csv << data
end
The interesting part is here:
class_days[i % class_days.size]
We need an index into class_days that is between 0 and class_days.size - 1. We can get that with the % (modulo) operator. That operator yields the remainder after dividing i by class_days.size. This table shows how it works:
i i % 3
0 0
1 1
2 2
3 0
4 1
5 2
...
The other key part is that the times method yields indices starting with 0.

Ruby data extraction from a text file

I have a relatively big text file with blocks of data layered like this:
ANALYSIS OF X SIGNAL, CASE: 1
TUNE X = 0.2561890123390808
Line Frequency Amplitude Phase Error mx my ms p
1 0.2561890123391E+00 0.204316425208E-01 0.164145385871E+03 0.00000000000E+00 1 0 0 0
2 0.2562865535359E+00 0.288712798671E-01 -.161563284233E+03 0.97541196785E-04 1 0 0 0
(they contain more lines and then are repeated)
I would like first to extract the numerical value after TUNE X = and output these in a text file. Then I would like to extract the numerical value of LINE FREQUENCY and AMPLITUDE as a pair of values and output to a file.
My question is the following: altough I could make something moreorless working using a simple REGEXP I'm not convinced that it's the right way to do it and I would like some advices or examples of code showing how I can do that efficiently with Ruby.
Generally, (not tested)
toggle=0
File.open("file").each do |line|
if line[/TUNE/]
puts line.split("=",2)[-1].strip
end
if line[/Line Frequency/]
toggle=1
next
end
if toggle
a = line.split
puts "#{a[1]} #{a[2]}"
end
end
go through the file line by line, check for /TUNE/, then split on "=" to get last item.
Do the same for lines containing /Line Frequency/ and set the toggle flag to 1. This signify that the rest of line contains the data you want to get. Since the freq and amplitude are at fields 2 and 3, then split on the lines and get the respective positions. Generally, this is the idea. As for toggling, you might want to set toggle flag to 0 at the next block using a pattern (eg SIGNAL CASE or ANALYSIS)
file = File.open("data.dat")
#tune_x = #frequency = #amplitude = []
file.each_line do |line|
tune_x_scan = line.scan /TUNE X = (\d*\.\d*)/
data_scan = line.scan /(\d*\.\d*E[-|+]\d*)/
#tune_x << tune_x_scan[0] if tune_x_scan
#frequency << data_scan[0] if data_scan
#amplitude << data_scan[0] if data_scan
end
There are lots of ways to do it. This is a simple first pass at it:
text = 'ANALYSIS OF X SIGNAL, CASE: 1
TUNE X = 0.2561890123390808
Line Frequency Amplitude Phase Error mx my ms p
1 0.2561890123391E+00 0.204316425208E-01 0.164145385871E+03 0.00000000000E+00 1 0 0 0
2 0.2562865535359E+00 0.288712798671E-01 -.161563284233E+03 0.97541196785E-04 1 0 0 0
ANALYSIS OF X SIGNAL, CASE: 1
TUNE X = 1.2561890123390808
Line Frequency Amplitude Phase Error mx my ms p
1 1.2561890123391E+00 0.204316425208E-01 0.164145385871E+03 0.00000000000E+00 1 0 0 0
2 1.2562865535359E+00 0.288712798671E-01 -.161563284233E+03 0.97541196785E-04 1 0 0 0
ANALYSIS OF X SIGNAL, CASE: 1
TUNE X = 2.2561890123390808
Line Frequency Amplitude Phase Error mx my ms p
1 2.2561890123391E+00 0.204316425208E-01 0.164145385871E+03 0.00000000000E+00 1 0 0 0
2 2.2562865535359E+00 0.288712798671E-01 -.161563284233E+03 0.97541196785E-04 1 0 0 0
'
require 'stringio'
pretend_file = StringIO.new(text, 'r')
That gives us a StringIO object we can pretend is a file. We can read from it by lines.
I changed the numbers a bit just to make it easier to see that they are being captured in the output.
pretend_file.each_line do |li|
case
when li =~ /^TUNE.+?=\s+(.+)/
print $1.strip, "\n"
when li =~ /^\d+\s+(\S+)\s+(\S+)/
print $1, ' ', $2, "\n"
end
end
For real use you'd want to change the print statements to a file handle: fileh.print
The output looks like:
# >> 0.2561890123390808
# >> 0.2561890123391E+00 0.204316425208E-01
# >> 0.2562865535359E+00 0.288712798671E-01
# >> 1.2561890123390808
# >> 1.2561890123391E+00 0.204316425208E-01
# >> 1.2562865535359E+00 0.288712798671E-01
# >> 2.2561890123390808
# >> 2.2561890123391E+00 0.204316425208E-01
# >> 2.2562865535359E+00 0.288712798671E-01
You can read your file line by line and cut each by number of symbol, for example:
to extract tune x get symbols from
10 till 27 on line 2
to extract LINE FREQUENCY get
symbols from 3 till 22 on line 6+n

Resources