How to extract lines from text file "block" in loop - bash

I have a huge text file and I want to use grep to search if some "blocks" in my text file is existing in another file. So, I need to extract these blocks first.
This is my file:
>gi|60117238|gb|AY897435.1| Wolbachia endosymbiont of Drosophila mojavensis, genomic survey sequence
TCTGTTGCGAGTGTGCTGATAACTACTGAATCTATGATAGTTGATGTACCAAGCAAAGAAAATGCTTCATCTCCTATGGG
TGCAGGAGAAATGAGTGGCATGGGTGGATTCTAAGTAGAATGAAACCGTGGAGCAATTGCTCCACGGTAGTTCCAAAAAA
TCTCACATTTTACTATTCGTTAAAGGTAATACGTTTGGTGCAGAAATGCACTACTGTTTGCATCCGTTTCGCTCCTTTAT
ATTGTGGTTGTCTAATAACAAAAAGGCAGCATAAGAAAACTATAACACCTAGTATATTTATACTATAGCTGACCCAAGCA
ACACGTCATACCGCGATTCATTCCACAACTGTACGAACATTACAATATGGCACATAGTAAACGATGTCATGAAAGTAGCT
GACACTGGAATTCAGAAAAAAGGATTATGTCATTCCAGTGCTTGACACTGGAATCCAGCATTTCCATAATCATCAAAACA
TTGTATTTTAACAAAAAACATGTATTTTTATGCTTGCCAACTTAATAAAATTCCTGGATCCCAGTGTCAAGCACTGGGAT
GACAC
>gi|60117239|gb|AY897436.1| Wolbachia endosymbiont of Drosophila mojavensis, genomic survey sequence
TTTTCATCGCTCATGTCCTTAGTTTACCCCCTGTTTCACCATTACATTAATATCTACAGAACCTCCCACTGGGGAGTAGT
AATCTAGGATAGTTTCTATCACTAAAACGCGTGGTATTCCTTTATTTTTTACCAATTTTAAATAAGACAATACCTTATTA
TCATCATAATGCTGCAGAAAGCGGCAAAAGACACCTAATTCATAATTTGTAGCTGATAATTCTTCTTGAGTTATGAGTTT
AATTTTTAAATCTTCTACTGCCTGCCTAGGCACTTTATGTTCGTTGTAATAATATAAGCCTATAGAACCTTTATTGTGTA
TATCAGAATAAGCAAGAAATAAAGAGTGTACGCCAAATAGCAATATATTTTTAGCACCATCTATATTAACCCTAGAATTA
AACTCTTTAGTGTCAAACCTGGAATATCCTAGCAATGCTTGGTAAAACGCTATTTTCCTGTCTTCTGATGTTTCTTTCTC
CTTAAAAAGAATCAAATGAAAATATTGACTCCTGCCTTAAAATATCCGGCATTTTTAACCAATTCTTTTCAGCGGCAACC
CTTGCCCACATTGCTGCTGCTTTAGGAAAAATGGTATTTCTTTAAACACTTACCTTTTGATGAAAGTTGCCCAAAATCCT
TTGTTCTATCCGAATCCAAAACCCCTATTTCCCAAACGCCCCTTAAAACCTTTTTTAAAATTGGAACAAAAAATATTTAA
TTTTTAAAAAAAAACG
>gi|60117240|gb|AY897437.1| Wolbachia endosymbiont of Drosophila mojavensis, genomic survey sequence
TTGNCCATCAATTGGCCACCAGAAAAGTTGCGTCCGTTTACTTCTACACCATGTATAAATGCACCTAAAATCATGCCTTG
GCAAAATGCAGCACCAAGTGACCCAAAATGAAAGGCATAATCCCATAATCGCCTGTATTTTCCTTCTGCCTTAAAACGAA
ACTCAAAGGATACTCCGCGCACTATAAGGCCAAGCAGCATAATAATGATTGGAATATAAAAAGCAGGCATTAATATTGAA
TATGCAAGAGGAAAAGCAGCAAACAACCCTCCACCACCTAGTACCAACCATGTTTCGTTTCCATCCCAAAATGGTGCAAT
TGAGCTTATCATGTGATCACGGCATTTATCTGACGGTGCAAAAGGAAGTAAAATACCAATACCTAAATCAAACCCATCCA
TTAAAATATACAGTAAAACAGCTATGGCAATTAGTAATCCCCAGATTAGGGGTAAATTAATTAAGGAAGAAAAATCAAAC
ATGATTGTTGTCCTTTCCAGATGTACCAGCATCAATCACTGAAGCTCCAATACCGTGTTTATAAAATTGCTCTTCTTCTT
TAATGACAGGAATTCCTTTGTATATAAGTTTCAGAATATAGTATCTACCTGCTCCAAATATAAGGGTATACATAAACGAT
AAATGCAATCAAAGACCATGCAACCTGAGGACCGGTAATCGCAGATGAAAATGATTCAATTGTGCCGTTAATTCCATATA
CAGTGTAAAGTTGACGGCCAATTTCATAGTAAACCAAACTGCAAGTAACGCTATGGACCCCGACGGCATCTTTGAAATCC
ACAATCCTTTGAAAACACAACTTTGGAATAATTTGCCCCGAAAAATACTGAAAAAAAATTTACTGGACCCATTTTGGATT
ATTAAAATTTCAACTCCAACCATTTATACGGG
Block is starting from > to the letter befor the next >.
So, 1st block is:
TCTGTTGCGAGTGTGCTGATAACTACTGAATCTATGATAGTTGATGTACCAAGCAAAGAAAATGCTTCATCTCCTATGGG
TGCAGGAGAAATGAGTGGCATGGGTGGATTCTAAGTAGAATGAAACCGTGGAGCAATTGCTCCACGGTAGTTCCAAAAAA
TCTCACATTTTACTATTCGTTAAAGGTAATACGTTTGGTGCAGAAATGCACTACTGTTTGCATCCGTTTCGCTCCTTTAT
ATTGTGGTTGTCTAATAACAAAAAGGCAGCATAAGAAAACTATAACACCTAGTATATTTATACTATAGCTGACCCAAGCA
ACACGTCATACCGCGATTCATTCCACAACTGTACGAACATTACAATATGGCACATAGTAAACGATGTCATGAAAGTAGCT
GACACTGGAATTCAGAAAAAAGGATTATGTCATTCCAGTGCTTGACACTGGAATCCAGCATTTCCATAATCATCAAAACA
TTGTATTTTAACAAAAAACATGTATTTTTATGCTTGCCAACTTAATAAAATTCCTGGATCCCAGTGTCAAGCACTGGGAT
GACAC
2nd block is:
TTTTCATCGCTCATGTCCTTAGTTTACCCCCTGTTTCACCATTACATTAATATCTACAGAACCTCCCACTGGGGAGTAGT
AATCTAGGATAGTTTCTATCACTAAAACGCGTGGTATTCCTTTATTTTTTACCAATTTTAAATAAGACAATACCTTATTA
TCATCATAATGCTGCAGAAAGCGGCAAAAGACACCTAATTCATAATTTGTAGCTGATAATTCTTCTTGAGTTATGAGTTT
AATTTTTAAATCTTCTACTGCCTGCCTAGGCACTTTATGTTCGTTGTAATAATATAAGCCTATAGAACCTTTATTGTGTA
TATCAGAATAAGCAAGAAATAAAGAGTGTACGCCAAATAGCAATATATTTTTAGCACCATCTATATTAACCCTAGAATTA
AACTCTTTAGTGTCAAACCTGGAATATCCTAGCAATGCTTGGTAAAACGCTATTTTCCTGTCTTCTGATGTTTCTTTCTC
CTTAAAAAGAATCAAATGAAAATATTGACTCCTGCCTTAAAATATCCGGCATTTTTAACCAATTCTTTTCAGCGGCAACC
CTTGCCCACATTGCTGCTGCTTTAGGAAAAATGGTATTTCTTTAAACACTTACCTTTTGATGAAAGTTGCCCAAAATCCT
TTGTTCTATCCGAATCCAAAACCCCTATTTCCCAAACGCCCCTTAAAACCTTTTTTAAAATTGGAACAAAAAATATTTAA
TTTTTAAAAAAAAACG
Third block:
TTGNCCATCAATTGGCCACCAGAAAAGTTGCGTCCGTTTACTTCTACACCATGTATAAATGCACCTAAAATCATGCCTTG
GCAAAATGCAGCACCAAGTGACCCAAAATGAAAGGCATAATCCCATAATCGCCTGTATTTTCCTTCTGCCTTAAAACGAA
ACTCAAAGGATACTCCGCGCACTATAAGGCCAAGCAGCATAATAATGATTGGAATATAAAAAGCAGGCATTAATATTGAA
TATGCAAGAGGAAAAGCAGCAAACAACCCTCCACCACCTAGTACCAACCATGTTTCGTTTCCATCCCAAAATGGTGCAAT
TGAGCTTATCATGTGATCACGGCATTTATCTGACGGTGCAAAAGGAAGTAAAATACCAATACCTAAATCAAACCCATCCA
TTAAAATATACAGTAAAACAGCTATGGCAATTAGTAATCCCCAGATTAGGGGTAAATTAATTAAGGAAGAAAAATCAAAC
ATGATTGTTGTCCTTTCCAGATGTACCAGCATCAATCACTGAAGCTCCAATACCGTGTTTATAAAATTGCTCTTCTTCTT
TAATGACAGGAATTCCTTTGTATATAAGTTTCAGAATATAGTATCTACCTGCTCCAAATATAAGGGTATACATAAACGAT
AAATGCAATCAAAGACCATGCAACCTGAGGACCGGTAATCGCAGATGAAAATGATTCAATTGTGCCGTTAATTCCATATA
CAGTGTAAAGTTGACGGCCAATTTCATAGTAAACCAAACTGCAAGTAACGCTATGGACCCCGACGGCATCTTTGAAATCC
ACAATCCTTTGAAAACACAACTTTGGAATAATTTGCCCCGAAAAATACTGAAAAAAAATTTACTGGACCCATTTTGGATT
ATTAAAATTTCAACTCCAACCATTTATACGGG
How can I loop my file and extract one block in each iteration? To grep it with the other file?
Edit 1:
For more clarification:
I want to do some operation on each block. First, I perform diff between two files and but the result in a new file. For the new file which contains the blocks, i want to search if each block is included in the first file or in the second file. If it is included in the first file, i want to extract it to another new file. If it is included in the second file, i want to escape and go to next block.
Hope you getting my point.
Thanks,

Do you want to create a separate file for each block? And then you want to do any operation on those files? Or you just want to do some operation(say search/grep) for each block in each loop iteration? Please clarify your requirement.

Related

Update a matrix into a text file without appending the results

I have a Fortran 77 code like this, more or less:
nMaxRow=100
nMaxStep=100
! initialization of the matrix if Step=1
do step=1,nMaxStep
if (step.eq.1) then
do ii=1,nMaxRow
do jj=1,nMaxStep
A(ii,jj)=0
end do
end do
end if
!now for each step and for each row update the cell of the matrix
do ii=1,nMaxRow
A(ii,step)=X(ii) !X(ii) is a number associated with the specific ow at that specific step
end do
!Now I want to write the updated matrix at this step into a text file,
!How can I do that????
end do !close the do step...
Is it possible to update the values of the matrix and write the updated matrix at that specific step into a text file? I mean, without appending the results each step...
I found out that for Fortran 90 the 'REPLACE' command exists... but I couldn't find anything similar for Fortran 77.
One simple idea would be deleting the file just before writing a new one... but I don't like it and I don't know how to do it anyway.
If the file is already open (from the previous writing), you can just go the the start of the file using
rewind(unitnumber)
and start writing again. It will delete the original content of the file and start again. If you wan't to go back just be several records, you can use backtrace(), but you probably don't want that here.
If it isn't open, just open it and start writing. Unless you open it of appending, it will overwrite the original content.

Deleting contents of file after a specific line in ruby

Probably a simple question, but I need to delete the contents of a file after a specific line number? So I wan't to keep the first e.g 5 lines and delete the rest of the contents of a file. I have been searching for a while and can't find a way to do this, I am an iOS developer so Ruby is not a language I am very familiar with.
That is called truncate. The truncate method needs the byte position after which everything gets cut off - and the File.pos method delivers just that:
File.open("test.csv", "r+") do |f|
f.each_line.take(5)
f.truncate( f.pos )
end
The "r+" mode from File.open is read and write, without truncating existing files to zero size, like "w+" would.
The block form of File.open ensures that the file is closed when the block ends.
I'm not aware of any methods to delete from a file so my first thought was to read the file and then write back to it. Something like this:
path = '/path/to/thefile'
start_line = 0
end_line = 4
File.write(path, File.readlines(path)[start_line..end_line].join)
File#readlines reads the file and returns an array of strings, where each element is one line of the file. You can then use the subscript operator with a range for the lines you want
This isn't going to be very memory efficient for large files, so you may want to optimise if that's something you'll be doing.

Editing a CSV file in place, row by row

I have a long CSV file with two columns of numbers:
1,2
2,5
7,3
etc...
I would like to add a third column equal to the sum of the first two:
1,2,3
2,5,7
7,3,10
The following code is a solution to the problem, and it makes a copy of the input file, with the third column appended. Instead, I would like to operate on the input file line by line, writing the third column to each line as I went along. If the process gave error through for some reason, the answers to the first half of the file should already be saved and would not need to be recalculated.
I can't come up with a good way to do this using ruby's CSV class. Here's my current solution with the copied file:
require 'csv'
CSV.open("big_file.csv", "w") do |csv|
csv << %w{1 2}
csv << %w{2 5}
csv << %w{3 8}
end
big_csv_file = CSV.open("big_file.csv", 'r')
# I'm creating a copy of big_file.csv here
# I'd rather edit it in place
CSV.open("copy_with_extra_column.csv", "w") do |csv|
big_csv_file.each do |row|
row << eval(row[0] + row[1])
csv << row
end
end
To put this another, way, there is no way, at the fundamental file level, to "insert" the sum into the file. In your example:
1,2
2,5
7,2
If we ignore the whole notion of a "CSV" file (which is really just a concept layered on top of a stream text file) To "insert" the text ,3 at the end of the first line, we need to do all of these things:
move the "\n" after the 2, and all the following text two positions later in the file (leaving some junk in its place)
overwrite the junk with ",3"
Then you would repeat this process for each additional row.
This is obviously very inefficient. In simple terms, the CSV file format is not designed for efficient insertion of data.
Your two options are:
Load the file into memory (as, i.e., an array of lines), operate on it there, and then write it all back out over the existing file. Assuming your file only grows, this will work fine, but you'll need to be willing to allocate enough memory to read and operate on the whole file.
Write to a temporary file as you work through the data, and then move the temporary file in place of the original when you're done.
Updating the file "in place" is not practical.
A file is like one long string, for example:
1,2\n2,5
However, unlike a string, you can only overwrite characters in a file. In the example above, there are 7 characters. You can overwrite any of those characters with any characters you choose. So for instance, if you put the sum of the numbers at position 0 and position 2 into position 3, the result is:
1,232,5
That's probably not what you want because it looks like the first two numbers are 1 and 232 and their sum is 5. However, that is all you can do when editing a file inplace: you can only overwrite characters with other characters.
For a large file, you can read in one line, then write the altered line to a new file. When you are done, you can delete the original file, and then you can rename the new file to the old file name. You can use the Tempfile class to avoid name clashes for the new file name.
Instead of CSV.open(), try CSV.read(). For example, it's obviously a little ugly, but:
big_csv_file = CSV.read("big_file.csv")
big_csv_file[0] << eval(big_csv_file[0][0] + big_csv_file[0][1])
CSV.open("copy_with_extra_column.csv", "w") do |csv|
big_csv_file.each do |row|
csv << row
end
end
If you need the file to always be at the latest, the alterations and the writing will need to be in a loop, obviously.

Ruby .count operation truncates input file

I want to read a file in and show how large it is. .count is acting like .count! and changing the size of my input file buffer. so now logfile.each doesn't iterate. What's going on?
logfile = open(input_fspec)
puts "logfile size: #{logfile.count} lines"
count will read all the lines from the input in order to do the counting. If you want to read the lines again (e.g. using readline or each) then you will need to call logfile.rewind to move back to the start of the file.
In fact, what count is actually returning is the number of lines that have not been read yet. For example, if you had already read through the file and called count afterwards then it would return 0.
You could do this instead before you even open it:
File.size("input_fspec")

Get line count before looping over data in ruby

I need to get the total number of lines that an IO object contains before looping through each line in the IO object. How can I do this in ruby?
You can't really, unless you want to shell out to wc and parse the result of that - otherwise you'll need to do two passes - one to get the line numbers, and another to do your actual work.
(assuming we're talking about a File IO instance - neither of those approaches work for network sockets etc)
in rails (the only difference is how I generate the file object instance)
file = File.open(File.join(Rails.root, 'lib', 'assets', 'file.json'))
linecount = file.readlines.size
io.lines.count would give you the number of lines.
io.lines.each_with_index {|line, index|} would give you each line and which line number it is (starting at 0).
But I don't know if it's possible to count the number of lines without reading a file.
You may want to read a file, and then use io.rewind to read it again.
If your file is not humongous, slurp it into memory(array) and count the the number of items( ie lines).

Resources