Most efficient data structure for a nested loop? - data-structures

I am iterating through each line in the first file (3000 lines total) to find it's corresponding label in the second file, line by line (which is ~2 million lines; 47 MB)
Currently, I have a nested loop structure with the outer loop grabbing a line (converting into a list) and the inner loop iterating through the 2 million lines (line by line):
for row in read_FIMO: #read_FIMO is first file; 3000 lines long
with open("chr8labels.txt") as label: #2 million lines long
for line in csv.reader(label, delimiter="\t"): #list
for i in range(int(row[3]),int(row[4])):
if i in range((int(line[1])-50),int(line[1])):#compare the ranges in each list
line1=str(line)
row1=str(row)
outF.append(row1+"\t"+line1)
-I realize this is horribly inefficient, but I need to find all instances of when the first range overlaps with the ranges of the other file
-Is reading in each file line by line the fastest way? if not, what would the best data structure be for entire file
-should the lines be in a different data structure other than lists?
THANK YOU if you have any feedback!
aside: the purpose is to label a range of numbers if the numbers are found in the ranges of the other file(long story; maybe not relevant?)

Your goal seems to be to find whether one range (row[3] to row[4]) overlaps with another (line[1]-50 to line[1]). For this, it is sufficient to that that either line[1]-50 or row[3] lies inside the other range. This eliminates the third nested loop.
Also, take the 2-million-line file and sort it, once, and then use the sorted list inside your 3000-line loop to do a binary search, cutting from a O(nm) algorithm to an O(nlogm) one.
(My Python is far from perfect, but this should get you going in the right direction.)
with open("...") as label:
reader = csv.reader(label, delimiter="\t")
lines = list(reader)
lines.sort(key=lambda line: int(line[1]))
for row in read_FIMO:
# Find the nearest, lesser value in lines smaller than row[3]
line = binary_search(lines, int(row[3]))
# If what you're after is multiple matches, then
# instead of getting a single line, get the smallest and
# largest indexes whose ranges overlap (which you can do with a
# binary search for row[3] and row[4]+50)
# e.g.,
# smallestIndex = binary_search(lines, int(row[3]))
# largestIndex = binary_search(lines, int(row[4])+50)
# for index in range(smallestIndex, largestIndex+1):
lower1 = int(line[1] - 50)
lower2 = int(row[3])
upper1 = int(line[1])
upper2 = int(row[4])
if (lower1 > lower2 and lower1 < upper1) or (lower2 > lower1 and lower2 < upper2):
line1=str(line)
row1=str(row)
outF.append(row1+"\t"+line1)

I think am efficient way would be...
Going through the destination file first and record all the labels
to a dictionary hash. [LabelName] -> [Line number]
Go through each line of the source and lookup the line from the
dictionary and print (or print something else if not found)
Notes / Tips
I think the above would be O(n)
You could also go through the source file first and record then go through the destination file. This would create a smaller dictionary (and use less memory) but the output may not be in the order you would like. (smaller data sets often have better cache hits ratios so this is why I bring this up.)
Also, only because you mentioned most efficient, and if you want to get crazy, I would try skipping the line by line processing, and just go through the input like it is one big string. Search for a new line char + white-space + a label name. Using this tip kind of depends what the input data looks like though. If the "csv.reader" already does parse by line then this would not be good. Also, you may need a lower level language with more control to do this method. (Caution: 90% chance this tip would just lead you to a rabbit hole that goes no-where)

Related

How to transform one column to much column on matrix using Fortran 90

I have one column (im = 160648) and row (jm = 1). I want to transform that to a matrix with sizes (im = 344) and (jm=467)
my program code is
program matrix
parameter (im=160648, jm=1)
dimension h(im,jm)
integer::h
open (1,file="Hasil.txt", status='old')
open (2,file="HasilNN.txt", status='unknown')
do i=1,jm
read(1,*)(h(i,j)),j=1,jm)
end do
do i=1,im
write(2,33)(h(i,j),j=1,jm)
end do
33 format(1x, 344f10.6)
end program matrix
the error code that appears when read(1,*)(h(i,j)),j=1,jm)
the data type is floating data.
Your read loop is:
do i=1,jm
read(1,*)(h(i,j)),j=1,jm)
end do
Shouldn't do i=1,jm be do i=1,im ?
This would imply there are "im" records (lines) in the formatted text file Hasil.txt, which your question suggests.
read(1,*)(h(i,j)),j=1,jm) implies each record (line of text) has "jm" values, which is 1 value per line. Is this what the file looks like ? (An unknown number of blank lines will be skipped with this read (lu,*) ... statement.)
You appear to be wanting to write this information to another file; HasilNN.txt using 33 format (1x, 344f10.6) which suggests 3441 characters per line, although your write statement will write only 1 value per line (as jm=1). This would be a very long line for a text file and probably difficult to manage outside the program. If you did wish to do this, you could achieve this with an implied do loop, such as:
write(2,33) ((h(i,j),j=1,jm),I=1,im)
A few comments:
using jm = 1 implies each row has only one value, which could be equivalently represented as a 1d vector "dimension h(im)", negating the need for j
File unit numbers 1 and 2 are typically reserved unit numbers for screen/keyboard. You would be better using units 11 and 12.
When devising this code, you need to address the record structure in the 2 files, as a simple vector could be used. You can control the line length with the format. A format of (1x,8f10.6) would create a record of 81 characters, which would be much easier to manage.
Format descriptor f10.6 also limits the range of values you can manage in the files. Values >= 1000 or <= -100 will overflow this format, while values smaller than 1.e-6 will be zero.
As #francescalus has noted, you have declared "h" as integer, but use a real format descriptor. This will produce an "Error : format-data mismatch" and has to be changed to what is expected in the file.
You should consider what you wish to achieve and adjust the code.

Format statement with unknown columns

I am attempting to use fortran to write out a comma-delimited file for import into another commercial package. The issue is that I have an unknown number of data columns. My output needs to look like this:
a_string,a_float,a_different_float,float_array_elem1,float_array_elem2,...,float_array_elemn
which would result in something that might look like this:
L1080,546876.23,4325678.21,300.2,150.125,...,0.125
L1090,563245.1,2356345.21,27.1245,...,0.00983
I have three issues. One, I would prefer the elements to be tightly grouped (variable column width), two, I do not know how to define a variable number of array elements in the format statement, and three, the array elements can span a large range--maybe 12 orders of magnitude. The following code conceptually does what I want, but the variable 'n' and the lack of column-width definition throws an error (of course):
WRITE(50,900) linenames(ii),loc(ii,1:2),recon(ii,1:n)
900 FORMAT(A,',',F,',',F,n(',',F))
(I should note that n is fixed at run-time.) The write statement does what I want it to when I do WRITE(50,*), except that it's width-delimited.
I think this thread almost answered my question, but I got quite confused: SO. Right now I have a shell script with awk fixing the issue, but that solution is...inelegant. I could do some manipulation to make the output a string, and then just write it, but I would rather like to avoid that option if at all possible.
I'm doing this in Fortran 90 but I like to try to keep my code as backwards-compatible as possible.
the format close to what you want is f0.3, this will give no spaces and a fixed number of decimal places. I think if you want to also lop off trailing zeros you'll need to do a good bit of work.
The 'n' in your write statement can be larger than the number of data values, so one (old school) approach is to put a big number there, eg 100000. Modern fortran does have some syntax to specify indefinite repeat, i'm sure someone will offer that up.
----edit
the unlimited repeat is as you might guess an asterisk..and is evideltly "brand new" in f2008
In order to make sure that no space occurs between the entries in your line, you can write them separately in character variables and then print them out using theadjustl() function in fortran:
program csv
implicit none
integer, parameter :: dp = kind(1.0d0)
integer, parameter :: nn = 3
real(dp), parameter :: floatarray(nn) = [ -1.0_dp, -2.0_dp, -3.0_dp ]
integer :: ii
character(30) :: buffer(nn+2), myformat
! Create format string with appropriate number of fields.
write(myformat, "(A,I0,A)") "(A,", nn + 2, "(',',A))"
! You should execute the following lines in a loop for every line you want to output
write(buffer(1), "(F20.2)") 1.0_dp ! a_float
write(buffer(2), "(F20.2)") 2.0_dp ! a_different_float
do ii = 1, nn
write(buffer(2+ii), "(F20.3)") floatarray(ii)
end do
write(*, myformat) "a_string", (trim(adjustl(buffer(ii))), ii = 1, nn + 2)
end program csv
The demonstration above is only for one output line, but you can easily write a loop around the appropriate block to execute it for all your output lines. Also, you can choose different numerical format for the different entries, if you wish.

Get line count before looping over data in ruby

I need to get the total number of lines that an IO object contains before looping through each line in the IO object. How can I do this in ruby?
You can't really, unless you want to shell out to wc and parse the result of that - otherwise you'll need to do two passes - one to get the line numbers, and another to do your actual work.
(assuming we're talking about a File IO instance - neither of those approaches work for network sockets etc)
in rails (the only difference is how I generate the file object instance)
file = File.open(File.join(Rails.root, 'lib', 'assets', 'file.json'))
linecount = file.readlines.size
io.lines.count would give you the number of lines.
io.lines.each_with_index {|line, index|} would give you each line and which line number it is (starting at 0).
But I don't know if it's possible to count the number of lines without reading a file.
You may want to read a file, and then use io.rewind to read it again.
If your file is not humongous, slurp it into memory(array) and count the the number of items( ie lines).

Selecting random phrase from a list

I've been playing around with a .lua file which passes a random phrase using the following line:
SendChatMessage(GetRandomArgument("text1", "text2", "text3", "text4"), "RAID")
My problem is that I have a lot of phrases and the one line of code is very long indeed.
Is there a way to hold
text1
text2
text3
text3
in a list somewhere else in the code (or externally) and call a random value from the main code. Would make maintaining the list of text options easier.
For lists up to a few hundred elements, then the following will work:
messages = {
"text1",
"text2",
"text3",
"text4",
-- ...
}
SendChatMessage(GetRandomArgument(unpack(messages)), "RAID")
For longer lists, you would be well served to replace GetRandomArgument with GetRandomElement that would take a single table as its argument and return a random entry from the table.
Edit: Olle's answer shows one way that something like GetRandomElement might be implemented. But it used table.getn on every call which is deprecated in Lua 5.1, and its replacement (table.maxn) has a runtime cost proportional to the number of elements in the table.
The function table.maxn is only required if the table in use might have missing elements in its array part. However, in this case of a list of items to choose among, there is likely to be no reason to need to allow holes in the list. If you need to edit the list at run time, you can always use table.remove to remove an item since it will also close the gap.
With a guarantee of no gaps in the array of text, then you can implement GetRandomElement like this:
function GetRandomElement(a)
return a[math.random(#a)]
end
So that you send the message like this:
SendChatMessage(GetRandomElement(messages), "RAID")
You want a table to contain your phrases like
phrases = { "tex1", "text2", "text3" }
table.insert(phrases ,"text4") -- alternative syntax
SendChatMessage(phrases[math.random(table.getn(phrases))], "RAID")
Note: getn gets the size of the table; math.random gets a random number (with a max of the size of the phrases table) and the phrases[] syntax returns the table element at the index inside [].

sed optimization (large file modification based on smaller dataset)

I do have to deal with very large plain text files (over 10 gigabytes, yeah I know it depends what we should call large), with very long lines.
My most recent task involves some line editing based on data from another file.
The data file (which should be modified) contains 1500000 lines, each of them are e.g. 800 chars long. Each line is unique, and contains only one identity number, each identity number is unique)
The modifier file is e.g. 1800 lines long, contains an identity number, and an amount and a date which should be modified in the data file.
I just transformed (with Vim regex) the modifier file to sed, but it's very inefficient.
Let's say I have a line like this in the data file:
(some 500 character)id_number(some 300 character)
And I need to modify data in the 300 char part.
Based on the modifier file, I come up with sed lines like this:
/id_number/ s/^\(.\{650\}\).\{20\}/\1CHANGED_AMOUNT_AND_DATA/
So I have 1800 lines like this.
But I know, that even on a very fast server, if I do a
sed -i.bak -f modifier.sed data.file
It's very slow, because it has to read every pattern x every line.
Isn't there a better way?
Note: I'm not a programmer, had never learnt (in school) about algorithms.
I can use awk, sed, an outdated version of perl on the server.
My suggested approaches (in order of desirably) would be to process this data as:
A database (even a simple SQLite-based DB with an index will perform much better than sed/awk on a 10GB file)
A flat file containing fixed record lengths
A flat file containing variable record lengths
Using a database takes care of all those little details that slow down text-file processing (finding the record you care about, modifying the data, storing it back to the DB). Take a look for DBD::SQLite in the case of Perl.
If you want to stick with flat files, you'll want to maintain an index manually alongside the big file so you can more easily look up the record numbers you'll need to manipulate. Or, better yet, perhaps your ID numbers are your record numbers?
If you have variable record lengths, I'd suggest converting to fixed-record lengths (since it appears only your ID is variable length). If you can't do that, perhaps any existing data will not ever move around in the file? Then you can maintain that previously mentioned index and add new entries as necessary, with the difference is that instead of the index pointing to record number, you now point to the absolute position in the file.
I suggest you a programm written in Perl (as I am not a sed/awk guru and I don't what they are exactly capable of).
You "algorithm" is simple: you need to construct, first of all, an hashmap which could give you the new data string to apply for each ID. This is achieved reading the modifier file of course.
Once this hasmap in populated you may browse each line of your data file, read the ID in the middle of the line, and generate the new line as you've described above.
I am not a Perl guru too , but I think that the programm is quite simple. If you need help to write it, ask for it :-)
With perl you should use substr to get id_number, especially if id_number has constant width.
my $id_number=substr($str, 500, id_number_length);
After that if $id_number is in range, you should use substr to replace remaining text.
substr($str, -300,300, $new_text);
Perl's regular expressions are very fast, but not in this case.
My suggestion is, don't use database. Well written perl script will outperform database in order of magnitude in this sort of task. Trust me, I have many practical experience with it. You will not have imported data into database when perl will be finished.
When you write 1500000 lines with 800 chars it seems 1.2GB for me. If you will have very slow disk (30MB/s) you will read it in a 40 seconds. With better 50 -> 24s, 100 -> 12s and so. But perl hash lookup (like db join) speed on 2GHz CPU is above 5Mlookups/s. It means that your CPU bound work will be in seconds and you IO bound work will be in tens of seconds. If it is really 10GB numbers will change but proportion is same.
You have not specified if data modification changes size or not (if modification can be done in place) thus we will not assume it and will work as filter. You have not specified what format of your "modifier file" and what sort of modification. Assume that it is separated by tab something like:
<id><tab><position_after_id><tab><amount><tab><data>
We will read data from stdin and write to stdout and script can be something like this:
my $modifier_filename = 'modifier_file.txt';
open my $mf, '<', $modifier_filename or die "Can't open '$modifier_filename': $!";
my %modifications;
while (<$mf>) {
chomp;
my ($id, $position, $amount, $data) = split /\t/;
$modifications{$id} = [$position, $amount, $data];
}
close $mf;
# make matching regexp (use quotemeta to prevent regexp meaningful characters)
my $id_regexp = join '|', map quotemeta, keys %modifications;
$id_regexp = qr/($id_regexp)/; # compile regexp
while (<>) {
next unless m/$id_regexp/;
next unless $modifications{$1};
my ($position, $amount, $data) = #{$modifications{$1}};
substr $_, $+[1] + $position, $amount, $data;
}
continue { print }
On mine laptop it takes about half minute for 1.5 million rows, 1800 lookup ids, 1.2GB data. For 10GB it should not be over 5 minutes. Is it reasonable quick for you?
If you start think you are not IO bound (for example if use some NAS) but CPU bound you can sacrifice some readability and change to this:
my $mod;
while (<>) {
next unless m/$id_regexp/;
$mod = $modifications{$1};
next unless $mod;
substr $_, $+[1] + $mod->[0], $mod->[1], $mod->[2];
}
continue { print }
You should almost certainly use a database, as MikeyB suggested.
If you don't want to use a database for some reason, then if the list of modifications will fit in memory (as it currently will at 1800 lines), the most efficient method is a hashtable populated with the modifications as suggested by yves Baumes.
If you get to the point where even the list of modifications becomes huge, you need to sort both files by their IDs and then perform a list merge -- basically:
Compare the ID at the "top" of the input file with the ID at the "top" of the modifications file
Adjust the record accordingly if they match
Write it out
Discard the "top" line from whichever file had the (alphabetically or numerically) lowest ID and read another line from that file
Goto 1.
Behind the scenes, a database will almost certainly use a list merge if you perform this alteration using a single SQL UPDATE command.
Good deal on the sqlloader or datadump decision. That's the way to go.

Resources