Format statement with unknown columns - format

I am attempting to use fortran to write out a comma-delimited file for import into another commercial package. The issue is that I have an unknown number of data columns. My output needs to look like this:
a_string,a_float,a_different_float,float_array_elem1,float_array_elem2,...,float_array_elemn
which would result in something that might look like this:
L1080,546876.23,4325678.21,300.2,150.125,...,0.125
L1090,563245.1,2356345.21,27.1245,...,0.00983
I have three issues. One, I would prefer the elements to be tightly grouped (variable column width), two, I do not know how to define a variable number of array elements in the format statement, and three, the array elements can span a large range--maybe 12 orders of magnitude. The following code conceptually does what I want, but the variable 'n' and the lack of column-width definition throws an error (of course):
WRITE(50,900) linenames(ii),loc(ii,1:2),recon(ii,1:n)
900 FORMAT(A,',',F,',',F,n(',',F))
(I should note that n is fixed at run-time.) The write statement does what I want it to when I do WRITE(50,*), except that it's width-delimited.
I think this thread almost answered my question, but I got quite confused: SO. Right now I have a shell script with awk fixing the issue, but that solution is...inelegant. I could do some manipulation to make the output a string, and then just write it, but I would rather like to avoid that option if at all possible.
I'm doing this in Fortran 90 but I like to try to keep my code as backwards-compatible as possible.

the format close to what you want is f0.3, this will give no spaces and a fixed number of decimal places. I think if you want to also lop off trailing zeros you'll need to do a good bit of work.
The 'n' in your write statement can be larger than the number of data values, so one (old school) approach is to put a big number there, eg 100000. Modern fortran does have some syntax to specify indefinite repeat, i'm sure someone will offer that up.
----edit
the unlimited repeat is as you might guess an asterisk..and is evideltly "brand new" in f2008

In order to make sure that no space occurs between the entries in your line, you can write them separately in character variables and then print them out using theadjustl() function in fortran:
program csv
implicit none
integer, parameter :: dp = kind(1.0d0)
integer, parameter :: nn = 3
real(dp), parameter :: floatarray(nn) = [ -1.0_dp, -2.0_dp, -3.0_dp ]
integer :: ii
character(30) :: buffer(nn+2), myformat
! Create format string with appropriate number of fields.
write(myformat, "(A,I0,A)") "(A,", nn + 2, "(',',A))"
! You should execute the following lines in a loop for every line you want to output
write(buffer(1), "(F20.2)") 1.0_dp ! a_float
write(buffer(2), "(F20.2)") 2.0_dp ! a_different_float
do ii = 1, nn
write(buffer(2+ii), "(F20.3)") floatarray(ii)
end do
write(*, myformat) "a_string", (trim(adjustl(buffer(ii))), ii = 1, nn + 2)
end program csv
The demonstration above is only for one output line, but you can easily write a loop around the appropriate block to execute it for all your output lines. Also, you can choose different numerical format for the different entries, if you wish.

Related

Most efficient data structure for a nested loop?

I am iterating through each line in the first file (3000 lines total) to find it's corresponding label in the second file, line by line (which is ~2 million lines; 47 MB)
Currently, I have a nested loop structure with the outer loop grabbing a line (converting into a list) and the inner loop iterating through the 2 million lines (line by line):
for row in read_FIMO: #read_FIMO is first file; 3000 lines long
with open("chr8labels.txt") as label: #2 million lines long
for line in csv.reader(label, delimiter="\t"): #list
for i in range(int(row[3]),int(row[4])):
if i in range((int(line[1])-50),int(line[1])):#compare the ranges in each list
line1=str(line)
row1=str(row)
outF.append(row1+"\t"+line1)
-I realize this is horribly inefficient, but I need to find all instances of when the first range overlaps with the ranges of the other file
-Is reading in each file line by line the fastest way? if not, what would the best data structure be for entire file
-should the lines be in a different data structure other than lists?
THANK YOU if you have any feedback!
aside: the purpose is to label a range of numbers if the numbers are found in the ranges of the other file(long story; maybe not relevant?)
Your goal seems to be to find whether one range (row[3] to row[4]) overlaps with another (line[1]-50 to line[1]). For this, it is sufficient to that that either line[1]-50 or row[3] lies inside the other range. This eliminates the third nested loop.
Also, take the 2-million-line file and sort it, once, and then use the sorted list inside your 3000-line loop to do a binary search, cutting from a O(nm) algorithm to an O(nlogm) one.
(My Python is far from perfect, but this should get you going in the right direction.)
with open("...") as label:
reader = csv.reader(label, delimiter="\t")
lines = list(reader)
lines.sort(key=lambda line: int(line[1]))
for row in read_FIMO:
# Find the nearest, lesser value in lines smaller than row[3]
line = binary_search(lines, int(row[3]))
# If what you're after is multiple matches, then
# instead of getting a single line, get the smallest and
# largest indexes whose ranges overlap (which you can do with a
# binary search for row[3] and row[4]+50)
# e.g.,
# smallestIndex = binary_search(lines, int(row[3]))
# largestIndex = binary_search(lines, int(row[4])+50)
# for index in range(smallestIndex, largestIndex+1):
lower1 = int(line[1] - 50)
lower2 = int(row[3])
upper1 = int(line[1])
upper2 = int(row[4])
if (lower1 > lower2 and lower1 < upper1) or (lower2 > lower1 and lower2 < upper2):
line1=str(line)
row1=str(row)
outF.append(row1+"\t"+line1)
I think am efficient way would be...
Going through the destination file first and record all the labels
to a dictionary hash. [LabelName] -> [Line number]
Go through each line of the source and lookup the line from the
dictionary and print (or print something else if not found)
Notes / Tips
I think the above would be O(n)
You could also go through the source file first and record then go through the destination file. This would create a smaller dictionary (and use less memory) but the output may not be in the order you would like. (smaller data sets often have better cache hits ratios so this is why I bring this up.)
Also, only because you mentioned most efficient, and if you want to get crazy, I would try skipping the line by line processing, and just go through the input like it is one big string. Search for a new line char + white-space + a label name. Using this tip kind of depends what the input data looks like though. If the "csv.reader" already does parse by line then this would not be good. Also, you may need a lower level language with more control to do this method. (Caution: 90% chance this tip would just lead you to a rabbit hole that goes no-where)

Extracting data from text file in AMPL without adding indexes

I'm new to AMPL and I have data in a text file in matrix form from which I need to use certain values. However, I don't know how to use the matrices directly without having to manually add column and row indexes to them. Is there a way around this?
So the data I need to use looks something like this, with hundreds of rows and columns (and several more matrices like this), and I would like to use it as a parameter with index i for rows and j for columns.
t=1
0.0 40.95 40.36 38.14 44.87 29.7 26.85 28.61 29.73 39.15 41.49 32.37 33.13 59.63 38.72 42.34 40.59 33.77 44.69 38.14 33.45 47.27 38.93 56.43 44.74 35.38 58.27 31.57 55.76 35.83 51.01 59.29 39.11 30.91 58.24 52.83 42.65 32.25 41.13 41.88 46.94 30.72 46.69 55.5 45.15 42.28 47.86 54.6 42.25 48.57 32.83 37.52 58.18 46.27 43.98 33.43 39.41 34.0 57.23 32.98 33.4 47.8 40.36 53.84 51.66 47.76 30.95 50.34 ...
I'm not aware of an easy way to do this. The closest thing is probably the table format given in section 9.3 of the AMPL Book. This avoids needing to give indices for every term individually, but it still requires explicitly stating row and column indices.
AMPL doesn't seem to do a lot with position-based input formats, probably because it defaults to treating index sets as unordered so the concept of "first row" etc. isn't meaningful.
If you really wanted to do it within AMPL, you could probably put together a work-around along these lines:
declare a single-index param with length equal to the total size of your matrix (e.g. if your matrix is 10 x 100, this param has length 1000)
edit the beginning and end of your "matrix" data file to turn it into appropriate format for a single-index parameter indexed from 1 to n
then define your matrix something like this:
param m{i in 1..nrows,j in 1..ncols} := x[j+i*(ncols-1)];
(not tested, I won't promise that I have rows and columns the right way around there!)
But you're probably better off editing the input file into one of the standard AMPL matrix formats. AMPL isn't really designed for data wrangling - you can do it in a pinch but if you're doing this kind of thing repeatedly it may be less trouble to code it in a general-purpose language e.g. Python.

How to transform one column to much column on matrix using Fortran 90

I have one column (im = 160648) and row (jm = 1). I want to transform that to a matrix with sizes (im = 344) and (jm=467)
my program code is
program matrix
parameter (im=160648, jm=1)
dimension h(im,jm)
integer::h
open (1,file="Hasil.txt", status='old')
open (2,file="HasilNN.txt", status='unknown')
do i=1,jm
read(1,*)(h(i,j)),j=1,jm)
end do
do i=1,im
write(2,33)(h(i,j),j=1,jm)
end do
33 format(1x, 344f10.6)
end program matrix
the error code that appears when read(1,*)(h(i,j)),j=1,jm)
the data type is floating data.
Your read loop is:
do i=1,jm
read(1,*)(h(i,j)),j=1,jm)
end do
Shouldn't do i=1,jm be do i=1,im ?
This would imply there are "im" records (lines) in the formatted text file Hasil.txt, which your question suggests.
read(1,*)(h(i,j)),j=1,jm) implies each record (line of text) has "jm" values, which is 1 value per line. Is this what the file looks like ? (An unknown number of blank lines will be skipped with this read (lu,*) ... statement.)
You appear to be wanting to write this information to another file; HasilNN.txt using 33 format (1x, 344f10.6) which suggests 3441 characters per line, although your write statement will write only 1 value per line (as jm=1). This would be a very long line for a text file and probably difficult to manage outside the program. If you did wish to do this, you could achieve this with an implied do loop, such as:
write(2,33) ((h(i,j),j=1,jm),I=1,im)
A few comments:
using jm = 1 implies each row has only one value, which could be equivalently represented as a 1d vector "dimension h(im)", negating the need for j
File unit numbers 1 and 2 are typically reserved unit numbers for screen/keyboard. You would be better using units 11 and 12.
When devising this code, you need to address the record structure in the 2 files, as a simple vector could be used. You can control the line length with the format. A format of (1x,8f10.6) would create a record of 81 characters, which would be much easier to manage.
Format descriptor f10.6 also limits the range of values you can manage in the files. Values >= 1000 or <= -100 will overflow this format, while values smaller than 1.e-6 will be zero.
As #francescalus has noted, you have declared "h" as integer, but use a real format descriptor. This will produce an "Error : format-data mismatch" and has to be changed to what is expected in the file.
You should consider what you wish to achieve and adjust the code.

confirm conditional statement applies to >0 observations in Stata

This is something that has puzzled me for some time and I have yet to find an answer.
I am in a situation where I am applying a standardized data cleaning process to (supposedly) similarly structured files, one file for each year. I have a statement such as the following:
replace field="Plant" if field=="Plant & Machinery"
Which was a result of the original code-writing based on the data file for year 1. Then I generalize the code to loop through the years of data. The problem becomes if in year 3, the analogous value in that variable was coded as "Plant and MachInery ", such that the code line above would not make the intended change due to the difference in the text string, but not result in an error alerting the change was not made.
What I am after is some sort of confirmation that >0 observations actually satisfied the condition each instance the code is executed in the loop, otherwise return an error. Any combination of trimming, removing spaces, and standardizing the text case are not workaround options. At the same time, I don't want to add a count if and then assert statement before every conditional replace as that becomes quite bulky.
Aside from going to the raw files to ensure the variable values are standardized, is there any way to do this validation "on the fly" as I have tried to describe? Maybe just write a custom program that combines a count if, assert and replace?
The idea has surfaced occasionally that replace should return the number of observations changed, but there are good reasons why not, notably that it is not a r-class or e-class command any way and it's quite important not to change the way it works because that could break innumerable programs and do-files.
So, I think the essence of any answer is that you have to set up your own monitoring process counting how many values have (or would be) changed.
One pattern is -- when working on a current variable:
gen was = .
foreach ... {
...
replace was = current
replace current = ...
qui count if was != current
<use the result>
}

sed optimization (large file modification based on smaller dataset)

I do have to deal with very large plain text files (over 10 gigabytes, yeah I know it depends what we should call large), with very long lines.
My most recent task involves some line editing based on data from another file.
The data file (which should be modified) contains 1500000 lines, each of them are e.g. 800 chars long. Each line is unique, and contains only one identity number, each identity number is unique)
The modifier file is e.g. 1800 lines long, contains an identity number, and an amount and a date which should be modified in the data file.
I just transformed (with Vim regex) the modifier file to sed, but it's very inefficient.
Let's say I have a line like this in the data file:
(some 500 character)id_number(some 300 character)
And I need to modify data in the 300 char part.
Based on the modifier file, I come up with sed lines like this:
/id_number/ s/^\(.\{650\}\).\{20\}/\1CHANGED_AMOUNT_AND_DATA/
So I have 1800 lines like this.
But I know, that even on a very fast server, if I do a
sed -i.bak -f modifier.sed data.file
It's very slow, because it has to read every pattern x every line.
Isn't there a better way?
Note: I'm not a programmer, had never learnt (in school) about algorithms.
I can use awk, sed, an outdated version of perl on the server.
My suggested approaches (in order of desirably) would be to process this data as:
A database (even a simple SQLite-based DB with an index will perform much better than sed/awk on a 10GB file)
A flat file containing fixed record lengths
A flat file containing variable record lengths
Using a database takes care of all those little details that slow down text-file processing (finding the record you care about, modifying the data, storing it back to the DB). Take a look for DBD::SQLite in the case of Perl.
If you want to stick with flat files, you'll want to maintain an index manually alongside the big file so you can more easily look up the record numbers you'll need to manipulate. Or, better yet, perhaps your ID numbers are your record numbers?
If you have variable record lengths, I'd suggest converting to fixed-record lengths (since it appears only your ID is variable length). If you can't do that, perhaps any existing data will not ever move around in the file? Then you can maintain that previously mentioned index and add new entries as necessary, with the difference is that instead of the index pointing to record number, you now point to the absolute position in the file.
I suggest you a programm written in Perl (as I am not a sed/awk guru and I don't what they are exactly capable of).
You "algorithm" is simple: you need to construct, first of all, an hashmap which could give you the new data string to apply for each ID. This is achieved reading the modifier file of course.
Once this hasmap in populated you may browse each line of your data file, read the ID in the middle of the line, and generate the new line as you've described above.
I am not a Perl guru too , but I think that the programm is quite simple. If you need help to write it, ask for it :-)
With perl you should use substr to get id_number, especially if id_number has constant width.
my $id_number=substr($str, 500, id_number_length);
After that if $id_number is in range, you should use substr to replace remaining text.
substr($str, -300,300, $new_text);
Perl's regular expressions are very fast, but not in this case.
My suggestion is, don't use database. Well written perl script will outperform database in order of magnitude in this sort of task. Trust me, I have many practical experience with it. You will not have imported data into database when perl will be finished.
When you write 1500000 lines with 800 chars it seems 1.2GB for me. If you will have very slow disk (30MB/s) you will read it in a 40 seconds. With better 50 -> 24s, 100 -> 12s and so. But perl hash lookup (like db join) speed on 2GHz CPU is above 5Mlookups/s. It means that your CPU bound work will be in seconds and you IO bound work will be in tens of seconds. If it is really 10GB numbers will change but proportion is same.
You have not specified if data modification changes size or not (if modification can be done in place) thus we will not assume it and will work as filter. You have not specified what format of your "modifier file" and what sort of modification. Assume that it is separated by tab something like:
<id><tab><position_after_id><tab><amount><tab><data>
We will read data from stdin and write to stdout and script can be something like this:
my $modifier_filename = 'modifier_file.txt';
open my $mf, '<', $modifier_filename or die "Can't open '$modifier_filename': $!";
my %modifications;
while (<$mf>) {
chomp;
my ($id, $position, $amount, $data) = split /\t/;
$modifications{$id} = [$position, $amount, $data];
}
close $mf;
# make matching regexp (use quotemeta to prevent regexp meaningful characters)
my $id_regexp = join '|', map quotemeta, keys %modifications;
$id_regexp = qr/($id_regexp)/; # compile regexp
while (<>) {
next unless m/$id_regexp/;
next unless $modifications{$1};
my ($position, $amount, $data) = #{$modifications{$1}};
substr $_, $+[1] + $position, $amount, $data;
}
continue { print }
On mine laptop it takes about half minute for 1.5 million rows, 1800 lookup ids, 1.2GB data. For 10GB it should not be over 5 minutes. Is it reasonable quick for you?
If you start think you are not IO bound (for example if use some NAS) but CPU bound you can sacrifice some readability and change to this:
my $mod;
while (<>) {
next unless m/$id_regexp/;
$mod = $modifications{$1};
next unless $mod;
substr $_, $+[1] + $mod->[0], $mod->[1], $mod->[2];
}
continue { print }
You should almost certainly use a database, as MikeyB suggested.
If you don't want to use a database for some reason, then if the list of modifications will fit in memory (as it currently will at 1800 lines), the most efficient method is a hashtable populated with the modifications as suggested by yves Baumes.
If you get to the point where even the list of modifications becomes huge, you need to sort both files by their IDs and then perform a list merge -- basically:
Compare the ID at the "top" of the input file with the ID at the "top" of the modifications file
Adjust the record accordingly if they match
Write it out
Discard the "top" line from whichever file had the (alphabetically or numerically) lowest ID and read another line from that file
Goto 1.
Behind the scenes, a database will almost certainly use a list merge if you perform this alteration using a single SQL UPDATE command.
Good deal on the sqlloader or datadump decision. That's the way to go.

Resources