I have a large CSV with a large number of columns. I am trying to count the number of lines using
File.open(file).readlines.to_a.compact.count.to_i
It displays 57 although there are only 56 rows. Upon close examination I found that a part of one line is wrapped to form the next line. How to get the correct count?
Upon close examination I found that a part of one line is wrapped to form the next line. How to get the correct count?
You need to show an example of the incoming data if you want us to help beyond generic answers.
To fix the problem, you have to be able to identify the line. We can't help you there because it could look like anything. Making a wild guess, I'd say that one of the columns had an embedded new-line in it, which forces the line to wrap.
It the file is a true CSV file, that column should be wrapped in double-quotes, so you could search the file for lines that do NOT end with whatever data type should be in the last column, then read the next line, join them, then rewrite the file. But, again, we have nothing to work with, because your file's format could be a huge number of different things.
Your best bet is to use the CSV class that comes with Ruby, and let it read the file, instead of trying to treat it like a text file. CSV files are text, but they are formatted to maintain the columns and rows, so using the CSV class will give you a better chance of getting at the data.
Looking at your code:
There are a number of ways to count the number of lines in a file, including the easiest which is:
`wc -l /path/to/file`.to_i
if you're using *nix.
Using File.open(file).readlines.to_a is horribly redundant and not fast or scalable if your file is big.
readlines returns an array.
to_a returns an array.
Why turn the array into an array?
readlines loads an entire file into memory, then splits it on line ends into an array. That process can be a lot slower than simply reading the file line-by-line and incrementing a counter, plus "slurping" can make your program crawl if the file is larger than available memory.
See "Why is "slurping" a file not a good practice?" for more information.
compact removes nils from an array. readlines should never return any nils so compact will iterate over the array looking for something that shouldn't exist.
count returns an integer.
to_i converts the receiver to an integer.
In other words, to_i is turning an integer into an integer. Why?
If you want to do it in Ruby instead of using wc -l, do something simple and fast:
lines_in_file = 0
File.foreach(some_file) { lines_in_file += 1 }
After running that, lines_in_file will contain the number of lines read. Memory won't be impacted and it'll run like blue blazes on huge files.
Related
I have 2 (large) files. The first one is about 200k lines, the second one about 30 millions lines.
I want to check if each line of the first one is in the second one using Perl.
Is it faster to compare directly each line of the first to each line of the second or is it better to store them all in two different arrays and then manipulate arrays?
You have File A and File B. You want to check if lines in File A appear in File B.
If you have enough memory to hold the contents of File B in a hash using one entry per line, that's the simplest. Go ahead.
However, if you do not, I recommend you put both files in tables in an SQL database. SQLite might be enough to start. Then, your problem is reduced to a simple JOIN. If line length is an issue, use a fast hash such as xxHash. If implemented correctly, the 64-bit version is blazing fast on a 64-bit machine, especially if you enabled optimizations in your Perl. Store two columns, hash and the actual line. If hashes match, check if the lines match. Make sure to index on the hash column.
You say:
In fact, my files are like : File A : name number (per line) File B : name date location number (per line) And I have to check if File B contains the lines matching datas of File A (ignoring date and location for example) So it's not an exact match ...
In that case, you are set. You do not even have to worry about the hash stuff (which I am leaving here for reference). Put the interesting bits of data on which you need to match against in separate columns in an SQLite database. Write a join. ... Profit.
Alternatively, you could use BerkeleyDB which gives you the conceptual simplicity of having an in memory hash while storing the table on disk. If you have multiple attributes on which to match, this will not scale well.
Store the first file's lines in a hash, then iterate through the second file without storing it in memory.
It might be counterintuitive to store the first file and iterate the second file as opposed to vice-versa, but it allows you to avoid creating a 30 million element hash.
use feature 'say';
my ($path_1, $path_2) = #ARGV;
open my $fh1,"<",$path_1;
my %f1;
$f1{$_} = $. while (<$fh1>);
open my $fh2,"<",$path_2;
while (<$fh2>) {
if (my $f1_line = $f1{$_}) {
say "file 1 line $f1_line appears in file 2 line $.";
}
}
Note that without further processing, the duplicated lines will display in the order they appear in the second file, not first.
Also, this assumes file 1 does not have duplicate lines, but that can be handled if necessary.
I am reverse engineering some old database files. It's going pretty good. All the files I have worked with so far have fixed width records and the width is defined in the header. Pretty straight forward.. I know the header length, so I can start reading the file right after the header and then I know that X bytes later I get to the end of the record. If the record is 30 bytes and the header is 100 I can do something like this:
file = IO.binread(path + file_name, end_of_header, end_of_file)
read_file(file[0, 30]) #This calls a function that parses the data..
However, there are several tables with dynamic width records. So, one record can be 100 bytes and the next could be 20 bytes. The records are as big as the amount of text the user saved. There does not seem to be anything that notes the record length on the record..
Each record is separated by a delimiter (FEFE). I am scanning for the next delimiter and pulling the record that way, but it takes forever to read the entire file byte by byte looking for matches. Is there a better way than scanning to find the next match OR get a list of all the indexes of each occurrence of the byte array?
RUBY...
You can specify a separator for readline
file.readline(sep="FEFE")
or if you mean the 2 char hex string:
file.readline(sep="\xFE\xFE")
Gets you one record (including the delimiter)
Or you can pass to a code block
file.readlines(sep="\xFE\xFE").each{|line|...}
I am looking for a way to check if a PDF is missing an end of file character. So far I have found I can use the pdf-reader gem and catch the MalformedPDFError exception, or of course I could simply open the whole file and check if the last character was an EOF. I need to process lots of potentially large PDF's and I want to load as little memory as possible.
Note: all the files I want to detect will be lacking the EOF marker, so I feel like this is a little more specific scenario then detecting general PDF "corruption". What is the best, fast way to do this?
TL;DR
Looking for %%EOF, with or without related structures, is relatively speedy even if you scan the entirety of a reasonably-sized PDF file. However, you can gain a speed boost if you restrict your search to the last kilobyte, or the last 6 or 7 bytes if you simply want to validate that %%EOF\n is the only thing on the last line of a PDF file.
Note that only a full parse of the PDF file can tell you if the file is corrupted, and only a full parse of the File Trailer can fully validate the trailer's conformance to standards. However, I provide two approximations below that are reasonably accurate and relatively fast in the general case.
Check Last Kilobyte for File Trailer
This option is fairly fast, since it only looks at the tail of the file, and uses a string comparison rather than a regular expression match. According to Adobe:
Acrobat viewers require only that the %%EOF marker appear somewhere within the last 1024 bytes of the file.
Therefore, the following will work by looking for the file trailer instruction within that range:
def valid_file_trailer? filename
File.open filename { |f| f.seek -1024, :END; f.read.include? '%%EOF' }
end
A Stricter Check of the File Trailer via Regex
However, the ISO standard is both more complex and a lot more strict. It says, in part:
The last line of the file shall contain only the end-of-file marker, %%EOF. The two preceding lines shall contain, one per line and in order, the keyword startxref and the byte offset in the decoded stream from the beginning of the file to the beginning of the xref keyword in the last cross-reference section. The startxref line shall be preceded by the trailer dictionary, consisting of the keyword trailer followed by a series of key-value pairs enclosed in double angle brackets (<< … >>) (using LESS-THAN SIGNs (3Ch) and GREATER-THAN SIGNs (3Eh)).
Without actually parsing the PDF, you won't be able to validate this with perfect accuracy using regular expressions, but you can get close. For example:
def valid_file_trailer? filename
pattern = /^startxref\n\d+\n%%EOF\n\z/m
File.open(filename) { |f| !!(f.read.scrub =~ pattern) }
end
VBA question
There is a large log file (around 500,000 lines), I need to read it line by line in reverse order, i.e. from the last line to the first line.
I know I can use FileSystemObject in the Microsoft Scripting Runtime reference, but there is no such option like reverse for ReadLine Method in TextStream
Now, the only way I can think of is like this, has a counter and skip previous lines for each of the line I read, but definitely this is not good enough. Any suggestion code/algo will be much appreciated.
If your log is a kind of database with field which allows to determine the order (is there a date field or line number field), if so you could try to use ADO solution with SQL query to read the log in reverse order (ORDER BY ... DESC). So, you will be able to read from last to first. Or generally- try to use ADO.
A file is not line based, or even character based, it's just bytes so there is no way to read lines in reverse order in a file. How the text is separated into lines is only determined by where there are line break characters in the text.
You can read lines from the beginning and store them in a rotating buffer, so that you have for example the last 1000 lines in the buffer when you reach the end of the file. That way you have a certain number of lines that you can access from your buffer without having to read the entire file for every single line.
After that you know how many lines there are in the file, so when you need to refill the buffer you can just skip a certain number of lines and read the following lines into the buffer.
Help me brainstorm how I would solve this problem.
I have a file of dates with corresponding data, the format looks like this:
Date,data,data,data,data,data
Date,data,data,data,data,data
It's a plain csv file, only commas being used.
I need to be able to select a beginning date. And then get the data for the next 20 days beginning with the date selected.
Date format:
2007.05.21 (y,m,d)
So I think it would be best to search for the date. Either loading the entire file first into memory or read line by line. The file is only 1 megabyte, however I might want to do this with a 100 megabyte file as well. Is that still little?
Also I will want to do this very many times. I think I may want to keep the file in memory for the entire run of the program. So I can repeatedly access it.
After finding the date. I need to be able to get column 2 day 1, column 4 day 4. Ect. However there is always the same amount of columns for each day. So I guess if this is loaded into some kind of array I can always know in what array number the next and next day starts.
Any help would be greatly appreciated. Also any code examples provided would really help. This is not a homework problem or anything like that and I'm really new to programming.
You can user csv library to parse your file like this line by line
require 'csv'
date_to_search = Date(2009, 10, 10)
CSV.read('yourfilename.txt', :col_sep => ',') do |row|
# row will be an array of strings which you can parse
cur_date = Date.parse(row[0])
if cur_date == date_to_search
# you are set to read next 19 lines
# you can keep a counter and increment it after parsing each line (row here)
end
# compare and check if you need this line (and next 19)
# other calculations
end
As your data is sorted, Binary Search is what you want to use.
Simply put, you look up an element near the middle of your CSV, compare its date to the one you're looking for, and continue recursively in the matching half of the file (See the Wikipedia link for details).
Binary search has a runtime complexity of O(log n), which means that the number of read operations on a file containing 1,000,000 lines (Reasonable estimation for 100 MB) will never (under normal circumstances, that is, lines of different length are equally distributed) exceed 20.
Therefore, there is no need to keep the file in memory, quite the contrary. The operating system's disk cache will do the task of accelerating consecutive operations for you without running into memory shortage.
To read and process a line, you first need to find its first character, which is either the first letter after a newline character (\n) or the beginning of the file. Reading multiple lines can be achieved similar.
To parse a line, I suggest you split the line at the separation characters and/or the date's dots. This is, of course, only appropriate if the CSV comes from a trustworthy source and never changes its layout.