I have problems reading a large JSON file (2.9GB) in Ruby. I am using this code
json_file = File.read(filename)
results = JSON.parse(json_file)
and when I try to read the file I get the error:
Errno::EINVAL: Invalid argument - <filename>
I have tested the same code with smaller files and it works fine. To verify that the file is written correctly I have tried to read it with python and it works.
Is there a limitation on the size of the file for JSON.parse? If so, could you recommend an alternative?
I have looked in the msgpack to reduce the size of the files, but unfortunately I am constraint by the fact that I cannot install gems.
This is a limitation of IO.read.
You may split your file into smaller parts (for example, 1 gigabyte) and read them separately:
dirname = File.dirname(filename)
`split -b 1024m #{filename} #{filename}.parts.`
Dir.chdir(dirname)
parts = Dir["#{filename}.parts.*"]
json = ''
parts.each do |partname|
json += File.read(partname)
File.delete(partname)
end
results = JSON.parse(json)
Be patient, this could take a while.
Related
I have a 30MB XML file that contains some gibberish in the beginning, and so typically I have to remove that in order for Nokogiri to be able to parse the XML document properly.
Here's what I currently have:
contents = File.open(file_path).read
if contents[0..123].include? 'authenticate_response'
fixed_contents = File.open(file_path).read[123..-1]
File.open(file_path, 'w') { |f| f.write(fixed_contents) }
end
However, this actually causes the ruby script to open up the large XML file twice. Once to read the first 123 characters, and another time to read everything but the first 123 characters.
To solve the first issue, I was able to accomplish this:
contents = File.open(file_path).read(123)
However, now I need to remove these characters from the file without reading the entire file. How can I "trim" the beginning of this file without having to open the entire thing in memory?
You can open the file once, then read and check the "garbage" and finally pass the opened file directly to nokogiri for parsing. That way, you only need read the file once and don't need to write it at all.
File.open(file_path) do |xml_file|
if xml_file.read(123).include? 'authenticate_response'
# header found, nothing to do
else
# no header found. We rewind and let nokogiri parse the whole file
xml_file.rewind
end
xml = Nokogiri::XML.parse(xml_file)
# Now to whatever you want with the parsed XML document
end
Please refer to the documentation of IO#read, IO#rewind and Nokigiri::XML::Document.parse for details about those methods.
I am using Ruby to parse a CSV file. Lines look something like this:
NOT_PROCESSED:COLUMN1:COLUMN2:COLUMN3....
NOT_PROCESSED:COLUMN1:COLUMN2:COLUMN3....
NOT_PROCESSED:COLUMN1:COLUMN2:COLUMN3....
I am parsing the whole line and after processing I want to change first column NOT_PROCESSED to PASSED or FAILED accordingly. So after file is processed it should look like this:
PASSED :COLUMN1:COLUMN2:COLUMN3....
FAILED :COLUMN1:COLUMN2:COLUMN3....
PASSED :COLUMN1:COLUMN2:COLUMN3....
How to do that with Ruby correctly?
Years ago I used code that looked like this:
CSV.open("File.csv", 'r+', { :col_sep => ':' } ) do |line|
old_pos = line.pos
account = line.shift
# PROCESSING CODE....
line.seek(old_pos)
account[0] = '%-13s' % status # status can be "PASSED" or "FAILED"
line << account
line.flush()
old_pos = line.pos
end
Now I am running this code on a newer machine that has new OS version as well as new ruby and gem versions. It seems that code above is not running reliable anymore, in some cases it even corrupts input CSV file.
Most of the lines are still ok after processing but there are also multiple lines that are malformed - something like this (for example part of line is missing):
LUMN1:COLUMN2:COLUMN3....
There can also be some empty line inserted or 2 lines merged into one etc.
What would be the best way to achieve what I want?
I am downloading an XML record from Musicbrainz.org, applying an XSLT transformation and outputting a new and different XML record.
I am running into one issue that I wonder if it is a limitation with my approach, XSLT transformations or applying Ruby to text.
I download the record:
require 'open-uri'
mb_metadata = open('http://musicbrainz.org/ws/2/release/?query=barcode:744861082927', 'User-Agent' => 'MarcBrainz marc4brainz#gmail.com').read
File.open('mb_record.xml', 'w').write(mb_metadata)
This works fine.
Then I want to transform that record. First I tried using Nokogiri:
# mb_metadata to transformed record
mb_record = Nokogiri::XML(File.read('mb_record.xml'))
#if we have the xslt document locally this introduces it
template = Nokogiri::XSLT(File.read('mb_to_marc.xsl'))
# this transforms the input document with the template.xslt
puts template.transform(mb_record)
If I run this on its own it works, however if I download the record and then run this it doesn't, it produces a transformed record which just contains some inserts, no element from the original XML file is transformed.
So I thought this might be an issue with Nokogiri and then I tried using the Ruby/XSLT gem:
xslt = XML::XSLT.new()
xslt.xml = 'mb_record.xml'
xslt.xsl = 'mb_to_marc.xsl'
out = xslt.serve()
print out;
Again, if I'm running this on a local file it works, but if I download it and try to transform it it doesn't work - it produces the following error:
xslt.xml = 'mb_record.xml'
Both methods work fine if I just run them on a file which has been downloaded already.
So what's the issue? Is it a naming problem, an XSLT issue, or something else?
Here's the whole script:
#!/usr/bin/env ruby
# encoding: UTF-8
require 'rubygems' if RUBY_VERSION >= '1.9'
require 'pathname'
require 'httpclient'
require 'xml/xslt'
require 'nokogiri'
require 'open-uri'
# DOWNLOAD RECORD FROM MusicBrainz.org - this works
mb_metadata = open('http://musicbrainz.org/ws/2/release/?query=barcode:744861082927', 'User-Agent' => 'MarcBrainz marc4brainz#gmail.com').read
#puts record
File.open('mb_record.xml', 'w').write(mb_metadata)
# mb_metadata to transformed record - this works on a saved file but not if the file is created earlier in this file .
#
#mb_record = Nokogiri::XML(File.read('mb_record.xml'))
#if we have the xslt document locally this introduces it
#template = Nokogiri::XSLT(File.read('mb_to_marc.xsl'))
# this is supposed to transform the input document with the template.xslt
#puts template.transform(mb_record)
# TRYING ANOTHER TACK
# This works if acting on a saved file. i.e. if I comment out the nokogiri lines above and just run the lines below - to 'print out' the xml is correctly transfored by the xslt to produce more xml.
# I added 'sleep 3' to see if that would help but it doesn't make a difference.
xslt = XML::XSLT.new()
xslt.xml = 'mb_record.xml'
xslt.xsl = 'mb_to_marc.xsl'
out = xslt.serve()
print out;
File.open('mb_record.xml', 'w').write(mb_metadata)
is better written as
File.write('mb_record.xml', mb_metadata)
The first will result in a file that hasn't been closed, and possibly not flushed to the disk, which can mean the file has no contents, or only partial contents.
The second writes the file and immediately flushes and closes it.
I have a very large text file, 958 MBAnd I have created the following script
f = IO.read ("Playback.xml").encode ("utf-8", replace: nil)
separate_files_array = f.scan /strong text<Bla>.*?<\/Bla>/
counter=0
separate_files_array.each do |x|
.
.
.
end
The following code only iterates over the first 31 occurences of that regex - and I have no idea why.
No, there is no way these are all the occurrences, I could see its not, and the script runs for a few seconds - this makes no sense for a file that size
The problem is IO.read is creating a buffer on default - and loading only part of the file to cache - In the end I used the following to answer my question
Regexp search through a very large file
the reason is because File.read is not creating a buffer on default - which when using a too big a file can cause the program to crush.
I'm still learning ruby, so I'm sure I'm doing something wrong here, but using ruby 1.9.3 on windows, I'm having a problem writing a file with random ascii garbage to be a specific size. I need to be able to write these files for a test on an application I'm QAing. On Mac and on *nix, the file size is written correctly every time. But on windows, it generates files of random size, generally between 1,024 bytes and 1,031 bytes.
I'm sure the problem is one of the characters that the rstr is generating is counting as two characters but... it seems like this shouldn't happen.
Here is my code:
num = 10
k = 1
for i in 1..num
fname = "f#{i}.txt"
f = File.new(fname, "w")
for k in 1..size
rstr = "#{(1..1024).map{rand(255).chr}.join}"
f.write rstr
print " #{rstr.size} " # this returns 1024 every time.
rstr = ""
end
f.close
end
Also tried:
opts = {}
opts[:encoding] = "UTF-8"
fname = "f#{i}.txt"
f = File.new(fname, "w", opts)
By default files open in Windows are open with text mode meaning that line endings and other details are adjusted.
If you want the files be written byte-to-byte exactly as you want, you need to open the files in binary mode:
File.new("foo", "wb") do |f|
# ...
end
The b is a ignored on POSIX operating systems, so your scripts are now cross-platform compatible.
Note: I used block syntax to manage the file so it properly closes and disposes the file handler once the block is executed. You no longer need to worry about closing the file ;-)
Hope this helps.
There is not any 255 ASCII. The values goes from 0~254.
If you try to printf 255.chr, you'll get a multibyte character.
As Windows does not standard utf-8, you'll get incorrect values. Hence the problem you're facing!
Try adding #coding: utf-8 at the top of your file. It should get things working.