how to save StringIO (pdf) data into file - ruby

I want to save pdf file which is located in external remote server with ruby. The pdf file is coming in StringIO. I tried saving the data with File.write but it is not working. I received the below error .
ArgumentError: string contains null byte
How to save now ?

require 'stringio'
sio = StringIO.new("he\x00llo")
File.open('data.txt', 'w') do |f|
f.puts(sio.read)
end
$ cat data.txt
hello
Response to comment:
Okay, try this:
require 'stringio'
sio = StringIO.new("\c2\xb5")
sio.set_encoding('ASCII-8BIT') #Apparently, this is what you have.
File.open('data.txt', 'w:utf-8') do |f|
f.puts(sio.read)
end
--output:--
1.rb:7:in `write': "\xB5" from ASCII-8BIT to UTF-8 (Encoding::UndefinedConversionError)
To get rid of that error, you can set the encoding of the StringIO to UTF-8:
require 'stringio'
sio = StringIO.new("\c2\xb5")
sio.set_encoding('ASCII-8BIT') #Apparently, this is what you have.
sio.set_encoding('UTF-8') #Change the encoding to what it should be.
File.open('data.txt', 'w:UTF-8') do |f|
f.puts(sio.read)
end
Or, you can use the File.open modes:
require 'stringio'
sio = StringIO.new("\c2\xb5")
sio.set_encoding('ASCII-8BIT') #Apparently, this is what you have.
File.open('data.txt', 'w:UTF-8:ASCII-8BIT') do |f|
f.puts(sio.read)
end
But, that assumes the data is encoded in UTF-8. If you actually have binary data, i.e. data that isn't encoded because it represents a .jpg file for instance, then that won't work.

Related

Ruby CSV BOM|UTF-8 encoding for StringIO

Ruby 2.6.3.
I have been trying to parse a StringIO object into a CSV instance with the bom|utf-8 encoding, so that the BOM character (undesired) is stripped and the content is encoded to UTF-8:
require 'csv'
CSV_READ_OPTIONS = { headers: true, encoding: 'bom|utf-8' }.freeze
content = StringIO.new("\xEF\xBB\xBFid\n123")
first_row = CSV.parse(content, CSV_READ_OPTIONS).first
first_row.headers.first.include?("\xEF\xBB\xBF") # This returns true
Apparently the bom|utf-8 encoding does not work for StringIO objects, but I found that it does work for files, for instance:
require 'csv'
CSV_READ_OPTIONS = { headers: true, encoding: 'bom|utf-8' }.freeze
# File content is: "\xEF\xBB\xBFid\n12"
first_row = CSV.read('bom_content.csv', CSV_READ_OPTIONS).first
first_row.headers.first.include?("\xEF\xBB\xBF") # This returns false
Considering that I need to work with StringIO directly, why does CSV ignores the bom|utf-8 encoding? Is there any way to remove the BOM character from the StringIO instance?
Thank you!
Ruby 2.7 added the set_encoding_by_bom method to IO. This methods consumes the byte order mark and sets the encoding.
require 'csv'
require 'stringio'
CSV_READ_OPTIONS = { headers: true }.freeze
content = StringIO.new("\xEF\xBB\xBFid\n123")
content.set_encoding_by_bom
first_row = CSV.parse(content, CSV_READ_OPTIONS).first
first_row.headers.first.include?("\xEF\xBB\xBF")
#=> false
Ruby doesn't like BOMs. It only handles them when reading a file, never anywhere else, and even then it only reads them so that it can get rid of them. If you want a BOM for your string, or a BOM when writing a file, you have to handle it manually.
There are probably gems for doing this, though it's easy to do yourself
if string[0...3] == "\xef\xbb\xbf"
string = string[3..-1].force_encoding('UTF-8')
elsif string[0...2] == "\xff\xfe"
string = string[2..-1].force_encoding('UTF-16LE')
# etc
I found out that forcing encoding to utf8 on the StringIO string and removing the BOM to generate a new StringIO worked:
require 'csv'
CSV_READ_OPTIONS = { headers: true}.freeze
content = StringIO.new("\xEF\xBB\xBFid\n123")
csv_file = StringIO.new(content.string.force_encoding('utf-8').sub("\xEF\xBB\xBF", ''))
first_row = CSV.parse(csv_file, CSV_READ_OPTIONS).first
first_row.headers.first.include?("\xEF\xBB\xBF") # => false
The encoding option is no more needed. It may not be the best option memory-wise, but it works.

I am getting (eval):1: invalid Unicode codepoint error while trying to scrape instagram

I am trying to scrape data from instagram. Here is my code
require 'open-uri'
require 'nokogiri'
require 'json'
require "unicode/emoji"
def get_html
url = 'https://www.instagram.com/muriithi_kabogo/'
html = open(url)
end
def pass_data
html = get_html
doc = Nokogiri::HTML(html)
end
def get_data
profiles = []
body = pass_data.at('body')
script = body.at('script').text
myText = script
json_object_data = eval(myText)
end
get_data()
When I try to change the text into json format, I get an error:
(eval):1: invalid Unicode codepoint (SyntaxError)
usinessmen #beautiful #smile\ud83d\ude0a #teambringit #shebr
How do I move past this error?
JSON, like JavaScript, uses UCS2 encoding, which Ruby chokes on.
Do not use evil. For one thing, Ruby will detect \ud83d\ude0a as invalid codepoints, as it should; for another, it is a security hole; and lastly, it slows down your code.
Use JSON.parse, which is safer, faster, and knows how to deal with UCS2:
require 'json'
json_str = '"usinessmen #beautiful #smile\ud83d\ude0a #teambringit #shebr"'
JSON.parse(json_str)
# => "usinessmen #beautiful #smile😊 #teambringit #shebr"

`scan': invalid byte sequence in UTF-8 (ArgumentError)

I'm trying to read a .txt file in ruby and split the text line-by-line.
Here is my code:
def file_read(filename)
File.open(filename, 'r').read
end
puts f = file_read('alice_in_wonderland.txt')
This works perfectly. But when I add the method line_cutter like this:
def file_read(filename)
File.open(filename, 'r').read
end
def line_cutter(file)
file.scan(/\w/)
end
puts f = line_cutter(file_read('alice_in_wonderland.txt'))
I get an error:
`scan': invalid byte sequence in UTF-8 (ArgumentError)
I found this online for untrusted website and tried to use it for my own code but it's not working. How can I remove this error?
Link to the file: File
The linked text file contains the following line:
Character set encoding: ISO-8859-1
If converting it isn't desired or possible then you have to tell Ruby that this file is ISO-8859-1 encoded. Otherwise the default external encoding is used (UTF-8 in your case). A possible way to do that is:
s = File.read('alice_in_wonderland.txt', encoding: 'ISO-8859-1')
s.encoding # => #<Encoding:ISO-8859-1>
Or even like this if you prefer your string UTF-8 encoded (see utf8everywhere.org):
s = File.read('alice_in_wonderland.txt', encoding: 'ISO-8859-1:UTF-8')
s.encoding # => #<Encoding:UTF-8>
It seems to work if you read the file directly from the page, maybe there's something funny about the local copy you have. Try this:
require 'net/http'
uri = 'http://www.ccs.neu.edu/home/vip/teach/Algorithms/7_hash_RBtree_simpleDS/hw_hash_RBtree/alice_in_wonderland.txt'
scanned = Net::HTTP.get_response(URI.parse(uri)).body.scan(/\w/)

Ruby Build Hash from file

I'm consuming a web-service and using Savon to do +-1000 (paid) requests and parse the requests to a csv file.
I save the xml.hash response in a file if the parsing failed.
How can I initialize an hash that was saved to a file? (or should I save in XML and then let savon make it into a hash it again?
Extra info:
client = Savon.client do
wsdl "url"
end
response = client.call(:read_request) do
message "dat:number" => number
end
I use the response.hash to build/parse my csv data. Ex:
name = response.hash[:description][:name]
If the building failed I'm thinking about saving the response.hash to a file. But the problem is I don't know how to reuse the saved response (XML/Hash) so that an updated version of the building/parsing can be run using the saved response.
You want to serialize the Hash to a file then deserialize it back again.
You can do it in text with YAML or JSON and in a binary via Marshal.
Marshal
def serialize_marshal filepath, object
File.open( filepath, "wb" ) {|f| Marshal.dump object, f }
end
def deserialize_marshal filepath
File.open( filepath, "rb") {|f| Marshal.load(f)}
end
Marshaled data has a major and minor version number written with it, so it's not guaranteed to always load in another Ruby if the Marshal data version changes.
YAML
require 'yaml'
def serialize_yaml filepath, object
File.open( filepath, "w" ) {|f| YAML.dump object, f }
end
def deserialize_yaml filepath
File.open( filepath, "r") {|f| YAML.load(f) }
end
JSON
require 'json'
def serialize_json filepath, object
File.open( filepath, "w" ) {|f| JSON.dump object, f }
end
def deserialize_json filepath
File.open( filepath, "r") {|f| JSON.load(f)}
end
Anecdotally, YAML is slow, Marshal and JSON are quick.
If your code is expecting to use/manipulate a ruby hash as demonstrated above, then if you want to save the Savon response, then use the json gem and do something like:
require 'json'
File.open("responseX.json","w") do |f|
f << response.hash.to_json
end
Then if you need to read that file to recreate your response hash:
File.open('responseX.json').each do |line|
reponseHash = JSON.parse(line)
# do something with responseHash
end

Using aws-sdk to download files from s3. Encoding not right

I am trying to use aws-sdk to load s3 files to local disk, and question why my pdf file (which just has a text saying SAMPLE PDF) turns out with an apparently empty content.
I guess it has something to do with the encoding...but how can i fix it?
Here is my code :
require 'aws-sdk'
bucket_name = "****"
access_key_id = "***"
secret_access_key = "**"
s3=AWS::S3.new(
access_key_id: access_key_id,
secret_access_key: secret_access_key)
b = s3.buckets[bucket_name]
filen = File.basename("Sample.pdf")
path = "original/90/#{filen}"
o = b.objects[path]
require 'tempfile'
ext= File.extname(filen)
file = File.open("test.pdf","w", encoding: "ascii-8bit")
# streaming download from S3 to a file on disk
begin
file.write(o.read) do |chunk|
file.write(chunk)
end
end
file.close
If i take out the encoding: "ascii-8bit", i just get an error message Encoding::UndefinedConversionError: "\xC3" from ASCII-8BIT to UTF-8
After some research and a tip from a cousin of mine, i finally got this to work.
Instead of using the aws solution to load the file from amazon and write it to disk (which was generating a strange pdf file : apparently equal to the original, but with blank content, and Adobe Reader "fixing" it when opening)
i instead am now using open-uri, with SSL ignore.
Here is the final code which made my day :
require 'open-uri'
open('test.pdf', 'wb') do |file|
file << open('https://s3.amazon.com/mybucket/Sample.pdf',:ssl_verify_mode => OpenSSL::SSL::VERIFY_NONE).read
end

Resources