Read a Zip::Entry object after unzipping an xml file - ruby

I have an external xml file download that needs unzipped and parsed. I have downloaded and unzipped it but now it is stuck as an Zip::Entry object and I am unable to parse it with Nokogiri.
require 'open-uri'
require 'zip'
require 'nokogiri'
url = 'https://download.api.bingads.microsoft.com/ReportDownload/Download.aspx?xmlfile'
zip_file = open(url)
# file pulled down successfully => tmp/localpath
unzippedxml = Zip::File.open(zip_file.path) do |z|
xml_file = z.first
end
#output is my xml file => myxml.xml
unzippedxml.class => Zip::Entry
Nokogiri::XML("unzippedxml")
=> #<Nokogiri::XML::Document:0x212b2c0 name="document")
How do I parse this file? I've created a dummy xml file that didn't need unzipped and I've been able to parse it in the console but I am unable to get this one open.
Any help would be greatly appreciated!

Zip::ZipFile represents the entire Zip container; what you need instead is inside this container, an object of class Zip::ZipEntry. You could for example use Zip::ZipFile.read to get a file with a specific name:
require 'zip/zip'
zip = Zip::ZipFile.open('some.zip') # open zip
xml_source = zip.read('filename_inside_zip.xml') # read file contents
# now use the contents of xml_source with Nokogiri
Or, if you don't know the name but there's always only one file in the Zip, you can just take the first one:
require 'zip/zip'
zip = Zip::ZipFile.open('some.zip') # open zip
entry = zip.entries.reject(&:directory?).first # take first non-directory
xml_source = entry.get_input_stream{|is| is.read } # read file contents
# now use the contents of xml_source with Nokogiri

Related

Collect all the zip files having same name and zip it again Ruby

I have a directory which contains multiple logs for same script with timestamp. I want to collect all the zip files and make a new zip.
Directory Structure:
Test_1_Run_Logs_06-12-2018_10_15_35.zip
Test_1_Integration_Logs_06-12-2018_10_15_35.zip
Test_1_Interface_Logs_06-12-2018_10_15_35.zip
Test_2_Run_Logs_06-12-2018_10_30_35.zip
Test_2_Integration_Logs_06-12-2018_10_30_35.zip
Test_2_Interface_Logs_06-12-2018_10_30_35.zip
I have separated all the files having same name. The zip file is not moving all the zip files. How to do it in ruby
Code
require 'fileUtils'
require 'zip'
scriptNameArr = []
logFolder = 'C:/Users/Desktop/logs/'
copyFolder = "C:/Users/admin/Desktop/Test/Ruby Test/copyFolder/"
# Collect all the files present in logFolder separating by timestamp
Dir.entries("#{logFolder}/").each do |fName|
unless (File.directory? "#{logFolder}#{fName}")
scriptNameArr << fName.split("/").last.split(/_\d+-\d+-/)[0]
end
end
scriptNameArr.uniq!
# Create a new zip into copy
scriptNameArr.each do |scriptName|
zipName = "#{copyFolder}#{scriptName}.zip"
Dir.mkdir(copyFolder) unless (Dir.exist?(copyFolder))
FileUtils.rm(zipName) if File.exist? (zipName)
Zip::File.open(zipName, Zip::File::CREATE) do |zip|
Dir.glob("#{logFolder}#{scriptName}*") { |file|
fileName = file.split("/").last
zip.add(fileName, logFolder)
}
end
end
It is creating empty zip everytime. What should i do to copy the zip file and paste in new location?
I think you miss filename when zip.add(fileName, logFolder)
Correct:
zip.add(fileName, File.join(logFolder, fileName))
Happy Coding! :)

Using aws-sdk to download files from s3. Encoding not right

I am trying to use aws-sdk to load s3 files to local disk, and question why my pdf file (which just has a text saying SAMPLE PDF) turns out with an apparently empty content.
I guess it has something to do with the encoding...but how can i fix it?
Here is my code :
require 'aws-sdk'
bucket_name = "****"
access_key_id = "***"
secret_access_key = "**"
s3=AWS::S3.new(
access_key_id: access_key_id,
secret_access_key: secret_access_key)
b = s3.buckets[bucket_name]
filen = File.basename("Sample.pdf")
path = "original/90/#{filen}"
o = b.objects[path]
require 'tempfile'
ext= File.extname(filen)
file = File.open("test.pdf","w", encoding: "ascii-8bit")
# streaming download from S3 to a file on disk
begin
file.write(o.read) do |chunk|
file.write(chunk)
end
end
file.close
If i take out the encoding: "ascii-8bit", i just get an error message Encoding::UndefinedConversionError: "\xC3" from ASCII-8BIT to UTF-8
After some research and a tip from a cousin of mine, i finally got this to work.
Instead of using the aws solution to load the file from amazon and write it to disk (which was generating a strange pdf file : apparently equal to the original, but with blank content, and Adobe Reader "fixing" it when opening)
i instead am now using open-uri, with SSL ignore.
Here is the final code which made my day :
require 'open-uri'
open('test.pdf', 'wb') do |file|
file << open('https://s3.amazon.com/mybucket/Sample.pdf',:ssl_verify_mode => OpenSSL::SSL::VERIFY_NONE).read
end

Edit docx using nokogiri and rubyzip

Here, I'm using a rubyzip and nokogiri to modify a .docx file.
RubyZip -> Unzip .docx file
Nokogiri -> Parse and change in content of the body of word/document.xml
As I wrote the sample code just below but code modify the file but others file were disturbed. In other words, updated file is not opening showing error the word processor is crashed. How can I resolve this issue ?
require 'zip/zipfilesystem'
require 'nokogiri'
zip = Zip::ZipFile.open("SecurityForms.docx")
doc = zip.find_entry("word/document.xml")
xml = Nokogiri::XML.parse(doc.get_input_stream)
wt = xml.root.xpath("//w:t", {"w" => "http://schemas.openxmlformats.org/wordprocessingml/2006/main"}).first
wt.content = "FinalStatement"
zip.get_output_stream("word/document.xml") {|f| f << xml.to_s}
zip.close
According to the official Github documentation, you should Use write_buffer instead open. There's also a code example at the link.
Following is the code that edit the content of a .docx template file.It first creae a new copy of your template.docx remember u will create this template file and keep this file in the same folder where you create your ruby class like you will create My_Class.rb and copy following code in it.It works perfectly for my case. Remember you need to install rubyzip and nokogiri gem in a gemset.(Google them to install).Thanks
require 'rubygems'
require 'zip/zipfilesystem'
require 'nokogiri'
class Edit_docx
def initialize
coupling = [('a'..'z'),('A'..'Z')].map{|i| i.to_a}.flatten
secure_string = (0...50).map{ coupling[rand(coupling.length)] }.join
FileUtils.cp 'template.docx', "#{secure_string}.docx"
zip = Zip::ZipFile.open("#{secure_string}.docx")
doc = zip.find_entry("word/document.xml")
xml = Nokogiri::XML.parse(doc.get_input_stream)
wt = xml.root.xpath("//w:t", {"w"=>"http://schemas.openxmlformats.org/wordprocessingml/2006/main"})
#puts wt
wt.each_with_index do |tag,i|
tag.content = i.to_s + ""
end
zip.get_output_stream("word/document.xml") {|f| f << xml.to_s}
zip.close
puts secure_string
#FileUtils.rm("#{secure_string}.docx")
end
N.new
end

Can't read files from amazon s3 bucket using aws_s3 (ruby gem) in correct encoding?

I have problem when creating a file in encoding 'utf-8' and reading it from amazon-s3 bucket.
I create a file.
file = File.open('new_file', 'w', :encoding => 'utf-8')
string = "Some ££££ sings"
file.write(string)
file.close
When read from local everything is ok.
open('new_file').read
=> "Some ££££ sings"
Now I upload the file to amazon s3 using aws_s3.
AWS::S3::S3Object.store('new_file', open('new_file'), 'my_bucket')
=> #<AWS::S3::S3Object::Response:0x2214462560 200 OK>
When I read from amazon s3
AWS::S3::S3Object.find('new_file', 'my_bucket').value
=> "Some \xC2\xA3\xC2\xA3\xC2\xA3\xC2\xA3 sings"
open(AWS::S3::S3Object.find('new_file','my_bucket').url).read
=> "Some \xC2\xA3\xC2\xA3\xC2\xA3\xC2\xA3 sings"
I've been trying many things a still can't find solution.
Many Thanks for all the help
M
I found solution on different forum.
They way to do it is to make sure we are passing/uploading the text file in 'utf-8' in the first place. This it self will not solve the problem but will allow you with certainty force on stream back string encoding.
open(AWS::S3::S3Object.find('new_file','my_bucket').url).read.force_encoding('utf-8')
I think there is a better solution. Put the file you are writing to in binmode.
file = File.open("test.txt", "wb")
# or use File#binmode
file = File.open("test.txt")
file.binmode
# binmode also works with Tempfile
file = Tempfile.new
file.binmode
# then proceed to downloading
s3 = AWS::S3.new
s3.buckets["foo"]["test.txt"].read do |chunk|
file.write(chunk)
end

Ruby Load multiple xml from a directory in a program to parse them

I want to load a set of xml from a directory and use REXML to parse all the xml in a loop.
I cant seem to create File Object after i start reading from a directory
i=1
filearray=Array.new
documentarray=Array.new
directory = 'xml'
Dir.foreach(directory).each { |file|
next if file == '.' or file == '..'
filearray[i]=File.open(directory +"/"+file)
i=i+1
Please help
You are opening the file, but not reading it. This is ugly, but will work:
require 'find'
files = []
directory = 'xml'
def get_contents(file)
contents = ""
contents = File.open(file).readlines
end
Find.find(directory) do |file|
next if FileTest.directory?(file)
files << get_contents(file)
end
Hope it helps

Resources