I've been asked to write some tests to confirm text is contained within a PDF file. I've come across the PDF reader gem which is all good at rendering text from the file except the output isn't too good. I have a piece of text for example, that should read Date of first registration of the product but PDF reader sees this as Date offirstregistrationoftheproduct. Thus when I run my assertion, it fails due to the spacing of the text.
My code:
expected_text = 'Date of first registration of the product'
file = File.open(my_pdf, "rb")
PDF::Reader.open(file) do |reader|
reader.pages.each do |page|
expect(page).to have_text expected_text
end
The result is an RSpec expectation not met error.
Is there a way I can get this text properly formatted so that my assertion can read it?
The page object of Reader is not text. If you want to get text from a pdf, you may use page.text. Using a regex may solve your problem.
Try something like below.
expected_text = 'Date of first registration of the product'
file = File.open(my_pdf, "rb")
PDF::Reader.open(file) do |reader|
reader.pages.each do |page|
expect(page.text.match(/#{expected_text}/)).to be true
end
Related
I'm trying to access http://www.orimi.com/pdf-test.pdf to test if "PDF Test File" exists.
This is my code:
it 'pdf test' do
visit 'http://www.orimi.com/pdf-test.pdf'
puts page.title
sleep 5
convert_pdf_to_page
expect(page).to have_content 'PDF Test File'
end
def convert_pdf_to_page
temp_pdf = Tempfile.new('pdf')
temp_pdf << page.source.force_encoding('UTF-8')
reader = PDF::Reader.new(temp_pdf)
pdf_text = reader.pages.map(&:text)
temp_pdf.close
page.driver.response.instance_variable_set('#body', pdf_text)
end
But I got:
PDF::Reader::MalformedPDFError: PDF does not contain EOF marker
I searched and I found that the problem can be the PDF file. I checked the temp_pdf variable and there is just HTML with a empty body.
Is there something wrong in my code?
PDF is a tricky format, and different readers react differently to unexpected content in the PDF files. Some would crash, others would make assumptions to not crash.
I'd guess this is what happens here. When you open the file in the browser/pdf reader it works, but PDF::Reader can't handle whatever is not-standard there.
Try using different gem, Origami seems to have good opinions. I tried it with your file, and it seems to work:
> require 'origami'
> pdf = Origami::PDF.read '/tmp/pdf-test.pdf'
> pdf.grep(/Not existing/).any?
=> false
> pdf.grep(/PDF Test File/).any?
=> true
For reference (how I came up with this answer):
I googled the PDF::Reader::MalformedPDFError: PDF does not contain EOF marker and found this thread, which suggests that it's a more common problem with "working" PDFs. One of the last messages suggests the Origami, which (after checking) seems to be able to handle the PDF in question.
I have an input file and a batch file. When the batch file is executed using the System command,
a corresponding outfile is generated.
Now I want a particular text (position 350 to 357) from that outfile to be displayed on to my lineedit widget
Here is that part of my code:
system("C:/ORG_Class0178.bat")
Now the outfile will be generated
File.open("C:/ORG_Class0178_out.txt", 'r').each do |line|
var = line[350..357]
puts var
# To test whether the file is being read.
#responseLineEdit = Qt::LineEdit.new(self)
#responseLineEdit.setFont Qt::Font.new("Times NEw Roman", 12)
#responseLineEdit.resize 100,20
#responseLineEdit.move 210,395
#responseLineEdit.setText("#{var}")
end
When I do test whether the file is being read using puts statement, I get the exact required output in editor. However, the same text is not being displayed on LineEdit. Suggestions are welcome.
EDIT: A wired observation here. It works fine when I try to read the input file and display it , however it does not work with the output file. The puts statement does give the answer in editor confirming that output file does contain the required text. I am confused over this scenario.
There is nothing wrong with the code fragments shown.
Note that var is a local variable. Are the second and third code fragments in the same context? If they are in the same method, and var is not touched in-between, it will work.
If the fragments belong to different methods of the same class, than an instance variable (#var) will solve the problem.
If all that does not help, use Pry to chase the problem. Follow the link to find the pre-requisites and how to use. Place binding.pry in your code, and your program will stop at that line. Then inspect what your variables are doing.
try 'rb' instead of 'r'
File.open("C:/ORG_Class0178_out.txt", 'rb').each do |line|
var = line[350..357]
puts var
I'm currently issuing a GET request to the PivotalTracker API to get all of the bugs for a given project, by bug severity. All I really need is a count of the bugs (i.e. 10 critical bugs), but I'm currently getting all of the raw data for each bug in XML format. The XML data has a bug count at the top, but I have to scroll up tons of data to get to that count.
To solve this issue, I'm trying to parse the XML to only display the bug count, but I'm not sure how to do that. I've experimented with Nokogiri and REXML, but it seems like they can only parse actual XML files, not XML from an HTTP GET request.
Here is my code (The access token as been replaced with *'s for security reasons):
require 'net/http'
require 'rexml/document'
prompt = '> '
puts "What is the id of the Project you want to get data from?"
print prompt
project_id = STDIN.gets.chomp()
puts "What type of bugs do you want to get?"
print prompt
type = STDIN.gets.chomp()
def bug(project_id, type)
net = Net::HTTP.new("www.pivotaltracker.com")
request = Net::HTTP::Get.new("/services/v3/projects/#{project_id}/stories?filter=label%3Aqa-#{type}")
request.add_field("X-TrackerToken", "*******************")
net.read_timeout = 10
net.open_timeout = 10
response = net.start do |http|
http.request(request)
end
puts response.code
print response.read_body
end
bug(project_id, type)
Like I said, the GET request is successfully printing a bug count and all of the raw data for each individual bug to my Terminal window, but I only want it to print the bug count.
The API documentation shows the total bug count is an attribute of the XML response's top-level node, stories.
Using Nokogiri as an example, try replacing print response.read_body with
xml = Nokogiri::XML.parse(response.body)
puts "Bug count: #{xml.xpath('/stories/#total')}"
Naturally you'll need to add require 'nokogiri' at the top of your code as well.
I want to scrape data from specific divs on a CarFax report. However, when I search for divs, I always get this weird garbage output.
I tried search(#divId) , search(.divClass), and even tried to grab all divs with search('div'). Each time I get similar results: the div's content is partially truncated and the tags are all messed up.
This is the URL I am loading into my agent: https://gist.github.com/atkolkma/8024287
This is the code (user and pass ommited):
require "rubygems"
require "mechanize"
scraper = Mechanize.new
scraper.user_agent_alias = 'Mac Safari'
scraper.follow_meta_refresh = true
scraper.redirect_ok = true
scraper.get("http://www.carfaxonline.com")
form = scraper.page.forms.first
form.j_username = "******"
form.j_password = "*****"
scraper.submit(form)
scraper.get("http://www.carfaxonline.com/api/report?vin=1G1AT58H697144202&track=true")
puts scraper.page.search("#headerBodyType")
This is what the file returns when I run it:
</div>4 DRderBodyType">
What I expect is:
<div id="headerBodyType"> SEDAN 4 DR </div>
The strangest thing is, if I copy the HTML source, save it as a new file, upload it and search it, I get the correct output! I've uploaded the copied HTML to my chevy-pics dot com domain and run the following code:
scraper2 = Mechanize.new
scraper2.get("http://www.chevy-pics.com/test.html")
puts scraper2.page.search("#headerBodyType")
I get this as output, as expected:
<div id="headerBodyType"> SEDAN 4 DR </div>
I can reproduce this by changing the line endings on the file in by editor to Mac OS 9, which uses a single \r (carriage return) character. When you use puts on the resulting string the console returns to the start of the line each time this character is seen, but doesn’t start a new line. Each line therefore overwrites what was there before and you end up with the corrupted output you are seeing.
You should be able to tell if this is the case by using p instead of puts. You should see something like "<div id=\"headerBodyType\">\r SEDAN 4 DR\r </div>" as the output. Notice the \r characters used as newlines.
The actual result you get from the query is correct, it’s just displaying that is causing the problems. The solution is probably just to use gsub on the text to convert \r to the more normal \n. I don’t know the best place to do this, it might be possible to change the entire document before Mechanize hands it over to Nokogiri for parsing but I don’t know how you’d do that.
You may need to just change any results you get, as a start try:
puts scraper.page.search("#headerBodyType").to_s.gsub("\r", "\n")
In my application, the user must upload a text document, the contents of which are then parsed by the receiving controller action. I've gotten the document to upload successfully, but I'm having trouble reading its contents.
There are several threads on this issue. I've tried more or less everything recommended on these threads, and I'm still unable to resolve the problem.
Here is my code:
file_data = params[:file]
contents = ""
if file_data.respond_to?(:read)
contents = file_data.read
else
if file_data.respond_to?(:path)
File.open(file_data, 'r').each_line do |line|
elts = line.split
#
#
end
end
end
So here are my problems:
file_data doesn't 'respond_to?' either :read or :path. According to some other threads on the topic, if the uploaded file is less than a certain size, it's interpreted as a string and will respond to :read. Otherwise, it should respond to :path. But in my code, it responds to neither.
If I try to take out the if statements and straight away attempt File.open(file_data, 'r'), I get an error saying that the file wasn't found.
Can someone please help me find out what's wrong?
PS, I'm really sorry that this is a redundant question, but I found the other threads unhelpful.
Are you actually storing the file? Because if you are not, of course it can't be found.
First, find out what you're actually getting for file_data by adding debug output of file_data.inspect. It maybe something you don't expect, especially if form isn't set up correctly (i.e. :multipart => true).
Rails should enclose uploaded file in special object providing uniform interface, so that something as simple as this should work:
file_data.read.each_line do |line|
elts = line.split
#
#
end