"syntax error, unexpected tIDENTIFIER, expecting $end" - ruby

I put together this script based on this tutorial.
require 'nokogiri'
require 'open-uri'
url = "http://sfbay.craigslist.org/sby/jjj/"
data = Nokogiri::HTML(open(url))
puts data.at_css('.itempn').text
puts data.at_css('.itemcg').text
I keep getting this error:
Macintosh:nokogiri rgrush$ ruby aaa.rb
aaa.rb:1: syntax error, unexpected tIDENTIFIER, expecting $end
url = "http://sf...
^
Any ideas? Could it be that one of my dependencies is out of date?

most likely you have a non ASCII char in URL.
try adding
# encoding: UTF-8
as first line of aaa.rb
so it will look like:
# encoding: UTF-8
require 'nokogiri'
require 'open-uri'

Related

I am getting (eval):1: invalid Unicode codepoint error while trying to scrape instagram

I am trying to scrape data from instagram. Here is my code
require 'open-uri'
require 'nokogiri'
require 'json'
require "unicode/emoji"
def get_html
url = 'https://www.instagram.com/muriithi_kabogo/'
html = open(url)
end
def pass_data
html = get_html
doc = Nokogiri::HTML(html)
end
def get_data
profiles = []
body = pass_data.at('body')
script = body.at('script').text
myText = script
json_object_data = eval(myText)
end
get_data()
When I try to change the text into json format, I get an error:
(eval):1: invalid Unicode codepoint (SyntaxError)
usinessmen #beautiful #smile\ud83d\ude0a #teambringit #shebr
How do I move past this error?
JSON, like JavaScript, uses UCS2 encoding, which Ruby chokes on.
Do not use evil. For one thing, Ruby will detect \ud83d\ude0a as invalid codepoints, as it should; for another, it is a security hole; and lastly, it slows down your code.
Use JSON.parse, which is safer, faster, and knows how to deal with UCS2:
require 'json'
json_str = '"usinessmen #beautiful #smile\ud83d\ude0a #teambringit #shebr"'
JSON.parse(json_str)
# => "usinessmen #beautiful #smile😊 #teambringit #shebr"

How to search in page body and force an encoding conversion

I have pretty simple code:
require 'rubygems'
require 'mechanize'
URL = 'http://yandex.ru'
agent = Mechanize.new
page = agent.get(URL)
# page.encoding => UTF-8
# page.body.encoding => ASCII-8BIT
page.body.include?("Карты")
And on the last line of that code Ruby returned an error:
in `include?': incompatible character encodings: ASCII-8BIT and UTF-8 (Encoding::CompatibilityError)
Solutions from "How to get Mechanize to auto-convert body to UTF8?" don't help. What should I do to fix it?
You can use the force_encoding method like this:
agent.page.body.force_encoding('utf-8')

Ruby Escaping Arguments Inside Backticks Shell

I'm wanting to throw a Ruby variable filled with an HTML file that I have grabbed with open-uri and nokogiri, into a backticks system process to tidy it up. The nature of the variable is confusing the process. I am thinking I need to escape it but I am not sure. Any advice appreciated.
require 'open-uri'
require 'nokogiri'
url = 'http://www.wikihow.com/Bathe-a-Cat'
page = Nokogiri::HTML(open(url))
pagestring = page.to_s
result = `tidy --break-before-br no --char-encoding utf8 --clean yes --drop-empty-paras yes ' #{pagestring}'`
puts results.length
Here is the error I get:
sh: -c: line 144: syntax error near unexpected token `"Search","Search","Custom_search"'
sh: -c: line 144: ` <input type="submit" id="cse_sa" value="Search" class="search_button" onmouseover="button_swap(this);" onmouseout="button_unswap(this);" onclick='gatTrack("Search","Search","Custom_search");'>'
Cheers
You might want to use IO.popen instead. Then you can invoke the command with an array instead of stringifying it:
cmd = %w{ tidy --break-before-br no --char-encoding utf8 --clean yes --drop-empty-paras yes }
result = IO.popen(cmd, 'r+') {|io|
io.puts pagestring
io.close_write
io.read
}
assuming tidy reads HTML from stdin.
Instead of dumping all that HTML onto the command line, why not make a file?
require 'open-uri'
require 'nokogiri'
require 'tempfile'
url = 'http://www.wikihow.com/Bathe-a-Cat'
page = Nokogiri::HTML(open(url))
pagestring = page.to_s
file = Tempfile.new('blah')
file.write(pagestring)
file.close
result = `tidy --break-before-br no --char-encoding utf8 --clean yes --drop-empty-paras yes #{file.path}`
puts result.length
file.unlink
Seems to work with a quick test here...
For normal arguments like file paths and stuff like that, you could use "str".shellescape (http://apidock.com/ruby/Shellwords/shellescape).
args_array = [ ... ]
`tidy #{args_array.map(&:shellescape).join(' ')`
However, to pass a complete html file as an command line argument, something like what was suggested above might be better. I just though I'd mention this here for reference to others for normal cli arguments.

hpricot-invalid byte sequence in UTF-8

I already done some searches but none of that can solve this peculiar,unexpected problem.
Just look at the code blow:
require 'open-uri'
require 'hpricot'
doc = Hpricot(open("http://www.baidu.com/")) #this web page's encoding is GB2312
I don't know what's going on here,you can this in your irb to see if you can get the problem
It just pop up "ArgumentError: invalid byte sequence in UTF-8"
I have try to convert the original HTML into utf-8 by Iconv but it still won't work
Guys,I really don't what to do now,please help me
Hpricot - UTF-8 issues
invalid byte sequence in UTF-8 (ArgumentError)
require 'hpricot'
require 'open-uri'
doc = open('http://www.amazon.co.jp/') {|f| Hpricot(f.read) }
puts doc.to_html
open('http://www.amazon.co.jp/') {|f| Hpricot(f.read.encode("UTF-8")) }
I know how it could work with Net::HTTP (Ruby 1.9.2):
require 'net/http'
require 'uri'
url = URI.parse('http://www.baidu.com')
res = Net::HTTP.start(url.host, url.port) {|http|
http.get('/')
}
str = res.body.force_encoding('GB2312')
puts str
puts str.encoding.name # => GB2312
Does that help?

quotes issue (ruby)

any idea how I can pass correct argument to xpath? There must be something about how to use single/double quotes. When I use variable
parser_xpath_identificator = "'//table/tbody[#id=\"threadbits_forum_251\"]/tr'" gives me an incorrect value or
parser_xpath_identificator = "'//table/tbody[#id="threadbits_forum_251"]/tr'" gives me an error syntax error, unexpected tIDENTIFIER, expecting $end
require 'rubygems'
require 'mechanize'
parser_xpath_identificator = "'//table/tbody[#id=\"threadbits_forum_251\"]/tr'"
# parser_xpath_identificator = "'//table/tbody[#id="threadbits_forum_251"]/tr'"
#gives an error: syntax error, unexpected tIDENTIFIER, expecting $end
agent = WWW::Mechanize.new
page = agent.get("http://www.vbulletin.org/forum/index.php")
page = page.link_with(:text=>'vB4 General Discussions').click
puts "Page title: #{page.title}"
puts "\nfrom variable: #{page.parser.xpath(parser_xpath_identificator).length}"
puts "directly: #{page.parser.xpath('//table/tbody[#id="threadbits_forum_251"]/tr').length}"
In both cases you're nesting single-quotes directly inside double-quotes, which I don't think is correct. Try this:
parser_xpath_identificator = '//table/tbody[#id="threadbits_forum_251"]/tr'

Resources