Converting Jsonp to Json in different methods - ruby

I been trying to use JSONP data in a json format in a ruby project.
From your experiences how did you address this?

JSONP is easy to handle. It's just JSON in a minor wrapper, and that wrapper is easy to strip off:
require 'open-uri'
require 'json'
URL = 'http://www.google.com/dictionary/json?callback=a&sl=en&tl=en&q=epitome'
jsonp = open(URL).read
jsonp now contains the result in JSONP format:
jsonp[0, 3] # => "a({"
jsonp[-11 ... -1] # => "},200,null"
Those extraneous parts, a{ and ,200,null" are the trouble spots when passing the data to JSON for parsing, so we strip them.
A simple, greedy, regex is all that's needed. /{.+}/ will find everything wrapped by the outermost curly-braces and return it, which is all the JSON needs:
data = JSON.parse(jsonp[/{.+}/])
data['query'] # => "epitome"
data['primaries'].size # => 1

From my experience, one way is to use this regex to filter out the function callback name:
/(\{.*\})/m
or the lazy way would be find the index of the first occurrence of "(" and just substring it with last character, which would be a ")" .
I been trying to look for answers on here, didn't get a solid answer, hope this helps.
Cheers

Related

Parsing XML document missing enclosing parent entity

I want to process an XML document that lacks an overarching enclosing entity. (Yes, that's the file I'm given. No, I didn't create it.) For example:
<DeviceInfo>
<Greeting>Crunchy bacon!</Greeting>
</DeviceInfo>
<InstantaneousDemand>
<TimeStamp>0x1c722845</TimeStamp>
</InstantaneousDemand>
<InstantaneousDemand>
<TimeStamp>0x1c72284a</TimeStamp>
</InstantaneousDemand>
When I parse the file using Nokogiri's XML method, it (predictably) only reads the first entity:
>> doc = Nokogiri::XML(File.open("x.xml"))
>> doc.children.count
=> 1
doc.text
=> "\n Crunchy bacon!\n"
I could read the file as a string and wrap a fake enclosing entity around the whole thing, but that seems heavy handed. Is there a better way to get Nokogiri to read in all the entities?
You might create a DocumentFragment rather than Document (especially taking into account that your content is actually a document fragment):
▶ doc = Nokogiri::XML::DocumentFragment.parse File.read("x.xml")
#⇒ #<Nokogiri::XML::DocumentFragment:0x14efa38 name="#document-fragment"
# ...
# #<Nokogiri::XML::Element:0x14ef68c name="InstantaneousDemand"
# ...
▶ doc.children.count
#⇒ 6
Hope it helps.

Disable HTML within XML escaping with Nokogiri

I'm trying to parse an XML document from the Google Directions API.
This is what I've got so far:
x = Nokogiri::XML(GoogleDirections.new("48170", "48104").xml)
x.xpath("//DirectionsResponse//route//leg//step").each do |q|
q.xpath("html_instructions").each do |h|
puts h.inner_html
end
end
The output looks like this:
Head <b>south</b> on <b>Hidden Pond Dr</b> toward <b>Ironwood Ct</b>
Turn <b>right</b> onto <b>N Territorial Rd</b>
Turn <b>left</b> onto <b>Gotfredson Rd</b>
...
I would like the output to be:
Turn <b>right</b> onto <b>N Territorial Rd</b>
The problem seems to be Nokogiri escaping the html within the xml
I trust Google, but I think it would be also good to sanitize it further to:
Turn right onto N Territorial Rd
But I can't (using sanitize perhaps) without the raw xml. Ideas?
Because I don't have the Google Directions API installed I can't access the XML, but I have a strong suspicion the problem is the result of telling Nokogiri you're dealing with XML. As a result it's going to return you the HTML encoded like it should be in XML.
You can unescape the HTML using something like:
CGI::unescape_html('Head <b>south</b> on <b>Hidden Pond Dr</b> toward <b>Ironwood Ct</b>')
=> "Head <b>south</b> on <b>Hidden Pond Dr</b> toward <b>Ironwood Ct</b>\n"
unescape_html is an alias to unescapeHTML:
Unescape a string that has been HTML-escaped
CGI::unescapeHTML("Usage: foo "bar" <baz>")
# => "Usage: foo \"bar\" "
I had to think about this a bit more. It's something I've run into, but it was one of those things that escaped me during the rush at work. The fix is simple: You're using the wrong method to retrieve the content. Instead of:
puts h.inner_html
Use:
puts h.text
I proved this using:
require 'httpclient'
require 'nokogiri'
# This URL comes from: https://developers.google.com/maps/documentation/directions/#XML
url = 'http://maps.googleapis.com/maps/api/directions/xml?origin=Chicago,IL&destination=Los+Angeles,CA&waypoints=Joplin,MO|Oklahoma+City,OK&sensor=false'
clnt = HTTPClient.new
doc = Nokogiri::XML(clnt.get_content(url))
doc.search('html_instructions').each do |html|
puts html.text
end
Which outputs:
Head <b>south</b> on <b>S Federal St</b> toward <b>W Van Buren St</b>
Turn <b>right</b> onto <b>W Congress Pkwy</b>
Continue onto <b>I-290 W</b>
[...]
The difference is that inner_html is reading the content of the node directly, without decoding. text decodes it for you. text, to_str and inner_text are aliased to content internally in Nokogiri::XML::Node for our parsing pleasure.
Wrap your nodes in CDATA:
def wrap_in_cdata(node)
# Using Nokogiri::XML::Node#content instead of #inner_html (which
# escapes HTML entities) so nested nodes will not work
node.inner_html = node.document.create_cdata(node.content)
node
end
Nokogiri::XML::Node#inner_html escapes HTML entities except in CDATA sections.
fragment = Nokogiri::HTML.fragment "<div>Here is an unescaped string: <span>Turn left > right > straight & reach your destination.</span></div>"
puts fragment.inner_html
# <div>Here is an unescaped string: <span>Turn left > right > straight & reach your destination.</span></div>
fragment.xpath(".//span").each {|node| node.inner_html = node.document.create_cdata(node.content) }
fragment.inner_html
# <div>Here is an unescaped string: <span>Turn left > right > straight & reach your destination.</span>\n</div>
This is not a great or DRY solution, but it works:
puts h.inner_html.gsub("<b>" , "").gsub("</b>", "").gsub("<div style=\"font-size:0.9em\">", "").gsub("</div>", "")

Ruby: replace a given URL in an HTML string

In Ruby, I want to replace a given URL in an HTML string.
Here is my unsuccessful attempt:
escaped_url = url.gsub(/\//,"\/").gsub(/\./,"\.").gsub(/\?/,"\?")
path_regexp = Regexp.new(escaped_url)
html.gsub!(path_regexp, new_url)
Note: url is actually a Google Chart request URL I wrote, which will not have more special characters than /?|.=%:
The gsub method can take a string or a Regexp as its first argument, same goes for gsub!. For example:
>> 'here is some ..text.. xxtextxx'.gsub('..text..', 'pancakes')
=> "here is some pancakes xxtextxx"
So you don't need to bother with a regex or escaping at all, just do a straight string replacement:
html.gsub!(url, new_url)
Or better, use an HTML parser to find the particular node you're looking for and do a simple attribute assignment.
I think you're looking for something like:
path_regexp = Regexp.new(Regexp.escape(url))

Ruby Regular Expression: Setting $1 variable in a hash

Everything in this code works properly, except the contents of the $1 variable aren't being properly displayed. According to my tests, all the matching is being done properly, I am just having trouble figuring out how to actually output the contents of $1.
codeTags = {
/\[b\](.+?)\[\/b\]/m => "<strong>#{$1}</strong>",
/\[i\](.+?)\[\/i\]/m => "<em>#{$1}</em>"
}
regexp = Regexp.new(/(#{Regexp.union(codeTags.keys)})/)
message = (message).gsub(/#{regexp}/) do |match|
codeTags[codeTags.keys.select {|k| match =~ Regexp.new(k)}[0]]
end
return message.html_safe
Thank you!
As soon as you do this:
codeTags = {
/\[b\](.+?)\[\/b\]/m => "<strong>#{$1}</strong>",
/\[i\](.+?)\[\/i\]/m => "<em>#{$1}</em>"
}
The #{$1} bits in the values are interpolated using whatever happens to be in $1 at the time. The values will most likely be "<strong></strong>" and "<em></em>" and those aren't very useful.
And regexp is already a regular expression object so gsub(/#{regexp}/) should be just gsub(regexp). Similar things apply to the keys of codeTags, they're already regular expression objects so you don't need to Regexp.new(k).
I'd change the whole structure, you're overcomplicating things. Just something simple like this would be fine for only two replacements:
message = message.gsub(/\[b\](.*?)\[\/b\]/) { '<strong>' + $1 + '</strong>' }
message = message.gsub(/\[i\](.*?)\[\/i\]/) { '<em>' + $1 + '</em>' }
If you try to do it all at once you'll have problems with nesting in something like this:
message = 'Where [b]is[/b] pancakes [b]house [i]and[/i] more[/b] stuff?'
You'd end up having to use a recursive gsub and possibly some lambdas if you wanted to properly handle things like that with a single expression.
There are better things to spend your time on than trying to be clever on something like this.
Response to comments: If you have more bb-tags and some smilies to worry about and several messages per page then you should HTMLify each message when you create it. You could store only the HTML version or both HTML and BB-Code versions if you want the BB-Code stuff around for some reason. This way you'd only pay for the HTMLification once per message and producing your big lists would be nearly free.

Extract URLs from text using Ruby while handling matched parens

URI.extract claims to do this, but it doesn't handle matched parens:
>> URI.extract("text here (http://foo.example.org/bla) and here")
=> ["http://foo.example.org/bla)"]
What's the best way to extract URLs from text without breaking parenthesized URLs (which users like to use)?
If the URLs are always bound by parentheses a Regular Expression might be a better solution.
text = "text here (http://foo.example.org/bla) and here and here is (http://yet.another.url/with/parens) and some more text"
text.scan /\(([^\)]*)\)/
Before using this
>> URI.extract("text here (http://foo.example.org/bla) and here")
=> ["http://foo.example.org/bla)"]
You need to add this
require 'uri'
You could use this regexp to extract URL's from a string
"some thing http://abcd.com/ and http://google.com are great".scan(/(?:http|https):\/\/[a-z0-9]+(?:[\-\.]{1}[a-z0-9]+)*\.[a-z]{2,5}(?:(?::[0-9]{1,5})?\/[^\s]*)?/ix)

Resources