Extract Image from Instagram using Nokogiri? - ruby

I'm trying to extract an image from Instagram using Nokogiri. I've tried so many things that I don't even think it's a good idea to show what I've done so far.
I'm starting with:
image_url = Nokogiri::HTML(open('http://instagram.com/p/g3mXJ1p109/'))
And I've noticed the picture on Instagram.com is in the following div:
<div class="Image iLoaded iWithTransition Frame" src="http://distilleryimage9.ak.instagram.com/b711daf4508c11e385ff1234c61f9f0f_8.jpg"></div>
Ok, just one thing I've tried:
Nokogiri::HTML(open(pic)).css('body script').children.first
and it gives me this:
#<Nokogiri::XML::CDATA:0x767c904 "\nwindow._csrf_token = '88b78a58e333056bcc67e338f06ce786';\nwindow._jscalls = [\n\n['bluebar', 'init', []],\n\n['framework/config', 'init', [{staticRoot: '//d36xtkk24g8jdx.cloudfront.net/bluebar/89c8068'}]],\n\n [\"lib\\/fullpage\\/transitions\",\"bootstrap\",[{\"componentName\":null,\"moduleName\":\"lib\\/ui\\/media\\/DesktopPPage\",\"props\":{\"viewer\":null,\"shortcode\":\"g3mXJ1p109\",\"prerelease\":false,\"staticRoot\":\"\\/\\/d36xtkk24g8jdx.cloudfront.net\\/bluebar\\/89c8068\",\"media\":{\"code\":\"g3mXJ1p109\",\"comments\":{\"nodes\":[]},\"date\":1384805104.0,\"likes\":{\"count\":0,\"viewer_has_liked\":false,\"nodes\":[]},\"owner\":{\"username\":\"marunbai\",\"requested_by_viewer\":false,\"profile_pic_url\":\"http:\\/\\/images.ak.instagram.com\\/profiles\\/anonymousUser.jpg\",\"id\":\"549577518\",\"followed_by_viewer\":false},\"is_video\":false,\"id\":\"592110592901733693\",\"display_src\":\"http:\\/\\/distilleryimage9.ak.instagram.com\\/b711daf4508c11e385ff1234c61f9f0f_8.jpg\"}}}]],\n\n];\n">

Nokogiri cannot evaluate the JavaScript. As a result, if you look at the HTML that Nokogiri sees, the div tag you see will not be there.
However, the page does contain a meta-element with the image source. You can extract the desired value from there:
require 'nokogiri'
require 'open-uri'
doc = Nokogiri::HTML(open('http://instagram.com/p/g3mXJ1p109/'))
p doc.at_css('meta[property="og:image"]')['content']
#=> "http://distilleryimage9.ak.instagram.com/b711daf4508c11e385ff1234c61f9f0f_8.jpg"

Related

Excluding contents of <span> from text using Waitr

Watir
mytext =browser.element(:xpath => '//*[#id="gold"]/div[1]/h1').text
Html
<h1>
This is the text I want
<span> I do not want this text </span>
</h1>
When I run my Watir code, it selects all the text, including what is in the spans. How do I just get the text "This is the text I want", and no span text?
If you have a more complicated HTML, I find it can be easier to deal with this using Nokogiri as it provides more methods for parsing the HTML:
require 'nokogiri'
h1 = browser.element(:xpath => '//*[#id="gold"]/div[1]/h1')
doc = Nokogiri::HTML.fragment(h1.html)
mytext = doc.at('h1').children.select(&:text?).map(&:text).join.strip
Ideally start by trying to avoid using XPath. One of the most powerful features of Watir is the ability to create complicated locators without XPath syntax.
The issue is that calling text on a node gets all content within that node. You'd need to do something like:
top_level = browser.element(id: 'gold')
h1_text = top_level.h1.text
span_text = top_level.h1.span.text
desired_text = h1_text.chomp(span_text)
This is useful for top level text.
If there is only one h1, you can ommit id
#b.h1.text.remove(#b.h1.children.collect(&:text).join(' '))
Or specify it if there are more
#b.h1(id: 'gold').text.remove(#b.h1.children.collect(&:text).join(' '))
Make it a method and call it from your script with get_top_text(#b.h1) to get it
def get_top_text(el)
el.text.chomp(#b.h1.children.collect(&:text).join(' '))
end

How to search two paths but get the results in order using Nokogiri

I am trying to search for elements with prefix w and also t or br using Nokogiri.
For example if this is the core of the doc returned from parsing:
<w:t></w:t><w:br></w:br><w:t></w:t>
This search
doc.search('.//w:t','.//w:br')
Results in:
['<w:t></w:t>','<w:t></w:t>','<w:br></w:br>']
Instead I want (the elements are in the order of the original doc):
['<w:t></w:t>','<w:br></w:br>','<w:t></w:t>']
Using CSS selectors you can do this:
require 'nokogiri'
doc = Nokogiri::XML(<<EOT)
<xml>
<t></t><br></br><t></t>
</xml>
EOT
doc.search('t, br')
# => [#<Nokogiri::XML::Element:0x3c name="t">, #<Nokogiri::XML::Element:0x50 name="br">, #<Nokogiri::XML::Element:0x64 name="t">]
doc.search('t, br').map(&:to_html)
# => ["<t></t>", "<br>", "<t></t>"]
CSS selectors are recommended by Nokogiri's authors because they're generally easier and less noisy.
Using XPath, this'd work:
doc.search('//t | //br')
# => [#<Nokogiri::XML::Element:0x3c name="t">, #<Nokogiri::XML::Element:0x50 name="br">, #<Nokogiri::XML::Element:0x64 name="t">]
doc.search('//t | //br').map(&:to_html)
# => ["<t></t>", "<br>", "<t></t>"]
However, your XML has namespaces, and you didn't show us the appropriate namespace declaration so that's left for you to figure out.
See Nokogiri's Namespaces documentation for more information.
Thanks to the Tin Man's response, the answer I was looking for is this
doc.search('.//w:t | .//w:br')

How to find an element's text in Capybara while ignoring inner element text

In the HTML example below I am trying to grab the $16.95 text in the outer span.price element and exclude the text from the inner span.sale one.
<div class="price">
<span class="sale">
<span class="sale-text">"Low price!"</span>
"$16.95"
</span>
</div>
If I was using Nokogiri this wouldn't be too difficult.
price = doc.css('sale')
price.search('.sale-text').remove
price.text
However Capybara navigates rather than removes nodes. I knew something like price.text would grab text from all sub elements, so I tried to use xpath to be more specific. p.find(:xpath, "//span[#class='sale']", :match => :first).text. However this grabs text from the inner element as well.
Finally, I tried looping through all spans to see if I could separate the results but I get an Ambiguous error.
p.find(:css, 'span').each { |result| puts result.text }
Capybara::Ambiguous: Ambiguous match, found 2 elements matching css "span"
I am using Capybara/Selenium as this is for a web scraping project with authentication complications.
There is no single statement way to do this with Capybara since the DOMs concept of innerText doesn't really support what you want to do. Assuming p is the '.price' element, two ways you could get what you want are as follows:
Since you know the node you want to ignore just subtract that text from the whole text
p.find('span.sale').text.sub(p.find('span.sale-text').text, '')
Grab the innerHTML string and parse that with Nokogiri or Capybara.string (which just wraps Nokogiri elements in the Capybara DSL)
doc = Capybara.string(p['innerHTML'])
nokogiri_fragment = doc.native
#do whatever you want with the nokogiri fragment

select a word in a text blob in ruby based on a pattern

I have a text blob and I would like to select URL's based on whether they have .png or .jpg. I would like to select the entire word based on a pattern.
For example in this blob:
width='17'></a> <a href='http://click.e.groupon.com/? qs=94bee0ddf93da5b3903921bfbe17116f859915d3a978c042430abbcd51be55d8df40eceba3b1c44e' style=\"text-decoration: none;\">\n<img alt='Facebook' border='0' height='18' src='http://s3.grouponcdn.com/email/images/gw-email/facebook.jpg' style='display: i
I'd like to select the image:
http://s3.grouponcdn.com/email/images/gw-email/facebook.jpg
Can I use nokogiri on an html text blob?
Using Nokogiri and XPath:
frag = Nokogiri::HTML.fragment(str) # Don't construct an entire HTML document
images = frag.xpath('.//img/#src').map(&:text).grep /\.(png|jpg|jpeg)\z/
The XPath says:
.// — anywhere in this fragment
img — find all the <img> elements
/#src — now find the src attribute of each
Then we:
map(&:text) – convert all the Nokogiri::XML::Attr to the value of the attribute.
grep - find only those strings in the array that end with the appropriate text.
Yes, you can use nokogiri, and you should!
Here's a simple snippet:
require "nokogiri"
str = "....your blob"
html_doc = Nokogiri::HTML(str)
html_doc.css("a").collect{|e| e.attributes["href"].value}.select{|e| e.index(".png") || e.index(".jpeg") }
If you only want to find urls ending in .jpg or .png a pattern like this should do it.
https?:\/\/.*?\.(?:jpg|png)

Hpricot: How to extract inner text without other html subelements

I'm working on a vim rspec plugin (https://github.com/skwp/vim-rspec) - and I am parsing some html from rspec. It looks like this:
doc = %{
<dl>
<dt id="example_group_1">This is the heading text</dt>
Some puts output here
</dl>
}
I can get the entire inner of the using:
(Hpricot.parse(doc)/:dl).first.inner_html
I can get just the dt by using
(Hpricot.parse(doc)/:dl).first/:dt
But how can I access the "Some puts output here" area? If I use inner_html, there is way too much other junk to parse through. I've looked through hpricot docs but don't see an easy way to get essentially the inner text of an html element, disregarding its html children.
I ended up figuring out a route by myself, by manually parsing the children:
(#context/"dl").each do |dl|
dl.children.each do |child|
if child.is_a?(Hpricot::Elem) && child.name == 'dd'
# do stuff with the element
elsif child.is_a?(Hpricot::Text)
text=child.to_s.strip
puts text unless text.empty?
end
end
Note that this is bad HTML you have there. If you have control over it, you should wrap the content you want in a <dd>.
In XML terms what you are looking for is the TextNode following the <dt> element. In my comment I showed how you can select this node using XPath in Nokogiri.
However, if you must use Hpricot, and cannot select text nodes using it, then you could hack this by getting the inner_html and then stripping out the unwanted:
(Hpricot.parse(doc)/:dl).first.inner_html.sub %r{<dt>.+?</dt>}, ''

Resources