How to scrape data from a website using Nokogiri

How to scrape data from a website using Nokogiri - ruby

When I try to scrape the table data from the following link it displays nothing.. `
I write the following code but it gives nothing..I want the table data i.e last Update, weather, temperature from that link which is i given please help me..
url = "http://w1.weather.gov/xml/current_obs/KM89.xml"
docs = Nokogiri::HTML(open(url))
puts docs.css("table")

Go to that page, open your development tools and when you find the response of the request to KM89.xml under Network tab you'll see that it's not returning HTML, but XML like this one:
<?xml version="1.0" encoding="ISO-8859-1"?>
<?xml-stylesheet href="latest_ob.xsl" type="text/xsl"?>
<current_observation version="1.0"
xmlns:xsd="http://www.w3.org/2001/XMLSchema"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:noNamespaceSchemaLocation="http://www.weather.gov/view/current_observation.xsd">
<credit>NOAA's National Weather Service</credit>
<credit_URL>http://weather.gov/</credit_URL>
<image>
<url>http://weather.gov/images/xml_logo.gif</url>
<title>NOAA's National Weather Service</title>
<link>http://weather.gov</link>
</image>
<suggested_pickup>15 minutes after the hour</suggested_pickup>
<suggested_pickup_period>60</suggested_pickup_period>
<location>Dexter B Florence Memorial Field Airport, AR</location>
<station_id>KM89</station_id>
<latitude>34.1</latitude>
<longitude>-93.07</longitude>
<observation_time>Last Updated on Nov 23 2012, 7:56 am CST</observation_time>
<observation_time_rfc822>Fri, 23 Nov 2012 07:56:00 -0600</observation_time_rfc822>
<weather>Light Rain</weather>
<temperature_string>57.0 F (13.8 C)</temperature_string>
<temp_f>57.0</temp_f>
<temp_c>13.8</temp_c>
<relative_humidity>87</relative_humidity>
<wind_string>Northeast at 8.1 MPH (7 KT)</wind_string>
<wind_dir>Northeast</wind_dir>
<wind_degrees>30</wind_degrees>
<wind_mph>8.1</wind_mph>
<wind_kt>7</wind_kt>
<pressure_string>1027.5 mb</pressure_string>
<pressure_mb>1027.5</pressure_mb>
<pressure_in>30.30</pressure_in>
<dewpoint_string>52.9 F (11.6 C)</dewpoint_string>
<dewpoint_f>52.9</dewpoint_f>
<dewpoint_c>11.6</dewpoint_c>
<windchill_string>55 F (13 C)</windchill_string>
<windchill_f>55</windchill_f>
<windchill_c>13</windchill_c>
<visibility_mi>10.00</visibility_mi>
<icon_url_base>http://forecast.weather.gov/images/wtf/small/</icon_url_base>
<two_day_history_url>http://www.weather.gov/data/obhistory/KM89.html</two_day_history_url>
<icon_url_name>ra1.png</icon_url_name>
<ob_url>http://www.weather.gov/data/METAR/KM89.1.txt</ob_url>
<disclaimer_url>http://weather.gov/disclaimer.html</disclaimer_url>
<copyright_url>http://weather.gov/disclaimer.html</copyright_url>
<privacy_policy_url>http://weather.gov/notice.html</privacy_policy_url>
</current_observation>
So you can scrape it like this:
require 'open-uri'
require 'nokogiri'
url = 'http://w1.weather.gov/xml/current_obs/KM89.xml'
doc = Nokogiri::HTML(open(url))
p doc.at_css('station_id').text

Related

Ruby nokogiri attribute selector in XML file

this is the xml file:
<?xml version="1.0" encoding="UTF-8"?>
<SOAP-ENV:Envelope SOAP-ENV:encodingStyle="http://schemas.xmlsoap.org/soap/encoding/"
xmlns:SOAP-ENV="http://schemas.xmlsoap.org/soap/envelope/"
xmlns:xsd="http://www.w3.org/2001/XMLSchema"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:SOAP-ENC="http://schemas.xmlsoap.org/soap/encoding/">
<SOAP-ENV:Body>
<ns1:putResponse
xmlns:ns1="urn:DmsManagerClient">
<result xsi:type="xsd:string">
<?xml version="1.0" encoding="ISO-8859-1"?>
<MESSAGE ID="11c73b9e-687c-4300-baba-b743c26f7c83" TYPE="CUSDMS">
<DELIVERY>
<FROM>
<SENDER>0072000</SENDER>
<SERVICE>eService</SERVICE>
<DATE>2019-03-08T12:27:25</DATE>
</FROM>
<TO>
<DEALER DEALERCODE="0072000" MARKETCODE="1000"/>
</TO>
</DELIVERY>
<CONTENT>
<dms:ComplexResponse ErrorCode="430" ErrorDescription="null : PrivacyUE Mancante" Return="false"
xmlns:dms="http://dmsmanagerservice">
<dms:Element Name="DMSVERSION">2.7</dms:Element>
</dms:ComplexResponse>
</CONTENT>
</MESSAGE>
</result>
</ns1:putResponse>
</SOAP-ENV:Body>
</SOAP-ENV:Envelope>
I am coding with Ruby and I used Nokogiri and the method xpath to extrapole the "CONTENT" of the file
this is the code:
def extrapolate_error(xml)
doc = Nokogiri::XML(File.open(xml))
doc.xpath('//CONTENT')
end
and this is the result:
[#<Nokogiri::XML::Element:0x1c5ba78 name="CONTENT" children=[
#<Nokogiri::XML::Text:0x1c5b940 "\n">,
#<Nokogiri::XML::Element:0x1c5b8bc name="ComplexResponse" namespace=#<Nokogiri::XML::Namespace:0x1c5b88c prefix="dms" href="http://dmsmanagerservice">
attributes=[
#<Nokogiri::XML::Attr:0x1c5b874 name="ErrorCode" value="430">,
#<Nokogiri::XML::Attr:0x1c5b868 name="ErrorDescription" value="null : PrivacyUE Mancante">,
#<Nokogiri::XML::Attr:0x1c5b85c name="Return" value="false">]
children=[#<Nokogiri::XML::Text:0x1c5b118 "\n">,
#<Nokogiri::XML::Element:0x1c5b094 name="Element" namespace=#<Nokogiri::XML::Namespace:0x1c5b88c prefix="dms" href="http://dmsmanagerservice">
attributes=[#<Nokogiri::XML::Attr:0x1c5b058 name="Name" value="DMSVERSION">]
children=[#<Nokogiri::XML::Text:0x1c5abe4 "2.7">]>,
#<Nokogiri::XML::Text:0x1c5aaac "\n">]>,
#<Nokogiri::XML::Text:0x1c5a974 "\n">]>]
Now I need to enter in it and select some attributes.
In the specific I need this:
name="ErrorCode" value="430"
name="ErrorDescription" value="null : PrivacyUE Mancante"
I do not know how to procceed. Can you help me?

The following should work for you assuming the dms namespace is always the same
doc.xpath('//CONTENT/dms:ComplexResponse', dms: 'http://dmsmanagerservice')
.xpath('#ErrorCode | #ErrorDescription')
.each_with_object({}) do |e,obj|
obj[e.name] = e.text
end
#=> {"ErrorCode"=>"430", "ErrorDescription"=>"null : PrivacyUE Mancante"}
You already understand how you got to //CONTENT so from there we use dms:ComplexResponse to navigate deeper but since this is namespaced we have to provide the namespace reference e.g. dms: 'http://dmsmanagerservice'.
Then we select the attributes we are interested in #ErrorCode and #ErrorDescription.
In XPath the pipe | means UNION (think AND) so we want to select both.
Then we are just building a Hash using the name as the key and the text as the value.
XPath Cheatsheet - Useful resource if you need additional reference
Update
You asked about conditionals so this is what I would propose
ndoc = Nokogiri::XML(doc)
namespaces = ndoc.collect_namespaces
response = ndoc.xpath("//CONTENT/dms:ComplexResponse", namespaces)
if response.xpath("self::node()[#ErrorCode != '' and #ErrorDescription != '']").any?
response.xpath("#ErrorCode | #ErrorDescription")
.each_with_object({}) do |e,obj|
obj[e.name] = e.text
end
else
response.xpath('dms:Element/#Name | dms:Element/text()',namespaces)
.each_slice(2)
.map {|s| s.map(&:text)}.to_h
end
This checks to see if there is an ErrorCode and and ErrorDescription if so then Hash as originally proposed. If Not then it returns all the dms:Elements as a Hash so {"DMSVERSION"=>"2.7"} in this case Functional Example

Can't address XML attribute thought XPath in Ruby (using Nokogiri)

I'm trying to filter xml file to get nodes with certain attribute. I can successfully filter by node (ex. \top_manager), but when I try \\top_manager[#salary='great'] I get nothing.
<?xml version= "1.0"?>
<employee xmlns="http://www.w3schools.com" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="employee.xsd">
<top_manager>
<ceo salary="great" respect="enormous" type="extra">
<fname>
Vasya
</fname>
<lname>
Pypkin
</lname>
<hire_date>
19
</hire_date>
<descr>
Big boss
</descr>
</ceo>
<cio salary="big" respect="great" type="intro">
<fname>
Petr
</fname>
<lname>
Pypkin
</lname>
<hire_date>
25
</hire_date>
<descr>
Resposible for information security
</descr>
</cio>
</top_manager>
......
How I need to correct this code to get what I need?
require 'nokogiri'
f = File.open("employee.xml")
doc = Nokogiri::XML(f)
doc.xpath("//top_manager[#salary='great']").each do |node|
puts node.text
end
thank you.

That's because salary is not attribute of <top_manager> element, it is the attribute of <top_manager>'s children elements :
//xmlns:top_manager[*[#salary='great']]
Above XPath select <top_manager> element having any of it's child element has attribute salary equals "great". Or if you meant to select the children (the <ceo> element in this case) :
//xmlns:top_manager/*[#salary='great']

Parsing XML into CSV using Nokogiri

I am trying to figure out how to get Make and Model out of XML returned from a URL and put them into a CSV. Here is the XML returned from the URL:
<VINResult xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns="http://basicvalues.pentondata.com/">
<Vehicles>
<Vehicle>
<ID>131497</ID>
<Product>TRUCK</Product>
<Year>1993</Year>
<Make>Freightliner</Make>
<Model>FLD12064T</Model>
<Description>120'' BBC Alum Air Cond Long Conv. (SBA) Tractor w/48'' Sleeper Air Brakes & Power Steering 6x4 (SBA - Set Back Axle)</Description>
</Vehicle>
<Vehicle>
<ID>131497</ID>
<Product>TRUCK</Product>
<Year>1993</Year>
<Make>Freightliner</Make>
<Model>FLD12064T</Model>
<Description>120'' BBC Alum Air Cond Long Conv. (SBA) Tractor w/48'' Sleeper Air Brakes & Power Steering 6x4 (SBA - Set Back Axle)</Description>
</Vehicle>
</Vehicles>
<Errors/>
<InvalidVINMsg/>
</VINResult>
Here is the code I have so far:
require 'csv'
require 'rubygems'
require 'nokogiri'
require 'open-uri'
vincarriercsv = 'vincarrier.csv'
vindetails = 'vindetails.csv'
vinurl = 'http://redacted/LookUp_VIN?key=redacted&vin='
CSV.open(vindetails, "wb") do |details|
CSV.foreach(vincarriercsv) do |row|
vinxml = Nokogiri::HTML(vinurl + row[1])
make = vinxml.xpath('//VINResult//Vehicles//Vehicle//Make').text
model = vinxml.xpath('//VINResult//Vehicles//Vehicle//Model').text
details << [ row[0], row[1], make, model ]
end
end
For some reason the URL returns the same data twice but I only need the first result. So far my attempts to grab the Make and Model from the XML has failed...any ideas?

Here's how to get at the make and model data. How to convert it to CSV is left to you:
require 'nokogiri'
doc = Nokogiri::XML(<<EOT)
<VINResult xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns="http://basicvalues.pentondata.com/">
<Vehicles>
<Vehicle>
<ID>131497</ID>
<Product>TRUCK</Product>
<Year>1993</Year>
<Make>Freightliner</Make>
<Model>FLD12064T</Model>
<Description>120'' BBC Alum Air Cond Long Conv. (SBA) Tractor w/48'' Sleeper Air Brakes & Power Steering 6x4 (SBA - Set Back Axle)</Description>
</Vehicle>
<Vehicle>
<ID>131497</ID>
<Product>TRUCK</Product>
<Year>1993</Year>
<Make>Freightliner</Make>
<Model>FLD12064T</Model>
<Description>120'' BBC Alum Air Cond Long Conv. (SBA) Tractor w/48'' Sleeper Air Brakes & Power Steering 6x4 (SBA - Set Back Axle)</Description>
</Vehicle>
</Vehicles>
<Errors/>
<InvalidVINMsg/>
</VINResult>
EOT
vehicle_make_and_models = doc.search('Vehicle').map{ |vehicle|
[
'make', vehicle.at('Make').content,
'model', vehicle.at('Model').content
]
}
This results in:
vehicle_make_and_models # => [["make", "Freightliner", "model", "FLD12064T"], ["make", "Freightliner", "model", "FLD12064T"]]
If you don't want the field names:
vehicle_make_and_models = doc.search('Vehicle').map{ |vehicle|
[
vehicle.at('Make').content,
vehicle.at('Model').content
]
}
vehicle_make_and_models # => [["Freightliner", "FLD12064T"], ["Freightliner", "FLD12064T"]]
Note: You have XML, not HTML. Don't assume that Nokogiri treats them the same, or that the difference is insignificant. Nokogiri parses XML strictly, since XML is a strict standard.
I use CSS selectors unless I absolutely have to use XPath. CSS results in a much clearer selector most of the time, which results in easier to read code.
vinxml.xpath('//VINResult//Vehicles//Vehicle//Make').text doesn't work, because // means "start at the top of the document". Each time it's encountered Nokogiri starts at the top, searches down, and finds all matching nodes. xpath returns all matching nodes as a NodeSet, not just a particular Node, and text will return the text of all Nodes in the NodeSet, resulting in a concatenated string of the text, which is probably not what you want.
I prefer to use search instead of xpath or css. It returns a NodeSet like the other two, but it also lets us use either CSS or XPath selectors. If your particular selector was ambiguous and could be interpreted as either CSS or XPath, then you can use the explicit form. Likewise, you can use at or xpath_at or css_at to find just the first matching node, which is equivalent to search('foo').first.

You could also do the following which will place all of the vehicles in an Array and all of the vehicle attributes into a Hash
require 'nokogiri'
doc = Nokogiri::XML(open(YOUR_XML_FILE))
vehicles = doc.search("Vehicle").map do |vehicle|
Hash[
vehicle.children.map do |child|
[child.name, child.text] unless child.text.chomp.strip == ""
end.compact
]
end
#=>[{"ID"=>"131497", "Product"=>"TRUCK", "Year"=>"1993", "Make"=>"Freightliner", "Model"=>"FLD12064T", "Description"=>"120'' BBC Alum Air Cond Long Conv. (SBA) Tractor w/48'' Sleeper Air Brakes Power Steering 6x4 (SBA - Set Back Axle)"}, {"ID"=>"131497", "Product"=>"TRUCK", "Year"=>"1993", "Make"=>"Freightliner", "Model"=>"FLD12064T", "Description"=>"120'' BBC Alum Air Cond Long Conv. (SBA) Tractor w/48'' Sleeper Air Brakes Power Steering 6x4 (SBA - Set Back Axle)"}]
Then you can access all the attributes for an individual vehicle i.e.
vehicles.first["ID"]
#=> "131497"
vehicles.first["Year"]
#=> "1993"
etc.

Parsing an XML feed using Nokogiri isn't working

This is my code:
doc= Nokogiri::HTML(open("http://www.cincinnatisun.com/index.php?rss/90d24f4ad98a2793", 'User-Agent' => 'ruby'))
search=doc.css('item')
if !search.blank?
search.each do |data|
title=data.css("title").text
link=data.css("link").text
end
end
but I did not get the link.

Several things are wrong:
if !search.blank?
won't work because search would be a NodeSet returned by doc.css. NodeSet's don't have a blank? method. Perhaps you meant empty??
title=data.css("title").text
isn't the correct way to find the title because, like in the above problem, you're getting a NodeSet instead of a Node. Getting text from a NodeSet can return a lot of garbage you don't want. Instead do:
title=data.at("title").text
Changing the code to this:
require 'nokogiri'
require 'open-uri'
doc= Nokogiri::HTML(open("http://www.cincinnatisun.com/index.php?rss/90d24f4ad98a2793", 'User-Agent' => 'ruby'))
search=doc.css('item')
if !search.empty?
search.each do |data|
title=data.at("title").text
link=data.at("link").text
puts "title: #{ title } link: #{ link }"
end
end
Outputs:
title: Ex-Bengals cheerleaders lawsuit trial to begin link:
title: Freedom Center Offering Free Admission Monday link:
title: Miami University Band Performing in the Inaugural Parade link:
title: Northern Kentucky Man To Present Colors At Inauguration link:
title: John Gumms Monday Forecast link:
title: President Obama VP Biden sworn in officially begin second terms link:
title: Colerain Township Pizza Hut Robbed Saturday Night link:
title: Cold Snap Coming to Tri-State link:
title: 2 Men Arrested After Police Chase in Northern Kentucky link:
The link won't work because the XML is malformed, which, in my experience, is unbelievably common on the internet because people don't take the time to check their work.
The fix is going to take massaging the XML prior to Nokogiri receiving the content, or to modify your accessors. Luckily, this particular XML is easy to work around so this should help:
require 'nokogiri'
require 'open-uri'
doc = Nokogiri::HTML(open("http://www.cincinnatisun.com/index.php?rss/90d24f4ad98a2793", 'User-Agent' => 'ruby'))
search = doc.css('item')
if !search.empty?
search.each do |data|
title = data.at("title").text
link = data.at("link").next_sibling.text
puts "title: #{ title } link: #{ link }"
end
end
Which outputs:
title: Ex-Bengals cheerleaders lawsuit trial to begin link: http://www.cincinnatisun.com/index.php/sid/212072454/scat/90d24f4ad98a2793
title: Freedom Center Offering Free Admission Monday link: http://www.cincinnatisun.com/index.php/sid/212072914/scat/90d24f4ad98a2793
title: Miami University Band Performing in the Inaugural Parade link: http://www.cincinnatisun.com/index.php/sid/212072915/scat/90d24f4ad98a2793
title: Northern Kentucky Man To Present Colors At Inauguration link: http://www.cincinnatisun.com/index.php/sid/212072913/scat/90d24f4ad98a2793
title: John Gumms Monday Forecast link: http://www.cincinnatisun.com/index.php/sid/212070535/scat/90d24f4ad98a2793
title: President Obama VP Biden sworn in officially begin second terms link: http://www.cincinnatisun.com/index.php/sid/212060033/scat/90d24f4ad98a2793
title: Colerain Township Pizza Hut Robbed Saturday Night link: http://www.cincinnatisun.com/index.php/sid/212057132/scat/90d24f4ad98a2793
title: Cold Snap Coming to Tri-State link: http://www.cincinnatisun.com/index.php/sid/212057131/scat/90d24f4ad98a2793
title: 2 Men Arrested After Police Chase in Northern Kentucky link: http://www.cincinnatisun.com/index.php/sid/212057130/scat/90d24f4ad98a2793
All that done, you can write your code more clearly like:
require 'nokogiri'
require 'open-uri'
doc = Nokogiri::HTML(open("http://www.cincinnatisun.com/index.php?rss/90d24f4ad98a2793", 'User-Agent' => 'ruby'))
doc.css('item').each do |data|
title = data.at("title").text
link = data.at("link").next_sibling.text
puts "title: #{ title } link: #{ link }"
end
Interestingly enough, now the sample page appears to have its links fixed.

According to http://nokogiri.org/tutorials/searching_a_xml_html_document.html something like:
#doc = Nokogiri::XML(File.read("feed.xml"))
#doc.xpath('//xmlns:link')
should do the job. But be aware, that your provided xml snippet isn't a valid xml feed at all (no root element, item tag not opened - only closed etc.). The code assumes the xml feed looks i.e.
<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
<item>
<title>Atom-Powered Robots Run Amok</title>
<link>http://example.org/2003/12/13/atom03</link>
</item>
</feed>
And extracts:
<link>http://example.org/2003/12/13/atom03</link>
as result. Please try to to look at the documentation/reference first, if you have problems like this. If you tried something and it didn't worked like you would expect, than you can consult stackoverflow with actual code - that makes it easier to understand your problem & provide help.

Trying to parse a XML using Nokogiri with Ruby

I am new to programming so bear with me. I have an XML document that looks like this:
File name: PRIDE1542.xml
<ExperimentCollection version="2.1">
<Experiment>
<ExperimentAccession>1015</ExperimentAccession>
<Title>**Protein complexes in Saccharomyces cerevisiae (GPM06600002310)**</Title>
<ShortLabel>GPM06600002310</ShortLabel>
<Protocol>
<ProtocolName>**None**</ProtocolName>
</Protocol>
<mzData version="1.05" accessionNumber="1015">
<cvLookup cvLabel="RESID" fullName="RESID Database of Protein Modifications" version="0.0" address="http://www.ebi.ac.uk/RESID/" />
<cvLookup cvLabel="UNIMOD" fullName="UNIMOD Protein Modifications for Mass Spectrometry" version="0.0" address="http://www.unimod.org/" />
<description>
<admin>
<sampleName>**GPM06600002310**</sampleName>
<sampleDescription comment="Ho, Y., et al., Systematic identification of protein complexes in Saccharomyces cerevisiae by mass spectrometry. Nature. 2002 Jan 10;415(6868):180-3.">
<cvParam cvLabel="NEWT" accession="4932" name="Saccharomyces cerevisiae (Baker's yeast)" value="Saccharomyces cerevisiae" />
</sampleDescription>
</admin>
</description>
<spectrumList count="0" />
</mzData>
</Experiment>
</ExperimentCollection>
I want to take out the text in between <Title>, <ProtocolName>, and <SampleName> and put into a text file (I tried bolding them to making it easier to see). I have the following code so far (based on posts I saw on this site), but it seems not to work:
>> require 'rubygems'
>> require 'nokogiri'
>> doc = Nokogiri::XML(File.open("PRIDE_Exp_Complete_Ac_10094.xml"))
>> #ExperimentCollection = doc.css("ExperimentCollection Title").map {|node| node.children.text }
Can someone help me?

Try to access them using xpath expressions. You can enter the path through the parse tree using slashes.
puts doc.xpath( "/ExperimentCollection/Experiment/Title" ).text
puts doc.xpath( "/ExperimentCollection/Experiment/Protocol/ProtocolName" ).text
puts doc.xpath( "/ExperimentCollection/Experiment/mzData/description/admin/sampleName" ).text

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

How to scrape data from a website using Nokogiri - ruby

Related

Ruby nokogiri attribute selector in XML file

Can't address XML attribute thought XPath in Ruby (using Nokogiri)

Parsing XML into CSV using Nokogiri

Parsing an XML feed using Nokogiri isn't working

Trying to parse a XML using Nokogiri with Ruby

Categories

Resources