Extracting value from complex hash in Ruby - ruby

I am using an API (zillow) which returns a complex hash. A sample result is
{"xmlns:xsi"=>"http://www.w3.org/2001/XMLSchema-instance",
"xsi:schemaLocation"=>"http://www.zillow.com/static/xsd/SearchResults.xsd http://www.zillowstatic.com/vstatic/5985ee4/static/xsd/SearchResults.xsd",
"xmlns:SearchResults"=>"http://www.zillow.com/static/xsd/SearchResults.xsd", "request"=>[{"address"=>["305 Vinton St"], "citystatezip"=>["Melrose, MA 02176"]}],
"message"=>[{"text"=>["Request successfully processed"], "code"=>["0"]}],
"response"=>[{"results"=>[{"result"=>[{"zpid"=>["56291382"], "links"=>[{"homedetails"=>["http://www.zillow.com/homedetails/305-Vinton-St-Melrose-MA-02176/56291382_zpid/"],
"graphsanddata"=>["http://www.zillow.com/homedetails/305-Vinton-St-Melrose-MA-02176/56291382_zpid/#charts-and-data"], "mapthishome"=>["http://www.zillow.com/homes/56291382_zpid/"],
"comparables"=>["http://www.zillow.com/homes/comps/56291382_zpid/"]}], "address"=>[{"street"=>["305 Vinton St"], "zipcode"=>["02176"], "city"=>["Melrose"], "state"=>["MA"], "latitude"=>["42.466805"],
"longitude"=>["-71.072515"]}], "zestimate"=>[{"amount"=>[{"currency"=>"USD", "content"=>"562170"}], "last-updated"=>["06/01/2014"], "oneWeekChange"=>[{"deprecated"=>"true"}], "valueChange"=>[{"duration"=>"30", "currency"=>"USD", "content"=>"42749"}], "valuationRange"=>[{"low"=>[{"currency"=>"USD",
"content"=>"534062"}], "high"=>[{"currency"=>"USD", "content"=>"590278"}]}], "percentile"=>["0"]}], "localRealEstate"=>[{"region"=>[{"id"=>"23017", "type"=>"city",
"name"=>"Melrose", "links"=>[{"overview"=>["http://www.zillow.com/local-info/MA-Melrose/r_23017/"], "forSaleByOwner"=>["http://www.zillow.com/melrose-ma/fsbo/"],
"forSale"=>["http://www.zillow.com/melrose-ma/"]}]}]}]}]}]}]}
I can extract a specific value using the following:
result = result.to_hash
p result["response"][0]["results"][0]["result"][0]["zestimate"][0]["amount"][0]["content"]
It seems odd to have to specify the index of each element in this fashion. Is there a simpler way to obtain a named value?

It looks like this should be parsed into XML. According to the Zillow API Docs, it returns XML by default. Apparently, "to_hash" was able to turn this into a hash (albeit, a very ugly one), but you are really trying to swim upstream by using it this way. I would recommend using it as intended (xml) at the start, and then maybe parsing it into an easier to use format (like a JSON/Hash structure) later.
Nokogiri is GREAT at parsing XML! You can use the xpath syntax for grabbing elements from the dom, or even css selectors.
For example, to get an array of the "content" in every result:
response = #get xml response from zillow
results = Nokogiri::XML(response).remove_namespaces!
#using css
content_array = results.css("result content")
#same thing using xpath:
content_array = results.xpath("//result//content")
If you just want the content from the first result, you can do this as a shortcut:
content = results.at_css("result content").content

Since it is indeed XML dumped into a JSON, you could use JSONPath to query the JSON

Related

how to pull data using xpath from script?

Just started learning scraping and for my test project i'm trying to retreive the quantity of a certain project in scrapy shell by using
response.xpath('//script[contains("quantity")]/text()').extract()
This doesn't work.
help me understand what should be the right covention to retreive data from such attributes like quantity, category_path & etc
<script>
window.dataLayer = window.dataLayer || [];
dataLayer.push({"event":"datalayer-initialized","region":"India","account_type":"ecom","customer":{"id":""},"page_type":"Product","product":{"ffr":"csddfas","name":"tote bag by singh","materials":"100% polyester","specs":"Dimensions: 18.5\" x 6.75\"; 24L","color":null,"size":null,"upc":null,"new":false,"brand":null,"season":"HOLIDAY 2017","on_sale":false,"quantity":"158","original_price":100,"price":100,"category_path":
["Mens","Accessories","Backpacks \/ Bags"],"created":"2016-09-07","modified":"2018-02-12",
"colors":["BLACK"],"sizes":["S","M","L","XS","XL","XXL"]}});
</script>
Scrapy selectors have built-in support for regular expressions and it can help you on this case:
response.xpath('//script[contains(text(),"quantity")]/text()').re(r'"quantity":"(\d+)"')
(you need to update your xpath to collect script content since your script not good enough)
Another Way: You also can use regular expressions to collect the json content on script, parse them to json obj and work with it as easier!
You are using css method and giving it an Xpath
Try
response.xpath('//script[contains(text(),'quantity')]').extract()
or
response.css('script::contains(quantity)').extract()
And you will need a Regex to extract that JSON string
re.findall(r'(?<=dataLayer\.push\().*(?=\)\;)', your_script_tag_data, re.DOTALL)
javascript = response.xpath('//script[contains("quantity")]/text()').extract_first()
json_string = re.search( r'dataLayer\.push\((.+?)\);', javascript, re.DOTALL ).group(1)
data = json.loads(json_string)
print( "Quantity: {0}".format(data["product"]["quantity"]) )
In my experience, have no way to get quantity, category_path & etc by only Xpath, because they're in Json format. Xpath can get information in XML data.
I assume that you've already have xml data, use
python
data = yourXML.xpath('//script//text()')
Now data is a string that contain all information. Then, you need to get string in dataLayer.push and convert it to Json format. With Json, it' easy to get your information.

simplexml_load_file with xPath returns empty array

Getting XML from this URL:
$xml = simplexml_load_file('http://geocode-maps.yandex.ru/1.x/?geocode=37.71677,55.75208&kind=metro&spn=1,1&rspn=1');
print_r($xml) shows that XML loaded, but xpath always returns empty array. I tried:
$xml->xpath('/');
$xml->xpath('/ymaps');
$xml->xpath('/GeoObjectCollection');
$xml->xpath('/ymaps/GeoObjectCollection');
$xml->xpath('//GeoObjectCollection');
$xml->xpath('precision');
Why I got empty array? Hope I just missing something easy.
It might be rather easy, but I guess it is also the most common mistake in the history of XML: You are forgetting namespaces!
A lot of elements in the given XML are changing the default namespace and you have to consider that in your XPath.
You can first register your namespace like so:
$xml->registerXPathNamespace('y', 'http://maps.yandex.ru/ymaps/1.x');
$xml->registerXPathNamespace('a', 'http://maps.yandex.ru/attribution/1.x');
and then you can query your data:
$xml->xpath('//y:ymaps/y:GeoObjectCollection');

how to extract data using jtidy and xpath

i have to extract d company name and face value from
http://money.rediff.com/companies/20-microns-ltd/15110088
i noticed that this task could be accomplished using xpath api.
since this is an html page, i am using jtidy parser.
this is the xpath for the face value which i have to extract.
/html/body/div[4]/div[6]/div[9]/div/table/tbody/tr[4]/td[2]
This is my code
URL oracle = new URL("http://money.rediff.com/companies/20-microns-ltd/15110088");
URLConnection yc = oracle.openConnection();
InputStream is = yc.getInputStream();
is = oracle.openStream();
Tidy tidy = new Tidy();
tidy.setQuiet(true);
tidy.setShowWarnings(false);
Document tidyDOM = tidy.parseDOM(is, null);
XPathFactory xPathFactory = XPathFactory.newInstance();
XPath xPath = xPathFactory.newXPath();
String expression = "/html";
XPathExpression xPathExpression = xPath.compile(expression);
Object result = xPathExpression.evaluate(tidyDOM,XPathConstants.NODESET);
System.out.println(result.toString());
please guide me further, because, i cannot find a right solution for the above
Try not to use "full" xpaths.
//div[#id='leftcontainer']//div[9]//table//tr[4]/td[2]
is better than
/html/body/.../.../.../.../.../...
Most HTML pages are not valid or even well-formed. So the DOM structure may change when processed by "real-world HTML parsers". For example, a <tbody> may be inserted under <table> if there isn't one. Things are worse when different HTML parsers generate different DOM trees so one XPath may be valid for one parser, but not the other. I would rather use "wildcards" like table//tr[4] instead of table/tbody/tr[4] or table/tr[4] so that I can forget about <tbody>. Such expressions are more robust when used against the messy real-world HTML pages.
You can use Firepath, a plugin for Firebug which is then a plugin for Firefox, to debug XPath expressions.
p.s. You can try my JHQL (http://github.com/wks/jhql) project for exactly this task. You will like it if you have more pages to extract data from.

Most efficient way to parse and reformat data with Nokogiri & Sinatra

I'm working on reformatting HTML output from a search query for an inventory manager for a number of car dealers. There's no direct DB access, no information available from the service creators so I decided to attempts to parse and reformat the data with Nokogiri and generate new pages of results based on the search query.
On first load of the page, I'm just using a default search to generate the first results.
For the search to work, I'm sending the query to a URL like this:
post '/search/?:search_query' do
url = "http://domain.com/v/?DealerId=" + settings.dealer_id + "&maxrows=10&#{params[:search_query]}"
doc = Nokogiri::HTML(open(url))
doc.css("td:nth-child(5) .ForeColor4").each do |msrp|
session["msrp"] = msrp.inner_html
end
doc.css("td:nth-child(4) .ForeColor4").each do |price|
session["price"] = price.inner_html
end
erb :index
end
I know there's got to be a smarter way to do this.
Edit:
An example URL to request data:
http://domain.com/?DealerId=1234&object=list&lang=en&MAKE=&MODEL=&maxrows=50&MinYear=&MaxYear=2011&Type=N&MinPrice=&MaxPrice=&STYLE=&ExtColor=&MaxMiles=&StockNo=
A description of the HTML generated:
Unfortunately, it's old code that's almost entirely table-based, has inline-styles and lacks classes or ids in most areas.
An example of a CSS selector:
td:nth-child(5) .ForeColor4
An XPath selector:
//td[(((count(preceding-sibling::*) + 1) = 5) and parent::*)]//*[contains(concat( " ", #class, " " ), concat( " ", "ForeColor4", " " ))]
I've also looked at mechanize or hpricot as possibilities but I'm not aware of the best tools for the job as I haven't attempted screen-scraping before.
Summary: I want to pull the data from the HTML, temporarily store it in a variable / session / cookie (data changes several times per day), and then be able to reformat the output into my own HTML/CSS styling.
Personally, I'd decouple the scraping from the user action. Have an independent process scrape and fill your database. This will improve performance drastically, as the fetching, creating a DOM, parsing, then rendering output on every action is going to be slow.
doc.css("td:nth-child(5) .ForeColor4").each do |msrp|
session["msrp"] = msrp.inner_html
end
doc.css("td:nth-child(4) .ForeColor4").each do |price|
session["price"] = price.inner_html
end
You might want to use Nokogiri's at_css() method instead of the regular css(). at_css() finds the first occurrence of your target and only returns that one node, similar to doing a .first against the nodeset that .css() returns.
That would simplify your lookups to this form:
session["msrp"] = doc.at_css("td:nth-child(5) .ForeColor4").inner_html
I'd probably add something like rescue 'msrp lookup failed' while testing to the end of the lookups just in case you've got bad accessors. Or you could let the code fail when inner_html() got mad trying to read from a nil. It's just a bit friendlier way to debug.
Otherwise your lookups seem to be decent.

How do I parsing data from JSON object?

I'm just starting to dabble in consuming a JSON web service, and I am having a little trouble working out the best way to get to the actual data elements.
I am receiving a response which has been converted into a Ruby hash using the JSON.parse method. The hash looks like this:
{"response"=>{"code"=>2002, "payload"=>{"topic"=>[{"name"=>"Topic Name", "url"=>"http://www.something.com/topic", "hero_image"=>{"image_id"=>"05rfbwV0Nggp8", "hero_image_id"=>"0d600BZ7MZgLJ", "hero_image_url"=>"http://img.something.com/imageserve/0d600BZ7MZgLJ/60x60.jpg"}, "type"=>"PERSON", "search_score"=>10.0, "topic_id"=>"0eG10W4e3Aapo"}]}, "message"=>"Success"}}
What I would like to know, is what is the easiest way to get to the "topic" data so I can do something like:
topic.name = json_resp.name
topic.img = jsob_resp.hero_image_url
etc
You can use Hashie's Mash . One of the best twitter clients for ruby uses it, and the resulting interface is very clean and easy to use. I've wrapped over Delicious rss api with it in less than 60 lines.
As usuall, the specs show very clearly how to use it.

Resources