XML api response to json, or hash? - ruby

So, I'm using an API which happens to only return XML, that sucks. What I want to do is create a database entry for each record that get returned from the API, but I'm not sure how.
The XML that gets returned is huge and has lots of whitespace characters in it... is that normal? Here is a sample of some of the XML.
<!-- ... -->
<attribute name="item_date">May 17, 2011</attribute>
<attribute name="external_url">http://missionlocal.org/2011/05/rain-camioneta-part-i/</attribute>
<attribute name="source" id="2478">Mission Loc#l</attribute>
<attribute name="excerpt"></attribute>
</attributes>
</newsitem>
<newsitem
id="5185807"
title="Lost Chrome messenger PBR bag and contents (marina / cow hollow)"
url="http://sf.everyblock.com/lost-and-found/by-date/2011/5/17/5185807/"
location_name="Van Ness and Filbert"
schema="lost-and-found"
schema_id="7"
pub_date="May 17, 2011, 12:15 p.m."
longitude="-122.424129925"
latitude="37.7995100578"
>
<attributes>
<attribute name="item_date">May 17, 2011</attribute>
<attribute name="external_url">http://sfbay.craigslist.org/sfc/laf/2386709187.html</attribute>
</attributes>
</newsitem>
<newsitem
id="5185808"
title="Plywood Update: Dumplings & Buns Aims To Be "Beard Papa Of Chinese Buns""
url="http://sf.everyblock.com/news-articles/by-date/2011/5/17/5185808/"
location_name="2411 California Street"
schema="news-articles"
schema_id="5"
pub_date="May 17, 2011, 12:15 p.m."
longitude="-122.434000442"
latitude="37.7888985667"
>
<attributes>
<attribute name="item_date">May 17, 2011</attribute>
<attribute name="external_url">http://sf.eater.com/archives/2011/05/17/dumplings_buns_aims_to_be_beard_papa_of_chinese_buns.php</attribute>
<attribute name="source" id="2155">Eater SF</attribute>
<attribute name="excerpt"></attribute>
</attributes>
</newsitem>
<newsitem
id="5185809"
title="Freebies: This week, Piazza D'Angelo (22 Miller..."
url="http://sf.everyblock.com/news-articles/by-date/2011/5/17/5185809/"
location_name="22 Miller"
schema="news-articles"
schema_id="5"
pub_date="May 17, 2011, 12:15 p.m."
longitude="-122.408894997"
latitude="37.7931966922"
>
<attributes>
<attribute name="item_date">May 17, 2011</attribute>
<attribute name="external_url">http://sf.eater.com/archives/2011/05/17/freebies_24.php</attribute>
<attribute name="source" id="2155">Eater F</attribute>
<attribute name="excerpt"></attribute>
<!-- ... -->
Any ideas?

That's not quite valid XML. That's some sort of escaped-string representation of XML, perhaps console output. It also doesn't seem to be complete. Other than that, it's fairly normal XML. Here's a smaller excerpt, unescaped and formatted:
<newsitem
id="5185807"
title="Lost Chrome messenger PBR bag and contents (marina / cow hollow)"
url="http://sf.everyblock.com/lost-and-found/by-date/2011/5/17/5185807/"
location_name="Van Ness and Filbert"
schema="lost-and-found"
schema_id="7"
pub_date="May 17, 2011, 12:15 p.m."
longitude="-122.424129925"
latitude="37.7995100578">
<attributes>
<attribute name="item_date">May 17, 2011</attribute>
<attribute name="external_url">http://sfbay.craigslist.org/sfc/laf/2386709187.html</attribute>
</attributes>
</newsitem>
You'll just need to determine what you want to extract and put in the database, and let that drive your DB design decision. Do you need multiple models with relationships intact, or are you just concerned with a subset of the data?

XML can have whitespace and not affect the quality of the data it contains. A good parser, which is how you should be processing the XML, will not care, and will give you access to the data whether there is whitespace or not.
Nokogiri is a favorite for me, and seems to be the de facto standard for Ruby now days. It is very easy to use, but you will have to learn how to tell it what nodes you want.
To get you going, here is some of the terminology:
Node is the term for a tag after it has been parsed.
Nodes have attributes, which can be accessed using node_var['attribute'].
Node text can be accessed using node_var.text or node_var.content or node_var.inner_text.
NodeSet is basically an array of Nodes.
at returns the first node matching the accessor you give the parser. % is an alias.
search returns a NodeSet of nodes matching the accessor you give the parser. / is an alias.
Here's how we can parse the snippet of XML:
require 'nokogiri'
xml =<<EOT
<newsitem
id="5185807"
title="Lost Chrome messenger PBR bag and contents (marina / cow hollow)"
url="http://sf.everyblock.com/lost-and-found/by-date/2011/5/17/5185807/"
location_name="Van Ness and Filbert"
schema="lost-and-found"
schema_id="7"
pub_date="May 17, 2011, 12:15 p.m."
longitude="-122.424129925"
latitude="37.7995100578">
<attributes>
<attribute name="item_date">May 17, 2011</attribute>
<attribute name="external_url">http://sfbay.craigslist.org/sfc/laf/2386709187.html</attribute>
</attributes>
</newsitem>
EOT
doc = Nokogiri::XML(xml)
doc.at('newsitem').text # => "\n \n May 17, 2011\n http://sfbay.craigslist.org/sfc/laf/2386709187.html\n \n"
(doc % 'attribute').content # => "May 17, 2011"
doc.at('attribute[name="external_url"]').inner_text # => "http://sfbay.craigslist.org/sfc/laf/2386709187.html"
doc.at('newsitem')['id'] # => "5185807"
newsitem = doc.at('newsitem')
newsitem['title'] # => "Lost Chrome messenger PBR bag and contents (marina / cow hollow)"
attributes = doc.search('attribute').map{ |n| n.text }
attributes # => ["May 17, 2011", "http://sfbay.craigslist.org/sfc/laf/2386709187.html"]
attributes = (doc / 'attribute').map{ |n| n.text }
attributes # => ["May 17, 2011", "http://sfbay.craigslist.org/sfc/laf/2386709187.html"]
All accesses are using CSS, just like you'd use when writing web pages. It's simpler, and usually more clear, but Nokogiri also support XPath, which is very powerful and lets you offload a lot of processing to the underlying libXML2 library, which will run very fast.
Nokogiri works very nicely with Ruby's Open-URI, so if you're retrieving the XML from a website, you can do it like this:
require 'open-uri'
require 'nokogiri'
doc = Nokogiri::HTML(open('http://www.example.com'))
doc.to_html.size # => 2825
That's parsing HTML, which Nokogiri excels at too, but the process is the same for XML, just replace Nokogiri::HTML with Nokogiri::XML.
See "How to avoid joining all text from Nodes when scraping" also.

Related

Different results for the same address in DistanceMatrix API

I have an destination address of: Rob 25, 1314 Rob, Slovenija
If I put this address in Distance Matrix API call like this:
https://maps.googleapis.com/maps/api/distancematrix/xml?destinations=Rob%2025,1314%20Rob,Slovenija&origins=45.8931535156444,14.6671678125858&key=
API response:
<DistanceMatrixResponse>
<status>OK</status>
<origin_address>Neimenovana cesta, 1290 Grosuplje, Slovenia</origin_address>
<destination_address>Slovenia</destination_address>
<row>
<element>
<status>OK</status>
<duration>
<value>4414</value>
<text>1 hour 14 mins</text>
</duration>
<distance>
<value>75726</value>
<text>75.7 km</text>
</distance>
</element>
</row>
</DistanceMatrixResponse>
I get destination of 75.7 km and this is wrong with regards to real world location of destination and origin.
However if remove space from street address and number:
https://maps.googleapis.com/maps/api/distancematrix/xml?destinations=Rob25,%201314%20Rob,%20Slovenija&origins=45.8931535156444,14.6671678125858&key=
API Response:
<DistanceMatrixResponse>
<status>OK</status>
<origin_address>Neimenovana cesta, 1290 Grosuplje, Slovenia</origin_address>
<destination_address>1314 Rob, Slovenia</destination_address>
<row>
<element>
<status>OK</status>
<duration>
<value>1757</value>
<text>29 mins</text>
</duration>
<distance>
<value>18288</value>
<text>18.3 km</text>
</distance>
</element>
</row>
</DistanceMatrixResponse>
I get correct result of 18.3 km.
I also get correct result if I use coordinates for destination.
Is the problem spaces in address if there is a number after space - browser inserts %20?

Replacing translations using ActiveAdmin

I'm trying to use Mobility in my Rails application with ActiveAdmin as administration panel.
I use Container backend with JSONB column.
I also have activeadmin_json_editor gem installed so it's not possible to produce bad JSON. Inside my admin resource I permit :translations attribute using StrongParams.
When editing translations using ActiveAdmin and submitting the form I get the following parameters:
2.5.3 (#<Admin::QuestionsController:0x00007fd466a9a690>):0 > permitted_params
=> <ActionController::Parameters {"utf8"=>"✓", "_method"=>"patch", "authenticity_token"=>"DwSuN9M9cD27dR7WmitBSMKKgVjhW1om3xwxOJUhK41no8RWH1Xh6L9QNIhOc1NhPYtm5QnKJWh7KEIUvuehUQ==", "commit"=>"Update Question", "id"=>"37", "question"=><ActionController::Parameters {"translations"=>"{\"en\":{\"body\":\"dupa\"}}", "dimension_id"=>"6"} permitted: true>} permitted: true>
However once the update query gets processed my model has no translations at all:
2.5.3 (#<Admin::QuestionsController:0x00007fd466a9a690>):0 > resource.update(permitted_params["question"])
(0.4ms) BEGIN
↳ (pry):18
Dimension Load (0.4ms) SELECT "dimensions".* FROM "dimensions" WHERE "dimensions"."id" = $1 LIMIT $2 [["id", 6], ["LIMIT", 1]]
↳ (pry):18
(0.3ms) COMMIT
↳ (pry):18
=> true
2.5.3 (#<Admin::QuestionsController:0x00007fd466a9a690>):0 > resource
=> #<Question:0x00007fd45c284d98
id: 37,
body: nil,
translations: {},
created_at: Wed, 16 Jan 2019 12:17:38 UTC +00:00,
updated_at: Fri, 08 Feb 2019 12:07:00 UTC +00:00,
dimension_id: 6>
What am I doing wrong? Should I parse the JSON from the params and use resource.<attribute_name>_backend.write for each locale?
Since I didn't get any answers I dug around and came up with the following solution. In your resource admin model add:
controller do
def update
translations = JSON.parse(permitted_params.dig(resource.class.name.downcase, "translations"))
translations.each do |locale, attributes|
supported_attributes = attributes.select { |attribute_name, _| resource.class.mobility_attributes.include?(attribute_name) }
supported_attributes.each do |attribute_name, translation|
resource.send("#{attribute_name}_backend").send(:write, locale.to_sym, translation.to_s)
end
end
resource.save
redirect_to admin_questions_path
end
end
This is probably not really the proper way to mass update the translations but I couldn't figure out a better way to do this. Please keep in mind that this doesn't really care if the locale key is valid.

Nokogiri extract object that is not wrapped in tags

I'm loading regulations to a database and putting them in a tree hierarchy. When the XML Im scraping is set up as below, it is trivial to scrape it:
<CHAPTER>
<PART>
<SUBPART>
<SECTION>
<HD>Section Title</HD>
</SECTION>
<SECTION> ... </SECTION>
<APPENDIX>
<HD>Appendix Title</HD>
<P>Appendix content...</P>
<FOO>More content in unexpected tags</FOO>
</APPENDIX>
</SUBPART>
<SUBPART> ... </SUBPART>
</PART>
</CHAPTER>
Since I have to know what the parent ID is, I do something along this line:
parent_id = 1
doc.xpath("//chapter/part/subpart").each do |subpart|
title = subpart.xpath("hd").first.text
# add is a method that creates object and saves it to database, returning its id
id = add(title,'SUBPART',parent_id)
subpart.xpath('section').each do |section|
title = section.xpath('hd').first.text
add(title,'SECTION',id)
end
subpart.xpath('appendix').each do |app|
title = section.xpath('hd').first.text
content = app.to_s
add(title,'APPENDIX',id,content) #content is an optional input
end
end
However, the XML is not always set up in such a logical way. Sometimes, the appendices are not wrapped in tags :(
When this is the case, the XML looks like this:
<EXTRACT>
<HD SOURCE="HD1">Appendix A to § 1926.60—Substance Data Sheet, for 4-4′ Methylenedianiline</HD>
<NOTE>
<HD SOURCE="HED">Note:</HD>
<P>The requirements applicable to construction work under this appendix A are identical to those set forth in appendix A to § 1910.1050 of this chapter.</P>
</NOTE>
<HD SOURCE="HD1">Appendix B to § 1926.60—Substance Technical Guidelines, MDA</HD>
<NOTE>
<HD SOURCE="HED">Note:</HD>
<P>The requirements applicable to construction work under this appendix B are identical to those set forth in appendix B to § 1910.1050 of this chapter.</P>
</NOTE>
<HD SOURCE="HD1">Appendix C to § 1926.60—Medical Surveillance Guidelines for MDA</HD>
<NOTE>
<HD SOURCE="HED">Note:</HD>
<P>The requirements applicable to construction work under this appendix C are identical to those set forth in appendix C to § 1910.1050 of this chapter.</P>
</NOTE>
<HD SOURCE="HD1">Appendix D to § 1926.60—Sampling and Analytical Methods for MDA Monitoring and Measurement Procedures</HD>
<NOTE>
<HD SOURCE="HED">Note:</HD>
<P>The requirements applicable to construction work under this appendix D are identical to those set forth in appendix D to § 1910.1050 of this chapter.</P>
</NOTE>
</EXTRACT>
<CITA>
[57 FR 35681, Aug. 10, 1992, as amended at 57 FR 49649, Nov. 3, 1992; 61 FR 5510, Feb. 13, 1996; 61 FR 31431, June 20, 1996; 63 FR 1296, Jan. 8, 1998; 69 FR 70373, Dec. 6, 2004; 70 FR 1143, Jan. 5, 2005; 71 FR 16674, Apr. 3, 2006; 71 FR 50191, Aug. 24, 2006; 73 FR 75588, Dec. 12, 2008; 76 FR 33611, June 8, 2011; 77 FR 17889, Mar. 26, 2012]
</CITA>
The only way I can think of extracting these appendices is to iterate through the <EXTRACT> node and check the tags to see if its name is "HD" and "Appendix" is in the text. Then save everything after until I hit the next <HD> with "Appendix" in the text.
Feels like a pretty clunky solution. Is there a better way to do this?

Splitting a String with Pig

I have a String in the following format :
Sat, 09 Jul 2011 05:38:24 GMT
I would have an output like this :
09 Jul 2011
05:38:24
Thanks.
[EDIT]
I have tried many solutions, I have had errors. I will re-explain the problem. I have an XML file where I have a node : Tue, 05 Jul 2011 10:10:30 GMT from which I would like to extract two separated String as illustrated above.
I have tried this code:
register /usr/lib/pig/piggybank.jar;
items = LOAD ' depeche/2011_7_10_12_30_rss.txt' USING org.apache.pig.piggybank.storage.XMLLoader('item') AS (item:chararray);
source_name = FOREACH items GENERATE REGEX_EXTRACT(item, '<link>(.*)</link>', 1) AS link:chararray,
REGEX_EXTRACT(item, '<title>(.*)</title>', 1) AS title:chararray,
REGEX_EXTRACT(item, '<description>(.*)</description>', 1) AS description:chararray,
REGEX_EXTRACT(item, '<pubDate>(.*)</pubDate>', 1) AS pubdate:chararray,
sortie = FOREACH pubdate GENERATE SUBSTRING((chararray)$0, 4, 25);
illustrate sortie;
error:
[main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: <line 21, column 333> mismatched input '=' expecting SEMI_COLON
EDITED ANSWER:
That example is a bit more clear ... I grabbed an RSS feed example, and did a quick test. The code below worked using a sample which contained all of the elements in your example above. I used REGEX_EXTRACT instead of SUBSTRING to get the pubdate, however.
--rss.pig
REGISTER piggybank.jar
items = LOAD 'rss.txt' USING org.apache.pig.piggybank.storage.XMLLoader('item') AS (item:chararray);
data = FOREACH items GENERATE REGEX_EXTRACT(item, '<link>(.*)</link>', 1) AS link:chararray,
REGEX_EXTRACT(item, '<title>(.*)</title>', 1) AS title:chararray,
REGEX_EXTRACT(item, '<description>(.*)</description>', 1) AS description:chararray,
REGEX_EXTRACT(item, '<pubDate>.*(\\d{2}\\s[a-zA-Z]{3}\\s\\d{4}\\s\\d{2}:\\d{2}:\\d{2}).*</pubDate>', 1) AS pubdate:chararray;
dump data;
--rss.txt
<rss version="2.0">
<channel>
<title>News</title>
<link>http://www.hannonhill.com</link>
<description>Hannon Hill News</description>
<language>en-us</language>
<pubDate>Tue, 10 Jun 2003 04:00:00 GMT</pubDate>
<generator>Cascade Server</generator>
<webMaster>webmaster#hannonhill.com</webMaster>
<item>
<title>News Item 1</title>
<link>http://www.hannonhill.com/news/item1.html</link>
<description>Description of news item 1 here.</description>
<pubDate>Tue, 03 Jun 2003 09:39:21 GMT</pubDate>
<guid>http://www.hannonhill.com/news/item1.html</guid>
</item>
<item>
<title>News Item 2</title>
<link>http://www.hannonhill.com/news/item2.html</link>
<description>Description of news item 2 here.</description>
<pubDate>Fri, 30 May 2003 11:06:42 GMT</pubDate>
<guid>http://www.hannonhill.com/news/item2.html</guid>
</item>
<item>
<title>News Item 3</title>
<link>http://www.hannonhill.com/news/item3.html</link>
<description>Description of news item 3 here.</description>
<pubDate>Tue, 20 May 2003 08:56:02 GMT</pubDate>
<guid>http://www.hannonhill.com/news/item3.html</guid>
</item>
</channel>
</rss>
Results for rss.pig:
(http://www.hannonhill.com/news/item1.html,News Item 1,Description of news item 1 here.,03 Jun 2003 09:39:21)
(http://www.hannonhill.com/news/item2.html,News Item 2,Description of news item 2 here.,30 May 2003 11:06:42)
(http://www.hannonhill.com/news/item3.html,News Item 3,Description of news item 3 here.,20 May 2003 08:56:02)
ORIGINAL ANSWER:
There are several methods that would work here, so I'll cover two: SUBSTRING and REGEX_EXTRACT.
If your string length is constant, then you can use the builtin SUBSTRING function. Think of it like the cut command in Linux.
OUTPUT = FOREACH INPUT GENERATE SUBSTRING((chararray)$0, 4, 25);
Otherwise, you can use the builtin REGEX_EXTRACT to pull the string that you're looking for. Given the example, the easiest regex match that I came up with was to begin the string with the first digit, and end with the last digit, capturing all characters in between.
OUTPUT = FOREACH INPUT GENERATE REGEX_EXTRACT((chararray)$0, '([\d].*[\d])', 1);

Ruby Regex to match multiple portions of a string

Using Ruby: ruby 1.9.3dev (2011-09-23 revision 33323) [i686-linux]
I have the following string:
str = 'Message relates to activity TU4 Sep 5 Activity 1 <img src="/images/layout/placeholder.png" width="222" height="149"/><br/><br/>First question from Manager on TU4 Sep 5 Activity 1.'
I want to match the following:
35 (a number which is part of href attribute value)
TU4 Sep 5 Activity (the text for tag)
First question from Manager on TU4 Sep 5 Activity 1. (the remaining text after last <br/><br/> tags)
For achieving the same I have written the following regex
result = str.match(/<a href="\/activities\/(?<activity_id>\d+)">(?<activity_title>.*)<\/a>.*<br\/><br\/>(?<message>.*)/)
This produces following result:
#<MatchData "TU4 Sep 5 Activity 1 <img src=\"/images/layout/placeholder.png\" width=\"222\" height=\"149\"/><br/><br/>First question from Manager on TU4 Sep 5 Activity 1."
activity_id:"35"
activity_title:"TU4 Sep 5 Activity 1"
message:"First question from Manager on TU4 Sep 5 Activity 1.">
But I guess this is not efficient.
Is it possible that somehow only the required values(as mentioned above under what I want to match) is returned in the matched result and the following
value gets excluded from matched result:
"TU4 Sep 5 Activity 1 <img src=\"/images/layout/placeholder.png\" width=\"222\" height=\"149\"/><br/><br/>First question from Manager on TU4 Sep 5 Activity 1."
Thanks,
Jignesh
The appropriate way to do this is NOT to use regexen. Instead, use the Nokogiri library to easily parse your html:
require 'nokogiri'
doc = Nokogiri::HTML.parse(str)
activity_id = doc.css('[href^="/activities"]').attr('href').value[/\d+$/]
activity_title = doc.css('[href^="/activities"]')[0].inner_text
message = doc.search("//text()").last
This will do exactly what your regexp was attempting, with much lower chance of random failure.

Resources