Different results for the same address in DistanceMatrix API - google-distancematrix-api

I have an destination address of: Rob 25, 1314 Rob, Slovenija
If I put this address in Distance Matrix API call like this:
https://maps.googleapis.com/maps/api/distancematrix/xml?destinations=Rob%2025,1314%20Rob,Slovenija&origins=45.8931535156444,14.6671678125858&key=
API response:
<DistanceMatrixResponse>
<status>OK</status>
<origin_address>Neimenovana cesta, 1290 Grosuplje, Slovenia</origin_address>
<destination_address>Slovenia</destination_address>
<row>
<element>
<status>OK</status>
<duration>
<value>4414</value>
<text>1 hour 14 mins</text>
</duration>
<distance>
<value>75726</value>
<text>75.7 km</text>
</distance>
</element>
</row>
</DistanceMatrixResponse>
I get destination of 75.7 km and this is wrong with regards to real world location of destination and origin.
However if remove space from street address and number:
https://maps.googleapis.com/maps/api/distancematrix/xml?destinations=Rob25,%201314%20Rob,%20Slovenija&origins=45.8931535156444,14.6671678125858&key=
API Response:
<DistanceMatrixResponse>
<status>OK</status>
<origin_address>Neimenovana cesta, 1290 Grosuplje, Slovenia</origin_address>
<destination_address>1314 Rob, Slovenia</destination_address>
<row>
<element>
<status>OK</status>
<duration>
<value>1757</value>
<text>29 mins</text>
</duration>
<distance>
<value>18288</value>
<text>18.3 km</text>
</distance>
</element>
</row>
</DistanceMatrixResponse>
I get correct result of 18.3 km.
I also get correct result if I use coordinates for destination.
Is the problem spaces in address if there is a number after space - browser inserts %20?

Related

XPATH for all child nodes with different names

I have a parent element with various child elements that I need to keep a count on. The problem I'm having is each child element is different name, so everytime I use count(*) the numbering restarts. I need the numbering to go 1.1, 1.2, 1.3...
The parent tag is <application> would be 1, <ident> would be 1.1, <kitapplic> would be 1.2, and <tctoproof> would be 1.3
I thought I could just do a count(child::application) but that didn't work. Your help is appreciated.
<application>
<ident>
<para>This Technical Order is applicable.</para>
</ident>
<kitapplic>
<kitapptbl>
<kitapptblrow>
<model>Model</model>
<serialno>Serial Number</serialno>
<kitno>Kit Required</kitno>
</kitapptblrow>
</kitapptbl>
</kitapplic>
<tctoproof>
<para>Time Compliance Technical Order (TCTO) verification, in accordance
with TO 00-5-15, was accomplished 28 August 2019 at Nellis Air Force
Base with GCS serial number 5147.</para>
</tctoproof>
</application>
With XPath, you can use count preceding-sibling and concat to get the desired numbers. Example with kitapplic :
concat("1.",count(application/kitapplic/preceding-sibling::*)+1)
Output : 1.2
If you need a list with 1.1, 1.2, 1.3 for each child of application element you can do (example in Python) :
data = """<application>
<ident>
<para>This Technical Order is applicable.</para>
</ident>
<kitapplic>
<kitapptbl>
<kitapptblrow>
<model>Model</model>
<serialno>Serial Number</serialno>
<kitno>Kit Required</kitno>
</kitapptblrow>
</kitapptbl>
</kitapplic>
<tctoproof>
<para>Time Compliance Technical Order (TCTO) verification, in accordance
with TO 00-5-15, was accomplished 28 August 2019 at Nellis Air Force
Base with GCS serial number 5147.</para>
</tctoproof>
</application>"""
import lxml.html
tree = lxml.html.fromstring(data)
for el in tree.xpath("//application/*"):
print(el.xpath("concat(name(.),' 1.',count(./preceding-sibling::*)+1)"))
Output :
ident 1.1
kitapplic 1.2
tctoproof 1.3

Nokogiri extract object that is not wrapped in tags

I'm loading regulations to a database and putting them in a tree hierarchy. When the XML Im scraping is set up as below, it is trivial to scrape it:
<CHAPTER>
<PART>
<SUBPART>
<SECTION>
<HD>Section Title</HD>
</SECTION>
<SECTION> ... </SECTION>
<APPENDIX>
<HD>Appendix Title</HD>
<P>Appendix content...</P>
<FOO>More content in unexpected tags</FOO>
</APPENDIX>
</SUBPART>
<SUBPART> ... </SUBPART>
</PART>
</CHAPTER>
Since I have to know what the parent ID is, I do something along this line:
parent_id = 1
doc.xpath("//chapter/part/subpart").each do |subpart|
title = subpart.xpath("hd").first.text
# add is a method that creates object and saves it to database, returning its id
id = add(title,'SUBPART',parent_id)
subpart.xpath('section').each do |section|
title = section.xpath('hd').first.text
add(title,'SECTION',id)
end
subpart.xpath('appendix').each do |app|
title = section.xpath('hd').first.text
content = app.to_s
add(title,'APPENDIX',id,content) #content is an optional input
end
end
However, the XML is not always set up in such a logical way. Sometimes, the appendices are not wrapped in tags :(
When this is the case, the XML looks like this:
<EXTRACT>
<HD SOURCE="HD1">Appendix A to § 1926.60—Substance Data Sheet, for 4-4′ Methylenedianiline</HD>
<NOTE>
<HD SOURCE="HED">Note:</HD>
<P>The requirements applicable to construction work under this appendix A are identical to those set forth in appendix A to § 1910.1050 of this chapter.</P>
</NOTE>
<HD SOURCE="HD1">Appendix B to § 1926.60—Substance Technical Guidelines, MDA</HD>
<NOTE>
<HD SOURCE="HED">Note:</HD>
<P>The requirements applicable to construction work under this appendix B are identical to those set forth in appendix B to § 1910.1050 of this chapter.</P>
</NOTE>
<HD SOURCE="HD1">Appendix C to § 1926.60—Medical Surveillance Guidelines for MDA</HD>
<NOTE>
<HD SOURCE="HED">Note:</HD>
<P>The requirements applicable to construction work under this appendix C are identical to those set forth in appendix C to § 1910.1050 of this chapter.</P>
</NOTE>
<HD SOURCE="HD1">Appendix D to § 1926.60—Sampling and Analytical Methods for MDA Monitoring and Measurement Procedures</HD>
<NOTE>
<HD SOURCE="HED">Note:</HD>
<P>The requirements applicable to construction work under this appendix D are identical to those set forth in appendix D to § 1910.1050 of this chapter.</P>
</NOTE>
</EXTRACT>
<CITA>
[57 FR 35681, Aug. 10, 1992, as amended at 57 FR 49649, Nov. 3, 1992; 61 FR 5510, Feb. 13, 1996; 61 FR 31431, June 20, 1996; 63 FR 1296, Jan. 8, 1998; 69 FR 70373, Dec. 6, 2004; 70 FR 1143, Jan. 5, 2005; 71 FR 16674, Apr. 3, 2006; 71 FR 50191, Aug. 24, 2006; 73 FR 75588, Dec. 12, 2008; 76 FR 33611, June 8, 2011; 77 FR 17889, Mar. 26, 2012]
</CITA>
The only way I can think of extracting these appendices is to iterate through the <EXTRACT> node and check the tags to see if its name is "HD" and "Appendix" is in the text. Then save everything after until I hit the next <HD> with "Appendix" in the text.
Feels like a pretty clunky solution. Is there a better way to do this?

Splitting a String with Pig

I have a String in the following format :
Sat, 09 Jul 2011 05:38:24 GMT
I would have an output like this :
09 Jul 2011
05:38:24
Thanks.
[EDIT]
I have tried many solutions, I have had errors. I will re-explain the problem. I have an XML file where I have a node : Tue, 05 Jul 2011 10:10:30 GMT from which I would like to extract two separated String as illustrated above.
I have tried this code:
register /usr/lib/pig/piggybank.jar;
items = LOAD ' depeche/2011_7_10_12_30_rss.txt' USING org.apache.pig.piggybank.storage.XMLLoader('item') AS (item:chararray);
source_name = FOREACH items GENERATE REGEX_EXTRACT(item, '<link>(.*)</link>', 1) AS link:chararray,
REGEX_EXTRACT(item, '<title>(.*)</title>', 1) AS title:chararray,
REGEX_EXTRACT(item, '<description>(.*)</description>', 1) AS description:chararray,
REGEX_EXTRACT(item, '<pubDate>(.*)</pubDate>', 1) AS pubdate:chararray,
sortie = FOREACH pubdate GENERATE SUBSTRING((chararray)$0, 4, 25);
illustrate sortie;
error:
[main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: <line 21, column 333> mismatched input '=' expecting SEMI_COLON
EDITED ANSWER:
That example is a bit more clear ... I grabbed an RSS feed example, and did a quick test. The code below worked using a sample which contained all of the elements in your example above. I used REGEX_EXTRACT instead of SUBSTRING to get the pubdate, however.
--rss.pig
REGISTER piggybank.jar
items = LOAD 'rss.txt' USING org.apache.pig.piggybank.storage.XMLLoader('item') AS (item:chararray);
data = FOREACH items GENERATE REGEX_EXTRACT(item, '<link>(.*)</link>', 1) AS link:chararray,
REGEX_EXTRACT(item, '<title>(.*)</title>', 1) AS title:chararray,
REGEX_EXTRACT(item, '<description>(.*)</description>', 1) AS description:chararray,
REGEX_EXTRACT(item, '<pubDate>.*(\\d{2}\\s[a-zA-Z]{3}\\s\\d{4}\\s\\d{2}:\\d{2}:\\d{2}).*</pubDate>', 1) AS pubdate:chararray;
dump data;
--rss.txt
<rss version="2.0">
<channel>
<title>News</title>
<link>http://www.hannonhill.com</link>
<description>Hannon Hill News</description>
<language>en-us</language>
<pubDate>Tue, 10 Jun 2003 04:00:00 GMT</pubDate>
<generator>Cascade Server</generator>
<webMaster>webmaster#hannonhill.com</webMaster>
<item>
<title>News Item 1</title>
<link>http://www.hannonhill.com/news/item1.html</link>
<description>Description of news item 1 here.</description>
<pubDate>Tue, 03 Jun 2003 09:39:21 GMT</pubDate>
<guid>http://www.hannonhill.com/news/item1.html</guid>
</item>
<item>
<title>News Item 2</title>
<link>http://www.hannonhill.com/news/item2.html</link>
<description>Description of news item 2 here.</description>
<pubDate>Fri, 30 May 2003 11:06:42 GMT</pubDate>
<guid>http://www.hannonhill.com/news/item2.html</guid>
</item>
<item>
<title>News Item 3</title>
<link>http://www.hannonhill.com/news/item3.html</link>
<description>Description of news item 3 here.</description>
<pubDate>Tue, 20 May 2003 08:56:02 GMT</pubDate>
<guid>http://www.hannonhill.com/news/item3.html</guid>
</item>
</channel>
</rss>
Results for rss.pig:
(http://www.hannonhill.com/news/item1.html,News Item 1,Description of news item 1 here.,03 Jun 2003 09:39:21)
(http://www.hannonhill.com/news/item2.html,News Item 2,Description of news item 2 here.,30 May 2003 11:06:42)
(http://www.hannonhill.com/news/item3.html,News Item 3,Description of news item 3 here.,20 May 2003 08:56:02)
ORIGINAL ANSWER:
There are several methods that would work here, so I'll cover two: SUBSTRING and REGEX_EXTRACT.
If your string length is constant, then you can use the builtin SUBSTRING function. Think of it like the cut command in Linux.
OUTPUT = FOREACH INPUT GENERATE SUBSTRING((chararray)$0, 4, 25);
Otherwise, you can use the builtin REGEX_EXTRACT to pull the string that you're looking for. Given the example, the easiest regex match that I came up with was to begin the string with the first digit, and end with the last digit, capturing all characters in between.
OUTPUT = FOREACH INPUT GENERATE REGEX_EXTRACT((chararray)$0, '([\d].*[\d])', 1);

using variables in gsub

I have a variable address which for now is a long string containing some unneccessary info, eg: "Aboriginal Relations 11th Floor Commerce Place 10155 102 Street Edmonton AB T5J 4G8 Phone 780 427-9658 Fax 780 644-4939 Email gerry.kushlyk#gov.ab.ca"
Aboriginal Relations is in a variable called title, and I'm trying to call address.gsub!(title,''), but its returning the original string.
I've also tried address.gsub!(/#{title}/,'') and address.gsub!("#{title}",'') but those won't work either. Any ideas?
Sorry, the typo occurred when I typed it into stack overflow, heres the code and the output, copied and pasted:
(this is within a loop, so there will be multiple outputs)
p title
address.gsub!(title,'')
p address
output
"Aboriginal Relations "
"Aboriginal Relations 11th Floor Commerce Place 10155 102 Street Edmonton AB T5J 4G8 Phone 780 427-9658 Fax 780 644-4939 Email gerry.kushlyk#gov.ab.ca"
"Aboriginal Tourism Advisory Council "
"Aboriginal Tourism Advisory Council 5th Floor Terrace Building 9515 107 Street Edmonton AB T5K 2C3 Phone 780 427-9687 Fax 780 422-7235 Email foip.fintprccs#gov.ab.ca"
"Acadia Foundation "
"Acadia Foundation PO Box 96 Oyen AB T0J 2J0 Phone 403 664-3384 Fax 403 664-3316 Email acadiafoundation#telus.net"
"Access Advisory Council "
"Access Advisory Council 12th Floor Centre West Building 10035 108 Street Edmonton AB T5J 3E1 Phone 780 427-2805 Fax 780 422-3204 Email barb.joyner#gov.ab.ca"
"ACCM Benevolent Association "
"ACCM Benevolent Association Suite 100 9403 95 Avenue Edmonton AB T6C 4M7 Phone 780 468-4648 Fax 780 468-4648 Email accmmanor#shaw.ca"
"Acme Municipal Library "
"Acme Municipal Library PO Box 326 Acme AB T0M 0A0 Phone 403 546-3845 Fax 403 546-2248 Email aamlibrary#marigold.ab.ca"
likewise, if I try address.match(/#{title}/) I get nil.
I'm assuming you're using ruby 1.9 or higher.
It's possible that the trailing whitespace is a non-breaking space:
p "Relations\u00a0" # looks like a trailing space, but strip won't remove it
to get rid of it:
"Relations\u00a0".gsub!(/^\u00a0|\u00a0$/, '') # => "Relations"
A more generic solution for all unicode whitespace:
"Relations\u00a0".gsub!(/^[[:space:]]|[[:space:]]$/, '') # => "Relations"
To see what the character is in your case:
title[-1].ord # => 160 (example only)
'%x' % title[-1].ord # => "a0" (hex equivalent; example only)
title = title[0..-2] seemed to solve it. for some reason strip and chomp wouldn't work.

XML api response to json, or hash?

So, I'm using an API which happens to only return XML, that sucks. What I want to do is create a database entry for each record that get returned from the API, but I'm not sure how.
The XML that gets returned is huge and has lots of whitespace characters in it... is that normal? Here is a sample of some of the XML.
<!-- ... -->
<attribute name="item_date">May 17, 2011</attribute>
<attribute name="external_url">http://missionlocal.org/2011/05/rain-camioneta-part-i/</attribute>
<attribute name="source" id="2478">Mission Loc#l</attribute>
<attribute name="excerpt"></attribute>
</attributes>
</newsitem>
<newsitem
id="5185807"
title="Lost Chrome messenger PBR bag and contents (marina / cow hollow)"
url="http://sf.everyblock.com/lost-and-found/by-date/2011/5/17/5185807/"
location_name="Van Ness and Filbert"
schema="lost-and-found"
schema_id="7"
pub_date="May 17, 2011, 12:15 p.m."
longitude="-122.424129925"
latitude="37.7995100578"
>
<attributes>
<attribute name="item_date">May 17, 2011</attribute>
<attribute name="external_url">http://sfbay.craigslist.org/sfc/laf/2386709187.html</attribute>
</attributes>
</newsitem>
<newsitem
id="5185808"
title="Plywood Update: Dumplings & Buns Aims To Be "Beard Papa Of Chinese Buns""
url="http://sf.everyblock.com/news-articles/by-date/2011/5/17/5185808/"
location_name="2411 California Street"
schema="news-articles"
schema_id="5"
pub_date="May 17, 2011, 12:15 p.m."
longitude="-122.434000442"
latitude="37.7888985667"
>
<attributes>
<attribute name="item_date">May 17, 2011</attribute>
<attribute name="external_url">http://sf.eater.com/archives/2011/05/17/dumplings_buns_aims_to_be_beard_papa_of_chinese_buns.php</attribute>
<attribute name="source" id="2155">Eater SF</attribute>
<attribute name="excerpt"></attribute>
</attributes>
</newsitem>
<newsitem
id="5185809"
title="Freebies: This week, Piazza D'Angelo (22 Miller..."
url="http://sf.everyblock.com/news-articles/by-date/2011/5/17/5185809/"
location_name="22 Miller"
schema="news-articles"
schema_id="5"
pub_date="May 17, 2011, 12:15 p.m."
longitude="-122.408894997"
latitude="37.7931966922"
>
<attributes>
<attribute name="item_date">May 17, 2011</attribute>
<attribute name="external_url">http://sf.eater.com/archives/2011/05/17/freebies_24.php</attribute>
<attribute name="source" id="2155">Eater F</attribute>
<attribute name="excerpt"></attribute>
<!-- ... -->
Any ideas?
That's not quite valid XML. That's some sort of escaped-string representation of XML, perhaps console output. It also doesn't seem to be complete. Other than that, it's fairly normal XML. Here's a smaller excerpt, unescaped and formatted:
<newsitem
id="5185807"
title="Lost Chrome messenger PBR bag and contents (marina / cow hollow)"
url="http://sf.everyblock.com/lost-and-found/by-date/2011/5/17/5185807/"
location_name="Van Ness and Filbert"
schema="lost-and-found"
schema_id="7"
pub_date="May 17, 2011, 12:15 p.m."
longitude="-122.424129925"
latitude="37.7995100578">
<attributes>
<attribute name="item_date">May 17, 2011</attribute>
<attribute name="external_url">http://sfbay.craigslist.org/sfc/laf/2386709187.html</attribute>
</attributes>
</newsitem>
You'll just need to determine what you want to extract and put in the database, and let that drive your DB design decision. Do you need multiple models with relationships intact, or are you just concerned with a subset of the data?
XML can have whitespace and not affect the quality of the data it contains. A good parser, which is how you should be processing the XML, will not care, and will give you access to the data whether there is whitespace or not.
Nokogiri is a favorite for me, and seems to be the de facto standard for Ruby now days. It is very easy to use, but you will have to learn how to tell it what nodes you want.
To get you going, here is some of the terminology:
Node is the term for a tag after it has been parsed.
Nodes have attributes, which can be accessed using node_var['attribute'].
Node text can be accessed using node_var.text or node_var.content or node_var.inner_text.
NodeSet is basically an array of Nodes.
at returns the first node matching the accessor you give the parser. % is an alias.
search returns a NodeSet of nodes matching the accessor you give the parser. / is an alias.
Here's how we can parse the snippet of XML:
require 'nokogiri'
xml =<<EOT
<newsitem
id="5185807"
title="Lost Chrome messenger PBR bag and contents (marina / cow hollow)"
url="http://sf.everyblock.com/lost-and-found/by-date/2011/5/17/5185807/"
location_name="Van Ness and Filbert"
schema="lost-and-found"
schema_id="7"
pub_date="May 17, 2011, 12:15 p.m."
longitude="-122.424129925"
latitude="37.7995100578">
<attributes>
<attribute name="item_date">May 17, 2011</attribute>
<attribute name="external_url">http://sfbay.craigslist.org/sfc/laf/2386709187.html</attribute>
</attributes>
</newsitem>
EOT
doc = Nokogiri::XML(xml)
doc.at('newsitem').text # => "\n \n May 17, 2011\n http://sfbay.craigslist.org/sfc/laf/2386709187.html\n \n"
(doc % 'attribute').content # => "May 17, 2011"
doc.at('attribute[name="external_url"]').inner_text # => "http://sfbay.craigslist.org/sfc/laf/2386709187.html"
doc.at('newsitem')['id'] # => "5185807"
newsitem = doc.at('newsitem')
newsitem['title'] # => "Lost Chrome messenger PBR bag and contents (marina / cow hollow)"
attributes = doc.search('attribute').map{ |n| n.text }
attributes # => ["May 17, 2011", "http://sfbay.craigslist.org/sfc/laf/2386709187.html"]
attributes = (doc / 'attribute').map{ |n| n.text }
attributes # => ["May 17, 2011", "http://sfbay.craigslist.org/sfc/laf/2386709187.html"]
All accesses are using CSS, just like you'd use when writing web pages. It's simpler, and usually more clear, but Nokogiri also support XPath, which is very powerful and lets you offload a lot of processing to the underlying libXML2 library, which will run very fast.
Nokogiri works very nicely with Ruby's Open-URI, so if you're retrieving the XML from a website, you can do it like this:
require 'open-uri'
require 'nokogiri'
doc = Nokogiri::HTML(open('http://www.example.com'))
doc.to_html.size # => 2825
That's parsing HTML, which Nokogiri excels at too, but the process is the same for XML, just replace Nokogiri::HTML with Nokogiri::XML.
See "How to avoid joining all text from Nodes when scraping" also.

Resources