Nokogiri removing xml encoding - ruby

I am using nokogiri to decode some xml. This xml does have some html as values. I am seeing some strange behavior when parsing this. It appears nokogiri is removing some of the html encoded tags, so when i parse the html I am unable to decode it properly. See examples below:
doc = Nokogiri::XML '<?xml version="1.0"?><manifest
xmlns="http://www.imsglobal.org/xsd/imscp_v1p1"
identifier="Manifest-eaf97d26-aa83-4399-8e9b-ae9f6f5fc6a2"
xmlns="http://www.imsglobal.org/xsd/imscp_v1p1"
xmlns:imsmd="http://www.imsglobal.org/xsd/imsmd_v1p2"
xmlns:imsqti="http://www.imsglobal.org/xsd/imsqti_v2p1">
<imsmd:langstring><p>
 These are the<strong>instructions</strong> for the pool</p></imsmd:langstring>'
this yields the following value:
"<?xml version=\"1.0\"?>\n<manifest xmlns=\"http://www.imsglobal.org/xsd/imscp_v1p1\" xmlns:imsmd=\"http://www.imsglobal.org/xsd/imsmd_v1p2\" xmlns:imsqti=\"http://www.imsglobal.org/xsd/imsqti_v2p1\" identifier=\"Manifest-eaf97d26-aa83-4399-8e9b-ae9f6f5fc6a2\">\n<imsmd:langstring>p
 These are thestrong instructions/strong for the pool/p</imsmd:langstring></manifest>\n"
Notice how the < > tags are missing. However the following works as expected.
doc = Nokogiri::XML '<?xml version="1.0"?><imsmd:langstring><p>
 These are the<strong> instructions</strong> for the pool</p></imsmd:langstring>'
and gives the following result
"<?xml version=\"1.0\"?>\n<imsmd:langstring><p>
 These are the<strong> instructions</strong> for the pool</p></imsmd:langstring>\n"
I am sure I am missing something but can't figure out what is causing this.

Related

Searching savon response as nokogiri document returns an empty array

I try to parse savon's response as nokokiri document
c = Savon.client(wsdl: 'http://test.fedresurs.ru/MessageService/WebService.svc?wsdl', digest_auth: ['demowebuser', 'Ax!761BN'], namespace: "http://tempuri.org/", namespace_identifier: :tem, log: true)
r = c.call(:get_trade_messages, message: {'tem:startFrom' => DateTime.now-1})
r.doc.search("TradePlace")
and it returns an empty array.
What I'm doing wrong? May be I should deal somehow with namespaces? But, how?. Examples, that I found in nokogiri documentation use Xpath, not search. And even with Xpath it returns an empty array.
XML-response:
<s:Envelope xmlns:s="http://schemas.xmlsoap.org/soap/envelope/">
<s:Body xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<GetTradeMessagesResponse xmlns="http://tempuri.org/">
<GetTradeMessagesResult>
<TradePlace INN="7606055642" Name="Первая электронная площадка " Site="1torgi.ru " OwnerName="ООО "Промтех"">
<TradeList>
<Trade ID_External="ЗКОФЦП-17136" ID_EFRSB="653476">
<MessageList>
<TradeMessage ID="4851134"/>
<TradeMessage ID="4851135"/>
</MessageList>
</Trade>
</TradeList>
</TradePlace>
</GetTradeMessagesResult>
</GetTradeMessagesResponse>
</s:Body>
</s:Envelope>
You can use Nokogiri to break apart the XML response.
A (now nofunctional) example is this:
doc = Nokogiri::XML(response.to_hash[:get_quote_response][:get_quote_result])
print doc.to_xml(indent: 2)
print "Date : ", doc.at_css("Date").text, "\n"
print "Last price: ", doc.at_css("Last").text
are more complete example in my pastebin https://pastebin.com/W0RUuaHU. The WebserviceX was unfortunately discontinued.
As I expected the answer was in namespace, code below works fine:
r.doc.search("a|TradePlace", {"a" => "http://tempuri.org/"})

Parsing out contents of XML tag in Ruby

I have an XML, that as I understand it has already been parsed by tags. My goal is to parse all the information that is in the <GetResidentsContactInfoResult> tag. In this tag of the sample xml below there are two records in here which begin each with the Lease PropertyId key. How can I iterate over the <GetResidentsContactInfoResult> tag and print out the key/value pairs for each record? I'm new to Ruby and working with XML files, is this something I can do with Nokogiri?
<?xml version="1.0" encoding="UTF-8"?>
<soap:Envelope xmlns:soap="http://www.w3.org/2003/05/soap-envelope" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<soap:Body>
<GetResidentsContactInfoResponse xmlns="http://tempuri.org/">
<GetResidentsContactInfoResult><PropertyResidents><Lease PropertyId="21M" BldgID="00" UnitID="0903" ResiID="3" occustatuscode="P" occustatuscodedescription="Previous" MoveInDate="2016-01-07T00:00:00" MoveOutDate="2016-02-06T00:00:00" LeaseBeginDate="2016-01-07T00:00:00" LeaseEndDate="2017-01-31T00:00:00" MktgSource="DBY" PrimaryEmail="noemail1#fake.com"><Occupant PropertyId="21M" BldgID="00" UnitID="0903" ResiID="3" OccuSeqNo="3444755" OccuFirstName="Efren" OccuLastName="Cerda" Phone2No="(832) 693-9448" ResponsibleFlag="Responsible" /></Lease><Lease PropertyId="21M" BldgID="00" UnitID="0908" ResiID="2" occustatuscode="P" occustatuscodedescription="Previous" MoveInDate="2016-02-20T00:00:00" MoveOutDate="2016-04-25T00:00:00" LeaseBeginDate="2016-02-20T00:00:00" LeaseEndDate="2017-02-28T00:00:00" MktgSource="PW" PrimaryEmail="noemail1#fake.com"><Occupant PropertyId="21M" BldgID="00" UnitID="0908" ResiID="2" OccuSeqNo="3451301" OccuFirstName="Donna" OccuLastName="Mclean" Phone2No="(713) 785-4240" ResponsibleFlag="Responsible" /></Lease></PropertyResidents></GetResidentsContactInfoResult>
</GetResidentsContactInfoResponse>
</soap:Body>
</soap:Envelope>
This uses Nokogiri to find all the GetResidentsContactInfoResponse elements, and then Active Support to convert the inner text to a hash of key-value pairs.
Read "sparklemotion/nokogiri" and "Tutorials" regarding installing and using Nokogiri.
Read "Active Support Core Extensions" about more capabilities of Active Support (though the guide does not include Hash.from_xml). To install it simply do gem install activesupport.
I assume you're fine with Nokogiri as you mentioned it in your question.
If you don't want to use Active Support, consider looking into "Convert a Nokogiri document to a Ruby Hash" as an alternative to the line Hash.from_xml(elm.text):
# Needed in order to use the `Hash.from_xml`
require 'active_support/core_ext/hash/conversions'
def find_key_values(str)
doc = Nokogiri::XML(str)
# Ignore namespaces for easier traversal
doc.remove_namespaces!
doc.css('GetResidentsContactInfoResponse').map do |elm|
Hash.from_xml(elm.text)
end
end
Usage:
# Option 1: if your XML above is stored in a variable called `string`
find_key_values string
# Option 2: if your XML above is stored in a file
find_key_values File.open('/path/to/file')
Which returns:
[{"PropertyResidents"=>
{"Lease"=>
[{"PropertyId"=>"21M",
"BldgID"=>"00",
"UnitID"=>"0903",
"ResiID"=>"3",
"occustatuscode"=>"P",
"occustatuscodedescription"=>"Previous",
"MoveInDate"=>"2016-01-07T00:00:00",
"MoveOutDate"=>"2016-02-06T00:00:00",
"LeaseBeginDate"=>"2016-01-07T00:00:00",
"LeaseEndDate"=>"2017-01-31T00:00:00",
"MktgSource"=>"DBY",
"PrimaryEmail"=>"noemail1#fake.com",
"Occupant"=>
{"PropertyId"=>"21M",
"BldgID"=>"00",
"UnitID"=>"0903",
"ResiID"=>"3",
"OccuSeqNo"=>"3444755",
"OccuFirstName"=>"Efren",
"OccuLastName"=>"Cerda",
"Phone2No"=>"(832) 693-9448",
"ResponsibleFlag"=>"Responsible"}},
{"PropertyId"=>"21M",
"BldgID"=>"00",
"UnitID"=>"0908",
"ResiID"=>"2",
"occustatuscode"=>"P",
"occustatuscodedescription"=>"Previous",
"MoveInDate"=>"2016-02-20T00:00:00",
"MoveOutDate"=>"2016-04-25T00:00:00",
"LeaseBeginDate"=>"2016-02-20T00:00:00",
"LeaseEndDate"=>"2017-02-28T00:00:00",
"MktgSource"=>"PW",
"PrimaryEmail"=>"noemail1#fake.com",
"Occupant"=>
{"PropertyId"=>"21M",
"BldgID"=>"00",
"UnitID"=>"0908",
"ResiID"=>"2",
"OccuSeqNo"=>"3451301",
"OccuFirstName"=>"Donna",
"OccuLastName"=>"Mclean",
"Phone2No"=>"(713) 785-4240",
"ResponsibleFlag"=>"Responsible"}}]}}]

Testing Nokogiri XML generation with blank nodes

I'm having a bit of trouble testing some XML generation using Nokogiri when the node is blank. I'm using Minitest to compare the generated XML string with a template fixture file. My test fails with the blank node as Minitest is comparing <Node></Node> with <Node />.
XML Generation
builder = Nokogiri::XML::Builder.new encoding: "UTF-8" do |xml|
xml.Header
xml.FileName #object.filename
end
Template file
This is the file I'm using as a fixture in my tests
<?xml version="1.0" encoding="UTF-8"?>
<Header/>
<FileName></FileName>
Minitest output
3) Failure:
--- expected
+++ actual
## -25,7 +25,7 ##
<Header />
- <FileName/>
+ <FileName></FileName>
As you can see, MiniTest is trying to compare a self-closing tag with a non-self-closing tag and making the test fail. Changing the fixture tag to a self-closing one results, strangely, in exactly the same error message.
It's because sometimes #object.filename is nil - if I have a blank XML node (as in xml.Header above) using a self-closing tag in my fixture works no problem.
I would use XML schema in this case:
def test_that_xml_data_conforms_to_schema
xml_data = ...
schema_data = ...
fragment = Nokogiri::XML.parse(xml_data)
schema = Nokogiri::XML::Schema(schema_data)
assert schema.valid?(fragment)
end

I always get an UndefinedConversionError in Ruby 2.0 while scraping with Mechanize

When I try to submit a textarea with Mechanize and Ruby 2.0, I always get an
Encoding::UndefinedConversionError: U+0151 from UTF-8 to ISO-8859-1
Then I tryied to convert the text with Iconv, I got a similar result:
Iconv.iconv("LATIN1", "UTF-8", text)
I get this error message:
Iconv::IllegalSequence: "őzködik, melyet "...
As the text contains east-european characters. What can I do to avoid this kind of inconveniences or how can I convert properly between different encodings?
I have found an elegant solution:
replacements = [["À", "À"], ["Á", "Á"], ["Â", "Â"], ["Ã", "Ã"], ["Ä", "Ä"], ["Å", "Å"], ["Æ", "Æ"], ["Ç", "Ç"], ["È", "È"], ["É", "É"], ["Ê", "Ê"], ["Ë", "Ë"], ["Ì", "Ì"], ["Í", "Í"], ["Î", "Î"], ["Ï", "Ï"], ["Ð", "Ð"], ["Ñ", "Ñ"], ["Ò", "Ò"], ["Ó", "Ó"], ["Ô", "Ô"], ["Õ", "Õ"], ["Ö", "Ö"], ["Ø", "Ø"], ["Ù", "Ù"], ["Ú", "Ú"], ["Û", "Û"], ["Ü", "Ü"], ["Ý", "Ý"], ["Þ", "Þ"], ["ß", "ß"], ["à", "à"], ["á", "á"], ["â", "â"], ["ã", "ã"], ["ä", "ä"], ["å", "å"], ["æ", "æ"], ["ç", "ç"], ["è", "è"], ["é", "é"], ["ê", "ê"], ["ë", "ë"], ["ì", "ì"], ["í", "í"], ["î", "î"], ["ï", "ï"], ["ð", "ð"], ["ñ", "ñ"], ["ò", "ò"], ["ó", "ó"], ["ô", "ô"], ["õ", "õ"], ["ö", "ö"], ["ø", "ø"], ["ù", "ù"], ["ú", "ú"], ["û", "û"], ["ü", "ü"], ["ý", "ý"], ["þ", "þ"], ["ÿ", "ÿ"]]
def replace(str,replacements)
replacements.each {|replacement| str.gsub!(replacement[0], replacement[1])}
return str
end
my_string=replace(my_string,replacements)

Base64.decode64 in ruby returning strange results

I'm having problems in decoding a string using Base64.decode64 in Ruby. As a test, I'm using this site that decodes strings in php: https://rnd.feide.no/simplesaml/module.php/saml2debug/debug.php.
As a test, I'm using this string:
fZJNT%2BMwEIbvSPwHy%2Fd8tMvHympSdUGISuwS0cCBm%2BtMUwfbk%2FU4zfLvSVMq2Euv45n3fd7xzOb%2FrGE78KTRZXwSp5yBU1hpV2f8ubyLfvJ5fn42I2lNKxZd2Lon%2BNsBBTZMOhLjQ8Y77wRK0iSctEAiKLFa%2FH4Q0zgVrceACg1ny9uMy7rCdaM2%2Bs0BWrtppK2UAdeoVjW2ruq1bevGImcvR6zpHmtJ1MHSUZAuDKU0vY7Si2h6VU5%2BiMuJuLx65az4dPql3SHBKaz1oYnEfVkWUfG4KkeBna7A%2Fxm6M14j1gZihZazBRH4MODcoKPOgl%2BB32kFz08PGd%2BG0JJIkr7v46%2BhRCaEpod17DCRivYZCkmkd4N28B3wfNyrGKP5bws9DS6PKDz%2FMpsl36Tyz%2F%2Fax1jeFmi0emcLY7C%2F8SDD0Z7dobcynHbbV3QVbcZW0TlqQemNhoqzJD%2B4%2Fn8Yw7l8AA%3D%3D
The output should be:
<?xml version="1.0" encoding="UTF-8"?>
<samlp:AuthnRequest xmlns:samlp="urn:oasis:names:tc:SAML:2.0:protocol" ID="agdobjcfikneommfjamdclenjcpcjmgdgbmpgjmo" Version="2.0" IssueInstant="2007-04-26T13:51:56Z" ProtocolBinding="urn:oasis:names:tc:SAML:2.0:bindings:HTTP-POST" ProviderName="google.com" AssertionConsumerServiceURL="https://www.google.com/a/solweb.no/acs" IsPassive="true"><saml:Issuer xmlns:saml="urn:oasis:names:tc:SAML:2.0:assertion">google.com</saml:Issuer><samlp:NameIDPolicy AllowCreate="true" Format="urn:oasis:names:tc:SAML:2.0:nameid-format:unspecified" /></samlp:AuthnRequest>
But in Ruby I keep getting a strange output:
}\222MO?0\206?H?\a??|\264???jRuA\210J????\233?LS\aۓ?8???IS*?K\257??}????\377\254a;??e|\022\247\234\201SXiWg????~?y~~6#iM+\026]غ'??6L:\022?C?;?J?$\234\264#\"(\261Z?~?8\255ǀ\n\rg??\214˺?u\2436??Z?i\244\255\224\001רV5\266\256?m??"g/G\254?kI???Q\220.\f\2454\275\216ҋhzUN~\210ˉ\270\274z??t???!?)\254????}Y\026Q?*G\201\235\256??\031\2723^#?b\205\226\263\005\021?0?ܠ\243΂_\201?i\005?O\017\031߆ВH\222\276??\241D&\204\246\207u?0?\212?
I\244w\203v??|ܫ\030\243?o
.\217(<\3772\233%ߤ?????X?h\264zg\vc\260\277? ??\236ݡ\2672\234v?Wt\025m?V?9jA鍆\212\263$?\270\376\177\030ù|\000
The code I use is:
require 'cgi'
require 'base64'
Base64::decode64(CGI::unescape('fZJNT%2BMwEIbvSPwHy%2Fd8tMvHympSdUGISuwS0cCBm%2BtMUwfbk%2FU4zfLvSVMq2Euv45n3fd7xzOb%2FrGE78KTRZXwSp5yBU1hpV2f8ubyLfvJ5fn42I2lNKxZd2Lon%2BNsBBTZMOhLjQ8Y77wRK0iSctEAiKLFa%2FH4Q0zgVrceACg1ny9uMy7rCdaM2%2Bs0BWrtppK2UAdeoVjW2ruq1bevGImcvR6zpHmtJ1MHSUZAuDKU0vY7Si2h6VU5%2BiMuJuLx65az4dPql3SHBKaz1oYnEfVkWUfG4KkeBna7A%2Fxm6M14j1gZihZazBRH4MODcoKPOgl%2BB32kFz08PGd%2BG0JJIkr7v46%2BhRCaEpod17DCRivYZCkmkd4N28B3wfNyrGKP5bws9DS6PKDz%2FMpsl36Tyz%2F%2Fax1jeFmi0emcLY7C%2F8SDD0Z7dobcynHbbV3QVbcZW0TlqQemNhoqzJD%2B4%2Fn8Yw7l8AA%3D%3D'))
What could possibly be wrong? Thanks in advance.
I have no idea where you got the idea that that string of yours is a base64-encoded version of your XML. If you pass the first bit of it (<?x) through Base64.encode64() then CGI.escape(), you get:
PD94
at the start, which is nothing like your string. In fact, your first four characters "fZJN" are values 31, 25, 9 and 13 in base 64 so will give you:
011111 011001 001001 001101
then, grouping them in octets instead of sextets (I guess that's the right word):
01111101 10010010 01001101
7D 92 4D
which are not the characters you're expecting to see.
Putting the whole string in gives you:
PD94bWwgdmVyc2lvbj0iMS4wIiBlbmNvZGluZz0iVVRGLTgiPz4gPHNhbWxw
OkF1dGhuUmVxdWVzdCB4bWxuczpzYW1scD0idXJuOm9hc2lzOm5hbWVzOnRj
OlNBTUw6Mi4wOnByb3RvY29sIiBJRD0iYWdkb2JqY2Zpa25lb21tZmphbWRj
bGVuamNwY2ptZ2RnYm1wZ2ptbyIgVmVyc2lvbj0iMi4wIiBJc3N1ZUluc3Rh
bnQ9IjIwMDctMDQtMjZUMTM6NTE6NTZaIiBQcm90b2NvbEJpbmRpbmc9InVy
bjpvYXNpczpuYW1lczp0YzpTQU1MOjIuMDpiaW5kaW5nczpIVFRQLVBPU1Qi
IFByb3ZpZGVyTmFtZT0iZ29vZ2xlLmNvbSIgQXNzZXJ0aW9uQ29uc3VtZXJT
ZXJ2aWNlVVJMPSJodHRwczovL3d3dy5nb29nbGUuY29tL2Evc29sd2ViLm5v
L2FjcyIgSXNQYXNzaXZlPSJ0cnVlIj48c2FtbDpJc3N1ZXIgeG1sbnM6c2Ft
bD0idXJuOm9hc2lzOm5hbWVzOnRjOlNBTUw6Mi4wOmFzc2VydGlvbiI+Z29v
Z2xlLmNvbTwvc2FtbDpJc3N1ZXI+PHNhbWxwOk5hbWVJRFBvbGljeSBBbGxv
d0NyZWF0ZT0idHJ1ZSIgRm9ybWF0PSJ1cm46b2FzaXM6bmFtZXM6dGM6U0FN
TDoyLjA6bmFtZWlkLWZvcm1hdDp1bnNwZWNpZmllZCIgLz48L3NhbWxwOkF1
dGhuUmVxdWVzdD4=
When you escape that, you get:
PD94bWwgdmVyc2lvbj0iMS4wIiBlbmNvZGluZz0iVVRGLTgiPz4gPHNhbWxw%0AOkF1dGhuUmVxdWVzdCB4bWxuczpzYW1scD0idXJuOm9hc2lzOm5hbWVzOnRj%0AOlNBTUw6Mi4wOnByb3RvY29sIiBJRD0iYWdkb2JqY2Zpa25lb21tZmphbWRj%0AbGVuamNwY2ptZ2RnYm1wZ2ptbyIgVmVyc2lvbj0iMi4wIiBJc3N1ZUluc3Rh%0AbnQ9IjIwMDctMDQtMjZUMTM6NTE6NTZaIiBQcm90b2NvbEJpbmRpbmc9InVy%0AbjpvYXNpczpuYW1lczp0YzpTQU1MOjIuMDpiaW5kaW5nczpIVFRQLVBPU1Qi%0AIFByb3ZpZGVyTmFtZT0iZ29vZ2xlLmNvbSIgQXNzZXJ0aW9uQ29uc3VtZXJT%0AZXJ2aWNlVVJMPSJodHRwczovL3d3dy5nb29nbGUuY29tL2Evc29sd2ViLm5v%0AL2FjcyIgSXNQYXNzaXZlPSJ0cnVlIj48c2FtbDpJc3N1ZXIgeG1sbnM6c2Ft%0AbD0idXJuOm9hc2lzOm5hbWVzOnRjOlNBTUw6Mi4wOmFzc2VydGlvbiI%2BZ29v%0AZ2xlLmNvbTwvc2FtbDpJc3N1ZXI%2BPHNhbWxwOk5hbWVJRFBvbGljeSBBbGxv%0Ad0NyZWF0ZT0idHJ1ZSIgRm9ybWF0PSJ1cm46b2FzaXM6bmFtZXM6dGM6U0FN%0ATDoyLjA6bmFtZWlkLWZvcm1hdDp1bnNwZWNpZmllZCIgLz48L3NhbWxwOkF1%0AdGhuUmVxdWVzdD4%3D%0A
So, the bottom line is that you're getting junk from the decode because the data is not of the correct format.
It appears that the data is also deflated/compressed.
require 'zlib'
inflated=Base64::decode64(CGI::unescape('fZJNT%2BMwEIbvSPwHy%2Fd8tMvHympSdUGISuwS0cCBm%2BtMUwfbk%2FU4zfLvSVMq2Euv45n3fd7xzOb%2FrGE78KTRZXwSp5yBU1hpV2f8ubyLfvJ5fn42I2lNKxZd2Lon%2BNsBBTZMOhLjQ8Y77wRK0iSctEAiKLFa%2FH4Q0zgVrceACg1ny9uMy7rCdaM2%2Bs0BWrtppK2UAdeoVjW2ruq1bevGImcvR6zpHmtJ1MHSUZAuDKU0vY7Si2h6VU5%2BiMuJuLx65az4dPql3SHBKaz1oYnEfVkWUfG4KkeBna7A%2Fxm6M14j1gZihZazBRH4MODcoKPOgl%2BB32kFz08PGd%2BG0JJIkr7v46%2BhRCaEpod17DCRivYZCkmkd4N28B3wfNyrGKP5bws9DS6PKDz%2FMpsl36Tyz%2F%2Fax1jeFmi0emcLY7C%2F8SDD0Z7dobcyHbbV3QVbcZW0TlqQemNhoqzJD%2B4%2Fn8Yw7l8AA%3D%3D'))
zlib = Zlib::Inflate.new(-Zlib::MAX_WBITS)
zlib.inflate(inflated)

Resources