I want to open this file and get all elements that start with us-gaap.
ftp://ftp.sec.gov/edgar/data/916789/0001558370-15-001143.txt
To get elements I tried like this:
str = '<html><body><us-gaap:foo>foo</us-gaap:foo></body></html>'
doc = Nokogiri::XML(File.read(str))
doc.xpath('//us-gaap:*')
Nokogiri::XML::XPath::SyntaxError: Undefined namespace prefix: //us-gaap:*
from /Users/ironsand/.rbenv/versions/2.2.2/lib/ruby/gems/2.2.0/gems/nokogiri-1.6.7.2/lib/nokogiri/xml/searchable.rb:165:in `evaluate'
doc.namespaces returns {}, so I think I have to add namespace us-gaap.
There are some questions about "adding namespace with Nokogiri", but it looks like about how to create a new XML document, not how to add a namespace to existing documents.
How can I add a namespace to existing document?
I know I can remove the namespace by Nokogiri::XML::Document#remove_namespaces!, but I don't want to use it because it removes also necesarry information.
You have asked an XY Problem. You think that the problem is that you need to add a missing namespace; the real problem is that the file you're trying to parse is not valid XML.
require 'nokogiri'
doc = Nokogiri.XML( IO.read('0001558370-15-001143.txt') )
doc.errors.length
#=> 5716
For example, the <ACCEPTANCE-DATETIME> 'element' opened on line 3 is never closed, and on line 16 there is a raw ampersand in the text:
STANDARD INDUSTRIAL CLASSIFICATION: ELECTRIC HOUSEWARES & FANS [3634]
which ought to be escaped as an entity.
However, the document has valid XML fragments within it! In particular, there is one XML document that defines xmlns:us-gaap namespace, from lines 27243-49312. Let's extract just that, using only the knowledge that the root element defines the namespace we want, and the assumptions that no element with the same name is nested within the document, and that the root element does not have an unescaped > character in any attribute. (These assumptions are valid for this file, but may not be valid for every XML file.)
txt = IO.read('0001558370-15-001143.txt')
gaap_finder = %r{(<(\w+) [^>]+xmlns:us-gaap=.+?</\2>)}m
txt.scan(gaap_finder) do |xml,_|
doc = Nokogiri.XML( xml )
gaaps = doc.xpath('//us-gaap:*')
p gaaps.length
#=> 569
end
The code above handles the case where there may be more than one XML document in the txt file, though in this case there is only one.
Decoded, the gaap_finder regex says this:
%r{...}m — this is a regular expression (that allows slashes in it, unescaped) with "multiline mode", where a period will match newline characters
(...) — capture everything we find
< — start with a literal "less-than" symbol
(\w+) — find one or more word characters (the tag name), and save them
— the word characters must be followed by a space (important to avoid capturing the <xsd:xbrl ...> element in this file)
[^>]+ — followed by one or more characters that is NOT a "greater-than" symbol (to ensure that we stay in the same element that we started in)
xmlns:us-gaap\s*= — followed by this literal namespace declaration (which may have whitespace separating it from the equals sign)
.+? — followed by anything (as little as possible)...
</\2> — ...up until you see a closing tag with the same name as what we captured for the name of the starting tag
Because of the way scan works when the regex has capturing groups, each result is a two-element array, where the first element is the entire captured XML and the second element is the name of the tag that we captured (which we "discard" by assigning it to the _ variable).
If you want to be less magic about your capturing, the text file format appears to always wrap each XML document in <XBRL>...</XBRL>. So, you could do this to process every XML file (there are seven, five of which do not happen to have any us-gaap namespaces):
txt = IO.read('0001558370-15-001143.txt')
xbrls = %r{(?<=<XBRL>).+?(?=</XBRL>)}m # find text inside <XBRL>…</XBRL>
txt.scan(xbrls) do |xml|
doc = Nokogiri.XML( xml )
if doc.namespaces["xmlns:us-gaap"]
gaaps = doc.xpath('//us-gaap:*')
p gaaps.length
end
end
#=> 569
#=> 0 (for the XML Schema document that defines the namespace)
I couldn't figure out how to update an existing doc with a new namespace, but since Nokogiri will recognize namespaces on the root element, and those namespaces are, syntactically, just attributes, you can update the document with a new namespace declaration, serialize the doc to a string, and re-parse it:
str = '<html><body><us-gaap:foo>foo</us-gaap:foo></body></html>'
doc_without_ns = Nokogiri::XML(str)
doc_without_ns.root['xmlns:us-gaap'] = 'http://your/actual/ns/here'
doc = Nokogiri::XML(doc_without_ns.to_xml)
doc.xpath("//us-gaap:*")
# Returns [#<Nokogiri::XML::Element:0x3ff375583f9c name="foo" namespace=#<Nokogiri::XML::Namespace:0x3ff375583f24 prefix="us-gaap" href="http://your/actual/ns/here"> children=[#<Nokogiri::XML::Text:0x3ff375583768 "foo">]>]
Related
Attempting to confirm that of all the schema in the head of a page exactly 3 of them should have a specific string within them. These schemas have no tags or sub classes to differentiate themselves from each other, only the text within them. I can confirm that the text exists within any of the schema:
cy.get('head > script[type="application/ld+json"]').should('contain', '"#type":"Product"')
But what I need is to confirm that that string exists 3 times, something like this:
cy.get('head > script[type="application/ld+json"]').contains('"#type":"Product"').should('have.length', 3)
And I can't seem to find a way to get this to work since .filter, .find, .contains, etc don't filter down the way I need them to. Any suggestions? At this point it seems like I either need to import a custom library or get someone to add ids to these specific schema. Thanks!
The first thing to note is that .contains() always yields a single result, even when many element match.
It's not very explicit in the docs, but this is what it says
Yields
.contains() yields the new DOM element it found.
If you run
cy.get('head > script[type="application/ld+json"]')
.contains('"#type":"Product"')
.then(console.log) // logs an object with length: 1
and open up the object logged in devtools you'll see length: 1, but if you remove the .contains('"#type":"Product"') the log will show a higher length.
You can avoid this by using the jQuery :contains() selector
cy.get('script[type="application/ld+json"]:contains("#type\": \"Product")')
.then(console.log) // logs an object with length: 3
.should('have.length', 3);
Note the inner parts of the search string have escape chars (\) for quote marks that are part of the search string.
If you want to avoid escape chars, use a bit of javascript inside a .then() to filter
cy.get('script[type="application/ld+json"]')
.then($els => $els.filter((index, el) => el.innerText.includes('"#type": "Product"')) )
.then(console.log) // logs an object with length: 3
.should('have.length', 3);
I am parsing a Wiki text from an XML dump, for a string named 'section' which includes templates in double braces, including some arguments, which I want to reorganize.
This has an example named TextTerm:
section="Sample of a text with a first template {{TextTerm|arg1a|arg2a|arg3a...}} and then a second {{TextTerm|arg1b|arg2b|arg3b...}} etc."
I can use scan and a regex to get each template and work on it on a loop using:
section.scan(/\{\{(TextTerm)\|(.*?)\|(.*?)\}\}/i).each { |item| puts "1=" + item[1] # arg1a etc.}
And, I have been able to extract the database of the first argument of the template.
Now I also want to replace the name of the template "NewTextTerm" and reorganize its arguments by placing the second argument in place of the first.
Can I do it in the same loop? For example by changing scan by a gsub(rgexp){ block}:
section.gsub!(/\{\{(TextTerm)\|(.*?)\|(.*?)\}\}/) { |item| '{{NewTextTerm|\2|\1}}'}
I get:
"Sample of a text with a first template {{NewTextTerm|\\2|\\1}} and then a second {{NewTextTerm|\\2|\\1}} etc."
meaning that the arguments of the regexp are not recognized. Even if it worked, I would like to have some place within the gsub block to work on the arguments. For example, I can't have a puts in the gsub block similar to the scan().each block but only a string to be substituted.
Any ideas are welcome.
PS: Some editing: braces and "section= added", code is complete.
When you have the replacement as a string argument, you can use '\1', etc. like this:
string.gsub!(regex, '...\1...\2...')
When you have the replacement as a block, you can use "#$1", etc. like this:
string.gsub!(regex){"...#$1...#$2..."}
You are mixing the uses. Stick to either one.
Yes, changing the quote by a double quote isn't enough, #$1 is the answer. Here is the complete code:
section="Sample of a text with a first template {{TextTerm|arg1a|arg2a|arg3a...}} and then a second {{TextTerm|arg1b|arg2b|arg3b...}} etc."
section.gsub(/\{\{(TextTerm)\|(.*?)\|(.*?)\}\}/) { |item| "{{New#$1|#$3|#$2}}"}
"Sample of a text with a first template {{NewTextTerm|arg2a|arg3a...|arg1a}} and then a second {{NewTextTerm|arg2b|arg3b...|arg1b}} etc."
Thus, it works. Thanks.
But now I have to replace the string, by a "function" returning the changed string:
def stringreturn(arg1,arg2,arg3) strr = "{{New"+arg1 + arg3 +arg2 + "}}"; return strr ; end
and
section.gsub(/\{\{(TextTerm)\|(.*?)\|(.*?)\}\}/) { |item| stringreturn("#$1","|#$2","|#$3") }
will return:
"Sample of a text with a first template {{NewTextTerm|arg2a|arg3a...|arg1a}} and then a second {{NewTextTerm|arg2b|arg3b...|arg1b}} etc."
Thanks to all!
There is probably a better way to manipulate arguments in MediaWiki templates using Ruby.
I want to check in a xml if there is a node with the value "Hotel Hafen Hamburg".
But I get the error.
SimpleXMLElement::xpath(): Invalid predicate on line 25
You can view the xml here.
http://de.sourcepod.com/dkdtrb22-19748
Until now I have written the following code.
$apiUmgebungUrl = "xml.xml";
$xml_umgebung = simplexml_load_file($apiUmgebungUrl);
echo $nameexist = $xml_umgebung->xpath('boolean(//result/name[#Hotel Hafen Hamburg');
It seems that your parantheses and brackets do not close properly at the end of your XPath expression - it should end on ]).
Also, what is Hotel Hafen Hamburg? If it is an attribute called value, your value check should look like this:
[#value="Hotel Hafen Hamburg"]
You cannot just write # and then a value, without specifying where that value is supposed to be.
EDIT: Looking at the Xml document, it seems that Hotel Hafen Hamburg is supposed to be the text content of the <name> element. Therefore, try looking for a text node with that value rather than an attribute:
boolean(//result/name[text() = "Hotel Hafen Hamburg"])
I want to replace the "text" contains http link with the actual HTML markup for this link.
Here is my Ruby code
url_check = Regexp.new( '(\A|[\n ])([\w]+?://[\w]+[^ \"\r\n\t<]*)', Regexp::MULTILINE | Regexp::IGNORECASE )
self.gsub!(url_check, '\1\2')
to_s
Here is a test case:
This is entrance page for the service (using HTML):
http://foobar.org/resources?format=html
Let us pick the "contributions" namespace: http://foobar.org/
The link is created only for the second case, but not for the first (which has several line breaks before)
I suggest using \b (word boundary) instead of new-line/start-of-the-line detection:
.gsub!(/\b([\w]+?:\/\/[\w]+[^ \"\r\n\t<]*)/i, '\1')
you don't need "http:" in replacement as you already match for protocol.
How can I get H1,H2,H3 contents in one single xpath expression?
I know I could do this.
//html/body/h1/text()
//html/body/h2/text()
//html/body/h3/text()
and so on.
Use:
/html/body/*[self::h1 or self::h2 or self::h3]/text()
The following expression is incorrect:
//html/body/*[local-name() = "h1"
or local-name() = "h2"
or local-name() = "h3"]/text()
because it may select text nodes that are children of unwanted:h1, different:h2, someWeirdNamespace:h3.
Another recommendation: Always avoid using // when the structure of the XML document is statically known. Using // most often results in significant inefficiencies because it causes the complete document (sub)tree roted in the context node to be traversed.