MSXML / XPath: special Characters - xpath

I'm reading and writing XML-files with Microsoft XML Core Services 6.0 (MSXML).
When writing element-content with "special" chars that have to be escaped
in the context of xml, like writing "&" as & i dont have to care
about this because MSXML does this conversion. This means, if i assign a text
to an element, e.g. oXMLElement.Text = "1 & 2" , MSXML actually writes
oXMLElement.Text = 1 & 2 when i create a XML-file. Thats pretty nice
and saves me some work.
Now, what i want to do, is to "de-mask" XML-strings
automatically. So, i read from a XML-file with the selectNodes-method, which
works by adding an XPath-statement, e.g. //ns:element/text(). Unfortunately,
the result-string i get looks like 1 & 2 and not like 1 & 2. Is there
a way to tell the MSXML-object or maybe the XPath-statement to give me an
"de-masked" string? I´m using MSXML with ObjectPal / Paradox, so the best
solution would be a method from the MSXML-library or a "special" XPath-
statement.

What you're seeing is the "escaped" XML notation for the text. This is what you should see if you use the .xml property to retrieve the string.
To get the string without the escapes, use .nodeValue.

Related

How to remove "amp;" from getting sent in XML?

i have a string "Travel & Hospitality". like this.
When i sent it through XML API the output is like below.
"Travel & Hospitality"
i tried removing it with ruby code like below and sent it through XML.
"Travel & Hospitality".gsub("&","&").
<Specialization__c>Travel & Hospitality</Specialization__c>
even though gsub is removing "amp;" again while sending it through XML tags again the amp; word is coming.
How can i remove it.my desired output is
"Travel & Hospitality"
XML doesn't allow such a thing. & is not allowed to appear unescaped.
You could have an XML file like this instead:
<Specialization__c><![CDATA[Travel & Hospitality]]></Specialization__c>
That would work, but the problem is how to convince your XML output library to do something like that. It might not even be possible at all. (I might be wrong about that last part. I know nothing about ruby)

Processing form input in a Joomla component

I am creating a Joomla component and one of the pages contains a form with a text input for an email address.
When a < character is typed in the input field, that character and everything after is not showing up in the input.
I tried $_POST['field'] and JFactory::getApplication()->input->getCmd('field')
I also tried alternatives for getCmd like getVar, getString, etc. but no success.
E.g. John Doe <j.doe#mail.com> returns only John Doe.
When the < is left out, like John Doe j.doe#mail.com> the value is coming in correctly.
What can I do to also have the < character in the posted variable?
BTW. I had to use & lt; in this question to display it as I want it. This form suffers from the same problem!!
You actually need to set the filtering that you want when you grab the input. Otherwise, you will get some heavy filtering. (Typically, I will also lose # symbols.)
Replace this line:
JFactory::getApplication()->input->getCmd('field');
with this line:
JFactory::getApplication()->input->getRaw('field');
The name after the get part of the function is the filtering that you will use. Cmd strips everything but alphanumeric characters and ., -, and _. String will run through the html clean tags feature of joomla and depending on your settings will clean out <>. (That usually doesn't happen for me, but my settings are generally pretty open to the point of no filtering on super admins and such.
getRaw should definitely work, but note that there is no filtering at all, which can open security holes in your application.
The default text filter trims html from the input for your field. You should set the property
filter="raw"
in your form's manifest (xml) file, and then use getRaw() to retrieve the value. getCmd removes the non-alphanumeric characters.

C# MVC3 and non-latin characters

I have my database results (áéíóúàâêô...) and when I display any of this characters I get codes like:
á
My controller is like this:
ViewBag.EstadosDeAlma = (from e in db.EstadosDeAlma select e.Title).ToList();
My cshtml page is like this:
var data = '#foreach (dynamic item in ViewBag.EstadosDeAlma){ #(item + " ") }';
In addition, if I use any rich text editor as Tiny MCE all non-latin characters are like this too.
What should I do to avoid this problem?
What output encoding are you using on your web pages? I would suggest using UTF-8 since you want a lot of non-ascii characters to work.
I think you should HTML encode/decode the values before comparing them.
Since you are using jQuery you can take advantage of the encoding functions built-in into it. For example:
$('<div/>').html('& #225;gil').html()
gives you "ágil" (notice that I added an extra space between the & and the # so that stackoverflow does not encode it, you won't need it)
This other question has more information about this.
HTML-encoding lost when attribute read from input field

What's the most reliable way to parse a piece of text out into paragraphs in RealBasic that will work on Windows, Mac, and Linux?

I'm writing a piece of software using RealBASIC 2011r3 and need a reliable, cross-platform way to break a string out into paragraphs. I've been using the following but it only seems to work on Linux:
dim pTemp() as string
pTemp = Split(txtOriginalArticle.Text, EndOfLine + EndOfLine)
When I try this on my Mac it returns it all as a single paragraph. What's the best way to make this work reliably on all three build targets that RB supports?
EndofLine changes depending upon platform and depending upon the platform that created the string. You'll need to check for the type of EndOfLine in the string. I believe it's sMyString.EndOfLineType. Once you know what it is you can then split on it.
There are further properties for the EndOfLine. It can be EndOfLine.Macintosh/Windows/Unix.
EndOfLine docs: http://docs.realsoftware.com/index.php/EndOfLine
I almost always search for and replace the combinations of line break characters before continuing. I'll usually do a few lines of:
yourString = replaceAll(yourString,chr(10)+chr(13),"<someLineBreakHolderString>")
yourString = replaceAll(yourString,chr(13)+chr(10),"<someLineBreakHolderString>")
yourString = replaceAll(yourString,chr(10),"<someLineBreakHolderString>")
yourString = replaceAll(yourString,chr(13),"<someLineBreakHolderString>")
The order here matters (do 10+13 before an individual 10) because you don't want to end up replacing a line break that contains a 10 and a 13 with two of your line break holders.
It's a bit cumbersome and I wouldn't recommend using it to actually modify the original string, but it definitely helps to convert all of the line breaks to the same item before attempting to further parse the string.

Problem With Regular Expression to Remove HTML Tags

In my Ruby app, I've used the following method and regular expression to remove all HTML tags from a string:
str.gsub(/<\/?[^>]*>/,"")
This regular expression did just about all I was expecting it to, except it caused all quotation marks to be transformed into “
and all single quotes to be changed to ”
.
What's the obvious thing I'm missing to convert the messy codes back into their proper characters?
Edit: The problem occurs with or without the Regular Expression, so it's clear my problem has nothing to do with it. My question now is how to deal with this formatting error and correct it. Thanks!
Use CGI::unescapeHTML after you perform your regular expression substitution:
CGI::unescapeHTML(str.gsub(/<\/?[^>]*>/,""))
See http://www.ruby-doc.org/core/classes/CGI.html#M000547
In the above code snippet, gsub removes all HTML tags. Then, unescapeHTML() reverts all HTML entities (such as <, &#8220) to their actual characters (<, quotes, etc.)
With respect to another post on this page, note that you will never ever be passed HTML such as
<tag attribute="<value>">2 + 3 < 6</tag>
(which is invalid HTML); what you may receive is, instead:
<tag attribute="<value>">2 + 3 < 6</tag>
The call to gsub will transform the above to:
2 + 3 < 6
And unescapeHTML will finish the job:
2 + 3 < 6
You're going to run into more trouble when you see something like:
<doohickey name="<foobar>">
You'll want to apply something like:
gsub(/<[^<>]*>/, "")
...for as long as the pattern matches.
This regular expression did just about
all I was expecting it to, except it
caused all quotation marks to be
transformed into “ and all
single quotes to be changed to ”
.
This doesn't sound as if the RegExp would be doing this. Are you sure it's different before?
See this question here for information about the problem, it has got an excellent answer:
Get non UTF-8 form fields as UTF-8 in php.
I've run into a similar problem with character changes, this happened when my code ran through another module that enforced UTF-8 encoding and then when it came back, I had a different file (slurped array of lines) on my hands.
You could use a multi-pass system to get the results you are looking for.
After running your regular expression, run an expression to convert &8220; to quotes and another to convert &8221; to single quotes.

Resources