Not sure why the output of my PHP scripts contains random embedded spaces within character strings - joomla

I have written several PHP scripts to read the contents of a database and output those contents in an email message. Every once in a while, I will see a SPACE (0x20) character embedded in the output where there shouldn't be any. For example, in one script, I reference a PHP global variable containing exactly "n" non-space characters, and sometimes (not always), when that variable is dumped to an email message, the string will appear with an embedded blank (making the total length of the string "n+1"). Other times, an HTML tag (such as <BR>) will appear as < BR> (note the SPACE before the "B").
Because the behavior of the script is not consistent (some emails are affected, and others aren't), I can't seem to find the problem.
I am enclosing a link to the PHP script that is occasionally embedding a space into the BREAK tag. I have removed the lines that provide specific login information to the databases. Otherwise, everything else is intact. In the code file you can find at the link below, line 281 is the one that contained the BREAK command with the embedded SPACE (as described above). This has happened only once!
http://jem-software.com/temptest.txt
I guess the only other potentially relevant information is that this script file is taken from code entered into a JUMI code block contained within a Joomla! based website.
Edit 1:
Thank you, Riccardo, for your suggestions. Here is some more clarification:
I am not reading an email and parsing the results in order to insert into a database. Just the opposite, I am reading from a database and using the results to create an email. I will check the database to see what character set was used, and explicitly pass the character set to see if that makes a difference.
I don't use Joomla functions to access the database because the database I am referencing is external to the Joomla! environment. It is a pre-existing database created from PHP scripts written several years prior. When my old website was re-written using Joomla, I wanted to "port" the PHP database access code intact, so I installed the JUMI plugin to make this possible.
I will check out the character coding in the database and synchronize it to the character code of the email message.
I don't understand how an issue with character coding would result in the insertion of a SPACE into the hard-coded HTML tag - this tag did not come from any database, but was typed into the email as a literal string.

This is a strange issue, but here are my two cents:
The first is you're not using Joomla functions to access the db and the mail subsystem. While this could work, it's not really nice.
The second is, this smells like a character set / codepage issue.
Here are a few considerations on the character set issue:
I read your code quickly, and I didn't notice anything wrong. But Joomla uses UTF-8, and your queries don't specify it (mysql_set_charset() is missing!) which could be a first issue.
The second is that the emails you read will have different character sets, depending on the senders' settings. Make sure you handle the codepage issues properly: the following is a snippet of a function I use for parsing email:
$mime = imap_fetchmime($this->connection, $this->messageNumber, $partNumber);
return $this->decodeMailBody($data,$mime); // QUOTED_PRINTABLE
function decodeMailBody($string,$mime) {
$str = quoted_printable_decode($string);
echo "<h3>mime: $mime; charset $charset</h3>";
//mime: Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=utf-8
//mime: Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=windows-1252
$mimes = explode('charset=',$mime);
foreach($mimes as $mimepiece) {
$charset = $mimepiece;
}
$charset = strtolower(trim($charset));
if ($charset == 'utf-8') {
return $str;
} else {
return iconv($charset, 'UTF-8', $str);
}
}
Last, make sure you use utf-8 when you insert the mail into the db after parsing it.

Related

How to Escape Double Quotes from Ruby Page Object text

In using the Page Object gem, I'm trying to pull text from a page to verify error messages. One of these error messages contains double-quotes, but when the page object pulls the text from the page, it pulls some other characters.
expected ["Please select a category other than the Default â?oEMSâ?? before saving."]
to include "Please select a category other than the Default \"EMS\" before saving."
(RSpec::Expectations::ExpectationNotMetError)
I'm not quite sure how to escape these - I'm not sure where I could use Regexs and be able to escape these odd characters.
Honestly you are over complicating your validation.
I would recommend simplifying what you are trying to do, start by asking yourself: Is the part in quotes a critical part of your validation?
If it is, isolate it by doing a String.contains("EMS")
If it is not, then you are probably doing too much work, only check for exactly what you need in validation:
String.beginsWith("Please select a category other than the Default")
With respect to the actual issue you are having, on a technical level you have an encoding issue. Encode your result string with utf-8 before you pass it to your validation and you will be fine.
Good luck
It's pretty likely that somewhere along the line encoded the string improperly. (A tipoff is the accented characters followed by ?.) It seems pretty likely that the quotes were converted to "smart quotes" somewhere. This table compares Window-1252 to UTF-8:
Code Point Characters UTF-8 Bytes
Unicode Windows
1252 Expected Actual
------ ---- - --- -----------
U+201C 0x93 “ “ %E2 %80 %9C
U+201D 0x94 ” †%E2 %80 %9D
What you'll want to do is spot check various places in the code to find the first place the string is encoded in something other than UTF-8:
puts error_str.encoding
(For clarity, error_str is the variable that holds the string you are testing. I'm using puts, but you might want have another way to log diagnostic messages.)
Once you find the string that's not encoded UTF-8, you can convert it:
error_str.encode('UTF-8')
Or, if the string is hardcoded somewhere, just replace the string.
For more debugging advice, see: 3 Steps to Fix Encoding Problems in Ruby and How to Get From They’re to They’re.

Mac computers aren't processing mailto: links correctly when they have // in them (mailto://)

Sorry for the question title, it's a little difficult to phrase in my opinion. Here is the full question:
The WYSIWYG HTML editor we use on our websites includes a // in the mailto: link when inserted into the text editor box (mailto://). We are a webfirm and use this editor on many, many websites. For example, all the mail links inserted appear like this:
Text Here
We just noticed this morning that Windows computers do not include the // in the To: field when clicked regardless of the email client it's opened with. It will include the email as normal (email#domain.com).
However, Mac computers are including the // though, so whenever someone tries to send an email using these links, it's trying to email //email#domain.com - which isn't delivering, because obviously it's an invalid format with the //s.
Does anyone have any knowledge to why this is happening? The WYSIWYG editor we are using is obout. If we have to go back and remove these // from every single website we've built, it would be a tremendous task. I'm just wondering why Macs seem to not process the link correctly, while Windows computers do.
The Macs are processing the link correctly. Windows is incorrectly removing data and your editor is incorrectly encoding the data.
The mailto: URL scheme is defined by RFC 2368. It defines it as:
mailtoURL = "mailto:" [ to ] [ headers ]
to = #mailbox
headers = "?" header *( "&" header )
header = hname "=" hvalue
hname = *urlc
hvalue = *urlc
"#mailbox" is as specified in RFC 822 [RFC822]. This means that it
consists of zero or more comma-separated mail addresses, possibly
including "phrase" and "comment" components. Note that all URL
reserved characters in "to" must be encoded: in particular,
parentheses, commas, and the percent sign ("%"), which commonly occur
in the "mailbox" syntax.
There is no provision for removing characters such as /.

Characters in string within post are being auto escaped

I am sending tokens via a POST request, but when I see them on the server it doesn't match up with what was sent.
"U2FsdGVkX1+pxBHFdSU4NiSIOdR2GCCBr/WF7AOSF5zQjRqjSoTeOKR0Dzwm\nNT+g\n" <-- Original
"U2FsdGVkX1+pxBHFdSU4NiSIOdR2GCCBr/WF7AOSF5zQjRqjSoTeOKR0Dzwm\\nNT+g\\n" <-- Result
Notice that the \n has been replaced with \\n. When I do the token lookup verification, of course, no result is found because the string I'm looking for is not the proper string anymore!
I'm not sure why this string is being auto changed like this or quite how to correct it. I'm just accessing this through the standard params like so.
token.verify(params["token"])
EDIT for further clarity
I'm viewing this from the terminal using the debugger gem. I have autoeval enabled and display with params["token"] without p or puts. I am not trying to create newline characters with \n. The literal \n is an actual part of the string that is received in the post. I randomly generate a token using a hashing and encryption library and the strings sometimes end up with these characters in them. If I run token.verify(params["token"]) from the debugger terminal I get nil back from the database as there is no match due to the extra backslash characters being added into the string.
If I directly run token.verify("U2FsdGVkX1+pxBHFdSU4NiSIOdR2GCCBr/WF7AOSF5zQjRqjSoTeOKR0Dzwm\nNT+g\n") from the debugger terminal I get the correct record back from the database. This leaves me thinking that either Rack or Sinatra is auto escaping the "special" characters in the string before I get a chance to even touch it.
This has something to with the way Ruby is handling special characters. From irb you can see this with a quick check like this.
"\\n" == '\n'
Unexpectedly; at least to me, this returns true as they are treated the same. Rather than trying to deal with special characters coming across the wire I ended up just base 64 encoding everything.

Processing form input in a Joomla component

I am creating a Joomla component and one of the pages contains a form with a text input for an email address.
When a < character is typed in the input field, that character and everything after is not showing up in the input.
I tried $_POST['field'] and JFactory::getApplication()->input->getCmd('field')
I also tried alternatives for getCmd like getVar, getString, etc. but no success.
E.g. John Doe <j.doe#mail.com> returns only John Doe.
When the < is left out, like John Doe j.doe#mail.com> the value is coming in correctly.
What can I do to also have the < character in the posted variable?
BTW. I had to use & lt; in this question to display it as I want it. This form suffers from the same problem!!
You actually need to set the filtering that you want when you grab the input. Otherwise, you will get some heavy filtering. (Typically, I will also lose # symbols.)
Replace this line:
JFactory::getApplication()->input->getCmd('field');
with this line:
JFactory::getApplication()->input->getRaw('field');
The name after the get part of the function is the filtering that you will use. Cmd strips everything but alphanumeric characters and ., -, and _. String will run through the html clean tags feature of joomla and depending on your settings will clean out <>. (That usually doesn't happen for me, but my settings are generally pretty open to the point of no filtering on super admins and such.
getRaw should definitely work, but note that there is no filtering at all, which can open security holes in your application.
The default text filter trims html from the input for your field. You should set the property
filter="raw"
in your form's manifest (xml) file, and then use getRaw() to retrieve the value. getCmd removes the non-alphanumeric characters.

Bug in Chrome, or Stupidity in User? Sanitising inputs on forms?

I've written a more detailed post about this on my blog at:
http://idisposable.co.uk/2010/07/chrome-are-you-sanitising-my-inputs-without-my-permission/
but basically, I have a string which is:
||abcdefg
hijklmn
opqrstu
vwxyz
||
the pipes I've added to give an indiciation of where the string starts and ends, in particular note the final carriage return on the last line.
I need to put this into a hidden form variable to post off to a supplier.
In basically, any browser except chrome, I get the following:
<input type="hidden" id="pareqMsg" value="abcdefg
hijklmn
opqrstu
vwxyz
" />
but in chrome, it seems to apply a .Trim() or something else that gives me:
<input type="hidden" id="pareqMsg" value="abcdefg
hijklmn
opqrstu
vwxyz" />
Notice it's cut off the last carriage return. These carriage returns (when Encoded) come up as %0A if that helps.
Basically, in any browser except chrome, the whole thing just works and I get the desired response from the third party. In Chrome, I get an 'invalid pareq' message (which suggests to me that those last carriage returns are important to the supplier).
Chrome version is 5.0.375.99
Am I going mad, or is this a bug?
Cheers,
Terry
You can't rely on form submission to preserve the exact character data you include in the value of a hidden field. I've had issues in the past with Firefox converting CRLF (\r\n) sequences into bare LFs, and your experience shows that Chrome's behaviour is similarly confusing.
And it turns out, it's not really a bug.
Remember that what you're supplying here is an HTML attribute value - strictly, the HTML 4 DTD defines the value attribute of the <input> element as of type CDATA. The HTML spec has this to say about CDATA attribute values:
User agents should interpret attribute values as follows:
Replace character entities with characters,
Ignore line feeds,
Replace each carriage return or tab with a single space.
User agents may ignore leading and trailing white space in CDATA attribute values (e.g., " myval " may be interpreted as "myval"). Authors should not declare attribute values with leading or trailing white space.
So whitespace within the attribute value is subject to a number of user agent transformations - conforming browsers should apparently be discarding all your linefeeds, not only the trailing one - so Chrome's behaviour is indeed buggy, but in the opposite direction to the one you want.
However, note that the browser is also expected to replace character entities with characters - which suggests you ought to be able to encode your CRs and LFs as 
 and
, and even spaces as , eliminating any actual whitespace characters from your value field altogether.
However, browser compliance with these SGML parsing rules is, as you've found, patchy, so your mileage may certainly vary.
Confirmed it here. It trims trailing CRLFs, they don't get parsed into the browser's DOM (I assume for all HTML attributes).
If you append CRLF with script, e.g.
var pareqMsg = document.forms[0]['pareqMsg']
if (/\r\n$/.test(pareqMsg.value) == false)
pareqMsg.value += '\r\n';
...they do get maintained and POSTed back to the server. Although the hidden <textarea> idea suggested by Gaby might be easier!
Normally in an input box you cannot enter (by keyboard) a newline.. so perhaps chrome enforces this even for embedded, through the attributes, values ..
try using a textarea (with display:none)..

Resources