Will errant whitespace break an EDI document? - whitespace

We had some EDI files that were clogging our system, I found some data that was outdated and most likely confused the shipping system. I also found some spaces where there shouldn't be any:
. . . |LB~TD5||2| FEDEX|T| FEDEX~REF|2I| . . .
Note the spaces at the beginning of the FEDEX segments. I am having trouble finding good documentation for X12 856 EDIs. Would this white-space matter? I have removed it since it is not required, but should I bother looking out for this as an issue in the future?

The whitespace shouldn't matter (as long as it doesn't appear in the ISA segment), as most translators will base their process on delimiters. Most translators worth their salt will also support some kind of Trim() mechanism that you can strip the whitespace. It is certainly bad practice, and you can ask your trading partner to knock it off, but if they can't, you should be able to strip it during translation.
The whitespace could be an issue if you're cross-referencing the FEDEX value (TD503 and TD505, it looks like) to a code in the shipping system, but shouldn't "clog" any system.

Related

Search using Xapian Omega - with Wild Cards or Regular Expressions

We are confronting different search engines for our research
archives and having browsed the Xapian-Omega documentation, we
decided to try it out since the Omega option appears to be an
appropriate solution with several interesting search options.
We installed Xapian-Omega on a Linux Server (Deb 7) and tested
the setup with success. However we are unsure as to how one can
employ or perhaps even enable the use of Wild Cards or Regular
Expressions with Xapian-Omega.
We read that for Xapian one has to enable the Wild Card option
"QueryParser flags"
Could someone clarify this ?
ie. explain with or indicate a page with an example or two.
But we did not see much information regarding examples with Omega
CGI and although this latter runs well, wild card options
(such as * for the general wild card and ? as a single character),
do not seem to work as expected by default and they would be
useful, even though stemming and substrings etc may be functional.
Eg: It would be interesting to be able to employ standard simple
wild char searches with a certain precision such as :
medic* for medicine medical medicament
or with ? for single characters
Can Regexp be recognised with Omega ?
eg : sep[ae]r[ae]te(\w+)?
or searching for structured formats such as Email or Credit Card
Numbers or certain formula types in research papers etc.
In a note from Olly Betts long ago (Dev Mailing List) regarding
this one suggestion was to grep the index file but this would
defeat the RAD advantage of Omega.
Any examples of searches using Omega with Wild Cards or Regular
Expressions would be most appreciated ... even an indication of
a page where information regarding this theme is well presented
with examples illustrating how to develop advanced searches
using Xapian alone would be most welcome (PHP or Python perhaps).
(We are not concerned for the moment about the eventual
substantial increase in the size of the index size or in the
time to index the archive)
You can enable right-wildcards (such as "medic*") in Omega using $set{flag_wildcard,1} (covered in the Omegascript documentation), which enables FLAG_WILDCARD. There's a section in the user manual on using wildcards.
Xapian doesn't provide support for regular expression searching, although in theory I believe it would be possible to support, if potentially costly (depending on the regex). It would have to run the regular expression against unstemmed terms in the database, and then feed them into the search. Where it becomes difficult is if the regex expands to a lot of terms (eg just 'a' as a regex). There's also some subtlety in making it efficient; it's easy to jump through the term list to something with a constant prefix, and you'd want to take advantage of that if possible.
For your example of sep[ae]r[ae]te(\w+)?, it sounds like you actually want a combination of spelling correction (for the a-e substitutions, which you can enable using $set{flag_spelling_correction,1}) and stemming (for the trailing letters after 'te'; Omega defaults to English stemming, but that can be changed), or either wildcard or partial match support.
If you do need regular expressions for your use case, then I'd suggest bringing it up on the xapian-discuss mailing list. Xapian has moved on since the last discussion, and I believe it would be easier to build such support now than it was then.
James Ayatt: Thank you for your answer and help, my apologies for this belated reply, a distraction with other work.
We had already seen the Omegascript page but it was not clear to us how to employ these options with the CGI interface. Also the use of * seems to be for trailing chars, is that correct ? ie not for internal groups of words eg: omeg*ipt; there are cases where the stemming option would not be sufficient. We did not see an option for single wild chars, sometimes represented by ? in certain search engines. Could you comment here ?
Regarding the use of regular expressions we had immagined that it might not be quite as simple as one could hope. The examples mentioned in the preceding post were of course simple possible uses, there are of course many more. Your comment on using the stemming option seems appropriate.
In certain cases it could be interesting to enable some type of regexp option for the extraction of text forms, such as those mentioned. The quick extractiion of such text, perhaps together with some surrounding text could be very useful.
We will certainly try your proposal with the mailing list.
Thank you again.

What are the valid characters in the domain part of e-mail address?

Intention
I'm trying to do some minimal very minimal validation of e-mail addresses, despite seeing a lot of advice advising against doing that. The reason I'm doing this is that spec I am implementing requires e-mail addresses to be in this format:
mailto:<uri-encoded local part>#<domain part>
I'd like to simply split on the starting mailto: and the final #, and assume the "local part" is between these. I'll verify that the "local part" is URI encoded.
I don't want to do much more than this, and the spec allows for me to get away with "best effort" validation for most of this, but is very specific on the URI encoding and the mailto: prefix.
Problem
From everything I've read, splitting on the # seems risky to me.
I've seen a lot of conflicting advice on the web and on Stack Overflow answers, most of it saying "read the RFCs", and some of it saying that the domain part can only be certain characters, i.e. 1-9 a-z A-Z -., maybe a couple other characters, but not much more than this. E.g.:
What characters are allowed in an email address?
When I read various RFCs on domain names, I see that "any CHAR" (dtext) or "any character between ASCII 33 and 90" (dtext) are allowed, which implies # symbols are allowed. This is further compounded because "comments" are allowed in parens ( ) and can contain characters between ASCII 42 and 91 which include #.
RFC1035 seems to support the letters+digits+dashes+periods requirement, but "domain literal" syntax in RFC5322 seems to allow more characters.
Am I misunderstanding the RFC, or is there something I'm missing that disallows a # in the domain part of an e-mail address? Is "domain literal" syntax something I don't have to worry about?
The most recent RFC for email on the internet is RFC 5322 and it specifically addresses addresses.
addr-spec = local-part "#" domain
local-part = dot-atom / quoted-string / obs-local-part
The dot-atom is a highly restricted set of characters defined in the spec. However, the quoted-string is where you can run into trouble. It's not often used, but in terms of the possibility that you'll run into it, you could well get something in quotation marks that could itself contain an # character.
However, if you split the string from the last #, you should safely have located the local-part and the domain, which is well defined in the specification in terms of how you can verify it.
The problem comes with punycode, whereby almost any Unicode character can be mapped into a valid DNS name. If the system you are front-ending can understand and interpret punycode, then you have to handle almost anything that has valid unicode characters in it. If you know you're not going to work with punycode, then you can use a more restricted set, generally letters, digits, and the hyphen character.
To quote the late, great Jon Postel:
TCP implementations should follow a general principle of robustness: be conservative in what you do, be liberal in what you accept from others.
Side note on the local part:
Keeping in mind, of course, that there are probably lots of systems on the internet that don't require strict adherence to the specs and therefore might allow things outside of the spec to work due to the long standing liberal-acceptance/conservative-transmission philosophy.

Right single apostrophe vs. apostrophe?

Right single quotation mark (U+2019)
vs.
Apostrophe (U+0027)
What is the difference between these two characters?
I ran into this issue where I use CAtlString to load a string from a resource file, and on some Windows installations, the LoadString fails when trying to load a string that contains U+2019, but it works on some other Windows installations. The U+2019 character appears in strings in my resource file that I copied from Word, and U+0027 appears in stirngs that I hand coded. Why does LoadString (sometimes) choke on this?
What is the difference between these two characters?
Arguable!
Going by the names, one would imagine that the curly ‹’› is only for use as a quotation mark, and that the straight ‹'› is only for use as a real apostrophe, an indicator of omitted letters.
However traditional typesetting practice in English is always to use a curly ‹’› to render an apostrophe. Personally—and I may be alone here—I don't like this. It can make for more ambiguous reading:
“He said, ‘It’s fish ’n’ chips’...”
with the apostrophes being straight it's (marginally) clearer where the quotation ends:
“He said, ‘It's fish 'n' chips’...”
and the apostrophe being ‘straight’ makes more sense to me because its purpose of indicating omitted letters has no inherent directionality, whereas quotation marks are clearly asymmetrical in purpose.
In traditional ASCII, of course, there are no smart quotes, so the apostrophe is always used for both...
on some Windows installations, the LoadString fails when trying to load a string that contains U+2019, but it works on some other Windows installations.
Here you are meeting the horror of the ‘ANSI’ code page. This is a default character encoding that is different across different Windows install locales. So on a machine in the Western region, you get different results when you read a resource to when you read it on a Japanese Windows.
It is highly unfortunate that Windows has varying default code pages instead of using a single global encoding like UTF-8, but it's too late to fix now. If you compile your whole application as a Unicode app (so you'll be using LoadStringW rather than LoadStringA) then you can cope with non-ASCII characters like the smart quotes much better.
If you can't move to a Unicode application you're a bit stuck. You won't be able to handle non-ASCII characters like the smart quotes globally, so stick with ASCII characters like the straight apostrophe ‹'› alone.
The U+2019 character appears in strings in my resource file that I copied from Word
Yes, Word has an annoying AutoCorrect feature that replaces all apostrophes you type with smart quotes. This is especially undesirable when you are dealing with code, where ‹’› will break the program; but it's also wrong even for plain old English, as it's not possible to correctly guess the desired direction of the quote. (It'll get one of the apostrophes in “fish 'n' chips” the wrong way round, for example.)
I suggest turning off the automatic-replace-with-smart-quotes feature. If you want the smart quotes, it's better to type them deliberately. Unfortunately they are inconvenient to type on most keyboard layouts, often requiring obscure Alt+numpad sequences. Personally I use this one to drop them onto Alt+[] keys.
Historically, single-quote and double-quote come in pairs, left (open) and right (close).
For many years the character sets of computers were limited, having a single form of each.
Now, with the advent of Unicode, the full forms are available, but support for them is still limited. Programming languages still use the simple forms, and the full forms can still cause problems.

Why should tags be space-delimited?

I'm working on a Web application that uses tags like Stackoverflow tags. I've noticed that a lot of sites that use tags make them space-delimited, which disallows a tags like...
"favorite recipes"
Instead they enforce this...
"favorite-recipes" | "favorite_recipes" | "FavoriteRecipes"
If the tags were comma-delimited, an item could have a set of tags like...
"cats, birds, favorite recipes, horses"
I have to decide on the policy for my app.
I guess I like the idea of space-delimited, but if my users aren't programmers they might be more comfortable with the familiar idea of commas denoting a list.
Why are comma-delimited tags unusual? Is there a major downside to them?
I think space delimited tags feel more like typing and you aren't syntactically bound to commas or pipes etc...
I also think that not allowing spaces in tags is kind of nice because you don't have to worry about turning them into URLs they are already ready to go.
Online tools like Delicious use space delimiting for tags, but other tools like Wordpress allow you to have spaces in tags so they require commas. If you do things this way, you will have to create similar tag "slugs" to make sure they work in a URL cleanly (i.e. my tag would be my-tag)
Remember, allowing too much freedom in your tag creation can result in some pretty crazy tags like "Things I like to do on a Saturday afternoon..." etc.
I think the idea is that a tag should be short and to the point.
If you go to type in "favorite recipes", when it doesn't work, you think to yourself "Oh, I should redo this." So you make "favorites" "recipes" instead.
If it was something you really needed, like "pork roast" then it makes sense to make it "pork-roast" but only after you've thought about really joining those works. Perhaps you shouldn't though - perhaps it should be "pork" "roast" so it shows up under searches for pork and searches for roast.
tl;dr It's for user experience and searching so they don't enter something that can't be easily searched for.
Yes, I also think that space-separated tags reduce complexity. And you can enter them faster, because you have the big space key to separate them. Maybe the brain is also less loaded, because it doesn't need to think about where to put a comma delimiter and where not.
I built an app with tags once, and used commas. The only down side to commas is that you have to do more thorough checking of empty tags. For instance in the example:
"George Bush, Bill Clinton, Barack Obama, "
If someone posts the tags with a comma at the end and there is a space it generally will get added to the database.
This is because if you simply check to delete once space you would turn Bill Clinton into BillClinton.
However you can make sure that the tag is at least a certain amount of characters to ensure there is not empty tags. This will not ensure that there is not three or four spaces in a row though.
Anther thing to note is that humans generally put spaces before words and after commas.
So in the example above there is a space after each comma and before each president. This space will be included in the database making the tag:
" Bill Clinton"
instead of
"Bill Clinton"
as the user probably attempted.
Once again you can eliminate both beginning and trailing spaces, but there is more server side code to implement in this case.
If you just use spaces then you can eliminate any unwanted characters like commas etc, and put the tags into an array using space characters to separate them.
I actually prefer tags that are comma or semi-colon delimited for two basic reasons:
It's more natural (user-centric not techie-centric)
It reduces redundancy, as the example noted "favorite-recipes"
"favorite_recipes" "FavoriteRecipes"
And I'm afraid some of the other answers sort of make it sound as if web developers are (how shall I say this) less willing to do the work than, for instance, database developers - who've been addressing some of these same issues for years. (I've coded for web and client-server databases.)

Regex to strip symbols but not foreign characters in Ruby

Does anyone have a good regex for stripping all symbols (';.,_\$#!% the carriage return etc) from a string, without damaging any foreign characters (é 多 فا etc)? Non-regex would be even better, I suppose, but I don't see any Ruby or Rails methods that do this.
What is a symbol? This seems like a fuzzy requirement. Is & a symbol, even though it's just shorthand for the word "and"? Is ! a symbol, even though it's used as an alphabetic character in transliterating some African languages? If $ is a symbol, does that mean 円 is as well? I think answering this question will go a long way to suggesting a course of action.
I think the closest you are likely to get with a regexp is /[^\w\s]/. Ruby 1.9's Regexp engine is meant to understand foreign languages well enough to correctly know which are "word" characters, so this will leave those and spaces. In my tests, this correctly removes punctuation from English, Japanese and German sentences while leaving the surrounding characters. But dollars to doughnuts there will be edge cases that trip up just about any solution — dealing with the huge variety of languages in the world (some of which don't even have words as we know them) is an incredibly complex task.
The good way to do this would be to use the new(ish) unicode character classes in regex, such as \P{L} to match anything that is not a letter (in any language) according to unicode. Unfortunately, it seems that Ruby doesn't support this, even in 1.9.
Perhaps the 1.9 regex parser is smart enough to not match the bytes that make up special symbols in unicode characters, so simple enumerating all the characters to strip can work, though. That assumes you really can enumerate all characters you wish to filter out, which might be a lot more than the symbols in ASCII, like logical not, aeroplane, etc...

Resources