What is the maximum length of a valid email address? - validation

What is the maximum length of a valid email address? Is it defined by any standard?

An email address must not exceed 254 characters.
This was accepted by the IETF following submitted erratum. A full diagnosis of any given address is available online. The original version of RFC 3696 described 320 as the maximum length, but John Klensin subsequently accepted an incorrect value, since a Path is defined as
Path = "<" [ A-d-l ":" ] Mailbox ">"
So the Mailbox element (i.e., the email address) has angle brackets around it to form a Path, which a maximum length of 254 characters to restrict the Path length to 256 characters or fewer.
The maximum length specified in RFC 5321 states:
The maximum total length of a reverse-path or forward-path is 256 characters.
RFC 3696 was corrected here.
People should be aware of the errata against RFC 3696 in particular. Three of the canonical examples are in fact invalid addresses.
I've collated a couple hundred test addresses, which you can find at http://www.dominicsayers.com/isemail

320
And the segments look like this
{64}#{255}
64 + 1 + 255 = 320
You should also read this if you are validating emails: I Knew How To Validate An Email Address Until I Read The RFC

user
The maximum total length of a user name is 64 characters.
domain
Maximum of 255 characters in the domain part (the one after the “#”)
However, there is a restriction in RFC 2821 reading:
The maximum total length of a reverse-path or forward-path is 256
characters, including the punctuation and element separators”. Since
addresses that don’t fit in those fields are not normally useful, the
upper limit on address lengths should normally be considered to be
256, but a path is defined as: Path = “<” [ A-d-l “:” ] Mailbox “>”
The forward-path will contain at least a pair of angle brackets in
addition to the Mailbox, which limits the email address to 254
characters.

According to the below article:
https://www.rfc-editor.org/rfc/rfc3696 (Page 6, Section 3)
It's mentioned that:
"There is a length limit on
email addresses. That limit is a maximum of 64 characters (octets)
in the "local part" (before the "#") and a maximum of 255 characters
(octets) in the domain part (after the "#") for a total length of 320
characters. Systems that handle email should be prepared to process
addresses which are that long, even though they are rarely
encountered."
So, the maximum total length for an email address is 320 characters
("local part": 64 + "#": 1 + "domain part": 255 which sums to 320)

To help the confused rookies like me, the answer to "What is the maximum length of a valid email address?" is 254 characters.
If your application uses an email, just set your field to accept 254 characters or less and you are good to go.
You can run a bunch of tests on an email to see if it is valid here. http://isemail.info/
The RFC, or Request for Comments is a type of publication from the Internet Engineering Task Force (IETF) that defines 254 characters as the limit. Located here - https://www.rfc-editor.org/rfc/rfc5321#section-4.5.3

The other answers muddy the water a bit.
Simple answer: 254 total chars in our control for email
256 are for the ENTIRE email address, which includes implied "<" at the beginning, and ">" at the end. Therefore, 254 are left over for our use.

TLDR Answer
Given an email address like...
me#example.com
The length limits are as follows:
Entire Email Address (aka: "The Path"): i.e., me#example.com -- 256 characters maximum.
Local-Part: i.e., me -- 64 character maximum.
Domain: i.e., example.com -- 254 characters maximum.
Source — TLDR;
The RFC standards are constantly evolving, but if you want a 2009 IETF source in a single line:
...the upper limit on address lengths should normally be considered to be 256. (Source: RFC3696.)
Source — The History
SMTP originally defined what a path was in RFC821, published August 1982, which is an official Internet Standard (most RFC's are only proposals). To quote it...
...a reverse-path, specifies who the mail is from.
...a forward-path, which specifies who the mail is to.
RFC2821, published in April 2001, is the Obsoleted Standard that defined our present maximum values for local-parts, domains, and paths. A new Draft Standard, RFC5321, published in October 2008, keeps the same limits. In between these two dates, RFC3696 was published, on February 2004. It mistakenly cites the maximum email address limit as 320-characters, but this document is "Informational" only, and states: "This memo provides information for the Internet community. It does not specify an Internet standard of any kind." So, we can disregard it.
To quote RFC2821, the modern, accepted standard as confirmed in RFC5321...
4.5.3.1.1. Local-part
The maximum total length of a user name or other local-part is 64
characters.
4.5.3.1.2. Domain
The maximum total length of a domain name or number is 255 characters.
4.5.3.1.3. Path
The maximum total length of a reverse-path or forward-path is 256
characters (including the punctuation and element separators).
You'll notice that I indicate a domain maximum of 254 and the RFC indicates a domain maximum of 255. It's a matter of simple arithmetic. A 255-character domain, plus the "#" sign, is a 256-character path, which is the max path length. An empty or blank name is invalid, though, so the domain actually has a maximum of 254.

Sadly, all the other answers are wrong. Most of them cite RFC 2821 or newer, which does not even define e-mail addresses. What it does do is define paths. E-mail addresses are defined by RFC 2822 (or newer) and can be much longer. Examples of valid addresses that are not valid paths are:
(Firstname Lastname) user#domain
Firstname Lastname <user#domain>
Both of these are the same mailbox written differently. So if your goal is to store e-mail addresses in a database, a limit of 254, 256 or 320 octets might be too low, although in practise, this is rarely going to be a problem.

Related

Clarifications for FHIR R4 string element

There are a couple of things that I am having trouble with regarding HL7 FHIR R4 strings (https://www.hl7.org/fhir/datatypes.html#string):
The specification mentions: Note that strings SHALL NOT exceed 1MB (1024*1024 characters) in size. The trouble I am having with this is that 1024x1024 Unicode characters are not always 1MB in size. Besides that it is unclear to me what Unicode encoding is meant here, and I will assume the reasonable UTF-8 since that is the default for both XML and JSON. For example the character '🦁' needs 4 bytes to encode, therefore 1024x1024 of such characters would be 4MB in size. The Regex-es in the notes, though not normative, make this a bit clearer, but not much. It states that codes up to FFFF are ok, which means a max. byte use of 3 per characters which would still exceed the 1MB limit by a factor of 3. My interpretation is that we would like a reasonable limit that doesn't open up any denial-of-service attacks. Therefore I would like to suggest keeping the meaningful 1MB limit but drop the number of characters requirement OR add it as a separate requirement.
The specification mentions: Therefore strings SHOULD always contain non-whitespace content. It does not mention what it considers whitespace. Is this just the three codes mentioned earlier representing horizontal tab, carriage return and line feed or are more exotic whitespace characters also prohibited, like next line or no-break space?
Ok, that about sums up my questions about the string specifications. Hope that someone can help me out.
Best,
Dirk
The rule is clearly expressed in characters explicitly because Unicode characters have variable length. There is no maximum in bytes, only in characters (though given Unicode rules, you could calculate what the maximum possible length in bytes might be). If you feel this isn't sufficiently clear, feel free to submit a change request.
The expectation is a string SHOULD always have textual content. If you have nothing to say, omit the element. Trying to work around the "no empty string" limitation by transmitting a non-breaking space or some other non-visible character to meet the non-empty requirement while not actually conveying any human-readable information would be contrary to the intent of the specification. We don't demand that systems enforce this because trying to figure out all the creative ways implementers might have of conveying "no useful text" with Unicode isn't terribly practical. I believe the Java code just does a trim() and compares the result to empty string.

maximum field number in protobuf message

The official document for protocol buffers https://developers.google.com/protocol-buffers/docs/proto3 says the maximum field number for fields in protobuf message is 2^29-1. But why is this limit?
Please anyone can explain in some detail? I am newbie to this.
I read answers to the this question at why 2^29-1 is the biggest key in protocol buffers.
But I am not clarified
Each field in an encoded protocol buffer has a header (called key or tag) prefixed to the actual encoded value. The encoding spec defines this key:
Each key in the streamed message is a varint with the value (field_number << 3) | wire_type – in other words, the last three bits of the number store the wire type.
Here the spec says the tag is a varint where the first 3 bits are used to encode the wire type. A varint could encode a 64 bit value, thus just by going on this definition the limit would be 2^61-1.
In addition to this, the Language Guide narrows this down to a 32 bit value at max.
The smallest field number you can specify is 1, and the largest is 2^29 - 1, or 536,870,911.
The reasons for this are not given. I can only speculate for the reasons behind this:
Artificial limit as no one is expecting a message to have that many fields. Just think about fitting a message with that many fields into memory.
As the key is a varint, it isn't simply the next 4 bytes in the raw buffer, rather a variable length of bytes (Java code reading a varint32). Each byte has 7 bit of actual data and 1 bit indicating if the end is reached. It cloud be that for performance reasons it was deemed to be better to limit the range.
Since proto3 is the 3rd version of protocol buffers, it could be that either proto1 or proto2 defined the tag to be a varint32. To keep backwards compatibility this limit is still true in proto3 today.
Because of this line:
#define GOOGLE_PROTOBUF_WIRE_FORMAT_MAKE_TAG(FIELD_NUMBER, TYPE) \
static_cast<uint32>((static_cast<uint32>(FIELD_NUMBER) << 3) | (TYPE))
this line create a "tag", which left only 29 (32 - 3) bits to save field indice.
Don't know why google use uint32 instead of uint64 though, since field number is a varint, may be they think 2^29-1 fields is large enough for a single message declaration.
I suspect this is simply so that a field-header (wire-type and tag-number) can be decoded and handled as a 32-bit value. The wire-type is always the 3 least significant bits, leaving 29 bits for the tag number. Technically "varint" should support 64 bits, but it makes sense to limit it to reasonable numbers, not least because "varint" encoding means that larger numbers take more bytes to encode.
Edit: I realise now that this is similar to the linked post, but... it remain true! Each field in protobuf is prefixed by a "varint" that expresses what field (tag-number) follows, and what data type it is (wire-type). The latter is important especially so that unexpected fields (version differences) can be stored or skipped correctly. It is convenient for that field-header to be trivially processed by most frameworks, and most frameworks are fine with 32-bit integers.
this is another question rather a comment, in the document it says,
Field numbers in the range 16 through 2047 take two bytes. So you
should reserve the numbers 1 through 15 for very frequently occurring
message elements. Remember to leave some room for frequently occurring
elements that might be added in the future.
Because for the first byte, top 5 bits are used for field number, and bottom 3 bits for field type, isn't it that field number from 31 (because zero is not used) to 2047 take two bytes? (and I also guess the second bytes' lower 3 bits are used also for field type.. I'm in the middle of reading it, so I'll fix it when I know it)

What is the maximum SMS message length?

What is the maximum SMS message length when sent through the Clickatell API for English and Spanish messages?
Is there is a difference between English and Spanish message lengths, since Spanish may contain Unicode characters?
From the SMS wikipedia page:
Messages are sent with the MAP MO- and MT-ForwardSM operations, whose payload length is limited by the constraints of the signaling protocol to precisely 140 octets (140 octets = 140 * 8 bits = 1120 bits).
Depending on which alphabet the subscriber has configured in the handset, this leads to the maximum individual short message sizes of 160 7-bit characters, 140 8-bit characters, or 70 16-bit characters.
As for your other question:
Is there is a difference between English and Spanish message lengths, since Spanish may contain Unicode characters?
No, there is no difference, as both English and Spanish are completely covered in the 8-bit Latin 1 character set.
SMS allows for multiple SMS messages to be strung together (with the length of each reducing to allow for "joining" data). I have experience of sending messages of length of 612 characters (4 SMS messages) - there is a reduction of 7 characters per message segment. On the receiving system the parts may be received out of sequence, with the message only making sense once all parts have been received. The Clickatell API allows this, although their API guide at https://www.clickatell.com/downloads/http/Clickatell_HTTP.pdf recommends a practical maximum of 3 messages it allows up to 35 (see section 4.2.7). So (ignoring unicode for the moment) you can send a message of 35 * 153 = 5355 characters via the Clickatell API. If you are sending unicode characters (which the OP is not) the character count for a single message is 70, reduced by 7 characters for each segment in concatenated message or 63 * 35 = 2205 unicode characters.
SMS messages can contain data of 140 bytes. However, SMS data is sent as a bitstream. This means if you are sending 7 bit ASCII, you can send 160 characters.

SMS Text message - do line breaks count as characters?

I'm developing a system to integrate with an SMS API, and I was wondering whether or not newlines count towards the character limit, I can't find any documentation on this.
Thanks
Yes, it does. ANY character you embed in the message counts against the limit, whether it's "visible" or not. Even if it's not a printable character, you've still sent that character across the cell network and caused the cell network to use up (as they claim) vast amounts of bandwidth that much be charged ludicrous rates to handle.
Any character counts toward your SMS limit. Line breaks, included.
I actually can't find a standard or anything. But I do know the message size is limited to 160 7-bit characters, 140 8-bit characters, or 70 16-bit characters, depending on the alphabet used.
Sorry, I can't give you any sources.
I just tested it on my phone. Yes, a line break counts
Yes, each line break, space and carriage return adds to your character count.
1 space = 1 character
1 line break = 2 characters
Yes it does. also as a side note, i noticed that if there are any unicode characters in the text, the entire message is treated as unicode and the length of the message is multiplyed by three.

SMS PDU and User Data Length

I'm writing a code for handling SMS PDUs based on all those ETSI GSM documentations. There is one thing I need to ask about. PDU contains a User Data Length field followed by User Data. According to GSM 03.40, the UDL field is the number of septets of user data when the uncompressed GSM default alphabet is used. However, it also says, that when the data is compressed, then the UDL is the number of octets of user data.
See the quotes:
If the TP User Data is coded using the
GSM 7 bit default alphabet, the TP
User Data Length field gives an
integer representation of the number
of septets within the TP User Data
field to follow.
[...]
If the TP User Data is coded using
compressed GSM 7 bit default alphabet
or compressed 8 bit data or compressed
UCS2 [24] data, the TP User Data
Length field gives an integer
representation of the number of octets
after compression within the TP User
Data field to follow.
The problem is that when the 7-bit data is compressed and the number of octets of the compressed user data is a multiple of 7, you don't know whether the last 7 bits in the last octet are fill bits or a real character. I.e. 7 octets may contain either 7 or 8 7-bit characters when compression is on. And when the UDL field is the number of octets, how can you know whether those 7 octets contain 7 or 8 characters?? If UDL contained the number of septets, everything would be clear, right? So have I misunderstood the documentation or does it really work this way?
Could anyone please explain me how it really works? Thanks in advance!
As you are already aware, creating an MMS message requires you to add a UDH before your text message. The UDH becomes part of your payload, thus reducing the number of characters you can send per segment.
As it has become part of your payload, it needs to confirm with your payloads data requirement - which is 7 bit. The UDH however, is 8 bit, which clearly complicates things.
Consider the UDH of the following as an example (It's a UDH for a concatenated message):
050003000302
05 is the length of the UDH (the 5 octets which follow)
00 is the IEI
03 is the IEDL (3 more octets)
00 is a reference (this number must be the same in each of your concatenated message UDH's)
03 is the maximum number of messages
02 is the current message number.
This is 6 octets in total - equating to 48 bits. This is all and well, but since the UDH is actually part of your SMS message, what you have to do is add more bits so that the actual message starts on a septet boundary. A septet boundary is every 7 bits, so in this case, we will have to add 1 more bit of data to make the UDH 49 bits, and then we can add our standard GSM-7 encoded characters.
You can read up more about this from Here
So, the thing is that I misunderstood the meaning of the compression bit in the Data Coding Scheme byte. I thought it referred to the 7-bit alphabet packing method (where at least one character is stored within one byte) but it refers to Huffman compression.
Therefore, the question above was kind of stupid. Sorry for that :-).

Resources