SMS PDU and User Data Length - sms

I'm writing a code for handling SMS PDUs based on all those ETSI GSM documentations. There is one thing I need to ask about. PDU contains a User Data Length field followed by User Data. According to GSM 03.40, the UDL field is the number of septets of user data when the uncompressed GSM default alphabet is used. However, it also says, that when the data is compressed, then the UDL is the number of octets of user data.
See the quotes:
If the TP User Data is coded using the
GSM 7 bit default alphabet, the TP
User Data Length field gives an
integer representation of the number
of septets within the TP User Data
field to follow.
[...]
If the TP User Data is coded using
compressed GSM 7 bit default alphabet
or compressed 8 bit data or compressed
UCS2 [24] data, the TP User Data
Length field gives an integer
representation of the number of octets
after compression within the TP User
Data field to follow.
The problem is that when the 7-bit data is compressed and the number of octets of the compressed user data is a multiple of 7, you don't know whether the last 7 bits in the last octet are fill bits or a real character. I.e. 7 octets may contain either 7 or 8 7-bit characters when compression is on. And when the UDL field is the number of octets, how can you know whether those 7 octets contain 7 or 8 characters?? If UDL contained the number of septets, everything would be clear, right? So have I misunderstood the documentation or does it really work this way?
Could anyone please explain me how it really works? Thanks in advance!

As you are already aware, creating an MMS message requires you to add a UDH before your text message. The UDH becomes part of your payload, thus reducing the number of characters you can send per segment.
As it has become part of your payload, it needs to confirm with your payloads data requirement - which is 7 bit. The UDH however, is 8 bit, which clearly complicates things.
Consider the UDH of the following as an example (It's a UDH for a concatenated message):
050003000302
05 is the length of the UDH (the 5 octets which follow)
00 is the IEI
03 is the IEDL (3 more octets)
00 is a reference (this number must be the same in each of your concatenated message UDH's)
03 is the maximum number of messages
02 is the current message number.
This is 6 octets in total - equating to 48 bits. This is all and well, but since the UDH is actually part of your SMS message, what you have to do is add more bits so that the actual message starts on a septet boundary. A septet boundary is every 7 bits, so in this case, we will have to add 1 more bit of data to make the UDH 49 bits, and then we can add our standard GSM-7 encoded characters.
You can read up more about this from Here

So, the thing is that I misunderstood the meaning of the compression bit in the Data Coding Scheme byte. I thought it referred to the 7-bit alphabet packing method (where at least one character is stored within one byte) but it refers to Huffman compression.
Therefore, the question above was kind of stupid. Sorry for that :-).

Related

maximum field number in protobuf message

The official document for protocol buffers https://developers.google.com/protocol-buffers/docs/proto3 says the maximum field number for fields in protobuf message is 2^29-1. But why is this limit?
Please anyone can explain in some detail? I am newbie to this.
I read answers to the this question at why 2^29-1 is the biggest key in protocol buffers.
But I am not clarified
Each field in an encoded protocol buffer has a header (called key or tag) prefixed to the actual encoded value. The encoding spec defines this key:
Each key in the streamed message is a varint with the value (field_number << 3) | wire_type – in other words, the last three bits of the number store the wire type.
Here the spec says the tag is a varint where the first 3 bits are used to encode the wire type. A varint could encode a 64 bit value, thus just by going on this definition the limit would be 2^61-1.
In addition to this, the Language Guide narrows this down to a 32 bit value at max.
The smallest field number you can specify is 1, and the largest is 2^29 - 1, or 536,870,911.
The reasons for this are not given. I can only speculate for the reasons behind this:
Artificial limit as no one is expecting a message to have that many fields. Just think about fitting a message with that many fields into memory.
As the key is a varint, it isn't simply the next 4 bytes in the raw buffer, rather a variable length of bytes (Java code reading a varint32). Each byte has 7 bit of actual data and 1 bit indicating if the end is reached. It cloud be that for performance reasons it was deemed to be better to limit the range.
Since proto3 is the 3rd version of protocol buffers, it could be that either proto1 or proto2 defined the tag to be a varint32. To keep backwards compatibility this limit is still true in proto3 today.
Because of this line:
#define GOOGLE_PROTOBUF_WIRE_FORMAT_MAKE_TAG(FIELD_NUMBER, TYPE) \
static_cast<uint32>((static_cast<uint32>(FIELD_NUMBER) << 3) | (TYPE))
this line create a "tag", which left only 29 (32 - 3) bits to save field indice.
Don't know why google use uint32 instead of uint64 though, since field number is a varint, may be they think 2^29-1 fields is large enough for a single message declaration.
I suspect this is simply so that a field-header (wire-type and tag-number) can be decoded and handled as a 32-bit value. The wire-type is always the 3 least significant bits, leaving 29 bits for the tag number. Technically "varint" should support 64 bits, but it makes sense to limit it to reasonable numbers, not least because "varint" encoding means that larger numbers take more bytes to encode.
Edit: I realise now that this is similar to the linked post, but... it remain true! Each field in protobuf is prefixed by a "varint" that expresses what field (tag-number) follows, and what data type it is (wire-type). The latter is important especially so that unexpected fields (version differences) can be stored or skipped correctly. It is convenient for that field-header to be trivially processed by most frameworks, and most frameworks are fine with 32-bit integers.
this is another question rather a comment, in the document it says,
Field numbers in the range 16 through 2047 take two bytes. So you
should reserve the numbers 1 through 15 for very frequently occurring
message elements. Remember to leave some room for frequently occurring
elements that might be added in the future.
Because for the first byte, top 5 bits are used for field number, and bottom 3 bits for field type, isn't it that field number from 31 (because zero is not used) to 2047 take two bytes? (and I also guess the second bytes' lower 3 bits are used also for field type.. I'm in the middle of reading it, so I'll fix it when I know it)

Why does MQTT use such a strange encoding scheme for remaining length?

I've recently started writing an MQTT library for a microcontroller. I've been following the specification document. Section 2.2.3 explains how the remaining length field (part of the fixed header) encodes the number of bytes to follow in the rest of the packet.
It uses a slightly odd encoding scheme:
Byte 0 = a mod 128, a /= 128, if a > 0, set top bit and add byte 1
Byte 1 = a mod 128, a /= 128, if a > 0, set top bit... etc
This variable length encoding seems odd in this application. You could easily transmit the same number using fewer bytes, especially once you get into numbers that take 2-4 bytes using this scheme. MQTT was designed to be simple to use and implement. So why did they choose this scheme?
For example, decimal 15026222 would be encoded as 0xae 0x90 0x95 0x7, however in hexadecimal it's 0xE5482E -- 3 bytes instead of four. The overhead in calculating the encoding scheme and decoding it at the other end seems to contradict the idea that MQTT is supposed to be fast and simple to implement on an 8-bit microcontroller.
What are the benefits to this encoding scheme? Why is it used? The only blog post I could find that even mentions any motivation is this one, which says:
The encoding of the remaining length field requires a bit of additional bit and byte handling, but the benefit is that only a single byte is needed for most messages while preserving the capability to send larger message up to 268’435’455 bytes.
But that doesn't make sense to me. You could have even more messages be only a single byte if you used the entire first byte to represent 0-255 instead of 0-127. And if you used straight hexadecimal, you could represent a number as large as 4 294 967 295 instead of only 268 435 455.
Does anyone have any idea why this was used?
As the comment you cited explains, under the assumption that "only a single byte is needed for most messages", or in other words, under the assumption that most of the time a <= 127 only a single byte is needed to represent the value.
The alternatives are:
Use a value to explicitly indicate how many bytes (or bits) are needed for a. This would require dedicating at least 2 bits to support at most "4 byte" sized a for all messages.
Dedicate a fixed size for a, probably 4 bytes, for all messages. This is inferior if many (read: most) messages don't utilize this size and can't support larger values if that becomes a requirement.

What is the maximum SMS message length?

What is the maximum SMS message length when sent through the Clickatell API for English and Spanish messages?
Is there is a difference between English and Spanish message lengths, since Spanish may contain Unicode characters?
From the SMS wikipedia page:
Messages are sent with the MAP MO- and MT-ForwardSM operations, whose payload length is limited by the constraints of the signaling protocol to precisely 140 octets (140 octets = 140 * 8 bits = 1120 bits).
Depending on which alphabet the subscriber has configured in the handset, this leads to the maximum individual short message sizes of 160 7-bit characters, 140 8-bit characters, or 70 16-bit characters.
As for your other question:
Is there is a difference between English and Spanish message lengths, since Spanish may contain Unicode characters?
No, there is no difference, as both English and Spanish are completely covered in the 8-bit Latin 1 character set.
SMS allows for multiple SMS messages to be strung together (with the length of each reducing to allow for "joining" data). I have experience of sending messages of length of 612 characters (4 SMS messages) - there is a reduction of 7 characters per message segment. On the receiving system the parts may be received out of sequence, with the message only making sense once all parts have been received. The Clickatell API allows this, although their API guide at https://www.clickatell.com/downloads/http/Clickatell_HTTP.pdf recommends a practical maximum of 3 messages it allows up to 35 (see section 4.2.7). So (ignoring unicode for the moment) you can send a message of 35 * 153 = 5355 characters via the Clickatell API. If you are sending unicode characters (which the OP is not) the character count for a single message is 70, reduced by 7 characters for each segment in concatenated message or 63 * 35 = 2205 unicode characters.
SMS messages can contain data of 140 bytes. However, SMS data is sent as a bitstream. This means if you are sending 7 bit ASCII, you can send 160 characters.

Handing 32-bit ifindex inside of snmp

In developing my own SNMP poller, I've come across the problem of being able to poll devices with 32-bit interface indexes. I can't find anything out there explaining how to covert the hex (5 bytes) into the 32 bit integer or from the integer into hex as it doesn't use simple hex conversion. Example, the interface index is 436219904. While doing a pcap with a snmpget, I see the hex for this is 81 d0 80 e0 00 which makes no sense. I cannot for the life of me figure out how that converts to an integer value. I've tried to find a RFC dealing with this and have had no luck. The 16 bit interface values convert as they should. 0001 = 1 and so on. Only the 32-bit ones seem to be giving me this problem. Any help is appreciated.
SNMP uses ASN.1 syntax to encode data. Thus, you need to learn the BER rules,
http://en.wikipedia.org/wiki/X.690
For your case, I can say you watched the wrong data, as if 436219904 is going to be encoded as Integer32 in SNMP, the bytes should be 1A 00 30 00.
I guess you have missed some details in the analysis, so you might want to do it once again and add more descriptions (screenshot and so on) to enrich your question.
I suspect the key piece of info missing from your question is that the ifIndex value in question as used in your polling is an index for the table polled (not mentioned which, but we could assume ifTable), which means it will be encoded in as a subidentifier of the OID being polled (give me [some property] for [this ifIndex]) versus a requested value (give me [the ifIndex] for [some other row of some other table]).
Per X.209 (the version of the ASN.1 Basic Encoding Rules used by SNMP) subidentifiers in OIDs (other than the first two) are encoded in one or more octets (8 bits) with the highest order bit used as a continuation bit (i.e. "next octet is part of this subidentifier too"), and then remaining 7 bits for the actual value.
In other words, in your value "81 d0 80 e0 00", the highest bit is set in each of the first 4 octets and cleared in the last octet: this is how you know there are 5 octets in the subidentifier. The remaining 7 bits of each of those octets are concatenated to arrive at the integer value.
The converse of course is that to encode an integer value into a subidentifier of an OID, you have to build it 7 bits at a time.

Efficient encoding of basic numeric datatypes in protobuffers

In Protobuffers documentation, it has been given
"For historical reasons, repeated fields of basic numeric types aren't encoded as
efficiently as they could be. New code should use the special option [packed=true] to get
a more efficient encoding. For example:
repeated int32 samples = 4 [packed=true];"
Can someone clearly explain how does the statement "packed=true" improve the efficieny of encoding basic numeric datatypes??
Basically, under the original encoding the field header (which is composed of the wire type combined with the field-number, bit-shifted and or'd) occurs for every element. Because the header is varint encoded, it is at least one byte per element, but possibly more. So 10 4-byte floats would be at least 50 bytes and quite possibly 90 bytes if the header takes 5 bytes (large field numbers take more space than small field numbers).
With the packed encoding, the field header occurs only once, followed by a varint that indicates the number of bytes to follow. So for 10 floats, the payload length is 40, which is varint-encoded in a single byte for the length prefix. At deserialization time it simply consumes that-many bytes, reading elements as it does so. Therefore for the same data (50 to 90 bytes previously) we are now using 42 to 46 bytes (again, for the range of field numbers that take 1 to 5 bytes each).
These 2 layouts are very different on the wire, and code expecting one can not usually decode the other. As such, it needs to be explicitly enabled to prevent breaking existing messages.

Resources