scrambling GBT data to identify counterpart - vhdl

this question is FPGA-design-and-languages agnostic. I use bidirectional gigabit optical transmission lines (GBT) for communication of two distant counterparts. The GBT frame payload is 80 bits, out of them I use 64, so I have additional 16 bits for spare usage.
The master sends consecutively each 25ns a packet of 80 bits, and the same works the other way around. I need to assure, that master sends the data to a client, which has specific firmware implemented, so during the communication I need to identify, that client is equipped with a firmware version required to digest the data I'm sending. The communication is not transaction based, but the optical link rather realizes 80bit register to 80bit register seamless pass-through. Unfortunately the 16 spare bits I have to my disposition I just cannot simply set to a constant value, and somehow code into that constant the target firmware. Such method is quite common, and I could not guarantee 100% firmwares match.
I was thinking whether there would exist some sort of symmetric data scrambling, which could be used to send from slave to master over such scrambled channel the constants needed. I was wondering if there exist some 'standard' solution how to identify two counterparts in a hardware communication channel by some reasonably simple means in terms of required logic elements.
I do not want to encrypt the data. I just want to assure that the firmwares match.
What is the recommended way to handle this?

You could simply barrel-rotate your data by one bit. This would result in essentially a ceasar cipher - not cryptographically secure, but noone is going to accidentally implement the same thing in a different firmware. Also, it gives you the ability to differentiate between 79 different firmware versions(shift one bit, two bits, three ...).
However, this seems overly complicated to me. I can't think of a time when either transmitting a version code in the 16 spare bits or exchanging version numbers at the beginning of communication in a handshake-style patten wouldn't make more sense.

Related

in ESP32 / ESP-IDF - when to use EEPROM vs NVS vs SPIFFS?

I'm fairly new to doing production work on ESP32 microcontrollers, and I'm wanting a little context and nuance from people who've been around the block a few times. So this question is a bit more on that kind of thing rather than a "how do I code X" kind of question.
I have lots of data storage needs on my current project.
larger blobs of data that need to be stored less often
smaller blobs of data that need to be updated more often
factory settings (like serial number, board revision, etc) that are particular to a given device, but aren't going to be encoded in C.
etc
I'm familiar with storing data in "blobs", and I'm familiar with encoding / decoding data with protocol buffers.
So given all that, I'm trying to gain context on the differences between my various storage options on the ESP32, and when to use each.
EEprom
NVS
SPIFFS / LittleFS
other options...
What use cases make you pick one of these options over another?
There's no EEPROM on the ESP32, just the flash.
NVS is a simple non-volatile key-value store with different data types (integers 8-64 bits, strings, blobs). It's reasonably convenient to use, does wear levelling and supports flash encryption (although that a bit of a hassle). I'd use it for storing factory settings and anything else which is reasonably small (there's a 4000 byte limit on strings, 508,000 byte limit on blobs). If the device needs to write often, you might want to create a separate, dedicated, read-only NVS partition for storing device attributes (serial, hw info) so it's guaranteed to not get clobbered by power failures during write.
ESP IDF supports SPIFFS and FAT file systems.
SPIFFS is light-weight and much better in terms of wear levelling and reliability. I'd use this for storing any larger files. It doesn't support flash encryption, unfortunately.
FAT file system is probably the worst choice because it's not really natively Flash-friendly, nor reliable. Espressif has built some kind of a layer between FAT and flash to accommodate wear levelling. The only critical advantage of FAT is that it supports flash encryption.
Then there are third party options which I haven't used, unfortunately.
As always, consider the number of page erases your writes are going to cause in the flash - this gives you an estimate of how many times you can write before the chip's lifetime is reached.

Are bytes real?

I know that this question may sound stupid, but let me just explain. So...
Everyone knows that byte is 8 bits. Simple, right? But where exactly is it specified? I mean, phisically you don't really use bytes, but bits. For example drives. As I understand, it's just a reaaaaly long string of ones and zeros and NOT bytes. Sure, there are sectors, but, as far as I know, there are programmed at software level (at least in SSDs, I think). Also RAM, which is again - a long stream of ones and zeros. Another example is CPU. It doesn't process 8 bits at a time, but only one.
So where exactly is it specified? Or is it just general rule, which everyone follows? If so, could I make system (either operating system or even something at lower level) that would use, let's say, 9 bits in a byte? Or I wouldn't have to? Also - why can't you use less than a byte of memory? Or maybe you can? For example: is it possible for two applications to use the same byte (e.g. first one uses 4 bits and second one uses other 4)? And last, but not least - does computer drives really use bytes? Or is it that, for example, bits 1-8 belong to something, next to them there are some 3 random bits and bits 12-20 belong to something different?
I know that there are a lot of question and knowing answers to these questions doesn't change anything, but I was just wondering.
EDIT: Ok, I might've expressed myself not clear enough. I know that byte is just a concept (well, even bit is just a concept that we make real). I'm NOT asking why there are 8 bits in byte and why bytes exist as a term. What I'm asking is where in a computer is byte defined or if it even is defined. If bytes really are defined somewhere, at what level (a hardware level, OS level, programming language level or just at application level)? I'm also asking if computers even care about bytes (in that concept that we've made real), if they use bytes constantly (like in between two bytes, can there be some 3 random bits?).
Yes, they’re real insofaras they have a definition and a standardised use/understanding. The Wikipedia article for byte says:
The modern de-facto standard of eight bits, as documented in ISO/IEC 2382-1:1993, is a convenient power of two permitting the values 0 through 255 for one byte (2 in power of 8 = 256, where zero signifies a number as well).[7] The international standard IEC 80000-13 codified this common meaning. Many types of applications use information representable in eight or fewer bits and processor designers optimize for this common usage. The popularity of major commercial computing architectures has aided in the ubiquitous acceptance of the eight-bit size.[8] Modern architectures typically use 32- or 64-bit words, built of four or eight bytes
The full article is probably worth reading. No one set out stall 50+ years ago, banged a fist on the desk and said ‘a byte shallt be 8 bits’ but it became that way over time, with popular microprocessors being able to carry out operations on 8 bits at a time. Subsequent processor architectures carry out ops on multiples of this. While I’m sure intel could make their next chip a 100bit capable one, I think the next bitness revolution we’ll encounter will be 128
Everyone knows that byte is 8 bits?
These days, yes
But where exactly is it specified?
See above for the ISO code
I mean, phisically you don't really use bytes, but bits.
Physically we don’t use bits either, but a threshold of detectable magnetic field strength on a rust coated sheet of aluminium, or an amount of electrical charge storage
As I understand, it's just a reaaaaly long string of ones and zeros and NOT bytes.
True, everything to a computer is a really long stream of 0 and 1. What is important in defining anything else is where to stop counting this group of 0 or 1, and start counting the next group, and what you call the group. A byte is a group of 8 bits. We group things for convenience. It’s a lot more inconvenient to carry 24 tins of beer home than a single box containing 24 tins
Sure, there are sectors, but, as far as I know, there are programmed at software level (at least in SSDs, I think)
Sectors and bytes are analogous in that they represent a grouping of something, but they aren’t necessarily directly related in the way that bits and bytes are because sectors are a level of grouping on top of bytes. Over time the meaning of a sector as a segment of a track (a reference to a platter number and a distance from the centre of the platter) has changed as the march of progress has done away with positional addressing and later even rotational storage. In computing you’ll typically find that there is a base level that is hard to use, so someone builds a level of abstraction on top of it, and that becomes the new “hard to use”, so it’s abstracted again, and again.
Also RAM, which is again - a long stream of ones and zeros
Yes, and is consequently hard to use, so it’s abstracted, and abstracted again. Your program doesn’t concern itself with raising the charge level of some capacitive area of a memory chip, it uses the abstractions it has access to, and that attraction fiddles the next level down, and so on until the magic happens at the bottom of the hierarchy. Where you stop on this downward journey is largely a question of definition and arbitrary choice. I don’t usually consider my ram chips as something like ice cube trays full of electrons, or the subatomic quanta, but I could I suppose. We normally stop when it ceases to useful to solving the
Problem
Another example is CPU. It doesn't process 8 bits at a time, but only one.
That largely depends on your definition of ‘at a time’ - most of this question is about the definitions of various things. If we arbitrarily decide that ‘at a time’ is the unit block of the multiple picoseconds it takes the cpu to complete a single cycle then yes, a CPU can operate on multiple bits of information at once - it’s the whole idea of having a multiple bit cpu that can add two 32 bit numbers together and not forget bits. If you want to slice the time up so precisely that we can determine that enough charge has flowed to here but not there then you could say which bit the cpu is operating on right at this pico (or smaller) second, but it’s not useful to go so fine grained because nothing will happen until the end of the time slice the cpu is waiting for.
Suffice to say, when we divide time just enough to observe a single cpu cycle from start to finish, we can say the cpu is operating on more than one bit.
If you write at one letter per second, and I close my eyes for 2 out of every 3 seconds, I’ll see you write a whole 3 letter word “at the same time” - you write “the cat sat onn the mat” and to the observer, you generated each word simultaneously.
CPUs run cycles for similar reasons, they operate on the flow and buildup of electrical charge and you have to wait a certain amount of time for the charge to build up so that it triggers the next set of logical gates to open/close and direct the charge elsewhere. Faster CPUs are basically more sensitive circuitry; the rate of flow of charge is relatively constant, it’s the time you’re prepared to wait for input to flow from here to there, for that bucket to fill with just enough charge, that shortens with increasing MHz. Once enough charge has accumulated, bump! Something happens, and multiple things are processed “at the same time”
So where exactly is it specified? Or is it just general rule, which everyone follows?
it was the general rule, then it was specified to make sure it carried on being the general rule
If so, could I make system (either operating system or even something at lower level) that would use, let's say, 9 bits in a byte? Or I wouldn't have to?
You could, but you’d essentially have to write an adaptation(abstraction) of an existing processor architecture and you’d use nine 8bit bytes to achieve your presentation of eight 9bit bytes. You’re creating an abstraction on top of an abstraction and boundaries of basic building blocks don’t align. You’d have a lot of work to do to see the system out to completion, and you wouldn’t bother.
In the real world, if ice cube trays made 8 cubes at a time but you thought the optimal number for a person to have in the freezer was 9, you’d buy 9 trays, freeze them and make 72 cubes, then divvy them up into 8 bags, and sell them that way. If someone turned up with 9 cubes worth of water (it melted), you’d have to split it over 2 trays, freeze it, give it back.. this constant adaptation between your industry provided 8 slot trays and your desire to process 9 cubes is the adaptive abstraction
If you do do it, maybe call it a nyte? :)
Also - why can't you use less than a byte of memory? Or maybe you can?
You can, you just have to work with the limitations of the existing abstraction being 8 bits. If you have 8 Boolean values to store you can code things up so you flip bits of the byte on and off, so even though you’re stuck with your 8 cube ice tray you can selectively fill and empty each cube. If your program only ever needs 7 Booleans, you might have to accept the wastage of the other bit. Or maybe you’ll use it in combination with a regular 32 bit int to keep track of a 33bit integer value. Lot of work though, writing an adaptation that knows to progress onto the 33rd bit rather than just throw an overflow error when you try to add 1 to 4,294,967,295. Memory is plentiful enough that you’d waste the bit, and waste another 31bits using a 64bit integer to hold your 4,294,967,296 value.
Generally, resource is so plentiful these days that we don’t care to waste a few bits.. It isn’t always so, of course: take credit card terminals sending data over slow lines. Every bit counts for speed, so the ancient protocols for info interchange with the bank might well use different bits of the same byte to code up multiple things
For example: is it possible for two applications to use the same byte (e.g. first one uses 4 bits and second one uses other 4)?
No, because hardware and OS memory management these days keeps programs separate for security and stability. In the olden days though, one program could write to another program’s memory (it’s how we cheated at games, see the lives counter go down, just overwrite a new value), so in those days if two programs could behave, and one would only write to the 4 high bits and the other the 4 low bits then yes, they could have shared a byte. Access would probably be whole byte though, so each program would have to read the whole byte, only change its own bits of it, then write the entire result back
And last, but not least - does computer drives really use bytes? Or is it that, for example, bits 1-8 belong to something, next to them there are some 3 random bits and bits 12-20 belong to something different?
Probably not, but you’ll never know because you don’t get to peek to that level of abstraction enough to see the disk laid out as a sequence of bits and know where the byte boundaries are, or sector boundaries, and whether this logical sector follows that logical sector, or whether a defect in the disk surface means the sectors don’t follow on from each other. You don’t typically care though, because you treat the drive as a contiguous array of bytes (etc) and let its controller worry about where the bits are

Array of values loaded through UART in VHDL

I am working on a project in VHDL wich includes mutliplying matrices. I would like to be able to load data from PC to arrays on FPGA using UART. I am only making my first bigger steps in VHDL and I am not sure if I am taking the right attitude.
I wanted to declare an array of integer signals, and then implement UART to receive data form PC and load it into those signals. However, I can't use for-loop for that, as it will be synthesised to load data parallelly (which is impossible, because values will be comming from PC one after another, using serial port.) And because matrices may be various sizes, in order to assign signals one by one I would need to write lots of specific code (and it appears to be a bad practice to me.)
Is the idea to use an array of signals and load data to those signals through UART realizable? And if my approach is entirely wrong, how could I achieve that?
What you want is doable but you will probably need to design a kind of hardware monitor to act as an intermediate between your UART and your storage (your array of integer signals). This hardware monitor will interpret commands coming from the UART and perform read/write operations in your storage. It will have one interface with the storage and another with the UART. You will have to define a kind of protocol with a syntax for your commands and of sequences of operations for each command.
Example: the monitor waits for commands coming from the UART. The first received character indicates whether it is a read (0) or a write (1). The four next characters are the target address, least significant byte first. If the command is a read, the monitor reads the data at the specified address in your storage and sends it to the UART, one byte at a time, least significant byte first. If the command is a write, the address is followed by a data to write in your storage at the specified address, least significant byte first, and your monitor waits until the data is received and writes it in your storage.
Optionally, the monitor could send an exit status byte at the end of each command to indicate potential errors (protocol errors, unmapped addresses, write attempts in read-only regions...)
Of course, depending on the characteristics of your application, you will probably define a completely different protocol, simpler or more complex, but the principle will be the same.
All this is usually implemented in software and runs on a CPU that has the UART as peripheral and the storage in its memory space. But if you do not have a CPU...
Warning: this is quite complex. The UART itself is quite complex. Not sure you should start with this if you are a VHDL beginner.
Your approach is not entirely wrong but you have a software orientated way of expressing this which indicate you are missing the fundamentals. People with strong software backgrounds tend to think in terms of the programming language and not in terms of the actual FPGA specific structures they want to achieve. It is the important to unlearn this if you want to be successful in designing for FPGA.
Based on what I just wrote you should consider in what type of FPGA structure you would like to store the data. The speed, resource and power requirements govern this choice. One suitable way to store the data would be in either a single or an array of either Block RAM or LUTRAM. Both of these structures can be inferred by using a signal of an array type in the hardware description language which is why I said you are not entirely off track. Consult the manual of your synthesis tool to find templates for how to infer these structures. An alternative is to use a vendor IP block or to instantiate a primitive directly but both those methods are clumsier in my opinion.
Important parameters to consider are the total number of words you need to store, the size of a word and the number of read/write operations per clock cycle. For higher number of reads per cycle an array of memories must be used since most FPGA memories only support two reads per cycle.

bit-wise matrix transposition in VHDL using blockram

I've been trying to figure out a nice way of transposing a large amount of data in VHDL using a block ram (or similar).
Using a vector of vectors it's relatively easy, but it gets icky for large amounts of data.
I want to use a dual channel block ram so that I can write to the one block and read out the other. write in 8 bit std_logic_vectors, read out 32 bit std_logic_vectors, where the 32 bits is (for first rotation at least) the MSB for input vectors 0 - 31, then 32 - 63 all the way to 294911, then MSB-1, etc.
The case described above is my ideal scenario. Is this even possible? I can't seem to find a nice way of doing this.
In this answer, I'm assuming 18kbit Xilinx-style BRAMs. I'm most familiar with the Virtex-4, so I'm referring to UG070 for the following. The answer will map trivially to other Xilinx FPGAs, and probably other vendors' parts as well.
The first thing to note is that Virtex-4 BRAMs can be used in dual-port mode with different geometries for each port. However, with 1-bit-wide ports, these RAMs' parity bits aren't useful. Thus, an 18kbit BRAM is only effectively 16kbits here.
Next, consider the number of BRAMs you need just for storage (irrespective of your design.) 294912x8 bits maps to 144 BRAMs, which is a pretty large resource commitment. There's always a balance between throughput, design complexity, and resource requirements; if you need to squeeze every MBit/sec out of the design, maybe a BRAM-based approach is ideal. If not, you should consider whether your throughput and latency requirements allow you to use off-chip RAM instead of BRAM.
If you do plan on using BRAMs, then you should consider an array of 18x8 BRAMs. The 8 columns each store a single input bit. Each BRAM is written via a 1-bit port, and read out with a 32-bit port.
Every 8-bit write is mapped to eight one-bit writes (one write to a BRAM in each column.)
Every 32-bit read is mapped to a single 32-bit read from a single BRAM.
You should need very little sequencing logic (around a RAMB16 primitive) to get this working. The major complexity is how to map address bits between the 1-bit port and the 32-bit port.
After a bit of research, and thought, this is my answer to this problem:
Due to the nature of the block ram addressing, the ideal scenario mentioned in the OP is not possible with the current block ram addressing implementation. In order to perform a bitwise matrix transposition in the manner described, the block ram addressing would need to be able to switch between horizontal and vertical. That is, the ram must be accessible both row-wise and column-wise, and the addressing mode must be switchable in real-time. Since a bit-wise data transposition is not really a particularly "useful" transform, there wouldn't really be a reason for the implementation of such a switching scheme. Especially since the whole point of block ram is to store datain chunks of more than 1 bit, and such a transform would scramble the data.
I have discovered a way of changing my design such that 294911 x 8 bits do not need to be transformed at once, but rather done in stages using a process. This does not rely on block ram to perform the transform.

Symmetric Encryption: Performance Questions

Does the performance of a symmetric encryption algorithm depend on the amount of data being encrypted? Suppose I have about 1000 bytes I need to send over the network rapidly, is it better to encrypt 50 bytes of data 20 times, or 1000 bytes at once? Which will be faster? Does it depend on the algorithm used? If so, what's the highest performing, most secure algorithm for amounts of data under 512 bytes?
The short answers are:
You want to encrypt all your data in one go. You actually want to give them all in one function call to the encryption code, so that it can run even faster.
With a proper encryption algorithm, encryption will be substantially faster than the network itself. It takes a very bad implementation, a very old PC or a very fast network to make encryption a bottleneck.
When in doubt, use an all-made protocol such as SSL/TLS. If given the choice of the encryption algorithm within a protocol, use AES.
Now the longer answers:
There are block ciphers and stream cipher. A stream cipher begins with an initialization phase, in which the key is input in the system (often called "key schedule"), and then encrypts data bytes "on the fly". With a stream cipher, the encrypted message has the same length than the input message, and encryption time is proportional to the input message length, save for the computational cost of the key schedule. We are not talking big numbers here; key schedule time is below 1 microsecond on a not-so-new PC. Yet, for optimal performance, you want to do key schedule once, not 50 times.
Block ciphers also have a key schedule. When the key schedule has been performed, a block cipher can encrypt blocks, i.e. chunks of data of a fixed length. The block length depends on the algorithm, but it is typically 8 or 16 bytes. AES is a block cipher. In order to encrypt a "message" of arbitrary length, the block cipher must be invoked several times, and this is trickier than it seems (there are many security issues). The part which decides how those invocations are assembled together is called the chaining mode. A well-known chaining mode is called CBC. Depending on the chaining mode, there may be a need for an extra step called padding in which a few extra bytes are added to the input message, so that its length becomes compatible with the chosen chaining. Padding must be such that, upon decryption, it can be removed unambiguously. A common padding scheme is called "PKCS#5".
There is a chaining mode called "CTR" which effectively turns a block cipher into a stream cipher. It has some good points; especially, CTR mode requires no padding, and the encrypted message length will have the same length than the input message. AES with CTR mode is good.
Encryption speed will be about 100 MB/s on a typical PC (e.g. a 2.4 GHz Intel Core2, using a single core).
Warning: there are issues with regards to encrypting several messages with the same key. With chaining modes, those issues hide under the name of "IV" (as "initial value"). With most chaining modes, the IV is a random value of the same size of the cipher block. The IV needs not be secret (it is often transmitted along with the encrypted message, because the decrypting party must also know it) but it must be chosen randomly and uniformly, and each message needs a new IV. Some chaining modes (e.g. CTR) can tolerate a non-uniform IV but only for the first message ever with a given key. With CBC even the first message needs a fully random IV. If this paragraph does not make full sense to you, then, please, do not try to design an encryption protocol, it is more complex than you imagine. Instead, use an already specified protocol, such as SSL (for an encryption tunnel) or CMS (for encrypted messages). Development of such protocols was a long and painful history of attacks and countermeasures, with much grinding of teeth. Do not reenact that history...
Warning 2: if you use encryption, then you are worrying about security: there may be adverse entities bent on attacking your system. Most of the time, simple encryption will not fully deter them; by itself, (properly applied) encryption defeats only passive attackers, those who observe transmitted bytes but do not alter them. A generic attacker is also active, i.e. he removes some data bytes, moves and duplicates other, or adds extra bytes of his own designing. To defeat active attackers you need more than encryption, you also need integrity checks. There again, protocols such as SSL and CMS already take care of the details.
AES should be a good choice.
Good implementations of AES (e.g. the one included in the openssl library) require about 10-20 CPU cycles per byte to encrypt. When you encrypt small messages then you also have to consider the time for the key setup. E.g., a typical implementation of DES requires a few thousand cycles for a key setup. I.e., you might actually finish encrypting a small message with AES before you even start encrypting with other ciphers.
Newer processors (e.g. Westmere based CPUs) have an instruction set that supports AES, allowing to encrypt at speeds of 1.5-4 cycles per byte. That is almost impossible to beat by any other cipher.
If possible try encrypting longer messages, rather then splitting them into small pieces. The main reason is not encryption speed, but rather security. I.e., a secure encryption mode usually requires to use an initialization vector and a message authentication (MAC). These will add about 32 bytes to each of your ciphertext parts. I.e. if you divide your message into small parts then this overhead will be significant.
Symmetric encryption algorithms typically are block ciphers. For any given algorithm, the block size is fixed. Then you pick from several different methods of making subsequent blocks dependent on earlier blocks (i.e. cipher block chaining) to create a stream cipher. But the stream cipher invariably caches incoming data and submits it to the block cipher in full blocks.
So by doing 50 bytes 20 times, all you're doing is pounding on your cache logic.
If you're not operating in stream mode, then datagrams smaller than the native block size of your cipher will get significantly less protection than complete blocks, since there are fewer possible messages for an attacker to consider.
Obviously, performance depends on the amount of data as every part of the data has to be encrypted. You'll get much better information by doing a test in your specific environment (language, platform, encryption algorithm implementation) than anyone here could possibly provide out of the blue: I don't think it would take more than half an hour to set up a basic performance measurement.
As for security, you should be fine with Triple DES or AES.

Resources