Counter Size in AES Driver in CTR mode in linux kernel - linux-kernel

I have seen multiple open source drivers for AES(CTR) mode for different Crypto Hardware Engines, I was not really sure on counter size,nonce etc.
Please can any one provide some info on the following
Q1:
How does AES driver identifies the counter size during the CTR mode of operation?
looks like AES in CTR mode supports "countersize" of multiple lengths as below:
1: First is a counter which is made up of a nonce and counter. The nonce is random, and the remaining bytes are counter bytes (which are incremented).
For example, a 16 byte block cipher might use the high 8 bytes as a nonce, and the low 8 bytes as a counter.
2: Second is a counter block, where all bytes are counter bytes and can be incremented as carries are generated.
For example, in a 16 byte block cipher, all 16 bytes are counter bytes
Q2:
Does Linux Kernel Crypto subsystem increments the counter value for every block of input or is it needs tp be taken care by Kernel Driver for the respective Crypto H/W ?
Q3:
counters and nonces are something which will be extracted from the IV i.e., IV = nonce + counter .Note if "l" is length of IV then first "l/2" is length of nonce and next "l/2" is length of counter.Please let me know if my understanding regarding IV,counter and nonce is correct or not?
Any information regarding the above is really appreciable.
BR,
& Sanumala

How does AES driver identifies the counter size during the CTR mode of operation?
It most likely doesn't. As long as it sees the IV as one big 128 bit counter then there isn't a problem. If the counter would be 64 bit and initialized on all zeros then you would only have a problem after 2^64 = 18,446,744,073,709,551,616 (16 byte) blocks of data; that's not likely to happen.
Does Linux Kernel Crypto subsystem increments the counter value for every block of input or is it needs tp be taken care by Kernel Driver for the respective Crypto H/W ?
It needs to be taken care by the kernel driver. I only see an IV as input in the API. This is commonly the case for crypto API's. You cannot get any performance if you have to update the counter for each 16 bytes you want to encrypt.
counters and nonces are something which will be extracted from the IV i.e., IV = nonce + counter .Note if "l" is length of IV then first "l/2" is length of nonce and next "l/2" is length of counter.Please let me know if my understanding regarding IV,counter and nonce is correct or not?
Yes, you understand correctly. You would only have a problem if the protocol uses a separate nonce and counter and both are generated randomly. In that case you may have a problem with the carry from the counter to the nonce field.
Note that it may be a good idea to limit the data size to, say ~68 GB and use the top 12 bytes as a random nonce to avoid being bitten by the birthday problem.

Related

Parallel Crc-32 calculation Ethernet 10GE MAC

I have generated an Ethernet 10GE MAC design in VHDL. Now I am trying to implement CRC. I have a 64-bit parallel CRC-32 generator in VHDL.
Specification:
- Data bus is of 64-bits
- Control bus is of 8-bits (which validates the data bytes)
Issue:
Let's say, my incoming packet length is 14-bytes, (assuming no padding).
The CRC is calculated for the first 8 bytes in one clock cycle, but when I try to calculate the CRC over the remaining 6 bytes the results are wrong due to zeros being appended.
Is there a way I can generate the CRC for any length of bytes packet length using a 64-bit parallel CRC generator?
What I've tried:
I used different parallel CRC generators (8-bit parallel CRC, 16-bit parallel CRC generator and so on). But that consumes a lot of FPGA resources. I want to conserve resources using just 64-bit parallel CRC generators.
Start with a constant 64-bit data word that brings the effective CRC register to all zeros. Then prepend the message with zero bytes, instead of appending them, putting those zeros on the end of the 64-bit word that is processed first. (You did not provide the CRC definition, so this depends on whether the CRC is reflected or not. If the CRC is reflected, then put the zero bytes in the least-significant bit positions. If the CRC is not reflected, then put them in the most-significant bit positions.) Then exclusive-or the result with a 32-bit constant.
So for the example, you would first feed a 64-bit constant to the parallel CRC generator, then feed two zero bytes and six bytes of message in the first word, and then eight message bytes in the second word. Then exclusive-or the result with the 32-bit constant.
For the standard PKZIP CRC, the 64-bit constant is 0x00000000ffffffff, the 32-bit constant is 0x2e448638, and the prepended zero bytes go in the bottom of the 64-bit word.
If you are in control of the implementation of the CRC generator, then you can probably modify it to initialize the effective CRC register to all zeros when you reset the generator, avoiding the need to feed the 64-bit constant.
I can't speak for certain, but if you can pad zeros at the start of your packet instead of at the end, then you should get the right answer. It does depend on the polynomial and the initializer...
See this answer here Best way to generate CRC8/16 when input is odd number of BITS (not byte)? C or Python

An algorithm to determine if a number belongs to a group or not

I'm not even sure if this is possible but I think it's worth asking anyway.
Say we have 100 devices in a network. Each device has a unique ID.
I want to tell a group of these devices to do something by broadcasting only one packet (A packet that all the devices receive).
For example, if I wanted to tell devices 2,5,75,116 and 530 to do something, I have to broadcast this : 2-5-75-116-530
But this packet can get pretty long if I wanted (for example) 95 of the devices to do something!!!
So I need a method to reduce the length of this packet.
After thinking for a while, I came up with an idea:
what if I used only prime numbers as device IDs? Then I could send the product of device IDs of the group I need, as the packet and every device will check if the remainder of the received number and its device ID is 0.
For example if I wanted devices 2,3,5 and 7 to do something, I would broadcast 2*3*5*7 = 210 and then each device will calculate "210 mod self ID" and only devices with IDs 2,3,5 and 7 will get 0 so they know that they should do something.
But this method is not efficient because the 100th prime numbers is 541 and the broadcasted number may get really big and the "mod" calculation may get really hard.(the devices have 8bit processors).
So I just need a method for the devices to determine if they should do something or ignore the received packet. And I need the packet to be as short as possible.
I tried my best to explain the question, If its still vague, please tell me to explain more.
You can just use a bit string in which every bit represents a device. Then, you just need a bitwise AND to tell if a given machine should react.
You'd need one bit per device, which would be, for example, 32 bytes for 256 devices. Admittedly, that's a little wasteful if you only need one machine to react, but it's pretty compact if you need, say, 95 devices to respond.
You mentioned that you need the device id to be <= 4 bytes, but that's no problem: 4 bytes = 32 bits = enough space to store 2^32 device ids. For example, the device id for the 101st machine (if you start at 0) could just be 100 (0b01100100) = 1 byte. You would just need to use that to figure out which byte of the packet to use (ceil(100 / 8) = the 13th) and bitwise AND that byte against 100 % 8 = 4 = 0b00000100.
As cobarzan said, you also can use a hybrid scheme allowing for individual addressing. In that scenario, you could use the first bit as a signal to indicate multiple- or single-machine addressing. As cobarzan said, that requires more processing, and it means the first byte can only store 7 machine signals, rather than 8.
Like Ed Cottrell suggested, a bit string would do the job. If the machines are labeled {1,..,n}, there are 2n-1 possible subsets (assuming you do not send requests with no intended target). So you need a data structure able to hold every possible signature of such a subset, whatever you decide the signature to be. And n bits (one for each machine) is the best one can do regarding the size of such a data structure. The evaluation performed on the machines takes constant time (on machine with label l just look at the lth bit).
But one could go for some hybrid scheme. Say you have a task for one device only, then it would be a pity to send n bits (all 0s, except one). So you can take one additional bit T which indicates the type of packet. The value of T is set to 0 if you are sending a bit string of length n as described above or set to 1 if you are using a more appropriate scheme (i.e. less bits). In the case of just one machine that needs to perform the task, you could send directly the label of the machine (which is O(log n) bits long). This approach reduces the size of the packet if you have less than O(n/log n) machines you need to perform the task. Evaluation on the machines is more expensive though.

Cache calculating block offset and index

I've read several topics about this theme but I could not get the answer. So my question is:
1) How is the block offset calculated?
I want to know not the formula but the concept of it. As I know it is quantity of cases which a block can store the address. For example If there is a block with 8 byte storage and has to store 2 byte addresses. Does its block offset is 2 bit?(So there is 4 cases to store the address (the diagram below might make easier to see what I am saying).
The block offset is simply calculated as log2 cache_line_size.
The reason is that all system that I know of are byte addressable. So you need enough bits to index any byte in the block. Although most systems have a word size that is larger than a single byte, they still support offsets of a single byte gradulatrity, even if that is not the common case.
So for the example you mentioned of an 8-byte block size with 2-byte word, you would still need 3 bits in order to allow accessing any byte. If you had a system that was not byte addressable then you could use just 2 bits for the block offset. But in practice all systems that I know of are byte addressable.

Can AES algorithm work the same over plain text and over bytes sequences?

It's clear how the algorithm manages plain text as the characters byte values to the state matrix.
But what about AES encryption of binary files?
How does the algorithm manages larger than 16 bytes files, as long as the state is standarized to be 4x4 bytes?
The AES primitive is the basis of constructions that allow encryption/decryption of arbitrary binary streams.
AES-128 takes a 128-bit key and a 128-bit data block and "encrypts" or "decrypts" this block. 128 bit is 16 bytes. Those 16 bytes can be text (e.g. ASCII, one character per byte) or binary data.
A naive implementation would just break a file with longer than 16 bytes into groups of 16 bytes and encrypt each of these with the same key. You might also need to "pad" the file to make it a multiple of 16 bytes. The problem with that is that it exposes information about the file because every time you encrypt the same block with the same key you'll get the same ciphertext.
There are different ways to build on the AES function to encrypt/decrypt more than 16 bytes securely. For example you can use CBC or use counter mode.
Counter mode is a little easier to explain so let's look at that. If we have AES_e(k, b) encrypt block b with key k we do not want to re-use the same key to encrypt the same block more than once. So the construction we'll use is something like this:
Calculate AES_e(k, 0), AES_e(k, 1), AES_e(k, n)
Now we can take arbitrary input, break it into 16 bytes blocks, and XOR with this sequence. Since the attacker does not know they key they can not regenerate this sequence and decode our (longer) message. The XOR is applied bit by bit between the blocks generated above and the cleartext. The receiving side can now generate the same sequence, XOR it with the ciphertext and retrieve the cleartext.
In application you also want to combine this with some sort of authentication mechanism so you something like AES-GCM or AES-CCM.
Imagine you have a 17 byte plain text.
state matrix will be filled with the first 16 bytes and one block will be encrypt.
Next block will be 1 byte that left and state matrix will be padded with data in order to fill those 16 bytes AES needs.
It works well with bytes/binary files because AES always consider bytes unities.Does not matter if that is a ascii chunk or any other think. Just remember that everything in a computer is binary/bytes/bits. Once data be a stream data (chunks of information in bytes) it'll work fine.

PyCrypto compatibility with CommonCrypto in CFB mode?

I'm trying to get somepython code to decrypt data that was encrypted using the OS X CommonCrypto APIs. There is little to no documentation on the exact options that CommonCrypto uses, so I'm needing some help figuring out what options to set in PyCrypto.
Specifically, my CommonCrypto decryption setup call is:
CCCryptorCreateWithMode(kCCDecrypt, kCCModeCFB, kCCAlgorithmAES128, ccDefaultPadding, NULL, key, keyLength, NULL, 0, 0, 0, &mAESKey);
My primary questions are:
Since there is both a kCCModeCFB and kCCModeCFB8, what is CommonCrypto's definition of CFB mode - what segment size, etc?
What block size is the CommonCrypto AES128 using? 16 or 128?
What is the default padding, and does it even matter in CFB mode?
Currently, the first 4 bytes of data is decrypting successfully with PyCrypto *as long as I set the segment_size to 16*.
Ideas?
Without knowing CommonCrypto or PyCrypto, some partial answers:
AES (in all three variants) has a block size of 128 bits, which are 16 bytes.
CFB (cipher feedback mode) would actually also work without padding (i.e. with a partial last block), since for each
block the ciphertext is created as the XOR of plaintext with some keystream block, which only depends on previous blocks.
(You still can use any padding you want.)
If you can experiment with some known data, first have a look at the ciphertext size. If it is not a multiple of a
full block (and the same as the plaintext + IV), then it is quite likely no padding.
Otherwise, decrypt it with noPadding mode, have a look at the result, and compare with the different known padding modes.
From a glance at the source code, it might be PKCS#5-padding.
CFB8 is a variant of CFB which uses only the top 8 bits (= one byte) of each block cipher call output (which takes the
previous 128 bits (= 16 bytes) of ciphertext (or IV) as input). This needs 16 times as many block cipher calls, but
allows partial sending of a stream without having to worry about block boundaries.
There is another definition of CFB which includes a segment size - here the segment size is the number of
bits (or bytes) to be used from each cipher output. In this definition, the "plain" CFB would have a segment size of 128 bits (= 16 bytes), CFB8 would have a segment size of 8 bits (one byte).

Resources