How do I calculate PCIe 1x, 2.0, 3.0, speeds properly? - performance

I am honestly very lost with the speeds calculations of PCIe devices.
I can understand the 33MHz - 66MHz clocks of PCI and PCI-X devices, but PCIe confuses me.
Could anyone explain how to calculate the transfer speeds of PCIe?

To understand the table pointed to by Paebbels, you should know how PCIe transmission works. Contrary to PCI and PCI-X, PCIe is a point-to-point serial bus with link aggregation (meaning that several serial lanes are put together to increase transfer bandwidth).
For PCIe 1.0, a single lane transmits symbols at every edge of a 1.25GHz clock (Takrate). This yield a transmission rate of 2.5G transfers (or symbols) per second. The protocol encodes 8 bit of data with 10 symbols (8b10b encoding) for DC balance and clock recovery. Therefore the raw transfer rate of a lane is
2.5Gsymb/s / 10symb * 8bits = 250MB/s
The raw transfer rate can be multiplied by the number of lanes available to get the full link transfer rate.
Note that the useful transfer rate is actually less than that because data is packetized similar to ethernet protocol layer packetization.
A more detailed explanation can be found in this Xilinx white paper.

Related

STM32F411 I need to send a lot of data by USB with high speed

I'm using STM32F411 with USB CDC library, and max speed for this library is ~1Mb/s.
I'm creating a project where I have 8 microphones connected into ADC line (this part works fine), I need a 16-bit signal, so I'm increasing accuracy by adding first 16 signals from one line (ADC gives only 12-bits signal). In my project, I need 96k 16-bit samples for one line, so it's 0,768M signals for all 8 lines. This signal needs 12000Kb space, but STM32 have only 128Kb SRAM, so I decided to send about 120 with 100Kb data in one second.
The conclusion is I need ~11,72Mb/s to send this.
The problem is that I'm unable to do that because CDC USB limited me to ~1Mb/s.
Question is how to increase USB speed to 12Mb/s for STM32F4. I need some prompt or library.
Or maybe should I set up "audio device" in CubeMX?
If small b means byte in your question, the answer is: it is not possible as your micro has FS USB which max speeds is 12M bits per second.
If it means bits your 1Mb (bit) speed assumption is wrong. But you will not reach the 12M bit payload transfer.
You may try to write (only if b means bit) your own class but I afraid you will not find a ready made library. You will need also to write the device driver on the host computer

PWM transistor heating - Rapberry

I have a raspberry and an auxiliary PCB with transistors for driving some LED strips.
The strips datasheets says 12V, 13.3W/m, i'll use 3 strips in parallel, 1.8m each, so 13.3*1.8*3 = 71,82W, with 12 V, almost 6A.
I'm using an 8A transistor, E13007-2.
In the project i have 5 channels of different LEDs: RGB and 2 types of white.
R, G, B, W1 and W2 are directly connected in py pins.
LED strips are connected with 12V and in CN3, CN4 for GND (by the transistor).
Transistor schematic.
I know that that's a lot of current passing through the transistors, but, is there a way to reduce the heating? I think it's getting 70-100°C. I already had a problem with one raspberry, and i think it's getting dangerous for the application. I have some large traces in the PCB, that's not the problem.
Some thoughts:
1 - Resistor driving the base of the transistor. Maybe it won't reduce heating, but i think it's advisable for short circuit protection, how can i calculate this?
2 - The PWM has a frequency of 100Hz, is there any difference if i reduce this frequency?
The BJT transistor you're using has current gain hFE of roughly 20. This means that the collector current is roughly 20 times the base current, or the base current needs to be 1/20 of the collector current, i.e. 6A/20=300mA.
Rasperry PI for sure can't supply 300mA current from the IO pins, so you're operating the transistor in linear region, which causes it to dissipate a lot of heat.
Change your transistors to MOSFETs with low enough threshold voltage (like 2.0V to have enough conduction at 3.3V IO voltage) to keep it simple.
Using a N-Channel MOSFET will run much cooler if you get enough gate voltage to force to completely enhance. Since this is not a high volume item why not simply use a MOSFET gate driver chip. Then you can use a low RDS on device. Another device is the siemons BTS660 (S50085B BTS50085B TO-220). it is a high side driver that you will need to drive with an open collector or drain device. It will switch 5A at room temperature with no heat sink.It is rated for much more current and is available in a To220 type package. It is obsolete but available as is the replacement. MOSFETs are voltage controlled while transistors are current controlled.

ATTiny85 Internal Clock and One-Wire

Is the internal clock on the ATTiny85 sufficiently accurate for one-wire timing?
Per https://learn.sparkfun.com/tutorials/ws2812-breakout-hookup-guide one-wire timing seems to need accuracy around the 0.05us range, so a 10% clock error on the AVR at 8MHZ would cause 0.0125us sized timing differences (assuming the 10% error figure is accurate, and that it's 10% error on frequency, not +/- 10% variance on each pulse).
Not a ton of margin - but is it good enough?
First of all, WS2812 LEDs are not the 1-wire.
The control protocol of WS2812 is described in the datasheet
The short answer is yes, ATTiny85, also the whole AVR family have enough clock accuracy to control the WS2812 chain. But routine should be written at assembler, also no interrupts should be allowed, to guarantee match the timing requests. When doing the programming well, 8MHz speed of the internal oscillator may be enough to output the different data to two WS2812 chains simultaneously.
So, when running 8MHz ±10%, the one clock cycle would be 112...138 ns.
The datasheet requires (with 150ns tolerance):
When transmitting "one": high level to be 550...850ns; - 6 clock cycles (672...828) matches this range (also 5 clock cycles (560...690ns) matches)
following low level - 450...750ns; - 5 cycles (560...690ns)
When transmitting "zero": high level 200...500ns; - 3 cycles (336...414ns)
following low level 650...950ns; - 6 cycles (672...828).
So, as you can see, considering tolerance ±10% of the clock's source, you can find the integer number of cycles which will guarantee match to the required intervals.
Speaking from the experience, it still be working if the low level, which follows the pulse, will be extended for a couple hundreds of nanoseconds.
There are known issues using internal oscilator with UART - should be timed to 2% accuracy while the internal oscilator can be up to 10% off with factory setting. While it can be calibrated(AVR has register OSCCAL for that purpose), its frequency is influenced by temperature.
It is worth the try, but might not to be reliable with temperature changes or fluctuating operating voltage.
References: ATmega's internal oscillator - how bad is it, Timing accuracy on tiny2313, Tuning internal oscilator
The timing requirements of NeoPixels (WS2812B) are wide enough that the only really critical part is the minimum width of a 1 bit. The ATtiny85 at 16Mhz is plenty fast to drive a string of them from a GPIO pin. At 8Mhz, it may not work (I haven't tried yet). I just released a small Arduino sketch which allows you to control NeoPixel strings of any length on a ATtiny85 without using any RAM.
https://github.com/bitbank2/NeoPixel
For devices with hardware SPI (e.g. ATMega328p), it's better to use SPI to shift out the bits (also included in my code).

ACP and DMA, how they work?

I'm using ARM a53 platform, it has ACP component, and I'm trying to use DMA to transfer data through ACP.
By ARM trm document, if I understand it correctly, the DMA transmission data size limits to 64 bytes for each DMA transfer when using ACP.
If so, does this limitation make DMA not usable? Because it's dumb to configure DMA descriptor but to transfer 64 bytes only each time.
Or DMA should auto divide its transfer length into many ACP size limited(64 bytes) packets, without any software intervention.
Need any expert to explain how ACP and DMA work together.
Somewhere in the interfaces from the DMA to the ACP's AXI port should auto divide its transfer length as needed into transfers of appropriate length. For the Cortex-A53 ACP, AXI transfers are limited to 64B(perhaps intentionally 1x cacheline).
From https://developer.arm.com/documentation/ddi0500/e/level-2-memory-system/acp/transfer-size-support :
x byte INCR request characterized by:(some list of limitations)
Note the use of INCR instead of FIXED. INCR will automatically increment the address according to the size of the transfer, while FIXED will not. This makes it simple for the peripheral break a large transfer into a series of multiple INCR transfers.
However, do note that on the Cortex-A53, transfer size(x in the quote) is fixed at 16 or 64 byte aligned transfers. If the DMA sends an inappropriate sized transfer(because misconfigured or correct size unsupported), the AXI will emit a SLVERR. If the buffer is not appropriately aligned, I think this also causes a SLVERR.
Lastly, the on-chip network routing must support connecting the DMA to the ACP at chip design time. In my experience this is more commonly done for network accelerators and FPGA fabric glue, but tends to be less often connected for low speed peripherals like UART/SPI/I2C.

AXI4 (Lite) Narrow Burst vs. Unaligned Burst Clarification/Compatibility

I'm currently writing an AXI4 master that is supposed to support AXI4 Lite (AXI4L) as well.
My AXI4 master is receiving data from a 16-bit interface. This is on a Xilinx Spartan 6 FPGA and I plan on using the EDK AXI4 Interconnect IP, which has a minimum WDATA width of 32 bits.
At first I wanted to use narrow burst, i.e. AWSIZE = x"01" (2 bytes in transfer). However, I found that Xilinx' AXI Reference Guide UG761 states "narrow bursts [are] supported but [...] not recommended." Unaligned transactions are supposed to be supported.
This had me thinking. Say I start an unaligned burst:
AWLEN = x"01" (2 beats)
AWSIZE = x"02" (4 bytes in transfer")
And do the following:
AX (32-bit word #0: send hi16)
XB (32-bit word #1: send lo16)
Where A, B are my 16 bit words that start off at an unaligned (2-byte aligned) address. X means WSTRB is deasserted for the indicated 16 bit.
Is this supported or does this fall under the category "narrow burst" even through AWSIZE = x"02" (4 bytes in transfer) as opposed to AWSIZE = x"01" (2 bytes in transfer)?
Now, if this was just for AXI4, I would probably not care as much about this use case, because AXI4 peripherals are required to use the WSTRB signals. However, the AXI Reference Guide UG761 states "[AXI4L] Slaves interface can elect to ignore WSTRB (assume all bytes valid)."
I read here that many (but not all; and there is not list?) Xilinx AXI4L peripherals do elect to ignore WSTRB.
Does this mean that I'm essentially barred from doing narrow burst ("not recommended") as well as unaligned bursts ("WSTRB can be ignored") or is there an easy way to unload some of the implementation work from my master into the interconnect, guaranteeing proper system behavior when accessing AXI4L peripherals?
Your example is not a narrow burst, and should work.
The reason narrow burst is not recommended is that it gives sub-optimal performances. Both narrow-burst and data realignement cost in area and are not recommended IMHO. However, DRE has minimal bandwidth cost, while narrow burst does. If your AXI port is 100MHz 32 bits, you have 3.2GBits maximum throughput, if you use narrow burst of 16 bits 50% of the time, than your maximum throughput is reduced to 2.4GBits (32bits X 50MHz + 16bits X 50Mhz). Also, I'm not sure AXI-Lite support narrow burst or data realignement.
Your example has 2 major flaws. First, it requires 3 data-beats to transfer 32 bits, which is worst than narrow-burst (I don't think AXI is smart enough to cancel the last burst with WSTRB to 0). Second, you can't burst more than 2 16-bits at a time, which will hang your AXI infrastructure's performances if you have a lot of data to transfer.
The best way to deal with this is concatenate the 16 bits together to form a 32 bits in your block. Then you buffer these 32 bits and burst them when you have enough. This is the AXI high performance way to do this.
However, if you receive data as 16-bits, it seems you would be better using AXI-Stream, which support 16-bits but doesn't have the notion of addresses. You can map an AXI-Stream to AXI-4 using Xilinx's IP cores. Either AXI-Datamover or AXI-DMA can do that. Both do the same (in fact, AXI-DMA includes a datamover), but AXI-DMA is controlled trough an AXI-Lite interface while Datamover is controlled through additionals AXI-Streams.
As a final note, the Xilinx cores never requires narrow-burst or DRE. If you need DRE in AXI-DMA, it's done by the AXI-DMA core and not the AXI Interconnect. Also, these cores are clear-source, so you can checkout how they operate easily.

Resources