Can Power8 use atomic operations to communicate with ASIC/FPGA connected by PCI Express? - device

As known, Power8 supports Coherent Accelerator Processor Interface (CAPI): https://www.nextplatform.com/2015/06/22/the-secret-of-power8-capi-is-addressing/
Hardware Managed Cache Coherence
Enables the accelerator to participate in “Locks” as a normal thread Lowers Latency over IO communication model
https://www.microway.com/download/presentation/IBM_POWER8_CPU_Architecture.pdf
What does it mean “Locks”? Does it mean, that we can use spin-lock to protect shared-memory for safe access to it using it from CPU-Cores and PCIe-devices (ASIC, FPGA, ...)?
I.e. does it mean, that we can use spin-lock, atomic-operations, even LL/SC-atomic operations across PCI Express bus?

So P8 doesn't support the PCIe atomics as defined by the PCIe sig (optional feature of PCIe).
It does support some proprietary atomic primitives that are used by CAPI. I don't know if it's possible to exploit them from a non-CAPI adapter on P8.

Related

What is practical way for GUI control for FPGA logic?

I have one of Zynq development boards (Z7020), where on the hardware cores I am running Linux. I want to be to control logic which I will program into FPGA portion of Zynq with a GUI interface running on the hardware cores and displayed on the connected touch display screen.
Would I just send interrupts to FPGA as I am selecting an options or start/stoping a task from the GUI interface?
How do I also return either indication that task is finished back from FPGA to hardware cores or possibly some data?
The most direct communication path between the CPUs and the programmable logic is the AXI memory interconnect, which enable the processors to send read and write requests to the programmable logic.
You can implement registers or FIFOs in your programmable logic and control the logic by writing to the registers or enqueuing data into the FIFOs. The programmable logic can return data to the processors via registers or be enqueuing into memory-mapped FIFOs that are dequeued by the processors.
It can be helpful for the programmable logic to interrupt the CPU when there is something for the CPU to do.
Interrupts and AXI interconnect between the processors and the programmable logic are documented in the Zynq Technical Reference Manual.

Optimize socket data transfer over loopback wrt NUMA

I was looking over the Linux loopback and IP network data handling, and it seems that there is no code to cover the case where 2 CPUs on different sockets are passing data via the loopback.
I think it should be possible to detect this condition and then apply hardware DMA when available to avoid NUMA contention to copy the data to the receiver.
My questions are:
Am I correct that this is not currently done in Linux?
Is my thinking that this is possible on the right track?
What kernel APIs or existing drivers should I study to help complete such a version of the loopback?
There are several projects/attempts to add interfaces to memory-to-memory DMA Engines intended for use in HPS (mpi):
KNEM kernel module - High-Performance Intra-Node MPI Communication - http://knem.gforge.inria.fr/
Cross Memory Attach (CMA) - New syscalls process_vm_readv, process_vm_writev: http://man7.org/linux/man-pages/man2/process_vm_readv.2.html
KNEM may use I/OAT Intel DMA engine on some microarchitectures and sizes
I/OAT copy offload through DMA Engine
One interesting asynchronous feature is certainly I/OAT copy offload.
icopy.flags = KNEM_FLAG_DMA;
Some authors say that it have no benefits of hardware DMA Engine on newer Intel microarchitectures:
http://www.ipdps.org/ipdps2010/ipdps2010-slides/CAC/slides_cac_Mor10OptMPICom.pdf
I/OAT only useful for obsolete architectures
CMA was announced as similar project to knem: http://www.open-mpi.org/community/lists/devel/2012/01/10208.php
These system calls were designed to permit fast message passing by
allowing messages to be exchanged with a single copy operation
(rather than the double copy that would be required when using, for
example, shared memory or pipes).
If you can, you should not use sockets (especially tcp sockets) to transfer data, they have high software overhead which is not needed when you are working on single machine. Standard skb size limit may be too small to use I/OAT effectively, so network stack probably will not use I/OAT.

What is the use of the DMA controller in a processor?

DMA controllers are present on disks, networking devices. So they can transfer data to main memory directly. Then what is use of the dma controller inside processor chip ?Also i would like to know, if there are different buses (i2c, pci, spi) outside of processor chip and only one bus (AXI) inside processor. how does this work?(shouldn’t it result in some bottleneck)
The on-chip DMA can take the task of copying data from devices to memory and viceversa for simple devices that cannot implement a DMA of their own. I can think that such devices can be a mouse, a keyboard, a soundcard, a bluetooth device, etc. These devices have simple logic and their requests are multiplexed and sent to a single general purpose DMA on the chip.
Peripherals with high bandwidths like GPU cards, Network Adapters, Hard Disks implement their own DMA that communicates with the chip's bus in order to initiate uploads and downloads to the system's memory.
if there are different buses (i2c, pci, spi) outside of processor chip
and only one bus (AXI) inside processor. how does this work?(shouldn’t
it result in some bottleneck)
That's actually simple. The on-chip internal AXI bus is much faster - running at a much higher frequency (equal or in the same range to the CPU's frequency) (has a much higher bandwidth) than all the aggregated bandwidths of i2c+pci+spi. Of course multiple hardware elements compete on the AXI bus but usually you have priorities implemented and different optimization techniques.
From Wikipedia:
Direct memory access (DMA) is a feature of computerized systems that allows certain hardware subsystems to access main system memory independently of the central processing unit (CPU). [...] A DMA controller can generate memory addresses and initiate memory read or write cycles. It contains several processor registers that can be written and read by the CPU. These include a memory address register, a byte count register, and one or more control registers.

How hypervisor design changes when hardware support is provided in processor?

I would like to know how hypervisor design changes or how functionality improves when hardware support is provided from processor,as I know ARM CORTEX A9 series doesn't have support for virtualization from processor and this is expected in ARM CORTEX A-15 onwards, my question is how this differs in implementation and what is this hardware support in general means? and what components specific to hypervisor software which H/W take care of ?
Thanks,
R
basically in order to understand changes in implementation of hypervisor when a processor supports virtualization requires to understand what is hardware virtualization. hardware extensions convert non privileged but sensitive instructions(e.g. popf) of ISA privileged in sense these instructions cause trap to hypervisor. This is basic concept for virtualizable hardware. but with passage of time vendors introduced new functionality regarding virtualization. the most important was nested paging (EPT/NPT) in order to efficiently virtualize memory. presently hardware has been revolutionized as compared to late 90s.
so when there were no hardware facilities, VMware team still manage to virtualize the hardware. they made use of binary translation and dynamic interpretor for non virtualizeable part of ISA. for memory they used shadow page tables(sPT) to virtualize memory. In sPT, processor use sPT instead of guest tables (as MMU can scan one level of tables). In EPT/NPT, MMU take two rounds of table first for guest tables and then for EPT/NPT. By this method, efficiency is increased in most use cases. when hypervisor use hardware virtualization extensions then it use hardware defined structures(second level tables,VMCS) in the strict manner. I am feeling difficulty to conclude the answer due to broad scope of the question but I hope this answer will provide sufficient material to start up

difference between I/O ports and I/O memory

I just want to know the the difference between I/o ports and I/o memory, because I am quite confused. And if someone explain the use of it, that would be great. And by use I mean, when I/O ports are preferred and when I/O memory is preferred.
There is no conceptual difference between memory regions and I/O regions: both of them are accessed by asserting electrical signals on the address bus and control bus
While some CPU manufacturers implement a single address space in their chips, others decided that peripheral devices are different from memory and, therefore, deserve a separate address space. Some processors (most notably the x86 family) have separate read and write electrical lines for I/O ports and special CPU instructions to access ports.
Linux implements the concept of I/O ports on all computer platforms it runs on, even on platforms where the CPU implements a single address space. The implementation of port access sometimes depends on the specific make and model of the host computer (because different models use different chipsets to map bus transactions into memory address space).
Even if the peripheral bus has a separate address space for I/O ports, not all devices map their registers to I/O ports. While use of I/O ports is common for ISA peripheral boards, most PCI devices map registers into a memory address region. This I/O memory approach is generally preferred, because it doesn't require the use of special-purpose processor instructions; CPU cores access memory much more efficiently, and the compiler has much more freedom in register allocation and addressing-mode selection when accessing memory.
More Details at http://www.makelinux.net/ldd3/chp-9-sect-1

Resources