I am writing an USB Audio Playback driver using ALSA APIs. For that I was trying to understand existing audio drivers in Linux kernel. But I get confused on when to update the kernel audio buffer pointer. We know kernel puts new audio data in a ring buffer and our drivers task is to take new data from the ring buffer, pass it over USB and update the kernel buffer pointer.
The drivers I was looking at takes care of this in URB completion function. Say they have a predefined macro for USB transfer size, which is around 4096 bytes in almost all cases. So when the URB transfer is finished and the code execution path comes in URB completion, they copy another 4096 bytes from the kernel buffer into the URB buffer, submit the URB again to the USB controller and forward the kernel buffer pointer by 4096 bytes.
But what I don't understand is, how come they be so sure that by the time a URB trasfer is finished, there are 4096 bytes of new data in the kernel buffer? The new data amount in the kernel buffer might be smaller than 4096 bytes? Then why does it always update the buffer pointer by 4096 bytes. I think there should be some of knowing how many new bytes are in the kernel buffer and the driver should only update by that amount or may be I misunderstood something? Any suggestion or guideline is appreciable.
These USB audio drivers behave exactly like a PCI sound card, i.e., when the device needs some samples, those samples are just read from the ring buffer.
A PCI chip has no way of knowing what part of the buffer actually contains valid samples.
A buffer underrun is detected later by software (the device informs the driver about the current position with an interrupt; the interrupt handler then raises the underrun error if the position is too far ahead).
USB audio drivers use exactly the same mechanism for detecting underruns, i.e., the snd_pcm_period_elapsed() function checks whether the current position (as returned by your .pointer callback) is too far ahead.
Related
I'm using ARM a53 platform, it has ACP component, and I'm trying to use DMA to transfer data through ACP.
By ARM trm document, if I understand it correctly, the DMA transmission data size limits to 64 bytes for each DMA transfer when using ACP.
If so, does this limitation make DMA not usable? Because it's dumb to configure DMA descriptor but to transfer 64 bytes only each time.
Or DMA should auto divide its transfer length into many ACP size limited(64 bytes) packets, without any software intervention.
Need any expert to explain how ACP and DMA work together.
Somewhere in the interfaces from the DMA to the ACP's AXI port should auto divide its transfer length as needed into transfers of appropriate length. For the Cortex-A53 ACP, AXI transfers are limited to 64B(perhaps intentionally 1x cacheline).
From https://developer.arm.com/documentation/ddi0500/e/level-2-memory-system/acp/transfer-size-support :
x byte INCR request characterized by:(some list of limitations)
Note the use of INCR instead of FIXED. INCR will automatically increment the address according to the size of the transfer, while FIXED will not. This makes it simple for the peripheral break a large transfer into a series of multiple INCR transfers.
However, do note that on the Cortex-A53, transfer size(x in the quote) is fixed at 16 or 64 byte aligned transfers. If the DMA sends an inappropriate sized transfer(because misconfigured or correct size unsupported), the AXI will emit a SLVERR. If the buffer is not appropriately aligned, I think this also causes a SLVERR.
Lastly, the on-chip network routing must support connecting the DMA to the ACP at chip design time. In my experience this is more commonly done for network accelerators and FPGA fabric glue, but tends to be less often connected for low speed peripherals like UART/SPI/I2C.
I need to pass a direct ATA request to a hard drive (0x25, READ DMA EXT), to disobey max sector count (long story), and to bypass all possible OS caches, buffers, reorderings et al.
HDIO_DRIVE_TASKFILE IOCTL is no longer available due to libata.
I accomplished the goal with a SG_IO IOCTL with ATA pass-through (SG_ATA_16). Works perfectly except one problem: I can read a maximum of 8192 sectors in one command. I need to read a full of 32767 sectors.
max_hw_sectors_kb is 32767, so the drive supports this much transfer
max_sectors_kb was low, yet I brought it up to 32767 sectors, to no avail
scheduler is set to noop, no change.
Tried gather buffer (iovec_count>0, properly set iovecs to consecutive buffer slices), no change.
Environment: Ubuntu 16.04/16.10/17.04 with standard kernels, SATA drive connected to standard AHCI interface on Intel chipset.
No matter what I do, starting with 8193 sectors, IOCTL bails out with "Invalid argument" error.
Where to look? What else can cause a 4MB data transfer cap?
My application on PC sends a file (2 MB) in chunks of 1 KB to embedded device.
I use FTDI Windows driver, I use the classic FT_Write() API function as my code is cross-platform.
Note: These issues below appear when I use 1KB chunk size. Smaller chunk (I tried 64 bytes) works fine.
The problem is the function returns "0 byte sent" every couple hundred packets and stuck. I found a work around, by purging both TX and Rx, followed by ResetDevice() call recovered the chip. It still happened every couple hundred packets, but at least I can send the whole file (2 MB).
But when I use USB isolator (http://www.bb-elec.com/Products/USB-Connectivity/USB-Isolators/Compact-USB-Port-Guardian.aspx)
the work around failed.
I believe my work around is not a graceful solution.
Note: I use large chunk because of suggestion I found in FTDI application note below:
When writing data to an FTDI device, as much data as possible should
be buffered in the application and written to the device in a single
write function call (either WriteFile for a VCP application using the
Win32 API, FT_Write if using the D2XX classic interface or
FT_WriteFile if using the D2XX FT_W32 interface). The result of this
is that the data will be written to the device with 64 bytes per USB
packet.
Any idea what's the proper fix for these issues? Is it related to FTDI initialization? My driver version is 2.12.16.0 (3/9/2016).
I also saw the same problem of API FT_Write() not working right if too much data was passed,
while working on the library for my USB device Nusbio.
I mostly work in the mode Synchronous Bitbanging rather than UART but after all it is the same
hardware, driver and API.
There are the USB 2.0 specification or the FTDI FT232RL specification and then there is
reality of the electron and bit. The expected numbers of transfer speed never really match at
least at first. In other words it is complicated (see more below in my referenced blog post).
In 2015 I was under the impression that with FTDI chip FT232RL the size of 384 bytes was working well
and the number comes from the chip datasheet (128 byte receive buffer and 256 byte transmit buffer).
Using a size of 500 bytes would still work but above 600 bytes thing would not work.
I later used the chip FT231X which has a larger buffer (1k, 512 byte receive buffer and 512 byte transmit buffer).
and was able to transfer with FT_Write() 1k and 2k buffer of data, therefore more than doubling my speed of transfer.
But above 2k things would not work.
In 2016, I read every thing you can read about FTDI USB 2.0 Full speed chip, I came to the
conclusion that FT_Write should support up to 64K (see datasheet for the following chip
FT232RL, FT231X, FT232H, FT260, FT4222).
I also did some research on faster serial port communication from .NET than 115200 baud.
Somehow I was able to update my C# library to send data in buffer of 32k in FT_Write() and it is
working with the FT232RL and the FT231X chip, but I can't tell you what changed.
I was probably not completely underdanding the in and out of the USB 2.0 full speed FTDI technology.
For example let's say you are using the FT232RL and transfering 384 bytes at the time with
FT_Write(). Knowing that there is at least a 1 milli-second latency in USB 2.0 full speed what ever you
do, you are transfering from a USB point of view 384*1000/1024, that is 375 K byte/s in theory
(that would be the max), that said now what is the baudrate supported by your embedded device.
What is the baudrate used?
The FT232RL max baudrate is 900 000 baud, which would give you only 900000/(1+8+1) == 87 K byte/S.
Right away you can tell there is going to be some problem, may be the FTDI driver takes care of
it or not. I can't tell.
Re do the math based on the baudrate supported by your embedded device, and a 384 byte buffer
sent 1000 per second, then slow down your USB speed with a sleep() to match your baud rate.
That is where I would start.
I'm using Angtsrom embedded linux kernel v.2.6.37, based on Technexion distribution.
DM3730 SoC, TDM3730 module, custom baseboard.
CodeSourcery toolchain v. 2010-09.50
Here is dataflow in my system:
http://i.stack.imgur.com/kPhKw.png
FPGA generates incrementing data, Kernel reads it via GPMC DMA. GPMC pack size = 512 data samples. Buffer size = 61440 32bit samples (=60 ram pages).
DMA buffer is allocated by dma_alloc_coherent and mapped to userspace by mmap() call. User application directly reads data from DMA buffer and saving to NAND using fwrite() call. User reads data by 4096 samples at once.
And what I see in my file? http://i.stack.imgur.com/etzo0.png
Red line means first border of ring buffer. Ooops! Small packs (~16 samples) starts to hide after border. Their values is accurately = "old" values of corresponding buffer position. But WHY? 16 samples is much lesser than DMA pack size and user read pack size, so there cannot be pointers mismatch.
I guess there is some mmap() feature is hiding somewhere. I have tried different flags for mmap() - such as MAP_LOCKED, MAP_POPULATE, MAP_NONBLOCK with no success. I completely missunderstanding this behaviour :(
P.S. When i'm using copy_to_user() from kernel instead of mmap() and zero-copy access, there is no such behaviour.
I have a block device driver which is working, after a fashion. It is for a PCIe device, and I am handling the bios directly with a make_request_fn rather than use a request queue, as the device has no seek time. However, it still has transaction overhead.
When I read consecutively from the device, I get bios with many segments (generally my maximum of 32), each consisting of 2 hardware sectors (so 2 * 2k) and this is then handled as one scatter-gather transaction to the device, saving a lot of signaling overhead. However on a write, the bios each have just one segment of 2 sectors and therefore the operations take a lot longer in total. What I would like to happen is to somehow cause the incoming bios to consist of many segments, or to merge bios sensibly together myself. What is the right approach here?
The current content of the make_request_fn is something along the lines of:
Determine read/write of the bio
For each segment in the bio, make an entry in a scatterlist* with sg_set_page
Map this scatterlist to PCI with pci_map_sg
For every segment in the scatterlist, add to a device-specific structure defining a multiple-segment DMA scatter-gather operation
Map that structure to DMA
Carry out transaction
Unmap structure and SG DMA
Call bio_endio with -EIO if failed and 0 if succeeded.
The request queue is set up like:
#define MYDEV_BLOCK_MAX_SEGS 32
#define MYDEV_SECTOR_SIZE 2048
blk_queue_make_request(mydev->queue, mydev_make_req);
set_bit(QUEUE_FLAG_NONROT, &mydev->queue->queue_flags);
blk_queue_max_segments(mydev->queue, MYDEV_BLOCK_MAX_SEGS);
blk_queue_physical_block_size(mydev->queue, MYDEV_SECTOR_SIZE);
blk_queue_logical_block_size(mydev->queue, MYDEV_SECTOR_SIZE);
blk_queue_flush(mydev->queue, 0);
blk_queue_segment_boundary(mydev->queue, -1UL);
blk_queue_max_segments(mydev->queue, MYDEV_BLOCK_MAX_SEGS);
blk_queue_dma_alignment(mydev->queue, 0x7);