AXI4 delay transactions - vhdl

I am just looking for advice. I currently have a custom IP integrated in VHDL which has a AXI4 slave input and an AXI4 master output, and currently the signals are directly tied together.
I would like to add a customizable latency to the AXI signals, so that way they can be delayed for a particular amount of time through the IP, rather than being connected to each other.
My question is; can I delay read and write transactions through the IP merely through the use of the AxVALID and AxREADY (and maybe the RVALID/RREADY and WVALID/WREADY) signals?
If for instance I wanted a 20 clock cycle delay, I could wait for an external master to assert VALID, and wait 20 clocks before having the IP slave assert READY? Is this correct logic?
Thanks in advance for any any advice.

Yes, that can be done. Depending on your infrastructure it can cause bus congestion. Alternatively, you should also insert a FIFO to buffer these delayed bus transactions.

Related

Configuration stm32f3discovery, work with (or without) Freertos to short delay (µs)

I'm a newcomer here. I come to you because for my student project I need to establish a communication between the stm32f3 and the DHT11 sensor.
The communication is very specific and i need good masteries on time.
But I never work on µs and i don't know how to use it
Could someone help me please ?
The communication interface DHT11 uses is OneWire, which is standardized interface. These microsecond delays needed for it can be implemented using one of MCU's timers, you will just need to set up its prescaler to divide clocks to 1MHz (for 1us resolution), load the value of delay in us into period register and start the timer up. Then you just wait for timer update event. For the whole OneWire communication you can port one of many libraries available on the web.

Bus arbitration on CAN bus

Hello I have a question concerning communication/arbitration on a CAN bus.
Say more than one masters on the CAN bus want to send simultaneously which means that the one with the lowest message identifier will win arbitration in the end and starts to send his payload. The others lose arbitration, switch to receiving mode and wait that the bus is free again.
Now my question:
Do the masters that lost arbitration in the previous try immediately arbitrate the bus again (i.e. when the bus is free)? Do they wait for their next activation cycle as defined in the CAN matrix? Or can that be defined in the CAN matrix individually?
Thanks in advance,
Florian
I don't know what you mean with this "CAN matrix", but yes a soon as the bus is idle the nodes are allowed to try again to get on the bus by starting the arbitration process with sending the Start of Frame bit and the CAN Id.
CAN does not know masters or slaves. It is called a multi-master system. every node has the same rights on the bus. Higher Layer CAN protocols like CANopen define a Master roler for some kind of network management.
I kind of found the answer here:
CAN bus arbitration backoff time
It's written that the masters are free to arbitrate again after the frame of the "arbitration winner" was sent. Does this mean that this decision is coded in the CAN matrix?

Space efficient data bus implementations [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 7 years ago.
Improve this question
I'm writing a microcontroller in VHDL and have essentially got a core for my actual microcontroller section down. I'm now getting to the point however of starting to include memory mapped peripherals. I'm using a very simple bus consisting of a single master (the CPU) and multiple slaves (the peripherals/RAM). My bus works through an acknowledge CPU->perip and acknowledge perip->CPU. The CPU also has separate input and output data buses to avoid tristates.
I've chosen this method as I wish to have the ability for peripherals to stall the CPU. A bus transaction is achieved by: The master places the data, address and read/write bit on the bus, bringing the ack(c->p) high. Once the slave has successfully received the information and has placed the response back on the data (p->c) bus, the slave sets its ack(p->c) high. The master notes the slave has successfully placed the data, takes the data for processing and releases the ack(c->p). The bus is now in idle state again, ready for further transactions.
Obviously this is a very simple bus protocol and doesn't include burst features, variable word sizes or other more complex features. My question however is what space efficient methods can be used to connect peripherals to a master CPU?
I've looked into 3 different methods as of yet. I'm currently using a single output data bus from the master to all of the peripherals, with the data outputs from all the peripherals being or'd, along with their ack(p->c) outputs. Each peripheral contains a small address mux which only allows a slave to respond if the address is within a predefined range. This reduces the logic for switching between peripherals but obviously will infer lots of logic/peripheral for the address muxes which leads me to believe that future scalability will be impacted.
Another method I though of was having a single large address mux connected from the master which decodes the address and sends it, along with the data and ack signals to each slave. The output data is then muxed back into the master. This seems a slightly more efficient method though I always seem to end up with ridiculously long data vectors and its a bit of a chore to keep track of.
A third method I thought of was to have it arranged in a ring like fashion. The master address goes to all of the slaves, with a smaller mux which merely chooses which ack signals to send out. The data output from the master then travels serially through each slave. Each slave contains a mux which can allow it to either let the data coming into it pass through unaffected OR to allow the slave to place its own data on the bus. I feel this will work best for slow systems as there is only one small mux/slave required to mux between the incoming data and that slave's data, along with a small mux that decodes the address and sends out the ack signals. The issue here I believe however is that with lots of peripherals, the propagation delay from the output of the master to the input of the master would be pretty large as it has to travel through each slave!
Could anybody give me suitable reasoning for the different methods? I'm using Quartus to synthesize and route for an Altera EP4CE10E22C8 FPGA and I'm looking for the smallest implementation with regards to FPGA LUTs. My system uses a 16bit address and data bus. I'm looking to achieve at minimum ~50MHz under ideal memory conditions (i.e no wait states) and would be looking to have around 12 slaves, each with between 8 and 16bits of addressable space.
Thanks!
I suggest that you download the AMBA specification from the ARM web site (http://www.arm.com/) and look at the AXI4-lite bus or the much older APB bus. In most bus standards with a single master there is no multiplexer on the addresses, only an address decoder that drives the peripheral selection signals. It is only the response data from the slaves that are multiplexed to the master, thanks to the "response valid" signals from the slaves. It is scalable if you pipeline it when the number of slaves increases and you cannot reach your target clock frequency any more. The hardware cost is mainly due to the read data multiplexing, that is, a N-bits P-to-one multiplexer.
This is almost your second option.
The first option is a variant of the second where read data multiplexers are replaced by or gates. I do not think it will change much the hardware cost: or gates are less complex than multiplexers but each slave will now have to zero its read data bus, which adds as many and gates. A good point is, maybe, a reduced activity and thus a lower power consumption: slaves that are not accessed by the master will keep their read data bus low. But as you synthesize all this with a logic synthesizer and place and route it with a CAD tool, I am almost sure that you will end up with the same results (area, power, frequency) as for the more classical second option.
Your third option reminds me the principles of the daisy chain or the token ring. But as you want to avoid 3-states I doubt that it will bring any benefit in terms of hardware cost. If you pipeline it correctly (each slave samples the incoming master requests and processes them or passes them to the next) you will probably reach higher clock frequencies than with the classical bus, especially with a large number of slaves, but as, in average, a complete transaction will take more clock cycles, you will not improve the performance neither.
For really small (but slow) interconnection networks you could also have a look at the Serial Peripheral Interface (SPI) protocols. This is what they are made for: drive several slaves from a single master with few wires.
Considering your target hardware (Altera Cyclone IV), your target clock frequency (50MHz) and your other specifications I would first try the classical bus. The address decoder will produce one select signal for each of your 12 slaves, based on the 8 most significant bits of your 16-bits address bus. The cost will be negligible. Apart these individual select signals, all slaves will receive all other signals (address bus, write data bus, read enable, write enable(s)). The 16-bits read data bus of your master will be the output of a 16-bits 12-to-1 multiplexer that selects one slave response among 12. This will be the part that consumes most of the resources of your interconnect. But it should be OK and run at 50 MHz without problem... if you avoid combinatorial paths between master requests and salve responses.
A good starter is the WISHBONE SoC Interconnect from OpenCores.org. The classic read and write cycles are easy to implement. Beyond that, also burst transfers are specified for high throughput and much more. The website also hosts a lot of WISHBONE compatible projects providing a wide range of I/O devices.
And last but not least, the WISHBONE standard is in the public domain.

Asynchronous asymmmetric FIFO in VHDL synthesis issue

I have designed an Asynchrounous asymmetric fifo using VHDL constructs.It is generic fifo with depth and prog_full as parameters. It has 32-bit in 16-bit output data width.
You can find the fifo design link here.
The top level asymmetric fifo (fifo_wrapper.vhd),is built upon an 32-bit asynchronous fifo(async_fifo.vhd). This internal fifo (async_fifo) is build using the logic from generic FIFO on open cores (http://opencores.org/project,generic_fifos). I have added a simple testbench to try out this fifo design.
BUT there is some issue with this design that I am not able to figure out. The fifo design works perfectly fine when I simulate it, but when I synthesize it and run It along with my other design on hardware I get some erroneous data sometimes. May be there is some corner case that I am not able to simulate or Is it some thing else?
That's why I would like anyone who needs this design to try it and let me know if he/she encounters any Issues during simulation or after synthesis.
thanks
PS: kindly let me know if there is some other forum where I can put my design for public use. thanks
There are a number of issues to point out in relation to this asynchronous FIFO
design, based on the assumption that the write and read clocks are fully
asynchronous.
A (and probably THE) major problem is that the write side pointer (wp in
async_fifo), which is a normal binary counter, is transfered and synchronized
to the read side clock without any Gray encoding. So the different bits in
the vector may arrive at different time in the read clock domain, thus the
write pointer value can (and most likely will from time to time) be different
from the write side value. The comparison with the read pointer (rp) will
therefore make no sense. Binary values that are transfered over clock
domains should be Gray encoded before transfer and decoded at arrival. Also
use synchronization with two flip-flop levels.
The two clocks (rd_clk and wr_clk) are assumed to be asynchronous, but there
is only a single reset (rst), so timing may be violated when reset is
deasserted, unless there are some additional requirements for clocking at the
time of reset deassert.
An similar with clear, where there is only one signal for use in two
different clock domains.
Suggestion would be to use a port naming convention where the clock domain
relationship for the port is cleared indicated in the name, like naming all
the ports in the write clock domain wr_* (e.g. wr_clk_i, wr_clk_we_i, etc.,
and all the ports in the read clock domain as rd_*.
Reset is asserted low, so a naming of rst_n would be nice.
n'I can't access your code (firewall) so I'll just mention the general points in designing them, which might be of help to you and others.
to be completely clock safe, the write side should exchange its pointer to the read side using a fully safe asynchronous handshaking method using 2 meta-stability signalling chains.
This contruct for this is a double buffered register.
The write side registers its write pointer into a buffer, and asserts a valid signal high.
A meta-stability chain reclocks the buffer valid signal to the read clock domain
On the read clock side, once a transition to high of valid is seen at the output of the meta chain, the data in the write side buffer is reregistered onto another register on the read domain. This is ok, because it is known that the data in the buffer is stable. (because of the meta chain).
The read domain asserts an ack signal high.
Another meta-stability chain reclocks the ack signal to the write clock domain.
The write side awaits a transition of the ack signal at the output of the meta chain, once seen it deasserts its valid signal.
The read side awaits a transition of the valid signal at the output of the meta chain to low, once seen it deasserts its ack signal.
The write side awaits a transition of the ack signal at the output of the meta chain to low. The cycle is now complete.
The current write pointer, which may have moved on quite a bit now, may now be transferred again.
A similar approach is taken for transferring the read pointer to the write domain.
It can be seen that although this approach leads to a latency between the write pointer on the write/ read side, and the read pointer on the read / write side, that this latency can never lead to overflow. Instead it leads to a premature full on the write side, adn a premature empty on the read side, which will eventually resolve once the pointers are next exchanged.
This approach is the only completely clock safe design for a fifo that doesn't depend on a-priori knowledge of the clock speeds. Gray coding is not required at all.
The other thing to note is that the logic for addressing / empty / full etc. needs to be duplicated on each clock domain.

pic32 uart issue

We have a system a group of PIC32 MCU on a shared UART bus plus a couple I/Os as hand-shake akin to chip select. One master multiple slaves. The transmit from master to slave is direct and goes well. the response signal from slave to master goes trough a 1K resistor with a 10K pullup on master side. Each slave must disable its UART if not selected so as not to disturb other slave's transmission. The master is always active and allows a 400 us delays between two slaves communication sequence. Transmissions are made in 4-bytes chunks.
One out of 5 systems we build have issues where the start bit from one of the slave in incomplete. A glitch of about 1/4 bit width. When this appends, the master master fail to recognise the chunk and timeout the transmission. So far we worked around the problem by changing the faulty MCU, but that is a development-time fix, not good for production.
Anyone have seen something similar? what can be the issue?
We are using pic32mx320F064H-80 for both master ans slave devices.
Thank you.
A 1:10 ratio with the resistors can be dodgy, and the low level may not be well recognized by the master.
For your circuit, I assume the 1k resistor is here to protect the slaves if two manage to get enabled at the same time. For this purpose, 120 Ohms is enough under 3.3V (14mA short circuit current).
On a previous project I found that 10k pullups tend to be weak, depending on the fan-in (number of slaves in your case). I would suggest you to reduce it to 4.7k
.
With those values the ratio is now 0.025.
Either the selected slave is not enabled soon enough before transmission
- or -
the previous selected slave is not disabled soon enough.
Knowing the baud would help as that would bring the "400 us" into perspective.

Resources