Bus arbitration on CAN bus - matrix

Hello I have a question concerning communication/arbitration on a CAN bus.
Say more than one masters on the CAN bus want to send simultaneously which means that the one with the lowest message identifier will win arbitration in the end and starts to send his payload. The others lose arbitration, switch to receiving mode and wait that the bus is free again.
Now my question:
Do the masters that lost arbitration in the previous try immediately arbitrate the bus again (i.e. when the bus is free)? Do they wait for their next activation cycle as defined in the CAN matrix? Or can that be defined in the CAN matrix individually?
Thanks in advance,

I don't know what you mean with this "CAN matrix", but yes a soon as the bus is idle the nodes are allowed to try again to get on the bus by starting the arbitration process with sending the Start of Frame bit and the CAN Id.
CAN does not know masters or slaves. It is called a multi-master system. every node has the same rights on the bus. Higher Layer CAN protocols like CANopen define a Master roler for some kind of network management.

I kind of found the answer here:
CAN bus arbitration backoff time
It's written that the masters are free to arbitrate again after the frame of the "arbitration winner" was sent. Does this mean that this decision is coded in the CAN matrix?


AXI4 delay transactions

I am just looking for advice. I currently have a custom IP integrated in VHDL which has a AXI4 slave input and an AXI4 master output, and currently the signals are directly tied together.
I would like to add a customizable latency to the AXI signals, so that way they can be delayed for a particular amount of time through the IP, rather than being connected to each other.
My question is; can I delay read and write transactions through the IP merely through the use of the AxVALID and AxREADY (and maybe the RVALID/RREADY and WVALID/WREADY) signals?
If for instance I wanted a 20 clock cycle delay, I could wait for an external master to assert VALID, and wait 20 clocks before having the IP slave assert READY? Is this correct logic?
Thanks in advance for any any advice.
Yes, that can be done. Depending on your infrastructure it can cause bus congestion. Alternatively, you should also insert a FIFO to buffer these delayed bus transactions.

Space efficient data bus implementations [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 7 years ago.
Improve this question
I'm writing a microcontroller in VHDL and have essentially got a core for my actual microcontroller section down. I'm now getting to the point however of starting to include memory mapped peripherals. I'm using a very simple bus consisting of a single master (the CPU) and multiple slaves (the peripherals/RAM). My bus works through an acknowledge CPU->perip and acknowledge perip->CPU. The CPU also has separate input and output data buses to avoid tristates.
I've chosen this method as I wish to have the ability for peripherals to stall the CPU. A bus transaction is achieved by: The master places the data, address and read/write bit on the bus, bringing the ack(c->p) high. Once the slave has successfully received the information and has placed the response back on the data (p->c) bus, the slave sets its ack(p->c) high. The master notes the slave has successfully placed the data, takes the data for processing and releases the ack(c->p). The bus is now in idle state again, ready for further transactions.
Obviously this is a very simple bus protocol and doesn't include burst features, variable word sizes or other more complex features. My question however is what space efficient methods can be used to connect peripherals to a master CPU?
I've looked into 3 different methods as of yet. I'm currently using a single output data bus from the master to all of the peripherals, with the data outputs from all the peripherals being or'd, along with their ack(p->c) outputs. Each peripheral contains a small address mux which only allows a slave to respond if the address is within a predefined range. This reduces the logic for switching between peripherals but obviously will infer lots of logic/peripheral for the address muxes which leads me to believe that future scalability will be impacted.
Another method I though of was having a single large address mux connected from the master which decodes the address and sends it, along with the data and ack signals to each slave. The output data is then muxed back into the master. This seems a slightly more efficient method though I always seem to end up with ridiculously long data vectors and its a bit of a chore to keep track of.
A third method I thought of was to have it arranged in a ring like fashion. The master address goes to all of the slaves, with a smaller mux which merely chooses which ack signals to send out. The data output from the master then travels serially through each slave. Each slave contains a mux which can allow it to either let the data coming into it pass through unaffected OR to allow the slave to place its own data on the bus. I feel this will work best for slow systems as there is only one small mux/slave required to mux between the incoming data and that slave's data, along with a small mux that decodes the address and sends out the ack signals. The issue here I believe however is that with lots of peripherals, the propagation delay from the output of the master to the input of the master would be pretty large as it has to travel through each slave!
Could anybody give me suitable reasoning for the different methods? I'm using Quartus to synthesize and route for an Altera EP4CE10E22C8 FPGA and I'm looking for the smallest implementation with regards to FPGA LUTs. My system uses a 16bit address and data bus. I'm looking to achieve at minimum ~50MHz under ideal memory conditions (i.e no wait states) and would be looking to have around 12 slaves, each with between 8 and 16bits of addressable space.
I suggest that you download the AMBA specification from the ARM web site (http://www.arm.com/) and look at the AXI4-lite bus or the much older APB bus. In most bus standards with a single master there is no multiplexer on the addresses, only an address decoder that drives the peripheral selection signals. It is only the response data from the slaves that are multiplexed to the master, thanks to the "response valid" signals from the slaves. It is scalable if you pipeline it when the number of slaves increases and you cannot reach your target clock frequency any more. The hardware cost is mainly due to the read data multiplexing, that is, a N-bits P-to-one multiplexer.
This is almost your second option.
The first option is a variant of the second where read data multiplexers are replaced by or gates. I do not think it will change much the hardware cost: or gates are less complex than multiplexers but each slave will now have to zero its read data bus, which adds as many and gates. A good point is, maybe, a reduced activity and thus a lower power consumption: slaves that are not accessed by the master will keep their read data bus low. But as you synthesize all this with a logic synthesizer and place and route it with a CAD tool, I am almost sure that you will end up with the same results (area, power, frequency) as for the more classical second option.
Your third option reminds me the principles of the daisy chain or the token ring. But as you want to avoid 3-states I doubt that it will bring any benefit in terms of hardware cost. If you pipeline it correctly (each slave samples the incoming master requests and processes them or passes them to the next) you will probably reach higher clock frequencies than with the classical bus, especially with a large number of slaves, but as, in average, a complete transaction will take more clock cycles, you will not improve the performance neither.
For really small (but slow) interconnection networks you could also have a look at the Serial Peripheral Interface (SPI) protocols. This is what they are made for: drive several slaves from a single master with few wires.
Considering your target hardware (Altera Cyclone IV), your target clock frequency (50MHz) and your other specifications I would first try the classical bus. The address decoder will produce one select signal for each of your 12 slaves, based on the 8 most significant bits of your 16-bits address bus. The cost will be negligible. Apart these individual select signals, all slaves will receive all other signals (address bus, write data bus, read enable, write enable(s)). The 16-bits read data bus of your master will be the output of a 16-bits 12-to-1 multiplexer that selects one slave response among 12. This will be the part that consumes most of the resources of your interconnect. But it should be OK and run at 50 MHz without problem... if you avoid combinatorial paths between master requests and salve responses.
A good starter is the WISHBONE SoC Interconnect from OpenCores.org. The classic read and write cycles are easy to implement. Beyond that, also burst transfers are specified for high throughput and much more. The website also hosts a lot of WISHBONE compatible projects providing a wide range of I/O devices.
And last but not least, the WISHBONE standard is in the public domain.

protect shared memory region in multiprocessors

The situation is that I have 2 boards connected together via PCIE bus. One board is the rootport and one board is the endpoint. The endpoint side exported a memory region to the rootport side.
The communication between two boards is implemented via software message queue. The queue meta data and buffer are all located inside the exported memory region.
Both sides can access the memory region at the same time (rootport via its PCIE bus, and endpoint via its local bus). This may cause problem when both sides try to update the queue meta data.
At first, I tried to allocate a spinlock_t on the same exported memory region, but because each board is uniprocessor, the spinlock_t is not allocated anyway.
May anyone please suggest a mechanism to protect the shared region or recommend other approach to communicate between two boards. Any recommendations are appreciated. Thanks a lot!
Thank you for your interest so far.
We finally implemented the shared memory communication with circular queue. The implementation can be referenced from this link. We reduce the problem to single producer single consumer thus the circular queue does not require lock to protect. The limitation of this approach is we have to create a queue for each peer connection.
In PCIE spec, there is also sections described the Atomic Operation, unfortunately our PCIE host controller does not support this feature, so we can not make use of this feature.

Efficient Overlapped I/O for a socket server

Which of these two different models would be more efficient (consider thrashing, utilization of processor cache, overall desgn, everything, etc)?
1 IOCP and spinning up X threads (where X is the number of processors the computer has). This would mean that my "server" would only have 1 IOCP (queue) for all requests and X Threads to serve/handle them. I have read many articles discussing the effeciency of this design. With this model I would have 1 listener that would also be associated to the IOCP. Lets assume that I could figure out how to keep the packets/requests synchronized.
X IOCP (where X is the number of processors the computer has) and each IOCP has 1 thread. This would mean that each Processor has its own queue and 1 thread to serve/handle them. With this model I would have a separate Listener (not using IOCP) that would handle incomming connections and would assign the SOCKET to the proper IOCP (one of the X that were created). Lets assume that I could figure out the Load Balancing.
Using an overly simplified analogy for the two designs (a bank):
One line with several cashiers to hand the transactions. Each person is in the same line and each cashier takes the next available person in line.
Each cashier has their own line and the people are "placed" into one of those lines
Between these two designs, which one is more efficient. In each model the Overlapped I/O structures would be using VirtualAlloc with MEM_COMMIT (as opposed to "new") so the swap-file should not be an issue (no paging). Based on how it has been described to me, using VirtualAlloc with MEM_COMMIT, the memory is reserved and is not paged out. This would allow the SOCKETS to write the incomming data right to my buffers without going through intermediate layers. So I don't think thrashing should be a factor but I might be wrong.
Someone was telling me that #2 would be more efficient but I have not heard of this model. Thanks in advance for your comments!
I assume that for #2 you plan to manually associate your sockets with an IOCP that you decide is 'best' based on some measure of 'goodness' at the time the socket is accepted? And that somehow this measure of 'goodness' will persist for the life of the socket?
With IOCP used the 'standard' way, i.e. your option number 1, the kernel works out how best to use the threads you have and allows more to run if any of them block. With your method, assuming you somehow work out how to distribute the work, you are going to end up with more threads running than with option 1.
Your #2 option also prevents you from using AcceptEx() for overlapped accepts and this is more efficient than using a normal accept loop as you remove a thread (and the resulting context switching and potential contention) from the scene.
Your analogy breaks down; it's actually more a case of either having 1 queue with X bank tellers where you join the queue and know that you'll be seen in an efficient order as opposed to each teller having their own queue and you having to guess that the queue you join doesn't contain a whole bunch of people who want to open new accounts and the one next to you contains a whole bunch of people who only want to do some paying in. The single queue ensures that you get handled efficiently.
I think you're confused about MEM_COMMIT. It doesn't mean that the memory isn't in the paging file and wont be paged. The usual reason for using VirtualAlloc for overlapped buffers is to ensure alignment on page boundaries and so reduce the number of pages that are locked for I/O (a page sized buffer can be allocated on a page boundary and so only take one page rather than happening to span two due to the memory manager deciding to use a block that doesn't start on a page boundary).
In general I think you're attempting to optimise something way ahead of schedule. Get an efficient server working using IOCP the normal way first and then profile it. I seriously doubt that you'll even need to worry about building your #2 version ... Likewise, use new to allocate your buffers to start with and then switch to the added complexity of VirtualAlloc() when you find that you server fails due to ENOBUFS and you're sure that's caused by the I/O locked page limit and not lack of non-paged pool (you do realise that you have to allocate in 'allocation granularity' sized chunks for VirtualAlloc()?).
Anyway, I have a free IOCP server framework that's available here: http://www.serverframework.com/products---the-free-framework.html which might help you get started.
Edited: The complex version that you suggest could be useful in some NUMA architectures where you use NIC teaming to have the switch spit your traffic across multiple NICs, bind each NIC to a different physical processor and then bind your IOCP threads to the same processor. You then allocate memory from that NUMA node and effectively have your network switch load balance your connections across your NUMA nodes. I'd still suggest that it's better, IMHO, to get a working server which you can profile using the "normal" method of using IOCP first and only once you know that cross NUMA node issues are actually affecting your performance move towards the more complex architecture...
Queuing theory tells us that a single queue has better characteristics than multiple queues. You could possibly get around this with work-stealing.
The multiple queues method should have better cache behavior. Whether it is significantly better depends on how many received packets are associated with a single transaction. If a request fits in a single incoming packet, then it'll be associated to a single thread even with the single IOCP approach.

What is the best way for "Polling"?

This question is related with Microcontroller programming but anyone may suggest a good algorithm to handle this situation.
I have a one central console and set of remote sensors. The central console has a receiver and the each sensor has a transmitter operates on same frequency. So we can only implement Simplex communication.
Since the transmitters work on same frequency we cannot have 2 sensors sending data to central console at the same time.
Now I want to program the sensors to perform some "polling". The central console should get some idea about the existence of sensors (Whether the each sensor is responding or not)
I can imagine several ways.
Using a same interval between the poll messages for each sensor and start the sensors randomly. So they will not transmit at the same time.
Use of some round mechanism. Sensor 1 starts polling at 5 seconds the second at 10 seconds etc. More formal version of method 1.
The maximum data transfer rate is around 4800 bps so we need to consider that as well.
Can some one imagine a good way to resolve this with less usage of communication links. Note that we can use different poll intervals for each sensors if necessary.
I presume what you describe is that the sensors and the central unit are connected to a bus that can deliver only one message at a time.
A normal way to handle this is to have collision detection. This is e.g. how Ethernet operates as far as I know. You try to send a message; then attempt to detect collision. If you detect a collision, wait for a random amount (to break ties) and then re-transmit, of course with collision check again.
If you can't detect collisions, the different sensors could have polling intervals that are all distinct prime numbers. This would guarantee that every sensor would have dedicated slots for successful polling. Of course there would be still collisions, but they wouldn't need to be detected. Here example with primes 5, 7 and 11:
----|----|----|----|----|----|----|----| (5)
------|------|------|------|------|----- (7)
----------|----------|----------|-:----- (11)
Notable it doesn't matter if the sensor starts "in phase" or "out of phase".
I think you need to look into a collision detection system (a la Ethernet). If you have time-based synchronization, you rely on the clocks on the console and sensors never drifting out of sync. This might be ok if they are connected to an external, reliable time reference, or if you go to the expense of having a battery backed RTC on each one (expensive).
Consider using all or part of an existing protocol, unless protocol design is an end in itself - apart from saving time you reduce the probability that your protocol will have a race condition that causes rare irreproducible bugs.
A lot of protocols for this situation have the sensors keeping quiet until the master specifically asks them for the current value. This makes it easy to avoid collisions, and it makes it easy for the master to request retransmissions if it thinks it has missed a packet, or if it is more interested in keeping up to date with one sensor than with others. This may even give you better performance than a system based on collision detection, especially if commands from the master are much shorter than sensor responses. If you end up with something like Alohanet (see http://en.wikipedia.org/wiki/ALOHAnet#The_ALOHA_protocol) you will find that the tradeoff between not transmitting very often and having too many collisions forces you to use less than 50% of the available bandwidth.
Is it possible to assign a unique address to each sensor?
In that case you can implement a Master/Slave protocol (like Modbus or similar), with all devices sharing the same communication link:
Master is the only device which can initiate communication. It can poll each sensor separately (one by one), by broadcasting its address to all slaves.
Only the slave device which was addressed will reply.
If there is no response after a certain period of time (timeout), device is not available and Master can poll the next device.
See also: List of automation protocols
I worked with some Zigbee systems a few years back. It only had two sensors so we just hard-coded them with different wait times and had them always respond to requests. But since Zigbee has systems However, we considered something along the lines of this:
Start out with an announcement from the console 'Hey everyone, let's make a network!'
Nodes all attempt to respond with something like 'I'm hardware address x, can I join?'
At first it's crazy, but with some random retry times, eventually the console responds to all nodes: 'Yes hardware address x, you can join. You are node #y and you will have a wait time of z milliseconds from the time you receive your request for data'
Then it should be easy. Every time the console asks for data the nodes respond in their turn. Assuming transmission of all of the data takes less time than the polling period you're set. It's best not to acknowledge the messages. If the console fails to respond, then very likely the node will try to retransmit just when another node is trying to send data, messing both of them up. Then it snowballs into complete failure...
