Are two JESD204B FPGA masters able to form high speed serial link composed of multiple serial lines?

Are two JESD204B FPGA masters able to form high speed serial link composed of multiple serial lines? - fpga

I need to realize point-to-point multigigabit connection between two FPGAs. For that I can use 4x6.25Gbps serial lines, and transport the data over optical link. The problem is, that the connection - realized over those 4 optical lanes - must look as single point-to-point channel (so single channel with >20Gbps).
The lanes assembly is exactly what for example JESD204B/C does when connecting fast ADC or DAC to the FPGA through HSSI.
I was wondering, whether an instantiated JESD204B ip core on two distant FPGAs is able to assemble a channel composed of 4 lines and function as a transport channel for the data.
I somehow feel, that the problem might occur during synchronization phase, because the two IP cores always act as masters and they expect a component (ADC or DAC) to be attached for the synchronization.
The link shall be established between Intel and Xilinx FPGAs.

Related

Linking (two) bidirectional ports between (two) modules in VHDL

I have an FPGA which accepts an 8-bit address and data bus (one bus is used for both) from two microcontrollers.
Using a 2:1 multiplexer, my FPGA only selects one device's inputs at a time (address and data) and the selection is based on an external signal to the FPGA.
I also have a separate decoder and register module which the microcontroller reads and writes to. How do I link the bidirectional output signal from my multiplexer to the decoder/register bidirectional input module at the higher level using the port map assignment?
Using a std_logic_vector(7..0), it will not work as I get an error "this signal is connected to multiple drivers". I think I need to tri-state the two, but I'm not sure. Looking at the image below, the green circle is what I'm trying to glue together.
My FPGA project

You are right in thinking you need to tri-state, but this is needed at the edge of the FPGA, i.e., on the I/O pins.
You cannot have bidirectional ports inside an FPGA. So for each bidirectional pin you have three signals, incoming outgoing and a direction. If all pins always have the same direction you can use the same signal for all.
For you this means you don't need to multiplex the incoming signals as they can be split (one signal to multiple instances), but you need one for the outgoing signals (multiple instances to one signal).

Space efficient data bus implementations [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 7 years ago.
Improve this question
I'm writing a microcontroller in VHDL and have essentially got a core for my actual microcontroller section down. I'm now getting to the point however of starting to include memory mapped peripherals. I'm using a very simple bus consisting of a single master (the CPU) and multiple slaves (the peripherals/RAM). My bus works through an acknowledge CPU->perip and acknowledge perip->CPU. The CPU also has separate input and output data buses to avoid tristates.
I've chosen this method as I wish to have the ability for peripherals to stall the CPU. A bus transaction is achieved by: The master places the data, address and read/write bit on the bus, bringing the ack(c->p) high. Once the slave has successfully received the information and has placed the response back on the data (p->c) bus, the slave sets its ack(p->c) high. The master notes the slave has successfully placed the data, takes the data for processing and releases the ack(c->p). The bus is now in idle state again, ready for further transactions.
Obviously this is a very simple bus protocol and doesn't include burst features, variable word sizes or other more complex features. My question however is what space efficient methods can be used to connect peripherals to a master CPU?
I've looked into 3 different methods as of yet. I'm currently using a single output data bus from the master to all of the peripherals, with the data outputs from all the peripherals being or'd, along with their ack(p->c) outputs. Each peripheral contains a small address mux which only allows a slave to respond if the address is within a predefined range. This reduces the logic for switching between peripherals but obviously will infer lots of logic/peripheral for the address muxes which leads me to believe that future scalability will be impacted.
Another method I though of was having a single large address mux connected from the master which decodes the address and sends it, along with the data and ack signals to each slave. The output data is then muxed back into the master. This seems a slightly more efficient method though I always seem to end up with ridiculously long data vectors and its a bit of a chore to keep track of.
A third method I thought of was to have it arranged in a ring like fashion. The master address goes to all of the slaves, with a smaller mux which merely chooses which ack signals to send out. The data output from the master then travels serially through each slave. Each slave contains a mux which can allow it to either let the data coming into it pass through unaffected OR to allow the slave to place its own data on the bus. I feel this will work best for slow systems as there is only one small mux/slave required to mux between the incoming data and that slave's data, along with a small mux that decodes the address and sends out the ack signals. The issue here I believe however is that with lots of peripherals, the propagation delay from the output of the master to the input of the master would be pretty large as it has to travel through each slave!
Could anybody give me suitable reasoning for the different methods? I'm using Quartus to synthesize and route for an Altera EP4CE10E22C8 FPGA and I'm looking for the smallest implementation with regards to FPGA LUTs. My system uses a 16bit address and data bus. I'm looking to achieve at minimum ~50MHz under ideal memory conditions (i.e no wait states) and would be looking to have around 12 slaves, each with between 8 and 16bits of addressable space.
Thanks!

I suggest that you download the AMBA specification from the ARM web site (http://www.arm.com/) and look at the AXI4-lite bus or the much older APB bus. In most bus standards with a single master there is no multiplexer on the addresses, only an address decoder that drives the peripheral selection signals. It is only the response data from the slaves that are multiplexed to the master, thanks to the "response valid" signals from the slaves. It is scalable if you pipeline it when the number of slaves increases and you cannot reach your target clock frequency any more. The hardware cost is mainly due to the read data multiplexing, that is, a N-bits P-to-one multiplexer.
This is almost your second option.
The first option is a variant of the second where read data multiplexers are replaced by or gates. I do not think it will change much the hardware cost: or gates are less complex than multiplexers but each slave will now have to zero its read data bus, which adds as many and gates. A good point is, maybe, a reduced activity and thus a lower power consumption: slaves that are not accessed by the master will keep their read data bus low. But as you synthesize all this with a logic synthesizer and place and route it with a CAD tool, I am almost sure that you will end up with the same results (area, power, frequency) as for the more classical second option.
Your third option reminds me the principles of the daisy chain or the token ring. But as you want to avoid 3-states I doubt that it will bring any benefit in terms of hardware cost. If you pipeline it correctly (each slave samples the incoming master requests and processes them or passes them to the next) you will probably reach higher clock frequencies than with the classical bus, especially with a large number of slaves, but as, in average, a complete transaction will take more clock cycles, you will not improve the performance neither.
For really small (but slow) interconnection networks you could also have a look at the Serial Peripheral Interface (SPI) protocols. This is what they are made for: drive several slaves from a single master with few wires.
Considering your target hardware (Altera Cyclone IV), your target clock frequency (50MHz) and your other specifications I would first try the classical bus. The address decoder will produce one select signal for each of your 12 slaves, based on the 8 most significant bits of your 16-bits address bus. The cost will be negligible. Apart these individual select signals, all slaves will receive all other signals (address bus, write data bus, read enable, write enable(s)). The 16-bits read data bus of your master will be the output of a 16-bits 12-to-1 multiplexer that selects one slave response among 12. This will be the part that consumes most of the resources of your interconnect. But it should be OK and run at 50 MHz without problem... if you avoid combinatorial paths between master requests and salve responses.

A good starter is the WISHBONE SoC Interconnect from OpenCores.org. The classic read and write cycles are easy to implement. Beyond that, also burst transfers are specified for high throughput and much more. The website also hosts a lot of WISHBONE compatible projects providing a wide range of I/O devices.
And last but not least, the WISHBONE standard is in the public domain.

Connect stack of Parallela boards and a rPI via FPGA and 1/0 pins

I want to conect my Pi and Parallella such that the Pi does the GPU side and the Parallella stack this is to be controled by a third Parallella
I think the best way to do this is through an FPGA. Is this possible and a good way to do it?
Also what structure should I use and how should I start to implement it?
I know little VHDL and Verilog and do not want to use paid software.
I am eager to learn and have a lot of time to do it though so no "simple but bad solutions".
I will up load the project on Git when done

The solution depends on the bandwidth and latency requirements. You are right that FPGA provides the largest bandwidth and lowest latency. However, do you really need such good performance? Maybe USB or Ethernet connections are good enough.
For the FPGA solution, consider the secondary pi and parallella as two peripherals for the primary pi, and assign different address spaces for them. The communications among three devices are based on polling initiated by the primary pi. FPGA should pass the signaling on data/address bus to the two peripherals with compatible I/O timing. Peripherals consider the FPGA as a RAM, and should listen to any data/controls with their best effort. FPGA should buffer the data/control signals if peripherals cannot respond in real-time.
Overall, it's a very tough work. I'd like to see the source code if the FPGA solution works.

Bus protocol for a microcontroller in VHDL

I am designing a microcontroller in VHDL. I am at the point where I understand the role of each component (ALU/Memory...), and some ideas on how to realise them. I basically want to implement a Von Neumann architecture.
But here is what I don't get : how do the components communicate ? I don't know how to design my bus (buses?). I am therefore looking for a simple bus implementation and protocol.
My unresolved questions :
Is it simpler to have one bus for everything or to separate the different kind of data ?
How does each component knows when to "listen" and when to "write" ?
The emphasis is on the simplicity of the design (and thus of the implementation). I do not care about speed. I want to do everything from scratch (ie. no pre-made softcore).
I don't know if this is of importance at this stage, but it will not need to run "real" compiled code, is have any kind of compatibility with anything existing. Also, at which point do I begin to think about my 'assembly' instructions ? I thinks that I will load them directly in the memory.
Thank you for your help.
EDIT :
I ended up drawing (a lot of) inspiration from the Picoblaze, because it is :
simple to understand
under a BSD Licence
Specifically, I started by adding a few instructions to it.

Since your main concern seems to be learning about microcontroller design, a good approach could be taking a look into some of the earlier microprocessor models. Take for instance the Z80:
Source: http://landley.net/history/mirror/cpm/z80.html
Another good Z80 HW description: http://www.msxarchive.nl/pub/msx/mirrors/msx2.com/zaks/z80prg02.htm
To answer your first question (single vs. multiple buses), this chip uses a single bus for everything, and it has a very simple design. You could probably use something similar. To make the terminology clear, a single system bus may be composed of sub-buses (and they are also called buses). The figure shows a system bus composed of a bidirection data bus (8-bit wide) and an address bus (16-bit wide).
To answer your second question (how do components know when they are active),
in the image above you see two distinct signals, memory request and I/O request. Only one will be active at a time, and when I/O request is active, that's when a peripheral could potentially be accessed.
If you don't have many peripherals, you don't need to use all 16 address lines (some Z80's have an 8-bit I/O space). Each peripheral would be accessed through some addresses in this space. For instance, in a very simple system:
a timer peripheral could use addresses from 00h to 03h
a uart could addresses from 08h to 0Fh
In this simple example, you need to provide two circuits: one would detect when the address is within the range 00-03h, and another would do the same for 08-0Fh. If you do a logic "and" between the output of each detector and the I/O request signal, then you would have two signals indicating when each of the peripherals is being accessed. Your peripheral hardware should primarily listen to this signal.
Finally, regarding your question about instructions, the dataflow inside your microprocessor would have several stages. This is usually called a processor's datapath. It is common to divide the stages into:
FETCH: read an instruction from program memory
DECODE: check specific bits within the instructions, and decide what type of instruction it is
EXECUTE: take the actions required by the instruction (e.g., ALU operations)
MEMORY: for some instructions, you need to do a data read or write
WRITE BACK: update your CPU registers with new values affected by the instruction
Source: https://www.cs.umd.edu/class/fall2001/cmsc411/projects/DLX/proj.html
Most of your job of dealing with individual instructions would be done in the DECODE and EXECUTE stages. As for the datapath control, you will need a state machine that controls the sequence of operations through the 5 stages. This functional block is usually called a Control Unit. Here you have a few choices:
Your state machine could go throgh all stages sequentially, one at a time. An instruction would take several clock cycles to execute.
Similar as the choice above, but combining two or more stages in a single cycle if you want to make things simpler and faster.
Pipeline the execution of instructions. This can give a great speed boost, but maybe it's better left for later because things can get quite complex.
As for the implementation, I recommend keeping the functional blocks as separate entities, and make sure you write a testbench for each block. Your job will go faster if you write those testbenches.
As for the blocks, the Register File is pretty easy to code. The Instruction Decoder is also easy if you have a clear idea of your instruction layout and opcodes. And the ALU is also easy if you know the operations it needs to perform.
I would start by writing testbenches for the Instruction Decoder and the Register File. Then I would write a script that runs all the testbenches and checks their results automatically. Only then I would focus on the implementation of the functional blocks themselves.

Basically on-chip busses will use parallel busses for address and data input and output. Usually there will be some kind of arbiter which decides which component is allowed to write to the bus. So a common approach is:
The component that wants to write will set a data line connected to the arbiter to high or low to signal that it wants to access the bus.
The arbiter decides who gets access to the bus
The arbiter sets the chip select of the component that should be allowed next to access the bus.
Usually your on chip bus will use a master/slave concept, so only masters have acting access to the bus. The slaves only wait for requests from the master.
I for one like the AMBA AHB/APB design but this might be a little over the top for your application. You can have a look at this book looking for ideas on how to implement your bus

Synchronizing a counter across a network

I have two computers that can talk to each other over a serial connection. The connection is made over a wireless network. There is a variable, changing delay in communications between the two systems. On both systems I have a counter runtime that increments by 1 every ms. They both start as soon as the applications start. Say each computer is started at different times. How can I with with the serial connection synchronize the counters so that systemA.counter will equal systemB.counter and so that both counters increment at the same time (or as close as possible).
Ideally once synchronized the counters would drift only slowly apart so that once every 3 or 4 thousand incs I could re-synchronize.
I'm looking for good resources on the topic, example algorythms, example code (c/c++), anything to point me in the right direction.
Update
This is a closed system, no internet. For all intents and purposes no real protocol at all besides and open serial line over the wireless link. That link at the moment is bluetooth, but I'm thinking over moving it to a ZigBee Mesh. There are currently 2 nodes, but if I have 30 nodes all running this same application I would want them all to synchronize. There is not client/server designation, just a couple of devices running the same program with a counter. I don't have access to anything like time, just this counter that increments once a millisecond and whatever algorithm I can put in place.
Once I can get this working, I would like to put in place a propositioning and mapping system, but to figure out distances between nodes, I need actuate timing synchronized on the devices.

If you use this counters to order events in a system, you should look at vector clocks or Lamport timestamps.

The obvious resource is NTP, which is documented for example at http://www.eecis.udel.edu/~mills/ntp.html and with links off there. Basically, this uses timestamps to adjust the frequency at which local clocks run. The protocol has been around for years and been the subject of continuous research - I can't see any pack of slides there which immediately makes it clear how it works. You might be better to see if there is already an NTP implementation available than to try and re-implement it yourself.
It appears (e.g. from searching) that there is a small industry of people working on time synchronisation algorithms, especially in the context of wireless sensor networks. One jumping-off point, apart from searches, is the survey paper at http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.85.2012 - Time synchronization in sensor networks: A survey (2004)

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio