Having several processes with the same sensitivity list

Having several processes with the same sensitivity list - vhdl

Can there be any unwanted effects by having several processes with the same sensitivity list in one architecture?
I have several processes that happen in parallel in an architecture, one process for reading input as a slave from a master which writes the input, one for writing output back to the master when the master asks for and one calculation. All the processes are clocked processes and their sensitivity list contains only the reset and clock signals.
Each process writes into its own signals, which the other processes may read from, i.e. there isn't an instance where two processes write into the same signal.
It's possible to implement everything in one single big process, but it'll be more cumbersome.
Can there be any adverse affects with such implementation?
Are there any reasons to favor the less elegant one big process over several smaller ones?

Not at all. Most of my processes have the same sensitivity list : (reset, clock) and this is not unusual.
If a single big process really is less elegant, and an implementation using multiple smaller processes is genuinely clearer and easier to understand, then go ahead and design that way.
I tend towards fewer, larger processes in my designs, but lay them out in discrete sections to make it easier to separate functionality within a process, but that's not a concrete rule : I wouldn't implement several independent state machines in a single process for example.
What I would advise against is two styles often seen :
the 2-process ( or 3-process) state machine where one process is purely combinational, with a complex sensitivity list that's difficult to get right. Single process SM is simpler, shorter, and at least as easy to understand.
A huge number of tiny processes, each controlling one or two signals which are inputs to other tiny processes, so that you can spend all day tracing signals from process to process without finding what does the actual work!
If I understand your description you have something like 3 blocks : receiver, data processor, transmitter; and this sounds like a good separation of functions to me.
For example you can more easily replace the receiver or transmitter, or re-use them with a different data processor, if they are separate processes (or even separate entities).

Related

What is the overall impact of ignoring data at FIFO input in an FPGA?

I understand the operation of a FIFO, but I think I am missing something about it's utility.
When implementing a FIFO in an FPGA, let's say to cross clock domains, it seems that you would frequently run into the situation where the FIFO is full, but there is still data that should be clocking in every cycle. This might happen if the writing mechanism is clocking data in faster than the reading mechanism is reading data out. Obviously, once the FIFO is full it will start ignoring data until it has room to continue storing data.
My question is, isn't this a big deal? We are basically just losing data? Sure the FIFO is doing it's job, but the overall system is just throwing away data
I have drawn two possible conclusions
1) In this scenario (where the input data rate is greater than the output data rate), if we really care about not losing any data, maybe a FIFO isn't the best way to cross these domains (especially if the writing mechanism is much faster clock than the reading domain). If this is true, is there conventionally a better way to cross clock domains than with a FIFO? Maybe the answer is that you need to use another element, such as a decimator, before the FIFO?
2) We put a constraint on the system that says "you can only write for X amount of data (or cycles, or time etc.)" before the FIFO needs time to clear it's data. This seems unsatisfactory to me that we must turn off the data stream for a little while and wait for the FIFO to clear some room until we continue writing. But then again, I'm new to digital systems and maybe this is just the harsh reality that I am not used to :)
It seems then that the best use for a FIFO when crossing clock domains is simply one where the data rate into the FIFO and the data rate out of the FIFO are the same, because then it can keep up with itself.

It seems you're mixing two problems into one.
There's clock domain crossing, and input data buffering. It just happens that FIFO combines implementations for these two tasks in one entity.
If the receiver can't keep up with transmitter, and there's no flow control, then the data will be lost, and it doesn't matter if data was crossing the clock domains or not. You can't solve the data loss problem without adding some kind of handshake or flow control lines.
Without flow control you must ensure that the input buffer size is enough for handling load peaks in your specific case.
As for impacts - it's either nonexistant if your design is ok with data loss, or you'll have a nonfunctional device if the data loss is not tolerated by the design.

FIFOs have also the functionality of different input and output widths. That means for example you have an 100 Mhz 32 Bit Input and an 50Mhz 64bit output. The data rate into and out of the fifo ist half but the data widht is double.

MPI shared memory access

In the parallel MPI program on for example 100 processors:
In case of having a global counting number which should be known by all MPI processes and each one of them can add to this number and the others should see the change instantly and add to the changed value.
Synchronization is not possible and would have lots of latency issue.
Would it be OK to open a shared memory among all the processes and use this memory for accessing this number also changing that?
Would it be OK to use MPI_WIN_ALLOCATE_SHARED or something like that or is this not a good solution?

Your question suggests to me that you want to have your cake and eat it too. This will end in tears.
I write you want to have your cake and eat it too because you state that you want to synchronise the activities of 100 processes without synchronisation. You want to have 100 processes incrementing a shared counter, (presumably) to have all the updates applied correctly and consistently, and to have increments propagated to all processes instantly. No matter how you tackle this problem it is one of synchronisation; either you write synchronised code or you offload the task to a library or run-time which does it for you.
Is it reasonable to expect MPI RMA to provide automatic synchronisation for you ? No, not really. Note first that mpi_win_allocate_shared is only valid if all the processes in the communicator which make the call are in shared memory. Given that you have the hardware to support 100 processes in the same, shared, memory, you still have to write code to ensure synchronisation, MPI won't do it for you. If you do have 100 processes, any or all of which may increment the shared counter, there is nothing in the MPI standard, or any implementations that I am familiar with, which will prevent a data race on that counter.
Even shared-memory parallel programs (as opposed to MPI providing shared-memory-like parallel programs) have to take measures to avoid data races and other similar issues.
You could certainly write an MPI program to synchronise accesses to the shared counter but a better approach would be to rethink your program's structure to avoid too-tight synchronisation between processes.

Would threading be beneficial for this situation?

I have a CSV file with over 1 million rows. I also have a database that contains such data in a formatted way.
I want to check and verify the data in the CSV file and the data in the database.
Is it beneficial/reduces time to thread reading from the CSV file and use a connection pool to the database?
How well does Ruby handle threading?
I am using MongoDB, also.

It's hard to say without knowing some more details about the specifics of what you want the app to feel like when someone initiates this comparison. So, to answer, some general advice that should apply fairly well regardless of the problem you might want to thread.
Threading does NOT make something computationally less costly
Threading doesn't make things less costly in terms of computation time. It just lets two things happen in parallel. So, beware that you're not falling into the common misconception that, "Threading makes my app faster because the user doesn't wait for things." - this isn't true, and threading actually adds quite a bit of complexity.
So, if you kick off this DB vs. CSV comparison task, threading isn't going to make that comparison take any less time. What it might do is allow you to tell the user, "Ok, I'm going to check that for you," right away, while doing the comparison in a separate thread of execution. You still have to figure out how to get back to the user when the comparison is done.
Think about WHY you want to thread, rather than simply approaching it as whether threading is a good solution for long tasks
Like I said above, threading doesn't make things faster. At best, it uses computing resources in a way that is either more efficient, or gives a better user experience, or both.
If the user of the app (maybe it's just you) doesn't mind waiting for the comparison to run, then don't add threading because you're just going to add complexity and it won't be any faster. If this comparison takes a long time and you'd rather "do it in the background" then threading might be an answer for you. Just be aware that if you do this you're then adding another concern, which is, how do you update the user when the background job is done?
Threading involves extra overhead and app complexity, which you will then have to manage within your app - tread lightly
There are other concerns as well, such as, how do I schedule that worker thread to make sure it doesn't hog the computing resources? Are the setting of thread priorities an option in my environment, and if so, how will adjusting them affect the use of computing resources?
Threading and the extra overhead involved will almost definitely make your comparison take LONGER (in terms of absolute time it takes to do the comparison). The real advantage is if you don't care about completion time (the time between when the comparison starts and when it is done) but instead the responsiveness of the app to the user, and/or the total throughput that can be achieved (e.g. the number of simultaneous comparisons you can be running, and as a result the total number of comparisons you can complete within a given time span).
Threading doesn't guarantee that your available CPU cores are used efficiently
See Green Threads vs. native threads - some languages (depending on their threading implementation) can schedule threads across CPUs.
Threading doesn't necessarily mean your threads wind up getting run in multiple physical CPU cores - in fact in many cases they definitely won't. If all your app's threads run on the same physical core, then they aren't truly running in parallel - they are just splitting CPU time in a way that may make them look like they are running in parallel.
For these reasons, depending on the structure of your app, it's often less complicated to send background tasks to a separate worker process (process, not thread), which can easily be scheduled onto available CPU cores at the OS level. Separate processes (as opposed to separate threads) also remove a lot of the scheduling concerns within your app, because you essentially offload the decision about how to schedule things onto the OS itself.
This last point is pretty important. OS schedulers are extremely likely to be smarter and more efficiently designed than whatever algorithm you might come up with in your app.

Moving data between processes in Spartan 3

I have two processes A and B, each with its own clock input.
The clock frequencies are a little different, and therefore not synchronized.
Process A samples data from an IC, this data needs to be passed to process B, which then needs to write this data to another IC.
My current solution is using some simple handshake signals between process A and B.
The memory has been declared as distributed RAM (128Bytes as an array of std_logic_vector(7 downto 0)) inside process A (not block memory).
I'm using a Spartan 3AN from Xilinx and the ISE Webpack.
But is this the right way to do it?
I read somewhere that the Spartan 3 has dual-port block memory supporting two clocks, so would this be more correct?
The reason I'm asking, is because my design behaves unpredictable, and in cases like this I just hate magic. :-)

Except for very specific exceptional cases, the only correct way to move data between two independent clock domains is to use an asynchronous FIFO (also more correctly called a multi-rate FIFO).
In almost all FPGAs (including the Xilinx parts you are using), you can use FIFOs created by the vendor -- in Xilinx's case, you do this by generating yourself a FIFO using the CoreGen tool.
You can also construct such a FIFO yourself using a dual-port RAM and appropriate handshaking logic, but like most things, this is not something you ought to go reinvent on your own unless you have a very good reason to do so.
You also might consider whether your design really needs to have multiple clock domains. Sometimes it's absolutely necessary, but that's much, MUCH less often than most people just starting out believe. For instance, even if you need logic that runs at multiple rates, you can often handle this by using a single clock and appropriately generated synchronous clock enables.

The magic you are experiencing is most likely because either you haven't constrained your design correctly in synthesis, or you haven't done your handshaking properly. You have two options:
FIFO
Use a multirate FIFO as stated by wjl, which is a very common solution, works always (when done properly) and is huge in terms of resources. The big plus of this solution is that you don't have to take care about the actual clock domain crossing issues, and you'll get maximum bandwidth between the two domains. Never try to build up an asynchronous FIFO in VHDL because that won't work; even in VHDL there are some things that you simply can't do properly; use the appropriate generators from Xilinx, I think thats CoreGen
Handshaking
Have at least two registers for your data in the two domains and build a complete request/acknowledge handshaking logic, it won't work properly if you don't include those. Make sure that the handshaking logic is properly synchronized by adding at least two registers for the handshaking signals in the receiving domain, because otherwise you will most likely have unpredictable behaviour because of metastability issues.

For getting a "valid/ack" set of flags across clock domains, you may wish to look at the Flancter and here's an application of it
But in the general case, using a dual-clock FIFO is the order-of-the-day. Writing your own will be an interesting exercise, but validating across all the potential clock timing cases is a nightmare. This is one of the few places I'll instantiate a Coregen block.

MPI Large Data all to all transfer

My application of MPI has some process that generate some large data. Say we have N+1 process (one for master control, others are workers), each of worker processes generate large data, which is now simply write to normal file, named file1, file2, ..., fileN. The size of each file may be quite different. Now I need to send all fileM to rank M process to do the next job, So it's just like all to all data transfer.
My problem is how should I use MPI API to send these files efficiently? I used to use windows share folder to transfer these before, but I think it's not a good idea.
I have think about MPI_file and MPI_All_to_all, but these functions seems not to be so suitable for my case. Simple MPI_Send and MPI_Recv seems hard to be used because every process need to transfer large data, and I don't want to use distributed file system for now.

It's not possible to answer your question precisely without a lot more data, data that only you have right now. So here are some generalities, you'll have to think about them and see if and how to apply them in your situation.
If your processes are generating large data sets they are unlikely to be doing so instantaneously. Instead of thinking about waiting until the whole data set is created, you might want to think about transferring it chunk by chunk.
I don't think that MPI_Send and _Recv (or the variations on them) are hard to use for large amounts of data. But you need to give some thought to finding the right amount to transfer in each communication between processes. With MPI it is not a simple case of there being a message startup time plus a message transfer rate which apply to all messages sent. Some IBM implementations, for example, on some of their hardware had different latencies and bandwidths for small and large messages. However, you have to figure out for yourself what the tradeoffs between bandwidth and latency are for your platform. The only general advice I would give here is to parameterise the message sizes and experiment until you maximise the ratio of computation to communication.
As an aside, one of the tests you should already have done is measured message transfer rates for a wide range of sizes and communications patterns on your platform. That's kind of a basic shake-down test when you start work on a new system. If you don't have anything more suitable, the STREAMS benchmark will help you get started.
I think that a all-to-all transfers of large amounts of data is an unusual scenario in the kinds of programs for which MPI is typically used. You may want to give some serious thought to redesigning your application to avoid such transfers. Of course, only you know if that is feasible or worthwhile. From what little information your provide it seems as if you might be implementing some kind of pipeline; in such cases the usual pattern of communication is from process 0 to process 1, process 1 to process 2, 2 to 3, etc.
Finally, if you happen to be working on a computer with shared memory (such as a multicore PC) you might think about using a shared memory approach, such as OpenMP, to avoid passing large amounts of data around.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio