Vivado routing metrics - fpga

I'm trying to gather metrics to measure routing utilization on a set of different designs. Any pointers would help a lot!
In the Router Utilization Summary, what does "Global Vertical/Horizontal Routing Utilization" measure?
Global Vertical Routing Utilization = 15.3424 %
Global Horizontal Routing Utilization = 16.3981 %
Routable Net Status*
*Does not include unroutable nets such as driverless and loadless.
Run report_route_status for detailed report.
Number of Failed Nets = 0
Number of Unrouted Nets = 0
Number of Partially Routed Nets = 0
Number of Node Overlaps = 0
Is there any way to access the per CLB metrics mentioned here (especially the "Horizontal/Vertical routing congestion per CLB") through tcl? I've searched far and wide to no avail.

Routing report
Since CLBs in an FPGA are connected through configurable switches to redirect traffic, as shown in this source:
Also quoting it:
A vertical (horizontal) channel is defined as a set of tracks between two consecutive columns (rows) of CLBs; wire segments connecting CLB pins are aligned into tracks running in the channel.
So it seems that the Vivado report means how much the switches are used in vertical and horizontal configuration. I don't know how much this information could be useful to the end user, maybe a big disproportion of these percentages might indicate that some particular hard IPs are overutilized and all the connections follow one direction, but other than that I would expect the percentages to be quite similar, and together an indication of how "crowded" your design is.
Metrics
For the second question, I believe you can't access the metrics because the link you have shown is just a heat map over the device, that is drawn by Vivado.
You can however access to the underlying data used to generate the map, for instance by running the time report
report_timing_summary -delay_type min_max -report_unconstrained -check_timing_verbose -max_paths 10 -input_pins -routable_nets -name timing_1
You can access to the Min Slack per placed BEL.

Related

How to change the behavior of nodes (cars), rsu in Omnet++ Veins project

I have setup my environment using omnet++, sumo and veins in ubuntu. I want to reduce packet loss in an emergency situation among vehicles and improve packet delivery time and cost. My project is about choosing the suitable processing position among cluster head (nodes), road side unit (rsu) and cloud. I want to achieve certain tasks that is need to implement my veins project. I have configured 50 nodes and 4 rsu and provide data rate about 6mbps and assign the packet size upto 2MB.
Therefore, how can I change the behavior of vehicles (nodes), road side unit (rsu) and cloud in order to implement the following parameters?
processing rate of clusters (nodes) = 3 Mbps.
processing rate of RSUs = 7 Mbps.
processing rate of cloud = 10 Mbps.
the range of clusters (nodes) = 60 m.
the range of RSU = 120 m.
the range of cloud = 500 m.
If you could help with building these parameters I will appreciate it.
Thank you
If you are talking about transsmision rate, then you can set the bit rate in the ini file (check veins example) but if you meant processing delay then it is usually simulated by scheduling self messages (check tictoc example). In terms of transsmsion range, veins uses Free Space Propagation model and the related parameters are set in the ini file so you can change them to decide the required range. Finally, I recommand to read more about veins and how it deal with the parameters you asked about. There are alot of answered questions on StackOverFlow about your questions.

nifi ingestion with 10,000+ sensor data?

I am planning to use nifi to ingest data from more than 10,000 sensors. There are 50-100 types of sensors which will send a specific metric to nifi.
I am pondering over whether I should assign 1 port number to listen to all the sensors or I should assign 1 port for each type of sensor to facilitate my data pipeline. which is the better option?
Is there a upper limit of the no of ports which I can "listen" using nifi?
#ilovetolearn
NiFi is such a powerful tool. You can do either of your ideas, but I would recommend to do what is easier for you. If you have data source sensors that need different data flows, use different ports. However, if you can fire everything at a single port, I would do this. This makes it easier to implement, consistent, easier to support later, and easier to scale.
In large scale highly available NiFi, you may want a Load Balancer to handle the inbound data. This would push the sensor data toward a single host:port on the LB appliance, that then directs to NiFi with 3-5-10+ nodes.
I agree with the other answer that once scaling comes into play, an external load balancer in front of NiFi would be helpful.
In regards to the flow design, I would suggest using a single exposed port to ingest all the data, and then use RouteOnAttribute or RouteOnContent processors to direct specific sensor inputs into different flow segments.
One of the strengths of NiFi is the generic nature of flows given sufficient parameterization, so taking advantage of flowfile attributes to handle different data types dynamically scales and performs better than duplicating a lot of flow segments to statically handle slightly differing data.
The performance overhead to run multiple ingestion ports vs. a single port and routed flowfiles is substantial, so this will give you a large performance improvement. You can also organize your flow segments into hierarchical nested groups using the Process Group features, to keep different flow segments cleanly organized and enforce access controls as well.
2020-06-02 Edit to answer questions in comments
Yes, you would have a lot of relationships coming out of the initial RouteOnAttribute processor at the ingestion port. However, you can segment these (route all flowfiles with X attribute in "family" X here, Y here, etc.) and send each to a different process group which encapsulates more specific logic.
Think of it like a physical network: at a large organization, you don't buy 1000 external network connections and hook each individual user's machine directly to the internet. Instead, you obtain one (plus redundancy/backup) large connection to the internet and use a router internally to direct the traffic to the appropriate endpoint. This has management benefits as well as cost, scalability, etc.
The overhead of multiple ingestion ports is that you have additional network requirements (S2S is very efficient when communicating, but there is overhead on a connection basis), multiple ports to be opened and monitored, and CPU to schedule & run each port's ingestion logic.
I've observed this pattern in practice at scale in multinational commercial and government organizations, and the performance improvement was significant when switching to a "single port; route flowfiles" pattern vs. "input port per flow" design. It is possible to accomplish what you want with either design, but I think this will be much more performant and easier to build & maintain.

Why TSync(Time Synchronization) is needed in Adaptive AUTOSAR?

I'm a rookie in Adaptive AUTOSAR.
I can't imagine why Time Synchronization(Tysnc) is needed. System time of ECUs can be synchronized by PTP.
Could you explain why Tsync is needed even though PTP synchronize time across a distributed system? Or I welcome any documents or materials for me to understand Tsync's usages or use-cases.
The reason for the existence time sync along with the definition of time domains is that you need to be able to define different time domains across different bus systems within the vehicle. One example for a not directly obvious definition of a time domain could be the metering of operation-hours.
On top of that, the time domains can cross AUTOSAR platforms, i.e. a time domain may consists of both CP and AP nodes.
You can find explanations for time sync in (e.g) the AUTOSAR documents TPS Manifest and TPS System Template.
There need to be different time bases in vehicle.
Examples of Time Bases in vehicles are:
• Absolute, which is based on a GPS based time.
• Relative, which represents the accumulated overall operating time of a vehicle,
i.e. this Time Base does not start with a value of zero whenever the vehicle starts
operating.
• Relative, starting at zero when the ECU begins its operation.

What is the right way to do model parallelism in tensorflow?

I have multiple 4GB GPU nodes so I want them to run huge model in parallel. I hope just splitting layers into several pieces with appropriate device scopes just enables model parallelism but it turns out that it doesn't reduce memory footprint for master node(task 0). (10 nodes configuration - master: 20g, followers:2g, 1 node configuration - master: 6~7g)
Suspicious one is that gradients are not distributed because I didn't setup right device scope for them.
my model is available on github.(https://github.com/nakosung/tensorflow-wavenet/tree/model_parallel_2)
device placement log is here: https://gist.github.com/nakosung/a38d4610fff09992f7e5569f19eefa57
So the good news is that you using colocate_gradients_with_ops, which means that you are ensuring that the gradients are being computed on the same device that the ops are placed. (https://github.com/nakosung/tensorflow-wavenet/blob/model_parallel_2/train.py#L242)
Reading the device placement log is a little difficult, so I would suggest using TensorBoard to try visualizing the graph. It has options to be able to visualize how nodes are being placed on devices.
Secondly, you can try to see how the sizes of your operations map down to devices -- it is possible that the largest layers (largest activations, or largest weights) may be disproportionately placed more on some nodes than others. You might try to use https://github.com/tensorflow/tensorflow/blob/6b1d4fd8090d44d20fdadabf06f1a9b178c3d80c/tensorflow/python/tools/graph_metrics.py to analyze your graph to get a better picture of where resources are required in your graph.
Longer term we'd like to try to solve some of these placement problems automatically, but so far model parallelism requires a bit of care to place things precisely.

Determine Request Latency

I'm working on creating a version of Pastry natively in Go. From the design [PDF]:
It is assumed that the application
provides a function that allows each Pastry node to determine the “distance” of a node
with a given IP address to itself. A node with a lower distance value is assumed to be
more desirable. An application is expected to implements this function depending on its
choice of a proximity metric, using network services like traceroute or Internet subnet
maps, and appropriate caching and approximation techniques to minimize overhead.
I'm trying to figure out what the best way to determine the "proximity" (i.e., network latency) between two EC2 instances programmatically from Go. Unfortunately, I'm not familiar enough with low-level networking to be able to differentiate between the different types of requests I could use. Googling did not turn up any suggestions for measuring latency from Go, and general latency techniques always seem to be Linux binaries, which I'm hoping to avoid in the name of fewer dependencies. Any help?
Also, I note that the latency should be on the scale of 1ms between two EC2 instances. While I plan to use the implementation on EC2, it could hypothetically be used anywhere. Is latency generally so bad that I should expend the effort to ensure the network proximity of two nodes? Keep in mind that most Pastry requests can be served in log base 16 of the number of servers in the cluster (so for 10,000 servers, it would take approximately 3 requests, on average, to find the key being searched for). Is the latency from, for example, EC2's Asia-Pacific region to EC2's US-East region enough to justify the increased complexity and the overhead introduced by the latency checks when adding nodes?
A common distance metric in networking is to count the number of hops (node-hops in-between) a packet needs to reach its destination. This metric was also mentioned in the text you quoted. This could give you adequate distance values even for the low-latency environment you mentioned (EC2 “local”).
For the go logic itself, one would think the net package is what you are looking for. And indeed, for latency tests (ICMP ping) you could use it to create an IP connection
conn, err := net.Dial("ip4", "127.0.0.1")
create your ICMP package structure and data, and send it. (See Wikipedia page on ICMP; IPv6 needs a different format.) Unfortunately you can’t create an ICMP connection directly, like you can with TCP and UDP, thus you will have to handle the package structure yourself.
As conn of type Conn is a Writer, you can then pass it your data, the ICMP data you defined.
In the ICMP Type field you can specify the message type. Values 8, 1 and 30 are the ones you are looking for. 8 for your echo request, the reply will be of type 1. And maybe 30 gives you some more information.
Unfortunately, for counting the network hops, you will need the IP packet header fields. This means, you will have to construct your own IP packets, which net does not seem to allow.
Checking the source of Dial(), it uses internetSocket, which is not exported/public. I’m not really sure if I’m missing something, but it seems there is no simple way to construct your own IP packets to send, with customizable header values. You’d have to further check how DialIP sends packages with internetSocket and duplicate and adapt that code/concept. Alternatively, you could use cgo and a system library to construct your own packages (this would add yet more complexity though).
If you are planning on using IPv6, you will (also) have to look into ICMPv6. Both packages have a different structure over their v4 versions.
So, I’d suggest using simple latency (timed ping) as a simple(r) implementation and then add node-hops at a later time/afterwards, if you need it. If you have both in place, maybe you also want to combine those 2 (less hops does not automatically mean better; think long overseas-cables etc).

Resources