Horizontal Scalability - opendaylight

Is there any way in ODL to split the same namespace into a few shards? The current performance of our application for topology discovery does not satisfy us. It would be good if we could use a few controllers to do it. For this purpose, we would divide our controllers into two sets: frontend and backend controllers. Each frontend controller would only collect and store the data from some subnet of the network to data store.. Each of the frontend controllers should use the same namespace but a different shard because we want it to be a shard leader for the high performance. Backend controllers would use the collected data. They should receive all data from all frontend controllers. Currently we don’t care about HA.
Is it possible with the current ODL?
If not, are there any plans to have horizontal scalability in the next future?

Currently ODL clustering is centered around HA. There have been some building blocks put in place to do sub-sharding for scale out but none of the stakeholders that currently contribute to ODL are currently investing resources in horizontal scalability. If you have a strong interest and use case, the ODL community would welcome your contributions.

Related

Highload data update architecture

I'm developing a parcels tracking system and thinking of how to improve it's performance.
Right now we have one table in postgres named parcels containing things like id, last known position, etc.
Everyday about 300.000 new parcels are added to this table. The parcels data is took from external API. We need to track all parcels positions as accurate as possible and reduce time between API calls about specific parcel.
Given such requirements what could you suggest about project architecture?
Right now the only solution I can think of is producer-consumer pattern. Like having one process selecting all records from parcel table in the infinite loop and then distribute fetching data task with something like Celery.
Majors downsides of this solution are:
possible deadlocks, as fetching data about the same task can be executed at the same time on different machines.
need in control of queue size
This is a very broad topic, but I can give you a few pointers. Once you reach the limits of vertical scaling (scaling based on picking more powerful machines) you have to scale horizontally (scaling based on adding more machines to the same task). So for being able to design scalable architectures you have to learn about distributed systems. Here some topics to look into:
Infrastructure & processes for hosting distributed systems, such as Kubernetes, Containers, CI/CD.
Scalable forms of persistence. For example different forms of distributed NoSQL like key-value stores, wide-column stores, in-memory databases and novel scalable SQL stores.
Scalable forms of data flows and processing. For example event driven architectures using distributed message- / event-queues.
For your specific problem with packages I would recommend to consider a key-value store for your position data. Those could scale to billions of insertions and retrievals per day (when querying by key).
It also sounds like your data is somewhat temporary and could be kept in an in-memory hot-storage while the package is not delivered yet (and archived afterwards). A distributed in-memory DB could scale even further in terms insertion and queries.
Also you probably want to decouple data extraction (through your api) from processing and persistence. For that you could consider introducing stream processing systems.

Microservices interdependency

One of the benefits of Microservice architecture is one can scale heavily used parts of the application without scaling the other parts. This supposedly provides benefits around cost.
However, my question is, if a heavily used microservice is dependent on other microservice to do it's work wouldn't you have to scale the other services as well seemingly defeating the purpose. If a microservice is calling other micro service at real time to do it's job, does it mean that Micro service boundaries are not established correctly.
There's no rule of thumb for that.
Scaling usually depends on some metrics and when some thresholds are reached then new instances are created. Same goes for the case when they are not needed anymore.
Some services are doing simple, fast tasks, like taking an input and writing it to the database and others may be longer running task which can take any amount of time.
If a service that needs scale is calling a service that can easily handle heavy loads in a reliable way then there is no need to scale that service.
That idea behind scaling is to scale up when needed in order to support the loads and then scale down whenever loads get in the regular metrics ranges in order to reduce the costs.
There are two topics to discuss here.
First is that usually, it is not a good practice to communicate synchronously two microservices because you are coupling them in time, I mean, one service has to wait for the other to finish its task. So normally it is a better approach to use some message queue to decouple the producer and consumer, this way the load of one service doesn't affect the other.
However, there are situations in which it is necessary to do synchronous communication between two services, but it doesn't mean necessarily that both have to scale the same way, for example: if a service has to make several calls to other services, queries to database, or other kind of heavy computational tasks, and one of the service called only do an array sorting, probably the first service has to scale much more than the second in order to process the same number of request because the threads in the first service will be occupied longer time than the second

Use case of service bus in microservice architecture

I am trying to learn architecting an business application adhering microservices fundamentals and its considerations. I have come across a question to which I am bit confused.
In a microservice architecture having multiple microservices with their own DB if data needs to be shared among each others then what should be the proffered way, service bus or calling them via HttpClient ?
I know that with message queue through service bus whenever a message is needed to be shared with others one micro service can publish this message and all subscriber then can retrieve the same, but in this case if that information needs to be stored in other microservice application's DB too, would that not become the redundant data?
So isn't enough to read the data simply via HttpClient whenever needed.
Looking forward to see the replies, thanks for the help in advance.
It depends upon the other factor like latency, redundancy and availability. Both options works keeping redundant data or REST call whenever we need data.
Points that work against direct HTTP Clients calls are -
It impact availability. It reduce overall availability if the system.
It impact performance and latency. Support there is an operation from service A that need data from service B. Frequency of the operation is very high. In that case, it reduce performance and increase latency as well as response time.
It doesn't support JOINs. So, you have to manipulate data. That also impact performance.
Points that work against message bus approach/event driven -
Duplicate data - So, increase complexity of the system to keep the same in sync.
It reduce consistency of the system. Now, system is eventual consistent.
In system design, no option is incorrect. All options have some pros and some cons so choose wisely according to your requirement and system.

nifi ingestion with 10,000+ sensor data?

I am planning to use nifi to ingest data from more than 10,000 sensors. There are 50-100 types of sensors which will send a specific metric to nifi.
I am pondering over whether I should assign 1 port number to listen to all the sensors or I should assign 1 port for each type of sensor to facilitate my data pipeline. which is the better option?
Is there a upper limit of the no of ports which I can "listen" using nifi?
#ilovetolearn
NiFi is such a powerful tool. You can do either of your ideas, but I would recommend to do what is easier for you. If you have data source sensors that need different data flows, use different ports. However, if you can fire everything at a single port, I would do this. This makes it easier to implement, consistent, easier to support later, and easier to scale.
In large scale highly available NiFi, you may want a Load Balancer to handle the inbound data. This would push the sensor data toward a single host:port on the LB appliance, that then directs to NiFi with 3-5-10+ nodes.
I agree with the other answer that once scaling comes into play, an external load balancer in front of NiFi would be helpful.
In regards to the flow design, I would suggest using a single exposed port to ingest all the data, and then use RouteOnAttribute or RouteOnContent processors to direct specific sensor inputs into different flow segments.
One of the strengths of NiFi is the generic nature of flows given sufficient parameterization, so taking advantage of flowfile attributes to handle different data types dynamically scales and performs better than duplicating a lot of flow segments to statically handle slightly differing data.
The performance overhead to run multiple ingestion ports vs. a single port and routed flowfiles is substantial, so this will give you a large performance improvement. You can also organize your flow segments into hierarchical nested groups using the Process Group features, to keep different flow segments cleanly organized and enforce access controls as well.
2020-06-02 Edit to answer questions in comments
Yes, you would have a lot of relationships coming out of the initial RouteOnAttribute processor at the ingestion port. However, you can segment these (route all flowfiles with X attribute in "family" X here, Y here, etc.) and send each to a different process group which encapsulates more specific logic.
Think of it like a physical network: at a large organization, you don't buy 1000 external network connections and hook each individual user's machine directly to the internet. Instead, you obtain one (plus redundancy/backup) large connection to the internet and use a router internally to direct the traffic to the appropriate endpoint. This has management benefits as well as cost, scalability, etc.
The overhead of multiple ingestion ports is that you have additional network requirements (S2S is very efficient when communicating, but there is overhead on a connection basis), multiple ports to be opened and monitored, and CPU to schedule & run each port's ingestion logic.
I've observed this pattern in practice at scale in multinational commercial and government organizations, and the performance improvement was significant when switching to a "single port; route flowfiles" pattern vs. "input port per flow" design. It is possible to accomplish what you want with either design, but I think this will be much more performant and easier to build & maintain.

In-Memory Data Grid for Java Project [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 7 years ago.
Improve this question
I'm looking to use an in memory data grid for my java project. I know there are a few relevant products such as VMWare GemFire, GigaSpaces XAP, IBM eXtreme Scale and others. Can someone elaborate from their experience with any of these tools and how they compare to one another? Thanks, Alex
(disclaimer - I work for GigaSpaces)
Hi Alex
There are many criteria to compare by, it really depends on what you're trying to do. in memory data grid have a lot of use cases, e.g. caching, OLTP, high throughput event processing, etc.
In general, the main criteria you should be looking at are:
Programming model: Support popular Java frameworks such as Spring (XAP and Gemfire support it natively)
Querying and indexing: if you want more than trivial key/value data access. Most people need SQL like semantics, or even full text search, and if the data grid can provide that out of the box it's a big advantage.
Ability to execute code on the grid nodes, and even colocate your code with them and handle events that are injected to the grid (e.g. objects written or updated). This is a massive scalability benefit and allows you to implement very efficient shared-nothing architectures.
Languages and APIs support: Most data grid support at least Java and JVM based languages (e.g. Scala), but a lot of them also support other languages and allow you to access the same data from various programming languages. For example XAP supports natively Java, .Net and C++, and other languages using its REST and memcached interfaces. As far as APIs go, some grid support more than one API. At GigaSpaces we support Map, Spring/POJO, JPA, JDBC and others.
Transactions: This is also a big one if you want to go anywhere beyond caching. When using the memory as your system of record, you should be able to rollback state in case you have an error or a bug, otherwise you end up with corrupt data. Another important thing is what types of transactions are supported. A lot of data grids only support "local" transactions. i.e. within the boundaries of a single node / partition / shard (which is probably what you want to do in most cases for performance reason). But more advanced grids also support distributed transactions and know how to seamlessly upgrade from local to distributed when needed.
Replication: there are various models here (synchronous, asynchronous, hybrid) and you need to decide which one of them is best for your use case. Some grids also have explicit support for cross cluster replication over WAN which is important if you're implementing DR.
Data partitioning and scalability: how does the grid partition data (fixed / consistent hashing), what is the level of control the user has over it, and does it support dynamic addition of server to the grid to increase capacity.
Administration and monitoring: Last but not least - what kind of facilities are provided out of the box, such as monitoring and administration hooks (JMX or another administrative API), user interfaces and integration with other 3rd party systems.
The following links are a good place to start:
http://gojko.net/2009/06/01/oracle-coherence-vs-gigaspaces-xap/. Also read the comments
http://www.neovise.com/neovise-data-caching-performance-technical-white-paper - a recent comparison between GigaSpaces and GemFire which I think speaks for itself :)
HTH,
Uri
There's a summary of the In-Memory Data Grid market by Gartner called "Competitive Landscape: In-Memory Data Grids". You can see a copy at: http://www.gartner.com/technology/reprints.do?id=1-1HCCIMJ&ct=130718&st=sb
Oracle Coherence (previously Tangosol Coherence) is and has long been the market leader in this space, but it is a commercial product. As I explained elsewhere recently:
Pros:
Elastic. Just add nodes. Auto-discovery. Auto-load-balancing. No data loss. No interruption. Every time you add a node, you get more data capacity and more throughput.
Use both RAM and flash. Transparently. Easily handle 10s or even 100s of gigabytes per Coherence node (e.g. up to a TB or more per physical server).
Automatic high availability (HA). Kill a process, no data loss. Kill a server, no data loss.
Datacenter continuous availability (CA). Kill a data center, no data loss.
RESTful APIs available from any language. Native APIs and client libraries for C/C++, C#, .NET and Java.
In addition to simple key-value (K/V) caching, also support queries (including some SQL), parallel queries, indexes (including custom indexes), a rich eventing model (for event-driven systems like exchanges), transactions (including MVCC), parallel execution of both scalar (EntryProcessor) and aggregate (ParallelAwareAggregator) functions, cache triggers, etc.
Easy to integrate with a database via read-through, read-ahead, write-through and write-behind caching. Automatically refreshes just the changed data when changes occur to the database (leveraging Oracle GoldenGate technology).
Coherence Incubator: The Coherence Incubator
memcache protocol support: Memcached interface for Oracle Coherence project on github
Thousands of customers, some using Coherence in production now for over a decade.
Cons:
As of Coherence 12.1.2, the cache itself is not persistent.
It costs money.
For the sake of full disclosure, I work at Oracle. The opinions and views expressed in this post are my own, and do not necessarily reflect the opinions or views of my employer.

Resources