Application Architecture for scalable hyperledger v1.4 with IOT data - amazon-ec2

I am working on the Hyperledger Application that can store sensor data from IoT.
Using HLF v1.4 with Raft. Each IoT device will provide JSON data at fixed intervals which gets stored in Hyperledger. I have worked with HLF v1.3 which doesn't scale very well.
With v1.4, I am planning to start with 2 organization setup with 5 peers for each organization.
But the limiting factor seems to be, as the number of blocks increase by adding new transactions and querying the network takes a longer time.
What are the steps that can be taken to scale the HLF with v1.4 or onwards.
What type of Server specs should be used for good performance, like RAM, CPUs when selecting a server e.g EC2

You can change your block size. If you increase the size of your block, then the number of blocks will get reduced. For better query and Invoke functionality you can limit the data storing into Blockchain. Yes, Computation speed also matters in Blockchain, if you have good speed, tps may vary. Try with instance types like t3 medium or more than that like t3 large.

Related

Load 600+ million records in Synapse Dedicated Pool with Oracle as Source

I am trying to do a full load a very huge table (600+ million records) which resides in an Oracle On-Prem database. My destination is Azure Synapse Dedicated Pool.
I have already tried following:
Using ADF Copy activity with Source Partitioning, as source table is having 22 partitions
I increased the Copy Parallelism and DIU to a very high level
Still, I am able to fetch only 150 million records in 3 hrs whereas the ask is to complete the full load in around 2 hrs as the source would be freezed to users during that time frame so that Synapse can copy the data
How a full copy of data can be done from Oracle to Synapse in that time frame?
For a change, I tried loading data from Oracle to ADLS Gen 2, but its slow as well
There are a number of factors to consider here. Some ideas:
how fast can the table be read? What indexing / materialized views are in place? Is there any contention at the database level to rule out?
Recommendation: ensure database is set up for fast read on the table you are exporting
as you are on-premises, what is the local network card setup and throughput?
Recommendation: ensure local network setup is as fast as possible
as you are on-premises, you must be using a Self-hosted Integration Runtime (SHIR). What is the spec of this machine? eg 8GB RAM, SSD for spooling etc as per the minimum specification. Where is this located? eg 'near' the datasource (in the same on-premises network) or in the cloud. It is possible to scale out SHIRs by having up to four nodes but you should ensure via the metrics available to you that this is a bottleneck before scaling out.
Recommendation: consider locating the SHIR 'close' to the datasource (ie in the same network)
is the SHIR software version up-to-date? This gets updated occasionally so it's good practice to keep it updated.
Recommendation: keep the SHIR software up-to-date
do you have Express Route or going across the internet? ER would probably be faster
Recommendation: consider Express Route. Alternately consider Data Box for a large one-off export.
you should almost certainly land directly to ADLS Gen 2 or blob storage. Going straight into the database could result in contention there and you are dealing with Synapse concepts such as transaction logging, DWU, resource class and queuing contention among others. View the metrics for the storage in the Azure portal to determine it is under stress. If it is under stress (which I think unlikely), consider multiple storage accounts
Recommendation: load data to ADLS2. Although this might seem like an extra step, it provides a recovery point and avoids contention issues by attempting to do the extract and load all at the same time. I would only load directly to the database if you can prove it goes faster and you definitely don't need the recovery point
what format are you landing in the lake? Converting to parquet is quite compute intensive for example. Landing to the lake does leave an audit trail and give you a position to recover from if things go wrong
Recommendation: use parquet for a compressed format. You may need to optimise the file size.
ultimately the best thing to do would be one big bulk load (say taking the weekend) and then do incremental upserts using a CDC mechanism. This would allow you to meet your 2 hour window.
Recommendation: consider a one-off big bulk load and CDC / incremental loads to stay within the timeline
In summary, it's probably your network but you have a lot of investigation to do first, and then a number of options I've listed above to work through.
wBob provided a good summary of things you good look at to increase your transfer speed. In addition to that, you could try to bulk export your data into chunks of data files, and in-parallel transfer the files to azure datalake or azure blob storage, this way you can maximize your network throughput.
Once the data is on the datalake, you can scale up your Synapse instance and take advantage of fast loads using the COPY command.
I faced the same problem in our organization, and the fastest way to get the data out of SQL Server was using bcp into a fast storage layer.

EC2 host type for a DynamoDB batchWrite call

I have a requirement to bulk upload an excel sheet to a DynamoDB table and the maximum number of rows are 200,000. The website for bulk upload will be used less frequently, so we can assume there are only 1 - 2 bulk uploads being processed at a given time. In the backend, I am using Apache POI API to parse the excel sheet into DynamoDB Items.
Because we can only send up to 25 items in a batchWriteItem call, the currently latency is around 15 minutes (900 seconds) to completely upload all the 200,000 items. Hence I am planning to implement multi threading to execute multiple batchWriteItem API calls in parallel. Can you help me understand which EC2 host types are best suited for multi-threading for this purpose.
Any references will be really helpful.
Normally, multi-threading would be helped by using an Instance Type that has multiple CPUs.
However, you are describing behaviour that is waiting on network rather than CPU. Therefore, it is likely that the operation you describe is not being heavily impacted by CPU Utilization.
The best way to answer your question is to recommend that you experiment with different instance types to find the one that is best for your application's combination of needs:
Pick an instance family (eg m5) and try a few different sizes
Compare this against another family (eg c5) to see whether the improved performance is worth the extra cost
Monitor the application to find the bottleneck, which would either be RAM, CPU, Network or Disk access
Please note that smaller instances have less Network bandwidth, so you might need to choose a larger instance type to avoid being throttled on network bandwidth. This might result in excess CPU that isn't being fully utilized.

Azure Redis cache latency

I am working on an application having web job and azure function app. Web job generates the redis cache for function app to consume. Cache size is around 10 Mega Bytes. I am using lazy loading and all as per the recommendation. I still find that the overall cache operation is slow. Depending upon the size of the file i am processing, i may end up calling Redis cache upto 100,000 times . Wondering if I need to hold the cache data in a local variabke instead of reading it every time from redis. Has anyone experienced any latency in accessing Redis? Does it makes sense to create a singletone object in c# function app and refresh it based on some timer or other logic?
could you consider this points in your usage this is some good practices of azure redis cashe
Redis works best with smaller values, so consider chopping up bigger data into multiple keys. In this Redis discussion, 100kb is considered "large". Read this article for an example problem that can be caused by large values.
Use Standard or Premium Tier for Production systems. The Basic Tier is a single node system with no data replication and no SLA. Also, use at least a C1 cache. C0 caches are really meant for simple dev/test scenarios since they have a shared CPU core, very little memory, are prone to "noisy neighbor", etc.
Remember that Redis is an In-Memory data store. so that you are aware of scenarios where data loss can occur.
Reuse connections - Creating new connections is expensive and increases latency, so reuse connections as much as possible. If you choose to create new connections, make sure to close the old connections before you release them (even in managed memory languages like .NET or Java).
Locate your cache instance and your application in the same region. Connecting to a cache in a different region can significantly increase latency and reduce reliability. Connecting from outside of Azure is supported, but not recommended especially when using Redis as a cache (as opposed to a key/value store where latency may not be the primary concern).
Redis works best with smaller values, so consider chopping up bigger data into multiple keys.
Configure your maxmemory-reserved setting to improve system responsiveness under memory pressure conditions, especially for write-heavy workloads or if you are storing larger values (100KB or more) in Redis. I would recommend starting with 10% of the size of your cache, then increase if you have write-heavy loads. See some considerations when selecting a value.
Avoid Expensive Commands - Some redis operations, like the "KEYS" command, are VERY expensive and should be avoided.
Configure your client library to use a "connect timeout" of at least 10 to 15 seconds, giving the system time to connect even under higher CPU conditions. If your client or server tend to be under high load, use an even larger value. If you use a large number of connections in a single application, consider adding some type of staggered reconnect logic to prevent a flood of connections hitting the server at the same time.

Frequent Updates on Apache Ignite

I hope someone experienced with Apache Ignite can help guide my team towards the answer regarding a new setup with Apache Ignite.
Overall Setup
Data is continuously generated from many distributed sensors and streamed into our database. Each sensor may deliver many updates every second, but generally generates <10 updates/sec.
Daily the magnitude of the data is approx. 50 million records, per site.
Data Description
Each record consists of the following values
Sensor ID
Point ID
Timestamp
Proximity
where 1, is our ID of the sensor, 2 is an ID of some point on the site, and 3 is a proximity measurement from the sensor to the point.
Each second there is approx. 1000 such new records. A record is never updated.
Query Workload
Queries are fairly complex with significant (and dynamic) look-back in time. A query may require data from several sensors in one site, but the required sensors are determined dynamically. Most continuous queries only require data from the last few hours, but frequently it is necessary to query over many days.
Generally, we therefore have a write-once query-many scenario.
Initial Strategy
If we load data into primitive integer arrays in, e.g., java, the space consumption for a week approaches 5 GB. Because that is "peanuts" in the platforms of today, we intend to load all data onto all nodes in the Ignite cluster/distributed cache. In other words, use a replicated cache.
However, the continuous updates keep puzzling me. If I update the entire cache, I image quite substantial amounts of data needs to be transferred across the network every second.
Creating chunks for, say, each minute/hour is not necessarily going to work (well) either as each sensor can be temporarily offline, which will make it deliver stale data at some later point in time.
My question is therefore how to efficiently handle this stream of updates, while maintaining a consistent view of the data for the last 7-10 days.
My current, local, implementation is chunking the data into 1-hour chunks. When a new record for a given chunk arrives, the chunk is replaced with an updated chunk. This works well on a single machine but is likely too expensive in terms of network overhead in a cluster. I do not have an Ignite implementation, yet, so I have not been able to test this.
Ideally, each node in the ignite cluster would maintain its own copy of all data within the last X days, and apply the small update workload continuously.
So my question is, how would fellow Igniters approach this problem?
It sounds like you want to scale the load across multiple servers, but it's not possible with replicated caches, because each update will always update all nodes, and more nodes you have the more network traffic you will get. I think you should use partitioned caches instead and try adding nodes until the system is capable of handling the load.

Geoserver and threads number

We're using a Geoserver, and we've a performance problems in production with a large number of users.
We've made some load test with : 250, 150, and 20 threads. We've noticed that Geoserver works better with 20 threads than with 150 threads, and when thread number increase (150 or 250), performance decrease.
Is it normal ? How Geoserver manage the users request ? Does Geoserver use asynchronous strategy to manage users request ?
Thanks in advance.
bsh
Sounds pretty normal. Threads (and cpu context switches) aren't free, and at some point you are going to spend more time thrashing around switch threads than actually doing anything useful. Often better to have a much smaller number of threads (number of cores * 2 is often reasonable) combined with some sort of front end queue that will accept a connection and hold it until a worker is free.
Here are some real-world use case statistics for you; in production, for mobile/web apps serving 'google-maps' style users for the outdoor market, my company has tested various configurations (several of these discussed by theonlysandman, a contributor to this question), and which also support the observation by Tyler Evans, also a contributor to this question).
We need loads of greater than 5000 requests / second ('qps), and as our Geoserver instances ubiquitously topped at nearly 100 qps each, we'd need to horizontally and vertically scale to over 50 Geoserver instances.
Parameters: mostly vector sources, local PostGIS databases all less than 2tb each and no table > 1M records (or if greater than 1M, simplified geometry at nodes > 1m apart), 60%-40%-10% WMS/WMTS/WFS requests, google cloud hosted servers, each 32 core, ssd drive cluster to 4Tb.
The bottleneck of qps appears to be Geoserver itself. (Styling, reprojection, all the niceties that come with it). I'm not advocating it is poorly written, but the heavier a car gets the slower it might drive.
If we replicate the wfs requests using GO or python +/- gdal to directly access postgis data we get faster throughput than geoserver (up to 1000 qps each instance or more, where PostGIS becomes the bottleneck).
The same goes for our homemade Java microservice based on PostGIS that creates pbf/mvt tiles from postgis- it, too, was very quick- at about 1000 qps.
Nginx for us performed slightly better than php (~110 qps vs ~89 qps), but this could be a result of a configuration of apache.
Where do we go from here? In all of our production use cases, for our users, serving miniature sharded sqlite/mbtile databases (vector or raster)... and maintaining them with custom code... was far more performant and scalable.
We may write a Java plugin for geoserver that pushes GeoWebCache TMS tiles into a Google Storage Bucket designed for slippy z/x/y calls... this way we could more easily maintain a tile pyramid with updates etc., using Geoserver tools.
The more threads, the harder the load on the server. See WikiPedia article on thrasing.
Geoserver performance is affect by many things. My advice is to look at each one and see where the bottle neck be occurring.
Here are a list of question to set you on the correct path:
What are the specs of your machine? It should have an SSD.
Are you generating your tiles on the way? Or are they pre-seeded?
If they you are pre-seeding, is that running?
NOTE: pre-seeding helps but hammers the system so best done out of production.
What is the source for your data, if postgis, are you using spatial indexes?
Is PostgreSQL/postgis on the same machine?
How many types of tiles are you generating?
NOTE: you could be generating extra tiles which you don't need/use.
Do you use GeoWebCache?
With some more details, I can help you out.

Resources