Form Recognizer Heavy Workload - python-asyncio

My use case is the following :
Once every day I upload 1000 single page pdf to Azure Storage and process them with Form Recognizer via python azure-form-recognizer latest client.
So far I’m using the Async version of the client and I send the 1000 coroutines concurrently.
tasks = {asyncio.create_task(analyse_async(doc)): doc for doc in documents}
pending = set(tasks)
# Handle retry
while pending:
# backoff in case of 429
time.sleep(1)
# concurrent call return_when all completed
finished, pending = await asyncio.wait(
pending, return_when=asyncio.ALL_COMPLETED
)
# check if task has exception and register for new run.
for task in finished:
arg = tasks[task]
if task.exception():
new_task = asyncio.create_task(analyze_async(doc))
tasks[new_task] = doc
pending.add(new_task)
Now I’m not really comfortable with this setup. The main reason being the unpredictable successive states of the service in the same iteration. Can be up then throw 429 then up again. So not enough deterministic for me. I was wondering if another approach was possible. Do you think I should rather increase progressively the transactions. Start with 15 (default TPS) then 50 … 100 until the queue is empty ? Or another option ?
Thx

We need to enable the CORS and make some changes to that CORS to make it available to access the heavy workload.
Follow the procedure to implement the heavy workload in form recognizer.
Make it for page blobs here for higher and best performance.
Redundancy is also required. Make it ZRS for better implementation.
Create a storage account to upload the files.
Go to CORS and add the URL required.
Set the Allowed origins to https://formrecognizer.appliedai.azure.com
Go to containers and upload the documents.
Upload the documents. Use the container and blob information to give as the input for the recognizer. If the case is from Form Recognizer studio, the size of the total documents is considered and also the number of characters limit is there. So suggested to use the python code using the container created as the input folder.

Related

Dataflow job has high data freshness and events are dropped due to lateness

I deployed an apache beam pipeline to GCP dataflow in a DEV environment and everything worked well. Then I deployed it to production in Europe environment (to be specific - job region:europe-west1, worker location:europe-west1-d) where we get high data velocity and things started to get complicated.
I am using a session window to group events into sessions. The session key is the tenantId/visitorId and its gap is 30 minutes. I am also using a trigger to emit events every 30 seconds to release events sooner than the end of session (writing them to BigQuery).
The problem appears to happen in the EventToSession/GroupPairsByKey. In this step there are thousands of events under the droppedDueToLateness counter and the dataFreshness keeps increasing (increasing since when I deployed it). All steps before this one operates good and all steps after are affected by it, but doesn't seem to have any other problems.
I looked into some metrics and see that the EventToSession/GroupPairsByKey step is processing between 100K keys to 200K keys per second (depends on time of day), which seems quite a lot to me. The cpu utilization doesn't go over the 70% and I am using streaming engine. Number of workers most of the time is 2. Max worker memory capacity is 32GB while the max worker memory usage currently stands on 23GB. I am using e2-standard-8 machine type.
I don't have any hot keys since each session contains at most a few dozen events.
My biggest suspicious is the huge amount of keys being processed in the EventToSession/GroupPairsByKey step. But on the other, session is usually related to a single customer so google should expect handle this amount of keys to handle per second, no?
Would like to get suggestions how to solve the dataFreshness and events droppedDueToLateness issues.
Adding the piece of code that generates the sessions:
input = input.apply("SetEventTimestamp", WithTimestamps.of(event -> Instant.parse(getEventTimestamp(event))
.withAllowedTimestampSkew(new Duration(Long.MAX_VALUE)))
.apply("SetKeyForRow", WithKeys.of(event -> getSessionKey(event))).setCoder(KvCoder.of(StringUtf8Coder.of(), input.getCoder()))
.apply("CreatingWindow", Window.<KV<String, TableRow>>into(Sessions.withGapDuration(Duration.standardMinutes(30)))
.triggering(Repeatedly.forever(AfterProcessingTime.pastFirstElementInPane().plusDelayOf(Duration.standardSeconds(30))))
.discardingFiredPanes()
.withAllowedLateness(Duration.standardDays(30)))
.apply("GroupPairsByKey", GroupByKey.create())
.apply("CreateCollectionOfValuesOnly", Values.create())
.apply("FlattenTheValues", Flatten.iterables());
After doing some research I found the following:
regarding constantly increasing data freshness: as long as allowing late data to arrive a session window, that specific window will persist in memory. This means that allowing 30 days late data will keep every session for at least 30 days in memory, which obviously can over load the system. Moreover, I found we had some ever-lasting sessions by bots visiting and taking actions in websites we are monitoring. These bots can hold sessions forever which also can over load the system. The solution was decreasing allowed lateness to 2 days and use bounded sessions (look for "bounded sessions").
regarding events dropped due to lateness: these are events that on time of arrival they belong to an expired window, such window that the watermark has passed it's end (See documentation for the droppedDueToLateness here). These events are being dropped in the first GroupByKey after the session window function and can't be processed later. We didn't want to drop any late data so the solution was to check each event's timestamp before it is going to the sessions part and stream to the session part only events that won't be dropped - events that meet this condition: event_timestamp >= event_arrival_time - (gap_duration + allowed_lateness). The rest will be written to BigQuery without the session data (Apparently apache beam drops an event if the event's timestamp is before event_arrival_time - (gap_duration + allowed_lateness) even if there is a live session this event belongs to...)
p.s - in the bounded sessions part where he demonstrates how to implement a time bounded session I believe he has a bug allowing a session to grow beyond the provided max size. Once a session exceeded the max size, one can send late data that intersects this session and is prior to the session, to make the start time of the session earlier and by that expanding the session. Furthermore, once a session exceeded max size it can't be added events that belong to it but don't extend it.
In order to fix that I switched the order of the current window span and if-statement and edited the if-statement (the one checking for session max size) in the mergeWindows function in the window spanning part, so a session can't pass the max size and can only be added data that doesn't extend it beyond the max size. This is my implementation:
public void mergeWindows(MergeContext c) throws Exception {
List<IntervalWindow> sortedWindows = new ArrayList<>();
for (IntervalWindow window : c.windows()) {
sortedWindows.add(window);
}
Collections.sort(sortedWindows);
List<MergeCandidate> merges = new ArrayList<>();
MergeCandidate current = new MergeCandidate();
for (IntervalWindow window : sortedWindows) {
MergeCandidate next = new MergeCandidate(window);
if (current.intersects(window)) {
if ((current.union == null || new Duration(current.union.start(), window.end()).getMillis() <= maxSize.plus(gapDuration).getMillis())) {
current.add(window);
continue;
}
}
merges.add(current);
current = next;
}
merges.add(current);
for (MergeCandidate merge : merges) {
merge.apply(c);
}
}

How to performance test workflow execution?

I have 2 APIs
Create a workflow (http POST request)
Check workflow status (http GET
request)
I want to performance test on how much time does workflow takes to complete.
Tried two ways:
Option 1 Created a java test that triggers workflow create API and then poll status API to check if status turns to CREATED. I check the time taken in this process which gives me performance results.
Option 2 Was using Gatling to do the same
val createWorkflow = http("create").post("").body(ElFileBody("src/main/resources/weather.json")).asJson.check(status.is(200))
.check(jsonPath("$.id").saveAs("id"))
val statusWorkflow = http("status").get("/${id}")
.check(jsonPath("$.status").saveAs("status")).asJson.check(status.is(200))
val scn = scenario("CREATING")
.exec(createWorkflow)
.repeat(20){exec(statusWorkflow)}
Gatling one didn't really work (or I am doing it in some wrong way). Is there a way in Gatling I can merge multiple requests and do something similar to Option 1
Is there some other tool that can help me out to performance test such scenarios?
I think something like below should work when using Gatling's tryMax
.tryMax(100) {
pause(1)
.exec(http("status").get("/${id}")
.check(jsonPath("$.status").saveAs("status")).asJson.check(status.is(200))
)
}
Note: I didn't try this out locally. More information about tryMax:
https://medium.com/#vcomposieux/load-testing-gatling-tips-tricks-47e829e5d449 (Polling: waiting for an asynchronous task)
https://gatling.io/docs/current/advanced_tutorial/#step-05-check-and-failure-management

Why does ZeroMQ not receive a string when it becomes too large on a PUSH/PULL MT4 - Python setup?

I have an EA set in place that loops history trades and builds one large string with trade information. I then send this string every second from MT4 to the python backend using a plain PUSH/PULL pattern.
For whatever reason, the data isn't received on the pull side when the string transferred becomes too long. The backend PULL-socket slices each string and further processes it.
Any chance that the PULL-side is too slow to grab and process all the data which then causes an overflow (so that a delay arises due to the processing part)?
Talking about file sizes we are well below 5kb per second.
This is the PULL-socket, which manipulates the data after receiving it:
while True:
# check 24/7 for available data in the pull socket
try:
msg = zmq_socket.recv_string()
data = msg.split("|")
print(data)
# if data is available and msg is account info, handle as follows
if data[0] == "account_info":
[...]
except zmq.error.Again:
print("\nResource timeout.. please try again.")
sleep(0.000001)
I am a bit curious now since the pull socket seems to not even be able to process a string containing 40 trades with their according information on a single MT4 client - Python connection. I actually planned to set it up to handle more than 5.000 MT4 clients - python backend connections at once.
Q : Any chance that the pull side is too slow to grab and process all the data which then causes an overflow (so that a delay arises due to the processing part)?
Zero chance.
Sending 640 B each second is definitely no showstopper ( 5kb per second - is nowhere near a performance ceiling... )
The posted problem formulation is otherwise undecidable.
Step 1) POSACK/NACK prove whether a PUSH side accepts the payload for sending error-free.
Step 2) prove the PULL side is not to be blamed - [PUSH.send(640*chr(64+i)) for i in range( 10 )] via a python-2-python tcp://-transport-class solo-channel crossing host-to-host hop, over at least your local physical network ( no VMCI/emulated vLAN, no other localhost colocation )
Step 3) if either steps above got POSACK-ed, your next chances are the ZeroMQ configuration space and/or the MT4-based PUSH-side incompatibility, most probably "hidden" inside a (not mentioned) third party ZeroMQ wrapper used / first-party issues with string handling / processing ( which you must have already read about, as it has been so many times observed and mentioned in the past posts about this trouble with well "hidden" MQL4 internal eco-system changes ).
Anyway, stay tuned. ZeroMQ is a sure bet and a truly horsepower for professional and low-latency designs in distributed-system's domain.

Concurrency in Redis for flash sale in distributed system

I Am going to build a system for flash sale which will share the same Redis instance and will run on 15 servers at a time.
So the algorithm of Flash sale will be.
Set Max inventory for any product id in Redis
using redisTemplate.opsForValue().set(key, 400L);
for every request :
get current inventory using Long val = redisTemplate.opsForValue().get(key);
check if it is non zero
if (val == null || val == 0) {
System.out.println("not taking order....");
}
else{
put order in kafka
and decrement using redisTemplate.opsForValue().decrement(key)
}
But the problem here is concurrency :
If I set inventory 400 and test it with 500 request thread,
Inventory becomes negative,
If I make function synchronized I cannot manage it in distributed servers.
So what will be the best approach to it?
Note: I can not go for RDBMS and set isolation level because of high request count.
Redis is monothreaded, so running a Lua Script on it is always atomic.
You can define then a Lua script on your Redis instance and running it from your Spring instances.
Your Lua script would just be a sequence of operations to execute against your redis instance (the only one to have the correct value of your stock) and returns the new value for instance or an error if the value is negative.
Your Lua script is basically a Redis transaction, there are other methods to achieve Redis transaction but IMHO Lua is the simplest above all (maybe the least performant, but I have found that in most cases it is fast enough).

administrative limit exceeded, REST

Using rest I got this exception
http://localhost:8080/customgroups?_queryFilter=(members/uid+co+%22test%22)
{"code":413,"reason":"Request Entity Too Large","message":"Administrative Limit Exceeded"}
I turned all limits off:
ds-cfg-lookthrough-limit: 0
ds-cfg-size-limit: 0
Is there another constrain? The result should be 1-3 entries. Other requests like get all customGroups = 83 or users = 1300 works fine, so why does the query_filter making problems?
Thank You
There are a few things you might try out:
can you check the ds-rlim-lookthrough-limit operational attribute is correctly set? Especially for cn=Directory Manager if you are using it to make requests.
I can see there is a special config for collective attributes ds-rlim-lookthrough-limit;collective: 0. Maybe does it apply to your request?
References:
http://ludopoitou.com/2012/04/10/tips-resource-limits-in-opendj/
http://opendj.forgerock.org/opendj-server/configref/global.html#lookthrough-limit
http://docs.forgerock.org/en/opendj/2.6.0/admin-guide/index/chap-resource-limits.html

Resources