Using Rxjs5 (Beta) to cap http requests - rxjs

I'm using RxJS5 (https://github.com/ReactiveX/RxJS) and I'm trying to access the Riot API which has a cap rate of 500 requests every 10 minutes and 10 requests every 10 seconds.
I set up a stream of request objects and I have a subscriber ready to get them and actually request them, but I'm kinda new at RxJS and not sure which operator I should use to cap the requests.

If you want to balance your requests you can use sample:
const newRequests = requestStream.sample(Observable.timer(75)); //10*60*60/500 = 72 --> 75 to be sure
Sample will emit an element from the observable, when the given observable emits

Not sure if it's the best way but I ended up zipping the request objest stream with an interval observer so it will stream events only when the interval ticks.
again, not sure if it's the best way to do this but it works, here's what it looks like:
raw_stream = Rx.Observable.fromEvent EventEmitter, 'event'
interval = Rx.Observable.interval(1000)
timed_events = Rx.Observable.zip interval, raw_stream
if you have a better way please feel free to answer.

Related

Form Recognizer Heavy Workload

My use case is the following :
Once every day I upload 1000 single page pdf to Azure Storage and process them with Form Recognizer via python azure-form-recognizer latest client.
So far I’m using the Async version of the client and I send the 1000 coroutines concurrently.
tasks = {asyncio.create_task(analyse_async(doc)): doc for doc in documents}
pending = set(tasks)
# Handle retry
while pending:
# backoff in case of 429
time.sleep(1)
# concurrent call return_when all completed
finished, pending = await asyncio.wait(
pending, return_when=asyncio.ALL_COMPLETED
)
# check if task has exception and register for new run.
for task in finished:
arg = tasks[task]
if task.exception():
new_task = asyncio.create_task(analyze_async(doc))
tasks[new_task] = doc
pending.add(new_task)
Now I’m not really comfortable with this setup. The main reason being the unpredictable successive states of the service in the same iteration. Can be up then throw 429 then up again. So not enough deterministic for me. I was wondering if another approach was possible. Do you think I should rather increase progressively the transactions. Start with 15 (default TPS) then 50 … 100 until the queue is empty ? Or another option ?
Thx
We need to enable the CORS and make some changes to that CORS to make it available to access the heavy workload.
Follow the procedure to implement the heavy workload in form recognizer.
Make it for page blobs here for higher and best performance.
Redundancy is also required. Make it ZRS for better implementation.
Create a storage account to upload the files.
Go to CORS and add the URL required.
Set the Allowed origins to https://formrecognizer.appliedai.azure.com
Go to containers and upload the documents.
Upload the documents. Use the container and blob information to give as the input for the recognizer. If the case is from Form Recognizer studio, the size of the total documents is considered and also the number of characters limit is there. So suggested to use the python code using the container created as the input folder.

DISCORD.JS SendPing to Host

I'm developing a bot in Discord.js, and because I use lavalink, I hosted it (lavalink server) on a free host, and to keep it online I need to do some pings constantly, I was wondering if, is there any way to make my bot (which is currently my vps) send a ping every time interval to the "url/host" where my lavalink is. if you have any solution I will be grateful!
You have two ways:
Using Uptimer Robot (fastest way)
Uptimer Robot is an online service that can do HTTP requestes each 5 minutes.
Very simple and fast to use, see more here.
making the request from your bot vps
Installing node-fetch
Type this in yout terminal:
npm i node-fetch
Making the request
Insert this where You want in the bot code.
const fetch = require('node-fetch');
const intervalTime = 300000; // Insert here the interval for doing the request in milliseconds, like now 300000 is equal to 5 minutes
const lavalinkURL = 'insert here the lavalink process url';
setInterval(() => {
fetch(lavalinkURL);
}, intervalTime)

KafkaConsumer poll() behavior understanding

Trying to understand (new to kafka)how the poll event loop in kafka works.
Use Case : 25 records on the topic, max poll size is set to 5.
max.poll.interval.ms = 5000 //5 seconds by default max.poll.records = 5
Sequence of tasks
Poll the records from the topic.
Process the records in a for loop.
Some processing login where the logic would either pass or fail.
If logic passes (with offset) will be added to a map.
Then it will be committed using commitSync call.
If fails then the loop will break and whatever was success before this would be committed.The problem starts after this.
The next poll would just keep moving in batches of 5 even after error, is it expected?
What we basically expect is that the loop breaks and the offsets till success process message logic should get committed, then the next poll should continue from the failed message.
Example, 1st batch of poll 5 messages polled and 1,2 offsets successful and committed then 3rd failed.So the poll call keep moving to next batch like 5-10,10-15 if there are any errors in between we expect it to stop at that point and poll should start from 3 in first case or if it fails in 2nd batch at 8 then the next poll should start from 8th offset not from next max poll batch settings which would be like 5 in this case.IF IT MATTERS USING SPRING BOOT PROJECT and enable autocommit is false.
I have tried finding this in documentation but no help.
tried tweaking this but no help max.poll.interval.ms
EDIT: Not accepted answer because there is no direct solution for a customer consumer.Keeping this for informational purpose
max.poll.interval.ms is milliseconds, not seconds so it should be 5000.
Once the records have been returned by the poll (and offsets not committed), they won't be returned again unless you restart the consumer or perform seek() operations on the consumer to reset the offset to the unprocessed ones.
The Spring for Apache Kafka project provides a SeekToCurrentErrorHandler to perform this task for you.
If you are using the consumer yourself (which it sounds like), you must do the seeks.
You can manually seek to the beginning offset of the poll for all the assigned partitions on failure. I am not sure using spring consumer.
Sample code for seeking offset to beginning for normal consumer.
In the code below I am getting the records list per partition and then getting the offset of the first record to seek to.
def seekBack(records: ConsumerRecords[String, String]) = {
records.partitions().map(partition => {
val partitionedRecords = records.records(partition)
val offset = partitionedRecords.get(0).offset()
consumer.seek(partition, offset)
})
}
One problem doing this in production is bad since you don't want seekback all the time only in cases where you have a transient error otherwise you will end up retrying infinitely.

FB Messenger API - Receiving double requests

I have a working FB Bot built with Ruby which allows players to play a scavenger hunt.
Sometimes though, when I have multiple players in a team, FB is sending me a players 'Answer' webhook twice. I have looked into it and at first thought it was to do with the 20 second timeout if FB gets no 200 OK response (Docs here). After checking the logs though, I am receiving the second webhook from FB only 14 seconds later. See below:
# Webhook #1
{"object"=>"page", "entry"=>[{"id"=>"252445748474312", "time"=>1532153642358, "messaging"=>[{"sender"=>{"id"=>"1709242109154907"}, "recipient"=>{"id"=>"252445748474312"}, "timestamp"=>1532153641935, "message"=>{"mid"=>"0FeOChulGjuPgg3YJqEgajNsY8kMfNRt_bpIdeegEeE54h-KB8szcd-EQ-UHUT3850RwHgH4TxVYFkoFwxqhtg", "seq"=>402953, "text"=>"Larrikins"}}]}]}
# Webhook #2 (14 seconds later)
{"object"=>"page", "entry"=>[{"id"=>"252445748474312", "time"=>1532153656901, "messaging"=>[{"sender"=>{"id"=>"1709242109154907"}, "recipient"=>{"id"=>"252445748474312"}, "timestamp"=>1532153641935, "message"=>{"mid"=>"0FeOChulGjuPgg3YJqEgajNsY8kMfNRt_bpIdeegEeE54h-KB8szcd-EQ-UHUT3850RwHgH4TxVYFkoFwxqhtg", "seq"=>402953, "text"=>"Larrikins"}}]}]}
Notice both are exactly the same apart from the first "time" attribute (14 secs later).
Due to a number of methods and calls that I process after receiving the first webhook, the 200 OK response is only being sent back to FB once I have finished sending my messages in response (hence the 14 second delay).
So I have two questions:
Is the 14 second delay too long and that is why FB is resending? If so, how can I send a 200OK response straight away (head :ok)?
Is it another issue entirely?
You also ensure that "Echo" is disabled.
Go to Settings>Webhooks, edit events.
Asyncronous language like NodeJS is recomended, in my case y work with AWS SQS, I have workers that process the requests witout blocking (dont wait), I return 200,"ok" to FB to avoid that FB send again the message to my webhook.
Anothe apporach maybe store the mid in database, and check in each request if the mid exists, if exists the dont process the message. I was use Dynamo DB (AWS) with TTL enabled, thus with TTL my database autoclean every hour erasing old request.
I think it is the 15 second wait before replying, was also happening to me as Facebook auto retries when you don't reply fast enough. Te EEe Te's idea is solid, write some mechanism to cache mids and check if it is a duplicate before processing

Regulating / rate limiting ruby mechanize

I need to regulate how often a Mechanize instance connects with an API (once every 2 seconds, so limit connections to that or more)
So this:
instance.pre_connect_hooks << Proc.new { sleep 2 }
I had thought this would work, and it sort of does BUT now every method in that class sleeps for 2 seconds, as if the mechanize instance is touched and told to hold 2 seconds. I'm going to try a post connect hook, but it is obvious I need something a bit more elaborate, but what I don't know what at this point.
Code is more explanation so if you are interested following along: https://github.com/blueblank/reddit_modbot, otherwise my question concerns how to efficiently and effectively rate limit a Mechanize instance to within a specific time frame specified by an API (where overstepping that limit results in dropped requests and bans). Also, I'm guessing I need to better integrate a mechanize instance to my class as well, any pointers on that appreciated as well.
Pre and post connect hooks are called on every connect, so if there is some redirection it could trigger many times for one request. Try history_added which only gets called once:
instance.history_added = Proc.new {sleep 2}
I use SlowWeb to rate limit calls to a specific URL.
require 'slowweb'
SlowWeb.limit('example.com', 10, 60)
In this case calls to example.com domain are limited to 10 requests every 60 seconds.

Resources