Correct use of streamz with websocket - websocket

I am trying to figure out a correct way of processing streaming data using streamz. My streaming data is loaded using websocket-client, after which I do this:
# open a stream and push updates into the stream
stream = Stream()
# establish a connection
ws = create_connection("ws://localhost:8765")
# get continuous updates
from tornado import gen
from tornado.ioloop import IOLoop
async def f():
while True:
await gen.sleep(0.001)
data = ws.recv()
stream.emit(data)
IOLoop.current().add_callback(f)
While this works, I find that my stream is not able to keep pace with the streaming data (so the data I see in the stream is several seconds behind the streaming data, which is both high volume and high frequency). I tried setting the gen.sleep(0.001) to a smaller value (removing it completely halts the jupyter lab), but the problem remains.
Is this a correct way of connecting streamz with streaming data using websocket?

I don't think websocket-client provides an async API and, so, it's blocking the event loop.
You should use an async websocket client, such as the one Tornado provides:
from tornado.websocket import websocket_connect
ws = websocket_connect("ws://localhost:8765")
async def f():
while True:
data = await ws.read_message()
if data is None:
break
else:
await stream.emit(data)
# considering you're receiving data from a localhost
# socket, it will be really fast, and the `await`
# statement above won't pause the while-loop for
# enough time for the event loop to have chance to
# run other things.
# Therefore, sleep for a small time to suspend the
# while-loop.
await gen.sleep(0.0001)
You don't need to sleep if you're receiving/sending data from/to a remote connection which will be slow enough to suspend the while loop at await statements.

Related

asyncio: can a task only start when previous task reach a pre-defined stage?

I am starting with asyncio that I wish to apply to following problem:
Data is split in chunks.
A chunk is 1st compressed.
Then the compressed chunk is written in the file.
A single file is used for all chunks, so I need to process them one by one.
with open('my_file', 'w+b') as f:
for chunk in chunks:
compress_chunk(ch)
f.write(ch)
From this context, to run this process faster, as soon as the write step of current iteration starts, could the compress step of next iteration be triggered as well?
Can I do that with asyncio, keeping a similar for loop structure? If yes, could you share some pointers about this?
I am guessing another way to run this in parallel is by using ProcessPoolExecutor and splitting fully the compress phase from the write phase. This means compressing 1st all chunks in different executors.
Only when all chunks are compressed, then starting the writing step .
But I would like to investigate the 1st approach with asyncio 1st, if it makes sense.
Thanks in advance for any help.
Bests
You can do this with a producer-consumer model. As long as there is one producer and one consumer, you will have the correct order. For your use-case, that's all you'll benefit from. Also, you should use the aioFiles library. Standard file IO will mostly block your main compression/producer thread and you won't see much speedup. Try something like this:
async def produce(queue, chunks):
for chunk in chunks:
compress_chunk(ch)
await queue.put(i)
async def consume(queue):
with async with aiofiles.open('my_file', 'w') as f:
while True:
compressed_chunk = await Q.get()
await f.write(b'Hello, World!')
queue.task_done()
async def main():
queue = asyncio.Queue()
producer = asyncio.create_task(producer(queue, chunks))
consumer = asyncio.create_task(consumer(queue))
# wait for the producer to finish
await producer
# wait for the consumer to finish processing and cancel it
await queue.join()
consumer.cancel()
asyncio.run(main())
https://github.com/Tinche/aiofiles
Using asyncio.Queue for producer-consumer flow

Broadcasting message from grpc server to all/some connected clients in python

i am learning how to use grpc streams to exchange messages between clients and server in python. I found a base example that enables the simple message sending between server and client. I am trying to modify it so that i could keep track of all the clients connected to the grpc server (on the server side) and could do two things: 1) broadcast from server to all clients, 2) send message to a particular connected client.
Here is the .proto file
syntax = 'proto3';
service Scenario {
rpc Chat(stream DPong) returns (stream DPong) {}
}
message DPong {
string name = 1;
}
And here is the client.py that creates a daemon process to listen for incoming messages and waits for stdin for any outgoing messages
import threading
import grpc
import time
import scenario_pb2_grpc, scenario_pb2
# new changes
msgQueue = queue.Queue()
def run():
channel = grpc.insecure_channel('localhost:50052')
stub = scenario_pb2_grpc.ScenarioStub(channel)
print('client connected')
global queue
def inputStream():
while 1:
msg = input('>>Enter message\n>>')
yield scenario_pb2.DPong(name=msg)
input_stream = stub.Chat(inputStream())
def read_incoming():
while 1:
print('receivedFromServer: {}\n>>'.format(next(input_stream).name))
thread = threading.Thread(target=read_incoming)
thread.daemon = True
thread.start()
while 1:
time.sleep(1)
if __name__ == '__main__':
print('client starting ...')
run()
Below is the server.py
import random
import string
import threading
import grpc
import scenario_pb2_grpc
import scenario_pb2
import time
from concurrent import futures
clientList = []
class Scenario(scenario_pb2_grpc.ScenarioServicer):
def Chat(self, request_iterator, context):
clients = []
def stream():
while 1:
time.sleep(1)
msg = input('>>Enter message\n>>')
for i in clientList:
yield msg
output_stream = stream()
def read_incoming():
while 1:
received = next(request_iterator).name
if (context,request_iterator) not in clientList:
clientList.append((context, request_iterator))
print('receivedFromClient: {}'.format(received), len(clientList))
thread = threading.Thread(target=read_incoming)
thread.daemon = True
thread.start()
while 1:
msg = output_stream
yield scenario_pb2.DPong(name=next(msg))
if __name__ == '__main__':
server = grpc.server(futures.ThreadPoolExecutor(max_workers=10))
scenario_pb2_grpc.add_ScenarioServicer_to_server(
Scenario(), server)
server.add_insecure_port('[::]:50052')
server.start()
print('listening ...')
while 1:
time.sleep(1)
So far, i have tried to maintain a list object clientList that contains the context & request_iterator object of the client, and is updated every time a new client joins the server. But how do i set these object from the clientList before sending out an outgoing message? I have tried to iterate the list but the server sends the message to the same client (the last client heard from) a number of times instead of sending it to all the clients once.
Any help is highly appreciated!
This is certainly possible. The problem that you're running into here is that each call to Scenario.Chat on the server side corresponds to a single client connection. That is, this function is called when the streaming RPC starts and as soon as the function exits, the RPC ends.
So if you want n connected clients, you'll need n instances of Scenario.Chat running concurrently, each on its own thread. This does mean that the number of concurrently connected clients is limited by the size of the threadpool with which you instantiate your server.
So, let's say you have n threads in your server process dedicated to maintaining client connections. Then you need another n+1th thread (perhaps the main thread) determining when the server will broadcast a message to all clients (maybe by looking for input from STDIN?). When this extra thread determines that a message should be broadcast, it needs to communicate this intent to all of the threads maintaining connections to a client. There are many ways to make this happen. A threading.Condition and a global collections.deque, or a collections.deque per client connection (somewhat like channels between goroutines) would be two ways. The tricky bit here is ensuring that each client connection will receive the message regardless of how long the client connection thread takes to wake up and how many messages the n+1th thread decides to send in the interim.
If this is still unclear, I can follow up with some actual code demonstrating the idea.
You can spin up multiple ports in one application.
gRPC can be running in port 50011 and flask with socket.io can be running in port 8080
with python, you can use the flask framework and flask_socketio library in your server.py
eg server.py
from flask import Flask
from flask_socketio import SocketIO, emit
app = Flask(__name__)
socketio = SocketIO(app)
#app.route('/')
def index():
return "Hello, World!"
if __name__ == '__main__':
app.run(port=8080)
app.run(debug=True)
socketio.run(app)
instead of using gRPC streaming API, use WebSocket to broadcast to all connected clients and specific/selected clients using rooms.
eg
#socketio.on('message')
def handle_message(data):
// logic to send large data in chunks the logic should call the
// emit function in socket.io and emit an event that send the large
// data in chunks eg emit('my response', chunkData)
gRPC is primarily built for one client request and response and WebSocket is for multiple clients.

Non-blocking socket connect()'s within asyncio

Here is an example how to do non-blocking socket connects (as client) within asyncore. Since this module is deprecated with recomendation 'Deprecated since version 3.6: Please use asyncio instead.' How does it possible within asyncio? Creating socket and it's connect inside coroutine is working syncronious and create problem like it described in linked question.
A connect inside a coroutine appears synchronous to that coroutine, but is in fact asynchronous with respect to the event loop. This means that you can create any number of coroutines working in parallel without blocking each other, and yet all running inside a single thread.
If you are doing http, look at examples of parallel downloads using aiohttp. If you need low-level TCP connections, look at the examples in the documentation and use asyncio.gather to run them in parallel:
async def talk(host):
# wait until connection is established, but without blocking
# other coroutines
r, w = await asyncio.open_connection(host, 80)
# use the streams r, w to talk to the server - for example, echo:
while True:
line = await r.readline()
if not line:
break
w.write(line)
w.close()
async def talk_many(hosts):
coros = [talk(host) for host in hosts]
await asyncio.gather(*coros)
asyncio.run(talk_many(["host1", "host2", ...])

aiohttp Error Rate Increases with Number of Connections

I am trying to get the status code from millions of different sites, I am using asyncio and aiohttp, I run the below code with a different number of connections (yet same timeout on the request) but get very different results specifically much higher number of the following exception.
'concurrent.futures._base.TimeoutError'
The code
import pandas as pd
import asyncio
import aiohttp
out = []
CONNECTIONS = 1000
TIMEOUT = 10
async def fetch(url, session, loop):
try:
async with session.get(url,timeout=TIMEOUT) as response:
res = response.status
out.append(res)
return res
except Exception as e:
_exception = 'Error: '+str(type(e))
out.append(_exception)
return _exception
async def bound_fetch(sem, url, session, loop):
async with sem:
await fetch(url, session, loop)
async def run(urls, loop):
tasks = []
sem = asyncio.Semaphore(value=CONNECTIONS,loop=loop)
_connector = aiohttp.TCPConnector(limit=CONNECTIONS, loop=loop)
async with aiohttp.ClientSession(connector=_connector,loop=loop) as session:
for url in urls:
task = asyncio.ensure_future(bound_fetch(sem, url, session, loop))
tasks.append(task)
responses = await asyncio.gather(*tasks,return_exceptions=True)
return responses
## BEGIN ##
tlds = open('data/sample_1k.txt').read().splitlines()
urls = ['http://{}'.format(x) for x in tlds[1:]]
loop = asyncio.get_event_loop()
future = asyncio.ensure_future(run(urls,loop))
ans = loop.run_until_complete(future)
print(str(pd.Series(out).value_counts()))
Results
CONNECTIONS=1000
CONNECTIONS=100
Is this a bug? These sites do response with a status code and run sequentially or with lower connections there is no timeout error so why is this happening? The other exceptions seem stable as you change number of connections. The ClientOSErrors are from sites that actually timeout or respond, honestly don't really know where the concurrent.futures._base.TimeoutError errors are coming from.
Imagine you opened 1000 urls in browser simultaneously. I bet you'll notice many of them aren't loaded after 10 seconds. It's not a bug it's a limit of your machine resources.
More parallel requests you're doing -> less network capacity for each one, less CPU time for each one, less RAM for each one -> higher chances each request wouldn't be ready before it's timeout.
If you see there are many timeouts with 1000 connections, make less connections (and may be increase timeout). Based on aiohttp documentation using different ClientSession instancies may also help:
Unless you are connecting to a large, unknown number of different
servers over the lifetime of your application, it is suggested you use
a single session for the lifetime of your application
I've had the same issue, have a look at the details of the ClientOSErrors and you might see Too many open files, if so you need to increase the OS's number of file descriptors.
Either way, you'll get more information if you print the whole exceptions, not just their types.

Why is gevent.sleep(0.1) necessary in this example to prevent the app from blocking?

I'm pulling my hair out over this one. I'm trying to get the simplest of examples working with zeromq and gevent. I changed this script to use PUB/SUB sockets and when I run it the 'server' socket loops forever. If I uncomment the gevent.sleep(0.1) line then it works as expected and yields to the other green thread, which in this case is the client.
The problem is, why should I have to manually add a sleep call? I thought when I import the zmq.green version of zmq that the send and receive calls are non blocking and underneath do the task switching.
In other words, why should I have to add the gevent.sleep() call to get this example working? In Jeff Lindsey's original example, he's doing REQ/REP sockets and he doesn't need to add sleep calls...but when I changed this to PUB/SUB I need it there for this to yield to the client for processing.
#Notes: Code taken from slide: http://www.google.com/url?sa=t&rct=j&q=zeromq%20gevent&source=web&cd=27&ved=0CFsQFjAGOBQ&url=https%3A%2F%2Fraw.github.com%2Fstrangeloop%2F2011-slides%2Fmaster%2FLindsay-DistributedGeventZmq.pdf&ei=JoDNUO6OIePQiwK8noHQBg&usg=AFQjCNFa5g9ZliRVoN_yVH7aizU_fDMtfw&bvm=bv.1355325884,d.cGE
#Jeff Lindsey talk on gevent and zeromq
import gevent
from gevent import spawn
import zmq.green as zmq
context = zmq.Context()
def serve():
print 'server online'
socket = context.socket(zmq.PUB)
socket.bind("ipc:///tmp/jeff")
while True:
print 'send'
socket.send("World")
#gevent.sleep(0.1)
def client():
print 'client online'
socket = context.socket(zmq.SUB)
socket.connect("ipc:///tmp/jeff")
socket.setsockopt(zmq.SUBSCRIBE, '')
while True:
print 'recv'
message = socket.recv()
cl = spawn(client)
server = spawn(serve)
print 'joinall'
gevent.joinall([cl, server])
print 'end'
I thought when I import the zmq.green version of zmq that the send and receive calls are non blocking and underneath do the task switching.
zmq.green will only yield if these calls would block, it does not yield if they are ready (there's nothing to wait for). In your case the sender is always ready, so it never has a reason to yield.
Some pointers:
a minimal explicit yield is gevent.sleep(0), it doesn't need to be finite.
zmq.green only yields on blocking calls. That is, if a socket is always ready to send/recv when you ask it to, it will never yield.
socket.send only blocks when the socket is not ready to send (not (socket.events & zmq.POLLOUT)),
which can never actually be true of a PUB socket (you will see it at HWM for PUSH, DEALER, etc.).
in general, don't trust send to yield, because of the way zeromq works this will rarely be the case unless
you are exceeding the capacity of your configuration.
unlike send, recv regularly blocks in normal usage, so it yields on most calls. But if a peer is flooding your incoming buffer, repeated recv calls will not yield until there is nothing ready to receive, so you may again need to explicitly yield every so often to prevent starvation.
What zmq.green amounts to is turning send/recv into:
try:
socket.send(msg, zmq.NOBLOCK) # or recv
except zmq.ZMQError as e:
if e.errno == zmq.EAGAIN:
yield # and wait for socket to be ready, then try again
so if send/recv with NOBLOCK are always succeeding, the socket never yields.
To put it another way: If a socket has nothing to wait for, it won't wait.

Resources