Handling concurrent neo4j connections with goroutines - go

We've been tasked with migrating quite a bit of xml data (1.27 million xml files, one node with properties per file) into a Neo graph, and we've been using go routines to chew through files, parse xml, prepare cypher queries for inserts etc. Due to the architecture of having to process the xml, we are using go routines concurrently with channels to process each file in threads, throttling the number of workers going on at one time.
The issue i'm having is that I'm running into errors like "tcp connection reset by peer" and also "panic: Couldn't read expected bytes for message length. Read: 0 Expected: 2." and I can only imagine this is due to running connections and statements concurrently in our workers. Our throttling has us at 100 concurrent workers, and I wouldn't think this would be a major problem for Neo, but I just can't figure out why its choking on us.
Are there any architecture recommendations out there for handling a use case like this, where we have to run single cypher statements in large numbers of worker routines (in our case 100 at a time)?
Currently, we're walking a file tree to build up a queue of files to process, then after the walk is done, we iterate that queue and fire off go routines to process each file, using a buffered throttle channel to block the firing off of new routines until previous routines have finished. Within each routine i'm spinning up a new connection, prepare statement, execute, close, etc..
I see this package offers Pipelines, but i'm just not sure how to use that within the processing/queue/channel architecture that we've got going currently:
https://github.com/johnnadratowski/golang-neo4j-bolt-driver
I've also tried using:
https://github.com/go-cq/cq
But keep getting tcp connection reset by peer errors when trying to connect to Neo concurrently.

It is possible you are using thread unsafe functionality in neo4j-bolt-driver.
There are 2 versions of drivers provided by Neo4j-bolt-driver:
Driver Plain driver
DriverPool Driver which manages connection pool
The driver objects themselves are thread-safe, but the Conn object which represent the underlying connection are not. You maybe using the Conn object in a way that its not meant to be.
With goroutines, its best to create Conn objects using DriverPool methods. When Close is called on the connection, it doesn't necessarily close the underlying connection and reclaims the connection for reuse.

Related

IO callbacks on goroutine

I'm a beginner at golang. Looking at all golang tutorials, it looks you should create goroutines for everything. Coming from something like libuv in C where you can define callbacks for socket read/write on a single thread, is the right way to achieve that in golang to create nested goroutines for any IO tasks needed?
As an example, take something like nginx where a single thread will handle multiple connections. To do something like that in golang, we would need a goroutine for every connection?
Go stands out in the area of tools to write networked services specifically because of the fact it has I/O-awareness integrated right into the runtime scheduler powering any running GO program.
The basic idea is roughly like this: a goroutine performs normal, sequential, callback-free operations on sockets — that is, plain reads and plain writes, — and as soon as the next I/O operation would block (yes, the relevant syscall on a Unix-like kernel returns EWOULDBLOCK), the goroutine is suspended, its socket is handed out into a component of the runtime called "netpoller", which is implemented using the platform-native socket I/O multiplexor such as epoll, kqueue or IOCP, and the OS thread the goroutine was running on is handed off to another goroutine which wants to run. As soon as the netpoller signals the I/O on the socket caused the goroutine to suspend can proceed, the scheduler queues that goroutine for execution and then it contnues to run exactly where it left off.
Because of this, the usual model employed when writing networking services in Go is to have one goroutine per socket. When you're writing plain TCP server, you should create a goroutine yourself (and hand it the socket returned by the listener once it accepted a client's connection).
net/http.Server has this behaviour built-in as it creates a goroutine to serve each incoming client request (actually, for HTTP/1.x, two or even three goroutines are created per connection, but it's invisible to HTTP request handlers).
Now, we've just covered the basics. Of course, there might exist legitimate reasons to have extra goroutines to handle tasks needed to be carried out to complete a request, and that's what #Volker referred to.
More info:
"What color is your function?" — a classical essay dealing with I/O multiplexing implemented as a library vs it being implemented in the core.
"Go's work-stealing scheduler"; also see this and this and this design doc.
State threads library which implements the approach quite similar to that of Go, just on much lower level. Its documentation is quite insightful on the approach implemented in Go.
libtask is a much more recent stab at
the same problem, by one of Go's creators.

Broadcast Server

I am writing a TCP Server that accepts connections from multiple clients, this server gathers data from the system that it's running on and transmits it to every connected client.
What design patterns would be best for this situation?
Example
Put all connections in an array, then loop through the array and send the data to each client one by one. Advantage: very easy to implement. Disadvantage: not very efficient when handling large amounts of data.
An easier way is to use some existing software to do this ... For example use https://www.ibm.com/developerworks/community/groups/service/html/communityview?communityUuid=d5bedadd-e46f-4c97-af89-22d65ffee070 .
In case you want to write on your own you will need a list(linked list) to manage the connections.
Here is an example of a server http://examples.oreilly.com/jenut/Server.java
If you want to handle large amounts of data, one of the techniques is to have a queue associated with each of the subscribers at the server end. A multi-threaded program can send the data to the clients from those queues.
A number of patterns have been developed for distributed processing and servers, for instance in the ACE project: http://www.cs.wustl.edu/~schmidt/patterns-ace.html. The design might be focused around the events which announce either that data has been received and may be read, or that buffers have been emptied and more data may now be written. At least in the days when a 32-bit address space was the rule, you could have many more open connections than you had threads, so you would typically have a small number of threads waiting for events which announced that they could safely read or write without stalling until the other side co-operated. This may come from events, or from calls such as select() or poll() terminating. Patterns are also associated with http://zguide.zeromq.org/page:all#-MQ-in-a-Hundred-Words.

Client to server connection only sending not receiving

This is my case, I have a server listening for connections, and a client that I'm programming now. The client has nothing to receive from the server, yet it has to be sending status updates every 3 minutes.
I have the following at the moment:
WSAStartup(0x101,&ws);
sock = socket(AF_INET,SOCK_STREAM,0);
sa.sin_family = AF_INET;
sa.sin_port = htons(PORT_NET);
sa.sin_addr.s_addr = inet_addr("127.0.0.1");
connect(sock,(SOCKADDR*)&sa,sizeof(sa));
send(sock,(const char*)buffer,128,NULL);
How should my approach be? Can I avoid looping recv?
That's rather dependant on what behaviour you want and your program structure.
By default a socket will block on any read or write operations, which means that if your try and have your server's main thread poll the connection, you're going to end up with it 'freezing' for 3 minutes or until the client closes the connection.
The absolute simplest functional solution (no multithreadding) is to set the socket to non-blocking, and poll in in the main thread. It sounds like you want to avoid doing that though.
The most obvious way around that is to make a dedicated thread for every connection, plus the main listener socket. Your server listens for incoming connections and spawns a thread for each stream socket it creates. Then each connection thread blocks on it's socket until it receives data, and either handles it itself or shunts it onto a shared queue.
That's a bulky and complex solution - multiple threads which need opening and closing, shared resources which need protecting.
Another option is to set the socket to non-blocking (Under win32 use setsockopt so set a timeout, under *nix pass it the O_NONBLOCK flag). That way it will return control if there's no data available to read. However that means you need to poll the socket at reasonable intervals ("reasonable" being entirely up to you, and how quickly you need the server to act on new data.)
Personally, for the lightweight use you're describing I'd use a combination of the above: A single dedicated thread which polls a socket (or an array of nonblocking sockets) every few seconds, sleeping in between, and simply pushed the data onto a queue for the main thread to act upon during it's main loop.
There are a lot of ways to get into a mess with asynchronous programs, so it's probably best to keep it simple and get it working, until you're comfortable with the control flow.

Is this a scaleable named pipe server implementation?

Looking at this example of named pipes using Overlapped I/O from MSDN, I am wondering how scaleable it is? If you spin up a dozen clients and have them hit the server 10x/sec, with each batch of 10 being immediately after the other, before sleeping for a full second, it seems that eventually some of the instances of the client are starved.
Server implementation:
http://msdn.microsoft.com/en-us/library/aa365603%28VS.85%29.aspx
Client implementation (assuming call is made 10x/sec, and there are a dozen instances).
http://msdn.microsoft.com/en-us/library/aa365603%28VS.85%29.aspx
The fact that the web page points out that:
The pipe server creates a fixed number of pipe instances.
and
Although the example shows simultaneous operations on different pipe instances, it avoids simultaneous operations on a single pipe instance by using the event object in the OVERLAPPED structure. Because the same event object is used for read, write, and connect operations for each instance, there is no way to know which operation's completion caused the event to be set to the signaled state for simultaneous operations using the same pipe instance
you can probably safely assume that it's not as scalable as it could be; it's an API usage example after all; demonstration of functionality is usually the most important design constraint for such code.
If you need 12 clients making 10 connections per second then I'd personally have the server able to handle MORE than just 12 clients to allow for the period when the server is preparing for a new client to connect... Personally I'd switch to using sockets but that's just me (and I'm skewed that way because I've done lots of high performance socket's work and so have all the code)...

How does a non-forking web server work?

Non-forking (aka single-threaded or select()-based) webservers like lighttpd or nginx are
gaining in popularity more and more.
While there is a multitude of documents explaining forking servers (at
various levels of detail), documentation for non-forking servers is sparse.
I am looking for a bird eyes view of how a non-forking web server works.
(Pseudo-)code or a state machine diagram, stripped down to the bare
minimum, would be great.
I am aware of the following resources and found them helpful.
The
World of SELECT()
thttpd
source code
Lighttpd
internal states
However, I am interested in the principles, not implementation details.
Specifically:
Why is this type of server sometimes called non-blocking, when select() essentially blocks?
Processing of a request can take some time. What happens with new requests during this time when there is no specific listener thread or process? Is the request processing somehow interrupted or time sliced?
Edit:
As I understand it, while a request is processed (e.g file read or CGI script run) the
server cannot accept new connections. Wouldn't this mean that such a server could miss a lot
of new connections if a CGI script runs for, let's say, 2 seconds or so?
Basic pseudocode:
setup
while true
select/poll/kqueue
with fd needing action do
read/write fd
if fd was read and well formed request in buffer
service request
other stuff
Though select() & friends block, socket I/O is not blocking. You're only blocked until you have something fun to do.
Processing individual requests normally involved reading a file descriptor from a file (static resource) or process (dynamic resource) and then writing to the socket. This can be done handily without keeping much state.
So service request above typically means opening a file, adding it to the list for select, and noting that stuff read from there goes out to a certain socket. Substitute FastCGI for file when appropriate.
EDIT:
Not sure about the others, but nginx has 2 processes: a master and a worker. The master does the listening and then feeds the accepted connection to the worker for processing.
select() PLUS nonblocking I/O essentially allows you to manage/respond to multiple connections as they come in a single thread (multiplexing), versus having multiple threads/processes handle one socket each. The goal is to minimize the ratio of server footprint to number of connections.
It is efficient because this single thread takes advantage of the high level of active socket connections required to reach saturation (since we can do nonblocking I/O to multiple file descriptors).
The rationale is that it takes very little time to acknowledge bytes are available, interpret them, then decide on the appropriate bytes to put on the output stream. The actual I/O work is handled without blocking this server thread.
This type of server is always waiting for a connection, by blocking on select(). Once it gets one, it handles the connection, then revisits the select() in an infinite loop. In the simplest case, this server thread does NOT block any other time besides when it is setting up the I/O.
If there is a second connection that comes in, it will be handled the next time the server gets to select(). At this point, the first connection could still be receiving, and we can start sending to the second connection, from the very same server thread. This is the goal.
Search for "multiplexing network sockets" for additional resources.
Or try Unix Network Programming by Stevens, Fenner, Rudoff

Resources