What is the best way to pass slice and map structure over channel that is distributed over network? I need to distribute the application running over several EC2 instances and wonder how I can achieve this by communicating each application by Go channel.
Here's the workflow that I would like to run:
1. Process data in one application
2. Distribute the data into 10 replica applications
3. Each 10 application does its job in a separate EC2 instance
4. Once they are all done, they send the result back to the original program
5. This is sent over the channel
Please let me know. Thanks!
If depends on the format you will chose for the serialization.
One well-suited for over-the-network communication is MessagePack (an efficient binary serialization format. It lets you exchange data among multiple languages like JSON. But it's faster and smaller)
A Go library like philhofer/msgp can serializaze any struct (like one with a map), including composite types like maps and arrays.
However, it uses Go1.4 go generate command. (go 1.4rc1 is already out)
From there, a library like docker/libchan can help: Libchan is an ultra-lightweight networking library which lets network services communicate in the same way that goroutines communicate using channels.
Related
Consider two application.
"A" application receives data from internet like player position and other details
"B" application which also needs the player position but it will be blocked from accessing internet. So the only way is to use SQLite sync player position (these frequently updates in milliseconds).
I can't even use socket or any other plugins too. So do you think SQLite can handle read and write in milliseconds without using CPU heavily ?
If you wish to share the data in anything like real-time then I would use something like inter-process pipes or file mapping (memory) for this.
Writing data to and reading it back from any form of hardware storage will add quite a delay to the data passing, which will only become worse as the hardware data cache is filled.
Hardware is okay for historic data.
Both are supported by Win32 and should be accessible even if you use .NET to produce a UWP application.
See here
I am designing a package where I want to provide an API based on the observer pattern: that is, there are points where I'd like to emit a signal that will trigger zero or more interested parties. Those interested parties shouldn't necessarily need to know about each other.
I know I can implement an API like this from scratch (e.g. using a collection of channels or callback functions), but was wondering if there was a preferred way of structuring such APIs.
In many of the languages or frameworks I've played with, there has been standard ways to build these APIs so that they behave the way users expect: e.g. the g_signal_* functions for glib based applications, events and addEventListener() for JavaScript DOM apps, or multicast delegates for .NET.
Is there anything similar for Go? If not, is there some other way of structuring this type of API that is more idiomatic in Go?
I would say that a goroutine receiving from a channel is an analogue of an observer to a certain extent. An idiomatic way to expose events in Go would be thus IMHO to return channels from a package (function). Another observation is that callbacks are not used too often in Go programs. One of the reasons is also the existence of the powerful select statement.
As a final note: some people (me too) consider GoF patterns as Go antipatterns.
Go gives you a lot of tools for designing a signal api.
First you have to decide a few things:
Do you want a push or a pull model? eg. Does the publisher push events to the subscribers or do the subscribers pull events from the publisher?
If you want a push system then having the subscribers give the publisher a channel to send messages on would work really well. If you want a pull method then just a message box guarded with a mutex would work. Other than that without knowing more about your requirements it's hard to give much more detail.
I needed an "observer pattern" type thing in a couple of projects. Here's a reusable example from a recent project.
It's got a corresponding test that shows how to use it.
The basic theory is that an event emitter calls Submit with some piece of data whenever something interesting occurs. Any client that wants to be aware of that event will Register a channel it reads the event data off of. This channel you registered can be used in a select loop, or you can read it directly (or buffer and poll it).
When you're done, you Unregister.
It's not perfect for all cases (e.g. I may want a force-unregister type of event for slow observers), but it works where I use it.
I would say there is no standard way of doing this because channels are built into the language. There is no channel library with standard ways of doing things with channels, there are simply channels. Having channels as built in first class objects frees you from having to work with standard techniques and lets you solve problems in the simplest most natural way.
There is a basic Golang version of Node EventEmitter at https://github.com/chuckpreslar/emission
See http://itjumpstart.wordpress.com/2014/11/21/eventemitter-in-go/
I've got a project to create a roguelike that in some way abstracts the UI from the engine and the engine from map creation, line-of-site, etc. To narrow the focus, i first want to just get the UI (player's client) and engine working.
My current idea is to make the client basically a program that decides what one character (player, monsters) will do for its turn and waits until it can move again. So each monster has a client, and so does the player. The player's client prints the map, waits for input, sends it to the engine, and tells the player what happened. The monster's client does the same except without printing the map and using AI instead of keyboard input.
Before i go any futher, if this seems somehow an obfuscated way of doing things, my goal is to learn, not write a roguelike. It's the journy, not the destination.
And so i need to choose what form of ipc fits this model best.
My first attempt used pipes because they're simplest and i wrote a
UI for the player and a program to pipe in instructions such as
where to put the map and player. While this works, it only allows
one client--communicating through stdin and out.
I've thought about making the engine a daemon that looks in a spool
where clients, when started, create unique-per-client temp files to
give instructions to the engine and recieve feedback.
Lastly, i've done a little introductory programing with sockets.
They seem like they might be the way to go, and would allow the game
to perhaps someday be run over a net. I'd like to, if possible, use
a simpler solution, and since i'm unfamiliar with them, it's more
error prone.
I'm always open to suggestions.
I've been playing around with using these combinations for a similar problem (multiple clients talking via a single daemon on the local box, with much of the intelligence shoved off into the clients).
mmap for sharing large data blobs, with unix domain sockets, messages queues, or named pipes for notification
same, but using individual files per blob instead of munging them all together in an mmap
same, but without the files or mmap (in other words, more like conventional messaging)
In general I like the idea of breaking things up into separate executables this way -- it certainly makes testing easier, for instance. I think the choice of method comes down to usage patterns -- how large are messages, how persistent does the data in them need to be, can you afford the cost of multiple trips through the network stack for a socket-based message, that sort of thing. The fact that you're sticking to Linux makes things easy in terms of what's available -- you don't need to worry about portability of message queues, for instance.
This one's also applicable: https://stackoverflow.com/a/1428542/1264797
I would like to make a system whereby users can upload and download files. The system will have a centralized topography but will rely heavily on peers to transfer relevant data through the central node to other peers. Instead of peers holding entire files I would like for them to hold a compressed an encrypted portion of the whole data set.
Some client uploads file to server anonymously
I would like for the client to be able to upload using some sort of NAT (random ip), realizing that the server would not be able to send confirmation packets back to the client. Is ensuring data integrity feasible with a header relaying the total content length, and disregarding the entire upload if there is a mismatch?
Server indexes, compresses and splits the data into chunks adding identifying bytes to each chunk, encrypts it, and splits the data over the network while mapping the locations of each chunk.
The server will also update the file index for peers upon request. As more data is added to the system, I imagine that the compression can become more efficient. I would like to be able to push these new dictionary entries to peers so they can update both their chunks and the decompression system in the client software, without causing overt network strain. If encrypted, the chunks can be large without any client being aware of having part of x file.
Some client requests a file
The central node performs a lookup to determine the location of the chunks within the network and requests these chunks from peers. Once the chunks have been assembled, they are sent (still encrypted and compressed) to the client, who then translates the content into the decompressed file. It would be nice if an encrypted request could be made through a peer and relayed to a server, and onion routed through multiple paths with end-to-end encryption.
In the background, the server will be monitoring the stability and redundancy of the chunks, and if necessary will take on chunks that near extinction, and either hold them in it's own bank or redistribute them over the network if there are willing clients. In this way, the central node can shrink and grow as appropriate.
The goal is to have a network within which any client can upload or download data with no single other peer knowing who has done either, but with free and open access to all.
The system must be able to handle a massive amount of simultaneous connections while managing the peers and data library without loosing it's head.
What would be your optimal implementation?
Edit : Bounty opened.
Over the weekend, I implemented a system that does basically the above, minus part 1. For the upload, I just implemented SSL instead of forging the IP address. The system is weak in several areas. Files are split into 1MB chunks and encrypted, and sent to registered peers at random. The recipient(s) for each chunk are stored in the database. I fear that this will quickly grow too large to be manageable, but I also want to avoid having to flood the network with chunk requests. When a file is requested, the central node informs peers possessing the chunks that they need to send the chunks to x client (in p2p mode) or to the server (in direct mode), which then transfers the file down. The system is just one big hack, and written in ruby, which I imagine is not really up to the task. For the rewrite, I am considering using C++ with Boost.Asio.
I am looking for general suggestions regarding architecture and system design. I am not at all attached to my current implementation.
Current Topography
Server Handling client uploads,indexing, and content propagation
Server Handling client requests
Client for upload files and requesting files
Client Server accepting chunks and requests
I would like for the client not to have to have a persistent server running, but I can't think of a good way around it.
I would post some of the code but its embarassing. Thanks. Please ask any questions, the basic idea is to have a decent anonymous file sharing model combining the strengths of both the distributed and centralized model of content distribution. If you have a totally different idea, please feel free to post it if you want.
I would like for the client to be able
to upload using some sort of NAT
(random ip), realizing that the server
would not be able to send confirmation
packets back to the client. Is
ensuring data integrity feasible with
a header relaying the total content
length, and disregarding the entire
upload if there is a mismatch?
No, that's not feasible. If your packets are 1500 bytes, and you have 0.1% packetloss, the chance of a one megabyte file being uploaded without any lost packets is .999 ^ (1048576 / 1500) = 0.497, or under 50%. Further, it's not clear how the client would even know if the upload succeeded if the server has no way to send acknowledgements back to the client.
One way around the acknowledgement issue would be to use a rateless code, which allows the client to compute and send an effectively infinite number of unique blocks, such that any sufficiently large subset is enough to reconstruct the original file. This adds a large amount of complexity to both the client and server, however, and still requires some way to notify the client that the server has received the complete file.
It seems to me you're confusing several issues here. If your system has a centralized component to which your clients upload, why do you need to do NAT traversal at all?
For parts two and three of your question, you probably want to research Distributed Hash Tables and content-based addressing (but with major caveats explained here). Preventing the nodes from knowing the content of the files they store could be accomplished by, for example, encrypting the files with the first hash of their content, and storing them keyed by the second hash - this means that anyone who knows the hash of the file can retrieve it, but clients cannot decrypt the files they host.
In general, I would suggest starting by writing down a solid list of goals for the system you're designing, then looking for an architecture that suits those goals. In contrast, it sounds like you have some implicit goals, and have already picked a basic system architecture - which may not suit your full goals - based on that.
Sorry for arriving late at the generous 500 reputation party, but even if i am too late i would like to add a little of my research to your discussion.
Yes such a system would be nice, like Bittorrent but with encrypted files and hashes of the un-encrypted data. In BT you can add encrypted files of course, but then the hashes would be of the encrypted data and thus not possible to identify retrieval-sources without a centralized queryKey->hashCollection storage, i.e. a server that does all the work of identifying package-sources for every client. A similar system was attempted by Freenet (http://freenetproject.org/), although more limited than what you attempt.
For the NAT consideration let's first look at: aClient -> yourServer (and aClient->aClient later)
For the communication between a client and your server the NATs (and firewalls that shield the clients) are not an issue! Since the clients initiate the connection to your server (which has either fixed ip-address or a dns-entry (or dyndns)) you dont even have to think about NATs, the server can respond without an issue since, even if multiple clients are behind a single corporate firewall the firewall (its NAT) will look up with which client the server wants to communicate and forwards accordingly (without you having to tell it to).
Now the "hard" part: client -> client communication through firewalls/NAT: The central technique you can use is hole-punching (http://en.wikipedia.org/wiki/UDP_hole_punching). It works so well it is what Skype uses (from one client behind a corporate firewall to another; (if it does not succeed it uses a mirroring-server)). For this you need both clients to know the address of the other and then shoot some packets at eachother so how do they get eachother's addresses?: Your server gives the addresses to the clients (this requires that not only a requester but also every distributer open a connection to your server periodically).
Before i talk about your concern about data-integrity, the general distinction between packages and packets you could (and i guess should) make:
You just have to consider that you can separate between your (application-domain) packages (large) and the packets used for internet-transmission (small, limited by MTU among other things): It would be slow to have both be the same size, the size of a maximum tcp/ip packet is 576 (minus overhead; take a look here: http://www.comsci.us/datacom/ippacket.html ); you could do some experiments about what a good size for your packages is, my best guess is that 50k-1M would all be fine (but profiling would optimize that since we dont if most of the files you want to distribute are large or small).
About data-integrity: For your packages you definitely need a hash, i would recommend to directly take a cryptographic hash since this prevents tampering (in addition to corruption); you dont need to record the size of the packages since if the hash is faulty you have to re-transmit the package anyways. Bear in mind, that this kind of package-corruption is not very frequent if you use TCP/IP for packet transmission (yes, you can use TCP/IP even in your scenario), it automatically corrects (re-requests) transmission-errors. The huge advantage is that all computers and routers in between know TCP/IP and check for corruption automatically on every step in between the source and destination computer, so they can re-request the packet themselves which makes it very fast. They would not know about a packet-integrity-protocol you implement yourself so with that custom protocol the packet has to arrive at the destination before the re-request can even start.
For the next thought let's call the client which publishes a file the "publisher", i know this is kind of obvious, however it is important to distinguish this from "uploader", since the client does not need to upload the file to your server (just some info about it, see below).
Implementing the central indexing-server should be no problem, the problem would be that you plan to have it encrypt all the files itself instead of making the publisher do that heavy work (good encryption is very heavy lifting). The only problem with having the publisher (not the server) encrypt the data is, that you have to trust the publisher to give you reasonable search-keywords: theoretically it could give you a very attractive search-keyword every client desires together with a reference to bogus data (encrypted data is hard to distinguish from random data). But the solution to this problem is crowd-sourcing: make your server store a user-rating so downloaders can vote on files. The implementation of the table you need could be a regular-old hash-table of individual search-keywords to client-ID's (see below) who have that package. The publisher is at first the only client that holds the data, but every client that downloaded at least one of the packages should then be added to the hash-table's entry, so if the publisher goes offline and every package has been downloaded by at least one client everything continues working. Now critically the mapping client-ID->IP-Addresses is non-trivial because it changes often (e.g. every 24 hours for many clients); to compensate, you have to have another table on your server that makes this mapping and make the clients contact the server periodically (e.g. every hour) to tell it its IP-address. I would recommend using a crypto-hash for client-ID's so that it is impossible for one client to trash this table by telling you fake ID's.
For any questions and criticism, please comment.
I am not sure having one central point of attack (the central server) is a very good idea. That of course depends on the kind of problems you want to be able to handle. It also limits your scalability a lot
Does google protocol buffers support stl vectors, maps and boost shared pointers? I have some objects that make heavy use of stl containers like maps, vectors and also boost::shared_ptr. I want to use google protocol buffers to serialize these objects across the network to different machines.
I want to know does google protobuf support these containers? Also if I use apache thrift instead, will it be better? I only need to serialize/de-serialize data and don't need the network transport that apache thrift offers. Also apache thrift not having proper documentation puts me off.
Protocol buffers directly handles an intentionally small number of constructs; vectors map nicely to the "repeated" element type, but how this is presented in C++ is via "add" methods - you don't (AFAIK) just hand it a vector. See "Repeated Embedded Message Fields" here for more info.
Re maps; there is no inbuilt mechanism for that, but a key/value pair is easily represented in .proto (typically key = 1, value = 2) and then handled via "repeated".
A shared_ptr itself would seem to have little meaning in a serialized file. But the object may be handled (presumably) as a message.
Note that in the google C++ version the DTO layer is generated, so you may need to map between them and any existing object model. This is usually pretty trivial.
For some languages/platforms there are protobuf variants that work against existing object models.
(sorry, I can't comment on thrift - I'm not familiar with it)