centralized / distributed sharing - algorithm

I would like to make a system whereby users can upload and download files. The system will have a centralized topography but will rely heavily on peers to transfer relevant data through the central node to other peers. Instead of peers holding entire files I would like for them to hold a compressed an encrypted portion of the whole data set.
Some client uploads file to server anonymously
I would like for the client to be able to upload using some sort of NAT (random ip), realizing that the server would not be able to send confirmation packets back to the client. Is ensuring data integrity feasible with a header relaying the total content length, and disregarding the entire upload if there is a mismatch?
Server indexes, compresses and splits the data into chunks adding identifying bytes to each chunk, encrypts it, and splits the data over the network while mapping the locations of each chunk.
The server will also update the file index for peers upon request. As more data is added to the system, I imagine that the compression can become more efficient. I would like to be able to push these new dictionary entries to peers so they can update both their chunks and the decompression system in the client software, without causing overt network strain. If encrypted, the chunks can be large without any client being aware of having part of x file.
Some client requests a file
The central node performs a lookup to determine the location of the chunks within the network and requests these chunks from peers. Once the chunks have been assembled, they are sent (still encrypted and compressed) to the client, who then translates the content into the decompressed file. It would be nice if an encrypted request could be made through a peer and relayed to a server, and onion routed through multiple paths with end-to-end encryption.
In the background, the server will be monitoring the stability and redundancy of the chunks, and if necessary will take on chunks that near extinction, and either hold them in it's own bank or redistribute them over the network if there are willing clients. In this way, the central node can shrink and grow as appropriate.
The goal is to have a network within which any client can upload or download data with no single other peer knowing who has done either, but with free and open access to all.
The system must be able to handle a massive amount of simultaneous connections while managing the peers and data library without loosing it's head.
What would be your optimal implementation?
Edit : Bounty opened.
Over the weekend, I implemented a system that does basically the above, minus part 1. For the upload, I just implemented SSL instead of forging the IP address. The system is weak in several areas. Files are split into 1MB chunks and encrypted, and sent to registered peers at random. The recipient(s) for each chunk are stored in the database. I fear that this will quickly grow too large to be manageable, but I also want to avoid having to flood the network with chunk requests. When a file is requested, the central node informs peers possessing the chunks that they need to send the chunks to x client (in p2p mode) or to the server (in direct mode), which then transfers the file down. The system is just one big hack, and written in ruby, which I imagine is not really up to the task. For the rewrite, I am considering using C++ with Boost.Asio.
I am looking for general suggestions regarding architecture and system design. I am not at all attached to my current implementation.
Current Topography
Server Handling client uploads,indexing, and content propagation
Server Handling client requests
Client for upload files and requesting files
Client Server accepting chunks and requests
I would like for the client not to have to have a persistent server running, but I can't think of a good way around it.
I would post some of the code but its embarassing. Thanks. Please ask any questions, the basic idea is to have a decent anonymous file sharing model combining the strengths of both the distributed and centralized model of content distribution. If you have a totally different idea, please feel free to post it if you want.

I would like for the client to be able
to upload using some sort of NAT
(random ip), realizing that the server
would not be able to send confirmation
packets back to the client. Is
ensuring data integrity feasible with
a header relaying the total content
length, and disregarding the entire
upload if there is a mismatch?
No, that's not feasible. If your packets are 1500 bytes, and you have 0.1% packetloss, the chance of a one megabyte file being uploaded without any lost packets is .999 ^ (1048576 / 1500) = 0.497, or under 50%. Further, it's not clear how the client would even know if the upload succeeded if the server has no way to send acknowledgements back to the client.
One way around the acknowledgement issue would be to use a rateless code, which allows the client to compute and send an effectively infinite number of unique blocks, such that any sufficiently large subset is enough to reconstruct the original file. This adds a large amount of complexity to both the client and server, however, and still requires some way to notify the client that the server has received the complete file.
It seems to me you're confusing several issues here. If your system has a centralized component to which your clients upload, why do you need to do NAT traversal at all?
For parts two and three of your question, you probably want to research Distributed Hash Tables and content-based addressing (but with major caveats explained here). Preventing the nodes from knowing the content of the files they store could be accomplished by, for example, encrypting the files with the first hash of their content, and storing them keyed by the second hash - this means that anyone who knows the hash of the file can retrieve it, but clients cannot decrypt the files they host.
In general, I would suggest starting by writing down a solid list of goals for the system you're designing, then looking for an architecture that suits those goals. In contrast, it sounds like you have some implicit goals, and have already picked a basic system architecture - which may not suit your full goals - based on that.

Sorry for arriving late at the generous 500 reputation party, but even if i am too late i would like to add a little of my research to your discussion.
Yes such a system would be nice, like Bittorrent but with encrypted files and hashes of the un-encrypted data. In BT you can add encrypted files of course, but then the hashes would be of the encrypted data and thus not possible to identify retrieval-sources without a centralized queryKey->hashCollection storage, i.e. a server that does all the work of identifying package-sources for every client. A similar system was attempted by Freenet (http://freenetproject.org/), although more limited than what you attempt.
For the NAT consideration let's first look at: aClient -> yourServer (and aClient->aClient later)
For the communication between a client and your server the NATs (and firewalls that shield the clients) are not an issue! Since the clients initiate the connection to your server (which has either fixed ip-address or a dns-entry (or dyndns)) you dont even have to think about NATs, the server can respond without an issue since, even if multiple clients are behind a single corporate firewall the firewall (its NAT) will look up with which client the server wants to communicate and forwards accordingly (without you having to tell it to).
Now the "hard" part: client -> client communication through firewalls/NAT: The central technique you can use is hole-punching (http://en.wikipedia.org/wiki/UDP_hole_punching). It works so well it is what Skype uses (from one client behind a corporate firewall to another; (if it does not succeed it uses a mirroring-server)). For this you need both clients to know the address of the other and then shoot some packets at eachother so how do they get eachother's addresses?: Your server gives the addresses to the clients (this requires that not only a requester but also every distributer open a connection to your server periodically).
Before i talk about your concern about data-integrity, the general distinction between packages and packets you could (and i guess should) make:
You just have to consider that you can separate between your (application-domain) packages (large) and the packets used for internet-transmission (small, limited by MTU among other things): It would be slow to have both be the same size, the size of a maximum tcp/ip packet is 576 (minus overhead; take a look here: http://www.comsci.us/datacom/ippacket.html ); you could do some experiments about what a good size for your packages is, my best guess is that 50k-1M would all be fine (but profiling would optimize that since we dont if most of the files you want to distribute are large or small).
About data-integrity: For your packages you definitely need a hash, i would recommend to directly take a cryptographic hash since this prevents tampering (in addition to corruption); you dont need to record the size of the packages since if the hash is faulty you have to re-transmit the package anyways. Bear in mind, that this kind of package-corruption is not very frequent if you use TCP/IP for packet transmission (yes, you can use TCP/IP even in your scenario), it automatically corrects (re-requests) transmission-errors. The huge advantage is that all computers and routers in between know TCP/IP and check for corruption automatically on every step in between the source and destination computer, so they can re-request the packet themselves which makes it very fast. They would not know about a packet-integrity-protocol you implement yourself so with that custom protocol the packet has to arrive at the destination before the re-request can even start.
For the next thought let's call the client which publishes a file the "publisher", i know this is kind of obvious, however it is important to distinguish this from "uploader", since the client does not need to upload the file to your server (just some info about it, see below).
Implementing the central indexing-server should be no problem, the problem would be that you plan to have it encrypt all the files itself instead of making the publisher do that heavy work (good encryption is very heavy lifting). The only problem with having the publisher (not the server) encrypt the data is, that you have to trust the publisher to give you reasonable search-keywords: theoretically it could give you a very attractive search-keyword every client desires together with a reference to bogus data (encrypted data is hard to distinguish from random data). But the solution to this problem is crowd-sourcing: make your server store a user-rating so downloaders can vote on files. The implementation of the table you need could be a regular-old hash-table of individual search-keywords to client-ID's (see below) who have that package. The publisher is at first the only client that holds the data, but every client that downloaded at least one of the packages should then be added to the hash-table's entry, so if the publisher goes offline and every package has been downloaded by at least one client everything continues working. Now critically the mapping client-ID->IP-Addresses is non-trivial because it changes often (e.g. every 24 hours for many clients); to compensate, you have to have another table on your server that makes this mapping and make the clients contact the server periodically (e.g. every hour) to tell it its IP-address. I would recommend using a crypto-hash for client-ID's so that it is impossible for one client to trash this table by telling you fake ID's.
For any questions and criticism, please comment.

I am not sure having one central point of attack (the central server) is a very good idea. That of course depends on the kind of problems you want to be able to handle. It also limits your scalability a lot

Related

Detecting duplicates in a data integration system

I am looking for ways to avoid the transfer of duplicate files when transferring through HTTP and SFTP. My system stores the state of the transfer each time a transfer is performed into an external cache.
Before each transfer, I look up the external cache and if there is an entry for the current file with the status SUCCESS, the file will be skipped. This works well as long as my system is able to store the status in the cache each time the transfer happens. But in cases when the transfer is done and before writing the status of the transfer, the service dies, the service has no clue about the transfer and the next time the same file comes, I will re-transfer the file.
One way to improve this is to update the cache before and after the transfer is done so that I will have some clue about the file. But is there any other way to avoid this? Because once the file is transferred to the external system, there is no way to undo it when the writing of the status fails. Any thoughts?
I routinely synchronize external data and have written enough mastering processes to speak on the subject. You are asking for logistics solutions without even mentioning the context of the data and its purpose in being delivered to another location.
Are you trying to mirror a master copy of the file to another location? If so, then you need to simply deliver the file with a unique delivery number attached, allowing the recipient to independently synchronize both data sets and handle any detected differences in the files. If you are forcibly doing this work on behalf of the recipient, you may be destroying data. I consistently recommend having the recipient pull the data themselves as needed and synchronize/master it themselves, rather than pushing it. That way these business rules are organized where they should be. Push processes are bad.
Are you trying to allow users to overwrite a master file with their own copies, asking how to coordinate their uploads so that the file isn't overwritten? If so, you need to take away their direct control to overwrite that file. You need to separately synchronize each file according to a user-defined process, because each can have its own business rules.
When you say "look up the external cache and if there is an entry for the current file with the status SUCCESS, the file will be skipped", you have given far too much responsibility to the deliverer. I say that, but how do you know? In manufacturing, no deliverer would be expected to do more than carry the load. Consumers are responsible for allocating that space. If the consumer truly needs the file, let it make the decision to order it and handle receiving it, rather than having the deliverer juggle such decisions.

Save file from POST request to disk without storing in memory with Python's BaseHTTPServer

I'm writing an HTTP server in Python 2 with BaseHTTPServer, and it's assumed that it accepts multiple connections at the same time, on each connection the user can send a large file through a POST request. However my understanding is that the whole request will be stored in the server's memory before being processed, and multiple uploaded file at the same time can exceed the amount of memory on the server. Is there any way to, instead of storing the file/request in memory, stream it to a file on disk directly?
BaseHTTPServer doesn't come with a POST handler out of the box, so you'll have to implement it yourself or find an implementation that works for you. (These are easy to search for; here's one I found that looked straightforward.
Your question is similar to this question about limiting the max-size of POST; the answer points out you'll need to read through all that data in order to ensure proper browser functionality. The comments to that answer seem to indicate the use of other techniques ("e.g. AJAX and realtime notifications via WebSocket." #dmitry-nedbaylo)

Techniques/algorithms used in WAN optimization

What are the techniques/algorithms used in WAN optimization? I am looking for a reference which can give a good theory supported with code examples, I have taken a look in Steelhead manual from Riverbed and I found the following main techniques used in:
SDR (Scalable Data Referencing): which breaks up TCP data into
unique data chunks, each chunk has a reference number, where when the
same byte sequence occurs in future transmission, the reference
number is only sent across the LAN instead of raw data chunks.
Connection pooling: The product creates pools of idle TCP
connection (for HTTP as example), where when a client tries to create
a new connection to a previously visited target, it uses one from its
pool, which, in turns, overcomes three-way TCP handshake.
The product reduces the number of round trips over WAN for common
actions (opening/editing remote shared files/folders), it supports
most of intended protocols: CIFS, MAPI, HTTP … etc.
Data compression.
Through my search I found 3 open source projects aim to do WAN optimization, these are:
TrafficSqueezer
WANProxy
OpenNOP
TrafficSqueezer seems to have more features but the comments in its page in sorceforge do not give a good sense about it. I tried to find a document within these projects with good info but I couldn't.
the techniques that can reduce the traffic amount most - are of course compression and data deduplication (both WAN optimisers built up the same data based on a algorithm on memory or HDD - as soon as there is again the same traffic pattern - the pattern is replaced with a pointer to the data and a length - therefore you can save up to 99% when you transfer the same file twice, but even different files have a lot of common data where deduplication can optimise a lot!).
(you will find a lot of sources on the web: e.g. http://www.computerweekly.com/feature/How-data-deduplication-works)
in your example this is technique called SDR.
Riverbed has also a lot of protocol support - which makes for e.g. CIFS, SMB and MAPI more delay aware (e.g. a lot of packages are buffered and sent once - so save roundtrips)
Also F5 does e.g. FTP and HTTP optimisations to get those more performant.
when there is a lot delay on the WAN link - of course you can also save time with connection pooling - so pre-established TCP sessions (you can save the time that would be needed for a tcp 3way handshake)
so at a glance:
-data deduplication
-connection pooling
-compression
-protocol optimisation
i am sure you can find a lot in the f5 doku (F5 WOM is the product), bluecoat does offer WAN optimisation as well and of course Riverbed. also silverpeak might be worth a try.
for the opensouce ones i only have experiences on traffic squeezer, but there hasn't been a comparable feature-set to commercial products this time.

How can one detect changes in a directory across program executions?

I am making a protocol, client and server which provide file transfer functionality similar to FTP (among other features). One difference between my protocol and FTP is that I would like to store a copy of the remote server's directory structure in a local cache. The server will only be running on Windows (written in C++) so any applicable Win32 API calls would be appreciated (if any). When initially connected, the client requests the immediate children (both files and directories, just like "ls" or "dir" with no options), then when a user navigates into a directory, this step repeats with the new parent like you might expect.
Of course, most of the time, if the same directory of a given server is requested twice by a client, the directory's contents will be the same. Therefore I would like to cache the results of each directory listing on the client. I would like a simple way of implementing this, but it would need to take into account expiring cache entries because of file/directory access and modification time and name changes, which is the tricky part. I would ideally like something which would enable almost instant directory listings by the client, with something like a hash which takes into account not only file contents, but also changes in subdirectories' contents' filenames, data, modification and access dates, etc.
This is NOT something that could completely rely on FileSystemWatcher (or similar) objects because it would need to maintain this cache even if the program is only run occasionally. Of course these would be nice to help maintain the cache, but that's only part of the problem.
My best(?) idea so far is using FindFirstFile() and FindNextFile(), and sorting (somehow), concatenating and hashing values found in the WIN32_FIND_DATA structs (with file contents maybe), and using that as a token for expiration (just to indicate change in any of these fields). Then I would have one of these tokens for each directory. When a directory is requested, the server would hash everything and compare that to the cached hash provided by the client, and if it's different, return the normal data, otherwise an HTTP 304 equivalent. Is there a less elaborate way of doing something like this? Does "directory last modified date" take into account every one of its subdirectories' files' modification dates under all circumstances? I'm sure that the built-in Windows indexing service has something like this but ideally I wouldn't need to rely on it.
Because this service is for file sharing, something involving hashes would be especially nice so that I could automatically and efficiently find other people who are sharing a given file, but that's less of a concern then hosing the disk during the hash calculation.
I'm wondering what others who are more experienced than I am with programming would do to solve this problem (rsync and subversion have solved similar problems but not identical).
You're asking a lot of a File System Implementation of Very Little Brain (with apologies to A. A. Milne).
This is actually well-trammeled ground and you'd do well to look at the existing literature on distributed filesystems. AFS comes to mind as an example of a very well studied approach.
I doubt you'll be able to come up with something useful and accurate without doing some serious homework. Put another way, 'twould be folly to ignore all the prior art.

Techniques to reduce data harvesting from AJAX/JSON services

I was wondering if anyone had come across any techniques to reduce the chances of data exposed through JSON type services on the server (intended to supply AJAX functions) from being harvested by external agents.
It seems to me that the problem is not so difficult if you had say a Flash client consuming the data. Then you could send encrypted data to the client, which would know how to decrypt it. The same method seems impossible with AJAX though, due to the open nature of the Javascript source.
Has anybody implemented a clever technique here?
Whatever the method, it should still allow a genuine AJAX function to consume the data.
Note that I'm not really talking about protecting 'sensitive' information here, the odd record leaking out is not a problem. Rather I am thinking about stopping a situation where the whole DB is hoovered up by bots (either in one go, or gradually over time).
Thanks.
First, I would like to clear on this:
It seems to me that the problem is not
so difficult if you had say a Flash
client consuming the data. Then you
could send encrypted data to the
client, which would know how to
decrypt it. The same method seems
impossible with AJAX though, due to
the open nature of the Javascrip
source.
It will be pretty obvious the information is being sent encrypted to the flash client & it won't be that hard for the attacker to find out from your flash compiled program what's being used for this - replicate & get all that data.
If the data does happens to have the value you are thinking, you can count on the above.
If this is public information, embrace that & don't combat it - instead find ways to capitalize on it.
If this is information that you are only exposing to a set of users, make sure you have the corresponding authentication / secure communication. Track usage as others have said, and have measures that act on it,
The first thing to prevent bots from stealing your data is not technological, it's legal. First, make sure you have the right language in your site's Terms of Use that what you're trying to prevent is actually disallowed and defensible from a legal standpoint. Second, make sure you design your technical strategy with legal issues in mind. For example, in the US, if you put data behind an authentication barrier and an attacker steals it, it's likely a violation of the DMCA law. Third, find a lawyer who can advise you on IP and DMCA issues... nice folks on StackOverflow aren't enough. :-)
Now, about the technology:
A reasonable solution is to require that users be authenticated before they can get access to your sensitive Ajax calls. This allows you to simply monitor per-user usage of your Ajax calls and (manually or automatically) cancel the account of any user who makes too many requests in a particular time period. (or too many total requests, if you're trying to defend against a trickle approach).
This approach of course is vulnerable to sophisticated bots who automatically sign up new "users", but with a reasonably good CAPTCHA implementation, it's quite hard to build this kind of bot. (see "circumvention" section at http://en.wikipedia.org/wiki/CAPTCHA)
If you are trying to protect public data (no authentication) then your options are much more limited. As other answers noted, you can try IP-address-based limits (and run afoul of large corporate proxy users) but sophisticated attackers can get around this by distributing the load. There's also likley sophisticated software which watches things like request timing, request patterns, etc. and tries to spot bots. Poker sites, for example, spend a lot of time on this. But don't expect these kinds of systems to be cheap. One easy thing you can do is to mine your web logs (e.g. using Splunk) and find the top N IP addresses hitting your site, and then do a reverse-IP lookup on them. Some will be legitimate corporate or ISP proxies. But if you recognize a compeitor's domain name among the list, you can block their domain or follow up with your lawyers.
In addition to pre-theft defense, you might also want to think about inserting a "honey pot": deliberately fake information that you can track later. This is how, for example, maps manufacturers catch plaigarism: they insert a fake street in their maps and see which other maps show the same fake street. While this doesn't prevent determined folks from sucking out all your data, it does let you find out later who's re-using your data. This can be done by embedding unique text strings in your text output, and then searching for those strings on Google later (assuming your data is re-usable on another public website). If your data is HTML or images, you can include an image which points back to your site, and you can track who is downloading it, and look for patterns you can use to bust the freeloaders.
Note that the javascript encryption approach noted in one of the other answers won't work for non-authenticated sessions-- an attacker can simply download the javascript and run it just like a regular browser would. Moral of the story: public data is essentially indefensible. If you want to keep data protected, put it behind an authentication barrier.
This is obvious, but if your data is publicly searchable by search engines, you'll both need a non-AJAX solution for them (Google won't read your ajax data!) and you'll want to mark those pages NOARCHIVE so your data doesn't show up in Google's cache. You'll also probably want a white list of search engine crawler IP addreses which you allow into your search-engine-crawlable pages (you can work with Google, Bing, Yahoo, etc. to get these), otherwise malicious bots could simply impersonate Google and get your data.
In conclusion, I want to echo #kdgregory above: make sure that the threat is real enough that it's worth the effort required. Many companies overestimate the interest that other people (both legitimate customers and nefarious actors) have in their business. It might be that yours is an oddball case where you have particularly important data, it's particularly valuable to obtain, it must be publicly accessible without authentication, and your legal recourses will be limited if someone steals your data. But all those together is admittedly an unusual case.
P.S. - another way to think about this problem which may or may not apply in your case. Sometimes it's easier to change how your data works which obviates securing it. For example, can you tie your data in some way to a service on your site so that the data isn't very useful unless it's being used in conjunction with your code. Or can you embed advertising in it, so that wherever it's shown you get paid? And so on. I don't know if any of these mitigations apply to your case, but many businesses have found ways to give stuff away for free on the Internet (and encourage rather than prevent wide re-distribution) and still make money, so a hybrid free/pay strategy may (or may not) be possible in your case.
If you have an internal Memcached box, you could consider using a technique where you create an entry for each IP that hits your server with an hour expiration. Then increment that value each time the IP hits your AJAX endpoint. If the value gets over a particular threshold, fry the connection. If the value expires in Memcached, you know it isn't getting "hoovered away".
This isn't a concrete answer with a proof of concept, but maybe a starting point for you. You could create a javascript function that provides encryption/decryption functions. The javascript would need to be built dynamically, and you would include an encryption key that is unique to the session. On the server side, you'd have an encryption service that uses the key from the session to encrypt your JSON before delivering it.
This would at least prevent someone from listening to your web traffic, pulling information out of your database.
I'm with kdgergory though, it sounds like your data is too open.
Some techniques are listed in Further thoughts on hindering screen scraping.
If you use PHP, Bad behavior is a nice tool to help. If you don't use PHP, it can give some ideas on how to filter (see How it works page).
Incredibill's blog is giving nice tips, lists of User-agents/IP ranges to block, etc...
Here are a variety of suggestions:
Issue tokens required for redemption along with each AJAX request. Expire the tokens.
Track how many queries are coming from each client, and throttle excessive usage based on expected normal usage of your site.
Look for patterns in usage such as sequential queries, spikes in requests, or queries that occur faster than a human could conduct.
Check user-agents. Many bots don't completely replicate the user agent info of a browser, and you can eliminate programatic scraping of your data using this method.
Change the front-end component of your website to redirect to a captcha (or some other human verifying mechanism) once a request threshold is exceeded.
Modify your logic so the respsonse data is returned in a few different ways to complicate the code required to parse.
Obsfucate your client-side javascript.
Block IPs of offending clients.
Bots usually doesn't parse Javascript, so your ajax code won't be instantly executed. And if they even do, bots usually doesn't maintain sessions/cookies as well. Knowing that, you could reject the request if it is invoked without a valid session/cookie (which is obviously set on the server side beforehand by the request on the parent page).
This does not protect you from human hazard though. The safest way is to restrict access to users with a login/password. If that is not your intent, well, then you have to live with the fact that it's a public application. You could of course scan logs and maintian blacklists with IP addresses and useragents, but that goes extreme.

Resources