Detecting duplicates in a data integration system

Detecting duplicates in a data integration system - etl

I am looking for ways to avoid the transfer of duplicate files when transferring through HTTP and SFTP. My system stores the state of the transfer each time a transfer is performed into an external cache.
Before each transfer, I look up the external cache and if there is an entry for the current file with the status SUCCESS, the file will be skipped. This works well as long as my system is able to store the status in the cache each time the transfer happens. But in cases when the transfer is done and before writing the status of the transfer, the service dies, the service has no clue about the transfer and the next time the same file comes, I will re-transfer the file.
One way to improve this is to update the cache before and after the transfer is done so that I will have some clue about the file. But is there any other way to avoid this? Because once the file is transferred to the external system, there is no way to undo it when the writing of the status fails. Any thoughts?

I routinely synchronize external data and have written enough mastering processes to speak on the subject. You are asking for logistics solutions without even mentioning the context of the data and its purpose in being delivered to another location.
Are you trying to mirror a master copy of the file to another location? If so, then you need to simply deliver the file with a unique delivery number attached, allowing the recipient to independently synchronize both data sets and handle any detected differences in the files. If you are forcibly doing this work on behalf of the recipient, you may be destroying data. I consistently recommend having the recipient pull the data themselves as needed and synchronize/master it themselves, rather than pushing it. That way these business rules are organized where they should be. Push processes are bad.
Are you trying to allow users to overwrite a master file with their own copies, asking how to coordinate their uploads so that the file isn't overwritten? If so, you need to take away their direct control to overwrite that file. You need to separately synchronize each file according to a user-defined process, because each can have its own business rules.
When you say "look up the external cache and if there is an entry for the current file with the status SUCCESS, the file will be skipped", you have given far too much responsibility to the deliverer. I say that, but how do you know? In manufacturing, no deliverer would be expected to do more than carry the load. Consumers are responsible for allocating that space. If the consumer truly needs the file, let it make the decision to order it and handle receiving it, rather than having the deliverer juggle such decisions.

Related

Store the state inside golang binary

I am Developing an onpremise solution for a client without any control and internet connection on the machine.
The solution is to be monetized based on number of allowed requests(REST API calls) for a bought license. So currently we store the request count in an encrypted file on the file system itself. But this solution is not perfect as the file can be copied somewhere and then replaced when the requests quota is over. Also if the file is deleted then there's manual intervention needed from support.
I'm looking for a solution to store the state/data in binary and update it runtime (consider usage count that updates in binary itself)
Looking for a better approach.
Also binary should start from the previous stored State
Is there a way to do it?
P.S. I know writing to binary won't solve the issue but I think it'll increase the difficulty by increasing number of permutation and combinations for places where the state can be stored and since it's not a common knowledge that you can change the executable that would be the last place to look for the state if someone's trying to mess with the system (security by obscurity)

Is there a way to do it?
No.
(At least no official, portable way. Of course you can modify a binary and change e.g. the data or BSS segment, but this is hard, OS-dependent and does not solve your problem as it has the same problem like an external file: You can just keep the original executable and start over with that one. Some things simply cannot be solved technically.)

If your rest API is within your control and is the part that you are monetizing surely this is the point at which you would be filtering the licensed perhaps some kind of certificate authentication or key to the API and then you can keep then count on the API side that you can control and then it wont matter if it is in a flat file or a DB etc, because you control it.

Here is a solution to what you are trying to do (not to writing to the executable which) that will defeat casual copying of files.
A possible approach is to regularly write the request count and the current system time to file. This file does not even have to be encrypted - you just need to generate a hash of the data (eg using SHA2) and sign it with a private key then append to the file.
Then when you (re)start the service read and verify the file using your public key and check that it has not been too long since the time that was written to the file. Note that some initial file will have to be written on installation and your service will need to be running continually - only allowing for brief restarts. You also would probably verify that the time is not in the future as this would indicate an attempt to circumvent the system.
Of course this approach has problems such as the client fiddling with the system time or even debugging your code to find the private key and probably others. Hopefully these are hard enough to act as a deterrent. Also if the service or system is shut down for an extended period of time then some sort of manual intervention would be required.

Making a journal file in golang

I have a small project in Go that are receiving text lines over tcp to process. However, to ensure robustness, I want to create some sort of journal so that nothing is lost in case of power failure (e.g. a frame of data is received by my app, but is not yet processed).
I have googled for any guides on how a journal file should be implemented, but the search results are heavily polluted by Oracle RDBMS documentation and such.
My tought was something like: immediately after receiving a line, write it to a file with a "not processed flag". After processing, update the file so that this flag is cleared, opening for overwrites. At the same time as this flag is cleared, send an "processed ack" to the data sender. Perhaps its easiest to deal with fixed size "slots" in the journal to ensure that I can reuse freed slots rather than having a ever-increasing file and maintain a "free list" of unused slots.
Is there any "best practice" for implementing such files in custom code, i.g.e with regards to file structure, padding and locking? Are there any concerns doing so in Go as it is cross-platform rather than using native file-system APIs?

You shouldn't rewrite a journal. Just append the operations to it so that you can recreate them, and then control the strictness level you want.
The logic should simply be:
receive message
write it to journal
optionally do an fsync on the journal now - depending on your consistency requirements.
optionally then send a "received ack" - depends on your needs.
process the message.
optionally write another "processed" record to the file with an id of the record. you don't always need that but this where you don't rewrite the old record. Alternatively you can write a separate file with the "top transaction id" you've processed, so you'll automatically know where to begin processing again in case of a failure. this will reduce the journal size.
send a "processed ack" or "processing failure" - again, depends on what you want.
Databases usually let you control the fsync behavior - every write, every N seconds, when the os decides - it's a matter of speed vs. durability.
A good read on the subject might be this post on redis persistence:
http://oldblog.antirez.com/post/redis-persistence-demystified.html
[EDIT] another great read on the subject - http://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying
As for the Go aspect of it - there are a few options of writing to files, from a low level file handler to a buffered writer. Of course a file handler will keep you most in control of what's going on under the hood. I'm not sure how much caching behind the scenes a normal file writer in Go does, I'd suggest you read the code if you intend to use it.

Is it possible to associate data with a running process?

As the title says, I want to associate a random bit of data (ULONG) with a running process on the local machine. I want that data persisted with the process it's associated with, not the process thats reading & writing the data. Is this possible in Win32?

Yes but it can be tricky. You can't access an arbitrary memory address of another process and you can't count on shared memory because you want to do it with an arbitrary process.
The tricky way
What you can do is to create a window (with a special and known name) inside the process you want to decorate. See the end of the post for an alternative solution without windows.
First of all you have to get a handle to the process with OpenProcess.
Allocate memory with VirtualAllocEx in the other process to hold a short method that will create a (hidden) window with a special known name.
Copy that function from your own code with WriteProcessMemory.
Execute it with CreateRemoteThread.
Now you need a way to identify and read back this memory from another process other than the one that created that. For this you simply can find the window with that known name and you have your holder for a small chunk of data.
Please note that this technique may be used to inject code in another process so some Antivirus may warn about it.
Final notes
If Address Space Randomization is disabled you may not need to inject code in the process memory, you can call CreateRemoteThread with the address of a Windows kernel function with the same parameters (for example LoadLibrary). You can't do this with native applications (not linked to kernel32.dll).
You can't inject into system processes unless you have debug privileges for your process (with AdjustTokenPrivileges).
As alternative to the fake window you may create a suspended thread with a local variable, a TLS or stack entry used as data chunk. To find this thread you have to give it a name using, for example, this (but it's seldom applicable).
The naive way
A poor man solution (but probably much more easy to implement and somehow even more robust) can be to use ADS to hide a small data file for each process you want to monitor (of course an ADS associated with its image then it's not applicable for services and rundll'ed processes unless you make it much more complicated).
Iterate all processes and for each one create an ADS with a known name (and the process ID).
Inside it you have to store the system startup time and all the data you need.
To read back that informations:
Iterate all processes and check for that ADS, read it and compare the system startup time (if they mismatch then it means you found a widow ADS and it should be deleted.
Of course you have to take care of these widows so periodically you may need to check for them. Of course you can avoid this storing ALL these small chunk of data into a well-known location, your "reader" may check them all each time, deleting files no longer associated to a running process.

centralized / distributed sharing

I would like to make a system whereby users can upload and download files. The system will have a centralized topography but will rely heavily on peers to transfer relevant data through the central node to other peers. Instead of peers holding entire files I would like for them to hold a compressed an encrypted portion of the whole data set.
Some client uploads file to server anonymously
I would like for the client to be able to upload using some sort of NAT (random ip), realizing that the server would not be able to send confirmation packets back to the client. Is ensuring data integrity feasible with a header relaying the total content length, and disregarding the entire upload if there is a mismatch?
Server indexes, compresses and splits the data into chunks adding identifying bytes to each chunk, encrypts it, and splits the data over the network while mapping the locations of each chunk.
The server will also update the file index for peers upon request. As more data is added to the system, I imagine that the compression can become more efficient. I would like to be able to push these new dictionary entries to peers so they can update both their chunks and the decompression system in the client software, without causing overt network strain. If encrypted, the chunks can be large without any client being aware of having part of x file.
Some client requests a file
The central node performs a lookup to determine the location of the chunks within the network and requests these chunks from peers. Once the chunks have been assembled, they are sent (still encrypted and compressed) to the client, who then translates the content into the decompressed file. It would be nice if an encrypted request could be made through a peer and relayed to a server, and onion routed through multiple paths with end-to-end encryption.
In the background, the server will be monitoring the stability and redundancy of the chunks, and if necessary will take on chunks that near extinction, and either hold them in it's own bank or redistribute them over the network if there are willing clients. In this way, the central node can shrink and grow as appropriate.
The goal is to have a network within which any client can upload or download data with no single other peer knowing who has done either, but with free and open access to all.
The system must be able to handle a massive amount of simultaneous connections while managing the peers and data library without loosing it's head.
What would be your optimal implementation?
Edit : Bounty opened.
Over the weekend, I implemented a system that does basically the above, minus part 1. For the upload, I just implemented SSL instead of forging the IP address. The system is weak in several areas. Files are split into 1MB chunks and encrypted, and sent to registered peers at random. The recipient(s) for each chunk are stored in the database. I fear that this will quickly grow too large to be manageable, but I also want to avoid having to flood the network with chunk requests. When a file is requested, the central node informs peers possessing the chunks that they need to send the chunks to x client (in p2p mode) or to the server (in direct mode), which then transfers the file down. The system is just one big hack, and written in ruby, which I imagine is not really up to the task. For the rewrite, I am considering using C++ with Boost.Asio.
I am looking for general suggestions regarding architecture and system design. I am not at all attached to my current implementation.
Current Topography
Server Handling client uploads,indexing, and content propagation
Server Handling client requests
Client for upload files and requesting files
Client Server accepting chunks and requests
I would like for the client not to have to have a persistent server running, but I can't think of a good way around it.
I would post some of the code but its embarassing. Thanks. Please ask any questions, the basic idea is to have a decent anonymous file sharing model combining the strengths of both the distributed and centralized model of content distribution. If you have a totally different idea, please feel free to post it if you want.

I would like for the client to be able
to upload using some sort of NAT
(random ip), realizing that the server
would not be able to send confirmation
packets back to the client. Is
ensuring data integrity feasible with
a header relaying the total content
length, and disregarding the entire
upload if there is a mismatch?
No, that's not feasible. If your packets are 1500 bytes, and you have 0.1% packetloss, the chance of a one megabyte file being uploaded without any lost packets is .999 ^ (1048576 / 1500) = 0.497, or under 50%. Further, it's not clear how the client would even know if the upload succeeded if the server has no way to send acknowledgements back to the client.
One way around the acknowledgement issue would be to use a rateless code, which allows the client to compute and send an effectively infinite number of unique blocks, such that any sufficiently large subset is enough to reconstruct the original file. This adds a large amount of complexity to both the client and server, however, and still requires some way to notify the client that the server has received the complete file.
It seems to me you're confusing several issues here. If your system has a centralized component to which your clients upload, why do you need to do NAT traversal at all?
For parts two and three of your question, you probably want to research Distributed Hash Tables and content-based addressing (but with major caveats explained here). Preventing the nodes from knowing the content of the files they store could be accomplished by, for example, encrypting the files with the first hash of their content, and storing them keyed by the second hash - this means that anyone who knows the hash of the file can retrieve it, but clients cannot decrypt the files they host.
In general, I would suggest starting by writing down a solid list of goals for the system you're designing, then looking for an architecture that suits those goals. In contrast, it sounds like you have some implicit goals, and have already picked a basic system architecture - which may not suit your full goals - based on that.

Sorry for arriving late at the generous 500 reputation party, but even if i am too late i would like to add a little of my research to your discussion.
Yes such a system would be nice, like Bittorrent but with encrypted files and hashes of the un-encrypted data. In BT you can add encrypted files of course, but then the hashes would be of the encrypted data and thus not possible to identify retrieval-sources without a centralized queryKey->hashCollection storage, i.e. a server that does all the work of identifying package-sources for every client. A similar system was attempted by Freenet (http://freenetproject.org/), although more limited than what you attempt.
For the NAT consideration let's first look at: aClient -> yourServer (and aClient->aClient later)
For the communication between a client and your server the NATs (and firewalls that shield the clients) are not an issue! Since the clients initiate the connection to your server (which has either fixed ip-address or a dns-entry (or dyndns)) you dont even have to think about NATs, the server can respond without an issue since, even if multiple clients are behind a single corporate firewall the firewall (its NAT) will look up with which client the server wants to communicate and forwards accordingly (without you having to tell it to).
Now the "hard" part: client -> client communication through firewalls/NAT: The central technique you can use is hole-punching (http://en.wikipedia.org/wiki/UDP_hole_punching). It works so well it is what Skype uses (from one client behind a corporate firewall to another; (if it does not succeed it uses a mirroring-server)). For this you need both clients to know the address of the other and then shoot some packets at eachother so how do they get eachother's addresses?: Your server gives the addresses to the clients (this requires that not only a requester but also every distributer open a connection to your server periodically).
Before i talk about your concern about data-integrity, the general distinction between packages and packets you could (and i guess should) make:
You just have to consider that you can separate between your (application-domain) packages (large) and the packets used for internet-transmission (small, limited by MTU among other things): It would be slow to have both be the same size, the size of a maximum tcp/ip packet is 576 (minus overhead; take a look here: http://www.comsci.us/datacom/ippacket.html ); you could do some experiments about what a good size for your packages is, my best guess is that 50k-1M would all be fine (but profiling would optimize that since we dont if most of the files you want to distribute are large or small).
About data-integrity: For your packages you definitely need a hash, i would recommend to directly take a cryptographic hash since this prevents tampering (in addition to corruption); you dont need to record the size of the packages since if the hash is faulty you have to re-transmit the package anyways. Bear in mind, that this kind of package-corruption is not very frequent if you use TCP/IP for packet transmission (yes, you can use TCP/IP even in your scenario), it automatically corrects (re-requests) transmission-errors. The huge advantage is that all computers and routers in between know TCP/IP and check for corruption automatically on every step in between the source and destination computer, so they can re-request the packet themselves which makes it very fast. They would not know about a packet-integrity-protocol you implement yourself so with that custom protocol the packet has to arrive at the destination before the re-request can even start.
For the next thought let's call the client which publishes a file the "publisher", i know this is kind of obvious, however it is important to distinguish this from "uploader", since the client does not need to upload the file to your server (just some info about it, see below).
Implementing the central indexing-server should be no problem, the problem would be that you plan to have it encrypt all the files itself instead of making the publisher do that heavy work (good encryption is very heavy lifting). The only problem with having the publisher (not the server) encrypt the data is, that you have to trust the publisher to give you reasonable search-keywords: theoretically it could give you a very attractive search-keyword every client desires together with a reference to bogus data (encrypted data is hard to distinguish from random data). But the solution to this problem is crowd-sourcing: make your server store a user-rating so downloaders can vote on files. The implementation of the table you need could be a regular-old hash-table of individual search-keywords to client-ID's (see below) who have that package. The publisher is at first the only client that holds the data, but every client that downloaded at least one of the packages should then be added to the hash-table's entry, so if the publisher goes offline and every package has been downloaded by at least one client everything continues working. Now critically the mapping client-ID->IP-Addresses is non-trivial because it changes often (e.g. every 24 hours for many clients); to compensate, you have to have another table on your server that makes this mapping and make the clients contact the server periodically (e.g. every hour) to tell it its IP-address. I would recommend using a crypto-hash for client-ID's so that it is impossible for one client to trash this table by telling you fake ID's.
For any questions and criticism, please comment.

I am not sure having one central point of attack (the central server) is a very good idea. That of course depends on the kind of problems you want to be able to handle. It also limits your scalability a lot

How can one detect changes in a directory across program executions?

I am making a protocol, client and server which provide file transfer functionality similar to FTP (among other features). One difference between my protocol and FTP is that I would like to store a copy of the remote server's directory structure in a local cache. The server will only be running on Windows (written in C++) so any applicable Win32 API calls would be appreciated (if any). When initially connected, the client requests the immediate children (both files and directories, just like "ls" or "dir" with no options), then when a user navigates into a directory, this step repeats with the new parent like you might expect.
Of course, most of the time, if the same directory of a given server is requested twice by a client, the directory's contents will be the same. Therefore I would like to cache the results of each directory listing on the client. I would like a simple way of implementing this, but it would need to take into account expiring cache entries because of file/directory access and modification time and name changes, which is the tricky part. I would ideally like something which would enable almost instant directory listings by the client, with something like a hash which takes into account not only file contents, but also changes in subdirectories' contents' filenames, data, modification and access dates, etc.
This is NOT something that could completely rely on FileSystemWatcher (or similar) objects because it would need to maintain this cache even if the program is only run occasionally. Of course these would be nice to help maintain the cache, but that's only part of the problem.
My best(?) idea so far is using FindFirstFile() and FindNextFile(), and sorting (somehow), concatenating and hashing values found in the WIN32_FIND_DATA structs (with file contents maybe), and using that as a token for expiration (just to indicate change in any of these fields). Then I would have one of these tokens for each directory. When a directory is requested, the server would hash everything and compare that to the cached hash provided by the client, and if it's different, return the normal data, otherwise an HTTP 304 equivalent. Is there a less elaborate way of doing something like this? Does "directory last modified date" take into account every one of its subdirectories' files' modification dates under all circumstances? I'm sure that the built-in Windows indexing service has something like this but ideally I wouldn't need to rely on it.
Because this service is for file sharing, something involving hashes would be especially nice so that I could automatically and efficiently find other people who are sharing a given file, but that's less of a concern then hosing the disk during the hash calculation.
I'm wondering what others who are more experienced than I am with programming would do to solve this problem (rsync and subversion have solved similar problems but not identical).

You're asking a lot of a File System Implementation of Very Little Brain (with apologies to A. A. Milne).
This is actually well-trammeled ground and you'd do well to look at the existing literature on distributed filesystems. AFS comes to mind as an example of a very well studied approach.
I doubt you'll be able to come up with something useful and accurate without doing some serious homework. Put another way, 'twould be folly to ignore all the prior art.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio