I'm attempting to stream several large symmetrically encrypted .csv.gpg (40GB+ each) files from S3 to gnupg to an output stream.
I'd like to process the files in chunks using streams so that we never need to download the entire encrypted/decrypted file to disk or memory.
Here's an example using the AWS Ruby S3 SDK to download chunks of the object and pass them to gnupg for decryption, using Ruby 2.5.0 with the ruby-gpgme gem.
crypto = GPGME::Crypto.new
s3_client.get_object(bucket: BUCKET, key: KEY) do |chunk|
crypto.decrypt(chunk, password: PASSWORD, output: $stdout)
end
When running this, I see valid decrypted CSV data in STDOUT (good!) up until it fails at the end of the first chunk:
~/.rvm/gems/ruby-2.5.0/gems/gpgme-2.0.14/lib/gpgme/ctx.rb:435:in `decrypt_verify': Decryption failed (GPGME::Error::DecryptFailed)
This is where I'm stuck.
Can gnupg decrypt chunks at a time or must it read the entire file before writing the output?
Do the chunks need to be of certain size and/or delimited in some way?
Any feedback would be greatly appreciated.
Related
I can read files encrypted with ccat file or ccrypt -c file in Bash with ccrypt.
How can I append an encrypted file without doing the decryption process?
You can program this, but you can probably not perform this from the command line.
The description of the protocol can be found here. It uses full block CFB where the previous ciphertext block is encrypted again to create a stream that is XOR'ed with the plaintext.
A quick look at the Wikipedia page shows that you can just grab the last full ciphertext block, use that as IV, skip the bytes of the result used for any partial ciphertext block (if present) and then continue encrypting.
However, you'll have to program that yourself. Good luck!
I vacuumed log data from PostgreSQL by ruby scipt and stored into GoogleCloudStorage.
Each file has a 10000 of user data and the total number of files are over 100000.Belows are part of files.
I downloaded each file into local machine and filtered gzip into JSON for BigQuery like jq -c ".[]" ~/Downloads/1-10000 > ~/Downloads/1-10000.json
and bq load --source_format=NEWLINE_DELIMITED_JSON userdata.user_logs_1-10000 gs://user-logs/1-10000 schema.json into BigQuery by hand.It succeed but this is not smart way and I can't repeat this.
What is a best way to parse a huge amount of gzip file into json and load into BigQuery at once.
I am open to all suggestions.Thank you.
I realize there are 3 steps. (please let me know if I'm wrong.)
download gzip files
decompress gzip into json
upload into BigQuery
You can try yajl-ruby gem to finish the first two steps.
require 'uri'
require 'yajl/gzip'
require 'yajl/deflate'
require 'yajl/http_stream'
url = URI.parse("http://example.com/foo.json")
results = Yajl::HttpStream.get(url)
And have a look at BigBroda and BigQuery. I've never used Google BigQuery before, I'm not sure which one works. You have to try it yourself.
This is an example:
bq = BigQuery::Client.new(opts)
bq.insert('table_name', results)
It would be helpful if using multithreading or multiprocessing since you have a huge amount of files.
I have thousands (or more) of gzipped files in a directory (on a Windows system) and one of my tools consumes those gzipped files. If it encounters a corrupt gzip file, it conveniently ignores them instead of raising an alarm.
I have been trying to write a Perl program that loops through each file and makes a list of files which are corrupt.
I am using the Compress::Zlib module, and have tried reading the first 1KB of each file, but that did not work since some of the files are corrupted towards the end (verified during the manual extract, alarm raised only towards the end) and reading first 1KB doesn't show a problem. I am wondering if a CRC check of these files will be of any help.
Questions:
Will CRC validation work in this case? If yes, how does it work? Will the true CRC be part of the gzip header, and we are to compare it with the calculated CRC from the file we have? How do I accomplish this in Perl?
Are there any other simpler ways to do this?
In short, the only way to check a gzip file is to decompress it until you get an error, or get to the end successfully. You do not however need to store the result of the decompression.
The CRC stored at the end of a gzip file is the CRC of the uncompressed data, not the compressed data. To use it for verification, you have to decompress all of the data. This is what gzip -t does, decompressing the data and checking the CRC, but not storing the uncompressed data.
Often a corruption in the compressed data will be detected before getting to the end. But if not, then the CRC, as well as a check against an uncompressed length also stored at the end, will with a probability very close to one detect a corrupted file.
The Archive::Zip FAQ gives some very good guidance on this.
It looks like the best option for you is to check the CRC of each member of the archives, and a sample program that does this -- ziptest.pl -- comes with the Archive::Zip module installation.
It should be easy to test the file is not corrupt by just using "gunzip -t" command, gunzip is available for windows as well and should come with gzip package.
Is it true WebHDFS does not support SequenceFiles?
I can't find anything that says it does. I have the usual small file problem and believe SequenceFiles would work well enough, but I need to use WebHDFS. I need to create and then append to a SequenceFile via WebHDFS.
I think it's true. There is no web API to append to a sequence file.
However you can append binary data, and if your sequence file is not block-compressed, you should be able to format your data on the client with relatively little effort. You can do it by running your input through a sequence file writer on the client, and then using the output for uploading (either the whole file, or a slice representing the delta since last append).
You can read more about sequence file format here.
I am trying to write a video ruby transformer script (using ffmpeg) that depends on mov files being ftped to a server.
The problem I've run into is that when a large file is uploaded by a user, the watch script (using rb-inotify) attempts to execute (and run the transcoder) before the mov is completely uploaded.
I'm a complete noob. But I'm trying to discover if there is a way for me to be able to ensure my watch script doesn't run until the file(s) is/are completely uploaded.
My watch script is here:
watch_me = INotify::Notifier.new
watch_me.watch("/directory_to_my/videos", :close_write) do |directories|
load '/directory_to_my/videos/.transcoder.rb'
end
watch_me.run
Thank you for any help you can provide.
Just relying on inotify(7) to tell you when a file has been updated isn't a great fit for telling when an upload is 'complete' -- an FTP session might time out and be re-started, for example, allowing a user to upload a file in chunks over several days as connectivity is cheap or reliable or available. inotify(7) only ever sees file open, close, rename, and access, but never the higher-level event "I'm done modifying this file", as the user would understand it.
There are two mechanisms I can think of: one is to have uploads go initially into one directory and ask the user to move the file into another directory when the upload is complete. The other creates some file meta-data on the client and uses that to "know" when the upload is complete.
Move completed files manually
If your users upload into the directory ftp/incoming/temporary/, they can upload the file in as many connections is required. Once the file is "complete", they can rename the file (rename ftp/incoming/temporary/hello.mov ftp/incoming/complete/hello.mov) and your rb-inotify interface looks for file renames in the ftp/incoming/complete/ directory, and starts the ffmpeg(1) command.
Generate metadata
For a transfer to be "complete", you're really looking for two things:
The file is the same size on both systems.
The file is identical on both systems.
Since "identical" is otherwise difficult to check, most people content themselves with checking if the contents of the file, when run through a cryptographic hash function such as MD5 or SHA-1 (or better, SHA-224, SHA-256, SHA-384, or SHA-512) functions. MD5 is quite fine if you're guarding against incomplete transmission but if you intend on using the output of the function for other means, using a stronger function would be wise.
MD5 is really tempting though, since tools to create and validate MD5 hashes are very widespread: md5sum(1) on most Linux systems, md5(1) on most BSD systems (including OS X).
$ md5sum /etc/passwd
c271aa0e11f560af419557ef49a27ac8 /etc/passwd
$ md5sum /etc/passwd > /tmp/sums
$ md5sum -c /tmp/sums
/etc/passwd: OK
The md5sum -c command asks the md5sum(1) program to check the file of hashes and filenames for correctness. It looks a little silly when used on just a single file, but when you've got dozens or hundreds of files, it's nice to let the software do the checking for you. For example: http://releases.mozilla.org/pub/mozilla.org/firefox/releases/3.0.19-real-real/MD5SUMS -- Mozilla has published such files with 860 entries -- checking them by hand would get tiring.
Because checking hashes can take a long time (five minutes on my system to check a high-definition hour-long video that wasn't recently used), it'd be a good idea to only check the hashes when the filesizes match. Modify your upload tool to send along some metadata about how long the file is and what its cryptographic hash is. When your rb-inotify script sees file close requests, check the file size, and if the sizes match, check the cryptographic hash. If the hashes match, then start your ffmpeg(1) command.
It seems easier to upload the file to a temporal directory on the server and move it to the location your script is watching once the transfer is completed.