I thought that when downloading s3 object into a file it would write to it by chunks to avoid loading the whole file into memory.
But apparently, this is not the case, this is my code:
puts("Memory (before file loaded): #{((`ps -o rss= -p #{Process.pid}`.to_i) / 1024.0).round(2)} MB")
my_s3_object.get(response_target: file_path)
puts("Memory (after file loaded): #{((`ps -o rss= -p #{Process.pid}`.to_i) / 1024.0).round(2)} MB")
Output:
Memory (before file loaded): 191.08 MB
Memory (after file loaded): 259.41 MB
Where my_s3_object is 130MB zip archive. Ok, so it's not fully loaded into memory but almost half of it.
Is there a way to improve memory usage by passing some params to get method? Or how do I do it?
I think you are looking for ranged requests, which are a "general" HTTP "pattern" and supported by the AWS SDK.
The documentation provides the following examples, which should allow you to download parts of the object, write them, discard the bytes from memory and read the next bytes until the whole file is downloaded. In the end memory usage will depend on the range of bytes you download with every request.
Example: To retrieve a byte range of an object
# The following example retrieves an object for an S3 bucket.
# The request specifies the range header to retrieve a specific
# byte range.
resp = client.get_object({
bucket: "examplebucket",
key: "SampleFile.txt",
range: "bytes=0-9",
})
resp.to_h outputs the following:
{
accept_ranges: "bytes",
content_length: 10,
content_range: "bytes 0-9/43",
content_type: "text/plain",
etag: "\"0d94420ffd0bc68cd3d152506b97a9cc\"",
last_modified: Time.parse("Thu, 09 Oct 2014 22:57:28 GMT"),
metadata: {
},
version_id: "null",
}
Streaming data to a block
# WARNING: yielding data to a block disables retries of networking errors
# However truncation of the body will be retried automatically using a range request
File.open('/path/to/file', 'wb') do |file|
s3.get_object(bucket: 'bucket-name', key: 'object-key') do |chunk, headers|
# headers['content-length']
file.write(chunk)
end
end
Related
I'm trying to inspect a CSV file and there are no findings being returned (I'm using the EMAIL_ADDRESS info type and the addresses I'm using are coming up with positive hits here: https://cloud.google.com/dlp/demo/#!/). I'm sending the CSV file into inspect_content with a byte_item as follows:
byte_item: {
type: :CSV,
data: File.open('/xxxxx/dlptest.csv', 'r').read
}
In looking at the supported file types, it looks like CSV/TSV files are inspected via Structured Parsing.
For CSV/TSV does that mean one can't just sent in the file, and needs to use the table attribute instead of byte_item as per https://cloud.google.com/dlp/docs/inspecting-structured-text?
What about for XSLX files for example? They're an unspecified file type so I tried with a configuration like so, but it still returned no findings:
byte_item: {
type: :BYTES_TYPE_UNSPECIFIED,
data: File.open('/xxxxx/dlptest.xlsx', 'rb').read
}
I'm able to do inspection and redaction with images and text fine, but having a bit of a problem with other file types. Any ideas/suggestions welcome! Thanks!
Edit: The contents of the CSV in question:
$ cat ~/Downloads/dlptest.csv
dylans#gmail.com,anotehu,steve#example.com
blah blah,anoteuh,
aonteuh,
$ file ~/Downloads/dlptest.csv
~/Downloads/dlptest.csv: ASCII text, with CRLF line terminators
The full request:
parent = "projects/xxxxxxxx/global"
inspect_config = {
info_types: [{name: "EMAIL_ADDRESS"}],
min_likelihood: :POSSIBLE,
limits: { max_findings_per_request: 0 },
include_quote: true
}
request = {
parent: parent,
inspect_config: inspect_config,
item: {
byte_item: {
type: :CSV,
data: File.open('/xxxxx/dlptest.csv', 'r').read
}
}
}
dlp = Google::Cloud::Dlp.dlp_service
response = dlp.inspect_content(request)
The CSV file I was testing with was something I created using Google Sheets and exported as a CSV, however, the file showed locally as a "text/plain; charset=us-ascii". I downloaded a CSV off the internet and it had a mime of "text/csv; charset=utf-8". This is the one that worked. So it looks like my issue was specifically due the file being an incorrect mime type.
xlsx is not yet supported. Coming soon. (Maybe that part of the question should be split out from the CSV debugging issue.)
I am testing adding a watermark to a video once uploaded. I am running into an issue where lamdba wants me to specify which file to change on upload. but i want it to trigger when any (really, any file that ends in .mov, .mp4, etc.) file is uploaded.
To clarify, this all works manually in creating a pipeline and job.
Here's my code:
require 'json'
require 'aws-sdk-elastictranscoder'
def lambda_handler(event:, context:)
client = Aws::ElasticTranscoder::Client.new(region: 'us-east-1')
resp = client.create_job({
pipeline_id: "15521341241243938210-qevnz1", # required
input: {
key: File, #this is where my issue
},
output: {
key: "CBtTw1XLWA6VSGV8nb62gkzY",
# thumbnail_pattern: "ThumbnailPattern",
# thumbnail_encryption: {
# mode: "EncryptionMode",
# key: "Base64EncodedString",
# key_md_5: "Base64EncodedString",
# initialization_vector: "ZeroTo255String",
# },
# rotate: "Rotate",
preset_id: "1351620000001-000001",
# segment_duration: "FloatString",
watermarks: [
{
preset_watermark_id: "TopRight",
input_key: "uploads/2354n.jpg",
# encryption: {
# mode: "EncryptionMode",
# key: "zk89kg4qpFgypV2fr9rH61Ng",
# key_md_5: "Base64EncodedString",
# initialization_vector: "ZeroTo255String",
# },
},
],
}
})
end
How do i specify just any file that is uploaded, or files that are a specific format? for the input: key: ?
Now, my issue is that i am using active storage so it doesn't end in .jpg or .mov, etc., it just is a random generated string (they have reasons for doing this). I am trying to find a reason to use active storage and this is my final step to making it work like other alternatives before it.
The extension field is Optional. If you don't specify anything in it, the lambda will be triggered no matter what file is uploaded. You can then check if it's the type of file you want and proceed.
I'm having issues getting data from GitHub Archive.
The main issue is my problem with encoding {} and .. in my URL. Maybe I am misreading the Github API or not understanding encoding correctly.
require 'open-uri'
require 'faraday'
conn = Faraday.new(:url => 'http://data.githubarchive.org/') do |faraday|
faraday.request :url_encoded # form-encode POST params
faraday.response :logger # log requests to STDOUT
faraday.adapter Faraday.default_adapter # make requests with Net::HTTP
end
#query = '2015-01-01-15.json.gz' #this one works!!
query = '2015-01-01-{0..23}.json.gz' #this one doesn't work
encoded_query = URI.encode(query)
response = conn.get(encoded_query)
p response.body
The GitHub Archive example for retrieving a range of files is:
wget http://data.githubarchive.org/2015-01-01-{0..23}.json.gz
The {0..23} part is being interpreted by wget itself as a range of 0 .. 23. You can test this by executing that command with the -v flag which returns:
wget -v http://data.githubarchive.org/2015-01-01-{0..1}.json.gz
--2015-06-11 13:31:07-- http://data.githubarchive.org/2015-01-01-0.json.gz
Resolving data.githubarchive.org... 74.125.25.128, 2607:f8b0:400e:c03::80
Connecting to data.githubarchive.org|74.125.25.128|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2615399 (2.5M) [application/x-gzip]
Saving to: '2015-01-01-0.json.gz'
2015-01-01-0.json.gz 100%[===========================================================================================================================================>] 2.49M 3.03MB/s in 0.8s
2015-06-11 13:31:09 (3.03 MB/s) - '2015-01-01-0.json.gz' saved [2615399/2615399]
--2015-06-11 13:31:09-- http://data.githubarchive.org/2015-01-01-1.json.gz
Reusing existing connection to data.githubarchive.org:80.
HTTP request sent, awaiting response... 200 OK
Length: 2535599 (2.4M) [application/x-gzip]
Saving to: '2015-01-01-1.json.gz'
2015-01-01-1.json.gz 100%[===========================================================================================================================================>] 2.42M 867KB/s in 2.9s
2015-06-11 13:31:11 (867 KB/s) - '2015-01-01-1.json.gz' saved [2535599/2535599]
FINISHED --2015-06-11 13:31:11--
Total wall clock time: 4.3s
Downloaded: 2 files, 4.9M in 3.7s (1.33 MB/s)
In other words, wget is substituting values into the URL and then getting that new URL. This isn't obvious behavior, nor is it well documented, but you can find mention of it "out there". For instance in "All the Wget Commands You Should Know":
7. Download a list of sequentially numbered files from a server
wget http://example.com/images/{1..20}.jpg
To do what you want, you need to iterate over the range in Ruby using something like this untested code:
0.upto(23) do |i|
response = conn.get("/2015-01-01-#{ i }.json.gz")
p response.body
end
To get a better idea of what's going wrong, let's start with the example given in the GitHub documentation:
wget http://data.githubarchive.org/2015-01-01-{0..23}.json.gz
The thing to note here is that {0..23} is automagically getting expanded by bash. You can see this by running the following command:
echo {0..23}
> 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
This means wget doesn't get called just once, but instead gets called a total of 24 times. The problem you're having is that Ruby doesn't automagically expand {0..23} like bash does, and instead you're making a literal call to http://data.githubarchive.org/2015-01-01-{0..23}.json.gz, which doesn't exist.
Instead you will need to loop through 0..23 yourself and make a single call every time:
(0..23).each do |n|
query = "2015-01-01-#{n}.json.gz"
encoded_query = URI.encode(query)
response = conn.get(encoded_query)
p response.body
end
I'm trying to upload an image using POST. Then on the server to get the POST data, I use:
data: read system/ports/input
...but it seems that the data is truncated.
There doesn't seem to be some specific boundary where the data are truncated. I'm uploading images in range from cca 15-200kB, and the resulting data are few hundreds to few tens of kB long, so there's no artificial boundary like 32'000 bytes.
Does anyone have experience with getting data from POST?
The read action on system/ports/input works at a low level, like a stream.
Continuous reads will return partial data until the end of input is reached.
The problem is that system/ports/input will return an error at the end of input instead of none! or an empty string.
The following code works for me to read large POST input:
image: make binary! 200'000
while [
not error? try [data: read system/ports/input]
][
append image data
]
with r3-64-view-2014-02-14-1926d8.exe I used
while [
all [
not error? try [data: read system/ports/input]
0 < probe length? data
]
][
append image data
]
print length? image
And did
D:\own\Rebol>r3-64-view-2014-02-14-1926d8.exe read-img.r < r3-64-view-2014-02-14-1926d8.exe > err.txt
and got
.
.
16384
16384
16384
2048
0
1181696
When attempting to load a page which is a CSV that has encoding of UTF-8, using Mechanize V2.5.1, I used the following code:
a.content_encoding_hooks << lambda{|httpagent, uri, response, body_io|
response['Content-Encoding'] = 'none' if response['Content-Encoding'].to_s == 'UTF-8'
}
p4 = a.get(redirect_url, nil, ['accept-encoding' => 'UTF-8'])
but I find that the content encoding hook is not being called and I get the following error and traceback:
/Users/jackrg/.rbenv/versions/1.9.2-p290/lib/ruby/gems/1.9.1/gems/mechanize-2.5.1/lib/mechanize/http/agent.rb:787:in 'response_content_encoding': unsupported content-encoding: UTF-8 (Mechanize::Error)
from /Users/jackrg/.rbenv/versions/1.9.2-p290/lib/ruby/gems/1.9.1/gems/mechanize-2.5.1/lib/mechanize/http/agent.rb:274:in 'fetch'
from /Users/jackrg/.rbenv/versions/1.9.2-p290/lib/ruby/gems/1.9.1/gems/mechanize-2.5.1/lib/mechanize/http/agent.rb:949:in 'response_redirect'
from /Users/jackrg/.rbenv/versions/1.9.2-p290/lib/ruby/gems/1.9.1/gems/mechanize-2.5.1/lib/mechanize/http/agent.rb:299:in 'fetch'
from /Users/jackrg/.rbenv/versions/1.9.2-p290/lib/ruby/gems/1.9.1/gems/mechanize-2.5.1/lib/mechanize/http/agent.rb:949:in 'response_redirect'
from /Users/jackrg/.rbenv/versions/1.9.2-p290/lib/ruby/gems/1.9.1/gems/mechanize-2.5.1/lib/mechanize/http/agent.rb:299:in 'fetch'
from /Users/jackrg/.rbenv/versions/1.9.2-p290/lib/ruby/gems/1.9.1/gems/mechanize-2.5.1/lib/mechanize.rb:407:in 'get'
from prototype/test1.rb:307:in `<main>'
Does anyone have an idea why the content hook code is not firing and why I am getting the error?
but I find that the content encoding hook is not being called
What makes you think that?
The error message references this code:
def response_content_encoding response, body_io
...
...
out_io = case response['Content-Encoding']
when nil, 'none', '7bit', "" then
body_io
when 'deflate' then
content_encoding_inflate body_io
when 'gzip', 'x-gzip' then
content_encoding_gunzip body_io
else
raise Mechanize::Error,
"unsupported content-encoding: #{response['Content-Encoding']}"
So mechanize only recognizes the content encodings: '7bit', 'deflate', 'gzip', or 'x-gzip'.
From the HTTP/1.1 spec:
4.11 Content-Encoding
The Content-Encoding entity-header field is used as a modifier to the
media-type. When present, its value indicates what additional content
codings have been applied to the entity-body, and thus what decoding
mechanisms must be applied in order to obtain the media-type
referenced by the Content-Type header field. Content-Encoding is
primarily used to allow a document to be compressed without losing the
identity of its underlying media type.
Content-Encoding = "Content-Encoding" ":" 1#content-coding
Content codings are defined in section 3.5. An example of its use is
Content-Encoding: gzip
The content-coding is a characteristic of the entity identified by the
Request-URI. Typically, the entity-body is stored with this encoding
and is only decoded before rendering or analogous usage. However, a
non-transparent proxy MAY modify the content-coding if the new coding
is known to be acceptable to the recipient, unless the "no-transform"
cache-control directive is present in the message.
...
...
3.5 Content Codings
Content coding values indicate an encoding transformation that has
been or can be applied to an entity. Content codings are primarily
used to allow a document to be compressed or otherwise usefully
transformed without losing the identity of its underlying media type
and without loss of information. Frequently, the entity is stored in
coded form, transmitted directly, and only decoded by the recipient.
content-coding = token
All content-coding values are case-insensitive. HTTP/1.1 uses
content-coding values in the Accept-Encoding (section 14.3) and
Content-Encoding (section 14.11) header fields. Although the value
describes the content-coding, what is more important is that it
indicates what decoding mechanism will be required to remove the
encoding.
The Internet Assigned Numbers Authority (IANA) acts as a registry for
content-coding value tokens. Initially, the registry contains the
following tokens:
gzip An encoding format produced by the file compression program "gzip" (GNU zip) as described in RFC 1952 [25]. This format is a
Lempel-Ziv coding (LZ77) with a 32 bit CRC.
compress The encoding format produced by the common UNIX file compression program "compress". This format is an adaptive
Lempel-Ziv-Welch coding (LZW).
Use of program names for the identification of encoding formats
is not desirable and is discouraged for future encodings. Their
use here is representative of historical practice, not good
design. For compatibility with previous implementations of HTTP,
applications SHOULD consider "x-gzip" and "x-compress" to be
equivalent to "gzip" and "compress" respectively.
deflate The "zlib" format defined in RFC 1950 [31] in combination with the "deflate" compression mechanism described in RFC 1951 [29].
identity The default (identity) encoding; the use of no transformation whatsoever. This content-coding is used only in the
Accept- Encoding header, and SHOULD NOT be used in the
Content-Encoding header.
http://www.w3.org/Protocols/rfc2616/rfc2616-sec3.html#sec3.5
In other words, an http content encoding has nothing to do with ascii v. utf-8 v. latin-1.
In addition the source code for Mechanize::HTTP::Agent has this in it:
# A list of hooks to call after retrieving a response. Hooks are called with
# the agent and the response returned.
attr_reader :post_connect_hooks
# A list of hooks to call before making a request. Hooks are called with
# the agent and the request to be performed.
attr_reader :pre_connect_hooks
# A list of hooks to call to handle the content-encoding of a request.
attr_reader :content_encoding_hooks
So it doesn't even look like you are calling the right hook.
Here is an example I got to work:
require 'mechanize'
a = Mechanize.new
p a.content_encoding_hooks
func = lambda do |a, uri, resp, body_io|
puts body_io.read
puts "The Content-Encoding is: #{resp['Content-Encoding']}"
if resp['Content-Encoding'].to_s == 'UTF-8'
resp['Content-Encoding'] = 'none'
end
puts "The Content-Encoding is now: #{resp['Content-Encoding']}"
end
a.content_encoding_hooks << func
a.get(
'http://localhost:8080/cgi-bin/myprog.rb',
[],
nil,
"Accept-Encoding" => 'gzip, deflate' #This is what Firefox always uses
)
myprog.rb:
#!/usr/bin/env ruby
require 'cgi'
cgi = CGI.new('html3')
headers = {
"type" => 'text/html',
"Content-Encoding" => "UTF-8",
}
cgi.out(headers) do
cgi.html() do
cgi.head{ cgi.title{"Content-Encoding Test"} } +
cgi.body() do
cgi.div(){ "The Accept-Encoding was: #{cgi.accept_encoding}" }
end
end
end
--output:--
[]
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN"><HTML><HEAD><TITLE>Content-Encoding Test</TITLE></HEAD><BODY><DIV>The Accept-Encoding was: gzip, deflate</DIV></BODY></HTML>
The Content-Encoding is: UTF-8
The Content-Encoding is now: none