Handling chunked request in Sinatra - ruby

In Sinatra I have a route setup similar to the following:
put '/test' do
begin
logger.info 'In put begin block'
write_to_file(request.body.read)
[200, '']
rescue RuntimeError => e
[500, 'some error']
end
end
def write_to_file(data)
logger.info "writing data with size #{data.size}"
# Code to save file...
end
When I send data that is < ~500 MBytes it everything seems to work correctly but when I attempt to send data that is >= 500 MBytes I get some weird log output and then the client eventually errors out with the following error: Excon::Errors::SocketError: EOFError (EOFError)
The logs from the server (Sinatra) look like the following:
For data < 500 MBytes:
I, [2013-01-07T21:33:59.386768 #17380] INFO -- : In put begin block
I, [2013-01-07T21:34:01.279922 #17380] INFO -- : writing data with size 209715200
xxx.xxx.xxx.xxx - - [07/Jan/2013 21:34:22] "PUT /test " 200 - 22.7917
For data > 500 MBytes:
I, [2013-01-07T21:47:37.434022 #17386] INFO -- : In put begin block
I, [2013-01-07T21:47:41.152932 #17386] INFO -- : writing data with size 524288000
I, [2013-01-07T21:48:16.093683 #17380] INFO -- : In put begin block
I, [2013-01-07T21:48:20.300391 #17380] INFO -- : writing data with size 524288000
xxx.xxx.xxx.xxx - - [07/Jan/2013 21:48:39] "PUT /test " 200 - 62.4515
I, [2013-01-07T21:48:54.718971 #17386] INFO -- : In put begin block
I, [2013-01-07T21:49:00.381725 #17386] INFO -- : writing data with size 524288000
I, [2013-01-07T21:49:33.980043 #17267] INFO -- : In put begin block
I, [2013-01-07T21:49:41.990671 #17267] INFO -- : writing data with size 524288000
xxx.xxx.xxx.xxx - - [07/Jan/2013 21:50:06] "PUT /test " 200 - 110.2076
xxx.xxx.xxx.xxx - - [07/Jan/2013 21:51:22] "PUT /test " 200 - 108.5339
Not entirely sure whats going on here so I guess my question is two fold, A. What is fundamentally different between these two cases that would cause them to behave this way? B. Is there a better way to handle data to mitigate against this?

The problem turned out to be with apache/passenger. Running the server with WEBrick alleviated the issue.

Related

Send logs to datadog with custom formatter shows only the first log of multiple ones

I am trying to send logs to DD (datadog) in such a way that the logs are being received as json and therefore shown properly in the portal through attributes.
My logger is a simple Logger.new(STDOUT, level: Logger::INFO).
If I stick to its standard output, it will in the form
I, [2022-07-30T22:43:35.216846 #1] INFO -- my-app: {"user":"1234"}
which is not really parsable by DD since not a proper JSON. In this case however all the logs appear at least on the DD portal.
Now.. I am trying to format the logs in a JSON manner in this way:
def self.logger
#logger ||= Logger.new(STDOUT, level: Logger::INFO)
#logger.progname = 'my-app'
#logger.formatter = proc do |severity, datetime, progname, msg|
{timestamp: datetime.to_s, progname: progname, severity: severity, correlation: Datadog::Tracing.log_correlation, message: msg}.to_json
end
#logger
end
This is my logger and thanks to this logs are seen properly in DD and parsed correctly because formatted in my app in a proper JSON.
The problem with this approach though seems to be that the logs are sent in 1 full block. Meaning that only the very first log is being visible. Let's say that I want to log this:
my_hash = {"message" => '1', "prop" => '1234'}.to_json
logger.info(my_hash)
my_hash = {"message" => '2', "prop" => '12345'}.to_json
logger.info(my_hash)
only the first log will be shown correctly on the DD portal. Parsed correctly with its message and prop attributes, but nothing about the second log.
Here is the thing, if I see the output of my app locally in the console I see this:
{"timestamp":"2022-07-31 01:15:39 +0200","progname":"my-app","severity":"INFO","correlation":"dd.service=my-app dd.trace_id=2976451780376429536 dd.span_id=0","message":"{"message":"1","prop":"1234"}"}{"timestamp":"2022-07-31 01:15:39 +0200","progname":"my-app","severity":"INFO","correlation":"dd.service=my-app dd.trace_id=2976451780376429536 dd.span_id=0","message":"{"message":"2","prop":"12345"}"}127.0.0.1 - - [31/Jul/2022:01:15:39 +0200] "GET /controller/test_controller HTTP/1.1" 200 - 0.0024
so the 2nd log gets actually outputted! But DD somehow sees only the first log..
(I know there is even a 3rd one shown in this message.. but that's just Sinatra automatic behavior for every http call reaching the api). What do you guys think is the problem?

Elastic Beanstalk and Rails – initializing multiple times?

When I deploy my RoR (4.2.6) application to ElasticBeanstalk, it appears that the initialization process is getting run four times. This is impacting the way I rely on a singleton instance of a job Scheduler object (using Rufus Scheduler).
In a couple of initializer files and in application.rb, I added a few log statements:
Here:
# /config/initializers/scheduler.rb
require 'rufus-scheduler'
::Rufus_lockfile = "/tmp/.rufus-scheduler.lock"
::Scheduler = Rufus::Scheduler.singleton(
:lockfile => Rufus_lockfile
)
Rails.logger.info "1: started Scheduler #{Scheduler.object_id}"
And here:
# /config/initializers/cookies_serializer.rb
Rails.logger.info "2: some other initializer"
And here:
# /config/application.rb
require File.expand_path('../boot', __FILE__)
require 'rails/all'
Bundler.require(*Rails.groups)
module MyApp
class Application < Rails::Application
...
config.after_initialize do
Rails.logger.info "3: after app is initialized"
end
end
end
After I run eb deploy and it completes, this is what I see at the top of app/log/production.log:
I, [2016-05-31T10:31:08.756302 #18753] INFO -- : 2: some other initializer
I, [2016-05-31T10:31:08.757252 #18753] INFO -- : 1: started Scheduler 47235057343600
I, [2016-05-31T10:31:08.896353 #18753] INFO -- : 3: after app is initialized
I, [2016-05-31T10:31:23.669517 #18817] INFO -- : 2: some other initializer
I, [2016-05-31T10:31:23.670380 #18817] INFO -- : 1: started Scheduler 46989489069800
I, [2016-05-31T10:31:23.806154 #18817] INFO -- : 3: after app is initialized
D, [2016-05-31T10:31:23.969103 #18817] DEBUG -- : ^[[1m^[[36mActiveRecord::SchemaMigration Load (1.3ms)^[[0m ^[[1mSELECT "schema_migrations".* FROM "schema_migrations"^[[0m
I, [2016-05-31T10:31:33.108449 #18897] INFO -- : 2: some other initializer
I, [2016-05-31T10:31:33.109513 #18897] INFO -- : 1: started Scheduler 47156425207060
I, [2016-05-31T10:31:33.116500 #18901] INFO -- : 2: some other initializer
I, [2016-05-31T10:31:33.117374 #18901] INFO -- : 1: started Scheduler 47156425216940
I, [2016-05-31T10:31:33.790266 #18901] INFO -- : 3: after app is initialized
I, [2016-05-31T10:31:33.844517 #18897] INFO -- : 3: after app is initialized
So it looks like the initializer files and even the code in my after_initializer block are getting run four times... and I can't figure out why.
Question got asked on rufus-scheduler issue system, answered there: https://github.com/jmettraux/rufus-scheduler/issues/208

URL encoding issues with curly braces

I'm having issues getting data from GitHub Archive.
The main issue is my problem with encoding {} and .. in my URL. Maybe I am misreading the Github API or not understanding encoding correctly.
require 'open-uri'
require 'faraday'
conn = Faraday.new(:url => 'http://data.githubarchive.org/') do |faraday|
faraday.request :url_encoded # form-encode POST params
faraday.response :logger # log requests to STDOUT
faraday.adapter Faraday.default_adapter # make requests with Net::HTTP
end
#query = '2015-01-01-15.json.gz' #this one works!!
query = '2015-01-01-{0..23}.json.gz' #this one doesn't work
encoded_query = URI.encode(query)
response = conn.get(encoded_query)
p response.body
The GitHub Archive example for retrieving a range of files is:
wget http://data.githubarchive.org/2015-01-01-{0..23}.json.gz
The {0..23} part is being interpreted by wget itself as a range of 0 .. 23. You can test this by executing that command with the -v flag which returns:
wget -v http://data.githubarchive.org/2015-01-01-{0..1}.json.gz
--2015-06-11 13:31:07-- http://data.githubarchive.org/2015-01-01-0.json.gz
Resolving data.githubarchive.org... 74.125.25.128, 2607:f8b0:400e:c03::80
Connecting to data.githubarchive.org|74.125.25.128|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2615399 (2.5M) [application/x-gzip]
Saving to: '2015-01-01-0.json.gz'
2015-01-01-0.json.gz 100%[===========================================================================================================================================>] 2.49M 3.03MB/s in 0.8s
2015-06-11 13:31:09 (3.03 MB/s) - '2015-01-01-0.json.gz' saved [2615399/2615399]
--2015-06-11 13:31:09-- http://data.githubarchive.org/2015-01-01-1.json.gz
Reusing existing connection to data.githubarchive.org:80.
HTTP request sent, awaiting response... 200 OK
Length: 2535599 (2.4M) [application/x-gzip]
Saving to: '2015-01-01-1.json.gz'
2015-01-01-1.json.gz 100%[===========================================================================================================================================>] 2.42M 867KB/s in 2.9s
2015-06-11 13:31:11 (867 KB/s) - '2015-01-01-1.json.gz' saved [2535599/2535599]
FINISHED --2015-06-11 13:31:11--
Total wall clock time: 4.3s
Downloaded: 2 files, 4.9M in 3.7s (1.33 MB/s)
In other words, wget is substituting values into the URL and then getting that new URL. This isn't obvious behavior, nor is it well documented, but you can find mention of it "out there". For instance in "All the Wget Commands You Should Know":
7. Download a list of sequentially numbered files from a server
wget http://example.com/images/{1..20}.jpg
To do what you want, you need to iterate over the range in Ruby using something like this untested code:
0.upto(23) do |i|
response = conn.get("/2015-01-01-#{ i }.json.gz")
p response.body
end
To get a better idea of what's going wrong, let's start with the example given in the GitHub documentation:
wget http://data.githubarchive.org/2015-01-01-{0..23}.json.gz
The thing to note here is that {0..23} is automagically getting expanded by bash. You can see this by running the following command:
echo {0..23}
> 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
This means wget doesn't get called just once, but instead gets called a total of 24 times. The problem you're having is that Ruby doesn't automagically expand {0..23} like bash does, and instead you're making a literal call to http://data.githubarchive.org/2015-01-01-{0..23}.json.gz, which doesn't exist.
Instead you will need to loop through 0..23 yourself and make a single call every time:
(0..23).each do |n|
query = "2015-01-01-#{n}.json.gz"
encoded_query = URI.encode(query)
response = conn.get(encoded_query)
p response.body
end

Substring to hash key issue?

I have a log file and need to create a hash key for each URL in the record. Each line from the record has been placed into an array and I am looping through the array assigning hash keys.
I need to get from this:
"2010/08/23 15:25:35 [error]: (4: No such file or directory), clent: 80.154.42.54, server: localhost, request: "GET /logschecks/scripts/setup1.php HTTP/1.1", host: "www.example.com"
to this:
"/logschecks/scripts/setup1.php"
I have tried using match, scan and split but they have both failed to get me where I need to go.
My method currently looks like:
def pathHistogram (rowsInFile)
i = 0
urlHash = Hash.new
while i <= rowsInFile.length - 1
urlKey = rowsInFile[i].scan(/<"GET ">/).last.first
if urlHash.has_key?(urlKey) == true
#get the number of stars already in there and add one.
urlHash[urlKey] = urlHash[urlKey] + '*'
i = i + 1
else
urlHash[urlKey] = '*'
i = i + 1
end
end
end
I know that just scanning the "GET " won't complete the job but I was trying to baby-step through it. The match and split versions that I tried were fairly epic-fails, but I was likely using them incorrectly and they are long gone.
Running this script gives me an undefined method error on "first", though I have gotten other errors when I vary the way this is handled.
I should also say I am not married to using scan. If another method would work better, I would be more than happy to switch.
Any help would be greatly appreciated.
You state in a comment to the other answer the pattern is basically "GET ... HTTP, where you are interested in the ... part. That can be extracted very easily:
line = '2010/08/23 15:25:35 [error]: (4: No such file or directory), clent: 80.154.42.54, server: localhost, request: "GET /logschecks/scripts/setup1.php HTTP/1.1", host: "www.example.com"'
line[/"GET (.*?) HTTP/, 1]
# => "/logschecks/scripts/setup1.php"
Assuming each of your input lines contains /logschecks/...:
x = "2010/08/23 15:25:35 [error]: (4: No such file or directory), clent: 80.154.42.54, server: localhost, request: \"GET /logschecks/scripts/setup1.php HTTP/1.1\", host: \"www.example.com\""
x[%r(/logscheck[/\w\.]+)] # => "/logschecks/scripts/setup1.php"
Scanning HTTP logs isn't hard, but how you go about it will vary depending on the format. In the sample you're giving it's easier than a standard log because you have some landmarks you can look for:
Search for request: " using something like:
/request: "\S+ (\S+)/i
That pattern will skip over GET, POST, HEAD or whatever method was used for the request.
log_line[/request: "\S+ (\S+)/i, 1] # => "/logschecks/scripts/setup1.php"
You might want to know that if you're mining your logs. In that case...
Search for request: "[GET|POST|HEAD|...] using something like:
/request: "(\S+) (\S+)/i
You'd use it like:
method, url = log_line.match(/request: "(\S+) (\S+)/i).captures # => ["GET", "/logschecks/scripts/setup1.php"]
method # => "GET"
url # => "/logschecks/scripts/setup1.php"
You can also grab whatever is inside the double-quotes, then split it to get at the parts:
/request: "([^"]+)"/i
For instance:
log_line = %[2010/08/23 15:25:35 [error]: (4: No such file or directory), clent: 80.154.42.54, server: localhost, request: "GET /logschecks/scripts/setup1.php HTTP/1.1", host: "www.example.com"]
method, url, http_ver = log_line[/request: "([^"]+)"/i, 1].split # => ["GET", "/logschecks/scripts/setup1.php", "HTTP/1.1"]
method # => "GET"
url # => "/logschecks/scripts/setup1.php"
http_ver # => "HTTP/1.1"
Or use a bit more complex pattern, using some of the modern extensions and reduce the code:
log_line = %[2010/08/23 15:25:35 [error]: (4: No such file or directory), clent: 80.154.42.54, server: localhost, request: "GET /logschecks/scripts/setup1.php HTTP/1.1", host: "www.example.com"]
/request: "(?<method>\S+) (?<url>\S+) (?<http_ver>\S+)"/i =~ log_line
method # => "GET"
url # => "/logschecks/scripts/setup1.php"
http_ver # => "HTTP/1.1"

Calling Sinatra from Sinatra produces results different from external request

Following my question here.
So I am trying to do a smart redirect using this:
get "/category/:id/merge" do
#... setting #catalog_id and category
call env.merge("PATH_INFO" => "/catalog/#{#catalog_id}/category/#{category.id}", "REQUEST_METHOD"=>"PATCH","QUERY_STRING"=>"merge=1")
status 200
end
But when I look in logs, I see something that is not only frustrating but also completely absurd:
# this one is from internal call
I, [2013-03-21T15:55:54.382153 #29569] INFO -- : Processing GET /catalog/1/category/2686/merge
I, [2013-03-21T15:55:54.382239 #29569] INFO -- : Parameters: {}
...
I, [2013-03-21T15:55:54.394992 #29569] INFO -- : Processing PATCH /catalog/1/category/2686
I, [2013-03-21T15:55:54.395041 #29569] INFO -- : Parameters: {"merge"=>"1"}
I, [2013-03-21T15:55:54.395560 #29569] INFO -- : Processed PATCH /catalog/1/category/2686?merge=1 with status code 404
I, [2013-03-21T15:55:54.395669 #29569] INFO -- : Processed GET /catalog/1/category/2686/merge with status code 200
# this one is a direct request
I, [2013-03-21T15:56:36.246535 #29588] INFO -- : Processing PATCH /catalog/1/category/2686
I, [2013-03-21T15:56:36.246629 #29588] INFO -- : Parameters: {"merge"=>"1"}
...
I, [2013-03-21T15:56:36.286216 #29588] INFO -- : Processed PATCH /catalog/1/category/2686?merge=1 with status code 204
And the body of internal 404 request is just Sinatra's standard 404 error page. How the hell can he tell me straight in the eye that he doesn't know this route if I caught him serving exactly the same URL with acceptable 204?
UPDATE
It gets even more exciting when I change REQUEST_METHOD to GET - works like a charm.
I, [2013-03-21T17:09:37.718756 #3141] INFO -- : Processing GET /catalog/1/category/2686/merge
I, [2013-03-21T17:09:37.718838 #3141] INFO -- : Parameters: {}
...
I, [2013-03-21T17:09:37.735632 #3141] INFO -- : Processing GET /catalog/1/category/2686
I, [2013-03-21T17:09:37.735678 #3141] INFO -- : Parameters: {"merge"=>"1"}
...
I, [2013-03-21T17:09:37.773033 #3141] INFO -- : Processed GET /catalog/1/category/2686?merge=1 with status code 200
I, [2013-03-21T17:09:37.773143 #3141] INFO -- : Processed GET /catalog/1/category/2686/merge with status code 200

Resources