Robots.txt on Heroku Rails app unreachable by Google crawl [closed] - heroku

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 9 years ago.
Improve this question
Google can not access our robots.txt file when trying to crawl our site.... www.spokenvote.org
In Google's Webmaster tools, Fetch as Google, the error reads: Unreachable robots.txt
We have the exact same code deployed on on another Heroku app, staging.spokenvote.org, where Google is able to access the robots.txt fine.
Tailing the Heroku logs, the error message thrown by Heroku is an H13:
2014-01-04T21:11:32.042768+00:00 app[web.1]: !! Rack application returned nil body. Probably you wanted it to be an empty string?
2014-01-04T21:11:32.042879+00:00 app[web.1]: !! Unexpected error while processing request: undefined method `each' for nil:NilClass
2014-01-04T21:11:32.053281+00:00 heroku[router]: at=error code=H13 desc="Connection closed without response" method=GET path=/robots.txt host=www.spokenvote.org fwd="66.249.66.196" dyno=web.1 connect=2ms service=9ms status=503 bytes=0
Any ideas what is happening?
Spokenvote Team

Related

Heroku Request blocked

in my application (deployed in Heroku), there is a request (GET) that is blocked in the infrastructure layer, the request does not get to execute the code of my application. It returns an error status=400 and connect=0ms and does not carry any associated Heroku error code and description. The request never reaches the application.
It only happens with this GET request, when it comes from the production server. if I make the request from postman it is received correctly with status=200
The other requests have not problem and are executed correctly from the production server.
This is an example:
2021-08-20T10: 27: 02.217551 + 00: 00 heroku [router]: at=info method=GET path="/api/get" host=myapp.herokuapp.com request_id=2920634e-87f2-4b2c-be60-b38497c53e58 dyno=web.1 connect=0ms service=1ms status=400 bytes=47 protocol=https
The problem was identified and corrected.
The problem was that one of the headers of the GET request was being sent to null and the request was rejected as Bad Request Exception before entering the APP.
regards

Heroku throws an error when retrieving JSON response with 40 elements

So I have an app deployed in Heroku, a Go backend with Angular 8 frontend using Hobby dinos. But today I noticed that one of my endpoints would not work if the result contains more than 40 elements. I tested this locally and this doesnt happen locally, so it must be a problem when running in Heroku. Any idea about what is going on here? Heroku throws the following error message:
sock=backend at=error code=H18 desc="Server Request Interrupted" method=POST path="/invoices/g/range?from=2020-08-01T00:00:00.000+00:00&to=2020-08-28T00:00:00.000+00:00" host=xxx-prod.herokuapp.com request_id=d113ba1c-f51a-4f57-8f02-31195da1b5f8 fwd="xx.xxx.xxx.xxx" dyno=web.1 connect=1ms service=60ms status=503 bytes= protocol=https
So I finally figured out what was going on here. It has to do with an unsolved Node issue, see: https://github.com/nodejs/node/issues/12339
When using the POST handler I was receiving a large response that triggered that Issue. What I did is I rebuilt my logic in order to retrieve the response using a GET. Now it works as expected.

Heroku - prevent scraping

I am looking on my drain log and I see this
327 <158>1 2018-04-17T22:03:27.578702+00:00 heroku router - - at=info method=GET path="/{url}" host={my_host} request_id=11bb9b05-dea3-42c2-b57a-9be6fb9b93d2 fwd="80.6.26.72,141.101.107.25" dyno=web.1 connect=0ms service=1ms status=200 bytes=6265 protocol=http
I am certain that this request doesn't come from a legit user, how is it possible to dig in more and get the remote server IP? I used https://stackoverflow.com/a/6837689/2513428 inside my script to check the ip's but I assume it returned the proxy of herocu servers.
Heroku makes the IP making the request available in the fwd log field: https://devcenter.heroku.com/articles/http-routing#heroku-router-log-format
You can also read it within your code by looking at the X-Forwarded-For HTTP header.
So in your case, the IP of the client making this request was 80.6.26.72.

How do I suppress http request logs on heroku (using flask, if it matters)?

Very simple question: I'm running a simple flask app on heroku with no changes to the default logging settings. But my logs are filled with all kinds of terrible http request noise.
For example, I don't have any favicon or anything like that set up on my app. I don't need one. But every browser, of course, requests one, and so whenever I try to look at my logs, I get floods of requests with a 404 for the favicon and such. Which is totally useless information to me.
Example garbage logs (with sensitive information stripped):
2018-02-01T04:11:32.538658+00:00 heroku[router]: at=info method=GET
path="/apple-touch-icon-precomposed.png" host=[MY_HOSTNAME_CENSORED]
request_id=[A_UUID] fwd="[AN_IP_ADDRESS]" dyno=web.1 connect=0ms
service=17ms status=404 bytes=386 protocol=https
2018-02-01T04:11:32.675406+00:00 heroku[router]: at=info method=GET
path="/favicon.ico" host=[MY_HOSTNAME_CENSORED] request_id=
fwd="[AN_IP_ADDRESS]" dyno=web.1 connect=0ms service=2ms status=404
bytes=386 protocol=https
I think that these logs are generated by heroku itself rather than the application (that's what the bit after the timestamp means, right?), but I can't find any documentation anywhere on how to change that.
There's an earlier related SO, but the latest relevant answer saying that you can't disable logs is from 2014---so I like to think this might have changed.
Alternatively, is there some way to instruct browsers not to request favicons and such?
You could easily do this kind of filtering in whatever tool you are using for reading your logs.
For example, if you attach the Papertrail add-on to your Heroku app, you can easily configure it to filter out any log patterns you want, even if you are using their free plan.
Such configuration is done via the Papertrail "Settings" menu, under "Filter logs".
See Log Filtering for details.
There isn't any way to get rid of it entirely. But, if what you're really annoyed by is the router showing up when you're live tailing your logs (which is what I was annoyed by), then you can add "--source app" to the tail command to get rid of the router logs, like this:
heroku logs --tail --source app --remote whateveryounamedit
Then you'll only see logs generated by your app.

Heroku 500 Error but works after reload

I'm currently running an app on heroku with python/flask as the main back-end. I've managed to successfully launch the site on heroku (Here's the site link). What seems to happen is that when I load the sit in the browser after a certain period of time, I will get an error as follows:
Internal Server Error
The server encountered an internal error and was unable to complete your request. Either the server is overloaded or there is an error in the application.
Essentially I am getting a 500 Error, with the following in the heroku logs:
2014-01-11T22:08:01.423860+00:00 heroku[router]: at=info method=GET path=/ host=outlet-beta.herokuapp.com fwd="207.38.157.121" dyno=web.1 connect=2ms service=92ms status=500 bytes=291
After I reload the page, the site works fine. I'm not 100% sure how to proceed on this front.
I have the Sentry addon that lets me see what's going on with the errors:
OperationalError: (OperationalError) (2006, 'MySQL server has gone away')
Sentry has logged this as a trending error which happens pretty often, and is logged each time this internal service error occurs.
I'm running the site with ClearDB and Cloudinary. Is there any chance the mysql server isn't getting connected to quickly enough, and after reload it works? If so, how do I fix it?

Resources