Submitting login fields during a scraping process with ruby? - ruby

I need to scrape some financial data from a system called NetTeller.
An example can be found here.
Note the initial ID field prompt:
Then once you submit you have to then enter your password:
As you can see, it has a two step process where you first enter an ID number and then after submission the user is presented with a password field. I'm hitting some roadbumps here when it comes to jumping through these two hoops prior to getting on into the system and getting to the data that I actually want. How would one process a scenario such as this where you need to pass through the authentication fields prior first before getting to the data you want to scrape?
I have assumed that I could just jump in with httpclient and nokogiri, but am curious if there are any tricks when dealing with a two-page login such as this before getting into your target.

I would use Mechanize. The first page is "tricky" because the login form is within an iframe. So you could use just the source where the iframe is being loaded. Here is how:
agent = Mechanize.new
# Get first page
iframe_url = 'https://www.banksafe.com/sfonline/'
page = agent.get(iframe_url)
login_form = page.forms.first
username_field = login_form.field_with(:name => "12345678")
# Get second page
response = login_form.submit
second_login_form = response.forms.first
password_field = second_login_form.field_with(:password => "xxxxx")
# Get page to scrap
response = second_login_form.submit
This is how you could process an scenario like this. Obviously you might need to adapt to exactly how those forms/fields are written and other specific-page details, but I would go for this approach.

Related

How does Djoser account verification system really works under the hood?

So I'm currently in an attempt to make my own account verification system and I'm using some parts of Djoser as a reference. let me try to walk you to my question
Let's say you're to make a new account in Djoser app
you put in the information of your soon to be made account including email
submit the form to the backend
get an email to the whatever email account you put in earlier to verify your account
click the link in your email
get to the verify account page
now in this page there's a button to submit a UID and a token and both of those information lies in the URL.
My question is:
What are those tokens? is it JWT?
How do they work?
How can I implement that in my own projects without djoser?
The answers to your questions are immersed in the own code of djoser.
You can check djoser.email file and in the classes there, they are few methods get_context_data().
def get_context_data(self):
context = super().get_context_data()
user = context.get("user")
context["uid"] = utils.encode_uid(user.pk)
context["token"] = default_token_generator.make_token(user)
context["url"] = settings.ACTIVATION_URL.format(**context)
return context
So get the context in the class where is instance, and in this context add the 'uid' (this is basically str(pk) and coded in base64, check encode_uid()), the 'token' (just a random string created with a Django function from your Secret key; you can change the algorithm of that function and the duration of this token with PASSWORD_RESET_TIMEOUT setting) to use temporary links, and finally the URL according the action which is performed (in this case the email activation).
Other point to consider is in each of this classes has template assigned and you can override it.
Now, in the views, specifically in UserViewSet and its actions perform_create(), perform_update() and resend_activation(), if the Djoser setting SEND_ACTIVATION_EMAIL is True, call to ActivationEmail to send an email to the user address.
def perform_create(self, serializer):
user = serializer.save()
signals.user_registered.send(
sender=self.__class__, user=user, request=self.request
)
context = {"user": user}
to = [get_user_email(user)]
if settings.SEND_ACTIVATION_EMAIL:
settings.EMAIL.activation(self.request, context).send(to)
...
The email is sent and when a user click the link, whether the token is still valid and uid match (djoser.UidAndTokenSerializer), the action activation() of the same View is executed. Change the user flag 'is_active' to True and it may sent another email to confirm the activation.
If you want code your own version, as you can see, you only have to create a random token, generate some uid to identify the user in the way that you prefer. Code a pair of views that send emails with templates that permit the activation.

Safely save data to django model using AJAX

I have a model say TestModel as follows:
class TestModel(models.Model):
name = models.CharField()
description = models.TextField()
Now I can use a ModelForm to save data to this model. However, say I want to use Ajax and send a url as follows savethis?name=ABC&desc=SomeRandomDescription to a view that handles it as follows:
def savethis(request):
if request.GET['name'] and request.GET['desc']:
name = request.GET['name']
desc = request.GET['desc']
test = TestModel(name=name, description=desc)
test.save
return HttpResponse('Ok')
else:
return HttpResponse('Fail')
What's to stop someone from running a script that can easily hit this url with valid data and thus save data to my model? How do I ensure that incoming data is sent only from the right source?
One option is sending the data as JSON in a Post request but even that's not too hard to emualte.
Seems that you have stumbled upon the great security flaw that is Cross-site Scripting attacks. They are several ways you can get around it, but going into all of them in one answer would be fruitless. I suggest you Google the term and do some poking around, and you will find several different methods on how to protect your site better.
Django has a security page dedicated to talking about how to protect your site.

How do I search then parse results on a webpage with Ruby?

How would you use Ruby to open a website and do a search in the search field and then parse the results? For example if I entered something into a search engine and then parsed the results page. I know how to use Nokogiri to find the webpage and open it. I am lost on how to input into the search field and moving forward to the results. Also on the page that I am actually searching I have to click on enter, I can't simply hit enter to move forward. Thank you so much for your help.
Use Mechanize - a library used for automating interaction with websites.
Something like mechanize will work, but interacting with the front end UI code is always going to be slower and more problematic than making requests directly against the back end.
Your best bet would be to look at the request that is being made to the server (probably a HTTP GET or POST request with some associated params). You can do this with firebug or Fiddler 2 for windows. Then, once you know the parameters that the server will accept, just make the request yourself.
For example, if you were doing this with the duckduckgo.com search engine, you could either get mechanize to go to duckduckgo.com, input text into the search box, and click submit, or you could just create a GET request to http://www.duckduckgo.com?q=search_term_here.
You can use Mechanize for something like this but it might be overkill. I would take a look at RestClient, especially if you don't need to manage cookies.
Edit:
If you can determine the specific URL that the form submits to, say for example 'example.com/search'; and you knew the request was a POST (which it usually is if you are submitting a form) you could construct something like this with mechanize:
agent = Mechanize.new
agent.post 'http://example.com/search', {
"_id0:Number" => string_to_search_for,
"_id0:submitButton" => "Enter"
}
Notice how the 'name' attribute of a form element becomes a key for the post and the 'value' element becomes the value. The 'input' element gets the value directly from the text you would have entered. This gets transformed into a request and submitted to the server when you push the submit button (of course in this case you are making the request directly). The result of the post should be some HTML that you can parse for the info you need.

Django Forms - Processing GET Requests

We have an existing Django form that accepts GET requests to allow users to bookmark their resulting query parameters. The form contains many fields, most of which are required. The form uses semi-standard boilerplate for handling the request, substituting GET for POST:
if request.method == 'GET':
form = myForm(request.GET)
if form.isValid()
# Gather fields together into query.
else
form = myForm()
The problem is that the first time the form is loaded, there's nothing in the GET request to populate the required fields with, so most of the form lights up with 'missing field' errors.
Setting initial values doesn't work; apparently, the non-existent values in the GET request override them.
How can we avoid this? I'm pretty certain we're simply not processing things correctly, but I can't find an example of a form that handles GET requests. We want errors to show up if the user hits the "Submit" button while fields are blank or otherwise invalid, but don't want these errors showing up when the form is initially displayed.
The positional argument to the forms.Form subclass informs Django that you intend to process a form rather than just display a blank/default form. Your if request.method == 'GET' isn't making the distinction that you want because regular old web requests by typing a URL in a web browser or clicking a link are also GET requests, so request.method is equal to GET either way.
You need some differentiating mechanism such that you can tell the difference between a form display and a form process.
Ideas:
If your processing is done via. AJAX, you could use if request.is_ajax() as your conditional.
Alternatively, you could include a GET token that signifies that the request is processing. Under this example, first you'd need something in your form:
<input type="hidden" name="action" value="process_form" />
And then you can look for that value in your view:
if 'action' in request.GET and request.GET['action'] == 'process_form':
form = myForm(request.GET)
if form.is_valid():
# form processing code
else:
form = myForm()
I'll also give you the standard, boilerplate point that it's generally preferable not to use GET for form processing if you can help it (precisely because you run into difficulties like this since you're using an anomalous pattern), but if you have a use case where you really need it, then you really need it. You know your needs better than I do. :-)
If your clean page load doesn't have any non form GET params, you can differentiate between a clean page load and a form submit in your view. Instead of the usual
form = YourForm()
if request.POST:
you can do
if request.GET.items():
form = YourForm(request.GET)
if form.is_valid():
...
else:
form = YourForm()
If your clean page load could have other params (eg email link tracking params) you'll need to use the QueryDict methods to test if any of your form params are in the request.
request.GET is and empty dictionary when you first load a clean form. Once you have submitted the form, request.GET will be populated with your fields data, even if the fields contain only empty data.
My first question is this, which I posted as comment:
Why not just use request.POST and the standard way of processing form data?
After considering everything here, perhaps what you are looking for is a way of processing data in your query string to populate a form. You can do that without using request.GET as your form.data.
In my own views, I take advantage of a utility function I created to add initial data to the form from request.GET, but I am not going to share that function here. Here's the signature, though. initial_dict is typically request.GET. model_forms is either a single ModelForm or a list of ModelForm.
def process_initial_data(model_forms, initial_dict):
Nevertheless, I am able to process the form through the standard practice of using request.POST when the form is POSTed. And I don't have to pass around all kinds of information in the URL query string or modify it with JavaScript as the user enters information.

Manual POST request

Scenario: I have logged into a website, gained cookies etc, got to a particular webpage with a form + hidden fields. I now want to be able to create my own http post with my own hidden form data instead of what is on the webpage and verify the response instead of using the one on the webpage.
Reason: Testing against pre-existing data (I know, I know) which could be different on each environment hence no predictable way to use it. We need a workaround.
Is there any way to do this without manually editing the existing form and submitting that? Feels a little 'hacky'.
Ideally, I would like to say something like:
browser.post 'url', 'field1=test&field2=abc'
I would probably switch to mechanize to muck around at the protocol level. Something like this added to your script
b = WWW::Mechanize.new
b.get('http://yoursite.com/current_page') do |page|
# Submit the login form
my_form = page.form_with(:action => '/post/url') do |f|
f.form_loginname = 'tim'
f.form_pw = 'password'
end.click_button
end

Resources