Difficulties using UNIX cURL to scrape Ajax Wicket Information

Difficulties using UNIX cURL to scrape Ajax Wicket Information - ajax

I am instructed to use write UNIX shell scripts that scrape certain websites. We use fiddler to trace the HTTP requests, then we write the cURLs accordingly. For the most part, scraping most websites seem to be fairly simple, however I've ran into a situation where I'm having difficulties capturing certain information.
I need to be somewhat generic in saying that I cannot provide the website address that I am actually looking at, however I can post some of the requests and responses to provide context.
Here's the situation:
The website starts with a search screen. You enter your search query and the website returns a list of results.
I need to choose the first result from the result page.
I need to capture EVERYTHING on the page from the first result.
Everything up until this point is working fine
Here's the problem:
The page returned has hyperlinks that are wickets. When these links are pressed, a window pops up within the page - it is not actually a window like a pop up created by javascript, it is more comparable to what you see when you 'compose a message' or 'poke' someone on Facebook ( am I the only one who still does that? ).
I need to capture the contents of that pop up window. There are usually multiple wicket links on a given page. Handling that should be easy enough with a loop, but I need to figure out the proper way to cURL those wickets first.
Here is the cURL i'm currently using to attempt to scrape the wickets.
(I'm explicitly defining the referrer URL, Accept, and Wicket-Ajax boolean as these were the items that were sent in the header when I traced the site). Link is the URL which looks like this:
http://www.someDomainName.com/searches/?x=as56f1sa65df1&random=0.121345151
( the random I believe is populated with some javascript, not sure if that's needed or even possible to recreate. I'm currently sending one of the randoms that I received on one particular occasion. ).
/bin/curl -v3 -b COOKIE -c COOKIE -H "Accept: text/xml" -H "Referer: $URL$x" -H "Wicket-Ajax: true" -sLf "$link"
Here is the response I get:
<ajax-response><redirect><![CDATA[home.page;jsessionid=6F45DF769D527B98DD1C7FFF3A0DF089]]></redirect>
</ajax-response>
I am expecting an XML document with actual content to be returned. Any insight into this issue would be greatly appreciated. Please let me know if you need more information.
Thanks,
Paul

Related

playwright-python Is there a way to navigate pages with post_data?

I learned how to navigate pages using
page.goto('someurl')
Some pages need to send post data to browse but it seems there is no arguments for post method in goto funtion.
and then I tried this.
page.request.post(url='someurl', data={'paramname':value})
but it doesn't navigate, it just gives me response.
I think I don't fully understand about route too. I thought if I set post_data and method using route.continue_(method="POST", post_data={'k':somevalue})
but it seems not affect to page.goto funtion. (but when I print page.request.post_data, it prints the dictionary that I set in route.continue_)
I stuck from this point..
Is there a way navigate pages with post data with playwright-python?

Extract value from javascript object in site using xpath and import.io

I want extract a number provided by javascript object in site, but I really don't understand that I am doing.
I tried different versions using alike examples and guidelines in import.io site and other tutorial sites, but I got only 1 of two results: extracted all numbers on given page or nothing at all.
I tried e.g. //[contains(.,"Unikālo apmeklējumu skaits:")]#type ; //[contains(.,"Unikālo apmeklējumu skaits:")] . Most likely it's necessary to add there something else, but I just don't know that.
Link I am interested in to extract from is: https://www.ss.lv/msg/lv/clothes-footwear/womens-clothes/trousers/ikcbb.html and information necessary is a number after text "Unikālo apmeklējumu skaits:" which is given by javascript.
Hopefully someone will be able to help me with this problem.

For someone who is new in web-scraping this should be a hard task, I'll ty to explain it. First of all, the xpath to get to that location could be something like this:
'//td[#class="msg_footer" and contains(text(), "Unik")]'
Now you have that tag (and what it contains), but if you check it doesn't contain the number you need, that content is being dynamically loaded with a javascript, and the javascript is this one:
<script type="text/javascript"><!--
var ss_w='rādīt numuru';
document.write( '<scr'+'ipt id="contacts_js" src="/js/2015-10-27/37863/VHoBGkpqSV8bfwkdTX9AXEpZXCVDlASIQ1ZV3kK.js?t='+new Date()+'"></scr'+'ipt>' );
--></script>
which could be gotten from the response with this xpath:
'//script[contains(text(), "contacts_js")]/text()'
from that string, you should replicate the url that comes in src, so this url for example:
/js/2015-10-27/37863/VHoBGkpqSV8bfwkdTX9AXEpZXCVDlASIQ1ZV3kK.js?t=
and add to the end the current date, as javascript creates it with new Date(). Then you should make a request to that url (adding the previous response domain), so something like:
https://www.ss.lv/js/2015-10-27/37863/VHoBGkpqSV8bfwkdTX9AXEpZXCVDlASIQ1ZV3kK.js?t=Wed%20Oct%2028%202015%2020:56:42%20GMT-0500%20(PET)
check that the date is urlencoded. it should return a response like:
var PHONE_CNT=-1;var PHONE_CNT2=-1;var PHONE_CNT3=-1;var EMAIL_CNT=-1;var SHOW_CNT=22;var PH_c="";var PH_1=0;var PH_2=0;var PH_3=0;
pcc_id=0;PH_1=gpzd("JTg3aCU3QyU1QnolN0MlN0JYcWh6JTVCdCU5NSU4QyU5MnV4ayU5QXElN0IlOTQlNUNweiU5MGtvJTdCJThFJTVF","55937369");
where you can check that the value inside SHOW_CNT is the number you want.
If you want to know how I figured out which request and which script was populating that response tag, well that I did using firebug, searching for SHOW_CNT inside all of the responses that involve calling to your URL, which pointed to the request I specified, and then trying to check who was requesting that.
Hope it helped.

support#import.io are the guys to speak to, they give free advice and help trouble shoot problems just like this all the time.
There are all kinds of tips and tricks you can use... for example import.io provide (an undocumented beta) JavaScript Pre-render service that would likely work for you in this scenario. API publish failures are sometimes caused by timeouts while waiting for sites to render JS, this would fix that.
http://support.import.io/knowledgebase/articles/623235-infinite-scroll-and-javascript-prerender-beta
I hope this helps.

How do I search then parse results on a webpage with Ruby?

How would you use Ruby to open a website and do a search in the search field and then parse the results? For example if I entered something into a search engine and then parsed the results page. I know how to use Nokogiri to find the webpage and open it. I am lost on how to input into the search field and moving forward to the results. Also on the page that I am actually searching I have to click on enter, I can't simply hit enter to move forward. Thank you so much for your help.

Use Mechanize - a library used for automating interaction with websites.

Something like mechanize will work, but interacting with the front end UI code is always going to be slower and more problematic than making requests directly against the back end.
Your best bet would be to look at the request that is being made to the server (probably a HTTP GET or POST request with some associated params). You can do this with firebug or Fiddler 2 for windows. Then, once you know the parameters that the server will accept, just make the request yourself.
For example, if you were doing this with the duckduckgo.com search engine, you could either get mechanize to go to duckduckgo.com, input text into the search box, and click submit, or you could just create a GET request to http://www.duckduckgo.com?q=search_term_here.
You can use Mechanize for something like this but it might be overkill. I would take a look at RestClient, especially if you don't need to manage cookies.
Edit:
If you can determine the specific URL that the form submits to, say for example 'example.com/search'; and you knew the request was a POST (which it usually is if you are submitting a form) you could construct something like this with mechanize:
agent = Mechanize.new
agent.post 'http://example.com/search', {
"_id0:Number" => string_to_search_for,
"_id0:submitButton" => "Enter"
}
Notice how the 'name' attribute of a form element becomes a key for the post and the 'value' element becomes the value. The 'input' element gets the value directly from the text you would have entered. This gets transformed into a request and submitted to the server when you push the submit button (of course in this case you are making the request directly). The result of the post should be some HTML that you can parse for the info you need.

Scraping pages with asynchronous responses with Hpricot

I'm trying to scrape a page but the initial response has nothing in the body as the content is pumped in asynchronously, e.g. the results from a search on the apple website: http://www.apple.com/uk/search/?q=searching+for+something&sec=global
Any ideas on how I can successfully grab the results from the search with hpricot?
Thanks.

When the search page you refer to is loaded, it makes a request via javascript/ajax to some other location, then populates the search results. This is what you're seeing in the page. Hpricot itself can't help you here because it has no way to interpret the javascript that comes with the page in order to fetch the actual search results list.
Now, if what you're interested in are the search results, you'd need to analyze a bit what happens when you enter that page and type a search query. Some javascript in the page takes your query, and calls (via XMLHttpRequest or similar, AJAX techniques) some other script in Apple's server. This is the one that actually does the search in a database and returns the result.
I suggest you install Firefox with the Firebug plugin, or some other way of seeing the actual requests a page and its javascript components send and / or receive. You'll see that, for the search page you referred, it fetches two parts: First, the "featured" results that come from this URL:
http://www.apple.com/global/scripts/search_featured.php?q=mac+mini&section=global&geo=uk
Notice the search string is in the "q" parameter.
Second, a long results list comes from here:
http://www.apple.com/search/service/nph-search10?site=uk_www&filter=1&snum=50&q=mac+mini
These both are XML documents; you might have better luck parsing these URLs with Hpricot.

GET vs POST in Ajax

What is the difference between GET and POST for Ajax requests?
I don't see any difference between those two, except that when I use GET, the parameters are send in URL, which for me don't really make any difference, since all requests are made on background and user doesn't find any difference.
edit:
What are PUT and DELETE methods used for?

GET is designed for getting data from the server. POST (and lesser-known friends PUT and DELETE) are designed for modifying data on the server.
A GET request should never cause data to be removed from an application. If you have a link you can click on with a GET to remove data, then Google spidering your site could click on all your "Delete" links.
The canonical answer can be found here, which quotes the HTML 2.0 spec:
If the processing of a form is idempotent (i.e. it has no lasting
observable effect on the state of the
world), then the form method should be
GET. Many database searches have no
visible side-effects and make ideal
applications of query forms.
If the service associated with the processing of a form has side effects
(for example, modification of a
database or subscription to a
service), the method should be POST.
In your AJAX call, you need to use whatever method your server supports. You should always design your server so that operations that modify data are called by POST/PUT/DELETE. Other comments have links to REST, which generally maps C/R/U/D to "POST or PUT"(Create)/GET(Read)/PUT(Update)/DELETE(Delete).

If you're sending large amounts of data, or sensitive data over HTTPS, you will want to use POST. If it's just a simple parameter, I would use GET.
GET requests have a limit to the amount of data that can be sent. I forget the exact number, but this can cause issues if you're sending anything substantial.
Basically the difference between GET and POST is that in a GET request, the parameters are passed in the URL where as in a POST, the parameters are included in the message body.

Whether its AJAX or not is irrelevant. Its about the action that you're taking. I'd recommend following the principles of REST. Which have further provisions for updating, deleting, etc...

GET requests are easier to exploit in CSRF (cross site request forgery) attacks. Namely fake POST requests require Javascript to be enabled on the user side, while fake GET requests are still possible just with img, script tags.

Many web servers limit the length of the data that can be passed as part of the URL, so the GET request may break in odd ways that are hard to debug.
Also, most server software logs URLs in the access logs, so if you pass sensitive information (such as passwords) in a GET request, this will in all likelihood be written to disk in plaintext.
From a REST perspective, GET requests should have no side-effects -- they shouldn't modify data. So, if you're just GETting a resource by ID, this makes sense, but if you're committing changes to a resource, you should be using PUT, POST, or UPDATE for the http verb.

Both are used to send some data and receive some response using that data.
GET: Get information store in server. Ie. Search, tweet, Person Information. If you want to send information then get request send request using process.php?name=subroto
So it basically send information through url. Url cannot handle more than 2083 char. So for blog post can you remember it is not possible?
POST: Post do same thing as get. User registration, User login, Big data send, Blog Post.
If you need to send secure information then use post or for big data as it not go through url.
AJAX: $.get() and $.post() contain features that are subsets of $.ajax(). It has much configuration.
$.get () method, which is a kind of shorthand for $.Ajax (). When using $.get (), instead of passing in an object, you pass in arguments. At minimum, you’ll need the first two arguments, which are the URL of the file you want to retrieve (i.e. ‘test.txt’) and a success callback.
Summary:
$.get( url [, data ] [, success ] [, dataType ] )
$.post( url [, data ] [, success ] [, dataType ] ) // for sending secure or Large information
$.ajax( url [, settings ] ) // More Configaration

First, general information. Use GET if you only read data, use POST if you change something on database, txt files etc.
But the problem is, some browsers cache GET results. I had problems with AJAX requests in IE7, but at last I found out that browser caches GET results. I rethought the flow and changes my request to POST.
So, don't use GET if you don't want caching.
(Of course you can disable caching in GET operations. But I didn't prefer it)

About me, i prefer POST. I reserve get to the events i know the sent value is limited to data i have the "control", for example, to retreive an item with an id. Example, "getitem?id=123", "deleteImtem?id=123", ... For the other cases, when i have a form fillable by a user, i prefer POST.
Like Ryan Smith have said, it's better to use POST to send a large amount of data, and less wories in cases of the use in others language/special chars (generally all majors javascript framework should'nt have any problems to deal with that but i think is less wories to use POST).
For the REST perspective, in my opinion, you can use this with a new project (to keep a consistency with the entire project).
Finally, maybee some programs used in a network (URL loguers (ie.: to see if the employees lost their time on non-autorised sites, ...) proxys, ... ) or any other kind of tool can intercept the query. Somes will show in the reports the params you have sent with GET, considering it like a different web page. But in this situation, is could be not your problem it's changes from a project to an other! ;)

The difference is the same between GET and POST whether you're using Ajax, HTML forms, or curl. Here are the relevant definitions:
GET
POST

If you are passing on any arguments with characters that can get messed up in the URL (such as spaces), you use POST. Otherwise you can use GET.
Generally, if you're just passing on a few tiny arguments you would use GET. But for passing on user submitted information such as blog entries, text, etc, its a good practice to use POST.
There are also certain frameworks that rely completely on segment based urls (such as site.com/products/133 rather than site.com/products.php?id=333 and these frameworks unset the GET variables for security. In such cases you would use POST allt the time.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio