This sound like pretty silly question (even to me) but since this is all very far from my area of competence (if any...), I have to ask those who know more:
I was trying to help someone to scrape data from this web page. Using the Developer tab, I quickly found out that the page makes a get() request to
https://api-cdn.embed.ly/1/card-details
Usually, the target data is in the response, but it this case it was in the parameters themselves:
params = (
('card', '1'),
('key', 'fd92ebbc52fc43fb98f69e50e7893c13'),
('native', 'true'),
('scheme', 'https'),
('urls', 'https://www.youtube.com/watch?v=8f82p6Do9cc,https://youtu.be/ILUHWQqlGy0,https://youtu.be/lZPlvh55P1g,https://www.youtube.com/watch?v=8aDSmK_gLwE,https://youtu.be/vAnF-Hvlc0w,https://youtu.be/ZUI4m8cDFMQ,https://youtu.be/ILUHWQqlGy0'),
('v', 'MTcyMDEw'),
('youtube_showinfo', '0'),
)
So, it turns out, I don't need to capture the response, but I do need to (programmatically, not by copying from the developer tab) capture the request. What I can't figure out is where in the page (or anywhere else) are the request parameters located? For example, as can be seen above, one of the target parameters (urls) consists of a bunch of urls; these urls aren't in the page itself (that I can find). So, to generate the request, these must be read from somewhere - where is that somewhere located?
Again, sorry for the newbie-type question, but there must be something fundamental I'm missing.
Related
I'm developing a web app with two languages, German and English. I have implemented searching on my webpage, and I want to keep track of the user's locale when searching.
How can I achieve this:
http://localhost:8080/user/search?search=pax?lang=de
instead of:
http://localhost:8080/user/search?search=pax
In my form I have:
action="/user/search"
I tried
action="<spring:message code="user.search.movie.link"/>
user.search.movie.link = /user/search or /user/search?lang=de
but it doesn't work.
Putting information in URL parameters is good in some cases*, but probably not this one. It seems likely that a user chooses their language setting once, around login time, and then rarely if ever changes it. Or it might even be set automatically. If so, language is something you might want to store in the user's session, or a persistent store like a database if you're using one. You seem to be using Spring, and I don't know a lot about their session handling, but their docs are at https://docs.spring.io/spring-session/docs/current/reference/html5/.
*: for more on this, you might want to read up on the differences between GET requests and POST requests (here's one of many SO posts on the topic). The most relevant part for you is that GETs are the ones that have visible parameters in the URL, but there are lots of other reasons to use one over the other.
I want extract a number provided by javascript object in site, but I really don't understand that I am doing.
I tried different versions using alike examples and guidelines in import.io site and other tutorial sites, but I got only 1 of two results: extracted all numbers on given page or nothing at all.
I tried e.g. //[contains(.,"Unikālo apmeklējumu skaits:")]#type ; //[contains(.,"Unikālo apmeklējumu skaits:")] . Most likely it's necessary to add there something else, but I just don't know that.
Link I am interested in to extract from is: https://www.ss.lv/msg/lv/clothes-footwear/womens-clothes/trousers/ikcbb.html and information necessary is a number after text "Unikālo apmeklējumu skaits:" which is given by javascript.
Hopefully someone will be able to help me with this problem.
For someone who is new in web-scraping this should be a hard task, I'll ty to explain it. First of all, the xpath to get to that location could be something like this:
'//td[#class="msg_footer" and contains(text(), "Unik")]'
Now you have that tag (and what it contains), but if you check it doesn't contain the number you need, that content is being dynamically loaded with a javascript, and the javascript is this one:
<script type="text/javascript"><!--
var ss_w='rādīt numuru';
document.write( '<scr'+'ipt id="contacts_js" src="/js/2015-10-27/37863/VHoBGkpqSV8bfwkdTX9AXEpZXCVDlASIQ1ZV3kK.js?t='+new Date()+'"></scr'+'ipt>' );
--></script>
which could be gotten from the response with this xpath:
'//script[contains(text(), "contacts_js")]/text()'
from that string, you should replicate the url that comes in src, so this url for example:
/js/2015-10-27/37863/VHoBGkpqSV8bfwkdTX9AXEpZXCVDlASIQ1ZV3kK.js?t=
and add to the end the current date, as javascript creates it with new Date(). Then you should make a request to that url (adding the previous response domain), so something like:
https://www.ss.lv/js/2015-10-27/37863/VHoBGkpqSV8bfwkdTX9AXEpZXCVDlASIQ1ZV3kK.js?t=Wed%20Oct%2028%202015%2020:56:42%20GMT-0500%20(PET)
check that the date is urlencoded. it should return a response like:
var PHONE_CNT=-1;var PHONE_CNT2=-1;var PHONE_CNT3=-1;var EMAIL_CNT=-1;var SHOW_CNT=22;var PH_c="";var PH_1=0;var PH_2=0;var PH_3=0;
pcc_id=0;PH_1=gpzd("JTg3aCU3QyU1QnolN0MlN0JYcWh6JTVCdCU5NSU4QyU5MnV4ayU5QXElN0IlOTQlNUNweiU5MGtvJTdCJThFJTVF","55937369");
where you can check that the value inside SHOW_CNT is the number you want.
If you want to know how I figured out which request and which script was populating that response tag, well that I did using firebug, searching for SHOW_CNT inside all of the responses that involve calling to your URL, which pointed to the request I specified, and then trying to check who was requesting that.
Hope it helped.
support#import.io are the guys to speak to, they give free advice and help trouble shoot problems just like this all the time.
There are all kinds of tips and tricks you can use... for example import.io provide (an undocumented beta) JavaScript Pre-render service that would likely work for you in this scenario. API publish failures are sometimes caused by timeouts while waiting for sites to render JS, this would fix that.
http://support.import.io/knowledgebase/articles/623235-infinite-scroll-and-javascript-prerender-beta
I hope this helps.
I need help understanding AJAX. I am going through the tutorial on W3C schools ( creating a button that opens text file on the server and displays the result in a div)
One part of the tutorials seems abstract to me, without sufficient explanation. I am sure its a pre-requisite that I have missed or not aware of, detailing below
To avoid getting a cached result in response to an XMLHttpRequest made to the server, the tutorial says one needs to ADD A UNIQUE ID to the URL parameter in the XMLHttp Open Method which has been done (in the tutorial) by adding a ?, another character (t) and an = after the file extention followed by joining a random number to the URL (using Math.random()). See code below.
A simple GET request would be like:
xmlhttp.open("GET","demo_get.asp,true); \\I can understand this
Unique ID added to URL
xmlhttp.open("GET","demo_get.asp?t=" + Math.random(),true); \\ I can't undersatnd this
'?' , 't' & a random number generator added to demo_get.asp - Why T, why not P Q R Z ?? Why "?" after .asp
Should the compiler not go bonkers and report an error if arbitary characters are added to the file location. How is the part of the URL after the file extention handled as in this case ?t= + Math.random()
This has been a case of much agony and frustration for the last 3 days cause I don't get which part of JS i have missed here, what do you call this concept and where can I read it ??
This apart, specifying message headers while sending data - What are HTTP headers and what do they mean. How do I decide what the parameters of the setRequestHeader() method shall be ?
Please help. Rest of Ajax is clear to me.
(I haven't read on the second part - the message headers. I have asked that query here to avoiding posting another question later, just in case it turns out to be as eluding and enigmatic as the UNIQUE ID concept - Apologies in case its a direct simple question I ought to read up myself)
Cache compares the requested URL with those present with it, if a unique id is added to the URL, it does not match and the browser treats it as a fresh GET request, which then is forwarded to the server. This is a standard way to bypass / disable browser caching.
Please refer this document for a better understanding of browser Caching.
See Page No 4, which explains the same thing as stated above.
http://www.f5.com/pdf/white-papers/browser-behavior-wp.pdf
This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Why do people put code like “throw 1; <dont be evil>” and “for(;;);” in front of json responses?
I found this kind of syntax being used on Facebook for Ajax calls. I'm confused on the for (;;); part in the beginning of response. What is it used for?
This is the call and response:
GET http://0.131.channel.facebook.com/x/1476579705/51033089/false/p_1524926084=0
Response:
for (;;);{"t":"continue"}
I suspect the primary reason it's there is control. It forces you to retrieve the data via Ajax, not via JSON-P or similar (which uses script tags, and so would fail because that for loop is infinite), and thus ensures that the Same Origin Policy kicks in. This lets them control what documents can issue calls to the API — specifically, only documents that have the same origin as that API call, or ones that Facebook specifically grants access to via CORS (on browsers that support CORS). So you have to request the data via a mechanism where the browser will enforce the SOP, and you have to know about that preface and remove it before deserializing the data.
So yeah, it's about controlling (useful) access to that data.
Facebook has a ton of developers working internally on a lot of projects, and it is very common for someone to make a minor mistake; whether it be something as simple and serious as failing to escape data inserted into an HTML or SQL template or something as intricate and subtle as using eval (sometimes inefficient and arguably insecure) or JSON.parse (a compliant but not universally implemented extension) instead of a "known good" JSON decoder, it is important to figure out ways to easily enforce best practices on this developer population.
To face this challenge, Facebook has recently been going "all out" with internal projects designed to gracefully enforce these best practices, and to be honest the only explanation that truly makes sense for this specific case is just that: someone internally decided that all JSON parsing should go through a single implementation in their core library, and the best way to enforce that is for every single API response to get for(;;); automatically tacked on the front.
In so doing, a developer can't be "lazy": they will notice immediately if they use eval(), wonder what is up, and then realize their mistake and use the approved JSON API.
The other answers being provided seem to all fall into one of two categories:
misunderstanding JSONP, or
misunderstanding "JSON hijacking".
Those in the first category rely on the idea that an attacker can somehow make a request "using JSONP" to an API that doesn't support it. JSONP is a protocol that must be supported on both the server and the client: it requires the server to return something akin to myFunction({"t":"continue"}) such that the result is passed to a local function. You can't just "use JSONP" by accident.
Those in the second category are citing a very real vulnerability that has been described allowing a cross-site request forgery via tags to APIs that do not use JSONP (such as this one), allowing a form of "JSON hijacking". This is done by changing the Array/Object constructor, which allows one to access the information being returned from the server without a wrapping function.
However, that is simply not possible in this case: the reason it works at all is that a bare array (one possible result of many JSON APIs, such as the famous Gmail example) is a valid expression statement, which is not true of a bare object.
In fact, the syntax for objects defined by JSON (which includes quotation marks around the field names, as seen in this example) conflicts with the syntax for blocks, and therefore cannot be used at the top-level of a script.
js> {"t":"continue"}
typein:2: SyntaxError: invalid label:
typein:2: {"t":"continue"}
typein:2: ....^
For this example to be exploitable by way of Object() constructor remapping, it would require the API to have instead returned the object inside of a set of parentheses, making it valid JavaScript (but then not valid JSON).
js> ({"t":"continue"})
[object Object]
Now, it could be that this for(;;); prefix trick is only "accidentally" showing up in this example, and is in fact being returned by other internal Facebook APIs that are returning arrays; but in this case that should really be noted, as that would then be the "real" cause for why for(;;); is appearing in this specific snippet.
Well the for(;;); is an infinite loop (you can use Chrome's JavaScript console to run that code in a tab if you want, and then watch the CPU-usage in the task manager go through the roof until the browser kills the tab).
So I suspect that maybe it is being put there to frustrate anyone attempting to parse the response using eval or any other technique that executes the returned data.
To explain further, it used to be fairly commonplace to parse a bit of JSON-formatted data using JavaScript's eval() function, by doing something like:
var parsedJson = eval('(' + jsonString + ')');
...this is considered unsafe, however, as if for some reason your JSON-formatted data contains executable JavaScript code instead of (or in addition to) JSON-formatted data then that code will be executed by the eval(). This means that if you are talking with an untrusted server, or if someone compromises a trusted server, then they can run arbitrary code on your page.
Because of this, using things like eval() to parse JSON-formatted data is generally frowned upon, and the for(;;); statement in the Facebook JSON will prevent people from parsing the data that way. Anyone that tries will get an infinite loop. So essentially, it's like Facebook is trying to enforce that people work with its API in a way that doesn't leave them vulnerable to future exploits that try to hijack the Facebook API to use as a vector.
I'm a bit late and T.J. has basically solved the mystery, but I thought I'd share a great paper on this particular topic that has good examples and provides deeper insight into this mechanism.
These infinite loops are a countermeasure against "Javascript hijacking", a type of attack that gained public attention with an attack on Gmail that was published by Jeremiah Grossman.
The idea is as simple as beautiful: A lot of users tend to be logged in permanently in Gmail or Facebook. So what you do is you set up a site and in your malicious site's Javascript you override the object or array constructor:
function Object() {
//Make an Ajax request to your malicious site exposing the object data
}
then you include a <script> tag in that site such as
<script src="http://www.example.com/object.json"></script>
And finally you can read all about the JSON objects in your malicious server's logs.
As promised, the link to the paper.
This looks like a hack to prevent a CSRF attack. There are browser-specific ways to hook into object creation, so a malicious website could use do that first, and then have the following:
<script src="http://0.131.channel.facebook.com/x/1476579705/51033089/false/p_1524926084=0" />
If there weren't an infinite loop before the JSON, an object would be created, since JSON can be eval()ed as javascript, and the hooks would detect it and sniff the object members.
Now if you visit that site from a browser, while logged into Facebook, it can get at your data as if it were you, and then send it back to its own server via e.g., an AJAX or javascript post.
What is the difference between GET and POST for Ajax requests?
I don't see any difference between those two, except that when I use GET, the parameters are send in URL, which for me don't really make any difference, since all requests are made on background and user doesn't find any difference.
edit:
What are PUT and DELETE methods used for?
GET is designed for getting data from the server. POST (and lesser-known friends PUT and DELETE) are designed for modifying data on the server.
A GET request should never cause data to be removed from an application. If you have a link you can click on with a GET to remove data, then Google spidering your site could click on all your "Delete" links.
The canonical answer can be found here, which quotes the HTML 2.0 spec:
If the processing of a form is idempotent (i.e. it has no lasting
observable effect on the state of the
world), then the form method should be
GET. Many database searches have no
visible side-effects and make ideal
applications of query forms.
If the service associated with the processing of a form has side effects
(for example, modification of a
database or subscription to a
service), the method should be POST.
In your AJAX call, you need to use whatever method your server supports. You should always design your server so that operations that modify data are called by POST/PUT/DELETE. Other comments have links to REST, which generally maps C/R/U/D to "POST or PUT"(Create)/GET(Read)/PUT(Update)/DELETE(Delete).
If you're sending large amounts of data, or sensitive data over HTTPS, you will want to use POST. If it's just a simple parameter, I would use GET.
GET requests have a limit to the amount of data that can be sent. I forget the exact number, but this can cause issues if you're sending anything substantial.
Basically the difference between GET and POST is that in a GET request, the parameters are passed in the URL where as in a POST, the parameters are included in the message body.
Whether its AJAX or not is irrelevant. Its about the action that you're taking. I'd recommend following the principles of REST. Which have further provisions for updating, deleting, etc...
GET requests are easier to exploit in CSRF (cross site request forgery) attacks. Namely fake POST requests require Javascript to be enabled on the user side, while fake GET requests are still possible just with img, script tags.
Many web servers limit the length of the data that can be passed as part of the URL, so the GET request may break in odd ways that are hard to debug.
Also, most server software logs URLs in the access logs, so if you pass sensitive information (such as passwords) in a GET request, this will in all likelihood be written to disk in plaintext.
From a REST perspective, GET requests should have no side-effects -- they shouldn't modify data. So, if you're just GETting a resource by ID, this makes sense, but if you're committing changes to a resource, you should be using PUT, POST, or UPDATE for the http verb.
Both are used to send some data and receive some response using that data.
GET: Get information store in server. Ie. Search, tweet, Person Information. If you want to send information then get request send request using process.php?name=subroto
So it basically send information through url. Url cannot handle more than 2083 char. So for blog post can you remember it is not possible?
POST: Post do same thing as get. User registration, User login, Big data send, Blog Post.
If you need to send secure information then use post or for big data as it not go through url.
AJAX: $.get() and $.post() contain features that are subsets of $.ajax(). It has much configuration.
$.get () method, which is a kind of shorthand for $.Ajax (). When using $.get (), instead of passing in an object, you pass in arguments. At minimum, you’ll need the first two arguments, which are the URL of the file you want to retrieve (i.e. ‘test.txt’) and a success callback.
Summary:
$.get( url [, data ] [, success ] [, dataType ] )
$.post( url [, data ] [, success ] [, dataType ] ) // for sending secure or Large information
$.ajax( url [, settings ] ) // More Configaration
First, general information. Use GET if you only read data, use POST if you change something on database, txt files etc.
But the problem is, some browsers cache GET results. I had problems with AJAX requests in IE7, but at last I found out that browser caches GET results. I rethought the flow and changes my request to POST.
So, don't use GET if you don't want caching.
(Of course you can disable caching in GET operations. But I didn't prefer it)
About me, i prefer POST. I reserve get to the events i know the sent value is limited to data i have the "control", for example, to retreive an item with an id. Example, "getitem?id=123", "deleteImtem?id=123", ... For the other cases, when i have a form fillable by a user, i prefer POST.
Like Ryan Smith have said, it's better to use POST to send a large amount of data, and less wories in cases of the use in others language/special chars (generally all majors javascript framework should'nt have any problems to deal with that but i think is less wories to use POST).
For the REST perspective, in my opinion, you can use this with a new project (to keep a consistency with the entire project).
Finally, maybee some programs used in a network (URL loguers (ie.: to see if the employees lost their time on non-autorised sites, ...) proxys, ... ) or any other kind of tool can intercept the query. Somes will show in the reports the params you have sent with GET, considering it like a different web page. But in this situation, is could be not your problem it's changes from a project to an other! ;)
The difference is the same between GET and POST whether you're using Ajax, HTML forms, or curl. Here are the relevant definitions:
GET
POST
If you are passing on any arguments with characters that can get messed up in the URL (such as spaces), you use POST. Otherwise you can use GET.
Generally, if you're just passing on a few tiny arguments you would use GET. But for passing on user submitted information such as blog entries, text, etc, its a good practice to use POST.
There are also certain frameworks that rely completely on segment based urls (such as site.com/products/133 rather than site.com/products.php?id=333 and these frameworks unset the GET variables for security. In such cases you would use POST allt the time.