Crawling Google play store - google-apps-marketplace

I am crawling the Google app store. I use Firefox+firebug to review the request and response. but one Parameter I don't understand.
for example: the URL ""
when load next page, it post a param pagTok, which's value is "EgIIKA==:S:ANO1ljJ4wWQ"
I don't know where does this value come from? any one can help?

Investigation
Since Google recently changed their paging logic, and now it requires a token, I've found myself trying to investigate how to either manually generate those tokens, or to scrape them out of the HTML retrieved on each response. So, lets get our hands dirty.
Using Fiddler2, I was able to isolate some token samples, looking at the requests issued for each "Paging" of the Play Store.
Here's the whole request:
POST https://play.google.com/store/search?q=a&c=apps HTTP/1.1
Host: play.google.com
Connection: keep-alive
Content-Length: 123
Origin: https://play.google.com
User-Agent: Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36
Content-Type: application/x-www-form-urlencoded;charset=UTF-8
Accept: */*
X-Client-Data: CIe2yQEIpLbJAQiptskBCMG2yQEInobKAQjuiMoBCImSygE=
Referer: https://play.google.com/store/search?q=a&c=apps
Accept-Encoding: gzip, deflate
Accept-Language: pt-BR,pt;q=0.8,en-US;q=0.6,en;q=0.4,es;q=0.2
** Post Body **
start=0&num=0&numChildren=0&pagTok=GAEiAggU%3AS%3AANO1ljLtUJw&ipf=1&xhr=1&token=bH2MlNeViIRJA8dT-zhaKrfNH7Q%3A1420660393029
Now that we know what's the request, the next step is to keep track of more requests to try to isolate some token formation logic.
Here are 3 request tokens I could find :
"GAEiAggU%3AS%3AANO1ljLtUJw", "GAEiAggo%3AS%3AANO1ljIeRQQ", "GAEiAgg8%3AS%3AANO1ljIM1CI"
Finding the Patterns
One thing our brain is really good at, is to find patterns, here's what mine found about the tokens formation:
1 - Starts with : "GAEiA"
2 - Followed by : two random characters
3 - Followed by: "%3AS%3"
4 - Followed by : 11 random characters
Browser Javascript tricks x Manual HTTP Requests
Doing the same request on a browser, most of the time, won't yield the same results as it will using code, manually issuing an Http Request. Why ? Because of Javascript.
Google is a heavy JS user, so it will use it's own tricks to try to fool you.
If you look at the HTML, you will see no token that matches the pattern described above, instead, you will find something like:
u0026c\\u003dapps\42,\42GAEiAghQ:S:ANO1ljLxWBY\42,\0420\42,\0420\42,\0420\42]\n
If you look carefully, you will see that your token is within this "random string". All you have to do is to replace : ":S:" by "%3AS%" .
Regular Expressions For the Win
If you apply regexes to the page, you will be able to find the token, and than, manually replace the :S: string with the %3AS% one.
Here's the one I ended up using (powered by the best Regex online Builder
Generated regular expression:
/GAEi+.+:S:.{11}\42/
Textual meaning of regular expression:
Match a string which contains the string GAE
followed by the character i 1 or more times
followed by any character 1 or more times
followed by the string :S:
followed by any character exactly 11 times
followed by the string \42
TL:DR
The token comes into the HTML, but it is "masked" by Google, that "unmasks" it using Javascript (which you can only run if you are using Browser engines such as Selenium or something).
In order to fetch the pagToken of the next page, read the current page html, scrape it (logic above), use it on the next request, repeat.
I Hope it helps, sorry about the wall of text, I wanted to be as clear as possible

Related

How can I get request headers in their original format from Rack?

I'm trying to get request headers in their original format from Rack using Ruby, but haven't been able to figure it out yet. The hash I get back from request.env isn't what I want. In that hash, the header keys are upcased and have underscores instead of dashes, like so:
"CONTENT_TYPE"=>"application/json; charset=utf-8"
What I want is the headers before they get processed, I'm looking for:
"Content_Type"=>"application/json; charset=utf-8"
I can easily enough loop through request.env looking for headers that start with HTTP_ and split them, capitalize each word and gsub to replace underscores with dashes to get them into the format I want. It becomes trickier to retain original format this way when dealing with headers such as:
"X-BT-RequestId"
I feel that I ought to be able to get at the pre-processed headers somehow.
I'm writing a HTTP listener that will wrap a request and forward it on to another service and I want to preserve headers in their original format. I know headers are supposed to be case insensitive, but if I can forward them in their original format, I can hopefully prevent case-sensitive issues later on when my database users are querying for values based on these headers.
Any ideas?
You can get the raw headers in webrick/httpserver.rb from the raw_header instance variable of WEBrick::HTTPRequest:
p req.instance_variable_get("#raw_header")
si.service(req, res)
You can also get it from inside the service method in handler/webrick.rb.

Disabling visually ambiguous characters in Google URL-shortener output

Is there a way to say (programmatically, I mean their API) the Google URL shortener not to produce short URL with characters like:
0 O
1 l
Because people often make mistake when reading those characters from displays and typing them elsewhere.
You cannot request the API to use a custom charset, so no.
Not a proper solution, but you could check the url for unwanted characters and request another short URL for the same long URL until you get one you like. Google URL shortner issues a unique short URL for an already shortned URL if you provide an OAuth token with the request. However I am not sure if a user is limited to one unique short URL per a specific long URL in which case this won't work either.
Since you're doing it programmatically, you could swap out those chars for their ascii value, '%6F' for the letter o, for instance. In this case, just warn the users that in doubt, it's a numeral.
Alternatively, use a font that distinguishes ambiguous chars, or better yet, color-code them (or underline numerals, or whatever visual mark)

ExpressionEngine template will not output empty JSON array

I'm creating JSON in an ExpressionEngine template and pointing the Ruby JSON library at the relevant URL. The template looks like this:
[
{exp:mylib:mytag channel="mychannel" backspace="1"}
{"entry_id":"{entry_id}","title":"{title}"},
{/exp:mylib:mytag}
]
When the tag returns data, everything is fine, my Ruby code works perfectly with the array of objects. However, when the tag returns no data (because there are no appropriate entries), Ruby complains that the json string is not the required 2 octets in length. I would expect the output to be [], i.e. an empty but valid JSON array. However, visiting the URL in Firefox/firebug and wget confirms that the response coming back from the URL is zero bytes in length, with status 200 OK.
I tested further by creating a template without tags and just a pair of empty square brackets, with the same result: zero bytes.
Is a pair of empty square brackets somehow a reserved token in the EE template language? Is there some clever optimisation going on that assumes that no-one could ever want a pair of square brackets in an html page?
Are you developing your own add-on, or using the built-in ExpressionEngine tags?
Using the native channel entries queries, you can use a if_no_results conditional tag to control what gets output when there are no matching results:
{exp:channel:entries channel="channel_name"}
{if no_results} ...{/if}
{/exp:channel:entries}
Many third-party add-ons also support the same type of {if_no_results} conditional.
You might also have a look at the third-party ExpressionEngine JSON add-on, which may be able to give you some inspiration on how to approach your situation.

Handling difference between strings returned in history change handler

I've got an app that receives urls after the # sign and responds to them with a History ValueChangeHandler. Serious problem: the urls are escaped differently on different browsers.
For example, when I go to #riley%2Blark%40gmail.com, Chrome sends my ValueChangeHandler riley%2Blark%40gmail.com while FireFox sends riley+lark#gmail.com. This is a terrible difference if I want to run URL.decodeQueryString on them because I'll end up with an extra space in Firefox.
How can I handle this, short of writing separate implementations for different browsers?
I can think of two possible solutions:
U could try adding another parameter
to the token so that the token was
of the for
#riley%2Blark%40gmail.com/%2B-a-space
on receiving the token, check the
second part of the token. If the
second part contains a %2B,
urldecode the token. else replace '+' with
You can also try using Location.hash
through JSNI. I reckon the results
ought to be uniform.

MIME RFC "Content-Type" parameter confusion? Unclear RFC specification

I'm trying to implement a basic MIME parser for the multipart/related in C++/Qt.
So far I've been writing some basic parser code for headers, and I'm reading the RFCs to get an idea how to do everything as close to the specification as possible. Unfortunately there is a part in the RFC that confuses me a bit:
From RFC882 Section 3.1.1:
Each header field can be viewed as a single, logical line of
ASCII characters, comprising a field-name and a field-body.
For convenience, the field-body portion of this conceptual
entity can be split into a multiple-line representation; this
is called "folding". The general rule is that wherever there
may be linear-white-space (NOT simply LWSP-chars), a CRLF
immediately followed by AT LEAST one LWSP-char may instead be
inserted. Thus, the single line
Alright, so I simply parse a header field and if a CRLF follows with linear whitespace, I simply concat those in a useful manner to result in a single header line. Let's proceed...
From RFC2045 Section 5.1:
In the Augmented BNF notation of RFC 822, a Content-Type header field
value is defined as follows:
content := "Content-Type" ":" type "/" subtype
*(";" parameter)
; Matching of media type and subtype
; is ALWAYS case-insensitive.
[...]
parameter := attribute "=" value
attribute := token
; Matching of attributes
; is ALWAYS case-insensitive.
value := token / quoted-string
token := 1*<any (US-ASCII) CHAR except SPACE, CTLs,
or tspecials>
Okay. So it seems if you want to specify a Content-Type header with parameters, simply do it like this:
Content-Type: multipart/related; foo=bar; something=else
... and a folded version of the same header would look like this:
Content-Type: multipart/related;
foo=bar;
something=else
Correct? Good. As I kept reading the RFCs, I came across the following in RFC2387 Section 5.1 (Examples):
Content-Type: Multipart/Related; boundary=example-1
start="<950120.aaCC#XIson.com>";
type="Application/X-FixedRecord"
start-info="-o ps"
--example-1
Content-Type: Application/X-FixedRecord
Content-ID: <950120.aaCC#XIson.com>
[data]
--example-1
Content-Type: Application/octet-stream
Content-Description: The fixed length records
Content-Transfer-Encoding: base64
Content-ID: <950120.aaCB#XIson.com>
[data]
--example-1--
Hmm, this is odd. Do you see the Content-Type header? It has a number of parameters, but not all have a ";" as parameter delimiter.
Maybe I just didn't read the RFCs correctly, but if my parser works strictly like the specification defines, the type and start-info parameters would result in a single string or worse, a parser error.
Guys, what's your thought on this? Just a typo in the RFCs? Or did I miss something?
Thanks!
It is a typo in the examples. Parameters must always be delimited with semicolons correctly, even when folded. The folding is not meant to change the semantics of a header, only to allow for readability and to account for systems that have line length restrictions.
Quite possibly a typo, but in general (and from experience) you should be able to handle this kind of thing "in the wild" as well. In particular, mail clients vary wildly in their ability to generate valid messages and follow all of the relevant specifications (if anything, it's even worse in the email/SMTP world than it is the WWW world!)

Resources