Can I make wikidata request faster using ajax/php? - ajax

I am wondering if it's possible to make my requests faster.
I have read that people recommend to improve code in php or js, shortening queries, etc..
Actually I am calling all these informations below, to get enwiki/url and image name.
URL:
https://www.wikidata.org/wiki/Special:EntityData/Q42.json
after get these data, my output filter the request. Is it better to create a php file for each request, to short amount of data from the query?
OUTPUT:
$output['image'] = $decode['entities'][$_REQUEST['id']]['claims']['P18'][0]['mainsnak']['datavalue']['value'];
$output['url'] = $decode['entities'][$_REQUEST['id']]['sitelinks']['enwiki']['url'];
With the image name I am getting the image using:
AJAX/JS:
https://upload.wikimedia.org/wikipedia/commons/${a}/${ab}/${imageName}
the average time to finish the resquest it's about 30/40 seconds, can I make it faster?

Related

AJAX request abort on large query string Elixir Plug

I am sending 2 large query string in AJAX requests, which are basically, a Base64 encoding of a jpeg(s). When the Camera is not a high-resolution one, AJAX request doesn't abort.
At first, I thought its a Nginx issue, Because I was getting an error as request entity too large I resolved it, Then I made changes to my Plug as
plug Plug.Parsers,
parsers: [
:urlencoded,
{:multipart, length: 20_000_000},
:json
],
pass: ["*/*"],
query_string_length: 1_000_000,
json_decoder: Poison
After defining query_string_length, Now I am not getting any errors like above but ajax request still abort.
Base64 encoding string size is 546,591 bytes or max.
I have tried to increase the AJAX request timeout to a very large timespan as well but it still fails. And I don't have any clue where the problem is right now.
How can we receive long strings in Plug?
Some of few answers on StackOverflow about this issue where people used AJAX and PHP, suggesting to change post_max_size, How can we do that in Elixir Plug?
As you are sending AJAX request with JSON data, you should put the length config of json in the plug.
plug Plug.Parsers,
parsers: [
:urlencoded,
{:multipart, length: 20_000_000},
{:json, length: 80_000_000},
],
pass: ["*/*"],
json_decoder: Poison
I suppose you will not put the data in the query string of the post, so the query_string_length - the maximum allowed size for query strings is not needed.
---Original answer---
For plug version around 1.4.3 and have no query_string_length option.
When you post the data as string, you are using Plug.Parsers.
If you are willing to process larger requests, please give a :length
to Plug.Parsers.
You should change the code query_string_length: 1_000_000 to length: 20_000_000.

Array index out of bound exception while downloading elastic search index

I am trying to download complete elastic search index using:
curl -o output_filename -m 600 -GET 'http://ip/index/_search?q=*&size=7000000'.
But its giving error:
{"error":"ArrayIndexOutOfBoundsException[-131072]","status":500}
How can I download complete index data?
The scroll API is what you're looking for, which supports proper pagination:
Scrolling is not intended for real time user requests, but rather for processing large amounts of data
It's the same /_search endpoint but additional gets passed the ?scroll=<timeout> parameter.
Please be sure to understand what the timeout to e.g. scroll=1m means: it will keep alive your scroll context until you request the next batch/page.
Use the scroll_id from the response to request the next batch/page.

AJAX query weird delay between DNS lookup and initial connection on Chrome but not FF, what is it?

I have an AJAX query on my client that passes two parameters to a server:
var url = window.location.origin + "/instanceStats"
$.getJSON(url, { 'unit' : unit, "stat" : stat }, function(data) {
instanceData[key] = data;
var count = showInstanceStats(targetElement, unit, stat, limiter);
});
The server itself is a very simple Python Flask application. On that particular URL, it grabs the "unit" and "stat" parameters from the query to determine the name of a CSV file and line within that file, grabs the line, and sends the data back to the client formatted as JSON (roughly 1KB).
Here is the funny thing: When I measure the time it takes for the data to come back, I observe that some queries are fast (between 20 and 40 ms), and some queries are slow (between 320 and 350 ms). Varying the "stat" parameter (i.e. selecting a different line in the CSV) doesn't seem to have any impact. The fast and slow queries usually switch back and forth (i.e. all even queries are fast, all odd ones are slow). The Python server itself reports roughly the same time for each query.
AJAX itself doesn't seem to have any impact either, as I can take the url that is constructed in the JS and paste it into the browser myself and get the same behavior. Here are some measurements from two subsequent queries:
Fast: http://i.imgur.com/VQ7qopd.png
Slow: http://i.imgur.com/YuG0ROM.png
This seems to be Chrome-specific, as I've tried it on Firefox and the same experiment yields roughly the same query time everytime (between 30 and 50 ms). This is unfortunate, as I want to deploy on both Chrome and Firefox.
What's causing this behavior, and how can I fix it?
I've run into this also. It only seems to happen when using localhost. If you use 127.0.0.1 (or even the computer name), it will not have the extra delay.
I'm having it too, and it's exactly the same: my Node.js application serves Ajax requests and no matter which /url I request it's either 30ms or 300ms and it switches back and forth: odd requests are long, even requests are short.
The thing I see in Chrome Web Inspector (aka Chrome DevTools) is that there is a long gap between "DNS lookup" and "Initial Connection".
They say it's OCSP related here:
http://www.webpagetest.org/forums/showthread.php?tid=12357
OCSP is some kind of certificate validation protocol:
https://en.wikipedia.org/wiki/Online_Certificate_Status_Protocol
Moving from localhost to 127.0.0.1 seems to fix it: response times are 30ms now.

Web Scraping using simplehtmldom on multiple sites

I am using simplehtmldom parser for my recent web scraping project and the project is actually building a price comparing website build with CodeIgniter. The website has to fetch product names, description and price from different shopping websites. Here is my code:
$this->dom->load_file('http://www.site1.com');
$price1 = $this->dom->find("span[itemprop=price]");
$this->dom->load_file('http://www.site2.com');
$price2 = $this->dom->find("div.price");
$this->dom->load_file('http://www.site3.com');
$price3 = $this->dom->find("div.priceBold");
$this->dom->load_file('http://www.site4.com');
$price4 = $this->dom->find("span.fntBlack");
$this->dom->load_file('http://www.site5.com');
$price5 = $this->dom->find("div.price");
The above code takes approximately 15-20 seconds to load the result into the screen. When I try with only one site, it just takes 2 seconds. This is how the simplehtmldom works with multiple domains? Or is there a way to optimize it?
PHP Simple HTML DOM Parser has some memory leak issue, so before trying to load a new page, clear the previous one using:
$this->dom->clear();
unset($this->dom);
If this doesn't change anything, then one of your websites is taking much time to respond... you'll have to check one by one to find the culprit xD

Scraping Real Time Visitors from Google Analytics

I have a lot of sites and want to build a dashboard showing the number of real time visitors on each of them on a single page. (would anyone else want this?) Right now the only way to view this information is to open a new tab for each site.
Google doesn't have a real-time API, so I'm wondering if it is possible to scrape this data. Eduardo Cereto found out that Google transfers the real-time data over the realtime/bind network request. Anyone more savvy have an idea of how I should start? Here's what I'm thinking:
Figure out how to authenticate programmatically
Inspect all of the realtime/bind requests to see how they change. Does each request have a unique key? Where does that come from? Below is my breakdown of the request:
https://www.google.com/analytics/realtime/bind?VER=8
&key= [What is this? Where does it come from? 21 character lowercase alphanumeric, stays the same each request]
&ds= [What is this? Where does it come from? 21 character lowercase alphanumeric, stays the same each request]
&pageId=rt-standard%2Frt-overview
&q=t%3A0%7C%3A1%3A0%3A%2Ct%3A11%7C%3A1%3A5%3A%2Cot%3A0%3A0%3A4%2Cot%3A0%3A0%3A3%2Ct%3A7%7C%3A1%3A10%3A6%3D%3DREFERRAL%3B%2Ct%3A10%7C%3A1%3A10%3A%2Ct%3A18%7C%3A1%3A10%3A%2Ct%3A4%7C5%7C2%7C%3A1%3A10%3A2!%3Dzz%3B%2C&f
The q variable URI decodes to this (what the?):
t:0|:1:0:,t:11|:1:5:,ot:0:0:4,ot:0:0:3,t:7|:1:10:6==REFERRAL;,t:10|:1:10:,t:18|:1:10:,t:4|5|2|:1:10:2!=zz;,&f
&RID=rpc
&SID= [What is this? Where does it come from? 16 character uppercase alphanumeric, stays the same each request]
&CI=0
&AID= [What is this? Where does it come from? integer, starts at 1, increments weirdly to 150 and then 298]
&TYPE=xmlhttp
&zx= [What is this? Where does it come from? 12 character lowercase alphanumeric, changes each request]
&t=1
Inspect all of the realtime/bind responses to see how they change. How does the data come in? It looks like some altered JSON. How many times do I need to connect to get the data? Where is the active visitors on site number in there? Here is a dump of sample data:
19
[[151,["noop"]
]
]
388
[[152,["rt",[{"ot:0:0:4":{"timeUnit":"MINUTES","overTimeData":[{"values":[49,53,52,40,42,55,49,41,51,52,47,42,62,82,76,71,81,66,81,86,71,66,65,65,55,51,53,73,71,81],"name":"Total"}]},"ot:0:0:3":{"timeUnit":"SECONDS","overTimeData":[{"values":[0,1,1,1,1,0,1,0,1,1,1,0,2,0,2,2,1,0,0,0,0,0,2,1,1,2,1,2,0,5,1,0,2,1,1,1,2,0,2,1,0,5,1,1,2,0,0,0,0,0,0,0,0,0,1,1,0,3,2,0],"name":"Total"}]}}]]]
]
388
[[153,["rt",[{"ot:0:0:4":{"timeUnit":"MINUTES","overTimeData":[{"values":[52,53,52,40,42,55,49,41,51,52,47,42,62,82,76,71,81,66,81,86,71,66,65,65,55,51,53,73,71,81],"name":"Total"}]},"ot:0:0:3":{"timeUnit":"SECONDS","overTimeData":[{"values":[2,1,1,1,1,1,0,1,0,1,1,1,0,2,0,2,2,1,0,0,0,0,0,2,1,1,2,1,2,0,5,1,0,2,1,1,1,2,0,2,1,0,5,1,1,2,0,0,0,0,0,0,0,0,0,1,1,0,3,2],"name":"Total"}]}}]]]
]
388
[[154,["rt",[{"ot:0:0:4":{"timeUnit":"MINUTES","overTimeData":[{"values":[53,53,52,40,42,55,49,41,51,52,47,42,62,82,76,71,81,66,81,86,71,66,65,65,55,51,53,73,71,81],"name":"Total"}]},"ot:0:0:3":{"timeUnit":"SECONDS","overTimeData":[{"values":[0,3,1,1,1,1,1,0,1,0,1,1,1,0,2,0,2,2,1,0,0,0,0,0,2,1,1,2,1,2,0,5,1,0,2,1,1,1,2,0,2,1,0,5,1,1,2,0,0,0,0,0,0,0,0,0,1,1,0,3],"name":"Total"}]}}]]]
]
Let me know if you can help with any of the items above!
To get the same, Google has launched new Real Time API. With this API you can easily retrieve real time online visitors as well as several Google Analytics with following dimensions and metrics. https://developers.google.com/analytics/devguides/reporting/realtime/dimsmets/
This is quite similar to Google Analytics API. To start development on this,
https://developers.google.com/analytics/devguides/reporting/realtime/v3/devguide
With Google Chrome I can see the data on the Network Panel.
The request endpoint is https://www.google.com/analytics/realtime/bind
Seems like the connection stays open for 2.5 minutes, and during this time it just keeps getting more and more data.
After about 2.5 minutes the connection is closed and a new one is open.
On the Network panel you can only see the data for the connections that are terminated. So leave it open for 5 minutes or so and you can start to see the data.
I hope that can give you a place to start.
Having google in the loop seems pretty redundant. Suggest you use a common element delivered on demand from the dashboard server and include this item by absolute URL on all pages to be monitored for a given site. The script outputting the item can read the IP of the browser asking and these can all be logged into a database and filtered for uniqueness giving a real time head count.
<?php
$user_ip = $_SERVER["REMOTE_ADDR"];
/// Some MySQL to insert $user_ip to the database table for website XXX goes here
$file = 'tracking_image.gif';
$type = 'image/gif';
header('Content-Type:'.$type);
header('Content-Length: ' . filesize($file));
readfile($file);
?>
Ammendum:
A database can also add a timestamp to every row of data it stores. This can be used to further filter results and provide the number of visitors in the last hour or minute.
Client side Javascript with AJAX for fine tuning or overkill
The onblur and onfocus javascript commands can be used to tell if the the page is visible, pass the data back to the dashboard server via Ajax. http://www.thefutureoftheweb.com/demo/2007-05-16-detect-browser-window-focus/
When a visitor closes a page this can also be detected by the javascript onunload function in the body tag and Ajax can be used to send data back to the server one last time before the browser finally closes the page.
As you may also wish to collect some information about the visitor like Google analytics does this page https://panopticlick.eff.org/ has a lot of javascript that can be examined and adapted.
I needed/wanted realtime data for personal use so I reverse-engineered their system a little bit.
Instead of binding to /bind I get data from /getData (no pun intended).
At /getData the minimum request is apparently: https://www.google.com/analytics/realtime/realtime/getData?pageId&key={{propertyID}}&q=t:0|:1
Here's a short explanation of the possible query parameters and syntax, please remember that these are all guesses and I don't know all of them:
Query Syntax: pageId&key=propertyID&q=dataType:dimensions|:page|:limit:filters
Values:
pageID: Required but seems to only be used for internal analytics.
propertyID: a{{accountID}}w{{webPropertyID}}p{{profileID}}, as specified at the Documentation link below. You can also find this in the URL of all analytics pages in the UI.
dataType:
t: Current data
ot: Overtime/Past
c: Unknown, returns only a "count" value
dimensions (| separated or alone), most values are only applicable for t:
1: Country
2: City
3: Location code?
4: Latitude
5: Longitude
6: Traffic source type (Social, Referral, etc.)
7: Source
8: ?? Returns (not set)
9: Another location code? longer.
10: Page URL
11: Visitor Type (new/returning)
12: ?? Returns (not set)
13: ?? Returns (not set)
14: Medium
15: ?? Returns "1"
page:
At first this seems to work for pagination but after further analysis it looks like it's also used to specify which of the 6 pages (Overview, Locations, Traffic Sources, Content, Events and Conversions) to return data for.
For some reason 0 returns an impossibly high metrictotal
limit: Result limit per page, maximum of 50
filters:
Syntax is as specified at the Documentation 2 link below except the OR is specified using | instead of a comma.6==CUSTOM;1==United%20States
You can also combine multiple queries in one request by comma separating them (i.e. q=t:1|2|:1|:10,t:6|:1|:10).
Following the above "documentation", if you wanted to build a query that requests the page URL and city of the top 10 active visitors with a traffic source type of CUSTOM located in the US you would use this URL: https://www.google.com/analytics/realtime/realtime/getData?key={{propertyID}}&pageId&q=t:10|2|:1|:10:6==CUSTOM;1==United%20States
Documentation
Documentation 2
I hope that my answer is readable and (although it's a little late) sufficiently answers your question and helps others in the future.

Resources