Web Scraping using simplehtmldom on multiple sites - codeigniter

I am using simplehtmldom parser for my recent web scraping project and the project is actually building a price comparing website build with CodeIgniter. The website has to fetch product names, description and price from different shopping websites. Here is my code:
$this->dom->load_file('http://www.site1.com');
$price1 = $this->dom->find("span[itemprop=price]");
$this->dom->load_file('http://www.site2.com');
$price2 = $this->dom->find("div.price");
$this->dom->load_file('http://www.site3.com');
$price3 = $this->dom->find("div.priceBold");
$this->dom->load_file('http://www.site4.com');
$price4 = $this->dom->find("span.fntBlack");
$this->dom->load_file('http://www.site5.com');
$price5 = $this->dom->find("div.price");
The above code takes approximately 15-20 seconds to load the result into the screen. When I try with only one site, it just takes 2 seconds. This is how the simplehtmldom works with multiple domains? Or is there a way to optimize it?

PHP Simple HTML DOM Parser has some memory leak issue, so before trying to load a new page, clear the previous one using:
$this->dom->clear();
unset($this->dom);
If this doesn't change anything, then one of your websites is taking much time to respond... you'll have to check one by one to find the culprit xD

Related

Drupal 9 - custom module caching issue

Longtime D7 user, first time with D9. I am writing my first custom module and having a devil of a time. My routing calls a controller that simple does this:
\Drupal::service('page_cache_kill_switch')->trigger();
die("hello A - ". rand());
I can refresh the page over and over and get a new random number each
time. But, when I change the code to:
\Drupal::service('page_cache_kill_switch')->trigger();
die("hello B - ". rand());
I still get "hello A 34234234" for several minutes. Clearing the cache doesn't help, all I can do is wait, it's normally about two minutes. I am at my wits end.
I thought it maybe an issue with my docker instance. So I generated a simple HTML file but if I edit then reload that file changes are reflected immediately.
In my settings.local.php I have disabled the render cache, caching for migrations, Internal Page Cache, and Dynamic Page Cache.
In my mymod.routing.yml I have:
options:
_admin_route: TRUE
no_cache: TRUE
Any hint on what I am missing would be deeply appreciated.
thanks,
summer

Drupal 7 ignoring $_SESSION in template

I'm working on a simple script in a custom theme in Drupal 7 that is supposed to just rotate through different background image each time a user loads the page. This is my code in [view].tpl.php that picks which image to use.
$img_index = (!isset($_SESSION["img_index"]) || is_null($_SESSION["img_index"])) ? 1 : $_SESSION["img_index"] + 1;
if ($img_index > 2) {
$img_index = 0;
}
$_SESSION["img_index"] = $img_index;
Pretty simple stuff, and it works fine as long as Drupal starts up a session. However, if I delete my session cookie, then always shows the same image, a session is never started.
I'm assuming that since this code is in the view file that the view code is being cached for anonymous users and hence the session is never started, but I can't figure out how to otherwise do what I want.
Don't mess with session like /u/maiznieks mentioned on Reddit. It's going to affect performance.
I've had to do something similar in the past and went with an approach like /u/maiznieks mentions. It's something like this,
Return all the URLs in an array via JS on Drupal.settings.
Check if a cookie is set.
If it's not, set it and set it's value to 0.
If it's set, get the value, increase the value by one, save it to the cookie.
With that value, now you have an index.
Check if image[index] exists
If it does, show that to the user.
If it doesn't, reset index to 0 and show that. Save 0 to the cookie.
You keep caching. You keep showing the user new images on every page load.
You could set your current view to do a random sort every 5 mins. You would then only have to update the logic above to replace that image. That way you can keep something similar working for users with no JS but still keep this functionality for the rest.
You can replace cookies above with HTML5 local storage if you'd like.
#hobberwickey, I will suggest to create a custom module and implement hook_boot() in module. As per drupal bootstrap process session layer will call after cache layer everytime. hook_boot can be called in cache pages and before bootstrap process also. You can take more information here.

Can't seem to run a process in background of Sinatra app

I'm trying to display a number from an api, but I want my page to load faster. So, I'd like to get the number from the api every 5 minutes, and just load that number to my page. This is what I have.
get '/' do
x = Numbersapi.new
#number = x.number
:erb home
end
This works fine, but getting that number from the api takes a while so that means my page takes a while to load. I want to look up that number ahead of time and then every 5 minutes. I've tried using threads and processes, but I can't seem to figure it out. I'm still pretty new to programming.
Here's a pretty simple way to get data in a separate thread. Somewhere outside of the controller action, fire off the async loop:
Data = {}
numbers_api = Numbersapi.new
Thread.new do
Data[:number] = numbers_api.number
sleep 300 # 5 minutes
end
Then in your controller action, you can simply refer to the Data[:number], and you'll get the latest value.
However if you're deploying this you should use a gem like Resque or Sidekiq; it will track failures and is probably optimized more

Can I reduce my amount of requests in Google Maps JavaScript API v3?

I call 2 locations. From an xml file I get the longtitude and the langtitude of a location. First the closest cafe, then the closest school.
$.get('https://maps.googleapis.com/maps/api/place/nearbysearch/xml?
location='+home_latitude+','+home_longtitude+'&rankby=distance&types=cafe&sensor=false&key=X',function(xml)
{
verander($(xml).find("result:first").find("geometry:first").find("location:first").find("lat").text(),$(xml).find("result:first").find("geometry:first").find("location:first").find("lng").text());
}
);
$.get('https://maps.googleapis.com/maps/api/place/nearbysearch/xml?
location='+home_latitude+','+home_longtitude+'&rankby=distance&types=school&sensor=false&key=X',function(xml)
{
verander($(xml).find("result:first").find("geometry:first").find("location:first").find("lat").text(),$(xml).find("result:first").find("geometry:first").find("location:first").find("lng").text());
}
);
But as you can see, I do the function verander(latitude,longtitude) twice.
function verander(google_lat, google_lng)
{
var bryantPark = new google.maps.LatLng(google_lat, google_lng);
var panoramaOptions =
{
position:bryantPark,
pov:
{
heading: 185,
pitch:0,
zoom:1,
},
panControl : false,
streetViewControl : false,
mapTypeControl: false,
overviewMapControl: false ,
linksControl: false,
addressControl:false,
zoomControl : false,
}
map = new google.maps.StreetViewPanorama(document.getElementById("map_canvas"), panoramaOptions);
map.setVisible(true);
}
Would it be possible to push these 2 locations in only one request(perhaps via an array)? I know it sounds silly but I really want to know if their isn't a backdoor to reduce these google maps requests.
FTR: This is what a request is for Google:
What constitutes a 'map load' in the context of the usage limits that apply to the Maps API? A single map load occurs when:
a. a map is displayed using the Maps JavaScript API (V2 or V3) when loaded by a web page or application;
b. a Street View panorama is displayed using the Maps JavaScript API (V2 or V3) by a web page or application that has not also displayed a map;
c. a SWF that loads the Maps API for Flash is loaded by a web page or application;
d. a single request is made for a map image from the Static Maps API.
e. a single request is made for a panorama image from the Street View Image API.
So I'm afraid it isn't possible, but hey, suggestions are always welcome!
Your calling places api twice and loading streetview twice. So that's four calls but I think they only count those two streetviews as once if your loading it on one page. And also your places calls will be client side so they won't count towards your limits.
But to answer your question there's no loop hole to get around the double load since you want to show the users two streetviews.
What I would do is not load anything until the client asks. Instead have a couple of call to action type buttons like <button onclick="loadStreetView('cafe')">Click here to see Nearby Cafe</button> and when clicked they will call the nearby search and load the streetview. And since it is only on client request your page loads will never increment the usage counts like when your site get's crawled by search engines.
More on those usage limits
The Google Places API has different usages then the maps. https://developers.google.com/places/policies#usage_limits
Users with an API key are allowed 1 000 requests per 24 hour period
Users who have verified their identity through the APIs console are allowed 100 000 requests per 24 hour period. A credit card is required for verification, by enabling billing in the console. We ask for your credit card purely to validate your identity. Your card will not be charged for use of the Places API.
100,000 requests a day if you verify yourself. That's pretty decent.
As for Google Maps, https://developers.google.com/maps/faq#usagelimits
You get 25,000 map loads per day and it says.
In order to accommodate sites that experience short term spikes in usage, the usage limits will only take effect for a given site once that site has exceeded the limits for more than 90 consecutive days.
So if you go over a bit not and then it seems like they won't mind.
p.s. you have an extra comma after zoom:1 and zoomControl : false and they shouldn't be there. Will cause errors in some browsers like IE. You also are missing a semicolon after var panoramaOptions = { ... } and before map = new

Scraping Real Time Visitors from Google Analytics

I have a lot of sites and want to build a dashboard showing the number of real time visitors on each of them on a single page. (would anyone else want this?) Right now the only way to view this information is to open a new tab for each site.
Google doesn't have a real-time API, so I'm wondering if it is possible to scrape this data. Eduardo Cereto found out that Google transfers the real-time data over the realtime/bind network request. Anyone more savvy have an idea of how I should start? Here's what I'm thinking:
Figure out how to authenticate programmatically
Inspect all of the realtime/bind requests to see how they change. Does each request have a unique key? Where does that come from? Below is my breakdown of the request:
https://www.google.com/analytics/realtime/bind?VER=8
&key= [What is this? Where does it come from? 21 character lowercase alphanumeric, stays the same each request]
&ds= [What is this? Where does it come from? 21 character lowercase alphanumeric, stays the same each request]
&pageId=rt-standard%2Frt-overview
&q=t%3A0%7C%3A1%3A0%3A%2Ct%3A11%7C%3A1%3A5%3A%2Cot%3A0%3A0%3A4%2Cot%3A0%3A0%3A3%2Ct%3A7%7C%3A1%3A10%3A6%3D%3DREFERRAL%3B%2Ct%3A10%7C%3A1%3A10%3A%2Ct%3A18%7C%3A1%3A10%3A%2Ct%3A4%7C5%7C2%7C%3A1%3A10%3A2!%3Dzz%3B%2C&f
The q variable URI decodes to this (what the?):
t:0|:1:0:,t:11|:1:5:,ot:0:0:4,ot:0:0:3,t:7|:1:10:6==REFERRAL;,t:10|:1:10:,t:18|:1:10:,t:4|5|2|:1:10:2!=zz;,&f
&RID=rpc
&SID= [What is this? Where does it come from? 16 character uppercase alphanumeric, stays the same each request]
&CI=0
&AID= [What is this? Where does it come from? integer, starts at 1, increments weirdly to 150 and then 298]
&TYPE=xmlhttp
&zx= [What is this? Where does it come from? 12 character lowercase alphanumeric, changes each request]
&t=1
Inspect all of the realtime/bind responses to see how they change. How does the data come in? It looks like some altered JSON. How many times do I need to connect to get the data? Where is the active visitors on site number in there? Here is a dump of sample data:
19
[[151,["noop"]
]
]
388
[[152,["rt",[{"ot:0:0:4":{"timeUnit":"MINUTES","overTimeData":[{"values":[49,53,52,40,42,55,49,41,51,52,47,42,62,82,76,71,81,66,81,86,71,66,65,65,55,51,53,73,71,81],"name":"Total"}]},"ot:0:0:3":{"timeUnit":"SECONDS","overTimeData":[{"values":[0,1,1,1,1,0,1,0,1,1,1,0,2,0,2,2,1,0,0,0,0,0,2,1,1,2,1,2,0,5,1,0,2,1,1,1,2,0,2,1,0,5,1,1,2,0,0,0,0,0,0,0,0,0,1,1,0,3,2,0],"name":"Total"}]}}]]]
]
388
[[153,["rt",[{"ot:0:0:4":{"timeUnit":"MINUTES","overTimeData":[{"values":[52,53,52,40,42,55,49,41,51,52,47,42,62,82,76,71,81,66,81,86,71,66,65,65,55,51,53,73,71,81],"name":"Total"}]},"ot:0:0:3":{"timeUnit":"SECONDS","overTimeData":[{"values":[2,1,1,1,1,1,0,1,0,1,1,1,0,2,0,2,2,1,0,0,0,0,0,2,1,1,2,1,2,0,5,1,0,2,1,1,1,2,0,2,1,0,5,1,1,2,0,0,0,0,0,0,0,0,0,1,1,0,3,2],"name":"Total"}]}}]]]
]
388
[[154,["rt",[{"ot:0:0:4":{"timeUnit":"MINUTES","overTimeData":[{"values":[53,53,52,40,42,55,49,41,51,52,47,42,62,82,76,71,81,66,81,86,71,66,65,65,55,51,53,73,71,81],"name":"Total"}]},"ot:0:0:3":{"timeUnit":"SECONDS","overTimeData":[{"values":[0,3,1,1,1,1,1,0,1,0,1,1,1,0,2,0,2,2,1,0,0,0,0,0,2,1,1,2,1,2,0,5,1,0,2,1,1,1,2,0,2,1,0,5,1,1,2,0,0,0,0,0,0,0,0,0,1,1,0,3],"name":"Total"}]}}]]]
]
Let me know if you can help with any of the items above!
To get the same, Google has launched new Real Time API. With this API you can easily retrieve real time online visitors as well as several Google Analytics with following dimensions and metrics. https://developers.google.com/analytics/devguides/reporting/realtime/dimsmets/
This is quite similar to Google Analytics API. To start development on this,
https://developers.google.com/analytics/devguides/reporting/realtime/v3/devguide
With Google Chrome I can see the data on the Network Panel.
The request endpoint is https://www.google.com/analytics/realtime/bind
Seems like the connection stays open for 2.5 minutes, and during this time it just keeps getting more and more data.
After about 2.5 minutes the connection is closed and a new one is open.
On the Network panel you can only see the data for the connections that are terminated. So leave it open for 5 minutes or so and you can start to see the data.
I hope that can give you a place to start.
Having google in the loop seems pretty redundant. Suggest you use a common element delivered on demand from the dashboard server and include this item by absolute URL on all pages to be monitored for a given site. The script outputting the item can read the IP of the browser asking and these can all be logged into a database and filtered for uniqueness giving a real time head count.
<?php
$user_ip = $_SERVER["REMOTE_ADDR"];
/// Some MySQL to insert $user_ip to the database table for website XXX goes here
$file = 'tracking_image.gif';
$type = 'image/gif';
header('Content-Type:'.$type);
header('Content-Length: ' . filesize($file));
readfile($file);
?>
Ammendum:
A database can also add a timestamp to every row of data it stores. This can be used to further filter results and provide the number of visitors in the last hour or minute.
Client side Javascript with AJAX for fine tuning or overkill
The onblur and onfocus javascript commands can be used to tell if the the page is visible, pass the data back to the dashboard server via Ajax. http://www.thefutureoftheweb.com/demo/2007-05-16-detect-browser-window-focus/
When a visitor closes a page this can also be detected by the javascript onunload function in the body tag and Ajax can be used to send data back to the server one last time before the browser finally closes the page.
As you may also wish to collect some information about the visitor like Google analytics does this page https://panopticlick.eff.org/ has a lot of javascript that can be examined and adapted.
I needed/wanted realtime data for personal use so I reverse-engineered their system a little bit.
Instead of binding to /bind I get data from /getData (no pun intended).
At /getData the minimum request is apparently: https://www.google.com/analytics/realtime/realtime/getData?pageId&key={{propertyID}}&q=t:0|:1
Here's a short explanation of the possible query parameters and syntax, please remember that these are all guesses and I don't know all of them:
Query Syntax: pageId&key=propertyID&q=dataType:dimensions|:page|:limit:filters
Values:
pageID: Required but seems to only be used for internal analytics.
propertyID: a{{accountID}}w{{webPropertyID}}p{{profileID}}, as specified at the Documentation link below. You can also find this in the URL of all analytics pages in the UI.
dataType:
t: Current data
ot: Overtime/Past
c: Unknown, returns only a "count" value
dimensions (| separated or alone), most values are only applicable for t:
1: Country
2: City
3: Location code?
4: Latitude
5: Longitude
6: Traffic source type (Social, Referral, etc.)
7: Source
8: ?? Returns (not set)
9: Another location code? longer.
10: Page URL
11: Visitor Type (new/returning)
12: ?? Returns (not set)
13: ?? Returns (not set)
14: Medium
15: ?? Returns "1"
page:
At first this seems to work for pagination but after further analysis it looks like it's also used to specify which of the 6 pages (Overview, Locations, Traffic Sources, Content, Events and Conversions) to return data for.
For some reason 0 returns an impossibly high metrictotal
limit: Result limit per page, maximum of 50
filters:
Syntax is as specified at the Documentation 2 link below except the OR is specified using | instead of a comma.6==CUSTOM;1==United%20States
You can also combine multiple queries in one request by comma separating them (i.e. q=t:1|2|:1|:10,t:6|:1|:10).
Following the above "documentation", if you wanted to build a query that requests the page URL and city of the top 10 active visitors with a traffic source type of CUSTOM located in the US you would use this URL: https://www.google.com/analytics/realtime/realtime/getData?key={{propertyID}}&pageId&q=t:10|2|:1|:10:6==CUSTOM;1==United%20States
Documentation
Documentation 2
I hope that my answer is readable and (although it's a little late) sufficiently answers your question and helps others in the future.

Resources