Scrape Instagram Web Hashtag Posts - xpath

I'm trying to scrape the number of posts to a given hashtag (#castles) and populate a Google Sheet cell using ImportXML.
I tried copying the Xpath from Chrome and paste it to the ImportXML parameter in the cell like this:
=ImportXML("https://www.instagram.com/explore/tags/castels/", "//*[#id="react-root"]/section/main/header/div[2]/div/div[2]/span/span")
I saw there is a problem with the quotation marks so I also tried:
=ImportXML("https://www.instagram.com/explore/tags/castels/", "//*[#id='react-root']/section/main/header/div[2]/div/div[2]/span/span")
Nevertheless, both return an error.
What am I doing wrong?
P.S. I am aware of the Xpath to the meta tag description "//meta[#name='description']/#content" however I would like to scrape the exact number of posts and not an abbreviated number.

Try this -
function hashCount() {
var url = 'instagram.com/explore/tags/cats/';
var response = UrlFetchApp.fetch(url, {muteHttpExceptions: true}).getContentText();
var regex = /(edge_hashtag_to_media":{"count":)(\d+)(,"page_info":)/gm;
var count = regex.exec(response)[2];
Logger.log(count);
}
Demo -
I've added muteHttpExceptions: true which was not added in my comment above. Hope this helps.

Related

Google Sheet ImportXml converts value of numbers wrong

I'm trying to get coordinates from geonames api by importXml google sheet function. When I try this formula: IMPORTXML(G2;"//lat") where G2 is the api.
I get 4.747.104,00 but the actual value would be 47.47104. As the values are different I can not only solve that by *10 or similar.
To be honest, it was a mystery to me why the Germal locale reads the number 47.47104 as 4.747.104. But you can get over this problem by using one custom function instead of three ImportXml functions
function parseXml(url) {
let xml = UrlFetchApp.fetch(url).getContentText(),
document = XmlService.parse(xml),
root = document.getRootElement();
return [[root.getChild('name').getText(),root.getChild('lat').getText(),root.getChild('lng').getText()]]
}

how to get value of a tag that has no class or id in html agility pack?

I am trying to get the text value of this a tag:
67 comments
so i'm trying to get '67' from this. however there are no defining classes or id's.
i've managed to get this far:
IEnumerable<HtmlNode> commentsNode = htmlDoc.DocumentNode.Descendants(0).Where(n => n.HasClass("subtext"));
var storyComments = commentsNode.Select(n =>
n.SelectSingleNode("//a[3]")).ToList();
this only give me "comments" annoyingly enough.
I can't use the href id as there are many of these items, so i cant hardcord the href
how can i extract the number aswell?
Just use the #href attribute and a dedicated string function :
substring-before(//a[#href="item?id=22513425"],"comments")
returns 67.
EDIT : Since you can't hardcode all the content of #href, maybe you can use starts-with. XPath 1.0 solution.
Shortest form (+ text has to contain "comments") :
substring-before(//a[starts-with(#href,"item?") and text()[contains(.,"comments")]],"c")
More restrictive (+ text has to finish with "comments") :
substring-before(//a[starts-with(#href,"item?")][substring(//a, string-length(//a) - string-length('comments')+1) = 'comments'],"c")
I am using ScrapySharp nuget which adds in my sample below, (It's possible HtmlAgilityPack offers the same functionality built it, I am just used to ScrapySharp from years ago)
var doc = new HtmlDocument();
doc.Load(#"C:\desktop\anchor.html"); //I created an html file with your <a> element as the body
var anchor = doc.DocumentNode.CssSelect("a").FirstOrDefault();
if (anchor == null) return;
var digits = anchor.InnerText.ToCharArray().Where(c => Char.IsDigit(c));
Console.WriteLine($"anchor text: {anchor.InnerText} - digits only: {new string(digits.ToArray())}");
Output:

From Google Maps CID to Place ID

I have a bunch of links to google maps in the form of https://maps.google.com/?cid=<identifier>.
How can you go from that CID to a Place ID that can be used with the Google Places API? Is there any API endpoint that you can use to convert these URLs into the new places?
Got lucky and found an undocumented api. If you have the CID, you can call the api like https://maps.googleapis.com/maps/api/place/details/json?cid=${CID}&key=YOUR_API_KEY.
I just changed place_id=${the_end_val_i_needed} to cid=${CID} and in the json response is the place_id.
You can use this API hidden parameter to get Place ID. Usage: https://maps.googleapis.com/maps/api/place/details/json?cid=YOUR_CID&key=YOUR_KEY
It returns a result contains formatted address, place_id, name of the address and GPS coordinater.
Please see my blog to see more detail: https://leonbbs.blogspot.com/2018/03/google-map-cid-to-placeid-or-get.html
The only way that I am aware of is scraping the regular html response. This can change of course at any time at Googles discretion and I would be interested in a better solution as well.
As of now e.g. the source of http://maps.google.com/?cid=15792901685599310349 has a json response embedded. It starts with cacheResponse([[[ and ends with ]);:
<!DOCTYPE html>
<html dir=ltr>
<head>
<script nonce="W6BFnVsL4HLAdFCwYXZY">
mapslite = {
START_PERF: (window.performance && window.performance.now) ?
window.performance.now() : +(new Date())
};
mapslite.getBasePageResponse = function(cacheResponse) {
delete mapslite.getBasePageResponse;
cacheResponse([[[2556.225486744883,-122.3492774,47.6205063], ... removed ... ]);
};
executeOgJs = function() {
Within this json array element 8 contains a sub array. Element 27 is the place_id.
In case of the Space Needle with CID 15792901685599310349 the place id is ChIJ-bfVTh8VkFQRDZLQnmioK9s.
Only Google knows why they don't provide a better way.

HtmlAgilityPack sanitizing string issue

I'm using HtmlAgilityPack to sanitize user entered rich text and strip any harmful/unwanted text. Problem occurs though when a simple text is also treated as html node
If I enter
a<b, c>d
and try to sanitize it, the output generated is
a<b, c="">d</b,>
The code I used was
HtmlDocument doc = new HthmlDocument();
doc.LoadHtml(value);
// Sanitizing Logic
var result = doc.DocumentNode.WriteTo();
I tried to set different parameters on HtmlDocument ('OptionCheckSyntax', 'OptionAutoCloseOnEnd', 'OptionWriteEmptyNodes') to not have the text be treated as a node but nothing worked. Is this is a known issue or any workaround possible?
IMO, there's no way you can tell HAP to not treat every '<' as start of new html node. But you can check if your html is a validate html or not by using
string html = "your-html";
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
if (doc.ParseErrors.Count() > 0)
{
//here you can ignore or do whatever you want
}

CodeIgniter Pagination with number in the middle of the query string

I searched the whole day for any solution but did not found any.
I have the same problem as this guy here: Codeigniter Pagination having page number in the middle of url but the "uri_segment" param doesn't work.
My Urls look like:
localhost/controller/0/some/filter/here/
The Pagination returns teh correct link for the next and 2. page.
But once I go there, I get a wrong link back to the first site.
I did something like this:
/*Paginartion Config*/
$pagconfig['base_url'] = base_url($this->uri->segment(1).'/0/'.$this->uri->segment(3).'/'.$this->uri->segment(4).'/'.$this->uri->segment(5).'/'.$this->uri->segment(6).'/'.$this->uri->segment(7).'/'.$this->uri->segment(8));
$pagconfig['total_rows'] = $ress->num_rows;
$pagconfig['per_page'] = 10;
$pagconfig['uri_segment'] = 2;
$pagconfig['prefix'] = '/'.$this->uri->segment(1).'/';
$pagconfig['suffix'] = '/'.$this->uri->segment(3).'/'.$this->uri->segment(4).'/'.$this->uri->segment(5).'/'.$this->uri->segment(6).'/'.$this->uri->segment(7).'/'.$this->uri->segment(8);
Also I just tried using the current_url() as base_url config param and of course I also just tried to use uri_segment = 2 without using pre- and sufix.
It never worked properly.
Routes look like this:
$route['map/(:num)/(:any)'] = 'map/index/$1/$2';
$route['map/(:num)/(:any)/(:any)'] = 'map/index/$1/$2/$3';
$route['map/(:num)/(:any)/(:any)/(:any)'] = 'map/index/$1/$2/$3/$4';
$route['karte/(:num)/(:any)'] = 'map/index/$1/$2';
$route['karte/(:num)/(:any)/(:any)'] = 'map/index/$1/$2/$3';
$route['karte/(:num)/(:any)/(:any)/(:any)'] = 'map/index/$1/$2/$3/$4';
As you can see I use two kind of routes, translates for google.
The routes and Controller also work!
If I type in by hand I get the correct paginated site content:
For Example: localhost/controller/10/some/filter/here returns every row beginning with 11 (it skips first 10 as it should).
Very important is that the number always appears even on the first page where it is 0 - as you can see above.
It would be so great to get any help in that one...
Best Regards

Resources