How does Facebook grab the text of the article when pasting the url? - ajax

Im a bit curious about this Facebook's useful functionality. When I paste a URL on the 'What's on your mind?' box, it almost perfectly gets the body of the article. How does Facebook do this?
Thanks!

It's part of how Facebook Share works.
The URL Linter is pretty helpful as well. For example, if we test it with this very question, you can scroll down and see where it's getting the data from
"Hello, Im a bit curious about this
Facebook's useful functionality. When
I paste a URL on the 'What's on your
mind?' box, it almost perfectly gets
the body of the article. How does
Facebook do this?" extracted from
<description> or first <p>

I can't speak for Facebook specifically, but there are entire companies dedicated to providing that kind of service. For example, Reddit recently outsourced preview generation to a 3rd party.
So, essentially, there's a certain amount of automation and a large amount of manual tweaking and configuration.
You might also look at the Readability tool, which extracts the main content of a web page - that might provide some insight into the processes involved.

You can put your own entries into the shared content, by using the things described in the OpenGraph protocol on Facebook developer website.

It basically goes to the page and begins sniffing for ID's in the HTML marked as Content or Main and probably a few other common terms people use when building a site and specifying where things like menus, content, main body, right menu, top menu, main article, etc are placed in the page when pulling it in dynamically (or non dynamically for that matter).
For example, look at the source of this page itself. You'll see an area that begins div id="content"
Bingo. That's where the facebook sniffer begins. It then grabs probably the first picture it finds within that area as well as the first bit of text in that area as well.

Related

SEO with angularjs and asp.net restfull service

I have developed a website using angularjs and web api.
The problem is that the ajax rendered content is not crawable by google. And no one can find the website using google search.
After reading many articles regarding this issue, including:
This one with all links of explanation going out,
Google ajax crawling protocol, and also stack over flow question, I couldn't find the proper solution. Those that mention asp.net solutions, are talking about mvc, and I need only the simple REST by web api, other articles are not talking about asp.net.
Is there any simple explanation?
I'm the one who asked this same question long ago, so I will answer from my experience:
Firstly, if all your content are accessible via unique URIs (including the hashbang if you use it), modern search engines should index it just fine. In fact Google can index javascript generated content now. You can try that via the Google Webmaster tool and see how your site is indexed.
Secondly, there are libraries that help you to serve parsed content to search engines if you need to, but in my case I didn't bother much with it since Google is indexing js nicely.
I've seen others ask this question, and maybe I'm missing something or this is outdated, but I don't see why AngularJS needs to be an issue with SEO.
Say you have a landing page and it has a bunch of links. Assuming you're using html5 mode in AngularJS (and I'm not sure that's 100% necessary) and something like ng-route then the links on the landing page can work both as "angular" (JavaScript) links and "old school" (full page load) links.
If you're a human user you can click a link and it will do angular magic and adjust the content without loading the full page. Ok, all fine.
But if you instead copy the link and paste it in a new tab or new browser, it will still work - assuming you've set up routes correctly.
I'm not an SEO expert by any stretch of the imagination, but as I understand it, having links that load pages and having those pages have real and useful content is the core of SEO, and done this way, AngularJS should work fine. The key thing to check is if you copy and paste the link (not just click it) that it works.

Scraping pages that do not seem to have URLs

I'm trying to scrape these listings and provide more exposure for these job listings on a site that belongs to a client of mine. The issue is that I need to be able to link to the specific job listing in order for the job seeker to apply. This is the page I'm trying to save listing links from.
It would be ideal if I could save an address for the job seeker to click on to see the original listing and then apply.
What is this website doing to not feature a URL for these pages
Is it possible to provide a listing specific address
If that's possible how could I generate that address?
If I can't get a specific address I think I could get it so that the user clicks a link that triggers an internal script on my client's site which takes the listing ID and searches the site I found that listing on, and then redirects the user to that specific listing.
The downside to this is that the user will have to wait a little while depending on how far back the listing is on a directory. I could put some kind of progress bar with a pleasant "Searching for your listing! Thanks for being patient" message.
If I can avoid having to do this, though, that'd be great!
I'm using Nokogiri and Mechanize.
The page you refer to appears to be generated by an Oracle product, so one would think they'd be willing to construct a web form properly (and with reference to accessibility concerns). They haven't, so it occurs to me that either their engineer was having a bad day, or they are deliberately making it (slightly) harder to scrape.
The reason your browser shows no href when you hover over those links is that there isn't one. What the page does instead is to use JavaScript to capture the click event, populate a POST form with some hidden values, and call the submit method programmatically. This can cause problems with screen-readers and other accessibility devices, as well as causing problems with the way in which back buttons have to re-submit the page.
The good news is that constructions of this kind can usually be scraped by creating a form yourself, either using a real one on a third party page, or via a crawler library. If you post the right values to the target URI, reverse-engineered from examining the page's script, the resulting document should be the "linked" page you expect.

Get URI fragment (hash) to affect SEO? Get indexed by SEs?

I am building a forum site where the post is retrieved on the same page as the listing via AJAX. When a new post is shown, the URI fragment is changed (ex: .php#1_This-is-the-first-post). Also the title and meta tags are changed.
My question is this. I have read that search engines aren't able to use #these-words. So therefore, my entire site won't be able to be indexed (as it will look like one page).
What can i do to get around this, or at least make my sub-pages be able to get indexed?
NOTE: I have built almost all of the site, so radically changes would be hard. SEO is my weakest geek-skill.
Add non-AJAX versions of every page, and link to them from your popups as "permalinks" (or whatever you want to call them). Not only aren't your pages available to search engines, they can't be bookmarked or emailed to friends. I recently worked with some designers on a site and talked them out of using an AJAX-only design. They ended up putting article "teasers" in popups and making users go to a page with a bookmarkable URL to read the complete texts.
As difficult as it may be, the "best" answer may be to re-architect your site to use the hash tag URL scheme more sparingly
Short of that, I'd suggest the following:
Create an alternative, non-hash based URL scheme. This is a must.
Create a site-map that allows search engines to find your existing pages through the new URL scheme.
Slowly port your site over. You might consider adding these deeper links on the page, or encourage users to share those links instead of the hash-based ones, etc.
Hope this helps!

Content Water Marking

We have members-only paid content that is frequently copied and republished without our permission.
We are trying to ‘watermark’ our content by including each customer’s user id in a fake css class, for example <p class='userid_1234'> (except not so obivous, of course :), that would help us track the source of the copying, and then we place that class somewhere in the article body.
The problem is, by including user-specific information into an article, it makes it so that the article content is ineligible for caching because it is now unique to each user.
This bumps the page load time from ~.8ms to ~2.5sec for each article page view.
Does anyone know of any watermarking strategies that can still be used with caching?
Alternatively, what can be done to speed up database access? ( ha, ha, that there’s just a tiny topic i’m sure.. )
We're using the CMS Expression Engine, but I'd like to hear about any strategies. They don't have to be EE-specific.
If you're talking about images then you could use PHP to add a watermark to the images.
How can I add an image onto an image in PHP like a watermark
its a tool to help track down the lazy copiers who just copy the source code as-is. this is not preventative, nor is it a deterrent. – Ian 12 hours ago
Going by your above comment you are happy with users copying your content, just not without the formatting etc. So what you could do is provide the users an embed type of source code for that particular content just like YouTube does with videos. Into that embed source code you could add your own links back to your site, utilize your own CSS etc.
That way you can still allow the members to use the content but it will always come out the way you intended it with links back to your site.
Thanks
You could always cache a version that uses a special string, like #!username!#, and then later fill it in with PHP based on which user is viewing it.
Another way I believe is to switch from caching on the server to instead let the browser cache it locally for a little. That way it is only cached per user, and it reduces the calls to your database. Because an article is pretty static, you could just let the local computer cache it, and pull in comments via javascript.
This last one is probably not one you are really looking for, but I'm gonna come out and say it anyway. You could not treat your users like thieves, and instead treat the thieves as thieves. Go to the person hosting the servers your content is on and send them an email telling them copyrighted premium content is being hosted on their servers without your permission. You can even automate that process.
How to find out what sites are posting your content? Put a link in the body content to your site, and do a Google Search/Blog Search for articles linking to that site. To automate it, use Google Blog Search because it offers RSS feeds. Any one that has a link back to your site could go into a database with a link to the page, someone could look at it, and if it is the entire article, go do a Whois and send them an email.
What makes you think adding css to something is going to stop people from copying it without that CSS? It's more likely that they are just coping the source of the content you are showing them and ignoring all the styling around it. For example, I use tamper data to look at all HTTP requests made by Firefox, if I can see it on the page, I can see it in the logs. Even with all the "protection" some sites try to put in place, they generally will never work. I can grab what I want, without using any screen capture/recording.
If you were serving flv's, for example, I would easily be able to grab the source of that even if you overlayed it with some CSS. I think the best approach would be to get the sites publishing your premium content and ask them to remove it. It's either that or watermark the actual content on the fly while sending it to the browser.

Screen scraping an ASP.NET web page to retrieve data displayed in the grid view

I am using RUBY to screen scrap a web page (created in asp.net) which uses gridview to display data. I am successfully able to read the data displayed on page-1 of the grid but unable to figure out how I can move to the next page in the grid to read all the data.
Problem is the page number hyperlinks are not normal hyperlinks (with URL) but instead are javascript hyperlink which causes postback to the same page..
An example of the hyperlink:-
6
I recommend using Watir, a ruby library designed for browser testing, if you're already using ruby for processing. For one thing, it gives you a much nicer interface to the DOM elements on the page, and it makes clicking links like this easier:
ie.link(:text, '6').click
Then, of course you have easier methods for navigating the table as well. It's easy enough to automate this process:
1..total_number_of_pages.each do |next_page|
ie.link(:text, next_page).click
# table processing goes here
end
I don't know your use case, but this approach has its advantages and disadvantages. For one thing, it actually runs a browser instance, so if this is something you need to frequently run quietly in the background in completely automated way, this may not be the best approach. On the other hand, if it's ok to launch a browser instance, then you don't have to worry about all that postback nonsense, and you can just click the link as if you were a user.
Watir: http://wtr.rubyforge.org/
You'll need to figure out the actual URL.
Option 1a: Open the page in a browser with good developer support (e.g. firefox with the web development tools) and look through the source to find where _doPostBack is defined. Figure out what URL it's constructing. Note that it might not be in the main page source, but instead in something that the page loads.
Option 1b: Ditto, but have ruby do it. If you're fetching the page with Net:HTTP you've got the tools to find the definition of __doPostBack already (the body as a string, ruby's grep, and the ability to request additional files, such as those in script tags).
Option 2: Monitor the traffic between a browser and the page (e.g. with a logging proxy) to find out what the URL is.
Option 3: Ask the owner of the web page.
Option 4: Guess. This may not be as bad as it sounds (e.g. if the original URL ends with "...?page=1" or something) but in general this is the least likely to work.
Edit (in response to your comment on the other question):
Assuming you're using the Net:HTTP library, you can do a postback by just replacing your get with a post, e.g. my_http.post(my_url) instead of my_http.get(my_url)
Edit (in response to danieltalsky's answer):
watir may be a really good solution for you (I'm kicking myself for not having thought of it), but be aware that you may have to manually fire the event or go through other hoops to get what you want. As a specific gotcha, with any asynchronous fetch like this you need to make sure that the full response has come back before you scrape it; that isn't a problem when you're doing the request inline yourself.
You will have to perform the postback. The data is pass with a form POST back to the server. Like Markus said use something like FireBug or the Developer Tools in IE 8 and fiddler to watch the traffic. But honestly this is a web form using the bloated GridView and you will be in for a fun adventure. ;)
You'll need to do some investigation in order to figure out what HTTP request the javascript execution is performing. I've used the Mozilla browser with the Firebug plugin and also the "Live HTTP Headers" plugin to help determine what is going on. It will likely become clear to you which requests you will need to make in order to traverse to the next page. Make sure you pay attention to any cookies getting set.
I've had really good success using Mechanize for scraping. It wraps all of the HTTP communication, html parsing and searching(using Nokogiri), redirection, and holding onto cookies. But it doesn't know how to execute Javascript, which is why you will need to figure out what http request to perform on your own.

Resources