I have a search bar on my unpublished website, and I was hoping that there's some kind of coding that could make it possible to search something that will draw a conclusion from my own website. (As of now, using the search bar takes me to google)
<form id="tfnewsearch" method="get" action="http://www.google.com">
<input type="text" class="tftextinput" name="q" size="21" maxlength="120"><input type="submit" value="search" class="tfbutton">
</form>
<div class="tfclear"></div>
Any suggestions?
You have added the link of google.com that is why you are redirecting to google.
For Your own search engine, you have to make a form for your site and some database queries which will select some data from your database and display the result. replace the google.com with your form url in action tag.
It does really depend on how your site is built, and whether it's fully accessible to public.
For example, if it's completely public, you can still use google to search your site by using the Google Custom Search API.
Otherwise, there's no magic potion. You will likely have to write some code to index your documents etc. Many sites achieve this by storing the information in a database and creating a full text index of the site, and then querying the database. But this will require more than just CSS and HTML.
Related
I want to scrape all the names of the users who commented below a youtube video.
I'm using ruby and nokogiri.
require 'rubygems'
require 'nokogiri'
require 'open-uri'
url = "https://www.youtube.com/watch?v=tntOCGkgt98"
doc = Nokogiri::HTML(open(url))
doc.css(".comment-thread-renderer > .comment-renderer").each do |comment|
name = comment.css("#comment-section-renderer-items .g-hovercard").text
puts name
end
But it's not working, I'm not getting any output, no error either.
I won't be able to give you a solution, but at least I can give you a couple of hints that may help you to move forward.
The code you have is not working because the comments section is loaded via an ajax call after the page is loaded. If you do a hard reload in your browser, you will see that there is a spinner icon and a Loading... text in the sections comment, waiting for the content to be loaded. When Nokogiri gets the page via the http request, it gets the html content that you see before the comments are loaded. As a matter of fact the place where the contents will be later added looks like:
<div id="watch-discussion" class="branded-page-box yt-card">
<div id="comment-section-renderer"
class="comment-section-renderer vve-check"
data-visibility-tracking="CCsQuy8iEwjr3P3u1uzNAhXIepAKHRV9D8Ao-B0=">
<div class="action-panel-loading">
<p class="yt-spinner ">
<span class="yt-spinner-img yt-sprite" title="Loading icon">
</span>
<span class="yt-spinner-message">Loading...</span>
</p>
</div>
</div>
</div>
That is the reason why you won't find the divs you are looking for, because they aren't part of the html you have.
Looking at the network console in the browser, it seems that the ajax request to get the comments data is being sent to https://www.youtube.com/watch_fragments_ajax?v=tntOCGkgt98&tr=time&distiller=1&ctoken=EhYSC3RudE9DR2tndDk4wAEAyAEA4AEBGAY%253D&frags=comments&spf=load. As you can see the v parameter is the video id, however there are a couple of caveats:
There is a ctoken param, which you can get by scraping the original page contents. It is inside a <script> tag, in the form of
'COMMENTS_TOKEN': "<token>".
However, you still need to send a session_token as a form data in the body of the AJAX request (which is a POST). That I don't know where is coming from :(.
I think that you will be pushing the limits of Nokogiri here, as AFAIK it is not intended to follow ajax requests or handling Javascript. Maybe the ruby Selenium driver is better suited for this.
HTH
I think you need name.css("#comment-section..."
The each statement will iterate over the elements, using the variable name.
You may want to use node instead of name:
doc.css(".comment-thread-renderer > .comment-renderer").each do |node|
name = node.css("#comment-section-renderer-items .g-hovercard").text
puts name
end
I wrote this rails app using nokogiri to see all the tags that a page has before any javascript is run in the browser. The source code is here, so you can adjust it if you need to add more info about the node in the view.
That can easily tell you if the particular tag element that you are looking for is something you can retrieve without having to do some JS eval.
Most web crawlers don't support client-side rendering, which gives you an idea that it's not a trivial task to execute JS when scraping content.
YouTube is a dynamically rendered JavaScript website, though it could be parsed with Nokogiri without using Selenium or another package. Try open the Network tab in dev tools, scroll to the comment section, and see what request being send.
You need to make a post request in order to fetch comments data. You can preview the output in the "Preview" tab.
Preview output:
Which is equivalent to this comment:
Note: Since this comment brings very little value, this answer will be updated with the attached code once there will be an available solution.
I'm working on a news publishing site that needs to load in stories from an RSS feed below the current news page. I've been using InfiniteAjaxScroll (http://infiniteajaxscroll.com/) to some success however, I've hit a brick wall. There is not way for me to dynamically change what story should load in next as you scroll down the page.
Does anyone know of any other plugins, tutorials, examples that replicate behavior like this. I've searched but come up with nothing that meets these requirements.
I'm trying to create something similar to what the Daily Beast has implemented on their site.
http://www.thedailybeast.com/articles/2014/11/05/inside-the-democrats-godawful-midterm-election-wipeout.html
How do they know what stories to load in?
Thanks!
If you're using the InfiniteAjaxScroll library, the "next story" is whatever link you define as the next URL which can be dynamic for each story you load.
Imagine your first story's HTML as something like this
<div class="stories">
<div class="story">
...
</div>
</div>
<div id="pagination">
next
</div>
Then in the storyC.html you have
...
<div id="pagination">
next
</div>
Assuming you're using some sort of dynamic backend, you would use some sort logic to grab a related story and just set that URL as the "next" URL.
I've removed all post centered markup from my Tumblr theme and instead I'm using ajax to fetch the data. So far, so good. Now I want to add a like button to each post, but I can't seem to find any documents on how to do this (without resorting to their api, which needs oauth to work).
Are there no way to include like buttons when you use ajax to fetch the posts and you rather not go full fledge api with oauth?
Tumblr's new implementation of the "Like button" for individual posts uses an <iframe> element to function. The URL for this iframe is obtainable only through your Theme code.
For example:
{Block:Posts}
<div class="like-button">{LikeButton}{/div>
{/Block:Posts}
What is rendered for the {LikeButton} will look something like this:
<iframe id="like_iframe_84714330251" src="http://assets.tumblr.com/assets/html/like_iframe.html?_v=fa292ab73ee80893ffdf1edfabaa185a#name=blog-name-&post_id=84814329251&rk=reKNyFfj" scrolling="no" width="20" height="20" frameborder="0" class="like_toggle" allowtransparency="true"></iframe>
There does not seem to be any way to obtain this without including {LikeButton} inside of a {Block:Posts}
For using ajax, you could include a hidden element on the page that loads this information and parse it out when loading each page of posts using ajax.
So if in your theme you included something like:
<div id="posts-info" style="display: none;">
{Block:Posts}
<div class="post-info" data-postid="{PostID}">{LikeButton}</div>
{/Block:Posts}
</div>
When you load your posts with AJAX, you would also have to load the correct page of your Tumblr (with this code in the Theme).
You could then parse this information by matching the Post ID's to the posts you fetched with AJAX and insert that <ifame> code.
This is a really round-about solution but it should work.
I have this problem I'm facing. I have been working on a project using Grails based on the advice from a friend. I'm still a novice in using Grails, so any down to earth explanation would be highly welcomed.
My project is a web application which scans broken or dead links and displays them on a screen. The main application is written in Java, and it displays the output (good links, bad links, pages scanned) continuously on the system console as the scan goes on. I've finished implementing my UI, controllers, views, database using Grails. Now, I will like to display actively in a section of my GSP page say forager.gsp the current link being scanned, the current number of bad links found and the current page being scanned.
The attempts I have tried in implementing this active display include storing the output my application displays on the console in a table in my database. This table has a single row which is constantly updated as the current paged scanned changes, number of good links found changes and number of bad links found changes. As this particular table is being updated constantly, I've written an action in my controller which reads this single line and renders the result to my UI. The problem I'm now facing is that I need a way of constantly updating the result being displayed after an interval of time in my UI. I want the final output to look like
scanning: This page, Bad links: 8, good links: 200
So basically here is my controller action which reads the table from the database
import groovy.sql.Sql
class PHPController {
def index() {}
def dataSource
def ajax = {
def sql = new Sql(dataSource)
def errors = sql.rows("SELECT *from links")
render (view: 'index', template:'test', model:[errors:errors])
}
}
Here is the template I render test.gsp
<table border="0">
<g:each in="${ errors }" var="error">
<tr><td>${ error.address }</td><td>${ error.error}</td><td>${ error.pageLink}</td></tr>
</g:each>
</table>
For now I'm working with a test UI, which means this is not my UI but one I use for testing purposes, say index.gsp
<html>
<body>
<div><p>Pleaseeee, update only the ones below</p></div>
<script type="text/javascript">
function ClickMe(){
setInterval('document.getElementById("auto").click()',5000);
alert("Function works");
}
</script>
<div id="dont't touch">
<g:formRemote url="[controller:'PHP', action:'ajax']" update="ajaxDiv"
asynchronous="true" name="Form" onComplete="ClickMe()" after="ClickMe()">
<div>
<input id="auto" type="button" value="Click" />
</div>
</g:formRemote>
<div id="ajaxDiv">
<g:render template="/PHP/test"/>
</div>
</body>
</html>
The div I'm trying to update is "ajaxDiv". Anyone trying to answer this question can just assume that I dont have an index.gsp and can propose a solution from scratch. This is the first time I'm using Grails in my life so far, and also the first time I'm ever dealing with ajax in any form. The aim is to dynamically fetch data from my database and display the result. Or if someone knows how to directly mirror output from the system console unto the UI, that will also be great.
It sounds like a form would be appropriate for your needs. Check out the Grails documentation on forms. You should be able to render a form with the values you would like without too much trouble. Be sure to pay attention to your mapping and let me know if you have any questions after you have set index.gsp up to render a form for your values.
Each blog post on my site -- http://www.correlated.org -- is archived at its own permalinked URL.
On each of these archived pages, I'd like to display not only the archived post but also the 10 posts that were published before it, so that people can get a better sense of what sort of content the blog offers.
My concern is that Google and other search engines will consider those other posts to be duplicate content, since each post will appear on multiple pages.
On another blog of mine -- http://coding.pressbin.com -- I had tried to work around that by loading the earlier posts as an AJAX call, but I'm wondering if there's a simpler way.
Is there any way to signal to a search engine that a particular section of a page should not be indexed?
If not, is there an easier way than an AJAX call to do what I'm trying to do?
Caveat: this hasn't been tested in the wild, but should work based on my reading of the Google Webmaster Central blog and the schema.org docs. Anyway...
This seems like a good use case for structuring your content using microdata. This involves marking up your content as a Rich Snippet of the type Article, like so:
<div itemscope itemtype="http://schema.org/Article" class="item first">
<h3 itemprop="name">August 13's correlation</h3>
<p itemprop="description" class="stat">In general, 27 percent of people have never had any wisdom teeth extracted. But among those who describe themselves as pessimists, 38 percent haven't had wisdom teeth extracted.</p>
<p class="info">Based on a survey of 222 people who haven't had wisdom teeth extracted and 576 people in general.</p>
<p class="social"><a itemprop="url" href="http://www.correlated.org/153">Link to this statistic</a></p>
</div>
Note the use of itemscope, itemtype and itemprop to define each article on the page.
Now, according to schema.org, which is supported by Google, Yahoo and Bing, the search engines should respect the canonical url described by the itemprop="url" above:
Canonical references
Typically, links are specified using the element. For example, the
following HTML links to the Wikipedia page for the book Catcher in the
Rye.
<div itemscope itemtype="http://schema.org/Book">
<span itemprop="name">The Catcher in the Rye</span>—
by <span itemprop="author">J.D. Salinger</a>
Here is the book's <a itemprop="url"
href="http://en.wikipedia.org/wiki/The_Catcher_in_the_Rye">Wikipedia
page.
http://schema.org/docs/gs.html#advanced_enum
So when marked up in this way, Google should be able to correctly ascribe which piece of content belongs to which canonical URL and weight it in the SERPs accordingly.
Once you've done marking up your content, you can test it using the Rich Snippets testing tool, which should give you a good indication of what Google things about your pages before you roll it into production.
p.s. the most important thing you can do to avoid a duplicate content penalty is to fix the titles on your permalink pages. Currently they all read 'Correlated - Discover surprising correlations' which will cause your ranking to take a massive hit.
I'm afraid but I think it is not possible to tell a Search Engine that a specif are of your web page should not be be indexed (example a div in your HTML source). A solution to this would be to use an Iframe for the content you do not what search engine to index, so I would use a robot.text file with an appropriate tag Disallow to deny access to that specific file linked to the Iframe.
You can't tell Google to ignore portions of a web page but you can serve up that content in such a way that the search engines can't find it. You can either place that content in an <iframe> or serve it up via JavaScript.
I don't like those two approaches because they're hackish. Your best bet is to completely block those pages from the search engines since all of the content is duplicated anyway. You can accomplish that a few ways:
Block your archives using robots.txt. If your archives in are in their own directory then you can block the entire directory easily. You can also block individual files and use wildcards to match patterns.
Use the <META NAME="ROBOTS" CONTENT="noindex"> tag to block each page from being indexed.
Use the X-Robots-Tag: noindex HTTP header to block each page from being indexed by the search engines. This is identical in effect to using the ` tag although this one can be easier to implement since you can use it in a .htaccess file and apply it to an entire directory.