Nokogiri strategy for identifying largest text on a page?

Nokogiri strategy for identifying largest text on a page? - ruby

I'm doing a comparison of a bunch of landing pages in the wild. I'm trying to pull out the main header and the call to action, but of course the HTML formatting of the pages varies wildly.
I started looking for H1, H2, etc. assuming that the header tags correspond to primacy, but this is often not the case. Rendered font-size* might be a better indicator, however this seems messy and wouldn't handle cases where images with alt tags are used.
What's a good strategy to identify the main heading of 100 wild landing pages using Nokogiri?
*Also- is there a clever selector for rendered font-size?

You can't do it unless you have an AI running that can determine the most semantically important section of a document.
You can't count on the tags, such as headers or meta-tags, because those can be missing entirely.
You can't count on location in the source because CSS can move things anywhere.
And, even if you think you've got it nailed by looking at the CSS, the JavaScript can rip that reality from you because it can override everything, relying on the fact it takes a human's eyes and brain to make sense of the final rendered page.
So, basically, you're going to be mostly shooting in the dark unless you have code that can understand the content of the page and determine how often a word occurs, along with its synonyms and their root words, and then determine their placement on the page after CSS and JavaScript have been run.
It's really a tough task that a lot of big companies are spending a lot of money on.

Related

Is it possible to code this website?

I'm a Graphic Designer.
I was wondering if it's possible for a programmer to code this website or I should redesign this?
Because I have doubts about how hard the header and footer are, and I think it's a really hard work for a programmer to code a website like this.
If it's not, please let me know then I will find a developer.

This is not impossible. It just takes a moment to think it through.
The footer can be made as a background image of the three colour splats which wraps three separate divs (Projects, Products, Contact Us). The header is a series of images absolutely positioned within a relative parent.
This is actually a very simple layout.

The design is bit complex and unusual but not impossible.
It's possible to convert this design in code with use of some script and css hacks mainly position and z-index based hacks

Coordinating graphic elements with streaming media

if you were watching the State of the Union Address (http://www.whitehouse.gov/state-of-the-union-2013) you would have seen graphic supplements that appeared alongside of the video stream of the President that served to illustrate his key points.
The video on the site is a composite of this, but during the live streaming these were handled separately.
My question is: what is the best approach for doing this? especially if one wanted very tight control of the appearance of the graphics (i.e. right when the point is made, not before and not long after).
I'm wondering if any tools exist to facilitate this? I've been scouring google, but I don't think that I have the correct technical vocabulary for what I'm describing because I'm coming up blank.
I imagine AJAX would be a good starting point, but I'm not sure how to achieve the level of control that they had, or how to handle the back end of things.

For anyone who might encounter this challenge we devised two ways to solve it:
The first is a bit mickey mouse: It requires that you know how many images, etc you want to use beforehand (which in most cases you would). We wrote a script to repeatedly request an image and inserts it into the page, and on finding an image then request the next image in the chain.
Ie. Display default image -> request image 1
then, displaying image 1 -> request image 2
etc
From your end you can simply drop the images into a folder on your server when you are ready for them to go in. An advantage of this is that the images can be interactive, with links to other content, etc.
The big disadvantage, of course, is a lot of unnecessary requests to your page. In our case we anticipated enough traffic that it didn't seem wise. Also, there are plenty of opportunities for mistakes and depending how frequently your timer fires there are likely to be timing discrepancies.
The Second costs money: we found the program Ustream (http://www.ustream.tv/producer) which allows us all the image control we require in terms of timing with the advantage of providing support for media clips etc. And it allows you to record everything streamed.
The disadvantage is that what the user sees is an integrated video on your site, so that you have to handle links to related content and provide images (if you want your users to have access to them) separately.
Hope this comes in handy for someone
I would still welcome any suggestions on how to make the first method more effective

CMS WYSIWYG Editors - What techniques do you use to client-proof these types of pages?

This is a topic that may be considered something not necessarily "programming related"; however, I feel it is since I'm asking for specific techniques.
Essentially, as a web developer, I work with a variety of platforms that include a WYSIWYG editor in the backend (TinyMCE, WYGWAM, etc) and one of the selling points of such systems is that it becomes easier to manage your own content because of these tools.
In theory, sounds great, in practice, not so much.
It can be way too easy for a client to break a layout by using many of the advanced features of a WYSIWYG editor. They can start floating things, setting too much margin/padding, etc.
Generally, I have tried to build any of these types of pages with only some sensible default styles applied to a few of the most common tags, such as setting a font size, colors, some margins, and some text decorations.
I would like to know if anyone has used anything more advanced to essentially turn the output of:
$cms->getContent();
...or equivalent into something that is effectively sandboxed and operates entirely agnostic of any other style/layout elements being used.
As often as possible, I express to clients that they should purchase an HTML/CSS book for Dummies and read it so that they aren't deer in headlights when they click "code view" in a WYSIWYG. But I know they don't do this, nor do they hire anyone who has experience, and it ends up allowing a client more control than they should responsibly have.
Plus, it sucks when you are using their sites as work samples to show others knowing they have the ability to take your beautiful design and development and make it look like crap.

A few things:
I have a standard WYGWAM config that I reuse on new sites by importing the exp_wygwam_configs table.
I keep options very limited in the editors
Areas of the page delineated for images should use a File field, with an image resizer like CE Image used to insure proper size
Client training. Make videos with Camtasia or similar tool if you have to.
Use a custom stylesheet for WYGWAM that has a small subset of styles, so they can choose h2...h4, for example, but not h1 or h5.

After encountering a lot of issues with WYSIWYG editors (which, by the way, never reflect accurately what you "get" in the end), I now prefer to leave only the most basic formatting features in the editor's configuration. For example, take a look at stackoverflow's editor.
It's got the following features: bold, italic, link, quote, pictures, lists, and alignments. The only special feature here are code sample and html, which are targeted to this site's audience. Most of your client don't need them.
I think it's the best approach, because if you give your clients the feeling that they can do whatever they want in the page, but in the end, this content is filtered when the page is rendered, they are going to be really frustrated. Not to mention the fact that the site will be slowed by the filtering process and the need to put the filtered content in cache.
Sometimes the client indeed wants to have a special layout in a page, but I think that can be best done by customizing the CMS so that it fits the client need.

Splitting Ruby strings into pages

I've been thinking about this problem for a while, and not quite sure the best way to go about it.
In a rails app I have books, which have many chapters, which have many sections. Chapters are basically just containers for sections, though may contain strings of text themselves. The sections hold most of the book text.
I'm planning to build an HTML 5 ebook reader that works in a mobile browser, and I don't want the user to have to scroll down -- I want the text to break at the end of the page.
I'd assumed using split might be the way to go, but I'm not sure there's a way to break at regular intervals? Would a javascript option work better here?
I'd looked at this: Dividing text article to smaller parts with paging in Ruby on Rails but can't feasibly insert manual break marks in the text, some of which are 90,000+ words.
Any ideas would be appreciated.

I think the main problem here is that the page length will depend on the device (and possibly the text size, if that is feature of your app). You should probably send large chunks that are sure to be at least say 5 pages long, at a time and then let the javascript do the paging. Rails has no access, nor should it, to the size of the display.
Text requires very little data, you shouldn't worry about transmitting more than you need or keeping too much in memory.

You may use blank line("\n" or "") as the separator.

I'd send enough of the page content down to easily fill a page and more, then use javascript on the client slide to remove sentences from the page until the scroll-bar disappears.
Resize.js is something similar I wrote a while ago. I wanted to enlarge/reduce the font size used on a screen until the screen was just full (for a dashboard monitor).. Yours would be similar, but instead of changing the font size, you are trimming off sentences.
Let me know if you can't see how to adapt this code.
Note: I would also make the javascript note the amount of text it ends up displaying, and pass that to the server in the 'next page' request, so the server knows where to start the next page from.

Best way to fit a lot of stuff onto one page

I have created a webpage, but my boss came back saying that the page is too busy. I was just wanting some ideas of how to split up the page e.g. Accordian, tabs etc. What tactics have you implemented to break up a page into different sections?

You already named 2 of the most popular ones: accordions and tabs. The other one you're missing is "rotators".
Here's an example of one: http://www.zurb.com/playground/jquery_image_slider_plugin (also happens to be a good jQuery plugin).
Keep in mind that you can also reduce clutter by using more vertical space and embracing scrolling. Not everything has to be above the fold.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio