Body Text extraction from websites e.g. extract only article heading and text not all text in site - algorithm

I am looking for algorithms that allow text extraction from websites. I do not mean "strip html", or any of the hundreds of libraries that allow this.
So for example for a news article I would like to identify the heading and all the text, but not the comments section and so on.
Are there any algorithms for that out there? Thank you!

In computer science literature this problem is usually referred to as the page segmentation or boiler plate detection problem. See the report Boilerplate Detection using Shallow Text Features and its related blog post. Also, I have a few reports and software sites bookmarked that address the problem. Also, see this stackoverflow question.

there are a few open source tools available that do similar article extraction tasks.
https://github.com/jiminoc/goose which was open source by Gravity.com
It has info on the wiki as well as the source you can view. There are dozens of unit tests that show the text extracted from various articles.

"Content extraction" is a very difficult topic. There are no common standards to identify the "main-article" content (there are several approaches to make HTML easier readably for crawlers, e.g. schema.org, but none of these is very popularly used).
So it turns out, that if you want good results, its probably best to define your own XPath selectors for each (news) website you want to scrape. Although there are some APIs for HTML content extraction, but as I said its very hard to develop an algorithm which works for every site.
Some APIs you could use:
alchemyapi.com
diffbot.com
boilerpipe-web.appspot.com
aylien.com
textracto.com

What you're trying to do is called "content extraction". It turns out to be a surprisingly hard problem to solve well, and many naive solutions do quite badly.
Instapaper and Readability both have to solve this, and you may learn something from looking at their solutions. They also both provide services that you may be able to take advantage of - perhaps you can outsource your problem to them and let their API take care of it. :)
Failing that, a search for "html content extraction" returns a great deal of useful results, including a number of papers on the subject.

I compared a few different libraries, and had really great luck with Mozilla's Readability library (Node), or its Python wrapper.
For example, take this CNN article: https://edition.cnn.com/2022/06/01/tech/elon-musk-tesla-ends-work-from-home/index.html
Readability successfully returns only the relevant data:
New York (CNN Business) Elon Musk is demanding that Tesla office workers return to in-person work or leave the company. The policy, disclosed in leaked emails Musk sent to Tesla's executive staff Tuesday, was first reported by electric vehicle news site Electrek. "Anyone who wishes to do remote work must be in the office for a minimum (and I mean *minimum*) of 40 hours per week or depart Tesla. This is less than we ask of factory workers," Musk wrote, adding that the office must be the employee's primary workplace where the other workers they regularly interact with are based — "not a remote branch office unrelated to the job duties." Musk said he would personally review any request for exemption from the policy, but that for the most part, "If you don't show up, we will assume you have resigned."
etc.

I think your best shoot is study what information can you get from the metadata and write a good html parser, oEmbed could be a good standard =)
https://oembed.com/#section7

Related

Multi-Language Websites

Can anyone recommend a good option to translate websites into Spanish? We tried using the Google translate plugin but the translation was so rough (very inaccurate, bordering on embarrassing the company) we had to hire a company to refine the translation so that it was much more accurate which makes for an extremely inefficient process for updating the site moving forward.
We're in health insurance, so the language we're translating is very specialized in nature and needs to be accurate for our members. To make it even more complicated, the Google Translate plugin happens instantly, so the translation is live before we have a chance to refine it before users can see it. In other words, there's no way to refine the translation before you make the content visible to users in the production environment. This is a legal regulatory requirement for Covered California and the Affordable Care Act, so it has to be a top notch implementation.
Short of a proxy solution that intercepts the content before it hits the production site or a separate site coded in Spanish, I'm not sure what other solutions exist if any. Ideas? The separate site solution is also problematic because it requires a bilingual staff and it doubles the work because both environments have to mirror each other exactly at all times.
Recommendations? Ideas? Any suggestions based on experience are most welcome!
Hire developer - he will describe all you need. You will never do it by your own. If you already have - hire new one, he will know how to do it. Question is very spiciefied but any (let's take for example php) php-engine (framework) or even custom php-engine can be updated the way you want.
Preview before upload to public? Easy! Change by moderator|admin values of translations? Easy! Main thing that each sentence (or even paragraph) you will describe by your own... I don't want describe all mechanism of it - hire developer and he will do all you need. $)

software for organizing text

(
I suspect that the question may not belong here as it's about software and not about programing. However, this is my computers community, and I trust you guys to refer me elsewhere if you think it's not appropriate to answer it here.
)
So,
I'm writing a lot. Text. For myself. Diaries, ideas, insights, observations. It always comes in the form of passages, passage at a time.
Until now I used to write in word documents, organizing them by rough categories divided to different documents, and by chronological order.
I figure out that this is way sub optimal. I can have more, and I do need more.
I'm looking for a software that will allow me to:
1 - tag passages
2 - store date and time automatically (created and edited)
3 - powerful full text search
4 - besides the above, I'd like it to have as much word processing capabilities as possible
Recommendations for a software that have this?
Now, I don't need this to be online. I'm doing this for myself, and don't want it to be published anywhere. I figure out however that many web platforms may have much of what I need, so I don't automatically reject the possibility to use one for my offline needs.
Thanks guys
Gidi
You could install wordpress or any other suitable blogging software locally and have your own private blog - let's you write passages as short as you like, you can tag it, categorize it, search it. Keeps track of when it was created and edited. And you can probably add a fair amount of word processing capabilities to it via plugins. And you could put it online when you wanted to.
It's a bit install overhead required (probably XAMP) though.

What kind of specs, documents, analysis do you get from superiors when starting a project? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 3 years ago.
Improve this question
I currently work in a small business (15-20 employees, 5 programmers) where most projects are custom built CMS and a few web applications products.
Since I started working there, I have worked on many projects, but specifications for each project vary a lot. Sometimes we get a little detail, a Word document telling what the client wants, and what we are suggesting (suggested form fields, a short description of display, etc.). Sometimes almost nothing except "do what you think is the best approach for this project/module/request".
My question to you guys, who might work in different kind of businesses, is: How (huge pile of paper? Word docs? Visios?) and what kind of information do you get from your superiors, managers, teamates when starting a project (plenty of analysis, drawings, etc.)? How much detail do you get on this?
Hope my question is clear enough, thank you.
Specs..that's kind of funny...how about never :(.
Seriously a lot of companies assume specs aren't needed, its absolutely unacceptable but this is how it is in a LOT of companies. They assume a one liner and the programmer knows what the program should do, the inputs / outputs and so on.
Unfortunately in my case I have to actually help write the specs..and Im the programmer :(.
I mostly get a lot of verbal direction and I use a voice recorder to record the conversation and transcribe it when I am done. I write my own specs from my customers' words.
Then, as a good consultant should, I take the writeup back to the customer and verify it, and get a signature and build it, and they live happily every after! (no they dont, they change their mind a 100 times)
It can vary depending on what group the work falls under:
Support request - If the change will take a short period of time and is fixing something broken, there is this group. This could be as simple as, "Add Bob to the list of authorized users for that ancient form" where the form is something written years ago and aside from adding and removing users, it isn't touched for fear of breaking things.
Service Advisory Committee request - Items that are up to a few days are in this group as these are kind of like mini-projects as the request may be to create a new form or portal for a group. This could be upgrading some 3rd party software where we have some customizations that make the upgrade not necessarily a simple thing for Operations to do.
Project - In this case there are usually a few Word documents and/or e-mail threads that help nail down requirements in terms of scope, budget, and time. These can take months though there is something to be said for having a prototype to change rather than creating the initial prototype to tell if requirements are really met or not. Course my current project is over a year old, still has a few more months to the timeline and already has a successor coming after it is done,i.e. there is a Phase II to go after Phase I.
Uber project - These merit their own group of documentation and are the million dollar, multiple company projects that usually try to document everything up front rarely works out well here. Thus, there is some adoptioon of agile for these but there are still some growing pains to go through as how we use agile matures. Think installing a dozen modules of some off-the-shelf software that requires both internal and external developers to customize the suite for our specific needs as the software is supposed to be very robust, flexible and help save lots of time and money on how people otherwise do their jobs generally. Think ERP or CRM for a couple of examples here.
We are a 16-person company that creates and supports customized software for small retail shop owners.
The projects we get fall into three general categories (as related to specs):
"Here, automate this form." A sales person explains that our customer only wants this form to appear where they can fill it out and print it to make it look professional to their customer. Our specs is a single piece of paper that looks something like an order form or report. This is always false; they want pop-up lookups, automatic updating from other sources, and "while you're at it" add-ons that more than double the time. These, we've learned to just live in the moment and let the project take its course. By the time we're done, the program doesn't look anything like their original form.
Small changes. Like a simple e-mail explaining that the background color is stale, or a request to sort a report by a different column. These, we just do as time allows.
Big company integrations, where we're tasked with making our software work with some big outfit like Intuit (QuickBook) or FedEx (shipping rates). These often have well thought out documentation and sample code. We get 100's of pages in word documents or pdfs. The problem with these is when their specs are wrong. We find out about inaccuracies when we try to test or certify our integration. In these instances, we usually take longer in certification than we did to originally develop the processes.
In all cases, the real trouble is when a sales person promises a solution to the customer before even asking a programmer what it would take. As recently as 2 weeks ago, a sales person got into real trouble and had to issue a refund (that person is no longer with the company).
None - at least not from management.
Instead, as a developer (and particularly one leading a software project right now), I'm expected to contact my users/customers/etc and work directly with them to come up with our specifications and requirements. The documentation I do request from my team is only what will be useful to the team. I am lucky in that management rarely requests a document that doesn't make sense or won't provide some use to our project.
I currently have a half-dozen or so specs each 60-80 pages. One of them is 80 pages with no table of contents. Good times.
Our Product Managers and senior engineers prepare three planning docs for our data management software projects.
High-level requirements: 1-to-3 sentence descriptions of hardware/software supported or specific feature for this project. (10-15 pages of Excel-like grids)
Technical details: Engineering implementation of each high-level requirements. Up to a page for each, depending on amount of detail. (30-40 pages of filled-in feature details)
Business agreement: Summary of 1 & 2 with engineering schedule and Product Mgmt's market analysis. Everyone signs off on this. (5 pages analysis, 20 technical)
I haven't seen work flows or other Visio-like details in our specs. The prioritized requirements and schedule prove critical, so we understand when to lop things off to save development and testing time.

What are good/bad ways of providing help for an application..?

I'm in the process of developling various applications for whom the end users are both engineers and salesman. Some of the operations and options may not be immediately obvious to all users. All applications are delivered with a PDF and paper manual - but of course nobody reads them!
I would like to improve the usability of the applications by including dynamic context sensitive help. One option would be alá MSDN and have F1 call up a web page - however internet access will not always be available and even this will be too much effort for some.
Another idea is to have descriptions pop up when an option is hovered over - like a tooltip.
I'm interested in other peoples views on this and what are best practices in this situation. Along a similar theme to this post What are common UI misconceptions and annoyances? I'd like to start a discussion regarding these two points:
What would be the best way to go about it?
What help features in existing applications you use either delight or annoy you..?
In my experience nobody but programmers reads the help. So when you have a technical and non-technical target audience you end up providing 2 ways of doing everything:
A Wizard with a few options.
A property editor with lots of options.
In either case, pictures are usually better than words for documentation. So a screenshot or 3 with big green arrows and circles calling out what does what will go a lot further than an indexing, exhaustive help file.
In my experience it would be very helpful to have a tooltip on each option that provides a little more definition/clarity for each option. Additionally, you can improve usability by having the default screen contain a few common, simple options and providing an advanced section that provides more control.
I'm currently working on a similar side-project. We have an existing product that's used by people as part of their day job. There is an inherent learning curve on the product, so users receive some degree of training and have people they can turn to for assistance. Even so, we know it needs more help and user documentation in general.
We are starting this help enhancement project by running a quick survey on the end users, (offering a prize draw as an incentive). We will also speak to the support staff who have to deal with help requests. This will uncover some of the pain points, and will give us a clear idea of how to focus our time & resources.
Guidelines on when to use inline tips vs tool tips etc can be found in various style guides, e.g. here:
http://developers.sun.com/docs/web-app-guidelines/uispec4_0/11-help.htm
Bear in mind that it's probably a bad idea to just copy & paste the text from your existing manuals into contextual help tips. You're going to need help writing completely new content. See if you can get some time from a technical writer / copywriter.

What are some great web based interfaces that you use on a day to day basis?

I definitely appreciate a good interface and as a developer, I try to create them for my users. But appreciating a good interface and designing one are a different thing. I'm looking for good interfaces (such as IMHO StackOverflow, Gmail) as examples of good UI from which I can model my own UI's.
I personally think that Netflix has an excellent web UI. Responsive, easy to navigate. Not mutch CRUD going on, but I find it very comfortable.
Pretty much anything by google, really. They're all very simple and to the point, focusing on usability.
You should get yourself a copy of both Don't Make Me Think and The Non-Designer's Design Book for your base knowledge/insight.
From there, it's much easier for you to dissect and analyze the layouts you already know and like, and recreate them for your own amusement.
edit: To mitigate misunderstanding, the point I'm trying to make is that you probably don't need as many good examples of nice layouts, if you know what to look for. For example, I can be shown a thousand haute couture dresses, and I still couldn't make one myself, because I don't know what to look for.
My favorites
Stack Overflow: This is a WIKI so it's not a rep point grab. I just really love the interface on this site. Been to too many crappy Q/A sites
Google Reader
MSDN: It's gotten a ton better in recent years and is a great way to grab little esoteric details about various APIs
iStockPhoto.com it's simple, effective and handles a large amount of information and data without getting bogged down. It also doesn't get in the way of the info you are looking for.
A good user interface fulfills a specific need of its users effectively.
As an example, here is a site (translation) that I have created for finding out what food is available in the cafeterias of the University of Helsinki. The typical use case is that when a student is hungry, he needs to know what food is available in the neighborhood student cafeterias (which are cheap for students), so that he can choose where to eat and what. He knows where each of those cafeterias is, but does not know what food they have today.
That site shows all the needed information at once. Because the students typically have a couple of cafeterias where they go, they can either bookmark the page with those cafeterias selected, or save the selection as a cookie. After that they can reach their goal without any navigation on the web site.
I don't use it on a day-to-day basis, but I'm very impressed with the Perseus Project digital library.
Here's a link to a poem from Catullus' Carmina in Latin as an example of the interface. Some features that I really like:
Click on the bar near the top to jump to any poem in the work. Larger chunks of the bar represent larger sections of the work (poems, chapters, however that particular work is logically broken up by the author).
Click on a Latin word in the poem to bring up a window (be patient; it seems to take a while) with lexicon entries, user voting and statistics on the word form (i.e. what the inflection means in the context of the sentence; it can be ambiguous in Latin) and so forth.
There are a number of resources down the right column, including various English translations, notes, references, etc. Any of them can be either shown in the right column, or swapped out with whatever is in the main content area in the center.
One of my personal favs: newspond.com

Resources