There are quite a lot of tools to extract text from PDF files[1-4]. However the problem with most scientific papers is the hardship to get access PDF directly mostly due to the need to pay for them. There are tools that provide easy access to papers' information such as metadata or bibtex , beyond the just the bibtex information[5-6]. What I want is like taking a step forward and go beyond just the bibtex/metadata:
Assuming that there is no direct access to the publications' PDF files, is there any way to obtain at least abstract of a scientific paper given the paper's DOI or title? Through my search I found that there has been some attempts [7] for some similar purpose. Does anyone know a website/tool that can help me obtain/extract abstract or full text of scientific papers? If there is not such tools, can you give me some suggestions for how I should go after solving this problem?
Thank you
[1] http://stackoverflow.com/questions/1813427/extracting-information-from-pdfs-of-research-papers
[2] https://stackoverflow.com/questions/6731735/extracting-the-actual-in-text-title-from-a-pdf
[3] http://stackoverflow.com/questions/6731735/extracting-the-actual-in-text-title-from-a-pdf?lq=1
[4] http://stackoverflow.com/questions/14291856/extracting-article-contents-from-pdf-magazines?rq=1
[5] https://stackoverflow.com/questions/10507049/get-metadata-from-doi
[6] https://github.com/venthur/gscholar
[7] https://stackoverflow.com/questions/15768499/extract-text-from-google-scholar
You can have a look at crossref text and datamining (tdm) service (http://tdmsupport.crossref.org/). This organization provides a RESTful API for free. There are more than 4000 publishers contributing to this tdm service.
You can find some examples from the link below:
https://github.com/CrossRef/rest-api-doc/blob/master/rest_api_tour.md
But to give a very simple example:
If you go to the link
http://api.crossref.org/works/10.1080/10260220290013453
you will see that other than some basic metadata, there are two other metadata namely, license and link where the former one gives under what kind of licence this publication is provided and the latter one gives the url of full text. For our example you will see on the license metadata that the license is creativecommons (CC) which means it is free to be used for tdm purposes. By searching for the publications with CC licenses within crossref you can access hundred thousands of publications with their full texts. From my latest research i can say that hindawi publication is the most friendly publisher. Even they provide more than 100K publications witt CC license. One last thing is that full texts might be provided in either in xml or pdf format. For those xml formats are highly structured thus easy to extract data.
To sum it up, you can automatically access many full texts through crossref tdm service by employing their API and simply writing a GET request. If you have further questions do not hesitate to ask.
Cheers.
Crossref may be worth checking. They allow members to include abstracts with the metadata, but it's optional, so it isn't comprehensive coverage. According to their helpdesk when I asked, they have abstracts available for around 450,000 DOIs registered as of June 2016.
If an abstract exists in their metadata, you can get it using their UNIXML format. Here's one specific example:
curl -LH "Accept:application/vnd.crossref.unixref+xml" http://dx.crossref.org/10.1155/2016/3845247
If the article is on PubMed (which contains around 25 million documents), you can use the Python package Entrez to retrieve the abstract.
Using curl (works in my linux):
curl http://api.crossref.org/works/10.1080/10260220290013453 2>&1 | # doi after works
grep -o -P '(?<=abstract":").*?(?=","DOI)' | # get text between abstract":" and ","DOI
sed -E 's/<jats:p>|<\\\/jats:p>/\n/g' | # substitute paragraph tags
sed 's/<[^>]*>/ /g' # remove other tags
# add "echo" to show unicode characters
echo -e $(curl http://api.crossref.org/works/10.1155/2016/3845247 2>&1 | # doi after works
grep -o -P '(?<=abstract":").*?(?=","DOI)' | # get text between abstract":" and ","DOI
sed -E 's/<jats:p>|<\\\/jats:p>/\n/g' | # substitute paragraph tags
sed 's/<[^>]*>/ /g') # remove other tags
using R:
library(rcrossref)
cr_abstract(doi = '10.1109/TASC.2010.2088091')
Related
I want to read out and later process a value from a website (Facebook Ads) from a bash script that runs daily. Unfortunately I need to be logged in to get this value:
So far I've figured out how to log into this website on Firefox and save the html file where the value could theoretically be read out:
The only unique identifier in this file is the first instance of "Gesamtausgaben". Is there any way with this information to cut out everything besides "100,10" ?
I'd also be happy for a different kind of way to get this value. And no, I don't have any API access.
I appreciate all ideas.
Thanks,
Patrick
How to Parse HTML (Badly) with PCRE
You can't reliably parse HTML with just regular expressions, so you'll need an XML/HTML or XPATH parser to do this properly. That said, if you have a PCRE-compatible grep then the following will likely work provided the HTML is minified and the class isn't re-used on your page.
$ pcregrep -o 'span class=".*_3df[ij].*>\K[^<]+' foo.html
100,10 €
If your target HTML spreads across multiple lines, or if you have multiple spans with the same classes assigned, then you'll have to do some work to refine the regular expression and differentiate between which matches are important to you. Context lines or subsequent matches may be helpful, but your mileage will definitely vary.
In markdown I can write:
[example1][myid]
[example2][myid]
[myid]: http://example.com
so I don't have to retype the full external link multiple times.
Is there an analogous feature in AsciiDoc? Specially interested in the Asciidoctor implementation.
So far I could only find:
internal cross references with <<>>
I think I saw a replacement feature of type :myid:, but I can't find it anymore. And I didn't see how to use different texts for each link however.
Probably you mean something like this:
Userguide Chapter 28.1. Setting configuration entries
...
Attribute entries promote clarity and eliminate repetition
URLs and file names in AsciiDoc3 macros are often quite long — they break paragraph flow and readability suffers. The problem is compounded by redundancy if the same name is used repeatedly. Attribute entries can be used to make your documents easier to read and write, here are some examples:
:1: http://freshmeat.net/projects/asciidoc3/
:homepage: http://asciidoc3.org[AsciiDoc3 home page]
:new: image:./images/smallnew.png[]
:footnote1: footnote:[A meaningless latin term]
Using previously defined attributes: See the {1}[Freshmeat summary]
or the {homepage} for something new {new}. Lorem ispum {footnote1}.
...
BTW, there is a 100% Python3 port available now: https://asciidoc3.org
I think you are looking for this (and both will work just fine),
https://www.google.com[Google]
or
link: https://google.com[Google]
Reference:
Ascii Doc User Manual Link
Update #1: Use of link along with variables in asciidoc
Declare variable
:url: https://www.google.com
Use variable feature using format mentioned above
Using ' Link with label '
{url}[Google]
Using a relative link
link:{url}[Google]
I have a directory (Confidential) which contains a bunch of text files.
Confidential
:- Secret-file1.txt
:- Secret-file2.txt
:- Secret-file3.txt
I want to produced another textfile (Summary.txt) with textwidth, say, 80 and with following formating
Secret-file1 - This file describes various secret activities of
organization Secret-Organization-1
Secret-file2 - This file describes various secret activities of
organization Secret-Organization-2. This summarizes
their activities from year 2001.
Secret-file3 - This file describes various secret activities of
organization Secret-Organization-3. This summarizes
their activities from year 2024.
Where the second column is right-aligned and copied from first line of corresponding text file. For example, the "Secret-file1.txt" looks like this
This file describes various secret activities of organization Secret-Organization-1.
XXXXXXXXXXXXXXXXX BUNCH of TEXT TILL EOF XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
How can I do that? I am looking at various options at bash (e.g., sed, awk,grep, your-prefered-bash-built-in).
Thanks
A
This is the simplest thing that came to my mind, since you didn't write what you tried I'm leaving possible tweaks to you, but I believe this is a good start ;)
for file in "*"; do echo "$file\t\t$(head -1 "$file")"; done
You can do this cleanly with a few lines of Python:
#!/usr/bin/env python3.3
import glob
import textwrap
from os.path import basename
INDENT=' ' * 22
for filename in glob.glob("Confidential/*.txt"):
with open(filename, 'r') as secret:
print("{:20s}- {}\n".format(
basename(filename),
'\n'.join(textwrap.wrap(secret.readline(),
width=74,
initial_indent=INDENT,
subsequent_indent=INDENT)).strip()),
end="")
prints
Secret-file1.txt - This file describes various secret activities of
organization Secret-Organization-1
Secret-file2.txt - This file describes various secret activities of
organization Secret-Organization-2. This summarizes
their activities from year 2001.
Secret-file3.txt - This file describes various secret activities of
organization Secret-Organization-3. This summarizes
their activities from year 2024.
It’s not shell, but it’s going to be faster because you’re not forking a bunch of processes, and you’re not going to spend a ton of time with string-formatting and writing loops to indent the text when the textwrap module can do it for you.
Take a look at the fmt command in Unix. It can reformat your document in a specific width and even control indentations.
It's been a long while since I used it. However, it can follow indents, set width, etc. I have a feeling it may do what you want.
Another command to look at is pr. pr, by default breaks text into pages, and adds page numbers, but you can turn all of that offi. This is another command that may be able to munge your text the way you want.
I want to find novel and known RNAs and transcripts in a sequence of about 10 KB. What is the most easiest way using bioinformatics tools to start with if that sequence is not well annotated in ensembl and UCSC browsers? Does splices ESTs and RNA sequencing data one option? I am new to bioinformatics, your suggestions are useful for me.
Thanks in advance
I am a bit unclear on what exactly your desired end-product or output would look like. But I might suggest doing multiple sequence alignments and looking for those with high scores. Chances are if this 10KB sequence will have some of those known sequences but they won't match exactly, so I think you want a program that gives you alignment scores and not just simple matches. I use Perl in combination with Clustal to make alignments. Basically, you will need to make .fasta or .aln files with both the 10KB sequence and a known sequence of interest according to those file formats' respective convention. You can use the GUI version of clustal if you are not too programming savvy. If you want to use Perl, here is a script I wrote for aligning a whole directory of .fasta files. It can perform many alignments in one fell swoop. NOTE: you must edit the clustal executable path in the last line (system call) to match its location on your computer for this script to function.
#!/usr/bin/perl
use warnings;
print "Please type the list file name of protein fasta files to align (end the directory path with a / or this will fail!): ";
$directory = <STDIN>;
chomp $directory;
opendir (DIR,$directory) or die $!;
my #file = readdir DIR;
closedir DIR;
my $add="_align.fasta";
foreach $file (#file) {
my $infile = "$directory$file";
(my $fileprefix = $infile) =~ s/\.[^.]+$//;
my $outfile="$fileprefix$add";
system "/Users/Wes/Desktop/eggNOG_files/clustalw-2.1-macosx/clustalw2 -INFILE=$infile -OUTFILE=$outfile -OUTPUT=FASTA";
}
Do you have a linux server or computer or are you relying on web and windows-based programs?
To align RNA-seq reads, people generally use splice read aligners like Tophat, although BLAST would probably work too.
Initially I wrote long response explaining how to do this in Linux but I've just realised that Galaxy might be a much easier solution for a beginner. Galaxy is an online bioinformatics tool with a very user friendly interface; it's particularly designed for beginners. You can sign up and log in at this website: https://main.g2.bx.psu.edu/
There are tutorials on how to do things (see 'Help' menu) but my basic workflow for your experiment would go something like this:
Log into Galaxy
Upload RNA-seq reads, EST reads and 10K genome sequence
In the menu on the left, click to expand "NGS-RNA sequencing", then click "Tophat for Illumina (assuming your RNA-seq reads are Illumina fastq reads)"
Align your RNA-seq reads using Tophat, make sure to select your 10K sequence as the reference genome.
Try aligning your EST reads with one of the programs. I'm not sure how successful this will be, Tophat isn't designed to work with long sequences so you might have to have a bit of a play or be a bit creative to get this working.
Use Cufflinks to create annotation for novel gene models, based on your RNA-seq reads and/or EST sequences.
Regarding viewing the output, I'm not sure what is available for a custom reference sequence on Windows, you might have to do a bit of research. For Linux/Mac, I'd recommend IGV.
I have a very large PDF File (200,000 KB or more) which contains a series of pages containing nothing but tables. I'd like to somehow parse this information using Ruby, and import the resultant data into a MySQL database.
Does anyone know of any methods for pulling this data out of the PDF? The data is formatted in the following manner:
Name | Address | Cash Reported | Year Reported | Holder Name
Sometimes the Name field overflows into the address field, in which case the remaining columns are displayed on the following line.
Due to the irregular format, I've been stuck on figuring this out. At the very least, could anyone point me to a Ruby PDF library for this task?
UPDATE: I accidentally provided incorrect information! The actual size of the file is 300 MB, or 300,000 KB. I made the change above to reflect this.
I assume you can copy'n'paste text snippets without problems when your PDF is opened in Acrobat Reader or some other PDF Viewer?
Before trying to parse and extract text from such monster files programmatically (even if it's 200 MByte only -- for simple text in tables that's huuuuge, unless you have 200000 pages...), I would proceed like this:
Try to sanitize the file first by re-distilling it.
Try with different CLI tools to extract the text into a .txt file.
This is a matter of minutes. Writing a Ruby program to do this certainly is a matter of hours, days or weeks (depending on your knowledge about the PDF fileformat internals... I suspect you don't have much experience of that yet).
If "2." works, you may halfway be done already. If it works, you also know that doing it programmatically with Ruby is a job that can in principle be solved. If "2." doesn't work, you know it may be extremely hard to achieve programmatically.
Sanitize the 'Monster.pdf':
I suggest to use Ghostscript. You can also use Adobe Acrobat Distiller if you have access to it.
gswin32c.exe ^
-o Monster-PDF-sanitized ^
-sDEVICE=pdfwrite ^
-f Monster.pdf
(I'm curious how much that single command will make your output PDF shrink if compared to the input.)
Extract text from PDF:
I suggest to first try pdftotext.exe (from the XPDF folks). There are other, a bit more inconvenient methods available too, but this might do the job already:
pdftotext.exe ^
-f 1 ^
-l 10 ^
-layout ^
-eol dos ^
-enc Latin1 ^
-nopgbrk ^
Monster-PDF-sanitized.pdf ^
first-10-pages-from-Monster-PDF-sanitized.txt
This will not extract all pages but only 1-10 (for proof of concept, to see if it works at all). To extract from every page, just leave off the -f 1 -l 10 parameter. You may need to tweak the encoding by changing the parameter to -enc ASCII7 (or UTF-8, UCS-2).
If this doesn't work the quick'n'easy way (because, as sometimes happens, some font in the original PDF uses "custom encoding vector") you should ask a new question, describing the details of your findings so far. Then you need to resort bigger calibres to shoot down the problem.
At the very least, could anyone point
me to a Ruby PDF library for this
task?
If you haven't done so, you should check out the two previous questions: "Ruby: Reading PDF files," and "ruby pdf parsing gem/library." PDF::Reader, PDF::Toolkit, and Docsplit are some of the relatively popular suggested libraries. There is even a suggestion of using JRuby and some Java PDF library parser.
I'm not sure if any of these solutions is actually suitable for your problem, especially that you are dealing with such huge PDF files. So unless someone offers a more informative answer, perhaps you should select a library or two and take them for a test drive.
This will be a difficult task, as rendered PDFs have no concept of tabular layout, just lines and text in predetermined locations. It may not be possible to determine what are rows and what are columns, but it may depend on the PDF itself.
The java libraries are the most robust, and may do more than just extract text. So I would look into JRuby and iText or PDFbox.
Check whether there is any structured content in the PDF. I wrote a blog article explaining this at http://www.jpedal.org/PDFblog/?p=410
If not, you will need to build it.
Maybe the Prawn ruby library? link text