Retrieve bibtex data from crossref by sending DOI from matlab: translation from ruby - ruby

I want to retrieve bibtex data (for building a bibliography) by sending a DOI (Digital Object Identifier) to http://www.crossref.org from within matlab.
The crossref API suggests something like this:
curl -LH "Accept: text/bibliography; style=bibtex" http://dx.doi.org/10.1038/nrd842
based on this source.
Another example from here suggests the following in ruby:
open("http://dx.doi.org/10.1038/nrd842","Accept" => "text/bibliography; style=bibtex"){|f| f.each {|line| print line}}
Although I've heard ruby rocks I want to do this in matlab and have no clue how to translate the ruby message or interpret the crossref command.
The following is what I have so far to send a doi to crossref and retrieve data in xml (in variable retdat), but not bibtex, format:
clear
clc
doi = '10.1038/nrd842';
URL_PATTERN = 'http://dx.doi.org/%s';
fetchurl = sprintf(URL_PATTERN,doi);
numinputs = 1;
www = java.net.URL(fetchurl);
is = www.openStream;
%Read stream of data
isr = java.io.InputStreamReader(is);
br = java.io.BufferedReader(isr);
%Parse return data
retdat = [];
next_line = toCharArray(br.readLine)'; %First line contains headings, determine length
%Loop through data
while ischar(next_line)
retdat = [retdat, 13, next_line];
tmp = br.readLine;
try
next_line = toCharArray(tmp)';
if strcmp(next_line,'M END')
next_line = [];
break
end
catch
break;
end
end
%Cleanup java objects
br.close;
isr.close;
is.close;
Help translating the ruby statement to something matlab can send using a script such as that posted to establish the communication with crossref would be greatly appreciated.
Edit:
Additional constraints include backward compatibility of the code (back at least to R14) :>(. Also, no use of ruby, since that solves the problem but is not a "matlab" solution, see here for how to invoke ruby from matlab via system('ruby script.rb').

You can easily edit urlread for what you need. I won't post my modified urlread function code due to copyright.
In urlread, (mine is at C:\Program Files\MATLAB\R2012a\toolbox\matlab\iofun\urlread.m), as the least elegant solution:
Right before "% Read the data from the connection." I added:
urlConnection.setRequestProperty('Accept','text/bibliography; style=bibtex');

The answer from user2034006 lays the path to a solution.
The following script works when urlread is modified:
URL_PATTERN = 'http://dx.doi.org/%s';
doi = '10.1038/nrd842';
fetchurl = sprintf(URL_PATTERN,doi);
method = 'post';
params= {};
[string,status] = urlread(fetchurl,method,params);
The modification in urlread is not identical to the suggestion of user2034006. Things worked when the line
urlConnection.setRequestProperty('Content-Type','application/x-www-form-urlencoded');
in urlread was replaced with
urlConnection.setRequestProperty('Accept','text/bibliography; style=bibtex');

Related

How to use entrezpy and Biopython Entrez libraries to access ClinVar data from genomic position of variant

[Disclaimer: I have published this question 3 weeks ago in biostars, with no answers yet. I really would like to get some ideas/discussion to find a solution, so I post also here.
biostars post link: https://www.biostars.org/p/447413/]
For one of my projects of my PhD, I would like to access all variants, found in ClinVar db, that are in the same genomic position as the variant in each row of the input GSVar file. The language constraint is Python.
Up to now I have used entrezpy module: entrezpy.esearch.esearcher. Please see more for entrezpy at: https://entrezpy.readthedocs.io/en/master/
From the entrezpy docs I have followed this guide to access UIDs using the genomic position of a variant: https://entrezpy.readthedocs.io/en/master/tutorials/esearch/esearch_uids.html in code:
# first get UIDs for clinvar records of the same position
# credits: credits: https://entrezpy.readthedocs.io/en/master/tutorials/esearch/esearch_uids.html
chr = variants["chr"].split("chr")[1]
start, end = str(variants["start"]), str(variants["end"])
es = entrezpy.esearch.esearcher.Esearcher('esearcher', self.entrez_email)
genomic_pos = chr + "[chr]" + " AND " + start + ":" + end # + "[chrpos37]"
entrez_query = es.inquire(
{'db': 'clinvar',
'term': genomic_pos,
'retmax': 100000,
'retstart': 0,
'rettype': 'uilist'}) # 'usehistory': False
entrez_uids = entrez_query.get_result().uids
Then I have used Entrez from BioPython to get the available ClinVar records:
# process each VariationArchive of each UID
handle = Entrez.efetch(db='clinvar', id=current_entrez_uids, rettype='vcv')
clinvar_records = {}
tree = ET.parse(handle)
root = tree.getroot()
This approach is working. However, I have two main drawbacks:
entrezpy fulls up my log file recording all interaction with Entrez making the log file too big to be read by the hospital collaborator, who is variant curator.
entrezpy function, entrez_query.get_result().uids, will return all UIDs retrieved so far from all the requests (say a request for each variant in GSvar), thus this space inefficient retrieval. That is the entrez_uids list will quickly grow a lot as I process all variants from a GSVar file. The simple solution that I have implenented is to check which UIDs are new from the current request and then keep only those for Entrez.fetch(). However, I still need to keep all seen UIDs, from previous variants in order to be able to know which is the new UIDs. I do this in code by:
# first snippet's first lines go here
entrez_uids = entrez_query.get_result().uids
current_entrez_uids = [uid for uid in entrez_uids if uid not in self.all_entrez_uids_gsvar_file]
self.all_entrez_uids_gsvar_file += current_entrez_uids
Does anyone have suggestion(s) on how to address these two presented drawbacks?

Is it possible to read pdf/audio/video files(unstructured data) using Apache Spark?

Is it possible to read pdf/audio/video files(unstructured data) using Apache Spark?
For example, I have thousands of pdf invoices and I want to read data from those and perform some analytics on that. What steps must I do to process unstructured data?
Yes, it is. Use sparkContext.binaryFiles to load files in binary format and then use map to map value to some other format - for example, parse binary with Apache Tika or Apache POI.
Pseudocode:
val rawFile = sparkContext.binaryFiles(...
val ready = rawFile.map ( here parsing with other framework
What is important, parsing must be done with other framework like mentioned previously in my answer. Map will get InputStream as an argument
We had a scenario where we needed to use a custom decryption algorithm on the input files. We didn't want to rewrite that code in Scala or Python. Python-Spark code follows:
from pyspark import SparkContext, SparkConf, HiveContext, AccumulatorParam
def decryptUncompressAndParseFile(filePathAndContents):
'''each line of the file becomes an RDD record'''
global acc_errCount, acc_errLog
proc = subprocess.Popen(['custom_decrypt_program','--decrypt'],
stdin=subprocess.PIPE, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
(unzippedData, err) = proc.communicate(input=filePathAndContents[1])
if len(err) > 0: # problem reading the file
acc_errCount.add(1)
acc_errLog.add('Error: '+str(err)+' in file: '+filePathAndContents[0]+
', on host: '+ socket.gethostname()+' return code:'+str(returnCode))
return [] # this is okay with flatMap
records = list()
iterLines = iter(unzippedData.splitlines())
for line in iterLines:
#sys.stderr.write('Line: '+str(line)+'\n')
values = [x.strip() for x in line.split('|')]
...
records.append( (... extract data as appropriate from values into this tuple ...) )
return records
class StringAccumulator(AccumulatorParam):
''' custom accumulator to holds strings '''
def zero(self,initValue=""):
return initValue
def addInPlace(self,str1,str2):
return str1.strip()+'\n'+str2.strip()
def main():
...
global acc_errCount, acc_errLog
acc_errCount = sc.accumulator(0)
acc_errLog = sc.accumulator('',StringAccumulator())
binaryFileTup = sc.binaryFiles(args.inputDir)
# use flatMap instead of map, to handle corrupt files
linesRdd = binaryFileTup.flatMap(decryptUncompressAndParseFile, True)
df = sqlContext.createDataFrame(linesRdd, ourSchema())
df.registerTempTable("dataTable")
...
The custom string accumulator was very useful in identifying corrupt input files.

Ruby Simple Read/Write File (Copy File)

I am practicing Ruby, and I am trying to copy contents from file "from" to file "to". can you tell me where I did it wrong?
thanks !
from = "1.txt"
to = "2.txt"
data = open(from).read
out = open(to, 'w')
out.write(data)
out.close
data.close
Maybe I am missing the point, but I think writing it like so is more 'ruby'
from = "1.txt"
to = "2.txt"
contents = File.open(from, 'r').read
File.open(to, 'w').write(contents)
Personally, however, I like to use the Operating systems terminal to do File operations like so. Here is an example on linux.
from = "1.txt"
to = "2.txt"
system("cp #{from} #{to}")
And for Windows I believe you would use..
from = "1.txt"
to = "2.txt"
system("copy #{from} #{to}")
Finally, if you were needing the output of the command for some sort of logging or other reason, I would use backticks.
#A nice one liner
`cp 1.txt 2.txt`
Here is the system and backtick methods documentation.
http://ruby-doc.org/core-1.9.3/Kernel.html
You can't perform data.close — data.class would show you that you have a String, and .close is not a valid String method. By opening from the way you chose to, you lost the File reference after using it with your read. One way to fix that would be:
from = "1.txt"
to = "2.txt"
infile = open(from) # Retain the File reference
data = infile.read # Use it to do the read
out = open(to, 'w')
out.write(data)
out.close
infile.close # And finally, close it

I am trying to use Curl::Easy.http_put but have some issues with the data argument

I'm struggling with a ruby script to upload some pictures to moodstocks using their http interface
here is the code that I have so far
curb = Curl::Easy.new
curb.http_auth_types = :digest
curb.username = MS_API
curb.password = MS_SECRET
curb.multipart_form_post = true
Dir.foreach(images_directory) do |image|
if image.include? '.jpg'
path = images_directory + image
filename = File.basename(path, File.extname(path))
puts "Upload #{path} with id #{filename}"
raw_url = 'http://api.moodstocks.com/v2/ref/' + filename
encoded_url = URI.parse URI.encode raw_url
curb.url = encoded_url
curb.http_put(Curl::PostField.file('image_file', path))
end
end
and this is the error that I get
/Library/Ruby/Gems/2.0.0/gems/curb-0.8.5/lib/curl/easy.rb:57:in `add': no implicit conversion of nil into String (TypeError)
from /Library/Ruby/Gems/2.0.0/gems/curb-0.8.5/lib/curl/easy.rb:57:in `perform'
from upload_moodstocks.rb:37:in `http_put'
from upload_moodstocks.rb:37:in `block in <main>'
from upload_moodstocks.rb:22:in `foreach'
from upload_moodstocks.rb:22:in `<main>'
I think the problem is in how I give the argument to the http_put method, but I have tried to look for some examples of Curl::Easy.http_put and have found nothing so far.
Could anyone point me to some documentation regarding it or help me out on this.
Thank you in advance
There are several problems here:
1. URI::HTTP instead of String
First, the TypeError you encounter comes from the fact that you pass a URI::HTTP instance (encoded_url) as curb.url instead of a plain Ruby string.
You may want to use encoded_url.to_s, but the question is why do you do this parse/encode here?
2. PUT w/ multipart/form-data
The second problem is related to curb. At the time of writing (v0.8.5) curb does NOT support the ability to perform a HTTP PUT request with multipart/form-data encoding.
If you refer to the source code you can see that:
the multipart_form_post setting is only used for POST requests,
the put_data setter does not support Curl::PostField-s
To solve your problem you need an HTTP client library that can combine Digest Authentication, multipart/form-data and HTTP PUT.
In Ruby you can use rufus-verbs, but you will need to use rest-client to build the multipart body.
There is also HTTParty but it has issues with Digest Auth.
That is why I greatly recommend to go ahead with Python and use Requests:
import requests
from requests.auth import HTTPDigestAuth
import os
MS_API_KEY = "kEy"
MS_API_SECRET = "s3cr3t"
filename = "sample.jpg"
with open(filename, "r") as f:
base = os.path.basename(filename)
uid = os.path.splitext(base)[0]
r = requests.put(
"http://api.moodstocks.com/v2/ref/%s" % uid,
auth = HTTPDigestAuth(MS_API_KEY, MS_API_SECRET),
files = {"image_file": (base, f.read())}
)
print(r.status_code)

Using Ruby and Node crypto library together

I've got a string encrypted using aes-128-cbc encryption using Ruby and the EzCrypto library.
Here's my encryption code in Ruby:
require 'rubygems'
require 'ezcrypto'
#pwd = 'hello'; #salt = 'salt'
key = EzCrypto::Key.with_password #pwd,#salt, :algorithm=>"aes-128-cbc"
File.open('key.txt','w') do |file|
file.write(key.to_s)
end
File.open('secret.txt','w') do |file|
file.write(key.encrypt("hello"))
end
Now I'd like to decrypt that string with Node. And i'm getting nothing back. I must be doing something wrong here. Below is my Node code.
var crypto = require('crypto');
var fs = require('fs');
var secret = fs.readFileSync('secret.txt', 'binary');
var key = fs.readFileSync('key.txt', 'base64');
var decipher = crypto.createDecipher('aes-128-cbc', key);
var string = decipher.update(secret, 'binary', 'utf8');
string += decipher.final('utf8');
console.log("STRING: ", string)
Which returns: STRING:
Any help would be much appreciated.
The secret.txt contains binary instead of the expected UTF-8/HEX.
This turned out to be a issue with Ruby's implementation of OpenSSL. If you dig down deep into Ruby's source you find this:
[https://github.com/ruby/ruby/blob/trunk/ext/openssl/ossl_cipher.c#L210][1]
Ruby always sets the iv or initialization vector to "OpenSSL for Ruby rulez!" which IMHO is ridiculous. Out of the box Ruby's OpenSSL encryption will never work with another languages.
Meaning EzCrypto won't work with Node :-(
I wrote my own cipher wrapper for Ruby which I set the IV manually. Everything else feel into place once that was fixed.
I really hope this helps someone else out. Took me forever to figure it out.

Resources