How do I download all the abstract datas from the pubmed data ncbi - ncbi

I want to download all the pubmed data abstracts.
Does anyone know how I can easily download all of the pubmed article abstracts?
I got the source of the data :
ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/af/12/
Is there anyway to download all these tar files..
Thanks in advance.

There is a package called rentrezhttps://ropensci.org/packages/. Check this out. You can retrieve abstracts by specific keywords or PMID etc. I hope it helps.
UPDATE: You can download all the abstracts by passing your list of IDS with the following code.
library(rentrez)
library(xml)
your.ids <- c("26386083","26273372","26066373","25837167","25466451","25013473")
# rentrez function to get the data from pubmed db
fetch.pubmed <- entrez_fetch(db = "pubmed", id = your.ids,
rettype = "xml", parsed = T)
# Extract the Abstracts for the respective IDS.
abstracts = xpathApply(fetch.pubmed, '//PubmedArticle//Article', function(x)
xmlValue(xmlChildren(x)$Abstract))
# Change the abstract names with the IDS.
names(abstracts) <- your.ids
abstracts
col.abstracts <- do.call(rbind.data.frame,abstracts)
dim(col.abstracts)
write.csv(col.abstracts, file = "test.csv")

I appreciate that this is a somewhat old question.
If you wish to get all the pubmed entries with python I wrote the following script a while ago:
import requests
import json
search_url = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&mindate=1800/01/01&maxdate=2016/12/31&usehistory=y&retmode=json"
search_r = requests.post(search_url)
search_data = search_r.json()
webenv = search_data["esearchresult"]['webenv']
total_records = int(search_data["esearchresult"]['count'])
fetch_url = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&retmax=9999&query_key=1&webenv="+webenv
for i in range(0, total_records, 10000):
this_fetch = fetch_url+"&retstart="+str(i)
print("Getting this URL: "+this_fetch)
fetch_r = requests.post(this_fetch)
f = open('pubmed_batch_'+str(i)+'_to_'+str(i+9999)+".json", 'w')
f.write(fetch_r.text)
f.close()
print("Number of records found :"+str(total_records))
It starts of by making an entrez/eutils search request between 2 dates which can be guaranteed to capture all of pubmed. Then from that response the 'webenv' (which saves the search history) and total_records are retrieved. Using the webenv capability saves having to hand the individual record ids to the efetch call.
Fetching records (efetch) can only be done in batches of 10000, the for loop handles grabbing batches of 9,999 records and saving them in labelled files until all the records are retrieved.
Note that requests can fail (non 200 http responses, errors), in a more robust solution you should wrap each requests.post() in a try/except. And before dumping/using the data to file you should ensure that the http response has a 200 status.

Related

How to store test data for Automation Frameworks

group_data = {
"name" => #utility.generate_random_string(8),
}
result = #groups.create(group_data)
create function:
def create(data)
res = #client.post($ENDPOINTS['admin']['groups'],val.to_json,Api.header)
return JSON.parse(res.body)
end
Lets take the above example this is being used in all the automation scripts mostly to create a group(in almost 100+spec files) Lets say in future if four mandatory fields are introduced there could be only two solutions done:
1.Update all the 100 spec files with new filed(Hectic to do)
2.add a patch in create method .
Question:-> Is there a better way to do data storage and retrieve here other than (Apache poi getting data from excel files)

Web.Content calling API service and merging pages with List.Transform started to fail

I created PowerBI report which which is connecting to data source via API service. Returning json contains thousands of entities. API service is called via Web.Content function. API service returns always total record count and so we are able to calculate nr. of pages which has to be called to obtain whole dataset. This report is displaying data from our servicedesk app, which is deployed on many servers and for many customers and use Query parameters to connect to any of these servers.
Detail of Power query is below.
Why am I writing here. This report was working without any issue more than 1,5 year but on August 17th one of servers start causing erros in step Pages where are some random lines (pages) with errors - see attached picture labeled "Errors in step Pages". and this is reason that next step Entities (List.Union) in query is stopping refresh and generate errors with message:
Expression.Error: We cannot apply field access to the type List. Details: Value=[List] Key=requests
What is notable
API service si returning records in the same order but faulty lists are random when calling with same parameters
some times is refresh without any error
The same power query called on another server is working correctly , problem is only with one specific server.
This problem started without notice on the most important server after 1,5 year without any problem.
Here is full text power of query for this main source, which is used later in other queries to extract all necessary data. Json is really complicated and I extract from it list of requests, list of solvers, list of solver groups,.... and this base query and its output is input for many referenced queries.
Errors in step Pages
let
BaseAPIUrl = apiurl&"apiservice?", /*apiurl is parameter - name of server e.g. https://xxxx.xxxxxx.sk/ */
EntitiesPerPage = RecordsPerPage, /*RecordsPerPage is parameter and defines nr. of record per page - we used as optimum 200-400 record per pages, but is working also with 4000 record per page*/
ApiToken = FnApiToken(), /*this function is returning apitoken value which is returning value of another api service apiurl&"api/auth/login", which use username and password in body of call to get apitoken */
GetJson = (QParm) => /*definiton general function to get data from data source*/
let
Options =
[ Query= QParm,
Headers=
[
Accept="application/json",
ApiKeyName="apitoken",
Authorization=ApiToken
]
],
RawData = Web.Contents(BaseAPIUrl, Options),
Json = Json.Document(RawData)
in Json,
GetEntityCount = () => /*one times called function to get nr of records using GetJson, which is returned as a part of each call*/
let
QParm = [pp="1", pg="1" ],
Json = GetJson(QParm),
Count = Json[totalRecord]
in
Count,
GetPage = (Index) => /*repeatadly called function to get each page of json using GetJson*/
let
PageNr = Text.From(Index+1),
PerPage = Text.From(EntitiesPerPage),
QParm = [pg = PageNr, pp=PerPage],
Json = GetJson(QParm),
Value = Json[data][requests]
in Value,
EntityCount = List.Max({ EntitiesPerPage, GetEntityCount() }), /*setup of nr. of records to variable*/
PageCount = Number.RoundUp(EntityCount / EntitiesPerPage), /*setup of nr. of pages */
PageIndices = { 0 .. PageCount - 1 },
Pages = List.Transform(PageIndices, each GetPage(_) /*Function.InvokeAfter(()=>GetPage(_),#duration(0,0,0,1))*/), /*here we call for each page GetJson function to get whole dataset - there is in comment test with delay between getpages but was not neccessary*/
Entities = List.Union(Pages),
Table = Table.FromList(Entities, Splitter.SplitByNothing(), null, null, ExtraValues.Error)
I also tried another way of appending pages to list using List.Generate. This is also bringing random errors in list but
it is bringing possibility to transform to table in contrast with original way with using List.Transform, but other referenced queries are failing and contains on the last row errors
When I am exploring content of faulty page/list extracting it via Add as New Query there are always all record without any fail.....
Source = List.Generate( /*another way to generate list of all pages*/
() => [Page = 0, ReqPageData = GetPage(0) ],
each [Page] < PageCount,
each [ReqPageData = GetPage( [Page] ),
Page = [Page] + 1 ],
each [ReqPageData]
),
#"Converted to Table" = Table.FromList(Source, Splitter.SplitByNothing(), null, null, ExtraValues.Error), /*here i am able to generate table from list in contrast when is used List.Generate*/
#"Expanded Column1" = Table.ExpandListColumn(#"Converted to Table", "Column1"), /*here aj can expand list to column*/
#"Removed Errors" = Table.RemoveRowsWithErrors(#"Expanded Column1", {"Column1"}) /*here i try to exclude errors, but i dont know what happend and which records (if any) are excluded*/
Extracting errored page
and finnaly I am tottaly clueless not able to find the cause of this behavior on this specific server. I tested to call pages which are errored via POSTMAN, I discused this issue with author of API service and He also tried to call this API service with all parameters but server is returning every page OK, only Power query is not able to List.Transform ...
I will be grateful and appreciate any tips or advice or if somebody solved the same issue in the past ....
Kuby
No, each error line of list in step List.Transform coud by extracted as new query and there are all records from one page OK. hmmmm
Finnaly, problem described in this issue was caused by "corrupted" content of returning json. The provider of core system informed me that they found bug and after fixing on the side of servisdesk is everything OK again. I tried to find problem in Power query and problem was in servisdesk. :(

Max number of classroom id retrieved

I have aprox 520 classrooms archived in my account, if I try to select them with
var courseList = Classroom.Courses.list({"courseStates":["ARCHIVED"]}).courses;
I get only 300 of them. Is this normal?
How can I select them all? Actually I'm writing a script to delete the oldest, but if I can't retrieve them, I can't delete them.
I understand that you got so many courses that the Courses.list() response is splitted in separate pages. In that case you can very easily navigate them by using tokens. First of all, make sure that you specify the pageSize in your request. That would set the desired amount of responses per page. Please keep in mind that the server may return fewer than the specified number of results, as it declared on the docs. In case that your response got divided into pages, the response would include the nextPageToken field. Then, to obtain the rest of courses, you have to repeat your request including that nextPageToken into the pageToken property. Please don't hesitate to ask me any doubt about this approach.
Thanks a lot Jaques, I found the solution:
var parametri = {"courseStates": "ARCHIVED"};
var page = Classroom.Courses.list(parametri);
var listaClassi = page.courses;
if (page.nextPageToken !== '') {
parametri.pageToken = page.nextPageToken;
page = Classroom.Courses.list(parametri);
listaClassi = listaClassi.concat(page.courses);
}
Anyway, I didn't need to change the pageSize, nor I found any tutorial about it.

Check if data already exists before inserting into BigQuery table (using Python)

I am setting up a daily cron job that appends a row to BigQuery table (using Python), however, duplicate data is being inserted. I have searched online and I know that there is a way to manually remove duplicate data, but I wanted to see if I could avoid this duplication in the first place.
Is there a way to check a BigQuery table to see if a data record already exists first in order to avoid inserting duplicate data? Thanks.
CODE SNIPPET:
import webapp2
import logging
from googleapiclient import discovery
from oath2client.client import GoogleCredentials
PROJECT_ID = 'foo'
DATASET_ID = 'bar'
TABLE_ID = 'foo_bar_table’
class UpdateTableHandler(webapp2.RequestHandler):
def get(self):
credentials = GoogleCredentials.get_application_default()
service = discovery.build('bigquery', 'v2', credentials=credentials)
try:
the_fruits = Stuff.query(Stuff.fruitTotal >= 5).filter(Stuff.fruitColor == 'orange').fetch();
for fruit in the_fruits:
#some code here
basket = dict()
basket['id'] = fruit.fruitId
basket['Total'] = fruit.fruitTotal
basket['PrimaryVitamin'] = fruit.fruitVitamin
basket['SafeRaw'] = fruit.fruitEdibleRaw
basket['Color'] = fruit.fruitColor
basket['Country'] = fruit.fruitCountry
body = {
'rows': [
{
'json': basket,
'insertId': str(uuid.uuid4())
}
]
}
response = bigquery_service.tabledata().insertAll(projectId=PROJECT_ID,
datasetId=DATASET_ID,
tableId=TABLE_ID,
body=body).execute(num_retries=5)
logging.info(response)
except Exception, e:
logging.error(e)
app = webapp2.WSGIApplication([
('/update_table', UpdateTableHandler),
], debug=True)
The only way to test whether the data already exists is to run a query.
If you have lots of data in the table, that query could be expensive, so in most cases we suggest you go ahead and insert the duplicate, and then merge duplicates later on.
As Zig Mandel suggests in a comment, you can query over a date partition if you know the date when you expect to see the record, but that may still be expensive compared to inserting and removing duplicates.

How can I change the column name of an existing Class in the Parse.com Web Browser interface?

I couldn't find a way to change a column name, for a column I just created, either the browser interface or via an API call. It looks like all object-related API calls manipulate instances, not the class definition itself?
Anyone know if this is possible, without having to delete and re-create the column?
This is how I did it in python:
import json,httplib,urllib
connection = httplib.HTTPSConnection('api.parse.com', 443)
params = urllib.urlencode({"limit":1000})
connection.connect()
connection.request('GET', '/1/classes/Object?%s' % params, '', {
"X-Parse-Application-Id": "yourID",
"X-Parse-REST-API-Key": "yourKey"
})
result = json.loads(connection.getresponse().read())
objects = result['results']
for object in objects:
connection = httplib.HTTPSConnection('api.parse.com', 443)
connection.connect()
objectId = object['objectId']
objectData = object['data']
connection.request('PUT', ('/1/classes/Object/%s' % objectId), json.dumps({
"clonedData": objectData
}), {
"X-Parse-Application-Id": "yourID",
"X-Parse-REST-API-Key": "yourKEY",
"Content-Type": "application/json"
})
This is not optimized - you can batch 50 of the processes together at once, but since I'm just running it once I didn't do that. Also since there is a 1000 query limit from parse, you will need to do run the load multiple times with a skip parameter like
params = urllib.urlencode({"limit":1000, "skip":1000})
From this Parse forum answer : https://www.parse.com/questions/how-can-i-rename-a-column
Columns cannot be renamed. This is to avoid breaking an existing app.
If your app is still under development, you can just query for all the
objects in your class and copy the value of the old column to the new
column. The REST API is very useful for this. You may them drop the
old column in the Data Browser
Hope it helps
Yes, it's not a feature provided by Parse (yet). But there are some third party API management tools that you can use to rename the fields in the response. One free tool is called apibond.com
It's a work around, but I hope it helps

Resources