How to fetch all records using NCBI Batch Entrez - bioinformatics

I have over 200,000 accessions in a flat file, which need to retrieve relevant entry from NBCI.
I use Batch Entrez (http://www.ncbi.nlm.nih.gov/sites/batchentrez) to do the job. But encountered several problems:
The initial file was splitted into multiple sub-files, each containing 4000 lines. But it seems Batch Entrez has some size limitation on the returned file. For example: if the first 1000 accessions all have tens of thousands lines which reach the size limitation, then the rest 3000 accessions will be rejected and won't be searched.
One possible solution in my head is to split the file into more sub-files and search individually. However this requires too much manual effort.
So I am just wondering if there is any other solution, or any code could be used.
Thanks in advance

Your problem sounds a good fit for a Bio-star toolkit. This is a solution using BioSmalltalk
| giList gbReader |
giList := (BioObject openFullFileNamed: 'd:\Batch_entrez_1.txt') contents lines.
gbReader := BioNCBIGenBankReader new.
gbReader
genBankRecordsFrom: 'nuccore'
format: #setModeXML
uids: giList.
(BioGBSeqCollection newFromXMLCollection: gbReader searchResults)
collect: [: e | BioParser
tokenizeNcbiXmlBlast: e contents
nodes: #('GBAuthor' 'GBSeq_definition') ]
To execute/debug the script, just select it and a right-click will open the Smalltalk world-menu.
The API automatically split and fetch your accession list (in the script contained in Batch_entrez_1.txt) maintaining the NCBI Entrez post limits to avoid penalities.
The result format is XML (which is an "easy" format to parse or filter specific fields) although it could be any of the retrieval modes supported by Entrez, for example setting #setModeText will answer an ASN.1 representation. Replace 'nuccore' for the database you want to query. Finally choose the interesting fields, in the script I have choosed 'GBAuthor' and 'GBSeq_definition', but you are free to choose anyone of the available nodes.

Related

Alternating information between two textfiles into a third textfile via stringlists

I have 2 files with information that looks like this
File 1
Show1
Show2
Show3
Show4
Show5
File2
Advert1
Advert2
I want to merge these two files into a 3rd textfile so that the output looks like this
Show1
Advert1
Show2
Advert2
Show3
Advert1
Show4
Advert2
Show5
The amount of information can vary, but file two will never have more lines than file 1
I have tried putting the information of the two text files into two different string lists and then reading them alternatively into a third string list via the indexes, I have tried hopping the information across different textiles and string lists and even arrays, but no matter what I try I always end up with an error, most often a list index error...
My latest attempt at this looks like this
For k := 0 to StringlistAdvert.Count - 1 do
Begin
StringListFinal.Add(StringlistShow[k]);
StringListFinal.Add(StringListAdvert[k]);
StringlistAdvert.Delete(k)
End;
I realise that this was a stupid idea as I'm deleting the items in the string list which I later would have to use in the loop again. My other attempts were not much better
This is a relatively simple program compared to others I have coded before yet the solution to this keeps eluding me, it is for a school project so any help would be greatly appreciated

Logic to compare rows in pig

I need logic for below scenario which needs to be implemented using Pig scripts. Can anyone please help in providing some ideas on how to do this.
Input contains a column groupName with some data like others and unknown. This data needs to be replaced by its previous record data.
Input:
id,groupName
123,casc0001
124,casc0002
125,sale0001
126,unknown
127,nave9876
128,casc0001
129,sale0002
130,others
131,casc0004
132,unknown
133,unknown
134,others
135,nave1234
output:
123,casc0001
124,casc0002
125,sale0001
126,sale0001
127,nave9876
128,casc0001
129,sale0002
130,sale0002
131,casc0004
132,casc0004
133,casc0004
134,casc0004
135,nave1234
In the above input 126,unknown to be replaced with 125,sale0001. 130,others need to be replaced by 129,sale0002. 132,unknown 133,unknown 134,others to be replaced with 131,casc0004.
--Edit--
I tried lead function in Pig. But it is used only to compare n rows at a time. Which cannot solve this completely.
Another logic which is working, but looking for optimized one.
Cogroup for the same data set (like Dataset and Dataset_self)
-Filter Dataset.id=Dataset_self.id or Dataset_self.groupname='others' or Dataset_self.groupname='unknown'
-Generate IdDiff like (Dataset_self.id-Dataset.id), CASE when id=id then ( id, group) else (id_self,group)
-Foreach (group id){
ordered = order by id,diff,group;
limited = ordered limit 1;
generate limited ;
}
This is going to be a complicated problem on a distributed system like hadoop, especially that your file is going to be split between nodes. In your case what if 126 happens to be the first record in a new split. Then you will need to trace the previous file split which is most likely on a different node. Lets say you come up with a MapReduce program to do this, in all likelyhood it would an extremely slow and inefficient way to do it. The solution might be simpler if you are in a single node system where the splittable property of your input format is false, and the nuber of reducers is set to 1.
In that case you could almost make the argument that a traditional database like Oracle or Terra data might be a better fit for your problem as you have lead or lag functions readily available which could be used to do exactly what u need.

Yahoo Pipes: Extracting number from feed item for use in URL builder

Been looking all over the place for a solution to this issue. I have a Yahoo Pipe (http://pipes.yahoo.com/pipes/pipe.info?_id=e5420863cfa494ee40e4c9be43f0e812) that I've created to pull back image content from the Bing Search API. The URL builder includes a $skip attribute that takes an integer and uses it to select the starting (index) point for the result set that the query returns.
My initial plan had been to use the math engine in the Wolfram Alpha API to generate a random number (randomInteger[1000]) that I could use to seed the $skip value each time that the pipe is run. I have an earlier version of the pipe where I was able to get the query / result steps working using either "XPath Fetch" and "Fetch Data". However, regardless of how I Fetch the result, the response returns as an attribute / value pair in a list item.Even when I use "Emit items as string" in XPath Fetch, I still get a list with a single item, when what I really want is the integer that I can plug into my $skip attribute.
I've tried everything in Pipes I can think of, and spent a lot of time online looking for an answer. Is there anyway to extract text (in this case, a number) from a single list item and then use the output as input to "wire" a text parameter in another Pipes block? Any suggestions / ideas welcome. In the meantime, I'm generating a sorta-random number by manipulating a timecode hash, but it just feels tacky :-)
Thanks!
All the sources are for repeated items. You can't have a source that just makes a single number.
I'm not really clear what you're trying to do. You want to put a random number into part of the URL string that gets an RSS feed?

Pass data from workspace to a function

I created a GUI and used uiimport to import a dataset into matlab workspace, I would like to pass this imported data to another function in matlab...How do I pass this imported dataset into another function....I tried doing diz...but it couldnt pick diz....it doesnt pick the data on the matlab workspace....any ideas??
[file_input, pathname] = uigetfile( ...
{'*.txt', 'Text (*.txt)'; ...
'*.xls', 'Excel (*.xls)'; ...
'*.*', 'All Files (*.*)'}, ...
'Select files');
uiimport(file_input);
M = dlmread(file_input);
X = freed(M);
I think that you need to assign the result of this statement:
uiimport(file_input);
to a variable, like this
dataset = uiimport(file_input);
and then pass that to your next function:
M = dlmread(dataset);
This is a very basic feature of Matlab, which suggests to me that you would find it valuable to read some of the on-line help and some of the documentation for Matlab. When you've done that you'll probably find neater and quicker ways of doing this.
EDIT: Well, #Tim, if all else fails RTFM. So I did, and my previous answer is incorrect. What you need to pass to dlmread is the name of the file to read. So, you either use uiimport or dlmread to read the file, but not both. Which one you use depends on what you are trying to do and on the format of the input file. So, go RTFM and I'll do the same. If you are still having trouble, update your question and provide details of the contents of the file.
In your script you have three ways to read the file. Choose one on them depending on your file format. But first I would combine file name with the path:
file_input = fullfile(pathname,file_input);
I wouldn't use UIIMPORT in a script, since user can change way to read the data, and variable name depends on file name and user.
With DLMREAD you can only read numerical data from the file. You can also skip some number of rows or columns with
M = dlmread(file_input,'\t',1,1);
skipping the first row and one column on the left.
Or you can define a range in kind of Excel style. See the DLMREAD documentation for more details.
The filename you pass to DLMREAD must be a string. Don't pass a file handle or any data. You will get "Filename must be a string", if it's not a string. Easy.
FREAD reads data from a binary file. See the documentation if you really have to do it.
There are many other functions to read the data from file. If you still have problems, show us an example of your file format, so we can suggest the best way to read it.

sed optimization (large file modification based on smaller dataset)

I do have to deal with very large plain text files (over 10 gigabytes, yeah I know it depends what we should call large), with very long lines.
My most recent task involves some line editing based on data from another file.
The data file (which should be modified) contains 1500000 lines, each of them are e.g. 800 chars long. Each line is unique, and contains only one identity number, each identity number is unique)
The modifier file is e.g. 1800 lines long, contains an identity number, and an amount and a date which should be modified in the data file.
I just transformed (with Vim regex) the modifier file to sed, but it's very inefficient.
Let's say I have a line like this in the data file:
(some 500 character)id_number(some 300 character)
And I need to modify data in the 300 char part.
Based on the modifier file, I come up with sed lines like this:
/id_number/ s/^\(.\{650\}\).\{20\}/\1CHANGED_AMOUNT_AND_DATA/
So I have 1800 lines like this.
But I know, that even on a very fast server, if I do a
sed -i.bak -f modifier.sed data.file
It's very slow, because it has to read every pattern x every line.
Isn't there a better way?
Note: I'm not a programmer, had never learnt (in school) about algorithms.
I can use awk, sed, an outdated version of perl on the server.
My suggested approaches (in order of desirably) would be to process this data as:
A database (even a simple SQLite-based DB with an index will perform much better than sed/awk on a 10GB file)
A flat file containing fixed record lengths
A flat file containing variable record lengths
Using a database takes care of all those little details that slow down text-file processing (finding the record you care about, modifying the data, storing it back to the DB). Take a look for DBD::SQLite in the case of Perl.
If you want to stick with flat files, you'll want to maintain an index manually alongside the big file so you can more easily look up the record numbers you'll need to manipulate. Or, better yet, perhaps your ID numbers are your record numbers?
If you have variable record lengths, I'd suggest converting to fixed-record lengths (since it appears only your ID is variable length). If you can't do that, perhaps any existing data will not ever move around in the file? Then you can maintain that previously mentioned index and add new entries as necessary, with the difference is that instead of the index pointing to record number, you now point to the absolute position in the file.
I suggest you a programm written in Perl (as I am not a sed/awk guru and I don't what they are exactly capable of).
You "algorithm" is simple: you need to construct, first of all, an hashmap which could give you the new data string to apply for each ID. This is achieved reading the modifier file of course.
Once this hasmap in populated you may browse each line of your data file, read the ID in the middle of the line, and generate the new line as you've described above.
I am not a Perl guru too , but I think that the programm is quite simple. If you need help to write it, ask for it :-)
With perl you should use substr to get id_number, especially if id_number has constant width.
my $id_number=substr($str, 500, id_number_length);
After that if $id_number is in range, you should use substr to replace remaining text.
substr($str, -300,300, $new_text);
Perl's regular expressions are very fast, but not in this case.
My suggestion is, don't use database. Well written perl script will outperform database in order of magnitude in this sort of task. Trust me, I have many practical experience with it. You will not have imported data into database when perl will be finished.
When you write 1500000 lines with 800 chars it seems 1.2GB for me. If you will have very slow disk (30MB/s) you will read it in a 40 seconds. With better 50 -> 24s, 100 -> 12s and so. But perl hash lookup (like db join) speed on 2GHz CPU is above 5Mlookups/s. It means that your CPU bound work will be in seconds and you IO bound work will be in tens of seconds. If it is really 10GB numbers will change but proportion is same.
You have not specified if data modification changes size or not (if modification can be done in place) thus we will not assume it and will work as filter. You have not specified what format of your "modifier file" and what sort of modification. Assume that it is separated by tab something like:
<id><tab><position_after_id><tab><amount><tab><data>
We will read data from stdin and write to stdout and script can be something like this:
my $modifier_filename = 'modifier_file.txt';
open my $mf, '<', $modifier_filename or die "Can't open '$modifier_filename': $!";
my %modifications;
while (<$mf>) {
chomp;
my ($id, $position, $amount, $data) = split /\t/;
$modifications{$id} = [$position, $amount, $data];
}
close $mf;
# make matching regexp (use quotemeta to prevent regexp meaningful characters)
my $id_regexp = join '|', map quotemeta, keys %modifications;
$id_regexp = qr/($id_regexp)/; # compile regexp
while (<>) {
next unless m/$id_regexp/;
next unless $modifications{$1};
my ($position, $amount, $data) = #{$modifications{$1}};
substr $_, $+[1] + $position, $amount, $data;
}
continue { print }
On mine laptop it takes about half minute for 1.5 million rows, 1800 lookup ids, 1.2GB data. For 10GB it should not be over 5 minutes. Is it reasonable quick for you?
If you start think you are not IO bound (for example if use some NAS) but CPU bound you can sacrifice some readability and change to this:
my $mod;
while (<>) {
next unless m/$id_regexp/;
$mod = $modifications{$1};
next unless $mod;
substr $_, $+[1] + $mod->[0], $mod->[1], $mod->[2];
}
continue { print }
You should almost certainly use a database, as MikeyB suggested.
If you don't want to use a database for some reason, then if the list of modifications will fit in memory (as it currently will at 1800 lines), the most efficient method is a hashtable populated with the modifications as suggested by yves Baumes.
If you get to the point where even the list of modifications becomes huge, you need to sort both files by their IDs and then perform a list merge -- basically:
Compare the ID at the "top" of the input file with the ID at the "top" of the modifications file
Adjust the record accordingly if they match
Write it out
Discard the "top" line from whichever file had the (alphabetically or numerically) lowest ID and read another line from that file
Goto 1.
Behind the scenes, a database will almost certainly use a list merge if you perform this alteration using a single SQL UPDATE command.
Good deal on the sqlloader or datadump decision. That's the way to go.

Resources