CEL files processing: are all cel files raw data? - bioinformatics

Using affymetrix data, I have always thought that "CEL files" Represent raw data that need to be processed (normalized for example) before being really used.
Nevertheless on the GEO Omnibus web site, when you look at CEL files for some studies (not all: please look at the example with this link: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM467778),
you can read that the values of GSM467778_50.CEL correspond to: "RMA normalised signal intensities" (in the example above) which is not what I call "raw values".
Therefore I don't really know if in this example, I should normalize the cel files or if it is already done ?

Related

What dimensions are required in an IOAPI NetCDF file?

When I do an ncdump -h on some old I/O API NetCDF files, I always get the same dimensions:
dimensions:
TSTEP
DATE-TIME
LAY
VAR
ROW
COL
Are these exact dimensions and names always required in an I/O API NetCDF file? (If this is the convention, why does it require TSTEP and DATE-TIME? They sound redundant.)
Most recent documentation seems to be included in https://github.com/cjcoats/ioapi-3.2
And there is a section called FILES: Variables, Layers, and Time Steps where You can read that "There are eight types of data currently supported by the I/O API." Without knowing Your files exactly, I cannot tell what format they should precisely correspond to.

Searching through the protocol buffer file

I'm new to protocol buffers and I was wondering whether it was possible to search a protocol buffers binary file and read the data in a structured format. For example if a message in my .proto file has 4 fields I would like to serialize the message and write multiple messages into a file and then search for a particular field in the file. If I find the field I would like to read back the message in the same structured format as it was written. Is this possible with protocol buffers ? If possible any sample code or examples would be very helpful. Thank you
You should treat protobuf library as one serialization protocol, not an all-in-one library which supports complex operations (such as querying, indexing, picking up particular data). Google has various libraries on top of open-sourced portion of protobuf to do so, but they are not released as open source, as they are tied with their unique infrastructure. That being said, what you want is certainly possible, yet you need to write some code.
Anyhow, some of your requirements are:
one file contains various serialized binaries.
search a particular field in each serialized binary and extract that chunk.
There are several ways to achieve them.
The most popular way for serial read/write is that the file contains a series of [size, type, serialization output]. That is, one serialized output is always prefixed by size and type (either 4/8 byte or variable-length) to help reading and parsing. So you just repeat this procedure: 1) read size and type, 2) read binary with given size, 3) parse with given type 4) goto 1). If you use union type or one file shares same type, you may skip type. You cannot drop size, as there is no way know the end of output by itself. If you want random read/write, other type of data structure is necessary.
'search field' in binary file is more tricky. One way is to read/parse output one by one and to check the existance of field by HasField(). It is most obvious and slow yet straightforward way to do so. If you want to search field by number (say, you want to search 'optional string email = 3;'), thus search by binary blob (like 0x1A, field number 3, wire type 2), it is not possible. In a serialized binary stream, field information is saved merely a number. Without an exact context (.proto scheme or binary file's structure), the number alone doesn't mean anything. There is no guarantee that 0x1A is from field information, or field information from other message type, or actually number 26, or part of other number, etc. That is, you need to maintain the information by yourself. You may create another file or database with necessary information to fetch particular message (like the location of serialization output with given field).
Long story short, what you ask is beyond what open-sourced protobuf library itself does, yet you can write them with your requirements.
I hope, this is what you are looking for:
http://temk.github.io/protobuf-utils/
This is a command line utility for searching within protobuf file.

Why are files returned by a For Each loop sorted, but not always?

I'm not sure if this is the correct place to post this question because I have a hunch that the behavior I witness will also be observed using other methods. But anyway, here it goes.
I have a VBscript that contains code like this:
For Each objFile In colFiles
...
Next
I've been running this code for quite some time on many different systems. I never bothered to order the files alphabetically. But today I found out by accident that the logic of my program depends on it. I ran the code on a new system (under Citrix) and the files were returned in a seemingly random order.
Does anybody know why Windows sometimes returns the files sorted alphabetically while sometimes it doesn't?
Added note: It might be relevant to note that the script as well as the input folder are on a network share (where my script outputs randomly ordered files).
Ordering is not supported for FileSystemObject. See KB 189751 http://support.microsoft.com/kb/189751/en-us
Also check out an answer on how to deal with that on SO Order of Files collection in FileSystemObject
The docs do not specify an ordering. Thus, you cannot depend on it to have an order. The Files property needs to ask the underlying file system for the files, and then gives it to you as it, without any processing. If that file system happens to return the files in order, that's great. If not, you'll have to sort it. Regardless of whether it is in order, you should always order it if you expect it in a certain order because the implementation may change tomorrow (as you've just witnessed).
It depends on what data structure you are looping through.
You will obviously get a different order if you use foreach loop in an array and a hashset, for example.
Personally, I don't know anything about VB. But it does work this way in C#.

Marking or Tagging Non-structured Data

I'm not entirely sure how to term this, but I've searched several phrases and haven't found what I need.
I have a whole lot of unstructured data that I need to get into a database. I used to do the heavy lifting with Needlebase and just clean up the data from there. But now that it's no more, I'm want for a good way to quickly grab pieces of text beyond select, copy, paste, lather, rinse, repeat.
Ideally something where I could select some text and a popup asks what it is (from a user-defined list, title, start time, image path, etc.) and then marks it as such. Naturally I would need to be able to mark the beginning and end of a record (all row data is consecutive, just not in an easily parseable format).
I could probably write something in a few hours that would do this, but I don't want to reinvent the wheel if something exists. I'm on OS X, but I'd be interested in software for any platform.
is your data in HTML format? if yes you can use Jsoup

Searching a list of keywords from text files in folders

I have compiled a list of db object names, one name per line, in a text file. I want to know for each names, where it is being used. The target search is a group of folders containing sub-folders of source codes.
Before I give up looking for a tool to do this and start creating my own, perhaps you can help to point to me an existing one.
Ideally, it should be a Windows desktop application. I have not used grep before.
use grep (there are tons of port of this command to windows, search the web).
eventually, use AgentRansack.
See our Source Code Search Engine. It indexes a large code base according to the atoms (tokens) of the language(s) of interest, and then uses that index to quickly execute structured queries stated in terms of language elememnts. It is a kind of super-grep, but it isn't fooled by comments or string literals, and it automatically ignores whitespace. This means you get a lot fewer false positive hits than you get with grep.
If you had an identifier "foo", the following query would find all mentions:
I=foo
For C and Java, you can constrain the types of identifier accesses to Use, Read, Write or Defines.
D=bar*
would find only declarations of identifiers which started with the letters "bar".
You can write more complex queries using sequences of language tokens:
'int' I=*baz* '['
for C, would find declarations of any variable name that contained the letters "baz" and apparantly declared an array.
You can see the hits in a GUI, and one-click navigate to a source code view of any hit.
It is a Windows application. It handles a wide variety of languages: C#, C++, Java, ... and many more.
I had created an SSIS package to load my 500+ source code files that is distributed into some depth of folders belongs to several projects, into a table, with 1 row as 1 line from the files (total is 10K+ lines).
I then made a select statement against it, by cross-applying the table that keeps the list of 5K+ keywords of db objects, with the help of RegEx for MS-SQL, http://www.simple-talk.com/sql/t-sql-programming/clr-assembly-regex-functions-for-sql-server-by-example/. The query took almost 1.5 hr to complete.
I know it's a long winded, but this is exactly what I need. I thank you for your efforts in guiding me. I would be happy to explain the details further, should anyone gets interested using my method.
insert
dbo.DbObjectUsage
select
do.Id as DbObjectId,
fl.Id as FileLineId
from
dbo.FileLine as fl -- 10K+
cross apply
dbo.DbObject as do -- 5K+
where
dbo.RegExIsMatch('\b' + do.name + '\b', fl.Line, 0) != 0

Resources