How to change long gene names to abbreviated in some automatic way (microarray data processing)? - bioinformatics

Is there any automatic way to convert a list of long gene names (like Cadherin_3453) to its abbreviations, like CDHRN_3453?
Are there any abbreviation name convention in Genomics, Bioinformatics?
Sorry, no code herein

There is the HUGO database which tries to standardize gene names. Depending on your use case you can either try to access their online search every time or download the data and use your own database.

Since you did not post a programming language that you want, I am guessing that this is just a simple, one-time exercise that you would like to do.
Though it is not a true abbreviation, you could just remove all of the vowels in the gene name (as you may have done by accident in your example).
You should use:
http://www.togglecase.com/convert_to_disemvowelled_text.php
It was able to change Cadherin_3453 to Cdhrn_3453.
If you a looking to be able to do this with a program that you can tailor to your specific needs, you can look into this SO question: String replace vowels in Python?

Related

Why does protobuf's FieldMask use field names instead of field numbers?

In the docs for FieldMask the paths use the field names (e.g., foo.bar.buzz), which means renaming the message field names can result in a breaking change.
Why doesn't FieldMask use the field numbers to define the path?
Something like 1.3.1?
You may want to consider filing an issue on the GitHub protocolbuffers repo for a definitive answer from the code's authors.
Your proposal seems logical. Using names may be a historical artifact. There's a possibly relevant comment on an issue thread in that repo:
https://github.com/protocolbuffers/protobuf/issues/3793#issuecomment-339734117
"You are right that if you use FieldMasks then you can't safely rename fields. But for that matter, if you use the JSON format or text format then you have the same issue that field names are significant and can't be changed easily. Changing field names really only works if you use the binary format only and avoid FieldMasks."
The answer for your question lies in the fact FieldMasks are a convention/utility developed on top of the proto3 schema definition language, and not a feature of it (and that utility is not present in all of the language bindings)
While you’re right in your observation that it can break easily (as schemas tend evolve and change), you need to consider this design choice from a user friendliness POV:
If you’re building an API and want to allow the user to select the field set present inside the response payload (the common use case for field masks), it’ll be much more convenient for you to allow that using field paths, rather then binary fields indices, as the latter would force the user of the gRPC/protocol generated code to be “aware” of the schema. That’s not always the desired case when providing API as a code software packages.
While implementing this as a proto schema feature can allow the user to have the best of both worlds (specify field paths, have them encoded as binary indices) for binary encoding, it would also:
Complicate code generation requirements
Still be an issue for plain text encoding.
So, you can understand why it was left as an “external utility”.

Arranging First and Last Names in data

Are there guidelines or best practice for formatting first and last names displayed in gridviews? Is it better to display the first name first or the last name first?
There is no universally accepted convention; the ordering of parts of personal names depends on the culture.
You can read more about this here: http://www.w3.org/International/questions/qa-personal-names
Then again, from a user-interface perspective, if the user is more likely to sort by the last name it should come before the first name.
In general, I've always found it better to show the surname first, followed by given name. Perhaps it's possible for you to have two separate columns if sorting on either one could be useful.
Bear in mind, if this is appropriate to your task, that the first/last names are not always the given name/surname in certain cultures so it's better to refer to them as given/surnames.

associate multiple strings to only one

I'm trying to make an algorithm that easily simplifies and groups synonyms (with mismatches, capitals, acronims, etc) into only one. I supose there should exist a standard way to build such a structure that, looking for a string with possible mismatches, if the string exists in the structure, it returns a normalized string key. In short, sometimes the same concept could be written in several ways, but I only want to keep the concept.
For instance: Supose I want to normalize or simplify the appearances of
"General Director", "General Manager", "G, Dtor", "Gen Dir", ...
into
"GEN_DIR"
and keep only this result for further reference.
By the way, I suppose that building a Hash with key/value pairs like
hash["General Director"]="GEN_DIR"
hash["General Manager"]="GEN_DIR"
hash["G, Dtor"]="GEN_DIR"
hash["G, Dir"]="GEN_DIR"
could be a solution, but I suspect that there are more elegant or adequate solutions to that.
I would also need the way to persist this associative structure easily without any database because it should grow as I find more mismatches of the same word or sentence. A possible approach I think is to define this structure by means of a DSL, but I'm open to suggestions.
Well, there is no rule, at least a clear one.
My aim is to scrap from web some "structured" data that sometimes is incorrectly or incompletely typed. Some fields are descriptions and can be left as is. But some fields are suposedly to be "sets" but aren's correctly typed (as in my example). As a human can read that, he immediatelly knows what it means and can associate that with its meaning.
But I would like to automate as much as possible the process of reducing those possible mismatches to only one "string" (or symbol) before, for instance, saving it into a database. So, what I would need is a kindof hash or dictionary, as sawa correctly stated, that I can use to lookup any of such dirty strings to get the normalized string or symbol.
Also, of course, it would be desirable a way to make this hash (or whatelse it could be) to learn from new mismatches in some way and add a new association automatically (possibly it could be based on a distance measure between mismatched string and normalized string that, if lower than X, a new association is built). The whole association (i.e, hash) should grow as new mismatches and concepts arise and, though, it should be kept anywhere (possibly in an xml file, or something like what Mori answered below) for future uses.
Any new Idea?

Can sorting Japanese kanji words be done programmatically?

I've recently discovered, to my astonishment (having never really thought about it before), machine-sorting Japanese proper nouns is apparently not possible.
I work on an application that must allow the user to select a hospital from a 3-menu interface. The first menu is Prefecture, the second is City Name, and the third is Hospital. Each menu should be sorted, as you might expect, so the user can find what they want in the menu.
Let me outline what I have found, as preamble to my question:
The expected sort order for Japanese words is based on their pronunciation. Kanji do not have an inherent order (there are tens of thousands of Kanji in use), but the Japanese phonetic syllabaries do have an order: あ、い、う、え、お、か、き、く、け、こ... and on for the fifty traditional distinct sounds (a few of which are obsolete in modern Japanese). This sort order is called 五十音順 (gojuu on jun , or '50-sound order').
Therefore, Kanji words should be sorted in the same order as they would be if they were written in hiragana. (You can represent any kanji word in phonetic hiragana in Japanese.)
The kicker: there is no canonical way to determine the pronunciation of a given word written in kanji. You never know. Some kanji have ten or more different pronunciations, depending on the word. Many common words are in the dictionary, and I could probably hack together a way to look them up from one of the free dictionary databases, but proper nouns (e.g. hospital names) are not in the dictionary.
So, in my application, I have a list of every prefecture, city, and hospital in Japan. In order to sort these lists, which is a requirement, I need a matching list of each of these names in phonetic form (kana).
I can't come up with anything other than paying somebody fluent in Japanese (I'm only so-so) to manually transcribe them. Before I do so though:
Is it possible that I am totally high on fire, and there actually is some way to do this sorting without creating my own mappings of kanji words to phonetic readings, that I have somehow overlooked?
Is there a publicly available mapping of prefecture/city names, from the government or something? That would reduce the manual mapping I'd need to do to only hospital names.
Does anybody have any other advice on how to approach this problem? Any programming language is fine--I'm working with Ruby on Rails but I would be delighted if I could just write a program that would take the kanji input (say 40,000 proper nouns) and then output the phonetic representations as data that I could import into my Rails app.
宜しくお願いします。
For Data, dig Google's Japanese IME (Mozc) data files here.
https://github.com/google/mozc/tree/master/src/data
There is lots of interesting data there, including IPA dictionaries.
Edit:
And you may also try Mecab, it can use IPA dictionary and can convert kanjis to katakana for most of the words
https://taku910.github.io/mecab/
and there is ruby bindings for that too.
https://taku910.github.io/mecab/bindings.html
and here is somebody tested, ruby with mecab with tagger -Oyomi
http://hirai2.blog129.fc2.com/blog-entry-4.html
just a quick followup to explain the eventual actual solution we used. Thanks to all who recommended mecab--this appears to have done the trick.
We have a mostly-Rails backend, but in our circumstance we didn't need to solve this problem on the backend. For user-entered data, e.g. creating new entities with Japanese names, we modified the UI to require the user to enter the phonetic yomigana in addition to the kanji name. Users seem accustomed to this. The problem was the large corpus of data that is built into the app--hospital, company, and place names, mainly.
So, what we did is:
We converted all the source data (a list of 4000 hospitals with name, address, etc) into .csv format (encoded as UTF-8, of course).
Then, for developer use, we wrote a ruby script that:
Uses mecab to translate the contents of that file into Japanese phonetic readings
(the precise command used was mecab -Oyomi -o seed_hospitals.converted.csv seed_hospitals.csv, which outputs a new file with the kanji replaced by the phonetic equivalent, expressed in full-width katakana).
Standardizes all yomikata into hiragana (because users tend to enter hiragana when manually entering yomikata, and hiragana and katakana sort differently). Ruby makes this easy once you find it: NKF.nkf("-h1 -w", katakana_str) # -h1 means to hiragana, -w means output utf8
Using the awesomely conveninent new Ruby 1.9.2 version of CSV, combine the input file with the mecab-translated file, so that the resulting file now has extra columns inserted, a la NAME, NAME_YOMIGANA, ADDRESS, ADDRESS_YOMIGANA, and so on.
Use the data from the resulting .csv file to seed our rails app with its built-in values.
From time to time the client updates the source data, so we will need to do this whenever that happens.
As far as I can tell, this output is good. My Japanese isn't good enough to be 100% sure, but a few of my Japanese coworkers skimmed it and said it looks all right. I put a slightly obfuscated sample of the converted addresses in this gist so that anybody who cared to read this far can see for themselves.
UPDATE: The results are in... it's pretty good, but not perfect. Still, it looks like it correctly phoneticized 95%+ of the quasi-random addresses in my list.
Many thanks to all who helped me!
Nice to hear people are working with Japanese.
I think you're spot on with your assessment of the problem difficulty. I just asked one of the Japanese guys in my lab, and the way to do it seems to be as you describe:
Take a list of Kanji
Infer (guess) the yomigana
Sort yomigana by gojuon.
The hard part is obviously step two. I have two guys in my lab: 高橋 and 高谷. Naturally, when sorting reports etc. by name they appear nowhere near each other.
EDIT
If you're fluent in Japanese, have a look here: http://mecab.sourceforge.net/
It's a pretty popular tool, so you should be able to find English documentation too (the man page for mecab has English info).
I'm not familiar with MeCab, but I think using MeCab is good idea.
Then, I'll introduce another method.
If your app is written in Microsoft VBA, you can call "GetPhonetic" function. It's easy to use.
see : http://msdn.microsoft.com/en-us/library/aa195745(v=office.11).aspx
Sorting prefectures by its pronunciation is not common. Most Japanese are used to prefectures sorted by 「都道府県コード」.
e.g. 01:北海道, 02:青森県, …, 13:東京都, …, 27:大阪府, …, 47:沖縄県
These codes are defined in "JIS X 0401" or "ISO-3166-2 JP".
see (Wikipedia Japanese) :
http://ja.wikipedia.org/wiki/%E5%85%A8%E5%9B%BD%E5%9C%B0%E6%96%B9%E5%85%AC%E5%85%B1%E5%9B%A3%E4%BD%93%E3%82%B3%E3%83%BC%E3%83%89

How to detect vulnerable/personal information in CVs programmatically (by means of syntax analysis/parsing etc...)

To make matter more specific:
How to detect people names (seems like simple case of named entity extraction?)
How to detect addresses: my best guess - find postcode (regexes); country and town names and take some text around them.
As for phones, emails - they could be probably caught by various regexes + preprocessing
Don't care about education/working experience at this point
Reasoning:
In order to build a fulltext index on resumes all vulnerable information should be stripped out from them.
P.S. any 3rd party APIs/services won't do as a solution.
The problem you're interested in is information extraction from semi structured sources. http://en.wikipedia.org/wiki/Information_extraction
I think you should download a couple of research papers in this area to get a sense of what can be done and what can't.
I feel it can't be done by a machine.
Every other resume will have a different format and layout.
The best you can do is to design an internal format and manually copy every resume content in there. Or ask candidates to fill out your form (not many will bother).
I think that the problem should be broken up into two search domains:
Finding information relating to proper names
Finding information that is formulaic
Firstly the information relating to proper names could probably be best found by searching for items that are either grammatically important or significant. I.e. English capitalizes only the first word of the sentence and proper nouns. For the gramatical rules you could look for all of the words that have the first letter of the word capitalized and check it against a database that contains the word and the type [i.e. Bob - Name, Elon - Place, England - Place].
Secondly: Information that is formulaic. This is more about the email addresses, phone numbers, and physical addresses. All of these have a specific formats that don't change. Use a regex and use an algorithm to detect the quality of the matches.
Watch out:
The grammatical rules change based on language. German capitalizes EVERY noun. It might be best to detect the language of the document prior to applying your rules. Also, another issue with this [and my resume sometimes] is how it is designed. If the resume was designed with something other than a text editor [designer tools] the text may not line up, or be in a bitmap format.
TL;DR Version: NLP techniques can help you a lot.

Resources