Can I use HISAT2 for alignment when doing the Chip-seq? - rna-seq

When I was doing the RNA-seq, I use the HISAT2 to do the alignment. However, during the chip-seq analysis, bowtie2 is widely used to do the alignment. But there is a parameter for HISAT2 to do the DNA-seq like analysis, --no-spliced-alignment. So, why not use HISAT2 during the Chip-seq analysis? Thank you!!

Related

Difference between Gensim's FastText and Facebook's FastText

I came upon the realization that there exists the original implementation of FastText here by which you can use fasttext.train_unsupervised in order to generate word vectors (see this link as an example). However, turns out that gensim also supports fasttext and its API is similar to that of word2vec. See example here.
I am wondering if there is a difference between the 2 implementations? The documentation was not clear but do they both mimic the paper Enriching Word Vectors with Subword Information? And if yes then why would one use gensim's fasttext over fasttext ?
Gensim intends to match the Facebook implementation, but with a few known or intentional differences. Specifically, Gensim doesn't implement:
the -supervised option, & specific-to-that-mode autotuning/quantization/pretrained-vectors options
word-multigrams (as controlled by the -wordNgrams paramerter to fasttext)
the plain softmax option for loss-optimization
Regarding options to -loss, I'm relatively sure that despite Facebook's command-line options docs indicating that the fasttext default is softmax, it is actually ns except when in -supervised mode, just like word2vec.c & Gensim. See for example this source code.
I suspect a future contribution to Gensim that adds wordNgrams support would be welcome, if that mode is useful to some users, and to match the reference implementation.
So far the choice of Gensim has been to avoid any supervised algorithms, so the -supervised mode is less-likely to appear in any future Gensim. (I'd argue for it, though, if a working implementation was contributed.)
The plain softmax mode is so much slower on typical large output vocabularies that few non-academic projects would want to use it over hs or ns. (It may still be practical with a smaller-number of output-labels, as in -supervised mode, though.)
I found 1 difference from the gensim's documentation:
word_ngrams (int, optional) – In Facebook’s FastText, “max length of word ngram” -
but gensim only supports the default of 1 (regular unigram word handling).
This means that gensim only supports unigrams, but no bigrams or trigrams.

when you want to test something on efficiency in erlang, which way do you use?

In the Programming Erlang Chapter 8
Joe Armstrong use
statistics(runtime)
statistics(wall_clock)
to test on efficiency,
is there any other way to test?
thks.
The timer:tc/1,2,3 functions are nicely wrapped for this kind of measurements.
In those cases when timer:tc is not sufficient I use fprof, built in profiling tool. It have minimal runtime performance impact and nice readable and parseable output format.

arm-none-eabi-gcc: -march option v/s -mcpu option

I have been following j lynch tutorial from atmel for developing small programms for at91sam7s256 (microcontroller). I have done a bit tinkering and used arm-none-eabi instead of arm-elf (old one). By default i found that gcc compiles assuming -march=armv4t even if one does not mention anything about chip. How much difference it would if i use -mcpu=arm7tdmi?
Even searching a lot on google i could not find a detailed tutorial which would explain all possible command like options including separate linker options,assembler and objcopy options like -MAP etc.
Can you provide any such material where all possibilities are explained?
Providing information about the specific processor gives the compiler additional information for selecting the most efficient mix of instructions, and the most efficient way of scheduling those instructions. It depends very much on the specific processor how much performance difference explicitly specifying -mcpu makes. There could be no difference whatsoever - the only way to know is to measure.
But in general - if you are building a specific image for a specific device, then you should provide the compiler with as much information as possible.
Note: your current instance of gcc compiles assuming -march=armv4t - this is certainly not a universal guarantee for all arm gcc toolchains.

How do I force gcc to inline a function?

Does __attribute__((always_inline)) force a function to be inlined by gcc?
Yes.
From documentation v4.1.2
From documentation latest
always_inline
Generally, functions are not inlined unless optimization is specified. For functions declared inline, this attribute inlines the function even if no optimization level was specified.
It should. I'm a big fan of manual inlining. Sure, used in excess it's a bad thing. But often times when optimizing code, there will be one or two functions that simply have to be inlined or performance goes down the toilet. And frankly, in my experience C compilers typically do not inline those functions when using the inline keyword.
I'm perfectly willing to let the compiler inline most of my code for me. It's only those half dozen or so absolutely vital cases that I really care about. People say "compilers do a good job at this." I'd like to see proof of that, please. So far, I've never seen a C compiler inline a vital piece of code I told it to without using some sort of forced inline syntax (__forceinline on msvc __attribute__((always_inline)) on gcc).
Yes, it will. That doesn't necessarily mean it's a good idea.
According to the gcc optimize options documentation, you can tune inlining with parameters:
-finline-limit=n
By default, GCC limits the size of functions that can be inlined. This flag
allows coarse control of this limit. n is the size of functions that can be
inlined in number of pseudo instructions.
Inlining is actually controlled by a number of parameters, which may be specified
individually by using --param name=value. The -finline-limit=n option sets some
of these parameters as follows:
max-inline-insns-single is set to n/2.
max-inline-insns-auto is set to n/2.
I suggest reading more in details about all the parameters for inlining, and setting them appropriately.
I want to add here that I have a SIMD math library where inlining is absolutely critical for performance. Initially I set all functions to inline but the disassembly showed that even for the most trivial operators it would decide to actually call the function. Both MSVC and Clang showed this, with all optimization flags on.
I did as suggested in other posts in SO and added __forceinline for MSVC and __attribute__((always_inline)) for all other compilers. There was a consistent 25-35% improvement in performance in various tight loops with operations ranging from basic multiplies to sines.
I didn't figure out why they had such a hard time inlining (perhaps templated code is harder?) but the bottom line is: there are very valid use cases for inlining manually and huge speedups to be gained.
If you're curious this is where I implemented it. https://github.com/redorav/hlslpp
Yes. It will inline the function regardless of any other options set. See here.
One can also use __always_inline. I have been using that for C++ member functions for GCC 4.8.1. But could not found a good explanation in GCC doc.
Actually the answer is "no". All it means is that the function is a candidate for inlining even with optimizations disabled.

Maintainability Index

I have come across the recommended values for a Maintainability Index (MI) as follows:
85 and more: good maintainability
65-85: moderate maintainability
65 and below: difficult to maintain with really
bad pieces of code (big, uncommented,
unstructured) the MI value can be
even negative
Are these values are dependent on technology? For example, is a value of 70 good for Mainframes but difficult to maintain for Java?
Can use same yardstick independent of technologies?
This is an explanation about meaning of maintainability index value.
Shortly this is
MI = 171 - 5.2*ln(Halstead Volume) - 0.23*(Cyclomatic Complexity) - 16.2*ln(Lines of Code)
scaled between 0 and 100.
As it's easy to see, this metric can be used for any procedural language.
The 65 and 85 thresholds come from the original paper introducing the Maintainability Index in 1992/1994.
Visual Studio has slightly adjusted the metric (mutiplying by 100/171) to make it fit to a 1-100 scale. Visual Studio uses 10 and 20 as thresholds.
In general I would not take this metric and its thresholds too seriously: See also my blog post "Think Twice Before Using the Maintainability Index."
The Maintainability Index is a empirical formula. It is, was constructed a model based in observation and adaptation. If you search for more detail, will discover that the equation have to be cabibrated for a specific language. The version of SEI is calibrated for Pascal and C, and used a bunch of programs, average 50KLOC, maintained by Hewlett-Packard.
The calibration of Visual Studio version is the same of SEI version, but was padronized to restrict the domain from 0 to 100.
I don't think you can put a number on how easy a piece of code will be for a developer to maintain.
Different people will look at the same code and interpret it in their own way, depending on experience, culture, reading comprehension, etc.
That being said, metrics would definitely be different across technologies. You're looking at completely different syntax, conventions, terminology, etc. How can you quantify the difficulty difference between low level mainframe code and a high level language like Java or C#?
I think metrics are good for one thing, and one thing only: guidelines. In terms of code quality, I don't think they should be used for anything other than describing a code base. They should not be used as a determining factor of difficulty or "grok-ability".
It depends how the "Maintainability Index" is calculated. This doesn't seem to me like something which would work across languages simply because languages are so different.
It might seem reasonable to do a simple compare of "number of lines per function", but what happens when you try and compare C code which is full of pointers, or C++ code full of templates, or C# with LINQ queries, or Java with generics?
All of those things affect maintainability but can't be measured in any meaningful cross-language way, so how can you ever compare numbers between 2 languages?

Resources