How to use htslib/samtools to transform SAM/BAM reads? - bioinformatics

I'm using the htslib library for reading SAM/BAM files, it works perfectly. I can also write the alignments back to a new SAM/BAM file.
For example, the following code prints the DNA sequence of an alignment:
bam1_t *b = ...;
int i;
for (i = 0; i < b->core.l_qseq; ++i) {
printf("%c", seq_nt16_str[bam_seqi(bam_get_seq(b),i)]);
}
Question: How do I change the query sequence? Say, change the first letter to 'T'? bam_get_seq returns the sequence of a read, but there is no bam_set_seq function? Ideally, I'm looking for something like:
bam_set_seq(b, 'TTTT') # My new DNA sequence
If I can figure out how to do the update, I know how to write the information to a new SAM/BAM file.

Related

Why should we use OutputStream.write(byte[] b, int off, int len) instead of OutputStream.write(byte[] b)?

Sorry, everybody. It's a Java beginner question, but I think it will be helpful for a lot of java learners.
FileInputStream fis = new FileInputStream(file);
OutputStream os = socket.getOutputStream();
byte[] buffer = new byte[1024];
int len;
while((len=fis.read(buffer)) != -1){
os.write(buffer, 0, len);
}
The code above is part of FileSenderClient class which is for sending files from client to a server using java.io and java.net.Socket.
My question is that: in the above code, why should we use
os.write(buffer, 0, len)
instead of
os.write(buffer)
In another way to ask this question: what is the point of having a "len" parameter for "OutputStream.write()" method?
It seems both codes are working fine.
while((len=fis.read(buffer)) != -1){
os.write(buffer, 0, len);
}
Because you only want to write data that you actually read. Consider the case where the input consists of N buffers plus one byte. Without the len parameter you would write (N+1)*1024 bytes instead of N*1024+1 bytes. Consider also the case of reading from a socket, or indeed the general case of reading: the actual contract of InputStream.read() is that it transfers at least one byte, not that it fills the buffer. Often it can't, for one reason or another.
It seems both codes are working fine.
No they're not.
It actually does not work in the same way.
It is very likely you used a very small text file to test. But if you look carefully, you will still find there is a lot of extra spaces in the end of you file you received, and the size of the file you received is larger than the file you send.
The reason is that you have created a byte array in a size of 1024 but you don't have so many data to put (or read()) into that byte array. Therefore, the byte array is full with NULL in the end part. When it comes to writing to file, these NULLs are still written into the file and show as spaces " " in Windows Notepad...
If you use advanced text editors like Notepad++ or Sublime Text to view the file you received, you will see these NULL characters.

How to modify this kind of file by Ruby

I have a file like:
Fruit.Store={
#order:123, order:345, order:456
#order:789
"customer-id:12345,item:store/apple" = 10;
"customer-id:23456,item:store/banana" = 10;
"customer-id:23456,item:store/watermelon" = 10;
#order:987
"customer-id:67890,item:store/pear" = 10;
}
Except the comments, each line has the same format: customer-id and item:store/ are fixed, and customer-id is a 5-digit number. There are about 1000 unique lines in the file.
When a new order is placed based on same customer-id and fruit type with different quantity, I want the order id be added in the comment line above and update the quantity, like if a new order 001 is placed with information "customer-id:23456,item:store/watermelon" = 5; than we should have a new file:
Fruit.Store={
#order:123, order:345, order:456
#order:789, order:000
"customer-id:12345,item:store/apple" = 10;
"customer-id:23456,item:store/banana" = 10;
"customer-id:23456,item:store/watermelon" = 5;
#order:987
"customer-id:67890,item:store/pear" = 10;
}
Is it possible to do so in an efficient way? Because file has to be read and written line by line, how could we detect the matched information and go back to previous line to do modification? Thank you.
In short: no, it is not possible to do so in an efficient way. Your best bet is to open separate files for reading and writing, but even then you'll effectively be rewriting the entire file over and over.
Ultimately, you should be using some sort of referential database, like SQL. Those databases were practically invented for this exact use-case.
I really don't like to tell people that the only solution is to do something entirely different from what they're doing, but in this case I can't stress enough how poorly text files scale for managing data.

What's the Input key of MapReduce by default?

I'm using MpaReduce based on hadoop 2.6.0,and I want to skip the first six lines of my data file, so I use
if(key.get()<6)
return ;
else
{do ....}
in my map() function.
But it was not right. I find that the input key of map() is not the offset of file line. The key is the sum of the length of every line. Why? It doesn't look like the words in many books.
If you look at the code, it is the actual byte offset of the file and not the line.
If you want to skip the first n lines of your file, you probably have to write your own input format / record reader, or make sure that you keep a line counter in the mapper logic ala:
int lines = 0;
public void map(LongWritable key, Text value, ...) {
if(++lines < 6) { return; }
}
This obviously doesn't work if you split the text file (so having > 1 mapper). So writing a dedicated InputFormat is the cleanest way to solve this problem.
Another trick would be to measure how many bytes the first n lines are in that specific file and then just skipping this amount of bytes at the start.

How would I convert a txt file containing a lot of symbols into a array?

so I just have a quick question. The program is supposed to create a character array, and get the content from a text file, containing a lot of random symbols like &,?,!,letters, and numbers. I am not allowed to create seperate arrays, and put them into the 2d array instead. How would I go about doing so? I already know the number of rows and columns because it tells me at the top of the file before actually having all the symbols and stuff. Heres what I have so far:
char [][]charArray=new char[a][b];
for(int z=0;z<charArray.length;z++)
{
for(int y=0;y<charArray[y].length;y++)
{
charArray[y]=fileReader.next();
}
}
So A is the number of rows, and B is the number of columns to read from. When I run the program, it says that it is expecting a char []charArray, and it found a string, and the error is called an incompatible type error.
ALso ps: fileReader is my scanner to read from a file. THanks!
First of all, you need to use more descriptive names for your variables. For example, why name the variable a when a really represents the number of rows in the file? Instead, use numRows (and likewise for b, use numCols). Also, you really should name your scanner scanner. There is a FileReader class and your fileReader variable name is misleading---it makes everyone think you're using a FileReader instead of a Scanner. Finally, the brackets used to declare an array type in Java are normally placed adjacent to the type name, as in char[][] instead of char [][]. This does not change the way the code executes, but it conforms better to common convention.
Now, to your problem. You stated that the number of rows/columns are declared at the beginning of the file. This solution assumes the file does in fact contain numRows rows and numCols columns. Basically, next returns a String. You can use String.toCharArray to convert the String to a char[]. Then you simply copy the characters to the appropriate position in your charArray.
Scanner scanner = new Scanner(theFile);
char[][] charArray=new char[numRows][numCols];
for (int i = 0; i < numRows; i++) {
final char[] aLine = scanner.next().toCharArray();
for(int j = 0; j < aLine.length;j++){
charArray[i][j] = aLine[j];
}
}

process bunch of string effective

I need to read some data from a file in chuck of 128M, and then for each line, I will do some processing, naive way to do is using split to convert the string into collection of lines and then process each line, but maybe that is not effective as it will create a collection which simply stores the temp result which could be costy. Is there is a way with better performance?
The file is huge, so I kicked off several thread, each thread will pick up 128 chuck, in the following script rawString is a chuck of 128M.
randomAccessFile.seek(start)
randomAccessFile.read(byteBuffer)
val rawString = new String(byteBuffer)
val lines=rawString.split("\n")
for(line <- lines){
...
}
It'd be better to read text line by line:
import scala.io.Source
for(line <- Source.fromFile("file.txt").getLines()) {
...
}
I'm not sure what you're going to do with the trailing bits of lines at the beginning and end of the chunk. I'll leave that to you to figure out--this solution captures everything delimited on both sides by \n.
Anyway, assuming that byteBuffer is actually an array of bytes and not a java.nio.ByteBuffer, and that you're okay with just handling Unix line encodings, you would want to
def lines(bs: Array[Byte]): Array[String] = {
val xs = Array.newBuilder[Int]
var i = 0
while (i<bs.length) {
if (bs(i)=='\n') xs += i
i += 1
}
val ix = xs.result
val ss = new Array[String](0 max (ix.length-1))
i = 1
while (i < ix.length) {
ss(i-1) = new String(bs, ix(i-1)+1, ix(i)-ix(i-1)-1)
i += 1
}
ss
}
Of course this is rather long and messy code, but if you're really worried about performance this sort of thing (heavy use of low-level operations on primitives) is the way to go. (This also takes only ~3x the memory of the chunk on disk instead of ~5x (for mostly/entirely ASCII data) since you don't need the full string representation around.)

Resources