How to demultiplex BCL data with two different barcoding systems? - bioinformatics

I was wondering if anyone has experience in demultiplexing BCL files from Illumina with two different barcoding systems in one go?
we would like to know whether it is possible to use different, barcoded Tn5 adaptors during tagmentation and then use those Barcodes in combination with the barcodes on the sequencing primers to demultiplex the samples.
Has anyone has experience with this kind of analysis?
Is it possible to only add the Tn5 sequence to the primer sequence before demultiplexing and use it in the BCL2fastq run?
thanks
Assa

I'm not very well aware of Tn5 adaptors library but I will give it a shot.
In theory, it should be possible. bcl2fastq can demultiplex anything as long as you give it a proper samplesheet and a correct --use-bases-mask argument.
If you need to demultiplex all samples with a combination of two barcode system, you will have to create a samplesheet with a line for each combination that can be found. You also need to know at which cycles the Tn5 adaptors are sequenced.
For example lets say my original samplesheet is like:
[Data]
Sample_ID,Sample_Name,Sample_Plate,Sample_Well,Index_Plate_Well,I7_Index_ID,index,I5_Index_ID,index2,Sample_Project,Description
ID-1,ID-1,,,A01,UDP0001,GAACTG,UDP0001,TCGTGG,project,
ID-2,ID-2,,,B01,UDP0002,AGGTCA,UDP0002,CTACAA,project,
i would use a --use-bases-mask argument like Y*,I6,I6,Y* to tell bcl2fastq that is needs to read 6 bases for the barcodes.
Now if your Tn5 adaptors are located just after your illumina barcodes, you will need a samplesheet like:
[Data]
Sample_ID,Sample_Name,Sample_Plate,Sample_Well,Index_Plate_Well,I7_Index_ID,index,I5_Index_ID,index2,Sample_Project,Description
ID-1-1,ID-1-1,,,A01,UDP0001,GAACTGATGC,UDP0001,TCGTGGATGC,project,
ID-1-2,ID-1-2,,,A01,UDP0001,GAACTGCGAT,UDP0001,TCGTGGCGAT,project,
ID-2-1,ID-2-1,,,B01,UDP0002,AGGTCAATGC,UDP0002,CTACAAATGC,project,
ID-2-2,ID-2-2,,,B01,UDP0002,AGGTCACGAT,UDP0002,CTACAACGAT,project,
note the added 4 bases behind the (previously defined) illumina adaptors with different combinations. Here we would use a --use-bases-mask argument like Y*,I10,I10,Y*. This is a very dummy example to illustrate how bcl2fastq works.
Two major difficulties:
You have to know all the possible combinations to put them in the samplesheet. If you have a kind of UMI barcoding (random bases), you cannot do it.
You must know precisely at which cycles the barcodes are read to use a --use-bases-mask argument accordingly.
Maybe I could understand better what you're trying to achieve with an example of a samplesheet you're using and the bcl2fastq command you're running.

Related

Returnn Switchboard data processing

Could anybody give me pointers on how to process Switchboard dataset for training with RETURNN? I did see BlissDataset class that seems to be designed for switchboard, but it's not clear to me what I should include in the paths given in the example:
Example:
./tools/dump-dataset.py "
{'class':'BlissDataset',
'path': '/u/tuske/work/ASR/switchboard/corpus/xml/train.corpus.gz',
'bpe_file': '/u/zeyer/setups/switchboard/subwords/swb-bpe-codes',
'vocab_file': '/u/zeyer/setups/switchboard/subwords/swb-vocab'}"
The switchboard dataset has several folders with audios, i.e. swb1_d2/data/*.sph and transcripts swb1_LDC97S62/swb_ms98_transcriptions/**/*
I'm not quite sure how to proceed with this to get a dataset that can be used to train RETURNN.
At our group (RWTH Aachen University), we use the config as it was published on GitHub. As you see, this one uses ExternSprintDataset. That dataset uses
The implementation uses Sprint (publicly called RWTH ASR (RASR), see here) as an external tool (ran in a subprocess) to handle the data (feature extraction, etc). Sprint gets a Bliss XML file which describes all the segments with path to audio and audio offsets and transcriptions, and also it gets further configs for the feature extraction and maybe other things. There is an open source version of RASR which should work but it might be a bit involved to get this to work.
The BlissDataset was planned to be a simpler replacement for that. However, the implementation is incomplete. Also, you still would need to generate the Bliss XML by yourself in some way (we have used some own internal scripts to prepare that based on the official LDC data).
So, unfortunately, there is no simple way yet. Actually, I think the easiest way would be to come up with yet another custom format, which might be similar to the LibriSpeechDataset implementation, or maybe just the same, and then you could just reuse LibriSpeechDataset, or at least parts of that. That dataset implementation takes the data in some zip format which contains the transcripts in txt files and the audio in ogg or wav files. It uses librosa to do MFCC feature extraction (or also other feature types). I planned to implement that for Switchboard, and then reproduce the results, however I did not have time yet and not sure when I will get to that. But if you want to try that on your own, I will be happy to help you however I can. The starting point would be to look at LibriSpeechDataset and understand how the format of that looks like.

Music data format for polyphonic music visualization with Processing

I am interested in visualizing melodic contours of polyphonic music with Processing. It is still unclear to me, though, what the most convenient format for imported data (pitch and onset/duration) would be: tabular (e.g. Humdrum), XML (e.g. MEI, musicXML), or JSON? Maybe another format?
Any suggestions/thoughts on this would be really helpful! Thanks.
Using MIDI files would be optimal, because of the combination of those 3 reasons
MIDI is widely used. You can export a .midi file from pratically any score editor plus you can create your own by recording the input from a midi instrument.
You can already find .midi files of iconic polyphonic music on the web (Bach's counterpoints, Reinaissance vocal music, etc)
It just contain music/playback information. It doesn't contain notation information like music XML. So if you just want to see pitches and note position/duration (like in this video) then .midi will contain just what you need
You can use the Java Midi Package in Processing and it already contains everything you need to read the MIDI files.
While other formats might also apply for 1, 2, 3 or 4 only MIDI applies for all of them.
The best answer I can give you is that you should put together a simple hello world program that tests out each format and see which one you like the best.
In the end, you're the one that has to deal with the code, so only you can really decide on the best format.

Which technique for locating many similar Base Pointer Adresses (fast)?

I try to catch some Base Pointer Adresses from a Windows Application which I want to Bot (Its not a game, Its an Online Broker). So, I know how to find Base Pointer Adresses, but I do this with Cheatengine (Find Adresses, set Breakpointes, search for the Output Adresses... and so on) - but this takes very much time for Base Pointers with 6+ Offsets. Maybe there is a much faster technique how to scrape them out of Memory?
And here is my presumption: In This Pic you can see, there are many similar Entrys (Forex Entrys), and they are all similar structured. They have a Adress for Ask-Value and Bid-Value - these are the Pointers I need! The Values are represented as Double. Maybe, I can find multiple Adresses at once, if I find the one from another. I thought on object oriented programing, where many Instances have Adresses nearby to the other. So, is there a way to find multiple at once, and fast?
I tried some stuff with OllyDbg, and didnt find some nearby (But my skills with OllyDbg are not insane, I still dont know all functions of it). Do you guys have a better solution, how I can find them faster? I dont really want to code some stuff in Assembler - but if nececary, I can do this. Would be great if you can help. There are 89 Entrys, and I will need per something like 20 - 30 minutes. Would be awful.
Cheers!
Filthy Frank
Using pointers and offsets is not the correct way to go about this. On the back end they're just using HTTP and an API. You should either use that directly or hook the function that does it and then work with the data right after it is received by the client.

Program that keeps track of packages with barcode

I am currently implementing a web app with the goal of keeping track of the location of all the packages in a company I am working for. Our plan is to have a barcode for each package and scan that barcode at the different sectors of the company, indicating where they are. The problem is that I have no idea where to start. I've done some research on Google but haven't found much. My main questions are:
How do barcodes work in the first place?
How do you program with barcodes? Is there a specific language I should use? Do I have to buy anything?
How do you read barcodes and enter them in your program and how do you generate them in the first place?
Any hints on how I should proceed with my implementation?
I look forward to hearing back from you as I need to implement this as soon as possible.
This is a pretty broad question, but I'll do my best to answer:
How do barcodes work in the first place?
Essentially, for this type of project, you can think of the barcodes you're going to be implementing as merely serial numbers. If you really want to know how barcodes work, Wikipedia has a pretty good write up - but essentially, at this level, just think of them as a serial number, encoded in such a way that a machine can read it.
In your web app, you'd be taking a number (say, 42) that has no meaning on its own, and associating with a package and a location.
How do you program with barcodes? Is there a specific language I should use? Do I have to buy anything?
You don't really "program" with barcodes per se... Again, it's just a machine readable implementation of some kind of information. In terms of "specific language", just build your web app as you already are, and add, say, an extra integer field. The integer doesn't mean anything on it's own - it's just going to be what's printed in the barcode. In this use case, you don't even have to have a barcode per se - you could just write it on the box! The usefulness of barcodes comes in speed and accuracy of data entry - you'd be having a computer device scan the barcode and type it in instead of a human.
How do you read barcodes and enter them in your program and how do you generate them in the first place?
It doesn't sound like you're at the point where you're doing any kind of machine vision or anything, so the most common entry method would be to buy a basic USB barcode scanner, like a Symbol LS2208. Use the manual that comes with it (or you can download the manual) to configure it as a keyboard emulation device - that way, your user would just select a field in the web app, scan, and the scanner would type out whatever was stored in the barcode (in the example above, the number 42).
As far as generating, depending on your volume, you have lots of options. For low volumes, you can find a generator online and print them out onto Avery label-type sheets using an inkjet or laser printer. You could also find a barcode font and print right from, say, Word, onto a label sheet. For higher volumes, you could purchase specialized software and use a label printer, or you can even write this yourself. Personally, I have a Zebra LP2844 with a network interface, and I wrote some custom PHP to send commands in the printer's native language (EPL2) over a socket to print onto roll labels.
EDIT: You'd probably want to use either Code128 or Code39. These are two different "symbologies" (types of barcode) that are appropriate for what it sounds like you're doing. They're 1-dimensional (like UPC codes and not like QR codes), so a cheap reader can decode them, and they're pretty flexible and VERY common.
Any hints on how I should proceed with my implementation?
Just think of barcodes, the way that it sounds like you want to use them, as arbitrary serial numbers that don't mean anything on their own. For example, doing this sort of box tracking in a previous warehouse environment, we printed THOUSANDS of unique serial numbered barcode labels. Those labels didn't have ANY value until they were attached to a box and a picker started to put stuff into that box. They were just numbers. Just remember to keep them unique.

Locating OEPs in Packed EXE Files

Are there any general rules on how to realiably locate OEPs (Original Entry Points) for packed .exe files, please? What OEP clues are there to search for in debugged assembly language?
Say there is a Windows .exe file packed with PC-Guard 5.06.0400 and I wish to unpack it. Therefore, the key condition is finding the OEP within the freshly extracted block of code.
I would use the common debugger OllyDBG to do that.
In the general case - no way. It highly depends on packer. In the most common case packer may replace some code from OEP by some other code.
This depends solely on the packer and the algorithms its using pack and/or virtualize code. Seeing as you are using ollydbg, i'd suggest checking out tuts4you, woodmanns and openrce, they have many plugins (iirc there is one designed for finding oep's in obfuscated code, but i have no clue how well it performs) and olly scripts for dealing with unpacking various packers (from which you may be able to pick up hints for a certain type of packer), they also have quite a few papers/tutorials on the subject as well, which may or may not be of use.
PC Guard doesn't seem to get much attention, but the video link and info here should be of help (praise be to Google cache!)
It's hard to point out any simple strategy and claim that it will work in general, because the business of packer tools is to make OEP finding a very hard problem. Besides, with a good packer, finding the OEP is still not enough. That being said, I do have some suggestions.
I would suggest that you read this paper on the Justin unpacker, they use heuristics that were reasonably effective at the time, and that you might be able to get some mileage from. They will at least reduce the number of candidate entry points to a manageable number:
A study of the packer problem and its solutions (2008)
by Fanglu Guo , Peter Ferrie , Tzi-cker Chiueh
There are also some web-analysis pages that can tell you a lot about your packed program. For example, the malware analyzer at:
http://eureka.cyber-ta.org/
Here's another one that is currently down, but has done a reasonable job in the past, and I presume will be up again soon):
http://bitblaze.cs.berkeley.edu/renovo.html

Resources