Data sorting approach - sorting

I have three rigs for motor testing, their output gets saved in three folders, one rig per folder and every rig outputs thousands of .txt files with parameters which later I'll plot, and other information like motor number, motor load and date. I would like to sort these .txt files in folders, after the motor load and motor number, like Motor 1 -> Motor load x -> all the .txt files that have test parameters of the motor 1 on load x. What would it be the fastest approach?
I have to mention that I am a beginner in programming, and I do not expect a fully coded algorithm, but rather some generic methods to adapt and apply. Thanks!

Related

How to create a .GTF file?

I am new to bioinformatics and programming. I would greatly appreciate some help with step-by-step instructions on how to create a .GTF file. I have two cancer cell lines with different green fluorescent protein (GFP) variants knocked-in to the genome of each cell line. The idea is that the expression of GFP can be used to distinguish cancer cells from non-cancer cells. I would like to count GFP reads in all cancer cells in a single cell RNA-seq experiment. The single cell experiment was performed on the 10X Chromium platform, on organoids composed of a mix of these cancer cells and non-cancer cells. Next generation sequencing was then performed and the reference genome is the human genome sequence, GRCh38. To 'map' and count GFP reads I was told to create a .GTF file which holds the location information, and this file will be used retrospectively to add GFP to the human genome sequence. I have the FASTA sequences for both GFP variants, which I can upload if requested. Where do I start with creation of a .GTF file? Do I create this file in Excel, or with, for example BASH script in a Terminal? I have a link to a Wellcome Trust genome website (https://www.ensembl.org/info/website/upload/gff.html?redirect=no) but it is not clear what practical/programming steps are needed. From my reading it seems a GFF (GFF3?) file is needed as an intermediate step. Step-by-step instructions would be very welcome to create the .GTF file. Thanks in advance.

Algorithm to remove vocal from sound track with bytes in sample code

I want to remove vocals from mp3 sound tracks(remove signer voice from the song file), I turned the song file into byte lists but don't know how to remove it's vocal with bytes.
does any body knows the algorithm of removing with bytes ?(I would be happy if you explain with a sample code with any languages [I work with dart]).
I read this article but the bytes haven't left and right :
Algorithm to remove vocal from sound track
Removing a voice isn't so simple. Usually, it's a combination of several tricks, like band-stop filters, spectrographic analysis (i.e. you'll need to use a FFT, Fast-Fourier Transform to switch to frequencies), and so on.
Simply "substracting" the two channels (i.e. phase cancellation) can't work if the original song wasn't properly recorded in studio, with voices being the ONLY centered track. If anything else (like drums or bass) is ALSO centered, you're dead.
Also, no algorithm would work "out-of-the-box": you'll need to set some parameters in order to let it work properly.
For example, to setup band-stop filters:
Common human voices (spoken): 500-1000Hz
Male reference range: 64-523Hz
Female reference range: 160-1200Hz
Male bass: 82-392Hz
Male baritone: 123-493Hz
Male tenor: 164-698Hz
Female bass: 82-392Hz
Female mezzo-soprano: 123-493Hz
Female soprano: 220-1100
So if your song's singers are both a male bass and a female soprano, you'll need to cut all frequencies from 82 to 392 Hz (male) and from 220 to 1100 Hz (female). So finally, everything between 82 to 1100 Hz... That won't let so much instruments left!
So you'll need to put markers on your timeline, when each singer is singing, and cut bands ONLY during these short periods - so you won't damage too much instruments.
The "right" way should be to try most of these tricks, on the tiniest possible durations (i.e. when a human is singing). You should first start to tag all these intervals so that you can try each algorithm on each sound sequence, and keep each time only the best one.
But if you're already lost by a "simple" phase cancellation, you may never be able to properly clean your song from its vocals. It's a quite advanced signal processing, and it will be even harder to apply if you don't know anything about signal processing.

Speed Up Gensim's Word2vec for a Massive Dataset

I'm trying to build a Word2vec (or FastText) model using Gensim on a massive dataset which is composed of 1000 files, each contains ~210,000 sentences, and each sentence contains ~1000 words. The training was made on a 185gb RAM, 36-core machine.
I validated that
gensim.models.word2vec.FAST_VERSION == 1
First, I've tried the following:
files = gensim.models.word2vec.PathLineSentences('path/to/files')
model = gensim.models.word2vec.Word2Vec(files, workers=-1)
But after 13 hours I decided it is running for too long and stopped it.
Then I tried building the vocabulary based on a single file, and train based on all 1000 files as follows:
files = os.listdir['path/to/files']
model = gensim.models.word2vec.Word2Vec(min_count=1, workers=-1)
model.build_vocab(corpus_file=files[0])
for file in files:
model.train(corpus_file=file, total_words=model.corpus_total_words, epochs=1)
But I checked a sample of word vectors before and after the training, and there was no change, which means no actual training was done.
I can use some advise on how to run it quickly and successfully. Thanks!
Update #1:
Here is the code to check vector updates:
file = 'path/to/single/gziped/file'
total_words = 197264406 # number of words in 'file'
total_examples = 209718 # number of records in 'file'
model = gensim.models.word2vec.Word2Vec(iter=5, workers=12)
model.build_vocab(corpus_file=file)
wv_before = model.wv['9995']
model.train(corpus_file=file, total_words=total_words, total_examples=total_examples, epochs=5)
wv_after = model.wv['9995']
so the vectors: wv_before and wv_after are exactly the same
There's no facility in gensim's Word2Vec to accept a negative value for workers. (Where'd you get the idea that would be meaningful?)
So, it's quite possible that's breaking something, perhaps preventing any training from even being attempted.
Was there sensible logging output (at level INFO) suggesting that training was progressing in your trial runs, either against the PathLineSentences or your second attempt? Did utilities like top show busy threads? Did the output suggest a particular rate of progress & let you project-out a likely finishing time?
I'd suggest using a positive workers value and watching INFO-level logging to get a better idea what's happening.
Unfortunately, even with 36 cores, using a corpus iterable sequence (like PathLineSentences) puts gensim Word2Vec in a model were you'll likely get maximum throughput with a workers value in the 8-16 range, using far less than all your threads. But it will do the right thing, on a corpus of any size, even if it's being assembled by the iterable sequence on-the-fly.
Using the corpus_file mode can saturate far more cores, but you should still specify the actual number of worker threads to use – in your case, workers=36 – and it is designed to work on from a single file with all data.
Your code which attempts to train() many times with corpus_file has lots of problems, and I can't think of a way to adapt corpus_file mode to work on your many files. Some of the problems include:
you're only building the vocabulary from the 1st file, which means any words only appearing in other files will be unknown and ignored, and any of the word-frequency-driven parts of the Word2Vec algorithm may be working on unrepresentative
the model builds its estimate of the expected corpus size (eg: model.corpus_total_words) from the build_vocab() step, so every train() will behave as if that size is the total corpus size, in its progress-reporting & management of the internal alpha learning-rate decay. So those logs will be wrong, the alpha will be mismanaged in a fresh decay each train(), resulting in a nonsensical jigsaw up-and-down alpha over all files.
you're only iterating over each file's contents once, which isn't typical. (It might be reasonable in a giant 210-billion word corpus, though, if every file's text is equally and randomly representative of the domain. In that case, the full corpus once might be as good as iterating over a corpus that's 1/5th the size 5 times. But it'd be a problem if some words/patterns-of-usage are all clumped in certain files - the best training interleaves contrasting examples throughout each epoch and all epochs.)
min_count=1 is almost always unwise with this algorithm, and especially so in large corpora of typical natural-language word frequencies. Rare words, and especially those appearing only once or a few times, make the model gigantic but those words won't get good word-vectors, and keeping them in acts like noise interfering with the improvement of other more-common words.
I recommend:
Try the corpus iterable sequence mode, with logging and a sensible workers value, to at least get an accurate read of how long it might take. (The longest step will be the initial vocabulary scan, which is essentially single-threaded and must visit all data. But you can .save() the model after that step, to then later re-.load() it, tinker with settings, and try different train() approaches without repeating the slow vocabulary survey.)
Try aggressively-higher values of min_count (discarding more rare words for a smaller model & faster training). Perhaps also try aggressively-smaller values of sample (like 1e-05, 1e-06, etc) to discard a larger fraction of the most-frequent words, for faster training that also often improves overall word-vector quality (by spending relatively more effort on less-frequent words).
If it's still too slow, consider if you could using a smaller subsample of your corpus might be enough.
Consider the corpus_file method if you can roll much or all of your data into the single file it requires.

Understanding Perceptrons

I just started a Machine learning class and we went over Perceptrons. For homework we are supposed to:
"Choose appropriate training and test data sets of two dimensions (plane). Use 10 data points for training and 5 for testing. " Then we are supposed to write a program that will use a perceptron algorithm and output:
a comment on whether the training data points are linearly
separable
a comment on whether the test points are linearly separable
your initial choice of the weights and constants
the final solution equation (decision boundary)
the total number of weight updates that your algorithm made
the total number of iterations made over the training set
the final misclassification error, if any, on the training data and
also on the test data
I have read the first chapter of my book several times and I am still having trouble fully understanding perceptrons.
I understand that you change the weights if a point is misclassified until none are misclassified anymore, I guess what I'm having trouble understanding is
What do I use the test data for and how does that relate to the
training data?
How do I know if a point is misclassified?
How do I go about choosing test points, training points, threshold or a bias?
It's really hard for me to know how to make up one of these without my book providing good examples. As you can tell I am pretty lost, any help would be so much appreciated.
What do I use the test data for and how does that relate to the
training data?
Think about a Perceptron as young child. You want to teach a child how to distinguish apples from oranges. You show it 5 different apples (all red/yellow) and 5 oranges (of different shape) while telling it what it sees at every turn ("this is a an apple. this is an orange). Assuming the child has perfect memory, it will learn to understand what makes an apple an apple and an orange an orange if you show him enough examples. He will eventually start to use meta-features (like shapes) without you actually telling him. This is what a Perceptron does. After you showed him all examples, you start at the beginning, this is called a new epoch.
What happens when you want to test the child's knowledge? You show it something new. A green apple (not just yellow/red), a grapefruit, maybe a watermelon. Why not show the child the exact same data as before during training? Because the child has perfect memory, it will only tell you what you told him. You won't see how good it generalizes from known to unseen data unless you have different training data that you never showed him during training. If the child has a horrible performance on the test data but a 100% performance on the training data, you will know that he has learned nothing - it's simply repeating what he has been told during training - you trained him too long, he only memorized your examples without understanding what makes an apple an apple because you gave him too many details - this is called overfitting. To prevent your Perceptron from only (!) recognizing training data you'll have to stop training at a reasonable time and find a good balance between the size of the training and testing set.
How do I know if a point is misclassified?
If it's different from what it should be. Let's say an apple has class 0 and an orange has 1 (here you should start reading into Single/MultiLayer Perceptrons and how Neural Networks of multiple Perceptrons work). The network will take your input. How it's coded is irrelevant for this, let's say input is a string "apple". Your training set then is {(apple1,0), (apple2,0), (apple3,0), (orange1,1), (orange2,1).....}. Since you know the class beforehand, the network will either output 1 or 0 for the input "apple1". If it outputs 1, you perform (targetValue-actualValue) = (1-0) = 1. 1 in this case means that the network gives a wrong output. Compare this to the delta rule and you will understand that this small equation is part of the larger update equation. In case you get a 1 you will perform a weight update. If target and actual value are the same, you will always get a 0 and you know that the network didn't misclassify.
How do I go about choosing test points, training points, threshold or
a bias?
Practically the bias and threshold isn't "chosen" per se. The bias is trained like any other unit using a simple "trick", namely using the bias as an additional input unit with value 1 - this means the actual bias value is encoded in this additional unit's weight and the algorithm we use will make sure it learns the bias for us automatically.
Depending on your activation function, the threshold is predetermined. For a simple perceptron, the classification will occur as follows:
Since we use a binary output (between 0 and 1), it's a good start to put the threshold at 0.5 since that's exactly the middle of the range [0,1].
Now to your last question about choosing training and test points: This is quite difficult, you do that by experience. Where you're at, you start off by implementing simple logical functions like AND, OR, XOR etc. There's it's trivial. You put everything in your training set and test with the same values as your training set (since for x XOR y etc. there are only 4 possible inputs 00, 10, 01, 11). For complex data like images, audio etc. you'll have to try and tweak your data and features until you feel like the network can work with it as good as you want it to.
What do I use the test data for and how does that relate to the training data?
Usually, to asses how well a particular algorithm performs, one first trains it and then uses different data to test how well it does on data it has never seen before.
How do I know if a point is misclassified?
Your training data has labels, which means that for each point in the training set, you know what class it belongs to.
How do I go about choosing test points, training points, threshold or a bias?
For simple problems, you usually take all the training data and split it around 80/20. You train on the 80% and test against the remaining 20%.

Public source of randomness

I want to set up "public lottery", in which everyone can see the selection is random and fair. If I only needed one bit, I would use, for example, the LSB of the closing Dow Jones index for that day. The problem is, I need 32 bits. I need a source that is:
available daily
visible to the public throughout the world
not manipulable (by me or anyone else)
unbiased
simple
I suppose I could just pick 32 stocks or stock-indices and use the LSB of each, that would be at least difficult to manipulate, and run them through some hash to eliminate any bias toward 0, but that doesn't really qualify as "simple". Other thoughts: some feed of meteorological or seismological data. That would be more difficult to manipulate (much easier to buy a share of stock than to cause an earthquake) but harder to authenticate (since there aren't armies of auditors watching weather data).
Any suggestions?
Check out http://www.random.org/ They have a section for Third-Party Draw Service
The Third-Party Draw Service is useful for people who operate raffles,
sweepstakes, promotional giveaways and other lottery type services
professionally. In a similar fashion to a certified official,
RANDOM.ORG acts as an unbiased third party who conducts the drawings
in a manner that is guaranteed to be fair and truly random. The
drawings are made using true randomness that comes from atmospheric
noise, which for many purposes is better than the pseudo-random number
algorithms typically used in computer programs.
Check out the Public Records for details about recent drawings held
with the service.
This sounds like what you are looking for, but you would end up having to rely on random.org for the numbers.
The part "visible to the public throughout the world" is the trickiest part in my opinion.
An excellent source of really random numbers is the noise on a webcam (or any other CCD camera). This noise is caused by quantum fluctuation of electron temperature on the CCD plate, so it's truly random.
You could use a picture from a publicly available webcam, but it's hard to find one with a closed shutter... You could set one up and make it available yourself, or you could use one that monitors some meteorological event and subtract a time-averaged image every day.
I hope this is simple enough!
Look at the XKCD GeoHashing algorithm.
MD5(Date, Dow Jones Opening)
Depends how "simple" you want.
I would take a large set of unrelated inputs. You could include some or all of these:
Stock prices (preferably from multiple locations, e.g. Last digit of Dow Jones + last digit of FTSE)
Last digit of the reading from a publicly-visible digital thermometer (easy to find in large cities)
The date
MD5 sum of the current google.com logo image
Name of top-billed guest on today's episode of <insert name of TV talk show here>
Other public lotteries
Concatenate all of these into one large string and apply a cryptographic hash function to it.
The hash will not increase the total entropy, but what it will do is make the output harder to manipulate (because the attacker would need to manipulate many inputs simultaneously.)
Now just take the first 32 bits of the hash.
Separate the non deterministic from the random use a third party service that streams random number sets with a sn assigned to each set.
you set up the number of bits and the number of digits in sn.
Now it streams in random sets with assigned sn in a loop the size of your sn. Save it and you get a batch set of numbers that you put out for public record
Now you can chose a smaller number that doesn't need to be random, just non deterministic to pick the single set of numbers

Resources