Is there a way to combine this code to run for all .fa files instead of doing them individually? - rna-seq

buildindex(basename="chr1_hg19",reference="chr1.fa")
I have multiple files with such names and I want to run them with the same code. I would assume a loop will work but im= am not sure how to do that

Related

Is there a way to have multiple programs without having multiple projects?

I'm using Codeblocks and don't want to create a new project every time i want to code something different. Is there any way to have something like a single project and then just open the files to work on them?
In other words for example: I have one CPP file with some arrays and another file to read and write data from a text file. What I'm currently doing is having another separated project for the second file I described.
In a single project (with a single call of the main function) you can have, for example, a menu that calls several functions that execute the different pieces of code that interest you.
You can only have one int main() function call.

Choosing a random mp4 file from directory in processing

I have a directory with a processing script and some .mp4 files, how do i choose a random one to display?
Break your problem down into smaller steps.
Can you write a program that simply lists all of the files in a directory? The File class might help, and the Java API is your best friend.
Can you write a program that takes that list of files and creates an array or ArrayList that contains all of them?
Can you write a program that takes an array or ArrayList and chooses a random element from it? Use hard-coded String values for testing.
When you get all of these individual steps working, you can combine them into a single program that chooses a random file from a directory. If you get stuck on a specific step, you can post a MCVE of just that step, and we'll go from there.

Handle single files while extracting tar.gz

I am having a huge .tgz file which is further structured inside like this:
./RandomFoldername1/file1
./RandomFoldername1/file2
./RandomFoldername2/file1
./RandomFoldername2/file2
etc
What I want to do is having each individual file extracted to standard output so that I can pipe it afterwards to another command. While doing this, I also need to get the RandomFoldername name and file name so that I can deal with them properly from within the second command.
Till now the options I have are
to either extract all of the tarball and deal with the structured files that I will be having, which is not an option since the extracted tar doesn't fit into the hard drive
Make a loop that pattern match each file and extract one file at time. This option although that solves the problem, is too slow because the tarball is sweeped each time for only one file.
While searching on how to solve this, I've started to fear that there is no better alternative to this.
Using tar the tool I don't believe you have any other options.
Using a tar library for some language of your choice should allow you to do what you want though as it should let you iterate over the entries in the tarball one-by-one and allow you to extract/pipe/etc. each file one-by-one as necessary.

S3DistCp Grouping by Folder

I'm trying to use S3DistCp to get around the small files problem in Hadoop. It's working, but the output is a bit annoying to work with. The file path's I'm dealing with are like :
s3://test-bucket/test/0000eb6e-4460-4b99-b93a-469d20543bf3/201402.csv
and there can be multiple files within that folder. I want to group by the folder name, so I use the following group by argument in s3distcp:
--groupBy '.*(........-.........-....-............).*'
and it does group the files, but it results it still results in multiple output folders, with one file in each folder. Is there any way to output the grouped files into one folder, instead of multiple?
Thanks!
As of 2015-11-20, this is the behavior of S3DistCp. It will create multiple directories based on the source directories. It will not combine across directories.
I think you can try out this:
--groupBy ".*/(........-.........-....-............)/.*"
In your example you should use something like: --src "s3://test-bucket/test/"
This way you will have multiple folders with all files inside those folders merged together.

process 100K of image files with bash

here is the script to optimize jpg images: https://github.com/kormoc/imgopt/blob/master/imgopt
There is a CMS with image files (not mine).
I assume there is a complicated structure of subdirectories and script just recursively find all img files in given folder.
The question is how to mark already processed files so with next run
script won't touch them and just skip?
I dont know when the guys would like to add new files to it and process it. Also I think renaming is not a good choice either.
I was thinking about hash-table or associative array which will be filled from txt file during
start. But is it ok to have 100K of items array in bash? Seems complicated for a script.
Any other ideas about optimization are also welcome.
I think the easiest thing to do is just output a file with a similar name per processed image file.
For example image1.jpg after being processed would have an empty file with a similar name e.g. .image1.jpg.processed.
Then when your script runs it just checks if the for the current image: NAME.EXT if a file .NAME.EXT.processed exists. If the file doesn't exist then you know it needs to be processed. No memory issues and no hashtable needed granted you will have 100K of empty extra files.

Resources