Mauve on Ubuntu 16 - bioinformatics

Mauve on Ubuntu 16 - bioinformatics

I installed Mauve with the following:
apt-get install mauve-aligner
and then I ran it with the following:
mauveAligner --output=ery.mauve --output-alignment=ery.alignment ery_multiseq.fasta
where ery_multiseq.fasta is the fasta file with the different sequences I want to align; each sequence has a unique name. The file is 13 Mb and the average sequence's length is 1.2 Mb. The output was:
Sequence loaded successfully.
ery_multiseq.fasta 1876490 base pairs.
Sequence loaded successfully.
ery_multiseq.fasta 1876490 base pairs.
Sequence loaded successfully.
ery_multiseq.fasta 1752910 base pairs.
Sequence loaded successfully.
ery_multiseq.fasta 1752910 base pairs.
Sequence loaded successfully.
ery_multiseq.fasta 1787941 base pairs.
Sequence loaded successfully.
ery_multiseq.fasta 1787941 base pairs.
Sequence loaded successfully.
ery_multiseq.fasta 346770 base pairs.
Sequence loaded successfully.
ery_multiseq.fasta 398993 base pairs.
Sequence loaded successfully.
ery_multiseq.fasta 454198 base pairs.
Sequence loaded successfully.
ery_multiseq.fasta 731061 base pairs.
Using 15-mers for initial seeds
Creating sorted mer list
Create time was: 1 seconds.
Creating sorted mer list
Create time was: 1 seconds.
Creating sorted mer list
Create time was: 0 seconds.
Creating sorted mer list
Create time was: 1 seconds.
Creating sorted mer list
Create time was: 1 seconds.
Creating sorted mer list
Create time was: 1 seconds.
Creating sorted mer list
Create time was: 0 seconds.
Creating sorted mer list
Create time was: 1 seconds.
Creating sorted mer list
Create time was: 0 seconds.
Creating sorted mer list
Create time was: 0 seconds.
0%..1%..2%..3%..4%..5%..6%..7%..8%..9%..10%..
11%..12%..13%..14%..15%..16%..17%..18%..19%..20%..
21%..22%..23%..24%..25%..26%..27%..28%..29%..30%..
31%..32%..33%..34%..35%..36%..37%..38%..39%..40%..
41%..42%..43%..44%..45%..46%..47%..48%..49%..50%..
51%..52%..53%..54%..55%..56%..57%..58%..59%..60%..
61%..62%..63%..64%..65%..66%..67%..68%..69%..70%..
71%..72%..73%..74%..75%..76%..77%..78%..79%..80%..
81%..82%..83%..84%..85%..86%..87%..88%..89%..90%..
91%..92%..93%..94%..95%..96%..97%..98%..99%..Starting with 43306 MUMs
Eliminating overlaps yields 43767 MUMs
Multiplicity filter gives 0 MUMs
The ery.mauve file contained:
FormatVersion 4
SequenceCount 10
Sequence0File ery_multiseq.fasta
Sequence0Length 1876490
Sequence1File ery_multiseq.fasta
Sequence1Length 1876490
Sequence2File ery_multiseq.fasta
Sequence2Length 1752910
Sequence3File ery_multiseq.fasta
Sequence3Length 1752910
Sequence4File ery_multiseq.fasta
Sequence4Length 1787941
Sequence5File ery_multiseq.fasta
Sequence5Length 1787941
Sequence6File ery_multiseq.fasta
Sequence6Length 346770
Sequence7File ery_multiseq.fasta
Sequence7Length 398993
Sequence8File ery_multiseq.fasta
Sequence8Length 454198
Sequence9File ery_multiseq.fasta
Sequence9Length 731061
IntervalCount 0
whereas the ery.alignment was completely empty.
What did I do wrong?
Many thanks

Related

Is there a way to Count & Sum Values From A Reference Column in Google Sheets?

I have a google sheet with 5 columns.
Column 1 is a Unique list of Names
Column 2 is a Unique list of Tasks
Column 3 is a Price For Each task
Column 4 is a List of Names - Log File Column
Column 5 is a List of Tasks - Log File Column
What I am trying to do is create a 6th column that gives me the total $ owed to each name. To get this value it will need to count each time a task occurs next to a name in column 5 and then multiply it times the lookup value of that task and then sum each task together. I know its kind of a big formula to do in a sheet and might be something I can't do, but any advice would be appreciated.

It seems like an odd set-up, so there's a chance I'm not understanding it completely. Here's a formula to try:
=(COUNTIFS($D$2:$D$20,$A2,$E$2:$E$20,$B$2)*$C$2)+(COUNTIFS($D$2:$D$20,$A2,$E$2:$E$20,$B$3)*$C$3)+(COUNTIFS($D$2:$D$20,$A2,$E$2:$E$20,$B$4)*$C$4)
It looks at the unique name in Column 1, then counts the number of times that name was paired with each task and multiplies it by the listed price per task. Then, it sums up those figures.
Here's what it looks like in Excel, but it should function the same in Google Sheets:

How to store only limited number of doc in elasticsearch.

I need to store only say 10 number docs under a particular index. And the 11th item should replace the old item i.e 1st item. So that i will be having only 10 doc at any time.
I am using elacticsearch in golang

If you want to store only 10 doc then
you should apply algo = (document no%10)+1.
the return value is your elasticsearch _id field
the algo retyrn only 1 to 10. and always index it.

I'm assuming you will have fixed names for documents, like 1, 2,...,10 or whatever. So one approach could be to use Redis to implement a circular list https://redis.io/commands/rpoplpush#pattern-circular-list (you can also implement your own algorithm to implement that circular list by code)
So basically you should follow the next steps:
You load those 10 values ordered in the circular list, let's say 1,2, 3, ... 10
When you want to store a document in Redis, you extract an element from the list, for our list that element will be 1
make a query count on your ElasticSearch index to get the number of the document in the index
if you get a count < 10 you call insert document query with your data and with the number extracted from the list as the document name. If count = 10 you call update document query on ElasticSearch
The circular will progress in this way:
The initial state of the list is [1, 2, ...10]. You extract 1 and after extracting it, it goes to the end of the list: [2,3,..., 10,1]
Second extraction from the list, current state of the list: [2, 3, ...10, 1]. You extract 2 and after extracting it, it goes to the end of the list: [3,4,..., 1,2]
and so on

Generate pairs from list that hasn't already historically existed

I'm building a pairing system that is supposed to create a pairing between two users and schedule them in a meeting. The selection is based upon a criteria that I am having a hard time figuring out. The criteria is that an earlier match cannot have existed between the pair.
My input is a list of size n that contains email addresses. This list is supposed to be split into pairs. The restriction is that this match hasn't occured previously.
So for example, my list would contain a couple of user ids
list = {1,5,6,634,533,515,61,53}
At the same time i have a database table where the old pairs exist:
previous_pairs
---------------------
id date status
1 2016-10-14 12:52:24.214 1
2 2016-10-15 12:52:24.214 2
3 2016-10-16 12:52:24.214 0
4 2016-10-17 12:52:24.214 2
previous_pair_users
---------------------
id userid
1 1
1 5
2 634
2 553
3 515
3 61
4 53
4 1
What would be a good approach to solve this problem? My test solution right now is to pop two random users and checking them for a previous match. If there exists no match, i pop a new random (if possible) and push one of the incorrect users back to the list. If the two people are last they will get matched anyhow. This doesn't sound good to me since i should predict which matches that cannot occur based on my list with already "existing" pairs.
Do you have any idea on how to get me going in regards to building this procedure? Java 8 streams looks interesting and might be a way to solve this, but i am very new to that unfortunately.

The solution here was to create a list with tuples that contain the old matches using group_concat feature of MySQL:
SELECT group_concat(MatchProfiles.ProfileId) FROM Matches
INNER JOIN MatchProfiles ON Matches.MatchId = MatchProfiles.MatchId
GROUP BY Matches.MatchId
old_matches = ((42,52),(12,52),(19,52),(10,12))
After that I select the candidates and generate a new list of tuples using my pop_random()
new_matches = ((42,12),(19,48),(10,36))
When both lists are done I look at the intersection to find any duplicates
duplicates = list(set(new_matches) & set(old_matches))
If we have duplicates we simply run the randomizer again X attemps until I find it impossible.
I know that this is not very effective when having a large set of numbers but my dataset will never be that large so I think it will be good enough.

Looking up a "key" in an 8GB+ text file

I have some 'small' text files that contain about 500000 entries/rows. Each row has also a 'key' column. I need to find this keys in a big file (8GB, at least 219 million entries). When found, I need to append the 'Value' from the big file into the small file, at the end of the row as a new column.
The big file that is like this:
KEY VALUE
"WP_000000298.1" "abc"
"WP_000000304.1" "xyz"
"WP_000000307.1" "random"
"WP_000000307.1" "text"
"WP_000000308.1" "stuff"
"WP_000000400.1" "stuffy"
Simply put, I need to look up 'key' in the big file.
Obviously I need to load the whole table in RAM (but this is not a problem I have 32GB available). The big file seems to be already sorted. I have to check this.
The problem is that I cannot do a fast lookup using something like TDictionary because as you can see, the key is not unique.
Note: This is probably a one-time computation. I will use the program once, then throw it away. So, it doesn't have to be the BEST algorithm (difficult to implement). It just need to finish in decent time (like 1-2 days). PS: I prefer doing this without DB.
I was thinking at this possible solution: TList.BinarySearch. But it seems that TList is limited to only 134,217,727 (MaxInt div 16) items. So TList won't work.
Conclusion:
I choose Arnaud Bouchez solution. His TDynArray is impressive! I totally recommend it if you need to process large files.
AlekseyKharlanov's provided another nice solution but TDynArray is already implemented.

Instead of re-inventing the wheel of binary search or B-Tree, try with an existing implementation.
Feed the content into a SQLite3 in-memory DB (with the proper index, and with a transaction every 10,000 INSERT) and you are done. Ensure you target Win64, to have enough space in RAM. You may even use a file-based storage: a bit slower to create, but with indexes, queries by Key will be instant. If you do not have SQlite3 support in your edition of Delphi (via latest FireDAC), you may use our OpenSource unit and its associated documentation.
Using SQlite3 will be definitively faster, and use less resources than a regular client-server SQL database - BTW the "free" edition of MS SQL is not able to handle so much data you need, AFAIR.
Update: I've written some sample code to illustrate how to use SQLite3, with our ORM layer, for your problem - see this source code file in github.
Here are some benchmark info:
with index defined before insertion:
INSERT 1000000 rows in 6.71s
SELECT 1000000 rows per Key index in 1.15s
with index created after insertion:
INSERT 1000000 rows in 2.91s
CREATE INDEX 1000000 in 1.28s
SELECT 1000000 rows per Key index in 1.15s
without the index:
INSERT 1000000 rows in 2.94s
SELECT 1000000 rows per Key index in 129.27s
So for huge data set, an index is worth it, and creating the index after the data insertion reduces the resources used! Even if the insertion is slower, the gain of an index is huge when selecting per key.. You may try to do the same with MS SQL, or using another ORM, and I guess you will cry. ;)

Another answer, since it is with another solution.
Instead of using a SQLite3 database, I used our TDynArray wrapper, and its sorting and binary search methods.
type
TEntry = record
Key: RawUTF8;
Value: RawUTF8;
end;
TEntryDynArray = array of TEntry;
const
// used to create some fake data, with some multiple occurences of Key
COUNT = 1000000; // million rows insertion !
UNIQUE_KEY = 1024; // should be a power of two
procedure Process;
var
entry: TEntryDynArray;
entrycount: integer;
entries: TDynArray;
procedure DoInsert;
var i: integer;
rec: TEntry;
begin
for i := 0 to COUNT-1 do begin
// here we fill with some data
rec.Key := FormatUTF8('KEY%',[i and pred(UNIQUE_KEY)]);
rec.Value := FormatUTF8('VALUE%',[i]);
entries.Add(rec);
end;
end;
procedure DoSelect;
var i,j, first,last, total: integer;
key: RawUTF8;
begin
total := 0;
for i := 0 to pred(UNIQUE_KEY) do begin
key := FormatUTF8('KEY%',[i]);
assert(entries.FindAllSorted(key,first,last));
for j := first to last do
assert(entry[j].Key=key);
inc(total,last-first+1);
end;
assert(total=COUNT);
end;
Here are the timing results:
one million rows benchmark:
INSERT 1000000 rows in 215.49ms
SORT ARRAY 1000000 in 192.64ms
SELECT 1000000 rows per Key index in 26.15ms
ten million rows benchmark:
INSERT 10000000 rows in 2.10s
SORT ARRAY 10000000 in 3.06s
SELECT 10000000 rows per Key index in 357.72ms
It is more than 10 times faster than the SQLite3 in-memory solution. The 10 millions rows stays in memory of the Win32 process with no problem.
And a good sample of how the TDynArray wrapper works in practice, and how its SSE4.2 optimized string comparison functions give good results.
Full source code is available in our github repository.
Edit: with 100,000,000 rows (100 millions rows), under Win64, for more than 10GB of RAM used during the process:
INSERT 100000000 rows in 27.36s
SORT ARRAY 100000000 in 43.14s
SELECT 100000000 rows per Key index in 4.14s

Since this is One-time task. Fastest way is to load whole file into memory, scan memory line by line, parse key and compare it with search key(keys) and print(save) found positions.
UPD: If you have sorted list in source file. And assume you have 411000keys to lookup. You can use this trick: Sort you search keys in same order with source file. Read first key from both lists and compare it. If they differs read next from source until they equal. Save position, if next key in source also equal, save it too..etc. If next key is differ, read next key from search keys list. Continue until EOF.

Use memory-mapped files. Just think your file is already read into the memory in its entirety and do that very binary search in memory that you wanted. Let Windows care about reading the portions of file when you do your in-memory search.
https://en.wikipedia.org/wiki/Memory-mapped_file
https://msdn.microsoft.com/en-us/library/ms810613.aspx
https://stackoverflow.com/a/9609448/976391
https://stackoverflow.com/a/726527/976391
http://msdn.microsoft.com/en-us/library/aa366761%28VS.85.aspx
https://msdn.microsoft.com/en-us/library/windows/desktop/aa366537.aspx
You may take any of those sources for start, just do not forget to update them for Win64
http://torry.net/quicksearchd.php?String=memory+mapped+files&Title=No

A method that needs the file to be sorted, but avoids data structures entirely:
You only care about one line, so why even read the bulk of the file?
Open the file and move the "get pointer" (apologies for talking C) halfway through the file. You'll have to figure out if you're at a number or a word, but a number should be close by. Once you know the closest number, you know if it's higher or lower than what you want, and continue with the binary search.

Idea based on Aleksey Kharlanov answer. I accepted his answer.
I only copy his idea here because he did not elaborate it (no pseudo-code or deeper analysis of the algorithm). I want to confirm it works before implementing it.
We sort both files (once).
We load Big file in memory (once).
We read Small file line by line from disk (once).
Code:
In the code below, sKey is the current key in Small file. bKey is the current key in the Big file:
LastPos:= 0
for sKey in SmallFile do
for CurPos:= LastPos to BigFile.Count do
if sKey = bKey
then
begin
SearchNext // search (down) next entries for possible duplicate keys
LastPos:= CurPos
end
else
if sKey < bKey
then break
It works because I know the last position (in Big file) of the last key. The next key can only be somewhere UNDER the last position; ON AVERAGE it should be in the next 440 entries. However, I don't even have to always read 440 entries below LastPos because if my sKey does not exist in the big file, it will be smaller than the bKey so I quickly break the inner loop and move on.
Thoughts?

If I were doing this as a one-time thing, I'd create a set with all the keys I need to look up. Then read the file line-by line, check to see if the key exists in the set, and output the value if so.
Briefly, the algorithm is:
mySet = dictionary of keys to look up
for each line in the file
key = parse key from line
if key in mySet
output key and value
end for
Since Delphi doesn't have a generic set, I'd use TDictionary and ignore the value.
The dictionary lookup is O(1), so should be very fast. Your limiting factor will be file I/O time.
I figure that'd take about 10 minutes to code up, and less than 10 minutes to run.

Hadoop PIG ouput is not split in mutliple files with PARALLEL operator

Looks like I'm missing something. Number of reducers on my data creates that many number of files in HDFS, but my data is not split into multiple files. What I noticed is that if I do a group by on a key that is in sequential order it works fine, like the data below split nicely into two files based on the key:
1 hello
2 bla
1 hi
2 works
2 end
But this data doesn't split:
1 hello
3 bla
1 hi
3 works
3 end
The code that I used that works fine for one and not for the other is
InputData = LOAD 'above_data.txt';
GroupReq = GROUP InputData BY $0 PARALLEL 2;
FinalOutput = FOREACH GroupReq GENERATE flatten(InputData);
STORE FinalOutput INTO 'output/GroupReq' USING PigStorage ();
The above code creates two output part files but in first input it splits the data nicely and put the key 1 in part-r-00000 and key 2 in part-r-00001. But for the second input it creates two part files but all the data ends up in part-r-00000. What is it I'm missing, what can I do to force the data to split in to multiple output files based on the unique keys?
Note: for the second input if I use PARALLEL 3 (3 reducers), it creates three part files and add all the data for key 1 in part-0 and all the data for key 3 in part-3 file. I found this behavior strange. BTW I'm using Cloudera CDH3B4.

That's because the number of the reducer that a key goes to is determined as hash(key) % reducersCount. If the key is an integer, hash(key) == key. When you have more data, they will be distributed more or less evenly, so you shouldn't worry about it.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio