How to print 2 conecutive lines from a file in python? - bioinformatics

I have a list of words in a file (PDB IDs). I want to compare the IDs to another file containing protein sequences in FASTA format. But, the PDB ID and the sequence are in consecutive lines. How to print both the lines together?

Related

Append multiple CSVs into one single file with Apache Nifi

I have a folder with CSV files that have the same first 3 columns and different last N columns. N is minimum 2 and up to 11.
Last n columns have number as header, for example:
File 1:
AAA,BBB,CCC,0,10,15
1,India,c,0,28,54
2,Taiwan,c,0,23,52
3,France,c,0,26,34
4,Japan,c,0,27,46
File 2:
AAA,BBB,CCC,0,5,15,30,40
1,Brazil,c,0,20,64,71,88
2,Russia,c,0,20,62,72,81
3,Poland,c,0,21,64,78,78
4,Litva,c,0,22,66,75,78
Desired output:
AAA,BBB,CCC,0,5,10,15,30,40
1,India,c,0,null,28,54,null,null
2,Taiwan,c,0,null,23,52,null,null
3,France,c,0,null,26,34,null,null
4,Japan,c,0,null,27,46,null,null
1,Brazil,c,0,20,null,64,71,88
2,Russia,c,0,20,null,62,72,81
3,Poland,c,0,21,null,64,78,78
4,Litva,c,0,22,null,66,75,78
Is there a way to append this files together with Nifi where a new column would get created (even if I do not now the column name beforehad) if a file with additional data is present in the folder?
I tried with Merge content processor but by default it just appends content of all my files together without minding headers (all the headers are always appended).
What you could do is write some scripts to combine the rows and columns using the ExecuteStreamCommand. This would allow you to write a custom script in whatever language you want.

Does a single column CSV file have commas?

When i open my csv file in an excel it looks like this -
Header
Value1
Value2
Value3
Value4
Value5
I want to know whether this file actually has commas in it? I am aware that if i have multiple columns i will see the commas
You can easily test that by opening the file in a text editor (e.g. Notepad on Windows). It will show the file as it is in text format, i.e., with commas present (if they are in the file). I would say that if it is single column, it won't have commas (but rather line breaks between the rows), but if you need to be sure just open it with a text editor.
https://www.ietf.org/rfc/rfc4180.txt
Given there is only one value in each record it would not have a comma given the spec.
Within the header and each record, there may be one or more
fields, separated by commas. Each line should contain the same
number of fields throughout the file. Spaces are considered part
of a field and should not be ignored. The last field in the
record must not be followed by a comma. For example:

How to parse csv file into multiple csv based on row spacing

I'm trying to build a airflow DAG and need to split out 7 tables contained in one csv into seven separate csv's.
dataset1
header_a
header_b
header_c
One
Two
Three
One
Two
Three
<-Always two spaced rows between data sets
dataset N <-part of csv file giving details on data
header_d
header_e
header_f
header_g
One
Two
Three
Four
One
Two
Three
Four
out:
dataset1.csv
datasetn.csv
Based on my research i think my solution might lie in awk searching for the double spaces?
EDIT: In plain text as requested.
table1 details1,
table1 details2,
table1 details3,
header_a,header_b,header_c,
1,2,3
1,2,3
tableN details1,
tableN details2,
tableN details3,
header_a, header_b,header_c,header_N,
1,2,3,4
1,2,3,4
Always two spaced rows between data sets
If your CSV file contains blank lines, and your goal is to write out each chunk of records that is separated by those blank lines into individual files, then you could use awk with its record separator RS set to nothing, which then defaults to treating each "paragraph" as a record. Each of them can then be redirected to a file whose name is based on the record number NR:
awk -vRS= '{print $0 > ("output_" NR ".csv")}' input.csv
This reads from input.csv and writes the chunks to output_1.csv, output_2.csv, output_3.csv and so forth.
If my interpretation of your input file's structure (or your problem in general) is wrong, please provide more detail to clarify.

How to take two columns of two TXT and create new TXT with the two columns?

I have two text files with only one column each.
I need to take the column from each of the text files and create a new text file with the two columns with tabs.
These columns have no relation (ID) but are in order with each other.
I could do that in Excel, but there are more than 200 thousand lines and not accepted.
How can I do it in Pentaho?
Take 2 text input steps, read both the files,
after that add 2 add constant step create same column with some value,make sure the value of the both constant values remains same.
use stream lookup/merge join and merge them with constant values.
generate the file.
You can read both files with Text file input, add "row number" in each stream, which gives you two streams of 2 fields each. Then you can Merge join both streams on Row number, and finally a Select fields step to clean up the output so that only the two relevant fields are kept. Then Text file output to write it.

How to filter Record values of files in hadoop mapreduce?

I am working with a program in MapReduce. I have two files and I want to delete some information from file1 that exists in file2. Every line has an ID as its key and some numbers (separated by comma) as its value.
file1:
1 1,2,10
2 2,7,8,5
3 3,9,12
and
file2:
1 1
2 2,5
3 3,9
I want output like this:
output:
1 2,10
2 7,8
3 12
I want to delete values of file1 that have the same key in file2. One way to do this is have the two files as input files and in the map step produce: (ID, line). Then in the reduce step I filter the values. But, my files are very very large and therefore I can't do it this way.
Or, would it be efficient if file1 was the input file and in the map I open file2 and seek to that line and then compare the value? But as I have a million keys and for every key I must open file1, I think it will have excessive I/O.
What can I do?
You can make both file1 and file2 inputs of your mapper. In the mapper you'd add the source (file1 or file2) to the records. Then use secondary sort to make sure records from file2 always come first. So, the combined input for your reducer would look like that:
1 file2,1
1 file1,1,2,10
2 file2,2,5
2 file1,2,7,8,5
3 file2,3,9
3 file1,3,9,12
You can take the design of the reducer from here.

Resources