Vertically divide an array so we get minimum splits - algorithm

I am thinking on the following problem.
I can have an array of strings like
Col1 Col2 Col3 Col4
aa aa aa aa
aaa aaa aaaaa aaa
aaaa aaaaaaa aa a
...........................
Actually it is CSV file. And I should find a way to divide this vertically into one or more files. Condition for splitting is that no one file contain no row that exceeds some bytes. For simplicity we can rewrite that array with lengths:
Col1 Col2 Col3 Col4
2 2 2 2
3 3 5 3
4 7 2 1
...........................
And let's say the limit is 10, i.e. if > 9 we should split. So if we split into 2 files [Col1, Col2, Col3] and [Col4] this will not satisfy the condition because the first file will contain 3 + 3 + 5 > 9 in the second row and 4 + 7 + 2 > 9 in the third row. If we split into [Col1, Col2] and [Col3, Col4] this will not satisfy the condition because the first file will contain 4 + 7 > 9 in the third row. So we are splitting this into 3 files like [Col1], [Col2, Col3] and [Col4]. Now every file is correct and looks like:
File1 | File2 | File3
------------------------------
Col1 | Col2 Col3 | Col4
2 | 2 2 | 2
3 | 3 5 | 3
4 | 7 2 | 1
...............................
So it should split from left to right giving maximum columns as possible to the left file. The problem is that this file can be huge and I don't want to read it into memory and so we read the initial file line by line and somehow I should determine a set of indexes to split. If that is possible at all? I hope I described the problem well, so you can understand it.

Generally awk is quite good at handling large csv files.
You could try something like this to retrieve the max length for each column and then decide how to split.
Let's say the file.txt contains
Col1;Col2;Col3;Col4
aa;aa;aa;aa
aaa;aaa;aaaaa;aaa
aaaa;aaaaaaa;aa;a
(Assuming windows style quotes) Running the following :
> awk -F";" "NR>1{for (i=1; i<=NF; i++) max[i]=(length($i)>max[i]?length($i):max[i])} END {for (i=1; i<=NF; i++) printf \"%d%s\", max[i], (i==NF?RS:FS)}" file.txt
Will output :
4;7;5;3
Could you try this on your real data set ?

Related

Combining multiple awk output statements into one line

I have some ascii files I’m processing, with 35 columns each, and variable number of rows. I need to take the difference between two columns (N+1), and place the results into a duplicate ascii file on column number 36. Then, I need to take another column, and divide it (row by row) by column 36, and place that result into the same duplicate ascii file in column 37.
I’ve done similar processing in the past, but by outputting temp files for each awk command, reading each successive temp file in to eventually create a final ascii file. Then, I would delete the temp files after. I’m hoping there is an easier/faster method than having to create a bunch of temp files.
Below is an initial working processing step, that the above awk commands would need to follow and fit into. This step gets the data from foo.txt, removes the header, and processes only the rows containing a particular, but varying, string.
cat foo.txt | tail -n +2 | awk '$17 ~ /^[F][0-9][0-9][0-9]$/' >> foo_new.txt
There’s another processing step for different data files, that I would also need the 2 new columns discussed earlier. This is simply appending a unique file name from what’s being catted to the last column of every row in a new ascii file. This command is actually in a loop with varying input files, but I’ve simplified it here.
cat foo.txt | tail -n +2 | awk -v fname="$fname" '{print $0 OFS fname;}' >> foo_new.txt
An example of one of the foo.txt files.
20 0 5 F001
4 2 3 F002
12 4 8 F003
100 10 29 O001
Below would be the example foo_new.txt desired. The requested 2 columns of output from awk (last 2 columns). In this example, column 5 is the difference between column 3 and 2 plus 1. Column 6 is the result of column 1 divided by column 5.
20 0 5 F001 6 3.3
4 2 3 F002 2 2.0
12 4 8 F003 5 2.4
For the second example foo_new.txt. The last column is an example of fname. These are computed in the shell script, and passed to awk. I don't care if the results in column 7 (fname) are at the end or placed between columns 4 and 5, so long as it gets along with the other awk statements.
20 0 5 F001 6 3.3 C1
4 2 3 F002 2 2.0 C2
12 4 8 F003 5 2.4 C3
The best luck so far, but unfortunately this is producing a file with the original output first, and the added output below it. I'd like to have the added output appended on as columns (#5 and #6).
cat foo.txt | tail -n +2 | awk '$17 ~ /^[F][0-9][0-9][0-9]$/' >> foo_new.txt
cat foo_new.txt | awk '{print $4=$3-$2+1, $5=$1/($3-$2+1)}' >> foo_new.txt
Consider an input file data with header line like this (based closely on your minimal example):
Col1 Col2 Col3 Col4
20 0 5 F001
4 2 3 F002
12 4 8 F003
100 10 29 O001
You want the output to contain a column 5 that is the value of $3 - $2 + 1 (column 3 minus column 2 plus 1), and a column 6 that is the value of column 1 divided by column 5 (with 1 decimal place in the output), and a file name that is based on a variable fname passed to the script but that has a unique value for each line. And you only want lines where column 4 matches F and 3 digits, and you want to skip the first line. That can all be written directly in awk:
awk -v fname=C '
NR == 1 { next }
$4 ~ /^F[0-9][0-9][0-9]$/ { c5 = $3 - $2 + 1
c6 = sprintf("%.1f", $1 / c5)
print $0, c5, c6, fname NR
}' data
You could write that on one line too:
awk -v fname=C 'NR==1{next} $4~/^F[0-9][0-9][0-9]$/ { c5=$3-$2+1; print $0,c5,sprintf("%.1f",$1/c5), fname NR }' data
The output is:
20 0 5 F001 6 3.3 C2
4 2 3 F002 2 2.0 C3
12 4 8 F003 5 2.4 C4
Clearly, you could change the file name so that the counter starts from 0 or 1 by using counter++ or ++counter respectively in place of the NR in the print statement, and you could format it with leading zeros or whatever else you want with sprintf() again. If you want to drop the first line of each file, rather than just the first file, change the NR == 1 condition to FNR == 1 instead.
Note that this does not need the preprocessing provided by cat foo.txt | tail -n +2.
I need to take the difference between two columns (N+1), and place the results into a duplicate ascii file on column number 36. Then, I need to take another column, and divide it (row by row) by column 36, and place that result into the same duplicate ascii file in column 37.
That's just:
awk -vN=9 -vanother_column=10 '{ v36 = $N - $(N+1); print $0, v36, $another_column / v36 }' input_file.tsv
I guess your file has some "header"/special "first line", so if it's the first line, then preserve it:
awk ... 'NR==1{print $0, "36_header", "37_header"} NR>1{ ... the script above ... }`
Taking first 3 columns from the example script you presented, and substituting N for 2 and another_column for 1, we get the following script:
# recreate input file
cat <<EOF |
20 0 5
4 2 3
12 4 8
100 10 29
EOF
tr -s ' ' |
tr ' ' '\t' > input_file.tsv
awk -vOFS=$'\t' -vIFS=$'\t' -vN=2 -vanother_column=1 '{ tmp = $(N + 1) - $N; print $0, tmp, $another_column / tmp }' input_file.tsv
and it will output:
20 0 5 5 4
4 2 3 1 4
12 4 8 4 3
100 10 29 19 5.26316
Such script:
awk -vOFS=$'\t' -vIFS=$'\t' -vN=2 -vanother_column=1 '{ tmp = $(N + 1) - $N + 1; print $0, tmp, sprintf("%.1f", $another_column / tmp) }' input_file.tsv
I think get's closer output to what you want:
20 0 5 6 3.3
4 2 3 2 2.0
12 4 8 5 2.4
100 10 29 20 5.0
And I guess that by that (N+1) you meant "the difference between two columns with 1 added".

Combining lines that are tab delimited

I've got all the lines in a proteins_num sorted numerically, I now need to combine the lines with identical number in a way so that new information is added to the upper line:
When I've sorted all the lines numerically, I need to combinde the lines with identical number in a way so that new information is added to the upper line. Take for instance the lines with no 61:
: Col | : 1 | : 2 | : 3 | : 4 | : 5 | : 6 | :7 | : 8 | : 9 | : 10 | : 11
: ----| : 61| :PTS... cyt 1bl.. 0,38 MONOMER homo-trimer FRUC... PER...Bac..
61 PTS... 3
becomes:
Col 1 2 3 4 5 6 7 8 9 10 11
61 PTS... cyt 1bl.. 0,38 MONOMER homo-trimer FRUC... PER...Bac.. 3
Sometimes there'll be information missing in some columns in the upper line that is found in the lower one. Therefore the order of joining must be concise.
Is If there are info in both lines that doable?
The file is here with 1021 lines
https://www.dropbox.com/s/yuu46crp7ql4z65/Proteins_num.txt?dl=0
An awk/gawk solution could be:
gawk '
BEGIN { SEQ="" };
$1 == SEQ { $1=""; printf("%s\t",$0)};
$1 != SEQ { SEQ=$1; printf("\n%s",$0);}
' Proteins_num.txt
where SEQ is the number at beginning of line. When it detects a numeration change, print last line with carriage return. If no change is detected, line is printed without break line, to join with next line. File must be numerical sorted previously.

Shell/awk script to read a column of files and combining columns to make a TSV file

I have over 600 files and I need to extract single column from each of the files and write them in a output file. My current code does this work and it takes column from all files and write the columns one after another in output file. However, I need two thing in my output file:
In the output file, instead of adding columns one after another, I need each column from the input files will be added as a new column in the output file (preferably as a TSV file).
The column name will be replaced by the file name.
My example code:
for f in *; do cat "$f" | tr "\t" "~" | cut -d"~" -f2; done >out.txt
Example input:
file01.txt
col1 col2 col3
1 2 3
4 5 6
7 8 9
10 11 12
file02.txt
col4 col5 col6
11 12 13
14 15 16
17 18 19
110 111 112
My current output:
col2
2
5
8
11
col5
12
15
18
111
Expected output:
file01.txt file02.txt
2 12
5 15
8 18
11 111
You can use awk like this:
awk -v OFS='\t' 'BEGIN {
for (i=1; i<ARGC; i++)
printf ARGV[i] OFS;
print ARGV[i];
}
FNR==1 { next }
{
a[FNR]=(a[FNR]==""?"":a[FNR] OFS) $2
}
END {
for(i=2; i<=FNR; i++)
print a[i];
}' file*.txt
file01.txt file02.txt
2 12
5 15
8 18
11 111

Mapping ids for 10 million records [duplicate]

This question already has answers here:
Efficient way to map ids
(2 answers)
Closed 9 years ago.
I have two text files,
File 1 with data like
User game count
A Rugby 2
A Football 2
B Volleyball 1
C TT 2
...
File 2
1 Basketball
2 Football
3 Rugby
...
90 TT
91 Volleyball
...
Now what I want to do is add another column to File 2 such that I have the corresponding index of the game from File 2 as an extra column in File 1.
I have 2 million entries in File 1. So I want to add another column specifying the index(basically the line number or order) of the game from file 2. How can I do this efficiently.
Right now I am doing this line by line. Reading a line from file 1, grep the corresponding game from file 2 for its line number and saving/writing that to a file.
This will take me ages. How can I speed this up if I have 10 million rows in file 2 and 3000 rows in file 1?
With awk, read field 1 from File2 into an array indexed by field 2, look up the array using field 2 from File1 as you iterate through it
awk 'NR == FNR{a[$2]=$1; next}; {print $0, a[$2]}' File2 File1
A Rugby 2 3
A Football 2 2
B Volleyball 1 91
C TT 2 90
You can construct an associative array from the second file, with game names as keys and the game index as values. then for each line in file 1 search the array for the wanted id, and write it back
Associative arrays provide O(1) time complexity.
Use the join command:
$ cat file1
A Rugby 2
A Football 2
B Volleyball 1
C TT 2
$ cat file2
1 Basketball
2 Football
3 Rugby
90 TT
91 Volleyball
$ join -1 3 -2 1 -o 1.1,1.2,1.3,2.2 \
<(sort -k 3 file1) <(sort -k 1 file2)
B Volleyball 1 Basketball
A Football 2 Football
A Rugby 2 Football
C TT 2 Football
Here's another approach: only read the small file into memory, and then read the bigger file line-by-line. Once each ID has been found, bail out:
awk '
NR == FNR {
f1[$2] = $0
n++
next
}
($2 in f1) {
print f1[$2], $1
delete f1[$2]
if (--n == 0) exit
}
' file1 file2
Rereading your question, I don't know if I've answered the question: do you want an extra column appended to file1 or file2?

Command line to sum frequency in concatenated file

I need to summarize the frequency of one column of several large tab-separated files.
An example of the content in the file is :
Blue table 3
Blue chair 2
Big cat 1
Small cat 2
After concatenating the files, the trouble is the following:
Column 2 essentially is a frequency count of the amount of times the combination of Column 0 and Column 1 were seen together.
I need to add the frequency of all of the identical combinations in Column 2 of the concatenated file.
For instance: If in File A the contents are as follows:
Blue table 3
Blue chair 2
Big cat 1
Small cat 2
and in File B the contents are as follows:
Blue table 3
Blue chair 2
Big cat 1
Small cat 2
the contents in the concatenated File C are as follows:
Blue table 3
Blue chair 2
Big cat 1
Small cat 2
Blue table 3
Blue chair 2
Big cat 1
Small cat 2
I want to sum the frequencies of all identical combos in Column 0 and Column 1 in a File D to get the following results:
Blue table 6
Blue chair 4
Big cat 2
Small cat 4
I tried to sort and count the info with the following command:
sort <input_file> | uniq -c <output_file>
but the result is the following:
2 Big cat 1
2 Blue chair 2
2 Blue table 3
2 Small cat 2
Does anyone have a suggestion of a terminal command that can produce my desired results?
Thank you in advance for any help.
You're close; you have all the numbers you need. The total for each row is the count of rows that you got from uniq (column 1) times the frequency count (column 4). You can calculate that with awk:
sort input.txt | uniq -c | awk ' { print $2 "\t" $3 "\t" $1*$4 } '

Resources