How to merge rows from the same column using unix tools - bash

I have a text file that looks like the following:
1000000 45 M This is a line This is another line Another line
that breaks into that also breaks that has a blank
multiple rows into multiple rows - row below.
How annoying!
1000001 50 F I am another I am well behaved.
column that has
text spanning
multiple rows
I would like to convert this into a csv file that looks like:
1000000, 45, M, This is a line that breaks into multiple rows, This is another line that also breaks into multiple rows - How annoying!
1000001, 50, F, I am another column that has text spanning multiple rows, I am well behaved.
The text file output comes from a program that was written in 1984, and I have no way to modify the output. I want it in csv format so that I can convert it to Excel as painlessly as possible. I am not sure where to start, and rather than reinvent the wheel, was hoping someone could point me in the right direction. Thanks!
== EDIT ==
I've modified the text file to have \n between rows - maybe this will be helpful?
== EDIT 2 ==
I've modified the text file to have a blank row.

Using GNU awk
gawk '
BEGIN { FIELDWIDTHS="11 6 5 22 22" }
length($1) == 11 {
if ($1 ~ /[^[:blank:]]/) {
if (f1) print_line()
f1=$1; f2=$2; f3=$3; f4=$4; f5=$5
}
else {
f4 = f4" "$4; f5 = f5" "$5
}
}
function rtrim(str) {
sub(/[[:blank:]]+$/, "", str)
return str
}
function print_line() {
gsub(/[[:blank:]]{2,}/, " ", f4); gsub(/"/, "&&", f4)
gsub(/[[:blank:]]{2,}/, " ", f5); gsub(/"/, "&&", f5)
printf "%s,%s,%s,\"%s\",\"%s\"\n", rtrim(f1), rtrim(f2), rtrim(f3),f4,f5
}
END {if (f1) print_line()}
' file
1000000,45,M,"This is a line that breaks into multiple rows ","This is another line that also breaks into multiple rows - How annoying!"
1000001,50,F,"I am another column that has text spanning multiple rows","I am well behaved. "
I've quoted the last 2 columns in case they contain commas, and doubled any potential inner double quotes.

Here's a Perl script that does what you want. It uses unpack to split the fixed width columns into fields, adding to the previous fields if there is no data in the first column.
As you've mentioned that the widths vary between files, the script works out the widths for itself, based on the content of the first line. The assumption is that there are at least two space characters between each field. It creates a format string like A11 A6 A5 A22 A21, where "A" means any character and the numbers specify the width of each field.
Inspired by glenn's version, I have wrapped any field containing spaces in double quotes. Whether that's useful or not depends on how you're going to end up using the data. For example, if you want to parse it using another tool and there are commas within the input, it may be helpful. If you don't want it to happen, you can change the grep block in both places to simply grep { $_ ne "" }:
use strict;
use warnings;
chomp (my $first_line = <>);
my #fields = split /(?<=\s{2})(?=\S)/, $first_line;
my $format = join " ", map { "A" . length } #fields;
my #cols = unpack $format, $first_line;
while(<>) {
chomp( my $line = $_ );
my #tmp = unpack $format, $line;
if ($tmp[0] ne '') {
print join(", ", grep { $_ ne "" && /\s/ ? qq/"$_"/ : $_ } #cols), "\n";
#cols = #tmp;
}
else {
for (1..$#tmp) {
$cols[$_] .= " $tmp[$_]" if $tmp[$_] ne "";
}
}
}
print join(", ", grep { $_ ne "" && /\s/ ? qq/"$_"/ : $_ } #cols), "\n";
Output:
1000000, 45, M, "This is a line that breaks into multiple rows", "This is another line that also breaks into multiple rows - How annoying!"
1000001, 50, F, "I am another column that has text spanning multiple rows", "I am well behaved."

Using this awk:
awk -F ' {2,}' -v OFS=', ' 'NF==5{if (p) print a[1], a[2], a[3], a[4], a[5];
for (i=1; i<=NF; i++) a[i]=$i; p=index($0,$4)}
NF<4 {for(i=2; i<=NF; i++) index($0,$i) == p ? a[4]=a[4] " " $i : a[5]=a[5] $i}
END { print a[1], a[2], a[3], a[4], a[5] }' file
1000000, 45, M, This is a line that breaks into multiple rows, This is another line that also breaks into multiple rows - How annoying!
1000001, 50, F, I am another column that has text spanning multiple rows, I am well behaved.

You can write a script in python that does that. Read each line, call split on it, if the line is not empty append to the previous line. If it is, then add the next line to the result set. Finally use the csv write to write the result set to file.
Something along the lines of :
#import csv
inputFile = open(filename, 'r')
isNewItem = True
results = []
for line in inputFile:
if len(results) == 0:
isNewItem = True
else if line == '':
isNewItem = True
continue
else:
inNewItem = False
temp = line.split()
if isNewItem:
results.append(temp)
else
lastRow = results[-1]
combinedRow = []
for leftColumn, rigtColumn in lastRow, temp:
combinedRow.append(leftColumn + rightColumn)
with open(csvOutputFileName, 'w') as outFile:
csv.write(results)

Related

Copy columns of a file to specific location of another pipe delimited file

I have a file suppose xyz.dat which has data like below -
a1|b1|c1|d1|e1|f1|g1
a2|b2|c2|d2|e2|f2|g2
a3|b3|c3|d3|e3|f3|g3
Due to some requirement, I am making two new files(aka m.dat and o.dat) from original xyz.dat.
M.dat contains columns 2|4|6 like below after running some logic on it -
b11|d11|f11
b22|d22|f22
b33|d33|f33
O.dat contains all the columns except 2|4|6 like below without any change in it -
a1|c1|e1|g1
a2|c2|e2|g2
a3|c3|e3|g3
Now I want to merge both M and O file to create back the original format xyz.dat file.
a1|b11|c1|d11|e1|f11|g1
a2|b22|c2|d22|e2|f22|g2
a3|b33|c3|d33|e3|f33|g3
Please note column positions can change for another file. I will get the columns positions like in above example it is 2,4,6 so need some generic command to run in loop to merge the new M and O file or one command in which I can pass the columns positions and it will copy the columns form M.dat file and past it in O.dat file.
I tried paste, sed, cut but not able to make any perfect command.
Please help.
To perform column-wise merge of two files, better to use a scripting engine (Python, Awk or Perl or even bash). Tools like paste, sed and cut do not have enough flexibility for those tasks (join may come close, but require extra work).
Consider the following awk based script
awk -vOFS='|' '-F|' '
{
getline s < "o.dat"
n = split(s. a)
# Print output, Add a[n], or $n, ... as needed based on actual number of fields.
print $1, a[1], $2, a[2], $3, a[3], a[4]
}
' m.dat
The print line can be customized to generate whatever column order
Based on clarification from OP, looks like the goal is: Given an input of two files, and list of columns where data should be merged from the 2nd file, produce an output file that contain the merge data.
For example:
awk -f mergeCols COLS=2,4,6 M=b.dat a.dat
# If file is marked executable (chmod +x mergeCols)
mergeCols COLS=2,4,6 M=b.dat a.dat
Will insert the columns from b.dat into columns 2, 4 and 6, whereas other column will include data from a.dat
Implementation, using awk: (create a file mergeCols).
#! /usr/bin/awk -f
BEGIN {
FS=OFS="|"
}
NR==1 {
# Set the column map
nc=split(COLS, c, ",")
for (i=1 ; i<=nc ; i++ ) {
cmap[c[i]] = i
}
}
{
# Read one line from merged file, split into tokens in 'a'
getline s < M
n = split(s, a)
# Merge columns using pre-set 'cmap'
k=0
for (i=1 ; i<=NF+nc ; i++ ) {
# Pick up a column
v = cmap[i] ? a[cmap[i]] : $(++k)
sep = (i<NF+nc) ? "|" : "\n"
printf "%s%s", v, sep
}
}

Deleting lines with more than 30% lowercase letters

I try to process some data but I'am unable to find a working solution for my problem. I have a file which looks like:
>ram
cacacacacacacacacatatacacatacacatacacacacacacacacacacacacaca
cacacacacacacaca
>pam
GAATGTCAAAAAAAAAAAAAAAAActctctct
>sam
AATTGGCCAATTGGCAATTCCGGAATTCaattggccaattccggaattccaattccgg
and many lines more....
I want to filter out all the lines and the corresponding headers (header starts with >) where the sequence string (those not starting with >) are containing 30 or more percent lowercase letters. And the sequence strings can span multiple lines.
So after command xy the output should look like:
>pam
GAATGTCAAAAAAAAAAAAAAAAActctctct
I tried some mix of a while loop for reading the input file and then working with awk, grep, sed but there was no good outcome.
Here's one idea, which sets the record separator to ">" to treat each header with its sequence lines as a single record.
Because the input starts with a ">", which causes an initial empty record, we guard the computation with NR > 1 (record number greater than one).
To count the number of characters we add the lengths of all the lines after the header. To count the number of lower-case characters, we save the string in another variable and use gsub to replace all the lower-case letters with nothing --- just because gsub returns the number of substitutions made, which is a convenient way of counting them.
Finally we check the ratio and print or not (adding back the initial ">" when we do print).
BEGIN { RS = ">" }
NR > 1 {
total_cnt = 0
lower_cnt = 0
for (i=2; i<=NF; ++i) {
total_cnt += length($i)
s = $i
lower_cnt += gsub(/[a-z]/, "", s)
}
ratio = lower_cnt / total_cnt
if (ratio < 0.3) print ">"$0
}
$ awk -f seq.awk seq.txt
>pam
GAATGTCAAAAAAAAAAAAAAAAActctctct
Or:
awk '{n=length(gensub(/[A-Z]/,"","g"));if(NF && n/length*100 < 30)print a $0;a=RT}' RS='>[a-z]+\n' file
RS='>[a-z]+\n' - Sets the record separator to the line containing '>' and name
RT - This value is set by what is matched by RS above
a=RT - save previous RT value
n=length(gensub(/[A-Z]/,"","g")); - get the length of lower case chars
if(NF && n/length*100 < 30)print a $0; - check we have a value and that the percentage is less than 30 for lower case chars
awk '/^>/{b=B;gsub( /[A-]/,"",b);
if( length( b) < length( B) * 0.3) print H "\n" B
H=$0;B="";next}
{B=( (B != "") ? B "\n" : "" ) $0}
END{ b=B;gsub( /[A-]/,"",b);
if( length( b) < length( B) * 0.3) print H "\n" B
}' YourFile
quick qnd dirty, a function suite better the need for printing
Nowadays I would not use sed or awk anymore for anything longer than 2 lines.
#! /usr/bin/perl
use strict; # Force variable declaration.
use warnings; # Warn about dangerous language use.
sub filter # Declare a sub-routing, a function called `filter`.
{
my ($header, $body) = #_; # Give the first two function arguments the names header and body.
my $lower = $body =~ tr/a-z//; # Count the translation of the characters a-z to nothing.
print $header, $body, "\n" # Print header, body and newline,
unless $lower / length ($body) > 0.3; # unless lower characters have more than 30%.
}
my ($header, $body); # Declare two variables for header and body.
while (<>) { # Loop over all lines from stdin or a file given in the command line.
if (/^>/) { # If the line starts with >,
filter ($header, $body) # call filter with header and body,
if defined $header; # if header is defined, which is not the case at the beginning of the file.
($header, $body) = ($_, ''); # Assign the current line to header and an empty string to body.
} else {
chomp; # Remove the newline at the end of the line.
$body .= $_; # Append the line to body.
}
}
filter ($header, $body); # Filter the last record.

replace terms with associated abbreviations from other file, in case of matching

I have two files:
1. Pattern file = pattern.txt
2. File containing different terms = terms.txt
pattern.txt contain two columns, separated by ;
In the first column I have several terms and in the second column abbreviations,
associated to the first column, same line.
terms.txt contain single words and terms defined by single words but also
by a combination of words.
pattern.txt
Berlin;Brln
Barcelona;Barcln
Checkpoint Charly;ChckpntChrl
Friedrichstrasse;Fridrchstr
Hall of Barcelona;HllOfBarcln
Paris;Prs
Yesterday;Ystrdy
terms.txt
Berlin
The Berlinale ended yesterday
Checkpoint Charly is still in Friedrichstrasse
There will be a fiesta in the Hall of Barcelona
Paris is a very nice city
The target is to replace terms with standardised abbreviations and to find out which terms
have no abbreviation.
As result I would like to have two files.
The first file is a new terms file, with terms replaced by abbreviations where it could be replaced.
The second file containing a list with all terms that doesn't have an abbreviation.
The output is case insensitive, I don't make difference between "The" and "the".
new_terms.txt
Brln
The Berlinale ended Ystrdy
ChckpntChrl is still in Fridrchstr
There will be a fiesta in the HllOfBarcln
Prs is a very nice city
terms_without_abbreviations.txt
a
be
Berlinale
city
ended
fiesta
in
is
nice
of
still
The
There
very
will
I will appreciate your help and thanks in advance for your time and hints!
This is mostly what you need:
BEGIN { FS=";"; }
FNR==NR { dict[tolower($1)] = $2; next }
{
line = "";
count = split($0, words, / +/);
for (i = 1; i <= count; i++) {
key = tolower(words[i]);
if (key in dict) {
words[i] = dict[key];
} else {
result[key] = words[i];
}
line = line " " words[i];
}
print substr(line, 2);
}
END {
count = asorti(result, sorted);
for (i = 1; i <= count; i++) {
print result[sorted[i]];
}
}
Ok, so I had a bit of a crack, but will explain issues:
If you have multiple changes in pattern.txt that can pertain to a single line, the first change will make its change and the second will not (eg. Barcelona;Barcln and Hall of Barcelona;HllOfBarcln, obviously if Barcln has already been done when you get to the longer version it will not longer exist and so no change made)
Similar to above, there is no abbreviation for the word 'Hall' so again if we assume above is true and only the first change was made, your new file for changes will include hall as not having an abbreviation
#!/usr/bin/awk -f
BEGIN{
FS = ";"
IGNORECASE = 1
}
FNR == NR{
abbr[tolower($1)] = $2
next
}
FNR == 1{ FS = " " }
{
for(i = 1; i <= NF; i++){
item = tolower($i)
if(!(item in abbr) && !(item in twa)){
twa[item]
print item > "terms_without_abbreviations.txt"
}
}
for(i in abbr)
gsub("\\<"i"\\>", abbr[i])
print > "new_terms.txt"
}
There are probably other gotchas to look for but it is a vague direction. Not sure how you would get around my points above??

Sort Markdown file by heading

Is it possible to sort a markdown file by level 1 heading? Looking for sed or similar command line solution
#B
a content of B
#A
b content of A
to...
#A
b content of A
#B
a content of B
A perl one-liner, split for readability
perl -0777 -ne '
(undef,#paragraphs) = split /^#(?=[^#])/m;
print map {"#$_"} sort #paragraphs;
' file.md
You'll want to end the file with a blank line, so there's a blank line before #B. Or you could change
map {"#$_"} to map {"#$_\n"}
to forcibly insert one.
You can use GNU Awk with PROCINFO["sorted_in"] = "#ind_str_asc":
gawk 'BEGIN { PROCINFO["sorted_in"] = "#ind_str_asc"; RS = ""; ORS = "\n\n" }
{ a[$1] = $0 } END { for (i in a) print a[i] }' file
Output:
#A
b content of A
#B
a content of B
Reference:
PROCINFO["sorted_in"]
If this element exists in PROCINFO,
then its value controls the order in
which array elements are traversed in
for loops. Supported values are
"#ind_str_asc", "#ind_num_asc",
"#val_type_asc", "#val_str_asc",
"#val_num_asc", "#ind_str_desc",
"#ind_num_desc", "#val_type_desc",
"#val_str_desc", "#val_num_desc", and
"#unsorted". The value can also be the
name of any comparison function defined
as follows:
you can also use this script to have the sorting on 3 levels instead of just one. It also won't stripe out the content before the first occurence of the first heading.
#!/usr/bin/env perl
local $/;
my $text = <>;
my ($start, #chapters) = split/^#(?=[^#])/m, $text;
print $start;
for (sort #chapters) {
my ($level1, #subchapters) = split/^##(?=[^#])/m;
print "#$level1";
for (sort #subchapters) {
my ($level2, #subsubchapters) = split/^###(?=[^#])/m;
print "##$level2";
print map {"###$_"} sort #subsubchapters;
}
}

search (e.g. awk, grep, sed) for string, then look for X lines above and another string below

I need to be able to search for a string (lets use 4320101), print 20 lines above the string and print after this until it finds the string
For example:
Random text I do not want or blank line
16 Apr 2013 00:14:15
id="4320101"
</eventUpdate>
Random text I do not want or blank line
I just want the following result outputted to a file:
16 Apr 2013 00:14:15
id="4320101"
</eventUpdate>
There are multiple examples of these groups of text in a file that I want.
I tried using this below:
cat filename | grep "</eventUpdate>" -A 20 4320101 -B 100 > greptest.txt
But it only ever shows for 20 lines either side of the string.
Notes:
- the line number the text is on is inconsistent so I cannot go off these, hence why I am using -A 20. - ideally I'd rather have it so when it searches after the string, it stops when it finds and then carries on searching.
Summary: find 4320101, output 20 lines above 4320101 (or one line of white space), and then output all lines below 4320101 up to
</eventUpdate>
Doing research I am unsure of how to get awk, nawk or sed to work in my favour to do this.
This might work for you (GNU sed):
sed ':a;s/\n/&/20;tb;$!{N;ba};:b;/4320102/!D;:c;n;/<\/eventUpdate>/!bc' file
EDIT:
:a;s/\n/&/20;tb;$!{N;ba}; this keeps a window of 20 lines in the pattern space (PS)
:b;/4320102!D; this moves the above window through the file until the pattern 4320102 is found.
:c;n;/<\/eventUpdate>/!bc the 20 line window is printed and any subsequent line until the pattern <\/eventUpdate> is found.
Here is an ugly awk solution :)
awk 'BEGIN{last=1}
{if((length($0)==0) || (Random ~ $0))last=NR}
/4320101/{flag=1;
if((NR-last)>20) last=NR-20;
cmd="sed -n \""last+1","NR-1"p \" input.txt";
system(cmd);
}
flag==1{print}
/eventUpdate/{flag=0}' <filename>
So basically what it does is keeps track of the last blank line or line containing Random pattern in the last variable. Now if the 4320101 has been found, it prints from that line -20 or last whichever is nearer through a system sed command. And sets the flag. The flag causes the next onwards lines to be printed till eventUpdate has been found. Have not tested though, but should be working
Look-behind in sed/awk is always tricky.. This self contained awk script basically keeps the last 20 lines stored, when it gets to 4320101 it prints these stored lines, up to the point where the blank or undesired line is found, then it stops. At that point it switches into printall mode and prints all lines until the eventUpdate is encountered, then it prints that and quits.
awk '
function store( line ) {
for( i=0; i <= 20; i++ ) {
last[i-1] = last[i]; i++;
};
last[20]=line;
};
function purge() {
for( i=20; i >= 0; i-- ) {
if( length(last[i])==0 || last[i] ~ "Random" ) {
stop=i;
break
};
};
for( i=(stop+1); i <= 20; i++ ) {
print last[i];
};
};
{
store($0);
if( /4320101/ ) {
purge();
printall=1;
next;
};
if( printall == 1) {
print;
if( /eventUpdate/ ) {
exit 0;
};
};
}' test
Let's see if I understand your requirements:
You have two strings, which I'll call KEY and LIMIT. And you want to print:
At most 20 lines before a line containing KEY, but stopping if there is a blank line.
All the lines between a line containing KEY and the following line containing LIMIT. (This ignores your requirement that there be no more than 100 such lines; if that's important, it's relatively straightforward to add.)
The easiest way to accomplish (1) is to keep a circular buffer of 20 lines, and print it out when you hit key. (2) is trivial in either sed or awk, because you can use the two-address form to print the range.
So let's do it in awk:
#file: extract.awk
# Initialize the circular buffer
BEGIN { count = 0; }
# When we hit an empty line, clear the circular buffer
length() == 0 { count = 0; next; }
# When we hit `key`, print and clear the circular buffer
index($0, KEY) { for (i = count < 20 ? 0 : count - 20; i < count; ++i)
print buf[i % 20];
hi = 0;
}
# While we're between key and limit, print the line
index($0, KEY),index($0, LIMIT)
{ print; next; }
# Otherwise, save the line
{ buf[count++ % 20] = $0; }
In order to get that to work, we need to set the values of KEY and LIMIT. We can do that on the command line:
awk -v "KEY=4320101" -v "LIMIT=</eventUpdate>" -f extract.awk $FILENAME
Notes:
I used index($0, foo) instead of the more usual /foo/, because it avoids having to escape regex special characters, and there is nowhere in the requirements that regexen are even desired. index(haystack, needle) returns the index of needle in haystack, with indices starting at 1, or 0 if needle is not found. Used as a true/false value, it is true of needle is found.
next causes processing of the current line to end. It can be quite handy, as this little program shows.
You can try something like this -
awk '{
a[NR] = $0
}
/<\/eventUpdate>/ {
x = NR
}
END {
for (i in a) {
if (a[i]~/4320101/) {
for (j=i-20;j<=x;j++) {
print a[j]
}
}
}
}' file
The simplest way is to use 2 passes of the file - the first to identify the line numbers in the range within which your target regexp is found, the second to print the lines in the selected range, e.g.:
awk '
NR==FNR {
if ($0 ~ /\<4320101\>/ {
for (i=NR-20;i<NR;i++)
range[i]
inRange = 1
}
if (inRange) {
range[NR]
}
if ($0 ~ /<\/eventUpdate>/) {
inRange = 0
}
next
}
FNR in range
' file file

Resources