Sort Markdown file by heading - sorting

Is it possible to sort a markdown file by level 1 heading? Looking for sed or similar command line solution
#B
a content of B
#A
b content of A
to...
#A
b content of A
#B
a content of B

A perl one-liner, split for readability
perl -0777 -ne '
(undef,#paragraphs) = split /^#(?=[^#])/m;
print map {"#$_"} sort #paragraphs;
' file.md
You'll want to end the file with a blank line, so there's a blank line before #B. Or you could change
map {"#$_"} to map {"#$_\n"}
to forcibly insert one.

You can use GNU Awk with PROCINFO["sorted_in"] = "#ind_str_asc":
gawk 'BEGIN { PROCINFO["sorted_in"] = "#ind_str_asc"; RS = ""; ORS = "\n\n" }
{ a[$1] = $0 } END { for (i in a) print a[i] }' file
Output:
#A
b content of A
#B
a content of B
Reference:
PROCINFO["sorted_in"]
If this element exists in PROCINFO,
then its value controls the order in
which array elements are traversed in
for loops. Supported values are
"#ind_str_asc", "#ind_num_asc",
"#val_type_asc", "#val_str_asc",
"#val_num_asc", "#ind_str_desc",
"#ind_num_desc", "#val_type_desc",
"#val_str_desc", "#val_num_desc", and
"#unsorted". The value can also be the
name of any comparison function defined
as follows:

you can also use this script to have the sorting on 3 levels instead of just one. It also won't stripe out the content before the first occurence of the first heading.
#!/usr/bin/env perl
local $/;
my $text = <>;
my ($start, #chapters) = split/^#(?=[^#])/m, $text;
print $start;
for (sort #chapters) {
my ($level1, #subchapters) = split/^##(?=[^#])/m;
print "#$level1";
for (sort #subchapters) {
my ($level2, #subsubchapters) = split/^###(?=[^#])/m;
print "##$level2";
print map {"###$_"} sort #subsubchapters;
}
}

Related

Simple one-liner to merge lines with common first field

In my work building an English language database, I often deal with text content from different sources, and need to merge lines that share the same first field. I often hack this in a text editor with a regex that captures a first field, searching across "\n", but often I have text files >10GB, so a command-line, streaming solution is preferred to in-memory.
Sample input:
apple|pear
apple|quince
apple cider|juice
banana|plantain
cherry|cheerful, crimson
cherry|ruddy
cherry|cerise
Desired output:
apple|pear|quince
apple cider|juice
banana|plantain
cherry|cheerful, crimson|ruddy|cerise
The logic is to concatenate (joined by "|") all lines with the same first field.
The only delimiter is "|", and the delimiter only appears once per input line. i.e. it's effectively a 2-column text file. The file sorting does not matter, the only concern is consecutive lines with the identical first field.
I have lots of solutions and one-liners (often in awk or ruby) to process same-line content, but I run into knots when dealing with multiple lines, and would appreciate help. For some reason, multiline processing always bogs me down.
I'm sure this is can be done succinctly with awk.
Assumptions/understandings:
overall file may not be sorted (by 1st field)
all lines with the same string in the 1st field will be listed consecutively; this should eliminate the need to maintain a large volume of data in memory with the tradeoff that we'll need a bit more typing
2nd field may contain trailing white space (per sample input); this will need to be removed
ouput does not need to be sorted (by 1st field)
One awk idea:
awk '
function print_line() {
if (prev != "")
print prev,data
}
BEGIN { FS=OFS="|" }
{ if ($1 != prev) {
print_line()
prev=$1
data=""
}
gsub(/[[:space:]]+$/,"",$2) # strip trailing white space
data= data (data=="" ? "" : OFS) $2 # concatentate 2nd fields with OFS="|"
}
END { print_line() } # flush last set of data to stdout
' pipe.dat
This generates:
apple|pear|quince
apple cider|juice
banana|plantain
cherry|cheerful, crimson|ruddy|cerise
Using any awk in any shell on every Unix box and assuming your input is grouped by the first field as shown in your sample input and you don't really have trailing blanks at the end of some lines:
$ cat tst.awk
BEGIN { FS=OFS="|" }
$1 != prev {
if ( NR>1 ) {
print out
}
out = prev = $1
}
{ out = out OFS $2 }
END { print out }
$ awk -f tst.awk file
apple|pear|quince
apple cider|juice
banana|plantain
cherry|cheerful, crimson|ruddy|cerise
If it's not grouped then do sort file | awk -f tst.awk and if there are trailing blanks then add { sub(/ +$/,"") } as the first line of the script.
Here is a Ruby solution that reads the file line-by-line. At the end I show how much simpler the solution could be if the file could be gulped into a string.
Let's first create an input file to work with.
str =<<~_
apple|pear
apple|quince
apple cider|juice
banana|plantain
cherry|cheerful, crimson
cherry|ruddy
cherry|cerise
_
file_name_in = 'file_in'
File.write(file_name_in, str)
#=> 112
Solution when file is read line-by-line
We can produce the desired output file with the following method.
def doit(file_name_in, file_name_out)
fin = File.new(file_name_in, "r")
fout = File.new(file_name_out, "w")
str = ''
until fin.eof?
s = fin.gets.strip
k,v = s.split(/(?=\|)/)
if str.empty?
str = s
key = k
elsif k == key
str << v
else
fout.puts(str)
str = s
key = k
end
end
fout.puts(str)
fin.close
fout.close
end
Let's try it.
file_name_out = 'file_out'
doit(file_name_in, file_name_out)
puts File.read(file_name_out)
prints the following.
apple|pear|quince
apple cider|juice
banana|plantain
cherry|cheerful, crimson|ruddy|cerise
Note that
"apple|pear".split(/(?=\|)/)
#=> ["apple", "|pear"]
The regular expression contains the positive lookahead (?=\|) which matches the zero-width location between 'e' and '|'.
Solution when file is gulped into a string
The OP does not want to gulp the file into a string (hence my solution above) but I would like to show how much simpler the problem is if one could do so. Here is one of many ways of doing that.
def gulp_it(file_name_in, file_name_out)
File.write(file_name_out,
File.read(file_name_in).gsub(/^(.+)\|.*[^ ]\K *\r?\n\1/, ''))
end
gulp_it(file_name_in, file_name_out)
#=> 98
puts File.read(file_name_out)
prints
apple|pear|quince
apple cider|juice
banana|plantain
cherry|cheerful, crimson|ruddy
cherry|cerise
Thinking about what the regex engine will be doing, this may be acceptably fast, depending on file size, of course.
Regex demo
While the link uses the PCRE engine the result would be the same using Ruby's regex engine (Onigmo). We can make the regular expression self-documenting by writing it in free-spacing mode.
/
^ # match the beginning of a line
(.+) # match one or more characters
\|.*[^ ] # match '|', then zero or more chars, then a non-space
\K # resets the starting point of the match and discards
# any previously-matched characters
[ ]* # match zero or more chars
\r?\n # march the line terminator(s)
\1 # match the content of capture group 1
/x # invoke free-spacing mode
(.+) matches, 'apple', 'banana' and 'cherry' because those words are at the beginning lines. One could alternatively write ([^|]*).
Assuming you have the following sample.txt
apple|pear
apple|quince
apple cider|juice
banana|plantain
cherry|cheerful, crimson
cherry|ruddy
cherry|cerise
I am not sure why you want the solution as a "one liner", but the following will do what you want.
cat sample.txt | ruby -e 'puts STDIN.readlines.map {_1.strip}.group_by {_1.split("|").first}.map{|k,v| v.reduce("#{k}") {"#{_1}|#{_2.split("|").last}"}}'
A more readable version with comments describing what's going on:
stripped_lines = STDIN.readlines.map { |l| l.strip } # remove leading and trailing whitespace
# create a hash where the keys are the value to the left of the |
# and the values are lines begining with that key ie
# {
# "apple"=>["apple|pear", "apple|quince"],
# "apple cider"=>["apple cider|juice"],
# "banana"=>["banana|plantain"],
# "cherry"=>["cherry|cheerful, crimson", "cherry|ruddy", "cherry|cerise"]
# }
grouped_by_first_element = stripped_lines.group_by { |sl| sl.split('|').first }
# map to the desired result by starting with the key
# and then concatinating the part to the right of the | for each element
# ie start with apple then append |pear to get apple|pear then append quince to that to get
# apple|pear|quince
result = grouped_by_first_element.map do |key, values|
values.reduce("#{key}") do |memo, next_element|
"#{memo}|#{next_element.split('|').last}"
end
end
puts result
If we assume s is a string containing all of the lines in the file.
s.split("\n").inject({}) { |h, x| k, v = x.split('|'); h[k] ||= []; h[k] << v.strip; h }
Will yield:
{"apple"=>["pear", "quince"], "apple cider"=>["juice"], "banana"=>["plantain"], "cherry"=>["cheerful, crimson", "ruddy", "cerise"]}
Then:
s.split("\n").inject({}) { |h, x| k, v = x.split('|'); h[k] ||= []; h[k] << v.strip; h }.map { |k, v| "#{k}|#{v.join('|')}" }
Yields:
["apple|pear|quince", "apple cider|juice", "banana|plantain", "cherry|cheerful, crimson|ruddy|cerise"]
A pure bash solution could look like this:
unset out # make sure we start fresh (important if this is in a loop)
declare -A out # declare associative array
d='|' # delimiter
# append all values to the key
while IFS=${d} read -r key val; do
out[${key}]="${out[${key}]}${d}${val}"
done <file
# print desired output
for key in "${!out[#]}"; do
printf '%s%s\n' "${key}" "${out[$key]}"
done | sort -t"${d}" -k1
### actual output
apple cider|juice
apple|pear|quince
banana|plantain
cherry|cheerful, crimson|ruddy|cerise
Or you could do this with awk. As mentioned in a comment, pure bash is not a great option, mostly due to performance and portability.
awk -F'|' '{
sub(/[[:space:]]*$/,"") # only necessary if you wish to trim trailing whitespace, which existed in your example data
a[$1]=a[$1] "|" $2 # append value to string
} END {
for(i in a) print i a[i] # print all recreated lines
}' <file
### acutal output
apple|pear|quince
banana|plantain
apple cider|juice
cherry|cheerful, crimson|ruddy|cerise

Deleting lines with more than 30% lowercase letters

I try to process some data but I'am unable to find a working solution for my problem. I have a file which looks like:
>ram
cacacacacacacacacatatacacatacacatacacacacacacacacacacacacaca
cacacacacacacaca
>pam
GAATGTCAAAAAAAAAAAAAAAAActctctct
>sam
AATTGGCCAATTGGCAATTCCGGAATTCaattggccaattccggaattccaattccgg
and many lines more....
I want to filter out all the lines and the corresponding headers (header starts with >) where the sequence string (those not starting with >) are containing 30 or more percent lowercase letters. And the sequence strings can span multiple lines.
So after command xy the output should look like:
>pam
GAATGTCAAAAAAAAAAAAAAAAActctctct
I tried some mix of a while loop for reading the input file and then working with awk, grep, sed but there was no good outcome.
Here's one idea, which sets the record separator to ">" to treat each header with its sequence lines as a single record.
Because the input starts with a ">", which causes an initial empty record, we guard the computation with NR > 1 (record number greater than one).
To count the number of characters we add the lengths of all the lines after the header. To count the number of lower-case characters, we save the string in another variable and use gsub to replace all the lower-case letters with nothing --- just because gsub returns the number of substitutions made, which is a convenient way of counting them.
Finally we check the ratio and print or not (adding back the initial ">" when we do print).
BEGIN { RS = ">" }
NR > 1 {
total_cnt = 0
lower_cnt = 0
for (i=2; i<=NF; ++i) {
total_cnt += length($i)
s = $i
lower_cnt += gsub(/[a-z]/, "", s)
}
ratio = lower_cnt / total_cnt
if (ratio < 0.3) print ">"$0
}
$ awk -f seq.awk seq.txt
>pam
GAATGTCAAAAAAAAAAAAAAAAActctctct
Or:
awk '{n=length(gensub(/[A-Z]/,"","g"));if(NF && n/length*100 < 30)print a $0;a=RT}' RS='>[a-z]+\n' file
RS='>[a-z]+\n' - Sets the record separator to the line containing '>' and name
RT - This value is set by what is matched by RS above
a=RT - save previous RT value
n=length(gensub(/[A-Z]/,"","g")); - get the length of lower case chars
if(NF && n/length*100 < 30)print a $0; - check we have a value and that the percentage is less than 30 for lower case chars
awk '/^>/{b=B;gsub( /[A-]/,"",b);
if( length( b) < length( B) * 0.3) print H "\n" B
H=$0;B="";next}
{B=( (B != "") ? B "\n" : "" ) $0}
END{ b=B;gsub( /[A-]/,"",b);
if( length( b) < length( B) * 0.3) print H "\n" B
}' YourFile
quick qnd dirty, a function suite better the need for printing
Nowadays I would not use sed or awk anymore for anything longer than 2 lines.
#! /usr/bin/perl
use strict; # Force variable declaration.
use warnings; # Warn about dangerous language use.
sub filter # Declare a sub-routing, a function called `filter`.
{
my ($header, $body) = #_; # Give the first two function arguments the names header and body.
my $lower = $body =~ tr/a-z//; # Count the translation of the characters a-z to nothing.
print $header, $body, "\n" # Print header, body and newline,
unless $lower / length ($body) > 0.3; # unless lower characters have more than 30%.
}
my ($header, $body); # Declare two variables for header and body.
while (<>) { # Loop over all lines from stdin or a file given in the command line.
if (/^>/) { # If the line starts with >,
filter ($header, $body) # call filter with header and body,
if defined $header; # if header is defined, which is not the case at the beginning of the file.
($header, $body) = ($_, ''); # Assign the current line to header and an empty string to body.
} else {
chomp; # Remove the newline at the end of the line.
$body .= $_; # Append the line to body.
}
}
filter ($header, $body); # Filter the last record.

awk: Interpreting strings as mathematical expressions

Context: I have an input file that contains parameters with associated values followed by literal mathematical expressions such as:
PARAMETERS DEFINITION
A = 5; B = 2; C=1.5; D=7.5
MATHEMATICAL EXPRESSIONS
A*B
C/D
...
and I would like to get the strings of the second part to be interpreted as mathematical expressions so that I get the results of the expressions in my output file:
...
MATHEMATICAL EXPRESSIONS
10
0.2
...
What I did already: So far, using awk, I store all the parameters names and their corresponding values in two distinct arrays. I then replace each parameter with its value so that I am now in a similar situation as the author of this thread.
However, the answers s/he gets are not in awk except for the last one which is very specific to her/his situation, and hard to understand for me as a beginner with awk and shell scripting.
What I tried afterwards: As I have no clue how to do this in awk, the idea I had was to store the new field value in a variable, then use a shell command within the awk script like this:
#!bin/awk -f
BEGIN{}
{
myExpression=$1
system("echo $myExpression | bc")
}
END{}
This, unfortunately does not work as the variable is somehow not recognized by the echo command.
What I would like:
I would prefer a solution using awk alone with no call to external functions, however, I am not against one using a shell command if it is simpler.
EDIT Taking into account all the comments so far, I will be more precise, my input files look more like this:
PARAMETERS_DEFINITION
[param1] = 5
[param2] = 2
[param3] = 1.5
[param4] = 7.5
MATHEMATICAL_EXPRESSIONS
[param1]*[param2]
some text containing also numbers and formulas that I do not want to be affected.
e.g: 1.45*2.6 = x, de(x)/dx=e(x) ; blah,blah,blah
[param3]/[param4]
The names of the parameters are complex enough so that any match of the string: "[param#]" within the document corresponds to a parameter that I want changed for its value.
Below is the way I manage to store the parameters and their value in arrays is the following:
{
if (match($2,/PARAMETERS_DEFINITION/) != 0) {paramSwitch = 1}
if (match($2,/MATHEMATICAL_EXPRESSIONS/) != 0) {paramSwitch = 0}
if (paramSwitch == 1)
{
parameterName[numOfParam] = $1 ;
parameterVal[numOfParam] = $3 ;
numOfParam += 1
}
}
Instead of this:
{
myExpression=$1
system("echo $myExpression | bc")
}
I think you'd want this:
{
myExpression=$1
system("echo " myExpression " | bc")
}
That's because in awk, assignments do not end up as environment variables, and putting strings next to each other concatenates them.
You asking awk: Interpreting strings as mathematical expressions - this functionality usually called as eval, and no, (AFAIK) awk doesn't knows such function. Therefore your questions is an typical XY problem
The right tool for this is bc, where you (nearly) don't need modify anything, and simply feed the bc with your input, only ensure than the variables are are lowercase, such the following input (edited the your example)
#PARAMETERS DEFINITION
a=5; b=2; c=1.5; d=7.5
#MATHEMATICAL EXPRESSIONS
a*b
c/d
using like
bc -l < inputfile
produces
10
.20000000000000000000
EDIT
For your edit, for the new input data. The following
grep '\[' inputfile | sed 's/[][]//g' | bc -l
for the input
PARAMETERS_DEFINITION
[param1] = 5
[param2] = 2
[param3] = 1.5
[param4] = 7.5
MATHEMATICAL_EXPRESSIONS
[param1]*[param2]
some text containing also numbers and formulas that I do not want to be affected.
e.g: 1.45*2.6 = x, de(x)/dx=e(x) ; blah,blah,blah
[param3]/[param4]
produces the following output:
10
.20000000000000000000
e.g. grepping out only lines what contains [ - any param definition or expression, remove any [], e.g. creating the following bc program:
param1 = 5
param2 = 2
param3 = 1.5
param4 = 7.5
param1*param2
param3/param4
and send the whole "program" to bc...
Using BIDMAS as a basis i have created this mathematical function in awk
I have not included brackets(or indices) yet as they will require some extra effort but i may add them later
This awk script effectively works as bc does.
No system call required, all in awk.
Generic version for all applications
awk '{split($0,a,"+")
for(i in a){
split(a[i],s,"-")
for(j in s){
split(s[j],m,"*")
for(k in m){
split(m[k],d,"/")
for(l in d){
if(l>1)d[1]=d[1]/d[l]
}
m[k]=d[1]
delete d
if(k>1)m[1]=m[1]*m[k]
}
s[j]=m[1]
delete m
if(j>1)s[1]=s[1]-s[j]
}
a[i]=s[1]
delete s
}
for(i in a)b=b+a[i];print b}{b=0}' file
For your specific example
awk '
/MATHEMATICAL_EXPRESSIONS/{z=1}
NR>1&&!z{split($0,y," = ");x[y[1]]=y[2]}
z&&/[\+\-\/\*]/{
for (n in x)gsub(n,x[n])
split($0,a,"+")
for(i in a){
split(a[i],s,"-")
for(j in s){
split(s[j],m,"*")
for(k in m){
split(m[k],d,"/")
for(l in d){
if(l>1)d[1]=d[1]/d[l]
}
m[k]=d[1]
delete d
if(k>1)m[1]=m[1]*m[k]
}
s[j]=m[1]
delete m
if(j>1)s[1]=s[1]-s[j]
}
a[i]=s[1]
delete s
}
for(i in a)b=b+a[i];print b}{b=0}' file
There's something like an eval for awk, its a magical conversion when needed in the context, here adding +0 would do the convertion.
What I got for you (detailled version below) with a file named awkinput with your exemple input
awk '/[A-Z]=[0-9.]+;/ { for (i=1;i<=NF ;i++) { print "working on "$i; split($i,fields,"="); sub(/;/,"",fields[2]); params[fields[1]]=strtonum(fields[2]) } }; /[A-Z](*|\/|+|-)[A-Z]/ { for (p in params) { sub(p, params[p],$0); }; system("echo " $0 " | bc -ql") }' awkinput
Detailled:
/[A-Z]=[0-9.]+;?/ { # if we match something like A=4.2 with or wothout a ; at end
for (i=1;i<=NF ;i++) { # loop through the fields (separated by space, the default Field Separator of awk)
print "working on "$i; # inform on what we do
split($i,fields,"="); # split in an array to get param and value
sub(/;/,"",fields[2]); # Eventually remove the ; at end
params[fields[1]]=strtonum(fields[2]) # new array of parameters where the values are numeric
}
}
/[A-Z](*|\/|+|-)[A-Z]/ { #when the line match a math operation with one param on each side (at least)
for (p in params) { # loop over know params
sub(p, params[p],$0); # replace each param with its value
};
system("echo " $0 " | bc -ql") # print the result (no way to get of system call here)
}
Drawback:
A math of the form AB*C would be resolved to 52*1.5
$ cat test
PARAMETERS DEFINITION
A=5; B=2; C=1.5; D=7.5
MATHEMATICAL EXPRESSIONS
A*B
C/D
$ awk -vRS='[= ;\n]' '{if ($0 ~ /[0-9]/){a[x] = $0; print x"="a[x]}else{x=$0}}/MATHEMATICAL/{print "MATHEMATICAL EXPRESSIONS"}{if ($0~"*") print a[substr($0,1,1)] * a[substr($0,3,1)]}{if ($0~"/") print a[substr($0,1,1)] / a[substr($0,3,1)]}' test
A=5
B=2
C=1.5
D=7.5
MATHEMATICAL EXPRESSIONS
10
0.2
Formatted nicely:
$ cat test.awk
# Store all variables in an array
{
if ($0 ~ /[0-9]/){
a[x] = $0;
print x " = " a[x] # Print the keys & values
}
else{
x = $0
}
}
# Print header
/MATHEMATICAL/ {print "MATHEMATICAL EXPRESSIONS"}
# Do the maths (case can work too, but it's not as widely available)
{
if ($0~"*")
print a[substr($0,1,1)] * a[substr($0,3,1)]
}
{
if ($0~"/")
print a[substr($0,1,1)] / a[substr($0,3,1)]
}
{
if ($0~"+")
print a[substr($0,1,1)] + a[substr($0,3,1)]
}
{
if ($0~"-")
print a[substr($0,1,1)] - a[substr($0,3,1)]
}
$ cat test
PARAMETERS DEFINITION
A=5; B=2; C=1.5; D=7.5
MATHEMATICAL EXPRESSIONS
A*B
C/D
D+C
C-A
$ awk -f test.awk -vRS='[= ;\n]' test
A = 5
B = 2
C = 1.5
D = 7.5
MATHEMATICAL EXPRESSIONS
10
0.2
9
-3.5

How to merge rows from the same column using unix tools

I have a text file that looks like the following:
1000000 45 M This is a line This is another line Another line
that breaks into that also breaks that has a blank
multiple rows into multiple rows - row below.
How annoying!
1000001 50 F I am another I am well behaved.
column that has
text spanning
multiple rows
I would like to convert this into a csv file that looks like:
1000000, 45, M, This is a line that breaks into multiple rows, This is another line that also breaks into multiple rows - How annoying!
1000001, 50, F, I am another column that has text spanning multiple rows, I am well behaved.
The text file output comes from a program that was written in 1984, and I have no way to modify the output. I want it in csv format so that I can convert it to Excel as painlessly as possible. I am not sure where to start, and rather than reinvent the wheel, was hoping someone could point me in the right direction. Thanks!
== EDIT ==
I've modified the text file to have \n between rows - maybe this will be helpful?
== EDIT 2 ==
I've modified the text file to have a blank row.
Using GNU awk
gawk '
BEGIN { FIELDWIDTHS="11 6 5 22 22" }
length($1) == 11 {
if ($1 ~ /[^[:blank:]]/) {
if (f1) print_line()
f1=$1; f2=$2; f3=$3; f4=$4; f5=$5
}
else {
f4 = f4" "$4; f5 = f5" "$5
}
}
function rtrim(str) {
sub(/[[:blank:]]+$/, "", str)
return str
}
function print_line() {
gsub(/[[:blank:]]{2,}/, " ", f4); gsub(/"/, "&&", f4)
gsub(/[[:blank:]]{2,}/, " ", f5); gsub(/"/, "&&", f5)
printf "%s,%s,%s,\"%s\",\"%s\"\n", rtrim(f1), rtrim(f2), rtrim(f3),f4,f5
}
END {if (f1) print_line()}
' file
1000000,45,M,"This is a line that breaks into multiple rows ","This is another line that also breaks into multiple rows - How annoying!"
1000001,50,F,"I am another column that has text spanning multiple rows","I am well behaved. "
I've quoted the last 2 columns in case they contain commas, and doubled any potential inner double quotes.
Here's a Perl script that does what you want. It uses unpack to split the fixed width columns into fields, adding to the previous fields if there is no data in the first column.
As you've mentioned that the widths vary between files, the script works out the widths for itself, based on the content of the first line. The assumption is that there are at least two space characters between each field. It creates a format string like A11 A6 A5 A22 A21, where "A" means any character and the numbers specify the width of each field.
Inspired by glenn's version, I have wrapped any field containing spaces in double quotes. Whether that's useful or not depends on how you're going to end up using the data. For example, if you want to parse it using another tool and there are commas within the input, it may be helpful. If you don't want it to happen, you can change the grep block in both places to simply grep { $_ ne "" }:
use strict;
use warnings;
chomp (my $first_line = <>);
my #fields = split /(?<=\s{2})(?=\S)/, $first_line;
my $format = join " ", map { "A" . length } #fields;
my #cols = unpack $format, $first_line;
while(<>) {
chomp( my $line = $_ );
my #tmp = unpack $format, $line;
if ($tmp[0] ne '') {
print join(", ", grep { $_ ne "" && /\s/ ? qq/"$_"/ : $_ } #cols), "\n";
#cols = #tmp;
}
else {
for (1..$#tmp) {
$cols[$_] .= " $tmp[$_]" if $tmp[$_] ne "";
}
}
}
print join(", ", grep { $_ ne "" && /\s/ ? qq/"$_"/ : $_ } #cols), "\n";
Output:
1000000, 45, M, "This is a line that breaks into multiple rows", "This is another line that also breaks into multiple rows - How annoying!"
1000001, 50, F, "I am another column that has text spanning multiple rows", "I am well behaved."
Using this awk:
awk -F ' {2,}' -v OFS=', ' 'NF==5{if (p) print a[1], a[2], a[3], a[4], a[5];
for (i=1; i<=NF; i++) a[i]=$i; p=index($0,$4)}
NF<4 {for(i=2; i<=NF; i++) index($0,$i) == p ? a[4]=a[4] " " $i : a[5]=a[5] $i}
END { print a[1], a[2], a[3], a[4], a[5] }' file
1000000, 45, M, This is a line that breaks into multiple rows, This is another line that also breaks into multiple rows - How annoying!
1000001, 50, F, I am another column that has text spanning multiple rows, I am well behaved.
You can write a script in python that does that. Read each line, call split on it, if the line is not empty append to the previous line. If it is, then add the next line to the result set. Finally use the csv write to write the result set to file.
Something along the lines of :
#import csv
inputFile = open(filename, 'r')
isNewItem = True
results = []
for line in inputFile:
if len(results) == 0:
isNewItem = True
else if line == '':
isNewItem = True
continue
else:
inNewItem = False
temp = line.split()
if isNewItem:
results.append(temp)
else
lastRow = results[-1]
combinedRow = []
for leftColumn, rigtColumn in lastRow, temp:
combinedRow.append(leftColumn + rightColumn)
with open(csvOutputFileName, 'w') as outFile:
csv.write(results)

aggregate totals when key changes in Perl

I have an input file with the following format
ant,1
bat,1
bat,2
cat,4
cat,1
cat,2
dog,4
I need to aggregate the col2 for each key (column1) so the result is:
ant,1
bat,3
cat,7
dog,4
Other considerations:
Assume that the input file is sorted
The input file is pretty large (about 1M rows), so I don't want to use an array and take up memory
Each input line should be processed as we read it, and move to the next line
I need to write the results to an outFile
I need to do this in Perl, but a pseudo-code or algorithm would help just as fine
Thanks!
This is what I came up with... want to see if this can be written better/elegant.
open infile, outFile
prev_line = <infile>;
print_line = $prev_line;
while(<>){
curr_line = $_;
#prev_cols=split(',', $prev_line);
#curr_cols=split(',', $curr_line);
if ( $prev_cols[0] eq $curr_cols[0] ){
$prev_cols[1] += curr_cols[1];
$print_line = "$prev_cols[0],$prev_cols[1]\n";
$print_flag = 0;
}
else{
$print outFile "$print_line";
$print_flag = 1;
$print_line = $curr_line;
}
$prev_line = $curr_line;
}
if($print_flag = 1){
print outFile "$curr_line";
}
else{
print outFile "$print_line";
}
#!/usr/bin/perl
use warnings;
use strict;
use integer;
my %a;
while (<>) {
my ($animal, $n) = /^\s*(\S+)\s*,\s*(\S+)/;
$a{$animal} += $n if defined $n;
}
print "$_,${a{$_}}\n" for sort keys %a;
This short code affords you the chance to learn Perl's excellent hash facility, as %a. Hashes are central to Perl. One really cannot write fluent Perl without them.
Observe incidentally that the code exercises Perl's interesting autovivification feature. The first time a particular animal is encountered in the input stream, no count exists, so Perl implicitly assumes a pre-existing count of zero. Thus, the += operator does not fail, even though it seems that it should. It just adds to zero in the first instance.
On the other hand, it may happen that not only the number of data but the number of animals is so large that one would not like to store the hash %a. In this case, one can still calculate totals, provided only that the data are sorted by animal in the input, as they are in your example. In this case, something like the following might suit (though regrettably it is not nearly so neat as the above).
#!/usr/bin/perl
use warnings;
use strict;
use integer;
my $last_animal = undef;
my $total_for_the_last_animal = 0;
sub start_new_animal ($$) {
my $next_animal = shift;
my $n = shift;
print "$last_animal,$total_for_the_last_animal\n"
if defined $last_animal;
$last_animal = $next_animal;
$total_for_the_last_animal = $n;
}
while (<>) {
my ($animal, $n) = /^\s*(\S+)\s*,\s*(\S+)/;
if (
defined($n) && defined($animal) && defined($last_animal)
&& $animal eq $last_animal
) { $total_for_the_last_animal += $n; }
else { start_new_animal $animal, $n; }
}
start_new_animal undef, 0;
Use Perl’s awk mode.
-a
turns on autosplit mode when used with a -n or -p. An implicit split command to the #F array is done as the first thing inside the implicit while loop produced by the -n or -p.
perl -ane 'print pop(#F), "\n";'
is equivalent to
while (<>) {
#F = split(' ');
print pop(#F), "\n";
}
An alternate delimiter may be specified using -F.
All that’s left for you is to accumulate the sums in a hash and print them.
$ perl -F, -lane '$s{$F[0]} += $F[1];
END { print "$_,$s{$_}" for sort keys %s }' input
Output:
ant,1
bat,3
cat,7
dog,4
It's trivial in perl. Loop on the file input. Split the input line on comma. For each key in column one keep a hash to which you add the value in column two. At the end of the file print the list of hash keys and their values. It can be done in one line but that would obfuscate the algorithm.

Resources