I am attempting to extract data between the nth occurrence of 2 patterns.
Pattern 1: CardDetail
Pattern 2: ]
The input file, input.txt has thousands of lines that vary in what each line contains. The lines I'm concerned with grabbing data from will always contain CardDetail somewhere in the line. Finding the matching lines is easy enough using awk, but pulling the data between each match and placing it onto seperate lines each is where I'm falling short.
input.txt contains data about network gear and any attached/child devices. It looks something like this:
DeviceDetail [baseProductId=router-5000, cardDetail=[CardDetail [baseCardId=router-5000NIC1, cardDescription=Router 5000 NIC, cardSerial=5000NIC1], CardDetail [baseCardId=router-5000NIC2, cardDescription=Router 5000 NIC, cardSerial=5000NIC2]], deviceSerial=5000PRIMARY, deviceDescription=Router 5000 Base Model]
DeviceDetail [baseProductId=router-100, cardDetail=[CardDetail [baseCardId=router-100NIC1, cardDescription=Router 100 NIC, cardSerial=100NIC1], CardDetail [baseCardId=router-100NIC2, cardDescription=Router 100 NIC, cardSerial=100NIC2]], deviceSerial=100PRIMARY, deviceDescription=Router 100 Base Model]
* UPDATE: I forgot to mention in the initial post that I also need the device's PARENT serials (deviceSerial) listed with them as well. *
What I would like the output.txt to look like is something like this:
"router-5000NIC1","Router 5000 NIC","5000NIC1","5000PRIMARY"
"router-5000NIC2","Router 5000 NIC","5000NIC2","5000PRIMARY"
"router-100NIC1","Router 100 NIC","100NIC1","100PRIMARY"
"router-100NIC2","Router 100 NIC","100NIC2","100PRIMARY"
The number of occurrences of CardDetail on a single line could vary between 0 to hundreds depending on the device. I need to be able to extract all of the data by field between each occurrence of CardDetail and the next occurrence of ] and transport them to their own line in a CSV format.
If you have gawk or mawk available, you can do this by (mis)using the record and field splitting capabilities:
awk -v RS='CardDetail *\\[' -v FS='[=,]' -v OFS=',' -v q='"' '
NR > 1 { sub("\\].*", ""); print q $2 q, q $4 q, q $6 q }'
Output:
"router-5000NIC1","Router 5000 NIC","5000NIC1"
"router-5000NIC2","Router 5000 NIC","5000NIC2"
"router-100NIC1","Router 100 NIC","100NIC1"
"router-100NIC2","Router 100 NIC","100NIC2"
Is it sufficient?
$> grep -P -o "(?<=CardDetail).*?(?=\])" input.txt | grep -P -o "(?<=\=).*?(?=\,)"
router-5000NIC1
Router 5000 NIC
router-5000NIC2
Router 5000 NIC
router-100NIC1
Router 100 NIC
router-100NIC2
Router 100 NIC
Here is an example that uses regular expressions. If there are minor variations in the text format, this will handle them. Also this collects all the values in an array; you could then do further processing (sort values, remove duplicates, etc.) if you wish.
#!/usr/bin/awk -f
BEGIN {
i_result = 0
DQUOTE = "\""
}
{
line = $0
for (;;)
{
i = match(line, /CardDetail \[ **([^]]*) *\]/, a)
if (0 == i)
break
# a[1] has the text from the parentheses
s = a[1]
# replace from this: a, b, c to this: "a","b","c"
gsub(/ *, */, "\",\"", s)
s = DQUOTE s DQUOTE
results[i_result++] = s
line = substr(line, RSTART + RLENGTH - 1)
}
}
END {
for (i = 0; i < i_result; ++i)
print results[i]
}
P.S. Just for fun I made a Python version.
#!/usr/bin/python
import re
import sys
DQUOTE = "\""
pat_card = re.compile("CardDetail \[ *([^]]*) *\]")
pat_comma = re.compile(" *, *")
results = []
def collect_cards(line, results):
while True:
m = re.search(pat_card, line)
if not m:
return
len_matched = len(m.group(0))
s = m.group(1)
s = DQUOTE + re.sub(pat_comma, '","', s) + DQUOTE
results.append(s)
line = line[len_matched:]
if __name__ == "__main__":
for line in sys.stdin:
collect_cards(line, results)
for card in results:
print card
EDIT: Here's a new version that also looks for "deviceID" and puts the matched text as the first field.
In AWK you concatenate strings just by putting them next to each other in an expression; there is an implicit concatenation operator when two strings are side by side. So this gets the deviceID text into a variable called s0, using concatenation to put double quotes around it; then later uses concatenation to put s0 at the start of the matched string.
#!/usr/bin/awk -f
BEGIN {
i_result = 0
DQUOTE = "\""
COMMA = ","
}
{
line = $0
for (;;)
{
i = match(line, /deviceID=([A-Za-z_0-9]*),/, a)
s0 = DQUOTE a[1] DQUOTE
i = match(line, /CardDetail \[ **([^]]*) *\]/, a)
if (0 == i)
break
# a[1] has the text from the parentheses
s = a[1]
# replace from this: foo=a, bar=b, other=c to this: "a","b","c"
gsub(/[A-Za-z_][^=,]*=/, "", s)
# replace from this: a, b, c to this: "a","b","c"
gsub(/ *, */, "\",\"", s)
s = s0 COMMA DQUOTE s DQUOTE
results[i_result++] = s
line = substr(line, RSTART + RLENGTH - 1)
}
}
END {
for (i = 0; i < i_result; ++i)
print results[i]
}
Try this
#awk -f myawk.sh temp.txt
BEGIN { RS="CardDetail"; FS="[=,]"; OFS=","; print "Begin Processing "}
$0 ~ /baseCardId/ {gsub("]","",$0);print $2, $4 , $6}
END {print "Process Complete"}
Related
I have large files that each store results from very long calculations. Here's an example of a file where there are results for five time steps; there are problems with the output at the third, fourth, and fifth time steps.
(Please note that I have been lazy and have used the same numbers to represent the results at each time step in my example. In reality, the numbers would be unique at each time step.)
3
i = 1, time = 1.000, E = 1234567
Mg 22.9985897185 6.9311166109 0.7603733573
O 23.0438129644 6.4358253659 1.5992513709
O 23.8223149199 7.2029442290 0.4030956770
3
i = 2, time = 1.500, E = 1234567
Mg 22.9985897185 6.9311166109 0.7603733573
O 23.0438129644 6.4358253659 1.5992513709
O 23.8223149199 7.2029442290 0.4030956770
3
i = 3, time = 2.000, E = 1234567
Mg 22.9985897185 6.9311166109 0.7603733573
O 23.0438129644 6.4358253659 1.5992513709
O 23.8223149199 (<--Problem: calculation stopped and some numbers are missing)
3
i = 4, time = 2.500, E = 1234567
Mg 22.9985897185 6.9311166109 0.7603733573
O 23.0438129644 6.4358253659 1.5992513709 (Problem: calculation stopped and entire row is missing below)
3
i = 5, time = 3.000, E = 1234567
Mg 22.9985897185 6.9311166109 0.7603733573
O 23.0438129644 6.4358253659 1.5992513709
O 23.8223149199 7.2029442290 0.4030956770 sdffs (<--Problem: rarely, additional characters can be printed but I figured out how to identify the longest lines in the file and don't have this problem this time)
The problem is that the calculations can fail (and then need to be restarted) as a result is printing to a file. That means that when I try to use the results, I have problems.
My question is, how can I find out when something has gone wrong and the results file has been messed up? The most common problem is that there are not "3" lines of results (plus the header, which is the line where there's i = ...)? If I could find a problem line, I could then delete that time step.
Here is an example of error output I get when trying to use a messed-up file:
Traceback (most recent call last):
File "/mtn/storage/software/languages/anaconda/Anaconda3-2018.12/lib/python3.7/site-packages/aser/io/extxyz.py", line 593, in read_xyz
nentss = int(line)
ValueError: invalid literal for int() with base 10: '\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x0
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "pythonPostProcessingCode.py", line 25, in <module>
path = read('%s%s' % (filename, fileext) , format='xyz', index=':') # <--This line tells me that Python cannot read in a particular time step because the formatting is messed up.
I am not experienced with scripting/Awk, etc, so if anyone thinks I have not used appropriate question tags, a heads-up would be welcome. Thank you.
The header plus 330 mean 331 lines of text and so
awk 'BEGIN { RS="i =" } { split($0,bits,"\n");if (length(bits)-1==331) { print RS$0 } }' file > newfile
Explanation:
awk 'BEGIN {
RS="i ="
}
{
split($0,bits,"\n");
if (length(bits)-1==331) {
print RS$0
}
}' file > newfile
Before processing any lines from the file called file, set the record separator equal to "i =". Then, for each record, use, split to split the record ($0) into an array bits based on a new line as the separator. Where the length of the array bits, less 1 is 331 print the record separator plus the record, redirecting the output to a new file called newfile
It sounds like this is what you want:
$ cat tst.awk
/^ i =/ {
prt()
expNumLines = prev + 1
actNumLines = 2
rec = prev RS $0
next
}
NF == 4 {
rec = rec RS $0
actNumLines++
}
{ prev = $0 }
END { prt() }
function prt() {
if ( (actNumLines == expNumLines) && (rec != "") ) {
print "-------------"
print rec
}
}
$ awk -f tst.awk file
-------------
3
i = 3, time = 2.000, E = 1234567
Mg 22.9985897185 6.9311166109 0.7603733573
O 23.0438129644 6.4358253659 1.5992513709
-------------
3
i = 5, time = 3.000, E = 1234567
Mg 22.9985897185 6.9311166109 0.7603733573
O 23.0438129644 6.4358253659 1.5992513709
Just change the prt() function to do whatever it is you want to do with valid records.
This answer is not really bash-related, but may be of interest if performance is an issue, since you seem to handle very large files.
Considering that you can compile some very basic C programs, you may build this code:
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
// Constants are hardcoded to make the program more readable
// But they could be passed as program argument
const char separator[]="i =";
const unsigned int requiredlines=331;
int main(void) {
char* buffer[331] = { NULL, };
ssize_t buffersizes[331] = { 0, };
size_t n = requiredlines+1; // Ignore lines until the separator is found
char* line = NULL;
size_t len = 0;
ssize_t nbread;
size_t i;
// Iterate through all lines
while ((nbread = getline(&line, &len, stdin)) != -1) {
// If the separator is found:
// - print the record (if valid)
// - reset the record (always)
if (strstr(line, separator)) {
if (n == requiredlines) {
for (i = 0 ; i < requiredlines ; ++i) printf("%s", buffer[i]);
}
n = 0;
}
// Add the line to the buffer, unless too many lines have been read
// (in which case we may discard lines until the separator is found again)
if (n < requiredlines) {
if (buffersizes[n] > nbread) {
strncpy(buffer[n], line, nbread);
buffer[n][nbread] = '\0';
} else {
free(buffer[n]);
buffer[n] = line;
buffersizes[n] = nbread+1;
line = NULL;
len = 0;
}
}
++n;
}
// Don't forget about the last record, if valid
if (n == requiredlines) {
for (i = 0 ; i < requiredlines ; ++i) printf("%s", buffer[i]);
}
free(line);
for (i = 0 ; i < requiredlines ; ++i) free(buffer[i]);
return 0;
}
The program can be compiled like this:
gcc -c prog.c && gcc -o prog prog.o
Then it may be executed like this:
./prog < infile > outfile
To simplify the code, it reads from stdin and outputs to stdout, but that’s more than enough in Bash considering all the options at your disposal to redirect streams. If need be, the code can be adapted to read/write directly from/to files.
I have tested it on a generated file with 10 million lines and compared it to the awk-based solution.
(time awk 'BEGIN { RS="i =" } { split($0,bits,"\n");if (length(bits)-1==331) { printf "%s",RS$0 } }' infile) > outfile
real 0m24.655s
user 0m24.357s
sys 0m0.279s
(time ./prog < infile) > outfile
real 0m1.414s
user 0m1.291s
sys 0m0.121s
With this example it runs approximately 18 times faster than the awk solution. Your mileage may vary (different data, different hardware) but I guess it should always be significantly faster.
I should mention that the awk solution is impressively fast (for a scripted solution, that is). I have first tried to code the solution in C++ and it had similar performance as awk, and sometimes it was even slower than awk.
It's a little bit more difficult to spread the match of a record across 2 lines to try and incorporate the i = ..., but I don't think you actually need to. It looks like a new record can be distinguished by the occurrence of a line with only one column. If that is the case, you could do something like:
awk -v c=330 -v p=1 'function pr(n) {
if( n - p == c) printf "%s", buf;
buf = ""}
NF == 1 { pr(NR); p = NR; c = $1 }
{buf = sprintf("%s%s\n", buf, $0)}
END {pr(NR+1)}' input-file
In the above, whenever a line is seen with a single record, the expectation is that many lines will be in the following record. If that number is not matched, the record is not printed. To avoid that logic, just remove the c = $1 near the end of line 4. The only reason you need the -v c=330 is to enable the removal of that assignment; if you want the single column line to be the line count of the record, you can omit -v c=330.
You can use csplit to grab each record in separate files, if you are allowed to write intermediate files.
csplit -k infile '/i = /' {*}
Then you can see which records are complete and which ones are not using wc -l xx* (note: xx is the default prefix of split files).
Then you can do whatever you want with those records, including listing files that have exactly 331 lines:
wc -l xx* | sed -n 's/^ *331 \(xx.*\)/\1/p'
Should you want to build a new file with all valid records, simply concatenate them:
wc -l xx* | sed -n 's/^ *331 \(xx.*\)/\1/p' | xargs cat > newfile
You can also, among other uses, archive failed records:
wc -l xx* | sed -e '$d' -e '/^ *331 \(xx.*\)/d' | xargs cat > failures
I have a csv file named "ranges.csv", which contains:
start_range,stop_range
9702220000,9702220999
9702222000,9702222999
9702223000,9702223999
9750000000,9750000999
9750001000,9750001999
9750002000,9750002999
I am trying to combine the ranges where the stop_range=start_range-1 and output the result in another csv file named "ranges2.csv". So the output will be:
9702220000,9702220999
9702222000,9702223999
9750000000,9750002999
Moreover, I need to know how many ranges contains a compress range (example: for the new range 9750000000,9750002999 I need to know that before the compression there were 3 ranges). This information will help me to create a new csv file named "ranges3.csv" which should contain only the range with the most ranges inside it (the most comprehensive area):
9750000000,9750002999
I was thinking about something like this:
if (stop_range = start_range-1)
new_stop_range = start_range-1
But I am not very smart and I am new to bash scripting.
I know how to output the results in another file but the function for what I need gives me headaches.
I think this does the trick:
#!/bin/bash
awk '
BEGIN { FS = OFS = ","}
NR == 2 {
start = $1; stop = $2; i = 1
}
NR > 2 {
if ($1 == (stop + 1)) {
i++;
stop = $2
} else {
if (++i > max) {
maxr = start "," stop;
max = i
}
start = $1
i = 0
}
stop = $2
}
END {
if (++i > max) {
maxr = start "," stop;
}
print maxr
}
' ranges.csv
Assuming your ranges are sorted, then this code gives you the merged ranges only:
awk 'BEGIN{FS=OFS=","}
(FNR>1) && ($1!=e+1){print b,e; b=e="" }
($1==e+1){ e=$2; next }
{ b=$1; e=$2 }
END { print b,e }' file
Below you get the same but with the range count:
awk 'BEGIN{FS=OFS=","}
(FNR>1) && ($1!=e+1){print b,e,c; b=e=c="" }
($1==e+1){ e=$2; c++; next }
{ b=$1; e=$2; c=1 }
END { print b,e,c }' file
If you want the largest one, you can sort on the third column. I don't want to make a rule to give the range with the most counts, as there might be multiple.
If you really only want all the ranges with the maximum merge:
awk 'BEGIN{FS=OFS=","}
(FNR>1) && ($1!=e+1){
a[c] = a[c] (a[c]?ORS:"") b OFS e
m=(c>m?c:m)
b=e=c=""
}
($1==e+1){ e=$2; c++; next }
{ b=$1; e=$2; c=1 }
END { a[c] = a[c] (a[c]?ORS:"") b OFS e
m=(c>m?c:m)
print a[m]
}' file
I have a file let's say files_190911.csv whose contents are as follows.
EDR_MPU023_09_20190911080534.csv.gz
EDR_MPU023_10_20190911081301.csv.gz
EDR_MPU023_11_20190911083544.csv.gz
EDR_MPU023_14_20190911091405.csv.gz
EDR_MPU023_15_20190911105513.csv.gz
EDR_MPU023_16_20190911105911.csv.gz
EDR_MPU024_50_20190911235332.csv.gz
EDR_MPU024_51_20190911235400.csv.gz
EDR_MPU024_52_20190911235501.csv.gz
EDR_MPU024_54_20190911235805.csv.gz
EDR_MPU024_55_20190911235937.csv.gz
EDR_MPU025_24_20190911000050.csv.gz
EDR_MPU025_25_20190911000155.csv.gz
EDR_MPU025_26_20190911000302.csv.gz
EDR_MPU025_29_20190911000624.csv.gz
I want to make a list of missing sequence from those using bash script.
Every MPUXXX has its own sequence. So there are multiple series of sequences in that file.
The datetime for missing list will use from previous sequence.
From the sample above, the result will be like this.
EDR_MPU023_12_20190911083544.csv.gz
EDR_MPU023_13_20190911083544.csv.gz
EDR_MPU024_53_20190911235501.csv.gz
EDR_MPU025_27_20190911000302.csv.gz
EDR_MPU025_28_20190911000302.csv.gz
It would be simpler if there were only a single sequence.
So I can use something like this.
awk '{for(i=p+1; i<$1; i++) print i} {p=$1}'
But I know this can't be used for multiple sequence.
EDITED (Thanks #Cyrus!)
AWK is your friend:
#!/usr/bin/awk
BEGIN {
FS="[^0-9]*"
last_seq = 0;
next_serial = 0;
}
{
cur_seq = $2;
cur_serial = $3;
if (cur_seq != last_seq) {
last_seq = cur_seq;
ts = $4
prev = cur_serial;
} else {
if (cur_serial == next_serial) {
ts = $4;
} else {
for (i = next_serial; i < cur_serial; i++) {
print "EDR_MPU" last_seq "_" i "_" ts ".csv.gz"
}
}
}
next_serial = cur_serial + 1;
}
And then you do:
$ < files_190911.csv awk -f script.awk
EDR_MPU023_12_20190911083544.csv.gz
EDR_MPU023_13_20190911083544.csv.gz
EDR_MPU024_53_20190911235501.csv.gz
EDR_MPU025_27_20190911000302.csv.gz
EDR_MPU025_28_20190911000302.csv.gz
The assignment to FS= splits lines by the regex. The rest program detects holes in sequences and prints them with the appropriate timestamp.
I have a file with delimited integers which I've extracted from elsewhere and dumped into a file. Some lines contain a range, as per the below:
Files 1,2,3,4,5,6,7,8,9,10 are OK
Users 1,2,3-9,10 have problems
Cars 1-5,5-10 are in the depot
Trains 1-10 are on time
Any way to expand the ranges on the text file so that it returns each individual number, with the , delimiter preserved? The text either side of the integers could be anything, and I need it preserved.
Files 1,2,3,4,5,6,7,8,9,10 are OK
Uses 1,2,3,4,5,6,7,8,9,10 have problems
Cars 1,2,3,4,5,6,7,8,9,10 are in the depot
Trains 1,2,3,4,5,6,7,8,9,10 are on time
I guess this can be done relatively easily with awk, let alone any other scripting language. Any help very much appreciated
You haven't tagged with perl but I'd recommend it in this case:
perl -pe 's/(\d+)-(\d+)/join(",", $1..$2)/ge' file
This substitutes all occurrences of one or more digits, followed by a hyphen, followed by one or more digits. It uses the numbers it has captured to create a list from the first number to the second and joins the list on a comma.
The e modifier is needed here so that an expression can be evaluated in the replacement part of the substitution.
To avoid repeated values and to sort the list, things get a little more complicated. At this point, I'd recommend using a script, rather than a one-liner:
use strict;
use warnings;
use List::MoreUtils qw(uniq);
while (<>) {
s/(\d+)-(\d+)/join(",", $1..$2)/ge;
if (/(.*\s)((\d+,)+\d+)(.*)/) {
my #list = sort { $a <=> $b } uniq split(",", $2);
$_ = $1 . join(",", #list) . $4 . "\n";
}
} continue {
print;
}
After expanding the ranges (like in the one-liner), I've re-parsed the line to extract the list of values. I've used uniq from List::MoreUtils (a core module) to remove any duplicates and sorted the values.
Call the script like perl script.pl file.
A solution using awk:
{
result = "";
count = split($0, fields, /[ ,-]+/, seps);
for (i = 1; i <= count; i++) {
if (fields[i] ~ /[0-9]+/) {
if (seps[i] == ",") {
numbers[fields[i]] = fields[i];
} else if (seps[i] == "-") {
for (j = fields[i] + 1; j <= fields[i+1]; j++) {
numbers[j] = j;
}
} else if (seps[i] == " ") {
numbers[fields[i]] = fields[i];
c = asort(numbers);
for (r = 1; r < c; r++) {
result = result numbers[r] ",";
}
result = result numbers[c] " ";
}
} else {
result = result fields[i] seps[i];
}
}
print result;
}
$ cat tst.awk
match($0,/[0-9,-]+/) {
split(substr($0,RSTART,RLENGTH),numsIn,/,/)
numsOut = ""
delete seen
for (i=1;i in numsIn;i++) {
n = split(numsIn[i],range,/-/)
for (j=range[1]; j<=range[n]; j++) {
if ( !seen[j]++ ) {
numsOut = (numsOut=="" ? "" : numsOut ",") j
}
}
}
print substr($0,1,RSTART-1) numsOut substr($0,RSTART+RLENGTH)
}
$ awk -f tst.awk file
Files 1,2,3,4,5,6,7,8,9,10 are OK
Users 1,2,3,4,5,6,7,8,9,10 have problems
Cars 1,2,3,4,5,6,7,8,9,10 are in the depot
Trains 1,2,3,4,5,6,7,8,9,10 are on time
another awk
$ awk '{while(match($0, /[0-9]+-[0-9]+/))
{k=substr($0, RSTART, RLENGTH);
split(k,a,"-");
f=a[1];
for(j=a[1]+1; j<=a[2]; j++) f=f","j;
sub(k,f)}}1' file
Files 1,2,3,4,5,6,7,8,9,10 are OK
Users 1,2,3,4,5,6,7,8,9,10 have problems
Cars 1,2,3,4,5,5,6,7,8,9,10 are in the depot
Trains 1,2,3,4,5,6,7,8,9,10 are on time
note that the Cars 1-5,5-10 will end up two 5 values when expanded due to overlapping ranges.
I would like to transpose a list of of items (key/value pairs) into a table format. The solution can be a bash script, awk, sed, or some other method.
Suppose I have a long list, such as this:
date and time: 2013-02-21 18:18 PM
file size: 1283483 bytes
key1: value
key2: value
date and time: 2013-02-21 18:19 PM
file size: 1283493 bytes
key2: value
...
I would like to transpose into a table format with tab or some other separator to look like this:
date and time file size key1 key2
2013-02-21 18:18 PM 1283483 bytes value value
2013-02-21 18:19 PM 1283493 bytes value
...
or like this:
date and time|file size|key1|key2
2013-02-21 18:18 PM|1283483 bytes|value|value
2013-02-21 18:19 PM|1283493 bytes||value
...
I have looked at solutions such as this An efficient way to transpose a file in Bash, but it seems like I have a different case here. The awk solution works partially for me, it keeps outputting all the rows into a long list of columns, but I need for the columns to be constrained to a unique list.
awk -F': ' '
{
for (i=1; i<=NF; i++) {
a[NR,i] = $i
}
}
NF>p { p = NF }
END {
for(j=1; j<=p; j++) {
str=a[1,j]
for(i=2; i<=NR; i++){
str=str" "a[i,j];
}
print str
}
}' filename
UPDATE
Thanks to all of you who providing your solutions. Some of them look very promising, but I think my version of the tools might be outdated and I am getting some syntax errors. What I am seeing now is that I did not start off with very clear requirements. Kudos to sputnick for being the first one to offer the solution before I spelled out the full requirements. I have had a long day when I wrote the question and thus it was not very clear.
My goal is to come up with a very generic solution for parsing multiple lists of items into column format. I am thinking the solution does not need to support more than 255 columns. Column names are not going to be known ahead of time, this way the solution will work for anyone, not just me. The two known things are the separator between kev/value pairs (": ") and a separator between lists (empty line). It would be nice to have a variable for those, so that they are configurable for others to reuse this.
From looking at proposed solutions, I realize that a good approach is to do two passes over the input file. First pass is to gather all the column names, optionally sort them, then print the header. Second to grab the values of the columns and print them.
Here's one way using GNU awk. Run like:
awk -f script.awk file
Contents of script.awk:
BEGIN {
# change this to OFS="\t" for tab delimited ouput
OFS="|"
# treat each record as a set of lines
RS=""
FS="\n"
}
{
# keep a count of the records
++i
# loop through each line in the record
for (j=1;j<=NF;j++) {
# split each line in two
split($j,a,": ")
# just holders for the first two lines in the record
if (j==1) { date = a[1] }
if (j==2) { size = a[1] }
# keep a tally of the unique key names
if (j>=3) { !x[a[1]] }
# the data in a multidimensional array:
# record number . key = value
b[i][a[1]]=a[2]
}
}
END {
# sort the unique keys
m = asorti(x,y)
# add the two strings to a numerically indexed array
c[1] = date
c[2] = size
# set a variable to continue from
f=2
# loop through the sorted array of unique keys
for (j=1;j<=m;j++) {
# build the header line from the file by adding the sorted keys
r = (r ? r : date OFS size) OFS y[j]
# continue to add the sorted keys to the numerically indexed array
c[++f] = y[j]
}
# print the header and empty
print r
r = ""
# loop through the records ('i' is the number of records)
for (j=1;j<=i;j++) {
# loop through the subrecords ('f' is the number of unique keys)
for (k=1;k<=f;k++) {
# build the output line
r = (r ? r OFS : "") b[j][c[k]]
}
# and print and empty it ready for the next record
print r
r = ""
}
}
Here's the contents of a test file, called file:
date and time: 2013-02-21 18:18 PM
file size: 1283483 bytes
key1: value1
key2: value2
date and time: 2013-02-21 18:19 PM
file size: 1283493 bytes
key2: value2
key1: value1
key3: value3
date and time: 2013-02-21 18:20 PM
file size: 1283494 bytes
key3: value3
key4: value4
date and time: 2013-02-21 18:21 PM
file size: 1283495 bytes
key5: value5
key6: value6
Results:
2013-02-21 18:18 PM|1283483 bytes|value1|value2||||
2013-02-21 18:19 PM|1283493 bytes|value1|value2|value3|||
2013-02-21 18:20 PM|1283494 bytes|||value3|value4||
2013-02-21 18:21 PM|1283495 bytes|||||value5|value6
Here's a pure awk solution:
# split lines on ": " and use "|" for output field separator
BEGIN { FS = ": "; i = 0; h = 0; ofs = "|" }
# empty line - increment item count and skip it
/^\s*$/ { i++ ; next }
# normal line - add the item to the object and the header to the header list
# and keep track of first seen order of headers
{
current[i, $1] = $2
if (!($1 in headers)) {headers_ordered[h++] = $1}
headers[$1]
}
END {
h--
# print headers
for (k = 0; k <= h; k++)
{
printf "%s", headers_ordered[k]
if (k != h) {printf "%s", ofs}
}
print ""
# print the items for each object
for (j = 0; j <= i; j++)
{
for (k = 0; k <= h; k++)
{
printf "%s", current[j, headers_ordered[k]]
if (k != h) {printf "%s", ofs}
}
print ""
}
}
Example input (note that there should be a newline after the last item):
foo: bar
foo2: bar2
foo1: bar
foo: bar3
foo3: bar3
foo2: bar3
Example output:
foo|foo2|foo1|foo3
bar|bar2|bar|
bar3|bar3||bar3
Note: you will probably need to alter this if your data has ": " embedded in it.
This does not make any assumptions on the column structure so it does not try to order them, however, all fields are printed in the same order for all records:
use strict;
use warnings;
my (#db, %f, %fields);
my $counter = 1;
while (<>) {
my ($field, $value) = (/([^:]*):\s*(.*)\s*$/);
if (not defined $field) {
push #db, { %f };
%f = ();
} else {
$f{$field} = $value;
$fields{$field} = $counter++ if not defined $fields{$field};
}
}
push #db, \%f;
#my #fields = sort keys %fields; # alphabetical order
my #fields = sort {$fields{$a} cmp $fields{$b} } keys %fields; #first seen order
# print header
print join("|", #fields), "\n";
# print rows
for my $row (#db) {
print join("|", map { $row->{$_} ? $row->{$_} : "" } #fields), "\n";
}
Using perl
use strict; use warnings;
# read the file paragraph by paragraph
$/ = "\n\n";
print "date and time|file size|key1|key2\n";
# parsing the whole file with the magic diamond operator
while (<>) {
if (/^date and time:\s+(.*)/m) {
print "$1|";
}
if (/^file size:(.*)/m) {
print "$1|";
}
if (/^key1:(.*)/m) {
print "$1|";
}
else {
print "|";
}
if (/^key2:(.*)/m) {
print "$1\n";
}
else {
print "\n";
}
}
Usage
perl script.pl file
Output
date and time|file size|key1|key2
2013-02-21 18:18 PM| 1283483 bytes| value| value
2013-02-21 18:19 PM| 1283493 bytes|| value
example:
> ls -aFd * | xargs -L 5 echo | column -t
bras.tcl# Bras.tpkg/ CctCc.tcl# Cct.cfg consider.tcl#
cvsknown.tcl# docs/ evalCmds.tcl# export/ exported.tcl#
IBras.tcl# lastMinuteRule.tcl# main.tcl# Makefile Makefile.am
Makefile.in makeRule.tcl# predicates.tcl# project.cct sourceDeps.tcl#
tclIndex