Nested getline in AWK script - shell

Please let me know if we can use nested getline within AWK scripts like:
while ( ("tail -f log" |& getline var0) > 0) {
while ( ("ls" | getline ) > 0) {
}
close("ls")
while ( ("date" | getline ) > 0) {
}
close("date")
}
close("tail -f log")
What is the depth we can make use of nested getline functionality and will there be any data loss of output at any level of the nested getline? What are the things we should make sure in implementing this style?
==================================================================================
UPDATE===================UPDATE==============UPDATE===============UPDATE=======
Requirement : Provide real time statistical data and errors by probing QA box and webserver / services logs and system status. Report would be generated in following format:
Local Date And Time | Category| Component | Condition
Assumption -: AWK script would execute faster than shell script with added advantage of using its inbuilt parsing and other functionalities.
Implementation : - The main command loop is command0="tail -f -n 0 -s 5 ...........". This command would start an infinite loop extracting appended logs of services / webserver of QA box. . Note the -f, -s and –n options which makes to dump all appended data to logs, sleep for 5 seconds after each iterations and start without printing any default content from the existing logs.
After each iteration, capture and verify the system time and execute various OS resource commands after 10 seconds interval (5 seconds sleep in-between each iteration and 4 seconds after processing the tail output – assuming that processing all tail command roughly take 1 sec, hence in all 10 seconds)
Various command I have used for extracting OS resources are:
I. command1="vmstat | nl | tr -s '\\t '"
II. command2="sar -W 0"
III. command3="top -b -n 1 | nl | tr -s '\\t '"
IV. command4="ls -1 /tmp | grep EXIT"
Search for respective command(?) in the script and go thru the while loop of it in the script to figure output processing of the respective command. Note I nave used ‘nl’ command for development / coding ease
Ultimately presence of /tmp/EXIT file on the box will make the script to exit after removing the same from the box
Below is my script - I have added comments as much as possible for self explanatory:
#Useage - awk -f script.awk
BEGIN {
command0="tail -f -n 0 -s 5 /x/web/webserver/*/logs/error_log /x/web/webserver/service/*/logs/log"
command1="vmstat | nl | tr -s '\\t '"
command2="sar -W 0"
command3="top -b -n 1 | nl | tr -s '\\t '"
command4="ls -1 /tmp | grep EXIT"
format = "%a %b %e %H:%M:%S %Z %Y"
split("", details)
split("", fields)
split("", data)
split("", values)
start_time=0
printf "\n>%s:\n\n", command0 #dummy print for debuggng command being executed
while ( (command0 |& getline var0) > 0) { #get the command output
if (start_time == 0) #if block to reset the start_time variable
{
start_time = systime() + 4
}
if (var0 ~ /==>.*<==/) { #if block to extract the file name from the tail output - outputted in '==>FileName<==' format
gsub(/[=><]/, "", var0)
len = split(var0, name, "/")
if(len == 7) {file = name[5]} else {file = name[6]}
}
if (len == 7 && var0 ~ /[Ee]rror|[Ee]xception|ORA|[Ff]atal/) { #extract the logs error statements
print strftime(format,systime()) " | Error Log | " file " | Error :" var0
}
if(systime() >= start_time) #check if curernt system time is greater than start_time as computed above
{
start_time = 0 #reset the start_time variable and now execute the system resource command
printf "\n>%s:\n\n", command1
while ( (command1 |& getline) > 0) { #process output of first command
if($1 <= 1)
continue #not needed for processing skip this one
if($1 == 2) #capture the fieds name and skip to next line
{
for (i = 1; i <= NF; i++){fields[$i] = i;}
continue
}
if ($1 == 3) #store the command data output in data array
split($0, data);
print strftime(format,systime()) " | System Resource | System | Time spent running non-kernel code :" data[fields["us"]]
print strftime(format,systime()) " | System Resource | System | Time spent running kernel code :" data[fields["sy"]]
print strftime(format,systime()) " | System Resource | System | Amount of memory swapped in from disk :" data[fields["si"]]
print strftime(format,systime()) " | System Resource | System | Amount of memory swapped to disk :" data[fields["so"]]
}
close(command1)
printf "\n>%s:\n\n", command2 #start processing second command
while ( (command2 |& getline) > 0) {
if ($4 ~ /[0-9]+[\.][0-9]+/) #check for 4th positional value if its format is of "int.intint" format
{
if( $4 > 0.0) #dummy check now to print if page swapping
print strftime(format,systime()) " | System Resource | Disk | Page rate is > 0.0 reads/second: " $4
}
}
close(command2)
printf "\n>%s:\n\n", command3 # start processing command number 3
while ( (command3 |& getline ) > 0) {
if($1 == 1 && $0 ~ /load average:/) #get the load average from the output if this is the first line
{
split($0, arr, ",")
print strftime(format,systime())" | System Resource | System |" arr[4]
}
if($1 > 7 && $1 <= 12) # print first top 5 process that are consuming most of the CPUs time
{
f=split($0, arr, " ")
if(f == 13)
print strftime(format,systime())" | System Resource | System | CPU% "arr[10]" Process No: "arr[1] - 7" Name: "arr[13]
}
}
close(command3)
printf "\n>%s:\n\n", command4 #process command number 4 to check presence of file
while ( (command4 |& getline var4) > 0) {
system("rm -rf /tmp/EXIT")
exit 0 #if file is there then remove the file and exit this script execution
}
close(command4)
}
}
close(command0)
}
Output -:
>tail -f -n 0 -s 5 /x/web/webserver/*/logs/error_log /x/web/webserver/service/*/logs/log:
>vmstat | nl | tr -s '\t ':
Sun Dec 16 23:05:12 PST 2012 | System Resource | System | Time spent running non-kernel code :9
Sun Dec 16 23:05:12 PST 2012 | System Resource | System | Time spent running kernel code :9
Sun Dec 16 23:05:12 PST 2012 | System Resource | System | Amount of memory swapped in from disk :0
Sun Dec 16 23:05:12 PST 2012 | System Resource | System | Amount of memory swapped to disk :2
>sar -W 0:
Sun Dec 16 23:05:12 PST 2012 | System Resource | Disk | Page rate is > 0.0 reads/second: 3.89
>top -b -n 1 | nl | tr -s '\t ':
Sun Dec 16 23:05:13 PST 2012 | System Resource | System | load average: 3.63
Sun Dec 16 23:05:13 PST 2012 | System Resource | System | CPU% 12.0 Process No: 1 Name: occworker
Sun Dec 16 23:05:13 PST 2012 | System Resource | System | CPU% 10.3 Process No: 2 Name: occworker
Sun Dec 16 23:05:13 PST 2012 | System Resource | System | CPU% 6.9 Process No: 3 Name: caldaemon
Sun Dec 16 23:05:13 PST 2012 | System Resource | System | CPU% 6.9 Process No: 4 Name: occmux
Sun Dec 16 23:05:13 PST 2012 | System Resource | System | CPU% 6.9 Process No: 5 Name: top
>ls -1 /tmp | grep EXIT:

This is your second post that I can recall about using getline this way. I mentioned last time that it was the wrong approach but it looks like you didn't believe me so let me try one more time.
Your question of "how do I use awk to execute commands with getline to read their output?" is like asking "how do I use a drill to cut glass?". You could get an answer telling you to tape over the part of the glass where you'll be drilling to avoid fracturing it and that WOULD answer your question but the more useful answer would probably be - don't do that, use a glass cutter.
Using awk as a shell from which to call commands is 100% the wrong approach. Simply use the right tool for the right job. If you need to parse a text file, use awk. If you need to manipulate files or processes or invoke commands, use shell (or your OS equivalent).
Finally, please read http://awk.freeshell.org/AllAboutGetline and don't even think about using getline until you fully understand all the caveats.
EDIT: here's a shell script to do what your posted awk script does:
tail -f log |
while IFS= read -r var0; do
ls
date
done
Look simpler? Not saying it makes sense to do that, but if you did want to do it, THAT's the way to implement it, not in awk.
EDIT: here's how to write the first part of your awk script in shell (bash in this case), I ran out of enthusiasm for translating the rest of it for you and I think this shows you how to do the rest yourself:
format = "%a %b %e %H:%M:%S %Z %Y"
start_time=0
tail -f -n 0 -s 5 /x/web/webserver/*/logs/error_log /x/web/webserver/service/*/logs/log |
while IFS= read -r line; do
systime=$(date +"%s")
#block to reset the start_time variable
if ((start_time == 0)); then
start_time=(( systime + 4 ))
fi
#block to extract the file name from the tail output - outputted in '==>FileName<==' format
case $var0 in
"==>"*"<==" )
path="${var0%% <==}"
path="${path##==> }"
name=( ${path//\// } )
len="${#name[#]}"
if ((len == 7)); then
file=name[4]
else
file=name[5]
fi
;;
esac
if ((len == 7)); then
case $var0 in
[Ee]rror|[Ee]xception|ORA|[Ff]atal ) #extract the logs error statements
printf "%s | Error Log | %s | Error :%s\n" "$(date +"$format")" "$file" "$var0"
;;
esac
fi
#check if curernt system time is greater than start_time as computed above
if (( systime >= start_time )); then
start_time=0 #reset the start_time variable and now execute the system resource command
....
Note that this would execute slightly faster than your awk script but that absolutely does not matter at all since your tail is taking 5 second breaks between iterations.
Also note that all I'm doing above is translating your awk script into shell, it doesn't necessarily mean it'd be the best way to write this tool from scratch.

Related

How do you get items from txt into presentable table in bash?

I'm trying to retrieve items from Node01.pc and put it within a table.
Example:
echo ${NodeCPU[0]} is able to print the item from the line.
But when I use printf or echo it either breaks or does not display the output from the array item.
The formating of the table seems work and it displays only if it's not the arrays. Could it be that there's more than to the file that I can see?
Node01.pc contains
192.168.0.99
2
70
16
80
4
4
100
4
VS122:NMAD:20:20:1:1
VS122:NAMD:20:20:1:1
RS123:FEM:10:20:1:1
QV999:BEM:20:20:1:1
But I only need lines 3,5,7,9
I'm not sure if what is the best way to do this, or if I even need to store items into arrays.
I thought about retrieving all text from the texts files and making a new file which will contain all the data, but I'm not sure how to do that.
This is the code that I have right now.
#!/bin/bash
Node01=($(cat Node01.pc))
Node02=($(cat Node02.pc))
Node03=($(cat Node03.pc))
Node04=($(cat Node04.pc))
Node05=($(cat Node05.pc))
NodeCPU=("${Node01[2]}" "${Node02[2]}" "${Node03[2]}" "${Node04[2]}" "${Node05[2]}")
NodeMEM=("${Node01[4]}" "${Node02[4]}" "${Node03[4]}" "${Node04[4]}" "${Node05[4]}")
NodeHDD=("${Node01[6]}" "${Node02[6]}" "${Node03[6]}" "${Node04[6]}" "${Node05[6]}")
NodeNET=("${Node01[8]}" "${Node02[8]}" "${Node03[8]}" "${Node04[8]}" "${Node05[8]}")
seperator=----------------------
seperator=$seperator$seperator
rows="%-10s| %-7s| %-7s| %-7s| %-7s\n"
TableWidth=140
printf "%-10s| %-7s| %-7s| %-7s| %-7s\n" NodeNumber CPU MEM HDD NET
printf "%.${TableWidth}s\n" "$seperator"
for((i=0;i<=4;i++))
do
printf "$rows" "$(( $i+1 ))" "${NodeCPU[i]}" "${NodeMEM[i]}" "${NodeHDD[i]}" "${NodeNET[i]}"
done
read
This is an example of what I want to display
NodeNumber | CPU | MEM | HDD | NET
----------------------------------
1 | 10 | 20 | 20 | 40
2 | 10 | 20 | 20 | 40
3 | 10 | 20 | 20 | 40
4 | 10 | 20 | 20 | 40
5 | 10 | 20 | 20 | 40
EDIT This is what I'm currently getting:
NodeNumber| CPU | MEM | HDD | NET
--------------------------------------------
| 4 | 70
| 5 | 90
| 6 | 100
| 6 | 70
| 40 | 40
Issue I'm having is with
printf "$rows" "$(( $i+1 ))" "${NodeCPU[i]}" "${NodeMEM[i]}" "${NodeHDD[i]}" "${NodeNET[i]}"
Why worry about all the separate array? Simply loop over all "Node*.pc" files in the current directory and read the contents of each file into an array with readarray and then output the file count and elements nos. 2, 4, 6, 8 of the array in the proper format (adjust elements output as needed), e.g.
#!/bin/bash
cnt=1 ## file counter
## print heading
printf "NodeNumber | CPU | MEM | HDD | NET\n----------------------------------\n"
for i in Node*.pc; do ## loop over all Node*.pc files in directory
readarray -t node < "$i" ## read contents into array
## output count and elements 2, 4, 6, 8 in proper format
printf "%-11s| %-4s| %-4s| %-4s| %s\n" $((cnt++)) \
"${node[2]}" "${node[4]}" "${node[6]}" "${node[8]}"
done
Example Use/Output
With the example data shown copied to the file Node01.pc in the current directory, you would get:
$ bash node.sh
NodeNumber | CPU | MEM | HDD | NET
----------------------------------
1 | 70 | 80 | 4 | 4
(I called the script node.sh)
It would output the information from each file as separate lines numbered 1, 2, ... Look things over an let me know if this is what you intended. (you can also do the same thing with awk faster by setting FS=\n and treating the lines as columns in a single record)
You can do the same thing in awk with:
awk '
BEGIN {
RS=""; FS="\n"
printf "NodeNumber | CPU | MEM | HDD | NET\n----------------------------------\n"
}
NF >= 9 {
printf "%-11s| %-4s| %-4s| %-4s| %s\n",++cnt,$3,$5,$7,$9
}
' Node*.pc
(note: in awk the field numbers are 1-based, while in bash the array indexes are 0-based)
Output is the same.

bash looping and extracting of the fragment of txt file

I am dealing with the analysis of big number of dlg text files located within the workdir. Each file has a table (usually located in different positions of the log) in the following format:
File 1:
CLUSTERING HISTOGRAM
____________________
________________________________________________________________________________
| | | | |
Clus | Lowest | Run | Mean | Num | Histogram
-ter | Binding | | Binding | in |
Rank | Energy | | Energy | Clus| 5 10 15 20 25 30 35
_____|___________|_____|___________|_____|____:____|____:____|____:____|____:___
1 | -5.78 | 11 | -5.78 | 1 |#
2 | -5.53 | 13 | -5.53 | 1 |#
3 | -5.47 | 17 | -5.44 | 2 |##
4 | -5.43 | 20 | -5.43 | 1 |#
5 | -5.26 | 19 | -5.26 | 1 |#
6 | -5.24 | 3 | -5.24 | 1 |#
7 | -5.19 | 4 | -5.19 | 1 |#
8 | -5.14 | 16 | -5.14 | 1 |#
9 | -5.11 | 9 | -5.11 | 1 |#
10 | -5.07 | 1 | -5.07 | 1 |#
11 | -5.05 | 14 | -5.05 | 1 |#
12 | -4.99 | 12 | -4.99 | 1 |#
13 | -4.95 | 8 | -4.95 | 1 |#
14 | -4.93 | 2 | -4.93 | 1 |#
15 | -4.90 | 10 | -4.90 | 1 |#
16 | -4.83 | 15 | -4.83 | 1 |#
17 | -4.82 | 6 | -4.82 | 1 |#
18 | -4.43 | 5 | -4.43 | 1 |#
19 | -4.26 | 7 | -4.26 | 1 |#
_____|___________|_____|___________|_____|______________________________________
The aim is to loop over all the dlg files and take the single line from the table corresponding to wider cluster (with bigger number of slashes in Histogram column). In the above example from the table this is the third line.
3 | -5.47 | 17 | -5.44 | 2 |##
Then I need to add this line to the final_log.txt together with the name of the log file (that should be specified before the line). So in the end I should have something in following format (for 3 different log files):
"Name of the file 1": 3 | -5.47 | 17 | -5.44 | 2 |##
"Name_of_the_file_2": 1 | -5.99 | 13 | -5.98 | 16 |################
"Name_of_the_file_3": 2 | -4.78 | 19 | -4.44 | 3 |###
A possible model of my BASH workflow would be:
#!/bin/bash
do
file_name2=$(basename "$f")
file_name="${file_name2/.dlg}"
echo "Processing of $f..."
# take a name of the file and save it in the log
echo "$file_name" >> $PWD/final_results.log
# search of the beginning of the table inside of each file and save it after its name
cat $f |grep 'CLUSTERING HISTOGRAM' >> $PWD/final_results.log
# check whether it works
gedit $PWD/final_results.log
done
Here I need to substitute combination of echo and grep in order to take selected parts of the table.
You can use this one, expected to be fast enough. Extra lines in your files, besides the tables, are not expected to be a problem.
grep "#$" *.dlg | sort -rk11 | awk '!seen[$1]++'
grep fetches all the histogram lines which are then sorted in reverse order by last field, that means lines with most # on the top, and finally awk removes the duplicates. Note that when grep is parsing more than one file, it has -H by default to print the filenames at the beginning of the line, so if you test it for one file, use grep -H.
Result should be like this:
file1.dlg: 3 | -5.47 | 17 | -5.44 | 2 |##########
file2.dlg: 3 | -5.47 | 17 | -5.44 | 2 |####
file3.dlg: 3 | -5.47 | 17 | -5.44 | 2 |#######
Here is a modification to get the first appearence in case of many equal max lines in a file:
grep "#$" *.dlg | sort -k11 | tac | awk '!seen[$1]++'
We replaced the reversed parameter in sort, with the 'tac' command which is reversing the file stream, so now for any equal lines, initial order is preserved.
Second solution
Here using only awk:
awk -F"|" '/#$/ && $NF > max[FILENAME] {max[FILENAME]=$NF; row[FILENAME]=$0}
END {for (i in row) print i ":" row[i]}' *.dlg
Update: if you execute it from different directory and want to keep only the basename of every file, to remove the path prefix:
awk -F"|" '/#$/ && $NF > max[FILENAME] {max[FILENAME]=$NF; row[FILENAME]=$0}
END {for (i in row) {sub(".*/","",i); print i ":" row[i]}}'
Probably makes more sense as an Awk script.
This picks the first line with the widest histogram in the case of a tie within an input file.
#!/bin/bash
awk 'FNR == 1 { if(sel) print sel; sel = ""; max = 0 }
FNR < 9 { next }
length($10) > max { max = length($10); sel = FILENAME ":" $0 }
END { if (sel) print sel }' ./"$prot"/*.dlg
This assumes the histograms are always the tenth field; if your input format is even messier than the lump you show, maybe adapt to taste.
In some more detail, the first line triggers on the first line of each input file. If we have collected a previous line (meaning this is not the first input file), print that, and start over. Otherwise, initialize for the first input file. Set sel to nothing and max to zero.
The second line skips lines 1-8 which contain the header.
The third line checks if the current line's histogram is longer than max. If it is, update max to this histogram's length, and remember the current line in sel.
The last line is spillover for when we have processed all files. We never printed the sel from the last file, so print that too, if it's set.
If you mean to say we should find the lines between CLUSTERING HISTOGRAM and the end of the table, we should probably have more information about what the surrounding lines look like. Maybe something like this, though;
awk '/CLUSTERING HISTOGRAM/ { if (sel) print sel; looking = 1; sel = ""; max = 0 }
!looking { next }
looking > 1 && $1 != looking { looking = 0; nextfile }
$1 == looking && length($10) > max { max = length($10); sel = FILENAME ":" $0 }
END { if (sel) print sel }' ./"$prot"/*.dlg
This sets looking to 1 when we see CLUSTERING HISTOGRAM, then counts up to the first line where looking is no longer increasing.
I would suggest processing using awk:
for i in $FILES
do
echo -n \""$i\": "
awk 'BEGIN {
output="";
outputlength=0
}
/(^ *[0-9]+)/ { # process only lines that start with a number
if (length(substr($10, 2)) > outputlength) { # if line has more hashes, store it
output=$0;
outputlength=length(substr($10, 2))
}
}
END {
print output # output the resulting line
}' "$i"
done

How can I process date strings in bash?

Does anyone have any idea how I could process input like this with bash? I would like to convert absolute time to relative time. My approach works but is VERY messy. Can anyone do better? Is there a cleaner way to do this?
Input:
| 2020-08-01 15:35:47.446 | message 1 |
| 2020-08-01 15:35:48.446 | hi these |
| 2020-08-01 15:31:47.446 | do stuff now! |
Output: Shows the time difference in milliseconds
0 message 1
1000 hi these
60000 do stuff now!
Working (very dirty) approach:
while read line;
do echo $(echo "$(echo "$line" | cut -d' ' -f3 | cut -d':' -f2 | head -1) * 60000 + $(echo "$line" | cut -d' ' -f3 | cut -d':' -f3 | head -1) * 1000 - $baseval" | bc) $(echo "$line" | cut -d'|' -f3) ;
done < file.log
Looks like the question ask to move a series of abs timestamp to relative timestamp, using 'baseval' as the zero point in time.
It is possible to use date command (using the '+%S' to get seconds past epoch) to simplify calcualtion. If the file has many lines, this solution might not be ideal, as it calls the 'date' process for each line.
Worth noting some of the complexities is with parsing the input format - combination of fixed + delimited column. Code uses bash 'IFS' to split the line into components.
#! /bin/bash
function relative_time_ms {
# Convert inputinto two tokens - relative seconds + nanoseconds
local dd=($(date '+%s %N' -d "$1"))
echo $((dd[0]*1000 + dd[1]/1000000 - baseval))
}
while IFS='|' read x ts msg ; do
rel_time=$(relative_time_ms "$ts")
echo "$rel_time | $msg"
done < file.log
Output:
0 | message 1
1000 | hi these
-240000 | do stuff now!

Bash - Removing empty columns from .csv file

I have a large .csv file in which I have to remove columns which are empty. By empty, I mean that they have a header, but the rest of the column contains no data.
I've written a Bash script to try and do this, but am running into a few issues.
Here's the code:
#!/bin/bash
total="$(head -n 1 Reddit-cleaner.csv | grep -o ',' | wc -l)"
i=1
count=0
while [ $i -le $total ]; do
cat Reddit-cleaner.csv | cut -d "," -f$i | while read CMD; do if [ -n CMD ]; then count=$count+1; fi; done
if [ $count -eq 1 ]; then
cut -d "," -f$i --complement <Reddit-cleaner.csv >Reddit-cleanerer.csv
fi
count=0
i=$i+1
done
Firstly I find the number of columns, and store it in total. Then while the program has not reached the last column, I loop through the columns individually. The nested while loop checks if each row in the column is empty, and if there is more than one row that is not empty, it writes all other columns to another file.
I recognise that there are a few problems with this script. Firstly, the count modification occurs in a subshell, so count is never modified in the parent shell. Secondly, the file I am writing to will be overwritten every time the script finds an empty column.
So my question then is how can I fix this. I initially wanted to have it so that it wrote to a new file column by column, based on count, but couldn't figure out how to get that done either.
Edit: People have asked for a sample input and output.
Sample input:
User, Date, Email, Administrator, Posts, Comments
a, 20201719, a#a.com, Yes, , 3
b, 20182817, b#b.com, No, , 4
c, 20191618, , No, , 4
d, 20190126, , No, , 2
Sample output:
User, Data, Email, Administrator, Comments
a, 20201719, a#a.com, Yes, 3
b, 20182817, b#b.com, No, 4
c, 20191618, , No, 4
d, 20190126, , No, 2
In the sample output, the column which has no data in it except for the header (Posts) has been removed, while the columns which are either entirely or partially filled remain.
I may be misinterpreting the question (due to its lack of example input and expected output), but this should be as simple as:
$ x="1,2,3,,4,field 5,,,six,7"
$ echo "${x//,+(,)/,}"
1,2,3,4,field 5,six,7
This requires bash with extglob enabled. Otherwise, you can use an external call to sed:
$ echo "1,2,3,,4,field 5,,,six,7" |sed 's/,,,*/,/g'
1,2,3,4,field 5,six,7
There's a lot of redundancy in your sample code. You should really consider awk since it already tracks the current field count (as NF) and the number of lines (as NR), so you could add that up with a simple total+=NF on each line. With the empty fields collapsed, awk can just run on the field number you want.
$ echo "1,2,3,,4,field 5,,,six,7" |awk -F ',+' '
{ printf "line %d has %d fields, the 6th of which is <%s>\n", NR, NF, $6 }'
line 1 has 7 fields, the 6th of which is <six>
This uses printf to denote the number of records (NR, the current line number), the number of fields (NF) and the value of the sixth field ($6, can also be as a variable, e.g. $NF is the value of the final field since awk is one-indexed).
It is actually job of a CSV parser but you may use this awk script to get the job done:
cat removeEmptyCellsCsv.awk
BEGIN {
FS = OFS = ", "
}
NR == 1 {
for (i=1; i<=NF; i++)
e[i] = 1 # initially all cols are marked empty
next
}
FNR == NR {
for (i=1; i<=NF; i++)
e[i] = e[i] && ($i == "")
next
}
{
s = ""
for (i=1; i<=NF; i++)
s = s (i==1 || e[i-1] ? "" : OFS) (e[i] ? "" : $i)
print s
}
Then run it as:
awk -f removeEmptyCellsCsv.awk file.csv{,}
Using sample data provided in question, it will produce following output:
1, User, Date, Email, Administrator, Comments
2, a, 20201719, a#a.com, Yes, 3
3, b, 20182817, b#b.com, No, 4
4, c, 20191618, , No, 4
5, d, 20190126, , No, 2
Note that Posts columns has been removed because it is empty in every record.
$ cat tst.awk
BEGIN { FS=OFS="," }
NR==FNR {
if ( NR > 1 ) {
for (i=1; i<=NF; i++) {
if ( $i ~ /[^[:space:]]/ ) {
gotValues[i]
}
}
}
next
}
{
c=0
for (i=1; i<=NF; i++) {
if (i in gotValues) {
printf "%s%s", (c++ ? OFS : ""), $i
}
}
print ""
}
$ awk -f tst.awk file file
User, Date, Email, Administrator, Comments
a, 20201719, a#a.com, Yes, 3
b, 20182817, b#b.com, No, 4
c, 20191618, , No, 4
d, 20190126, , No, 2
See also What's the most robust way to efficiently parse CSV using awk? if you need to work with any more complicated CSVs than the one in your question.
You can use Miller (https://github.com/johnkerl/miller) and its remove-empty-columns verb.
Starting from
+------+----------+---------+---------------+-------+----------+
| User | Date | Email | Administrator | Posts | Comments |
+------+----------+---------+---------------+-------+----------+
| a | 20201719 | a#a.com | Yes | - | 3 |
| b | 20182817 | b#b.com | No | - | 4 |
| c | 20191618 | - | No | - | 4 |
| d | 20190126 | - | No | - | 2 |
+------+----------+---------+---------------+-------+----------+
and running
mlr --csv remove-empty-columns input.csv >output.csv
you will have
+------+----------+---------+---------------+----------+
| User | Date | Email | Administrator | Comments |
+------+----------+---------+---------------+----------+
| a | 20201719 | a#a.com | Yes | 3 |
| b | 20182817 | b#b.com | No | 4 |
| c | 20191618 | - | No | 4 |
| d | 20190126 | - | No | 2 |
+------+----------+---------+---------------+----------+

Transposing Rows to Columns in Bash

I've trawling the internet to make sense of how to do what I want to do, but to no avail. I need to transpose the data below....
Caption=C:
Description=Local Fixed Disk
DriveType=3
FreeSpace=41265664000
ProviderName=
Size=146056146944
VolumeName=
Caption=D:
Description=Local Fixed Disk
DriveType=3
FreeSpace=125067259904
ProviderName=
Size=1073738674176
VolumeName=Data
Caption=E:
Description=Removable Disk
DriveType=2
FreeSpace=
ProviderName=
Size=
VolumeName=
To a table layout, like this...
Caption Description DriveType FreeSpace ProviderName Size VolumeName
C: Local Fixed Disk 3 41265664000 146056146944
D: Local Fixed Disk 3 125067259904 1073738674176 Data
E: Removable Disk 2
In bash. I've been exploring a myriad of awk scripts but I don't seem to understand the logic behind them that well :|
Any help would be appreciated. Thank you!
with gawk:
awk -v RS="Caption" -F"[=\n\"]" '
NR==2{
printf RS;
for(i=3;i<=NF;i+=2){
printf ":"$i
};
print ""
};
{
for(i=2;i<=NF;i+=2){
if ($i != "" ) printf ":"$i;
else printf ": "
};
print ""}' file | column -s":" -t
You can use other tools to do this such as Perl or python.
In Perl for example you could use this:
#!/usr/bin/perl -n
use Text::ASCIITable;
next unless /^(\w+)\s*=\s*(.*)$/;
$data{$1} = [] if not $data{$1};
push $data{$1}, $2;
END {
$t = Text::ASCIITable->new();
$t->setCols(keys %data);
for my $i (0..#{$data{(keys %data)[0]}} - 1) {
$t->addRow(map $data{$_}[$i], keys %data)
}
print $t;
}
Your data can be in the data.txt so you can write:
$ ./myscript.pl data.txt
.---------------------------------------------------------------------------------------------------.
| Caption | Size | DriveType | VolumeName | FreeSpace | Description | ProviderName |
+---------+---------------+-----------+------------+--------------+------------------+--------------+
| C: | 146056146944 | 3 | | 41265664000 | Local Fixed Disk | |
| D: | 1073738674176 | 3 | Data | 125067259904 | Local Fixed Disk | |
| E: | | 2 | | | Removable Disk | |
'---------+---------------+-----------+------------+--------------+------------------+--------------'
This script is generic for N number of columns given in any order. However, if the columns are known and displayed always in the same order the code can be simplified to this:
#!/usr/bin/perl -n
use Text::ASCIITable;
BEGIN {
#columns = qw/Caption Description DriveType FreeSpace ProviderName Size VolumeName/;
$t = Text::ASCIITable->new();
$t->setCols(#columns);
$re = join "\n", map "$_=(?<$_>.*)", #columns;
undef $/;
}
$t->addRow(map $+{$_}, #columns) while(/$re/g);
END { print $t; }
Or even in Python if you want:
#!/usr/bin/python
import sys
from terminaltables import AsciiTable
table_data = []
for row in ["Caption" + d for d in sys.stdin.read().split("Caption")[1:]]:
table_data.append([column.split('=')[1] for column in row.split("\n")[:-1]])
columns = [column.split('=')[0] for column in row.split("\n")[:-1]]
table_data.insert(0, columns)
print AsciiTable(table_data).table
$ cat tst.awk
BEGIN { FS=OFS="=" }
{ hdr = hdr sep $1; data = data sep $2; sep=OFS }
$1=="VolumeName" { if (!c++) print hdr; print data; data=sep="" }
$ awk -f tst.awk file | column -s= -t
Caption Description DriveType FreeSpace ProviderName Size VolumeName
C: Local Fixed Disk 3 41265664000 146056146944
D: Local Fixed Disk 3 125067259904 1073738674176 Data
E: Removable Disk 2

Resources