Convert each integer to a simple ASCII graph - bash

I have a file with a bunch of integers like this:
6
2
3
4
3
The goal is to convert those integers in stats like in a videogame, for example, if the number is 6, the stats must be ******----, if the number is 4 the result must be ****------.
I tried the following piece of code but it doesn't work:
# Here I put all the int in a variable.
intNumber=`cat /home/intNumbers.txt`
# This for is a loop to print as much * as the number says.
for i in `seq 1 $intNumber`
do
echo -n "*"
# This for loop is for printing - until reacing 10.
for j in `seq $intNumber 10`
do
echo -n "-"
done
done

With Perl:
perl -ne 'print("*" x $_, "-" x (10-$_), "\n")' file
$_ contains current row
Output:
******----
**--------
***-------
****------
***-------

You may use this awk:
awk '{s = sprintf("%*s", $1, ""); gsub(/ /, "*", s); p = sprintf("%*s", 10-$1, ""); gsub(/ /, "-", p); print s p}' file
******----
**--------
***-------
****------
***-------
A more readable version:
awk '{
s = sprintf("%*s", $1, "")
gsub(/ /, "*", s)
p = sprintf("%*s", 10-$1, "")
gsub(/ /, "-", p)
print s p
}' file

Another awk, keepin' it simple, sir:
$ awk '
BEGIN {
s="**********----------"
}
{
print substr(s,11-$1,10)
}' file
Output:
******----
**--------
***-------
****------
***-------
Similar for bash:
#!/bin/bash
s="**********----------"
while IFS= read -r line
do
echo "${s:((10-$line)):10}"
done < file
A more generic approach for awk could be, for example:
$ awk -v m=10 '{ # desired maximum number of chars
t="" # temp var
for(i=1;i<=m;i++) # loop to max
if(i<=$1) # up to threshold value from file
sub(/^/,"*",t) # prepend a *
else # after threshold
sub(/$/,"-",t) # append a -
print t
}' file
Some input checking could be in order.

Minimizing the work you have to do per input line for efficiency:
$ cat tst.awk
BEGIN {
lgth = 10
curr = base = sprintf("%*s",lgth,"")
gsub(/ /,"*",curr)
gsub(/ /,"-",base)
}
{ print substr(curr,1,$1) substr(base,$1+1) }
$ awk -f tst.awk file
******----
**--------
***-------
****------
***-------
or borrowing #JamesBrowns idea of indexing into a single string:
$ cat tst.awk
BEGIN {
lgth = 10
curr = base = sprintf("%*s",lgth,"")
gsub(/ /,"*",curr)
gsub(/ /,"-",base)
line = curr base
}
{ print substr(line,(lgth-$1)+1,lgth) }
$ awk -f tst.awk file
******----
**--------
***-------
****------
***-------

Don't read the entire input file into memory. Instead, process one line at a time.
The following also demonstrates how to do this more succinctly in Bash.
#!/bin/bash
ten='----------'
while IFS='' read -r num; do
printf -v graph '%10.10s' "${ten:$num}"
echo "${graph// /\*}"
done < intNumbers.txt
printf -v graph places the output in the variable graph, and we then use a Bash parameter substitution to replace the space padding from printf with asterisks.
Demo: https://ideone.com/wE6fpm
Doing this entirely in Bash is attractive if you end up doing this a lot; you generally want to avoid external processes especially in repeated code (though of course don't repeat yourself; put this in a function then.)
If you genuinely want to convert a file of numbers into a bunch of graphs, a single Awk process is still much better; the shell isn't particularly good at that. I'm imagining you have a different application where you occasionally need to format a number as a graph in various places in a Bash script.

A riff on tripleee's answer:
# Repeat a character a specified number of times
#
# parameters
# - character
# - count
#
# usage: strRepeat "*" 60
#
str_repeat() {
local char=$1 count=$2
local result
# string of count spaces
printf -v result "%*s" "$count" ""
# replace spaces with the char
echo "${result// /$char}"
}
while read -r num; do
printf '%s%s\n' "$(str_repeat '*' "$num")" "$(str_repeat '-' $((10-num)))"
done < intNumbers.txt

Another GAWK solution:
awk -v OFS="*" '{NF=($1+1); $1=""; print gensub(/ /,"-","g",sprintf("%-10s",$0))}' file
# -v OFS="*"
# [Sets the output field separator to an asterisk]
# NF=($1+1)
# [Defines the number of fields in the line as 1 more
# than the value in the first (and only) input field.
# If you printed the first line, it would now be "6******"]
# $1=""
# [Deletes the first field in the line, leaving only asterisks]
# sprintf("%-10s",$0)
# [Formats the line as 10 characters, left-justified.
# The missing characters on the right are blank spaces]
# gensub(/ /,"-","g",sprintf("%-10s",$0))
# [Replaces all blank spaces in the sprintf-formatted
# line with hyphens]
# print
# [Prints the line as modified by gensub]

With Python:
python -c "import sys
> for s in sys.stdin: n=int(s); print('*'*n + '-'*(10-n))" < file

Related

Unix Parse Varying Named Value into seperate rows

We are getting a varying length input file as mentioned below. The text is varying length.
Input file:
ID|Text
1|name1=value1;name3;name4=value2;name5=value5
2|name1=value1;name2=value2;name6=;name7=value7;name8=value8
The text here has named value pair as the content and it's of varying length. Please note that the name in the text column can contain a semi colon. We are trying to parse the input but we are not able handle it via AWK or BASH
Desired Output:
1|name1=value1
1|name3;name4=value2
1|name5=value5
2|name1=value1
2|name2=value2
2|name6=
2|name7=value7
2|name8=value8
The below snipped of code works for ID=2, but doesn't for ID=1
echo "2|name1=value1;name2=value2;name6=;name7=value7;name8=value8" | while IFS="|"; read id text;do dsc=`echo $text|tr ';' '\n'`;echo "$dsc" >tmp;done
cat tmp
2|name1=value1
2|name2=value2
2|name6=
2|name7=value7
2|name8=value8
echo "1|name1=value1;name3;name4=value2;name5=value5" | while IFS="|"; read id text;do dsc=`echo $text|tr ';' '\n'`;echo "$dsc" >tmp;sed -i "s/^/${id}\|/g" tmp;done
cat tmp
1|name1=value1
1|name3
1|name4=value2
1|name5=value5
Any help is greatly appreciated.
Could you please try following, written and tested with shown samples in GNU awk with new version of it. Since OP's awk version is old so if anyone having old version of awk then try changing it to awk --re-interval
awk '
BEGIN{
FS=OFS="|"
}
FNR==1{ next }
{
first=$1
while(match($0,/(name[0-9]+;?){1,}=(value[0-9]+)?/)){
print first,substr($0,RSTART,RLENGTH)
$0=substr($0,RSTART+RLENGTH)
}
}' Input_file
Output will be as follows.
1|name1=value1
1|name3;name4=value2
1|name5=value5
2|name1=value1
2|name2=value2
2|name6=
2|name7=value7
2|name8=value8
Explanation: Adding detailed explanation for above(following is for explanation purposes only).
awk ' ##Starting awk program from here.
BEGIN{ ##Starting BEGIN section from here.
FS=OFS="|" ##Setting FS and OFS wiht | here.
}
FNR==1{ next } ##If line is first line then go next, do not print anything.
{
first=$1 ##Creating first and setting as first field here.
while(match($0,/(name[0-9]+;?){1,}=(value[0-9]+)?/)){
##Running while loop which has match which has a regex of matching name and value all mentioned permutations and combinations.
print first,substr($0,RSTART,RLENGTH) ##Printing first and sub string(currently matched one)
$0=substr($0,RSTART+RLENGTH) ##Saving rest of the line into current line.
}
}' Input_file ##Mentioning Input_file name here.
Sample data:
$ cat name.dat
ID|Text
1|name1=value1;name3;name4=value2;name5=value5
2|name1=value1;name2=value2;name6=;name7=value7;name8=value8
One awk solution:
awk -F"[|;]" ' # use "|" and ";" as input field delimiters
FNR==1 { next } # skip header line
{ pfx=$1 "|" # set output prefix to field 1 + "|"
printpfx=1 # set flag to print prefix
for ( i=2 ; i<=NF ; i++ ) # for fields 2 to NF
{
if ( printpfx) { printf "%s", pfx ; printpfx=0 } # if print flag == 1 then print prefix and clear flag
if ( $(i) ~ /=/ ) { printf "%s\n", $(i) ; printpfx=1 } # if current field contains "=" then print it, end this line of output, reset print flag == 1
if ( $(i) !~ /=/ ) { printf "%s;", $(i) } # if current field does not contain "=" then print it and include a ";" suffix
}
}
' name.dat
The above generates:
1|name1=value1
1|name3;name4=value2
1|name5=value5
2|name1=value1
2|name2=value2
2|name6=
2|name7=value7
2|name8=value8
A Bash solution:
#!/usr/bin/env bash
while IFS=\| read -r id text || [ -n "$id" ]; do
IFS=\; read -r -a kv_arr < <(printf %s "$text")
printf "$id|%s\\n" "${kv_arr[#]}"
done < <(tail -n +2 a.txt)
A plain POSIX shell solution:
#!/usr/bin/env sh
# Chop the header line from the input file
tail -n +2 a.txt |
# While reading id and text Fields Separated by vertical bar
while IFS=\| read -r id text || [ -n "$id" ]; do
# Sets the separator to a semicolon
IFS=\;
# Print each semicolon separated field formatted on
# its own line with the ID
# shellcheck disable=SC2086 # Explicit split on semicolon
printf "$id|%s\\n" $text
done
Input a.txt:
ID|Text
1|name1=value1;name3;name4=value2;name5=value5
2|name1=value1;name2=value2;name6=;name7=value7;name8=value8
Output:
1|name1=value1
1|name3
1|name4=value2
1|name5=value5
2|name1=value1
2|name2=value2
2|name6=
2|name7=value7
2|name8=value8
You have some good answers and an accepted one already. Here is a much shorter gnu awk command that can also do the job:
awk -F '|' 'NR > 1 {
for (s=$2; match(s, /([^=]+=[^;]*)(;|$)/, m); s=substr(s, RLENGTH+1))
print $1 FS m[1]
}' file.txt
1|name1=value1
1|name3;name4=value2
1|name5=value5
2|name1=value1
2|name2=value2
2|name6=
2|name7=value7
2|name8=value8

Using awk in a bash script to select line range based on number of characters in specific line

This seems like something that might be possible in a single long awk command. But I don't know awk well enough to do it.
I want to identify the total number of A, T, G and C characters in every 4th line of input, starting from line 2. If any line number that is a multiple of 4, has a character count in the range say 1000 to 3000, then I want it to print that line as well as the line above and the two lines below.
I can break it down and do portions of this in separate lines of code. But when I have millions of lines, it takes too long to compute. I need a single powerful awk command here. There must be someone brilliant enough at awk to solve this one!
Very tiny example, with range 10 < character count < 40:
Input:
#d0aec33d-ba
TCAGTATGCTTCGTGCAATCAAG
+
-0(''$&"('
#ee487ad3-b71
ACAATGTG
+
""%#0&'+367<677
Output:
#d0aec33d-ba
TCAGTATGCTTCGTGCAATCAAG
+
-0(''$&"('
Here is a quick one:
$ awk '
NR%4==1 { b="" } # first record of four, reset buffer
NR%4==2 && length()>10 && length()<40 { f=1 } # 2/4 if length is right, flag up
{ b=b $0 ORS } # buffer records to b
NR%4==0 && f { # 4/4
printf "%s",b # print if flag is up
f=0 # and flag down
}' file
Output:
#d0aec33d-ba
TCAGTATGCTTCGTGCAATCAAG
+
-0(''$&"('
Edit:
A parameterized version (x=$min, y=$max):
$ awk -v x=$min -v y=$max '
NR%4==1 { b="" } # first record of four, reset buffer
NR%4==2 && length()>x && length()<y { f=1 } # 2/4 if length is right, flag up
{ b=b $0 ORS } # buffer records to b
NR%4==0 && f { # 4/4
printf "%s",b # # print if flag is up
f=0 # # # and flag down
# printf b; f=0 # # # # # # # # # # # # # # if commands on the same line
}' file # #
#
One-liner just in case:
$ awk -v x=$min -v y=$max 'NR%4==1{b=""} NR%4==2 && length()>x && length()<y{f=1} {b=b $0 ORS} NR%4==0 && f{printf "%s",b; f=0}' file

awk to calculate average of field in multiple text files and merge into one

I am trying to calculate the average of $2 in multiple test files in a directory and merge the output in one tab-delimeted output file. The output file is two fields, in which $1 is the file name that has been extracted by pref, and $2" is the calculated average with one decimal, rounded up. There is also a header in the outputSamplein$1andPercentin$2`. The below seems close but I am missing a few things (adding the header to the output, merging into one tab-delimeted file, and rounding to 3 decimal places), that I do not know how to do yet and not getting the desired output. Thank you :).
123_base.txt
AASS 99.81
ABAT 100.00
ABCA10 0.0
456_base.txt
ABL2 97.81
ABO 100.00
ACACA 99.82
desired output (tab-delimeted)
Sample Percent
123 66.6
456 99.2
Bash
for f in /home/cmccabe/Desktop/20x/percent/*.txt ; do
bname=$(basename $f)
pref=${bname%%_base_*.txt}
awk -v OFS='\t' '{ sum += $2 } END { if (NR > 0) print sum / NR }' $f /home/cmccabe/Desktop/NGS/bed/bedtools/IDP_total_target_length_by_panel/IDP_unix_trim_total_target_length.bed > /home/cmccabe/Desktop/20x/coverage/${pref}_average.txt
done
This one uses GNU awk, which provides handy BEGINFILE and ENDFILE events:
gawk '
BEGIN {print "Sample\tPercent"}
BEGINFILE {sample = FILENAME; sub(/_.*/,"",sample); sum = n = 0}
{sum += $2; n++}
ENDFILE {printf "%s\t%.1f\n", sample, sum/n}
' 123_base.txt 456_base.txt
If you're giving a pattern with the directory attached, I'd get the sample name like this:
match(FILENAME, /^.*\/([^_]+)/, m); sample = m[1]
and then, yes this is OK: gawk '...' /path/to/*_base.txt
And to steal against division by zero, inspired by James Brown's answer:
ENDFILE {printf "%s\t%.1f\n", sample, n==0 ? 0 : sum/n}
with perl
$ perl -ane '
BEGIN{ print "Sample\tPercent\n" }
$c++; $sum += $F[1];
if(eof)
{
($pref) = $ARGV=~/(.*)_base/;
printf "%s\t%.1f\n", $pref, $sum/$c;
$c = 0; $sum = 0;
}' 123_base.txt 456_base.txt
Sample Percent
123 66.6
456 99.2
print header using BEGIN block
-a option would split input line on spaces and save to #F array
For each line, increment counter and add to sum variable
If end of file eof is detected, print in required format
$ARGV contains current filename being read
If full path of filename is passed but only filename should be used to get pref, then use this line instead
($pref) = $ARGV=~/.*\/\K(.*)_base/;
In awk. Notice printf "%3.3s" to truncate the filename after 3rd char:
$ cat ave.awk
BEGIN {print "Sample", "Percent"} # header
BEGINFILE {s=c=0} # at the start of every file reset
{s+=$2; c++} # sum and count hits
ENDFILE{if(c>0) printf "%3.3s%s%.1f\n", FILENAME, OFS, s/c}
# above output if more than 0 lines
Run it:
$ touch empty_base.txt # test for division by zero
$ awk -f ave.awk 123_base.txt 123_base.txt empty_base.txt
Sample Percent
123 66.6
456 99.2
another awk
$ awk -v OFS='\t' '{f=FILENAME;sub(/_.*/,"",f);
a[f]+=$2; c[f]++}
END{print "Sample","Percent";
for(k in a) print k, sprintf("%.1f",a[k]/c[k])}' {123,456}_base.txt
Sample Percent
456 99.2
123 66.6

Awk/sed replace newlines

Intro:
I have been given a CSV file in which the field delimiter is the pipe characted (i.e., |).
This file has a pre-defined number of fields (say N). I can discover the value of N by reading the header of the CSV file, which we can assume to be correct.
Problem:
Some of the fields contain a newline character by mistake, which makes the line appear shorter than required (i.e., it has M fields, with M < N).
What I need to create is a sh script (not bash) to fix those lines.
Attempted solution:
I tried creating the following script to try fixing the file:
if [ $# -ne 1 ]
then
echo "Usage: $0 <filename>"
exit
fi
# get first line
first_line=$(head -n 1 $1)
# get number of fields
num_separators=$(echo "$first_line" | tr -d -c '|' | awk '{print length}')
cat $1 | awk -v numFields=$(( num_separators + 1 )) -F '|' '
{
totRecords = NF/numFields
# loop over lines
for (record=0; record < totRecords; record++) {
output = ""
# loop over fields
for (i=0; i<numFields; i++) {
j = (numFields*record)+i+1
# replace newline with question mark
sub("\n", "?", $j)
output = output (i > 0 ? "|" : "") $j
}
print output
}
}
'
However, the newline character is still present.
How can I fix that problem?
Example of the CSV:
FIRST_NAME|LAST_NAME|NOTES
John|Smith|This is a field with a
newline
Foo|Bar|Baz
Expected output:
FIRST_NAME|LAST_NAME|NOTES
John|Smith|This is a field with a * newline
Foo|Bar|Baz
* I don't care about the replacement, it could be a space, a question mark, whatever except a newline or a pipe (which would create a new field)
$ cat tst.awk
BEGIN { FS=OFS="|" }
NR==1 { reqdNF = NF; printf "%s", $0; next }
{ printf "%s%s", (NF < reqdNF ? " " : ORS), $0 }
END { print "" }
$ awk -f tst.awk file.csv
FIRST_NAME|LAST_NAME|NOTES
John|Smith|This is a field with a newline
Foo|Bar|Baz
If that's not what you want then edit your question to provide more truly representative sample input and associated output.
Based on the assumption that the last field may contain one newline. Using tac and sed:
tac file.csv | sed -n '/|/!{h;n;x;H;x;s/\n/ * /p;b};p' | tac
Output:
FIRST_NAME|LAST_NAME|NOTES
John|Smith|This is a field with a * newline
Foo|Bar|Baz
How it works. Read the file backwards, sed is easier without forward references. If a line has no '|' separator, /|/!, run the block of code in curly braces {};, otherwise just p print the line. The block of code:
h; stores the delimiter-less line in sed's hold buffer.
n; fetches another line, since we're reading backwards, this is the line that should be appended to.
x; exchange hold buffer and pattern buffer.
H; append pattern buffer to hold buffer.
x; exchange newly appended lines to pattern buffer, now there's two lines in one buffer.
s/\n/ * /p; replace the middle linefeed with a " * ", now there's only one longer line; and print.
b start again, leave the code block.
Re-reverse the file with tac; done.

Parsing a CSV file using gawk

How do you parse a CSV file using gawk? Simply setting FS="," is not enough, as a quoted field with a comma inside will be treated as multiple fields.
Example using FS="," which does not work:
file contents:
one,two,"three, four",five
"six, seven",eight,"nine"
gawk script:
BEGIN { FS="," }
{
for (i=1; i<=NF; i++) printf "field #%d: %s\n", i, $(i)
printf "---------------------------\n"
}
bad output:
field #1: one
field #2: two
field #3: "three
field #4: four"
field #5: five
---------------------------
field #1: "six
field #2: seven"
field #3: eight
field #4: "nine"
---------------------------
desired output:
field #1: one
field #2: two
field #3: "three, four"
field #4: five
---------------------------
field #1: "six, seven"
field #2: eight
field #3: "nine"
---------------------------
The gawk version 4 manual says to use FPAT = "([^,]*)|(\"[^\"]+\")"
When FPAT is defined, it disables FS and specifies fields by content instead of by separator.
The short answer is "I wouldn't use gawk to parse CSV if the CSV contains awkward data", where 'awkward' means things like commas in the CSV field data.
The next question is "What other processing are you going to be doing", since that will influence what alternatives you use.
I'd probably use Perl and the Text::CSV or Text::CSV_XS modules to read and process the data. Remember, Perl was originally written in part as an awk and sed killer - hence the a2p and s2p programs still distributed with Perl which convert awk and sed scripts (respectively) into Perl.
You can use a simple wrapper function called csvquote to sanitize the input and restore it after awk is done processing it. Pipe your data through it at the start and end, and everything should work out ok:
before:
gawk -f mypgoram.awk input.csv
after:
csvquote input.csv | gawk -f mypgoram.awk | csvquote -u
See https://github.com/dbro/csvquote for code and documentation.
If permissible, I would use the Python csv module, paying special attention to the dialect used and formatting parameters required, to parse the CSV file you have.
csv2delim.awk
# csv2delim.awk converts comma delimited files with optional quotes to delim separated file
# delim can be any character, defaults to tab
# assumes no repl characters in text, any delim in line converts to repl
# repl can be any character, defaults to ~
# changes two consecutive quotes within quotes to '
# usage: gawk -f csv2delim.awk [-v delim=d] [-v repl=`"] input-file > output-file
# -v delim delimiter, defaults to tab
# -v repl replacement char, defaults to ~
# e.g. gawk -v delim=; -v repl=` -f csv2delim.awk test.csv > test.txt
# abe 2-28-7
# abe 8-8-8 1.0 fixed empty fields, added replacement option
# abe 8-27-8 1.1 used split
# abe 8-27-8 1.2 inline rpl and "" = '
# abe 8-27-8 1.3 revert to 1.0 as it is much faster, split most of the time
# abe 8-29-8 1.4 better message if delim present
BEGIN {
if (delim == "") delim = "\t"
if (repl == "") repl = "~"
print "csv2delim.awk v.m 1.4 run at " strftime() > "/dev/stderr" ###########################################
}
{
#if ($0 ~ repl) {
# print "Replacement character " repl " is on line " FNR ":" lineIn ";" > "/dev/stderr"
#}
if ($0 ~ delim) {
print "Temp delimiter character " delim " is on line " FNR ":" lineIn ";" > "/dev/stderr"
print " replaced by " repl > "/dev/stderr"
}
gsub(delim, repl)
$0 = gensub(/([^,])\"\"/, "\\1'", "g")
# $0 = gensub(/\"\"([^,])/, "'\\1", "g") # not needed above covers all cases
out = ""
#for (i = 1; i <= length($0); i++)
n = length($0)
for (i = 1; i <= n; i++)
if ((ch = substr($0, i, 1)) == "\"")
inString = (inString) ? 0 : 1 # toggle inString
else
out = out ((ch == "," && ! inString) ? delim : ch)
print out
}
END {
print NR " records processed from " FILENAME " at " strftime() > "/dev/stderr"
}
test.csv
"first","second","third"
"fir,st","second","third"
"first","sec""ond","third"
" first ",sec ond,"third"
"first" , "second","th ird"
"first","sec;ond","third"
"first","second","th;ird"
1,2,3
,2,3
1,2,
,2,
1,,2
1,"2",3
"1",2,"3"
"1",,"3"
1,"",3
"","",""
"","""aiyn","oh"""
"""","""",""""
11,2~2,3
test.bat
rem test csv2delim
rem default is: -v delim={tab} -v repl=~
gawk -f csv2delim.awk test.csv > test.txt
gawk -v delim=; -f csv2delim.awk test.csv > testd.txt
gawk -v delim=; -v repl=` -f csv2delim.awk test.csv > testdr.txt
gawk -v repl=` -f csv2delim.awk test.csv > testr.txt
I am not exactly sure whether this is the right way to do things. I would rather work on a csv file in which either all values are to quoted or none. Btw, awk allows regexes to be Field Separators. Check if that is useful.
{
ColumnCount = 0
$0 = $0 "," # Assures all fields end with comma
while($0) # Get fields by pattern, not by delimiter
{
match($0, / *"[^"]*" *,|[^,]*,/) # Find a field with its delimiter suffix
Field = substr($0, RSTART, RLENGTH) # Get the located field with its delimiter
gsub(/^ *"?|"? *,$/, "", Field) # Strip delimiter text: comma/space/quote
Column[++ColumnCount] = Field # Save field without delimiter in an array
$0 = substr($0, RLENGTH + 1) # Remove processed text from the raw data
}
}
Patterns that follow this one can access the fields in Column[]. ColumnCount indicates the number of elements in Column[] that were found. If not all rows contain the same number of columns, Column[] contains extra data after Column[ColumnCount] when processing the shorter rows.
This implementation is slow, but it appears to emulate the FPAT/patsplit() feature found in gawk >= 4.0.0 mentioned in a previous answer.
Reference
Here's what I came up with. Any comments and/or better solutions would be appreciated.
BEGIN { FS="," }
{
for (i=1; i<=NF; i++) {
f[++n] = $i
if (substr(f[n],1,1)=="\"") {
while (substr(f[n], length(f[n]))!="\"" || substr(f[n], length(f[n])-1, 1)=="\\") {
f[n] = sprintf("%s,%s", f[n], $(++i))
}
}
}
for (i=1; i<=n; i++) printf "field #%d: %s\n", i, f[i]
print "----------------------------------\n"
}
The basic idea is that I loop through the fields, and any field which starts with a quote but does not end with a quote gets the next field appended to it.
Perl has the Text::CSV_XS module which is purpose-built to handle the quoted-comma weirdness.
Alternately try the Text::CSV module.
perl -MText::CSV_XS -ne 'BEGIN{$csv=Text::CSV_XS->new()} if($csv->parse($_)){#f=$csv->fields();for $n (0..$#f) {print "field #$n: $f[$n]\n"};print "---\n"}' file.csv
Produces this output:
field #0: one
field #1: two
field #2: three, four
field #3: five
---
field #0: six, seven
field #1: eight
field #2: nine
---
Here's a human-readable version.
Save it as parsecsv, chmod +x, and run it as "parsecsv file.csv"
#!/usr/bin/perl
use warnings;
use strict;
use Text::CSV_XS;
my $csv = Text::CSV_XS->new();
open(my $data, '<', $ARGV[0]) or die "Could not open '$ARGV[0]' $!\n";
while (my $line = <$data>) {
if ($csv->parse($line)) {
my #f = $csv->fields();
for my $n (0..$#f) {
print "field #$n: $f[$n]\n";
}
print "---\n";
}
}
You may need to point to a different version of perl on your machine, since the Text::CSV_XS module may not be installed on your default version of perl.
Can't locate Text/CSV_XS.pm in #INC (#INC contains: /home/gnu/lib/perl5/5.6.1/i686-linux /home/gnu/lib/perl5/5.6.1 /home/gnu/lib/perl5/site_perl/5.6.1/i686-linux /home/gnu/lib/perl5/site_perl/5.6.1 /home/gnu/lib/perl5/site_perl .).
BEGIN failed--compilation aborted.
If none of your versions of Perl have Text::CSV_XS installed, you'll need to:
sudo apt-get install cpanminus
sudo cpanm Text::CSV_XS

Resources