parse .csv data into matrix or 2 dimensional array bash/shell awk - bash

I have a coma delimited csv file named 'itrs.csv' which I want to parse into a matrix or 2d array using a script bash or shell
Loads\PostDate,schedule,seta,eeta,2019-11-05,2019-11-06,2019-11-07,2019-11-08
BANAMEX,7,1:18:10,1:23:45,G,G,C,C
EMEA,5,0:21:00,1:01:00,G,G,G,C
I have tried the following:
declare -A matrix
eval matrix=($(awk -f, itrs.csv))
for ((i=0;i<=2;i++))
do
for ((j=0;j<=$6;j++))
do
echo ${matrix[$i,$j]}" "
done
echo
done
but above code is throwing errors. I also would like to know how to check the number of columns and rows while parsing data because csv file size may change.

Well, you can do this: Create an associative array, iterate over lines and keep the count of the current line, then iterate over the fields and create an associative array with indexes as requested.
i=0
declare -A matrix
while IFS=, read -r -a line; do
for ((j = 0; j < ${#line[#]}; ++j)); do
matrix[$i,$j]=${line[$j]}
done
((i++))
done < itrs.csv
After it declare -p matrix would output:
declare -A matrix=([1,5]="G" [1,4]="G" [1,7]="C" [1,6]="C" [1,1]="7" [1,0]="BANAMEX" [1,3]="1:23:45" [1,2]="1:18:10" [0,4]="2019-11-05" [0,5]="2019-11-06" [0,6]="2019-11-07"[0,7]="2019-11-08" [0,0]="Loads\\PostDate" [0,1]="schedule" [0,2]="seta" [0,3]="eeta" [2,6]="G" [2,7]="C" [2,4]="G"[2,5]="G" [2,2]="0:21:00" [2,3]="1:01:00" [2,0]="EMEA" [2,1]="5" )
See bashfaq How can I read a file (data stream, variable) line-by-line (and/or field-by-field)?
Don't use eval. eval is evil. Don't eval arr=($(..)) unless you know what you are doing. In your case, using eval looks like it has little to zero sense.
The error comes from awk. awk works like awk [options] script [file], you could awk -F, '{print $0}' itrs.csv, but it would make no sense. The itrs.csv is parsed by awk as being the script - as it makes no sense as a awk script, the tool throws an error.
To read for example the first line only separated by comma into an array in bash you can IFS=, line=($(head -n1 itrs.csv)). The -F, affects how awk parses the file, not how bash creates array - for that use IFS.

Related

how to assign each of multiple lines in a file as different variable?

this is probably a very simple question. I looked at other answers but couldn't come up with a solution. I have a 365 line date file. file as below,
01-01-2000
02-01-2000
I need to read this file line by line and assign each day to a separate variable. like this,
d001=01-01-2000
d002=02-01-2000
I tried while read commands but couldn't get them to work.It takes a lot of time to shoot one by one. How can I do it quickly?
Trying to create named variable out of an associative array, is time waste and not supported de-facto. Better use this, using an associative array:
#!/bin/bash
declare -A array
while read -r line; do
printf -v key 'd%03d' $((++c))
array[$key]=$line
done < file
Output
for i in "${!array[#]}"; do echo "key=$i value=${array[$i]}"; done
key=d001 value=01-01-2000
key=d002 value=02-01-2000
Assumptions:
an array is acceptable
array index should start with 1
Sample input:
$ cat sample.dat
01-01-2000
02-01-2000
03-01-2000
04-01-2000
05-01-2000
One bash/mapfile option:
unset d # make sure variable is not currently in use
mapfile -t -O1 d < sample.dat # load each line from file into separate array location
This generates:
$ typeset -p d
declare -a d=([1]="01-01-2000" [2]="02-01-2000" [3]="03-01-2000" [4]="04-01-2000" [5]="05-01-2000")
$ for i in "${!d[#]}"; do echo "d[$i] = ${d[i]}"; done
d[1] = 01-01-2000
d[2] = 02-01-2000
d[3] = 03-01-2000
d[4] = 04-01-2000
d[5] = 05-01-2000
In OP's code, references to $d001 now become ${d[1]}.
A quick one-liner would be:
eval $(awk 'BEGIN{cnt=0}{printf "d%3.3d=\"%s\"\n",cnt,$0; cnt++}' your_file)
eval makes the shell variables known inside your script or shell. Use echo $d000 to show the first one of the newly defined variables. There should be no shell special characters (like * and $) inside your_file. Remove eval $() to see the result of the awk command. The \" quoted %s is to allow spaces in the variable values. If you don't have any spaces in your_file you can remove the \" before and after %s.

Extracting file content using a for loop [duplicate]

I'm working on a long Bash script. I want to read cells from a CSV file into Bash variables. I can parse lines and the first column, but not any other column. Here's my code so far:
cat myfile.csv|while read line
do
read -d, col1 col2 < <(echo $line)
echo "I got:$col1|$col2"
done
It's only printing the first column. As an additional test, I tried the following:
read -d, x y < <(echo a,b,)
And $y is empty. So I tried:
read x y < <(echo a b)
And $y is b. Why?
You need to use IFS instead of -d:
while IFS=, read -r col1 col2
do
echo "I got:$col1|$col2"
done < myfile.csv
To skip a given number of header lines:
skip_headers=3
while IFS=, read -r col1 col2
do
if ((skip_headers))
then
((skip_headers--))
else
echo "I got:$col1|$col2"
fi
done < myfile.csv
Note that for general purpose CSV parsing you should use a specialized tool which can handle quoted fields with internal commas, among other issues that Bash can't handle by itself. Examples of such tools are cvstool and csvkit.
How to parse a CSV file in Bash?
Coming late to this question and as bash do offer new features, because this question stand about bash and because none of already posted answer show this powerful and compliant way of doing precisely this.
Parsing CSV files under bash, using loadable module
Conforming to RFC 4180, a string like this sample CSV row:
12,22.45,"Hello, ""man"".","A, b.",42
should be splitted as
1 12
2 22.45
3 Hello, "man".
4 A, b.
5 42
bash loadable .C compiled modules.
Under bash, you could create, edit, and use loadable c compiled modules. Once loaded, they work like any other builtin!! ( You may find more information at source tree. ;)
Current source tree (Oct 15 2021, bash V5.1-rc3) do contain a bunch of samples:
accept listen for and accept a remote network connection on a given port
asort Sort arrays in-place
basename Return non-directory portion of pathname.
cat cat(1) replacement with no options - the way cat was intended.
csv process one line of csv data and populate an indexed array.
dirname Return directory portion of pathname.
fdflags Change the flag associated with one of bash's open file descriptors.
finfo Print file info.
head Copy first part of files.
hello Obligatory "Hello World" / sample loadable.
...
tee Duplicate standard input.
template Example template for loadable builtin.
truefalse True and false builtins.
tty Return terminal name.
uname Print system information.
unlink Remove a directory entry.
whoami Print out username of current user.
There is an full working cvs parser ready to use in examples/loadables directory: csv.c!!
Under Debian GNU/Linux based system, you may have to install bash-builtins package by
apt install bash-builtins
Using loadable bash-builtins:
Then:
enable -f /usr/lib/bash/csv csv
From there, you could use csv as a bash builtin.
With my sample: 12,22.45,"Hello, ""man"".","A, b.",42
csv -a myArray '12,22.45,"Hello, ""man"".","A, b.",42'
printf "%s\n" "${myArray[#]}" | cat -n
1 12
2 22.45
3 Hello, "man".
4 A, b.
5 42
Then in a loop, processing a file.
while IFS= read -r line;do
csv -a aVar "$line"
printf "First two columns are: [ '%s' - '%s' ]\n" "${aVar[0]}" "${aVar[1]}"
done <myfile.csv
This way is clearly the quickest and strongest than using any other combination of bash builtins or fork to any binary.
Unfortunely, depending on your system implementation, if your version of bash was compiled without loadable, this may not work...
Complete sample with multiline CSV fields.
Conforming to RFC 4180, a string like this single CSV row:
12,22.45,"Hello ""man"",
This is a good day, today!","A, b.",42
should be splitted as
1 12
2 22.45
3 Hello "man",
This is a good day, today!
4 A, b.
5 42
Full sample script for parsing CSV containing multilines fields
Here is a small sample file with 1 headline, 4 columns and 3 rows. Because two fields do contain newline, the file are 6 lines length.
Id,Name,Desc,Value
1234,Cpt1023,"Energy counter",34213
2343,Sns2123,"Temperatur sensor
to trigg for alarm",48.4
42,Eye1412,"Solar sensor ""Day /
Night""",12199.21
And a small script able to parse this file correctly:
#!/bin/bash
enable -f /usr/lib/bash/csv csv
file="sample.csv"
exec {FD}<"$file"
read -ru $FD line
csv -a headline "$line"
printf -v fieldfmt '%-8s: "%%q"\\n' "${headline[#]}"
numcols=${#headline[#]}
while read -ru $FD line;do
while csv -a row "$line" ; (( ${#row[#]} < numcols )) ;do
read -ru $FD sline || break
line+=$'\n'"$sline"
done
printf "$fieldfmt\\n" "${row[#]}"
done
This may render: (I've used printf "%q" to represent non-printables characters like newlines as $'\n')
Id : "1234"
Name : "Cpt1023"
Desc : "Energy\ counter"
Value : "34213"
Id : "2343"
Name : "Sns2123"
Desc : "$'Temperatur sensor\nto trigg for alarm'"
Value : "48.4"
Id : "42"
Name : "Eye1412"
Desc : "$'Solar sensor "Day /\nNight"'"
Value : "12199.21"
You could find a full working sample there: csvsample.sh.txt or
csvsample.sh.
Note:
In this sample, I use head line to determine row width (number of columns). If you're head line could hold newlines, (or if your CSV use more than 1 head line). You will have to pass number or columns as argument to your script (and the number of head lines).
Warning:
Of course, parsing CSV using this is not perfect! This work for many simple CSV files, but care about encoding and security!! For sample, this module won't be able to handle binary fields!
Read carefully csv.c source code comments and RFC 4180!
From the man page:
-d delim
The first character of delim is used to terminate the input line,
rather than newline.
You are using -d, which will terminate the input line on the comma. It will not read the rest of the line. That's why $y is empty.
We can parse csv files with quoted strings and delimited by say | with following code
while read -r line
do
field1=$(echo "$line" | awk -F'|' '{printf "%s", $1}' | tr -d '"')
field2=$(echo "$line" | awk -F'|' '{printf "%s", $2}' | tr -d '"')
echo "$field1 $field2"
done < "$csvFile"
awk parses the string fields to variables and tr removes the quote.
Slightly slower as awk is executed for each field.
In addition to the answer from #Dennis Williamson, it may be helpful to skip the first line when it contains the header of the CSV:
{
read
while IFS=, read -r col1 col2
do
echo "I got:$col1|$col2"
done
} < myfile.csv
If you want to read CSV file with some lines, so this the solution.
while IFS=, read -ra line
do
test $i -eq 1 && ((i=i+1)) && continue
for col_val in ${line[#]}
do
echo -n "$col_val|"
done
echo
done < "$csvFile"

Count columns with bash

I need to find a way to count columns from a file OR stdin without using anything except pure bash. What I have so far...
input="${1:-/dev/stdin}"
rows=0
while read -r myLine
do
a=($myLine)
cols=(${#a[*]})
rows=`expr $rows + 1`
done < $input
echo -e "$rows $cols"
I'm counting both rows and columns. Right now, my column count only works for files, not stdin.
Any advice?
Im running the following commands
echo -e "1\t2\n3\t4" > m1, ./matrix m1
echo -e "1\t2\n3\t4" | matrix
I would advise awk for this task, but if you need bash, you could use arrays:
let c=0
while read -r myLine; do
a=($myLine)
echo "Line $((++c)) has ${#a[*]} columns"
done < file
The array a is filled with the content of the line read by the read function.
The number of columns is the length of the array a.
Note this script assumes the input field separator IFS to be unset (and as a consequence to default to <space><tab><newline> for the read command).

Bash command to read a line based on the parameters I pass - perform column-based lookups

I have a file links.txt:
1 a.sh
3 b.sh
6 c.sh
4 d.sh
So, if i pass 1,4 as parameters to another file(master.sh), a.sh and d.sh should be stored in a variable.
sed '3!d' would print the 3rd line, but not the line that starts with 3. For that, you need sed '/^3 /!d'. The problem is you can't combine them for more lines, as this means "Delete everything that doesn't start with a 3", which means all other lines will be missed. So, use sed -n '/^3 /p' instead, i.e. don't print by default and tell sed what lines to print, not what lines to delete.
You can loop over the argument and create a sed script from them that prints the lines, then run sed using this output:
#!/bin/bash
file=$1
shift
for id in "$#" ; do
echo "/^$id /p"
done | sed -nf- "$file"
Run as script.sh filename 3 4.
If you want to remove the id from the output, you can either use
cut -f2 -d' '
or you can modify the generated sed script to do the work
echo "/^$id /s/.* //p"
i.e. only print if the substitution was successful.
This loops through each argument and greps for it in the links file. The result is piped into cut where we specify the delimiter as a space with -d flag and the field number as 2 with -f flag. Finally this is appended to the array called files.
links="links.txt"
files=()
for arg in $#; do
files=("${files[#]}" `grep "^$arg" "$links" | cut -d" " -f2`)
done;
echo ${files[#]}
Usage:
$ ./master.sh 1 4
a.sh d.sh
Edit:
As pointed out by mklement0, the solution above reads the file once per arg. The following first builds the pattern then reads the file just once.
links="links.txt"
pattern="^$1\s"
for arg in ${#:2}; do
pattern+="|^$arg\s"
done
files=$(grep -E "$pattern" "$links" | cut -d" " -f2)
echo ${files[#]}
Usage:
$ ./master.sh 1 4
a.sh d.sh
Here is another example with grep and cut:
#!/bin/bash
for line in $(grep "$1\|$2" links.txt|cut -d' ' -f2)
do
echo $line
done
Example of usage:
./master.sh 1 4
a.sh
d.sh
Why not just stores the values and call them at will:
items=()
while read -r num file
do
items[num]="$file"
done<links.txt
for arg
do
echo "${items[arg]}"
done
Now you can use the items array any time you like :)
The following awk solution:
preserves the argument order; that is, the results reflect the order in which the lookup values were specified (as opposed to the order in which the lookup values happen to occur in the file).
If that is not important (i.e., if outputting the results in file order is acceptable), the readarray technique below can be combined with this one-liner, which is a generalized variant of Panta's answer:
grep -f <(printf "^%s\n" "$#") links.txt | cut -d' ' -f2-
performs well, because the input file is only read once; the only requirement is that all key-value pairs fit into memory as a whole (as a single associative Awk array (dictionary)).
works with any lookup values that don't have embedded whitespace.
Similarly, the assumption is that the output column values (containing values such as a.sh in the sample input) have no embedded whitespace. awk doesn't handle quoted fields well, so more work would be needed.
#!/bin/bash
readarray -t files < <(
awk -v idList="$*" '
BEGIN { count=split(idList, idArr); for (i in idArr) idDict[idArr[i]]++ }
$1 in idDict { idDict[$1] = $2 }
END { for (i=1; i<=count; ++i) print idDict[idArr[i]] }
' links.txt
)
# Print results.
printf '%s\n' "${files[#]}"
readarray -t files reads stdin input (<) line by line into array variable files.
Note: readarray requires Bash v4+; on Bash 3.x, such as on macOS, replace this part with
IFS=$'\n' read -d '' -ra files
<(...) is a Bash process substitution that, loosely speaking, presents the output from the enclosed command as if it were (self-deleting) temporary file.
This technique allows readarray to run in the current shell (as opposed to a subshell if a pipeline had been used), which is necessary for the files variable to remain defined in the remainder of the script.
The awk command breaks down as follows:
-v idList="$*" passes the space-separated list of all command-line arguments as a single string to Awk variable idList.
Note that this assumes that the arguments have no embedded spaces, which is indeed the case here and also generally the case with identifiers.
BEGIN { ... } is only executed once, before the individual lines are processed:
split(idList, idArr) splits the input ID list into an array by whitespace and stores the result in idArr.
for (i in idArr) idDict[idArr[i]]++ } then converts the (conceptually regular) array into associative array idDict (dictionary), whose keys are the input IDs - this enables efficient lookup by ID later, and also allows storing the lookup result for each ID.
$1 in idDict { idDict[$1] = $2 } is processed for every input line:
Pattern $1 in idDict returns true if the line's first whitespace-separated field ($1) - e.g., 6 - is among the keys (in) of associative array idDict, and, if so, executes the associated action ({...}).
Action { idDict[$1] = $2 } then assigns the second field ($2) - e.g., c.sh - to the iDict entry for key $1.
END { ... } is executed once, after all input lines have been processed:
for (i=1; i<=count; ++i) print idDict[idArr[i]] loops over all input IDs in order and prints each ID's lookup result, which is the value of the dictionary entry with that ID.

Remove partial duplicates from text file

My bash-foo is a little rusty right now so I wanted to see if there's a clever way to remove partial duplicates from a file. I have a bunch of files containing thousands of lines with the following format:
String1|String2|String3|String4|String5|String6|...|String22|09-Apr-2016 05:28:03|x
Essentially it's a bunch of pipe delimited strings, with the final two columns being a timestamp and x. What I'd like to do is concatenate all of my files and then remove all partial duplicates. I'm defining partial duplicate as a line in the file that matches from String1 up to String22, but the timestamp can be different.
For example, a file containing:
String1|String2|String3|String4|String5|String6|...|String22|09-Apr-2016 05:28:03|x
String1|String2|String3|String4|String5|String6|...|String22|09-Apr-2016 12:12:12|x
String124|String2|String3|String4|String5|String6|...|String22|09-Apr-2016 05:28:03|x
would become:
String1|String2|String3|String4|String5|String6|...|String22|09-Apr-2016 05:28:03|x
String124|String2|String3|String4|String5|String6|...|String22|09-Apr-2016 05:28:03|x
(It doesn't matter which timestamp is chosen).
Any ideas?
Using awk you can do this:
awk '{k=$0; gsub(/(\|[^|]*){2}$/, "", k)} !seen[k]++' file
String1|String2|String3|String4|String5|String6|String7|09-Apr-2016 05:28:03|x
String124|String2|String3|String4|String5|String6|String7|09-Apr-2016 05:28:03|x
awk command first makes a variable k by removing last 2 fields from each line. Then it uses an associative array seen with key as k where it prints only first instance of key by storing each processes key in the array.
If you have Bash version 4, which supports associative arrays, it can be done fairly efficiently in pure Bash:
declare -A found
while IFS= read -r line || [[ -n $line ]] ; do
strings=${line%|*|*}
if (( ! ${found[$strings]-0} )) ; then
printf '%s\n' "$line"
found[$strings]=1
fi
done < "$file"
same idea with #anubhava, but I think more idiomatic
$ awk -F'|' '{line=$0;$NF=$(NF-1)=""} !a[$0]++{print line}' file
String1|String2|String3|String4|String5|String6|...|String22|09-Apr-2016 05:28:03|x
String124|String2|String3|String4|String5|String6|...|String22|09-Apr-2016 05:28:03|x

Resources