Generate a column for each file matching a glob - bash

I'm having difficulties with something that sounds relatively simple. I have a few data files with single values in them as shown below:
data1.txt:
100
data2.txt
200
data3.txt
300
I have another file called header.txt and its a template file that contains the header as shown below:
Data_1 Data2 Data3
- - -
I'm trying to add the data from the data*.txt files to the last line of Master.txt
The desired output would be something like this:
Data_1 Data2 Data3
- - -
100 200 300
I'm actively working this so I'm not sure where to begin. This doesn't need to be implemented in pure shell -- use of standard UNIX tools such as awk or sed is entirely reasonable.

paste is the key tool:
#!/bin/bash
exec >>Master.txt
cat header.txt
paste $'-d\n' data1.txt data2.txt data3.txt |
while read line1
do
read line2
read line3
printf '%-10s %-10s %-10s\n' "$line1" "$line2" "$line3"
done

As a native-bash implementation:
#!/usr/bin/env bash
case $BASH_VERSION in ''|[123].*) echo "ERROR: Bash 4.0+ needed" >&2; exit 1;; esac
declare -A keys=( ) # define an associative array (a string->string map)
for f in data*.txt; do # iterate over data*.txt files
name=${f%.txt} # for each, remove the ".txt" extension to get our name...
keys[${name^}]=$(<"$f") # capitalize the first letter, and read the file to get the value
done
{ # start a group so we can redirect output just once
printf '%s\t' "${!keys[#]}"; echo # first line: keys in our associative array
printf '%s\t' "${keys[#]//*/-}"; echo # second line: convert values to dashes
printf '%s\t' "${keys[#]}"; echo # third line: print the values unmodified
} >>Master.txt # all the above with output redirected to Master.txt
Most of the magic here is performed by parameter expansions:
${f%.txt} trims the .txt extension from the end of $f
${name^} capitalizes the first letter of $name
"${keys[#]}" expands to all values in the array named keys
"${keys[#]//*/-} replaces * (everything) in each key with the fixed string -.
"${!keys[#]}" expands to the names of entries in the associative array keys.

Related

Extracting file content using a for loop [duplicate]

I'm working on a long Bash script. I want to read cells from a CSV file into Bash variables. I can parse lines and the first column, but not any other column. Here's my code so far:
cat myfile.csv|while read line
do
read -d, col1 col2 < <(echo $line)
echo "I got:$col1|$col2"
done
It's only printing the first column. As an additional test, I tried the following:
read -d, x y < <(echo a,b,)
And $y is empty. So I tried:
read x y < <(echo a b)
And $y is b. Why?
You need to use IFS instead of -d:
while IFS=, read -r col1 col2
do
echo "I got:$col1|$col2"
done < myfile.csv
To skip a given number of header lines:
skip_headers=3
while IFS=, read -r col1 col2
do
if ((skip_headers))
then
((skip_headers--))
else
echo "I got:$col1|$col2"
fi
done < myfile.csv
Note that for general purpose CSV parsing you should use a specialized tool which can handle quoted fields with internal commas, among other issues that Bash can't handle by itself. Examples of such tools are cvstool and csvkit.
How to parse a CSV file in Bash?
Coming late to this question and as bash do offer new features, because this question stand about bash and because none of already posted answer show this powerful and compliant way of doing precisely this.
Parsing CSV files under bash, using loadable module
Conforming to RFC 4180, a string like this sample CSV row:
12,22.45,"Hello, ""man"".","A, b.",42
should be splitted as
1 12
2 22.45
3 Hello, "man".
4 A, b.
5 42
bash loadable .C compiled modules.
Under bash, you could create, edit, and use loadable c compiled modules. Once loaded, they work like any other builtin!! ( You may find more information at source tree. ;)
Current source tree (Oct 15 2021, bash V5.1-rc3) do contain a bunch of samples:
accept listen for and accept a remote network connection on a given port
asort Sort arrays in-place
basename Return non-directory portion of pathname.
cat cat(1) replacement with no options - the way cat was intended.
csv process one line of csv data and populate an indexed array.
dirname Return directory portion of pathname.
fdflags Change the flag associated with one of bash's open file descriptors.
finfo Print file info.
head Copy first part of files.
hello Obligatory "Hello World" / sample loadable.
...
tee Duplicate standard input.
template Example template for loadable builtin.
truefalse True and false builtins.
tty Return terminal name.
uname Print system information.
unlink Remove a directory entry.
whoami Print out username of current user.
There is an full working cvs parser ready to use in examples/loadables directory: csv.c!!
Under Debian GNU/Linux based system, you may have to install bash-builtins package by
apt install bash-builtins
Using loadable bash-builtins:
Then:
enable -f /usr/lib/bash/csv csv
From there, you could use csv as a bash builtin.
With my sample: 12,22.45,"Hello, ""man"".","A, b.",42
csv -a myArray '12,22.45,"Hello, ""man"".","A, b.",42'
printf "%s\n" "${myArray[#]}" | cat -n
1 12
2 22.45
3 Hello, "man".
4 A, b.
5 42
Then in a loop, processing a file.
while IFS= read -r line;do
csv -a aVar "$line"
printf "First two columns are: [ '%s' - '%s' ]\n" "${aVar[0]}" "${aVar[1]}"
done <myfile.csv
This way is clearly the quickest and strongest than using any other combination of bash builtins or fork to any binary.
Unfortunely, depending on your system implementation, if your version of bash was compiled without loadable, this may not work...
Complete sample with multiline CSV fields.
Conforming to RFC 4180, a string like this single CSV row:
12,22.45,"Hello ""man"",
This is a good day, today!","A, b.",42
should be splitted as
1 12
2 22.45
3 Hello "man",
This is a good day, today!
4 A, b.
5 42
Full sample script for parsing CSV containing multilines fields
Here is a small sample file with 1 headline, 4 columns and 3 rows. Because two fields do contain newline, the file are 6 lines length.
Id,Name,Desc,Value
1234,Cpt1023,"Energy counter",34213
2343,Sns2123,"Temperatur sensor
to trigg for alarm",48.4
42,Eye1412,"Solar sensor ""Day /
Night""",12199.21
And a small script able to parse this file correctly:
#!/bin/bash
enable -f /usr/lib/bash/csv csv
file="sample.csv"
exec {FD}<"$file"
read -ru $FD line
csv -a headline "$line"
printf -v fieldfmt '%-8s: "%%q"\\n' "${headline[#]}"
numcols=${#headline[#]}
while read -ru $FD line;do
while csv -a row "$line" ; (( ${#row[#]} < numcols )) ;do
read -ru $FD sline || break
line+=$'\n'"$sline"
done
printf "$fieldfmt\\n" "${row[#]}"
done
This may render: (I've used printf "%q" to represent non-printables characters like newlines as $'\n')
Id : "1234"
Name : "Cpt1023"
Desc : "Energy\ counter"
Value : "34213"
Id : "2343"
Name : "Sns2123"
Desc : "$'Temperatur sensor\nto trigg for alarm'"
Value : "48.4"
Id : "42"
Name : "Eye1412"
Desc : "$'Solar sensor "Day /\nNight"'"
Value : "12199.21"
You could find a full working sample there: csvsample.sh.txt or
csvsample.sh.
Note:
In this sample, I use head line to determine row width (number of columns). If you're head line could hold newlines, (or if your CSV use more than 1 head line). You will have to pass number or columns as argument to your script (and the number of head lines).
Warning:
Of course, parsing CSV using this is not perfect! This work for many simple CSV files, but care about encoding and security!! For sample, this module won't be able to handle binary fields!
Read carefully csv.c source code comments and RFC 4180!
From the man page:
-d delim
The first character of delim is used to terminate the input line,
rather than newline.
You are using -d, which will terminate the input line on the comma. It will not read the rest of the line. That's why $y is empty.
We can parse csv files with quoted strings and delimited by say | with following code
while read -r line
do
field1=$(echo "$line" | awk -F'|' '{printf "%s", $1}' | tr -d '"')
field2=$(echo "$line" | awk -F'|' '{printf "%s", $2}' | tr -d '"')
echo "$field1 $field2"
done < "$csvFile"
awk parses the string fields to variables and tr removes the quote.
Slightly slower as awk is executed for each field.
In addition to the answer from #Dennis Williamson, it may be helpful to skip the first line when it contains the header of the CSV:
{
read
while IFS=, read -r col1 col2
do
echo "I got:$col1|$col2"
done
} < myfile.csv
If you want to read CSV file with some lines, so this the solution.
while IFS=, read -ra line
do
test $i -eq 1 && ((i=i+1)) && continue
for col_val in ${line[#]}
do
echo -n "$col_val|"
done
echo
done < "$csvFile"

Reading CSV file in Shell Scripting

I am trying to read values from a CSV file dynamically based on the header. Here's how my input files can look like.
File 1:
name,city,age
john,New York,20
jane,London,30
or
File 2:
name,age,city,country
john,20,New York,USA
jane,30,London,England
I may not be following the best way to accomplish this but I tried the following code.
#!/bin/bash
{
read -r line
line=`tr ',' ' ' <<< $line`
while IFS=, read -r `$line`
do
echo $name
echo $city
echo $age
done
} < file.txt
I am expecting the above code read the values of the header as the variable names. I know that the order of columns can be different for the input file. But, I expect the files to have name, city and age columns in the input file. Is this the right approach? If so, what is the fix for the above code if fails with the error - "line7: name: command not found".
The issue is caused by the backticks. Bash will evaluate the contents and replace the backticks with the output from the command it just evaluated.
You can simply use the variable after the read command to achieve what you want:
#!/bin/bash
{
read -r line
line=`tr ',' ' ' <<< $line`
echo "$line"
while IFS=, read -r $line ; do
echo "person: $name -- $city -- $age"
done
} < file.txt
Some notes on your code:
The backtick syntax is legacy syntax, it is now preferred to use $(...) to evaluate commands. The new syntax is more flexible.
You can enable automatic script failure with set -euo pipefail (see here). This will make your script stop if it encounters an error.
You code is currently very sensitive to invalid header data:
with a file like
n ame,age,city,country
john,20,New York,USA
jane,30,London,England
your script (or rather the version in the beginning of my answer) will run without errors but with invalid output.
It is also good practice to quote variables to prevent unwanted splitting.
To make it much more robust, you can change it as follows:
#!/bin/bash
set -euo pipefail
# -e and -o pipefail will make the script exit
# in case of command failure (or piped command failure)
# -u will exit in case a variable is undefined
# (in you case, if the header is invalid)
{
read -r line
readarray -d, -t header < <(printf "%s" "$line")
# using an array allows to detect if one of the header entries
# contains an invalid character
# the printf is needed because bash would add a newline to the
# command input if using heredoc (<<<).
while IFS=, read -r "${header[#]}" ; do
echo "$name"
echo "$city"
echo "$age"
done
} < file.txt
A slightly different approach can let awk handle the field separation and ordering of the desired output given either of the input files. Below awk stores the desired output order in the f[] (field) array set in the BEGIN rule. Then on the first line in a file (FNR==1) the array a[] is deleted and filled with the headings from the current file. At that point you just loop over the field names in-order in the f[] array and output the corresponding field from the current line, e.g.
awk -F, '
BEGIN { f[1]="name"; f[2]="city"; f[3]="age" } # desired order
FNR==1 { # on first line read header
delete a # clear a array
for (i=1; i<=NF; i++) # loop over headings
a[$i] = i # index by heading, val is field no.
next # skip to next record
}
{
print "" # optional newline between outputs
for (i=1; i<=3; i++) # loop over desired field order
if (f[i] in a) # validate field in a array
print $a[f[i]] # output fields value
}
' file1 file2
Example Use/Output
In your case with the content you show in file1 and file2, you would have:
$ awk -F, '
> BEGIN { f[1]="name"; f[2]="city"; f[3]="age" } # desired order
> FNR==1 { # on first line read header
> delete a # clear a array
> for (i=1; i<=NF; i++) # loop over headings
> a[$i] = i # index by heading, val is field no.
> next # skip to next record
> }
> {
> print "" # optional newline between outputs
> for (i=1; i<=3; i++) # loop over desired field order
> if (f[i] in a) # validate field in a array
> print $a[f[i]] # output fields value
> }
> ' file1 file2
john
New York
20
jane
London
30
john
New York
20
jane
London
30
Where both files are read and handled identically despite having different field orderings. Let me know if you have further questions.
If using Bash verison ≥ 4.2, it is possible to use an associative array to capture an arbitrary number of fields with their name as a key:
#!/usr/bin/env bash
# Associative array to store columns names as keys and and values
declare -A fields
# Array to store columns names with index
declare -a column_name
# Array to store row's values
declare -a line
# Commands block consuming CSV input
{
# Read first line to capture column names
IFS=, read -r -a column_name
# Proces records
while IFS=, read -r -a line; do
# Store column values to corresponding field name
for ((i=0; i<${#column_name[#]}; i++)); do
# Fills fields' associative array
fields["${column_name[i]}"]="${line[i]}"
done
# Dump fields for debug|demo purpose
# Processing of each captured value could go there instead
declare -p fields
done
} < file.txt
Sample output with file 1
declare -A fields=([country]="USA" [city]="New York" [age]="20" [name]="john" )
declare -A fields=([country]="England" [city]="London" [age]="30" [name]="jane" )
For older Bash version, without associative array, use indexed column name alternatively:
#!/usr/bin/env bash
# Array to store columns names with index
declare -a column_name
# Array to store values for a line
declare -a value
# Commands block consuming CSV input
{
# Read first line to capture column names
IFS=, read -r -a column_name
# Proces records
while IFS=, read -r -a value; do
# Print record separator
printf -- '--------------------------------------------------\n'
# Print captured field name and value
for ((i=0; i<"${#column_name[#]}"; i++)); do
printf '%-18s: %s\n' "${column_name[i]}" "${value[i]}"
done
done
} < file.txt
Output:
--------------------------------------------------
name : john
age : 20
city : New York
country : USA
--------------------------------------------------
name : jane
age : 30
city : London
country : England

Split string in Array after specific delimited and New line

$string="name: Destination Administrator
description: Manage the destination configurations, certificates and subaccount trust.
readOnly:
roleReferences:
- roleTemplateAppId: destination-xsappname!b62
roleTemplateName: Destination_Administrator
name: Destination Administrator"
I have above string each line is delimited by newline char, and I like to create array with two column after "-" as below
Col1 col2
roleTemplateAppId destination-xsappname!b62
roleTemplateName Destination_Administrator
name Destination Administrator
I tried below but it is not returning correct array
IFS='- ' read -r -a arrstring <<< "$string"
echo "${arrstring [1]}"
Assumptions:
OP is unable to use a yaml parser (per Léa's comment)
the input is guaranteed to have \n line endings (within the data)
the - only shows up in the one location (as depicted in OP's sample input); otherwise we need a better definition of where to start parsing the data
we're interested in parsing everything that follows the -
data is to be parsed based on a : delimiter, with the first field serving as the index in an associative array, while the 2nd field will be the value stored in the array
leading/trailing spaces to be removed from array indexes and values
One sed idea for pulling out just the lines we're interested in:
$ sed -n '/- /,${s/-//;p}' <<< "${string}"
roleTemplateAppId: destinationxsappname!b62
roleTemplateName: Destination_Administrator
name: Destination Administrator
Adding a few more bits to strip off leading/trailing spaces:
$ sed -n '/- /,${s/-//;s/^[ ]*//;s/[ ]*$//;s/[ ]*:[ ]*/:/;p}' <<< "${string}"
roleTemplateAppId:destination-xsappname!b62
roleTemplateName:Destination_Administrator
name:Destination Administrator
From here we'll feed this to a while loop where we'll populate the associative array
unset arrstring
declare -A arrstring # declare as an associative array
while IFS=':' read -r index value
do
arrstring["${index}"]="${value}"
done < <(sed -n '/- /,${s/-//;s/^[ ]*//;s/[ ]*$//;s/[ ]*:[ ]*/:/;p}' <<< "${string}")
Leaving us with:
$ typeset -p arrstring
declare -A arrstring=([roleTemplateAppId]="destination-xsappname!b62" [name]="Destination Administrator" [roleTemplateName]="Destination_Administrator" )
$ for i in "${!arrstring[#]}"
do
echo "$i : ${arrstring[$i]}"
done
roleTemplateAppId : destination-xsappname!b62
name : Destination Administrator
roleTemplateName : Destination_Administrator

split large text file into chunks by lines containing a specific character

I am trying to chunk a large text file (~27 Gb) into a series of smaller files, where the break points are defined by a subheader each of which contains the same symbol (in this case '#').
So the following large file:
#auniquestring
dataline1
dataline2
...
dataline33456
#aseconduniquestring
dataline33458
dataline33459
...
dataline124589
#athirdunqiuestring
dataline124591
dataline124592
...
...becomes:
1st file:
#auniquestring
dataline1
dataline2
...
dataline33456
2nd file:
#aseconduniquestring
dataline33458
dataline33459
...
dataline124589
3rd file:
#athirdunqiuestring
dataline124591
dataline124592
...
etc
I've tried things like sed -n '/#/,/#/p' myfile but it outputs everything at once, and misses the contents of every other subheader. Any help would be much appreciated
Using awk (NOTICE IT WILL CREATE FILES NAMED file[0-9]+.txt):
$ awk '
BEGIN {
file="file0.txt" # just in case
}
/^#/ { # when record starts with #
close(file) # close previous file
file=sprintf("file%d.txt",++f) # generate next filename
}
{
print > file # output to generated filename
}' file
Sample output:
$ cat file1.txt
#auniquestring
dataline1
dataline2
...
dataline33456
Modern Bash versions can compare regular expressions.
#! /bin/bash
n=1
while read -r line; do
if [[ $line =~ ^# ]]; then
exec >file$((n++))
fi
printf "%s\n" "$line"
done

How to iterate over text file having multiple-words-per-line using shell script?

I know how to iterate over lines of text when the text file has contents as below:
abc
pqr
xyz
However, what if the contents of my text file are as below,
abc xyz
cdf pqr
lmn rst
and I need to get values "abc" stored to one variable and"xyz" stored to another variable. How would I do that?
read splits the line by $IFS as many times as you pass variables to it:
while read var1 var2 ; do
echo "var1: ${var1} var2: ${var2}"
done
You see, if you pass var1 and var2 both columns go to separate variables. But note that if the line would contain more columns var2 would contain the whole remaining line, not just column2.
Type help read for more info.
If the delimiter is a space then you can do:
#!/bin/bash
ALLVALUES=()
while read line
do
ALLVALUES+=( $line )
done < "/path/to/your/file"
So after, you can just reference an element by ${ALLVALUES[0]} or ${ALLVALUES[1]} etc
If you want to read every word in a file into a single array you can do it like this:
arr=()
while read -r -a _a; do
arr+=("${a[#]}")
done < infile
Which uses -r to avoid read from interpreting backslashes in the input and -a to have it split the words (splitting on $IFS) into an array. It then appends all the elements of that array to the accumulating array while being safe for globbing and other metacharacters.
This awk command reads the input word by word:
awk -v RS='[[:space:]]+' '1' file
abc
xyz
cdf
pqr
lmn
rst
To populate a shell array use awk command in process substitution:
arr=()
while read -r w; do
arr+=("$w")
done < <(awk -v RS='[[:space:]]+' '1' file)
And print the array content:
declare -p arr
declare -a arr='([0]="abc" [1]="xyz" [2]="cdf" [3]="pqr" [4]="lmn" [5]="rst")'

Resources