Replacing a column in CSV file with another in bash - bash

I have a csv file with a number of columns. I am trying to replace the second column with the second to last column from the same file.
For example, if I have a file, sample.csv
1,2,3,4,5,6
a,b,c,d,e,f
g,h,i,j,k,l
I want to output:
1,5,3,4,5,6
a,e,c,d,e,f
g,k,i,j,k,l
Can anyone help me with this task? Also note that I will be discarding the last two columns afterwards with the cut function so I am open to separating the csv file to begin with so that I can replace the column in one csv file with another column from another csv file. Whichever is easier to implement. Thanks in advance for any help.

How about this simpler awk:
awk 'BEGIN{FS=OFS=","} {$2=$(NF-1)}'1 sample.csv
EDIT: Noticed that you also want to discard last 2 columns. Use this awk one-liner:
awk 'BEGIN{FS=OFS=","} {$2=$(NF-1); NF=NF-2}'1 sample.csv

In bash
while IFS=, read -r -a arr; do
arr[1]="${arr[4]}";
printf -v output "%s," "${arr[#]}";
printf "%s\n" "${output%,}";
done < sample.csv

Pure bash solution, using IFS in a funny way:
# Set globally the IFS, you'll see it's funny
IFS=,
while read -ra a; do
a[1]=${a[#]: -2:1}
echo "${a[*]}"
done < file.csv
Setting globally the IFS variable is used twice: once in the read statement so that each field is split according to a coma and in the line echo "${a[*]}" where "${a[*]}" will expand to the fields of the array a separated by IFS... which is a coma!
Another special thing: you mentionned the second to last field, and that's exactly what ${a[#]: -2:1} will expand to (mind the space between : and -2), so that you don't have to count your number of fields.
Caveat. csv files need a special csv parser that is difficult to implement. This answer (and I guess all the other answers that will not use a genuine csv parser) might break if a field contains a coma, e.g.,
1,2,3,4,"a field, with a coma",5
If you want to discard the last two columns, don't use cut, but this instead:
IFS=,
while read -ra a; do
((${#a[#]}<2)) || continue # skip if array has less than two fields
a[1]=${a[#]: -2:1}
echo "${a[*]::${#a[#]}-2}"
done < file.csv

Related

IFS not parsing well CSV

I am trying to parse a file so I can obtain the first column. The command I'm using is:
while IFS=',' read -r a; do echo "$a"; done < test.csv
However it is still outputting the whole csv instead of the first column. An example of the csv is as follows:
NOM,CODI,DATA,SEXE,GRUP_EDAT,RESIDENCIA,CASOS_CONFIRMAT,PCR,INGRESSOS_TOTAL,INGRESSOS_CRITIC,INGRESSATS_TOTAL,INGRESSATS_CRITIC,EXITUS
MOIANÃS,42,24/08/2020,Home,Majors de 74,No,0,2,0,0,0,0,0
ALT CAMP,01,30/07/2020,Dona,Entre 15 i 64,Si,0,0,0,0,0,0,0
ALT CAMP,01,30/07/2020,Dona,Entre 65 i 74,No,0,1,0,0,0,0,0
ALT CAMP,01,30/07/2020,Dona,Entre 65 i 74,Si,0,0,0,0,0,0,0
I've been looking elsewhere and all seem to agree that this should be the correct approach when parsing csv using IFS. A thing I've noticed is that if I add a new column to the read function, say b, it outputs the first column instead of everything.
while IFS=',' read -r a b; do echo "$a"; done < test.csv
I don't understand this behaviour and it does not seem to work further than printing the first column. For example, If I were to put c and $c, it wouldn't print the third column and so on.
Can you please explain this behaviour and why this is happening?
Thank you
read is working correctly. It splits on IFS and assigns each field to a variable, with the remainder of the line going to the last variable. If you only give one variable, the whole line goes to it.
bash is not the right tool for parsing a csv file and you should consider awk for this. e.g. to printf first 2 columns use this super simple awk command:
awk -F, '{print $1, $2}' file.csv
Just to highlight your issue: Regarding your bash loop, better use an array to ready all comma separated columns into array:
while IFS=, read -ra arr; do
# print first 2 columns
echo "col1=${arr[0]}, col2=${arr[1]}"
done < file.csv
For simple CSV files, you can simply split on every comma, but you want to read the input into an array, unless you know the number of columns in every row.
For exapmle, if you know there are going to be (at most) 10 columns, you can use
while IFS=, read -r f1 f2 f3 f4 f5 f6 f7 f8 f9 f10; do
However, in bash it is simpler to read the entire split line into a single array:
while IFS=, read -ra f; do
The first field would be "${f[0]}", the second "${f[1]}", etc.

How to use awk to split a file and store each filename in a Bash array

Input
A file called input_file.csv, which has 7 columns, and n rows.
Example header and row:
Date Location Team1 Team2 Time Prize_$ Sport
2016 NY Raptors Gators 12pm $500 Soccer
Output
n files, where the rows in each new file are grouped based on their values in column 7 of the original file. Each file is named after that shared value from column 7. Note: each file will have the same header. (The script currently does this.)
Example: if 2 rows in the original file had golf as their value for column 7, they would be grouped together in a file called golf.csv. If 3 other rows shared soccer as their value for column 7, they would be found in soccer.csv.
An array that has the name of each generated file in it. This array lives outside of the scope of awk. (This is what I need help with.)
Example: Array = [golf.csv, soccer.csv]
Situation
The following script produces the desired output. However, I want to run another script on each of the newly generated files and I don't know how.
Question:
My idea is to store the names of each new file in an array. That way, I can loop through the array and do what I want to each file. The code below passes a variable called array into awk, but I don't know how to add the name of each file to the array.
#!/bin/bash
ARRAY=()
awk -v myarray="$ARRAY" -F"\",\"" 'NR==1 {header=$0}; NF>1 && NR>1 {if(! files[$7]) {print header >> ("" $7 ".csv"); files[$7]=1}; print $0 >> ("" $7 ".csv"); close("" $7 ".csv");}' input_file.csv
for i in "${ARRAY[#]}"
do
:
echo $i
done
Rather than struggling to get awk to fill your shell array variable, why not:
make sure that the *.csv files are created in a clean directory
use globbing to loop over all *.csv files in that directory?
awk -F'","' ... # your original Awk command
for i in *.csv # use globbing to loop over resulting *.csv files
do
:
echo $i
done
Just off the top of my head, untested because you haven't supplied very much sample data, what about this?
#!/usr/bin/awk -f
FNR==1 {
header=$0
next
}
! $7 in files {
files[$7]=sprintf("sport-%s.csv", $7)
print header > file
}
{
files[$7]=sprintf("sport-%s.csv", $7)
}
{
print > files[$7]
}
END {
printf("declare -a sportlist=( ")
for (sport in files) {
printf("\"%s\"", sport)
}
printf(" )\n");
}
The idea here is that we store sport names in the array files[], and build filenames out of that array. (You can format the filename inside sprintf() as you see fit.) We step through the file, adding a header line whenever we get a new sport with no recorded filename. Then for non-headers, print to the file based on the sport name.
For your second issue, exporting the array back to something outside of awk, the END block here will output a declare line which can be interpreted by bash. IF you feel lucky, you can eval this awk script inside command expansion, and the declare command will effectively be interpreted by your shell:
eval $(/path/to/awkscript inputfile.csv)
Or, if you subscribe to the school of thought that consiers eval to be evil, you can redirect the awk script's standard output to a temporary file which you source:
/path/to/awkscript inputfile.csv > /tmp/yadda.$$
. /tmp/yadda.$$
(Don't use this temp file, make a real one with mktemp or the like.)
There's no way for any program to modify the environment of the parent shell. Just have the awk script output the names of the files as standard output, and use command substitution to put them in an array.
filesArray=($(awk ... ))
If the files might have spaces in them, you need a different solution; assuming you're on bash 4, you can just be sure to print each file on a separate line and use readarray:
readarray filesArray < <( awk ... )
if the files might have newlines in them, too, then things get tricky...
if your file is not large, you can run another script to get the unique $7 elements, for example
$ awk 'NR>1&&!a[$7]++{print $7}' sports
will print the values, you can change it to your file name format as well, such as
$ awk 'NR>1&&!a[$7]++{print tolower($7)".csv"}' sports
this then can be piped to your other process, here for example to wc
$ awk ... sports | xargs wc
This will do what I THINK you want:
oIFS="$IFS"; IFS=$'\n'
array=( $(awk '{out=$7".csv"; print > out} !seen[out]++{print out}' input_file.csv) )
IFS="$oIFS"
If your input file really is comma-separated instead of space-separated as you show in the sample input in your question then adjust the awk script to suit (You might want to look at GNU awk and FPAT).
If you don't have GNU awk then you'll need to add a bit more code to close the open output files as you go.
The above will fail if you have file names that contain newlines but will be fine for blank chars or other white space.

Scripting username creation from text file?

I'm really new at Bash and scripting in general.
I have to create usernames formed of first letter of first name followed by last name. To do it, I use a provided text file that looks like this:
doe,john
smith,mike
...
I declared the following variables:
fname=$(cut -d, -f2 "file.txt" | cut -c1)
lname=$(cut -d, -f1 "file.txt")
But how do I put the elements together to form the names jdoe and msmith ? I tried the methods I know to concatenate strings and vriables, but nothing works..
I think I found a method using awk that is supposed to work, but is there any other way to "concatenate" the elements of 2 lists?
Thank you
There's a million ways to do it, this is simplest:
$ awk -F, '{print substr($2,1,1) $1}' file
jdoe
msmith
Ed Morton's awk-based answer is simplest (and probably fastest), but since you asked for a different solution:
#!/usr/bin/env bash
while IFS=, read -r last first _; do
username=${first:0:1}${last}
echo "username: $username"
done < file.txt
IFS=, read -r last first _ reads the first 2 ,-separated fields from each input line (_ is a dummy variable that receives the rest of the input line, if any; -r prevents interpretation of \ chars. in the input, which is usually what you want).
username=${first:0:1}${last} concatenates the 1st char. of variable $first's value with variable $last's value, simply by placing the two variable references next to each other.
${first:0:1} - extract 1 character from $first at position 0 - is an example of parameter expansion, specifically: substring expansion
< file.txt is an input redirection that sends file.txt's contents via stdin to the while loop.
This looks a bit too much like homework, so I'll just drop some hints.
To read the lastname and firstname into separate variables for each line of the file, see BashFAQ 1. It should not involve cut.
To grab the first character of a variable, see BashFAQ 100.

First line of text file not read in shell

Okay so Ive compile this code to read a text file however it succesfully finds the sum of every column needed except from the first line! Hence gives me the wrong summation which excludes the value on the first line it reads in. It sets value: $line = ddsdfj:jdskf:1:fjf but never extracts the 1 from the first line. Any clues would be appreciated.
FILE=$1
while read line
do
awk -F: '{summation += $3;}END{print summation;}'
done < $FILE
The while loop is completely superfluous. It looks like what you want is
awk -F: '{s+=$3}END{print s}' "$1"
quite simply.
The code you had would read the first line with read, then the other lines as standard input to awk; hence, the behavior you were observing. Something like
while read line; do
awk -F: '{s+=$3}END{print s}' <<<"$line"
done <"$1"
would have actually used the value from line for something, but of course, that would just extract the third field from each line individually, not performed any actual addition of values from different lines.

How to parse a CSV in a Bash script?

I am trying to parse a CSV containing potentially 100k+ lines. Here is the criteria I have:
The index of the identifier
The identifier value
I would like to retrieve all lines in the CSV that have the given value in the given index (delimited by commas).
Any ideas, taking in special consideration for performance?
As an alternative to cut- or awk-based one-liners, you could use the specialized csvtool aka ocaml-csv:
$ csvtool -t ',' col "$index" - < csvfile | grep "$value"
According to the docs, it handles escaping, quoting, etc.
See this youtube video: BASH scripting lesson 10 working with CSV files
CSV file:
Bob Brown;Manager;16581;Main
Sally Seaforth;Director;4678;HOME
Bash script:
#!/bin/bash
OLDIFS=$IFS
IFS=";"
while read user job uid location
do
echo -e "$user \
======================\n\
Role :\t $job\n\
ID :\t $uid\n\
SITE :\t $location\n"
done < $1
IFS=$OLDIFS
Output:
Bob Brown ======================
Role : Manager
ID : 16581
SITE : Main
Sally Seaforth ======================
Role : Director
ID : 4678
SITE : HOME
First prototype using plain old grep and cut:
grep "${VALUE}" inputfile.csv | cut -d, -f"${INDEX}"
If that's fast enough and gives the proper output, you're done.
CSV isn't quite that simple. Depending on the limits of the data you have, you might have to worry about quoted values (which may contain commas and newlines) and escaping quotes.
So if your data are restricted enough can get away with simple comma-splitting fine, shell script can do that easily. If, on the other hand, you need to parse CSV ‘properly’, bash would not be my first choice. Instead I'd look at a higher-level scripting language, for example Python with a csv.reader.
In a CSV file, each field is separated by a comma. The problem is, a field itself might have an embedded comma:
Name,Phone
"Woo, John",425-555-1212
You really need a library package that offer robust CSV support instead of relying on using comma as a field separator. I know that scripting languages such as Python has such support. However, I am comfortable with the Tcl scripting language so that is what I use. Here is a simple Tcl script which does what you are asking for:
#!/usr/bin/env tclsh
package require csv
package require Tclx
# Parse the command line parameters
lassign $argv fileName columnNumber expectedValue
# Subtract 1 from columnNumber because Tcl's list index starts with a
# zero instead of a one
incr columnNumber -1
for_file line $fileName {
set columns [csv::split $line]
set columnValue [lindex $columns $columnNumber]
if {$columnValue == $expectedValue} {
puts $line
}
}
Save this script to a file called csv.tcl and invoke it as:
$ tclsh csv.tcl filename indexNumber expectedValue
Explanation
The script reads the CSV file line by line and store the line in the variable $line, then it split each line into a list of columns (variable $columns). Next, it picks out the specified column and assigned it to the $columnValue variable. If there is a match, print out the original line.
Using awk:
export INDEX=2
export VALUE=bar
awk -F, '$'$INDEX' ~ /^'$VALUE'$/ {print}' inputfile.csv
Edit: As per Dennis Williamson's excellent comment, this could be much more cleanly (and safely) written by defining awk variables using the -v switch:
awk -F, -v index=$INDEX -v value=$VALUE '$index == value {print}' inputfile.csv
Jeez...with variables, and everything, awk is almost a real programming language...
For situations where the data does not contain any special characters, the solution suggested by Nate Kohl and ghostdog74 is good.
If the data contains commas or newlines inside the fields, awk may not properly count the field numbers and you'll get incorrect results.
You can still use awk, with some help from a program I wrote called csvquote (available at https://github.com/dbro/csvquote):
csvquote inputfile.csv | awk -F, -v index=$INDEX -v value=$VALUE '$index == value {print}' | csvquote -u
This program finds special characters inside quoted fields, and temporarily replaces them with nonprinting characters which won't confuse awk. Then they get restored after awk is done.
index=1
value=2
awk -F"," -v i=$index -v v=$value '$(i)==v' file
I was looking for an elegant solution that support quoting and wouldn't require installing anything fancy on my VMware vMA appliance. Turns out this simple python script does the trick! (I named the script csv2tsv.py, since it converts CSV into tab-separated values - TSV)
#!/usr/bin/env python
import sys, csv
with sys.stdin as f:
reader = csv.reader(f)
for row in reader:
for col in row:
print col+'\t',
print
Tab-separated values can be split easily with the cut command (no delimiter needs to be specified, tab is the default). Here's a sample usage/output:
> esxcli -h $VI_HOST --formatter=csv network vswitch standard list |csv2tsv.py|cut -f12
Uplinks
vmnic4,vmnic0,
vmnic5,vmnic1,
vmnic6,vmnic2,
In my scripts I'm actually going to parse tsv output line by line and use read or cut to get the fields I need.
Parsing CSV with primitive text-processing tools will fail on many types of CSV input.
xsv is a lovely and fast tool for doing this properly. To search for all records that contain the string "foo" in the third column:
cat file.csv | xsv search -s 3 foo
A sed or awk solution would probably be shorter, but here's one for Perl:
perl -F/,/ -ane 'print if $F[<INDEX>] eq "<VALUE>"`
where <INDEX> is 0-based (0 for first column, 1 for 2nd column, etc.)
Awk (gawk) actually provides extensions, one of which being csv processing.
Assuming that extension is installed, you can use awk to show all lines where a specific csv field matches 123.
Assuming test.csv contains the following:
Name,Phone
"Woo, John",425-555-1212
"James T. Kirk",123
The following will print all lines where the Phone (aka the second field) is equal to 123:
gawk -l csv 'csvsplit($0,a) && a[2] == 123 {print $0}'
The output is:
"James T. Kirk",123
How does it work?
-l csv asks gawk to load the csv extension by looking for it in $AWKLIBPATH;
csvsplit($0, a) splits the current line, and stores each field into a new array named a
&& a[2] == 123 checks that the second field is 123
if both conditions are true, it { print $0 }, aka prints the full line as requested.

Resources