Adding an extra value into CSV data, according to filename - shell

Let's say i have the following type of filename formats :
CO#ATH2000.dat , CO#MAR2000.dat
Each of these, have data like that following:
....
"12-02-1984",3.8,4.1,3.8,3.8,3.8,3.7,4.1,4.3,3.8,4.1,5.0,4.8,4.5,4.3,4.3,4.3,4.1,4.5,4.3,4.3,4.3,4.5,4.3,4.1
"13-02-1984",3.7,4.3,4.3,4.3,4.1,4.3,4.5,4.8,4.8,5.0,5.2,5.0,5.2,5.2,5.2,4.8,4.8,4.8,4.8,4.8,4.8,4.8,4.5,4.3
"14-02-1984",3.8,4.1,3.8,3.8,3.8,3.8,3.8,4.2,4.5,4.5,4.1,3.6,3.6,3.4,3.4,3.2,3.4,3.2,3.2,3.2,2.9,2.7,2.5,2.2
"15-02-1984",2.2,2.2,2.0,2.0,2.0,1.8,2.1,2.6,2.6,2.5,2.4,2.4,2.4,2.5,2.7,2.7,2.6,2.6,2.7,2.6,2.8,2.8,2.8,2.8
..........
Now i also have the following .sh file that can merge ALL those .dat files into one single output .dat file.
for filename in `ls CO#*`; do
cat $filename >> CO#combined.dat
done
Now here is the problem. I want inside CO#combined.dat, at each line, before the start of the values, to have a 'standard' value according to the filename-parameter. For example i want each file with ATH in its filename have 3, at the start of each line and with MAR in its filename have 22,.
So the CO#combined.dat should be something like this:
....
3,"12-02-1984",3.8,4.1,3.8,3.8,3.8,3.7,4.1,4.3,3.8,4.1,5.0,4.8,4.5,4.3,4.3,4.3,4.1,4.5,4.3,4.3,4.3,4.5,4.3,4.1
3,"13-02-1984",3.7,4.3,4.3,4.3,4.1,4.3,4.5,4.8,4.8,5.0,5.2,5.0,5.2,5.2,5.2,4.8,4.8,4.8,4.8,4.8,4.8,4.8,4.5,4.3
20,"14-02-1984",3.8,4.1,3.8,3.8,3.8,3.8,3.8,4.2,4.5,4.5,4.1,3.6,3.6,3.4,3.4,3.2,3.4,3.2,3.2,3.2,2.9,2.7,2.5,2.2
20,"15-02-1984",2.2,2.2,2.0,2.0,2.0,1.8,2.1,2.6,2.6,2.5,2.4,2.4,2.4,2.5,2.7,2.7,2.6,2.6,2.7,2.6,2.8,2.8,2.8,2.8
..........
So in conclusion i want the script to do the above procedure!
Thanks in advance!

With awk you can take advantage of the built-in FILENAME variable along with the fact that you can supply multiple files to a given invocation. awk processes each file in turn, setting FILENAME to the name of the file whose records are currently being read.
With that you can set your prefix according to whatever pattern you wish to search for in the file name. Finally you can print the prefix and the original record.
Here's a demonstration on simplified versions of your sample input:
$ cat CO\#ATH2000.dat
1
2
3
$ cat CO\#MAR2000.dat
A
B
C
$ awk 'FILENAME ~ /MAR/ {pre=22} FILENAME ~ /ATH/ {pre=3} { print pre "," $0 }' CO*.dat
3,1
3,2
3,3
22,A
22,B
22,C

can be done simply
for f in CO#*; do
case ${f:3:3} in
ATH) k=3 ;;
*) k=22 ;;
esac;
sed "s/^/$k,/" $f >> all;
done
${f:3:3} extract the code ATH or MAR from the filename it's bash substring function; case converts the code to numerical counterpart; sed insert the numerical value and comma at the beginning of each line.

Related

How Can I Use Sort or another bash cmd To Get 1 line from all the lines if 1st 2nd and 3rd Field are The same

I have a file named file.txt
$cat file.txt
1./abc/cde/go/ftg133333.jpg
2./abc/cde/go/ftg24555.jpg
3./abc/cde/go/ftg133333.gif
4./abt/cte/come/ftg24555.jpg
5./abc/cde/go/ftg133333.jpg
6./abc/cde/go/ftg24555.pdf
MY GOAL: To get only one line from lines who's first, second and third PATH are the same and have the same file EXTENSION.
Note each PATH is separated by forward slash "/". Eg in the first line of the list, the first PATH is abc, second PATH is cde and third PATH is go.
File EXTENSION is .jpg, .gif,.pdf... always at the end of the line.
HERE IS WHAT I TRIED
sort -u -t '/' -k1 -k2 -k3
My thoughts
Using / as a delimiter gives me 4 fields in each line. Sorting them with "-u" will remove all but 1 line with unique First, Second and 3rd field/PATH. But obviously, I didn't take into account the EXTENSION(jpg,pdf,gif) in this case.
MY QUESTION
I need a way to grep only 1 of the lines if the first, second and third field are same and have the same EXTENSION using "/" as delimiter to divide it into fields. I want to output it to a another file, say file2.txt.
In the file2.txt, how do I add a word say "KALI" before the extension in each line, so it will look something like /abc/cde/go/ftg13333KALI.jpg using line 1 as an example in file.txt above.
Desired Output
/abc/cde/go/ftg133333KALI.jpg
/abt/cte/come/ftg24555KALI.jpg
/abc/cde/go/ftg133333KALI.gif
/abc/cde/go/ftg24555KALI.pdf
COMMENT
Line 1,2 & 5 have the same 1st,2nd and 3rd field, with same file extension
".jpg" so only line 1 should be in the output.
Line 3 is in the output even though it has same 1st,2nd and 3rd field with
1,2 and 5, because the extension is different ".gif".
Line 4 has different 1st, 2nd and 3rd field, hence it in output.
Line 6 is in the output even though it has same 1st,2nd and 3rd field with
1,2 and 5, because the extension is different ".pdf".
$ awk '{ # using awk
n=split($0,a,/\//) # split by / to get all path components
m=split(a[n],b,".") # split last by . to get the extension
}
m>1 && !seen[a[2],a[3],a[4],b[m]]++ { # if ext exists and is unique with 3 1st dirs
for(i=2;i<=n;i++) # loop component parts and print
printf "/%s%s",a[i],(i==n?ORS:"")
}' file
Output:
/abc/cde/go/ftg133333.jpg
/abc/cde/go/ftg133333.gif
/abt/cte/come/ftg24555.jpg
/abc/cde/go/ftg24555.pdf
I split by / separately from .s in case there are .s in dir names.
Missed the KALI part:
$ awk '{
n=split($0,a,/\//)
m=split(a[n],b,".")
}
m>1&&!seen[a[2],a[3],a[4],b[m]]++ {
for(i=2;i<n;i++)
printf "/%s",a[i]
for(i=1;i<=m;i++)
printf "%s%s",(i==1?"/":(i==m?"KALI.":".")),b[i]
print ""
}' file
Output:
/abc/cde/go/ftg133333KALI.jpg
/abc/cde/go/ftg133333KALI.gif
/abt/cte/come/ftg24555KALI.jpg
/abc/cde/go/ftg24555KALI.pdf
Using awk:
$ awk -F/ '{ split($5, ext, "\\.")
if (!(($2,$3,$4,ext[2]) in files)) files[$2,$3,$4,ext[2]]=$0
}
END { for (f in files) {
sub("\\.", "KALI.", files[f])
print files[f]
}}' input.txt
/abt/cte/come/ftg24555KALI.jpg
/abc/cde/go/ftg133333KALI.gif
/abc/cde/go/ftg24555KALI.pdf
/abc/cde/go/ftg133333KALI.jpg
another awk
$ awk -F'[./]' '!a[$2,$3,$4,$NF]++' file
/abc/cde/go/ftg133333.jpg
/abc/cde/go/ftg133333.gif
/abt/cte/come/ftg24555.jpg
/abc/cde/go/ftg24555.pdf
assumes . doesn't exist in directory names (not necessarily true in general).

Grep list (file) from another file

Im new to bash and trying to extract a list of patterns from file:
File1.txt
ABC
BDF
GHJ
base.csv (tried comma separated and tab delimited)
line 1,,,,"hfhf,ferf,ju,ABC"
line 2 ,,,,,"ewy,trggt,gtg,ABC,RFR"
line 3 .."himk,n,hn.ujj., BDF"
etc
Suggested output is smth like
ABC
line 1..
line 2..(whole lines)
BDF
line 3..
and so on for each pattern from file 1
the code i tried was:
#!/bin/bash
for i in *.txt -# cycle through all files containing pattern lists
do
for q in "$i"; # # cycle through list
do
echo $q >>output.${i};
grep -f "${q}" base.csv >>output.${i};
echo "\n";
done
done
But output is only filename and then some list of strings without pattern names, e.g.
File1.txt
line 1...
line 2...
line 3..
so i don`t know to what pattern belongs each string and have to check and assign manually. Can you please point out my errors? Thanks!
grep can process multiple files in one go, and then has the attractive added bonus of indicating which file it found a match in.
grep -f File1.txt base.csv >output.txt
It's not clear what you hope for the inner loop to do; it will just loop over a single token at a time, so it's not really a loop at all.
If you want the output to be grouped per pattern, here's a for loop which looks for one pattern at a time:
while read -r pat; do
echo "$pat"
grep "$pat" *.txt
done <File1.txt >output.txt
But the most efficient way to tackle this is to write a simple Awk script which processes all the input files at once, and groups the matches before printing them.
An additional concern is anchoring. grep "ABC" will find a match in 123DEABCXYZ; is this something you want to avoid? You can improve the regex, or, again, turn to Awk which gives you more control over where exactly to look for a match in a structured line.
awk '# Read patterns into memory
NR==FNR { a[++i] = $1; next }
# Loop across patterns
{ for(j=1; j<=i; ++j)
if($0 ~ a[j]) {
print FILENAME ":" FNR ":" $0 >>output.a[j]
next }
}' File1.txt base.csv
You're not actually reading the files, you're just handling the filenames. Try this:
#!/bin/bash
for i in *.txt # cycle through all files containing pattern lists
do
while read -r q # read file line by line
do
echo "$q" >>"output.${i}"
grep -f "${q}" base.csv >>"output.${i}"
echo "\n"
done < "${i}"
done
Here is one that separates (with split, comma-separatd with quotes and spaces stripped off) words from file2 to an array (word[]) and stores the record names (line 1 etc.) to it comma-separated:
awk '
NR==FNR {
n=split($0,tmp,/[" ]*(,|$)[" ]*/) # split words
for(i=2;i<=n;i++) # after first
if(tmp[i]!="") # non-empties
word[tmp[i]]=word[tmp[i]] (word[tmp[i]]==""?"":",") tmp[1] # hash rownames
record[tmp[1]]=$0 # store records
next
}
($1 in word) { # word found
n=split(word[$1],tmp,",") # get record names
print $1 ":" # output word
for(i=1;i<=n;i++) # and records
print record[tmp[i]]
}' file2 file1
Output:
ABC:
line 1,,,,"hfhf,ferf,ju,ABC"
line 2 ,,,,,"ewy,trggt,gtg,ABC,RFR"
BDF:
line 3 .."himk,n,hn.ujj., BDF"
Thank you for your kind help, my friends.
Tried both variants above but kept getting various errors ( "do" expected) or misbehavior ( gets names of pattern blocks, eg ABC, BDF, but no lines.
Gave up for a while and then eventually tried another way
While base goal were to cycle through pattern list files, search for patterns in huge file and write out specific columns from lines found - i simply wrote
for *i in *txt # cycle throughfiles w/ patterns
do
grep -F -f "$i" bigfile.csv >> ${i}.out1 #greps all patterns from current file
cut -f 2,3,4,7 ${i}.out1>> ${i}.out2 # cuts columns of interest and writes them out to another file
done
I'm aware that this code should be improved using some fancy pipeline features, but it works perfectly as is, hope it`ll help somebody in similar situation. You can easily add some echoes to write out pattern list names as i initially requested

Bash command to read a line based on the parameters I pass - perform column-based lookups

I have a file links.txt:
1 a.sh
3 b.sh
6 c.sh
4 d.sh
So, if i pass 1,4 as parameters to another file(master.sh), a.sh and d.sh should be stored in a variable.
sed '3!d' would print the 3rd line, but not the line that starts with 3. For that, you need sed '/^3 /!d'. The problem is you can't combine them for more lines, as this means "Delete everything that doesn't start with a 3", which means all other lines will be missed. So, use sed -n '/^3 /p' instead, i.e. don't print by default and tell sed what lines to print, not what lines to delete.
You can loop over the argument and create a sed script from them that prints the lines, then run sed using this output:
#!/bin/bash
file=$1
shift
for id in "$#" ; do
echo "/^$id /p"
done | sed -nf- "$file"
Run as script.sh filename 3 4.
If you want to remove the id from the output, you can either use
cut -f2 -d' '
or you can modify the generated sed script to do the work
echo "/^$id /s/.* //p"
i.e. only print if the substitution was successful.
This loops through each argument and greps for it in the links file. The result is piped into cut where we specify the delimiter as a space with -d flag and the field number as 2 with -f flag. Finally this is appended to the array called files.
links="links.txt"
files=()
for arg in $#; do
files=("${files[#]}" `grep "^$arg" "$links" | cut -d" " -f2`)
done;
echo ${files[#]}
Usage:
$ ./master.sh 1 4
a.sh d.sh
Edit:
As pointed out by mklement0, the solution above reads the file once per arg. The following first builds the pattern then reads the file just once.
links="links.txt"
pattern="^$1\s"
for arg in ${#:2}; do
pattern+="|^$arg\s"
done
files=$(grep -E "$pattern" "$links" | cut -d" " -f2)
echo ${files[#]}
Usage:
$ ./master.sh 1 4
a.sh d.sh
Here is another example with grep and cut:
#!/bin/bash
for line in $(grep "$1\|$2" links.txt|cut -d' ' -f2)
do
echo $line
done
Example of usage:
./master.sh 1 4
a.sh
d.sh
Why not just stores the values and call them at will:
items=()
while read -r num file
do
items[num]="$file"
done<links.txt
for arg
do
echo "${items[arg]}"
done
Now you can use the items array any time you like :)
The following awk solution:
preserves the argument order; that is, the results reflect the order in which the lookup values were specified (as opposed to the order in which the lookup values happen to occur in the file).
If that is not important (i.e., if outputting the results in file order is acceptable), the readarray technique below can be combined with this one-liner, which is a generalized variant of Panta's answer:
grep -f <(printf "^%s\n" "$#") links.txt | cut -d' ' -f2-
performs well, because the input file is only read once; the only requirement is that all key-value pairs fit into memory as a whole (as a single associative Awk array (dictionary)).
works with any lookup values that don't have embedded whitespace.
Similarly, the assumption is that the output column values (containing values such as a.sh in the sample input) have no embedded whitespace. awk doesn't handle quoted fields well, so more work would be needed.
#!/bin/bash
readarray -t files < <(
awk -v idList="$*" '
BEGIN { count=split(idList, idArr); for (i in idArr) idDict[idArr[i]]++ }
$1 in idDict { idDict[$1] = $2 }
END { for (i=1; i<=count; ++i) print idDict[idArr[i]] }
' links.txt
)
# Print results.
printf '%s\n' "${files[#]}"
readarray -t files reads stdin input (<) line by line into array variable files.
Note: readarray requires Bash v4+; on Bash 3.x, such as on macOS, replace this part with
IFS=$'\n' read -d '' -ra files
<(...) is a Bash process substitution that, loosely speaking, presents the output from the enclosed command as if it were (self-deleting) temporary file.
This technique allows readarray to run in the current shell (as opposed to a subshell if a pipeline had been used), which is necessary for the files variable to remain defined in the remainder of the script.
The awk command breaks down as follows:
-v idList="$*" passes the space-separated list of all command-line arguments as a single string to Awk variable idList.
Note that this assumes that the arguments have no embedded spaces, which is indeed the case here and also generally the case with identifiers.
BEGIN { ... } is only executed once, before the individual lines are processed:
split(idList, idArr) splits the input ID list into an array by whitespace and stores the result in idArr.
for (i in idArr) idDict[idArr[i]]++ } then converts the (conceptually regular) array into associative array idDict (dictionary), whose keys are the input IDs - this enables efficient lookup by ID later, and also allows storing the lookup result for each ID.
$1 in idDict { idDict[$1] = $2 } is processed for every input line:
Pattern $1 in idDict returns true if the line's first whitespace-separated field ($1) - e.g., 6 - is among the keys (in) of associative array idDict, and, if so, executes the associated action ({...}).
Action { idDict[$1] = $2 } then assigns the second field ($2) - e.g., c.sh - to the iDict entry for key $1.
END { ... } is executed once, after all input lines have been processed:
for (i=1; i<=count; ++i) print idDict[idArr[i]] loops over all input IDs in order and prints each ID's lookup result, which is the value of the dictionary entry with that ID.

Find Replace using Values in another File

I have a directory of files, myFiles/, and a text file values.txt in which one column is a set of values to find, and the second column is the corresponding replace value.
The goal is to replace all instances of find values (first column of values.txt) with the corresponding replace values (second column of values.txt) in all of the files located in myFiles/.
For example...
values.txt:
Hello Goodbye
Happy Sad
Running the command would replace all instances of "Hello" with "Goodbye" in every file in myFiles/, as well as replace every instance of "Happy" with "Sad" in every file in myFiles/.
I've taken as many attempts at using awk/sed and so on as I can think logical, but have failed to produce a command that performs the action desired.
Any guidance is appreciated. Thank you!
Read each line from values.txt
Split that line in 2 words
Use sed for each line to replace 1st word with 2st word in all files in myFiles/ directory
Note: I've used bash parameter expansion to split the line (${line% *} etc) , assuming values.txt is space separated 2 columnar file. If it's not the case, you may use awk or cut to split the line.
while read -r line;do
sed -i "s/${line#* }/${line% *}/g" myFiles/* # '-i' edits files in place and 'g' replaces all occurrences of patterns
done < values.txt
You can do what you want with awk.
#! /usr/bin/awk -f
# snarf in first file, values.txt
FNR == NR {
subs[$1] = $2
next
}
# apply replacements to subsequent files
{
for( old in subs ) {
while( index(old, $0) ) {
start = index(old, $0)
len = length(old)
$0 = substr($0, start, len) subs[old] substr($0, start + len)
}
}
print
}
When you invoke it, put values.txt as the first file to be processed.
Option One:
create a python script
with open('filename', 'r') as infile, etc., read in the values.txt file into a python dict with 'from' as key, and 'to' as value. close the infile.
use shutil to read in directory wanted, iterate over files, for each, do popen 'sed 's/from/to/g'" or read in each file interating over all the lines, each line you find/replace.
Option Two:
bash script
read in a from/to pair
invoke
perl -p -i -e 's/from/to/g' dirname/*.txt
done
second is probably easier to write but less exception handling.
It's called 'Perl PIE' and it's a relatively famous hack for doing find/replace in lots of files at once.

How to extract one column of a csv file

If I have a csv file, is there a quick bash way to print out the contents of only any single column? It is safe to assume that each row has the same number of columns, but each column's content would have different length.
You could use awk for this. Change '$2' to the nth column you want.
awk -F "\"*,\"*" '{print $2}' textfile.csv
yes. cat mycsv.csv | cut -d ',' -f3 will print 3rd column.
The simplest way I was able to get this done was to just use csvtool. I had other use cases as well to use csvtool and it can handle the quotes or delimiters appropriately if they appear within the column data itself.
csvtool format '%(2)\n' input.csv
Replacing 2 with the column number will effectively extract the column data you are looking for.
Landed here looking to extract from a tab separated file. Thought I would add.
cat textfile.tsv | cut -f2 -s
Where -f2 extracts the 2, non-zero indexed column, or the second column.
Here is a csv file example with 2 columns
myTooth.csv
Date,Tooth
2017-01-25,wisdom
2017-02-19,canine
2017-02-24,canine
2017-02-28,wisdom
To get the first column, use:
cut -d, -f1 myTooth.csv
f stands for Field and d stands for delimiter
Running the above command will produce the following output.
Output
Date
2017-01-25
2017-02-19
2017-02-24
2017-02-28
To get the 2nd column only:
cut -d, -f2 myTooth.csv
And here is the output
Output
Tooth
wisdom
canine
canine
wisdom
incisor
Another use case:
Your csv input file contains 10 columns and you want columns 2 through 5 and columns 8, using comma as the separator".
cut uses -f (meaning "fields") to specify columns and -d (meaning "delimiter") to specify the separator. You need to specify the latter because some files may use spaces, tabs, or colons to separate columns.
cut -f 2-5,8 -d , myvalues.csv
cut is a command utility and here is some more examples:
SYNOPSIS
cut -b list [-n] [file ...]
cut -c list [file ...]
cut -f list [-d delim] [-s] [file ...]
I think the easiest is using csvkit:
Gets the 2nd column:
csvcut -c 2 file.csv
However, there's also csvtool, and probably a number of other csv bash tools out there:
sudo apt-get install csvtool (for Debian-based systems)
This would return a column with the first row having 'ID' in it.
csvtool namedcol ID csv_file.csv
This would return the fourth row:
csvtool col 4 csv_file.csv
If you want to drop the header row:
csvtool col 4 csv_file.csv | sed '1d'
First we'll create a basic CSV
[dumb#one pts]$ cat > file
a,b,c,d,e,f,g,h,i,k
1,2,3,4,5,6,7,8,9,10
a,b,c,d,e,f,g,h,i,k
1,2,3,4,5,6,7,8,9,10
Then we get the 1st column
[dumb#one pts]$ awk -F , '{print $1}' file
a
1
a
1
Many answers for this questions are great and some have even looked into the corner cases.
I would like to add a simple answer that can be of daily use... where you mostly get into those corner cases (like having escaped commas or commas in quotes etc.,).
FS (Field Separator) is the variable whose value is dafaulted to
space. So awk by default splits at space for any line.
So using BEGIN (Execute before taking input) we can set this field to anything we want...
awk 'BEGIN {FS = ","}; {print $3}'
The above code will print the 3rd column in a csv file.
The other answers work well, but since you asked for a solution using just the bash shell, you can do this:
AirBoxOmega:~ d$ cat > file #First we'll create a basic CSV
a,b,c,d,e,f,g,h,i,k
1,2,3,4,5,6,7,8,9,10
a,b,c,d,e,f,g,h,i,k
1,2,3,4,5,6,7,8,9,10
a,b,c,d,e,f,g,h,i,k
1,2,3,4,5,6,7,8,9,10
a,b,c,d,e,f,g,h,i,k
1,2,3,4,5,6,7,8,9,10
a,b,c,d,e,f,g,h,i,k
1,2,3,4,5,6,7,8,9,10
a,b,c,d,e,f,g,h,i,k
1,2,3,4,5,6,7,8,9,10
And then you can pull out columns (the first in this example) like so:
AirBoxOmega:~ d$ while IFS=, read -a csv_line;do echo "${csv_line[0]}";done < file
a
1
a
1
a
1
a
1
a
1
a
1
So there's a couple of things going on here:
while IFS=, - this is saying to use a comma as the IFS (Internal Field Separator), which is what the shell uses to know what separates fields (blocks of text). So saying IFS=, is like saying "a,b" is the same as "a b" would be if the IFS=" " (which is what it is by default.)
read -a csv_line; - this is saying read in each line, one at a time and create an array where each element is called "csv_line" and send that to the "do" section of our while loop
do echo "${csv_line[0]}";done < file - now we're in the "do" phase, and we're saying echo the 0th element of the array "csv_line". This action is repeated on every line of the file. The < file part is just telling the while loop where to read from. NOTE: remember, in bash, arrays are 0 indexed, so the first column is the 0th element.
So there you have it, pulling out a column from a CSV in the shell. The other solutions are probably more practical, but this one is pure bash.
You could use GNU Awk, see this article of the user guide.
As an improvement to the solution presented in the article (in June 2015), the following gawk command allows double quotes inside double quoted fields; a double quote is marked by two consecutive double quotes ("") there. Furthermore, this allows empty fields, but even this can not handle multiline fields. The following example prints the 3rd column (via c=3) of textfile.csv:
#!/bin/bash
gawk -- '
BEGIN{
FPAT="([^,\"]*)|(\"((\"\")*[^\"]*)*\")"
}
{
if (substr($c, 1, 1) == "\"") {
$c = substr($c, 2, length($c) - 2) # Get the text within the two quotes
gsub("\"\"", "\"", $c) # Normalize double quotes
}
print $c
}
' c=3 < <(dos2unix <textfile.csv)
Note the use of dos2unix to convert possible DOS style line breaks (CRLF i.e. "\r\n") and UTF-16 encoding (with byte order mark) to "\n" and UTF-8 (without byte order mark), respectively. Standard CSV files use CRLF as line break, see Wikipedia.
If the input may contain multiline fields, you can use the following script. Note the use of special string for separating records in output (since the default separator newline could occur within a record). Again, the following example prints the 3rd column (via c=3) of textfile.csv:
#!/bin/bash
gawk -- '
BEGIN{
RS="\0" # Read the whole input file as one record;
# assume there is no null character in input.
FS="" # Suppose this setting eases internal splitting work.
ORS="\n####\n" # Use a special output separator to show borders of a record.
}
{
nof=patsplit($0, a, /([^,"\n]*)|("(("")*[^"]*)*")/, seps)
field=0;
for (i=1; i<=nof; i++){
field++
if (field==c) {
if (substr(a[i], 1, 1) == "\"") {
a[i] = substr(a[i], 2, length(a[i]) - 2) # Get the text within
# the two quotes.
gsub(/""/, "\"", a[i]) # Normalize double quotes.
}
print a[i]
}
if (seps[i]!=",") field=0
}
}
' c=3 < <(dos2unix <textfile.csv)
There is another approach to the problem. csvquote can output contents of a CSV file modified so that special characters within field are transformed so that usual Unix text processing tools can be used to select certain column. For example the following code outputs the third column:
csvquote textfile.csv | cut -d ',' -f 3 | csvquote -u
csvquote can be used to process arbitrary large files.
I needed proper CSV parsing, not cut / awk and prayer. I'm trying this on a mac without csvtool, but macs do come with ruby, so you can do:
echo "require 'csv'; CSV.read('new.csv').each {|data| puts data[34]}" | ruby
I wonder why none of the answers so far have mentioned csvkit.
csvkit is a suite of command-line tools for converting to and working
with CSV
csvkit documentation
I use it exclusively for csv data management and so far I have not found a problem that I could not solve using cvskit.
To extract one or more columns from a cvs file you can use the csvcut utility that is part of the toolbox. To extract the second column use this command:
csvcut -c 2 filename_in.csv > filename_out.csv
csvcut reference page
If the strings in the csv are quoted, add the quote character with the q option:
csvcut -q '"' -c 2 filename_in.csv > filename_out.csv
Install with pip install csvkit or sudo apt install csvkit.
Simple solution using awk. Instead of "colNum" put the number of column you need to print.
cat fileName.csv | awk -F ";" '{ print $colNum }'
csvtool col 2 file.csv
where 2 is the column you are interested in
you can also do
csvtool col 1,2 file.csv
to do multiple columns
You can't do it without a full CSV parser.
If you know your data will not be quoted, then any solution that splits on , will work well (I tend to reach for cut -d, -f1 | sed 1d), as will any of the CSV manipulation tools.
If you want to produce another CSV file, then xsv, csvkit, csvtool, or other CSV manipulation tools are appropriate.
If you want to extract the contents of one single column of a CSV file, unquoting them so that they can be processed by subsequent commands, this Python 1-liner does the trick for CSV files with headers:
python -c 'import csv,sys'$'\n''for row in csv.DictReader(sys.stdin): print(row["message"])'
The "message" inside of the print function selects the column.
If the CSV file doesn't have headers:
python -c 'import csv,sys'$'\n''for row in csv.reader(sys.stdin): print(row[1])'
Python's CSV library supports all kinds of CSV dialects, so if your CSV file uses different conventions, it's possible to support them with relatively little change to the code.
Been using this code for a while, it is not "quick" unless you count "cutting and pasting from stackoverflow".
It uses ${##} and ${%%} operators in a loop instead of IFS. It calls 'err' and 'die', and supports only comma, dash, and pipe as SEP chars (that's all I needed).
err() { echo "${0##*/}: Error:" "$#" >&2; }
die() { err "$#"; exit 1; }
# Return Nth field in a csv string, fields numbered starting with 1
csv_fldN() { fldN , "$1" "$2"; }
# Return Nth field in string of fields separated
# by SEP, fields numbered starting with 1
fldN() {
local me="fldN: "
local sep="$1"
local fldnum="$2"
local vals="$3"
case "$sep" in
-|,|\|) ;;
*) die "$me: arg1 sep: unsupported separator '$sep'" ;;
esac
case "$fldnum" in
[0-9]*) [ "$fldnum" -gt 0 ] || { err "$me: arg2 fldnum=$fldnum must be number greater or equal to 0."; return 1; } ;;
*) { err "$me: arg2 fldnum=$fldnum must be number"; return 1;} ;;
esac
[ -z "$vals" ] && err "$me: missing arg2 vals: list of '$sep' separated values" && return 1
fldnum=$(($fldnum - 1))
while [ $fldnum -gt 0 ] ; do
vals="${vals#*$sep}"
fldnum=$(($fldnum - 1))
done
echo ${vals%%$sep*}
}
Example:
$ CSVLINE="example,fields with whitespace,field3"
$ $ for fno in $(seq 3); do echo field$fno: $(csv_fldN $fno "$CSVLINE"); done
field1: example
field2: fields with whitespace
field3: field3
You can also use while loop
IFS=,
while read name val; do
echo "............................"
echo Name: "$name"
done<itemlst.csv

Resources