Bash/*NIX: split a file into multiple files on a substring - bash

Variants of this question have been asked and answered before, but I find that my sed/grep/awk skills are far too rudimentary to work from those to a custom solution since I hardly ever work in shell scripts.
I have a rather large (100K+ lines) text file in which each line defines a GeoJSON object, each such object including a property called "county" (there are, all told, 100 different counties). Here's a snippet:
{"type": "Feature", "properties": {"county":"ALAMANCE", "vBLA": 0, "vWHI": 4, "vDEM": 0, "vREP": 2, "vUNA": 2, "vTOT": 4}, "geometry": {"type":"Polygon","coordinates":[[[-79.537429,35.843303],[-79.542428,35.843303],[-79.542428,35.848302],[-79.537429,35.848302],[-79.537429,35.843303]]]}},
{"type": "Feature", "properties": {"county":"NEW HANOVER", "vBLA": 0, "vWHI": 0, "vDEM": 0, "vREP": 0, "vUNA": 0, "vTOT": 0}, "geometry": {"type":"Polygon","coordinates":[[[-79.532429,35.843303],[-79.537428,35.843303],[-79.537428,35.848302],[-79.532429,35.848302],[-79.532429,35.843303]]]}},
{"type": "Feature", "properties": {"county":"ALAMANCE", "vBLA": 0, "vWHI": 0, "vDEM": 0, "vREP": 0, "vUNA": 0, "vTOT": 0}, "geometry": {"type":"Polygon","coordinates":[[[-79.527429,35.843303],[-79.532428,35.843303],[-79.532428,35.848302],[-79.527429,35.848302],[-79.527429,35.843303]]]}},
I need to split this into 100 separate files, each containing one county's GeoJSONs, and each named xxxx_bins_2016.json (where xxxx is the county's name). I'd also like the final character (comma) at the end of each such file to go away.
I'm doing this in Mac OSX, if that matters. I hope to learn a lot by studying any solutions you could suggest, so if you feel like taking the time to explain the 'why' as well as the 'what' that would be fantastic. Thanks!
EDITED to make clear that there are different county names, some of them two-word names.

jq can kind of do this; it can group the input and output one line of text per group. The shell then takes care of writing each line to an appropriately named file. jq itself doesn't really have the ability to open files for writing that would allow you to do this in a single process.
jq -Rn -c '[inputs[:-1]|fromjson] | group_by(.properties.county)[]' tmp.json |
while IFS= read -r line; do
county=$(jq -r '.[0].properties.county' <<< $line)
jq -r '.[]' <<< "$line" > "$county.txt"
done
[inputs[:-1]|fromjson] reads each line of your file as a string, strips the trailing comma, then parses the line as JSON and wraps the lines into a single array. The resulting array is sorted and grouped by county name, then written to standard output, one group per line.
The shell loop reads each line, extracts the county name from the first element of the group with a call to jq, then uses jq again to write each element of the group to the appropriate file, again one element per line.
(A quick look at https://github.com/stedolan/jq/issues doesn't appear to show any requests yet for an output function that would let you open and write to a file from inside a jq filter. I'm thinking of something like
jq -Rn '... | group_by(.properties.county) | output("\(.properties.county).txt")' tmp.json
without the need for the shell loop.)

If using string parsing rather than proper JSON parsing to extract the county name is acceptable - brittle in general, but would work in this simple case - consider Sam Tolton's GNU awk answer, which has the potential to be by far the simplest and fastest solution.
To complement chepner's excellent answer with a variation that focuses on performance:
jq -Rrn '[inputs[:-1]|fromjson] | .properties.county + "|" + (.|tostring)' file |
awk -F'|' '{ print $2 > ($1 "_bins_2016.json") }'
Shell loops are avoided altogether, which should speed up the operation.
The general idea is:
Use jq to trim the trailing , from each input line, interpret the trimmed string as JSON, extract the county name, then output the trimmed JSON strings prepended with the county name and a distinct separator, |.
Use an awk command to split each line into the prepended county name and the trimmed JSON string, which allows awk to easily construct the output filename and write the JSON string to it.
Note: The awk command keeps all output files open until the script has finished, which means that, in your case, 100 output files will be open simultaneously - a number that shouldn't be a problem, however.
In cases where it is a problem, you can use the following variation, in which jq first sorts the lines by county name, which then allows awk to immediately close the previous output field whenever the next county is reached in the input:
jq -Rrn '
[inputs[:-1]|fromjson] | sort_by(.properties.county)[] |
.properties.county + "|" + (.|tostring)
' file |
awk -F'|' '
prevCounty != $1 { if (outFile) close(outFile); outFile = $1 "_bins_2016.json" }
{ print $2 > outFile; prevCounty = $1 }
'

A simpler version of chepner's answer:
while IFS= read -r line
do
countyName=$(jq --raw-output '.properties.county' <<<"${line: : -1}")
jq <<< "${line: : -1}" >> "$countyName"_bins_2016.json
done<file
The idea is to filter the county name using a jq filter after stripping the , from each line of your input file. Then the line is passed to jq as plain stream to produce a JSON file in prettified format.
If you are from a relatively older version of bash (< 4.0) use "${line%?}" over "${line: : -1}"
For example with the change above, one of your county becomes,
cat ALAMANCE_bins_2016.json
{
"type": "Feature",
"properties": {
"county": "ALAMANCE",
"vBLA": 0,
"vWHI": 0,
"vDEM": 0,
"vREP": 0,
"vUNA": 0,
"vTOT": 0
},
"geometry": {
"type": "Polygon",
"coordinates": [
[
[
-79.527429,
35.843303
],
[
-79.532428,
35.843303
],
[
-79.532428,
35.848302
],
[
-79.527429,
35.848302
],
[
-79.527429,
35.843303
]
]
]
}
}
Note: The current solution could be performance intensive as reading file line by line is an expensive operation, and equally invoking jq for each of the lines.

This will do what you want minus getting rid of the last comma:-
gawk 'match($0, /"county":"([^"]+)/, array){ print >array[1]"_bins_2016.json" }' INPUT_FILE
This will output files in the current path with a filename in the format COUNTRY NAME_bins_2016.json.
The script goes line by line and uses a regex to match the exact term "country":" followed by 1 or more characters that aren't a ". It captures the characters within the quotes and then uses it as part of the filename to append the current line to.
To remove the trailing comma from all .json files in the current path you could use:-
sed -i '$ s/,$//' *.json
If you were certain that the last char was always a comma, a faster solution would be to use truncate:-
truncate -s-1 *.json
Last part taken from this answer: https://stackoverflow.com/a/40568723/1453798

Here is a quickie script that will do the job. It has the virtue of working on most systems without having to install any other tools.
IFS=$'\n'
counties=( $( sed 's/^.*"county":"//;s/".*$//' counties.txt ) )
unset IFS
for county in "${!counties[#]}"
do
county="${counties[$i]}"
filename="$county".out.txt
echo "'$filename'"
grep "\"$county\"" counties.txt > "$filename"
done
The setting of IFS to \n allows the array elements to contain spaces. The sed command strips off all the text up to the start of the county name and all the text after it. The for loop is the form that allows iterating over the array. Finally, the grep command needs to have double quotes around the search string so that counties that are substrings of other counties don't accidentally get put into the wrong file.
See this section of the GNU BASH Reference Manual for more info.

Related

Sorting the contents within a column using Shell Script Line by Line in a File

I am Sorting a File using a column using the command -
cat myFile | sort -u -k3
Now i want to Sort Data within a Column of a File. Can anyone please help and tell me how can i achieve it?
My Data Looks like this in the File names Student.csv -
Name,Age,Marks,Grades
Sam,21,"34,56,21,67","C,B,D,A"
Josh,25,"90,89,78,45","A,A,B,C"
Output-
Name,Age,Marks,Grades
Sam,21,"21,34,56,67","A,B,C,D"
Josh,25,"45,78,89,90","A,A,B,C"
Will Appreciate the help, Thanks
You should export your CSV with a field separator that does not exist within the texts. Otherwise it becomes hugely cumbersome to deal with this.
Afterwards you can easily sort by specifying the separator and the field.
Example if you would use | as separator:
Name|Age|Marks|Grades
Sam|21|"34,56,21,67"|"C,B,D,A"
Josh|25|"90,89,78,45"|"A,A,B,C"
Then execute:
cat myFile | sort -u -k3 -t\|
or:
sort -u -k3 -t\| <myFile
Afterwards you could be putting your semi-colons back:
sort -u -k3 -t\| <myFile | sed 's/|/;/g'
Did it, but I'm too tired to explain how; brain's hitting a brick wall. There's a lot to unpack there, and it'll take half-a-day to explain. I'll write all the steps in a couple hours after I get a nap in, otherwise there's gonna be 50 typos in that description.
cat Student.csv | head -n1 && cat Student.csv | tail -n+2 | awk -F \" '{split($2,a,",");asort(a);b="";for(i in a)b=b a[i] ",";split($4,c,",");asort(c);d="";for(i in c)d=d c[i] ",";printf "%s\"%s\",\"%s\"\n",$1,substr(b,1,length(b)-1),substr(d,1,length(d)-1)}'
Alternatively:
cat Student.csv | tee >(head -n1) >(tail -n+2 | awk -F \" '{split($2,a,",");asort(a);b="";for(i in a)b=b a[i] ",";split($4,c,",");asort(c);d="";for(i in c)d=d c[i] ",";printf "%s\"%s\",\"%s\"\n",$1,substr(b,1,length(b)-1),substr(d,1,length(d)-1)}') >/dev/null ; sleep 0.1
Output:
Name,Age,Marks,Grades
Sam,21,"21,34,56,67","A,B,C,D"
Josh,25,"45,78,89,90","A,A,B,C"
https://www.tutorialspoint.com/awk/index.htm
Edit -- 'kay, the explaination:
cat concatenates (glues) files together, but when you just give it one arg, then that's what it prints out.
You can do the next part in one or two steps, I'll explain the first method. | pipe directs the output to another command. We all know this, or we wouldn't be here right now... however someday, someone will come across this post, and wonder what it does.
head prints out the first few lines of what you give it. Here, I specified -n1 number of lines = one, so it would print out the header:
Name,Age,Marks,Grades
&& continues to the next command, so long as that initial instruction was a success.
cat Student.csv again, but this time piped into tail, which prints the last few lines, of whatever you give it. -n+2 specifies to spit out everything from line number 2, and beyond.
We then pipe those contents into AWK https://en.wikipedia.org/wiki/AWK ...I'm sure you could do it with sed https://en.wikipedia.org/wiki/Sed, and I started with that, but sed tends to be more simple than awk, so you'd need to do far more chained-commands to achieve the same thing. Lisp might be able to do it more concicely, but it sounded like you were asking for shell builtins. Python's also decent with strings, but again, sh.
-F \" delegates a literal " as the field separator, so that we can group the contents into 3 categories:
Sam,21, " 34,56,21,67 " , "C,B,D,A"
$1 = Sam,21,
$2 = 34,56,21,67
$3 = ,
$4 = C,B,D,A
You actually get 4, but I'm throwing out that comma in the third position. It's easy enough to put it back in.
We now need to sort those numbers, so split($2,a,",") returns an array, in this case, named a, from the contents of $2, which has been delimited by the , symbol.
a = [ 34, 56, 21, 67 ]
; separates AWK commands, you can mostly ignore those. If there were simply a space, awk would try to concatenate items together, and we don't want that yet.
Next, array sort asort( a ), the contents of a -- https://www.tutorialspoint.com/awk/awk_string_functions.htm
a = [ 21, 34, 56, 67 ]
Here would be a perfect time for Python's string .join() method https://www.w3schools.com/python/ref_string_join.asp
However, we don't have that available to us, and AWK doens't seem to have it, as far as I know, so we have to roll our own here. So construct string, b, whose contents will be appended by each item in a. Single-quotes often won't do in commandline, so you'll see double-quotes.
b=""
for( i in a ) b=b a[i] ","
b begins empty. Iterating a for-loop over a's contents, we arrive at an appending which includes commas. Leave the trailing comma for now, it'll get trimmed off in a bit.
21,34,56,67,
Exact same procedure for $4, but we name the array c this time, and the string in which those contents are contatenaded with commas, d -- split( $4, c, "," ) ; asort( c ) ; d="" ; for( i in c ) d=d c[i] "," You can name them anything you like, just happened to have ABCD staring me in the face from those grade listings, so that's what I went with.
OK, now we have everything we need.
$1 = Sam,21,
b = 21,34,56,67,
d = A,B,C,D,
Let's format a string so they're all together.
printf "%s\"%s\",\"%s\"\n"
This will print $1 in the first %s string position, then a literal double-quote,
b into the second %s string position, next ",",
followed by d in the third %s position,
all wrapped up with a final double-quote and a newline.
However, b and d both have trailing commas, so we trim those off with AWK's substr() command. -- https://www.tutorialspoint.com/awk/awk_string_functions.htm Knowing where to begin is easy enough, but we need to chop those at one-from-the-end.
substr( b, 1, length(b) -1 )
substr( d, 1, length(d) -1 )
It'd be nice if you could just specify -2, and have it count backwards, like you can in Lua, Python, et al... but that doesn't seem to do in AWK, so whatevs. Ya live, ya learn. And there you have it, all your ducks in a row.
Sam,21,"21,34,56,67","A,B,C,D"
This does, maybe not elegantly, but it's within the required guidelines. I'm sure there's possibilities of code-golfing in there somewhere, but it's solid logic you can follow.

Unix bash - using cut to regex lines in a file, match regex result with another similar line

I have a text file: file.txt, with several thousand lines. It contains a lot of junk lines which I am not interested in, so I use the cut command to regex for the lines I am interested in first. For each entry I am interested in, it will be listed twice in the text file: Once in a "definition" section, another in a "value" section. I want to retrieve the first value from the "definition" section, and then for each entry found there find it's corresponding "value" section entry.
The first entry starts with ' gl_ ', while the 2nd entry would look like ' "gl_ ', starting with a '"'.
This is the code I have so far for looping through the text document, which then retrieves the values I am interested in and appends them to a .csv file:
while read -r line
do
if [[ $line == gl_* ]] ; then (param=$(cut -d'\' -f 1 $line) | def=$(cut -d'\' -f 2 $line) | type=$(cut -d'\' -f 4 $line) | prompt=$(cut -d'\' -f 8 $line))
while read -r glline
do
if [[ $glline == '"'$param* ]] ; then val=$(cut -d'\' -f 3 $glline) |
"$project";"$param";"$val";"$def";"$type";"$prompt" >> /filepath/file.csv
done < file.txt
done < file.txt
This seems to throw some syntax errors related to unexpected tokens near the first 'done' statement.
Example of text that needs to be parsed, and paired:
gl_one\User Defined\1\String\1\\1\Some Text
gl_two\User Defined\1\String\1\\1\Some Text also
gl_three\User Defined\1\Time\1\\1\Datetime now
some\junk
"gl_one\1\Value1
some\junk
"gl_two\1\Value2
"gl_three\1\Value3
So effectively, the while loop reads each line until it hits the first line that starts with 'gl_', which then stores that value (ie. gl_one) as a variable 'param'.
It then starts the nested while loop that looks for the line that starts with a ' " ' in front of the gl_, and is equivalent to the 'param' value. In other words, the
script should couple the lines gl_one and "gl_one, gl_two and "gl_two, gl_three and "gl_three.
The text file is large, and these are settings that have been defined this way. I need to collect the values for each gl_ parameter, to save them together in a .csv file with their corresponding "gl_ values.
Wanted regex output stored in variables would be something like this:
first while loop:
$param = gl_one, $def = User Defined, $type = String, $prompt = Some Text
second while loop:
$val = Value1
Then it stores these variables to the file.csv, with semi-colon separators.
Currently, I have an error for the first 'done' statement, which seems to indicate an issue with the quotation marks. Apart from this,
I am looking for general ideas and comments to the script. I.e, not entirely sure I am looking for the quotation mark parameters "gl_ correctly, or if the
semi-colons as .csv separators are added correctly.
Edit: Overall, the script runs now, but extremely slow due to the inner while loop. Is there any faster way to match the two lines together and add them to the .csv file?
Any ideas and comments?
This will generate a file containing the data you want:
cat file.txt | grep gl_ | sed -E "s/\"//" | sort | sed '$!N;s/\n/\\/' | awk -F'\' '{print $1"; "$5"; "$7"; "$NF}' > /filepath/file.csv
It uses grep to extract all lines containing 'gl_'
then sed to remove the leading '"' from the lines that contain one [I have assumed there are no further '"' in the line]
The lines are sorted
sed removes the return from each pair of lines
awk then prints
the required columns according to your requirements
Output routed to the file.
LANG=C sort -t\\ -sd -k1,1 <file.txt |\
sed '
/^gl_/{ # if definition
N; # append next line to buffer
s/\n"gl_[^\\]*//; # if value, strip first column
t; # and start next loop
}
D; # otherwise, delete the line
' |\
awk -F\\ -v p="$project" -v OFS=\; '{print p,$1,$10,$2,$4,$8 }' \
>>/filepath/file.csv
sort lines so gl_... appears immediately before "gl_... (LANG fixes LC_TYPE) - assumes definition appears before value
sed to help ensure matching definition and value (may still fail if duplicate/missing value), and tidy for awk
awk to pull out relevant fields

How to extract text between two patterns with sed/awk

I know this has been asked 1000 times here, but I read a lot of similar questions and still did not manage to find the right way to do this. I need to extract a number from a line that looks like this:
{"version":"4.9.123M","info":{"version":[2034.2],"description":""},"status":"OK"}
Expected output:
2034.2
This version number will not always be the same, but the rest of the line should.
I have tried working with sed but I am new to this and failed:
sed -e 's/version":[\(.*\),"description/\1/'
output:
sed: -e expression #1, char 35: unterminated `s' command
I think the issue is that there are too many special characters involved in the line and I did not write the command very well.
Since it's JSON, use should use JSON aware tools for processing it. If you prefer, for example, awk, the way is to use GNU awk's JSON extension. This is a small how-to.
First download and compile appropriate versions of GNU awk, Gawkextlib and gawk-json. That's pretty straightforward, actually, just ./configure and make. Then, write some code:
awk '
#load "json" # enable json extension
{
lines=lines $0 # read json file records and buffer to var lines
if(json_fromJSON(lines,data)==1) { # once the json is complete
for(i in data["info"]["version"]) # that seems to be an array so all elements
print data["info"]["version"][i] # are outputed
lines="" # once done with the first json object
} # reset the var for more lines
}' file
Output this time:
2034.2
Explained a bit more:
The JSON file structure can vary from one line to multiple lines, for example:
{"version":"4.9.123M","info":{"version":[2034.2],"description":""},"status":"OK"}
or:
{
"version": "4.9.123M",
"info": {
"version": [
2034.2
],
"description": ""
},
"status": "OK"
}
so we need to buffer the JSON lines with lines=lines $0 until there is a whole valid object in variable lines. We use the extension function json_fromJSON() to determine that validity in if(json_fromJSON(lines,data)==1). While validated the object gets disentangled and stored to array data. For this particular object the structure of the array is:
data["version"]="4.9.123M"
data["info"]["version"][1]="2034.2"
data["info"]["description"]=""
data["status"]="OK"
We could examine the object and produce some output of it with this recursive array scanning function:
awk '
#load "json"
function scan(a,p, q) { # a is array, p path to it, q is qnd *
if(isarray(a))
for(i in a) {
q=p (p==""?"":"->") i
scan(a[i],q)
}
else
print p ":" a
}
{
lines=lines $0
if(json_fromJSON(lines,data)==1)
scan(data) #
}' file.json
Output:
status:OK
version:4.9.123M
info->version->1:2034.2
info->description:
*) quick'n dirty
Here is a brief example of how to output JSON from an array: https://stackoverflow.com/a/58109715/4162356
If the version is always enclosed in [] and no other [ or ] is present in a line ,you can try this logic
STR='{"version":"4.9.123M","info":{"version":[2034.2],"description":""},"status":"OK"}'
echo $STR | awk -F'[' '{print $2}' | awk -F']' '{print $1}'
Simplest Way
Try grep when want to extract simple texts
echo "{"version":"4.9.123M","info":{"version":[2034.2],"description":""},"status":"OK"}"| grep -o "\[.*\]" | sed -e 's/\[\|\]//g'
This should do:
STR='{"version":"4.9.123M","info":{"version":[2034.2],"description":""},"status":"OK"}'
echo "$STR" | awk -F'[][]' '{print $2}'
2034.2

awk sed backreference csv file

A question to extend previous one here. (I prefer asking new question rather editing first one. I may be wrong)
EDIT : ok, I was wrong, I should edit my first question. My bad (SO question is an art, difficult to master)
I have csv file, with semi-column as field delimiter. Here is an extract of csv file :
...;field;(:);10000(n,d);(:);field;....
...;field;123.12(b);123(a);123.00(:);....
Here is the desired output :
...;field;(:);(n,d) 10000;(:);field;....
...;field;(b) 123.12;(a) 123;(:) 123.00;....
I search a solution to swap 2 patterns in each field.
pattern 1 : any digit, with optional decimal mark (.) and optional decimal digit
e.g : 1 / 1111.00 / 444444444.3 / 32 / 32.6666666 / 1.0 / ....
pattern 2 : any string that begin with left parenthesis, follow by one or more character, ending with right parenthesis
e.g : (n,a,p) / (:) / (llll) / (d) / (123) / (1;2;3) ...
Solutions provided in first question are right for simple file that contain only one column. If I try the solution within csv file, I face multiple failures.
So I try awk similar solution, which is (I think) more "column-oriented".
I have try
awk -F";" '{print gensub(/([[:digit:].]*)(\(.*\))/, "\\2 \\1", "g")}' file
I though by fixing field delimiter (;), "my regex swap" will succes in every field. It was a mistake.
Here is an exemple of failure
;(:);7320000(n,d);(:)
desired output --> ;(:);(n,d) 7320000;(:)
My questions (finally) : why awk fail when it success with one-column file. what is the best tool to face this challenge ?
sed with very long regex ?
awk with very long regex ?
for loop ?
other tools ?
PS : I know I am not clear. I have 2 problems (English language, technical limitations). Sorry.
Your "question" is far too long, cluttered, and containing too many separate questions to wade through but here's how to get the output you want from the input you provided with any sed:
$ sed 's/\([0-9][0-9.]*\)\(([^)]*)\)/\2 \1/g' file
...;field;(:);(n,d) 10000;(:);field;....
...;field;(b) 123.12;(a) 123;(:) 123.00;....
Well, when parsing simple delimetered files without any quoted values, usually awk comes to the rescue:
awk -vFS=';' -vOFS=';' '{
for (i = 1; i < NF; i++) {
split($i, t, "(")
if (length(t[1]) != 0 && length(t[2]) != 0) {
$i="("t[2]" "t[1]
}
}
print
}' <<EOF
...;field;(:);10000(n,d);(:);field;....
...;field;123.12(b);123(a);123.00(:);....
EOF
However this will fail if fields are quoted, ie. the separator ; comes inside the values...
First we set input and output seapartor as ;
We iterate through all the fields in the line for (i = 1; i < NF; i++)
We split the line over ( character
If the first field splitted over ( is nonzero length and the second field has also nonzero length
We swap the firelds for this fields and add a space (we also remember about the removed ( on the beginning).
And then the line get's printed.
A solution using sed and xargs, but you need to know the number of fields in advance:
{
sed 's/;/\n/g' |
sed 's/\([^(]\{1,\}\)\((.*)\)/\2 \1/' |
xargs -d '\n' -n7 -- printf "%s;%s;%s;%s;%s;%s;%s\n"
} <<EOF
...;field;(:);10000(n,d);(:);field;....
...;field;123.12(b);123(a);123.00(:);....
EOF
For each ; i do a newline
For each line i substitute the string with at least on character before ( and a string inside ).
I then merge 7 lines using ; as separator with xargs and printf.
This might work for you (GNU sed):
sed -r 's/([0-9]+(\.[0-9]+)?)(\([^)]*\))/\3 \1/g' file
Look for group of numbers (possibly with a decimal point) followed by a pair of parens and rearrange them in the desired fashion, globally through out each line.

How to extract one column of a csv file

If I have a csv file, is there a quick bash way to print out the contents of only any single column? It is safe to assume that each row has the same number of columns, but each column's content would have different length.
You could use awk for this. Change '$2' to the nth column you want.
awk -F "\"*,\"*" '{print $2}' textfile.csv
yes. cat mycsv.csv | cut -d ',' -f3 will print 3rd column.
The simplest way I was able to get this done was to just use csvtool. I had other use cases as well to use csvtool and it can handle the quotes or delimiters appropriately if they appear within the column data itself.
csvtool format '%(2)\n' input.csv
Replacing 2 with the column number will effectively extract the column data you are looking for.
Landed here looking to extract from a tab separated file. Thought I would add.
cat textfile.tsv | cut -f2 -s
Where -f2 extracts the 2, non-zero indexed column, or the second column.
Here is a csv file example with 2 columns
myTooth.csv
Date,Tooth
2017-01-25,wisdom
2017-02-19,canine
2017-02-24,canine
2017-02-28,wisdom
To get the first column, use:
cut -d, -f1 myTooth.csv
f stands for Field and d stands for delimiter
Running the above command will produce the following output.
Output
Date
2017-01-25
2017-02-19
2017-02-24
2017-02-28
To get the 2nd column only:
cut -d, -f2 myTooth.csv
And here is the output
Output
Tooth
wisdom
canine
canine
wisdom
incisor
Another use case:
Your csv input file contains 10 columns and you want columns 2 through 5 and columns 8, using comma as the separator".
cut uses -f (meaning "fields") to specify columns and -d (meaning "delimiter") to specify the separator. You need to specify the latter because some files may use spaces, tabs, or colons to separate columns.
cut -f 2-5,8 -d , myvalues.csv
cut is a command utility and here is some more examples:
SYNOPSIS
cut -b list [-n] [file ...]
cut -c list [file ...]
cut -f list [-d delim] [-s] [file ...]
I think the easiest is using csvkit:
Gets the 2nd column:
csvcut -c 2 file.csv
However, there's also csvtool, and probably a number of other csv bash tools out there:
sudo apt-get install csvtool (for Debian-based systems)
This would return a column with the first row having 'ID' in it.
csvtool namedcol ID csv_file.csv
This would return the fourth row:
csvtool col 4 csv_file.csv
If you want to drop the header row:
csvtool col 4 csv_file.csv | sed '1d'
First we'll create a basic CSV
[dumb#one pts]$ cat > file
a,b,c,d,e,f,g,h,i,k
1,2,3,4,5,6,7,8,9,10
a,b,c,d,e,f,g,h,i,k
1,2,3,4,5,6,7,8,9,10
Then we get the 1st column
[dumb#one pts]$ awk -F , '{print $1}' file
a
1
a
1
Many answers for this questions are great and some have even looked into the corner cases.
I would like to add a simple answer that can be of daily use... where you mostly get into those corner cases (like having escaped commas or commas in quotes etc.,).
FS (Field Separator) is the variable whose value is dafaulted to
space. So awk by default splits at space for any line.
So using BEGIN (Execute before taking input) we can set this field to anything we want...
awk 'BEGIN {FS = ","}; {print $3}'
The above code will print the 3rd column in a csv file.
The other answers work well, but since you asked for a solution using just the bash shell, you can do this:
AirBoxOmega:~ d$ cat > file #First we'll create a basic CSV
a,b,c,d,e,f,g,h,i,k
1,2,3,4,5,6,7,8,9,10
a,b,c,d,e,f,g,h,i,k
1,2,3,4,5,6,7,8,9,10
a,b,c,d,e,f,g,h,i,k
1,2,3,4,5,6,7,8,9,10
a,b,c,d,e,f,g,h,i,k
1,2,3,4,5,6,7,8,9,10
a,b,c,d,e,f,g,h,i,k
1,2,3,4,5,6,7,8,9,10
a,b,c,d,e,f,g,h,i,k
1,2,3,4,5,6,7,8,9,10
And then you can pull out columns (the first in this example) like so:
AirBoxOmega:~ d$ while IFS=, read -a csv_line;do echo "${csv_line[0]}";done < file
a
1
a
1
a
1
a
1
a
1
a
1
So there's a couple of things going on here:
while IFS=, - this is saying to use a comma as the IFS (Internal Field Separator), which is what the shell uses to know what separates fields (blocks of text). So saying IFS=, is like saying "a,b" is the same as "a b" would be if the IFS=" " (which is what it is by default.)
read -a csv_line; - this is saying read in each line, one at a time and create an array where each element is called "csv_line" and send that to the "do" section of our while loop
do echo "${csv_line[0]}";done < file - now we're in the "do" phase, and we're saying echo the 0th element of the array "csv_line". This action is repeated on every line of the file. The < file part is just telling the while loop where to read from. NOTE: remember, in bash, arrays are 0 indexed, so the first column is the 0th element.
So there you have it, pulling out a column from a CSV in the shell. The other solutions are probably more practical, but this one is pure bash.
You could use GNU Awk, see this article of the user guide.
As an improvement to the solution presented in the article (in June 2015), the following gawk command allows double quotes inside double quoted fields; a double quote is marked by two consecutive double quotes ("") there. Furthermore, this allows empty fields, but even this can not handle multiline fields. The following example prints the 3rd column (via c=3) of textfile.csv:
#!/bin/bash
gawk -- '
BEGIN{
FPAT="([^,\"]*)|(\"((\"\")*[^\"]*)*\")"
}
{
if (substr($c, 1, 1) == "\"") {
$c = substr($c, 2, length($c) - 2) # Get the text within the two quotes
gsub("\"\"", "\"", $c) # Normalize double quotes
}
print $c
}
' c=3 < <(dos2unix <textfile.csv)
Note the use of dos2unix to convert possible DOS style line breaks (CRLF i.e. "\r\n") and UTF-16 encoding (with byte order mark) to "\n" and UTF-8 (without byte order mark), respectively. Standard CSV files use CRLF as line break, see Wikipedia.
If the input may contain multiline fields, you can use the following script. Note the use of special string for separating records in output (since the default separator newline could occur within a record). Again, the following example prints the 3rd column (via c=3) of textfile.csv:
#!/bin/bash
gawk -- '
BEGIN{
RS="\0" # Read the whole input file as one record;
# assume there is no null character in input.
FS="" # Suppose this setting eases internal splitting work.
ORS="\n####\n" # Use a special output separator to show borders of a record.
}
{
nof=patsplit($0, a, /([^,"\n]*)|("(("")*[^"]*)*")/, seps)
field=0;
for (i=1; i<=nof; i++){
field++
if (field==c) {
if (substr(a[i], 1, 1) == "\"") {
a[i] = substr(a[i], 2, length(a[i]) - 2) # Get the text within
# the two quotes.
gsub(/""/, "\"", a[i]) # Normalize double quotes.
}
print a[i]
}
if (seps[i]!=",") field=0
}
}
' c=3 < <(dos2unix <textfile.csv)
There is another approach to the problem. csvquote can output contents of a CSV file modified so that special characters within field are transformed so that usual Unix text processing tools can be used to select certain column. For example the following code outputs the third column:
csvquote textfile.csv | cut -d ',' -f 3 | csvquote -u
csvquote can be used to process arbitrary large files.
I needed proper CSV parsing, not cut / awk and prayer. I'm trying this on a mac without csvtool, but macs do come with ruby, so you can do:
echo "require 'csv'; CSV.read('new.csv').each {|data| puts data[34]}" | ruby
I wonder why none of the answers so far have mentioned csvkit.
csvkit is a suite of command-line tools for converting to and working
with CSV
csvkit documentation
I use it exclusively for csv data management and so far I have not found a problem that I could not solve using cvskit.
To extract one or more columns from a cvs file you can use the csvcut utility that is part of the toolbox. To extract the second column use this command:
csvcut -c 2 filename_in.csv > filename_out.csv
csvcut reference page
If the strings in the csv are quoted, add the quote character with the q option:
csvcut -q '"' -c 2 filename_in.csv > filename_out.csv
Install with pip install csvkit or sudo apt install csvkit.
Simple solution using awk. Instead of "colNum" put the number of column you need to print.
cat fileName.csv | awk -F ";" '{ print $colNum }'
csvtool col 2 file.csv
where 2 is the column you are interested in
you can also do
csvtool col 1,2 file.csv
to do multiple columns
You can't do it without a full CSV parser.
If you know your data will not be quoted, then any solution that splits on , will work well (I tend to reach for cut -d, -f1 | sed 1d), as will any of the CSV manipulation tools.
If you want to produce another CSV file, then xsv, csvkit, csvtool, or other CSV manipulation tools are appropriate.
If you want to extract the contents of one single column of a CSV file, unquoting them so that they can be processed by subsequent commands, this Python 1-liner does the trick for CSV files with headers:
python -c 'import csv,sys'$'\n''for row in csv.DictReader(sys.stdin): print(row["message"])'
The "message" inside of the print function selects the column.
If the CSV file doesn't have headers:
python -c 'import csv,sys'$'\n''for row in csv.reader(sys.stdin): print(row[1])'
Python's CSV library supports all kinds of CSV dialects, so if your CSV file uses different conventions, it's possible to support them with relatively little change to the code.
Been using this code for a while, it is not "quick" unless you count "cutting and pasting from stackoverflow".
It uses ${##} and ${%%} operators in a loop instead of IFS. It calls 'err' and 'die', and supports only comma, dash, and pipe as SEP chars (that's all I needed).
err() { echo "${0##*/}: Error:" "$#" >&2; }
die() { err "$#"; exit 1; }
# Return Nth field in a csv string, fields numbered starting with 1
csv_fldN() { fldN , "$1" "$2"; }
# Return Nth field in string of fields separated
# by SEP, fields numbered starting with 1
fldN() {
local me="fldN: "
local sep="$1"
local fldnum="$2"
local vals="$3"
case "$sep" in
-|,|\|) ;;
*) die "$me: arg1 sep: unsupported separator '$sep'" ;;
esac
case "$fldnum" in
[0-9]*) [ "$fldnum" -gt 0 ] || { err "$me: arg2 fldnum=$fldnum must be number greater or equal to 0."; return 1; } ;;
*) { err "$me: arg2 fldnum=$fldnum must be number"; return 1;} ;;
esac
[ -z "$vals" ] && err "$me: missing arg2 vals: list of '$sep' separated values" && return 1
fldnum=$(($fldnum - 1))
while [ $fldnum -gt 0 ] ; do
vals="${vals#*$sep}"
fldnum=$(($fldnum - 1))
done
echo ${vals%%$sep*}
}
Example:
$ CSVLINE="example,fields with whitespace,field3"
$ $ for fno in $(seq 3); do echo field$fno: $(csv_fldN $fno "$CSVLINE"); done
field1: example
field2: fields with whitespace
field3: field3
You can also use while loop
IFS=,
while read name val; do
echo "............................"
echo Name: "$name"
done<itemlst.csv

Resources